Error Handling

MARSYS provides a comprehensive error handling system for robust operation, graceful degradation, and intelligent recovery in multi-agent workflows.

Overview

The error handling system provides:

  • Hierarchical Exceptions: Granular error categorization with rich context
  • Intelligent Recovery: Automatic retry strategies and fallback mechanisms
  • Error Routing: Route errors to User nodes for human intervention
  • Provider-Specific Handling: Tailored strategies for different AI providers

Exception Hierarchy

Base Exceptions

from marsys.agents.exceptions import AgentFrameworkError, AgentError
class AgentFrameworkError(Exception):
"""Base exception for all MARSYS framework errors."""
pass
class AgentError(AgentFrameworkError):
"""Exception for agent-specific errors."""
pass

Error Categories

from marsys.agents.exceptions import (
AgentFrameworkError,
AgentError,
ToolExecutionError,
ModelError
)
# Validation Errors
class ValidationError(MarsysError):
"""Input validation failures."""
pass
class ActionValidationError(ValidationError):
"""Invalid agent actions."""
pass
# Configuration Errors
class ConfigurationError(MarsysError):
"""Configuration problems."""
pass
class TopologyError(ConfigurationError):
"""Topology definition errors."""
pass
# Execution Errors
class ExecutionError(MarsysError):
"""Runtime execution failures."""
pass
class TimeoutError(ExecutionError):
"""Operation timeout."""
pass
# API Errors
class APIError(MarsysError):
"""External API failures."""
pass
class RateLimitError(APIError):
"""API rate limit exceeded."""
recoverable = True

Error Handling Patterns

Comprehensive Try-Catch

async def execute_agent_task(agent, task, context):
"""Execute task with comprehensive error handling."""
try:
result = await agent.run(task, context)
return result
except ValidationError as e:
# Handle validation errors
logger.warning(f"Validation error: {e}")
corrected_task = correct_validation_issues(task, e)
return await agent.run(corrected_task, context)
except RateLimitError as e:
# Handle rate limits with backoff
wait_time = 60
logger.info(f"Rate limited. Waiting {wait_time}s...")
await asyncio.sleep(wait_time)
return await execute_agent_task(agent, task, context)
except TimeoutError as e:
# Handle timeout with retry
if context.get("retry_count", 0) < 3:
context["retry_count"] = context.get("retry_count", 0) + 1
return await execute_agent_task(agent, task, context)
raise
except MarsysError as e:
# Log framework errors
logger.error(f"Framework error: {e}")
raise
except Exception as e:
# Wrap unexpected errors
logger.error(f"Unexpected error: {e}")
raise ExecutionError(f"Unexpected error: {str(e)}")

Built-in Retry Logic

MARSYS automatically retries server-side API errors with exponential backoff at the adapter level. No manual retry logic needed for most API calls.

Error Routing to User

Route errors to User nodes for human intervention:

# Topology with error handling
topology = {
"agents": ["User", "Processor", "ErrorHandler"],
"flows": [
"User -> Processor",
"Processor -> User", # Success path
"Processor -> ErrorHandler", # Error path
"ErrorHandler -> User" # Report to user
]
}
# Configure error routing
config = ExecutionConfig(
enable_error_routing=True,
preserve_error_context=True
)
result = await Orchestra.run(
task=task,
topology=topology,
execution_config=config
)

Recovery Strategies

Automatic Retry

# Automatic retry with exponential backoff
async def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return await func()
except RateLimitError:
wait_time = 2 ** attempt # 1, 2, 4 seconds
await asyncio.sleep(wait_time)
raise MaxRetriesExceeded()

Fallback Mechanisms

# Fallback to alternative provider
async def call_with_fallback(primary_agent, fallback_agent, task):
try:
return await primary_agent.run(task)
except APIError:
logger.info("Primary failed, using fallback")
return await fallback_agent.run(task)

Best Practices

  • Catch specific exceptions: Handle different error types appropriately
  • Log with context: Include relevant information for debugging
  • Use recovery paths: Design topologies with error handling routes
  • Set timeouts: Prevent hanging operations with appropriate timeouts
  • Test error scenarios: Verify your error handling works correctly

Don't Swallow Errors

Always either handle an error appropriately or re-raise it. Silent failures make debugging extremely difficult.