Error Handling

MARSYS provides a comprehensive error handling system for robust operation, graceful degradation, and intelligent recovery in multi-agent workflows.

Overview

The error handling system provides:

  • Hierarchical Exceptions: Granular error categorization with rich context
  • Intelligent Recovery: Automatic retry strategies and fallback mechanisms
  • Error Routing: Route errors to User nodes for human intervention
  • Provider-Specific Handling: Tailored strategies for different AI providers

Exception Hierarchy

Base Exceptions

from marsys.agents.exceptions import AgentFrameworkError, AgentError
class AgentFrameworkError(Exception):
"""Base exception for all MARSYS framework errors."""
pass
class AgentError(AgentFrameworkError):
"""Exception for agent-specific errors."""
pass

Error Categories

from marsys.agents.exceptions import (
AgentFrameworkError,
AgentError,
ToolExecutionError,
ToolCallError,
ModelError
)
# Agent-specific errors
class AgentError(AgentFrameworkError):
"""Agent execution and initialization errors."""
pass
# Tool execution errors
class ToolExecutionError(AgentFrameworkError):
"""Tool function execution failures."""
pass
class ToolCallError(AgentFrameworkError):
"""Tool call format errors (malformed arguments, missing required fields)."""
pass
class ActionValidationError(ValidationError):
"""Invalid agent actions."""
pass
# Configuration Errors
class ConfigurationError(MarsysError):
"""Configuration problems."""
pass
class AgentConfigurationError(ConfigurationError):
"""Agent setup errors."""
pass
class TopologyError(ConfigurationError):
"""Topology definition errors."""
pass
# Execution Errors
class ExecutionError(MarsysError):
"""Runtime execution failures."""
pass
class AgentExecutionError(ExecutionError):
"""Agent execution failures."""
pass
class TimeoutError(ExecutionError):
"""Operation timeout."""
pass
# Permission Errors
class PermissionError(MarsysError):
"""Access control violations."""
pass
class AgentPermissionError(PermissionError):
"""Agent invocation denied."""
pass
# API Errors
class APIError(MarsysError):
"""External API failures."""
pass
class RateLimitError(APIError):
"""API rate limit exceeded."""
recoverable = True

Error Handling Patterns

Comprehensive Try-Catch

async def execute_agent_task(agent, task, context):
"""Execute task with comprehensive error handling."""
try:
result = await agent.run(task, context)
return result
except ValidationError as e:
# Handle validation errors
logger.warning(f"Validation error: {e}")
corrected_task = correct_validation_issues(task, e)
return await agent.run(corrected_task, context)
except RateLimitError as e:
# Handle rate limits with backoff
wait_time = 60
logger.info(f"Rate limited. Waiting {wait_time}s...")
await asyncio.sleep(wait_time)
return await execute_agent_task(agent, task, context)
except TimeoutError as e:
# Handle timeout with retry
if context.get("retry_count", 0) < 3:
context["retry_count"] = context.get("retry_count", 0) + 1
return await execute_agent_task(agent, task, context)
raise
except MarsysError as e:
# Log framework errors
logger.error(f"Framework error: {e}")
raise
except Exception as e:
# Wrap unexpected errors
logger.error(f"Unexpected error: {e}")
raise ExecutionError(f"Unexpected error: {str(e)}")

Automatic Retry in API Adapters

Built-in Retry Logic

MARSYS automatically retries server-side API errors with exponential backoff at the adapter level. No manual retry logic needed for API calls!

How It Works

All API adapters automatically retry the following status codes:

  • 500 - Internal Server Error
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout
  • 529 - Overloaded (Anthropic)
  • 408 - Request Timeout (OpenRouter)
  • 429 - Rate Limit (respects retry-after header)

Configuration: Max retries: 3 (total 4 attempts), Base delay: 1s, Exponential backoff: 1s → 2s → 4s

# Example log output during retry:
# 2025-11-05 00:58:46 - WARNING - Server error 503 from gpt-4. Retry 1/3 after 1.0s
# 2025-11-05 00:58:47 - WARNING - Server error 503 from gpt-4. Retry 2/3 after 2.0s
# 2025-11-05 00:58:49 - INFO - Request successful after 2 retries

Provider-Specific Behavior

ProviderRetryable ErrorsNon-Retryable
OpenRouter408, 429, 502, 503, 500+400, 401, 402, 403
OpenAI429, 500, 502, 503400, 401, 404
Anthropic429, 500, 529400, 401, 403, 413
Google429, 500, 503, 504400, 403, 404

When to Use Manual Retry

Built-in adapter retry handles transient API errors. Use manual retry for:

  • Application-level errors (business logic failures)
  • Multi-step workflows (coordinating multiple agents)
  • Custom retry policies (non-standard backoff)

Error Routing to User

Route errors to User nodes for human intervention:

# Topology with error handling
topology = {
"agents": ["User", "Processor", "ErrorHandler"],
"flows": [
"User -> Processor",
"Processor -> User", # Success path
"Processor -> ErrorHandler", # Error path
"ErrorHandler -> User" # Report to user
]
}
# Configure error routing
config = ExecutionConfig(
enable_error_routing=True,
preserve_error_context=True
)
result = await Orchestra.run(
task=task,
topology=topology,
execution_config=config
)

Recovery Strategies

Automatic Retry

# Automatic retry with exponential backoff
async def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return await func()
except RateLimitError:
wait_time = 2 ** attempt # 1, 2, 4 seconds
await asyncio.sleep(wait_time)
raise MaxRetriesExceeded()

Fallback Mechanisms

# Fallback to alternative provider
async def call_with_fallback(primary_agent, fallback_agent, task):
try:
return await primary_agent.run(task)
except APIError:
logger.info("Primary failed, using fallback")
return await fallback_agent.run(task)

Best Practices

  • Catch specific exceptions: Handle different error types appropriately
  • Log with context: Include relevant information for debugging
  • Use recovery paths: Design topologies with error handling routes
  • Set timeouts: Prevent hanging operations with appropriate timeouts
  • Test error scenarios: Verify your error handling works correctly

Don't Swallow Errors

Always either handle an error appropriately or re-raise it. Silent failures make debugging extremely difficult.