Error Handling
MARSYS provides a comprehensive error handling system for robust operation, graceful degradation, and intelligent recovery in multi-agent workflows.
Overview
The error handling system provides:
- Hierarchical Exceptions: Granular error categorization with rich context
- Intelligent Recovery: Automatic retry strategies and fallback mechanisms
- Error Routing: Route errors to User nodes for human intervention
- Provider-Specific Handling: Tailored strategies for different AI providers
Exception Hierarchy
Base Exceptions
from marsys.agents.exceptions import AgentFrameworkError, AgentErrorclass AgentFrameworkError(Exception):"""Base exception for all MARSYS framework errors."""passclass AgentError(AgentFrameworkError):"""Exception for agent-specific errors."""pass
Error Categories
from marsys.agents.exceptions import (AgentFrameworkError,AgentError,ToolExecutionError,ToolCallError,ModelError)# Agent-specific errorsclass AgentError(AgentFrameworkError):"""Agent execution and initialization errors."""pass# Tool execution errorsclass ToolExecutionError(AgentFrameworkError):"""Tool function execution failures."""passclass ToolCallError(AgentFrameworkError):"""Tool call format errors (malformed arguments, missing required fields)."""passclass ActionValidationError(ValidationError):"""Invalid agent actions."""pass# Configuration Errorsclass ConfigurationError(MarsysError):"""Configuration problems."""passclass AgentConfigurationError(ConfigurationError):"""Agent setup errors."""passclass TopologyError(ConfigurationError):"""Topology definition errors."""pass# Execution Errorsclass ExecutionError(MarsysError):"""Runtime execution failures."""passclass AgentExecutionError(ExecutionError):"""Agent execution failures."""passclass TimeoutError(ExecutionError):"""Operation timeout."""pass# Permission Errorsclass PermissionError(MarsysError):"""Access control violations."""passclass AgentPermissionError(PermissionError):"""Agent invocation denied."""pass# API Errorsclass APIError(MarsysError):"""External API failures."""passclass RateLimitError(APIError):"""API rate limit exceeded."""recoverable = True
Error Handling Patterns
Comprehensive Try-Catch
async def execute_agent_task(agent, task, context):"""Execute task with comprehensive error handling."""try:result = await agent.run(task, context)return resultexcept ValidationError as e:# Handle validation errorslogger.warning(f"Validation error: {e}")corrected_task = correct_validation_issues(task, e)return await agent.run(corrected_task, context)except RateLimitError as e:# Handle rate limits with backoffwait_time = 60logger.info(f"Rate limited. Waiting {wait_time}s...")await asyncio.sleep(wait_time)return await execute_agent_task(agent, task, context)except TimeoutError as e:# Handle timeout with retryif context.get("retry_count", 0) < 3:context["retry_count"] = context.get("retry_count", 0) + 1return await execute_agent_task(agent, task, context)raiseexcept MarsysError as e:# Log framework errorslogger.error(f"Framework error: {e}")raiseexcept Exception as e:# Wrap unexpected errorslogger.error(f"Unexpected error: {e}")raise ExecutionError(f"Unexpected error: {str(e)}")
Automatic Retry in API Adapters
Built-in Retry Logic
MARSYS automatically retries server-side API errors with exponential backoff at the adapter level. No manual retry logic needed for API calls!
How It Works
All API adapters automatically retry the following status codes:
- 500 - Internal Server Error
- 502 - Bad Gateway
- 503 - Service Unavailable
- 504 - Gateway Timeout
- 529 - Overloaded (Anthropic)
- 408 - Request Timeout (OpenRouter)
- 429 - Rate Limit (respects
retry-afterheader)
Configuration: Max retries: 3 (total 4 attempts), Base delay: 1s, Exponential backoff: 1s → 2s → 4s
# Example log output during retry:# 2025-11-05 00:58:46 - WARNING - Server error 503 from gpt-4. Retry 1/3 after 1.0s# 2025-11-05 00:58:47 - WARNING - Server error 503 from gpt-4. Retry 2/3 after 2.0s# 2025-11-05 00:58:49 - INFO - Request successful after 2 retries
Provider-Specific Behavior
| Provider | Retryable Errors | Non-Retryable |
|---|---|---|
| OpenRouter | 408, 429, 502, 503, 500+ | 400, 401, 402, 403 |
| OpenAI | 429, 500, 502, 503 | 400, 401, 404 |
| Anthropic | 429, 500, 529 | 400, 401, 403, 413 |
| 429, 500, 503, 504 | 400, 403, 404 |
When to Use Manual Retry
Built-in adapter retry handles transient API errors. Use manual retry for:
- Application-level errors (business logic failures)
- Multi-step workflows (coordinating multiple agents)
- Custom retry policies (non-standard backoff)
Error Routing to User
Route errors to User nodes for human intervention:
# Topology with error handlingtopology = {"agents": ["User", "Processor", "ErrorHandler"],"flows": ["User -> Processor","Processor -> User", # Success path"Processor -> ErrorHandler", # Error path"ErrorHandler -> User" # Report to user]}# Configure error routingconfig = ExecutionConfig(enable_error_routing=True,preserve_error_context=True)result = await Orchestra.run(task=task,topology=topology,execution_config=config)
Recovery Strategies
Automatic Retry
# Automatic retry with exponential backoffasync def retry_with_backoff(func, max_retries=3):for attempt in range(max_retries):try:return await func()except RateLimitError:wait_time = 2 ** attempt # 1, 2, 4 secondsawait asyncio.sleep(wait_time)raise MaxRetriesExceeded()
Fallback Mechanisms
# Fallback to alternative providerasync def call_with_fallback(primary_agent, fallback_agent, task):try:return await primary_agent.run(task)except APIError:logger.info("Primary failed, using fallback")return await fallback_agent.run(task)
Best Practices
- Catch specific exceptions: Handle different error types appropriately
- Log with context: Include relevant information for debugging
- Use recovery paths: Design topologies with error handling routes
- Set timeouts: Prevent hanging operations with appropriate timeouts
- Test error scenarios: Verify your error handling works correctly
Don't Swallow Errors
Always either handle an error appropriately or re-raise it. Silent failures make debugging extremely difficult.