Error Handling

MARSYS provides a comprehensive error handling system for robust operation, graceful degradation, and intelligent recovery in multi-agent workflows.

Overview

The error handling system provides:

Hierarchical Exceptions: Granular error categorization with rich context
Intelligent Recovery: Automatic retry strategies and fallback mechanisms
Error Routing: Route errors to User nodes for human intervention
Provider-Specific Handling: Tailored strategies for different AI providers

Exception Hierarchy

Base Exceptions

from marsys.agents.exceptions import AgentFrameworkError, AgentError

class AgentFrameworkError(Exception):
    """Base exception for all MARSYS framework errors."""
    pass

class AgentError(AgentFrameworkError):
    """Exception for agent-specific errors."""
    pass

Error Categories

from marsys.agents.exceptions import (
    AgentFrameworkError,
    AgentError,
    ToolExecutionError,
    ModelError
)

# Validation Errors
class ValidationError(MarsysError):
    """Input validation failures."""
    pass

class ActionValidationError(ValidationError):
    """Invalid agent actions."""
    pass

# Configuration Errors
class ConfigurationError(MarsysError):
    """Configuration problems."""
    pass

class TopologyError(ConfigurationError):
    """Topology definition errors."""
    pass

# Execution Errors
class ExecutionError(MarsysError):
    """Runtime execution failures."""
    pass

class TimeoutError(ExecutionError):
    """Operation timeout."""
    pass

# API Errors
class APIError(MarsysError):
    """External API failures."""
    pass

class RateLimitError(APIError):
    """API rate limit exceeded."""
    recoverable = True

Error Handling Patterns

Comprehensive Try-Catch

async def execute_agent_task(agent, task, context):
    """Execute task with comprehensive error handling."""
    try:
        result = await agent.run(task, context)
        return result

    except ValidationError as e:
        # Handle validation errors
        logger.warning(f"Validation error: {e}")
        corrected_task = correct_validation_issues(task, e)
        return await agent.run(corrected_task, context)

    except RateLimitError as e:
        # Handle rate limits with backoff
        wait_time = 60
        logger.info(f"Rate limited. Waiting {wait_time}s...")
        await asyncio.sleep(wait_time)
        return await execute_agent_task(agent, task, context)

    except TimeoutError as e:
        # Handle timeout with retry
        if context.get("retry_count", 0) < 3:
            context["retry_count"] = context.get("retry_count", 0) + 1
            return await execute_agent_task(agent, task, context)
        raise

    except MarsysError as e:
        # Log framework errors
        logger.error(f"Framework error: {e}")
        raise

    except Exception as e:
        # Wrap unexpected errors
        logger.error(f"Unexpected error: {e}")
        raise ExecutionError(f"Unexpected error: {str(e)}")

Built-in Retry Logic

MARSYS automatically retries server-side API errors with exponential backoff at the adapter level. No manual retry logic needed for most API calls.

Error Routing to User

Route errors to User nodes for human intervention:

# Topology with error handling
topology = {
    "agents": ["User", "Processor", "ErrorHandler"],
    "flows": [
        "User -> Processor",
        "Processor -> User",         # Success path
        "Processor -> ErrorHandler", # Error path
        "ErrorHandler -> User"       # Report to user
    ]
}

# Configure error routing
config = ExecutionConfig(
    enable_error_routing=True,
    preserve_error_context=True
)

result = await Orchestra.run(
    task=task,
    topology=topology,
    execution_config=config
)

Recovery Strategies

Automatic Retry

# Automatic retry with exponential backoff
async def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await func()
        except RateLimitError:
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            await asyncio.sleep(wait_time)
    raise MaxRetriesExceeded()

Fallback Mechanisms

# Fallback to alternative provider
async def call_with_fallback(primary_agent, fallback_agent, task):
    try:
        return await primary_agent.run(task)
    except APIError:
        logger.info("Primary failed, using fallback")
        return await fallback_agent.run(task)

Best Practices

Catch specific exceptions: Handle different error types appropriately
Log with context: Include relevant information for debugging
Use recovery paths: Design topologies with error handling routes
Set timeouts: Prevent hanging operations with appropriate timeouts
Test error scenarios: Verify your error handling works correctly

Don't Swallow Errors

Always either handle an error appropriately or re-raise it. Silent failures make debugging extremely difficult.

Navigation