Models API

Complete API documentation for the MARSYS model system, providing unified interfaces for local and API-based language models.

See Also

For guidance on choosing models, see the Models Concept Guide.

ModelConfig

Configuration schema for all model types using Pydantic validation.

Class Definition

from marsys.models import ModelConfig
class ModelConfig(BaseModel):
# Core settings
type: Literal["local", "api"] # Model type
name: str # Model identifier
# API settings
provider: Optional[str] = None # openai, anthropic, google, openrouter
base_url: Optional[str] = None # Custom API endpoint
api_key: Optional[str] = None # API key (auto-loads from env)
# Generation parameters
max_tokens: int = 8192
temperature: float = 0.7
top_p: float = 1.0
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
# Reasoning parameters
thinking_budget: Optional[int] = 1024
reasoning_effort: Optional[str] = "low" # low, medium, high
# Local model settings
model_class: Optional[Literal["llm", "vlm"]] = None
backend: Optional[Literal["huggingface", "vllm"]] = "huggingface"
torch_dtype: str = "auto"
device_map: str = "auto"
# vLLM-specific settings
tensor_parallel_size: Optional[int] = 1
gpu_memory_utilization: Optional[float] = 0.9
quantization: Optional[Literal["awq", "gptq", "fp8"]] = None
# Additional parameters
parameters: Dict[str, Any] = {}

Usage Examples

from marsys.models import ModelConfig
# Anthropic Claude Haiku
claude_haiku_config = ModelConfig(
type="api",
provider="openrouter",
name="anthropic/claude-haiku-4.5",
temperature=0.7,
max_tokens=12000
)
# Anthropic Claude Sonnet
claude_config = ModelConfig(
type="api",
provider="openrouter",
name="anthropic/claude-sonnet-4.5",
temperature=0.5,
max_tokens=12000
)
# Local LLM (HuggingFace backend)
llm_config = ModelConfig(
type="local",
name="Qwen/Qwen3-4B-Instruct-2507",
model_class="llm",
backend="huggingface", # Default, can be omitted
torch_dtype="bfloat16",
device_map="auto",
max_tokens=4096
)
# Local VLM (vLLM backend for production)
vlm_config = ModelConfig(
type="local",
name="Qwen/Qwen3-VL-8B-Instruct",
model_class="vlm",
backend="vllm",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
quantization="fp8",
max_tokens=4096
)
# Custom API endpoint
custom_config = ModelConfig(
type="api",
name="custom-model",
base_url="https://api.mycompany.com/v1",
api_key="custom-key",
parameters={"custom_param": "value"}
)

Local Model Architecture

MARSYS uses an adapter pattern for local models, supporting two backends:

┌──────────────────────────────┐
│ BaseLocalModel │
│ (Unified Interface) │
└────────────┬─────────────────┘
┌────────────┴─────────────────┐
│ LocalAdapterFactory │
└────────────┬─────────────────┘
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ HuggingFaceLLM │ │ HuggingFaceVLM │ │ VLLMAdapter │
│ Adapter │ │ Adapter │ │ (LLM & VLM) │
└──────────────────┘ └──────────────────┘ └──────────────────┘

BaseLocalModel

Unified interface for local models. Recommended for most use cases.

from marsys.models import BaseLocalModel
# HuggingFace backend (development)
model = BaseLocalModel(
model_name="Qwen/Qwen3-4B-Instruct-2507",
model_class="llm",
backend="huggingface",
torch_dtype="bfloat16",
device_map="auto",
max_tokens=4096
)
response = model.run(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing"}
]
)
print(response["content"])
# vLLM backend (production)
vlm_model = BaseLocalModel(
model_name="Qwen/Qwen3-VL-8B-Instruct",
model_class="vlm",
backend="vllm",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
max_tokens=4096
)

LocalProviderAdapter

Abstract base class for local model adapters. Used internally by BaseLocalModel.

class LocalProviderAdapter(ABC):
"""Abstract base class for local model provider adapters."""
# Training access (HuggingFace only)
model: Any = None # Raw PyTorch model
tokenizer: Any = None # HuggingFace tokenizer
@property
def supports_training(self) -> bool:
"""True for HuggingFace adapters, False for vLLM."""
@property
def backend(self) -> str:
"""Backend name: 'huggingface' or 'vllm'."""

HuggingFaceLLMAdapter

Adapter for text-only language models using HuggingFace transformers.

from marsys.models import HuggingFaceLLMAdapter
adapter = HuggingFaceLLMAdapter(
model_name="Qwen/Qwen3-4B-Instruct-2507",
max_tokens=4096,
torch_dtype="bfloat16",
device_map="auto",
thinking_budget=256,
trust_remote_code=True
)
# Access for training
pytorch_model = adapter.model # AutoModelForCausalLM
tokenizer = adapter.tokenizer # AutoTokenizer

HuggingFaceVLMAdapter

Adapter for vision-language models using HuggingFace transformers.

from marsys.models import HuggingFaceVLMAdapter
adapter = HuggingFaceVLMAdapter(
model_name="Qwen/Qwen3-VL-8B-Instruct",
max_tokens=4096,
torch_dtype="bfloat16",
device_map="auto",
thinking_budget=256
)
# Process images in messages
response = adapter.run(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}
]
}
]
)

VLLMAdapter

Adapter for high-throughput production inference using vLLM.

from marsys.models import VLLMAdapter
adapter = VLLMAdapter(
model_name="Qwen/Qwen3-VL-8B-Instruct",
model_class="vlm",
max_tokens=4096,
tensor_parallel_size=2, # Multi-GPU
gpu_memory_utilization=0.9, # Memory fraction
quantization="fp8", # awq, gptq, fp8
trust_remote_code=True
)
# Note: vLLM doesn't support training
assert not adapter.supports_training

LocalAdapterFactory

Factory to create the appropriate adapter.

from marsys.models import LocalAdapterFactory
# Create HuggingFace LLM adapter
adapter = LocalAdapterFactory.create_adapter(
backend="huggingface",
model_name="Qwen/Qwen3-4B-Instruct-2507",
model_class="llm",
torch_dtype="bfloat16",
device_map="auto"
)
# Create vLLM VLM adapter
adapter = LocalAdapterFactory.create_adapter(
backend="vllm",
model_name="Qwen/Qwen3-VL-8B-Instruct",
model_class="vlm",
tensor_parallel_size=2
)

BaseAPIModel

Base class for API-based models (OpenAI, Anthropic, etc.).

from marsys.models import BaseAPIModel
api_model = BaseAPIModel(
model_name="anthropic/claude-sonnet-4.5",
provider="openrouter",
max_tokens=12000,
temperature=0.7
)
response = await api_model.run(
messages=[
{"role": "user", "content": "Explain machine learning"}
]
)

Supported Providers

ProviderModelsEnvironment Variable
openrouterAll major modelsOPENROUTER_API_KEY
openaigpt-5, gpt-5-mini, gpt-5-chat, etc.OPENAI_API_KEY
anthropicclaude-haiku-4.5, claude-sonnet-4.5, etc.ANTHROPIC_API_KEY
googlegemini-2.5-pro, gemini-2.5-flash, etc.GOOGLE_API_KEY
xaigrok-4, grok-4-fast, grok-3, etc.XAI_API_KEY

OpenRouter Recommended

OpenRouter is recommended as it provides access to multiple providers through a single API, making it easy to switch between models.

Error Handling

Built-in Resilience

API adapters automatically retry transient server errors with exponential backoff. No manual retry needed for API calls!

Automatic Retry Behavior

Configuration: Max retries: 3 (total 4 attempts), Backoff: 1s → 2s → 4s (exponential)

Retryable Status Codes:

  • 500 - Internal Server Error
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout
  • 529 - Overloaded (Anthropic)
  • 408 - Request Timeout (OpenRouter)
  • 429 - Rate Limit (respects retry-after header)

Provider-Specific Retry Behavior

ProviderRetryable ErrorsNon-Retryable
OpenRouter408, 429, 502, 503, 500+400, 401, 402, 403
OpenAI429, 500, 502, 503400, 401, 404
Anthropic429, 500, 529400, 401, 403, 413
Google429, 500, 503, 504400, 403, 404

Manual Error Handling

For errors that aren't automatically retried (client errors, quota issues, etc.):

from marsys.agents.exceptions import (
ModelError,
ModelAPIError,
ModelTimeoutError,
ModelRateLimitError,
ModelTokenLimitError
)
try:
response = await model.run(messages)
except ModelRateLimitError as e:
# Rate limits are auto-retried, but if exhausted:
logger.error(f"Rate limit exceeded after retries")
if e.retry_after:
logger.info(f"Retry after {e.retry_after}s")
except ModelTokenLimitError as e:
# Token limit requires reducing input
logger.warning(f"Token limit exceeded: {e.message}")
messages = truncate_messages(messages, e.limit)
response = await model.run(messages)
except ModelAPIError as e:
# Check error classification
logger.error(f"API error: {e.status_code} - {e.message}")