Models API
Complete API documentation for the MARSYS model system, providing unified interfaces for local and API-based language models.
See Also
For guidance on choosing models, see the Models Concept Guide.
ModelConfig
Configuration schema for all model types using Pydantic validation.
Class Definition
from marsys.models import ModelConfigclass ModelConfig(BaseModel):# Core settingstype: Literal["local", "api"] # Model typename: str # Model identifier# API settingsprovider: Optional[str] = None # openai, anthropic, google, openrouterbase_url: Optional[str] = None # Custom API endpointapi_key: Optional[str] = None # API key (auto-loads from env)# Generation parametersmax_tokens: int = 8192temperature: float = 0.7top_p: float = 1.0frequency_penalty: float = 0.0presence_penalty: float = 0.0# Reasoning parametersthinking_budget: Optional[int] = 1024reasoning_effort: Optional[str] = "low" # low, medium, high# Local model settingsmodel_class: Optional[Literal["llm", "vlm"]] = Nonebackend: Optional[Literal["huggingface", "vllm"]] = "huggingface"torch_dtype: str = "auto"device_map: str = "auto"# vLLM-specific settingstensor_parallel_size: Optional[int] = 1gpu_memory_utilization: Optional[float] = 0.9quantization: Optional[Literal["awq", "gptq", "fp8"]] = None# Additional parametersparameters: Dict[str, Any] = {}
Usage Examples
from marsys.models import ModelConfig# Anthropic Claude Haikuclaude_haiku_config = ModelConfig(type="api",provider="openrouter",name="anthropic/claude-haiku-4.5",temperature=0.7,max_tokens=12000)# Anthropic Claude Sonnetclaude_config = ModelConfig(type="api",provider="openrouter",name="anthropic/claude-sonnet-4.5",temperature=0.5,max_tokens=12000)# Local LLM (HuggingFace backend)llm_config = ModelConfig(type="local",name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm",backend="huggingface", # Default, can be omittedtorch_dtype="bfloat16",device_map="auto",max_tokens=4096)# Local VLM (vLLM backend for production)vlm_config = ModelConfig(type="local",name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",backend="vllm",tensor_parallel_size=2,gpu_memory_utilization=0.9,quantization="fp8",max_tokens=4096)# Custom API endpointcustom_config = ModelConfig(type="api",name="custom-model",base_url="https://api.mycompany.com/v1",api_key="custom-key",parameters={"custom_param": "value"})
Local Model Architecture
MARSYS uses an adapter pattern for local models, supporting two backends:
┌──────────────────────────────┐│ BaseLocalModel ││ (Unified Interface) │└────────────┬─────────────────┘│┌────────────┴─────────────────┐│ LocalAdapterFactory │└────────────┬─────────────────┘│┌───────────────────────┼───────────────────────┐▼ ▼ ▼┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐│ HuggingFaceLLM │ │ HuggingFaceVLM │ │ VLLMAdapter ││ Adapter │ │ Adapter │ │ (LLM & VLM) │└──────────────────┘ └──────────────────┘ └──────────────────┘
BaseLocalModel
Unified interface for local models. Recommended for most use cases.
from marsys.models import BaseLocalModel# HuggingFace backend (development)model = BaseLocalModel(model_name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm",backend="huggingface",torch_dtype="bfloat16",device_map="auto",max_tokens=4096)response = model.run(messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Explain quantum computing"}])print(response["content"])# vLLM backend (production)vlm_model = BaseLocalModel(model_name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",backend="vllm",tensor_parallel_size=2,gpu_memory_utilization=0.9,max_tokens=4096)
LocalProviderAdapter
Abstract base class for local model adapters. Used internally by BaseLocalModel.
class LocalProviderAdapter(ABC):"""Abstract base class for local model provider adapters."""# Training access (HuggingFace only)model: Any = None # Raw PyTorch modeltokenizer: Any = None # HuggingFace tokenizer@propertydef supports_training(self) -> bool:"""True for HuggingFace adapters, False for vLLM."""@propertydef backend(self) -> str:"""Backend name: 'huggingface' or 'vllm'."""
HuggingFaceLLMAdapter
Adapter for text-only language models using HuggingFace transformers.
from marsys.models import HuggingFaceLLMAdapteradapter = HuggingFaceLLMAdapter(model_name="Qwen/Qwen3-4B-Instruct-2507",max_tokens=4096,torch_dtype="bfloat16",device_map="auto",thinking_budget=256,trust_remote_code=True)# Access for trainingpytorch_model = adapter.model # AutoModelForCausalLMtokenizer = adapter.tokenizer # AutoTokenizer
HuggingFaceVLMAdapter
Adapter for vision-language models using HuggingFace transformers.
from marsys.models import HuggingFaceVLMAdapteradapter = HuggingFaceVLMAdapter(model_name="Qwen/Qwen3-VL-8B-Instruct",max_tokens=4096,torch_dtype="bfloat16",device_map="auto",thinking_budget=256)# Process images in messagesresponse = adapter.run(messages=[{"role": "user","content": [{"type": "text", "text": "What's in this image?"},{"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}]}])
VLLMAdapter
Adapter for high-throughput production inference using vLLM.
from marsys.models import VLLMAdapteradapter = VLLMAdapter(model_name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",max_tokens=4096,tensor_parallel_size=2, # Multi-GPUgpu_memory_utilization=0.9, # Memory fractionquantization="fp8", # awq, gptq, fp8trust_remote_code=True)# Note: vLLM doesn't support trainingassert not adapter.supports_training
LocalAdapterFactory
Factory to create the appropriate adapter.
from marsys.models import LocalAdapterFactory# Create HuggingFace LLM adapteradapter = LocalAdapterFactory.create_adapter(backend="huggingface",model_name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm",torch_dtype="bfloat16",device_map="auto")# Create vLLM VLM adapteradapter = LocalAdapterFactory.create_adapter(backend="vllm",model_name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",tensor_parallel_size=2)
BaseAPIModel
Base class for API-based models (OpenAI, Anthropic, etc.).
from marsys.models import BaseAPIModelapi_model = BaseAPIModel(model_name="anthropic/claude-sonnet-4.5",provider="openrouter",max_tokens=12000,temperature=0.7)response = await api_model.run(messages=[{"role": "user", "content": "Explain machine learning"}])
Supported Providers
| Provider | Models | Environment Variable |
|---|---|---|
openrouter | All major models | OPENROUTER_API_KEY |
openai | gpt-5, gpt-5-mini, gpt-5-chat, etc. | OPENAI_API_KEY |
anthropic | claude-haiku-4.5, claude-sonnet-4.5, etc. | ANTHROPIC_API_KEY |
google | gemini-2.5-pro, gemini-2.5-flash, etc. | GOOGLE_API_KEY |
xai | grok-4, grok-4-fast, grok-3, etc. | XAI_API_KEY |
OpenRouter Recommended
OpenRouter is recommended as it provides access to multiple providers through a single API, making it easy to switch between models.
Error Handling
Built-in Resilience
API adapters automatically retry transient server errors with exponential backoff. No manual retry needed for API calls!
Automatic Retry Behavior
Configuration: Max retries: 3 (total 4 attempts), Backoff: 1s → 2s → 4s (exponential)
Retryable Status Codes:
- 500 - Internal Server Error
- 502 - Bad Gateway
- 503 - Service Unavailable
- 504 - Gateway Timeout
- 529 - Overloaded (Anthropic)
- 408 - Request Timeout (OpenRouter)
- 429 - Rate Limit (respects
retry-afterheader)
Provider-Specific Retry Behavior
| Provider | Retryable Errors | Non-Retryable |
|---|---|---|
| OpenRouter | 408, 429, 502, 503, 500+ | 400, 401, 402, 403 |
| OpenAI | 429, 500, 502, 503 | 400, 401, 404 |
| Anthropic | 429, 500, 529 | 400, 401, 403, 413 |
| 429, 500, 503, 504 | 400, 403, 404 |
Manual Error Handling
For errors that aren't automatically retried (client errors, quota issues, etc.):
from marsys.agents.exceptions import (ModelError,ModelAPIError,ModelTimeoutError,ModelRateLimitError,ModelTokenLimitError)try:response = await model.run(messages)except ModelRateLimitError as e:# Rate limits are auto-retried, but if exhausted:logger.error(f"Rate limit exceeded after retries")if e.retry_after:logger.info(f"Retry after {e.retry_after}s")except ModelTokenLimitError as e:# Token limit requires reducing inputlogger.warning(f"Token limit exceeded: {e.message}")messages = truncate_messages(messages, e.limit)response = await model.run(messages)except ModelAPIError as e:# Check error classificationlogger.error(f"API error: {e.status_code} - {e.message}")