Models API Reference
Complete API documentation for the MARSYS model system, providing unified interfaces for local and API-based language models.
Model Selection Guide
For guidance on choosing models and when to use VLM, see the Models Concept Guide.
ModelConfig
Configuration schema for all model types using Pydantic validation.
Class Definition
from pydantic import BaseModel, Fieldfrom typing import Literal, Optional, Dict, Anyclass ModelConfig(BaseModel):"""Unified configuration for all model types."""# Core settingstype: Literal["local", "api"] = Field(description="Model type - local or API-based")name: str = Field(description="Model identifier or HuggingFace path")# API settingsprovider: Optional[str] = Field(default=None,description="API provider (openai, anthropic, google, openrouter, xai, openai-oauth, anthropic-oauth)")base_url: Optional[str] = Field(default=None,description="Custom API endpoint URL")api_key: Optional[str] = Field(default=None,description="API key (auto-loaded from env if None)")oauth_profile: Optional[str] = Field(default=None,description="OAuth profile name for openai-oauth / anthropic-oauth")# Generation parametersmax_tokens: int = Field(default=8192,description="Maximum output tokens")temperature: float = Field(default=0.7,ge=0.0,le=2.0,description="Sampling temperature")top_p: float = Field(default=1.0,ge=0.0,le=1.0,description="Nucleus sampling parameter")frequency_penalty: float = Field(default=0.0,ge=-2.0,le=2.0,description="Frequency penalty")presence_penalty: float = Field(default=0.0,ge=-2.0,le=2.0,description="Presence penalty")# Reasoning parametersthinking_budget: Optional[int] = Field(default=1024,description="Token budget for extended thinking (models with thinking support)")reasoning_effort: Optional[str] = Field(default="low",description="Reasoning effort level (low, medium, high)")# Local model settingsmodel_class: Optional[Literal["llm", "vlm"]] = Field(default=None,description="Local model class (required for type='local')")backend: Optional[Literal["huggingface", "vllm"]] = Field(default="huggingface",description="Backend: 'huggingface' (dev) or 'vllm' (production)")torch_dtype: str = Field(default="auto",description="PyTorch dtype (auto, float16, bfloat16, float32)")device_map: str = Field(default="auto",description="Device mapping strategy (HuggingFace only)")# vLLM-specific settingstensor_parallel_size: Optional[int] = Field(default=1,description="Number of GPUs for tensor parallelism (vLLM only)")gpu_memory_utilization: Optional[float] = Field(default=0.9,description="GPU memory utilization fraction 0-1 (vLLM only)")quantization: Optional[Literal["awq", "gptq", "fp8"]] = Field(default=None,description="Quantization method (vLLM only)")# Additional parametersparameters: Dict[str, Any] = Field(default_factory=dict,description="Provider-specific parameters")
Usage Examples
from marsys.models import ModelConfig# OpenAI GPT-5 Codexgpt5_config = ModelConfig(type="api",provider="openrouter",name="openai/gpt-5-codex",temperature=0.7,max_tokens=12000)# Anthropic Claude Opus 4.6claude_config = ModelConfig(type="api",provider="openrouter",name="anthropic/claude-opus-4.6",temperature=0.5,max_tokens=12000)# Local LLM (HuggingFace backend)llm_config = ModelConfig(type="local",name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm",backend="huggingface", # Default, can be omittedtorch_dtype="bfloat16",device_map="auto",max_tokens=4096)# Local VLM (vLLM backend for production)vlm_config = ModelConfig(type="local",name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",backend="vllm",tensor_parallel_size=2,gpu_memory_utilization=0.9,quantization="fp8",max_tokens=4096)# Custom API endpointcustom_config = ModelConfig(type="api",name="custom-model",base_url="https://api.mycompany.com/v1",api_key="custom-key",parameters={"custom_param": "value"})
OAuth Providers (No API Keys)
MARSYS supports OAuth-backed providers that use local CLI credentials instead of API keys:
- openai-oauth: ChatGPT subscription via Codex CLI (
codex login) - anthropic-oauth: Claude Max subscription via Claude CLI (
claude login)
Credentials are read from local files and can be overridden with environment variables:
- OpenAI OAuth:
~/.codex/auth.json(override withCODEX_AUTH_PATH) - Anthropic OAuth:
~/.claude/.credentials.json(override withCLAUDE_AUTH_PATH)
OAuth profile resolution order:
- Explicit
credentials_pathinModelConfig oauth_profileinModelConfig- Provider default profile set via
marsys oauth set-default ...
# OpenAI ChatGPT OAuth (Codex CLI)openai_oauth = ModelConfig(type="api",provider="openai-oauth",name="gpt-5.3-codex",credentials_path="~/.codex/auth.json" # Optional override)# Anthropic Claude OAuth (Claude CLI)anthropic_oauth = ModelConfig(type="api",provider="anthropic-oauth",name="claude-opus-4-6",credentials_path="~/.claude/.credentials.json" # Optional override)
Use At Your Own Risk (Anthropic OAuth)
anthropic-oauth relies on a non-official integration path and may violate provider Terms of Service. Use at your own risk.
OpenAI OAuth Compliance
MARSYS does not make a legal determination about OpenAI ToS coverage for this OAuth path. Review OpenAI terms for your use case.
Model Classes
Local Model Architecture
MARSYS uses an adapter pattern for local models, supporting two backends:
┌──────────────────────────────┐│ BaseLocalModel ││ (Unified Interface) │└────────────┬─────────────────┘│┌────────────┴─────────────────┐│ LocalAdapterFactory │└────────────┬─────────────────┘│┌───────────────────────┼───────────────────────┐▼ ▼ ▼┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐│ HuggingFaceLLM │ │ HuggingFaceVLM │ │ VLLMAdapter ││ Adapter │ │ Adapter │ │ (LLM & VLM) │└──────────────────┘ └──────────────────┘ └──────────────────┘
BaseLocalModel
Unified interface for local models. Recommended for most use cases.
from marsys.models import BaseLocalModelclass BaseLocalModel:"""Base class for local models using adapter pattern."""def __init__(self,model_name: str,model_class: str = "llm",backend: str = "huggingface",max_tokens: int = 1024,thinking_budget: Optional[int] = None,**kwargs):"""Initialize local model.Args:model_name: HuggingFace model identifiermodel_class: "llm" or "vlm"backend: "huggingface" or "vllm"max_tokens: Maximum generation tokensthinking_budget: Token budget for thinking models**kwargs: Backend-specific parameters:- HuggingFace: torch_dtype, device_map, trust_remote_code- vLLM: tensor_parallel_size, gpu_memory_utilization, quantization"""
Methods
run(messages, **kwargs) -> Dict[str, Any]
Execute the model synchronously.
Parameters:
messages(List[Dict]): Conversation messagesjson_mode(bool): Enable JSON output modemax_tokens(Optional[int]): Override max tokenstools(Optional[List[Dict]]): Tool definitionsimages(Optional[List]): Images for VLM**kwargs: Additional generation parameters
Returns:
{"role": "assistant","content": "Generated response text","thinking": "Optional thinking content for thinking models","tool_calls": []}
arun(messages, **kwargs) -> HarmonizedResponse
Execute the model asynchronously.
Example:
from marsys.models import BaseLocalModel# HuggingFace backend (development)model = BaseLocalModel(model_name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm",backend="huggingface",torch_dtype="bfloat16",device_map="auto",max_tokens=4096)response = model.run(messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Explain quantum computing"}])print(response["content"])# vLLM backend (production)vlm_model = BaseLocalModel(model_name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",backend="vllm",tensor_parallel_size=2,gpu_memory_utilization=0.9,max_tokens=4096)
LocalProviderAdapter
Abstract base class for local model adapters. Used internally by BaseLocalModel.
class LocalProviderAdapter(ABC):"""Abstract base class for local model provider adapters."""# Training access (HuggingFace only)model: Any = None # Raw PyTorch modeltokenizer: Any = None # HuggingFace tokenizer@propertydef supports_training(self) -> bool:"""True for HuggingFace adapters, False for vLLM."""@propertydef backend(self) -> str:"""Backend name: 'huggingface' or 'vllm'."""
HuggingFaceLLMAdapter
Adapter for text-only language models using HuggingFace transformers.
from marsys.models import HuggingFaceLLMAdapteradapter = HuggingFaceLLMAdapter(model_name="Qwen/Qwen3-4B-Instruct-2507",max_tokens=4096,torch_dtype="bfloat16",device_map="auto",thinking_budget=256,trust_remote_code=True)# Access for trainingpytorch_model = adapter.model # AutoModelForCausalLMtokenizer = adapter.tokenizer # AutoTokenizer
HuggingFaceVLMAdapter
Adapter for vision-language models using HuggingFace transformers.
from marsys.models import HuggingFaceVLMAdapteradapter = HuggingFaceVLMAdapter(model_name="Qwen/Qwen3-VL-8B-Instruct",max_tokens=4096,torch_dtype="bfloat16",device_map="auto",thinking_budget=256)# Process images in messagesresponse = adapter.run(messages=[{"role": "user","content": [{"type": "text", "text": "What's in this image?"},{"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}]}])
VLLMAdapter
Adapter for high-throughput production inference using vLLM.
from marsys.models import VLLMAdapteradapter = VLLMAdapter(model_name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",max_tokens=4096,tensor_parallel_size=2, # Multi-GPUgpu_memory_utilization=0.9, # Memory fractionquantization="fp8", # awq, gptq, fp8trust_remote_code=True)# Note: vLLM doesn't support trainingassert not adapter.supports_training
LocalAdapterFactory
Factory to create the appropriate adapter.
from marsys.models import LocalAdapterFactory# Create HuggingFace LLM adapteradapter = LocalAdapterFactory.create_adapter(backend="huggingface",model_name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm",torch_dtype="bfloat16",device_map="auto")# Create vLLM VLM adapteradapter = LocalAdapterFactory.create_adapter(backend="vllm",model_name="Qwen/Qwen3-VL-8B-Instruct",model_class="vlm",tensor_parallel_size=2)
BaseAPIModel
Base class for API-based models.
class BaseAPIModel:"""API model wrapper."""def __init__(self,provider: str,model_name: str,api_key: Optional[str] = None,base_url: Optional[str] = None,max_tokens: int = 1024,**kwargs):"""Initialize API model.Args:provider: API provider namemodel_name: Model identifierapi_key: API key (auto-loaded from env if None)base_url: Custom endpoint URLmax_tokens: Maximum tokens**kwargs: Provider-specific parameters"""
Supported Providers
| Provider | Models | Environment Variable |
|---|---|---|
| openrouter | All major models | OPENROUTER_API_KEY |
| openai | gpt-5-codex, etc. | OPENAI_API_KEY |
| openai-oauth | gpt-5.3-codex | codex login (~/.codex/auth.json) |
| anthropic | claude-opus-4-6, claude-opus-4.6 (alias), etc. | ANTHROPIC_API_KEY |
| anthropic-oauth | claude-opus-4-6 | claude login (~/.claude/.credentials.json) |
| gemini-3-flash-preview, gemini-3-pro-preview, etc. | GOOGLE_API_KEY | |
| xai | grok-4, grok-4-fast, grok-3, etc. | XAI_API_KEY |
Methods
run(messages, **kwargs) -> Dict[str, Any]
Execute API model.
Parameters:
messages(List[Dict]): Conversation messagesjson_mode(bool): Force JSON response (non-schema mode)response_schema(Optional[Dict]): Strict JSON schema for structured outputtools(Optional[List[Dict]]): Function definitionstool_choice(Optional[str]): Tool selection strategy**kwargs: Provider-specific parameters
Example:
from marsys.models import BaseAPIModelmodel = BaseAPIModel(provider="openrouter",model_name="anthropic/claude-opus-4.6",temperature=0.7,max_tokens=12000)response = await model.run(messages=[{"role": "user", "content": "Hello!"}],tools=[{"type": "function","function": {"name": "get_weather","description": "Get weather for a location","parameters": {"type": "object","properties": {"location": {"type": "string"}},"required": ["location"]}}}])if response.get("tool_calls"):for tool_call in response["tool_calls"]:print(f"Tool: {tool_call['function']['name']}")print(f"Args: {tool_call['function']['arguments']}")
Model Factory
Model Creation
For API models, use BaseAPIModel.from_config():
from marsys.models import BaseAPIModel, ModelConfigconfig = ModelConfig(type="api",provider="openrouter",name="anthropic/claude-opus-4.6",max_tokens=12000)model = BaseAPIModel.from_config(config)response = await model.arun(messages=[{"role": "user", "content": "Hello!"}])
For local models, use BaseLocalModel:
from marsys.models import BaseLocalModel, ModelConfigconfig = ModelConfig(type="local",model_class="llm",name="Qwen/Qwen3-4B-Instruct-2507",backend="huggingface",torch_dtype="bfloat16",device_map="auto")model = BaseLocalModel(model_name=config.name,model_class=config.model_class,backend=config.backend,torch_dtype=config.torch_dtype,device_map=config.device_map,max_tokens=config.max_tokens)response = model.run(messages=[{"role": "user", "content": "Hello!"}])
LocalAdapterFactory
For direct adapter creation:
from marsys.models import LocalAdapterFactory# Creates the appropriate adapter based on backend and model_classadapter = LocalAdapterFactory.create_adapter(backend="huggingface", # or "vllm"model_name="Qwen/Qwen3-4B-Instruct-2507",model_class="llm", # or "vlm"torch_dtype="bfloat16",device_map="auto")
Advanced Features
Tool Calling
Models support OpenAI-compatible function calling:
tools = [{"type": "function","function": {"name": "search_web","description": "Search the web for information","parameters": {"type": "object","properties": {"query": {"type": "string","description": "Search query"},"max_results": {"type": "integer","description": "Maximum results","default": 5}},"required": ["query"]}}}]response = await model.run(messages=[{"role": "user", "content": "Find information about Mars rovers"}],tools=tools,tool_choice="auto" # auto, none, or specific function name)# Handle tool callsif response.get("tool_calls"):for call in response["tool_calls"]:if call["function"]["name"] == "search_web":args = json.loads(call["function"]["arguments"])results = search_web(args["query"], args.get("max_results", 5))# Add tool result to conversationmessages.append({"role": "tool","content": json.dumps(results),"tool_call_id": call["id"]})
JSON Mode
Force structured JSON output:
response = await model.run(messages=[{"role": "system","content": "Always respond with JSON: {\"answer\": str, \"confidence\": float}"},{"role": "user","content": "What is 2+2?"}],json_mode=True)data = json.loads(response["content"])print(f"Answer: {data['answer']} (Confidence: {data['confidence']})")
Structured Output (response_schema)
Use response_schema for strict schema-constrained JSON:
schema = {"type": "object","properties": {"answer": {"type": "string"},"confidence": {"type": "number"},},"required": ["answer", "confidence"],}response = await model.run(messages=[{"role": "user", "content": "What is 2+2?"}],response_schema=schema,)
Provider behavior:
- OpenAI / OpenRouter / OpenAI OAuth: native JSON schema mode
- Google:
responseSchemain generation config - Anthropic / Anthropic OAuth: native
output_config.formatJSON schema response_schematakes precedence overjson_mode
Strict schema note:
- MARSYS auto-normalizes schema objects with
additionalProperties: falsewhere required by strict providers.
Streaming Responses
Stream model output (when supported):
async for chunk in model.stream(messages=[{"role": "user", "content": "Write a story"}]):print(chunk["content"], end="", flush=True)
Error Handling
Automatic Retry for Server Errors
Built-in Resilience
API adapters automatically retry transient server errors with exponential backoff. No manual retry needed!
Automatic Retry Behavior:
- Max Retries: 3 (total 4 attempts)
- Backoff: 1s → 2s → 4s (exponential)
- Retryable Status Codes:
500- Internal Server Error502- Bad Gateway503- Service Unavailable504- Gateway Timeout529- Overloaded (Anthropic)408- Request Timeout (OpenRouter)429- Rate Limit (respectsretry-afterheader)
Example:
from marsys.models import BaseAPIModelmodel = BaseAPIModel(provider="openrouter",model_name="anthropic/claude-opus-4.6",api_key=api_key)# API adapter automatically retries server errors (500, 502, 503, etc.)# No manual retry logic needed!response = await model.arun(messages)# Logs will show retry attempts:# WARNING - Server error 503 from claude-opus-4.6. Retry 1/3 after 1.0s# WARNING - Server error 503 from claude-opus-4.6. Retry 2/3 after 2.0s# INFO - Request successful after 2 retries
What Gets Retried Automatically
| Provider | Retryable Errors | Non-Retryable Errors |
|---|---|---|
| OpenRouter | 408, 429, 502, 503, 500+ | 400, 401, 402, 403 |
| OpenAI | 429, 500, 502, 503 | 400, 401, 404 |
| Anthropic | 429, 500, 529 | 400, 401, 403, 413 |
| 429, 500, 503, 504 | 400, 403, 404 |
Manual Error Handling
For errors that aren't automatically retried (client errors, quota issues, etc.):
from marsys.agents.exceptions import (ModelError,ModelAPIError,ModelTimeoutError,ModelRateLimitError,ModelTokenLimitError)try:response = await model.run(messages)except ModelRateLimitError as e:# Rate limits are auto-retried, but if exhausted:logger.error(f"Rate limit exceeded after {e.context.get('max_retries', 3)} retries")if e.retry_after:logger.info(f"Retry after {e.retry_after}s")except ModelTokenLimitError as e:# Token limit requires reducing inputlogger.warning(f"Token limit exceeded: {e.message}")messages = truncate_messages(messages, e.limit)response = await model.run(messages)except ModelAPIError as e:# Check if it's a server error (already auto-retried)if e.status_code and e.status_code >= 500:logger.error(f"Server error persisted after retries: {e.message}")else:# Client error (400-level)logger.error(f"Client error: {e.status_code} - {e.message}")# Handle based on error classificationif e.classification == "invalid_request":# Fix request and retrypasselif e.classification == "insufficient_credits":# Handle quotapass
Error Classification
All ModelAPIError instances include classification:
except ModelAPIError as e:print(f"Error Code: {e.error_code}")print(f"Classification: {e.classification}")print(f"Is Retryable: {e.is_retryable}")print(f"Retry After: {e.retry_after}s")print(f"Suggested Action: {e.suggested_action}")# Example output for OpenRouter 503:# Error Code: MODEL_API_SERVICE_UNAVAILABLE_ERROR# Classification: service_unavailable# Is Retryable: True# Retry After: 10s# Suggested Action: Service temporarily unavailable. Please try again later.
Usage Tracking
Token Usage
response = await model.run(messages)usage = response.get("usage", {})print(f"Prompt tokens: {usage.get('prompt_tokens', 0)}")print(f"Completion tokens: {usage.get('completion_tokens', 0)}")print(f"Total tokens: {usage.get('total_tokens', 0)}")# Estimate cost (OpenAI pricing example)cost_per_1k_prompt = 0.03 # $0.03 per 1K tokenscost_per_1k_completion = 0.06 # $0.06 per 1K tokensprompt_cost = (usage.get('prompt_tokens', 0) / 1000) * cost_per_1k_promptcompletion_cost = (usage.get('completion_tokens', 0) / 1000) * cost_per_1k_completiontotal_cost = prompt_cost + completion_costprint(f"Estimated cost: ${total_cost:.4f}")
Best Practices
1. Configuration Management
# ✅ GOOD - Environment-based configimport osfrom marsys.models import ModelConfigconfig = ModelConfig(type="api",provider="openrouter",name=os.getenv("MODEL_NAME", "anthropic/claude-opus-4.6"),temperature=float(os.getenv("MODEL_TEMPERATURE", "0.7")),max_tokens=int(os.getenv("MAX_TOKENS", "12000")))# ❌ BAD - Hardcoded valuesconfig = ModelConfig(type="api",provider="openrouter",name="anthropic/claude-opus-4.6",api_key="sk-..." # Never hardcode!)
2. Error Recovery
# ✅ GOOD - Graceful degradationasync def robust_model_call(messages, fallback_model=None):try:return await primary_model.run(messages)except ModelError as e:if fallback_model:logger.warning(f"Primary failed, using fallback: {e}")return await fallback_model.run(messages)raise# ❌ BAD - No error handlingresponse = await model.run(messages) # Can fail!
3. Resource Management
# ✅ GOOD - Proper cleanup for local modelsclass ModelManager:def __init__(self):self.models = {}def get_model(self, config: ModelConfig):key = f"{config.type}:{config.name}"if key not in self.models:self.models[key] = create_model(config)return self.models[key]def cleanup(self):for model in self.models.values():if hasattr(model, 'cleanup'):model.cleanup()
Related Documentation
- Agents - How agents use models
- Configuration - Model configuration guide
- Error Handling - Error management
- Examples - Model usage examples