Multimodal Agents Guide
A comprehensive guide to building agents that can process and reason about visual content alongside text.
Overview
MARSYS supports multimodal AI agents that can:
Process Images
Analyze screenshots, photos, diagrams, charts
Read Documents
Extract text and images from PDFs
Visual Reasoning
Answer questions about visual content
Mixed Input
Handle tasks with both text and images
Quick Start
Basic Multimodal Task
from marsys.coordination import Orchestrafrom marsys.agents import Agentfrom marsys.models import ModelConfigimport os# Create vision-enabled agentvision_agent = Agent(model_config=ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY")),name="VisionAnalyst",goal="Expert at analyzing visual content",instruction="You are an expert at analyzing images and providing detailed insights.")# Run with imagesresult = await Orchestra.run(task={"content": "What is shown in this image? Provide a detailed analysis.","images": ["/path/to/screenshot.png"]},topology={"agents": ["VisionAnalyst"], "flows": []})print(result.final_response)
Model Configuration
from marsys.models import ModelConfig# OpenRouter with Claude Sonnet for visionclaude_config = ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY"),temperature=0.7,max_tokens=2000)# Anthropic model with vision (Claude Sonnet 4.5)anthropic_config = ModelConfig(type="api",name="claude-sonnet-4-5-20250514",provider="anthropic",api_key=os.getenv("ANTHROPIC_API_KEY"))# Google model with vision (Gemini 2.5 Pro)google_config = ModelConfig(type="api",name="gemini-2.5-pro",provider="google",api_key=os.getenv("GOOGLE_API_KEY"))
Multimodal Tasks
Task Format
Tasks with images use this format:
task = {"content": "Text description of what you want","images": ["path/to/image1.png", "path/to/image2.jpg"]}
Image Sources
The images field accepts multiple formats:
# Local file pathstask = {"content": "Analyze these screenshots","images": ["/home/user/screenshot1.png","/tmp/image.jpg"]}# URLs (if supported by model)task = {"content": "What's in this image?","images": ["https://example.com/photo.jpg"]}# Mixed sourcestask = {"content": "Compare these images","images": ["/local/path/image1.png","https://example.com/image2.jpg"]}
How It Works
When you provide images in a task:
1. Extraction
Orchestra extracts the images field
2. Memory
Images added to first agent's memory
3. Encoding
Auto base64 encode local images
4. Format
Message formatted per provider
5. Process
Vision model processes input
Multimodal Tools
ToolResponse: The Proper Way to Return Multimodal Content
Tools should return ToolResponse objects for multimodal content. This enables ordered sequences of text and images that map directly to LLM message content arrays.
from marsys.environment.tool_response import ToolResponse, ToolResponseContentdef tool_read_pdf(pdf_path: str) -> ToolResponse:"""Read PDF and return text + page images for vision analysis.Args:pdf_path: Path to PDF fileReturns:ToolResponse with ordered text and image content blocks"""# Extract texttext = extract_text_from_pdf(pdf_path)# Convert pages to imagespage_images = convert_pdf_pages_to_images(pdf_path)# Build ordered content blockscontent_blocks = [ToolResponseContent(text=f"Extracted {len(page_images)} pages from PDF.")]# Add each page image with its textfor i, (page_text, page_image) in enumerate(zip(text.split("---"), page_images), 1):content_blocks.append(ToolResponseContent(text=f"--- Page {i} ---"))content_blocks.append(ToolResponseContent(image_path=page_image))content_blocks.append(ToolResponseContent(text=page_text.strip()))return ToolResponse(content=content_blocks,metadata={"pages": len(page_images), "file": pdf_path})
ToolResponseContent Options
ToolResponseContent supports three content types:
from marsys.environment.tool_response import ToolResponseContent# 1. Text content (string)ToolResponseContent(text="Hello world")# 2. Text content (dict - will be JSON stringified)ToolResponseContent(text={"key": "value", "count": 42})# 3. Image with local file path (auto-converted to base64)ToolResponseContent(image_path="/path/to/image.png")# 4. Image with base64 data URL (already encoded)ToolResponseContent(image_data="...")
Mutually Exclusive
Each ToolResponseContent must have exactly ONE of:text, image_path, or image_data. You cannot combine them in the same content block.
ToolResponse Content Formats
ToolResponse supports three content formats:
from marsys.environment.tool_response import ToolResponse, ToolResponseContent# 1. Simple string (for text-only results)ToolResponse(content="File read successfully")# 2. Dictionary (for structured data)ToolResponse(content={"status": "success", "count": 42})# 3. List of ToolResponseContent (for ordered multimodal content)ToolResponse(content=[ToolResponseContent(text="Chapter 1"),ToolResponseContent(image_data="data:image/png;base64,..."),ToolResponseContent(text="Figure 1.1: System Architecture")],metadata={"pages": "1-5", "images": 3})
Tool Result Flow
The framework automatically handles ToolResponse processing:
# 1. Agent calls toolagent: "I need to read this PDF"tool_call: tool_read_pdf("/path/to/document.pdf")# 2. Tool returns ToolResponse with multimodal contenttool_result = ToolResponse(content=[ToolResponseContent(text="Extracted 3 pages."),ToolResponseContent(image_path="/tmp/page1.png"),ToolResponseContent(text="Page 1: Annual Report..."),ToolResponseContent(image_path="/tmp/page2.png"),ToolResponseContent(text="Page 2: Financial Data...")])# 3. Framework processes ToolResponse (tool_executor.py)# - Detects ToolResponse object# - Calls to_content_array() for proper LLM message format# 4. Framework creates two-message pattern (step_executor.py)# - role="tool": Metadata message about the tool execution# - role="user": Actual multimodal content array# 5. Agent receives properly formatted multimodal context# Vision model sees ordered text + images with correct structure
Complete Tool Example with ToolResponse
import PyPDF2from pdf2image import convert_from_pathfrom pathlib import Pathimport tempfilefrom marsys.environment.tool_response import ToolResponse, ToolResponseContentdef tool_analyze_document(file_path: str,extract_images: bool = True) -> ToolResponse:"""Analyze a document file (PDF, image, text).Args:file_path: Path to documentextract_images: Whether to extract visual contentReturns:ToolResponse with document content and optional images"""file_path = Path(file_path)if not file_path.exists():return ToolResponse(content=f"Error: File not found: {file_path}",metadata={"error": "file_not_found"})# Handle PDFsif file_path.suffix.lower() == '.pdf':content_blocks = []# Extract texttext_parts = []with open(file_path, 'rb') as f:reader = PyPDF2.PdfReader(f)for i, page in enumerate(reader.pages, 1):page_text = page.extract_text()if page_text.strip():text_parts.append((i, page_text.strip()))# Optionally convert pages to imagesimage_paths = []if extract_images:temp_dir = Path(tempfile.mkdtemp(prefix="pdf_analysis_"))pdf_images = convert_from_path(str(file_path), dpi=200)for i, img in enumerate(pdf_images, 1):img_path = temp_dir / f"page_{i}.png"img.save(str(img_path), 'PNG')image_paths.append(str(img_path))# Build ordered content: text, then image for each pagecontent_blocks.append(ToolResponseContent(text=f"Document: {file_path.name} ({len(text_parts)} pages)"))for i, (page_num, page_text) in enumerate(text_parts):content_blocks.append(ToolResponseContent(text=f"--- Page {page_num} ---\n{page_text}"))if i < len(image_paths):content_blocks.append(ToolResponseContent(image_path=image_paths[i]))return ToolResponse(content=content_blocks,metadata={"pages": len(text_parts),"file_type": "pdf","visual_pages": len(image_paths)})# Handle image fileselif file_path.suffix.lower() in ['.png', '.jpg', '.jpeg', '.gif', '.bmp']:return ToolResponse(content=[ToolResponseContent(text=f"Image file: {file_path.name}"),ToolResponseContent(image_path=str(file_path))],metadata={"file_type": "image"})# Handle text fileselse:try:with open(file_path, 'r', encoding='utf-8') as f:content = f.read()return ToolResponse(content=content,metadata={"file_type": "text", "chars": len(content)})except Exception as e:return ToolResponse(content=f"Error reading file: {e}",metadata={"error": "read_failed"})
Real-World Examples
Example 1: Screenshot Analysis
from marsys.coordination import Orchestrafrom marsys.agents import Agentfrom marsys.models import ModelConfigimport os# Create agentui_analyst = Agent(model_config=ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY")),name="UIAnalyst",goal="Expert at analyzing user interfaces",instruction="""You are a UI/UX expert. When analyzing screenshots:1. Identify all UI elements2. Assess usability and accessibility3. Note any design issues4. Suggest improvements""")# Analyze screenshotsresult = await Orchestra.run(task={"content": "Analyze these app screenshots and identify UI/UX issues","images": ["./screenshots/login_screen.png","./screenshots/dashboard.png","./screenshots/settings.png"]},topology={"agents": ["UIAnalyst"], "flows": []})print(result.final_response)
Example 2: Document Processing with Tools
from marsys.coordination import Orchestrafrom marsys.coordination.topology.patterns import PatternConfigimport os# Create agentscoordinator = Agent(model_config=ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY")),name="Coordinator",goal="Coordinates document analysis",instruction="You coordinate document analysis tasks between agents.",tools={"read_file": tool_analyze_document})analyst = Agent(model_config=ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY")),name="DocumentAnalyst",goal="Analyzes document content",instruction="You are an expert at analyzing documents and extracting key information.")# Define workflowtopology = PatternConfig.hub_and_spoke(hub="Coordinator",spokes=["DocumentAnalyst"])# Executeresult = await Orchestra.run(task="Read the file report.pdf and extract all key metrics and charts",topology=topology)
Example 3: Multi-Agent Vision Pipeline
from marsys.coordination.topology.patterns import PatternConfigimport os# Create specialized agentsimage_captioner = Agent(model_config=ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY")),name="Captioner",goal="Creates detailed image captions",instruction="Generate detailed, accurate captions for images.")object_detector = Agent(model_config=ModelConfig(type="api",name="gemini-2.5-pro",provider="google",api_key=os.getenv("GOOGLE_API_KEY")),name="ObjectDetector",goal="Identifies objects in images",instruction="List all objects visible in images with confidence scores.")scene_analyzer = Agent(model_config=ModelConfig(type="api",name="claude-sonnet-4-5-20250514",provider="anthropic",api_key=os.getenv("ANTHROPIC_API_KEY")),name="SceneAnalyzer",goal="Analyzes overall scene context",instruction="Analyze the overall scene, context, and relationships.")synthesizer = Agent(model_config=ModelConfig(type="api",name="anthropic/claude-sonnet-4",provider="openrouter",api_key=os.getenv("OPENROUTER_API_KEY")),name="Synthesizer",goal="Synthesizes analysis from specialists",instruction="Combine insights from specialists into comprehensive report.")# Pipeline topologytopology = PatternConfig.pipeline(stages=[{"name": "analysis", "agents": ["Captioner", "ObjectDetector", "SceneAnalyzer"]},{"name": "synthesis", "agents": ["Synthesizer"]}],parallel_within_stage=True # Parallel analysis)# Executeresult = await Orchestra.run(task={"content": "Provide comprehensive analysis of these images","images": ["./photos/scene1.jpg", "./photos/scene2.jpg"]},topology=topology)
Best Practices
1. Optimize Images Before Sending
from PIL import Imagefrom pathlib import Pathdef optimize_image(image_path: str, max_size: tuple = (2048, 2048)) -> str:"""Resize large images to reduce API costs."""img = Image.open(image_path)if img.size[0] > max_size[0] or img.size[1] > max_size[1]:img.thumbnail(max_size, Image.Resampling.LANCZOS)# Save to temp locationoutput_path = f"/tmp/optimized_{Path(image_path).name}"img.save(output_path, optimize=True, quality=85)return output_pathreturn image_path# BAD - Sending huge images wastes tokens and moneytask = {"content": "Analyze this image","images": ["./huge_8k_photo.jpg"] # 8000x6000 pixels!}
2. Provide Clear Context
# GOOD - Clear contexttask = {"content": """Analyze these medical X-rays and identify:1. Any abnormalities2. Potential diagnoses3. Recommended follow-up testsImages are: chest X-ray (front), chest X-ray (side)""","images": ["./xray_front.jpg", "./xray_side.jpg"]}# BAD - Vague contexttask = {"content": "Look at these","images": ["./xray_front.jpg", "./xray_side.jpg"]}
3. Handle Tool Errors Gracefully with ToolResponse
from marsys.environment.tool_response import ToolResponse, ToolResponseContent# GOOD - Robust error handling with ToolResponsedef tool_read_file(file_path: str) -> ToolResponse:"""Read file with proper error handling."""try:if not Path(file_path).exists():return ToolResponse(content=f"Error: File not found: {file_path}",metadata={"error": "file_not_found"})# Process file and return multimodal contentreturn ToolResponse(content=[ToolResponseContent(text=f"Reading: {file_path}"),ToolResponseContent(image_path=file_path) if is_image(file_path) else ToolResponseContent(text="..."),],metadata={"file": file_path})except Exception as e:return ToolResponse(content=f"Error processing file: {str(e)}",metadata={"error": "processing_failed"})# BAD - Unhandled exceptions crash workflowdef tool_read_file(file_path: str) -> ToolResponse:file = open(file_path) # Raises FileNotFoundErrorreturn ToolResponse(content=file.read())
4. Clean Up Temporary Files with ToolResponse
import tempfileimport shutilimport threadingimport timefrom marsys.environment.tool_response import ToolResponse, ToolResponseContentdef tool_process_pdf(pdf_path: str) -> ToolResponse:"""Process PDF with cleanup."""temp_dir = Path(tempfile.mkdtemp(prefix="pdf_"))try:# Extract pages to temp_dirimage_paths = extract_pdf_pages(pdf_path, temp_dir)# Build multimodal contentcontent_blocks = [ToolResponseContent(text=f"Processed PDF: {len(image_paths)} pages")]for i, img_path in enumerate(image_paths, 1):content_blocks.append(ToolResponseContent(text=f"Page {i}:"))content_blocks.append(ToolResponseContent(image_path=img_path))return ToolResponse(content=content_blocks,metadata={"pages": len(image_paths)})finally:# Clean up after a delay (allow time for processing)def cleanup():time.sleep(60) # Wait 60sshutil.rmtree(temp_dir, ignore_errors=True)threading.Thread(target=cleanup, daemon=True).start()
Configuration
Memory Management
Vision models use more tokens due to image encoding:
from marsys.agents import Agent# Configure memory retentionvision_agent = Agent(model_config=vision_model_config,name="VisionAgent",memory_retention="single_run", # Clear after each runmax_tokens=4000 # Higher limit for vision)
Cost Optimization
from marsys.coordination.config import ExecutionConfig# Limit steps to control costsresult = await Orchestra.run(task=vision_task,topology=topology,max_steps=20, # Prevent runaway costsexecution_config=ExecutionConfig(step_timeout=60.0, # Timeout per stepconvergence_timeout=300.0))
Debugging
Check Image Encoding
# Verify images are being added to memoryagent = vision_agentmessages = agent.memory.retrieve_all()for msg in messages:if hasattr(msg, 'images') and msg.images:print(f"Message {msg.message_id} has {len(msg.images)} images")
Inspect Tool Results
from marsys.environment.tool_response import ToolResponse# Check if tool is returning ToolResponse correctlyresult = tool_read_file("test.pdf")print(f"Result type: {type(result)}")if isinstance(result, ToolResponse):print(f"Content type: {type(result.content)}")print(f"Has images: {result.has_images()}")if isinstance(result.content, list):text_count = sum(1 for c in result.content if c.type == "text")image_count = sum(1 for c in result.content if c.type == "image")print(f"Text blocks: {text_count}, Image blocks: {image_count}")print(f"Metadata: {result.metadata}")else:print("Warning: Tool should return ToolResponse for multimodal content")
Ready for Vision!
You now have everything you need to build powerful multimodal agents that can reason about visual content alongside text.
Token Costs
Vision models typically charge based on image resolution. Optimize images before sending to reduce costs. See the File Operations guide for token estimation by provider.
Related Documentation
File Operations
Image support and token estimation.
Tools API
Create custom multimodal tools.
Orchestra API
Complete API reference for multimodal tasks.
Models API
Configure vision-capable models.