Browser Automation

MARSYS provides powerful browser automation capabilities through the BrowserAgent, enabling web scraping, interaction, and intelligent navigation for multi-agent workflows.

Overview

The browser automation system provides:

  • Dual Operation Modes: PRIMITIVE for fast content extraction, ADVANCED for complex multi-step scenarios with visual interaction
  • Web Navigation: Navigate, scrape, and interact with websites
  • Intelligent Automation: LLM-guided browser control and decision making
  • Dynamic Content Handling: JavaScript execution and async content loading
  • Form Automation: Fill forms, click elements, and handle interactions
  • Multimodal Capabilities: Screenshot-based visual understanding with element detection (ADVANCED mode)
  • Robust Error Handling: Retry mechanisms and resilient operations

Operation Modes

BrowserAgent supports two distinct operation modes optimized for different use cases:

PRIMITIVE Mode

Purpose: Fast, efficient content extraction without visual interaction

Characteristics:

  • High-level tools for quick content retrieval
  • No visual feedback or screenshots
  • No vision model required
  • Optimized for speed and simplicity
  • Single-step operations

Available Tools (5):

  • fetch_url - Navigate and extract content in one step
  • get_page_metadata - Get page title, URL, and links
  • download_file - Download files from URLs
  • list_downloads - List files in the downloads directory
  • get_page_elements - Get interactive elements with selectors (token-efficient format)
  • inspect_element - Get element details by selector (truncated text preview)

Best For: Web scraping and data extraction, content aggregation, simple information retrieval, API-like web interactions

ADVANCED Mode

Purpose: Complex multi-step scenarios requiring visual interaction and coordinate-based control

Characteristics:

  • Low-level coordinate-based tools
  • Visual feedback with auto-screenshot support
  • Vision model integration for visual understanding
  • Multi-step navigation and interaction
  • Form filling and complex workflows

Available Tools (20+): All PRIMITIVE mode tools, plus:

  • goto - Navigate to URL (auto-detects downloads)
  • scroll_up / scroll_down - Scroll the page
  • mouse_click - Click at specific coordinates (auto-detects downloads)
  • keyboard_input - Type text into focused input fields (search boxes, forms)
  • keyboard_press - Press special keys (Enter, Tab, arrows, etc.) (auto-detects downloads)
  • search_page - Find text on page with Chrome-like highlighting
  • go_back - Navigate back
  • reload - Reload current page
  • get_url / get_title - Get page information
  • screenshot - Take screenshot with element highlighting (returns multimodal ToolResponse)
  • inspect_element - Get element details by selector (truncated text preview)
  • inspect_at_position - Get element info at screen coordinates (x, y)
  • list_tabs / get_active_tab / switch_to_tab / close_tab - Tab management
  • save_session - Save browser session state for persistence

Best For: Form automation with complex interactions, multi-step workflows requiring visual confirmation, handling cookie popups and modals, sites with anti-bot protections, tasks requiring precise element interaction

Choosing the Right Mode

from marsys.agents import BrowserAgent, BrowserAgentMode
# Mode selection with enum (type-safe)
browser_agent = await BrowserAgent.create_safe(
model_config=config,
name="scraper",
mode=BrowserAgentMode.PRIMITIVE, # Using enum
goal="Efficiently fetch and extract content from web pages"
)
# Mode selection with string (convenient)
browser_agent = await BrowserAgent.create_safe(
model_config=config,
name="scraper",
mode="primitive", # Using string
goal="Efficiently fetch and extract content from web pages"
)
# ADVANCED mode - Visual interaction
browser_agent = await BrowserAgent.create_safe(
model_config=config, # Main agent model (Claude Haiku/Sonnet recommended)
name="navigator",
mode=BrowserAgentMode.ADVANCED, # or mode="advanced"
auto_screenshot=True, # Enable visual feedback
vision_model_config=ModelConfig( # Vision model for screenshot analysis
type="api",
provider="openrouter",
name="google/gemini-3-flash-preview", # Recommended: fast and cost-effective
# For complex tasks, use: "google/gemini-3-pro-preview"
temperature=0,
thinking_budget=0 # Disable thinking for faster vision responses
),
goal="Navigate and interact with web pages like a human"
)

BrowserAgent

Creating a BrowserAgent

from marsys.agents import BrowserAgent
from marsys.models import ModelConfig
# PRIMITIVE Mode - Fast content extraction
browser_agent = await BrowserAgent.create_safe(
model_config=ModelConfig(
type="api",
provider="openrouter",
name="anthropic/claude-opus-4.6",
temperature=0.3
),
name="web_scraper",
mode="primitive", # Simple string mode selection
goal="Fast web scraping agent for content extraction",
headless=True,
tmp_dir="./runs/run-20260206"
)
# ADVANCED Mode - Visual interaction with auto-screenshot
browser_agent_advanced = await BrowserAgent.create_safe(
model_config=ModelConfig(
type="api",
provider="openrouter",
name="anthropic/claude-opus-4.6", # Main agent for decision-making and planning
temperature=0.3
),
name="web_navigator",
mode="advanced", # Simple string mode selection
goal="Expert web automation agent for complex interactions",
auto_screenshot=True, # Enable visual feedback
vision_model_config=ModelConfig( # Required for auto-screenshot
type="api",
provider="openrouter",
name="google/gemini-3-flash-preview", # Recommended for browser vision
temperature=0,
thinking_budget=0 # Disable thinking for faster vision responses
),
headless=False,
tmp_dir="./runs/run-20260206"
)
# Always clean up
try:
# Use the agent
result = await browser_agent.run("Navigate to example.com and extract the main heading")
finally:
if browser_agent.browser_tool:
await browser_agent.browser_tool.close()

Virtual paths: BrowserAgent returns virtual paths for artifacts such as ./downloads/report.pdf and ./screenshots/step_1.png. See Run Filesystem.

BrowserAgent Artifact Configuration

BrowserAgent.create_safe(...) supports explicit download path behavior and tool naming:

browser_agent = await BrowserAgent.create_safe(
model_config=config,
name="web_scraper",
mode="primitive",
tmp_dir="./runs/run-20260206",
downloads_subdir="downloads", # Host folder under tmp_dir
downloads_virtual_dir="./downloads", # Path shown to the agent
fetch_file_tool_name="fetch_file", # Expose download tool under custom name
)
  • downloads_subdir changes host-side layout under tmp_dir.
  • downloads_virtual_dir changes what agents see/return in tool outputs.
  • fetch_file_tool_name remaps the download tool name from the default download_file.

Viewport Auto-Detection

If viewport_width/viewport_height are not provided, BrowserAgent picks defaults by model family:

  • Google/Gemini: 1000x1000
  • Anthropic/Claude: 1344x896
  • OpenAI/GPT: 1024x768
  • Fallback: 1536x1536

Using AgentPool for Parallel Browsing

from marsys.agents import AgentPool
# Create pool of browser agents
browser_pool = AgentPool(
agent_class=BrowserAgent,
num_instances=3,
model_config=config,
agent_name="BrowserPool",
headless=True
)
# Parallel scraping
async def scrape_urls(urls: List[str]):
tasks = []
for i, url in enumerate(urls):
async with browser_pool.acquire(f"branch_{i}") as agent:
task = agent.run(f"Scrape content from {url}")
tasks.append(task)
results = await asyncio.gather(*tasks)
return results
# Cleanup pool
await browser_pool.cleanup()

Text Search on Page

New Feature: search_page()

Find text on web pages with Chrome-like visual highlighting and navigation!

# Search for text on the current page
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 1/5 found and highlighted"
# All matches highlighted in YELLOW, current match in ORANGE
# Navigate to next match - call again with SAME term
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 2/5"
# Scrolls to and highlights next occurrence
# Continue navigating
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 3/5"
# Wraps around after last match back to first

Features:

  • Visual Highlighting: All matches in YELLOW, current in ORANGE (Chrome-like)
  • Auto-scroll: Automatically scrolls to current match (centered in viewport)
  • Match Counter: Shows "Match X/Y" so you know your progress
  • Wrap-around: After last match, returns to first match
  • Case-insensitive: Finds text regardless of case

Limitations:

  • Does NOT work with PDF files (PDFs are auto-downloaded, not displayed)
  • Does NOT search across multiple pages
  • Works with regular web pages, including shadow DOM content

Example - Finding Specific Information:

# Navigate to documentation page
await browser_tool.goto("https://docs.example.com/api")
# Search for specific API endpoint
result = await browser_tool.search_page("/api/v2/users")
# Match 1/3 found - scrolls to first occurrence
# Check if it's the right one with screenshot
screenshot = await browser_tool.screenshot()
# Visual: See highlighted text in orange
# Not the right one? Navigate to next match
result = await browser_tool.search_page("/api/v2/users")
# Match 2/3 - scrolls to second occurrence

Automatic Download Detection

Smart Download Handling

Actions that trigger file downloads are automatically detected and reported!

The browser automatically detects when actions (clicks, Enter key presses, navigation) trigger file downloads:

# Clicking a download link automatically detects the download
result = await browser_tool.mouse_click(x=450, y=300)
# Returns: "Action 'mouse_click' triggered a file download.
# File 'report.pdf' has been downloaded to: ./downloads/report.pdf"
# Navigating to a PDF URL triggers automatic download
result = await browser_tool.goto("https://example.com/paper.pdf")
# Returns: "Action 'goto' triggered a file download.
# File 'paper.pdf' has been downloaded to: ./downloads/paper.pdf"
# Pressing Enter on a download button
await browser_tool.mouse_click(x=500, y=400) # Focus download button
await browser_tool.keyboard_press("Enter")
# Returns: "Action 'keyboard_press' triggered a file download.
# File 'data.xlsx' has been downloaded to: ./downloads/data.xlsx"

Automatic Detection Features:

  • Detects downloads triggered by clicks, keyboard presses, or navigation
  • Returns file path and filename in response
  • Downloads saved under virtual ./downloads (host default: ./tmp/downloads)
  • PDFs are always downloaded (never displayed in browser)
  • Works with all file types (PDF, Excel, CSV, images, etc.)

download_file itself uses a dual strategy:

  • Primary: Playwright request context (inherits browser cookies/session)
  • Fallback: browser navigation + download-event detection

Listing Downloads:

# List all files in the downloads directory
downloads = await browser_tool.list_downloads()
# Returns a formatted list with sizes and paths

PDF-Specific Behavior:

# PDFs are NEVER displayed in browser - always downloaded
await browser_tool.goto("https://research.org/paper.pdf")
# Automatically downloads to ./downloads/paper.pdf
# Browser stays on previous page
# search_page() does NOT work with PDFs
# Instead, use file operation tools on the downloaded file

Download Path Configuration:

browser_tool = await BrowserTool.create_safe(
downloads_path="/custom/path/downloads", # Custom host download directory
temp_dir="/custom/tmp", # Custom temp directory (default: ./tmp)
downloads_virtual_dir="./downloads", # Virtual path returned to agents
)

Session Persistence

Browser Session Persistence

BrowserAgent supports saving and loading browser sessions (cookies, localStorage) using Playwright's storage_state feature. This enables persistent authentication across browser sessions.

Loading a Saved Session

from marsys.agents import BrowserAgent
# Create agent with existing session state
agent = await BrowserAgent.create_safe(
model_config=config,
name="AuthenticatedBrowser",
mode="advanced",
session_path="./sessions/linkedin_session.json", # Load existing session
headless=True
)
# Browser is now initialized with saved cookies and localStorage
# Already logged in to LinkedIn, Google, etc.
await agent.run("Go to linkedin.com/feed and extract posts")

Saving a Session

# Save via BrowserAgent tool invocation
result = await agent.run("Save the current session to ./sessions/my_session.json")
# Returns a success message with cookie/origin counts
# You can save additional checkpoints as needed
result = await agent.run("Save the current session to ./sessions/backup.json")

Session File Format

The session file is a JSON file compatible with Playwright's storage_state:

{
"cookies": [
{
"name": "session_id",
"value": "abc123",
"domain": ".example.com",
"path": "/",
"expires": 1735689600,
"httpOnly": true,
"secure": true
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{"name": "user_token", "value": "xyz789"}
]
}
]
}

Error Handling

Resilient Operations

class ResilientBrowserAgent(BrowserAgent):
"""Browser agent with enhanced error handling."""
async def retry_operation(
self,
operation: Callable,
max_retries: int = 3,
backoff_factor: float = 2.0,
context = None
):
"""Execute operation with exponential backoff retry."""
last_error = None
wait_time = 1.0
for attempt in range(max_retries):
try:
result = await operation()
if attempt > 0:
await self._log_progress(
context, LogLevel.INFO,
f"Operation succeeded on attempt {attempt + 1}"
)
return result
except Exception as e:
last_error = e
await self._log_progress(
context, LogLevel.WARNING,
f"Attempt {attempt + 1} failed: {e}"
)
if attempt < max_retries - 1:
await asyncio.sleep(wait_time)
wait_time *= backoff_factor
raise Exception(f"Operation failed after {max_retries} attempts: {last_error}")
async def safe_extract(
self,
selector: str,
default: Any = None,
context = None
):
"""Safely extract element with fallback."""
try:
text = await self.browser_tool.get_text(selector)
if text:
return text.strip()
except Exception as e:
await self._log_progress(
context, LogLevel.DEBUG,
f"Failed to extract {selector}: {e}"
)
return default

Performance Optimization

Resource Blocking

class OptimizedBrowserAgent(BrowserAgent):
"""Optimized browser agent for faster scraping."""
async def setup_fast_scraping(self, context = None):
"""Configure browser for fast text scraping."""
# Block unnecessary resources
await self.browser_tool.context.route("**/*", lambda route:
route.abort() if route.request.resource_type in
["image", "stylesheet", "font", "media"]
else route.continue_()
)
# Disable JavaScript if not needed
await self.browser_tool.context.set_javascript_enabled(False)
await self._log_progress(
context, LogLevel.INFO,
"Optimized browser for fast scraping"
)

Best Practices

1. Explicit Waits

# GOOD - Wait for specific conditions
await browser_tool.wait_for_selector("#content", timeout=10000, state="visible")
await browser_tool.wait_for_navigation()
# BAD - Fixed delays
await asyncio.sleep(5) # Unreliable and slow

2. Robust Selectors

# GOOD - Specific, stable selectors
await browser_tool.click("[data-testid='submit-button']")
await browser_tool.click("#unique-id")
# BAD - Fragile selectors
await browser_tool.click("div > span:nth-child(3)")

3. Resource Management

# GOOD - Always cleanup
browser_agent = await BrowserAgent.create_safe(
model_config=config,
name="CleanupExample",
mode="advanced",
headless=True,
)
try:
# Use agent
result = await browser_agent.run(task)
finally:
await browser_agent.browser_tool.close()
# BAD - Leaving browsers open
browser_agent = await BrowserAgent.create_safe(
model_config=config,
name="LeakyBrowser",
mode="advanced",
headless=True,
)
result = await browser_agent.run(task)
# Browser left running!

4. Error Context

# GOOD - Detailed error context
try:
await browser_tool.click(selector)
except Exception as e:
await self._log_progress(
context, LogLevel.ERROR,
f"Failed to click {selector} on {await browser_tool.get_url()}: {e}"
)
# Take screenshot for debugging
await browser_tool.screenshot("error_screenshot.png")
# BAD - Generic error handling
try:
await browser_tool.click(selector)
except:
print("Click failed")

Browser Automation Ready!

You now understand browser automation in MARSYS. The BrowserAgent provides powerful web interaction capabilities for your multi-agent workflows.