Browser Automation

MARSYS provides powerful browser automation capabilities through the BrowserAgent, enabling web scraping, interaction, and intelligent navigation for multi-agent workflows.

Overview

The browser automation system provides:

Dual Operation Modes: PRIMITIVE for fast content extraction, ADVANCED for complex multi-step scenarios with visual interaction
Web Navigation: Navigate, scrape, and interact with websites
Intelligent Automation: LLM-guided browser control and decision making
Dynamic Content Handling: JavaScript execution and async content loading
Form Automation: Fill forms, click elements, and handle interactions
Multimodal Capabilities: Screenshot-based visual understanding with element detection (ADVANCED mode)
Robust Error Handling: Retry mechanisms and resilient operations

Operation Modes

BrowserAgent supports two distinct operation modes optimized for different use cases:

PRIMITIVE Mode

Purpose: Fast, efficient content extraction without visual interaction

Characteristics:

High-level tools for quick content retrieval
No visual feedback or screenshots
No vision model required
Optimized for speed and simplicity
Single-step operations

Available Tools (5):

fetch_url - Navigate and extract content in one step
get_page_metadata - Get page title, URL, and links
download_file - Download files from URLs
list_downloads - List files in the downloads directory
get_page_elements - Get interactive elements with selectors (token-efficient format)
inspect_element - Get element details by selector (truncated text preview)

Best For: Web scraping and data extraction, content aggregation, simple information retrieval, API-like web interactions

ADVANCED Mode

Purpose: Complex multi-step scenarios requiring visual interaction and coordinate-based control

Characteristics:

Low-level coordinate-based tools
Visual feedback with auto-screenshot support
Vision model integration for visual understanding
Multi-step navigation and interaction
Form filling and complex workflows

Available Tools (20+): All PRIMITIVE mode tools, plus:

goto - Navigate to URL (auto-detects downloads)
scroll_up / scroll_down - Scroll the page
mouse_click - Click at specific coordinates (auto-detects downloads)
keyboard_input - Type text into focused input fields (search boxes, forms)
keyboard_press - Press special keys (Enter, Tab, arrows, etc.) (auto-detects downloads)
search_page - Find text on page with Chrome-like highlighting
go_back - Navigate back
reload - Reload current page
get_url / get_title - Get page information
screenshot - Take screenshot with element highlighting (returns multimodal ToolResponse)
inspect_element - Get element details by selector (truncated text preview)
inspect_at_position - Get element info at screen coordinates (x, y)
list_tabs / get_active_tab / switch_to_tab / close_tab - Tab management
save_session - Save browser session state for persistence

Best For: Form automation with complex interactions, multi-step workflows requiring visual confirmation, handling cookie popups and modals, sites with anti-bot protections, tasks requiring precise element interaction

Choosing the Right Mode

from marsys.agents import BrowserAgent, BrowserAgentMode

# Mode selection with enum (type-safe)
browser_agent = await BrowserAgent.create_safe(
    model_config=config,
    name="scraper",
    mode=BrowserAgentMode.PRIMITIVE,  # Using enum
    goal="Efficiently fetch and extract content from web pages"
)

# Mode selection with string (convenient)
browser_agent = await BrowserAgent.create_safe(
    model_config=config,
    name="scraper",
    mode="primitive",  # Using string
    goal="Efficiently fetch and extract content from web pages"
)

# ADVANCED mode - Visual interaction
browser_agent = await BrowserAgent.create_safe(
    model_config=config,  # Main agent model (Claude Haiku/Sonnet recommended)
    name="navigator",
    mode=BrowserAgentMode.ADVANCED,  # or mode="advanced"
    auto_screenshot=True,  # Enable visual feedback
    vision_model_config=ModelConfig(  # Vision model for screenshot analysis
        type="api",
        provider="openrouter",
        name="google/gemini-3-flash-preview",  # Recommended: fast and cost-effective
        # For complex tasks, use: "google/gemini-3-pro-preview"
        temperature=0,
        thinking_budget=0  # Disable thinking for faster vision responses
    ),
    goal="Navigate and interact with web pages like a human"
)

BrowserAgent

Creating a BrowserAgent

from marsys.agents import BrowserAgent
from marsys.models import ModelConfig

# PRIMITIVE Mode - Fast content extraction
browser_agent = await BrowserAgent.create_safe(
    model_config=ModelConfig(
        type="api",
        provider="openrouter",
        name="anthropic/claude-opus-4.6",
        temperature=0.3
    ),
    name="web_scraper",
    mode="primitive",  # Simple string mode selection
    goal="Fast web scraping agent for content extraction",
    headless=True,
    tmp_dir="./runs/run-20260206"
)

# ADVANCED Mode - Visual interaction with auto-screenshot
browser_agent_advanced = await BrowserAgent.create_safe(
    model_config=ModelConfig(
        type="api",
        provider="openrouter",
        name="anthropic/claude-opus-4.6",  # Main agent for decision-making and planning
        temperature=0.3
    ),
    name="web_navigator",
    mode="advanced",  # Simple string mode selection
    goal="Expert web automation agent for complex interactions",
    auto_screenshot=True,  # Enable visual feedback
    vision_model_config=ModelConfig(  # Required for auto-screenshot
        type="api",
        provider="openrouter",
        name="google/gemini-3-flash-preview",  # Recommended for browser vision
        temperature=0,
        thinking_budget=0  # Disable thinking for faster vision responses
    ),
    headless=False,
    tmp_dir="./runs/run-20260206"
)

# Always clean up
try:
    # Use the agent
    result = await browser_agent.run("Navigate to example.com and extract the main heading")
finally:
    if browser_agent.browser_tool:
        await browser_agent.browser_tool.close()

Virtual paths: BrowserAgent returns virtual paths for artifacts such as ./downloads/report.pdf and ./screenshots/step_1.png. See Run Filesystem.

BrowserAgent Artifact Configuration

BrowserAgent.create_safe(...) supports explicit download path behavior and tool naming:

browser_agent = await BrowserAgent.create_safe(
    model_config=config,
    name="web_scraper",
    mode="primitive",
    tmp_dir="./runs/run-20260206",
    downloads_subdir="downloads",             # Host folder under tmp_dir
    downloads_virtual_dir="./downloads",      # Path shown to the agent
    fetch_file_tool_name="fetch_file",        # Expose download tool under custom name
)

downloads_subdir changes host-side layout under tmp_dir.
downloads_virtual_dir changes what agents see/return in tool outputs.
fetch_file_tool_name remaps the download tool name from the default download_file.

Viewport Auto-Detection

If viewport_width/viewport_height are not provided, BrowserAgent picks defaults by model family:

Google/Gemini: 1000x1000
Anthropic/Claude: 1344x896
OpenAI/GPT: 1024x768
Fallback: 1536x1536

Using AgentPool for Parallel Browsing

from marsys.agents import AgentPool

# Create pool of browser agents
browser_pool = AgentPool(
    agent_class=BrowserAgent,
    num_instances=3,
    model_config=config,
    agent_name="BrowserPool",
    headless=True
)

# Parallel scraping
async def scrape_urls(urls: List[str]):
    tasks = []
    for i, url in enumerate(urls):
        async with browser_pool.acquire(f"branch_{i}") as agent:
            task = agent.run(f"Scrape content from {url}")
            tasks.append(task)

    results = await asyncio.gather(*tasks)
    return results

# Cleanup pool
await browser_pool.cleanup()

Text Search on Page

New Feature: search_page()

Find text on web pages with Chrome-like visual highlighting and navigation!

# Search for text on the current page
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 1/5 found and highlighted"
# All matches highlighted in YELLOW, current match in ORANGE

# Navigate to next match - call again with SAME term
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 2/5"
# Scrolls to and highlights next occurrence

# Continue navigating
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 3/5"
# Wraps around after last match back to first

Features:

Visual Highlighting: All matches in YELLOW, current in ORANGE (Chrome-like)
Auto-scroll: Automatically scrolls to current match (centered in viewport)
Match Counter: Shows "Match X/Y" so you know your progress
Wrap-around: After last match, returns to first match
Case-insensitive: Finds text regardless of case

Limitations:

Does NOT work with PDF files (PDFs are auto-downloaded, not displayed)
Does NOT search across multiple pages
Works with regular web pages, including shadow DOM content

Example - Finding Specific Information:

# Navigate to documentation page
await browser_tool.goto("https://docs.example.com/api")

# Search for specific API endpoint
result = await browser_tool.search_page("/api/v2/users")
# Match 1/3 found - scrolls to first occurrence

# Check if it's the right one with screenshot
screenshot = await browser_tool.screenshot()
# Visual: See highlighted text in orange

# Not the right one? Navigate to next match
result = await browser_tool.search_page("/api/v2/users")
# Match 2/3 - scrolls to second occurrence

Automatic Download Detection

Smart Download Handling

Actions that trigger file downloads are automatically detected and reported!

The browser automatically detects when actions (clicks, Enter key presses, navigation) trigger file downloads:

# Clicking a download link automatically detects the download
result = await browser_tool.mouse_click(x=450, y=300)
# Returns: "Action 'mouse_click' triggered a file download.
#          File 'report.pdf' has been downloaded to: ./downloads/report.pdf"

# Navigating to a PDF URL triggers automatic download
result = await browser_tool.goto("https://example.com/paper.pdf")
# Returns: "Action 'goto' triggered a file download.
#          File 'paper.pdf' has been downloaded to: ./downloads/paper.pdf"

# Pressing Enter on a download button
await browser_tool.mouse_click(x=500, y=400)  # Focus download button
await browser_tool.keyboard_press("Enter")
# Returns: "Action 'keyboard_press' triggered a file download.
#          File 'data.xlsx' has been downloaded to: ./downloads/data.xlsx"

Automatic Detection Features:

Detects downloads triggered by clicks, keyboard presses, or navigation
Returns file path and filename in response
Downloads saved under virtual ./downloads (host default: ./tmp/downloads)
PDFs are always downloaded (never displayed in browser)
Works with all file types (PDF, Excel, CSV, images, etc.)

download_file itself uses a dual strategy:

Primary: Playwright request context (inherits browser cookies/session)
Fallback: browser navigation + download-event detection

Listing Downloads:

# List all files in the downloads directory
downloads = await browser_tool.list_downloads()
# Returns a formatted list with sizes and paths

PDF-Specific Behavior:

# PDFs are NEVER displayed in browser - always downloaded
await browser_tool.goto("https://research.org/paper.pdf")
# Automatically downloads to ./downloads/paper.pdf
# Browser stays on previous page

# search_page() does NOT work with PDFs
# Instead, use file operation tools on the downloaded file

Download Path Configuration:

browser_tool = await BrowserTool.create_safe(
    downloads_path="/custom/path/downloads",  # Custom host download directory
    temp_dir="/custom/tmp",  # Custom temp directory (default: ./tmp)
    downloads_virtual_dir="./downloads",  # Virtual path returned to agents
)

Session Persistence

Browser Session Persistence

BrowserAgent supports saving and loading browser sessions (cookies, localStorage) using Playwright's storage_state feature. This enables persistent authentication across browser sessions.

Loading a Saved Session

from marsys.agents import BrowserAgent

# Create agent with existing session state
agent = await BrowserAgent.create_safe(
    model_config=config,
    name="AuthenticatedBrowser",
    mode="advanced",
    session_path="./sessions/linkedin_session.json",  # Load existing session
    headless=True
)

# Browser is now initialized with saved cookies and localStorage
# Already logged in to LinkedIn, Google, etc.
await agent.run("Go to linkedin.com/feed and extract posts")

Saving a Session

# Save via BrowserAgent tool invocation
result = await agent.run("Save the current session to ./sessions/my_session.json")
# Returns a success message with cookie/origin counts

# You can save additional checkpoints as needed
result = await agent.run("Save the current session to ./sessions/backup.json")

Session File Format

The session file is a JSON file compatible with Playwright's storage_state:

{
  "cookies": [
    {
      "name": "session_id",
      "value": "abc123",
      "domain": ".example.com",
      "path": "/",
      "expires": 1735689600,
      "httpOnly": true,
      "secure": true
    }
  ],
  "origins": [
    {
      "origin": "https://example.com",
      "localStorage": [
        {"name": "user_token", "value": "xyz789"}
      ]
    }
  ]
}

Error Handling

Resilient Operations

class ResilientBrowserAgent(BrowserAgent):
    """Browser agent with enhanced error handling."""

    async def retry_operation(
        self,
        operation: Callable,
        max_retries: int = 3,
        backoff_factor: float = 2.0,
        context = None
    ):
        """Execute operation with exponential backoff retry."""

        last_error = None
        wait_time = 1.0

        for attempt in range(max_retries):
            try:
                result = await operation()
                if attempt > 0:
                    await self._log_progress(
                        context, LogLevel.INFO,
                        f"Operation succeeded on attempt {attempt + 1}"
                    )
                return result

            except Exception as e:
                last_error = e
                await self._log_progress(
                    context, LogLevel.WARNING,
                    f"Attempt {attempt + 1} failed: {e}"
                )

                if attempt < max_retries - 1:
                    await asyncio.sleep(wait_time)
                    wait_time *= backoff_factor

        raise Exception(f"Operation failed after {max_retries} attempts: {last_error}")

    async def safe_extract(
        self,
        selector: str,
        default: Any = None,
        context = None
    ):
        """Safely extract element with fallback."""

        try:
            text = await self.browser_tool.get_text(selector)
            if text:
                return text.strip()
        except Exception as e:
            await self._log_progress(
                context, LogLevel.DEBUG,
                f"Failed to extract {selector}: {e}"
            )

        return default

Performance Optimization

Resource Blocking

class OptimizedBrowserAgent(BrowserAgent):
    """Optimized browser agent for faster scraping."""

    async def setup_fast_scraping(self, context = None):
        """Configure browser for fast text scraping."""

        # Block unnecessary resources
        await self.browser_tool.context.route("**/*", lambda route:
            route.abort() if route.request.resource_type in
            ["image", "stylesheet", "font", "media"]
            else route.continue_()
        )

        # Disable JavaScript if not needed
        await self.browser_tool.context.set_javascript_enabled(False)

        await self._log_progress(
            context, LogLevel.INFO,
            "Optimized browser for fast scraping"
        )

Best Practices

1. Explicit Waits

# GOOD - Wait for specific conditions
await browser_tool.wait_for_selector("#content", timeout=10000, state="visible")
await browser_tool.wait_for_navigation()

# BAD - Fixed delays
await asyncio.sleep(5)  # Unreliable and slow

2. Robust Selectors

# GOOD - Specific, stable selectors
await browser_tool.click("[data-testid='submit-button']")
await browser_tool.click("#unique-id")

# BAD - Fragile selectors
await browser_tool.click("div > span:nth-child(3)")

3. Resource Management

# GOOD - Always cleanup
browser_agent = await BrowserAgent.create_safe(
    model_config=config,
    name="CleanupExample",
    mode="advanced",
    headless=True,
)
try:
    # Use agent
    result = await browser_agent.run(task)
finally:
    await browser_agent.browser_tool.close()

# BAD - Leaving browsers open
browser_agent = await BrowserAgent.create_safe(
    model_config=config,
    name="LeakyBrowser",
    mode="advanced",
    headless=True,
)
result = await browser_agent.run(task)
# Browser left running!

4. Error Context

# GOOD - Detailed error context
try:
    await browser_tool.click(selector)
except Exception as e:
    await self._log_progress(
        context, LogLevel.ERROR,
        f"Failed to click {selector} on {await browser_tool.get_url()}: {e}"
    )
    # Take screenshot for debugging
    await browser_tool.screenshot("error_screenshot.png")

# BAD - Generic error handling
try:
    await browser_tool.click(selector)
except:
    print("Click failed")

Browser Automation Ready!

You now understand browser automation in MARSYS. The BrowserAgent provides powerful web interaction capabilities for your multi-agent workflows.

Navigation