Browser Automation

MARSYS provides powerful browser automation capabilities through the BrowserAgent, enabling web scraping, interaction, and intelligent navigation.

Overview

The browser automation system provides:

  • Dual Operation Modes: PRIMITIVE for fast content extraction, ADVANCED for complex multi-step scenarios with visual interaction
  • Web Navigation: Navigate, scrape, and interact with websites
  • Intelligent Automation: LLM-guided browser control and decision making
  • Dynamic Content Handling: JavaScript execution and async content loading
  • Form Automation: Fill forms, click elements, and handle interactions
  • Multimodal Capabilities: Screenshot-based visual understanding with element detection (ADVANCED mode)
  • Session Persistence: Save and load browser sessions across runs
  • Robust Error Handling: Retry mechanisms and resilient operations

Operation Modes

PRIMITIVE Mode

Fast, efficient content extraction without visual interaction:

  • High-level tools for quick content retrieval
  • No visual feedback or screenshots
  • No vision model required
  • Optimized for speed and simplicity
  • Best for web scraping, content aggregation, and API-like web interactions

ADVANCED Mode

Complex multi-step scenarios requiring visual interaction and coordinate-based control:

  • Low-level coordinate-based tools
  • Visual feedback with auto-screenshot support
  • Vision model integration for visual understanding
  • Multi-step navigation and interaction
  • Best for form automation, cookie popups, anti-bot sites, and complex workflows

Creating a BrowserAgent

PRIMITIVE Mode

from marsys.agents import BrowserAgent, BrowserAgentMode
from marsys.models import ModelConfig
# PRIMITIVE Mode - Fast content extraction
browser_agent = await BrowserAgent.create_safe(
model_config=ModelConfig(
type="api",
provider="openrouter",
name="anthropic/claude-haiku-4.5",
temperature=0.3
),
name="web_scraper",
mode="primitive", # or BrowserAgentMode.PRIMITIVE
goal="Fast web scraping agent for content extraction",
headless=True,
tmp_dir="./tmp/browser"
)

ADVANCED Mode

# ADVANCED Mode - Visual interaction
browser_agent = await BrowserAgent.create_safe(
model_config=ModelConfig(
type="api",
provider="openrouter",
name="anthropic/claude-haiku-4.5",
temperature=0.3
),
name="web_navigator",
mode="advanced", # or BrowserAgentMode.ADVANCED
goal="Expert web automation for complex interactions",
auto_screenshot=True, # Enable visual feedback
vision_model_config=ModelConfig(
type="api",
provider="openrouter",
name="google/gemini-2.5-flash", # Fast and cost-effective
temperature=0,
thinking_budget=0
),
headless=False,
tmp_dir="./tmp/screenshots"
)

Cleanup

Always clean up browser resources when done to avoid memory leaks. Use await browser_agent.cleanup() in a finally block.

Available Tools

PRIMITIVE Mode Tools

  • fetch_url - Navigate and extract content in one step
  • get_page_metadata - Get page title, URL, and links
  • download_file - Download files from URLs
  • get_page_elements - Get interactive elements with selectors (token-efficient format)
  • inspect_element - Get element details by selector (truncated text preview)

ADVANCED Mode Tools (Additional)

All PRIMITIVE mode tools, plus:

  • goto - Navigate to URL (auto-detects downloads)
  • scroll_up / scroll_down - Scroll the page
  • mouse_click - Click at specific coordinates (auto-detects downloads)
  • keyboard_input - Type text into focused input fields (search boxes, forms)
  • keyboard_press - Press special keys (Enter, Tab, arrows, etc.) (auto-detects downloads)
  • search_page - Find text on page with Chrome-like highlighting
  • go_back / reload - Navigation controls
  • get_url / get_title - Get current page information
  • screenshot - Take screenshot with element highlighting (returns multimodal ToolResponse)
  • inspect_at_position - Get element info at screen coordinates (x, y)
  • list_tabs / get_active_tab / switch_to_tab / close_tab - Tab management
  • save_session - Save browser session state for persistence

Usage Example

from marsys.coordination import Orchestra
# Create browser agent
browser_agent = await BrowserAgent.create_safe(
model_config=config,
name="scraper",
mode="primitive",
goal="Extract data from websites"
)
try:
# Use with Orchestra
topology = {
"agents": ["scraper"],
"flows": []
}
result = await Orchestra.run(
task="Go to example.com and extract all headings",
topology=topology
)
finally:
await browser_agent.cleanup()

Text Search on Page

search_page()

Find text on web pages with Chrome-like visual highlighting and navigation!

# Search for text on the current page
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 1/5 found and highlighted"
# All matches highlighted in YELLOW, current match in ORANGE
# Navigate to next match - call again with SAME term
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 2/5"
# Scrolls to and highlights next occurrence
# Continue navigating (wraps around after last match)
result = await browser_tool.search_page("quantum computing")
# Returns: "Match 3/5"

Features:

  • Visual Highlighting: All matches in YELLOW, current in ORANGE (Chrome-like)
  • Auto-scroll: Automatically scrolls to current match (centered in viewport)
  • Match Counter: Shows "Match X/Y" so you know your progress
  • Wrap-around: After last match, returns to first match
  • Case-insensitive: Finds text regardless of case

Limitations:

  • ❌ Does NOT work with PDF files (PDFs are auto-downloaded, not displayed)
  • ❌ Does NOT search across multiple pages
  • ✅ Works with regular web pages, including shadow DOM content

Automatic Download Detection

Smart Download Handling

Actions that trigger file downloads are automatically detected and reported!

# Clicking a download link automatically detects the download
result = await browser_tool.mouse_click(x=450, y=300)
# Returns: "Action 'mouse_click' triggered a file download.
# File 'report.pdf' has been downloaded to: /path/to/tmp/downloads/report.pdf"
# Navigating to a PDF URL triggers automatic download
result = await browser_tool.goto("https://example.com/paper.pdf")
# Returns: "Action 'goto' triggered a file download.
# File 'paper.pdf' has been downloaded to: /path/to/tmp/downloads/paper.pdf"
# Pressing Enter on a download button
await browser_tool.mouse_click(x=500, y=400) # Focus download button
await browser_tool.keyboard_press("Enter")
# Returns: "Action 'keyboard_press' triggered a file download.
# File 'data.xlsx' has been downloaded to: /path/to/tmp/downloads/data.xlsx"

Automatic Detection Features:

  • ✅ Detects downloads triggered by clicks, keyboard presses, or navigation
  • ✅ Returns file path and filename in response
  • ✅ Downloads saved to ./tmp/downloads/ by default
  • ✅ PDFs are always downloaded (never displayed in browser)
  • ✅ Works with all file types (PDF, Excel, CSV, images, etc.)

Session Persistence

Browser Session Persistence

BrowserAgent supports saving and loading browser sessions (cookies, localStorage) using Playwright's storage_state feature. This enables persistent authentication across browser sessions.

Loading a Saved Session

from marsys.agents import BrowserAgent
# Create agent with existing session state
agent = await BrowserAgent.create_safe(
model_config=config,
name="AuthenticatedBrowser",
mode="advanced",
session_path="./sessions/linkedin_session.json", # Load existing session
headless=True
)
# Browser is now initialized with saved cookies and localStorage
# Already logged in to LinkedIn, Google, etc.
await agent.run("Go to linkedin.com/feed and extract posts")

Saving a Session

# After logging in manually or programmatically, save the session
result = await agent.browser_tool.save_session("./sessions/my_session.json")
# Returns: "Session saved successfully to ./sessions/my_session.json. Saved 15 cookies and 3 origin storage entries."
# The agent can also save sessions via tool calls
result = await agent.run("Save the current session to ./sessions/backup.json")

Choosing the Right Mode

Mode Selection

Use PRIMITIVE for simple scraping tasks where speed matters.
Use ADVANCED when you need to interact with forms, handle popups, or navigate complex multi-step workflows.