Architecture

How a prompt becomes browser automation

Components, command layers, the agent loop, and two browser modes — and how they fit together.


The core parts

Three components are always involved, plus a fourth — the Chrome extension — that's used only in session mode.

claude-code
Agent

Reasons about the scenario, calls MCP tools, observes results, decides the next step. Every command is a markdown file Claude executes verbatim.

mcp/
MCP server

Long-running Node.js process. Owns the live browser session, exposes ~25 tools, and routes calls to either Playwright or the extension based on mode.

tool/
Tools

Browser tools (navigate, click, get_accessibility_tree…) and file-only tools (write_page_object, write_feature_doc, record_run…). Each tool is a single TypeScript module.

Optional fourth part — Chrome extension. An MV3 extension that connects to the MCP server via WebSocket and proxies commands into your real Chrome tab using playwright-crx. Loaded only when running in Mode 2 (session); ignored in the default Playwright mode.

The MCP server is the centre of gravity. It's worth a closer look before we describe how anything else works.


The MCP server

The MCP server is the long-running process every other component talks to. Claude Code launches it as a subprocess on demand, it exposes the brow-use tools as a single MCP toolset, and it owns the live browser handle for the duration of the user's session — so the browser stays open and in the same state between tool calls. That continuity is what makes the agent loop possible: each step picks up exactly where the last one left off.

Two routing decisions live here. Mode — each browser tool call goes to either Playwright (default) or the WebSocket bridge that talks to the extension (session). Type — file-only tools (write a page object, write a feature doc, record a run) always run on the Node side regardless of mode; pure-compute tools (like fingerprint comparison) run with neither browser nor disk.

Commands are organised into four layers. Knowing the taxonomy first makes the next section — the agent loop — read more cleanly, since the loop's shape varies by layer.


The command layers

Commands are organised into four layers. Each layer builds on the artifacts produced by the one below it, but every layer is independently runnable — you don't have to do /bu:explore first to run /bu:explore-guided, and you don't have to generate Page Objects to run /bu:run-instruction.

Layer 1   Live capture

Drive the browser to capture raw artifacts: traces and aria-tree logs. The user describes intent; the agent executes it.

Layer 2   Deterministic post-processing

Read trace zips and turn them into queryable artifacts. No browser, no LLM — pure shell and TypeScript so the same trace always produces the same output.

Layer 3   Knowledge generation

Take Layer 1 + Layer 2 artifacts and produce human and machine knowledge: docs the user reads and Page Objects tests can import.

Layer 4   Grounded execution & composition

With docs and POMs in place, either execute business-level intents against the app or generate reusable workflow code grounded in the knowledge layer below.

Layers 1, 3, and 4 each run the same agent loop — the shape stays constant; only the inputs and outputs change. That's the next section.


The agent loop

Every long-running brow-use command (Layers 1, 3, and 4) runs an agent loop, not a script. Layer 2 is the exception — pure deterministic post-processing with no LLM in the path — so this section describes the rest. The loop's shape stays the same across the three LLM-driven layers — perceive, reason, act, record — but what each step reads from and writes to changes by layer.

Layer 1 form — driving the live browser

When the loop drives a real browser, perception is primarily the accessibility tree, with snapshots (screenshots) and a filtered, safety-policy-applied element list as supporting inputs. Actions navigate, click, or type.

Perceive
accessibility tree
Primary input. Snapshots and a filtered element list as supporting inputs.
Reason
Claude
Pick the next action based on intent + current state
Act
click / type / navigate
Browser state changes
Record
trace + write_*
Trace event recorded; artifacts written when ready

How the loop changes across layers

Layer 3 — knowledge generation. Same loop shape, no live browser. Perceive reads captured artifacts from a prior explore run (aria log, observed-edge list, screenshots extracted from the trace). Act writes derived files via tools like write_feature_doc and write_page_object. The loop continues until every page in the source run has been processed.

Layer 4 — grounded execution. A hybrid. Before the live loop starts the agent loads the knowledge stack (docs, POMs, and workflows produced by earlier runs). That grounding biases every reasoning step in the Layer 1 form that follows — so picking which page to navigate to or which element to click is informed by recorded knowledge of the app, not just the current aria tree.

All Layer 1 / Layer 4 commands ultimately reach a real browser. There are two ways that can happen.


Two execution modes

The same agent, the same tools, the same outputs — but two different ways the browser tool calls actually reach a browser.

Capability Mode 1 — Playwright Mode 2 — Session
Browser Fresh Chromium launched by Playwright Your real Chrome with your profile
Login state None — must authenticate per run Already logged in
Cookies & storage Empty Real session data
Trace fidelity Full Full — same format, same viewer
Visible indicator Separate browser window appears Yellow "DevTools is debugging" banner
Setup Nothing — default mode Build & load the Chrome extension once

The diagrams below focus on what differs between modes — the path between the agent and the browser. The user prompt that feeds Claude Code, and the artifact outputs that come out the other end, are the same in both modes (and are described elsewhere on this page).

Mode 1 — Default (Playwright)

agent · MCP server
Claude Code
drives the browser via Playwright in-process
runtime
Playwright
Browser
Chromium · live session · headless: false

Mode 2 — Session (Extension)

agent · MCP server
Claude Code
browser tool calls forwarded over WebSocket :3456
browser extension (MV3)
brow-use extension
  • background.ts
  • playwright-crx
  • chrome.debugger
  • WS client :3456
your browser
Chrome logged in
your profile · real cookies · active tab

In session mode the MCP server runs a WebSocket server on port 3456; the brow-use extension connects as a client. Every browser tool call is forwarded over the socket and executed by the extension via playwright-crx — a full Playwright API backed by chrome.debugger instead of a separate browser process. File-writing tools always run on the Node side regardless of mode.

For details on the extension itself, see Chrome extension. For session mode setup, see Session mode.


The run database

Every command that drives the browser ends with a record_run call. The result is appended to .brow-use/runs.json — a flat JSON array, one entry per run, that captures:

Downstream commands like /bu:document and /bu:generate-page-objects read runs.json as their list of available source runs. The user picks one and the command resolves all paths from the entry.

Observability

Every long-running command additionally writes to output/reasoning/<sessionId>.jsonl via the log_reasoning tool — one line per non-obvious decision (plan, decision, observation, error). This is the audit trail you read after the fact to understand why the agent did what it did, separate from the trace's what.

The viewer/ app stitches all of this together: trace + sidecar + reasoning + aria log → a navigable timeline of the run with screenshots, decisions, and navigation edges side by side.


Where to next

User guide

Try it. The four capabilities, the recommended command sequence, and what each layer's commands actually do.

Developer guide

Build, run, and extend the MCP server, the extension, or the command set.