Architecture

How a prompt becomes browser automation

Components, command layers, the agent loop, and two browser modes — and how they fit together.


The core parts

Three components are always involved, plus a fourth — the Chrome extension — that's used only in session mode.

claude-code
Agent

Reasons about the scenario, calls MCP tools, observes results, decides the next step. Every command is a markdown file Claude executes verbatim.

mcp/
MCP server

Long-running Node.js process. Owns the live browser session, exposes ~25 tools, and routes calls to either Playwright or the extension based on mode.

tool/
Tools

Browser tools (navigate, click, get_accessibility_tree…) and file-only tools (write_page_object, write_feature_doc, record_run…). Each tool is a single TypeScript module.

Optional fourth part — Chrome extension. An MV3 extension that connects to the MCP server via WebSocket and proxies commands into your real Chrome tab using playwright-crx. Loaded only when running in Mode 2 (session); ignored in the default Playwright mode.

The MCP server is the centre of gravity. It's worth a closer look before we describe how anything else works.


The MCP server

The MCP server is the long-running process every other component talks to. Claude Code launches it as a subprocess on demand, it exposes the brow-use tools as a single MCP toolset, and it owns the live browser handle for the duration of the user's session — so the browser stays open and in the same state between tool calls. That continuity is what makes the agent loop possible: each step picks up exactly where the last one left off.

Two routing decisions live here. Mode — each browser tool call goes to either Playwright (default) or the WebSocket bridge that talks to the extension (session). Type — file-only tools (write a page object, write a feature doc, record a run) always run on the Node side regardless of mode; pure-compute tools (like fingerprint comparison) run with neither browser nor disk.

For the full per-tool surface — every mcp__bu__* tool exposed by the MCP server, and how to drive them from your own agent or skill — see Agent integration.


End-to-end: generating a page object

Before zooming in on the command taxonomy and the agent loop, here's the whole runtime stitched together for one concrete outcome — producing a typed Page Object for a page in your app. Two prompts, with a deterministic post-processing step between them. The first prompt drives a real browser via the Chrome extension; the second never touches a browser at all. That split is the key idea.

Phase 1 — Capture (live, session mode)

You run /bu:explore. The agent walks your app in a real Chrome tab and records everything. The trace and aria log it produces are the raw material the next phase reads.

you
/bu:explore
typed in Claude Code
agent
Claude Code
executes the markdown verbatim — perceive · reason · act · record
brow-use plugin · long-running Node process
MCP server
routes browser tool calls over WebSocket :3456 in session mode
browser extension (MV3)
brow-use extension
playwright-crx drives the active tab through chrome.debugger
your browser
Chrome logged in
your profile · real cookies · active tab

When the loop ends, the agent calls stop_trace and record_run — the trace zip lands at output/traces/<sessionId>.zip and the run is appended to .brow-use/runs.json. Then make extract SESSION=<id> (Layer 2 — pure TypeScript, no agent, no browser) turns the zip into the aria-tree log and per-step screenshots that Phase 2 reads.

Phase 2 — Generate (offline, no browser)

You run /bu:generate-page-objects. The same MCP server is still there, but no browser tool is called this time — the agent reads the captured artifacts and writes typed POMs.

you
/bu:generate-page-objects
typed in Claude Code
agent
Claude Code
picks a source run from .brow-use/runs.json, iterates page by page
brow-use plugin · file-only tools
MCP server
no browser, no extension, no WebSocket — straight filesystem reads & writes
disk · output/
Artifacts
  • aria-log/<sessionId>.jsonl — input
  • screenshots/<sessionId>/ — input
  • pom/<page>.ts — output

The phase split is the deliberate part. Everything fragile and stateful — real browser, real cookies, network timing — lives in Phase 1. By the time Phase 2 runs, the app is frozen as a deterministic set of files; re-running the generator against the same source run produces the same POMs without re-walking the app. That's what lets you iterate on the generator (or fix prompts in generate-page-objects.md) without re-paying the cost of a live capture.

Commands are organised into four layers. Knowing the taxonomy first makes the next section — the agent loop — read more cleanly, since the loop's shape varies by layer.


The command layers

Commands are organised into four layers. Each layer builds on the artifacts produced by the one below it, but every layer is independently runnable — you don't have to do /bu:explore first to run /bu:explore-guided, and you don't have to generate Page Objects to run /bu:run-instruction.

Layer 1   Live capture

Drive the browser to capture raw artifacts: traces and aria-tree logs. The user describes intent; the agent executes it.

Layer 2   Deterministic post-processing

Read trace zips and turn them into queryable artifacts. No browser, no LLM — pure shell and TypeScript so the same trace always produces the same output.

Layer 3   Knowledge generation

Take Layer 1 + Layer 2 artifacts and produce human and machine knowledge: docs the user reads and Page Objects tests can import.

Layer 4   Grounded execution & composition

With docs and POMs in place, either execute business-level intents against the app or generate reusable workflow code grounded in the knowledge layer below.

Layers 1, 3, and 4 each run the same agent loop — the shape stays constant; only the inputs and outputs change. That's the next section.


The agent loop

Every long-running brow-use command (Layers 1, 3, and 4) runs an agent loop, not a script. Layer 2 is the exception — pure deterministic post-processing with no LLM in the path — so this section describes the rest. The loop's shape stays the same across the three LLM-driven layers — perceive, reason, act, record — but what each step reads from and writes to changes by layer.

Layer 1 form — driving the live browser

When the loop drives a real browser, perception is primarily the accessibility tree, with snapshots (screenshots) and a filtered, safety-policy-applied element list as supporting inputs. Actions navigate, click, or type.

Perceive
accessibility tree
Primary input. Snapshots and a filtered element list as supporting inputs.
Reason
Claude
Pick the next action based on intent + current state
Act
click / type / navigate
Browser state changes
Record
trace + write_*
Trace event recorded; artifacts written when ready

How the loop changes across layers

Layer 3 — knowledge generation. Same loop shape, no live browser. Perceive reads captured artifacts from a prior explore run (aria log, observed-edge list, screenshots extracted from the trace). Act writes derived files via tools like write_feature_doc and write_page_object. The loop continues until every page in the source run has been processed.

Layer 4 — grounded execution. A hybrid. Before the live loop starts the agent loads the knowledge stack (docs, POMs, and workflows produced by earlier runs). That grounding biases every reasoning step in the Layer 1 form that follows — so picking which page to navigate to or which element to click is informed by recorded knowledge of the app, not just the current aria tree.

All Layer 1 / Layer 4 commands ultimately reach a real browser. There are two ways that can happen.


Two execution modes

The same agent, the same tools, the same outputs — but two different ways the browser tool calls actually reach a browser.

Capability Mode 1 — Playwright Mode 2 — Session
Browser Fresh Chromium launched by Playwright Your real Chrome with your profile
Login state None — must authenticate per run Already logged in
Cookies & storage Empty Real session data
Trace fidelity Full Full — same format, same viewer
Visible indicator Separate browser window appears Yellow "DevTools is debugging" banner
Setup Nothing — default mode Build & load the Chrome extension once

The diagrams below focus on what differs between modes — the path between the agent and the browser. The user prompt that feeds Claude Code, and the artifact outputs that come out the other end, are the same in both modes (and are described elsewhere on this page).

Mode 1 — Default (Playwright)

agent · MCP server
Claude Code
drives the browser via Playwright in-process
runtime
Playwright
Browser
Chromium · live session · headless: false

Mode 2 — Session (Extension)

agent · MCP server
Claude Code
browser tool calls forwarded over WebSocket :3456
browser extension (MV3)
brow-use extension
  • background.ts
  • playwright-crx
  • chrome.debugger
  • WS client :3456
your browser
Chrome logged in
your profile · real cookies · active tab

In session mode the MCP server runs a WebSocket server on port 3456; the brow-use extension connects as a client. Every browser tool call is forwarded over the socket and executed by the extension via playwright-crx — a full Playwright API backed by chrome.debugger instead of a separate browser process. File-writing tools always run on the Node side regardless of mode.

For details on the extension itself, see Chrome extension. For session mode setup, see Session mode.


The run database

Every command that drives the browser ends with a record_run call. The result is appended to .brow-use/runs.json — a flat JSON array, one entry per run. Downstream commands like /bu:document and /bu:generate-page-objects read it as their list of available source runs; the user picks one and the command resolves all paths from the entry.

Each entry captures the following fields:

Field What it captures Format Recorded by
sessionId Unique id used to correlate every downstream artifact — trace, aria log, screenshots, docs folder, result file — back to this run. <command>-<unix-ms>
e.g. explore-1714062110123
all
command Which brow-use command produced the run. Drives how downstream tools interpret the entry and which optional fields are present. "explore" · "explore-guided" · "run-instruction" · "investigate" all
startedAt / endedAt Wall-clock window of the run. Used to sort runs and to align with external logs when debugging. ISO 8601 timestamp all
url The URL the run started from. Anchors the run to a specific target so its outputs can't be confused with another site's. string · undefined if not recorded all
mode Which browser the run drove — a fresh Playwright Chromium or a real Chrome tab via the extension. "playwright" · "crx" all
artifacts Map of artifact label to file or directory path. Lets downstream commands resolve every output the run produced without guessing. Keys vary by command — tracePath, ariaLog for explore / explore-guided; tracePath, resultPath, howPath for run-instruction; tracePath, findingsPath for investigate. JSON object — string keys, string paths all
pagesVisited Number of distinct pages the autonomous walk reached before terminating. A coarse coverage signal. integer explore
terminationReason Why the explore loop stopped — clean exhaustion of the frontier or a budget cap was hit. "frontier-empty" · "maxSteps" · "maxLoopHits" · "error" explore
intent The plain-English instruction the user gave the agent for this run. For investigate this is the combined two-part input: "Run: {whatToRun} | Investigate: {howToHelp}". string explore-guided · run-instruction · investigate
format Output format the user asked for. Picks the renderer used by write_result. "markdown" · "csv" · "json" · "txt" run-instruction
recordsExtracted Number of rows in the result file. A quick way to spot empty or partial extractions without opening the file. integer run-instruction
sourceExploreId The earlier explore run whose docs / aria log this run was grounded in. Omitted entirely when the run was ungrounded. sessionId of a prior explore or explore-guided run run-instruction (when grounded)

Observability

Every long-running command additionally writes to output/reasoning/<sessionId>.jsonl via the log_reasoning tool — one line per non-obvious decision (plan, decision, observation, error). This is the audit trail you read after the fact to understand why the agent did what it did, separate from the trace's what.

The viewer/ app stitches all of this together: trace + sidecar + reasoning + aria log → a navigable timeline of the run with screenshots, decisions, and navigation edges side by side.


Where to next

User guide

Try it. The four capabilities, the recommended command sequence, and what each layer's commands actually do.

Developer guide

Build, run, and extend the MCP server, the extension, or the command set.