Architecture

How a prompt becomes browser automation

Components, command layers, the agent loop, and two browser modes — and how they fit together.

The core parts

Three components are always involved, plus a fourth — the Chrome extension — that's used only in session mode.

claude-code

Agent

Reasons about the scenario, calls MCP tools, observes results, decides the next step. Every command is a markdown file Claude executes verbatim.

mcp/

MCP server

Long-running Node.js process. Owns the live browser session, exposes ~25 tools, and routes calls to either Playwright or the extension based on mode.

tool/

Tools

Browser tools (navigate, click, get_accessibility_tree…) and file-only tools (write_page_object, write_feature_doc, record_run…). Each tool is a single TypeScript module.

Optional fourth part — Chrome extension. An MV3 extension that connects to the MCP server via WebSocket and proxies commands into your real Chrome tab using playwright-crx. Loaded only when running in Mode 2 (session); ignored in the default Playwright mode.

The MCP server is the centre of gravity. It's worth a closer look before we describe how anything else works.

The MCP server

The MCP server is the long-running process every other component talks to. Claude Code launches it as a subprocess on demand, it exposes the brow-use tools as a single MCP toolset, and it owns the live browser handle for the duration of the user's session — so the browser stays open and in the same state between tool calls. That continuity is what makes the agent loop possible: each step picks up exactly where the last one left off.

Two routing decisions live here. Mode — each browser tool call goes to either Playwright (default) or the WebSocket bridge that talks to the extension (session). Type — file-only tools (write a page object, write a feature doc, record a run) always run on the Node side regardless of mode; pure-compute tools (like fingerprint comparison) run with neither browser nor disk.

For the full per-tool surface — every mcp__bu__* tool exposed by the MCP server, and how to drive them from your own agent or skill — see Agent integration.

End-to-end: generating a page object

Before zooming in on the command taxonomy and the agent loop, here's the whole runtime stitched together for one concrete outcome — producing a typed Page Object for a page in your app. Two prompts, with a deterministic post-processing step between them. The first prompt drives a real browser via the Chrome extension; the second never touches a browser at all. That split is the key idea.

Phase 1 — Capture (live, session mode)

You run /bu:explore. The agent walks your app in a real Chrome tab and records everything. The trace and aria log it produces are the raw material the next phase reads.

you

/bu:explore

typed in Claude Code

↓ reads explore.md

agent

Claude Code

executes the markdown verbatim — perceive · reason · act · record

↓ MCP tool calls — start_trace, navigate, get_accessibility_tree, click…

↑ tool results

brow-use plugin · long-running Node process

MCP server

routes browser tool calls over WebSocket :3456 in session mode

↓ commands · WebSocket

↑ results

browser extension (MV3)

brow-use extension

playwright-crx drives the active tab through chrome.debugger

↓ CDP commands

↑ accessibility tree · snapshots

your browser

Chrome logged in

your profile · real cookies · active tab

When the loop ends, the agent calls stop_trace and record_run — the trace zip lands at output/traces/<sessionId>.zip and the run is appended to .brow-use/runs.json. Then make extract SESSION=<id> (Layer 2 — pure TypeScript, no agent, no browser) turns the zip into the aria-tree log and per-step screenshots that Phase 2 reads.

Phase 2 — Generate (offline, no browser)

You run /bu:generate-page-objects. The same MCP server is still there, but no browser tool is called this time — the agent reads the captured artifacts and writes typed POMs.

you

/bu:generate-page-objects

typed in Claude Code

↓ reads generate-page-objects.md

agent

Claude Code

picks a source run from .brow-use/runs.json, iterates page by page

↓ MCP tool calls — read_observed_edges, read_pom_summary, write_page_object

↑ page structure · navigation edges

brow-use plugin · file-only tools

MCP server

no browser, no extension, no WebSocket — straight filesystem reads & writes

↑ aria log · observed edges · screenshots

↓ generated POM files

disk · output/

Artifacts

aria-log/<sessionId>.jsonl — input
screenshots/<sessionId>/ — input
pom/<page>.ts — output

The phase split is the deliberate part. Everything fragile and stateful — real browser, real cookies, network timing — lives in Phase 1. By the time Phase 2 runs, the app is frozen as a deterministic set of files; re-running the generator against the same source run produces the same POMs without re-walking the app. That's what lets you iterate on the generator (or fix prompts in generate-page-objects.md) without re-paying the cost of a live capture.

Commands are organised into four layers. Knowing the taxonomy first makes the next section — the agent loop — read more cleanly, since the loop's shape varies by layer.

The command layers

Commands are organised into four layers. Each layer builds on the artifacts produced by the one below it, but every layer is independently runnable — you don't have to do /bu:explore first to run /bu:explore-guided, and you don't have to generate Page Objects to run /bu:run-instruction.

Layer 1 Live capture

Drive the browser to capture raw artifacts: traces and aria-tree logs. The user describes intent; the agent executes it.

/bu:explore — walk the app autonomously, breadth-first, with loop detection.
/bu:explore-guided — carry out a one-off plain-English intent and record it.
/bu:investigate — run a small action and answer a live question about what the page did, choosing the investigation technique on the fly.

Layer 2 Deterministic post-processing

Read trace zips and turn them into queryable artifacts. No browser, no LLM — pure shell and TypeScript so the same trace always produces the same output.

make extract SESSION=<id> — extract aria-tree log + per-step screenshots from a trace.
The viewer ingester — turns trace + sidecar + reasoning logs into the run-viewer database.

Layer 3 Knowledge generation

Take Layer 1 + Layer 2 artifacts and produce human and machine knowledge: docs the user reads and Page Objects tests can import.

/bu:document — feature docs + a page-transitions index, scoped per explore run.
/bu:generate-page-objects — typed POMs from the aria log; uses observed-edges to type navigation methods.

Layer 4 Grounded execution & composition

With docs and POMs in place, either execute business-level intents against the app or generate reusable workflow code grounded in the knowledge layer below.

/bu:run-instruction — extract data, follow a flow, satisfy an intent. Optionally grounded in an earlier explore run. Output in the format you ask for.
/bu:generate-workflow-function — generate a Playwright async function (no POM — calls Playwright APIs directly) from a plain-English goal, grounded in feature docs when available and the aria log otherwise.

Layers 1, 3, and 4 each run the same agent loop — the shape stays constant; only the inputs and outputs change. That's the next section.

The agent loop

Every long-running brow-use command (Layers 1, 3, and 4) runs an agent loop, not a script. Layer 2 is the exception — pure deterministic post-processing with no LLM in the path — so this section describes the rest. The loop's shape stays the same across the three LLM-driven layers — perceive, reason, act, record — but what each step reads from and writes to changes by layer.

Layer 1 form — driving the live browser

When the loop drives a real browser, perception is primarily the accessibility tree, with snapshots (screenshots) and a filtered, safety-policy-applied element list as supporting inputs. Actions navigate, click, or type.

Perceive

accessibility tree

Primary input. Snapshots and a filtered element list as supporting inputs.

→

Reason

Claude

Pick the next action based on intent + current state

→

Act

click / type / navigate

Browser state changes

→

Record

trace + write_*

Trace event recorded; artifacts written when ready

How the loop changes across layers

Layer 3 — knowledge generation. Same loop shape, no live browser. Perceive reads captured artifacts from a prior explore run (aria log, observed-edge list, screenshots extracted from the trace). Act writes derived files via tools like write_feature_doc and write_page_object. The loop continues until every page in the source run has been processed.

Layer 4 — grounded execution. A hybrid. Before the live loop starts the agent loads the knowledge stack (docs, POMs, and workflows produced by earlier runs). That grounding biases every reasoning step in the Layer 1 form that follows — so picking which page to navigate to or which element to click is informed by recorded knowledge of the app, not just the current aria tree.

All Layer 1 / Layer 4 commands ultimately reach a real browser. There are two ways that can happen.

Two execution modes

The same agent, the same tools, the same outputs — but two different ways the browser tool calls actually reach a browser.

Capability	Mode 1 — Playwright	Mode 2 — Session
Browser	Fresh Chromium launched by Playwright	Your real Chrome with your profile
Login state	None — must authenticate per run	Already logged in
Cookies & storage	Empty	Real session data
Trace fidelity	Full	Full — same format, same viewer
Visible indicator	Separate browser window appears	Yellow "DevTools is debugging" banner
Setup	Nothing — default mode	Build & load the Chrome extension once

The diagrams below focus on what differs between modes — the path between the agent and the browser. The user prompt that feeds Claude Code, and the artifact outputs that come out the other end, are the same in both modes (and are described elsewhere on this page).

Mode 1 — Default (Playwright)

agent · MCP server

Claude Code

drives the browser via Playwright in-process

↓ navigate / click / type

↑ accessibility tree / snapshots

↑ trace zip (on stop)

runtime

Playwright

Browser

Chromium · live session · headless: false

Mode 2 — Session (Extension)

agent · MCP server

Claude Code

browser tool calls forwarded over WebSocket :3456

↓ commands · WebSocket

↑ results

browser extension (MV3)

brow-use extension

background.ts
playwright-crx
chrome.debugger
WS client :3456

↓ navigate / click

↑ tree / snapshots

your browser

Chrome logged in

your profile · real cookies · active tab

In session mode the MCP server runs a WebSocket server on port 3456; the brow-use extension connects as a client. Every browser tool call is forwarded over the socket and executed by the extension via playwright-crx — a full Playwright API backed by chrome.debugger instead of a separate browser process. File-writing tools always run on the Node side regardless of mode.

For details on the extension itself, see Chrome extension. For session mode setup, see Session mode.

The run database

Every command that drives the browser ends with a record_run call. The result is appended to .brow-use/runs.json — a flat JSON array, one entry per run. Downstream commands like /bu:document and /bu:generate-page-objects read it as their list of available source runs; the user picks one and the command resolves all paths from the entry.

Each entry captures the following fields:

Field	What it captures	Format	Recorded by
`sessionId`	Unique id used to correlate every downstream artifact — trace, aria log, screenshots, docs folder, result file — back to this run.	`<command>-<unix-ms>` e.g. `explore-1714062110123`	all
`command`	Which brow-use command produced the run. Drives how downstream tools interpret the entry and which optional fields are present.	`"explore"` · `"explore-guided"` · `"run-instruction"` · `"investigate"`	all
`startedAt` / `endedAt`	Wall-clock window of the run. Used to sort runs and to align with external logs when debugging.	ISO 8601 timestamp	all
`url`	The URL the run started from. Anchors the run to a specific target so its outputs can't be confused with another site's.	string · `undefined` if not recorded	all
`mode`	Which browser the run drove — a fresh Playwright Chromium or a real Chrome tab via the extension.	`"playwright"` · `"crx"`	all
`artifacts`	Map of artifact label to file or directory path. Lets downstream commands resolve every output the run produced without guessing. Keys vary by command — `tracePath`, `ariaLog` for explore / explore-guided; `tracePath`, `resultPath`, `howPath` for run-instruction; `tracePath`, `findingsPath` for investigate.	JSON object — string keys, string paths	all
`pagesVisited`	Number of distinct pages the autonomous walk reached before terminating. A coarse coverage signal.	integer	explore
`terminationReason`	Why the explore loop stopped — clean exhaustion of the frontier or a budget cap was hit.	`"frontier-empty"` · `"maxSteps"` · `"maxLoopHits"` · `"error"`	explore
`intent`	The plain-English instruction the user gave the agent for this run. For `investigate` this is the combined two-part input: `"Run: {whatToRun} \| Investigate: {howToHelp}"`.	string	explore-guided · run-instruction · investigate
`format`	Output format the user asked for. Picks the renderer used by `write_result`.	`"markdown"` · `"csv"` · `"json"` · `"txt"`	run-instruction
`recordsExtracted`	Number of rows in the result file. A quick way to spot empty or partial extractions without opening the file.	integer	run-instruction
`sourceExploreId`	The earlier explore run whose docs / aria log this run was grounded in. Omitted entirely when the run was ungrounded.	sessionId of a prior explore or explore-guided run	run-instruction (when grounded)

Observability

Every long-running command additionally writes to output/reasoning/<sessionId>.jsonl via the log_reasoning tool — one line per non-obvious decision (plan, decision, observation, error). This is the audit trail you read after the fact to understand why the agent did what it did, separate from the trace's what.

The viewer/ app stitches all of this together: trace + sidecar + reasoning + aria log → a navigable timeline of the run with screenshots, decisions, and navigation edges side by side.

Where to next

User guide

Try it. The four capabilities, the recommended command sequence, and what each layer's commands actually do.

Developer guide

Build, run, and extend the MCP server, the extension, or the command set.