How a prompt becomes browser automation
Components, command layers, the agent loop, and two browser modes — and how they fit together.
The core parts
Three components are always involved, plus a fourth — the Chrome extension — that's used only in session mode.
Reasons about the scenario, calls MCP tools, observes results, decides the next step. Every command is a markdown file Claude executes verbatim.
Long-running Node.js process. Owns the live browser session, exposes ~25 tools, and routes calls to either Playwright or the extension based on mode.
Browser tools (navigate, click, get_accessibility_tree…) and file-only tools (write_page_object, write_feature_doc, record_run…). Each tool is a single TypeScript module.
playwright-crx. Loaded only when running in
Mode 2 (session); ignored in the default Playwright
mode.
The MCP server is the centre of gravity. It's worth a closer look before we describe how anything else works.
The MCP server
The MCP server is the long-running process every other component talks to. Claude Code launches it as a subprocess on demand, it exposes the brow-use tools as a single MCP toolset, and it owns the live browser handle for the duration of the user's session — so the browser stays open and in the same state between tool calls. That continuity is what makes the agent loop possible: each step picks up exactly where the last one left off.
Two routing decisions live here. Mode — each browser tool call goes to either Playwright (default) or the WebSocket bridge that talks to the extension (session). Type — file-only tools (write a page object, write a feature doc, record a run) always run on the Node side regardless of mode; pure-compute tools (like fingerprint comparison) run with neither browser nor disk.
Commands are organised into four layers. Knowing the taxonomy first makes the next section — the agent loop — read more cleanly, since the loop's shape varies by layer.
The command layers
Commands are organised into four layers. Each layer builds on the artifacts produced by
the one below it, but every layer is independently runnable — you don't have to do
/bu:explore first to run /bu:explore-guided, and you don't have
to generate Page Objects to run /bu:run-instruction.
Layer 1 Live capture
Drive the browser to capture raw artifacts: traces and aria-tree logs. The user describes intent; the agent executes it.
/bu:explore— walk the app autonomously, breadth-first, with loop detection./bu:explore-guided— carry out a one-off plain-English intent and record it.
Layer 2 Deterministic post-processing
Read trace zips and turn them into queryable artifacts. No browser, no LLM — pure shell and TypeScript so the same trace always produces the same output.
make extract SESSION=<id>— extract aria-tree log + per-step screenshots from a trace.- The viewer ingester — turns trace + sidecar + reasoning logs into the run-viewer database.
Layer 3 Knowledge generation
Take Layer 1 + Layer 2 artifacts and produce human and machine knowledge: docs the user reads and Page Objects tests can import.
/bu:document— feature docs + a page-transitions index, scoped per explore run./bu:generate-page-objects— typed POMs from the aria log; uses observed-edges to type navigation methods.
Layer 4 Grounded execution & composition
With docs and POMs in place, either execute business-level intents against the app or generate reusable workflow code grounded in the knowledge layer below.
/bu:run-instruction— extract data, follow a flow, satisfy an intent. Optionally grounded in an earlier explore run. Output in the format you ask for./bu:generate-workflow-function— generate a Playwright async function (no POM — calls Playwright APIs directly) from a plain-English goal, grounded in feature docs when available and the aria log otherwise.
Layers 1, 3, and 4 each run the same agent loop — the shape stays constant; only the inputs and outputs change. That's the next section.
The agent loop
Every long-running brow-use command (Layers 1, 3, and 4) runs an agent loop, not a script. Layer 2 is the exception — pure deterministic post-processing with no LLM in the path — so this section describes the rest. The loop's shape stays the same across the three LLM-driven layers — perceive, reason, act, record — but what each step reads from and writes to changes by layer.
Layer 1 form — driving the live browser
When the loop drives a real browser, perception is primarily the accessibility tree, with snapshots (screenshots) and a filtered, safety-policy-applied element list as supporting inputs. Actions navigate, click, or type.
How the loop changes across layers
Layer 3 — knowledge generation. Same loop shape, no live browser.
Perceive reads captured artifacts from a prior explore run (aria log,
observed-edge list, screenshots extracted from the trace). Act writes derived
files via tools like write_feature_doc and write_page_object.
The loop continues until every page in the source run has been processed.
Layer 4 — grounded execution. A hybrid. Before the live loop starts the agent loads the knowledge stack (docs, POMs, and workflows produced by earlier runs). That grounding biases every reasoning step in the Layer 1 form that follows — so picking which page to navigate to or which element to click is informed by recorded knowledge of the app, not just the current aria tree.
All Layer 1 / Layer 4 commands ultimately reach a real browser. There are two ways that can happen.
Two execution modes
The same agent, the same tools, the same outputs — but two different ways the browser tool calls actually reach a browser.
| Capability | Mode 1 — Playwright | Mode 2 — Session |
|---|---|---|
| Browser | Fresh Chromium launched by Playwright | Your real Chrome with your profile |
| Login state | None — must authenticate per run | Already logged in |
| Cookies & storage | Empty | Real session data |
| Trace fidelity | Full | Full — same format, same viewer |
| Visible indicator | Separate browser window appears | Yellow "DevTools is debugging" banner |
| Setup | Nothing — default mode | Build & load the Chrome extension once |
The diagrams below focus on what differs between modes — the path between the agent and the browser. The user prompt that feeds Claude Code, and the artifact outputs that come out the other end, are the same in both modes (and are described elsewhere on this page).
Mode 1 — Default (Playwright)
headless: falseMode 2 — Session (Extension)
:3456background.tsplaywright-crxchrome.debugger- WS client
:3456
In session mode the MCP server runs a WebSocket server on port 3456; the
brow-use extension connects as a client. Every browser tool call is forwarded over the
socket and executed by the extension via
playwright-crx
— a full Playwright API backed by chrome.debugger instead of a separate
browser process. File-writing tools always run on the Node side regardless of mode.
For details on the extension itself, see Chrome extension. For session mode setup, see Session mode.
The run database
Every command that drives the browser ends with a record_run call. The result
is appended to .brow-use/runs.json — a flat JSON array, one entry per run, that
captures:
- Identity — sessionId, command, app, mode, timestamps.
- Outcome — pages visited, termination reason, records extracted, intent.
- Artifacts — paths to the trace, aria log, docs folder, result file.
Downstream commands like /bu:document and /bu:generate-page-objects
read runs.json as their list of available source runs. The user picks one and
the command resolves all paths from the entry.
Observability
Every long-running command additionally writes to output/reasoning/<sessionId>.jsonl
via the log_reasoning tool — one line per non-obvious decision (plan,
decision, observation, error). This is the audit trail you read
after the fact to understand why the agent did what it did, separate from the
trace's what.
The viewer/ app stitches all of this together: trace + sidecar +
reasoning + aria log → a navigable timeline of the run with screenshots, decisions, and
navigation edges side by side.