Harness Engineering

Agent = Model + Harness. The harness is everything except the weights: system prompts, tools, permissions, context management, feedback loops, verification systems, and the orchestration glue that connects them. Harness engineering is the discipline of getting that wrapper right.

Martin Fowler and Birgitta Boeckeler at ThoughtWorks formalized the term in early 2026. OpenAI proved the concept at scale by shipping a million-line codebase with zero manually written code. Engineers stop writing code and start designing environments where agents write code reliably.

The core equation

Configuration failures cause more damage than model failures. Mitchell Hashimoto framed it well: every time an agent makes a mistake, you engineer a solution so it never makes that mistake again. The harness accumulates these lessons as deterministic constraints rather than hoping the model “learns.”

Components of a harness

Context files (CLAUDE.md, AGENTS.md): Project instructions injected into every session. An ETH Zurich study found LLM-generated agent files actually hurt performance while burning 20%+ more tokens. Human-written files, kept under 60 lines, gave modest (~4%) but real gains. Treat these as a table of contents, not an encyclopedia. Claude Code loads CLAUDE.md at session start for exactly this purpose.

Tools: The capabilities you expose to the model. Each tool description eats context tokens. Pi Coding Agent ships just four tools (read, write, edit, bash) to minimize this cost. Claude Code ships 10+ and uses MCP for extensibility. More tools means more capability but also more context tokens burned on descriptions. HumanLayer replaced a bloated Linear MCP server with a 6-command CLI wrapper, saving thousands of tokens.

Skills (progressive disclosure): Knowledge modules that load only when matched. The agent sees names and descriptions at startup. Full content loads on demand. This solves the “dumb zone” problem where too many tokens push the model past its effective instruction budget.

Sub-agents (context firewalls): Isolated context windows for discrete tasks. The parent agent sees only the final result, not intermediate tool calls. Chroma research confirmed that model performance degrades as context length grows. Sub-agents keep each task in the “smart zone” by resetting context for each subtask.

Hooks (deterministic control): Shell commands triggered at lifecycle events. Run linters on save, typecheck before commit, notify Slack on completion. Success stays silent; failures surface with exit code 2 to force the agent back into fixing mode. Hooks always execute, regardless of what the model decides.

Back-pressure mechanisms: Tests, builds, and typechecks that let agents validate their own output. Context efficiency matters more than comprehensiveness here. One team found 4,000 lines of passing test output caused the agent to hallucinate about test files instead of doing its job.

Guides vs. sensors

Fowler’s framework splits harness controls into two categories:

Guides (feedforward): steer the agent before it acts. System prompts, architectural constraints, linter rules, file templates.
Sensors (feedback): observe after the agent acts and trigger correction. Tests, type checkers, evaluator agents, CI pipelines.

Both can be computational (fast, deterministic: linters, type checkers) or inferential (LLM-based: semantic review, design evaluation). Computational checks run pre-commit; inferential ones run post-integration.

Three regulation categories

Maintainability: Code quality, style, structural rules. Most mature. Linters and formatters handle this well.
Architecture fitness: Performance, observability, dependency boundaries. OpenAI enforces a strict layering sequence: Types → Config → Repo → Service → Runtime → UI.
Behavior: Functional correctness. The least solved and most important. Tests help but can’t cover subjective quality like design.

Anthropic’s three-agent pattern

Anthropic’s harness for long-running tasks splits work across three specialized agents:

Planner: converts brief prompts into comprehensive specs
Generator: implements features in sprints, self-evaluating before handoff
Evaluator: uses Playwright to interact with the running app like a user, grading against design quality, originality, craft, and functionality

The evaluator exists because agents overrate their own work. Training a separate skeptical evaluator turned out to be far more tractable than making a generator critical of itself. Sprint contracts between generator and evaluator define what “done” looks like before work begins.

Context resets between agents prevent degradation. A claude-progress.txt file alongside git history lets the next agent reconstruct state from scratch. Opus 4.6 reduced the need for resets, showing that model improvements can simplify the harness. Each component encodes an assumption about what the model can’t handle, and better models invalidate old assumptions.

OpenAI’s million-line experiment

Three engineers, five months, ~1,500 merged PRs, zero manually written lines. They averaged 3.5 PRs per engineer per day, and throughput increased as the team grew to seven. The constraint (“no manual code”) forced them to build a robust harness: cross-linked design documents as machine-readable artifacts, custom linters enforcing layered architecture, telemetry-driven feedback loops, and recurring “garbage collection” scans for drift.

What doesn’t work

Pre-designing the ideal harness before real failures happen
Installing dozens of MCP servers and skills “just in case” (see: MCP risks)
Running full test suites on every agent session (5+ minute runs kill iteration speed)
LLM-generated CLAUDE.md files (hurt performance per ETH Zurich data)
Micro-optimizing tool access across sub-agents before understanding actual needs

What works

Start simple. Add configuration only when failures force it.
Optimize for iteration speed over single-shot correctness.
Treat harness changes like code: design, test, iterate, discard what doesn’t help.
Distribute battle-tested configs at the repo level so every engineer benefits.
Keep context files short and human-written.

The minimalism question

Pi Coding Agent and Claude Code represent opposite ends of the spectrum. Pi bets that four tools and a 1,000-token system prompt are enough because frontier models already understand coding-agent semantics. Claude Code bets that batteries-included (sub-agents, teams, MCP, hooks, skills, 10+ tools) produces more reliable output for most users.

Both positions have data. Terminal-Bench 2.0 showed Opus 4.6 ranking #33 in Claude Code but #5 in an unseen harness, suggesting configuration diversity matters more than matching the training harness. The PocketFlow lesson applies: most teams overestimate how much framework they need.

As models get better, how much harness do you shed? Harnesses should shrink over time because each component exists to compensate for a model limitation. When the limitation disappears, the component becomes dead weight. Auditing those assumptions regularly separates good harness engineering from cargo-culted configuration.

Harness Engineering

The core equation

Components of a harness

Guides vs. sensors

Three regulation categories

Anthropic’s three-agent pattern

OpenAI’s million-line experiment

What doesn’t work

What works

The minimalism question

Sources

Backlinks

Table of Contents

Graph View

Harness Engineering

The core equation

Components of a harness

Guides vs. sensors

Three regulation categories

Anthropic’s three-agent pattern

OpenAI’s million-line experiment

What doesn’t work

What works

The minimalism question

Sources

Related reading

Context Engineering

Claude Code Agent Teams

Claude Code Agent Skills

Backlinks

Table of Contents

Graph View