Note on Terminology: An Agent Framework (e.g., LangChain, Google ADK) provides the library of primitives—the "lego blocks" for defining tools and loops. An Agent Harness is the product-specific engine built using those frameworks; it is the industrial machinery that enforces security, handles retries, and ensures task completion. You use a Framework to build a Harness.
Definition
An agent harness is the execution system surrounding a language model that converts model inference into reliable task completion through orchestration, tool control, context management, execution boundaries, verification, and operational feedback loops.
The harness is the operational substrate of the agent.
The model produces tokens. The harness produces work.
A harness includes:
- agent loop
- tool dispatch
- context and memory management
- execution sandboxing
- permission boundaries
- verification logic
- retry and repair loops
- multi-agent coordination
- evaluation systems
- tracing and observability
- prompt routing
- model routing
- completion criteria
The model is one component inside the harness. It is not the harness.
Formal Execution Model
A production agent executes through:
input
→ context construction
→ reasoning
→ tool selection
→ tool execution
→ verification
→ repair or completion
→ trace persistence
The model participates in reasoning. The harness controls the rest.
Agent = Model + Harness
For production systems:
Harness >> Model integration complexity
This matches the architecture of Cursor, Anthropic Claude Code, Factory Droid, and Replit Agent.
Model Layer vs Harness Layer
Model layer
Provides:
- reasoning
- generation
- summarization
- classification
- planning proposals
- tool selection proposals
Examples: OpenAI GPT-5 family, Anthropic Claude models, Google Gemini models.
The model is replaceable.
Harness layer
Provides:
- execution authority
- workflow enforcement
- tool permissions
- validation
- state persistence
- observability
- reliability guarantees
The harness is product-specific. It is not commoditized.
Evidence: Harness Quality Beats Model Substitution
In February 2026, the LangChain team reported that their open-source coding agent deepagents-cli improved from 52.8% to 66.5% on Terminal-Bench 2.0 while using the same underlying model (GPT-5.2-Codex). The model did not change. The harness changed.
+13.7 points
outside Top 30 → Top 5
Orchestration quality can exceed model substitution in practical performance.
Harness Planes
A production harness is composed of operational planes. A plane is a bounded subsystem responsible for one reliability domain.
The seven common planes:
Agent loop plane
Defines how the model repeatedly reasons and acts until termination.
Common patterns:
- ReAct (Reason + Act)
- plan-execute
- generate-test-repair
- planner-executor
- verifier-repair loop
- supervisor-worker
reason
→ choose tool
→ execute
→ inspect result
→ continue or terminate
Implementations: LangGraph (1.2.10), XState (1.2.9), or custom orchestration.
Tool plane
Defines how external capabilities are exposed to the model.
A tool entry has:
- name
- input schema
- execution boundary
- permission rules
- timeout policy
- retry policy
- audit trace
Examples: filesystem access, deployment systems, CRM operations, GitHub actions, MCP tools, shell execution.
Tool design optimises for LLM reliability, not human ergonomics. Replit reported function-calling reliability ceilings around argument complexity and moved parts of execution toward constrained-interpreter patterns rather than unrestricted argument-heavy function calling.
Context plane
Controls what information is visible to the model during reasoning.
Includes:
- retrieval
- summarisation
- memory
- progressive disclosure
- context compression
- scratchpads
- state snapshots
The objective is bounded relevance, not maximum context. Large context windows do not remove the requirement for retrieval architecture.
Sandbox plane
Isolates execution and prevents unsafe side effects.
Controls:
- filesystem boundaries
- network restrictions
- secret isolation
- temporary environments
- execution quotas
- destructive action controls
Implementations: Docker containers, devcontainers, isolated VMs, Kubernetes namespaces, ephemeral workspaces.
This is mandatory for production agents with write authority.
Permission plane
Determines whether a proposed action may execute.
LLM proposes
system validates
tool executes
Validation:
- role checks
- approval thresholds
- policy enforcement
- tenant boundaries
- least privilege
- audit requirements
For the deeper treatment of the LLM-event-boundary pattern, tool authority, and approval gates, see 1.1.1 State machines in agentic systems.
Evaluation plane
Measures whether the system is improving or regressing.
- offline evals
- regression testing
- online evals
- benchmark tracking
- hallucination detection
- failure replay
- task completion measurement
Without evaluation, harness improvement is unmeasurable.
Implementations: LangSmith, Braintrust, Arize Phoenix, custom TypeScript runners.
Observability plane
Records runtime behavior for debugging, audit, and optimisation.
Required traces:
- prompts
- model outputs
- tool calls
- tool failures
- retries
- approval events
- completion criteria
- execution cost
- latency
- failure classification
Implementations: OpenTelemetry, LangSmith, Helicone, Grafana, Prometheus, Cloud Logging.
Claude Code, Cursor, and enterprise internal systems all depend on harness observability rather than prompt inspection alone.
Verification
Verification determines whether work is complete and correct.
Completion is not model confidence. Completion is external validation.
Examples:
- tests pass
- deployment health green
- file exists
- contract parsed correctly
- support ticket resolved
- expected schema validated
Verification is the most important production boundary. A system without verification is prompt automation, not an agent.
Failure-to-Harness Conversion
Harness improvement is the process of converting repeated failure classes into deterministic system behavior.
failure observed
→ root cause isolated
→ harness rule added
→ future failure prevented
Examples:
- add lint rule
- add retry policy
- add validation schema
- add approval checkpoint
- add context rule
- add specialised sub-agent
- add tool wrapper
This is the compounding mechanism of agent systems. Mitchell Hashimoto described this as the practical definition of harness engineering in early 2026.
Why Harnesses Compound
Model improvement
Model improvement is vendor-controlled and externally released.
GPT-5.2 → GPT-5.3
This resets baseline capability. It is not owned by the product team.
Harness improvement
Harness improvement is internally accumulated.
prompt repair
+ tool validation
+ approval logic
+ eval improvements
+ routing optimisation
This persists across model changes. It compounds. It is the primary source of defensibility.
Framework Layer vs Product Harness
Framework
Examples: LangChain, Vercel AI SDK, CrewAI, LangGraph, XState.
Frameworks provide primitives. They do not provide product correctness.
Product harness
Examples: Claude Code, Cursor, Devin, Factory Droid, Sourcegraph Amp, Replit Agent, Vercel v0.
These are opinionated systems built on top of frameworks.
Every serious production system adds a custom harness layer.
Harness and Vendor Lock-In
Harness architecture determines model portability. If orchestration depends on one provider's runtime assumptions, model substitution becomes expensive.
Good:
tool contracts stable
model replaceable
routing abstracted
Bad:
business logic embedded in vendor-specific prompt or runtime behavior
Model abstraction belongs at the harness boundary, not inside prompts.
Reference Architecture
Node / TypeScript
- LangGraph or XState
- Fastify or NestJS
- Postgres
- Redis
- BullMQ
- OpenTelemetry
- Zod
- Docker
- LangSmith or Braintrust
Google Cloud
- Cloud Run
- Pub/Sub
- Firestore
- BigQuery
- Vertex AI
- IAM
- Secret Manager
- Cloud Logging
On-Prem
- Kubernetes
- Postgres
- Redis
- Ollama
- vLLM
- Prometheus
- Grafana
- Vault
The harness architecture remains the same. Only infrastructure changes.
Build Decision
Use a stock harness
Use Claude Code, Codex CLI, or Cursor directly when:
- prototyping
- low operational risk
- evaluation maturity is low
- workflow economics are unclear
Extend an existing harness
Use hooks, MCP servers, AGENTS.md, sub-agents, and custom tool layers when:
- domain constraints exist
- permissions matter
- evaluation is stable
- product behavior must be repeatable
Build a custom harness
Build when:
- sustained evaluation gap exceeds operational threshold
- token economics dominate margins
- audit requirements exceed stock systems
- domain-specific tools are core product value
- model portability is strategic
The threshold is operational, not ideological.
Final Rule
The model provides intelligence. The harness provides reliability.
The model is replaceable. The harness is the product.
Cross-references
- 1.1.1 State machines in agentic systems — orchestration patterns and distributed-systems concerns underlying the permission, sandbox, and verification planes.
- 1.2.10 LangGraph — primary framework for the agent loop plane in JS/TS.
- 1.2.9 XState — alternative framework for deterministic transition control inside the agent loop plane.
- Headless agent runtime (this submodule's neighbour) — execution-tier concerns that overlap with the harness's permission, observability, and persistence planes.
References
- LangChain deepagents-cli — Terminal-Bench 2.0 result, February 2026
- Anthropic Claude Code —
claude.com/claude-code - Cursor —
cursor.com - LangGraph (JS) —
docs.langchain.com/oss/javascript/langgraph/overview - LangSmith —
smith.langchain.com - OpenTelemetry —
opentelemetry.io