Custom harnesses – Miguel Armengol

Note on Terminology: An Agent Framework (e.g., LangChain, Google ADK) provides the library of primitives—the "lego blocks" for defining tools and loops. An Agent Harness is the product-specific engine built using those frameworks; it is the industrial machinery that enforces security, handles retries, and ensures task completion. You use a Framework to build a Harness.

Definition

An agent harness is the execution system surrounding a language model that converts model inference into reliable task completion through orchestration, tool control, context management, execution boundaries, verification, and operational feedback loops.

The harness is the operational substrate of the agent.

The model produces tokens. The harness produces work.

A harness includes:

agent loop
tool dispatch
context and memory management
execution sandboxing
permission boundaries
verification logic
retry and repair loops
multi-agent coordination
evaluation systems
tracing and observability
prompt routing
model routing
completion criteria

The model is one component inside the harness. It is not the harness.

Formal Execution Model

A production agent executes through:

input
→ context construction
→ reasoning
→ tool selection
→ tool execution
→ verification
→ repair or completion
→ trace persistence

The model participates in reasoning. The harness controls the rest.

Agent = Model + Harness

For production systems:

Harness >> Model integration complexity

This matches the architecture of Cursor, Anthropic Claude Code, Factory Droid, and Replit Agent.

Model Layer vs Harness Layer

Model layer

Provides:

reasoning
generation
summarization
classification
planning proposals
tool selection proposals

Examples: OpenAI GPT-5 family, Anthropic Claude models, Google Gemini models.

The model is replaceable.

Harness layer

Provides:

execution authority
workflow enforcement
tool permissions
validation
state persistence
observability
reliability guarantees

The harness is product-specific. It is not commoditized.

Evidence: Harness Quality Beats Model Substitution

In February 2026, the LangChain team reported that their open-source coding agent deepagents-cli improved from 52.8% to 66.5% on Terminal-Bench 2.0 while using the same underlying model (GPT-5.2-Codex). The model did not change. The harness changed.

+13.7 points
outside Top 30 → Top 5

Orchestration quality can exceed model substitution in practical performance.

Harness Planes

A production harness is composed of operational planes. A plane is a bounded subsystem responsible for one reliability domain.

The seven common planes:

Agent loop plane

Defines how the model repeatedly reasons and acts until termination.

Common patterns:

ReAct (Reason + Act)
plan-execute
generate-test-repair
planner-executor
verifier-repair loop
supervisor-worker

reason
→ choose tool
→ execute
→ inspect result
→ continue or terminate

Implementations: LangGraph (1.2.10), XState (1.2.9), or custom orchestration.

Tool plane

Defines how external capabilities are exposed to the model.

A tool entry has:

name
input schema
execution boundary
permission rules
timeout policy
retry policy
audit trace

Examples: filesystem access, deployment systems, CRM operations, GitHub actions, MCP tools, shell execution.

Tool design optimises for LLM reliability, not human ergonomics. Replit reported function-calling reliability ceilings around argument complexity and moved parts of execution toward constrained-interpreter patterns rather than unrestricted argument-heavy function calling.

Context plane

Controls what information is visible to the model during reasoning.

Includes:

retrieval
summarisation
memory
progressive disclosure
context compression
scratchpads
state snapshots

The objective is bounded relevance, not maximum context. Large context windows do not remove the requirement for retrieval architecture.

Sandbox plane

Isolates execution and prevents unsafe side effects.

Controls:

filesystem boundaries
network restrictions
secret isolation
temporary environments
execution quotas
destructive action controls

Implementations: Docker containers, devcontainers, isolated VMs, Kubernetes namespaces, ephemeral workspaces.

This is mandatory for production agents with write authority.

Permission plane

Determines whether a proposed action may execute.

LLM proposes
system validates
tool executes

Validation:

role checks
approval thresholds
policy enforcement
tenant boundaries
least privilege
audit requirements

For the deeper treatment of the LLM-event-boundary pattern, tool authority, and approval gates, see 1.1.1 State machines in agentic systems.

Evaluation plane

Measures whether the system is improving or regressing.

offline evals
regression testing
online evals
benchmark tracking
hallucination detection
failure replay
task completion measurement

Without evaluation, harness improvement is unmeasurable.

Implementations: LangSmith, Braintrust, Arize Phoenix, custom TypeScript runners.

Observability plane

Records runtime behavior for debugging, audit, and optimisation.

Required traces:

prompts
model outputs
tool calls
tool failures
retries
approval events
completion criteria
execution cost
latency
failure classification

Implementations: OpenTelemetry, LangSmith, Helicone, Grafana, Prometheus, Cloud Logging.

Claude Code, Cursor, and enterprise internal systems all depend on harness observability rather than prompt inspection alone.

Verification

Verification determines whether work is complete and correct.

Completion is not model confidence. Completion is external validation.

Examples:

tests pass
deployment health green
file exists
contract parsed correctly
support ticket resolved
expected schema validated

Verification is the most important production boundary. A system without verification is prompt automation, not an agent.

Failure-to-Harness Conversion

Harness improvement is the process of converting repeated failure classes into deterministic system behavior.

failure observed
→ root cause isolated
→ harness rule added
→ future failure prevented

Examples:

add lint rule
add retry policy
add validation schema
add approval checkpoint
add context rule
add specialised sub-agent
add tool wrapper

This is the compounding mechanism of agent systems. Mitchell Hashimoto described this as the practical definition of harness engineering in early 2026.

Why Harnesses Compound

Model improvement

Model improvement is vendor-controlled and externally released.

GPT-5.2 → GPT-5.3

This resets baseline capability. It is not owned by the product team.

Harness improvement

Harness improvement is internally accumulated.

prompt repair
+ tool validation
+ approval logic
+ eval improvements
+ routing optimisation

This persists across model changes. It compounds. It is the primary source of defensibility.

Framework Layer vs Product Harness

Framework

Examples: LangChain, Vercel AI SDK, CrewAI, LangGraph, XState.

Frameworks provide primitives. They do not provide product correctness.

Product harness

Examples: Claude Code, Cursor, Devin, Factory Droid, Sourcegraph Amp, Replit Agent, Vercel v0.

These are opinionated systems built on top of frameworks.

Every serious production system adds a custom harness layer.

Harness and Vendor Lock-In

Harness architecture determines model portability. If orchestration depends on one provider's runtime assumptions, model substitution becomes expensive.

Good:

tool contracts stable
model replaceable
routing abstracted

Bad:

business logic embedded in vendor-specific prompt or runtime behavior

Model abstraction belongs at the harness boundary, not inside prompts.

Reference Architecture

Node / TypeScript

LangGraph or XState
Fastify or NestJS
Postgres
Redis
BullMQ
OpenTelemetry
Zod
Docker
LangSmith or Braintrust

Google Cloud

Cloud Run
Pub/Sub
Firestore
BigQuery
Vertex AI
IAM
Secret Manager
Cloud Logging

On-Prem

Kubernetes
Postgres
Redis
Ollama
vLLM
Prometheus
Grafana
Vault

The harness architecture remains the same. Only infrastructure changes.

Build Decision

Use a stock harness

Use Claude Code, Codex CLI, or Cursor directly when:

prototyping
low operational risk
evaluation maturity is low
workflow economics are unclear

Extend an existing harness

Use hooks, MCP servers, AGENTS.md, sub-agents, and custom tool layers when:

domain constraints exist
permissions matter
evaluation is stable
product behavior must be repeatable

Build a custom harness

Build when:

sustained evaluation gap exceeds operational threshold
token economics dominate margins
audit requirements exceed stock systems
domain-specific tools are core product value
model portability is strategic

The threshold is operational, not ideological.

Final Rule

The model provides intelligence. The harness provides reliability.

The model is replaceable. The harness is the product.

Cross-references

1.1.1 State machines in agentic systems — orchestration patterns and distributed-systems concerns underlying the permission, sandbox, and verification planes.
1.2.10 LangGraph — primary framework for the agent loop plane in JS/TS.
1.2.9 XState — alternative framework for deterministic transition control inside the agent loop plane.
Headless agent runtime (this submodule's neighbour) — execution-tier concerns that overlap with the harness's permission, observability, and persistence planes.

References

LangChain deepagents-cli — Terminal-Bench 2.0 result, February 2026
Anthropic Claude Code — claude.com/claude-code
Cursor — cursor.com
LangGraph (JS) — docs.langchain.com/oss/javascript/langgraph/overview
LangSmith — smith.langchain.com
OpenTelemetry — opentelemetry.io