Definition
Evaluation is the process of measuring whether an AI system produces the required outcome under defined conditions using explicit criteria, repeatable inputs, and observable outputs.
An evaluation framework is the structure used to define, execute, compare, and monitor those measurements across development and production environments.
Evaluation includes:
- input definition
- expected behavior definition
- scoring methodology
- execution environment
- result storage
- regression comparison
- decision thresholds
Evaluation must be deterministic where possible and explicitly probabilistic where determinism is not possible.
Offline Evaluation
Definition
Offline evaluation is the measurement of system performance using predefined datasets outside live user traffic.
Inputs are fixed.
Outputs are scored against expected results.
No production users are involved.
Components
Evaluation Dataset
A fixed set of test cases.
Each case contains:
- input
- expected output or scoring rule
- metadata
- failure classification tags
Example:
{
"input": "Summarize contract clause 7",
"expected": "Termination requires 30-day notice",
"tags": ["legal", "summary"]
}
Scoring Function
A scoring function converts output into measurable values.
Common scoring types:
- exact match
- semantic similarity
- rubric-based LLM judge
- human review
- tool success validation
- schema validation
Evaluation Runner
The execution system that runs prompts, retrieval, tools, and model outputs against the dataset.
Examples:
- LangChain LangSmith
- Arize AI Phoenix
- Weights & Biases Weave
- Braintrust
- Humanloop
- local TypeScript runners using Vitest + Zod + OpenAI SDK
Node / TypeScript Stack Example
Typical stack:
- evaluation cases stored as JSON or Postgres rows
- runner built in TypeScript
- assertions with Vitest
- schema validation with Zod
- tracing with OpenTelemetry
- storage in BigQuery or Postgres
- dashboards in Grafana or Looker
Libraries:
- OpenAI SDK
- Google Cloud Vertex AI SDK
- LangSmith SDK
- Braintrust SDK
- Helicone
- OpenTelemetry JS
Online Evaluation
Definition
Online evaluation is the measurement of system performance using live production traffic.
Real users interact with the deployed system.
Outputs are measured through behavior and operational metrics.
Common Metrics
Success Metrics
Measures whether the task was completed.
Examples:
- resolution rate
- support deflection
- ticket closure
- successful workflow completion
- tool execution success
Quality Metrics
Measures output quality.
Examples:
- groundedness
- correctness
- citation validity
- hallucination rate
- escalation rate
Operational Metrics
Measures runtime behavior.
Examples:
- latency
- token cost
- retries
- tool failures
- timeout rate
Online Evaluation Methods
A/B Testing
Traffic is split between system variants.
Comparison is statistical.
Examples:
- prompt A vs prompt B
- model A vs model B
- retriever version A vs B
Tools:
- LaunchDarkly
- Statsig
- PostHog
- custom routing with feature flags
Shadow Testing
A candidate system runs in parallel without affecting users.
Outputs are recorded but not shown.
Used for:
- model migration
- retriever changes
- tool execution validation
Canary Deployment
A small percentage of users receive the new system.
Rollback thresholds are predefined.
Tools:
- Google Cloud Cloud Run
- Kubernetes
- Temporal workflows
Human Evaluation
Definition
Human evaluation is scoring performed by reviewers using explicit criteria.
Used when deterministic scoring is insufficient.
Rubric Design
A rubric defines scoring dimensions.
Each dimension must be explicit.
Example:
| Dimension | Score |
|---|---|
| Correctness | 1–5 |
| Groundedness | 1–5 |
| Clarity | 1–5 |
| Policy Compliance | pass/fail |
Rubrics must avoid ambiguous labels.
Bad:
"good answer"
Good:
"answer cites correct internal source and does not invent facts"
Review Platforms
Examples:
- Labelbox
- Scale AI
- Humanloop
- Braintrust
- internal review UI built with React/Angular
LLM-as-Judge
Definition
LLM-as-judge is evaluation where a language model scores another model's output using a rubric.
The evaluator model is separate from the target model.
Requirements
- stable prompt template
- fixed rubric
- explicit scoring schema
- evaluator version tracking
- calibration against human review
Risks
- evaluator drift
- evaluator bias
- prompt sensitivity
- false agreement
- hidden hallucination acceptance
LLM-as-judge must be calibrated with human-reviewed gold sets.
Example
Judge prompt:
Score whether the answer is grounded only in provided sources.
Return:
{
"score": 1-5,
"reason": "..."
}
Validation:
Use Zod schema validation before storing.
Golden Datasets
Definition
A golden dataset is a curated set of stable evaluation cases used as the authoritative regression baseline.
It must change slowly.
It is used for release gating.
Properties
- high-quality labels
- representative failure modes
- edge cases included
- stable versioning
- reviewed ownership
Storage
Examples:
- Git repository
- Postgres table
- BigQuery
- Firestore
- dedicated eval platform
Versioning must be explicit.
Example:
golden-set-v14
Not:
latest_final_fixed_real_final.json
Task Success Metrics
Definition
Task success metrics measure whether the business objective was completed, not whether the generated text appears correct.
Examples
Support Agent
Metric:
refund processed successfully
Not:
response sounded helpful
Internal Search
Metric:
document found and used
Not:
answer looked confident
Workflow Agent
Metric:
approval completed without manual intervention
Not:
agent explanation quality
Instrumentation
Requires:
- event logging
- workflow completion tracking
- tool result capture
- user override detection
Typical tools:
- PostHog
- BigQuery
- Mixpanel
- internal event bus
- OpenTelemetry traces
Regression Evaluation
Definition
Regression evaluation compares current system behavior against a known previous baseline to detect degradation.
Types
Prompt Regression
Detects behavior change caused by prompt edits.
Model Regression
Detects behavior change caused by model upgrades.
Example:
OpenAI GPT version change
Retrieval Regression
Detects behavior change caused by chunking, embeddings, or reranking changes.
Tool Regression
Detects behavior change caused by tool execution or API contract changes.
Release Gates
Definition
A release gate is a formal threshold that must be passed before deployment.
Example:
- hallucination rate < 2%
- tool success > 95%
- latency < 2 seconds
- cost increase < 10%
Deployment is blocked if thresholds fail.
Release gates must be machine-enforced.
Not manual opinion.
Google Stack Example
Typical Enterprise Evaluation Stack
Storage
- BigQuery
- Cloud SQL
- Firestore
Execution
- Cloud Run
- Vertex AI
- Pub/Sub
Monitoring
- Cloud Logging
- Cloud Monitoring
- Looker
Security
- IAM
- VPC Service Controls
- Audit Logs
Example Flow
- evaluation cases stored in BigQuery
- Cloud Run runner executes evals
- Vertex AI runs model calls
- scores written back to BigQuery
- Looker dashboard tracks regressions
- deployment blocked via CI/CD policy
Local Stack Example
Fully Controlled Environment
Storage
- Postgres
- SQLite
- local files
Execution
- Node.js
- TypeScript
- Docker
- local GPU inference
Models
- Ollama
- vLLM
- Hugging Face Transformers
Monitoring
- Grafana
- Prometheus
- OpenTelemetry
This is common in regulated environments and internal secure deployments.