Definition

Evaluation is the process of measuring whether an AI system produces the required outcome under defined conditions using explicit criteria, repeatable inputs, and observable outputs.

An evaluation framework is the structure used to define, execute, compare, and monitor those measurements across development and production environments.

Evaluation includes:

  1. input definition
  2. expected behavior definition
  3. scoring methodology
  4. execution environment
  5. result storage
  6. regression comparison
  7. decision thresholds

Evaluation must be deterministic where possible and explicitly probabilistic where determinism is not possible.


Offline Evaluation

Definition

Offline evaluation is the measurement of system performance using predefined datasets outside live user traffic.

Inputs are fixed.

Outputs are scored against expected results.

No production users are involved.


Components

Evaluation Dataset

A fixed set of test cases.

Each case contains:

  1. input
  2. expected output or scoring rule
  3. metadata
  4. failure classification tags

Example:

{
  "input": "Summarize contract clause 7",
  "expected": "Termination requires 30-day notice",
  "tags": ["legal", "summary"]
}

Scoring Function

A scoring function converts output into measurable values.

Common scoring types:

  1. exact match
  2. semantic similarity
  3. rubric-based LLM judge
  4. human review
  5. tool success validation
  6. schema validation

Evaluation Runner

The execution system that runs prompts, retrieval, tools, and model outputs against the dataset.

Examples:

  • LangChain LangSmith
  • Arize AI Phoenix
  • Weights & Biases Weave
  • Braintrust
  • Humanloop
  • local TypeScript runners using Vitest + Zod + OpenAI SDK

Node / TypeScript Stack Example

Typical stack:

  1. evaluation cases stored as JSON or Postgres rows
  2. runner built in TypeScript
  3. assertions with Vitest
  4. schema validation with Zod
  5. tracing with OpenTelemetry
  6. storage in BigQuery or Postgres
  7. dashboards in Grafana or Looker

Libraries:

  • OpenAI SDK
  • Google Cloud Vertex AI SDK
  • LangSmith SDK
  • Braintrust SDK
  • Helicone
  • OpenTelemetry JS

Online Evaluation

Definition

Online evaluation is the measurement of system performance using live production traffic.

Real users interact with the deployed system.

Outputs are measured through behavior and operational metrics.


Common Metrics

Success Metrics

Measures whether the task was completed.

Examples:

  1. resolution rate
  2. support deflection
  3. ticket closure
  4. successful workflow completion
  5. tool execution success

Quality Metrics

Measures output quality.

Examples:

  1. groundedness
  2. correctness
  3. citation validity
  4. hallucination rate
  5. escalation rate

Operational Metrics

Measures runtime behavior.

Examples:

  1. latency
  2. token cost
  3. retries
  4. tool failures
  5. timeout rate

Online Evaluation Methods

A/B Testing

Traffic is split between system variants.

Comparison is statistical.

Examples:

  1. prompt A vs prompt B
  2. model A vs model B
  3. retriever version A vs B

Tools:

  • LaunchDarkly
  • Statsig
  • PostHog
  • custom routing with feature flags

Shadow Testing

A candidate system runs in parallel without affecting users.

Outputs are recorded but not shown.

Used for:

  1. model migration
  2. retriever changes
  3. tool execution validation

Canary Deployment

A small percentage of users receive the new system.

Rollback thresholds are predefined.

Tools:

  • Google Cloud Cloud Run
  • Kubernetes
  • Temporal workflows

Human Evaluation

Definition

Human evaluation is scoring performed by reviewers using explicit criteria.

Used when deterministic scoring is insufficient.


Rubric Design

A rubric defines scoring dimensions.

Each dimension must be explicit.

Example:

Dimension Score
Correctness 1–5
Groundedness 1–5
Clarity 1–5
Policy Compliance pass/fail

Rubrics must avoid ambiguous labels.

Bad:

"good answer"

Good:

"answer cites correct internal source and does not invent facts"


Review Platforms

Examples:

  • Labelbox
  • Scale AI
  • Humanloop
  • Braintrust
  • internal review UI built with React/Angular

LLM-as-Judge

Definition

LLM-as-judge is evaluation where a language model scores another model's output using a rubric.

The evaluator model is separate from the target model.


Requirements

  1. stable prompt template
  2. fixed rubric
  3. explicit scoring schema
  4. evaluator version tracking
  5. calibration against human review

Risks

  1. evaluator drift
  2. evaluator bias
  3. prompt sensitivity
  4. false agreement
  5. hidden hallucination acceptance

LLM-as-judge must be calibrated with human-reviewed gold sets.


Example

Judge prompt:

Score whether the answer is grounded only in provided sources.

Return:
{
  "score": 1-5,
  "reason": "..."
}

Validation:

Use Zod schema validation before storing.


Golden Datasets

Definition

A golden dataset is a curated set of stable evaluation cases used as the authoritative regression baseline.

It must change slowly.

It is used for release gating.


Properties

  1. high-quality labels
  2. representative failure modes
  3. edge cases included
  4. stable versioning
  5. reviewed ownership

Storage

Examples:

  1. Git repository
  2. Postgres table
  3. BigQuery
  4. Firestore
  5. dedicated eval platform

Versioning must be explicit.

Example:

golden-set-v14

Not:

latest_final_fixed_real_final.json

Task Success Metrics

Definition

Task success metrics measure whether the business objective was completed, not whether the generated text appears correct.


Examples

Support Agent

Metric:

refund processed successfully

Not:

response sounded helpful

Internal Search

Metric:

document found and used

Not:

answer looked confident

Workflow Agent

Metric:

approval completed without manual intervention

Not:

agent explanation quality


Instrumentation

Requires:

  1. event logging
  2. workflow completion tracking
  3. tool result capture
  4. user override detection

Typical tools:

  • PostHog
  • BigQuery
  • Mixpanel
  • internal event bus
  • OpenTelemetry traces

Regression Evaluation

Definition

Regression evaluation compares current system behavior against a known previous baseline to detect degradation.


Types

Prompt Regression

Detects behavior change caused by prompt edits.

Model Regression

Detects behavior change caused by model upgrades.

Example:

OpenAI GPT version change

Retrieval Regression

Detects behavior change caused by chunking, embeddings, or reranking changes.

Tool Regression

Detects behavior change caused by tool execution or API contract changes.


Release Gates

Definition

A release gate is a formal threshold that must be passed before deployment.

Example:

  1. hallucination rate < 2%
  2. tool success > 95%
  3. latency < 2 seconds
  4. cost increase < 10%

Deployment is blocked if thresholds fail.

Release gates must be machine-enforced.

Not manual opinion.


Google Stack Example

Typical Enterprise Evaluation Stack

Storage

  • BigQuery
  • Cloud SQL
  • Firestore

Execution

  • Cloud Run
  • Vertex AI
  • Pub/Sub

Monitoring

  • Cloud Logging
  • Cloud Monitoring
  • Looker

Security

  • IAM
  • VPC Service Controls
  • Audit Logs

Example Flow

  1. evaluation cases stored in BigQuery
  2. Cloud Run runner executes evals
  3. Vertex AI runs model calls
  4. scores written back to BigQuery
  5. Looker dashboard tracks regressions
  6. deployment blocked via CI/CD policy

Local Stack Example

Fully Controlled Environment

Storage

  • Postgres
  • SQLite
  • local files

Execution

  • Node.js
  • TypeScript
  • Docker
  • local GPU inference

Models

  • Ollama
  • vLLM
  • Hugging Face Transformers

Monitoring

  • Grafana
  • Prometheus
  • OpenTelemetry

This is common in regulated environments and internal secure deployments.