Eval frameworks – Miguel Armengol

Definition

Evaluation is the process of measuring whether an AI system produces the required outcome under defined conditions using explicit criteria, repeatable inputs, and observable outputs.

An evaluation framework is the structure used to define, execute, compare, and monitor those measurements across development and production environments.

Evaluation includes:

input definition
expected behavior definition
scoring methodology
execution environment
result storage
regression comparison
decision thresholds

Evaluation must be deterministic where possible and explicitly probabilistic where determinism is not possible.

Offline Evaluation

Definition

Offline evaluation is the measurement of system performance using predefined datasets outside live user traffic.

Inputs are fixed.

Outputs are scored against expected results.

No production users are involved.

Components

Evaluation Dataset

A fixed set of test cases.

Each case contains:

input
expected output or scoring rule
metadata
failure classification tags

Example:

{
  "input": "Summarize contract clause 7",
  "expected": "Termination requires 30-day notice",
  "tags": ["legal", "summary"]
}

Scoring Function

A scoring function converts output into measurable values.

Common scoring types:

exact match
semantic similarity
rubric-based LLM judge
human review
tool success validation
schema validation

Evaluation Runner

The execution system that runs prompts, retrieval, tools, and model outputs against the dataset.

Examples:

LangChain LangSmith
Arize AI Phoenix
Weights & Biases Weave
Braintrust
Humanloop
local TypeScript runners using Vitest + Zod + OpenAI SDK

Node / TypeScript Stack Example

Typical stack:

evaluation cases stored as JSON or Postgres rows
runner built in TypeScript
assertions with Vitest
schema validation with Zod
tracing with OpenTelemetry
storage in BigQuery or Postgres
dashboards in Grafana or Looker

Libraries:

OpenAI SDK
Google Cloud Vertex AI SDK
LangSmith SDK
Braintrust SDK
Helicone
OpenTelemetry JS

Online Evaluation

Definition

Online evaluation is the measurement of system performance using live production traffic.

Real users interact with the deployed system.

Outputs are measured through behavior and operational metrics.

Common Metrics

Success Metrics

Measures whether the task was completed.

Examples:

resolution rate
support deflection
ticket closure
successful workflow completion
tool execution success

Quality Metrics

Measures output quality.

Examples:

groundedness
correctness
citation validity
hallucination rate
escalation rate

Operational Metrics

Measures runtime behavior.

Examples:

latency
token cost
retries
tool failures
timeout rate

Online Evaluation Methods

A/B Testing

Traffic is split between system variants.

Comparison is statistical.

Examples:

prompt A vs prompt B
model A vs model B
retriever version A vs B

Tools:

LaunchDarkly
Statsig
PostHog
custom routing with feature flags

Shadow Testing

A candidate system runs in parallel without affecting users.

Outputs are recorded but not shown.

Used for:

model migration
retriever changes
tool execution validation

Canary Deployment

A small percentage of users receive the new system.

Rollback thresholds are predefined.

Tools:

Google Cloud Cloud Run
Kubernetes
Temporal workflows

Human Evaluation

Definition

Human evaluation is scoring performed by reviewers using explicit criteria.

Used when deterministic scoring is insufficient.

Rubric Design

A rubric defines scoring dimensions.

Each dimension must be explicit.

Example:

Dimension	Score
Correctness	1–5
Groundedness	1–5
Clarity	1–5
Policy Compliance	pass/fail

Rubrics must avoid ambiguous labels.

Bad:

"good answer"

Good:

"answer cites correct internal source and does not invent facts"

Review Platforms

Examples:

Labelbox
Scale AI
Humanloop
Braintrust
internal review UI built with React/Angular

LLM-as-Judge

Definition

LLM-as-judge is evaluation where a language model scores another model's output using a rubric.

The evaluator model is separate from the target model.

Requirements

stable prompt template
fixed rubric
explicit scoring schema
evaluator version tracking
calibration against human review

Risks

evaluator drift
evaluator bias
prompt sensitivity
false agreement
hidden hallucination acceptance

LLM-as-judge must be calibrated with human-reviewed gold sets.

Example

Judge prompt:

Score whether the answer is grounded only in provided sources.

Return:
{
  "score": 1-5,
  "reason": "..."
}

Validation:

Use Zod schema validation before storing.

Golden Datasets

Definition

A golden dataset is a curated set of stable evaluation cases used as the authoritative regression baseline.

It must change slowly.

It is used for release gating.

Properties

high-quality labels
representative failure modes
edge cases included
stable versioning
reviewed ownership

Storage

Examples:

Git repository
Postgres table
BigQuery
Firestore
dedicated eval platform

Versioning must be explicit.

Example:

golden-set-v14

Not:

latest_final_fixed_real_final.json

Task Success Metrics

Definition

Task success metrics measure whether the business objective was completed, not whether the generated text appears correct.

Examples

Support Agent

Metric:

refund processed successfully

Not:

response sounded helpful

Internal Search

Metric:

document found and used

Not:

answer looked confident

Workflow Agent

Metric:

approval completed without manual intervention

Not:

agent explanation quality

Instrumentation

Requires:

event logging
workflow completion tracking
tool result capture
user override detection

Typical tools:

PostHog
BigQuery
Mixpanel
internal event bus
OpenTelemetry traces

Regression Evaluation

Definition

Regression evaluation compares current system behavior against a known previous baseline to detect degradation.

Types

Prompt Regression

Detects behavior change caused by prompt edits.

Model Regression

Detects behavior change caused by model upgrades.

Example:

OpenAI GPT version change

Retrieval Regression

Detects behavior change caused by chunking, embeddings, or reranking changes.

Tool Regression

Detects behavior change caused by tool execution or API contract changes.

Release Gates

Definition

A release gate is a formal threshold that must be passed before deployment.

Example:

hallucination rate < 2%
tool success > 95%
latency < 2 seconds
cost increase < 10%

Deployment is blocked if thresholds fail.

Release gates must be machine-enforced.

Not manual opinion.

Google Stack Example

Typical Enterprise Evaluation Stack

Storage

BigQuery
Cloud SQL
Firestore

Execution

Cloud Run
Vertex AI
Pub/Sub

Monitoring

Cloud Logging
Cloud Monitoring
Looker

Security

IAM
VPC Service Controls
Audit Logs

Example Flow

evaluation cases stored in BigQuery
Cloud Run runner executes evals
Vertex AI runs model calls
scores written back to BigQuery
Looker dashboard tracks regressions
deployment blocked via CI/CD policy

Local Stack Example

Fully Controlled Environment

Storage

Postgres
SQLite
local files

Execution

Node.js
TypeScript
Docker
local GPU inference

Models

Ollama
vLLM
Hugging Face Transformers

Monitoring

Grafana
Prometheus
OpenTelemetry

This is common in regulated environments and internal secure deployments.