RAG Evaluation & Observability
Phase 2. The differentiator between hireable AI engineers and tutorials-watchers.
Why eval is everything
"If you can't measure it, you can't improve it. If you can't improve it, the client will."
Without eval, you ship blind. With eval :
- You can claim numbers in interviews
- You can A/B test prompts/chunking/models
- You can defend your architecture decisions
- You can detect drift in production
The staff-engineer mental model : RAG eval is a two-stage funnel
A RAG answer can be wrong for exactly two reasons, and they require different fixes. A senior never debugs "the answer is bad" — they decompose it :
QUESTION
│
┌───────▼────────┐ RETRIEVAL stage ── measured by context_precision / context_recall
│ Retriever │ "Did we put the right chunks in the prompt?"
└───────┬────────┘ Fix here = chunking, embeddings, hybrid search, reranking, top-k
│ (context)
┌───────▼────────┐ GENERATION stage ── measured by faithfulness / answer_relevancy
│ Generator │ "Given those chunks, did the model answer correctly?"
└───────┬────────┘ Fix here = prompt, model, output schema, citation enforcement
│
ANSWERThe diagnostic rule : high context_recall + low faithfulness → the retriever did its job, the generator hallucinated (fix the prompt/model). Low context_recall → the answer was doomed before generation; tuning the prompt is wasted effort (fix the retriever). This single decomposition is what separates "I tweaked things until the demo looked good" from "I instrumented each stage and attacked the bottleneck." Bring it to every interview.
A corollary: never report a single aggregate "RAG score." It hides which stage is failing. Always report the two stages separately, plus end-to-end answer correctness.
The eval triad (Ragas / TruLens)
| Metric | Stage | What it measures | Needs ground truth? | Range | Fix when low |
|---|---|---|---|---|---|
| Context Recall | Retrieval | Did we retrieve all relevant info | ✅ (expected sources/answer) | 0-1 | chunking, embeddings, top-k, hybrid search |
| Context Precision | Retrieval | Are retrieved chunks relevant (and ranked high) | ✅ | 0-1 | reranker, lower top-k, query rewriting |
| Faithfulness | Generation | Is the answer grounded in the retrieved context | ❌ (reference-free) | 0-1 | prompt ("cite only from context"), model, lower temp |
| Answer Relevancy | Generation | Does the answer address the question | ❌ (reference-free) | 0-1 | prompt, query understanding |
| Answer Correctness | End-to-end | Semantic match vs. the gold answer | ✅ | 0-1 | depends — diagnose via the two stages above |
Why the "needs ground truth?" column is load-bearing. The reference-free metrics (faithfulness, answer relevancy) can be computed on live production traffic — you have no gold answer at 3am, but you can still ask an LLM judge "is this answer grounded in this context?". The ground-truth metrics (recall, precision, correctness) require your golden set and run offline in CI / on a sampled basis. A mature system runs both: golden-set metrics gate deploys, reference-free metrics monitor prod.
Target for production : all > 0.8 — but treat that number as a starting heuristic, not gospel. The right bar is domain- and cost-of-error-dependent: a legal/medical RAG may demand faithfulness > 0.95 (a hallucinated citation is a liability), while an internal docs search tolerates 0.75. Set the threshold from the business consequence of a wrong answer, then defend it with the golden-set distribution, not a round number you read in a blog post.
⚠️ Faithfulness is computed by an LLM judge and is itself noisy. A faithfulness of 0.82 on a 50-question set has a confidence interval roughly ±0.10 (binomial-ish). Reporting "we went from 0.82 → 0.85" as a win without a significance check is a junior move — see the A/B section.
Ground truth dataset — how to build it
- Pick 50 representative questions users will actually ask
- Manually write expected answers + source chunks for each
- Store as JSONL :jsonl
{"question": "...", "expected_answer": "...", "expected_sources": ["doc_id_1", "doc_id_8"]} - Keep this in a
evals/golden.jsonlfile in the repo - Grow it over time — every prod bug becomes a new test case
How a staff engineer thinks about the golden set
The golden set is the spec. If it's biased, every metric you report is a lie you tell yourself. Concerns that separate a real eval harness from a toy :
- Stratify by question type, not just count. 50 questions all of the form "what is X?" will pass while your system silently fails on multi-hop ("compare X and Y"), negation ("which docs do not mention X"), and out-of-scope ("what's the weather"). Bucket your set: factoid / multi-hop / aggregation / unanswerable. The unanswerable bucket is the one juniors forget and the one that catches hallucination.
- Include adversarial & "I don't know" cases. A RAG system that never says "the documents don't contain that" is over-confident. ~15–20% of your golden set should have
expected_answer: "INSUFFICIENT_CONTEXT"so you can measure the refusal path, not just the happy path. - Freeze it, version it, review it.
evals/golden.jsonlis checked into git and changes go through PR review like code. A silent edit to the golden set that makes the numbers go up is fraud, not progress. - Beware leakage. If you build the golden answers by reading the model's output, you're grading the model against itself. Write expected answers from the source documents, independently of what the system currently produces.
- Bootstrapping with an LLM is fine — verification is not optional. Generating candidate Q/A pairs with Claude to seed the set is a legitimate time-saver, but a human must verify every entry that gates a deploy. LLM-generated golden answers that no human checked are not ground truth.
The cold-start answer (interview gold). "How do you eval with zero golden data?" → Start reference-free (faithfulness + answer relevancy need no labels), ship behind a feedback widget, mine thumbs-down + low-faithfulness traces into your first 50 golden cases, then graduate to ground-truth metrics. You are never blocked on labels to start measuring.
Eval frameworks
Ragas (RAG-specific)
- Docs
- Python-first
- Uses LLM-as-judge for many metrics
- Good for quick experimentation
- Use this by default for project 1
TruLens
- trulens.org
- More general
- Better dashboards
- Supports custom feedback functions
LangSmith
- Built into LangChain ecosystem
- Visual interface
- Datasets + experiments
- Free tier sufficient for portfolio
Phoenix (Arize)
- phoenix.arize.com
- Open source
- Strong tracing
- Self-hostable
Custom eval
- For specific use cases : write your own with
pytest - Example : "Does the answer cite at least one source?" → simple regex check
How a staff engineer chooses (the part the bullets above hide)
The frameworks are not interchangeable; they sit at different layers and a mature system uses two of them, not one.
| Layer | Question it answers | Tool that fits | Runs |
|---|---|---|---|
| Metric computation | "What's the faithfulness on this set?" | Ragas (RAG-specific metrics out of the box), or your own LLM-judge | CI gate (offline) + sampled prod |
| Tracing / observability | "Why was this answer wrong, and where's the latency?" | LangFuse / Phoenix / LangSmith | Always-on, every request |
| Experiment tracking | "Is variant B better than A, with significance?" | LangSmith datasets, or a notebook + scipy | Per experiment |
The senior mental model: buy the tracing, own the metric. Tracing (spans, token counts, latency breakdown, replay-by-request_id) is undifferentiated plumbing — Phoenix or LangFuse do it better than you will, self-host Phoenix if data residency matters. The metric — what "faithful" means for your domain, the rubric, the calibration against your humans — is your competitive moat and the thing an interviewer probes. Don't outsource the rubric to a library default and then quote the number as if it were ground truth.
Avoid framework lock-in at the trace boundary. Log a typed RagTrace (the schema above) to your own store first, then mirror to whatever vendor you're trialing. The day you swap LangSmith for Phoenix you don't want to lose three months of golden-set history. Treat the eval vendor as a renderer over your data, not the system of record.
The honest answer to "which one?" For a portfolio project: Ragas for the metrics, Phoenix (self-hosted, free) for tracing,
scipyfor the A/B test. That trio costs nothing, runs locally, and demonstrates every layer. Reach for LangSmith's hosted datasets only when a team needs shared, versioned experiment history — and say that in the interview, not "I used LangChain because it was there."
Observability — what to log per request
The trace is the unit of debuggability. If a user complains "the answer was wrong," a senior engineer pulls the trace by request_id and can tell within 30 seconds which stage failed — without re-running anything. Model it as a typed schema, not a free-form dict, so the shape never drifts :
from pydantic import BaseModel
from typing import Literal
class RetrievedChunk(BaseModel):
doc_id: str
score: float
rank: int
text: str | None = None # store a snippet, not the whole chunk, to bound log size
class RagTrace(BaseModel):
request_id: str # propagate this header end-to-end (NestJS → Python → logs)
user_id: str | None
tenant_id: str | None # multi-tenant cost & isolation
query: str
rewritten_query: str | None = None
# Retrieval stage — log EVERY stage so you can attribute a bad answer to the right one
vector_results: list[RetrievedChunk]
bm25_results: list[RetrievedChunk]
fused_results: list[RetrievedChunk]
reranked_results: list[RetrievedChunk]
context_tokens: int
# Generation stage
model: str = "claude-opus-4-8"
input_tokens: int
output_tokens: int
cache_read_input_tokens: int = 0 # from resp.usage — proves caching works
cache_creation_input_tokens: int = 0
response: str
stop_reason: str | None = None # "end_turn" | "max_tokens" | "refusal" | ...
# Derived / operational
cost_usd: float
latency_ms: int
retrieval_latency_ms: int # split the budget: where did the 1840ms go?
generation_latency_ms: int
experiment_id: str | None = None # A/B bucket
user_feedback: Literal["up", "down"] | None = None # filled in later
error: str | None = None→ Log this to LangSmith / LangFuse / Phoenix / your own DB (one row per request; the chunk lists go into a JSON column or a child table).
Three things juniors omit and seniors always log
request_idpropagated across services. Your stack is NestJS (API) → Python (RAG) → Claude. Generate the id at the edge, pass it as a header, and stamp it on every log line and every span. Without it, correlating a frontend error with a Python stack trace is archaeology.cache_read_input_tokensfromresp.usage. This is how you prove prompt caching is actually hitting. If it's zero across requests that share a system+context prefix, a silent invalidator (a timestamp in the prompt, a per-request UUID, unsorted JSON) is costing you ~10× on input tokens.- Split latency by stage. A p95 of 1840ms is meaningless aggregate.
retrieval_latency_msvsgeneration_latency_mstells you whether to optimize the vector DB / reranker or the LLM call. The reranker (a cross-encoder) is a frequent hidden tax — measure it.
Computing cost_usd correctly (defend the number)
Cost is per-token, and cached reads are ~10× cheaper than fresh input — if you bill at the flat input rate you'll overstate cost and make wrong model-choice decisions. For the 2026 flagship claude-opus-4-8 at 1M context : $5 / 1M input, $25 / 1M output, with cache reads at ~0.1× input. The mid tier claude-sonnet-4-6 is $3 / $15; the cheap claude-haiku-4-5 is $1 / $5.
# Prices in USD per 1M tokens (input, output) — keep this in config, not hardcoded
PRICES = {
"claude-opus-4-8": (5.0, 25.0),
"claude-sonnet-4-6": (3.0, 15.0),
"claude-haiku-4-5": (1.0, 5.0),
}
CACHE_READ_MULTIPLIER = 0.1 # cached input tokens bill at ~10% of base input price
def cost_usd(model: str, usage) -> float:
in_price, out_price = PRICES[model]
fresh_in = usage.input_tokens # uncached remainder only
cached_in = getattr(usage, "cache_read_input_tokens", 0)
out = usage.output_tokens
return (
fresh_in * in_price / 1e6
+ cached_in * in_price * CACHE_READ_MULTIPLIER / 1e6
+ out * out_price / 1e6
)Gotcha:
usage.input_tokensis the uncached remainder, not the total prompt size. Total prompt =input_tokens + cache_creation_input_tokens + cache_read_input_tokens. If you compute cost from a separately-counted "prompt length," you'll double-count the cached portion.
Cost as a first-class SLO, not an afterthought
A staff engineer treats cost the same way they treat p95 latency: a budget with an alarm, not a number you read off the invoice at the end of the month. Three reflexes:
- Unit-economics, not totals. "We spent $4,200 last month" is unactionable.
cost / answered_queryandcost / resolved_ticketare the numbers you defend to a PM and that catch a regression. A chunking change that doubles context tokens shows up here on day one, not on the invoice 30 days later. - Attribute every dollar to a tenant. In a multi-tenant RAG,
cost_usdrolls up bytenant_idso you can spot the one customer whose 50-page PDFs are eating the margin — and so you can bill them, rate-limit them, or move them toclaude-haiku-4-5. The trace schema above carriestenant_idprecisely for this. - Know your model mix. Serving generation on
claude-haiku-4-5($1/$5) and judging onclaude-opus-4-8($5/$25) is a deliberate asymmetry: the cheap model answers 10k prod queries, the expensive model grades a sample. If your eval cost is a meaningful fraction of your serving cost, you're judging too much — sample harder, or judge reference-free metrics only on the low-confidence tail.
The model-downgrade decision, defended with numbers. "Can we move generation from Opus to Haiku?" is not a vibe call. Run both over the golden set, compare
answer_correctnessandfaithfulnesswith a paired test (see A/B section), and put the cost delta next to the quality delta. If Haiku is 5× cheaper and loses 1 point of faithfulness inside the noise band, you ship Haiku and pocket the margin. If it drops faithfulness 8 points on the multi-hop bucket, you keep Opus there and route only the factoid bucket to Haiku. Per-bucket model routing is a senior move; "we use Opus everywhere because it's safest" is a junior one that quietly burns money.
Production monitoring dashboards to build
- Latency — p50/p95/p99 per endpoint, split by stage (retrieval vs generation). An aggregate p95 hides whether the reranker or the LLM is the tax.
- Cost — daily $$ + per-user / per-tenant + cost/answered-query (the unit metric). Overlay
cache_read_input_tokensas a % — a sudden drop means a silent cache invalidator went out in a deploy. - Eval scores — daily reference-free judge (faithfulness, relevancy) on sampled prod queries; golden-set metrics on every deploy.
- Error rate — LLM timeouts,
RateLimitError/OverloadedError, retries exhausted, fallbacks triggered,stop_reason: "refusal"rate. - User feedback — thumbs up/down rate, and crucially the down-rate on high-faithfulness answers (the model was confident and grounded but still wrong → your retrieval or your golden set is lying to you).
- Drift signals — query distribution shift (input drift), retrieval top-1 similarity drop, faithfulness trend (output drift).
The rule that separates a dashboard from a wall of charts
Every panel must map to an action. A chart nobody acts on is a chart nobody looks at. For each metric, write the runbook line before you build the panel:
| Panel | Threshold | Action when it fires |
|---|---|---|
| p95 generation latency | > 3s | Check max_tokens, drop to effort: "low", or stream |
| cost/answered-query | > 1.3× 7-day median | Audit cache hit-rate; check for a context-size regression |
cache_read % | < 50% on cacheable traffic | A silent invalidator shipped — diff the rendered prefix |
| faithfulness (sampled prod) | rolling mean −2σ | Page on-call; pull low-faithfulness traces; check index freshness |
| refusal rate | spikes | Adjacent-benign false positives, or a real attack — inspect stop_details |
If you can't write the action, you don't need the panel — you need a different metric. Alert on symptoms users feel (down-rate, latency, refusals) and page on leading indicators (faithfulness drift, cache collapse) before users feel them. Don't alert on raw token counts; alert on the cost/query they roll up into.
LLM-as-judge — when to trust it
Use for : faithfulness, relevancy, factuality, tone — anything where "is this good?" is a judgment a competent human could make from the text alone. Don't use for : code correctness (run the tests), numerical/financial accuracy (compute it), or specialized domain where the judge lacks expertise (the judge will confidently agree with a plausible-but-wrong answer).
Tip : use a stronger model as judge than the one generating. The flagship claude-opus-4-8 judges output from the cheaper claude-sonnet-4-6 / claude-haiku-4-5 you serve in prod — not the other way round. A judge weaker than the generator can't reliably catch the generator's mistakes.
A real, runnable judge (Anthropic SDK, structured output)
A senior judge implementation has four properties juniors miss: it's async (you grade hundreds of traces), it returns a structured verdict with a reason (a bare 0/1 is undebuggable), it uses AsyncAnthropic + structured outputs instead of hand-rolled JSON prompting, and it logs usage so eval cost is itself observable.
import asyncio
from anthropic import AsyncAnthropic
from pydantic import BaseModel, Field
client = AsyncAnthropic(max_retries=4) # SDK retries 429/5xx with backoff
class Faithfulness(BaseModel):
grounded: bool = Field(description="True iff every claim in the answer is supported by the context")
unsupported_claims: list[str] = Field(description="Claims in the answer NOT found in the context")
reason: str
JUDGE_SYSTEM = (
"You are a strict RAG faithfulness grader. An answer is faithful ONLY if every "
"factual claim it makes is directly supported by the provided context. "
"Do not use outside knowledge. If the answer adds anything not in the context, it is NOT grounded."
)
async def judge_faithfulness(question: str, context: str, answer: str) -> Faithfulness:
resp = await client.messages.parse( # native structured output — no XML/JSON prompting
model="claude-opus-4-8", # judge is STRONGER than the generator
max_tokens=1024,
thinking={"type": "adaptive"}, # let the judge reason before verdict
system=[{
"type": "text",
"text": JUDGE_SYSTEM,
"cache_control": {"type": "ephemeral"}, # cache the stable judge prefix across calls
}],
messages=[{
"role": "user",
"content": (
f"<question>{question}</question>\n"
f"<context>{context}</context>\n"
f"<answer>{answer}</answer>"
),
}],
output_config={"format": Faithfulness},
)
# resp.usage carries input/output/cache tokens — log it so eval cost is observable too
return resp.parsed_output
async def grade_batch(traces: list[dict]) -> list[Faithfulness]:
# asyncio.gather parallelizes the judge calls — grading 200 traces serially is the junior mistake
return await asyncio.gather(*[
judge_faithfulness(t["query"], t["context"], t["response"]) for t in traces
])Why adaptive thinking, not a
budget_tokensvalue? Onclaude-opus-4-8the oldthinking={"type": "enabled", "budget_tokens": N}form is removed and returns HTTP 400 (so aretemperature/top_p/top_k). Usethinking={"type": "adaptive"}and, if you need to dial depth,output_config={"effort": "low"|"medium"|"high"|"max"}. For a cheap, high-throughput judge,effort: "low"is often enough — it consolidates reasoning and cuts tokens, which is exactly what you want when grading hundreds of traces.
⚠️ The caching gotcha that bites everyone (and breaks Exercise 2 if you don't know it). Prompt caching is a prefix match with a minimum cacheable prefix, and on
claude-opus-4-8that minimum is ~4096 tokens. TheJUDGE_SYSTEMprompt above is ~60 tokens — puttingcache_controlon it caches nothing (cache_read_input_tokensstays 0, no error, no warning). To actually get cache hits on a judge, the stable cached prefix has to clear the minimum: pack the rubric, few-shot calibration examples, and detailed grading instructions into the cached system block until it's > 4096 tokens, then keep the volatile per-trace<question>/<context>/<answer>after the breakpoint. A skinny system prompt with acache_controlmarker is the #1 reason "caching doesn't work." (Onclaude-sonnet-4-6the minimum is ~2048 tokens — a prefix that caches on Sonnet silently won't on Opus.)
Structured output, the senior way.
messages.parse()with a Pydantic schema is the canonical pattern — the SDK injects the JSON-schema constraint (output_config.format), validates the response, and hands you a typedFaithfulnessobject viaresp.parsed_output. Do not hand-roll"respond in JSON like {...}"prompting and thenjson.loads()the text: it's brittle (the model wanders off-schema on edge cases), it costs you a retry loop, and it can't enforce the enum/bool/list[str]shape the way the constrained decoder does. Native structured output also forbids assistant-prefill on 4.8 — another reason the old{"role": "assistant", "content": "{"}JSON-forcing trick is dead.
Failure modes of LLM judges (and the fixes)
| Failure mode | Symptom | Fix |
|---|---|---|
| Self-preference bias | Judge rates its own family's output higher | Use a different/stronger model as judge; calibrate against humans |
| Position bias | In A/B pairwise judging, prefers whichever answer is shown first | Randomize order; run both orders and average |
| Verbosity bias | Longer answers score higher regardless of correctness | Explicit rubric; penalize unsupported claims, not length |
| Leniency / sycophancy | Everything scores 0.9+; no discrimination | Force a structured rubric + require listing unsupported_claims |
| Non-determinism | Same input, different score across runs | effort: "low", structured output, and calibrate: measure judge-vs-human agreement (Cohen's κ) on a labeled subset before you trust the judge at scale |
The meta-rule: the LLM judge is itself a model that needs eval. Before you trust it on 10k prod traces, measure its agreement with human labels on ~50 cases. If κ < 0.6, fix the rubric before scaling — otherwise you're monitoring with a broken thermometer.
Making the judge production-grade (the resilience juniors skip)
The pretty judge_faithfulness above is the happy path. Grading 200 traces means 200 network calls, and at scale some will fail. A senior wraps the batch grader so one flaky call doesn't poison the whole run:
- Typed exceptions, not string matching. Catch
anthropic.RateLimitError,anthropic.OverloadedError,anthropic.APITimeoutError,anthropic.APIStatusError— neverif "429" in str(e). The SDK'smax_retries=4already retries 429/5xx with backoff; your job is to decide what happens when retries are exhausted. - Bound concurrency, don't gather 200 at once.
asyncio.gatherover 200 coroutines fires 200 simultaneous requests and you'll trip your own rate limit. Gate with aSemaphore(e.g. 10–20 in flight) so the judge saturates throughput without self-DoSing. return_exceptions=True+ a sentinel. A single failed grade should not crash a 200-trace run. Collect failures, log them, and exclude them from the aggregate — and report the exclusion count, because "faithfulness 0.84 over 187/200 graded" is honest; silently averaging 187 and calling it 200 is not.- Per-call timeout. A wedged call shouldn't hang the batch. Set a timeout on the client (or
asyncio.wait_for) so a stuck grade fails fast and falls into the exception bucket. - Log
resp.usageper call. Eval cost is real cost. If you grade every prod trace onclaude-opus-4-8, your monitoring bill can rival your serving bill — which is the whole reason you sample and grade reference-free on the tail, not everything.
import asyncio
import anthropic
SEM = asyncio.Semaphore(15) # cap in-flight judge calls
async def judge_one(trace: dict) -> Faithfulness | None:
async with SEM:
try:
return await asyncio.wait_for(
judge_faithfulness(trace["query"], trace["context"], trace["response"]),
timeout=30,
)
except (anthropic.RateLimitError, anthropic.OverloadedError,
anthropic.APITimeoutError, asyncio.TimeoutError) as e:
# retries already exhausted by the SDK — log and drop this one grade
log.warning("judge failed for %s: %s", trace["request_id"], e)
return None # sentinel; excluded from the aggregate, counted as a failure
async def grade_batch_resilient(traces: list[dict]) -> tuple[list[Faithfulness], int]:
results = await asyncio.gather(*(judge_one(t) for t in traces))
graded = [r for r in results if r is not None]
failed = len(results) - len(graded)
return graded, failed # report BOTH — never hide the denominatorA/B testing in production
Once shipped :
- Tag each request with
experiment_id(e.g., "chunk_strategy_v2") - Route X% of traffic to variant
- Collect eval scores + user feedback
- Statistical significance check before rollout
Defending the number (the part interviews probe)
"Faithfulness went from 0.82 to 0.85, ship it" is how juniors lose money on noise. Two answers are wrong with the same dataset; the difference is whether the gap is signal or sampling noise.
- Offline (golden set, paired): the same N questions run through both variants → use a paired test (paired t-test, or McNemar for pass/fail). Paired is far more powerful than unpaired because it cancels per-question difficulty. On a 50-question set, a 3-point faithfulness move is usually inside the noise band — you typically need a few hundred questions to detect small effects.
- Online (live traffic): randomize per-user (not per-request, or the same user sees both and contaminates the comparison), pick the metric before you look (thumbs-up rate, or low-faithfulness rate), and compute a confidence interval. Don't peek-and-stop — sequential peeking inflates false positives; fix the sample size or use a sequential test.
- Watch the guardrails, not just the win metric. A chunking change that lifts faithfulness but adds 400ms p95 and 30% cost is a loss dressed as a win. Every experiment reports the win metric and latency p95 and cost/query.
from scipy import stats
# Paired: per-question faithfulness for variant A and B on the SAME golden set
a = [...] # control scores
b = [...] # variant scores
t, p = stats.ttest_rel(b, a) # paired t-test
delta = (sum(b) - sum(a)) / len(a)
print(f"Δ={delta:+.3f} p={p:.3f} → ship only if p < 0.05 AND guardrails OK")🏋️ Exercices
Demanding and progressive. Each builds on the previous. "Change a constant" is not here.
1. Build the two-stage harness — Objectif : prove you can attribute a failure to retrieval vs generation
Take a 30-question golden set (factoid + multi-hop + 5 unanswerable) and compute context_recall, context_precision, faithfulness, answer_relevancy separately, emitting a per-question table. Then deliberately break retrieval (set top_k=1) and break generation (raise temperature / weaken the prompt), and show that your metrics correctly localize which you broke. Indice/Solution : recall drops when you starve retrieval; faithfulness drops when you loosen generation. If both move together, your metrics aren't isolating the stages — your expected_sources labels are probably wrong.
2. Production-grade async judge with caching — Objectif : grade 200 traces fast and cheap, and prove the cache hits
Wrap the judge_faithfulness function above in a batch grader using AsyncAnthropic + asyncio.gather, with cache_control on the judge system prefix. Log resp.usage for every call and assert cache_read_input_tokens > 0 after the first call. Report total eval cost and p95 judge latency. Indice/Solution : the stable judge system prompt must come first and carry the cache_control breakpoint; the volatile per-trace question/context/answer goes after it. If cache_read_input_tokens stays 0, you put something volatile (a timestamp, the trace id) ahead of the breakpoint.
3. Calibrate the judge against humans — Objectif : don't trust a thermometer you never checked
Hand-label 40 (answer, context) pairs as faithful / not. Run your LLM judge on the same 40. Compute Cohen's κ and a confusion matrix. If κ < 0.6, iterate on the rubric (force unsupported_claims, strengthen the system prompt) until it clears. Indice/Solution : most disagreements are the judge being lenient (calling unsupported-but-plausible claims grounded). Tightening "use NO outside knowledge" and requiring it to enumerate unsupported claims usually moves κ the most.
4. Break it then fix it — silent cache invalidation — Objectif : feel the 10× cost cliff
Intentionally inject datetime.now() into the judge's system prompt. Run the batch grader and watch cache_read_input_tokens collapse to 0 and cost_usd jump ~10×. Then move the timestamp out (or into the user turn after the breakpoint) and restore the hit rate. Indice/Solution : caching is a prefix match — any byte change before the breakpoint invalidates everything after it. The fix is architectural (freeze the prefix), not "add more cache_control markers."
5. A/B with statistical honesty — Objectif : refuse to ship noise
Run two chunking strategies over a 50-question golden set. Compute the paired t-test on faithfulness. Then expand the set to 250 questions (LLM-bootstrap candidates, human-verify) and re-run. Show how the p-value and confidence interval change. Write the one-paragraph ship/no-ship recommendation a staff engineer would put in the PR — including latency p95 and cost/query guardrails. Indice/Solution : a gap that looked "promising" at N=50 often crosses p=0.05 only at N=250 — or evaporates. The deliverable is the decision, not the metric.
6. Drift alarm (production-grade) — Objectif : catch degradation before the client does
Build a daily job that samples K prod traces, runs the reference-free judge (faithfulness + relevancy — no ground truth needed live), and alerts when the rolling 7-day mean drops > 2σ OR when the retrieval-score distribution shifts (KS test on top-1 similarity). Wire it to fire a PushNotification-style alert. Indice/Solution : two independent drift signals matter — input drift (queries change → KS test on retrieval scores) and output drift (answers degrade → faithfulness trend). They have different root causes (new content / new users vs. model or index regression). Alert on both.
7. Break the batch grader, then make it survive — Objectif : a flaky network must not crash a 200-trace eval
Take the naive grade_batch (plain asyncio.gather over 200 coroutines, no semaphore, no try/except). Run it against a judge client with max_retries=0 while injecting failures: monkeypatch ~10% of calls to raise RateLimitError, and add a Semaphore-free burst so you trip your own TPM limit. Watch the whole run die on the first unhandled exception. Then rebuild it into grade_batch_resilient (semaphore-bounded, wait_for timeout, typed-exception catch, None sentinel) and prove it: 200 traces in, ~180 graded, ~20 logged failures, and the reported faithfulness carries the real denominator ("0.84 over 183/200"). Compare wall-clock and peak concurrency before/after. Indice/Solution : the un-bounded gather self-DoSes — you cause the 429s you then fail on. The fix is two-layered: cap in-flight calls with the semaphore (stop creating the rate-limit pressure) and catch exhausted-retry exceptions per-call (survive the ones that still fail). The trap is reporting mean(graded) as if it were over 200 — that silently inflates the score by dropping the hard cases that timed out. Always surface the failure count next to the metric.
8. Cost-defend a model downgrade — Objectif : turn "use Opus everywhere" into a per-bucket routing decision backed by numbers
Run generation on claude-opus-4-8 and again on claude-haiku-4-5 over a stratified golden set (factoid / multi-hop / aggregation / unanswerable). For each bucket compute answer_correctness and faithfulness with a paired test, and compute cost/answered-query from resp.usage for each model. Produce a routing recommendation: which buckets can safely move to Haiku (quality delta inside the noise band) and which must stay on Opus, with the projected blended cost/query and the quality you're trading for it. Indice/Solution : the win is almost never "downgrade everything" or "keep Opus everywhere" — it's routing. Factoid/unanswerable buckets usually survive Haiku with no significant faithfulness loss (cheap wins); multi-hop and aggregation often regress significantly (keep Opus). The deliverable is a table the PM can sign off on — quality delta + p-value + cost delta per bucket — not a single blended number that hides where you'd be hurting users to save pennies.
🎤 En entretien
"Your faithfulness went from 0.82 to 0.85 — do you ship?" Not on that alone. On a 50-question set that gap is likely inside the noise band; I run a paired test, check p < 0.05, and confirm latency p95 and cost/query guardrails didn't regress before shipping.
"How do you eval a RAG system with no labeled data?" Start reference-free — faithfulness and answer relevancy need no gold answers. Ship behind a feedback widget, mine thumbs-down + low-faithfulness traces into the first 50 golden cases, then graduate to ground-truth metrics. Never blocked on labels to start measuring.
"An answer is wrong. How do you find out why?" Pull the trace by
request_idand read the two stages separately. Highcontext_recall+ low faithfulness = generator hallucinated (fix prompt/model). Low recall = the answer was doomed at retrieval (fix chunking/embeddings/reranking). I never debug "the answer is bad" as one thing."Can you trust an LLM as a judge?" Only after calibrating it. The judge is itself a model that needs eval — I measure Cohen's κ against ~50 human labels before scaling it, use a stronger judge than the generator (
claude-opus-4-8judgingclaude-sonnet-4-6output), and control for position/verbosity/self-preference bias with a structured rubric."Your prompt caching shows zero hits. Where do you look?" First: is the cached prefix above the model's minimum? On
claude-opus-4-8it's ~4096 tokens — a 60-token system prompt withcache_controlcaches nothing, silently. Second: is there a silent invalidator ahead of the breakpoint — adatetime.now(), a per-request UUID, unsorted JSON — that changes the prefix bytes every call? I diff the rendered prefix of two requests to find it.cache_read_input_tokensfromresp.usageis how I confirm the fix."You're spending too much. Where's the first cut?" I look at cost/answered-query and attribute it by
tenant_id, not the monthly total. Usually the win is per-bucket model routing — move the factoid and unanswerable traffic toclaude-haiku-4-5, keep multi-hop onclaude-opus-4-8— defended with a paired quality test so I know exactly what I'm trading. Then I check the cache hit-rate: if reads are near zero on cacheable traffic, I'm paying ~10× on input tokens for nothing."How much of your eval do you run, and on what model?" Golden-set metrics gate every deploy (offline, full set). Reference-free metrics monitor prod on a sample — I don't grade every trace, I grade the low-confidence tail. The judge runs on a model at least as strong as the generator, which means eval cost is real; if it approaches serving cost, I'm over-judging and I sample harder.