RAG Evaluation & Observability

Phase 2. The differentiator between hireable AI engineers and tutorials-watchers.

Why eval is everything

"If you can't measure it, you can't improve it. If you can't improve it, the client will."

Without eval, you ship blind. With eval :

You can claim numbers in interviews
You can A/B test prompts/chunking/models
You can defend your architecture decisions
You can detect drift in production

The staff-engineer mental model : RAG eval is a two-stage funnel

A RAG answer can be wrong for exactly two reasons, and they require different fixes. A senior never debugs "the answer is bad" — they decompose it :

        QUESTION
           │
   ┌───────▼────────┐   RETRIEVAL stage  ── measured by context_precision / context_recall
   │   Retriever    │   "Did we put the right chunks in the prompt?"
   └───────┬────────┘   Fix here = chunking, embeddings, hybrid search, reranking, top-k
           │  (context)
   ┌───────▼────────┐   GENERATION stage ── measured by faithfulness / answer_relevancy
   │   Generator    │   "Given those chunks, did the model answer correctly?"
   └───────┬────────┘   Fix here = prompt, model, output schema, citation enforcement
           │
        ANSWER

The diagnostic rule : high context_recall + low faithfulness → the retriever did its job, the generator hallucinated (fix the prompt/model). Low context_recall → the answer was doomed before generation; tuning the prompt is wasted effort (fix the retriever). This single decomposition is what separates "I tweaked things until the demo looked good" from "I instrumented each stage and attacked the bottleneck." Bring it to every interview.

A corollary: never report a single aggregate "RAG score." It hides which stage is failing. Always report the two stages separately, plus end-to-end answer correctness.

The eval triad (Ragas / TruLens)

Metric	Stage	What it measures	Needs ground truth?	Range	Fix when low
Context Recall	Retrieval	Did we retrieve all relevant info	✅ (expected sources/answer)	0-1	chunking, embeddings, top-k, hybrid search
Context Precision	Retrieval	Are retrieved chunks relevant (and ranked high)	✅	0-1	reranker, lower top-k, query rewriting
Faithfulness	Generation	Is the answer grounded in the retrieved context	❌ (reference-free)	0-1	prompt ("cite only from context"), model, lower temp
Answer Relevancy	Generation	Does the answer address the question	❌ (reference-free)	0-1	prompt, query understanding
Answer Correctness	End-to-end	Semantic match vs. the gold answer	✅	0-1	depends — diagnose via the two stages above

Why the "needs ground truth?" column is load-bearing. The reference-free metrics (faithfulness, answer relevancy) can be computed on live production traffic — you have no gold answer at 3am, but you can still ask an LLM judge "is this answer grounded in this context?". The ground-truth metrics (recall, precision, correctness) require your golden set and run offline in CI / on a sampled basis. A mature system runs both: golden-set metrics gate deploys, reference-free metrics monitor prod.

Target for production : all > 0.8 — but treat that number as a starting heuristic, not gospel. The right bar is domain- and cost-of-error-dependent: a legal/medical RAG may demand faithfulness > 0.95 (a hallucinated citation is a liability), while an internal docs search tolerates 0.75. Set the threshold from the business consequence of a wrong answer, then defend it with the golden-set distribution, not a round number you read in a blog post.

⚠️ Faithfulness is computed by an LLM judge and is itself noisy. A faithfulness of 0.82 on a 50-question set has a confidence interval roughly ±0.10 (binomial-ish). Reporting "we went from 0.82 → 0.85" as a win without a significance check is a junior move — see the A/B section.

Ground truth dataset — how to build it

Pick 50 representative questions users will actually ask
Manually write expected answers + source chunks for each

Store as JSONL :

jsonl

{"question": "...", "expected_answer": "...", "expected_sources": ["doc_id_1", "doc_id_8"]}

Keep this in a evals/golden.jsonl file in the repo
Grow it over time — every prod bug becomes a new test case

How a staff engineer thinks about the golden set

The golden set is the spec. If it's biased, every metric you report is a lie you tell yourself. Concerns that separate a real eval harness from a toy :

Stratify by question type, not just count. 50 questions all of the form "what is X?" will pass while your system silently fails on multi-hop ("compare X and Y"), negation ("which docs do not mention X"), and out-of-scope ("what's the weather"). Bucket your set: factoid / multi-hop / aggregation / unanswerable. The unanswerable bucket is the one juniors forget and the one that catches hallucination.
Include adversarial & "I don't know" cases. A RAG system that never says "the documents don't contain that" is over-confident. ~15–20% of your golden set should have expected_answer: "INSUFFICIENT_CONTEXT" so you can measure the refusal path, not just the happy path.
Freeze it, version it, review it. evals/golden.jsonl is checked into git and changes go through PR review like code. A silent edit to the golden set that makes the numbers go up is fraud, not progress.
Beware leakage. If you build the golden answers by reading the model's output, you're grading the model against itself. Write expected answers from the source documents, independently of what the system currently produces.
Bootstrapping with an LLM is fine — verification is not optional. Generating candidate Q/A pairs with Claude to seed the set is a legitimate time-saver, but a human must verify every entry that gates a deploy. LLM-generated golden answers that no human checked are not ground truth.

The cold-start answer (interview gold). "How do you eval with zero golden data?" → Start reference-free (faithfulness + answer relevancy need no labels), ship behind a feedback widget, mine thumbs-down + low-faithfulness traces into your first 50 golden cases, then graduate to ground-truth metrics. You are never blocked on labels to start measuring.

Eval frameworks

Ragas (RAG-specific)

Docs
Python-first
Uses LLM-as-judge for many metrics
Good for quick experimentation
Use this by default for project 1

TruLens

trulens.org
More general
Better dashboards
Supports custom feedback functions

LangSmith

Built into LangChain ecosystem
Visual interface
Datasets + experiments
Free tier sufficient for portfolio

Phoenix (Arize)

phoenix.arize.com
Open source
Strong tracing
Self-hostable

Custom eval

For specific use cases : write your own with pytest
Example : "Does the answer cite at least one source?" → simple regex check

How a staff engineer chooses (the part the bullets above hide)

The frameworks are not interchangeable; they sit at different layers and a mature system uses two of them, not one.

Layer	Question it answers	Tool that fits	Runs
Metric computation	"What's the faithfulness on this set?"	Ragas (RAG-specific metrics out of the box), or your own LLM-judge	CI gate (offline) + sampled prod
Tracing / observability	"Why was this answer wrong, and where's the latency?"	LangFuse / Phoenix / LangSmith	Always-on, every request
Experiment tracking	"Is variant B better than A, with significance?"	LangSmith datasets, or a notebook + `scipy`	Per experiment

The senior mental model: buy the tracing, own the metric. Tracing (spans, token counts, latency breakdown, replay-by-request_id) is undifferentiated plumbing — Phoenix or LangFuse do it better than you will, self-host Phoenix if data residency matters. The metric — what "faithful" means for your domain, the rubric, the calibration against your humans — is your competitive moat and the thing an interviewer probes. Don't outsource the rubric to a library default and then quote the number as if it were ground truth.

Avoid framework lock-in at the trace boundary. Log a typed RagTrace (the schema above) to your own store first, then mirror to whatever vendor you're trialing. The day you swap LangSmith for Phoenix you don't want to lose three months of golden-set history. Treat the eval vendor as a renderer over your data, not the system of record.

The honest answer to "which one?" For a portfolio project: Ragas for the metrics, Phoenix (self-hosted, free) for tracing, scipy for the A/B test. That trio costs nothing, runs locally, and demonstrates every layer. Reach for LangSmith's hosted datasets only when a team needs shared, versioned experiment history — and say that in the interview, not "I used LangChain because it was there."

Observability — what to log per request

The trace is the unit of debuggability. If a user complains "the answer was wrong," a senior engineer pulls the trace by request_id and can tell within 30 seconds which stage failed — without re-running anything. Model it as a typed schema, not a free-form dict, so the shape never drifts :

python

from pydantic import BaseModel
from typing import Literal


class RetrievedChunk(BaseModel):
    doc_id: str
    score: float
    rank: int
    text: str | None = None  # store a snippet, not the whole chunk, to bound log size


class RagTrace(BaseModel):
    request_id: str           # propagate this header end-to-end (NestJS → Python → logs)
    user_id: str | None
    tenant_id: str | None     # multi-tenant cost & isolation
    query: str
    rewritten_query: str | None = None

    # Retrieval stage — log EVERY stage so you can attribute a bad answer to the right one
    vector_results: list[RetrievedChunk]
    bm25_results: list[RetrievedChunk]
    fused_results: list[RetrievedChunk]
    reranked_results: list[RetrievedChunk]
    context_tokens: int

    # Generation stage
    model: str = "claude-opus-4-8"
    input_tokens: int
    output_tokens: int
    cache_read_input_tokens: int = 0     # from resp.usage — proves caching works
    cache_creation_input_tokens: int = 0
    response: str
    stop_reason: str | None = None       # "end_turn" | "max_tokens" | "refusal" | ...

    # Derived / operational
    cost_usd: float
    latency_ms: int
    retrieval_latency_ms: int            # split the budget: where did the 1840ms go?
    generation_latency_ms: int
    experiment_id: str | None = None     # A/B bucket
    user_feedback: Literal["up", "down"] | None = None  # filled in later
    error: str | None = None

→ Log this to LangSmith / LangFuse / Phoenix / your own DB (one row per request; the chunk lists go into a JSON column or a child table).

Three things juniors omit and seniors always log

request_id propagated across services. Your stack is NestJS (API) → Python (RAG) → Claude. Generate the id at the edge, pass it as a header, and stamp it on every log line and every span. Without it, correlating a frontend error with a Python stack trace is archaeology.
cache_read_input_tokens from resp.usage. This is how you prove prompt caching is actually hitting. If it's zero across requests that share a system+context prefix, a silent invalidator (a timestamp in the prompt, a per-request UUID, unsorted JSON) is costing you ~10× on input tokens.
Split latency by stage. A p95 of 1840ms is meaningless aggregate. retrieval_latency_ms vs generation_latency_ms tells you whether to optimize the vector DB / reranker or the LLM call. The reranker (a cross-encoder) is a frequent hidden tax — measure it.

Computing `cost_usd` correctly (defend the number)

Cost is per-token, and cached reads are ~10× cheaper than fresh input — if you bill at the flat input rate you'll overstate cost and make wrong model-choice decisions. For the 2026 flagship claude-opus-4-8 at 1M context : $5 / 1M input, $25 / 1M output, with cache reads at ~0.1× input. The mid tier claude-sonnet-4-6 is $3 / $15; the cheap claude-haiku-4-5 is $1 / $5.

python

# Prices in USD per 1M tokens (input, output) — keep this in config, not hardcoded
PRICES = {
    "claude-opus-4-8":   (5.0, 25.0),
    "claude-sonnet-4-6": (3.0, 15.0),
    "claude-haiku-4-5":  (1.0,  5.0),
}
CACHE_READ_MULTIPLIER = 0.1   # cached input tokens bill at ~10% of base input price


def cost_usd(model: str, usage) -> float:
    in_price, out_price = PRICES[model]
    fresh_in = usage.input_tokens                      # uncached remainder only
    cached_in = getattr(usage, "cache_read_input_tokens", 0)
    out = usage.output_tokens
    return (
        fresh_in  * in_price  / 1e6
        + cached_in * in_price * CACHE_READ_MULTIPLIER / 1e6
        + out       * out_price / 1e6
    )

Gotcha: usage.input_tokens is the uncached remainder, not the total prompt size. Total prompt = input_tokens + cache_creation_input_tokens + cache_read_input_tokens. If you compute cost from a separately-counted "prompt length," you'll double-count the cached portion.

Cost as a first-class SLO, not an afterthought

A staff engineer treats cost the same way they treat p95 latency: a budget with an alarm, not a number you read off the invoice at the end of the month. Three reflexes:

Unit-economics, not totals. "We spent $4,200 last month" is unactionable. cost / answered_query and cost / resolved_ticket are the numbers you defend to a PM and that catch a regression. A chunking change that doubles context tokens shows up here on day one, not on the invoice 30 days later.
Attribute every dollar to a tenant. In a multi-tenant RAG, cost_usd rolls up by tenant_id so you can spot the one customer whose 50-page PDFs are eating the margin — and so you can bill them, rate-limit them, or move them to claude-haiku-4-5. The trace schema above carries tenant_id precisely for this.
Know your model mix. Serving generation on claude-haiku-4-5 ($1/$5) and judging on claude-opus-4-8 ($5/$25) is a deliberate asymmetry: the cheap model answers 10k prod queries, the expensive model grades a sample. If your eval cost is a meaningful fraction of your serving cost, you're judging too much — sample harder, or judge reference-free metrics only on the low-confidence tail.

The model-downgrade decision, defended with numbers. "Can we move generation from Opus to Haiku?" is not a vibe call. Run both over the golden set, compare answer_correctness and faithfulness with a paired test (see A/B section), and put the cost delta next to the quality delta. If Haiku is 5× cheaper and loses 1 point of faithfulness inside the noise band, you ship Haiku and pocket the margin. If it drops faithfulness 8 points on the multi-hop bucket, you keep Opus there and route only the factoid bucket to Haiku. Per-bucket model routing is a senior move; "we use Opus everywhere because it's safest" is a junior one that quietly burns money.

Production monitoring dashboards to build

Latency — p50/p95/p99 per endpoint, split by stage (retrieval vs generation). An aggregate p95 hides whether the reranker or the LLM is the tax.
Cost — daily $$ + per-user / per-tenant + cost/answered-query (the unit metric). Overlay cache_read_input_tokens as a % — a sudden drop means a silent cache invalidator went out in a deploy.
Eval scores — daily reference-free judge (faithfulness, relevancy) on sampled prod queries; golden-set metrics on every deploy.
Error rate — LLM timeouts, RateLimitError/OverloadedError, retries exhausted, fallbacks triggered, stop_reason: "refusal" rate.
User feedback — thumbs up/down rate, and crucially the down-rate on high-faithfulness answers (the model was confident and grounded but still wrong → your retrieval or your golden set is lying to you).
Drift signals — query distribution shift (input drift), retrieval top-1 similarity drop, faithfulness trend (output drift).

The rule that separates a dashboard from a wall of charts

Every panel must map to an action. A chart nobody acts on is a chart nobody looks at. For each metric, write the runbook line before you build the panel:

Panel	Threshold	Action when it fires
p95 generation latency	> 3s	Check `max_tokens`, drop to `effort: "low"`, or stream
cost/answered-query	> 1.3× 7-day median	Audit cache hit-rate; check for a context-size regression
`cache_read` %	< 50% on cacheable traffic	A silent invalidator shipped — diff the rendered prefix
faithfulness (sampled prod)	rolling mean −2σ	Page on-call; pull low-faithfulness traces; check index freshness
refusal rate	spikes	Adjacent-benign false positives, or a real attack — inspect `stop_details`

If you can't write the action, you don't need the panel — you need a different metric. Alert on symptoms users feel (down-rate, latency, refusals) and page on leading indicators (faithfulness drift, cache collapse) before users feel them. Don't alert on raw token counts; alert on the cost/query they roll up into.

LLM-as-judge — when to trust it

Use for : faithfulness, relevancy, factuality, tone — anything where "is this good?" is a judgment a competent human could make from the text alone. Don't use for : code correctness (run the tests), numerical/financial accuracy (compute it), or specialized domain where the judge lacks expertise (the judge will confidently agree with a plausible-but-wrong answer).

Tip : use a stronger model as judge than the one generating. The flagship claude-opus-4-8 judges output from the cheaper claude-sonnet-4-6 / claude-haiku-4-5 you serve in prod — not the other way round. A judge weaker than the generator can't reliably catch the generator's mistakes.

A real, runnable judge (Anthropic SDK, structured output)

A senior judge implementation has four properties juniors miss: it's async (you grade hundreds of traces), it returns a structured verdict with a reason (a bare 0/1 is undebuggable), it uses AsyncAnthropic + structured outputs instead of hand-rolled JSON prompting, and it logs usage so eval cost is itself observable.

python

import asyncio
from anthropic import AsyncAnthropic
from pydantic import BaseModel, Field

client = AsyncAnthropic(max_retries=4)  # SDK retries 429/5xx with backoff


class Faithfulness(BaseModel):
    grounded: bool = Field(description="True iff every claim in the answer is supported by the context")
    unsupported_claims: list[str] = Field(description="Claims in the answer NOT found in the context")
    reason: str


JUDGE_SYSTEM = (
    "You are a strict RAG faithfulness grader. An answer is faithful ONLY if every "
    "factual claim it makes is directly supported by the provided context. "
    "Do not use outside knowledge. If the answer adds anything not in the context, it is NOT grounded."
)


async def judge_faithfulness(question: str, context: str, answer: str) -> Faithfulness:
    resp = await client.messages.parse(           # native structured output — no XML/JSON prompting
        model="claude-opus-4-8",                  # judge is STRONGER than the generator
        max_tokens=1024,
        thinking={"type": "adaptive"},            # let the judge reason before verdict
        system=[{
            "type": "text",
            "text": JUDGE_SYSTEM,
            "cache_control": {"type": "ephemeral"},  # cache the stable judge prefix across calls
        }],
        messages=[{
            "role": "user",
            "content": (
                f"<question>{question}</question>\n"
                f"<context>{context}</context>\n"
                f"<answer>{answer}</answer>"
            ),
        }],
        output_config={"format": Faithfulness},
    )
    # resp.usage carries input/output/cache tokens — log it so eval cost is observable too
    return resp.parsed_output


async def grade_batch(traces: list[dict]) -> list[Faithfulness]:
    # asyncio.gather parallelizes the judge calls — grading 200 traces serially is the junior mistake
    return await asyncio.gather(*[
        judge_faithfulness(t["query"], t["context"], t["response"]) for t in traces
    ])

Why adaptive thinking, not a budget_tokens value? On claude-opus-4-8 the old thinking={"type": "enabled", "budget_tokens": N} form is removed and returns HTTP 400 (so are temperature/top_p/top_k). Use thinking={"type": "adaptive"} and, if you need to dial depth, output_config={"effort": "low"|"medium"|"high"|"max"}. For a cheap, high-throughput judge, effort: "low" is often enough — it consolidates reasoning and cuts tokens, which is exactly what you want when grading hundreds of traces.

⚠️ The caching gotcha that bites everyone (and breaks Exercise 2 if you don't know it). Prompt caching is a prefix match with a minimum cacheable prefix, and on claude-opus-4-8 that minimum is ~4096 tokens. The JUDGE_SYSTEM prompt above is ~60 tokens — putting cache_control on it caches nothing (cache_read_input_tokens stays 0, no error, no warning). To actually get cache hits on a judge, the stable cached prefix has to clear the minimum: pack the rubric, few-shot calibration examples, and detailed grading instructions into the cached system block until it's > 4096 tokens, then keep the volatile per-trace <question>/<context>/<answer> after the breakpoint. A skinny system prompt with a cache_control marker is the #1 reason "caching doesn't work." (On claude-sonnet-4-6 the minimum is ~2048 tokens — a prefix that caches on Sonnet silently won't on Opus.)

Structured output, the senior way. messages.parse() with a Pydantic schema is the canonical pattern — the SDK injects the JSON-schema constraint (output_config.format), validates the response, and hands you a typed Faithfulness object via resp.parsed_output. Do not hand-roll "respond in JSON like {...}" prompting and then json.loads() the text: it's brittle (the model wanders off-schema on edge cases), it costs you a retry loop, and it can't enforce the enum/bool/list[str] shape the way the constrained decoder does. Native structured output also forbids assistant-prefill on 4.8 — another reason the old {"role": "assistant", "content": "{"} JSON-forcing trick is dead.

Failure modes of LLM judges (and the fixes)

Failure mode	Symptom	Fix
Self-preference bias	Judge rates its own family's output higher	Use a different/stronger model as judge; calibrate against humans
Position bias	In A/B pairwise judging, prefers whichever answer is shown first	Randomize order; run both orders and average
Verbosity bias	Longer answers score higher regardless of correctness	Explicit rubric; penalize unsupported claims, not length
Leniency / sycophancy	Everything scores 0.9+; no discrimination	Force a structured rubric + require listing `unsupported_claims`
Non-determinism	Same input, different score across runs	`effort: "low"`, structured output, and calibrate: measure judge-vs-human agreement (Cohen's κ) on a labeled subset before you trust the judge at scale

The meta-rule: the LLM judge is itself a model that needs eval. Before you trust it on 10k prod traces, measure its agreement with human labels on ~50 cases. If κ < 0.6, fix the rubric before scaling — otherwise you're monitoring with a broken thermometer.

Making the judge production-grade (the resilience juniors skip)

The pretty judge_faithfulness above is the happy path. Grading 200 traces means 200 network calls, and at scale some will fail. A senior wraps the batch grader so one flaky call doesn't poison the whole run:

Typed exceptions, not string matching. Catch anthropic.RateLimitError, anthropic.OverloadedError, anthropic.APITimeoutError, anthropic.APIStatusError — never if "429" in str(e). The SDK's max_retries=4 already retries 429/5xx with backoff; your job is to decide what happens when retries are exhausted.
Bound concurrency, don't gather 200 at once. asyncio.gather over 200 coroutines fires 200 simultaneous requests and you'll trip your own rate limit. Gate with a Semaphore (e.g. 10–20 in flight) so the judge saturates throughput without self-DoSing.
return_exceptions=True + a sentinel. A single failed grade should not crash a 200-trace run. Collect failures, log them, and exclude them from the aggregate — and report the exclusion count, because "faithfulness 0.84 over 187/200 graded" is honest; silently averaging 187 and calling it 200 is not.
Per-call timeout. A wedged call shouldn't hang the batch. Set a timeout on the client (or asyncio.wait_for) so a stuck grade fails fast and falls into the exception bucket.
Log resp.usage per call. Eval cost is real cost. If you grade every prod trace on claude-opus-4-8, your monitoring bill can rival your serving bill — which is the whole reason you sample and grade reference-free on the tail, not everything.

python

import asyncio
import anthropic

SEM = asyncio.Semaphore(15)  # cap in-flight judge calls


async def judge_one(trace: dict) -> Faithfulness | None:
    async with SEM:
        try:
            return await asyncio.wait_for(
                judge_faithfulness(trace["query"], trace["context"], trace["response"]),
                timeout=30,
            )
        except (anthropic.RateLimitError, anthropic.OverloadedError,
                anthropic.APITimeoutError, asyncio.TimeoutError) as e:
            # retries already exhausted by the SDK — log and drop this one grade
            log.warning("judge failed for %s: %s", trace["request_id"], e)
            return None  # sentinel; excluded from the aggregate, counted as a failure


async def grade_batch_resilient(traces: list[dict]) -> tuple[list[Faithfulness], int]:
    results = await asyncio.gather(*(judge_one(t) for t in traces))
    graded = [r for r in results if r is not None]
    failed = len(results) - len(graded)
    return graded, failed   # report BOTH — never hide the denominator

A/B testing in production

Once shipped :

Tag each request with experiment_id (e.g., "chunk_strategy_v2")
Route X% of traffic to variant
Collect eval scores + user feedback
Statistical significance check before rollout

Defending the number (the part interviews probe)

"Faithfulness went from 0.82 to 0.85, ship it" is how juniors lose money on noise. Two answers are wrong with the same dataset; the difference is whether the gap is signal or sampling noise.

Offline (golden set, paired): the same N questions run through both variants → use a paired test (paired t-test, or McNemar for pass/fail). Paired is far more powerful than unpaired because it cancels per-question difficulty. On a 50-question set, a 3-point faithfulness move is usually inside the noise band — you typically need a few hundred questions to detect small effects.
Online (live traffic): randomize per-user (not per-request, or the same user sees both and contaminates the comparison), pick the metric before you look (thumbs-up rate, or low-faithfulness rate), and compute a confidence interval. Don't peek-and-stop — sequential peeking inflates false positives; fix the sample size or use a sequential test.
Watch the guardrails, not just the win metric. A chunking change that lifts faithfulness but adds 400ms p95 and 30% cost is a loss dressed as a win. Every experiment reports the win metric and latency p95 and cost/query.

python

from scipy import stats

# Paired: per-question faithfulness for variant A and B on the SAME golden set
a = [...]  # control scores
b = [...]  # variant scores
t, p = stats.ttest_rel(b, a)          # paired t-test
delta = (sum(b) - sum(a)) / len(a)
print(f"Δ={delta:+.3f}  p={p:.3f}  → ship only if p < 0.05 AND guardrails OK")

🏋️ Exercices

Demanding and progressive. Each builds on the previous. "Change a constant" is not here.

1. Build the two-stage harness — Objectif : prove you can attribute a failure to retrieval vs generation

Take a 30-question golden set (factoid + multi-hop + 5 unanswerable) and compute context_recall, context_precision, faithfulness, answer_relevancy separately, emitting a per-question table. Then deliberately break retrieval (set top_k=1) and break generation (raise temperature / weaken the prompt), and show that your metrics correctly localize which you broke. Indice/Solution : recall drops when you starve retrieval; faithfulness drops when you loosen generation. If both move together, your metrics aren't isolating the stages — your expected_sources labels are probably wrong.

2. Production-grade async judge with caching — Objectif : grade 200 traces fast and cheap, and prove the cache hits

Wrap the judge_faithfulness function above in a batch grader using AsyncAnthropic + asyncio.gather, with cache_control on the judge system prefix. Log resp.usage for every call and assert cache_read_input_tokens > 0 after the first call. Report total eval cost and p95 judge latency. Indice/Solution : the stable judge system prompt must come first and carry the cache_control breakpoint; the volatile per-trace question/context/answer goes after it. If cache_read_input_tokens stays 0, you put something volatile (a timestamp, the trace id) ahead of the breakpoint.

3. Calibrate the judge against humans — Objectif : don't trust a thermometer you never checked

Hand-label 40 (answer, context) pairs as faithful / not. Run your LLM judge on the same 40. Compute Cohen's κ and a confusion matrix. If κ < 0.6, iterate on the rubric (force unsupported_claims, strengthen the system prompt) until it clears. Indice/Solution : most disagreements are the judge being lenient (calling unsupported-but-plausible claims grounded). Tightening "use NO outside knowledge" and requiring it to enumerate unsupported claims usually moves κ the most.

4. Break it then fix it — silent cache invalidation — Objectif : feel the 10× cost cliff

Intentionally inject datetime.now() into the judge's system prompt. Run the batch grader and watch cache_read_input_tokens collapse to 0 and cost_usd jump ~10×. Then move the timestamp out (or into the user turn after the breakpoint) and restore the hit rate. Indice/Solution : caching is a prefix match — any byte change before the breakpoint invalidates everything after it. The fix is architectural (freeze the prefix), not "add more cache_control markers."

5. A/B with statistical honesty — Objectif : refuse to ship noise

Run two chunking strategies over a 50-question golden set. Compute the paired t-test on faithfulness. Then expand the set to 250 questions (LLM-bootstrap candidates, human-verify) and re-run. Show how the p-value and confidence interval change. Write the one-paragraph ship/no-ship recommendation a staff engineer would put in the PR — including latency p95 and cost/query guardrails. Indice/Solution : a gap that looked "promising" at N=50 often crosses p=0.05 only at N=250 — or evaporates. The deliverable is the decision, not the metric.

6. Drift alarm (production-grade) — Objectif : catch degradation before the client does

Build a daily job that samples K prod traces, runs the reference-free judge (faithfulness + relevancy — no ground truth needed live), and alerts when the rolling 7-day mean drops > 2σ OR when the retrieval-score distribution shifts (KS test on top-1 similarity). Wire it to fire a PushNotification-style alert. Indice/Solution : two independent drift signals matter — input drift (queries change → KS test on retrieval scores) and output drift (answers degrade → faithfulness trend). They have different root causes (new content / new users vs. model or index regression). Alert on both.

7. Break the batch grader, then make it survive — Objectif : a flaky network must not crash a 200-trace eval

Take the naive grade_batch (plain asyncio.gather over 200 coroutines, no semaphore, no try/except). Run it against a judge client with max_retries=0 while injecting failures: monkeypatch ~10% of calls to raise RateLimitError, and add a Semaphore-free burst so you trip your own TPM limit. Watch the whole run die on the first unhandled exception. Then rebuild it into grade_batch_resilient (semaphore-bounded, wait_for timeout, typed-exception catch, None sentinel) and prove it: 200 traces in, ~180 graded, ~20 logged failures, and the reported faithfulness carries the real denominator ("0.84 over 183/200"). Compare wall-clock and peak concurrency before/after. Indice/Solution : the un-bounded gather self-DoSes — you cause the 429s you then fail on. The fix is two-layered: cap in-flight calls with the semaphore (stop creating the rate-limit pressure) and catch exhausted-retry exceptions per-call (survive the ones that still fail). The trap is reporting mean(graded) as if it were over 200 — that silently inflates the score by dropping the hard cases that timed out. Always surface the failure count next to the metric.

8. Cost-defend a model downgrade — Objectif : turn "use Opus everywhere" into a per-bucket routing decision backed by numbers

Run generation on claude-opus-4-8 and again on claude-haiku-4-5 over a stratified golden set (factoid / multi-hop / aggregation / unanswerable). For each bucket compute answer_correctness and faithfulness with a paired test, and compute cost/answered-query from resp.usage for each model. Produce a routing recommendation: which buckets can safely move to Haiku (quality delta inside the noise band) and which must stay on Opus, with the projected blended cost/query and the quality you're trading for it. Indice/Solution : the win is almost never "downgrade everything" or "keep Opus everywhere" — it's routing. Factoid/unanswerable buckets usually survive Haiku with no significant faithfulness loss (cheap wins); multi-hop and aggregation often regress significantly (keep Opus). The deliverable is a table the PM can sign off on — quality delta + p-value + cost delta per bucket — not a single blended number that hides where you'd be hurting users to save pennies.

🎤 En entretien

"Your faithfulness went from 0.82 to 0.85 — do you ship?" Not on that alone. On a 50-question set that gap is likely inside the noise band; I run a paired test, check p < 0.05, and confirm latency p95 and cost/query guardrails didn't regress before shipping.
"How do you eval a RAG system with no labeled data?" Start reference-free — faithfulness and answer relevancy need no gold answers. Ship behind a feedback widget, mine thumbs-down + low-faithfulness traces into the first 50 golden cases, then graduate to ground-truth metrics. Never blocked on labels to start measuring.
"An answer is wrong. How do you find out why?" Pull the trace by request_id and read the two stages separately. High context_recall + low faithfulness = generator hallucinated (fix prompt/model). Low recall = the answer was doomed at retrieval (fix chunking/embeddings/reranking). I never debug "the answer is bad" as one thing.
"Can you trust an LLM as a judge?" Only after calibrating it. The judge is itself a model that needs eval — I measure Cohen's κ against ~50 human labels before scaling it, use a stronger judge than the generator (claude-opus-4-8 judging claude-sonnet-4-6 output), and control for position/verbosity/self-preference bias with a structured rubric.
"Your prompt caching shows zero hits. Where do you look?" First: is the cached prefix above the model's minimum? On claude-opus-4-8 it's ~4096 tokens — a 60-token system prompt with cache_control caches nothing, silently. Second: is there a silent invalidator ahead of the breakpoint — a datetime.now(), a per-request UUID, unsorted JSON — that changes the prefix bytes every call? I diff the rendered prefix of two requests to find it. cache_read_input_tokens from resp.usage is how I confirm the fix.
"You're spending too much. Where's the first cut?" I look at cost/answered-query and attribute it by tenant_id, not the monthly total. Usually the win is per-bucket model routing — move the factoid and unanswerable traffic to claude-haiku-4-5, keep multi-hop on claude-opus-4-8 — defended with a paired quality test so I know exactly what I'm trading. Then I check the cache hit-rate: if reads are near zero on cacheable traffic, I'm paying ~10× on input tokens for nothing.
"How much of your eval do you run, and on what model?" Golden-set metrics gate every deploy (offline, full set). Reference-free metrics monitor prod on a sample — I don't grade every trace, I grade the low-confidence tail. The judge runs on a model at least as strong as the generator, which means eval cost is real; if it approaches serving cost, I'm over-judging and I sample harder.

RAG Evaluation & Observability ​

Why eval is everything ​

The staff-engineer mental model : RAG eval is a two-stage funnel ​

The eval triad (Ragas / TruLens) ​

Ground truth dataset — how to build it ​

How a staff engineer thinks about the golden set ​

Eval frameworks ​

Ragas (RAG-specific) ​

TruLens ​

LangSmith ​

Phoenix (Arize) ​

Custom eval ​

How a staff engineer chooses (the part the bullets above hide) ​

Observability — what to log per request ​

Three things juniors omit and seniors always log ​

Computing cost_usd correctly (defend the number) ​

Cost as a first-class SLO, not an afterthought ​

Production monitoring dashboards to build ​

The rule that separates a dashboard from a wall of charts ​

LLM-as-judge — when to trust it ​

A real, runnable judge (Anthropic SDK, structured output) ​

Failure modes of LLM judges (and the fixes) ​

Making the judge production-grade (the resilience juniors skip) ​

A/B testing in production ​

Defending the number (the part interviews probe) ​

🏋️ Exercices ​

1. Build the two-stage harness — Objectif : prove you can attribute a failure to retrieval vs generation ​

2. Production-grade async judge with caching — Objectif : grade 200 traces fast and cheap, and prove the cache hits ​

3. Calibrate the judge against humans — Objectif : don't trust a thermometer you never checked ​

4. Break it then fix it — silent cache invalidation — Objectif : feel the 10× cost cliff ​

5. A/B with statistical honesty — Objectif : refuse to ship noise ​

6. Drift alarm (production-grade) — Objectif : catch degradation before the client does ​

7. Break the batch grader, then make it survive — Objectif : a flaky network must not crash a 200-trace eval ​

8. Cost-defend a model downgrade — Objectif : turn "use Opus everywhere" into a per-bucket routing decision backed by numbers ​

🎤 En entretien ​

My notes ​

RAG Evaluation & Observability

Why eval is everything

The staff-engineer mental model : RAG eval is a two-stage funnel

The eval triad (Ragas / TruLens)

Ground truth dataset — how to build it

How a staff engineer thinks about the golden set

Eval frameworks

Ragas (RAG-specific)

TruLens

LangSmith

Phoenix (Arize)

Custom eval

How a staff engineer chooses (the part the bullets above hide)

Observability — what to log per request

Three things juniors omit and seniors always log

Computing `cost_usd` correctly (defend the number)

Cost as a first-class SLO, not an afterthought

Production monitoring dashboards to build

The rule that separates a dashboard from a wall of charts

LLM-as-judge — when to trust it

A real, runnable judge (Anthropic SDK, structured output)

Failure modes of LLM judges (and the fixes)

Making the judge production-grade (the resilience juniors skip)

A/B testing in production

Defending the number (the part interviews probe)

🏋️ Exercices

1. Build the two-stage harness — Objectif : prove you can attribute a failure to retrieval vs generation

2. Production-grade async judge with caching — Objectif : grade 200 traces fast and cheap, and prove the cache hits

3. Calibrate the judge against humans — Objectif : don't trust a thermometer you never checked

4. Break it then fix it — silent cache invalidation — Objectif : feel the 10× cost cliff

5. A/B with statistical honesty — Objectif : refuse to ship noise

6. Drift alarm (production-grade) — Objectif : catch degradation before the client does

7. Break the batch grader, then make it survive — Objectif : a flaky network must not crash a 200-trace eval

8. Cost-defend a model downgrade — Objectif : turn "use Opus everywhere" into a per-bucket routing decision backed by numbers

🎤 En entretien

My notes