LangGraph — Agentic State Machines
Phase 3. Companion to DL.AI AI Agents in LangGraph.
Why LangGraph (not LangChain "agents")
LangChain "agents" (AgentExecutor) = magic. A while loop you don't own, a prompt you can't see, a control flow you can't step through. Great for a demo, miserable in production : when it loops forever, calls the wrong tool, or burns $40 of tokens on one request, you have no seam to put a breakpoint, a budget, or a retry.
LangGraph = explicit state machine. You define :
- State (typed dict of what flows through the graph)
- Nodes (pure-ish functions that read state and return a partial update)
- Edges (which node runs next — static or conditional on state)
This is MUCH closer to how Temporal works (which you know from Dravos). Same mental model : a durable, inspectable, resumable execution graph where each step is a unit of work and the orchestrator — not the LLM — owns the control flow.
The mental model that matters
The single most important reframe : the LLM is a node, not the loop. In a LangChain agent, the model decides what happens next on every turn and you hope it stops. In LangGraph, you wrote the edges. The model can influence routing (via a value it writes into state that a conditional edge reads), but it cannot invent a new transition you didn't draw. That is the entire reason a staff engineer reaches for it : the failure surface is bounded by the graph, not by the model's mood.
| You know (Temporal / NestJS) | LangGraph equivalent |
|---|---|
| Workflow definition | The compiled StateGraph |
| Activity | A node function |
| Workflow state / event history | The State TypedDict + checkpointer |
Signal / await condition | interrupt() (human-in-the-loop) |
| Continue-as-new / replay | Checkpoint + resume from thread_id |
| Saga / compensation | Conditional edge to an error/rollback node |
| Deterministic replay | Same — replaying the graph from a checkpoint must be deterministic |
Internalize this table and most of LangGraph is "Temporal for LLM steps."
Core concepts
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
# State is a TypedDict. Channels can declare a *reducer* (how concurrent / repeated
# writes merge). `add_messages` appends instead of overwriting — critical for chat.
class AgentState(TypedDict):
messages: Annotated[list, add_messages] # appended, not replaced
retrieved_docs: list
answer: str | None
needs_clarification: bool
revisions: int # loop guard — see below
def retrieve(state: AgentState) -> dict:
docs = vector_search(state["messages"][-1].content)
return {"retrieved_docs": docs} # return a PARTIAL update, not full state
def generate(state: AgentState) -> dict:
answer = llm_answer(state["messages"], context=state["retrieved_docs"])
return {"answer": answer, "revisions": state["revisions"] + 1}
def should_clarify(state: AgentState) -> str:
if state["needs_clarification"] and state["revisions"] < 3:
return "clarify"
return "end"
graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve)
graph.add_node("generate", generate)
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_clarify, {"clarify": "retrieve", "end": END})
graph.set_entry_point("retrieve")
app = graph.compile()Three things a staff engineer notices immediately about this snippet :
- Nodes return partial updates, not the whole state.
retrievereturns{"retrieved_docs": docs}— LangGraph merges it via each channel's reducer. Returning the full blob is the #1 beginner bug : it silently clobbers fields other nodes wrote, especially under parallel (map-reduce) execution. - Reducers are the concurrency model.
Annotated[list, add_messages]means "when two nodes writemessagesin the same superstep, append both" instead of last-writer-wins. This is exactly the channel/CRDT idea — get it wrong and parallel branches lose each other's work nondeterministically. - Every loop needs a guard.
revisions < 3is not optional polish. A conditional edge that can route back to a prior node is an unbounded loop unless state caps it. This is the LLM-agent version of "always set a Temporal workflow timeout."
Patterns to learn
1. Linear pipeline
A → B → C. Simplest. Replaces basic chains. Use when there's no branching — don't reach for a graph if a function would do.
2. Conditional routing
After A, decide between B and C based on state. The router is a plain Python function returning a string key — keep the routing decision out of the LLM where you can (a classifier or a rule is cheaper, faster, and testable). When the LLM must route, have it emit a structured field (see structured outputs below) rather than free text you regex.
3. Loop with exit condition
"Retrieve → generate → critique → if bad, retrieve again, up to N times." The reflection / self-critique pattern. The exit condition lives in state (revisions < N and a quality gate), never "until the model says it's happy" — models will happily say they're happy on turn 1 or never.
4. Map-reduce (fan-out / fan-in)
Run N nodes in parallel, aggregate. LangGraph expresses this with Send (dispatch one branch per item) and a reducer channel that merges results. This is where async matters : if each branch calls the model, you want them genuinely concurrent (see the production section). The classic failure mode is a non-appending reducer that drops all but one branch's output.
5. Human-in-the-loop
Pause the graph, wait for human input via interrupt(), resume. Requires a checkpointer (the graph must persist its state to survive the wait). This is the Temporal signal. Use it for irreversible actions : send-email, place-order, merge-PR. The approval gate is an edge, not a prompt instruction.
6. Checkpointing
Persist state between supersteps (a database checkpointer). Enables :
- Long-running agents (pause/resume across days)
- Crash recovery (resume from the last checkpoint, not turn 0)
- Time-travel debugging (replay to any checkpoint, fork an alternate path)
- Multi-tenant memory keyed by
thread_id
In production you do not use the in-memory checkpointer. Use langgraph-checkpoint-postgres (or SQLite for single-node). The checkpointer is your durability boundary — treat it like a Temporal event store.
State design
State design is the architecture. Get it right and nodes stay simple, testable, and parallel-safe.
Bad state (one big blob) :
class State(TypedDict):
everything: dictGood state (typed, granular, with reducers where needed) :
from typing import Annotated, TypedDict
from operator import add
class State(TypedDict):
messages: Annotated[list[Message], add_messages]
retrieved: list[Document]
plan: list[Step]
cursor: int
errors: Annotated[list[Error], add] # branches append errors, don't overwrite
final_answer: str | None→ Easier to debug, type-check, visualize, and — crucially — to reason about under concurrency. The reducer annotation is the difference between "two parallel branches each found an error" being recorded as two errors versus one randomly-surviving error.
Staff-level rule of thumb : a field with no reducer is last-writer-wins. That's fine for scalars one node owns (cursor, final_answer), wrong for anything multiple nodes or parallel branches append to. Audit every channel for "who writes this, and can two writers race?"
Memory in LangGraph
Short-term (within one thread / conversation)
- Just keep it in state (
messageswith theadd_messagesreducer). - Watch the window : an unbounded
messageslist grows your prompt — and your bill — every turn. Trim or summarize. On the model side, prompt caching (below) makes the stable prefix cheap, but the growing tail still costs full price.
Long-term (across conversations / threads)
- Store in a DB / vector store, retrieve via a "memory" node.
- LangGraph's
StoreAPI gives you namespaced cross-thread KV (e.g. per-user preferences). - LangMem / Letta integrate here if you want managed memory.
The split mirrors Claude's own surfaces : short-term = the message list (cheap, cache the prefix); long-term = an external store you read into a node. Don't conflate them — putting a user's entire history into every prompt is the classic cost blowup.
Plugging in the model — the node where the LLM lives
A LangGraph node is just a function, so the model call is yours to own. For a Python/NestJS-shaped backend serving concurrent users, this is where the production-grade Anthropic SDK usage belongs. The 2026 flagship is claude-opus-4-8 (Opus 4.8, $5 / $25 per Mtok in/out at 1M context); use claude-sonnet-4-6 for routine nodes and claude-haiku-4-5 ($1 / $5) for cheap routers/classifiers.
import anthropic
from anthropic import AsyncAnthropic
# One client for the process. AsyncAnthropic for a server — never the sync client
# inside an async graph, or you block the event loop on every model call.
client = AsyncAnthropic(max_retries=3) # SDK retries 429/5xx with backoff
SYSTEM = [
{
"type": "text",
"text": LARGE_STABLE_SYSTEM_PROMPT, # frozen, byte-identical every call
"cache_control": {"type": "ephemeral"}, # cache the stable prefix
}
]
async def generate(state: AgentState) -> dict:
try:
resp = await client.messages.create(
model="claude-opus-4-8",
max_tokens=16000,
system=SYSTEM,
# Adaptive thinking — NOT budget_tokens (that form is removed on 4.7/4.8
# and returns HTTP 400). Control depth with output_config.effort instead.
thinking={"type": "adaptive"},
output_config={"effort": "high"}, # low | medium | high | xhigh | max
messages=state["messages"],
timeout=60.0, # per-call wall-clock budget
)
except anthropic.RateLimitError:
# SDK already retried; surface for a graph-level fallback edge.
raise
except anthropic.APIStatusError as e:
log.error("model call failed", status=e.status_code, type=e.type)
raise
log.info("usage", **resp.usage.model_dump()) # log usage for cost/observability
text = "".join(b.text for b in resp.content if b.type == "text")
return {"answer": text, "revisions": state["revisions"] + 1}Correctness notes a reviewer will check :
- Adaptive thinking, not a thinking budget. The old
thinking={"type": "enabled", "budget_tokens": N}form is removed on Opus 4.7/4.8 and returns a 400. Usethinking={"type": "adaptive"}and tune depth withoutput_config={"effort": ...}. Sonnet 4.6 / Haiku take adaptive thinking too (no budget). AsyncAnthropicin an async graph. A sync client insideasync defnodes serializes every model call behind the event loop — your "parallel" map-reduce becomes sequential and your server's throughput collapses.- Typed exceptions. Branch on
RateLimitError/APIStatusError/OverloadedError/APITimeoutError, never on string-matching the message. resp.usageis your cost telemetry. Log it from every node; it's the only ground truth for per-request spend and cache hit rate (cache_read_input_tokens).
Structured routing — let the model fill a schema, not free text
When a node's output drives a conditional edge, do not parse free text. Use native structured outputs so the router reads a typed field :
from pydantic import BaseModel
class Route(BaseModel):
intent: str # "search" | "answer" | "clarify"
needs_clarification: bool
async def classify(state: AgentState) -> dict:
resp = await client.messages.parse( # native parse → validated Pydantic
model="claude-haiku-4-5", # cheap model for a routing decision
max_tokens=256,
messages=state["messages"],
output_config={"format": Route},
)
r = resp.parsed # typed; no regex, no XML hand-rolling
return {"needs_clarification": r.needs_clarification}messages.parse() with a Pydantic schema (or output_config.format) beats hand-rolled "respond in JSON" prompting every time : the SDK validates, and a malformed response is an exception you can route on instead of a KeyError three nodes later.
Parallel tool calls — gather, don't loop
import asyncio
async def fan_out_tools(state: AgentState) -> dict:
# Independent tool calls run concurrently. asyncio.gather, not a for-loop of awaits.
results = await asyncio.gather(
*(run_tool(call) for call in state["pending_tool_calls"])
)
return {"tool_results": results} # reducer-merged into stateStreaming
graph.stream(input) yields updates as the graph progresses. Three modes worth knowing :
stream_mode="values"— full state after each superstep (simplest).stream_mode="updates"— just the diff each node produced (less data over the wire).stream_mode="messages"— token-level LLM output, for "typing" UIs.
Use for :
- UI shows the agent's "thinking"/progress in real time (better perceived latency even when wall-clock is identical).
- Token streaming to the user from the model node.
- Backpressure / early-cancel : a streamed graph can be aborted mid-run.
On the model side, stream the Anthropic call for any large output (max_tokens over ~16K) to avoid SDK HTTP timeouts, and bridge those deltas into the graph's messages stream.
Tracing / Observability
LangGraph integrates with LangSmith out of the box — every node call, every state transition, every model request traced, with token counts and latency per step. Also works with OpenTelemetry / Phoenix if you want vendor-neutral spans into your existing stack.
What a staff engineer actually instruments :
- Per-node latency and token cost — find the expensive node, not the expensive request.
- Cache hit rate (
cache_read_input_tokens/ total) — a regression here means a silent prefix invalidator crept in (a timestamp in the system prompt, a reordered tool list). - Loop counts — alert when
revisionssaturates its cap; that's a quality regression hiding as "it still works." - Checkpoint size — unbounded
messagesshows up as a growing checkpoint and a growing bill.
Determinism, testing, and why the orchestrator owns control flow
The reason a staff engineer reaches for LangGraph over a hand-rolled loop is testability through a deterministic seam. The graph topology — nodes, edges, reducers — is pure Python you can unit-test without a single model call. The nondeterminism is quarantined inside node bodies, which you mock.
- Test the routing, not the model. A conditional edge is
should_clarify(state) -> str. Feed it synthetic states and assert the returned key. No API, no flakiness, no cost. This is the single highest-leverage test in an agentic system: it proves the control flow is correct independent of what the LLM says. - Test loop termination adversarially. Construct a state that always "needs clarification" and assert the graph halts within N supersteps. A loop guard you never tested is a loop guard you don't have.
- Golden-path integration tests use a stub model node. Swap the real
generatenode for one that returns a canned answer. You're now testing the graph's wiring end-to-end at zero cost and zero latency. - Replay determinism is a contract. Resuming from a checkpoint must reproduce the same transitions given the same state — exactly the Temporal rule. The hazard: a node that reads
datetime.now(), a random seed, or an unsorted set leaks nondeterminism into state, and your "time-travel" replay diverges. Push all nondeterminism (clock, RNG, external reads) into node inputs you can pin, never into node bodies that run on replay.
The mental model: the model is the only nondeterministic component, and you've boxed it inside a node. Everything around it — routing, looping, fan-out, persistence — is deterministic infrastructure you test like any other state machine.
Failure modes a staff engineer designs against
| Failure mode | Symptom | Defense |
|---|---|---|
| Unbounded loop | A request burns minutes and dollars; revisions saturates and never exits | A loop guard in state read by the conditional edge — revisions < N and a real quality gate. Never "until the model is happy." |
| Full-state clobber | A field another node wrote silently vanishes, nondeterministically, under parallel branches | Nodes return partial updates; every multi-writer channel has a reducer (add_messages, operator.add). Audit: "who writes this, can two writers race?" |
| Checkpointer is in-memory in prod | A crash or restart loses the entire run; human-in-the-loop never resumes | langgraph-checkpoint-postgres (or SQLite single-node). The in-memory checkpointer is for tests only. |
| Silent cache invalidation | Cost doubles; cache_read_input_tokens drops to ~0 with no error | A timestamp/UUID crept into the stable system prefix, or the tool list reordered. Alert on cache hit rate, not just on errors. |
| Sync client in an async graph | Throughput collapses; "parallel" map-reduce runs sequentially | AsyncAnthropic + asyncio.gather. A sync call inside async def serializes every model call behind the event loop. |
| Model overload not handled | A 529/OverloadedError aborts the whole graph mid-run | Typed-exception branch routes to a fallback edge: downgrade claude-opus-4-8 → claude-sonnet-4-6, or back off and retry. The SDK retries transient errors; you handle the terminal ones. |
| Checkpoint bloat | Latency and bill creep up over a long conversation; checkpoint size grows unbounded | Trim/summarize messages; cap the window. An unbounded add_messages channel is a slow leak. |
| Non-deterministic replay | Time-travel debugging produces a different answer than production did | No clock/RNG/external read inside node bodies on the replay path — pin them as inputs. |
The throughline: bound every loop, give every multi-writer channel a reducer, make durability real (Postgres), and treat the model call as a thing that fails — typed, retried, and fallback-routed.
Comparison to alternatives
| Framework | Strength | Weakness |
|---|---|---|
| LangGraph | Explicit, debuggable, checkpointable, mature | Steeper learning curve; you own state design |
LangChain Agent (AgentExecutor) | Easy start, magic | Opaque control flow, hard to bound, deprecated trend |
| crewAI | Multi-agent natural API | Less flexible, smaller ecosystem, weak control |
| AutoGen | Conversational multi-agent | Microsoft-specific feel, research-y |
| Anthropic Managed Agents | Anthropic runs the loop + hosts the tool sandbox; persisted versioned agent configs | Less control over orchestration; vendor-hosted |
| Vercel AI SDK + raw | TS-native, simple | No state machine — you build orchestration |
| Raw SDK (Anthropic) | Maximum control | You build the loop, checkpointing, everything |
→ In 2026, the production patterns are : LangGraph + raw SDK + MCP when you want to own orchestration and host your own tools; Anthropic Managed Agents when you want Anthropic to run the agent loop and host a per-session container. crewAI for prototyping, AutoGen for research. The decision hinges on one question : do you want to own the loop and the tool sandbox, or rent them?
TypeScript option
LangGraph JS exists and has matured a lot — same State/Node/Edge model, same checkpointer concept.
→ For your stack (TS + Python), the usual split : Python LangGraph for the agentic backend (richer ecosystem, the data/ML libs live here), TS/Angular for the frontend, talking to the graph over a streaming endpoint (SSE/WebSocket bridged from graph.stream). If your whole team is TS and the agents are simple, LangGraph JS is a legitimate single-language choice.
🎤 En entretien
- "Why LangGraph over a LangChain
AgentExecutor?" — Because the orchestrator owns control flow, not the model : explicit edges bound the failure surface, conditional routing and loops are testable, and checkpointing gives crash-recovery and human-in-the-loop.AgentExecutoris an opaquewhileloop you can't put a breakpoint, a budget, or a deterministic test on. - "How do you stop an agentic loop from running forever / costing too much?" — A loop guard in state (e.g.
revisions < N) read by the conditional edge, plus a hardmax_tokensand a per-calltimeouton the model, plusoutput_config.effortto cap reasoning depth. Never "loop until the model says it's done." - "Two parallel branches both write the same state field. What happens?" — Without a reducer it's last-writer-wins and you nondeterministically lose one branch's work. You annotate the channel with a reducer (
add_messages,operator.add) so concurrent writes merge — that's LangGraph's concurrency model. - "How is this different from Temporal, which I already know?" — It's the same shape : durable, inspectable, resumable execution graph with a persisted event history (the checkpointer). Nodes ≈ activities, conditional edges ≈ saga branches,
interrupt()≈ signals, resume-from-thread_id≈ replay. The novelty is only that some nodes call an LLM — which is exactly why you still want deterministic orchestration around the nondeterministic step. - "How do you test an agentic graph without burning money on every CI run?" — The topology is pure Python : unit-test router functions (
should_clarify(state) -> str) on synthetic states with no API call, assert loop termination on adversarial inputs, and stub the model node with a canned response for golden-path integration tests. The model is the only nondeterministic component and it's boxed inside a node you mock — everything else is a deterministic state machine you test directly. - "What makes a checkpointed graph's replay non-deterministic, and why do you care?" — A node body that reads the clock, an RNG, or an external source leaks nondeterminism into state, so replaying from a checkpoint diverges from what production did — time-travel debugging lies and crash-recovery can produce a different answer. Fix : push all nondeterminism into pinned node inputs, never into node bodies that run on the replay path. Same discipline as Temporal's deterministic-workflow rule.
🏋️ Exercices
Convert a LangChain agent to a LangGraph graph.Objectif : turn an opaque
AgentExecutor(retrieval + 2 tools) into an explicit State/Node/Edge graph with a typed state. Indice/Solution : model the tools as nodes, the "which tool / answer now" decision as a conditional edge driven by a structured-output classifier (Haiku viamessages.parse()), and a router function returning string keys. Verify the graph compiles andgraph.get_graph().draw_mermaid()matches your whiteboard.Build a research agent : decompose → search each sub-query in parallel → aggregate.Objectif : implement true fan-out/fan-in with
Sendand a reducer channel. Indice/Solution : the decompose node emits NSend("search", {...}); thesearchnode is async and usesAsyncAnthropic; aggregate via anAnnotated[list, operator.add]channel. Prove it's actually concurrent by logging start/end timestamps — if they don't overlap you accidentally used the sync client or afor awaitloop.Add a human-in-the-loop approval gate before an irreversible action (send email).Objectif : pause the graph at the gate, surface the draft, resume on approval/rejection. Indice/Solution : attach a Postgres checkpointer, call
interrupt()in the gate node, resume withCommand(resume=decision)keyed bythread_id. Now break it : kill the process between interrupt and resume, restart, and resume the samethread_id— it must continue, not restart. If it restarts, your checkpointer is in-memory.Make the loop production-grade : bound it, observe it, and survive a model overload.Objectif : take the reflect/critique loop from a toy to something you'd page on. Indice/Solution : add a
revisionscap and a quality gate to the exit edge; wrap the model node with typed-exception handling that routesOverloadedError/RateLimitErrorto a fallback edge (downgradeclaude-opus-4-8→claude-sonnet-4-6, or back off and retry); logresp.usageper node; assert in a test that the loop terminates within N supersteps on an adversarial input that keeps "needing clarification."Defend the number : cost-optimize the graph and prove the saving.Objectif : cut per-request token spend without losing quality, and show the receipts. Indice/Solution : put
cache_controlon the stable system+tools prefix, move all volatile content (timestamps, per-request IDs) after the last breakpoint, route the cheap classification node toclaude-haiku-4-5, and trim themessageswindow. Measure before/after withusage.cache_read_input_tokensand total input tokens across a fixed eval set. Be ready to explain why cache hit rate is the lever (cache reads ≈ 0.1× input price) and what silently invalidates it.Time-travel debug a wrong answer.Objectif : use checkpoints to find and fix the node that introduced a bad state, without re-running the whole graph. Indice/Solution : list checkpoints for the
thread_id, inspect state at each superstep, identify the node whose output diverged, fork from the prior checkpoint with a corrected input, and confirm the downstream answer changes. Bonus : write a regression test that pins that checkpoint's input → expected output so the bug can't silently return.Break replay determinism, then make it deterministic.Objectif : prove that a node body reading the clock/RNG corrupts checkpoint replay, then fix it. Indice/Solution : put a
datetime.now()(orrandom.random()) inside a node and write its result into state. Run to a checkpoint, then replay from an earlier checkpoint — observe that downstream state diverges from the original run, so time-travel debugging now lies. Fix by hoisting the clock/RNG to a pinned input (passed in at invoke time, or read once at the entry node and stored), so every replay reproduces the same transitions. Assert in a test that replaying the same checkpoint twice yields byte-identical downstream state.Make the graph testable end-to-end at zero API cost.Objectif : build a CI suite that proves control-flow correctness without a single model call. Indice/Solution : (a) unit-test each router function on synthetic states; (b) stub the model node (dependency-inject the client, or monkeypatch the node) to return canned answers, and assert the golden path reaches
ENDwith the expected state; (c) inject a model node that always setsneeds_clarification=Trueand assert the loop still terminates within N supersteps. The whole suite must run offline. If any test needs a real API key, the nondeterminism isn't boxed inside a node yet — refactor until it is.
Resources
- Docs : langchain-ai.github.io/langgraph
- Course : DL.AI "AI Agents in LangGraph"
- Source code : github.com/langchain-ai/langgraph
- Tutorials : langchain-ai.github.io/langgraph/tutorials