Skip to content

LangGraph — Agentic State Machines

Phase 3. Companion to DL.AI AI Agents in LangGraph.

Why LangGraph (not LangChain "agents")

LangChain "agents" (AgentExecutor) = magic. A while loop you don't own, a prompt you can't see, a control flow you can't step through. Great for a demo, miserable in production : when it loops forever, calls the wrong tool, or burns $40 of tokens on one request, you have no seam to put a breakpoint, a budget, or a retry.

LangGraph = explicit state machine. You define :

  • State (typed dict of what flows through the graph)
  • Nodes (pure-ish functions that read state and return a partial update)
  • Edges (which node runs next — static or conditional on state)

This is MUCH closer to how Temporal works (which you know from Dravos). Same mental model : a durable, inspectable, resumable execution graph where each step is a unit of work and the orchestrator — not the LLM — owns the control flow.

The mental model that matters

The single most important reframe : the LLM is a node, not the loop. In a LangChain agent, the model decides what happens next on every turn and you hope it stops. In LangGraph, you wrote the edges. The model can influence routing (via a value it writes into state that a conditional edge reads), but it cannot invent a new transition you didn't draw. That is the entire reason a staff engineer reaches for it : the failure surface is bounded by the graph, not by the model's mood.

You know (Temporal / NestJS)LangGraph equivalent
Workflow definitionThe compiled StateGraph
ActivityA node function
Workflow state / event historyThe State TypedDict + checkpointer
Signal / await conditioninterrupt() (human-in-the-loop)
Continue-as-new / replayCheckpoint + resume from thread_id
Saga / compensationConditional edge to an error/rollback node
Deterministic replaySame — replaying the graph from a checkpoint must be deterministic

Internalize this table and most of LangGraph is "Temporal for LLM steps."

Core concepts

python
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages

# State is a TypedDict. Channels can declare a *reducer* (how concurrent / repeated
# writes merge). `add_messages` appends instead of overwriting — critical for chat.
class AgentState(TypedDict):
    messages: Annotated[list, add_messages]   # appended, not replaced
    retrieved_docs: list
    answer: str | None
    needs_clarification: bool
    revisions: int                            # loop guard — see below

def retrieve(state: AgentState) -> dict:
    docs = vector_search(state["messages"][-1].content)
    return {"retrieved_docs": docs}           # return a PARTIAL update, not full state

def generate(state: AgentState) -> dict:
    answer = llm_answer(state["messages"], context=state["retrieved_docs"])
    return {"answer": answer, "revisions": state["revisions"] + 1}

def should_clarify(state: AgentState) -> str:
    if state["needs_clarification"] and state["revisions"] < 3:
        return "clarify"
    return "end"

graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve)
graph.add_node("generate", generate)
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_clarify, {"clarify": "retrieve", "end": END})
graph.set_entry_point("retrieve")

app = graph.compile()

Three things a staff engineer notices immediately about this snippet :

  1. Nodes return partial updates, not the whole state. retrieve returns {"retrieved_docs": docs} — LangGraph merges it via each channel's reducer. Returning the full blob is the #1 beginner bug : it silently clobbers fields other nodes wrote, especially under parallel (map-reduce) execution.
  2. Reducers are the concurrency model. Annotated[list, add_messages] means "when two nodes write messages in the same superstep, append both" instead of last-writer-wins. This is exactly the channel/CRDT idea — get it wrong and parallel branches lose each other's work nondeterministically.
  3. Every loop needs a guard. revisions < 3 is not optional polish. A conditional edge that can route back to a prior node is an unbounded loop unless state caps it. This is the LLM-agent version of "always set a Temporal workflow timeout."

Patterns to learn

1. Linear pipeline

A → B → C. Simplest. Replaces basic chains. Use when there's no branching — don't reach for a graph if a function would do.

2. Conditional routing

After A, decide between B and C based on state. The router is a plain Python function returning a string key — keep the routing decision out of the LLM where you can (a classifier or a rule is cheaper, faster, and testable). When the LLM must route, have it emit a structured field (see structured outputs below) rather than free text you regex.

3. Loop with exit condition

"Retrieve → generate → critique → if bad, retrieve again, up to N times." The reflection / self-critique pattern. The exit condition lives in state (revisions < N and a quality gate), never "until the model says it's happy" — models will happily say they're happy on turn 1 or never.

4. Map-reduce (fan-out / fan-in)

Run N nodes in parallel, aggregate. LangGraph expresses this with Send (dispatch one branch per item) and a reducer channel that merges results. This is where async matters : if each branch calls the model, you want them genuinely concurrent (see the production section). The classic failure mode is a non-appending reducer that drops all but one branch's output.

5. Human-in-the-loop

Pause the graph, wait for human input via interrupt(), resume. Requires a checkpointer (the graph must persist its state to survive the wait). This is the Temporal signal. Use it for irreversible actions : send-email, place-order, merge-PR. The approval gate is an edge, not a prompt instruction.

6. Checkpointing

Persist state between supersteps (a database checkpointer). Enables :

  • Long-running agents (pause/resume across days)
  • Crash recovery (resume from the last checkpoint, not turn 0)
  • Time-travel debugging (replay to any checkpoint, fork an alternate path)
  • Multi-tenant memory keyed by thread_id

In production you do not use the in-memory checkpointer. Use langgraph-checkpoint-postgres (or SQLite for single-node). The checkpointer is your durability boundary — treat it like a Temporal event store.

State design

State design is the architecture. Get it right and nodes stay simple, testable, and parallel-safe.

Bad state (one big blob) :

python
class State(TypedDict):
    everything: dict

Good state (typed, granular, with reducers where needed) :

python
from typing import Annotated, TypedDict
from operator import add

class State(TypedDict):
    messages: Annotated[list[Message], add_messages]
    retrieved: list[Document]
    plan: list[Step]
    cursor: int
    errors: Annotated[list[Error], add]   # branches append errors, don't overwrite
    final_answer: str | None

→ Easier to debug, type-check, visualize, and — crucially — to reason about under concurrency. The reducer annotation is the difference between "two parallel branches each found an error" being recorded as two errors versus one randomly-surviving error.

Staff-level rule of thumb : a field with no reducer is last-writer-wins. That's fine for scalars one node owns (cursor, final_answer), wrong for anything multiple nodes or parallel branches append to. Audit every channel for "who writes this, and can two writers race?"

Memory in LangGraph

Short-term (within one thread / conversation)

  • Just keep it in state (messages with the add_messages reducer).
  • Watch the window : an unbounded messages list grows your prompt — and your bill — every turn. Trim or summarize. On the model side, prompt caching (below) makes the stable prefix cheap, but the growing tail still costs full price.

Long-term (across conversations / threads)

  • Store in a DB / vector store, retrieve via a "memory" node.
  • LangGraph's Store API gives you namespaced cross-thread KV (e.g. per-user preferences).
  • LangMem / Letta integrate here if you want managed memory.

The split mirrors Claude's own surfaces : short-term = the message list (cheap, cache the prefix); long-term = an external store you read into a node. Don't conflate them — putting a user's entire history into every prompt is the classic cost blowup.

Plugging in the model — the node where the LLM lives

A LangGraph node is just a function, so the model call is yours to own. For a Python/NestJS-shaped backend serving concurrent users, this is where the production-grade Anthropic SDK usage belongs. The 2026 flagship is claude-opus-4-8 (Opus 4.8, $5 / $25 per Mtok in/out at 1M context); use claude-sonnet-4-6 for routine nodes and claude-haiku-4-5 ($1 / $5) for cheap routers/classifiers.

python
import anthropic
from anthropic import AsyncAnthropic

# One client for the process. AsyncAnthropic for a server — never the sync client
# inside an async graph, or you block the event loop on every model call.
client = AsyncAnthropic(max_retries=3)   # SDK retries 429/5xx with backoff

SYSTEM = [
    {
        "type": "text",
        "text": LARGE_STABLE_SYSTEM_PROMPT,        # frozen, byte-identical every call
        "cache_control": {"type": "ephemeral"},    # cache the stable prefix
    }
]

async def generate(state: AgentState) -> dict:
    try:
        resp = await client.messages.create(
            model="claude-opus-4-8",
            max_tokens=16000,
            system=SYSTEM,
            # Adaptive thinking — NOT budget_tokens (that form is removed on 4.7/4.8
            # and returns HTTP 400). Control depth with output_config.effort instead.
            thinking={"type": "adaptive"},
            output_config={"effort": "high"},       # low | medium | high | xhigh | max
            messages=state["messages"],
            timeout=60.0,                            # per-call wall-clock budget
        )
    except anthropic.RateLimitError:
        # SDK already retried; surface for a graph-level fallback edge.
        raise
    except anthropic.APIStatusError as e:
        log.error("model call failed", status=e.status_code, type=e.type)
        raise

    log.info("usage", **resp.usage.model_dump())     # log usage for cost/observability
    text = "".join(b.text for b in resp.content if b.type == "text")
    return {"answer": text, "revisions": state["revisions"] + 1}

Correctness notes a reviewer will check :

  • Adaptive thinking, not a thinking budget. The old thinking={"type": "enabled", "budget_tokens": N} form is removed on Opus 4.7/4.8 and returns a 400. Use thinking={"type": "adaptive"} and tune depth with output_config={"effort": ...}. Sonnet 4.6 / Haiku take adaptive thinking too (no budget).
  • AsyncAnthropic in an async graph. A sync client inside async def nodes serializes every model call behind the event loop — your "parallel" map-reduce becomes sequential and your server's throughput collapses.
  • Typed exceptions. Branch on RateLimitError / APIStatusError / OverloadedError / APITimeoutError, never on string-matching the message.
  • resp.usage is your cost telemetry. Log it from every node; it's the only ground truth for per-request spend and cache hit rate (cache_read_input_tokens).

Structured routing — let the model fill a schema, not free text

When a node's output drives a conditional edge, do not parse free text. Use native structured outputs so the router reads a typed field :

python
from pydantic import BaseModel

class Route(BaseModel):
    intent: str          # "search" | "answer" | "clarify"
    needs_clarification: bool

async def classify(state: AgentState) -> dict:
    resp = await client.messages.parse(     # native parse → validated Pydantic
        model="claude-haiku-4-5",           # cheap model for a routing decision
        max_tokens=256,
        messages=state["messages"],
        output_config={"format": Route},
    )
    r = resp.parsed                          # typed; no regex, no XML hand-rolling
    return {"needs_clarification": r.needs_clarification}

messages.parse() with a Pydantic schema (or output_config.format) beats hand-rolled "respond in JSON" prompting every time : the SDK validates, and a malformed response is an exception you can route on instead of a KeyError three nodes later.

Parallel tool calls — gather, don't loop

python
import asyncio

async def fan_out_tools(state: AgentState) -> dict:
    # Independent tool calls run concurrently. asyncio.gather, not a for-loop of awaits.
    results = await asyncio.gather(
        *(run_tool(call) for call in state["pending_tool_calls"])
    )
    return {"tool_results": results}        # reducer-merged into state

Streaming

graph.stream(input) yields updates as the graph progresses. Three modes worth knowing :

  • stream_mode="values" — full state after each superstep (simplest).
  • stream_mode="updates" — just the diff each node produced (less data over the wire).
  • stream_mode="messages" — token-level LLM output, for "typing" UIs.

Use for :

  • UI shows the agent's "thinking"/progress in real time (better perceived latency even when wall-clock is identical).
  • Token streaming to the user from the model node.
  • Backpressure / early-cancel : a streamed graph can be aborted mid-run.

On the model side, stream the Anthropic call for any large output (max_tokens over ~16K) to avoid SDK HTTP timeouts, and bridge those deltas into the graph's messages stream.

Tracing / Observability

LangGraph integrates with LangSmith out of the box — every node call, every state transition, every model request traced, with token counts and latency per step. Also works with OpenTelemetry / Phoenix if you want vendor-neutral spans into your existing stack.

What a staff engineer actually instruments :

  • Per-node latency and token cost — find the expensive node, not the expensive request.
  • Cache hit rate (cache_read_input_tokens / total) — a regression here means a silent prefix invalidator crept in (a timestamp in the system prompt, a reordered tool list).
  • Loop counts — alert when revisions saturates its cap; that's a quality regression hiding as "it still works."
  • Checkpoint size — unbounded messages shows up as a growing checkpoint and a growing bill.

Determinism, testing, and why the orchestrator owns control flow

The reason a staff engineer reaches for LangGraph over a hand-rolled loop is testability through a deterministic seam. The graph topology — nodes, edges, reducers — is pure Python you can unit-test without a single model call. The nondeterminism is quarantined inside node bodies, which you mock.

  • Test the routing, not the model. A conditional edge is should_clarify(state) -> str. Feed it synthetic states and assert the returned key. No API, no flakiness, no cost. This is the single highest-leverage test in an agentic system: it proves the control flow is correct independent of what the LLM says.
  • Test loop termination adversarially. Construct a state that always "needs clarification" and assert the graph halts within N supersteps. A loop guard you never tested is a loop guard you don't have.
  • Golden-path integration tests use a stub model node. Swap the real generate node for one that returns a canned answer. You're now testing the graph's wiring end-to-end at zero cost and zero latency.
  • Replay determinism is a contract. Resuming from a checkpoint must reproduce the same transitions given the same state — exactly the Temporal rule. The hazard: a node that reads datetime.now(), a random seed, or an unsorted set leaks nondeterminism into state, and your "time-travel" replay diverges. Push all nondeterminism (clock, RNG, external reads) into node inputs you can pin, never into node bodies that run on replay.

The mental model: the model is the only nondeterministic component, and you've boxed it inside a node. Everything around it — routing, looping, fan-out, persistence — is deterministic infrastructure you test like any other state machine.

Failure modes a staff engineer designs against

Failure modeSymptomDefense
Unbounded loopA request burns minutes and dollars; revisions saturates and never exitsA loop guard in state read by the conditional edge — revisions < N and a real quality gate. Never "until the model is happy."
Full-state clobberA field another node wrote silently vanishes, nondeterministically, under parallel branchesNodes return partial updates; every multi-writer channel has a reducer (add_messages, operator.add). Audit: "who writes this, can two writers race?"
Checkpointer is in-memory in prodA crash or restart loses the entire run; human-in-the-loop never resumeslanggraph-checkpoint-postgres (or SQLite single-node). The in-memory checkpointer is for tests only.
Silent cache invalidationCost doubles; cache_read_input_tokens drops to ~0 with no errorA timestamp/UUID crept into the stable system prefix, or the tool list reordered. Alert on cache hit rate, not just on errors.
Sync client in an async graphThroughput collapses; "parallel" map-reduce runs sequentiallyAsyncAnthropic + asyncio.gather. A sync call inside async def serializes every model call behind the event loop.
Model overload not handledA 529/OverloadedError aborts the whole graph mid-runTyped-exception branch routes to a fallback edge: downgrade claude-opus-4-8claude-sonnet-4-6, or back off and retry. The SDK retries transient errors; you handle the terminal ones.
Checkpoint bloatLatency and bill creep up over a long conversation; checkpoint size grows unboundedTrim/summarize messages; cap the window. An unbounded add_messages channel is a slow leak.
Non-deterministic replayTime-travel debugging produces a different answer than production didNo clock/RNG/external read inside node bodies on the replay path — pin them as inputs.

The throughline: bound every loop, give every multi-writer channel a reducer, make durability real (Postgres), and treat the model call as a thing that fails — typed, retried, and fallback-routed.

Comparison to alternatives

FrameworkStrengthWeakness
LangGraphExplicit, debuggable, checkpointable, matureSteeper learning curve; you own state design
LangChain Agent (AgentExecutor)Easy start, magicOpaque control flow, hard to bound, deprecated trend
crewAIMulti-agent natural APILess flexible, smaller ecosystem, weak control
AutoGenConversational multi-agentMicrosoft-specific feel, research-y
Anthropic Managed AgentsAnthropic runs the loop + hosts the tool sandbox; persisted versioned agent configsLess control over orchestration; vendor-hosted
Vercel AI SDK + rawTS-native, simpleNo state machine — you build orchestration
Raw SDK (Anthropic)Maximum controlYou build the loop, checkpointing, everything

In 2026, the production patterns are : LangGraph + raw SDK + MCP when you want to own orchestration and host your own tools; Anthropic Managed Agents when you want Anthropic to run the agent loop and host a per-session container. crewAI for prototyping, AutoGen for research. The decision hinges on one question : do you want to own the loop and the tool sandbox, or rent them?

TypeScript option

LangGraph JS exists and has matured a lot — same State/Node/Edge model, same checkpointer concept.

→ For your stack (TS + Python), the usual split : Python LangGraph for the agentic backend (richer ecosystem, the data/ML libs live here), TS/Angular for the frontend, talking to the graph over a streaming endpoint (SSE/WebSocket bridged from graph.stream). If your whole team is TS and the agents are simple, LangGraph JS is a legitimate single-language choice.

🎤 En entretien

  • "Why LangGraph over a LangChain AgentExecutor?" — Because the orchestrator owns control flow, not the model : explicit edges bound the failure surface, conditional routing and loops are testable, and checkpointing gives crash-recovery and human-in-the-loop. AgentExecutor is an opaque while loop you can't put a breakpoint, a budget, or a deterministic test on.
  • "How do you stop an agentic loop from running forever / costing too much?" — A loop guard in state (e.g. revisions < N) read by the conditional edge, plus a hard max_tokens and a per-call timeout on the model, plus output_config.effort to cap reasoning depth. Never "loop until the model says it's done."
  • "Two parallel branches both write the same state field. What happens?" — Without a reducer it's last-writer-wins and you nondeterministically lose one branch's work. You annotate the channel with a reducer (add_messages, operator.add) so concurrent writes merge — that's LangGraph's concurrency model.
  • "How is this different from Temporal, which I already know?" — It's the same shape : durable, inspectable, resumable execution graph with a persisted event history (the checkpointer). Nodes ≈ activities, conditional edges ≈ saga branches, interrupt() ≈ signals, resume-from-thread_id ≈ replay. The novelty is only that some nodes call an LLM — which is exactly why you still want deterministic orchestration around the nondeterministic step.
  • "How do you test an agentic graph without burning money on every CI run?" — The topology is pure Python : unit-test router functions (should_clarify(state) -> str) on synthetic states with no API call, assert loop termination on adversarial inputs, and stub the model node with a canned response for golden-path integration tests. The model is the only nondeterministic component and it's boxed inside a node you mock — everything else is a deterministic state machine you test directly.
  • "What makes a checkpointed graph's replay non-deterministic, and why do you care?" — A node body that reads the clock, an RNG, or an external source leaks nondeterminism into state, so replaying from a checkpoint diverges from what production did — time-travel debugging lies and crash-recovery can produce a different answer. Fix : push all nondeterminism into pinned node inputs, never into node bodies that run on the replay path. Same discipline as Temporal's deterministic-workflow rule.

🏋️ Exercices

  1. Convert a LangChain agent to a LangGraph graph.Objectif : turn an opaque AgentExecutor (retrieval + 2 tools) into an explicit State/Node/Edge graph with a typed state. Indice/Solution : model the tools as nodes, the "which tool / answer now" decision as a conditional edge driven by a structured-output classifier (Haiku via messages.parse()), and a router function returning string keys. Verify the graph compiles and graph.get_graph().draw_mermaid() matches your whiteboard.

  2. Build a research agent : decompose → search each sub-query in parallel → aggregate.Objectif : implement true fan-out/fan-in with Send and a reducer channel. Indice/Solution : the decompose node emits N Send("search", {...}); the search node is async and uses AsyncAnthropic; aggregate via an Annotated[list, operator.add] channel. Prove it's actually concurrent by logging start/end timestamps — if they don't overlap you accidentally used the sync client or a for await loop.

  3. Add a human-in-the-loop approval gate before an irreversible action (send email).Objectif : pause the graph at the gate, surface the draft, resume on approval/rejection. Indice/Solution : attach a Postgres checkpointer, call interrupt() in the gate node, resume with Command(resume=decision) keyed by thread_id. Now break it : kill the process between interrupt and resume, restart, and resume the same thread_id — it must continue, not restart. If it restarts, your checkpointer is in-memory.

  4. Make the loop production-grade : bound it, observe it, and survive a model overload.Objectif : take the reflect/critique loop from a toy to something you'd page on. Indice/Solution : add a revisions cap and a quality gate to the exit edge; wrap the model node with typed-exception handling that routes OverloadedError/RateLimitError to a fallback edge (downgrade claude-opus-4-8claude-sonnet-4-6, or back off and retry); log resp.usage per node; assert in a test that the loop terminates within N supersteps on an adversarial input that keeps "needing clarification."

  5. Defend the number : cost-optimize the graph and prove the saving.Objectif : cut per-request token spend without losing quality, and show the receipts. Indice/Solution : put cache_control on the stable system+tools prefix, move all volatile content (timestamps, per-request IDs) after the last breakpoint, route the cheap classification node to claude-haiku-4-5, and trim the messages window. Measure before/after with usage.cache_read_input_tokens and total input tokens across a fixed eval set. Be ready to explain why cache hit rate is the lever (cache reads ≈ 0.1× input price) and what silently invalidates it.

  6. Time-travel debug a wrong answer.Objectif : use checkpoints to find and fix the node that introduced a bad state, without re-running the whole graph. Indice/Solution : list checkpoints for the thread_id, inspect state at each superstep, identify the node whose output diverged, fork from the prior checkpoint with a corrected input, and confirm the downstream answer changes. Bonus : write a regression test that pins that checkpoint's input → expected output so the bug can't silently return.

  7. Break replay determinism, then make it deterministic.Objectif : prove that a node body reading the clock/RNG corrupts checkpoint replay, then fix it. Indice/Solution : put a datetime.now() (or random.random()) inside a node and write its result into state. Run to a checkpoint, then replay from an earlier checkpoint — observe that downstream state diverges from the original run, so time-travel debugging now lies. Fix by hoisting the clock/RNG to a pinned input (passed in at invoke time, or read once at the entry node and stored), so every replay reproduces the same transitions. Assert in a test that replaying the same checkpoint twice yields byte-identical downstream state.

  8. Make the graph testable end-to-end at zero API cost.Objectif : build a CI suite that proves control-flow correctness without a single model call. Indice/Solution : (a) unit-test each router function on synthetic states; (b) stub the model node (dependency-inject the client, or monkeypatch the node) to return canned answers, and assert the golden path reaches END with the expected state; (c) inject a model node that always sets needs_clarification=True and assert the loop still terminates within N supersteps. The whole suite must run offline. If any test needs a real API key, the nondeterminism isn't boxed inside a node yet — refactor until it is.

Resources

My notes

Bibliothèque tech perso — Achref