Skip to content

LiveKit + ElevenLabs — Production Voice Stack

Phase 4. The production pattern serious teams use.

The mental model: voice is a latency budget, not a feature

A text agent has one number that matters: was the answer correct. A voice agent has two, and the second one is unforgiving — time-to-first-audio (TTFA). Humans perceive a conversation as broken above ~800ms of silence after they stop talking. That single constraint dictates the entire architecture: every component in the pipeline is chosen, configured, and streamed to defend a latency budget, not to maximize quality in isolation.

A staff engineer reasons about a voice agent as a pipeline of streaming stages, each with its own latency contribution, each able to start emitting before the previous one finishes:

Mic → VAD (endpointing) → STT (streaming) → LLM (streaming) → TTS (streaming) → Speaker
        ~                  ~200ms partial    ~300ms TTFT       ~75-150ms          ~

The naive mental model — "transcribe, then think, then speak" — is wrong and will produce a 2-3 second pause that feels dead. The correct model is: the LLM starts generating before the user finishes the thought is endpointed, the TTS starts speaking before the LLM finishes the sentence, and the whole thing can be interrupted mid-word. Get this and everything else follows.

LiveKit — what it does

WebRTC-based real-time platform. Handles:

  • Audio transport (browser ↔ backend, low latency) — UDP-based, jitter-buffered, packet-loss-resilient. This is the part you do not want to build yourself.
  • Multi-participant rooms (calls, meetings, agents)
  • TURN/STUN servers (NAT traversal — you don't manage)
  • Recording, streaming, transcoding

Open source + cloud. Use LiveKit Cloud for production (free tier generous, then usage-based on participant-minutes).

Why WebRTC and not WebSocket? A WebSocket over TCP gives you head-of-line blocking: one lost packet stalls every packet behind it, and on a flaky mobile connection that turns into audible stutter and growing latency. WebRTC runs audio over UDP/SRTP with a jitter buffer and forward error correction — it drops late packets instead of waiting for them. For a voice agent, consistent low latency beats occasionally lower latency. This is the single biggest reason to use LiveKit instead of streaming raw audio over a WebSocket to your backend.

LiveKit Agents framework

Python framework on top of LiveKit for AI voice agents.

python
from livekit import agents, rtc
from livekit.agents.voice import Agent, AgentSession
from livekit.plugins import anthropic, elevenlabs, deepgram, silero

async def entrypoint(ctx: agents.JobContext):
    agent = Agent(
        instructions="You are a helpful assistant",
        # STT → LLM → TTS pipeline (Pattern B)
        llm=anthropic.LLM(model="claude-haiku-4-5"),  # fast + cheap for turn-by-turn voice
        stt=deepgram.STT(model="nova-3"),
        tts=elevenlabs.TTS(voice="Adam"),
        vad=silero.VAD.load(),
    )
    session = AgentSession()
    await session.start(agent=agent, room=ctx.room)

agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

That's a complete voice agent. Add tools, state, etc.

Model choice note. For a turn-by-turn voice pipeline, the LLM is in the critical latency path on every turn. Default to claude-haiku-4-5 ($1 / $5 per Mtok) for routine conversational turns — it has the lowest TTFT and is plenty for IVR-style flows. Step up to claude-sonnet-4-6 when turns need real reasoning (multi-step booking logic, policy lookups), and reserve claude-opus-4-8 ($5 / $25 per Mtok at 1M context) for the rare voice agent that does genuinely hard reasoning mid-call. The wrong move is reaching for Opus by default "because it's the best" — you'll pay for latency the user hears as a pause. Pick the cheapest model whose TTFT and quality clear the bar, then defend that choice with a measured TTFA number (see Exercises).

Note on the Anthropic SDK. The LiveKit anthropic.LLM plugin wraps AsyncAnthropic under the hood and streams tokens. If you build the LLM stage yourself instead of using the plugin, the production checklist a senior expects is:

  • AsyncAnthropic, never the sync client — the sync client blocks the event loop, and on a voice server one blocked turn stalls every concurrent call on that worker.
  • Always stream (client.messages.stream(...)) and emit text to the TTS sentence-by-sentence as it arrives — never await the full response. The first sentence should be on its way to ElevenLabs before the LLM has finished the second.
  • Don't set thinking on a latency-critical turn. Adaptive thinking adds a reasoning preamble before the first user-visible token — exactly the TTFA you're protecting. (The legacy thinking={"type": "enabled", "budget_tokens": N} form is removed on 4.7/4.8 and returns HTTP 400 anyway. On the rare non-latency-critical turn where you do want reasoning, use thinking={"type": "adaptive"} with output_config={"effort": "low"}.)
  • Typed exceptions + retries. Wrap the call so RateLimitError / OverloadedError / APIStatusError / APITimeoutError each route to the degraded path (see Failure modes), and lean on the SDK's max_retries with a tight per-call timeout — in voice, a 30s default retry is a dead conversation; budget ~3–5s.
  • cache_control on the stable prefix (system prompt + tool definitions + early transcript) so a long call doesn't re-bill the whole context every turn (see State / memory).
  • Log resp.usage (input_tokens, output_tokens, cache_read_input_tokens) per turn — this is your per-minute cost telemetry and your cache-hit proof.

A minimal hand-rolled streaming LLM stage looks like:

python
from anthropic import AsyncAnthropic, RateLimitError, OverloadedError, APITimeoutError

client = AsyncAnthropic(max_retries=2)  # SDK retries 429/5xx with backoff

async def stream_turn(history: list[dict], tts_say):
    try:
        async with client.messages.stream(
            model="claude-haiku-4-5",      # fast + cheap on the critical path
            max_tokens=1024,
            timeout=5.0,                   # voice budget — not the 30s default
            system=[{                      # stable prefix → cache it
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }],
            messages=history,
        ) as stream:
            buf = ""
            async for text in stream.text_stream:
                buf += text
                # flush to TTS at sentence boundaries to minimise TTFA
                while (i := _sentence_end(buf)) is not None:
                    await tts_say(buf[: i + 1])
                    buf = buf[i + 1 :]
            if buf.strip():
                await tts_say(buf)
        usage = (await stream.get_final_message()).usage
        log.info("llm.usage", **usage.model_dump())  # per-turn cost telemetry
    except (RateLimitError, OverloadedError, APITimeoutError):
        await tts_say("Sorry — give me one second.")  # degrade, don't dead-air
        raise

ElevenLabs — when to choose

  • Best voice quality in the market (mai 2026)
  • Voice cloning (clone a real person's voice, 5 min training)
  • Multilingual strong (FR voices are excellent)
  • Flash v2.5 : 75ms latency, ~$0.30/1000 chars
  • Conversational AI : their own end-to-end voice agent product (alternative to building yourself)

Cost vs OpenAI Realtime:

  • ElevenLabs TTS only : cheap, ~$0.40/min audio
  • OpenAI Realtime : ~$3/min
  • → If voice quality matters more than ultra-low latency, ElevenLabs wins

The model tiers matter for latency. ElevenLabs offers a quality↔latency dial, and a senior picks per-use-case rather than always reaching for the best-sounding model:

ModelLatencyUse when
Flash v2.5~75msReal-time conversation, IVR, anything turn-by-turn
Turbo v2.5~250-300msQuality matters more than the last 200ms
Multilingual v2~higherLong-form generation, narration, non-interactive

For an interactive agent, Flash is almost always the right default — the quality gap vs Turbo is inaudible in a phone-quality call, and the 200ms you save is 200ms of the human-perceptible silence budget you get to keep.

Deepgram (STT)

For STT-LLM-TTS pipeline:

  • Nova-3 : current best, very fast
  • Multilingual : FR works well
  • Streaming : sub-200ms partial transcripts
  • ~$0.0043/min

The endpointing tradeoff is the hardest tuning problem in voice. STT gives you partial transcripts continuously and a final transcript when it thinks the user stopped. But "stopped talking" is ambiguous — a 400ms pause might be the end of the turn, or it might be the user thinking mid-sentence ("I'd like to book a table for… four people"). Tune endpointing too aggressive and you interrupt people; too lax and the agent feels slow to respond. This is a VAD + STT-endpointing + (optionally) a semantic turn-detection model working together, and it's where most of your post-launch tuning time will go. There is no universally correct value — it depends on your users, your domain, and whether you bias toward "never interrupt" or "feel snappy."

Twilio (telephony)

If you need phone calls (not browser-based):

  • Twilio Voice for SIP/PSTN
  • LiveKit + Twilio integration available (SIP trunk into a LiveKit room)
  • Use case : voice agents answering business phone lines

Pricing : ~$0.013/min phone audio + standard LLM costs.

Phone audio is 8kHz, narrowband, lossy. This degrades STT accuracy versus a clean browser mic, which means your prompts and tools must be more forgiving (confirm critical values back to the user: "that's four people at seven PM, correct?"). It also means a voice clone that sounds pristine in your demo will sound different over a PSTN line — always test on a real phone call, not just the browser.

Architecture patterns

Pattern A — Pure speech-to-speech (OpenAI Realtime)

User → LiveKit → Backend (OpenAI Realtime) → LiveKit → User

Pros : simplest, lowest latency, most natural prosody/back-channel Cons : expensive (~$3/min), less control, no model choice flexibility, harder to inject/audit tool logic, vendor-locked to one provider's voice + brain

Pattern B — STT → LLM → TTS pipeline

User → LiveKit → Deepgram (STT) → Claude (LLM) → ElevenLabs (TTS) → LiveKit → User

Pros : cheaper, model flexibility (swap any stage independently), better voice quality (ElevenLabs), full control + observability at every stage, can log/redact transcripts Cons : more latency to manage (~700ms vs 300ms), more code, more failure modes to handle

The Pattern B latency budget, decomposed

The "~700ms" hides where the time actually goes. A staff engineer carries this breakdown in their head so they know which knob to turn when someone says "it feels slow":

StageTypical contributionDominated byHow you cut it
Endpointing wait200–700msVAD silence threshold + semantic turn detectionLower the silence threshold (risks cutting users off); add a turn-detection model
STT final transcript~100–200ms after endpointDeepgram streamingAlready streamed; little to do
LLM TTFT200–500msModel choice + prompt sizeUse Haiku, stream, skip thinking, cache the prefix
TTS first byte75–150msElevenLabs model tierUse Flash v2.5 (~75ms)
Network / jitter buffer~50–100msWebRTC, client distanceEdge regions, LiveKit Cloud

The two fat, tunable stages are endpointing wait and LLM TTFT — they're where 80% of your perceived-latency wins live, and they're the two you'll name in an interview when asked "where does the latency go." Everything else is already near its floor once you're streaming. Note these overlap: TTS starts on the LLM's first sentence, so TTFA ≈ endpointing + STT + LLM-first-sentence + TTS-first-byte, not the naive sum of full-stage durations.

Pattern C — Hybrid

OpenAI Realtime for back-and-forth, fallback to STT-LLM-TTS for long-form generation, or for turns that need tool use / a specific model.

How a staff engineer chooses

DimensionPattern A (S2S)Pattern B (pipeline)
TTFA~300ms~500-800ms (tunable)
Cost/min~$3~$0.40-0.60
Model controlNoneFull (any LLM, any STT, any voice)
ObservabilityOpaqueTranscript + per-stage timing at every hop
Voice quality / cloningProvider's voicesBest-in-class (ElevenLabs)
Tool use / business logicAwkwardNative — it's just an LLM call
Compliance (PII redaction, audit)HardEasy — you own the transcript
Failure isolationAll-or-nothingDegrade one stage at a time

The decision rule: if the agent is a regulated, tool-heavy, brand-voiced business workflow → Pattern B. If it's a consumer toy or a demo where naturalness is everything and budget is irrelevant → Pattern A. Most paying client work is Pattern B, because clients want their voice, their model, their logs, and a per-minute cost they can put in a spreadsheet.

Tool use during voice

Same as text. Define tools, model calls them, you execute, return result. Voice agent continues.

Tip : acknowledge before executing ("Let me check that for you…") so user knows agent is working. Reduces perceived latency. This is a UX trick that buys you the tool round-trip for free — emit the filler phrase to TTS the instant you detect a tool_use, then run the tool while it's being spoken.

Parallelize independent tool calls. If a turn triggers two independent lookups (check inventory and fetch the user's profile), fire them concurrently with asyncio.gather rather than awaiting them in series — in a voice context those serialized round-trips are silence the user hears. Set a per-tool timeout and a graceful fallback ("I'm having trouble reaching that system, can I take a message?") — a hung tool call in a voice agent is a hung conversation, far worse than in text where the user just waits.

State / memory

  • Within session : conversation history in LLM context
  • Across sessions : same as text agents (DB, vector store, etc.)

Cost note for long calls. The LLM context grows every turn, and you re-send it every turn — a 20-minute call resends the whole transcript dozens of times. Use prompt caching on the stable prefix (system prompt + tool definitions + early conversation) via cache_control so the repeated tokens bill at ~0.1× instead of full price. On a long support call this is the difference between a few cents and a real number. Log resp.usage (cache_read_input_tokens vs input_tokens) per turn so you can prove the cache is hitting.

Failure modes a senior expects (and designs around)

  • Barge-in (interruption) handling. The user starts talking while the agent is mid-sentence. You MUST stop TTS immediately, flush the queued audio, and re-open STT. If you don't, the agent talks over the user — instantly feels broken. LiveKit Agents handles the plumbing, but you have to want it and test it.
  • Hallucinated STT on silence/noise. Background noise or a cough can produce a phantom transcript. VAD gating + ignoring very short / low-confidence finals mitigates this.
  • TTS/LLM provider outage mid-call. A single-provider pipeline is a single point of failure on a live call. Have a degraded path (cached canned responses, a fallback TTS voice, a "please hold" message) rather than dead air.
  • Endpointing too aggressive → you cut users off. Too lax → the agent feels slow. Tune per domain; there is no free lunch.
  • Cost runaway on a wedged session. A call that never ends (user walked away, line stuck open) keeps an LLM context alive and billing. Enforce a max session duration and an inactivity timeout.
  • PII in transcripts. Voice agents capture names, card numbers, health info. In Pattern B you own the transcript — redact before logging, encrypt at rest, and know your retention obligations (GDPR/HIPAA). This is a feature of Pattern B, not a bug: in Pattern A you can't.

Observability — what to measure on every call

You cannot tune what you don't measure. Instrument per turn:

  • TTFA (time from user-stop to first agent audio) — the north-star metric
  • Per-stage latency: VAD endpoint → STT final → LLM first token → TTS first byte
  • Interruption rate (barge-ins per minute) — high means endpointing or verbosity is off
  • Cost per minute, split by stage (STT, LLM in/out tokens, TTS chars)
  • Cache hit rate on the LLM (cache_read_input_tokens / total)
  • Turn error rate (tool timeouts, STT empties, provider 5xx)

Ship these as structured logs/metrics from day one. The first production incident will be "it feels slow" and you'll need the per-stage breakdown to find which hop regressed.

Use cases that pay well in 2026

Use casePrice
Outbound call agent (B2B sales)€30-50k/project
Inbound IVR replacement€20-40k + maintenance
Voice-first customer support€25-50k
Voice booking / scheduling agent€15-30k
Voice-enabled internal tools (sales rep coach)€20-40k

→ Most clients pay fixed-price, not TJM. Be careful with scope. The scope killer in voice is latency and interruption tuning — it's invisible in the spec and eats weeks. Price it in, and define "good enough" with a measurable TTFA SLA in the contract, not a vibe.

🏋️ Exercices

  1. Stand up Pattern B end-to-end.Objectif : a working browser voice agent (LiveKit room + Deepgram Nova-3 STT + claude-haiku-4-5 + ElevenLabs Flash v2.5) that holds a multi-turn conversation. Indice/Solution : use the LiveKit Agents AgentSession from the snippet above; load Silero VAD; run agents.cli.run_app. Confirm you hear streamed audio, not a 3-second pause — if you do, your TTS isn't being fed incrementally.

  2. Add two tools and parallelize them.Objectif : a check_availability and a lookup_customer tool that fire concurrently on a single turn, with an "acknowledge before executing" filler phrase. Indice/Solution : on tool_use, immediately push "Let me check that for you…" to TTS, then await asyncio.gather(check_availability(...), lookup_customer(...)). Wrap each in asyncio.wait_for with a 2s timeout and a graceful fallback. Measure the latency saved vs awaiting them serially.

  3. Instrument and defend a TTFA number.Objectif : produce a per-turn latency breakdown (VAD→STT→LLM-first-token→TTS-first-byte→audio) and state a defensible median TTFA for your stack, with the bottleneck stage named. Indice/Solution : timestamp each stage transition; log structured per turn; aggregate the median and p95 over ~30 turns. Expect the LLM TTFT or endpointing wait to dominate. Now defend the number in an interview: "median TTFA 620ms, p95 1.1s; the long tail is endpointing waiting 700ms for a final on trailing-off speakers."

  4. Break it, then fix it: barge-in.Objectif : reproduce the agent talking over the user, then implement correct interruption handling. Indice/Solution : disable interruption handling and confirm the failure (agent keeps talking when you speak). Then: on VAD detecting user speech during agent playback, immediately stop and flush the TTS output, cancel any in-flight LLM generation for that turn, and re-open STT. Verify the agent yields within ~200ms of you starting to talk.

  5. Drive the LLM cost down on a long call.Objectif : on a simulated 15-minute call, cut LLM input cost by >70% using prompt caching, and prove it with usage. Indice/Solution : put cache_control: {type: "ephemeral"} on the stable system+tools+early-history prefix; keep the volatile per-turn content after the last breakpoint. Log cache_read_input_tokens vs input_tokens per turn and show the read ratio climbing. Watch for silent invalidators (a timestamp interpolated into the system prompt will zero out your hit rate).

  6. Production-grade resilience: kill a provider mid-call.Objectif : make the agent survive an ElevenLabs (or LLM) 5xx mid-conversation without dead air, and enforce session limits. Indice/Solution : wrap each stage in typed-exception handling (SDK RateLimitError / APIStatusError / OverloadedError / APITimeoutError for the LLM; HTTP errors for TTS/STT) with max_retries + a fallback path (canned "please hold" audio, fallback voice). Add a max-session-duration watchdog and an inactivity timeout that ends and logs the call. Verify by injecting a fault and confirming the user hears a graceful degradation, not silence. Hard mode: distinguish a retryable OverloadedError (529 — backoff and retry on the same provider) from a terminal one (sustained outage — flip to the fallback voice / canned path) so you don't burn the latency budget retrying a dead provider.

  7. Defend the per-minute cost number in a pricing meeting.Objectif : build a per-minute cost model for a Pattern B inbound support agent and defend it line-by-line, then show where prompt caching changes the answer. Indice/Solution : sum STT ($/min), LLM (in/out tokens × model rate — claude-haiku-4-5 at $1/$5 per Mtok vs claude-sonnet-4-6 at $3/$15), TTS ($/char), and telephony ($/min). Pull real token counts from resp.usage on a recorded call, not guesses. Now answer the CFO's two questions: "what does a 12-minute call cost?" and "what breaks the model?" — the answer to the second is context growth on long calls, which prompt caching (cache_read at ~0.1×) flattens. Be ready to state the number with and without caching and name your model choice as the dominant lever.

  8. Load: 50 concurrent calls on one worker.Objectif : prove (or disprove) that your agent holds its TTFA SLA under concurrency, and find the bottleneck. Indice/Solution : drive 50 simulated callers against one worker; watch p95 TTFA as concurrency climbs. The usual failure is a blocking call somewhere on the event loop (a sync Anthropic client, a sync DB call, CPU-bound VAD on the main thread) that serializes turns — p95 will hockey-stick. Fix by making every I/O call async (AsyncAnthropic), moving CPU-bound work off the loop, and capping per-worker concurrency so you scale horizontally instead of degrading every call. Defend the number: "one worker holds p95 TTFA < 900ms up to N concurrent calls; past N we add workers, we don't degrade."

🎤 En entretien

  • "Why does a voice agent need WebRTC instead of a WebSocket?" — TCP head-of-line blocking: one lost packet stalls the stream, producing audible stutter and growing latency on flaky networks. WebRTC runs audio over UDP/SRTP with a jitter buffer and FEC, dropping late packets instead of waiting — consistent low latency beats occasionally-lower latency for live audio.
  • "Walk me through where the latency goes in an STT→LLM→TTS pipeline and how you'd cut it." — Endpointing wait + STT final + LLM TTFT + TTS first-byte, all streamed and overlapped; the usual bottlenecks are endpointing (tune VAD + semantic turn detection) and LLM TTFT (use Haiku, stream tokens, skip thinking). You overlap stages — start TTS on the first LLM sentence, don't wait for the full response.
  • "How do you handle the user interrupting the agent?" — Barge-in: VAD detects user speech during playback → immediately stop and flush queued TTS audio, cancel the in-flight LLM turn, re-open STT. If you skip this the agent talks over the user and instantly feels broken; target yielding within ~200ms.
  • "This is a healthcare booking line. Pattern A or Pattern B, and why?" — Pattern B. You need to own the transcript (PII redaction, audit, retention), pick a model and a branded voice, use tools for the booking system, and degrade gracefully on a provider outage — none of which Pattern A's opaque speech-to-speech gives you. The latency cost is worth the control and compliance.
  • "Which LLM do you put on a turn-by-turn voice pipeline, and how do you defend it?" — Default to claude-haiku-4-5 ($1/$5 per Mtok) because the LLM is on the critical path every turn and Haiku has the lowest TTFT; step up to claude-sonnet-4-6 only for turns that need real reasoning (multi-step booking, policy lookups). Reaching for Opus by default is the junior move — you pay for latency the user hears as a pause. Defend the choice with a measured TTFA, not a vibe.
  • "A 20-minute support call is getting expensive. What's the one lever?" — Prompt caching on the stable prefix (system + tools + early transcript) via cache_control, so the repeated context bills at ~0.1× instead of full price every turn; prove it with cache_read_input_tokens in resp.usage. Watch for silent invalidators — a timestamp interpolated into the system prompt zeroes the hit rate. Model choice is the other lever, but caching is the one that's free.

Resources

My notes

Bibliothèque tech perso — Achref