- Status: Accepted
- Date: 2026-05-03
- Affected:
backend/workers/voice-talkback/,apps/web/app/(app)/talkback/
Context
V0.2 ships voice talk-back. There are two architectural shapes for real-time voice with LLMs in 2026:
- Bundled — a vendor’s all-in-one realtime API (OpenAI Realtime, Gemini Live). One WebSocket carries audio in / audio out; the provider handles STT, LLM, and TTS internally.
- Composed — orchestrate independent STT, LLM, and TTS services over a single pipeline. Pipecat is the open-source orchestrator.
docs/01_SOFTWARE_PLAN.md §1.8 already favored the composed approach.
This ADR documents why and what was given up.
Options considered
Option A — OpenAI Realtime
- Pros: Lowest end-to-end latency; voice quality strong; one API; tool use built in.
- Cons: Vendor lock — STT/LLM/TTS aren’t independently swappable. Cost ≈ $0.06/min vs Pipecat-orchestrated Deepgram+Haiku+Cartesia ≈ $0.02/min. Group conversation / multi-party support is weaker than LiveKit’s. Tool-calling parity for our memory tools is less mature than Claude Agent SDK’s.
Option B — Gemini Live
- Same shape as OpenAI Realtime; same trade-offs. Add: Google’s voice selection is narrower; no Anthropic models possible.
Option C — Pipecat + Deepgram Nova-3 + Claude Haiku + Cartesia Sonic (chosen)
- Pros: Each component is independently swappable. If Cartesia ships a better voice → flip a service. If Deepgram beats themselves on latency → drop in. Cost ≈ 1/3 of bundled. LiveKit Agents (which Pipecat composes with) is the proven multi-party WebRTC pattern; V0.3 group mode lands cleanly. Tool use is mature on Claude.
- Cons: One more service to deploy (the Pipecat orchestrator). Naturalness gap vs end-to-end speech models is real and requires attention to prompt + TTS choice (see ADR-0006).
Decision
Use Pipecat as the V0.2 voice orchestrator with Deepgram Nova-3
streaming STT, Claude Haiku 4.5 LLM, and Cartesia Sonic TTS. Keep
OpenAI Realtime as a fallback driver in the swappable agent layer
(packages/agents/src/drivers/realtime-voice.ts slot exists per the
plan), available if a future use case needs the latency.
Consequences
- Voice service is a separate Python process (
backend/workers/voice- talkback/). Deploys via Modal / Fly / any Node runner. - Latency budget per
01_SOFTWARE_PLAN.md§1.8: <1.5s end-of-utterance → AI starts speaking. Realistic 750ms-1.55s on warm path. - Group mode (V0.3) builds on this same Pipecat pipeline — swap WebSocket transport for LiveKit, keep STT/LLM/TTS choices.
- TTS swap path (Cartesia → ElevenLabs v3 → Sesame CSM) is one line
in
main.py. See ADR-0006.
Notes
If the bundled-API approach becomes meaningfully cheaper or the naturalness gap becomes intolerable, revisit. The Pipecat structure doesn’t preclude swapping to Realtime later — it just means we’d lose the cost advantage and the LiveKit-friendly group mode.