Skip to content

ADR-0002: Pipecat orchestrator (Deepgram + Claude + Cartesia) over OpenAI Realtime / Gemini Live

  • Status: Accepted
  • Date: 2026-05-03
  • Affected: backend/workers/voice-talkback/, apps/web/app/(app)/talkback/

Context

V0.2 ships voice talk-back. There are two architectural shapes for real-time voice with LLMs in 2026:

  1. Bundled — a vendor’s all-in-one realtime API (OpenAI Realtime, Gemini Live). One WebSocket carries audio in / audio out; the provider handles STT, LLM, and TTS internally.
  2. Composed — orchestrate independent STT, LLM, and TTS services over a single pipeline. Pipecat is the open-source orchestrator.

docs/01_SOFTWARE_PLAN.md §1.8 already favored the composed approach. This ADR documents why and what was given up.

Options considered

Option A — OpenAI Realtime

  • Pros: Lowest end-to-end latency; voice quality strong; one API; tool use built in.
  • Cons: Vendor lock — STT/LLM/TTS aren’t independently swappable. Cost ≈ $0.06/min vs Pipecat-orchestrated Deepgram+Haiku+Cartesia ≈ $0.02/min. Group conversation / multi-party support is weaker than LiveKit’s. Tool-calling parity for our memory tools is less mature than Claude Agent SDK’s.

Option B — Gemini Live

  • Same shape as OpenAI Realtime; same trade-offs. Add: Google’s voice selection is narrower; no Anthropic models possible.

Option C — Pipecat + Deepgram Nova-3 + Claude Haiku + Cartesia Sonic (chosen)

  • Pros: Each component is independently swappable. If Cartesia ships a better voice → flip a service. If Deepgram beats themselves on latency → drop in. Cost ≈ 1/3 of bundled. LiveKit Agents (which Pipecat composes with) is the proven multi-party WebRTC pattern; V0.3 group mode lands cleanly. Tool use is mature on Claude.
  • Cons: One more service to deploy (the Pipecat orchestrator). Naturalness gap vs end-to-end speech models is real and requires attention to prompt + TTS choice (see ADR-0006).

Decision

Use Pipecat as the V0.2 voice orchestrator with Deepgram Nova-3 streaming STT, Claude Haiku 4.5 LLM, and Cartesia Sonic TTS. Keep OpenAI Realtime as a fallback driver in the swappable agent layer (packages/agents/src/drivers/realtime-voice.ts slot exists per the plan), available if a future use case needs the latency.

Consequences

  • Voice service is a separate Python process (backend/workers/voice- talkback/). Deploys via Modal / Fly / any Node runner.
  • Latency budget per 01_SOFTWARE_PLAN.md §1.8: <1.5s end-of-utterance → AI starts speaking. Realistic 750ms-1.55s on warm path.
  • Group mode (V0.3) builds on this same Pipecat pipeline — swap WebSocket transport for LiveKit, keep STT/LLM/TTS choices.
  • TTS swap path (Cartesia → ElevenLabs v3 → Sesame CSM) is one line in main.py. See ADR-0006.

Notes

If the bundled-API approach becomes meaningfully cheaper or the naturalness gap becomes intolerable, revisit. The Pipecat structure doesn’t preclude swapping to Realtime later — it just means we’d lose the cost advantage and the LiveKit-friendly group mode.