- Status: Accepted
- Date: 2026-05-03
- Affected:
backend/workers/voice-talkback/,packages/agents/src/roles/
Context
V0.2 ships Pipecat-orchestrated voice (per ADR-0002). The default Cartesia Sonic voice is decent but doesn’t cross what Sesame’s CSM paper calls the “uncanny valley of voice” — context-aware prosody, disfluency, conversational register.
The question: how do we maximize naturalness without locking ourselves into a single TTS vendor?
Options considered
Option A — Pick the best TTS we can find, accept what it produces
- Pros: One decision, ship.
- Cons: TTS vendors leapfrog quarterly. Locking in one buys today’s ceiling and tomorrow’s regret. Most “AI voice” tells aren’t TTS tells — they’re text tells. Even the best voice reading AI-shaped text sounds like a chatbot.
Option B — Adopt OpenAI Realtime / Gemini Live for voice
- Pros: End-to-end speech models close the prosody gap by conditioning on audio history, not just text.
- Cons: Rejected for V0.2 in ADR-0002 for vendor-lock and cost reasons. Doesn’t change here.
Option C — Three layers, each independently swappable (chosen)
- Text shape — system prompt enforces phone-call register: contractions, fillers, no preambles, no lists, ALL CAPS for one-word emphasis. The biggest naturalness lever; the LLM can’t sound human if its words don’t.
- TTS provider — Cartesia today, ElevenLabs v3 (audio tags) or Sesame CSM tomorrow. One-line swap in the Pipecat pipeline.
- Prosody control — emotion classifier on user audio →
prosody / emotion tag injected into TTS metadata. Custom
FrameProcessorbetween LLM and TTS in the Pipecat pipeline.
Decision
Ship layer 1 immediately. The voice service’s DEFAULT_SYSTEM_PROMPT
is rewritten in spoken register. composeVoiceSystemPrompt(role) in
@arcive/agents builds the prompt for any role by appending the
shared voice addendum.
Layer 2 stays open. The Pipecat tts = line is kept as a single
swap-point. ElevenLabs v3 and Sesame CSM are documented as upgrade
paths in backend/workers/voice-talkback/README.md.
Layer 3 deferred — V0.3 work; the Pipecat processor architecture already supports it.
Consequences
- Naturalness wins ship now without infra change. Most of the perceived “AI voice” gap closes from the text rewrite alone.
- Future TTS swaps don’t require rewriting roles or prompts — the voice addendum is TTS-agnostic.
- ElevenLabs audio tags (
[whispers],[laughs],[soft]) need a prompt addition to get emitted. When we swap to ElevenLabs, add a short bullet toVOICE_MODE_ADDENDUMsaying so. Sesame CSM doesn’t use inline tags — it conditions on history; that’s a service-level change, not a prompt change. - A real test of naturalness needs a deployed Pipecat instance. Pre-deployment, read the LLM output aloud yourself; if it sounds AI to you, it’ll sound worse coming from a TTS.
Notes
The voice prompt is a separate file from the role’s text-channel prompt (the role keeps its full prompt; the addendum gets layered). This means:
- Reviewer chat in
/roles/<reviewer>uses the role’s text prompt. - Reviewer voice in
/talkback?role=reviewerusescomposeVoiceSystemPrompt(REVIEWER_ROLE), which is text + addendum.
This composition pattern works for any future role.