Skip to content

ADR-0006: Voice naturalness — text-first, TTS-second, prosody-third

  • Status: Accepted
  • Date: 2026-05-03
  • Affected: backend/workers/voice-talkback/, packages/agents/src/roles/

Context

V0.2 ships Pipecat-orchestrated voice (per ADR-0002). The default Cartesia Sonic voice is decent but doesn’t cross what Sesame’s CSM paper calls the “uncanny valley of voice” — context-aware prosody, disfluency, conversational register.

The question: how do we maximize naturalness without locking ourselves into a single TTS vendor?

Options considered

Option A — Pick the best TTS we can find, accept what it produces

  • Pros: One decision, ship.
  • Cons: TTS vendors leapfrog quarterly. Locking in one buys today’s ceiling and tomorrow’s regret. Most “AI voice” tells aren’t TTS tells — they’re text tells. Even the best voice reading AI-shaped text sounds like a chatbot.

Option B — Adopt OpenAI Realtime / Gemini Live for voice

  • Pros: End-to-end speech models close the prosody gap by conditioning on audio history, not just text.
  • Cons: Rejected for V0.2 in ADR-0002 for vendor-lock and cost reasons. Doesn’t change here.

Option C — Three layers, each independently swappable (chosen)

  1. Text shape — system prompt enforces phone-call register: contractions, fillers, no preambles, no lists, ALL CAPS for one-word emphasis. The biggest naturalness lever; the LLM can’t sound human if its words don’t.
  2. TTS provider — Cartesia today, ElevenLabs v3 (audio tags) or Sesame CSM tomorrow. One-line swap in the Pipecat pipeline.
  3. Prosody control — emotion classifier on user audio → prosody / emotion tag injected into TTS metadata. Custom FrameProcessor between LLM and TTS in the Pipecat pipeline.

Decision

Ship layer 1 immediately. The voice service’s DEFAULT_SYSTEM_PROMPT is rewritten in spoken register. composeVoiceSystemPrompt(role) in @arcive/agents builds the prompt for any role by appending the shared voice addendum.

Layer 2 stays open. The Pipecat tts = line is kept as a single swap-point. ElevenLabs v3 and Sesame CSM are documented as upgrade paths in backend/workers/voice-talkback/README.md.

Layer 3 deferred — V0.3 work; the Pipecat processor architecture already supports it.

Consequences

  • Naturalness wins ship now without infra change. Most of the perceived “AI voice” gap closes from the text rewrite alone.
  • Future TTS swaps don’t require rewriting roles or prompts — the voice addendum is TTS-agnostic.
  • ElevenLabs audio tags ([whispers], [laughs], [soft]) need a prompt addition to get emitted. When we swap to ElevenLabs, add a short bullet to VOICE_MODE_ADDENDUM saying so. Sesame CSM doesn’t use inline tags — it conditions on history; that’s a service-level change, not a prompt change.
  • A real test of naturalness needs a deployed Pipecat instance. Pre-deployment, read the LLM output aloud yourself; if it sounds AI to you, it’ll sound worse coming from a TTS.

Notes

The voice prompt is a separate file from the role’s text-channel prompt (the role keeps its full prompt; the addendum gets layered). This means:

  • Reviewer chat in /roles/<reviewer> uses the role’s text prompt.
  • Reviewer voice in /talkback?role=reviewer uses composeVoiceSystemPrompt(REVIEWER_ROLE), which is text + addendum.

This composition pattern works for any future role.