ADR-0006: Voice naturalness — text-first, TTS-second, prosody-third

Status: Accepted
Date: 2026-05-03
Affected: backend/workers/voice-talkback/, packages/agents/src/roles/

Context

V0.2 ships Pipecat-orchestrated voice (per ADR-0002). The default Cartesia Sonic voice is decent but doesn’t cross what Sesame’s CSM paper calls the “uncanny valley of voice” — context-aware prosody, disfluency, conversational register.

The question: how do we maximize naturalness without locking ourselves into a single TTS vendor?

Options considered

Option A — Pick the best TTS we can find, accept what it produces

Pros: One decision, ship.
Cons: TTS vendors leapfrog quarterly. Locking in one buys today’s ceiling and tomorrow’s regret. Most “AI voice” tells aren’t TTS tells — they’re text tells. Even the best voice reading AI-shaped text sounds like a chatbot.

Option B — Adopt OpenAI Realtime / Gemini Live for voice

Pros: End-to-end speech models close the prosody gap by conditioning on audio history, not just text.
Cons: Rejected for V0.2 in ADR-0002 for vendor-lock and cost reasons. Doesn’t change here.

Option C — Three layers, each independently swappable (chosen)

Text shape — system prompt enforces phone-call register: contractions, fillers, no preambles, no lists, ALL CAPS for one-word emphasis. The biggest naturalness lever; the LLM can’t sound human if its words don’t.
TTS provider — Cartesia today, ElevenLabs v3 (audio tags) or Sesame CSM tomorrow. One-line swap in the Pipecat pipeline.
Prosody control — emotion classifier on user audio → prosody / emotion tag injected into TTS metadata. Custom FrameProcessor between LLM and TTS in the Pipecat pipeline.

Decision

Ship layer 1 immediately. The voice service’s DEFAULT_SYSTEM_PROMPT is rewritten in spoken register. composeVoiceSystemPrompt(role) in @arcive/agents builds the prompt for any role by appending the shared voice addendum.

Layer 2 stays open. The Pipecat tts = line is kept as a single swap-point. ElevenLabs v3 and Sesame CSM are documented as upgrade paths in backend/workers/voice-talkback/README.md.

Layer 3 deferred — V0.3 work; the Pipecat processor architecture already supports it.

Consequences

Naturalness wins ship now without infra change. Most of the perceived “AI voice” gap closes from the text rewrite alone.
Future TTS swaps don’t require rewriting roles or prompts — the voice addendum is TTS-agnostic.
ElevenLabs audio tags ([whispers], [laughs], [soft]) need a prompt addition to get emitted. When we swap to ElevenLabs, add a short bullet to VOICE_MODE_ADDENDUM saying so. Sesame CSM doesn’t use inline tags — it conditions on history; that’s a service-level change, not a prompt change.
A real test of naturalness needs a deployed Pipecat instance. Pre-deployment, read the LLM output aloud yourself; if it sounds AI to you, it’ll sound worse coming from a TTS.

Notes

The voice prompt is a separate file from the role’s text-channel prompt (the role keeps its full prompt; the addendum gets layered). This means:

Reviewer chat in /roles/<reviewer> uses the role’s text prompt.
Reviewer voice in /talkback?role=reviewer uses composeVoiceSystemPrompt(REVIEWER_ROLE), which is text + addendum.

This composition pattern works for any future role.

ADR-0006: Voice naturalness — text-first, TTS-second, prosody-third

Context

Options considered

Option A — Pick the best TTS we can find, accept what it produces

Option B — Adopt OpenAI Realtime / Gemini Live for voice

Option C — Three layers, each independently swappable (chosen)

Decision

Consequences

Notes

Plans

Operations

Decisions (ADRs)

Discussions