Skip to content

ADR-0010: Defer voice talk-back beyond V0.2

  • Status: Accepted
  • Date: 2026-05-05
  • Affected: backend/workers/voice-talkback/, apps/web/app/(app)/talkback/, packages/agents/src/drivers/realtime-voice.ts (placeholder), docs/03_PROGRESS.md, docs/00_MASTER_PLAN.md Phase 2 roadmap entry

Context

V0.2 scoped voice talk-back as a web feature: Pipecat orchestrator (Deepgram Nova-3 → Claude Haiku 4.5 → Cartesia Sonic) with a /talkback page. The service is scaffolded and the web client renders; neither is on the critical path to a usable demo today.

Three things converged to make the case for deferral concrete:

  1. Voice is web-only and most users live on phone. Mobile talk-back was the next-up #1 item (03_PROGRESS.md). A spike on 2026-05-05 (recorded in 04_JOURNAL.md) confirmed @pipecat-ai/client-js won’t run in Expo Go — it depends on navigator.mediaDevices.getUserMedia and Web Audio APIs that only react-native-webrtc provides, and that requires an EAS dev build. Per project_apple_deferred the Apple Developer + EAS work is parked until release prep. So mobile voice is structurally blocked until that unlocks.
  2. The on-page promise doesn’t match what ships. The /talkback page currently advertises “holds the conversation while it retrieves from your memories” — but the LLM has no retrieval. MCP wiring (V0.3 next-up) was the planned fix, but adds a double-LLM tool-use hop that puts every retrieval-needing turn at 2-2.5s, beyond the 1.5s budget set in 01_SOFTWARE_PLAN §1.8. A hybrid (pre-fetch + tool-use) would work but is a real chunk of work for a feature with no users.
  3. The architecture envelope is shifting. Through 2025 OpenAI Realtime, Gemini Live, and AWS Nova Sonic moved speech-to-speech models with native MCP tool use into GA — Pipecat’s STT→LLM→TTS pipeline (the shape ARCIVE picked in ADR-0002) is starting to feel dated for solo voice chat. Anthropic still has no native real-time speech model, so staying on Claude means staying on Pipecat-shaped composition. That tension is not urgent to resolve; it’s also not productive to invest more in Pipecat-only polish until the resolution is clearer.

Meanwhile, the things users do touch — capture, browse, search, spaces, the AI assistant — have validation gaps that are cheaper to close and more visible if broken.

Options considered

Option A — Hybrid retrieval on voice, full ship

Wire MCP into the voice service with session-open pre-fetch + tool-use for explicit recall, plus prompt caching. Port the web client to RN via a custom WS transport (since Pipecat client-js is browser-only). Ship voice as a real V0.2 feature.

  • Pros: Truth-in-advertising on the talkback page; cross-platform parity per feedback_cross_platform_parity; validates ADR-0002 + ADR-0004 (MCP separation) end-to-end.
  • Cons: ~1-2 weeks of work concentrated on a feature with no users; RN port is structurally blocked by Expo Go (see Context #1) so the “ship” is web-only anyway; doesn’t address the speech-to-speech architecture question.

Option B — Swap voice brain to OpenAI Realtime / Gemini Live

Adopt a real-time speech-to-speech model with native MCP tool support. Pipecat can host either. Closes the latency gap and the tool-use double-hop in one move.

  • Pros: Lowest end-to-end latency; native MCP tool use; matches where the industry has moved.
  • Cons: Breaks single-brain coherence with the rest of ARCIVE (Claude everywhere else); ~3× cost per minute; supersedes ADR-0002 which was just accepted; doesn’t solve the Expo Go mobile blocker; doesn’t repay the work given low usage.

Option C — Defer voice talk-back, hold the architecture (chosen)

Pause active development on voice talk-back. Keep the deployed service

  • web page in place behind an admin flag (already wired: v0_2_voice_talkback) so it’s reactivatable in a single PR, but stop allocating effort against it. Surface “paused — coming back later” on the page; remove from primary nav. Revisit when EAS unblocks mobile voice and the speech-to-speech-vs-Pipecat question can be settled with a concrete user use-case driving it.
  • Pros: Frees the next 1-2 weeks for higher-leverage work (real e2e pipeline test, mobile parity items users actually see, prod hardening); avoids investing in either Pipecat polish or a Realtime swap before the deciding constraints (EAS, naturalness bar, tool-use latency tolerance) are clearer; preserves all built artifacts for resumption.
  • Cons: Voice talk-back was a public V0.2 storyline — stakeholders may notice the rollback. The marketing copy (“talk to the Reviewer, hands-free”) will need a stewardship pass.

Decision

Option C: defer voice talk-back beyond V0.2. Reactivate when:

  1. EAS dev build / Apple Developer setup completes (release-prep gate per project_apple_deferred) — unblocks mobile parity.
  2. A concrete user signal (interview, support volume, demo need) establishes whether the speech-to-speech pivot is worth supersediing ADR-0002, OR confirms Pipecat + Claude + hybrid retrieval is the shape we want.

Until then: deployed service stays runnable, page is paused, nav link is removed, no further engineering invested. ADR-0002 is not superseded — it remains the chosen shape if/when work resumes on Pipecat-side. ADR-0006 (voice naturalness) likewise remains accepted but inert.

Consequences

  • apps/web/app/(app)/talkback/page.tsx shows a paused notice; <VoiceClient /> is not rendered. voice-client.tsx and the Pipecat client deps stay in package.json (resumption is a single PR; uninstalling and re-installing for an indeterminate pause adds friction without saving anything).
  • Primary nav (apps/web/app/(app)/layout.tsx) drops the Voice link. The route still resolves; bookmarks and external links don’t 404, they hit the paused page.
  • backend/workers/voice-talkback/ stays deployed (or re-deployable) but is removed from active feature lists. README carries a “paused” banner pointing to this ADR.
  • MCP wiring on voice service drops out of next-up. The MCP server itself is unchanged — web /api/chat still uses it.
  • PROGRESS moves the two voice items from “Next up” into a “Deferred” bucket with this ADR as the reason.
  • MASTER_PLAN Phase 2 keeps the historical text intact but is annotated: voice talk-back was scaffolded then paused.
  • Roadmap copy (README, About page) is updated to reflect that voice conversation is paused, not “built — pending validation”.
  • Feature flag v0_2_voice_talkback stays defined in apps/web/lib/feature-flags.ts; defaults already false. No flag flip needed.

Notes

The 2025-2026 voice landscape — menu for the resumption decision

The reason we’re holding rather than swapping ADR-0002 outright: the menu of credible voice backends widened sharply in 2025. Anyone re-litigating this should pick from this list with a concrete user constraint in hand, not pre-emptively. The shape of the choice is no longer “Pipecat vs OpenAI Realtime” — it’s at least six paths.

Bundled commercial speech-to-speech (lowest latency, vendor lock):

  • OpenAI Realtime API — GA 2025. ~300-500ms first audio. Native MCP tool support. ~$0.06/min. Voices fixed (4-6).
  • Google Gemini Live — same shape; multimodal (vision in audio context); Google-only models.
  • AWS Nova Sonic — Bedrock-native real-time speech. Tool use, streaming. Useful if AWS is the gravity well; we’re on Supabase so it isn’t.

Open-source speech-to-speech (free to self-host, more ops):

  • Microsoft VibeVoice (MIT, Aug 2025) — frontier-quality long-form TTS, expressive, multi-speaker dialog support. Replaces Cartesia for the TTS slot specifically. Doesn’t change the orchestration — Pipecat still wraps it. Most realistic OSS swap: drops in for Cartesia, removes a paid dependency.
  • Kyutai Moshi (open weights, full-duplex) — true speech-to-speech in a single 7B model. Different architecture from STT→LLM→TTS composition. Quality below frontier closed models but improving.
  • Sesame CSM — already named in ADR-0006 as a TTS upgrade path; Canopy Labs released open weights. Naturalness-focused.
  • Phi-4-multimodal (Microsoft, Feb 2025) — open-weights multimodal including audio in/out. Smaller than frontier real-time; useful for edge / local-only variants on the platform roadmap.

Bundled-but-cheaper-or-friendlier middle ground:

  • ElevenLabs Conversational AI — bundled platform; STT + LLM (any via API) + ElevenLabs TTS, tool use built in. Less lock than OpenAI Realtime since you bring your own LLM; more managed than Pipecat.
  • Hume EVI 3 — emotion-aware voice agent platform. Niche but relevant if the Caregiver/companionship use case is the deciding signal on resumption.
  • Pipecat Cloud / Daily — managed Pipecat. Same architecture as what we have, hosted. Lowest-friction path to “make voice an ops problem someone else solves” without changing the model choices.

Free-for-end-users tier (Microsoft Copilot Voice, ChatGPT Voice free tier): worth noting as competitive context — users now expect voice chat to be free in general-purpose assistants. ARCIVE-shaped voice (memory-grounded, role-personalized) is differentiated, but the pricing perception is set elsewhere.

Decision shape on resumption

The question to ask is no longer “Pipecat or Realtime?” but “what’s the deciding constraint?”:

If the deciding constraint is…Likely answer
”Mobile voice in Expo Go”Still blocked — same as 2026-05-05. Doesn’t matter which backend; no native WebRTC, no go. Solve EAS first.
”Naturalness gap”TTS swap to VibeVoice / Sesame / ElevenLabs (smallest move; ADR-0006 path).
”Latency budget on retrieval-bearing turns”Either hybrid retrieval on Pipecat or Realtime/Gemini Live swap. Supersedes ADR-0002 only in the second case.
”Cost per minute at scale”Stay on Pipecat composition; OSS TTS (VibeVoice).
”Ops burden of running our own voice service”Pipecat Cloud / managed; no architecture change.
”Single-brain coherence with Claude”Stay composed (Pipecat or managed Pipecat); rule out Realtime/Live/Sonic.

ARCIVE so far has answered the last row “yes — one brain”. Until usage data argues otherwise, that answer stands. The landscape expansion strengthens the case for deferral: spending a week polishing Pipecat against a moving target is less defensible now than when ADR-0002 was written 2 days ago.

Resumption checklist

  1. Re-add Voice to nav in layout.tsx.
  2. Remove the paused notice in talkback/page.tsx and re-render <VoiceClient />.
  3. Pick from the menu above based on the deciding constraint; write a follow-up ADR (hybrid retrieval) or a superseding ADR (Realtime/Live swap).
  4. Refresh README “What works” / “Paused” sections + About page V0.2 row + PROGRESS Deferred bucket.
  5. Re-run the mobile-voice spike if EAS is now unlocked.