- Status: Accepted
- Date: 2026-05-05
- Affected:
backend/workers/voice-talkback/,apps/web/app/(app)/talkback/,packages/agents/src/drivers/realtime-voice.ts(placeholder),docs/03_PROGRESS.md,docs/00_MASTER_PLAN.mdPhase 2 roadmap entry
Context
V0.2 scoped voice talk-back as a web feature: Pipecat orchestrator
(Deepgram Nova-3 → Claude Haiku 4.5 → Cartesia Sonic) with a
/talkback page. The service is scaffolded and the web client renders;
neither is on the critical path to a usable demo today.
Three things converged to make the case for deferral concrete:
- Voice is web-only and most users live on phone. Mobile talk-back
was the next-up #1 item (03_PROGRESS.md). A
spike on 2026-05-05 (recorded in 04_JOURNAL.md)
confirmed
@pipecat-ai/client-jswon’t run in Expo Go — it depends onnavigator.mediaDevices.getUserMediaand Web Audio APIs that onlyreact-native-webrtcprovides, and that requires an EAS dev build. Per project_apple_deferred the Apple Developer + EAS work is parked until release prep. So mobile voice is structurally blocked until that unlocks. - The on-page promise doesn’t match what ships. The
/talkbackpage currently advertises “holds the conversation while it retrieves from your memories” — but the LLM has no retrieval. MCP wiring (V0.3 next-up) was the planned fix, but adds a double-LLM tool-use hop that puts every retrieval-needing turn at 2-2.5s, beyond the 1.5s budget set in 01_SOFTWARE_PLAN §1.8. A hybrid (pre-fetch + tool-use) would work but is a real chunk of work for a feature with no users. - The architecture envelope is shifting. Through 2025 OpenAI Realtime, Gemini Live, and AWS Nova Sonic moved speech-to-speech models with native MCP tool use into GA — Pipecat’s STT→LLM→TTS pipeline (the shape ARCIVE picked in ADR-0002) is starting to feel dated for solo voice chat. Anthropic still has no native real-time speech model, so staying on Claude means staying on Pipecat-shaped composition. That tension is not urgent to resolve; it’s also not productive to invest more in Pipecat-only polish until the resolution is clearer.
Meanwhile, the things users do touch — capture, browse, search, spaces, the AI assistant — have validation gaps that are cheaper to close and more visible if broken.
Options considered
Option A — Hybrid retrieval on voice, full ship
Wire MCP into the voice service with session-open pre-fetch + tool-use for explicit recall, plus prompt caching. Port the web client to RN via a custom WS transport (since Pipecat client-js is browser-only). Ship voice as a real V0.2 feature.
- Pros: Truth-in-advertising on the talkback page; cross-platform parity per feedback_cross_platform_parity; validates ADR-0002 + ADR-0004 (MCP separation) end-to-end.
- Cons: ~1-2 weeks of work concentrated on a feature with no users; RN port is structurally blocked by Expo Go (see Context #1) so the “ship” is web-only anyway; doesn’t address the speech-to-speech architecture question.
Option B — Swap voice brain to OpenAI Realtime / Gemini Live
Adopt a real-time speech-to-speech model with native MCP tool support. Pipecat can host either. Closes the latency gap and the tool-use double-hop in one move.
- Pros: Lowest end-to-end latency; native MCP tool use; matches where the industry has moved.
- Cons: Breaks single-brain coherence with the rest of ARCIVE (Claude everywhere else); ~3× cost per minute; supersedes ADR-0002 which was just accepted; doesn’t solve the Expo Go mobile blocker; doesn’t repay the work given low usage.
Option C — Defer voice talk-back, hold the architecture (chosen)
Pause active development on voice talk-back. Keep the deployed service
- web page in place behind an admin flag (already wired:
v0_2_voice_talkback) so it’s reactivatable in a single PR, but stop allocating effort against it. Surface “paused — coming back later” on the page; remove from primary nav. Revisit when EAS unblocks mobile voice and the speech-to-speech-vs-Pipecat question can be settled with a concrete user use-case driving it.
- Pros: Frees the next 1-2 weeks for higher-leverage work (real e2e pipeline test, mobile parity items users actually see, prod hardening); avoids investing in either Pipecat polish or a Realtime swap before the deciding constraints (EAS, naturalness bar, tool-use latency tolerance) are clearer; preserves all built artifacts for resumption.
- Cons: Voice talk-back was a public V0.2 storyline — stakeholders may notice the rollback. The marketing copy (“talk to the Reviewer, hands-free”) will need a stewardship pass.
Decision
Option C: defer voice talk-back beyond V0.2. Reactivate when:
- EAS dev build / Apple Developer setup completes (release-prep gate per project_apple_deferred) — unblocks mobile parity.
- A concrete user signal (interview, support volume, demo need) establishes whether the speech-to-speech pivot is worth supersediing ADR-0002, OR confirms Pipecat + Claude + hybrid retrieval is the shape we want.
Until then: deployed service stays runnable, page is paused, nav link is removed, no further engineering invested. ADR-0002 is not superseded — it remains the chosen shape if/when work resumes on Pipecat-side. ADR-0006 (voice naturalness) likewise remains accepted but inert.
Consequences
apps/web/app/(app)/talkback/page.tsxshows a paused notice;<VoiceClient />is not rendered.voice-client.tsxand the Pipecat client deps stay inpackage.json(resumption is a single PR; uninstalling and re-installing for an indeterminate pause adds friction without saving anything).- Primary nav (
apps/web/app/(app)/layout.tsx) drops the Voice link. The route still resolves; bookmarks and external links don’t 404, they hit the paused page. backend/workers/voice-talkback/stays deployed (or re-deployable) but is removed from active feature lists. README carries a “paused” banner pointing to this ADR.- MCP wiring on voice service drops out of next-up. The MCP
server itself is unchanged — web
/api/chatstill uses it. - PROGRESS moves the two voice items from “Next up” into a “Deferred” bucket with this ADR as the reason.
- MASTER_PLAN Phase 2 keeps the historical text intact but is annotated: voice talk-back was scaffolded then paused.
- Roadmap copy (README, About page) is updated to reflect that voice conversation is paused, not “built — pending validation”.
- Feature flag
v0_2_voice_talkbackstays defined inapps/web/lib/feature-flags.ts; defaults already false. No flag flip needed.
Notes
The 2025-2026 voice landscape — menu for the resumption decision
The reason we’re holding rather than swapping ADR-0002 outright: the menu of credible voice backends widened sharply in 2025. Anyone re-litigating this should pick from this list with a concrete user constraint in hand, not pre-emptively. The shape of the choice is no longer “Pipecat vs OpenAI Realtime” — it’s at least six paths.
Bundled commercial speech-to-speech (lowest latency, vendor lock):
- OpenAI Realtime API — GA 2025. ~300-500ms first audio. Native MCP tool support. ~$0.06/min. Voices fixed (4-6).
- Google Gemini Live — same shape; multimodal (vision in audio context); Google-only models.
- AWS Nova Sonic — Bedrock-native real-time speech. Tool use, streaming. Useful if AWS is the gravity well; we’re on Supabase so it isn’t.
Open-source speech-to-speech (free to self-host, more ops):
- Microsoft VibeVoice (MIT, Aug 2025) — frontier-quality long-form TTS, expressive, multi-speaker dialog support. Replaces Cartesia for the TTS slot specifically. Doesn’t change the orchestration — Pipecat still wraps it. Most realistic OSS swap: drops in for Cartesia, removes a paid dependency.
- Kyutai Moshi (open weights, full-duplex) — true speech-to-speech in a single 7B model. Different architecture from STT→LLM→TTS composition. Quality below frontier closed models but improving.
- Sesame CSM — already named in ADR-0006 as a TTS upgrade path; Canopy Labs released open weights. Naturalness-focused.
- Phi-4-multimodal (Microsoft, Feb 2025) — open-weights multimodal including audio in/out. Smaller than frontier real-time; useful for edge / local-only variants on the platform roadmap.
Bundled-but-cheaper-or-friendlier middle ground:
- ElevenLabs Conversational AI — bundled platform; STT + LLM (any via API) + ElevenLabs TTS, tool use built in. Less lock than OpenAI Realtime since you bring your own LLM; more managed than Pipecat.
- Hume EVI 3 — emotion-aware voice agent platform. Niche but relevant if the Caregiver/companionship use case is the deciding signal on resumption.
- Pipecat Cloud / Daily — managed Pipecat. Same architecture as what we have, hosted. Lowest-friction path to “make voice an ops problem someone else solves” without changing the model choices.
Free-for-end-users tier (Microsoft Copilot Voice, ChatGPT Voice free tier): worth noting as competitive context — users now expect voice chat to be free in general-purpose assistants. ARCIVE-shaped voice (memory-grounded, role-personalized) is differentiated, but the pricing perception is set elsewhere.
Decision shape on resumption
The question to ask is no longer “Pipecat or Realtime?” but “what’s the deciding constraint?”:
| If the deciding constraint is… | Likely answer |
|---|---|
| ”Mobile voice in Expo Go” | Still blocked — same as 2026-05-05. Doesn’t matter which backend; no native WebRTC, no go. Solve EAS first. |
| ”Naturalness gap” | TTS swap to VibeVoice / Sesame / ElevenLabs (smallest move; ADR-0006 path). |
| ”Latency budget on retrieval-bearing turns” | Either hybrid retrieval on Pipecat or Realtime/Gemini Live swap. Supersedes ADR-0002 only in the second case. |
| ”Cost per minute at scale” | Stay on Pipecat composition; OSS TTS (VibeVoice). |
| ”Ops burden of running our own voice service” | Pipecat Cloud / managed; no architecture change. |
| ”Single-brain coherence with Claude” | Stay composed (Pipecat or managed Pipecat); rule out Realtime/Live/Sonic. |
ARCIVE so far has answered the last row “yes — one brain”. Until usage data argues otherwise, that answer stands. The landscape expansion strengthens the case for deferral: spending a week polishing Pipecat against a moving target is less defensible now than when ADR-0002 was written 2 days ago.
Resumption checklist
- Re-add
Voiceto nav in layout.tsx. - Remove the paused notice in talkback/page.tsx
and re-render
<VoiceClient />. - Pick from the menu above based on the deciding constraint; write a follow-up ADR (hybrid retrieval) or a superseding ADR (Realtime/Live swap).
- Refresh README “What works” / “Paused” sections + About page V0.2 row + PROGRESS Deferred bucket.
- Re-run the mobile-voice spike if EAS is now unlocked.