- Status: Accepted
- Date: 2026-05-05
- Affected: every layer that consumes an external AI vendor —
supabase/functions/transcribe-step/,supabase/functions/summarize-step/,supabase/functions/embed-step/,supabase/functions/diarize-step/,supabase/functions/reid-step/,apps/web/app/api/chat/,packages/agents/src/drivers/,backend/workers/voice-talkback/,backend/mcp/arcive-memory-mcp/ - Companion working notes:
../discussions/2026-05-05_ai_strategy_architecture.md
Context
Two things forced this question on 2026-05-05:
- A small task surfaced a big one. Picking up “summarize + topics generation, including backfill for old memories” raised the practical question “which API are we using? Is there a consolidated place?” That expanded into “what about free options?” → “what about OpenAI / Microsoft?” → ultimately “step back as an architect — how are current players doing this?”
- The voice deferral (ADR-0010) opened the broader vendor question. Once we paused voice, the implicit assumption that ARCIVE would have one coherent vendor strategy across capture → summarize → chat → voice → multimodal → unified surfaces became visible — and visibly contestable.
The companion working-session notes (../discussions/2026-05-05_ai_strategy_architecture.md) walk through the full vendor landscape, cost projections at ARCIVE volumes, how credible adjacent players (Notion, Granola, Otter, Apple, ElevenLabs Conversational AI, Sesame, Hume) actually compose AI services, and the architectural shapes considered. This ADR captures the resulting commitment in the immutable form.
The pre-existing constraints that scoped the answer:
- ADR-0002 — Pipecat over OpenAI Realtime: voice loop is composed (STT → LLM → TTS), not bundled.
- ADR-0003 — Swappable agent driver layer: the chat brain is replaceable behind an
AgentDriverinterface. - ADR-0004 — MCP as separate service: retrieval is its own component, vendor-agnostic.
- ADR-0006 — Voice naturalness via swappable TTS: voice fidelity (per-role voices, audio tags, prosody control) is strategic — TTS must stay swappable. End-to-end speech-to-speech models cannot deliver this.
- ADR-0010 — Defer voice talk-back: voice is paused; resumption defaults to composed pipeline.
Options considered
Option A — Consolidate to one vendor across all stages
Pick one of {Anthropic, Google, OpenAI} and route every AI capability through them. One billing relationship, one SDK, one “throat to choke.”
- Pros: simpler procurement; possible volume discounts at scale; single-vendor support story for B2B; cleanest path to a literal Stage 6 unified-real-time experience.
- Cons:
- Forces dropping ADR-0006’s TTS swappability — speech-to-speech models from OpenAI Realtime / Gemini Live / Nova Sonic cannot do per-role voice cloning, audio tags, or granular prosody control. ARCIVE’s brand depends on Reviewer / Tutor / Caregiver / Brainstorm having distinct voices; consolidating breaks that.
- Forces a Stage 3 chat migration off Claude Agent SDK — quality regression risk + the ADR-0003 rewrite cost.
- Locks ARCIVE to one vendor’s roadmap, pricing, and outage profile. A model regression, price hike, or geo restriction at the chosen vendor cascades to every layer simultaneously.
- Doesn’t match how any credible non-frontier-lab player operates. Notion uses OpenAI + Anthropic. Granola uses Whisper + OpenAI/Anthropic. Apple uses Apple + OpenAI as fallback. ElevenLabs Conversational AI explicitly tells customers to bring their own LLM.
Option B — Decouple text-LLM from voice; lean on Gemini for the text side
Consolidate text-bearing stages (summarize, multimodal share, eventually chat) onto Gemini; keep voice composed per ADR-0006.
- Pros: respects ADR-0006; Gemini’s free tier is genuinely useful; multimodal coverage is broadest; today’s code change is identical to status quo (summarize-step already has Gemini as fallback).
- Cons:
- Still implies “Gemini wins the text side” — a vendor lean rather than a framework. Invites awkward exceptions when (a) Anthropic drops prices, (b) Mistral becomes the right pick for EU data residency, (c) DeepSeek / Llama hits a new quality bar at lower cost, (d) a customer requires a specific vendor.
- Doesn’t distinguish strategic from commodity layers — treats every text-LLM choice as one decision when in reality summarize and chat have very different procurement constraints.
- Doesn’t ground the choice in industry practice; it’s still an internal optimization argument.
Option C — Layered architecture: best-of-breed per task; ARCIVE owns the moat (chosen)
Compose specialized AI capabilities around a memory store. Layers, not pipelines. Vendors are commodities at most layers, strategic at a few, fully ARCIVE-owned at the rest.
- Pros:
- Matches how every credible player in adjacent spaces actually ships — Notion, Granola, Otter, Limitless, Plaud, Bee, Rewind, Personal.ai, Apple Intelligence (with OpenAI fallback), ElevenLabs Conversational AI (BYO LLM), Sesame (owns CSM specifically).
- Preserves all prior ADRs (especially 0002, 0006).
- Failure of any single vendor is layer-local, not strategic.
- Each layer can swap independently as the market shifts; no rolling rewrite is forced.
- Differentiation moves UP the stack to the actual moat (memory store, retrieval, auto-correlation), where ARCIVE controls quality.
- Today’s code change for the immediate summarize backfill is identical to Options A and B.
- Cons:
- More vendor relationships to maintain (today: Groq, Voyage, Anthropic, Deepgram, Modal, Cartesia paused, Stripe, Resend, Sentry, PostHog — adding Google as a 5th AI vendor for summarize). Mitigated by Supabase Edge Function secrets being the consolidated key store regardless.
- Tempting to grow scope inside any one layer (“just standardize embeddings to Cohere everything”) — discipline required to keep the framework intact.
- Slightly harder to explain to a non-technical stakeholder than “we use $VENDOR.” Mitigated by the layer model being a clearer story than “$VENDOR plus exceptions.”
Decision
Adopt Option C — Layered architecture, best-of-breed per task.
ARCIVE composes specialized AI capabilities around a memory store. The product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers, strategic at a few. Each layer picks the right vendor for that task with explicit, coded fallback. ARCIVE never consolidates to a single vendor across the stack.
The layer model (canonical)
This table is the canonical reference for ARCIVE’s AI architecture. Other docs that describe the system (
docs/01_SOFTWARE_PLAN.md, the strategy discussion notes, future ADRs) link here rather than re-declaring layer numbers. If a layer needs to change, change it here first.
| # | Layer | What it does | Class | Today | Procurement rule |
|---|---|---|---|---|---|
| 1 | Capture | mic, transcode, ingest | ARCIVE-owned | recorders + audio-transcode worker | Build internally |
| 2 | Transcribe | audio → text + timestamps | commodity | Groq Whisper | Cheapest acceptable STT; coded fallback |
| 3 | Understand | text → summary, topics, entities | commodity | Anthropic Haiku 4.5 → Gemini Flash fallback (target: flip to Gemini primary) | Cheapest acceptable text-LLM; coded fallback |
| 4 | Embed | text/image → vector | commodity | Voyage-3-lite | Best-of-breed embedding; future swap to multimodal-aware model |
| 5 | Retrieve | query → ranked memories | ARCIVE-owned (MOAT) | MCP server + pgvector HNSW + tsv + memory_edges + consent gate | Build internally; never outsource |
| 6 | Reason | chat with memories as context | semi-strategic | Claude Agent SDK | Quality + tool-use bar matters; re-eval on quality, not consolidation pressure |
| 7 | Hear | real-time STT (voice loop) | commodity | Deepgram Nova-3 (paused with voice) | Best-of-breed streaming STT |
| 8 | Speak | TTS | strategic (ADR-0006) | Cartesia → ElevenLabs v3 → Sesame CSM | Best-of-breed for fidelity; per-role voices, audio tags, cloning |
| 9 | Voice orchestration | real-time turn management | strategic (ADR-0002, paused per ADR-0010) | Pipecat (paused) | Composed on resumption — never speech-to-speech |
| 10 | Auto-correlate | cross-source memory linkage | ARCIVE-owned (MOAT) | edges job; future per 2026-05-04_multimodal_expansion.md | Build internally |
| 11 | Surfaces | PWA, mobile, MCP-as-output, future email-in / share-target | ARCIVE-owned | web/mobile + MCP server | Build internally |
Three layer classes, three procurement rules
- ARCIVE-owned (Capture, Retrieve, Auto-correlate, Surfaces) — built internally. The moat. Never outsourced. No external vendor sits in the path.
- Strategic AI (Speak, Voice orchestration, Reason — semi) — best-in-class on the strategic axis. Vendor changes require a follow-up ADR justifying the swap against the strategic dimension (fidelity for Speak; quality + tool-use for Reason).
- Commodity AI (Transcribe, Understand, Embed, Hear) — cheapest acceptable vendor with a coded fallback. Vendor swaps within this class do not require a new ADR. They happen as the market shifts.
Stage 6 is redefined, not deferred
The earlier framing implied Stage 6 (“talk + chat together in one experience”) would eventually demand single-model unification. Layered redefines it: Stage 6 ships as two surfaces (Talk + Chat) sharing the same memory store, MCP retrieval, and AI auto-correlation. Same data, different input modalities. This matches every credible memory-product reference (Notion, Granola, Limitless, Personal.ai, Apple Intelligence) and never forces ARCIVE to abandon ADR-0006.
If at some future point a speech-to-speech model gains true TTS swappability, voice cloning, custom voices per role, and audio-tag-equivalent prosody control, revisit this section in a superseding ADR. Until then, the two-surfaces shape is the operating answer.
Immediate vendor picks under this framework
These are the picks emerging today. Future picks within commodity-class layers happen without new ADRs.
| Layer | Pick | Justification |
|---|---|---|
| Layer 2 — Transcribe | Groq Whisper (status quo) | Cheap + fast; existing key |
| Layer 3 — Understand | Gemini 2.5 Flash via AI Studio (free tier) primary → Anthropic Haiku 4.5 fallback → Groq Llama 3.3 70B third fallback → no-op — flip existing fallback order in supabase/functions/summarize-step/index.ts and add Groq as third path | Free tier covers ARCIVE volumes through early prod; multimodal-ready for future image/PDF kinds; Anthropic + Groq fallbacks provide resilience without new vendor relationships |
| Layer 4 — Embed | Voyage-3-lite (status quo); revisit when multimodal expansion is scoped | Status quo works; multimodal embedding swap is its own evaluation (future ADR) |
| Layer 6 — Reason | Claude Agent SDK (status quo) | Quality bar + ADR-0003 + tool-use maturity; re-eval triggers are quality-driven, not consolidation-driven |
| Layer 7 — Hear | Deepgram Nova-3 (paused with voice) | Status quo per ADR-0010 |
| Layer 8 — Speak | Cartesia → ElevenLabs v3 / Sesame CSM (paused per ADR-0010; per ADR-0006 path) | Voice fidelity strategy intact |
| Layer 9 — Voice orchestration | Pipecat (paused per ADR-0010) | Composed on resumption; never speech-to-speech |
Consequences
What changes
supabase/functions/summarize-step/index.tsflips provider order — Gemini Flash becomes primary, Anthropic Haiku becomes fallback. One-line swap.GEMINI_API_KEYis set in Supabase Edge Function secrets. Operational config; no commitment beyond the commodity-class layer pick.- A
backfill-summariesEdge Function is added to enqueuesummarizejobs for memories with null/empty summary or topics. Lands in a separate PR. docs/03_PROGRESS.mdpicks up “Stage 2 summarize backfill” as a current-work item, links to this ADR.
What does not change
- ADR-0002, ADR-0003, ADR-0004, ADR-0006, ADR-0007, ADR-0010 all stand. None are superseded.
/api/chatkeeps Claude Agent SDK. No migration.- Voice talk-back stays paused per ADR-0010. Resumption defaults are unchanged (composed pipeline + ElevenLabs v3 / Sesame CSM TTS).
- Existing keys (
GROQ_API_KEY,VOYAGE_API_KEY,ANTHROPIC_API_KEY,DEEPGRAM_API_KEY) stay configured. AddingGEMINI_API_KEYmakes 5 AI-vendor relationships, all rotated through Supabase secrets. - The
/talkbackpage and thevoice-talkbackworker stay paused per ADR-0010 with no changes.
Future work the framework enables (not committed by this ADR)
- Multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) — a Layer 4 commodity-layer evaluation. Future ADR-0012 when scoped, per the multimodal-expansion discussion doc.
- MCP-write surface forward to V0.2 — per the multimodal-expansion discussion. Layer 11 (Surfaces) work; future ADR if accepted.
- EU data residency — future trigger when a customer requires it. Layer 3 might switch to Mistral La Plateforme; Layer 4 might switch to Cohere EU-hosted; Layer 6 to Mistral or Anthropic EU. Each is a commodity-class swap (or for Layer 6, a strategic re-evaluation in its own ADR).
- Voice resumption with ElevenLabs Conversational AI as a third path — beyond pure-Pipecat (current default) and pure-speech-to-speech (rejected). Decision deferred to ADR-0010 resumption work.
- Failover testing — exercise the coded fallbacks across commodity layers to confirm they actually work.
Notes
Why this isn’t a commitment to Gemini
The immediate Layer 3 pick is Gemini Flash, but the framework binds harder than the vendor pick. If next quarter Anthropic drops Haiku pricing 5×, or DeepSeek lands a quality jump, or Mistral becomes the EU answer, the swap happens within the commodity-class rules without a new ADR. The framework is durable; vendor picks are tactical.
Groq’s role across the layer model
Groq is currently a single-layer vendor for ARCIVE — Layer 2 (Transcribe / Whisper) — but is capable across more layers. Documenting the surface so future readers don’t re-litigate it:
| Groq capability | Layer | Status for ARCIVE |
|---|---|---|
Whisper STT (whisper-large-v3-turbo) | 2 — Transcribe | ✅ in use today |
| Llama 3.3 70B / Llama 4 Scout / Llama 4 Maverick | 3 — Understand, 5 — Multimodal share, 6 — Reason | available; not the primary pick (see below) |
| DeepSeek-R1-distill (reasoning) | 6 — Reason | available; budget option for future freemium chat |
| Mixtral / Qwen / Kimi K2 / GPT-OSS | 3 — Understand | available; redundant with Gemini at this layer |
| TTS | 8 — Speak | ❌ Groq does not ship |
| Speech-to-speech | 9 — Voice orchestration | ❌ Groq does not ship |
| Embeddings | 4 — Embed | ❌ Groq does not ship |
| Vision-only models | 5 — Multimodal share | partial — Llama 4 covers it |
Why Groq is not the Layer 3 primary today: Gemini Flash wins on multimodal coverage including audio (text + audio + image + PDF + video), which future-proofs Stage 5 multimodal share. Groq’s Llama 4 Scout is text + image only. The pick was on multimodal coverage, not on Groq quality — Groq Llama 4 / 3.3 are perfectly competent text summarizers.
Where Groq is added now: Layer 3 fallback chain extends to Gemini → Anthropic → Groq Llama 3.3 70B → no-op. Costs nothing if higher-priority providers stay up; provides real resilience if Gemini AND Anthropic both have a transient outage. Reuses the existing GROQ_API_KEY — no new vendor relationship.
Future Groq plays explicitly available within this framework (no new ADR needed):
- Add Groq Llama as a freemium chat fallback (Layer 6) when V1.0+ free-tier scaling becomes the constraint.
- Use Groq Llama for synchronous summarize when a UX driver justifies the latency advantage (Groq is ~10× faster than Gemini at inference).
- Use DeepSeek-R1-distill on Groq for reasoning-heavy tasks like the auto-correlation feature in
../discussions/2026-05-04_multimodal_expansion.md.
Groq plays explicitly not allowed within this framework:
- Do not extend Groq into Layers 4 / 7 / 8 / 9. Groq doesn’t ship those capabilities; pretending otherwise would mean self-hosting (out of scope until V2.x local-only platform variant).
Why this isn’t a commitment to Claude either
Layer 6 keeps Claude Agent SDK because today’s quality + tool-use bar plus ADR-0003’s investment make it the right pick now. The framework labels it semi-strategic, which means: when re-evaluation is warranted (Stage 5/6 design pressure, cost at scale, a quality leapfrog), a future ADR walks through the swap. No vendor lean implied.
Why this isn’t a commitment against unification
If a future model — Gemini Live N+2, OpenAI Realtime N+2, an OSS speech-to-speech with cloning + audio tags — closes the voice-fidelity gap that today blocks ADR-0006 from running on a unified architecture, revisit Layer 8 + 9 in a superseding ADR. This ADR doesn’t preclude that future; it commits to the present-state architecture given today’s vendor capabilities.
How this maps to the 6-stage pipeline ARCIVE is building
| Pipeline stage | Resolves to layers |
|---|---|
| Stage 1 — Transcribe | Layer 2 (commodity) |
| Stage 2 — Summarize + topics | Layer 3 (commodity) |
| Stage 3 — Text chat | Layer 6 (semi-strategic) + Layer 5 (retrieve, MOAT) |
| Stage 4 — Voice conversational | Layers 7 + 8 + 9 (strategic per ADR-0006); paused per ADR-0010 |
| Stage 5 — Multimodal share | Layers 11 (Surfaces) + 3 (Understand, per-kind) + 4 (multimodal embed) + 5 (retrieve) |
| Stage 6 — Unified | Two surfaces sharing memory store, not one model. Achieved when Stages 3 + 4 are both shipping; no extra Stage 6 architecture needed. |
TL;DR for someone walking in cold
ARCIVE’s product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers (transcribe, understand, embed, hear) and strategic at a few (speak, reason, retrieve). Pick best-of-breed per task with explicit fallback. Don’t consolidate to one vendor — that’s a Big-Tech aspiration, not a shipping reality outside the frontier labs. Stage 6 unified ships as two surfaces (Talk + Chat) sharing the memory store, never as one model that would force ARCIVE to abandon ADR-0006 voice fidelity.