ADR-0011: AI vendor strategy — best-of-breed per task, ARCIVE owns the moat

Status: Accepted
Date: 2026-05-05
Affected: every layer that consumes an external AI vendor — supabase/functions/transcribe-step/, supabase/functions/summarize-step/, supabase/functions/embed-step/, supabase/functions/diarize-step/, supabase/functions/reid-step/, apps/web/app/api/chat/, packages/agents/src/drivers/, backend/workers/voice-talkback/, backend/mcp/arcive-memory-mcp/
Companion working notes: ../discussions/2026-05-05_ai_strategy_architecture.md

Context

Two things forced this question on 2026-05-05:

A small task surfaced a big one. Picking up “summarize + topics generation, including backfill for old memories” raised the practical question “which API are we using? Is there a consolidated place?” That expanded into “what about free options?” → “what about OpenAI / Microsoft?” → ultimately “step back as an architect — how are current players doing this?”
The voice deferral (ADR-0010) opened the broader vendor question. Once we paused voice, the implicit assumption that ARCIVE would have one coherent vendor strategy across capture → summarize → chat → voice → multimodal → unified surfaces became visible — and visibly contestable.

The companion working-session notes (../discussions/2026-05-05_ai_strategy_architecture.md) walk through the full vendor landscape, cost projections at ARCIVE volumes, how credible adjacent players (Notion, Granola, Otter, Apple, ElevenLabs Conversational AI, Sesame, Hume) actually compose AI services, and the architectural shapes considered. This ADR captures the resulting commitment in the immutable form.

The pre-existing constraints that scoped the answer:

ADR-0002 — Pipecat over OpenAI Realtime: voice loop is composed (STT → LLM → TTS), not bundled.
ADR-0003 — Swappable agent driver layer: the chat brain is replaceable behind an AgentDriver interface.
ADR-0004 — MCP as separate service: retrieval is its own component, vendor-agnostic.
ADR-0006 — Voice naturalness via swappable TTS: voice fidelity (per-role voices, audio tags, prosody control) is strategic — TTS must stay swappable. End-to-end speech-to-speech models cannot deliver this.
ADR-0010 — Defer voice talk-back: voice is paused; resumption defaults to composed pipeline.

Options considered

Option A — Consolidate to one vendor across all stages

Pick one of {Anthropic, Google, OpenAI} and route every AI capability through them. One billing relationship, one SDK, one “throat to choke.”

Pros: simpler procurement; possible volume discounts at scale; single-vendor support story for B2B; cleanest path to a literal Stage 6 unified-real-time experience.
Cons:
- Forces dropping ADR-0006’s TTS swappability — speech-to-speech models from OpenAI Realtime / Gemini Live / Nova Sonic cannot do per-role voice cloning, audio tags, or granular prosody control. ARCIVE’s brand depends on Reviewer / Tutor / Caregiver / Brainstorm having distinct voices; consolidating breaks that.
- Forces a Stage 3 chat migration off Claude Agent SDK — quality regression risk + the ADR-0003 rewrite cost.
- Locks ARCIVE to one vendor’s roadmap, pricing, and outage profile. A model regression, price hike, or geo restriction at the chosen vendor cascades to every layer simultaneously.
- Doesn’t match how any credible non-frontier-lab player operates. Notion uses OpenAI + Anthropic. Granola uses Whisper + OpenAI/Anthropic. Apple uses Apple + OpenAI as fallback. ElevenLabs Conversational AI explicitly tells customers to bring their own LLM.

Option B — Decouple text-LLM from voice; lean on Gemini for the text side

Consolidate text-bearing stages (summarize, multimodal share, eventually chat) onto Gemini; keep voice composed per ADR-0006.

Pros: respects ADR-0006; Gemini’s free tier is genuinely useful; multimodal coverage is broadest; today’s code change is identical to status quo (summarize-step already has Gemini as fallback).
Cons:
- Still implies “Gemini wins the text side” — a vendor lean rather than a framework. Invites awkward exceptions when (a) Anthropic drops prices, (b) Mistral becomes the right pick for EU data residency, (c) DeepSeek / Llama hits a new quality bar at lower cost, (d) a customer requires a specific vendor.
- Doesn’t distinguish strategic from commodity layers — treats every text-LLM choice as one decision when in reality summarize and chat have very different procurement constraints.
- Doesn’t ground the choice in industry practice; it’s still an internal optimization argument.

Option C — Layered architecture: best-of-breed per task; ARCIVE owns the moat (chosen)

Compose specialized AI capabilities around a memory store. Layers, not pipelines. Vendors are commodities at most layers, strategic at a few, fully ARCIVE-owned at the rest.

Pros:
- Matches how every credible player in adjacent spaces actually ships — Notion, Granola, Otter, Limitless, Plaud, Bee, Rewind, Personal.ai, Apple Intelligence (with OpenAI fallback), ElevenLabs Conversational AI (BYO LLM), Sesame (owns CSM specifically).
- Preserves all prior ADRs (especially 0002, 0006).
- Failure of any single vendor is layer-local, not strategic.
- Each layer can swap independently as the market shifts; no rolling rewrite is forced.
- Differentiation moves UP the stack to the actual moat (memory store, retrieval, auto-correlation), where ARCIVE controls quality.
- Today’s code change for the immediate summarize backfill is identical to Options A and B.
Cons:
- More vendor relationships to maintain (today: Groq, Voyage, Anthropic, Deepgram, Modal, Cartesia paused, Stripe, Resend, Sentry, PostHog — adding Google as a 5th AI vendor for summarize). Mitigated by Supabase Edge Function secrets being the consolidated key store regardless.
- Tempting to grow scope inside any one layer (“just standardize embeddings to Cohere everything”) — discipline required to keep the framework intact.
- Slightly harder to explain to a non-technical stakeholder than “we use $VENDOR.” Mitigated by the layer model being a clearer story than “$VENDOR plus exceptions.”

Decision

Adopt Option C — Layered architecture, best-of-breed per task.

ARCIVE composes specialized AI capabilities around a memory store. The product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers, strategic at a few. Each layer picks the right vendor for that task with explicit, coded fallback. ARCIVE never consolidates to a single vendor across the stack.

The layer model (canonical)

This table is the canonical reference for ARCIVE’s AI architecture. Other docs that describe the system (docs/01_SOFTWARE_PLAN.md, the strategy discussion notes, future ADRs) link here rather than re-declaring layer numbers. If a layer needs to change, change it here first.

#	Layer	What it does	Class	Today	Procurement rule
1	Capture	mic, transcode, ingest	ARCIVE-owned	recorders + audio-transcode worker	Build internally
2	Transcribe	audio → text + timestamps	commodity	Groq Whisper	Cheapest acceptable STT; coded fallback
3	Understand	text → summary, topics, entities	commodity	Anthropic Haiku 4.5 → Gemini Flash fallback (target: flip to Gemini primary)	Cheapest acceptable text-LLM; coded fallback
4	Embed	text/image → vector	commodity	Voyage-3-lite	Best-of-breed embedding; future swap to multimodal-aware model
5	Retrieve	query → ranked memories	ARCIVE-owned (MOAT)	MCP server + `pgvector` HNSW + tsv + `memory_edges` + consent gate	Build internally; never outsource
6	Reason	chat with memories as context	semi-strategic	Claude Agent SDK	Quality + tool-use bar matters; re-eval on quality, not consolidation pressure
7	Hear	real-time STT (voice loop)	commodity	Deepgram Nova-3 (paused with voice)	Best-of-breed streaming STT
8	Speak	TTS	strategic (ADR-0006)	Cartesia → ElevenLabs v3 → Sesame CSM	Best-of-breed for fidelity; per-role voices, audio tags, cloning
9	Voice orchestration	real-time turn management	strategic (ADR-0002, paused per ADR-0010)	Pipecat (paused)	Composed on resumption — never speech-to-speech
10	Auto-correlate	cross-source memory linkage	ARCIVE-owned (MOAT)	edges job; future per `2026-05-04_multimodal_expansion.md`	Build internally
11	Surfaces	PWA, mobile, MCP-as-output, future email-in / share-target	ARCIVE-owned	web/mobile + MCP server	Build internally

Three layer classes, three procurement rules

ARCIVE-owned (Capture, Retrieve, Auto-correlate, Surfaces) — built internally. The moat. Never outsourced. No external vendor sits in the path.
Strategic AI (Speak, Voice orchestration, Reason — semi) — best-in-class on the strategic axis. Vendor changes require a follow-up ADR justifying the swap against the strategic dimension (fidelity for Speak; quality + tool-use for Reason).
Commodity AI (Transcribe, Understand, Embed, Hear) — cheapest acceptable vendor with a coded fallback. Vendor swaps within this class do not require a new ADR. They happen as the market shifts.

Stage 6 is redefined, not deferred

The earlier framing implied Stage 6 (“talk + chat together in one experience”) would eventually demand single-model unification. Layered redefines it: Stage 6 ships as two surfaces (Talk + Chat) sharing the same memory store, MCP retrieval, and AI auto-correlation. Same data, different input modalities. This matches every credible memory-product reference (Notion, Granola, Limitless, Personal.ai, Apple Intelligence) and never forces ARCIVE to abandon ADR-0006.

If at some future point a speech-to-speech model gains true TTS swappability, voice cloning, custom voices per role, and audio-tag-equivalent prosody control, revisit this section in a superseding ADR. Until then, the two-surfaces shape is the operating answer.

Immediate vendor picks under this framework

These are the picks emerging today. Future picks within commodity-class layers happen without new ADRs.

Layer	Pick	Justification
Layer 2 — Transcribe	Groq Whisper (status quo)	Cheap + fast; existing key
Layer 3 — Understand	Gemini 2.5 Flash via AI Studio (free tier) primary → Anthropic Haiku 4.5 fallback → Groq Llama 3.3 70B third fallback → no-op — flip existing fallback order in `supabase/functions/summarize-step/index.ts` and add Groq as third path	Free tier covers ARCIVE volumes through early prod; multimodal-ready for future image/PDF kinds; Anthropic + Groq fallbacks provide resilience without new vendor relationships
Layer 4 — Embed	Voyage-3-lite (status quo); revisit when multimodal expansion is scoped	Status quo works; multimodal embedding swap is its own evaluation (future ADR)
Layer 6 — Reason	Claude Agent SDK (status quo)	Quality bar + ADR-0003 + tool-use maturity; re-eval triggers are quality-driven, not consolidation-driven
Layer 7 — Hear	Deepgram Nova-3 (paused with voice)	Status quo per ADR-0010
Layer 8 — Speak	Cartesia → ElevenLabs v3 / Sesame CSM (paused per ADR-0010; per ADR-0006 path)	Voice fidelity strategy intact
Layer 9 — Voice orchestration	Pipecat (paused per ADR-0010)	Composed on resumption; never speech-to-speech

Consequences

What changes

supabase/functions/summarize-step/index.ts flips provider order — Gemini Flash becomes primary, Anthropic Haiku becomes fallback. One-line swap.
GEMINI_API_KEY is set in Supabase Edge Function secrets. Operational config; no commitment beyond the commodity-class layer pick.
A backfill-summaries Edge Function is added to enqueue summarize jobs for memories with null/empty summary or topics. Lands in a separate PR.
docs/03_PROGRESS.md picks up “Stage 2 summarize backfill” as a current-work item, links to this ADR.

What does not change

ADR-0002, ADR-0003, ADR-0004, ADR-0006, ADR-0007, ADR-0010 all stand. None are superseded.
/api/chat keeps Claude Agent SDK. No migration.
Voice talk-back stays paused per ADR-0010. Resumption defaults are unchanged (composed pipeline + ElevenLabs v3 / Sesame CSM TTS).
Existing keys (GROQ_API_KEY, VOYAGE_API_KEY, ANTHROPIC_API_KEY, DEEPGRAM_API_KEY) stay configured. Adding GEMINI_API_KEY makes 5 AI-vendor relationships, all rotated through Supabase secrets.
The /talkback page and the voice-talkback worker stay paused per ADR-0010 with no changes.

Future work the framework enables (not committed by this ADR)

Multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) — a Layer 4 commodity-layer evaluation. Future ADR-0012 when scoped, per the multimodal-expansion discussion doc.
MCP-write surface forward to V0.2 — per the multimodal-expansion discussion. Layer 11 (Surfaces) work; future ADR if accepted.
EU data residency — future trigger when a customer requires it. Layer 3 might switch to Mistral La Plateforme; Layer 4 might switch to Cohere EU-hosted; Layer 6 to Mistral or Anthropic EU. Each is a commodity-class swap (or for Layer 6, a strategic re-evaluation in its own ADR).
Voice resumption with ElevenLabs Conversational AI as a third path — beyond pure-Pipecat (current default) and pure-speech-to-speech (rejected). Decision deferred to ADR-0010 resumption work.
Failover testing — exercise the coded fallbacks across commodity layers to confirm they actually work.

Notes

Why this isn’t a commitment to Gemini

The immediate Layer 3 pick is Gemini Flash, but the framework binds harder than the vendor pick. If next quarter Anthropic drops Haiku pricing 5×, or DeepSeek lands a quality jump, or Mistral becomes the EU answer, the swap happens within the commodity-class rules without a new ADR. The framework is durable; vendor picks are tactical.

Groq’s role across the layer model

Groq is currently a single-layer vendor for ARCIVE — Layer 2 (Transcribe / Whisper) — but is capable across more layers. Documenting the surface so future readers don’t re-litigate it:

Groq capability	Layer	Status for ARCIVE
Whisper STT (`whisper-large-v3-turbo`)	2 — Transcribe	✅ in use today
Llama 3.3 70B / Llama 4 Scout / Llama 4 Maverick	3 — Understand, 5 — Multimodal share, 6 — Reason	available; not the primary pick (see below)
DeepSeek-R1-distill (reasoning)	6 — Reason	available; budget option for future freemium chat
Mixtral / Qwen / Kimi K2 / GPT-OSS	3 — Understand	available; redundant with Gemini at this layer
TTS	8 — Speak	❌ Groq does not ship
Speech-to-speech	9 — Voice orchestration	❌ Groq does not ship
Embeddings	4 — Embed	❌ Groq does not ship
Vision-only models	5 — Multimodal share	partial — Llama 4 covers it

Why Groq is not the Layer 3 primary today: Gemini Flash wins on multimodal coverage including audio (text + audio + image + PDF + video), which future-proofs Stage 5 multimodal share. Groq’s Llama 4 Scout is text + image only. The pick was on multimodal coverage, not on Groq quality — Groq Llama 4 / 3.3 are perfectly competent text summarizers.

Where Groq is added now: Layer 3 fallback chain extends to Gemini → Anthropic → Groq Llama 3.3 70B → no-op. Costs nothing if higher-priority providers stay up; provides real resilience if Gemini AND Anthropic both have a transient outage. Reuses the existing GROQ_API_KEY — no new vendor relationship.

Future Groq plays explicitly available within this framework (no new ADR needed):

Add Groq Llama as a freemium chat fallback (Layer 6) when V1.0+ free-tier scaling becomes the constraint.
Use Groq Llama for synchronous summarize when a UX driver justifies the latency advantage (Groq is ~10× faster than Gemini at inference).
Use DeepSeek-R1-distill on Groq for reasoning-heavy tasks like the auto-correlation feature in ../discussions/2026-05-04_multimodal_expansion.md.

Groq plays explicitly not allowed within this framework:

Do not extend Groq into Layers 4 / 7 / 8 / 9. Groq doesn’t ship those capabilities; pretending otherwise would mean self-hosting (out of scope until V2.x local-only platform variant).

Why this isn’t a commitment to Claude either

Layer 6 keeps Claude Agent SDK because today’s quality + tool-use bar plus ADR-0003’s investment make it the right pick now. The framework labels it semi-strategic, which means: when re-evaluation is warranted (Stage 5/6 design pressure, cost at scale, a quality leapfrog), a future ADR walks through the swap. No vendor lean implied.

Why this isn’t a commitment against unification

If a future model — Gemini Live N+2, OpenAI Realtime N+2, an OSS speech-to-speech with cloning + audio tags — closes the voice-fidelity gap that today blocks ADR-0006 from running on a unified architecture, revisit Layer 8 + 9 in a superseding ADR. This ADR doesn’t preclude that future; it commits to the present-state architecture given today’s vendor capabilities.

How this maps to the 6-stage pipeline ARCIVE is building

Pipeline stage	Resolves to layers
Stage 1 — Transcribe	Layer 2 (commodity)
Stage 2 — Summarize + topics	Layer 3 (commodity)
Stage 3 — Text chat	Layer 6 (semi-strategic) + Layer 5 (retrieve, MOAT)
Stage 4 — Voice conversational	Layers 7 + 8 + 9 (strategic per ADR-0006); paused per ADR-0010
Stage 5 — Multimodal share	Layers 11 (Surfaces) + 3 (Understand, per-kind) + 4 (multimodal embed) + 5 (retrieve)
Stage 6 — Unified	Two surfaces sharing memory store, not one model. Achieved when Stages 3 + 4 are both shipping; no extra Stage 6 architecture needed.

TL;DR for someone walking in cold

ARCIVE’s product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers (transcribe, understand, embed, hear) and strategic at a few (speak, reason, retrieve). Pick best-of-breed per task with explicit fallback. Don’t consolidate to one vendor — that’s a Big-Tech aspiration, not a shipping reality outside the frontier labs. Stage 6 unified ships as two surfaces (Talk + Chat) sharing the memory store, never as one model that would force ARCIVE to abandon ADR-0006 voice fidelity.