Skip to content

ARCIVE AI Strategy & Architecture — Working Session Notes

Date: 2026-05-05 Format: Working session notes — strategic, fed into ADR-0011. Outcome: ADR-0011 — AI vendor strategy: best-of-breed per task, ARCIVE owns the moat (Accepted) Companion: 2026-05-04_multimodal_expansion.md — the multimodal expansion discussion this builds on


Why this session happened

The trigger was small: pick the next thing to build. We started with “voice talk-back on mobile (#1 priority)”, spiked it, found Pipecat-in-Expo-Go was structurally blocked, and wrote ADR-0010 deferring the whole voice feature.

We then pivoted to the next item — summarize and topics generation, and backfill for older memories. That triggered the question “which API are we using? Is there a consolidated place?” which expanded into “what about free options?”“have you considered OpenAI / Microsoft?”“isn’t this fragmented?” → ultimately “step back and think as an architect — how are current players actually doing this?”

This doc captures the architectural reasoning that emerged. The conclusion isn’t a vendor pick; it’s a framework for picking vendors per task, plus a clear-eyed look at what ARCIVE actually owns and what it doesn’t.


1. The staged AI pipeline ARCIVE is building

User framing surfaced this six-stage progression — the actual roadmap, not just a wish list:

Stage 1 — Transcribe (audio → text) ✓ shipped (Groq Whisper)
Stage 2 — Summarize + topics (text → JSON) ⚠ scaffolded, missing API key
Stage 3 — Text chat (text + retrieval → response)✓ shipped (Claude Agent SDK + MCP)
Stage 4 — Voice conversational (real-time speech loop) ⏸ paused — ADR-0010
Stage 5 — Share multimodal (image / link / PDF / video) □ exploratory — see multimodal_expansion
Stage 6 — Unified (talk + chat) (real-time multimodal) □ V1.0+

Each stage adds a new modality or interaction shape on top of the same underlying memory store. The vendor question isn’t “best LLM for summarize” — it’s “what vendor strategy scales coherently across all 6 stages.”


2. What ARCIVE has already committed to (the constraints)

Three prior ADRs frame this strategy. Skipping any of them would mean overruling existing architectural intent:

ADRCommitmentImplication for vendor strategy
0002 — Pipecat over OpenAI RealtimeComposed STT→LLM→TTS pipeline, not bundled speech-to-speechStage 4 voice is composed, not unified
0003 — Swappable agent driver layerAgentDriver interface allows replacing the chat brainStage 3 LLM can be swapped without rewriting /api/chat callers
0004 — MCP as separate serviceMemory retrieval lives in its own MCP serverRetrieval is independent of any model vendor
0006 — Voice naturalness via swappable TTSTTS is upgrade path: Cartesia → ElevenLabs v3 → Sesame CSM. Per-role voices, audio tags, cloningVoice fidelity is strategic, not commodity. Forces composed pipeline at Stage 4
0007 — Consent gate on retrievalAgents see consent-scoped memories onlyRetrieval layer enforces the policy, not the model
0010 — Defer voice talk-backStage 4 paused until EAS unlocks + speech-to-speech-vs-composed settlesVoice-related vendor picks are deferred, not abandoned

ADR-0006 is the load-bearing constraint for this whole conversation. End-to-end speech-to-speech models cannot deliver swappable TTS, voice cloning, or audio tags. That kills any “consolidate to one vendor across all 6 stages” strategy on Stage 4.


3. The vendor menu (free + paid, multimodal + text-only)

Full landscape as of 2026-05-05. Costs per 1M tokens at published rates. “Forever-free” means a sustained free tier, not a trial credit.

Vendor / Model$/M in$/M outForever-freeMultimodalNotes
Google Gemini 2.5 Flash$0.075$0.30AI Studio: 1M tok/day, 15 RPMtext + audio + image + PDF + videoMost expansive multimodal; native audio in
Google Gemini 2.5 Flash-Lite$0.025$0.10Same AI Studiotext + imageCheapest paid Gemini
OpenAI GPT-5-nano$0.05$0.20$5 trial credit onlytext + visionCheap, no free-forever; native MCP
OpenAI GPT-5-mini$0.40$2.00$5 trial credittext + vision; Realtime audioNative MCP across Apps/Agents/Realtime SDKs
Anthropic Haiku 4.5$1.00$5.00$5 trial credittext + vision (no audio)What summarize-step uses today; ~10× Gemini cost
Anthropic Sonnet 4.6$3.00$15.00$5 trial credittext + visionWhat /api/chat Agent SDK uses
Groq Llama 4 ScoutGenerous free tier (existing GROQ_API_KEY)text + imageFastest LLM on the market
Groq Llama 3.3 70BSame free tiertext onlySolid quality, free
Cloudflare Workers AI10k req/day freetext + some visionEdge-deployed; low latency
GitHub Models50 req/day per model, ~30 models, free with GitHub accountvariesSingle token, gives Claude+GPT+Llama+Phi+Mistral. Eval gold; 50/day cap kills prod use
Microsoft Phi-4-multimodalself-host GPU $Free weights (MIT)text + audio + visionOSS option for V2.x local-only platform variant
DeepSeek V3.1$0.27$1.10Free :free variant on OpenRoutertext onlyVery cheap, surprising quality
Mistral Small 3 / Pixtral$0.20$0.60Free tier on La PlateformePixtral multimodalEU data residency (matters for V1.0 SOC 2 / EU users)
Cohere Command R+$2.50$10.00Free trial onlytext + visionEmbed v4 is the more interesting Cohere product for ARCIVE (see multimodal_expansion)
OpenRoutervariesvariesSeveral :free OSS modelsvariesOne key, ~300 models — but routing overhead
Azure OpenAIOpenAI prices + Azure premiumNoneOpenAI lineupOnly relevant for SOC 2 / enterprise (V1.0+)

Voice-specific (Stage 4):

Vendor / ModelTypeVoice cloning per role?Audio tags / emotion control?LatencyNotes
Cartesia SonicTTSpartial<100ms first chunkCurrent TTS in voice-talkback (paused)
ElevenLabs v3TTS✅ best-in-class ([laughs], [whispers])~200msADR-0006 step 2 upgrade target
Sesame CSMTTS (open weights)partialvariesADR-0006 step 2 alt; OSS path
Microsoft VibeVoiceTTS (MIT, Aug 2025)partialvariesOpen-source long-form TTS
Deepgram Nova-3streaming STTn/an/a~100msCurrent STT for voice-talkback
OpenAI Realtime APIspeech-to-speech300-500ms8 fixed voices; kills ADR-0006
Google Gemini Livespeech-to-speech⚠ primitive300-500ms~30 fixed voices; kills ADR-0006
AWS Nova Sonicspeech-to-speechvariesAWS-native; not in our gravity well
Kyutai Moshispeech-to-speech (open)n/apartiallowSingle-model full-duplex
ElevenLabs Conversational AIcomposed managed platform~500msBring-your-own-LLM — wraps ElevenLabs TTS + Deepgram-equivalent STT + your LLM + tool-use orchestration. Worth re-evaluating on Stage 4 resumption
Hume EVI 3bundled emotion-awarepartial✅ emotion-nativevariesNiche; relevant for Caregiver use case
Pipecat Cloud / Dailymanaged Pipecat✅ (TTS-dependent)✅ (TTS-dependent)composed latencySame architecture as current, hosted

Embedding-specific (Stage 5):

Vendor / ModelModalityCostNotes
Voyage-3-litetext onlyfree tier 50M tokCurrent. 512-d.
Voyage Multimodal-3text + imagepaidMultimodal expansion #1 leverage move
Cohere Embed v4text + imagepaidAlternative; eval needed

4. ARCIVE summarize traffic — what cost actually means

A typical memory: ~3K input tokens (transcript), ~200 output tokens. Estimated ARCIVE volume scenarios:

ScenarioMemories/dayTokens/day (in + out)Anthropic Haiku 4.5Gemini 2.5 Flash paidOpenAI GPT-5-nanoGroq / Gemini AI Studio (free)
Personal use1030K + 2K$0.04/mo$0.003/mo$0.002/mo$0
Early prod1,0003M + 200K$39/mo$2.50/mo$1.65/mo$0 (within quota)
Mid scale10,00030M + 2M$390/mo$25/mo$16/moLikely over free quota → paid
Large scale100,000300M + 20M$3,900/mo$250/mo$160/moPaid

Cost gradient is real but not catastrophic until the 10K+/day mark. All non-Anthropic options are <$30/mo at early prod scale. The real cost question is: at what scale does Anthropic’s premium stop being worth it for commodity tasks like summarize?

For Stage 3 chat (Sonnet 4.6, more expensive): the per-turn cost matters more, but quality bar is also higher there.


5. How credible players actually compose AI services

The lever on this conversation was looking outside ARCIVE. Nobody in adjacent product spaces consolidates to one vendor. Pattern across the industry:

PlayerSTTLLM (text)LLM (chat)TTS / VoiceStrategy
Notion AIn/aOpenAI + AnthropicOpenAI + Anthropicn/aMulti-vendor LLM, picks per task
GranolaOpenAI Whisper / DeepgramOpenAI / AnthropicOpenAI / Anthropicn/aBest-of-breed; doesn’t ship voice
Otter.aiTheir own (trained corpus)OpenAI (rumored)n/an/aOwns STT (the moat); commodities the rest
Limitless / Plaud / BeeWhisper / DeepgramOpenAI / Anthropicminimaln/aAudio capture + summary; no voice talk-back
Rewind / Personal.aiWhisper variantsOpenAI / AnthropicOpenAI / Anthropicn/aSame shape
ElevenLabs Conversational AIDeepgram-eqbring your own LLMbring your ownElevenLabs (own)Owns voice fidelity; commodities everything else
Sesame (companion)ownown CSM (open-sourced)Owns voice quality specifically
Hume EVI 3ownownownownVertical: emotion-aware voice
Apple Intelligence / SiriAppleApple + OpenAI fallbackApple + OpenAI fallbackAppleMulti-vendor, routes per task
OpenAI / Anthropic / GoogleownownownownThey are the vendors — vertically integrated by definition

Key observations:

  1. Nobody outside frontier labs uses one vendor for everything. Even Apple — the most vertically integrated company there is — uses OpenAI as fallback for tasks Apple Intelligence doesn’t handle.
  2. STT and TTS are routinely separated from LLM, even when unified models exist. Reasons: cost, task-specific quality, fallback resilience.
  3. Players that own a layer end-to-end (Otter on STT, ElevenLabs on TTS, Sesame on voice) own it because that’s their moat. They commoditize everything else.
  4. Best-of-breed per task is the operating norm. Single-vendor consolidation is a Big-Tech sales pitch, not a shipping reality.
  5. Frontier models commoditize the “smart” layer. Differentiation moves up the stack: data, retrieval, UX, brand.

6. Two architectural shapes considered (and why we reject both)

Shape A — Consolidate to one vendor across all 6 stages

Pick one of {Gemini, OpenAI} and run everything through them. Pros: single billing relationship, single SDK, vendor ergonomics, possible cost discounts at scale.

Why we reject this:

  • Forces dropping Claude Agent SDK (Stage 3) — quality regression risk + ADR-0003 driver layer rewrite
  • Forces dropping ADR-0006’s TTS swappability at Stage 4 — speech-to-speech models can’t do per-role voice cloning, audio tags, or prosody control
  • Locks ARCIVE to one vendor’s roadmap, pricing, and outage profile
  • Doesn’t match how any credible player in adjacent spaces actually operates
  • Premature optimization for unification that doesn’t exist yet (Stage 6 isn’t here)

Shape B — Decouple text-LLM from voice (Path D in earlier discussion)

Lean toward Gemini for text-bearing stages (2, 5, eventually 3); keep voice composed per ADR-0006.

Why we reject this too:

  • Implies “Gemini wins the text side” — invites awkward exceptions when Anthropic, Mistral, or DeepSeek ships a better text model
  • Doesn’t distinguish strategic from commodity layers — treats all text-LLM choices as one decision
  • Doesn’t ground the decision in industry practice
  • Creates a vendor lean that has to be walked back when the layer-specific situation changes (e.g., EU customers need EU data residency → Mistral, not Gemini)

7. Shape C — Layered architecture, best-of-breed per task (chosen)

ARCIVE is a system that composes specialized AI capabilities around a memory store. Layers, not pipelines. Vendors are commodities at most layers, strategic at a few.

The canonical layer model is in ADR-0011 and reproduced in 01_SOFTWARE_PLAN.md §1.4. This section captures the reasoning that produced it; the ADR is the immutable reference. If the layer numbers ever need to change, change them in ADR-0011 first.

The 11 layers, summarized for context:

#LayerClassToday’s vendor
1CaptureARCIVE-ownedrecorders + audio-transcode
2TranscribecommodityGroq Whisper
3UnderstandcommodityGemini Flash → Anthropic Haiku → Groq Llama (fallback chain)
4EmbedcommodityVoyage-3-lite
5Retrieve (MOAT)ARCIVE-ownedMCP + pgvector + tsv + edges + consent
6Reasonsemi-strategicClaude Agent SDK
7HearcommodityDeepgram Nova-3 (paused per ADR-0010)
8Speakstrategic (ADR-0006)Cartesia → ElevenLabs v3 / Sesame CSM (paused per ADR-0010)
9Voice orchestrationstrategic (ADR-0002, paused per ADR-0010)Pipecat (paused) — composed on resumption
10Auto-correlate (MOAT)ARCIVE-ownededges job; multimodal expansion future
11SurfacesARCIVE-ownedweb + mobile + MCP server

Three classes of layer, three procurement rules

ClassLayersProcurement rule
ARCIVE-ownedCapture, Retrieve, Auto-correlate, SurfacesBuild internally. This is the moat. Don’t outsource.
Strategic AISpeak (ADR-0006), Reason (semi)Pick best-in-class for the strategic axis. Justify every change in an ADR.
Commodity AITranscribe, Understand, Embed, HearPick the cheapest acceptable vendor with a coded fallback. Swap freely as the market shifts. No ADR needed for vendor swaps within this class.

How this resolves each stage of the pipeline

StageLayer mappingVendor decision frame
1 — TranscribeLayer 2 (commodity)Cheapest acceptable STT; Groq Whisper today, Deepgram if streaming, OpenAI Whisper if quality
2 — Summarize + topicsLayer 3 (commodity)Cheapest acceptable text-LLM; Gemini Flash today (free tier, multimodal-ready); fallback Anthropic
3 — Text chatLayer 6 (semi-strategic)Claude Agent SDK today; re-eval at Stage 5/6 pressure, not on consolidation pressure
4 — VoiceLayers 7+8+9 (strategic — voice fidelity)Composed pipeline, ElevenLabs v3 / Sesame CSM TTS. Never speech-to-speech. Paused per ADR-0010
5 — Multimodal shareLayer 3 (per-kind LLM) + Layer 4 (multimodal embed)Per-kind: image=Sonnet/Gemini vision, PDF=text-extract+LLM, link=fetch+LLM. Embedding swap is the leverage move
6 — UnifiedRedefined: two surfaces (Talk + Chat) sharing the memory store, not literally one modelDon’t try to literally unify. Accept Talk/Chat as separate UIs over the same MCP retrieval layer

Why Stage 6 is redefined, not deferred

The earlier framing said “defer Stage 6 until speech-to-speech voice fidelity matures.” Layered says: the unification was never the goal; the experience was. A user who can talk or type to ARCIVE about their memories, with the same retrieval and memory state, gets the experience without ARCIVE adopting a single-model architecture that breaks ADR-0006. Two surfaces, one memory store.

This matches how every credible memory product ships: same data, multiple input modalities, no model trying to be all things. If a future speech-to-speech model gains true TTS swappability, voice cloning, and audio tags, revisit. Until then, the layered Talk + Chat shape ships sooner and is more durable.


8. Comparing the options I considered along the way

For the record (and for the next person re-litigating this):

PathFrameCode change todayLong-term shapeWhy rejected (or chosen)
Path AGemini unified across all 6 stagesSameStage 4 voice on Gemini LiveRejected — kills ADR-0006 voice fidelity
Path BOpenAI everythingSame + replace Claude/WhisperStage 4 on OpenAI RealtimeRejected — same ADR-0006 kill + bigger migration cost + no free tier
Path CComposed throughout, multi-vendorSameNo literal Stage 6 unificationHalf-right — preserves voice fidelity but doesn’t articulate the framework
Path DDecouple text-LLM (Gemini lean) from voice (composed)SameStage 4 composed, text leans GeminiHalf-right — same operational outcome as Layered, but invites “why Gemini?” objections
Path EPath A with eyes open — bet voice fidelity tools convergeSameSame as A, with monitoringRejected — timing bet on a market trajectory, not a strategy
Layered (chosen)Best-of-breed per task; ARCIVE owns the moatSameTwo surfaces, shared memory; layer-by-layer vendor choiceChosen — matches industry practice, preserves all prior ADRs, durable to vendor changes

Net: today’s code change is identical across all paths. The choice is about the frame that governs every future per-task decision. Layered is the most durable frame because it doesn’t commit ARCIVE to any vendor’s roadmap.


9. Connection to the multimodal expansion discussion

The companion doc 2026-05-04_multimodal_expansion.md lays out:

  • Schema generalization: kind enum, polymorphic assets table, reuse of transcript+embedding+summary as universal text/vector layer
  • Universal /ingest endpoint with per-kind dispatcher
  • Surfaces: PWA share-target, Chrome MV3 extension, iOS share extension, email-in, MCP-write
  • The key realization: MCP-first is the better primitive than share extensions — pull MCP-write forward to V0.2 so Claude Desktop / ChatGPT / Apple Intelligence become the share surface
  • The half-day high-leverage move: multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) replaces caption-then-embed
  • Brand-aligned philosophy: every shared item becomes an audio memory with attachment (or auto-correlation — the version only ARCIVE could ship)

This Layered architecture maps cleanly:

Multimodal expansion conceptLayered equivalent
Schema generalization (kind enum, assets table)Capture layer extension
Universal /ingest per-kind dispatcherSurfaces layer + per-kind Understand/Embed routing
MCP-write forward to V0.2Surfaces layer (MCP-as-output)
Multimodal embedding swapEmbed layer vendor swap (commodity-class procurement)
Auto-correlation differentiatorAuto-correlate layer (ARCIVE-owned moat)
“Every entry is still a memory” philosophyResolves to: same memory schema across all kinds; surface variation but data uniformity

Layered architecture doesn’t preempt any of those decisions. The discussion doc’s open questions remain open. But anything that lands from the discussion doc lands in a clean architectural slot under this framework.


10. Decisions captured here (to formalize in ADRs)

These are the decisions emerging from the session. Each becomes (or extends) an ADR:

#DecisionOwning ADR
1AI vendor strategy is best-of-breed per task; ARCIVE owns the memory + retrieval + auto-correlation moatADR-0011 (new — to draft)
2Each layer is classified strategic / commodity / ARCIVE-owned; commodity-layer vendor swaps don’t need new ADRsADR-0011
3Stage 2 summarize uses Gemini Flash via AI Studio (free tier) primary, Anthropic Haiku fallback — already coded, just needs the keyADR-0011 (vendor pick within commodity-layer rules)
4Stage 4 voice stays composed on resumption (Pipecat + ElevenLabs v3 / Sesame CSM); never speech-to-speechReaffirms ADR-0002, ADR-0006, ADR-0010
5Stage 6 unified is redefined as Talk + Chat surfaces sharing memory store, not single-modelNew section in ADR-0011
6Multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) is the next high-leverage commodity-layer evaluationFuture ADR-0012 (when scoped)
7MCP-write surface moves forward to V0.2 per multimodal_expansion discussionFuture ADR if accepted

10b. Groq’s role across the layer model

Groq is currently a single-layer vendor for ARCIVE — Layer 2 (Transcribe / Whisper) — but is capable across more layers. Documenting the surface here so future readers don’t re-litigate it. (Also captured in ADR-0011 Notes.)

Groq capabilityLayerStatus for ARCIVE
Whisper STT (whisper-large-v3-turbo)2 — Transcribe✅ in use today
Llama 3.3 70B / Llama 4 Scout / Maverick3, 5, 6available; not primary; Llama 3.3 70B added as Layer 3 third fallback below Anthropic
DeepSeek-R1-distill (reasoning)6available; budget option for future freemium chat
Mixtral / Qwen / Kimi K2 / GPT-OSS3available; redundant with Gemini
TTS8❌ Groq doesn’t ship
Speech-to-speech9❌ Groq doesn’t ship
Embeddings4❌ Groq doesn’t ship

Why Gemini won the Layer 3 primary pick over Groq Llama: multimodal coverage including audio (text + audio + image + PDF + video) future-proofs Stage 5 multimodal share. Llama 4 Scout is text + image only. Groq is fast + free + competent; the call was on multimodal coverage.

Where Groq is added now: Layer 3 fallback chain becomes Gemini → Anthropic → Groq Llama 3.3 70B → no-op. Reuses the existing GROQ_API_KEY — no new vendor relationship — and provides resilience if Gemini AND Anthropic both have transient outages.

Future Groq plays available within the framework (no new ADR needed):

  • Groq Llama as freemium chat fallback (Layer 6) at V1.0+ when free-tier scaling matters
  • Groq for any synchronous-summarize UX moment (Groq is ~10× faster inference than Gemini)
  • DeepSeek-R1-distill on Groq for reasoning-heavy tasks (e.g., auto-correlation per multimodal expansion)

Plays not allowed within the framework: extending Groq into Layers 4 / 7 / 8 / 9 — Groq doesn’t ship those capabilities; pretending otherwise means self-hosting, which is out of scope until V2.x local-only platform variant.


11. Open questions

  • Quality eval — should we run a 50-transcript shootout on GitHub Models (free, 30+ models, single token) to verify Gemini Flash quality vs Haiku 4.5 on real ARCIVE transcripts before committing? Probably yes, half-day, but not gating the backfill PR.
  • When does Stage 3 chat get re-evaluated? Today Claude Agent SDK is the right pick on quality + tool-use grounds. The trigger to revisit is either (a) the discussion-doc multimodal expansion makes Stage 5 vision a chat-layer concern, or (b) cost at scale crosses some threshold.
  • B2B EU data residency — when a customer requires EU-only data, which layer needs the swap? Likely Understand (Mistral La Plateforme), possibly Embed (Cohere has EU-hosted), Reason (Mistral or Anthropic EU when available). Not urgent for V0.3; relevant for V1.0 SOC 2 prep.
  • Failover testing — the framework assumes commodity-layer fallbacks work. Have we actually exercised them? Summarize-step has Anthropic→Gemini→no-op coded but never tested with the primary down. Worth a chaos test once the Gemini key is live.
  • Voice resumption trigger — per ADR-0010 the resumption depends on EAS + voice-fidelity-on-speech-to-speech maturity. The ElevenLabs Conversational AI managed platform (BYO LLM, ElevenLabs TTS) is an underweighted middle option. Should ADR-0010’s resumption checklist add it as a third path?
  • Layer 6 (Reason) class assignment — currently labeled “semi-strategic”. Is that a stable classification or a temporary one until quality bars settle?

12. What ships from this session

In order:

  1. feat(pipeline): summarize backfill via Gemini Flash + Groq third fallback — set GEMINI_API_KEY, flip provider order (Gemini primary, Anthropic fallback, Groq Llama 3.3 70B third fallback), add backfill-summaries Edge Function enqueueing summarize jobs for null-summary memories. All commodity-layer picks within the framework rules.

  2. docs: ADR-0011 — AI vendor strategy (best-of-breed per task) — formalizes the framework + the layer classification. References this discussion doc.

  3. (Future) Multimodal embedding swap ADR + scoping when ready. Discussion doc’s #1 leverage move.

  4. (Future) ADR-0010 resumption checklist update if ElevenLabs Conversational AI is added as a third voice path.


13. The TL;DR for someone walking in cold

ARCIVE composes specialized AI capabilities around a memory store. The product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers (transcribe, understand, embed, hear) and strategic at a few (speak, reason, retrieve). Pick best-of-breed per task with explicit fallback. Don’t consolidate to one vendor — that’s a Big-Tech aspiration, not a shipping reality outside the frontier labs. Voice fidelity is strategic (ADR-0006), so the voice loop stays composed forever. Stage 6 “talk + chat together” ships as two surfaces sharing the memory store, not one unified model. Today’s work: Gemini Flash for summarize (free, multimodal-ready, already coded as fallback), backfill old memories.