Date: 2026-05-05
Format: Working session notes — strategic, fed into ADR-0011.
Outcome: ADR-0011 — AI vendor strategy: best-of-breed per task, ARCIVE owns the moat (Accepted)
Companion: 2026-05-04_multimodal_expansion.md — the multimodal expansion discussion this builds on
Why this session happened
The trigger was small: pick the next thing to build. We started with “voice talk-back on mobile (#1 priority)”, spiked it, found Pipecat-in-Expo-Go was structurally blocked, and wrote ADR-0010 deferring the whole voice feature.
We then pivoted to the next item — summarize and topics generation, and backfill for older memories. That triggered the question “which API are we using? Is there a consolidated place?” which expanded into “what about free options?” → “have you considered OpenAI / Microsoft?” → “isn’t this fragmented?” → ultimately “step back and think as an architect — how are current players actually doing this?”
This doc captures the architectural reasoning that emerged. The conclusion isn’t a vendor pick; it’s a framework for picking vendors per task, plus a clear-eyed look at what ARCIVE actually owns and what it doesn’t.
1. The staged AI pipeline ARCIVE is building
User framing surfaced this six-stage progression — the actual roadmap, not just a wish list:
Stage 1 — Transcribe (audio → text) ✓ shipped (Groq Whisper)Stage 2 — Summarize + topics (text → JSON) ⚠ scaffolded, missing API keyStage 3 — Text chat (text + retrieval → response)✓ shipped (Claude Agent SDK + MCP)Stage 4 — Voice conversational (real-time speech loop) ⏸ paused — ADR-0010Stage 5 — Share multimodal (image / link / PDF / video) □ exploratory — see multimodal_expansionStage 6 — Unified (talk + chat) (real-time multimodal) □ V1.0+Each stage adds a new modality or interaction shape on top of the same underlying memory store. The vendor question isn’t “best LLM for summarize” — it’s “what vendor strategy scales coherently across all 6 stages.”
2. What ARCIVE has already committed to (the constraints)
Three prior ADRs frame this strategy. Skipping any of them would mean overruling existing architectural intent:
| ADR | Commitment | Implication for vendor strategy |
|---|---|---|
| 0002 — Pipecat over OpenAI Realtime | Composed STT→LLM→TTS pipeline, not bundled speech-to-speech | Stage 4 voice is composed, not unified |
| 0003 — Swappable agent driver layer | AgentDriver interface allows replacing the chat brain | Stage 3 LLM can be swapped without rewriting /api/chat callers |
| 0004 — MCP as separate service | Memory retrieval lives in its own MCP server | Retrieval is independent of any model vendor |
| 0006 — Voice naturalness via swappable TTS | TTS is upgrade path: Cartesia → ElevenLabs v3 → Sesame CSM. Per-role voices, audio tags, cloning | Voice fidelity is strategic, not commodity. Forces composed pipeline at Stage 4 |
| 0007 — Consent gate on retrieval | Agents see consent-scoped memories only | Retrieval layer enforces the policy, not the model |
| 0010 — Defer voice talk-back | Stage 4 paused until EAS unlocks + speech-to-speech-vs-composed settles | Voice-related vendor picks are deferred, not abandoned |
ADR-0006 is the load-bearing constraint for this whole conversation. End-to-end speech-to-speech models cannot deliver swappable TTS, voice cloning, or audio tags. That kills any “consolidate to one vendor across all 6 stages” strategy on Stage 4.
3. The vendor menu (free + paid, multimodal + text-only)
Full landscape as of 2026-05-05. Costs per 1M tokens at published rates. “Forever-free” means a sustained free tier, not a trial credit.
| Vendor / Model | $/M in | $/M out | Forever-free | Multimodal | Notes |
|---|---|---|---|---|---|
| Google Gemini 2.5 Flash | $0.075 | $0.30 | AI Studio: 1M tok/day, 15 RPM | text + audio + image + PDF + video | Most expansive multimodal; native audio in |
| Google Gemini 2.5 Flash-Lite | $0.025 | $0.10 | Same AI Studio | text + image | Cheapest paid Gemini |
| OpenAI GPT-5-nano | $0.05 | $0.20 | $5 trial credit only | text + vision | Cheap, no free-forever; native MCP |
| OpenAI GPT-5-mini | $0.40 | $2.00 | $5 trial credit | text + vision; Realtime audio | Native MCP across Apps/Agents/Realtime SDKs |
| Anthropic Haiku 4.5 | $1.00 | $5.00 | $5 trial credit | text + vision (no audio) | What summarize-step uses today; ~10× Gemini cost |
| Anthropic Sonnet 4.6 | $3.00 | $15.00 | $5 trial credit | text + vision | What /api/chat Agent SDK uses |
| Groq Llama 4 Scout | — | — | Generous free tier (existing GROQ_API_KEY) | text + image | Fastest LLM on the market |
| Groq Llama 3.3 70B | — | — | Same free tier | text only | Solid quality, free |
| Cloudflare Workers AI | — | — | 10k req/day free | text + some vision | Edge-deployed; low latency |
| GitHub Models | — | — | 50 req/day per model, ~30 models, free with GitHub account | varies | Single token, gives Claude+GPT+Llama+Phi+Mistral. Eval gold; 50/day cap kills prod use |
| Microsoft Phi-4-multimodal | self-host GPU $ | — | Free weights (MIT) | text + audio + vision | OSS option for V2.x local-only platform variant |
| DeepSeek V3.1 | $0.27 | $1.10 | Free :free variant on OpenRouter | text only | Very cheap, surprising quality |
| Mistral Small 3 / Pixtral | $0.20 | $0.60 | Free tier on La Plateforme | Pixtral multimodal | EU data residency (matters for V1.0 SOC 2 / EU users) |
| Cohere Command R+ | $2.50 | $10.00 | Free trial only | text + vision | Embed v4 is the more interesting Cohere product for ARCIVE (see multimodal_expansion) |
| OpenRouter | varies | varies | Several :free OSS models | varies | One key, ~300 models — but routing overhead |
| Azure OpenAI | OpenAI prices + Azure premium | — | None | OpenAI lineup | Only relevant for SOC 2 / enterprise (V1.0+) |
Voice-specific (Stage 4):
| Vendor / Model | Type | Voice cloning per role? | Audio tags / emotion control? | Latency | Notes |
|---|---|---|---|---|---|
| Cartesia Sonic | TTS | ✅ | partial | <100ms first chunk | Current TTS in voice-talkback (paused) |
| ElevenLabs v3 | TTS | ✅ | ✅ best-in-class ([laughs], [whispers]) | ~200ms | ADR-0006 step 2 upgrade target |
| Sesame CSM | TTS (open weights) | ✅ | partial | varies | ADR-0006 step 2 alt; OSS path |
| Microsoft VibeVoice | TTS (MIT, Aug 2025) | ✅ | partial | varies | Open-source long-form TTS |
| Deepgram Nova-3 | streaming STT | n/a | n/a | ~100ms | Current STT for voice-talkback |
| OpenAI Realtime API | speech-to-speech | ❌ | ❌ | 300-500ms | 8 fixed voices; kills ADR-0006 |
| Google Gemini Live | speech-to-speech | ❌ | ⚠ primitive | 300-500ms | ~30 fixed voices; kills ADR-0006 |
| AWS Nova Sonic | speech-to-speech | ❌ | ❌ | varies | AWS-native; not in our gravity well |
| Kyutai Moshi | speech-to-speech (open) | n/a | partial | low | Single-model full-duplex |
| ElevenLabs Conversational AI | composed managed platform | ✅ | ✅ | ~500ms | Bring-your-own-LLM — wraps ElevenLabs TTS + Deepgram-equivalent STT + your LLM + tool-use orchestration. Worth re-evaluating on Stage 4 resumption |
| Hume EVI 3 | bundled emotion-aware | partial | ✅ emotion-native | varies | Niche; relevant for Caregiver use case |
| Pipecat Cloud / Daily | managed Pipecat | ✅ (TTS-dependent) | ✅ (TTS-dependent) | composed latency | Same architecture as current, hosted |
Embedding-specific (Stage 5):
| Vendor / Model | Modality | Cost | Notes |
|---|---|---|---|
| Voyage-3-lite | text only | free tier 50M tok | Current. 512-d. |
| Voyage Multimodal-3 | text + image | paid | Multimodal expansion #1 leverage move |
| Cohere Embed v4 | text + image | paid | Alternative; eval needed |
4. ARCIVE summarize traffic — what cost actually means
A typical memory: ~3K input tokens (transcript), ~200 output tokens. Estimated ARCIVE volume scenarios:
| Scenario | Memories/day | Tokens/day (in + out) | Anthropic Haiku 4.5 | Gemini 2.5 Flash paid | OpenAI GPT-5-nano | Groq / Gemini AI Studio (free) |
|---|---|---|---|---|---|---|
| Personal use | 10 | 30K + 2K | $0.04/mo | $0.003/mo | $0.002/mo | $0 |
| Early prod | 1,000 | 3M + 200K | $39/mo | $2.50/mo | $1.65/mo | $0 (within quota) |
| Mid scale | 10,000 | 30M + 2M | $390/mo | $25/mo | $16/mo | Likely over free quota → paid |
| Large scale | 100,000 | 300M + 20M | $3,900/mo | $250/mo | $160/mo | Paid |
Cost gradient is real but not catastrophic until the 10K+/day mark. All non-Anthropic options are <$30/mo at early prod scale. The real cost question is: at what scale does Anthropic’s premium stop being worth it for commodity tasks like summarize?
For Stage 3 chat (Sonnet 4.6, more expensive): the per-turn cost matters more, but quality bar is also higher there.
5. How credible players actually compose AI services
The lever on this conversation was looking outside ARCIVE. Nobody in adjacent product spaces consolidates to one vendor. Pattern across the industry:
| Player | STT | LLM (text) | LLM (chat) | TTS / Voice | Strategy |
|---|---|---|---|---|---|
| Notion AI | n/a | OpenAI + Anthropic | OpenAI + Anthropic | n/a | Multi-vendor LLM, picks per task |
| Granola | OpenAI Whisper / Deepgram | OpenAI / Anthropic | OpenAI / Anthropic | n/a | Best-of-breed; doesn’t ship voice |
| Otter.ai | Their own (trained corpus) | OpenAI (rumored) | n/a | n/a | Owns STT (the moat); commodities the rest |
| Limitless / Plaud / Bee | Whisper / Deepgram | OpenAI / Anthropic | minimal | n/a | Audio capture + summary; no voice talk-back |
| Rewind / Personal.ai | Whisper variants | OpenAI / Anthropic | OpenAI / Anthropic | n/a | Same shape |
| ElevenLabs Conversational AI | Deepgram-eq | bring your own LLM | bring your own | ElevenLabs (own) | Owns voice fidelity; commodities everything else |
| Sesame (companion) | — | — | own | own CSM (open-sourced) | Owns voice quality specifically |
| Hume EVI 3 | own | own | own | own | Vertical: emotion-aware voice |
| Apple Intelligence / Siri | Apple | Apple + OpenAI fallback | Apple + OpenAI fallback | Apple | Multi-vendor, routes per task |
| OpenAI / Anthropic / Google | own | own | own | own | They are the vendors — vertically integrated by definition |
Key observations:
- Nobody outside frontier labs uses one vendor for everything. Even Apple — the most vertically integrated company there is — uses OpenAI as fallback for tasks Apple Intelligence doesn’t handle.
- STT and TTS are routinely separated from LLM, even when unified models exist. Reasons: cost, task-specific quality, fallback resilience.
- Players that own a layer end-to-end (Otter on STT, ElevenLabs on TTS, Sesame on voice) own it because that’s their moat. They commoditize everything else.
- Best-of-breed per task is the operating norm. Single-vendor consolidation is a Big-Tech sales pitch, not a shipping reality.
- Frontier models commoditize the “smart” layer. Differentiation moves up the stack: data, retrieval, UX, brand.
6. Two architectural shapes considered (and why we reject both)
Shape A — Consolidate to one vendor across all 6 stages
Pick one of {Gemini, OpenAI} and run everything through them. Pros: single billing relationship, single SDK, vendor ergonomics, possible cost discounts at scale.
Why we reject this:
- Forces dropping Claude Agent SDK (Stage 3) — quality regression risk + ADR-0003 driver layer rewrite
- Forces dropping ADR-0006’s TTS swappability at Stage 4 — speech-to-speech models can’t do per-role voice cloning, audio tags, or prosody control
- Locks ARCIVE to one vendor’s roadmap, pricing, and outage profile
- Doesn’t match how any credible player in adjacent spaces actually operates
- Premature optimization for unification that doesn’t exist yet (Stage 6 isn’t here)
Shape B — Decouple text-LLM from voice (Path D in earlier discussion)
Lean toward Gemini for text-bearing stages (2, 5, eventually 3); keep voice composed per ADR-0006.
Why we reject this too:
- Implies “Gemini wins the text side” — invites awkward exceptions when Anthropic, Mistral, or DeepSeek ships a better text model
- Doesn’t distinguish strategic from commodity layers — treats all text-LLM choices as one decision
- Doesn’t ground the decision in industry practice
- Creates a vendor lean that has to be walked back when the layer-specific situation changes (e.g., EU customers need EU data residency → Mistral, not Gemini)
7. Shape C — Layered architecture, best-of-breed per task (chosen)
ARCIVE is a system that composes specialized AI capabilities around a memory store. Layers, not pipelines. Vendors are commodities at most layers, strategic at a few.
The canonical layer model is in ADR-0011 and reproduced in
01_SOFTWARE_PLAN.md §1.4. This section captures the reasoning that produced it; the ADR is the immutable reference. If the layer numbers ever need to change, change them in ADR-0011 first.
The 11 layers, summarized for context:
| # | Layer | Class | Today’s vendor |
|---|---|---|---|
| 1 | Capture | ARCIVE-owned | recorders + audio-transcode |
| 2 | Transcribe | commodity | Groq Whisper |
| 3 | Understand | commodity | Gemini Flash → Anthropic Haiku → Groq Llama (fallback chain) |
| 4 | Embed | commodity | Voyage-3-lite |
| 5 | Retrieve (MOAT) | ARCIVE-owned | MCP + pgvector + tsv + edges + consent |
| 6 | Reason | semi-strategic | Claude Agent SDK |
| 7 | Hear | commodity | Deepgram Nova-3 (paused per ADR-0010) |
| 8 | Speak | strategic (ADR-0006) | Cartesia → ElevenLabs v3 / Sesame CSM (paused per ADR-0010) |
| 9 | Voice orchestration | strategic (ADR-0002, paused per ADR-0010) | Pipecat (paused) — composed on resumption |
| 10 | Auto-correlate (MOAT) | ARCIVE-owned | edges job; multimodal expansion future |
| 11 | Surfaces | ARCIVE-owned | web + mobile + MCP server |
Three classes of layer, three procurement rules
| Class | Layers | Procurement rule |
|---|---|---|
| ARCIVE-owned | Capture, Retrieve, Auto-correlate, Surfaces | Build internally. This is the moat. Don’t outsource. |
| Strategic AI | Speak (ADR-0006), Reason (semi) | Pick best-in-class for the strategic axis. Justify every change in an ADR. |
| Commodity AI | Transcribe, Understand, Embed, Hear | Pick the cheapest acceptable vendor with a coded fallback. Swap freely as the market shifts. No ADR needed for vendor swaps within this class. |
How this resolves each stage of the pipeline
| Stage | Layer mapping | Vendor decision frame |
|---|---|---|
| 1 — Transcribe | Layer 2 (commodity) | Cheapest acceptable STT; Groq Whisper today, Deepgram if streaming, OpenAI Whisper if quality |
| 2 — Summarize + topics | Layer 3 (commodity) | Cheapest acceptable text-LLM; Gemini Flash today (free tier, multimodal-ready); fallback Anthropic |
| 3 — Text chat | Layer 6 (semi-strategic) | Claude Agent SDK today; re-eval at Stage 5/6 pressure, not on consolidation pressure |
| 4 — Voice | Layers 7+8+9 (strategic — voice fidelity) | Composed pipeline, ElevenLabs v3 / Sesame CSM TTS. Never speech-to-speech. Paused per ADR-0010 |
| 5 — Multimodal share | Layer 3 (per-kind LLM) + Layer 4 (multimodal embed) | Per-kind: image=Sonnet/Gemini vision, PDF=text-extract+LLM, link=fetch+LLM. Embedding swap is the leverage move |
| 6 — Unified | Redefined: two surfaces (Talk + Chat) sharing the memory store, not literally one model | Don’t try to literally unify. Accept Talk/Chat as separate UIs over the same MCP retrieval layer |
Why Stage 6 is redefined, not deferred
The earlier framing said “defer Stage 6 until speech-to-speech voice fidelity matures.” Layered says: the unification was never the goal; the experience was. A user who can talk or type to ARCIVE about their memories, with the same retrieval and memory state, gets the experience without ARCIVE adopting a single-model architecture that breaks ADR-0006. Two surfaces, one memory store.
This matches how every credible memory product ships: same data, multiple input modalities, no model trying to be all things. If a future speech-to-speech model gains true TTS swappability, voice cloning, and audio tags, revisit. Until then, the layered Talk + Chat shape ships sooner and is more durable.
8. Comparing the options I considered along the way
For the record (and for the next person re-litigating this):
| Path | Frame | Code change today | Long-term shape | Why rejected (or chosen) |
|---|---|---|---|---|
| Path A | Gemini unified across all 6 stages | Same | Stage 4 voice on Gemini Live | Rejected — kills ADR-0006 voice fidelity |
| Path B | OpenAI everything | Same + replace Claude/Whisper | Stage 4 on OpenAI Realtime | Rejected — same ADR-0006 kill + bigger migration cost + no free tier |
| Path C | Composed throughout, multi-vendor | Same | No literal Stage 6 unification | Half-right — preserves voice fidelity but doesn’t articulate the framework |
| Path D | Decouple text-LLM (Gemini lean) from voice (composed) | Same | Stage 4 composed, text leans Gemini | Half-right — same operational outcome as Layered, but invites “why Gemini?” objections |
| Path E | Path A with eyes open — bet voice fidelity tools converge | Same | Same as A, with monitoring | Rejected — timing bet on a market trajectory, not a strategy |
| Layered (chosen) | Best-of-breed per task; ARCIVE owns the moat | Same | Two surfaces, shared memory; layer-by-layer vendor choice | Chosen — matches industry practice, preserves all prior ADRs, durable to vendor changes |
Net: today’s code change is identical across all paths. The choice is about the frame that governs every future per-task decision. Layered is the most durable frame because it doesn’t commit ARCIVE to any vendor’s roadmap.
9. Connection to the multimodal expansion discussion
The companion doc 2026-05-04_multimodal_expansion.md lays out:
- Schema generalization:
kindenum, polymorphicassetstable, reuse oftranscript+embedding+summaryas universal text/vector layer - Universal
/ingestendpoint with per-kind dispatcher - Surfaces: PWA share-target, Chrome MV3 extension, iOS share extension, email-in, MCP-write
- The key realization: MCP-first is the better primitive than share extensions — pull MCP-write forward to V0.2 so Claude Desktop / ChatGPT / Apple Intelligence become the share surface
- The half-day high-leverage move: multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) replaces caption-then-embed
- Brand-aligned philosophy: every shared item becomes an audio memory with attachment (or auto-correlation — the version only ARCIVE could ship)
This Layered architecture maps cleanly:
| Multimodal expansion concept | Layered equivalent |
|---|---|
Schema generalization (kind enum, assets table) | Capture layer extension |
Universal /ingest per-kind dispatcher | Surfaces layer + per-kind Understand/Embed routing |
| MCP-write forward to V0.2 | Surfaces layer (MCP-as-output) |
| Multimodal embedding swap | Embed layer vendor swap (commodity-class procurement) |
| Auto-correlation differentiator | Auto-correlate layer (ARCIVE-owned moat) |
| “Every entry is still a memory” philosophy | Resolves to: same memory schema across all kinds; surface variation but data uniformity |
Layered architecture doesn’t preempt any of those decisions. The discussion doc’s open questions remain open. But anything that lands from the discussion doc lands in a clean architectural slot under this framework.
10. Decisions captured here (to formalize in ADRs)
These are the decisions emerging from the session. Each becomes (or extends) an ADR:
| # | Decision | Owning ADR |
|---|---|---|
| 1 | AI vendor strategy is best-of-breed per task; ARCIVE owns the memory + retrieval + auto-correlation moat | ADR-0011 (new — to draft) |
| 2 | Each layer is classified strategic / commodity / ARCIVE-owned; commodity-layer vendor swaps don’t need new ADRs | ADR-0011 |
| 3 | Stage 2 summarize uses Gemini Flash via AI Studio (free tier) primary, Anthropic Haiku fallback — already coded, just needs the key | ADR-0011 (vendor pick within commodity-layer rules) |
| 4 | Stage 4 voice stays composed on resumption (Pipecat + ElevenLabs v3 / Sesame CSM); never speech-to-speech | Reaffirms ADR-0002, ADR-0006, ADR-0010 |
| 5 | Stage 6 unified is redefined as Talk + Chat surfaces sharing memory store, not single-model | New section in ADR-0011 |
| 6 | Multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) is the next high-leverage commodity-layer evaluation | Future ADR-0012 (when scoped) |
| 7 | MCP-write surface moves forward to V0.2 per multimodal_expansion discussion | Future ADR if accepted |
10b. Groq’s role across the layer model
Groq is currently a single-layer vendor for ARCIVE — Layer 2 (Transcribe / Whisper) — but is capable across more layers. Documenting the surface here so future readers don’t re-litigate it. (Also captured in ADR-0011 Notes.)
| Groq capability | Layer | Status for ARCIVE |
|---|---|---|
Whisper STT (whisper-large-v3-turbo) | 2 — Transcribe | ✅ in use today |
| Llama 3.3 70B / Llama 4 Scout / Maverick | 3, 5, 6 | available; not primary; Llama 3.3 70B added as Layer 3 third fallback below Anthropic |
| DeepSeek-R1-distill (reasoning) | 6 | available; budget option for future freemium chat |
| Mixtral / Qwen / Kimi K2 / GPT-OSS | 3 | available; redundant with Gemini |
| TTS | 8 | ❌ Groq doesn’t ship |
| Speech-to-speech | 9 | ❌ Groq doesn’t ship |
| Embeddings | 4 | ❌ Groq doesn’t ship |
Why Gemini won the Layer 3 primary pick over Groq Llama: multimodal coverage including audio (text + audio + image + PDF + video) future-proofs Stage 5 multimodal share. Llama 4 Scout is text + image only. Groq is fast + free + competent; the call was on multimodal coverage.
Where Groq is added now: Layer 3 fallback chain becomes Gemini → Anthropic → Groq Llama 3.3 70B → no-op. Reuses the existing GROQ_API_KEY — no new vendor relationship — and provides resilience if Gemini AND Anthropic both have transient outages.
Future Groq plays available within the framework (no new ADR needed):
- Groq Llama as freemium chat fallback (Layer 6) at V1.0+ when free-tier scaling matters
- Groq for any synchronous-summarize UX moment (Groq is ~10× faster inference than Gemini)
- DeepSeek-R1-distill on Groq for reasoning-heavy tasks (e.g., auto-correlation per multimodal expansion)
Plays not allowed within the framework: extending Groq into Layers 4 / 7 / 8 / 9 — Groq doesn’t ship those capabilities; pretending otherwise means self-hosting, which is out of scope until V2.x local-only platform variant.
11. Open questions
- Quality eval — should we run a 50-transcript shootout on GitHub Models (free, 30+ models, single token) to verify Gemini Flash quality vs Haiku 4.5 on real ARCIVE transcripts before committing? Probably yes, half-day, but not gating the backfill PR.
- When does Stage 3 chat get re-evaluated? Today Claude Agent SDK is the right pick on quality + tool-use grounds. The trigger to revisit is either (a) the discussion-doc multimodal expansion makes Stage 5 vision a chat-layer concern, or (b) cost at scale crosses some threshold.
- B2B EU data residency — when a customer requires EU-only data, which layer needs the swap? Likely Understand (Mistral La Plateforme), possibly Embed (Cohere has EU-hosted), Reason (Mistral or Anthropic EU when available). Not urgent for V0.3; relevant for V1.0 SOC 2 prep.
- Failover testing — the framework assumes commodity-layer fallbacks work. Have we actually exercised them? Summarize-step has Anthropic→Gemini→no-op coded but never tested with the primary down. Worth a chaos test once the Gemini key is live.
- Voice resumption trigger — per ADR-0010 the resumption depends on EAS + voice-fidelity-on-speech-to-speech maturity. The ElevenLabs Conversational AI managed platform (BYO LLM, ElevenLabs TTS) is an underweighted middle option. Should ADR-0010’s resumption checklist add it as a third path?
- Layer 6 (Reason) class assignment — currently labeled “semi-strategic”. Is that a stable classification or a temporary one until quality bars settle?
12. What ships from this session
In order:
-
feat(pipeline): summarize backfill via Gemini Flash + Groq third fallback— setGEMINI_API_KEY, flip provider order (Gemini primary, Anthropic fallback, Groq Llama 3.3 70B third fallback), addbackfill-summariesEdge Function enqueueingsummarizejobs for null-summary memories. All commodity-layer picks within the framework rules. -
docs: ADR-0011 — AI vendor strategy (best-of-breed per task)— formalizes the framework + the layer classification. References this discussion doc. -
(Future) Multimodal embedding swap ADR + scoping when ready. Discussion doc’s #1 leverage move.
-
(Future) ADR-0010 resumption checklist update if ElevenLabs Conversational AI is added as a third voice path.
13. The TL;DR for someone walking in cold
ARCIVE composes specialized AI capabilities around a memory store. The product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers (transcribe, understand, embed, hear) and strategic at a few (speak, reason, retrieve). Pick best-of-breed per task with explicit fallback. Don’t consolidate to one vendor — that’s a Big-Tech aspiration, not a shipping reality outside the frontier labs. Voice fidelity is strategic (ADR-0006), so the voice loop stays composed forever. Stage 6 “talk + chat together” ships as two surfaces sharing the memory store, not one unified model. Today’s work: Gemini Flash for summarize (free, multimodal-ready, already coded as fallback), backfill old memories.