ARCIVE AI Strategy & Architecture — Working Session Notes

Date: 2026-05-05 Format: Working session notes — strategic, fed into ADR-0011. Outcome: ADR-0011 — AI vendor strategy: best-of-breed per task, ARCIVE owns the moat (Accepted) Companion: 2026-05-04_multimodal_expansion.md — the multimodal expansion discussion this builds on

Why this session happened

The trigger was small: pick the next thing to build. We started with “voice talk-back on mobile (#1 priority)”, spiked it, found Pipecat-in-Expo-Go was structurally blocked, and wrote ADR-0010 deferring the whole voice feature.

We then pivoted to the next item — summarize and topics generation, and backfill for older memories. That triggered the question “which API are we using? Is there a consolidated place?” which expanded into “what about free options?” → “have you considered OpenAI / Microsoft?” → “isn’t this fragmented?” → ultimately “step back and think as an architect — how are current players actually doing this?”

This doc captures the architectural reasoning that emerged. The conclusion isn’t a vendor pick; it’s a framework for picking vendors per task, plus a clear-eyed look at what ARCIVE actually owns and what it doesn’t.

1. The staged AI pipeline ARCIVE is building

User framing surfaced this six-stage progression — the actual roadmap, not just a wish list:

Stage 1 — Transcribe              (audio → text)               ✓ shipped (Groq Whisper)
Stage 2 — Summarize + topics      (text → JSON)                ⚠ scaffolded, missing API key
Stage 3 — Text chat               (text + retrieval → response)✓ shipped (Claude Agent SDK + MCP)
Stage 4 — Voice conversational    (real-time speech loop)      ⏸ paused — ADR-0010
Stage 5 — Share multimodal        (image / link / PDF / video) □ exploratory — see multimodal_expansion
Stage 6 — Unified (talk + chat)   (real-time multimodal)       □ V1.0+

Each stage adds a new modality or interaction shape on top of the same underlying memory store. The vendor question isn’t “best LLM for summarize” — it’s “what vendor strategy scales coherently across all 6 stages.”

2. What ARCIVE has already committed to (the constraints)

Three prior ADRs frame this strategy. Skipping any of them would mean overruling existing architectural intent:

ADR	Commitment	Implication for vendor strategy
0002 — Pipecat over OpenAI Realtime	Composed STT→LLM→TTS pipeline, not bundled speech-to-speech	Stage 4 voice is composed, not unified
0003 — Swappable agent driver layer	`AgentDriver` interface allows replacing the chat brain	Stage 3 LLM can be swapped without rewriting `/api/chat` callers
0004 — MCP as separate service	Memory retrieval lives in its own MCP server	Retrieval is independent of any model vendor
0006 — Voice naturalness via swappable TTS	TTS is upgrade path: Cartesia → ElevenLabs v3 → Sesame CSM. Per-role voices, audio tags, cloning	Voice fidelity is strategic, not commodity. Forces composed pipeline at Stage 4
0007 — Consent gate on retrieval	Agents see consent-scoped memories only	Retrieval layer enforces the policy, not the model
0010 — Defer voice talk-back	Stage 4 paused until EAS unlocks + speech-to-speech-vs-composed settles	Voice-related vendor picks are deferred, not abandoned

ADR-0006 is the load-bearing constraint for this whole conversation. End-to-end speech-to-speech models cannot deliver swappable TTS, voice cloning, or audio tags. That kills any “consolidate to one vendor across all 6 stages” strategy on Stage 4.

Full landscape as of 2026-05-05. Costs per 1M tokens at published rates. “Forever-free” means a sustained free tier, not a trial credit.

Vendor / Model	$/M in	$/M out	Forever-free	Multimodal	Notes
Google Gemini 2.5 Flash	$0.075	$0.30	AI Studio: 1M tok/day, 15 RPM	text + audio + image + PDF + video	Most expansive multimodal; native audio in
Google Gemini 2.5 Flash-Lite	$0.025	$0.10	Same AI Studio	text + image	Cheapest paid Gemini
OpenAI GPT-5-nano	$0.05	$0.20	$5 trial credit only	text + vision	Cheap, no free-forever; native MCP
OpenAI GPT-5-mini	$0.40	$2.00	$5 trial credit	text + vision; Realtime audio	Native MCP across Apps/Agents/Realtime SDKs
Anthropic Haiku 4.5	$1.00	$5.00	$5 trial credit	text + vision (no audio)	What summarize-step uses today; ~10× Gemini cost
Anthropic Sonnet 4.6	$3.00	$15.00	$5 trial credit	text + vision	What `/api/chat` Agent SDK uses
Groq Llama 4 Scout	—	—	Generous free tier (existing GROQ_API_KEY)	text + image	Fastest LLM on the market
Groq Llama 3.3 70B	—	—	Same free tier	text only	Solid quality, free
Cloudflare Workers AI	—	—	10k req/day free	text + some vision	Edge-deployed; low latency
GitHub Models	—	—	50 req/day per model, ~30 models, free with GitHub account	varies	Single token, gives Claude+GPT+Llama+Phi+Mistral. Eval gold; 50/day cap kills prod use
Microsoft Phi-4-multimodal	self-host GPU $	—	Free weights (MIT)	text + audio + vision	OSS option for V2.x local-only platform variant
DeepSeek V3.1	$0.27	$1.10	Free `:free` variant on OpenRouter	text only	Very cheap, surprising quality
Mistral Small 3 / Pixtral	$0.20	$0.60	Free tier on La Plateforme	Pixtral multimodal	EU data residency (matters for V1.0 SOC 2 / EU users)
Cohere Command R+	$2.50	$10.00	Free trial only	text + vision	Embed v4 is the more interesting Cohere product for ARCIVE (see multimodal_expansion)
OpenRouter	varies	varies	Several `:free` OSS models	varies	One key, ~300 models — but routing overhead
Azure OpenAI	OpenAI prices + Azure premium	—	None	OpenAI lineup	Only relevant for SOC 2 / enterprise (V1.0+)

Voice-specific (Stage 4):

Vendor / Model	Type	Voice cloning per role?	Audio tags / emotion control?	Latency	Notes
Cartesia Sonic	TTS	✅	partial	<100ms first chunk	Current TTS in voice-talkback (paused)
ElevenLabs v3	TTS	✅	✅ best-in-class (`[laughs]`, `[whispers]`)	~200ms	ADR-0006 step 2 upgrade target
Sesame CSM	TTS (open weights)	✅	partial	varies	ADR-0006 step 2 alt; OSS path
Microsoft VibeVoice	TTS (MIT, Aug 2025)	✅	partial	varies	Open-source long-form TTS
Deepgram Nova-3	streaming STT	n/a	n/a	~100ms	Current STT for voice-talkback
OpenAI Realtime API	speech-to-speech	❌	❌	300-500ms	8 fixed voices; kills ADR-0006
Google Gemini Live	speech-to-speech	❌	⚠ primitive	300-500ms	~30 fixed voices; kills ADR-0006
AWS Nova Sonic	speech-to-speech	❌	❌	varies	AWS-native; not in our gravity well
Kyutai Moshi	speech-to-speech (open)	n/a	partial	low	Single-model full-duplex
ElevenLabs Conversational AI	composed managed platform	✅	✅	~500ms	Bring-your-own-LLM — wraps ElevenLabs TTS + Deepgram-equivalent STT + your LLM + tool-use orchestration. Worth re-evaluating on Stage 4 resumption
Hume EVI 3	bundled emotion-aware	partial	✅ emotion-native	varies	Niche; relevant for Caregiver use case
Pipecat Cloud / Daily	managed Pipecat	✅ (TTS-dependent)	✅ (TTS-dependent)	composed latency	Same architecture as current, hosted

Embedding-specific (Stage 5):

Vendor / Model	Modality	Cost	Notes
Voyage-3-lite	text only	free tier 50M tok	Current. 512-d.
Voyage Multimodal-3	text + image	paid	Multimodal expansion #1 leverage move
Cohere Embed v4	text + image	paid	Alternative; eval needed

4. ARCIVE summarize traffic — what cost actually means

A typical memory: ~3K input tokens (transcript), ~200 output tokens. Estimated ARCIVE volume scenarios:

Scenario	Memories/day	Tokens/day (in + out)	Anthropic Haiku 4.5	Gemini 2.5 Flash paid	OpenAI GPT-5-nano	Groq / Gemini AI Studio (free)
Personal use	10	30K + 2K	$0.04/mo	$0.003/mo	$0.002/mo	$0
Early prod	1,000	3M + 200K	$39/mo	$2.50/mo	$1.65/mo	$0 (within quota)
Mid scale	10,000	30M + 2M	$390/mo	$25/mo	$16/mo	Likely over free quota → paid
Large scale	100,000	300M + 20M	$3,900/mo	$250/mo	$160/mo	Paid

Cost gradient is real but not catastrophic until the 10K+/day mark. All non-Anthropic options are <$30/mo at early prod scale. The real cost question is: at what scale does Anthropic’s premium stop being worth it for commodity tasks like summarize?

For Stage 3 chat (Sonnet 4.6, more expensive): the per-turn cost matters more, but quality bar is also higher there.

5. How credible players actually compose AI services

The lever on this conversation was looking outside ARCIVE. Nobody in adjacent product spaces consolidates to one vendor. Pattern across the industry:

Player	STT	LLM (text)	LLM (chat)	TTS / Voice	Strategy
Notion AI	n/a	OpenAI + Anthropic	OpenAI + Anthropic	n/a	Multi-vendor LLM, picks per task
Granola	OpenAI Whisper / Deepgram	OpenAI / Anthropic	OpenAI / Anthropic	n/a	Best-of-breed; doesn’t ship voice
Otter.ai	Their own (trained corpus)	OpenAI (rumored)	n/a	n/a	Owns STT (the moat); commodities the rest
Limitless / Plaud / Bee	Whisper / Deepgram	OpenAI / Anthropic	minimal	n/a	Audio capture + summary; no voice talk-back
Rewind / Personal.ai	Whisper variants	OpenAI / Anthropic	OpenAI / Anthropic	n/a	Same shape
ElevenLabs Conversational AI	Deepgram-eq	bring your own LLM	bring your own	ElevenLabs (own)	Owns voice fidelity; commodities everything else
Sesame (companion)	—	—	own	own CSM (open-sourced)	Owns voice quality specifically
Hume EVI 3	own	own	own	own	Vertical: emotion-aware voice
Apple Intelligence / Siri	Apple	Apple + OpenAI fallback	Apple + OpenAI fallback	Apple	Multi-vendor, routes per task
OpenAI / Anthropic / Google	own	own	own	own	They are the vendors — vertically integrated by definition

Key observations:

Nobody outside frontier labs uses one vendor for everything. Even Apple — the most vertically integrated company there is — uses OpenAI as fallback for tasks Apple Intelligence doesn’t handle.
STT and TTS are routinely separated from LLM, even when unified models exist. Reasons: cost, task-specific quality, fallback resilience.
Players that own a layer end-to-end (Otter on STT, ElevenLabs on TTS, Sesame on voice) own it because that’s their moat. They commoditize everything else.
Best-of-breed per task is the operating norm. Single-vendor consolidation is a Big-Tech sales pitch, not a shipping reality.
Frontier models commoditize the “smart” layer. Differentiation moves up the stack: data, retrieval, UX, brand.

6. Two architectural shapes considered (and why we reject both)

Shape A — Consolidate to one vendor across all 6 stages

Pick one of {Gemini, OpenAI} and run everything through them. Pros: single billing relationship, single SDK, vendor ergonomics, possible cost discounts at scale.

Why we reject this:

Forces dropping Claude Agent SDK (Stage 3) — quality regression risk + ADR-0003 driver layer rewrite
Forces dropping ADR-0006’s TTS swappability at Stage 4 — speech-to-speech models can’t do per-role voice cloning, audio tags, or prosody control
Locks ARCIVE to one vendor’s roadmap, pricing, and outage profile
Doesn’t match how any credible player in adjacent spaces actually operates
Premature optimization for unification that doesn’t exist yet (Stage 6 isn’t here)

Shape B — Decouple text-LLM from voice (Path D in earlier discussion)

Lean toward Gemini for text-bearing stages (2, 5, eventually 3); keep voice composed per ADR-0006.

Why we reject this too:

Implies “Gemini wins the text side” — invites awkward exceptions when Anthropic, Mistral, or DeepSeek ships a better text model
Doesn’t distinguish strategic from commodity layers — treats all text-LLM choices as one decision
Doesn’t ground the decision in industry practice
Creates a vendor lean that has to be walked back when the layer-specific situation changes (e.g., EU customers need EU data residency → Mistral, not Gemini)

7. Shape C — Layered architecture, best-of-breed per task (chosen)

ARCIVE is a system that composes specialized AI capabilities around a memory store. Layers, not pipelines. Vendors are commodities at most layers, strategic at a few.

The canonical layer model is in ADR-0011 and reproduced in 01_SOFTWARE_PLAN.md §1.4. This section captures the reasoning that produced it; the ADR is the immutable reference. If the layer numbers ever need to change, change them in ADR-0011 first.

The 11 layers, summarized for context:

#	Layer	Class	Today’s vendor
1	Capture	ARCIVE-owned	recorders + audio-transcode
2	Transcribe	commodity	Groq Whisper
3	Understand	commodity	Gemini Flash → Anthropic Haiku → Groq Llama (fallback chain)
4	Embed	commodity	Voyage-3-lite
5	Retrieve (MOAT)	ARCIVE-owned	MCP + pgvector + tsv + edges + consent
6	Reason	semi-strategic	Claude Agent SDK
7	Hear	commodity	Deepgram Nova-3 (paused per ADR-0010)
8	Speak	strategic (ADR-0006)	Cartesia → ElevenLabs v3 / Sesame CSM (paused per ADR-0010)
9	Voice orchestration	strategic (ADR-0002, paused per ADR-0010)	Pipecat (paused) — composed on resumption
10	Auto-correlate (MOAT)	ARCIVE-owned	edges job; multimodal expansion future
11	Surfaces	ARCIVE-owned	web + mobile + MCP server

Three classes of layer, three procurement rules

Class	Layers	Procurement rule
ARCIVE-owned	Capture, Retrieve, Auto-correlate, Surfaces	Build internally. This is the moat. Don’t outsource.
Strategic AI	Speak (ADR-0006), Reason (semi)	Pick best-in-class for the strategic axis. Justify every change in an ADR.
Commodity AI	Transcribe, Understand, Embed, Hear	Pick the cheapest acceptable vendor with a coded fallback. Swap freely as the market shifts. No ADR needed for vendor swaps within this class.

How this resolves each stage of the pipeline

Stage	Layer mapping	Vendor decision frame
1 — Transcribe	Layer 2 (commodity)	Cheapest acceptable STT; Groq Whisper today, Deepgram if streaming, OpenAI Whisper if quality
2 — Summarize + topics	Layer 3 (commodity)	Cheapest acceptable text-LLM; Gemini Flash today (free tier, multimodal-ready); fallback Anthropic
3 — Text chat	Layer 6 (semi-strategic)	Claude Agent SDK today; re-eval at Stage 5/6 pressure, not on consolidation pressure
4 — Voice	Layers 7+8+9 (strategic — voice fidelity)	Composed pipeline, ElevenLabs v3 / Sesame CSM TTS. Never speech-to-speech. Paused per ADR-0010
5 — Multimodal share	Layer 3 (per-kind LLM) + Layer 4 (multimodal embed)	Per-kind: image=Sonnet/Gemini vision, PDF=text-extract+LLM, link=fetch+LLM. Embedding swap is the leverage move
6 — Unified	Redefined: two surfaces (Talk + Chat) sharing the memory store, not literally one model	Don’t try to literally unify. Accept Talk/Chat as separate UIs over the same MCP retrieval layer

Why Stage 6 is redefined, not deferred

The earlier framing said “defer Stage 6 until speech-to-speech voice fidelity matures.” Layered says: the unification was never the goal; the experience was. A user who can talk or type to ARCIVE about their memories, with the same retrieval and memory state, gets the experience without ARCIVE adopting a single-model architecture that breaks ADR-0006. Two surfaces, one memory store.

This matches how every credible memory product ships: same data, multiple input modalities, no model trying to be all things. If a future speech-to-speech model gains true TTS swappability, voice cloning, and audio tags, revisit. Until then, the layered Talk + Chat shape ships sooner and is more durable.

8. Comparing the options I considered along the way

For the record (and for the next person re-litigating this):

Path	Frame	Code change today	Long-term shape	Why rejected (or chosen)
Path A	Gemini unified across all 6 stages	Same	Stage 4 voice on Gemini Live	Rejected — kills ADR-0006 voice fidelity
Path B	OpenAI everything	Same + replace Claude/Whisper	Stage 4 on OpenAI Realtime	Rejected — same ADR-0006 kill + bigger migration cost + no free tier
Path C	Composed throughout, multi-vendor	Same	No literal Stage 6 unification	Half-right — preserves voice fidelity but doesn’t articulate the framework
Path D	Decouple text-LLM (Gemini lean) from voice (composed)	Same	Stage 4 composed, text leans Gemini	Half-right — same operational outcome as Layered, but invites “why Gemini?” objections
Path E	Path A with eyes open — bet voice fidelity tools converge	Same	Same as A, with monitoring	Rejected — timing bet on a market trajectory, not a strategy
Layered (chosen)	Best-of-breed per task; ARCIVE owns the moat	Same	Two surfaces, shared memory; layer-by-layer vendor choice	Chosen — matches industry practice, preserves all prior ADRs, durable to vendor changes

Net: today’s code change is identical across all paths. The choice is about the frame that governs every future per-task decision. Layered is the most durable frame because it doesn’t commit ARCIVE to any vendor’s roadmap.

9. Connection to the multimodal expansion discussion

The companion doc 2026-05-04_multimodal_expansion.md lays out:

Schema generalization: kind enum, polymorphic assets table, reuse of transcript+embedding+summary as universal text/vector layer
Universal /ingest endpoint with per-kind dispatcher
Surfaces: PWA share-target, Chrome MV3 extension, iOS share extension, email-in, MCP-write
The key realization: MCP-first is the better primitive than share extensions — pull MCP-write forward to V0.2 so Claude Desktop / ChatGPT / Apple Intelligence become the share surface
The half-day high-leverage move: multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) replaces caption-then-embed
Brand-aligned philosophy: every shared item becomes an audio memory with attachment (or auto-correlation — the version only ARCIVE could ship)

This Layered architecture maps cleanly:

Multimodal expansion concept	Layered equivalent
Schema generalization (`kind` enum, `assets` table)	Capture layer extension
Universal `/ingest` per-kind dispatcher	Surfaces layer + per-kind Understand/Embed routing
MCP-write forward to V0.2	Surfaces layer (MCP-as-output)
Multimodal embedding swap	Embed layer vendor swap (commodity-class procurement)
Auto-correlation differentiator	Auto-correlate layer (ARCIVE-owned moat)
“Every entry is still a memory” philosophy	Resolves to: same memory schema across all `kind`s; surface variation but data uniformity

Layered architecture doesn’t preempt any of those decisions. The discussion doc’s open questions remain open. But anything that lands from the discussion doc lands in a clean architectural slot under this framework.

10. Decisions captured here (to formalize in ADRs)

These are the decisions emerging from the session. Each becomes (or extends) an ADR:

#	Decision	Owning ADR
1	AI vendor strategy is best-of-breed per task; ARCIVE owns the memory + retrieval + auto-correlation moat	ADR-0011 (new — to draft)
2	Each layer is classified strategic / commodity / ARCIVE-owned; commodity-layer vendor swaps don’t need new ADRs	ADR-0011
3	Stage 2 summarize uses Gemini Flash via AI Studio (free tier) primary, Anthropic Haiku fallback — already coded, just needs the key	ADR-0011 (vendor pick within commodity-layer rules)
4	Stage 4 voice stays composed on resumption (Pipecat + ElevenLabs v3 / Sesame CSM); never speech-to-speech	Reaffirms ADR-0002, ADR-0006, ADR-0010
5	Stage 6 unified is redefined as Talk + Chat surfaces sharing memory store, not single-model	New section in ADR-0011
6	Multimodal embedding swap (Voyage Multimodal-3 / Cohere Embed v4) is the next high-leverage commodity-layer evaluation	Future ADR-0012 (when scoped)
7	MCP-write surface moves forward to V0.2 per multimodal_expansion discussion	Future ADR if accepted

10b. Groq’s role across the layer model

Groq is currently a single-layer vendor for ARCIVE — Layer 2 (Transcribe / Whisper) — but is capable across more layers. Documenting the surface here so future readers don’t re-litigate it. (Also captured in ADR-0011 Notes.)

Groq capability	Layer	Status for ARCIVE
Whisper STT (`whisper-large-v3-turbo`)	2 — Transcribe	✅ in use today
Llama 3.3 70B / Llama 4 Scout / Maverick	3, 5, 6	available; not primary; Llama 3.3 70B added as Layer 3 third fallback below Anthropic
DeepSeek-R1-distill (reasoning)	6	available; budget option for future freemium chat
Mixtral / Qwen / Kimi K2 / GPT-OSS	3	available; redundant with Gemini
TTS	8	❌ Groq doesn’t ship
Speech-to-speech	9	❌ Groq doesn’t ship
Embeddings	4	❌ Groq doesn’t ship

Why Gemini won the Layer 3 primary pick over Groq Llama: multimodal coverage including audio (text + audio + image + PDF + video) future-proofs Stage 5 multimodal share. Llama 4 Scout is text + image only. Groq is fast + free + competent; the call was on multimodal coverage.

Where Groq is added now: Layer 3 fallback chain becomes Gemini → Anthropic → Groq Llama 3.3 70B → no-op. Reuses the existing GROQ_API_KEY — no new vendor relationship — and provides resilience if Gemini AND Anthropic both have transient outages.

Future Groq plays available within the framework (no new ADR needed):

Groq Llama as freemium chat fallback (Layer 6) at V1.0+ when free-tier scaling matters
Groq for any synchronous-summarize UX moment (Groq is ~10× faster inference than Gemini)
DeepSeek-R1-distill on Groq for reasoning-heavy tasks (e.g., auto-correlation per multimodal expansion)

Plays not allowed within the framework: extending Groq into Layers 4 / 7 / 8 / 9 — Groq doesn’t ship those capabilities; pretending otherwise means self-hosting, which is out of scope until V2.x local-only platform variant.

11. Open questions

Quality eval — should we run a 50-transcript shootout on GitHub Models (free, 30+ models, single token) to verify Gemini Flash quality vs Haiku 4.5 on real ARCIVE transcripts before committing? Probably yes, half-day, but not gating the backfill PR.
When does Stage 3 chat get re-evaluated? Today Claude Agent SDK is the right pick on quality + tool-use grounds. The trigger to revisit is either (a) the discussion-doc multimodal expansion makes Stage 5 vision a chat-layer concern, or (b) cost at scale crosses some threshold.
B2B EU data residency — when a customer requires EU-only data, which layer needs the swap? Likely Understand (Mistral La Plateforme), possibly Embed (Cohere has EU-hosted), Reason (Mistral or Anthropic EU when available). Not urgent for V0.3; relevant for V1.0 SOC 2 prep.
Failover testing — the framework assumes commodity-layer fallbacks work. Have we actually exercised them? Summarize-step has Anthropic→Gemini→no-op coded but never tested with the primary down. Worth a chaos test once the Gemini key is live.
Voice resumption trigger — per ADR-0010 the resumption depends on EAS + voice-fidelity-on-speech-to-speech maturity. The ElevenLabs Conversational AI managed platform (BYO LLM, ElevenLabs TTS) is an underweighted middle option. Should ADR-0010’s resumption checklist add it as a third path?
Layer 6 (Reason) class assignment — currently labeled “semi-strategic”. Is that a stable classification or a temporary one until quality bars settle?

12. What ships from this session

In order:

feat(pipeline): summarize backfill via Gemini Flash + Groq third fallback — set GEMINI_API_KEY, flip provider order (Gemini primary, Anthropic fallback, Groq Llama 3.3 70B third fallback), add backfill-summaries Edge Function enqueueing summarize jobs for null-summary memories. All commodity-layer picks within the framework rules.
docs: ADR-0011 — AI vendor strategy (best-of-breed per task) — formalizes the framework + the layer classification. References this discussion doc.
(Future) Multimodal embedding swap ADR + scoping when ready. Discussion doc’s #1 leverage move.
(Future) ADR-0010 resumption checklist update if ElevenLabs Conversational AI is added as a third voice path.

13. The TL;DR for someone walking in cold

ARCIVE composes specialized AI capabilities around a memory store. The product moat is the memory store + retrieval + auto-correlation, not any AI capability. AI vendors are commodities at most layers (transcribe, understand, embed, hear) and strategic at a few (speak, reason, retrieve). Pick best-of-breed per task with explicit fallback. Don’t consolidate to one vendor — that’s a Big-Tech aspiration, not a shipping reality outside the frontier labs. Voice fidelity is strategic (ADR-0006), so the voice loop stays composed forever. Stage 6 “talk + chat together” ships as two surfaces sharing the memory store, not one unified model. Today’s work: Gemini Flash for summarize (free, multimodal-ready, already coded as fallback), backfill old memories.