Companion to 00_MASTER_PLAN.md and 02_HARDWARE_PLAN.md. This document is the source of truth for everything that runs on a server, in a browser, or on a phone.
Software and hardware are co-equal, parallel tracks. Software ships standalone (phone-mic) at every phase, but is built to integrate with the device from Phase 0. The hardware↔software contract in 00_MASTER_PLAN §6 is frozen at the start of each phase; both teams build to it.
1. Stack — Final Decisions
| Concern | Choice | Notes |
|---|---|---|
| Web app framework | Next.js 15 (App Router) on Vercel | RSC, edge runtime, ISR, free hobby tier |
| Mobile app | Expo SDK 52+ with Expo Router | iOS + Android from one codebase, EAS for builds & OTA |
| Backend | Supabase | Postgres + Auth + Storage + Realtime + Edge Functions |
| Vector store | pgvector with HNSW | Same DB; no Pinecone/Weaviate needed |
| Transcription (V0, batch) | Groq Whisper-large-v3-turbo | ~$0.04/hr, faster than realtime but batch-only (POST complete file → get complete transcript). Fits V0 chunked-upload model perfectly. |
| Transcription (V0.1+, streaming) | Deepgram Nova-3 | True WebSocket streaming + live diarization, ~$0.26/hr. Used wherever live transcript or sub-second latency is required (voice talk-back, group mode, live captions). |
| Speaker re-ID | Pyannote.audio on Modal | Cross-session identity, owned voice embeddings |
| Embeddings | Voyage-3-lite | Cheap, top retrieval benchmarks |
| Summary / topic extraction | Gemini 2.5 Flash or Claude Haiku 4.5 | Cents per recording |
| Agent framework | Claude Agent SDK | Tool use, memory tool, role system prompts |
| Voice talk-back (V0.2+) | Pipecat + Cartesia Sonic TTS + Deepgram streaming STT | Sub-second turn latency |
| Group conversation media | LiveKit Cloud | WebRTC, used by OpenAI Realtime, free tier |
| Pipeline orchestration | pgmq + pg_cron (V0.1) → Inngest (V0.2+) if complexity demands | Queue-based, durable, retryable |
| Auth | Supabase Auth + magic links | No passwords |
| Payments | Stripe Billing + RevenueCat for mobile | One SDK across iOS/Android/web |
| Analytics | PostHog | Product analytics + session replay + feature flags |
| Errors | Sentry | Web + mobile + Edge Functions |
| Search (text) | Postgres FTS (tsvector) | No Algolia needed |
| Search (semantic) | pgvector HNSW | Same query layer |
| File storage | Supabase Storage | Audio + exports |
| Code structure | pnpm workspace monorepo | apps/web, apps/mobile, packages/db, packages/shared, packages/agents |
| Type safety | TypeScript everywhere + Zod for runtime | Generated types from Supabase schema |
| ORM | Drizzle (when raw SQL gets painful) | Type-safe, lightweight |
| Testing | Vitest + Playwright for web, Maestro for mobile | Skip what’s not high-value |
| CI/CD | GitHub Actions + Vercel previews + EAS Update | OTA mobile fixes without store review |
1.4. AI Architecture — Layered (canonical: ADR-0011)
ARCIVE composes specialized AI capabilities around a memory store. The product moat is the memory store + retrieval + auto-correlation, not any AI capability. The §1 “Stack” table above lists individual choices per concern; this section is how those choices fit together as a system, and which classes drive procurement.
Three layer classes:
- ARCIVE-owned — built internally; never outsourced. The moat.
- Strategic AI — best-in-class on a strategic axis (e.g. voice fidelity per ADR-0006, agent quality per ADR-0003). Vendor swaps require their own ADR.
- Commodity AI — cheapest acceptable vendor with a coded fallback. Vendor swaps within this class do not require a new ADR. They happen as the market shifts.
The 11 layers:
| # | Layer | Class | Today’s vendor |
|---|---|---|---|
| 1 | Capture (mic, transcode, ingest) | ARCIVE-owned | recorders + audio-transcode worker |
| 2 | Transcribe (audio → text + timestamps) | commodity | Groq Whisper |
| 3 | Understand (text → summary, topics, entities) | commodity | Gemini Flash → Anthropic Haiku → Groq Llama (fallback chain) |
| 4 | Embed (text/image → vector) | commodity | Voyage-3-lite |
| 5 | Retrieve (query → ranked memories) | ARCIVE-owned (MOAT) | MCP server + pgvector HNSW + tsv + edges + consent gate |
| 6 | Reason (chat with memories as context) | semi-strategic | Claude Agent SDK |
| 7 | Hear (real-time STT for voice loop) | commodity | Deepgram Nova-3 (paused per ADR-0010) |
| 8 | Speak (TTS) | strategic (ADR-0006) | Cartesia → ElevenLabs v3 → Sesame CSM (paused per ADR-0010) |
| 9 | Voice orchestration (real-time turn management) | strategic (ADR-0002, paused per ADR-0010) | Pipecat (paused) — composed on resumption, never speech-to-speech |
| 10 | Auto-correlate (cross-source memory linkage) | ARCIVE-owned (MOAT) | edges job; future per discussions/2026-05-04_multimodal_expansion.md |
| 11 | Surfaces (PWA, mobile, MCP-as-output, future email-in / share-target) | ARCIVE-owned | web + mobile + MCP server |
ARCIVE never consolidates to a single vendor across the stack. This matches how every credible adjacent player (Notion, Granola, Otter, Limitless, Apple Intelligence, ElevenLabs Conversational AI) actually operates. Full reasoning, cost projections at ARCIVE volumes, and the rejected alternatives are in ADR-0011 and the working session notes discussions/2026-05-05_ai_strategy_architecture.md.
Stage 6 (unified “talk + chat together”) is redefined: ships as two surfaces (Talk + Chat) sharing the memory store, not as a single-model unified architecture (which would force ARCIVE to abandon ADR-0006 voice fidelity). Same data, different input modalities.
1.5. Capture-Surface Capabilities (Honest Truth Table)
Different input surfaces have different capabilities. We do NOT promise capabilities a surface can’t deliver. This table is canonical — UX copy must reflect it.
| Capability | ARCIVE Device | Mobile App (native, Expo) | Web App / PWA |
|---|---|---|---|
| Always-on capture (records when not in foreground) | ✅ | ✅ (with foreground service / background audio mode) | ❌ — browsers suspend tabs after lock/switch |
| Press-to-record (intentional dictation) | ✅ | ✅ | ✅ |
| Recording while phone screen is off / locked | ✅ | ✅ (Android foreground service; iOS background audio capability) | ❌ |
| Recording survives WiFi outage | ✅ (30-min local buffer) | ✅ (SQLite/file queue) | ⚠️ (IndexedDB queue, lost if user closes tab) |
| Multi-speaker far-field capture | ✅ (4-mic array, 5m) | ❌ (single phone mic, ~1m) | ❌ |
| DoA / speaker positioning metadata | ✅ | ❌ | ❌ |
| Onboard VAD (silence not uploaded) | ✅ (XVF3800 hardware) | ✅ Silero VAD via onnxruntime-react-native or @picovoice/cobra-react-native (decision at V0.2 implementation time) | ✅ (@ricky0123/vad-web, AudioWorklet, Silero ONNX) |
| Opus encoding | ✅ (firmware) | ✅ (native codec) | ⚠️ Chrome ✅, Safari often falls back to MP4/AAC |
| Group mode (continuous WebRTC stream) | ✅ (Phase 3+) | ✅ (Phase 3+) | ✅ for joining group sessions, not as primary capturer |
| BLE pairing with ARCIVE device | n/a | ✅ (react-native-ble-plx) | ⚠️ Web Bluetooth: Chrome Android ✅, Safari iOS ❌ |
| Voice talk-back (real-time) | ✅ via paired phone | ✅ | ✅ (best when tab is foregrounded) |
| Push notifications | n/a | ✅ | ⚠️ iOS 16.4+ only when installed to home screen |
| Install to home screen | n/a | ✅ App Store / Play Store | ✅ PWA (see §1.6) |
| Offline browsing of past memories | n/a | ✅ (SQLite cache) | ✅ (Service Worker + IndexedDB) |
Implications
- The web/PWA is for intentional dictation and review. UX copy says “Tap to record.” It does NOT say “ARCIVE listens to your day.” That promise belongs to the device and the native mobile app.
- The mobile app is the always-on phone-only capture surface (when no device is paired). Foreground service on Android, background audio mode on iOS — both legitimate platform features, both require explicit permission disclosure.
- The ARCIVE device is the unrestricted always-on capture surface. No OS in the way.
- We never promise background recording on the web.
iOS background-mode footnote
Declaring UIBackgroundModes: ["audio", "bluetooth-central"] in Info.plist is necessary but not sufficient. Always-on capture + always-paired BLE on iOS requires:
- An active audio session (category
playAndRecord,mixWithOthers) at all times capture is expected — going idle risks suspension. - CBCentralManager state restoration implemented (
CBCentralManagerOptionRestoreIdentifierKey+centralManager(_:willRestoreState:)) so iOS can re-wake the app and reconnect to the device after suspension. bluetooth-centralbackground mode declared in addition toaudiomode — they are independent capabilities. Without state restoration, iOS will silently drop the BLE connection during suspension and not re-pair until the user reopens the app.
1.6. How the PWA Works
The “PWA” in “Next.js PWA” is not a separate framework — it’s a set of browser standards layered onto a normal website. Three pieces, simple individually, powerful together.
Piece 1 — manifest.webmanifest (the install metadata)
A JSON file at apps/web/public/manifest.webmanifest. Tells the browser “this site is installable, here’s its name, icon, color, start URL, and orientation.”
{ "name": "ARCIVE", "short_name": "ARCIVE", "description": "Record, retrieve, interact, and create memories.", "start_url": "/today", "display": "standalone", "background_color": "#F8F6F1", "theme_color": "#88B4A0", "icons": [ { "src": "/icons/192.png", "sizes": "192x192", "type": "image/png" }, { "src": "/icons/512.png", "sizes": "512x512", "type": "image/png", "purpose": "maskable" } ]}Linked from <head>:
<link rel="manifest" href="/manifest.webmanifest" /><meta name="theme-color" content="#88B4A0" /><meta name="apple-mobile-web-app-capable" content="yes" /><meta name="apple-mobile-web-app-status-bar-style" content="default" />When the user visits the site on Chrome Android or Safari iOS, the browser detects this manifest and offers an “Add to Home Screen” prompt (Android) or the user manually adds it from the Share menu (iOS). Once added, ARCIVE launches in standalone mode — no browser chrome, no URL bar, looks identical to a native app.
Piece 2 — Service Worker (the offline + caching brain)
A JavaScript file (apps/web/public/sw.js) the browser registers in the background. It intercepts network requests and can serve cached responses, queue failed requests for retry, handle push notifications, and run independently of any open tab.
For ARCIVE we use it for three things:
- Offline shell — cache the app’s HTML / JS / CSS so the UI loads instantly even with no network. Memories themselves come from API; if API fails, show cached list.
- Failed-upload retry queue — if the user records a memory and the upload fails (no WiFi), the service worker keeps the chunk in IndexedDB and retries when the network comes back, even if the tab is closed.
- Web push notifications (iOS 16.4+, all of Android) — server can push “Your memory is processed” without the app being open.
We don’t write the service worker by hand — we use @serwist/next (the maintained successor to next-pwa) which generates one from config. Setup is a single Next.js plugin.
Piece 3 — Web APIs the PWA uses
Once installed, the PWA can use the same browser APIs as a normal website — but UX is fullscreen, app-like:
getUserMediafor micMediaRecorderfor captureAudioWorkletfor VADWeb Bluetoothfor device pairing (Android only)IndexedDBfor local cacheWeb Pushfor notifications
What the user experiences
- Visits
arcive.appin Chrome Android or Safari iOS. - Browser shows “Add ARCIVE to Home Screen” (or user does it manually on iOS).
- Icon appears on home screen, indistinguishable from a native app.
- Tap → app launches fullscreen (no browser chrome).
- Works offline — past memories visible, new recordings queued for upload.
- Push notifications arrive (iOS 16.4+, all Android).
- Updates ship instantly — next time the app opens, it grabs the latest version. No App Store review.
What the PWA still cannot do (so we don’t pretend)
| Limitation | Workaround |
|---|---|
| No background audio recording on iOS | Use the native mobile app for that |
| No Web Bluetooth on Safari iOS | Pairing flow goes through native mobile app on iOS |
| Push notifications iOS-limited | Acceptable for V0; native app at V0.2 fixes. iOS Web Push prerequisites (all required): PWA installed to home screen via Safari, manifest has display: standalone or fullscreen, VAPID keys configured server-side, permission prompt fired from a user-gesture event handler (not on page load), not in private browsing. |
| Full filesystem access | Not needed for our use case |
Speech recognition (SpeechRecognition) Safari-spotty | We do STT server-side anyway |
Why PWA-first beats native-first for V0
- Ship a URL in 3 weeks, not 6. No App Store review, no provisioning profiles, no $99/yr Apple Developer fee for V0.
- One codebase across desktop, mobile web, installed-PWA. Native (Expo) comes at V0.2 once the pattern is proven.
- Iteration speed. Push a fix, refresh the page, done. No EAS build queue, no TestFlight invites.
- Marketing site is the same Next.js app. No second codebase to maintain.
The PWA is genuinely the right V0 surface. The native mobile app at V0.2 then unlocks always-on capture and full BLE — the one thing the PWA can’t do.
Implementation cost
Adding PWA support to a Next.js 15 app is roughly:
- 1 hour: install
@serwist/next, configure - 1 hour: write
manifest.webmanifest, generate icons (192, 512, maskable) - 1 hour: add iOS-specific
<meta>tags, test “Add to Home Screen” flow on real devices - 2 hours: write IndexedDB upload-queue logic for failed uploads
- 1 hour: test offline shell loads when network is off
Half a day’s work for a real PWA. Worth it from V0.
1.7. Hosting Strategy
Three things to host. Each has a clear best home.
What needs hosting
| Thing | Recommended home | Alternatives |
|---|---|---|
| Next.js web app | Vercel for V0–V0.2; Cloudflare Pages for V0.2+ if cost matters | Netlify, self-hosted on VPS, Fly.io |
| Backend (functions, DB, auth, storage, realtime) | Supabase (locked) | — |
| GPU / heavy workers (Pyannote, future on-device-grade STT) | Modal | Replicate, Fly Machines (GPU), self-host |
| MCP server (Phase 4+) | Cloudflare Workers (cheap, edge, always-on) | Vercel Edge Functions, Fly |
| Firmware OTA binaries | Supabase Storage | Any object storage |
Hosted-Supabase-only development
We do not run a local Supabase stack. No Docker, no local Postgres
container — every developer points at the same hosted project (or their
own personal one) over HTTPS. Trade-off: every migration / Edge
Function change is a supabase db push / functions deploy, not an
in-process reload. Win: zero Docker dependency, zero LAN-IP gymnastics
when testing on a real phone or device, identical environment between
laptop and CI.
# One-time installnpm install -g supabase # CLI for `link`, `db push`, `functions deploy` — no Docker neededbrew install pnpm # or: npm install -g pnpm
# Bootstrapgit clone <repo>pnpm install
# Link to a hosted Supabase project (yours or the team's)supabase loginsupabase link --project-ref <PROJECT_REF>supabase db pushpnpm functions:deploy
# Fill apps/web/.env.local with the hosted project's URL + anon keycp .env.example apps/web/.env.local
# Runpnpm --filter web dev # Next.js on localhost:3030 (project standard — not Next default 3000)
# When testing the mobile apppnpm --filter mobile start # phone hits the same hosted project; no tunnel, no LAN IP
# When testing the device (Phase 0+)# Firmware uploads directly to https://<project-ref>.supabase.co/functions/v1/ingest-audioHosting decision matrix
| Provider | Best for | V0 cost | V0.2 cost | Setup time | Verdict |
|---|---|---|---|---|---|
| Vercel | Next.js | $0 (hobby) | $20–50 (Pro) | 5 min | Default. Use this for V0. |
| Cloudflare Pages + Workers | Next.js at scale | $0 | $5–20 | 30 min | Migrate here at V0.2 if Vercel bill matters. |
| Netlify | Next.js | $0 | $19+ | 10 min | Fine, but no advantage over Vercel or Cloudflare |
| Hostinger / VPS Node.js | Cheapest fixed cost | ~$4 | ~$4–10 | 1–2 days | Skip unless you have a strong reason |
| Self-host on Hetzner / DigitalOcean | Full control | ~$5 | ~$10 | 1–2 days | Skip for V0; consider for local-only variant (V2.x) |
| Fly.io | Container apps | $0–5 | $5–20 | 1 hr | Better for backend than for Next.js |
| Railway / Render | Simple PaaS | $5–10 | $10–30 | 30 min | OK middle ground; no real edge |
Recommendation
- Local-first for the first 1–2 weeks of V0. No hosting yet.
- Deploy V0 to Vercel. Free hobby tier, 5-minute setup, PR previews, optimized for Next.js. Don’t reconsider until the bill is meaningful (>$50/mo).
- At V0.2, evaluate Cloudflare Pages. Same Next.js code, much cheaper at scale, generous Workers free tier (100k req/day). Migration is 1–2 days.
- Backend stays on Supabase regardless of where the web app lives. Don’t conflate.
- Long-running Pyannote workers go to Modal (pay-per-second GPU, no always-on cost).
- Skip cheap VPS hosts (Hostinger, DigitalOcean droplets, Hetzner) for V0. Saving $4/mo is not worth losing 2 days of engineering. Reconsider only when:
- You’re at 100k+ users and Vercel bill > $500/mo
- You have data-residency requirements (EU/UK/India local hosting)
- You’re building the local-only ARCIVE variant (V2.x) — that genuinely needs self-hosted or device-as-server architecture
Why not Hostinger / cheap VPS for V0
- No CDN — slow first paint for users far from the server
- No edge runtime — every request hits one region
- No image optimization — bigger pages, slower loads
- No PR previews — slows iteration
- No zero-config Next.js — you maintain Node, PM2, nginx, SSL, deploy scripts
- ~1–2 days of setup vs. 5 minutes
- Saves ~$15/mo, costs days of engineering you don’t have at V0
Engineering principle
At V0, optimize for engineering speed, not infrastructure cost. The bill is small at the scale where speed matters most. Cost optimization is a Phase 2+ exercise.
1.8. Latency & Streaming Architecture
This section is the source of truth for what streams where, what’s batch, and what the latency budget is at every realtime touchpoint. It corrects three things that were vague or wrong in earlier sections.
Streaming vs. batch — the two pipelines
ARCIVE has two parallel data pipelines, not one. They use different transcription providers because they have different latency requirements.
PIPELINE A — INGESTION (batch, V0+) capture surface ──► chunk to 30s + Opus mono @ 24kbps ──► HTTPS POST /ingest-audio │ ▼ Groq Whisper-large-v3-turbo (batch) │ ▼ queue: diarize → embed → summarize → edges │ ▼ Supabase Realtime → app shows new memory Use: dictation, hardware capture, mobile background capture, group session post-processing Total user-visible latency: ~2–5s after chunk seal
PIPELINE B — LIVE (streaming, V0.2+) capture surface ──► WebSocket / WebRTC ──► Deepgram Nova-3 streaming │ partial transcripts (~200ms) │ ▼ Pipecat orchestrator │ ▼ Claude Agent SDK (Haiku 4.5, streaming) │ ▼ Cartesia Sonic TTS (sub-100ms first chunk) │ ▼ audio back to user Use: voice talk-back, role-play conversations, group mode interjections, live captions Total user-visible latency budget: <1.5s end-of-utterance → AI starts speakingLatency budget (Pipeline B, voice talk-back)
| Stage | Component | Target | Realistic |
|---|---|---|---|
| End-of-utterance detection | Pipecat + Deepgram VAD | 200ms | 200–400ms |
| STT final result | Deepgram Nova-3 streaming | +100ms after EOU | 100–300ms |
| LLM first token | Claude Haiku 4.5 streaming | +300ms | 300–500ms |
| TTS first audio chunk | Cartesia Sonic | +100ms | 90–150ms |
| Network RTT | variable | 50ms | 50–200ms |
| Total | ~750ms ideal | ~750ms–1.55s realistic |
Phase 2 ships with explicit latency benchmarking against a 1500ms hard ceiling. If we exceed it, options: co-locate Deepgram + Cartesia in the same region as the user, switch to Cartesia’s edge endpoints, or use a smaller LLM (Haiku → 7B model on Groq).
Why Pipecat over OpenAI Realtime / Gemini Live
Both OpenAI Realtime API and Gemini Live offer bundled STT+LLM+TTS in a single WebSocket. Tempting, but rejected because:
- Vendor lock: you cannot swap STT, LLM, or TTS independently. If Cartesia ships a better voice or Deepgram beats them on latency, you can’t take advantage.
- Cost: ~$0.06/min for OpenAI Realtime vs. ~$0.02/min for Pipecat-orchestrated Deepgram + Haiku + Cartesia.
- Tool use parity: Pipecat with Claude Agent SDK has more mature tool-calling for our memory retrieval tools.
- Group mode: LiveKit Agents (which Pipecat composes with cleanly) is the proven pattern for multi-party WebRTC. OpenAI Realtime has weaker multi-party support.
We keep OpenAI Realtime as a fallback driver in the swappable agent layer — packages/agents/drivers/openai-realtime.ts exists as an option but is not the default.
Realtime / streaming coverage matrix
| Capability | Web (V0) | Mobile (V0.2) | Device (V0) | Device (V0.3 group) |
|---|---|---|---|---|
| Continuous mic capture (foreground) | ✅ getUserMedia | ✅ expo-audio | ✅ XVF3800 → I2S | ✅ |
| Continuous mic capture (background) | ❌ tab suspends | ✅ background-audio mode | ✅ always | ✅ always |
| On-device VAD | ✅ Silero in AudioWorklet | ✅ Silero ONNX or Cobra | ✅ XVF3800 hardware | ✅ XVF3800 hardware |
| Chunked HTTPS upload | ✅ fetch + ReadableStream | ✅ fetch | ✅ esp_https_client | ✅ |
| Resumable upload on network drop | ✅ IndexedDB queue | ✅ SQLite queue | ✅ 30-min circular buffer | ✅ |
| Streaming STT (live transcript) | ✅ Deepgram WS from V0.1 | ✅ from V0.2 | n/a (server-side after upload) | ✅ via backend bridge |
| Voice talk-back loop (Pipecat) | ✅ from V0.2 | ✅ from V0.2 | ✅ via paired phone playback | ✅ |
| Group mode WebRTC | ✅ LiveKit JS SDK from V0.3 | ✅ LiveKit RN SDK from V0.3 | ⚠️ HTTPS chunked → backend bridge | ⚠️ same |
Validated stack capabilities
| Component | Streams? | Verified |
|---|---|---|
| Supabase Edge Functions (Deno) | ✅ Request.body is ReadableStream | Used in production by audio products |
| Supabase Storage | ✅ resumable uploads via tus | Native support |
| Supabase Realtime | ✅ WebSocket, sub-100ms push | Native |
| Groq Whisper-large-v3-turbo | ❌ batch only | Confirmed by Groq API docs |
| Deepgram Nova-3 | ✅ WebSocket streaming + live diarization | Confirmed |
| Claude Agent SDK | ✅ streaming responses + tool use | Native |
| Cartesia Sonic TTS | ✅ <100ms first-byte | Confirmed by published benchmarks |
| Pipecat | ✅ provider-agnostic STT→LLM→TTS pipeline | Active OSS, used by many |
| LiveKit Cloud + Agents | ✅ multi-party WebRTC + server agent participants | Used by OpenAI Realtime, Character.ai |
| pgvector HNSW | ✅ sub-50ms similarity at our scale | Standard |
@ricky0123/vad-web | ✅ Silero VAD in AudioWorklet | Standard |
| MCP server | ✅ JSON-RPC over stdio/HTTP, supports streaming responses | Confirmed |
2. Repository Layout (software-only)
arcive/├── apps/│ ├── web/ # Next.js 15 PWA│ │ ├── app/│ │ │ ├── (auth)/login│ │ │ ├── (app)/today│ │ │ ├── (app)/universe│ │ │ ├── (app)/memory/[id]│ │ │ ├── (app)/people│ │ │ ├── (app)/roles│ │ │ └── (app)/settings│ │ ├── components/│ │ ├── lib/│ │ └── public/manifest.webmanifest│ ││ └── mobile/ # Expo (V0.2+)│ ├── app/(tabs)/today.tsx│ ├── app/(tabs)/universe.tsx│ ├── app/memory/[id].tsx│ ├── app/pair-device.tsx # V0.3+│ └── lib/ble.ts # V0.3+│├── packages/│ ├── db/│ │ ├── migrations/ # Supabase migrations│ │ ├── schema.ts # Drizzle schema│ │ └── types.ts # Generated│ ││ ├── shared/│ │ ├── zod/ # Validation schemas (Memory, Person, Role)│ │ ├── ble-uuids.ts # ← shared with firmware (HW Plan §6)│ │ ├── agent-interface.ts # AgentSession contract│ │ └── api-contracts.ts # Edge Function I/O types│ ││ └── agents/│ ├── roles/ # Built-in role definitions│ │ ├── reviewer.ts│ │ ├── tutor.ts│ │ ├── caregiver.ts│ │ └── brainstorm.ts│ ├── tools/│ │ ├── memory-search.ts│ │ ├── person-lookup.ts│ │ └── timeline-window.ts│ └── drivers/│ ├── stateless-rag.ts # V0.1│ ├── claude-agent-sdk.ts # V0.2│ └── realtime-voice.ts # V0.2 (voice) / V0.3 (group)│├── backend/│ ├── functions/ # Supabase Edge Functions│ │ ├── ingest-audio/ # Hardware + app upload entrypoint│ │ ├── transcribe-step/ # Queue worker│ │ ├── embed-step/│ │ ├── identify-speakers-step/│ │ ├── summarize-step/│ │ ├── compute-edges-step/│ │ ├── pair-device/ # V0.3+│ │ └── revoke-device/│ ││ ├── workers/ # Long-running workers (Modal/Fly)│ │ └── pyannote-reid/ # Speaker re-ID│ ││ └── mcp/ # MCP server (V0.3+ internal, V1 public)│ └── arcive-memory-mcp/│└── shared/ └── ble-uuids.ts # Mirrored to firmware repo3. Database Schema (V0 — full)
-- Vector extensioncreate extension if not exists vector;
-- Subscription enumcreate type subscription_tier as enum ('free', 'pro', 'family', 'enterprise');
-- Users handled by Supabase Auth; supplemental table for app-level fieldscreate table user_profiles ( id uuid primary key references auth.users, display_name text, subscription_tier subscription_tier default 'free', stripe_customer_id text unique, monthly_seconds_used int default 0, monthly_seconds_reset_at timestamptz, consent_granted_at timestamptz, created_at timestamptz default now());
-- People in the user's life (including "self")create table people ( id uuid primary key default gen_random_uuid(), user_id uuid references auth.users not null, display_name text not null, -- "Self", "Mom", "Dr. Singh" voice_embedding vector(192), -- Pyannote embedding (V0.1+) relationship text, -- "self" | "family" | "friend" | "professional" notes text, -- user-editable context for the agent consent_status text default 'pending', -- "granted" | "pending" | "revoked" created_at timestamptz default now());
-- Devices (hardware + phones; designed for full variant lineup per Master Plan §2.5)create type device_kind as enum ( 'phone_ios', 'phone_android', 'web', 'watch_apple', 'watch_wearos', 'arcive_clip', 'arcive_pendant', 'arcive_tabletop', 'arcive_card', 'arcive_screen', 'arcive_cellular', 'arcive_local');
create table devices ( id uuid primary key default gen_random_uuid(), user_id uuid references auth.users not null, kind device_kind not null, name text, mac_address text unique, -- null for phones/web firmware_version text, capabilities jsonb, -- {has_mic_array, has_screen, has_cellular, has_local_compute, mic_count, ...} connectivity text[], -- ["wifi", "ble"] | ["cellular", "ble"] | ["local_only"] cellular_iccid text, -- variant-specific, null otherwise local_mode boolean default false, -- if true, device does on-device STT/embed and uploads only summaries paired_at timestamptz, revoked_at timestamptz, created_at timestamptz default now());
-- Raw recordingscreate table recordings ( id uuid primary key default gen_random_uuid(), device_id uuid references devices, user_id uuid references auth.users not null, storage_path text not null, duration_seconds int, recorded_at timestamptz not null, doa_metadata jsonb, -- HW only status text default 'pending', -- pending | processing | done | error error_message text, created_at timestamptz default now());
-- Processed memoriescreate table memories ( id uuid primary key default gen_random_uuid(), recording_id uuid references recordings unique, user_id uuid references auth.users not null, transcript text, transcript_tsv tsvector generated always as (to_tsvector('english', coalesce(transcript, ''))) stored, summary text, topics text[], embedding vector(512), -- Voyage-3-lite is 512-dim recorded_at timestamptz, created_at timestamptz default now());
create index memories_tsv_idx on memories using gin(transcript_tsv);create index memories_embedding_idx on memories using hnsw (embedding vector_cosine_ops);
-- Speaker segments per memorycreate table memory_participants ( id uuid primary key default gen_random_uuid(), memory_id uuid references memories on delete cascade, person_id uuid references people, -- nullable until re-ID resolves speaker_label text, -- "Speaker A", "Speaker B" from diarization speaking_time_seconds int, segments jsonb -- [{start_s, end_s, text}]);
-- Semantic edges between memories (V0.1+)create table memory_edges ( id uuid primary key default gen_random_uuid(), memory_a uuid references memories on delete cascade, memory_b uuid references memories on delete cascade, similarity float not null, created_at timestamptz default now(), unique(memory_a, memory_b));
-- AI Roles (built-in + user-created + marketplace)create table roles ( id uuid primary key default gen_random_uuid(), user_id uuid references auth.users, -- null for built-in name text not null, description text, system_prompt text not null, voice_id text, -- Cartesia/ElevenLabs voice retrieval_config jsonb, -- {window_days, person_filter, top_k, ...} guardrails jsonb, -- {avoid_topics, escalation_triggers} is_premium boolean default false, is_published boolean default false, -- marketplace price_cents int, -- marketplace created_at timestamptz default now());
-- Conversation sessions with a rolecreate table role_sessions ( id uuid primary key default gen_random_uuid(), user_id uuid references auth.users not null, role_id uuid references roles not null, started_at timestamptz default now(), ended_at timestamptz, transcript text, memory_id uuid references memories -- the convo itself becomes a memory);
-- Family / shared spaces (V0.2+)create table spaces ( id uuid primary key default gen_random_uuid(), owner_id uuid references auth.users not null, name text not null, created_at timestamptz default now());
create table space_members ( space_id uuid references spaces on delete cascade, user_id uuid references auth.users, role text default 'member', -- "owner" | "member" | "caregiver" primary key (space_id, user_id));
create table memory_spaces ( memory_id uuid references memories on delete cascade, space_id uuid references spaces on delete cascade, primary key (memory_id, space_id));
-- Pipeline queue (pgmq table; managed by extension)-- See pgmq docs
-- RLS policies (selected)alter table memories enable row level security;create policy "user owns memories" on memories for all using (auth.uid() = user_id);
alter table people enable row level security;create policy "user owns people" on people for all using (auth.uid() = user_id);4. Pipeline (Queue Architecture)
┌──────────────────────┐ingest-audio ───────► │ recordings (pending) │(POST endpoint) └──────────┬───────────┘ │ enqueue: pipeline.transcribe ▼ ┌──────────────────────┐ │ transcribe-step │ Groq / Deepgram └──────────┬───────────┘ │ enqueue: pipeline.identify-speakers (if multi-speaker) ▼ ┌──────────────────────┐ │ identify-speakers │ Pyannote on Modal └──────────┬───────────┘ │ enqueue: pipeline.embed ▼ ┌──────────────────────┐ │ embed-step │ Voyage-3-lite └──────────┬───────────┘ │ enqueue: pipeline.summarize ▼ ┌──────────────────────┐ │ summarize-step │ Gemini Flash / Haiku └──────────┬───────────┘ │ enqueue: pipeline.compute-edges ▼ ┌──────────────────────┐ │ compute-edges-step │ pgvector top-K └──────────┬───────────┘ │ update: recordings.status = 'done' Realtime push → app- Each step is a separate Edge Function with a single responsibility.
- Queue:
pgmq(Postgres-native message queue) for V0.1; consider Inngest at V0.2 if observability hurts. - Failures retry with exponential backoff up to 5 times, then move to dead-letter table.
- Idempotency: each step writes its result keyed on
recording_id; safe to replay.
5. Agent Interface (Swappable Layer)
export interface AgentSession { start(input: { userId: string; roleId: string; modality: 'text' | 'voice' }): Promise<void>; send(message: string | AudioChunk): AsyncIterable<AgentEvent>; interject(): Promise<void>; // V0.3 group mode assumeRole(roleId: string): Promise<void>; end(): Promise<SessionSummary>;}
export type AgentEvent = | { type: 'partial'; text: string } | { type: 'final'; text: string; audio?: ArrayBuffer } | { type: 'tool_call'; name: string; args: unknown } | { type: 'error'; message: string };
// Drivers// packages/agents/drivers/stateless-rag.ts (V0.1)// packages/agents/drivers/claude-agent-sdk.ts (V0.2)// packages/agents/drivers/realtime-voice.ts (V0.2 voice, V0.3 group)The app code only ever imports AgentSession. Swapping drivers is a config change.
6. Hardware-Facing API Contracts
(Mirrors §6 of 00_MASTER_PLAN.md. Software is the server side of these contracts.)
6.1 POST /functions/v1/ingest-audio
// backend/functions/ingest-audio/index.ts (sketch)serve(async (req) => { const auth = req.headers.get('Authorization'); const deviceId = req.headers.get('X-Device-Id'); const recordedAt = url.searchParams.get('recorded_at'); const doaJson = url.searchParams.get('doa_json');
// 1. Verify device JWT, look up device + user // 2. Stream body to Supabase Storage at audio/<user>/<recording_id>.wav // 3. Insert recordings row (status=pending) // 4. Enqueue pipeline.transcribe // 5. Return 202 { recording_id }});6.2 POST /functions/v1/pair-device
App-initiated. Generates a one-shot device_jwt valid for ingest-audio, returns it bundled into a QR payload.
6.3 POST /functions/v1/revoke-device
User-initiated from app. Marks device revoked, invalidates JWT, notifies device over BLE if connected.
6.4 Realtime channels
recordings:user_id=eq.<uid>— new rows + status updatesmemories:user_id=eq.<uid>— when pipeline completesdevices:user_id=eq.<uid>— pairing + revocation events
7. Phase Deliverables — Software Track
Phase 0 — V0 (Wk 1–3)
- Next.js 15 PWA deployed to Vercel
- Supabase project provisioned, full schema migrated
- Magic-link auth + consent screen
- Phone-mic recording (
getUserMediawith VAD via@ricky0123/vad-web) - Audio upload to Supabase Storage + recordings row
- Synchronous Edge Function chain (no queue yet): transcribe → store memory
- Today (list) + Memory detail views
- Postgres FTS text search
- PostHog + Sentry instrumented
- Stripe customer pre-created on signup
- Internal
getMemories(query, filter, k)retrieval API - 20 invited users
/ingest-audioendpoint accepts hardware uploads from day one — same endpoint, same recording row, just differentdevice_id. Hardware track uses this to land its first end-to-end demo in the same phase.
Phase 1 — V0.1 (Wk 4–6)
- Move pipeline to
pgmqqueue + step workers - Diarization (Deepgram Nova-3)
- Pyannote.audio worker on Modal for speaker re-ID; populates
people.voice_embedding - Voyage-3-lite embeddings + HNSW index
- Universe view (
react-force-graphin web) - First role: Reviewer (text-only, RAG driver)
- Stripe Pro tier ($12/mo) goes live; PostHog feature flags gate features
- Export to Markdown
- App-side BLE pairing UX — required for the Phase 1 integration demo (“Pair dev-kit via QR scan → device joins WiFi → captures meeting”). On Chrome Android: Web Bluetooth in the PWA. On iOS: a developer-only CLI tool (10 dev-kits, internal use only); consumer iOS pairing UX ships with the Expo mobile app at V0.2.
- App calls
POST /functions/v1/pair-deviceandPOST /functions/v1/revoke-devicefrom settings UI
Phase 2 — V0.2 (Wk 7–10)
- Expo mobile app — feature parity with web
- pnpm workspace, shared
packages/db,packages/shared,packages/agents - RevenueCat for mobile billing; Stripe for web
- Voice talk-back: Pipecat + Deepgram streaming STT + Cartesia Sonic TTS
- Claude Agent SDK driver replaces stateless RAG
- Two more roles: Tutor, Brainstorm Partner
- Family tier launches: spaces, multi-member, caregiver role
- Offline recording on mobile (queue locally, upload when online)
Phase 3 — V0.3 (Wk 11–14)
- LiveKit-based group mode (continuous streaming session, server-side)
- Backend bridge: device → HTTPS chunked upload (1s Opus chunks) → bridge publishes as a LiveKit participant track. Device does not speak WebRTC directly — see 02_HARDWARE_PLAN.md §6.2. Web/mobile clients join the room directly via LiveKit SDKs.
- Agent interjection capability via
interject()— agent can speak into the LiveKit room as a participant - Polished pairing/revocation UX in the Expo mobile app (replaces the Phase 1 CLI dev tool for iOS; QR scan + BLE handshake)
- Hardware status UI (battery, recording state, mute control)
- MCP server built (internal use only) — agents and roles use it as the retrieval layer
- Caregiving role with explicit per-person consent flow
Phase 4 — V1.0 (Wk 15–22)
- DoA fusion: combine XVF3800 azimuth with Pyannote re-ID for higher-confidence speaker labels
- Public MCP server — Pro users get an endpoint they plug into Claude Desktop / ChatGPT / Cursor
- Role marketplace UI (browse, install, publish)
- Stripe Connect for marketplace payouts (70/30 split)
- B2B admin dashboard (multi-staff, audit log, org-level billing)
- SOC 2 Type 1 prep started
Phase 5 — V1.1+ (Wk 23+)
- Vertical packages: Caregiving, Education, Therapy
- Memory-as-a-Service API tier (usage-billed)
- Smartwatch app
- Multi-language support (start with ES, FR, DE)
- On-device summary cache for fast offline browsing
8. Privacy, Security, Consent
- All audio encrypted at rest (Supabase Storage default + customer-managed keys for B2B)
- Per-user “delete everything” flow (cascades to recordings, memories, embeddings, voice prints)
- Default retention: indefinite for owner, but UI nudge to set retention windows
- Voice embeddings of non-self people require their consent (
people.consent_status) - Two-party consent default in legal mode (user can disable for personal use, with a confirmation)
- Audit log table for B2B: who accessed what memory when
- Device JWT short-lived (rotated every 30 days, refreshed via BLE when phone is paired)
9. Cost Discipline Rules
- VAD before upload — never send silence to a paid API
- Cheapest model that works — measure quality, not vendor prestige
- Cache ruthlessly — embeddings, summaries, role responses (semantic cache for common queries)
- Per-user usage tracking in
user_profiles.monthly_seconds_usedenforced at upload time - Hard cutoffs at tier limits — record locally, queue for upload, but reject if cap exceeded
- Free tier Pyannote calls batched — daily, not real-time; Pro gets real-time
10. Open Questions
- Should free tier get the graph view or only Pro? (Probably free, since it’s the wow moment.)
- Self-host Pyannote vs. use Speechmatics managed? (Default: self-host on Modal; switch if ops pain exceeds savings.)
- Web Bluetooth for hardware pairing on Android? (Nice-to-have at V1.0; iOS requires native app anyway.)
- Should role marketplace allow free roles? (Yes — drives adoption; only premium ones are gated.)
- Encrypted-at-client option for Pro+ users? (Probably V1.1; complicates retrieval but is a compelling sell for therapy / caregiving B2B.)