Branch: claude/multimodal-platform-expansion-4i8XF
Date: 2026-05-04
Format: Working session notes — exploratory, not a committed plan.
Outcomes (so far):
- ADR-0011 — AI vendor strategy: best-of-breed per task (Accepted) — picked up the multimodal-embedding-swap and auto-correlation reasoning from §“Open decisions”.
- ADR-0012 — Topics as first-class graph nodes (Accepted) — this discussion’s “every modality eventually becomes text + one embedding vector” insight is what makes a single topics surface viable across kinds.
Followed up by: 2026-05-05_ai_strategy_architecture.md, 2026-05-06_graph_tagging_strategy.md.
Question 1 — Can ARCIVE be made truly multi-modal, and what would it take to add it as a share/context option on PC, Chrome, or phone? Auto-fit as a new searchable entry. Effort + how to do it in our stack.
Where we are today
- Audio-only by design. Schema:
recordings→memories(transcript + speaker segments) withpgvectorHNSW for semantic search and Postgrestsvectorfor FTS. - Single ingest endpoint:
POST /functions/v1/ingest-audio(Supabase Edge Function). - PWA is installable but has no Web Share Target.
- No browser extension, no iOS share extension.
- Claude Haiku 4.5 does summaries; Voyage-3-lite produces 512-d embeddings.
- Source:
docs/01_SOFTWARE_PLAN.md.
Architectural insight
Every modality eventually becomes (a) some text/caption and (b) one embedding vector. The existing graph (memory_edges) and search (transcript_tsv + embedding) work as-is once those two artifacts exist.
Recommended approach (in our stack)
1. Schema generalization (S — ~1 day)
- Keep
memories, addkindenum (audio,image,video,document,link,note,clipboard). - New polymorphic
assetstable (memory_id,mime,storage_path,bytes,derived_from). - Reuse
transcript,transcript_tsv,embedding,summaryas universal text/vector layer. - pgvector index unchanged.
2. Universal ingest endpoint (M — ~3 days)
POST /functions/v1/ingestacceptingmultipart/form-datawithkind, file(s), optionalurl, optionaltext,source(share,extension,clipboard,device).- Dispatcher per kind:
- image → Claude Sonnet 4.6 vision → caption + tags → embed caption
- document/PDF → text extract (pdf-parse / unstructured) → chunked embeddings
- link → fetch + readability → caption + embed; screenshot via Modal worker w/ chromium
- video → audio track through existing Whisper pipeline; key-frame captions via vision
- audio → existing pipeline (no change)
- Same JWT/device auth.
3. Share-target surfaces
| Surface | Mechanism | Effort |
|---|---|---|
| Android (Chrome/installed PWA) | share_target in manifest.webmanifest → /share route → /ingest | S, ~½ day |
| Chrome/Edge desktop (PWA installed) | Same share_target; macOS Chrome doesn’t show share menu — use extension | S |
| Chrome/Edge MV3 extension | contextMenus (“Save to ARCIVE”), action popup, OAuth via Supabase | M, ~2–3 days |
| iOS Share Extension | Expo config plugin / custom native → /ingest | M–L, ~3–5 days w/ TestFlight |
| macOS Services / Windows Share | Edge “Install as app” registers Windows Share; macOS = publish a .shortcut | S each |
| Clipboard / quick-add | /quick PWA route + extension hotkey | S |
4. Auto-fit as searchable entry
Once /ingest writes a row with transcript + embedding, existing FTS + HNSW + memory_edges job picks it up automatically. The “Universe” graph becomes multi-modal — show a kind icon on each node.
5. Privacy / consent Audio consent doesn’t blanket-cover other modalities. Add per-source toggles (“extension can save links without prompt”, “image shares always confirm”).
Effort
- MVP (image + link + text via PWA share-target on Android, Chrome extension, universal
/ingest, schema migration): ~2 weeks, one engineer. - + iOS share extension + video + PDF: another ~2 weeks.
- Polish (graph icons, per-source consent, dedup): ~1 week.
- Total ~5 weeks end-to-end without disturbing audio path.
Suggested order
- Schema migration +
/ingestrouter (unblocks everything) - PWA
share_target(cheapest big Android win) - Chrome MV3 extension (covers desktop on all OSes)
- iOS share extension via Expo
- Video + PDF (longest tail)
Question 2 — Is there a better way we are overlooking?
Yes. Four angles the first answer mis-weighted, plus one philosophical reframe.
1. MCP-first, not share-target-first (probably the biggest miss)
We’re already building a public MCP server for V1.0. Building 5 native share extensions is the old model. The new model:
Expose
arcive.write_memory(...)as an MCP tool and let Claude Desktop, ChatGPT Desktop, Cursor, Apple Intelligence, Android Gemini become the share surface.
One integration, every current and future assistant becomes a “Save to ARCIVE” button — including ones that don’t exist yet. Pull MCP-write forward from V1.0 to V0.2.
2. Email-in beats most share extensions combined
Single inbound address u_<userid>@in.arcive.io (SES/Postmark webhook → /ingest):
- iOS Mail share (no extension to ship/review)
- Android share (every app has “Share via email”)
- Newsletter auto-archive
- Slack/Discord forwards
- Works on every device, no install
- ~1 day to build; covers ~60% of share use cases.
3. App Intents / Android Intents instead of Share Extensions
- Apple App Intents (iOS 16+) registers actions with Spotlight, Siri, Shortcuts, Action Button, share sheet from one declaration.
- Android equivalent surfaces in Assistant + share + Quick Settings.
- Cheaper than custom Share Extension, more surfaces. If we already need Expo native code, this is the right primitive.
4. Use a real multimodal embedding model — don’t caption-then-embed
Original plan: image → Claude vision → caption → Voyage text embed. Lossy + slow.
Better: Voyage Multimodal-3 or Cohere Embed v4 — image+text in one shared space.
- Query “photo of mom’s handwritten recipe” finds it even if OCR failed
- One model, one index, one HNSW (existing pgvector unchanged)
- Cuts ~$0.005/image vision cost + latency
- Same model handles video keyframes
- Half-day swap, big recall win.
Philosophical reframe — stay on-brand
ARCIVE is calm, ambient, passive. Share targets are active, deliberate, productivity-coded. Two ways to keep multi-modal on-brand:
- Voice-annotated capture: every shared item still becomes an audio memory with the link/image/doc attached, not a parallel “image entry.” Search/graph code unchanged.
- Auto-correlation, not user-initiated share: with consent, auto-attach camera-roll photos taken during a recorded conversation, calendar events, foregrounded URL. The user never “shares” — ARCIVE notices. This is the version only ARCIVE could ship; Notion/Evernote/Raindrop can’t.
Revised order of operations
- Multimodal embedding swap (½ day, biggest leverage)
- Email-in (1 day, covers long tail)
- MCP write tools (already on roadmap — pull forward to V0.2)
- App Intents (iOS) + Intents (Android) (replaces share-extension work)
- Chrome MV3 extension (only surface MCP/email don’t cover well)
- PWA
share_targetas freebie on Android - Skip dedicated native share extensions entirely
Net: same coverage, ~half the code, more on-brand, future-proofed against the agent shift.
Open decisions
- Do we adopt the MCP-first strategy and pull MCP-write into V0.2?
- Confirm multimodal embedding model choice (Voyage Multimodal-3 vs Cohere Embed v4) — needs eval on our domain.
- Email-in: SES vs Postmark — pick based on existing infra.
- Do we commit to the “every entry is still a memory” philosophical model, or allow first-class non-audio entries?
- Auto-correlation features need explicit consent UX design before scoping.
Next concrete step (if greenlit)
Spike either (a) the MCP write-tool surface or (b) the email-in handler — both are 1-day builds and unlock a disproportionate share of the surface area.