Multi-Modal Expansion — Discussion Notes

Branch: claude/multimodal-platform-expansion-4i8XF Date: 2026-05-04 Format: Working session notes — exploratory, not a committed plan. Outcomes (so far):

ADR-0011 — AI vendor strategy: best-of-breed per task (Accepted) — picked up the multimodal-embedding-swap and auto-correlation reasoning from §“Open decisions”.
ADR-0012 — Topics as first-class graph nodes (Accepted) — this discussion’s “every modality eventually becomes text + one embedding vector” insight is what makes a single topics surface viable across kinds.

Followed up by: 2026-05-05_ai_strategy_architecture.md, 2026-05-06_graph_tagging_strategy.md.

Where we are today

Audio-only by design. Schema: recordings → memories (transcript + speaker segments) with pgvector HNSW for semantic search and Postgres tsvector for FTS.
Single ingest endpoint: POST /functions/v1/ingest-audio (Supabase Edge Function).
PWA is installable but has no Web Share Target.
No browser extension, no iOS share extension.
Claude Haiku 4.5 does summaries; Voyage-3-lite produces 512-d embeddings.
Source: docs/01_SOFTWARE_PLAN.md.

Architectural insight

Every modality eventually becomes (a) some text/caption and (b) one embedding vector. The existing graph (memory_edges) and search (transcript_tsv + embedding) work as-is once those two artifacts exist.

Recommended approach (in our stack)

1. Schema generalization (S — ~1 day)

Keep memories, add kind enum (audio, image, video, document, link, note, clipboard).
New polymorphic assets table (memory_id, mime, storage_path, bytes, derived_from).
Reuse transcript, transcript_tsv, embedding, summary as universal text/vector layer.
pgvector index unchanged.

2. Universal ingest endpoint (M — ~3 days)

POST /functions/v1/ingest accepting multipart/form-data with kind, file(s), optional url, optional text, source (share, extension, clipboard, device).
Dispatcher per kind:
- image → Claude Sonnet 4.6 vision → caption + tags → embed caption
- document/PDF → text extract (pdf-parse / unstructured) → chunked embeddings
- link → fetch + readability → caption + embed; screenshot via Modal worker w/ chromium
- video → audio track through existing Whisper pipeline; key-frame captions via vision
- audio → existing pipeline (no change)
Same JWT/device auth.

3. Share-target surfaces

Surface	Mechanism	Effort
Android (Chrome/installed PWA)	`share_target` in `manifest.webmanifest` → `/share` route → `/ingest`	S, ~½ day
Chrome/Edge desktop (PWA installed)	Same `share_target`; macOS Chrome doesn’t show share menu — use extension	S
Chrome/Edge MV3 extension	`contextMenus` (“Save to ARCIVE”), `action` popup, OAuth via Supabase	M, ~2–3 days
iOS Share Extension	Expo config plugin / custom native → `/ingest`	M–L, ~3–5 days w/ TestFlight
macOS Services / Windows Share	Edge “Install as app” registers Windows Share; macOS = publish a `.shortcut`	S each
Clipboard / quick-add	`/quick` PWA route + extension hotkey	S

4. Auto-fit as searchable entry Once /ingest writes a row with transcript + embedding, existing FTS + HNSW + memory_edges job picks it up automatically. The “Universe” graph becomes multi-modal — show a kind icon on each node.

5. Privacy / consent Audio consent doesn’t blanket-cover other modalities. Add per-source toggles (“extension can save links without prompt”, “image shares always confirm”).

Effort

MVP (image + link + text via PWA share-target on Android, Chrome extension, universal /ingest, schema migration): ~2 weeks, one engineer.
+ iOS share extension + video + PDF: another ~2 weeks.
Polish (graph icons, per-source consent, dedup): ~1 week.
Total ~5 weeks end-to-end without disturbing audio path.

Suggested order

Schema migration + /ingest router (unblocks everything)
PWA share_target (cheapest big Android win)
Chrome MV3 extension (covers desktop on all OSes)
iOS share extension via Expo
Video + PDF (longest tail)

Question 2 — Is there a better way we are overlooking?

Yes. Four angles the first answer mis-weighted, plus one philosophical reframe.

We’re already building a public MCP server for V1.0. Building 5 native share extensions is the old model. The new model:

Expose arcive.write_memory(...) as an MCP tool and let Claude Desktop, ChatGPT Desktop, Cursor, Apple Intelligence, Android Gemini become the share surface.

One integration, every current and future assistant becomes a “Save to ARCIVE” button — including ones that don’t exist yet. Pull MCP-write forward from V1.0 to V0.2.

Single inbound address u_<userid>@in.arcive.io (SES/Postmark webhook → /ingest):

iOS Mail share (no extension to ship/review)
Android share (every app has “Share via email”)
Newsletter auto-archive
Slack/Discord forwards
Works on every device, no install
~1 day to build; covers ~60% of share use cases.

Apple App Intents (iOS 16+) registers actions with Spotlight, Siri, Shortcuts, Action Button, share sheet from one declaration.
Android equivalent surfaces in Assistant + share + Quick Settings.
Cheaper than custom Share Extension, more surfaces. If we already need Expo native code, this is the right primitive.

4. Use a real multimodal embedding model — don’t caption-then-embed

Original plan: image → Claude vision → caption → Voyage text embed. Lossy + slow.

Better: Voyage Multimodal-3 or Cohere Embed v4 — image+text in one shared space.

Query “photo of mom’s handwritten recipe” finds it even if OCR failed
One model, one index, one HNSW (existing pgvector unchanged)
Cuts ~$0.005/image vision cost + latency
Same model handles video keyframes
Half-day swap, big recall win.

Philosophical reframe — stay on-brand

ARCIVE is calm, ambient, passive. Share targets are active, deliberate, productivity-coded. Two ways to keep multi-modal on-brand:

Voice-annotated capture: every shared item still becomes an audio memory with the link/image/doc attached, not a parallel “image entry.” Search/graph code unchanged.
Auto-correlation, not user-initiated share: with consent, auto-attach camera-roll photos taken during a recorded conversation, calendar events, foregrounded URL. The user never “shares” — ARCIVE notices. This is the version only ARCIVE could ship; Notion/Evernote/Raindrop can’t.

Revised order of operations

Multimodal embedding swap (½ day, biggest leverage)
Email-in (1 day, covers long tail)
MCP write tools (already on roadmap — pull forward to V0.2)
App Intents (iOS) + Intents (Android) (replaces share-extension work)
Chrome MV3 extension (only surface MCP/email don’t cover well)
PWA share_target as freebie on Android
Skip dedicated native share extensions entirely

Net: same coverage, ~half the code, more on-brand, future-proofed against the agent shift.

Open decisions

Do we adopt the MCP-first strategy and pull MCP-write into V0.2?
Confirm multimodal embedding model choice (Voyage Multimodal-3 vs Cohere Embed v4) — needs eval on our domain.
Email-in: SES vs Postmark — pick based on existing infra.
Do we commit to the “every entry is still a memory” philosophical model, or allow first-class non-audio entries?
Auto-correlation features need explicit consent UX design before scoping.

Next concrete step (if greenlit)

Spike either (a) the MCP write-tool surface or (b) the email-in handler — both are 1-day builds and unlock a disproportionate share of the surface area.

Multi-Modal Expansion — Discussion Notes

Where we are today

Architectural insight

Recommended approach (in our stack)

Effort

Suggested order

Question 2 — Is there a better way we are overlooking?

4. Use a real multimodal embedding model — don’t caption-then-embed

Philosophical reframe — stay on-brand

Revised order of operations

Open decisions

Next concrete step (if greenlit)

Plans

Operations

Decisions (ADRs)

Discussions

Multi-Modal Expansion — Discussion Notes

Question 1 — Can ARCIVE be made truly multi-modal, and what would it take to add it as a share/context option on PC, Chrome, or phone? Auto-fit as a new searchable entry. Effort + how to do it in our stack.

Where we are today

Architectural insight

Recommended approach (in our stack)

Effort

Suggested order

Question 2 — Is there a better way we are overlooking?

1. MCP-first, not share-target-first (probably the biggest miss)

2. Email-in beats most share extensions combined

3. App Intents / Android Intents instead of Share Extensions

4. Use a real multimodal embedding model — don’t caption-then-embed

Philosophical reframe — stay on-brand

Revised order of operations

Open decisions

Next concrete step (if greenlit)

Plans

Operations

Decisions (ADRs)

Discussions