Skip to content

Multi-Modal Expansion — Discussion Notes

Branch: claude/multimodal-platform-expansion-4i8XF Date: 2026-05-04 Format: Working session notes — exploratory, not a committed plan. Outcomes (so far):

Followed up by: 2026-05-05_ai_strategy_architecture.md, 2026-05-06_graph_tagging_strategy.md.


Question 1 — Can ARCIVE be made truly multi-modal, and what would it take to add it as a share/context option on PC, Chrome, or phone? Auto-fit as a new searchable entry. Effort + how to do it in our stack.

Where we are today

  • Audio-only by design. Schema: recordingsmemories (transcript + speaker segments) with pgvector HNSW for semantic search and Postgres tsvector for FTS.
  • Single ingest endpoint: POST /functions/v1/ingest-audio (Supabase Edge Function).
  • PWA is installable but has no Web Share Target.
  • No browser extension, no iOS share extension.
  • Claude Haiku 4.5 does summaries; Voyage-3-lite produces 512-d embeddings.
  • Source: docs/01_SOFTWARE_PLAN.md.

Architectural insight

Every modality eventually becomes (a) some text/caption and (b) one embedding vector. The existing graph (memory_edges) and search (transcript_tsv + embedding) work as-is once those two artifacts exist.

1. Schema generalization (S — ~1 day)

  • Keep memories, add kind enum (audio, image, video, document, link, note, clipboard).
  • New polymorphic assets table (memory_id, mime, storage_path, bytes, derived_from).
  • Reuse transcript, transcript_tsv, embedding, summary as universal text/vector layer.
  • pgvector index unchanged.

2. Universal ingest endpoint (M — ~3 days)

  • POST /functions/v1/ingest accepting multipart/form-data with kind, file(s), optional url, optional text, source (share, extension, clipboard, device).
  • Dispatcher per kind:
    • image → Claude Sonnet 4.6 vision → caption + tags → embed caption
    • document/PDF → text extract (pdf-parse / unstructured) → chunked embeddings
    • link → fetch + readability → caption + embed; screenshot via Modal worker w/ chromium
    • video → audio track through existing Whisper pipeline; key-frame captions via vision
    • audio → existing pipeline (no change)
  • Same JWT/device auth.

3. Share-target surfaces

SurfaceMechanismEffort
Android (Chrome/installed PWA)share_target in manifest.webmanifest/share route → /ingestS, ~½ day
Chrome/Edge desktop (PWA installed)Same share_target; macOS Chrome doesn’t show share menu — use extensionS
Chrome/Edge MV3 extensioncontextMenus (“Save to ARCIVE”), action popup, OAuth via SupabaseM, ~2–3 days
iOS Share ExtensionExpo config plugin / custom native → /ingestM–L, ~3–5 days w/ TestFlight
macOS Services / Windows ShareEdge “Install as app” registers Windows Share; macOS = publish a .shortcutS each
Clipboard / quick-add/quick PWA route + extension hotkeyS

4. Auto-fit as searchable entry Once /ingest writes a row with transcript + embedding, existing FTS + HNSW + memory_edges job picks it up automatically. The “Universe” graph becomes multi-modal — show a kind icon on each node.

5. Privacy / consent Audio consent doesn’t blanket-cover other modalities. Add per-source toggles (“extension can save links without prompt”, “image shares always confirm”).

Effort

  • MVP (image + link + text via PWA share-target on Android, Chrome extension, universal /ingest, schema migration): ~2 weeks, one engineer.
  • + iOS share extension + video + PDF: another ~2 weeks.
  • Polish (graph icons, per-source consent, dedup): ~1 week.
  • Total ~5 weeks end-to-end without disturbing audio path.

Suggested order

  1. Schema migration + /ingest router (unblocks everything)
  2. PWA share_target (cheapest big Android win)
  3. Chrome MV3 extension (covers desktop on all OSes)
  4. iOS share extension via Expo
  5. Video + PDF (longest tail)

Question 2 — Is there a better way we are overlooking?

Yes. Four angles the first answer mis-weighted, plus one philosophical reframe.

1. MCP-first, not share-target-first (probably the biggest miss)

We’re already building a public MCP server for V1.0. Building 5 native share extensions is the old model. The new model:

Expose arcive.write_memory(...) as an MCP tool and let Claude Desktop, ChatGPT Desktop, Cursor, Apple Intelligence, Android Gemini become the share surface.

One integration, every current and future assistant becomes a “Save to ARCIVE” button — including ones that don’t exist yet. Pull MCP-write forward from V1.0 to V0.2.

2. Email-in beats most share extensions combined

Single inbound address u_<userid>@in.arcive.io (SES/Postmark webhook → /ingest):

  • iOS Mail share (no extension to ship/review)
  • Android share (every app has “Share via email”)
  • Newsletter auto-archive
  • Slack/Discord forwards
  • Works on every device, no install
  • ~1 day to build; covers ~60% of share use cases.

3. App Intents / Android Intents instead of Share Extensions

  • Apple App Intents (iOS 16+) registers actions with Spotlight, Siri, Shortcuts, Action Button, share sheet from one declaration.
  • Android equivalent surfaces in Assistant + share + Quick Settings.
  • Cheaper than custom Share Extension, more surfaces. If we already need Expo native code, this is the right primitive.

4. Use a real multimodal embedding model — don’t caption-then-embed

Original plan: image → Claude vision → caption → Voyage text embed. Lossy + slow.

Better: Voyage Multimodal-3 or Cohere Embed v4 — image+text in one shared space.

  • Query “photo of mom’s handwritten recipe” finds it even if OCR failed
  • One model, one index, one HNSW (existing pgvector unchanged)
  • Cuts ~$0.005/image vision cost + latency
  • Same model handles video keyframes
  • Half-day swap, big recall win.

Philosophical reframe — stay on-brand

ARCIVE is calm, ambient, passive. Share targets are active, deliberate, productivity-coded. Two ways to keep multi-modal on-brand:

  • Voice-annotated capture: every shared item still becomes an audio memory with the link/image/doc attached, not a parallel “image entry.” Search/graph code unchanged.
  • Auto-correlation, not user-initiated share: with consent, auto-attach camera-roll photos taken during a recorded conversation, calendar events, foregrounded URL. The user never “shares” — ARCIVE notices. This is the version only ARCIVE could ship; Notion/Evernote/Raindrop can’t.

Revised order of operations

  1. Multimodal embedding swap (½ day, biggest leverage)
  2. Email-in (1 day, covers long tail)
  3. MCP write tools (already on roadmap — pull forward to V0.2)
  4. App Intents (iOS) + Intents (Android) (replaces share-extension work)
  5. Chrome MV3 extension (only surface MCP/email don’t cover well)
  6. PWA share_target as freebie on Android
  7. Skip dedicated native share extensions entirely

Net: same coverage, ~half the code, more on-brand, future-proofed against the agent shift.


Open decisions

  • Do we adopt the MCP-first strategy and pull MCP-write into V0.2?
  • Confirm multimodal embedding model choice (Voyage Multimodal-3 vs Cohere Embed v4) — needs eval on our domain.
  • Email-in: SES vs Postmark — pick based on existing infra.
  • Do we commit to the “every entry is still a memory” philosophical model, or allow first-class non-audio entries?
  • Auto-correlation features need explicit consent UX design before scoping.

Next concrete step (if greenlit)

Spike either (a) the MCP write-tool surface or (b) the email-in handler — both are 1-day builds and unlock a disproportionate share of the surface area.