Date: 2026-05-06
Format: Working session notes — exploratory, not a committed plan.
Outcome: ADR-0012 — Topics as first-class graph nodes (Proposed)
Companion docs: 2026-05-04_multimodal_expansion.md · 2026-05-05_ai_strategy_architecture.md
Question 1 — Standard practices in Obsidian and journaling apps for tags. Inline #tag in prose, frontmatter, end-of-note? What’s best for a graph view like Obsidian when AI is doing it usually and the user occasionally?
Clarification first: in Obsidian, tags and graph view are mostly separate systems. The graph is driven by [[wikilinks]] between notes, not by #tags (tags appear as nodes but are secondary).
Standard practices, ranked by how systematic users do it:
- YAML frontmatter for structured tags — top of note,
tags:field. Best for AI-generated tags: clean to parse, easy to edit programmatically, doesn’t clutter prose. - Inline
#tagswithin text — used when the tag is contextual to a paragraph. Good for human flow-of-thought tagging, messy for AI to manage. - Tags appended at end — older journaling convention; superseded by frontmatter in Obsidian circles.
[[Wikilinks]]inline — for actual concepts, people, projects. These build the graph. Zettelkasten leans heavily on this.
For an AI-mostly setup: frontmatter for tags, inline wikilinks for entities. AI populates both on ingest; user adjusts occasionally.
On graph behavior: wikilinks create note-to-note edges (the main structure). Tags appear as their own node type but don’t link to each other — they cluster, they don’t web. The interesting graph everyone screenshots comes almost entirely from wikilinks.
Question 2 — Compare Obsidian / Miro / mind maps and check the strategies
Four families, sharply different philosophies:
Link-driven graphs — Obsidian, Logseq, Roam Research
Connections emerge from [[wikilinks]] and #tags written into the text. Graph is a byproduct of writing.
- AI fit: high — easy to inject links and frontmatter into markdown.
Typed-node / schema systems — Tana, Capacities, Notion
Tana’s supertags are the headline idea: a #meeting tag isn’t just a label, it’s a schema that adds fields (attendees, project, date) and turns the node into a queryable database row. Capacities works similarly.
- AI fit: very high — structured fields are exactly what LLMs produce well, and queries beat graph-staring for retrieval.
Spatial canvases — Miro, Heptabase, Scrintal, Excalidraw, Obsidian Canvas
No automatic graph — place cards, draw arrows. The “tagging” is spatial: clusters, colors, frames.
- AI fit: low for auto-organization, decent for generating starter cards.
Mind maps — XMind, MindMeister, Coggle, FreeMind
Strict parent-child hierarchy, one root. Tree, not graph. Cross-links are second-class.
- AI fit: medium — easy to generate a tree, but the model is too rigid for evolving knowledge.
Strategy implications for ARCIVE (AI-mostly, ambient, multimodal)
- Want emergent graph aesthetic → Obsidian-style (AI generates wikilinks for entities + frontmatter for categories). Cheapest, prettiest graph, weakest retrieval.
- Want actually queryable knowledge → Tana / Capacities. AI fills typed fields; you ask “all journal entries tagged anxious where I mentioned Sarah” and get a real answer. Graph is less central but more useful.
- Want visual thinking → Heptabase. AI drafts cards, user arranges.
- Mind maps and Miro are poor fits for ongoing journaling — they’re for one-shot ideation.
Honest take: Obsidian’s graph view is mostly eye candy past a few hundred notes. Tana-style typed tags scale better for AI-generated content because the AI’s output becomes structured data rather than prose-with-sprinkled-tags.
Question 3 — How would the Obsidian-style emergent graph be best implemented in Next.js 15 / Vercel + Expo, given multimodal input?
Key architectural insight: don’t store markdown with [[wikilinks]] like Obsidian does — ARCIVE isn’t file-based. Store the captured artifact for display, but extract entities and tags into relational tables so the graph and search are real database queries.
Data model:
notes(id, user_id, body_md, frontmatter jsonb, created_at, embedding vector)entities(id, user_id, canonical_name, type, aliases text[], embedding vector)tags(id, user_id, name, kind)note_entities(note_id, entity_id, mention_text, char_offset)note_tags(note_id, tag_id)The entities table is what gives you the Obsidian graph: nodes = entities, edges = co-occurrence. Embeddings on entities solve the “is this Sarah the same Sarah?” problem — without entity resolution, you’ll have “Sarah”, “Sarah K”, “sarah” as three separate graph nodes within a week.
Multimodal extension — every modality eventually becomes (a) some text/caption and (b) one embedding vector. Per-attachment derivation:
| Modality | Derive step | Tool |
|---|---|---|
| Voice/video audio | Transcribe with timestamps | Deepgram, AssemblyAI, Whisper via Groq |
| Image | Caption + entity detection | Vision model (one call gets caption + entities) |
| Handwriting | OCR | Vision model; Google Vision as fallback |
| Screenshot | OCR + UI context | Vision model |
| EXIF | Pull location, time | exifr; reverse-geocode location to a place entity |
Multimodal embedding model > caption-then-embed. Voyage Multimodal-3 or Cohere Embed v4 — image+text in one shared space, one model, one HNSW index. Half-day swap, big recall win.
Confidence weighting on edges:
- Text mention of “Sarah” → high
- Photo with face matching prior Sarahs → medium
- Voice memo whose transcript mentions Sarah → high (transcript is text)
- GPS in EXIF matching a place entity → automatic place edge
Photos and places give you nodes that have thumbnails — way better aesthetic than Obsidian’s dot-and-label.
Cost reality:
- 5-min voice memo: ~$0.03 (transcription + extraction)
- Single photo: ~$0.01 vision
- Heavy user with 10 voice + 20 photos/day: ~$15/month API costs
Implications: debounce extraction (don’t re-extract on every edit), cache derivations, tier the model (Haiku 4.5 routine, escalate to Sonnet for ambiguous cases), batch photos in single vision calls.
Question 4 — Are we already extracting these tags? Could the same extraction produce inline-highlighted links in the viewer plus tags as separate metadata?
Yes — already doing it. supabase/functions/summarize-step/index.ts:24-51 extracts tags from transcript text. The prompt explicitly says topics are “GRAPH EDGES, not search facets” and biases toward recurring nodes (people, projects, themes), 2–4 typical, 5 max. Stored on memories.topics.
The chat summary’s “AI extracts tags from text” advice isn’t new — it’s the existing pipeline. The question is what we do downstream of that.
Three gaps in ARCIVE today (not redesign, but small additions):
1. Topics aren’t edges yet — only embeddings are
compute-edges-step/index.ts:46-64 builds memory_edges purely from embedding similarity (match_memories RPC, top-8, ≥0.55 cosine). Two memories that both tag Daniel are only connected if their vectors happen to be close. The cheaper, sharper edge is the topic itself: shared topic = explicit edge, no LLM/vector cost.
2. No topic canonicalization
Daniel / daniel / Dan will become three nodes within a week. Today’s topics text[] column has no canonical identity. The minimum fix is a topics table with normalize-on-insert (lowercase + trigram match against existing labels for that user).
3. No inline highlighting → no in-text links
The right split: proper-case topics (Daniel, Q2 Roadmap) become inline highlights in the transcript viewer; lowercase topics (anxiety, public speaking) become chips above the transcript. The existing prompt almost gives this for free — case is already used to distinguish entities from themes.
Char offsets from the LLM are fragile. Just string-match topics against transcript at render time, case-insensitive word boundary, longest-match-first. That’s what most viewers do; cheap and robust.
Synthesis — the framing that became ADR-0012
The chat’s framing (“Obsidian aesthetic, Tana retrieval”) is useful as vocabulary but ARCIVE is structurally closer to Tana with a force-graph skin: typed entity nodes, AI-populated, queryable, with a graph view as one of several surfaces.
This is on-brand:
- Calm, ambient, AI-does-the-work (
2026-05-04_multimodal_expansion.md§“Philosophical reframe”) — user never types[[wikilinks]]. - Auto-correlation, not user-initiated tagging is ARCIVE’s MOAT layer (ADR-0011 Layer 10).
- Multimodal landings — vision/OCR-extracted topics from photos flow into the same
memory_topicstable, no parallel system.
What lands in ADR-0012
Ships now (V0.3 slice):
topics+memory_topicsschema withkind(person | place | project | theme | event) and an embedding column reusing the Voyage-3-lite 512-d space.- Extend
summarize-stepprompt to return[{label, kind}]. - New
link-topics-steppgmq function with hybrid pg_trgm + pgvector resolution. - Extend
compute-edges-stepto write topic-shared edges withkind='topic'; existing edges becomekind='semantic'. - Render-time inline highlighting in the transcript viewer + chip row above.
- Universe view nodes are topics; memory↔memory edges become a secondary toggle.
Deferred (future feature work):
- Confidence weighting on memory↔topic links.
- EXIF → place topics, free.
- Diarized speaker → person topic attribution.
- Vision/OCR-extracted topics for images (lands with multimodal ingest).
- MCP retrieval over topics (
memories_by_topic,related_topics). - Topic merge/split UX.
- Cluster-around-entity zoom view.
Open decisions
- Adopt ADR-0012 — accepted, shipped 2026-05-07 (PR #11).
- When mobile Universe view ships, build against topic-nodes from day one (avoid rebuild). Shipped 2026-05-06 (
feature/mobile-universe) — Skia + d3-force, edges/nodes already topic-aware via the β.1 schema, no retrofit needed. - Confirm Voyage Multimodal-3 vs Cohere Embed v4 for the multimodal swap (separate decision tracked in
2026-05-04_multimodal_expansion.md).
Lessons recorded
- Read the prompts before proposing infrastructure. The conversation was heading toward “we should add topic extraction” — a half-day discovery in the existing code revealed we’d been doing it for weeks and just stranding the output. The actual ADR is “use what we already extract,” not “extract more.”
- Obsidian-style wikilinks-in-prose is a file-format artifact, not a strategy. ARCIVE has no editable prose layer; that pattern is irrelevant. The vocabulary stays useful (entities-as-nodes, graph-as-surface) — the markup mechanics don’t.
- The graph ≠ the value. Obsidian power users discover the force-graph stops being useful past ~500 nodes. The actual product win is retrieval (typed entity queries, MCP surfaces) with the graph as one rendering of the same data.