ADR-0012: Topics as first-class graph nodes (entity-style), not stranded text[]

Status: Accepted
Date: 2026-05-06
Deciders: Sujith
Affected: supabase/migrations/, supabase/functions/summarize-step/, supabase/functions/compute-edges-step/, apps/web/app/(app)/memory/, apps/web/components/universe/, apps/mobile/ memory viewer, backend/mcp/arcive-memory-mcp/

Context

Three things forced this question on 2026-05-06:

The Universe view is “rough.” README.md §“Built — pending validation” lists Universe as built but unvalidated. Inspection shows why: it renders memory_edges, which today are pure embedding-similarity edges (top-8, ≥0.55 cosine) from compute-edges-step/index.ts:46-64. The result is a fuzzy blob — not the legible “people/projects/places connect across entries” graph the product implies.
Topics are already extracted from text but stranded. summarize-step/index.ts:24-51 extracts 2–4 topics per memory with the explicit framing “topics are GRAPH EDGES, not search facets.” They’re written to memories.topics text[] and never used as graph primitives. Two memories that both tag Daniel are only connected if their embeddings happen to land near each other.
A working session on graph/tagging strategy (Obsidian/Tana/Heptabase comparison, multimodal expansion) raised the structural question: are entity nodes a separate layer or just a re-skin of memory edges? Working notes: ../discussions/2026-05-06_graph_tagging_strategy.md (to be added alongside this ADR).

The pre-existing constraints that scoped the answer:

ADR-0001 — pgmq + pg_cron: new extraction work fits as new pgmq steps, not a new orchestrator.
ADR-0007 — Consent gate on agent retrieval: topic-shared edges must respect the same consent rules as semantic edges.
ADR-0011 — AI vendor strategy: topic extraction is Layer 3 (Understand) — commodity, already cascading Anthropic → Gemini → Groq.
Multi-modal expansion notes (../discussions/2026-05-04_multimodal_expansion.md): future image/link/document ingest produces text → must reuse the same topics surface, not invent a parallel “image tags” path.

Options considered

Option A — Leave topics as `text[]`, do nothing structural

Keep the current shape. Universe view stays semantic-edge-only. Topics surface as chips on the memory detail view but don’t drive the graph.

Pros: zero migration cost; Universe today still “works” in the eye-candy sense.
Cons:
- “Daniel” / “daniel” / “Dan” become three independent labels within a week — no canonicalization, no dedupe.
- The graph never gets the entity-as-node aesthetic the product implies (and the working session validated as the right target).
- Topic-shared connections — the cheapest, sharpest edge — are extracted and discarded.
- Multimodal text artifacts (image captions, OCR, link readability) will produce more text[] topics with the same canonicalization gap, multiplying the problem.

Option B — Full Obsidian/Tana clone: wikilinks in stored prose, frontmatter, schema-typed tags

Rewrite the data model around editable markdown notes with [[wikilinks]] and YAML frontmatter; each tag is a Tana-style supertag with a schema.

Pros: very rich; community-validated patterns; aesthetically maximal.
Cons:
- Wrong shape for ARCIVE. Inputs are voice memos, photos, shared URLs — there is no editable prose layer. Frontmatter and [[wikilinks]] are file-format artifacts of a tool ARCIVE isn’t.
- Forces a UX (markdown editing, manual link-typing) that contradicts the brand’s “calm, ambient, AI-does-the-work” positioning (../discussions/2026-05-04_multimodal_expansion.md §“Philosophical reframe”).
- Multi-month rewrite for an aesthetic that’s mostly delivered by Option C in days.

Option C — Promote topics to first-class entity nodes; render-time inline highlighting (chosen)

Keep extraction where it is. Add a small canonical layer — topics and memory_topics tables with a kind field — and use it in two places: the graph (entity-as-node, with topic-shared edges) and the transcript viewer (inline highlights for proper-noun topics, chip row for theme topics). No frontmatter, no wikilinks-in-prose, no Obsidian markdown layer.

Pros:
- Reuses what already works — the existing summarize-step prompt is already topic-discipline-aware (“recurring nodes, not hyper-specific one-offs”); only the output schema and the consumer side need to change.
- Topic-shared edges are a single SQL join — sharper and cheaper than embedding similarity, and still composable alongside it.
- Canonicalization fixes “Daniel/daniel/Dan” once, and every downstream surface (search filter, graph node, MCP retrieval) inherits it.
- Render-time string-match for inline highlighting avoids LLM-generated character offsets (which are routinely off-by-one and brittle).
- Multimodal lands cleanly: vision/OCR/link extractors all emit topics into the same table — photo of Sarah connects to voice memo mentioning Sarah without new edge logic.
- Lifts “Universe view” from “rough” to legible without a frontend rewrite — same canvas, richer nodes.
Cons:
- One new migration, one summarize-step prompt extension, one new normalize-on-insert step, one viewer change.
- kind taxonomy needs a starting set (person | place | project | theme | event) that we’ll likely refine. Acceptable: kinds are not user-visible primary structure, just a render hint.

Decision

Adopt Option C — Topics as first-class entity nodes; render-time inline highlighting.

Topics are extracted, canonicalized, and used as both graph nodes (Universe view) and viewer overlays (inline highlights + chip row). The change is layered — no rewrite of summarize, no rewrite of compute-edges, no rewrite of the Universe canvas.

Framing. This is an evolution of memory_edges, not a rewrite. ARCIVE is structurally closer to Tana with a force-graph skin (typed entity nodes, AI-populated, queryable, graph as one of several surfaces) than to Obsidian (file-based markdown with user-typed wikilinks). The vocabulary from Obsidian/Tana/Heptabase is useful; the file-format mechanics are not.

What ships now (V0.2 — bundled in one feature branch)

These are small, reinforcing changes. None are speculative.

Schema migration: add topics(id, user_id, label, canonical_key, kind, embedding vector(512), created_at) and memory_topics(memory_id, topic_id, source). RLS mirrors memories. Index on (user_id, canonical_key); HNSW index on embedding for the vector half of resolution. The embedding column reuses the existing Voyage-3-lite 512-d space (01_SOFTWARE_PLAN §1) so topic and memory vectors live in the same space — no second embedding model.
Prompt extension in summarize-step/index.ts: topics returned as [{label, kind}] instead of string[]. kind ∈ {person, place, project, theme, event}. Existing memories.topics text[] stays populated for backward compatibility (one release window), then is deprecated.
Normalize-on-insert (new pgmq step link-topics, runs between summarize and embed): for each extracted {label, kind}, do hybrid resolution — pg_trgm + pgvector. First, pg_trgm similarity (≥0.85) against existing labels for that user catches surface-form variants (Daniel/daniel/Dan). Then, embed the label-in-context and run pgvector cosine match (≥0.80) against existing topic embeddings to catch semantic aliases that trigram misses (Dr. Patel ↔ my doctor, Q2 Roadmap ↔ the Q2 plan). Reuse on either match; create otherwise. Writes the memory_topics join row. New step lives in supabase/functions/link-topics-step/ and is wired into the pipeline tick the same way the existing steps are.
Topic-shared edges in compute-edges-step/index.ts: alongside the existing semantic edges, write edges for memories sharing a topic_id. Add kind column to memory_edges ('semantic' | 'topic'); migrate existing rows to 'semantic'.
Viewer inline highlighting: in the transcript viewer (web + mobile), word-boundary string match memory_topics.label against transcript, render proper-noun kinds (person | place | project) as a clickable span linking to the entity page; render theme/event kinds as chips above the transcript.
Universe view: nodes are topics (not memories) by default, sized by memory-count. Photo/place topics get thumbnails on the node (first photo for kind='place', first attached photo for kind='person' once vision-extracted topics arrive — until then, kind-based glyph). Memories surface on click. Memory↔memory edges become a secondary toggle (“show similarity edges”) rather than the primary structure. No change to the canvas library — same renderer, different node source.

What’s deferred (future feature — re-evaluate after V0.2 ships)

Confidence weighting on memory↔topic links. Today every link is implicitly 1.0. Future kinds — face-match in photos, GPS-derived places, voice prosody-derived emotions — need a confidence float and a slider in the graph view.
EXIF → place topics, free. Photo ingest reads GPS, reverse-geocodes once, creates a kind='place' topic with no LLM call. Lands with the multi-modal ingest endpoint per ../discussions/2026-05-04_multimodal_expansion.md.
Diarized speaker → person topic attribution. Today’s memory_participants + Pyannote re-ID identifies speakers but doesn’t link them to topics. Future: when “Sarah said X” is the speaker, attribute the uttered topics to Sarah, not just to the journal owner.
Vision/OCR-extracted topics for images. Reuses the same memory_topics table; no new edges, no new UI. Blocked on multimodal ingest endpoint.
MCP retrieval over topics. Add arcive.memories_by_topic(label) and arcive.related_topics(label) to the MCP server so the agent can navigate the graph, not just embed-search. Cheap once the schema lands.
Topic merge/split UX. When canonicalization is wrong (Q2 Roadmap ≠ Q2 roadmap was actually the same; Sarah was actually two different Sarahs), users need to merge or split. Keep on the shelf until first user complaint.
Cluster-around-entity zoom view. Force-graphs degrade visually past ~500 nodes. Plan a primary “show me Sarah” zoomed view and reserve the global force-graph as secondary aesthetic. Re-evaluate when any user crosses ~300 topics.

Consequences

Easier

The Universe view stops being “rough” because nodes are now meaningful (people/projects/places) rather than implicit similarity blobs.
Search gets a free filter axis (memory_topics.topic_id).
The MCP server (Layer 5 retrieval) gains a topic-keyed surface that’s faster than vector search for known entities.
Multi-modal ingest reuses the same topic surface — no per-modality tagging system to maintain.

Harder / new responsibilities

One more pgmq step (or one extension to summarize-step) to maintain — normalize-on-insert.
The kind taxonomy becomes a small piece of product copy: when extraction is wrong, where it’s wrong matters (a theme rendered as inline link is awkward; a person rendered as a chip is wasted).
Backfill: a one-time job to re-run summarize on existing memories so legacy topics text[] becomes memory_topics rows. The existing backfill-summaries function is the right home — extend, don’t add a new function.

Impossible / explicitly out of scope

Editable markdown notes with user-typed [[wikilinks]]. Not now, not later — wrong shape for a multimodal ambient journal.
Frontmatter or YAML in stored memory bodies.

What this enables that wasn’t possible before

The Universe view becomes the graph the README implies, not the graph the README hopes you don’t look at too closely.
A memory without a recording (notes, shared links, future image-only entries) participates in the graph the same way a recorded memory does — the topic surface is modality-agnostic.

Recommendation on what’s already overdue

The slice marked “ships now” is already overdue, not future work:

Topics have been extracted and discarded for the duration of V0.1. Every summarize-step run since the prompt was tuned has thrown away the cheapest graph signal we have.
The canonicalization gap silently degrades the data with every recording — fixing it later costs more than fixing it now (legacy duplicates accumulate).
Universe view validation (per ../03_PROGRESS.md “validate Universe with real users”) is blocked on the graph being legible enough to validate. Showing users a fuzzy blob and asking “is this useful?” answers the wrong question.

The deferred items are genuine future work — they’re cheap additions once the schema lands but each adds product surface (confidence sliders, merge UX, place-from-EXIF) that would distract from validating the core slice.

Notes

Current state evidence:
- Topic extraction prompt: supabase/functions/summarize-step/index.ts:24-51
- Edge computation (semantic-only): supabase/functions/compute-edges-step/index.ts:46-64
- Schema: supabase/migrations/20260503000000_init_v0_schema.sql:100-138
Working session that informed this ADR: graph/tagging strategy review, 2026-05-06 (Obsidian wikilinks/frontmatter, Tana supertags, Heptabase canvas, mind-map trees) — to be archived in docs/discussions/2026-05-06_graph_tagging_strategy.md if the decision is accepted.
Prior multi-modal architecture notes that this decision must compose with: ../discussions/2026-05-04_multimodal_expansion.md.
Topic-shared edges as a graph primitive is well-established in personal-knowledge-graph systems (Roam, Logseq, Tana). The novelty for ARCIVE is not asking the user to type the links — extraction is AI-driven, canonicalization is automatic, and the user only intervenes to merge/split if and when the algorithm gets it wrong.

ADR-0012: Topics as first-class graph nodes (entity-style), not stranded text[]

Context

Options considered

Option A — Leave topics as `text[]`, do nothing structural

Option B — Full Obsidian/Tana clone: wikilinks in stored prose, frontmatter, schema-typed tags

Option C — Promote topics to first-class entity nodes; render-time inline highlighting (chosen)

Decision

What ships now (V0.2 — bundled in one feature branch)

What’s deferred (future feature — re-evaluate after V0.2 ships)

Consequences

Recommendation on what’s already overdue

Notes

Plans

Operations

Decisions (ADRs)

Discussions

ADR-0012: Topics as first-class graph nodes (entity-style), not stranded text[]

Context

Options considered

Option A — Leave topics as text[], do nothing structural

Option B — Full Obsidian/Tana clone: wikilinks in stored prose, frontmatter, schema-typed tags

Option C — Promote topics to first-class entity nodes; render-time inline highlighting (chosen)

Decision

What ships now (V0.2 — bundled in one feature branch)

What’s deferred (future feature — re-evaluate after V0.2 ships)

Consequences

Recommendation on what’s already overdue

Notes

Plans

Operations

Decisions (ADRs)

Discussions

Option A — Leave topics as `text[]`, do nothing structural