Skip to content

ADR-0012: Topics as first-class graph nodes (entity-style), not stranded text[]

  • Status: Accepted
  • Date: 2026-05-06
  • Deciders: Sujith
  • Affected: supabase/migrations/, supabase/functions/summarize-step/, supabase/functions/compute-edges-step/, apps/web/app/(app)/memory/, apps/web/components/universe/, apps/mobile/ memory viewer, backend/mcp/arcive-memory-mcp/

Context

Three things forced this question on 2026-05-06:

  1. The Universe view is “rough.” README.md §“Built — pending validation” lists Universe as built but unvalidated. Inspection shows why: it renders memory_edges, which today are pure embedding-similarity edges (top-8, ≥0.55 cosine) from compute-edges-step/index.ts:46-64. The result is a fuzzy blob — not the legible “people/projects/places connect across entries” graph the product implies.
  2. Topics are already extracted from text but stranded. summarize-step/index.ts:24-51 extracts 2–4 topics per memory with the explicit framing “topics are GRAPH EDGES, not search facets.” They’re written to memories.topics text[] and never used as graph primitives. Two memories that both tag Daniel are only connected if their embeddings happen to land near each other.
  3. A working session on graph/tagging strategy (Obsidian/Tana/Heptabase comparison, multimodal expansion) raised the structural question: are entity nodes a separate layer or just a re-skin of memory edges? Working notes: ../discussions/2026-05-06_graph_tagging_strategy.md (to be added alongside this ADR).

The pre-existing constraints that scoped the answer:

Options considered

Option A — Leave topics as text[], do nothing structural

Keep the current shape. Universe view stays semantic-edge-only. Topics surface as chips on the memory detail view but don’t drive the graph.

  • Pros: zero migration cost; Universe today still “works” in the eye-candy sense.
  • Cons:
    • “Daniel” / “daniel” / “Dan” become three independent labels within a week — no canonicalization, no dedupe.
    • The graph never gets the entity-as-node aesthetic the product implies (and the working session validated as the right target).
    • Topic-shared connections — the cheapest, sharpest edge — are extracted and discarded.
    • Multimodal text artifacts (image captions, OCR, link readability) will produce more text[] topics with the same canonicalization gap, multiplying the problem.

Rewrite the data model around editable markdown notes with [[wikilinks]] and YAML frontmatter; each tag is a Tana-style supertag with a schema.

  • Pros: very rich; community-validated patterns; aesthetically maximal.
  • Cons:
    • Wrong shape for ARCIVE. Inputs are voice memos, photos, shared URLs — there is no editable prose layer. Frontmatter and [[wikilinks]] are file-format artifacts of a tool ARCIVE isn’t.
    • Forces a UX (markdown editing, manual link-typing) that contradicts the brand’s “calm, ambient, AI-does-the-work” positioning (../discussions/2026-05-04_multimodal_expansion.md §“Philosophical reframe”).
    • Multi-month rewrite for an aesthetic that’s mostly delivered by Option C in days.

Option C — Promote topics to first-class entity nodes; render-time inline highlighting (chosen)

Keep extraction where it is. Add a small canonical layer — topics and memory_topics tables with a kind field — and use it in two places: the graph (entity-as-node, with topic-shared edges) and the transcript viewer (inline highlights for proper-noun topics, chip row for theme topics). No frontmatter, no wikilinks-in-prose, no Obsidian markdown layer.

  • Pros:
    • Reuses what already works — the existing summarize-step prompt is already topic-discipline-aware (“recurring nodes, not hyper-specific one-offs”); only the output schema and the consumer side need to change.
    • Topic-shared edges are a single SQL join — sharper and cheaper than embedding similarity, and still composable alongside it.
    • Canonicalization fixes “Daniel/daniel/Dan” once, and every downstream surface (search filter, graph node, MCP retrieval) inherits it.
    • Render-time string-match for inline highlighting avoids LLM-generated character offsets (which are routinely off-by-one and brittle).
    • Multimodal lands cleanly: vision/OCR/link extractors all emit topics into the same table — photo of Sarah connects to voice memo mentioning Sarah without new edge logic.
    • Lifts “Universe view” from “rough” to legible without a frontend rewrite — same canvas, richer nodes.
  • Cons:
    • One new migration, one summarize-step prompt extension, one new normalize-on-insert step, one viewer change.
    • kind taxonomy needs a starting set (person | place | project | theme | event) that we’ll likely refine. Acceptable: kinds are not user-visible primary structure, just a render hint.

Decision

Adopt Option C — Topics as first-class entity nodes; render-time inline highlighting.

Topics are extracted, canonicalized, and used as both graph nodes (Universe view) and viewer overlays (inline highlights + chip row). The change is layered — no rewrite of summarize, no rewrite of compute-edges, no rewrite of the Universe canvas.

Framing. This is an evolution of memory_edges, not a rewrite. ARCIVE is structurally closer to Tana with a force-graph skin (typed entity nodes, AI-populated, queryable, graph as one of several surfaces) than to Obsidian (file-based markdown with user-typed wikilinks). The vocabulary from Obsidian/Tana/Heptabase is useful; the file-format mechanics are not.

What ships now (V0.2 — bundled in one feature branch)

These are small, reinforcing changes. None are speculative.

  1. Schema migration: add topics(id, user_id, label, canonical_key, kind, embedding vector(512), created_at) and memory_topics(memory_id, topic_id, source). RLS mirrors memories. Index on (user_id, canonical_key); HNSW index on embedding for the vector half of resolution. The embedding column reuses the existing Voyage-3-lite 512-d space (01_SOFTWARE_PLAN §1) so topic and memory vectors live in the same space — no second embedding model.
  2. Prompt extension in summarize-step/index.ts: topics returned as [{label, kind}] instead of string[]. kind ∈ {person, place, project, theme, event}. Existing memories.topics text[] stays populated for backward compatibility (one release window), then is deprecated.
  3. Normalize-on-insert (new pgmq step link-topics, runs between summarize and embed): for each extracted {label, kind}, do hybrid resolution — pg_trgm + pgvector. First, pg_trgm similarity (≥0.85) against existing labels for that user catches surface-form variants (Daniel/daniel/Dan). Then, embed the label-in-context and run pgvector cosine match (≥0.80) against existing topic embeddings to catch semantic aliases that trigram misses (Dr. Patelmy doctor, Q2 Roadmapthe Q2 plan). Reuse on either match; create otherwise. Writes the memory_topics join row. New step lives in supabase/functions/link-topics-step/ and is wired into the pipeline tick the same way the existing steps are.
  4. Topic-shared edges in compute-edges-step/index.ts: alongside the existing semantic edges, write edges for memories sharing a topic_id. Add kind column to memory_edges ('semantic' | 'topic'); migrate existing rows to 'semantic'.
  5. Viewer inline highlighting: in the transcript viewer (web + mobile), word-boundary string match memory_topics.label against transcript, render proper-noun kinds (person | place | project) as a clickable span linking to the entity page; render theme/event kinds as chips above the transcript.
  6. Universe view: nodes are topics (not memories) by default, sized by memory-count. Photo/place topics get thumbnails on the node (first photo for kind='place', first attached photo for kind='person' once vision-extracted topics arrive — until then, kind-based glyph). Memories surface on click. Memory↔memory edges become a secondary toggle (“show similarity edges”) rather than the primary structure. No change to the canvas library — same renderer, different node source.

What’s deferred (future feature — re-evaluate after V0.2 ships)

  1. Confidence weighting on memory↔topic links. Today every link is implicitly 1.0. Future kinds — face-match in photos, GPS-derived places, voice prosody-derived emotions — need a confidence float and a slider in the graph view.
  2. EXIF → place topics, free. Photo ingest reads GPS, reverse-geocodes once, creates a kind='place' topic with no LLM call. Lands with the multi-modal ingest endpoint per ../discussions/2026-05-04_multimodal_expansion.md.
  3. Diarized speaker → person topic attribution. Today’s memory_participants + Pyannote re-ID identifies speakers but doesn’t link them to topics. Future: when “Sarah said X” is the speaker, attribute the uttered topics to Sarah, not just to the journal owner.
  4. Vision/OCR-extracted topics for images. Reuses the same memory_topics table; no new edges, no new UI. Blocked on multimodal ingest endpoint.
  5. MCP retrieval over topics. Add arcive.memories_by_topic(label) and arcive.related_topics(label) to the MCP server so the agent can navigate the graph, not just embed-search. Cheap once the schema lands.
  6. Topic merge/split UX. When canonicalization is wrong (Q2 RoadmapQ2 roadmap was actually the same; Sarah was actually two different Sarahs), users need to merge or split. Keep on the shelf until first user complaint.
  7. Cluster-around-entity zoom view. Force-graphs degrade visually past ~500 nodes. Plan a primary “show me Sarah” zoomed view and reserve the global force-graph as secondary aesthetic. Re-evaluate when any user crosses ~300 topics.

Consequences

Easier

  • The Universe view stops being “rough” because nodes are now meaningful (people/projects/places) rather than implicit similarity blobs.
  • Search gets a free filter axis (memory_topics.topic_id).
  • The MCP server (Layer 5 retrieval) gains a topic-keyed surface that’s faster than vector search for known entities.
  • Multi-modal ingest reuses the same topic surface — no per-modality tagging system to maintain.

Harder / new responsibilities

  • One more pgmq step (or one extension to summarize-step) to maintain — normalize-on-insert.
  • The kind taxonomy becomes a small piece of product copy: when extraction is wrong, where it’s wrong matters (a theme rendered as inline link is awkward; a person rendered as a chip is wasted).
  • Backfill: a one-time job to re-run summarize on existing memories so legacy topics text[] becomes memory_topics rows. The existing backfill-summaries function is the right home — extend, don’t add a new function.

Impossible / explicitly out of scope

  • Editable markdown notes with user-typed [[wikilinks]]. Not now, not later — wrong shape for a multimodal ambient journal.
  • Frontmatter or YAML in stored memory bodies.

What this enables that wasn’t possible before

  • The Universe view becomes the graph the README implies, not the graph the README hopes you don’t look at too closely.
  • A memory without a recording (notes, shared links, future image-only entries) participates in the graph the same way a recorded memory does — the topic surface is modality-agnostic.

Recommendation on what’s already overdue

The slice marked “ships now” is already overdue, not future work:

  • Topics have been extracted and discarded for the duration of V0.1. Every summarize-step run since the prompt was tuned has thrown away the cheapest graph signal we have.
  • The canonicalization gap silently degrades the data with every recording — fixing it later costs more than fixing it now (legacy duplicates accumulate).
  • Universe view validation (per ../03_PROGRESS.md “validate Universe with real users”) is blocked on the graph being legible enough to validate. Showing users a fuzzy blob and asking “is this useful?” answers the wrong question.

The deferred items are genuine future work — they’re cheap additions once the schema lands but each adds product surface (confidence sliders, merge UX, place-from-EXIF) that would distract from validating the core slice.

Notes