ADR-0009: Web recorder long-session strategy (>30s)

Status: Proposed
Date: 2026-05-04
Affected: apps/web/components/recorder.tsx, supabase/functions/ingest-audio/, supabase/migrations/, apps/web/app/(app)/today/, apps/web/app/(app)/memory/[id]/

Context

The web recorder today is a single-shot capture: hold-record / stop / upload-once. The whole Blob lives in browser memory until stop, then POSTs as one body to /ingest-audio. The ingest function caps at 50 MB and the integration contract caps at 30 s per chunk (00_MASTER_PLAN §6.1, 02_HARDWARE_PLAN §6.1). A user holding record for >30 s on the web is currently invisible to the contract — the client has nothing telling it to stop, and the server only enforces the cap as a billing floor (30 s) but accepts the full blob up to 50 MB. So today, “5-minute web dictation” works by accident, not by design, and gets billed as 30 s.

The plan does say web should do chunked uploads (01_SOFTWARE_PLAN.md:336) and resumable on network drop (01_SOFTWARE_PLAN.md:121). What the plan does not say: when one user-intent is one continuous recording session that spans multiple 30 s chunks, do those chunks become N memories or one memory with N parts?

The current schema strongly implies one-memory-per-chunk. The memories table has recording_id uuid unique references recordings (migrations/20260503000000_init_v0_schema.sql:96), so each recording can produce at most one memory. There’s no recording_session_id and no concept of “parts.” Today’s ingest function mints a fresh recording_id per upload and chains transcribe → diarize → reid → summarize → embed against that single row. The pipeline has no notion of “stitch 10 chunks back into a single memory.”

This decision matters now because we just shipped two surfaces that make the gap visible: the Today list and the audio-playback button. Both render per-recording. A 5-minute dictation today would appear as one row with one transcript fragment and silently drop the other 4.5 minutes from billing. As soon as we either (a) start enforcing the 30 s cap on the client or (b) actually fix billing to be honest, the UX consequence shows up directly and we have to pick a model.

This is also a good moment to reaffirm the framing in 01_SOFTWARE_PLAN.md:65: “the web/PWA is for intentional dictation and review.” Long-form always-on capture is the device’s job and the native mobile app’s job. So the question is not “should web do hours of audio” — it’s “what’s the right shape for the 2–5 minute dictation case that does happen.”

Options considered

Option A — Session-level recording (multi-part schema)

Introduce a recording_sessions table. A session has many recordings (chunks). The web (and eventually device) seal each 30 s chunk and POST it with a session_id. The pipeline still runs per-chunk for transcribe/diarize/reid/embed, but summarize runs once at session-close against the stitched transcript, and the memories row points to the session, not a chunk.

Pros: Models the user’s mental model directly — “I recorded one thing.” Today list shows one row per session. Search and the graph stay coherent. Billing is the sum of chunk durations, which is honest.
Cons: Schema migration touches memories.recording_id, memory_participants, memory_spaces, the embed step, the summarize step, the export markdown shape, and the realtime subscription on memory-list-live. The session-close trigger has to handle “user closed the tab mid-session” (server-side timeout to flush). This is a real lift — XL by the size convention.

Option B — Chunk-per-memory (status quo, made honest)

Keep recording_id-as-memory-key. Client slices into 30 s chunks and POSTs each as its own recording. Each becomes its own memory, its own list row. UX cue: when the user records >30 s, render the resulting list rows as a visually grouped cluster (“Continued recording — 4 of 10”) with shared timestamp range.

Pros: Zero schema change. Pipeline unchanged. Billing already works correctly (per chunk). Ships in a small diff. Resumable retry per-chunk is trivial — each chunk is independent.
Cons: Search and the universe graph see 10 fragments of one thought instead of one thought. Summary becomes nearly useless per 30-second slice. The clustering UI has to be a real feature, not a band-aid, or the list becomes unreadable. We end up rebuilding “session” semantics in the client without naming it, which is the worst of both options.

Option C — Defer web long-form to mobile parity (chosen)

Cap web recordings at 30 s on the client. When the user has been holding record for ~28 s, show a visible warning (“Wrapping up in 2 s — start a new recording for more”). Stop at 30 s. The web remains the “intentional dictation” surface the plan named it as. Long-form always-on capture lives in the V0.2 mobile app (03_PROGRESS.md:82) and the V0.x device (02_HARDWARE_PLAN.md), both of which have the right surfaces for it (background-audio mode, foreground service, on-device buffer).

When mobile and device land, they’ll need session semantics anyway — at that point Option A becomes the right design conversation, informed by real device traffic instead of speculation.

Pros: Honors the plan’s framing of web’s role. Zero schema change. Forces the session-vs-chunk question to be answered when the surfaces that actually need it (mobile, device) arrive, with real telemetry to inform the choice. Keeps recordings.unique invariant intact. Smallest diff: one client-side timer + a UI cue.
Cons: Web users who today record >30 s “by accident” get a ceiling. We have to either (a) message that clearly or (b) accept it as a regression in a feature that was never billed honestly anyway. The 30 s ceiling needs UX work — a visible countdown is the difference between “feels intentional” and “feels broken.”

Decision

Option C. Cap web recordings at 30 s with a visible countdown, and defer the session-vs-chunk question to whenever the mobile or device track first records something >30 s. At that point we will revisit with ADR-0010 and almost certainly land on Option A — because mobile and device fundamentally cannot operate as single-shot uploaders.

The web ceiling is implemented as: at 25 s, recorder shows a “5 s left” indicator; at 30 s, recorder auto-stops and uploads. The existing MAX_CHUNK_SECONDS = 30 server-side cap (supabase/functions/ingest-audio/index.ts:75) becomes load-bearing instead of decorative — the client now respects it and the contract is end-to-end consistent.

The “Web recorder: chunked upload for >30s sessions” line in 03_PROGRESS.md:122 is closed without implementation. Replaced by:

“Web recorder: 30 s ceiling with countdown UI” (ships now)
“Session schema decision (ADR-0010)” (deferred to first mobile or device long-form arrival)

Consequences

The current “5-minute dictation works by accident” path stops working. Users who were relying on it (probably nobody — the surface is small) get a clear cap and a message.
Billing becomes honest. Today, anything >30 s gets billed as 30 s and the rest is free audio storage. After the cap, every second the user records is also a second they’re billed for — matches the tier-cap intent in 00_MASTER_PLAN.md §8.
The 30 s constraint is now one rule with one enforcement point per surface — server caps every body, client respects it before it sends. Hardware already does the same. No surface lies to another.
The memories.recording_id unique invariant holds. No schema motion. The audio-playback feature shipped this week works unchanged — one signed URL per memory, full recording included.
The session-vs-chunk question is not answered, only deferred. This is on purpose. We will know more in 2–3 weeks once mobile background capture is real and the device is uploading actual multi-minute streams. Speculating now means designing for traffic we haven’t seen.
The plan-level row “Chunked HTTPS upload — ✅ fetch + ReadableStream” in 01_SOFTWARE_PLAN.md:336 becomes a description of mobile/device only, not web. Update needed: qualify the web column with a footnote pointing at this ADR.
Pipeline B (group streaming, V0.3+) is unaffected. It already has a separate endpoint, separate session token, separate data path — and never planned to share /ingest-audio’s single-chunk contract.

Notes

This ADR closes the “chunked upload” TODO without writing chunked upload, on purpose. The honest framing is: chunked upload on the web in isolation is a feature with no clear data model behind it, and shipping it ahead of the schema discussion would create either fragmented memories (bad UX) or implicit session semantics in the client (worst of both worlds).

When ADR-0010 lands, the work it unlocks is: recording_sessions schema, session-aware summarize, list-view grouping, search across sessions, export shape. That’s the right time to also revisit whether the web should join — by then we’ll know if anyone wants to.

ADR-0009: Web recorder long-session strategy (>30s)

Context

Options considered

Option A — Session-level recording (multi-part schema)

Option B — Chunk-per-memory (status quo, made honest)

Option C — Defer web long-form to mobile parity (chosen)

Decision

Consequences

Notes

Plans

Operations

Decisions (ADRs)

Discussions