- Status: Accepted
- Date: 2026-05-03
- Affected:
supabase/migrations/,supabase/functions/pipeline-tick/,supabase/functions/_shared/pipeline.ts
Context
ADR-0001 chose pgmq + pg_cron for V0.1 pipeline orchestration and explicitly deferred the DLQ question: “no exponential backoff yet, no DLQ — see ADR-TBD when we add one.” That moment is now — production hardening item #1 in the session handoff.
Today’s behavior: each step Edge Function deletes its own pgmq message
on success and leaves it for vt-based redelivery on failure. Failure
is silent and unbounded. A poison message — bad audio file, malformed
diarization payload, deleted recording row — retries every 60 seconds
forever, consumes capacity, and is invisible until someone tails Edge
Function logs.
We need three things: (1) a retry ceiling, (2) a place stuck jobs go
that’s queryable, (3) a path to manual replay once the underlying
issue is fixed. The pgmq message itself already carries read_ct, the
delivery-attempt counter — that’s enough signal to detect “stuck”
without a sidecar counter.
Options considered
Option A — Sibling DLQ queue per source queue (chosen)
Promote a stuck message by pgmq.send-ing the original payload into
pipeline_jobs_dlq, then pgmq.delete-ing it from pipeline_jobs.
Insert a row into a pipeline_dead_letters diagnostics table with
the error string, step, recording_id, read_ct, and timestamps. The
DLQ message payload stays minimal and identical to the live-queue
payload — no error text, no metadata duplication.
- Pros: pgmq-idiomatic; sibling queue uses the same
send/read/deleteops the live queue does. Clean separation: live queue holds work, DLQ holds dead, diagnostics table holds the error narrative. Manual replay ispgmq.readfrom DLQ →pgmq.sendto live → delete the diagnostics row, with no bespoke tooling. Scales 1:1 if more queues land later (each gets its own_dlqsibling). The 1:1 mapping makes it obvious which dead messages belong to which pipeline. - Cons: Two queues to operate. A small amount of motion (one extra
send+ onedelete) per dead message — but only at the dead boundary, not the hot path.
Option B — One global pipeline_dead_letters table, mark-and-store
Skip the DLQ queue. When read_ct > MAX_RETRIES, copy the payload
into a regular table with the error context, then delete from the
live queue.
- Pros: One artifact instead of two. Slightly simpler migration.
- Cons: Replay tooling becomes bespoke — there’s no
pgmq.sendfrom a row, you’d write SQL or a script to push back into the queue. Throws away the queue affordance for re-drainage.
Option C — Mark in place via sidecar set
Don’t move the message at all. Maintain a pipeline_dead_msg_ids
table the tick consults; skip dispatch if a message’s id is in that
set. The dead message stays in the live queue, gets read repeatedly,
and is no-op’d.
- Pros: Zero data motion.
- Cons: Mixes dead and live messages. Every
pgmq.readbatch wastes slots on dead ones. The sidecar set keys onmsg_id, which is pgmq’s internal counter — fragile if the queue is recreated. Defeats the point ofread_ctceiling: dead messages still tick the counter forever.
Decision
Option A. Add a sibling DLQ queue pipeline_jobs_dlq and a
pipeline_dead_letters diagnostics table. The dead-letter promotion
check lives at the dispatch boundary in pipeline-tick, not
inside step functions — one place, one rule, no per-step drift.
When pipeline-tick reads a batch and a message has
read_ct > MAX_RETRIES:
pgmq.sendthe payload topipeline_jobs_dlq.insert into pipeline_dead_letterswith(queue_name, original_msg_id, payload, step, recording_id, read_ct, last_error, first_seen, dead_at).last_erroris the most recent step-side failure if available;nullis acceptable for V0.3.pgmq.deletefrompipeline_jobs.- Skip dispatch.
MAX_RETRIES defaults to 5, env-tunable via
PIPELINE_MAX_RETRIES. The first delivery counts as read_ct = 1,
so 5 means “five honest tries before we give up.”
Step functions keep their current pattern: succeed and pgmq.delete,
or do nothing and let the visibility timeout re-deliver. Step
functions do not decide when to dead-letter. Step-side terminal
errors (e.g. recording row deleted) can be added as an explicit
“promote now” path in a later slice; for V0.3, retry-ceiling is
sufficient.
The diagnostics table includes a 30-day retention policy. The pruning
job (pg_cron delete from pipeline_dead_letters where dead_at < now() - interval '30 days') can ship in this slice or a follow-up,
but the policy is part of this decision — without TTL, the table
grows unbounded forever, and that’s a bug, not an open question.
Consequences
- New surface area: one queue (
pipeline_jobs_dlq), one table (pipeline_dead_letters), one env var (PIPELINE_MAX_RETRIES). pipeline-tickgrows a check at the top of the dispatch loop. The hot path is unchanged when no message is stuck.- Operations gets a simple SQL surface:
select * from pipeline_dead_letters order by dead_at descshows what’s stuck; joining withrecordingsshows whose pipeline failed. - Replay is queue-native.
pgmq.read('pipeline_jobs_dlq', …)→pgmq.send('pipeline_jobs', payload)→pgmq.deletefrom DLQ → delete the matching diagnostics row. No new code path needed; an operator can do it from the SQL editor. A Settings UI for replay is a future nice-to-have, not a blocker. - TTL on diagnostics: 30 days. The pruning job is a follow-up
slice (likely a
pg_cron.scheduledaily entry) but the contract is set here so anyone reading dead-letter rows knows they’re ephemeral. Anything we want to keep longer goes into Sentry (see hardening item #3) or PostHog, not this table. - DLQ retention is unbounded for V0.3. A pgmq DLQ with a few
hundred rows is fine; if a queue accumulates real volume, we’ll
add a
pgmq.archivepolicy later. The diagnostics table TTL is the safety valve for now. - No exponential backoff in this slice. pgmq’s fixed 60s visibility timeout is the entire backoff for now. Five retries at 60s = ~5 minutes from first failure to dead-letter, which is fast enough to detect a real outage and slow enough that a transient blip recovers without ceremony.
- The
read_ct > MAX_RETRIEScheck fires at the next tick after the 5th failure, not the 5th failure itself. This is fine — the message is invisible duringvt, so no work is wasted. The alternative (counting on the step side) requires shared state and reintroduces the per-step drift this ADR is closing.
Notes
pgmq exposes read_ct on every message read (see _shared/pipeline.ts
QueueMessage type). Promotion uses the live pgmq.send /
pgmq.delete we already use elsewhere. The DLQ queue is created in
the same defensive DO ... EXCEPTION block style as
pipeline_jobs so missing extensions never break the migration suite
— direct-HTTP fallback (per ADR-0001) skips the queue entirely
anyway.
If the V0.2+ migration to Inngest from ADR-0001’s “Re-evaluate” clause ever happens, this DLQ design ports cleanly: Inngest has native dead-letter semantics. The diagnostics table is the only piece worth keeping.