ADR-0008: pgmq dead-letter queue shape

Status: Accepted
Date: 2026-05-03
Affected: supabase/migrations/, supabase/functions/pipeline-tick/, supabase/functions/_shared/pipeline.ts

Context

ADR-0001 chose pgmq + pg_cron for V0.1 pipeline orchestration and explicitly deferred the DLQ question: “no exponential backoff yet, no DLQ — see ADR-TBD when we add one.” That moment is now — production hardening item #1 in the session handoff.

Today’s behavior: each step Edge Function deletes its own pgmq message on success and leaves it for vt-based redelivery on failure. Failure is silent and unbounded. A poison message — bad audio file, malformed diarization payload, deleted recording row — retries every 60 seconds forever, consumes capacity, and is invisible until someone tails Edge Function logs.

We need three things: (1) a retry ceiling, (2) a place stuck jobs go that’s queryable, (3) a path to manual replay once the underlying issue is fixed. The pgmq message itself already carries read_ct, the delivery-attempt counter — that’s enough signal to detect “stuck” without a sidecar counter.

Options considered

Option A — Sibling DLQ queue per source queue (chosen)

Promote a stuck message by pgmq.send-ing the original payload into pipeline_jobs_dlq, then pgmq.delete-ing it from pipeline_jobs. Insert a row into a pipeline_dead_letters diagnostics table with the error string, step, recording_id, read_ct, and timestamps. The DLQ message payload stays minimal and identical to the live-queue payload — no error text, no metadata duplication.

Pros: pgmq-idiomatic; sibling queue uses the same send/read/delete ops the live queue does. Clean separation: live queue holds work, DLQ holds dead, diagnostics table holds the error narrative. Manual replay is pgmq.read from DLQ → pgmq.send to live → delete the diagnostics row, with no bespoke tooling. Scales 1:1 if more queues land later (each gets its own _dlq sibling). The 1:1 mapping makes it obvious which dead messages belong to which pipeline.
Cons: Two queues to operate. A small amount of motion (one extra send + one delete) per dead message — but only at the dead boundary, not the hot path.

Option B — One global `pipeline_dead_letters` table, mark-and-store

Skip the DLQ queue. When read_ct > MAX_RETRIES, copy the payload into a regular table with the error context, then delete from the live queue.

Pros: One artifact instead of two. Slightly simpler migration.
Cons: Replay tooling becomes bespoke — there’s no pgmq.send from a row, you’d write SQL or a script to push back into the queue. Throws away the queue affordance for re-drainage.

Option C — Mark in place via sidecar set

Don’t move the message at all. Maintain a pipeline_dead_msg_ids table the tick consults; skip dispatch if a message’s id is in that set. The dead message stays in the live queue, gets read repeatedly, and is no-op’d.

Pros: Zero data motion.
Cons: Mixes dead and live messages. Every pgmq.read batch wastes slots on dead ones. The sidecar set keys on msg_id, which is pgmq’s internal counter — fragile if the queue is recreated. Defeats the point of read_ct ceiling: dead messages still tick the counter forever.

Decision

Option A. Add a sibling DLQ queue pipeline_jobs_dlq and a pipeline_dead_letters diagnostics table. The dead-letter promotion check lives at the dispatch boundary in pipeline-tick, not inside step functions — one place, one rule, no per-step drift.

When pipeline-tick reads a batch and a message has read_ct > MAX_RETRIES:

pgmq.send the payload to pipeline_jobs_dlq.
insert into pipeline_dead_letters with (queue_name, original_msg_id, payload, step, recording_id, read_ct, last_error, first_seen, dead_at). last_error is the most recent step-side failure if available; null is acceptable for V0.3.
pgmq.delete from pipeline_jobs.
Skip dispatch.

MAX_RETRIES defaults to 5, env-tunable via PIPELINE_MAX_RETRIES. The first delivery counts as read_ct = 1, so 5 means “five honest tries before we give up.”

Step functions keep their current pattern: succeed and pgmq.delete, or do nothing and let the visibility timeout re-deliver. Step functions do not decide when to dead-letter. Step-side terminal errors (e.g. recording row deleted) can be added as an explicit “promote now” path in a later slice; for V0.3, retry-ceiling is sufficient.

The diagnostics table includes a 30-day retention policy. The pruning job (pg_cron delete from pipeline_dead_letters where dead_at < now() - interval '30 days') can ship in this slice or a follow-up, but the policy is part of this decision — without TTL, the table grows unbounded forever, and that’s a bug, not an open question.

Consequences

New surface area: one queue (pipeline_jobs_dlq), one table (pipeline_dead_letters), one env var (PIPELINE_MAX_RETRIES).
pipeline-tick grows a check at the top of the dispatch loop. The hot path is unchanged when no message is stuck.
Operations gets a simple SQL surface: select * from pipeline_dead_letters order by dead_at desc shows what’s stuck; joining with recordings shows whose pipeline failed.
Replay is queue-native. pgmq.read('pipeline_jobs_dlq', …) → pgmq.send('pipeline_jobs', payload) → pgmq.delete from DLQ → delete the matching diagnostics row. No new code path needed; an operator can do it from the SQL editor. A Settings UI for replay is a future nice-to-have, not a blocker.
TTL on diagnostics: 30 days. The pruning job is a follow-up slice (likely a pg_cron.schedule daily entry) but the contract is set here so anyone reading dead-letter rows knows they’re ephemeral. Anything we want to keep longer goes into Sentry (see hardening item #3) or PostHog, not this table.
DLQ retention is unbounded for V0.3. A pgmq DLQ with a few hundred rows is fine; if a queue accumulates real volume, we’ll add a pgmq.archive policy later. The diagnostics table TTL is the safety valve for now.
No exponential backoff in this slice. pgmq’s fixed 60s visibility timeout is the entire backoff for now. Five retries at 60s = ~5 minutes from first failure to dead-letter, which is fast enough to detect a real outage and slow enough that a transient blip recovers without ceremony.
The read_ct > MAX_RETRIES check fires at the next tick after the 5th failure, not the 5th failure itself. This is fine — the message is invisible during vt, so no work is wasted. The alternative (counting on the step side) requires shared state and reintroduces the per-step drift this ADR is closing.

Notes

pgmq exposes read_ct on every message read (see _shared/pipeline.ts QueueMessage type). Promotion uses the live pgmq.send / pgmq.delete we already use elsewhere. The DLQ queue is created in the same defensive DO ... EXCEPTION block style as pipeline_jobs so missing extensions never break the migration suite — direct-HTTP fallback (per ADR-0001) skips the queue entirely anyway.

If the V0.2+ migration to Inngest from ADR-0001’s “Re-evaluate” clause ever happens, this DLQ design ports cleanly: Inngest has native dead-letter semantics. The diagnostics table is the only piece worth keeping.

ADR-0008: pgmq dead-letter queue shape

Context

Options considered

Option A — Sibling DLQ queue per source queue (chosen)

Option B — One global `pipeline_dead_letters` table, mark-and-store

Option C — Mark in place via sidecar set

Decision

Consequences

Notes

Plans

Operations

Decisions (ADRs)

Discussions

ADR-0008: pgmq dead-letter queue shape

Context

Options considered

Option A — Sibling DLQ queue per source queue (chosen)

Option B — One global pipeline_dead_letters table, mark-and-store

Option C — Mark in place via sidecar set

Decision

Consequences

Notes

Plans

Operations

Decisions (ADRs)

Discussions

Option B — One global `pipeline_dead_letters` table, mark-and-store