Skip to content

ADR-0008: pgmq dead-letter queue shape

  • Status: Accepted
  • Date: 2026-05-03
  • Affected: supabase/migrations/, supabase/functions/pipeline-tick/, supabase/functions/_shared/pipeline.ts

Context

ADR-0001 chose pgmq + pg_cron for V0.1 pipeline orchestration and explicitly deferred the DLQ question: “no exponential backoff yet, no DLQ — see ADR-TBD when we add one.” That moment is now — production hardening item #1 in the session handoff.

Today’s behavior: each step Edge Function deletes its own pgmq message on success and leaves it for vt-based redelivery on failure. Failure is silent and unbounded. A poison message — bad audio file, malformed diarization payload, deleted recording row — retries every 60 seconds forever, consumes capacity, and is invisible until someone tails Edge Function logs.

We need three things: (1) a retry ceiling, (2) a place stuck jobs go that’s queryable, (3) a path to manual replay once the underlying issue is fixed. The pgmq message itself already carries read_ct, the delivery-attempt counter — that’s enough signal to detect “stuck” without a sidecar counter.

Options considered

Option A — Sibling DLQ queue per source queue (chosen)

Promote a stuck message by pgmq.send-ing the original payload into pipeline_jobs_dlq, then pgmq.delete-ing it from pipeline_jobs. Insert a row into a pipeline_dead_letters diagnostics table with the error string, step, recording_id, read_ct, and timestamps. The DLQ message payload stays minimal and identical to the live-queue payload — no error text, no metadata duplication.

  • Pros: pgmq-idiomatic; sibling queue uses the same send/read/delete ops the live queue does. Clean separation: live queue holds work, DLQ holds dead, diagnostics table holds the error narrative. Manual replay is pgmq.read from DLQ → pgmq.send to live → delete the diagnostics row, with no bespoke tooling. Scales 1:1 if more queues land later (each gets its own _dlq sibling). The 1:1 mapping makes it obvious which dead messages belong to which pipeline.
  • Cons: Two queues to operate. A small amount of motion (one extra send + one delete) per dead message — but only at the dead boundary, not the hot path.

Option B — One global pipeline_dead_letters table, mark-and-store

Skip the DLQ queue. When read_ct > MAX_RETRIES, copy the payload into a regular table with the error context, then delete from the live queue.

  • Pros: One artifact instead of two. Slightly simpler migration.
  • Cons: Replay tooling becomes bespoke — there’s no pgmq.send from a row, you’d write SQL or a script to push back into the queue. Throws away the queue affordance for re-drainage.

Option C — Mark in place via sidecar set

Don’t move the message at all. Maintain a pipeline_dead_msg_ids table the tick consults; skip dispatch if a message’s id is in that set. The dead message stays in the live queue, gets read repeatedly, and is no-op’d.

  • Pros: Zero data motion.
  • Cons: Mixes dead and live messages. Every pgmq.read batch wastes slots on dead ones. The sidecar set keys on msg_id, which is pgmq’s internal counter — fragile if the queue is recreated. Defeats the point of read_ct ceiling: dead messages still tick the counter forever.

Decision

Option A. Add a sibling DLQ queue pipeline_jobs_dlq and a pipeline_dead_letters diagnostics table. The dead-letter promotion check lives at the dispatch boundary in pipeline-tick, not inside step functions — one place, one rule, no per-step drift.

When pipeline-tick reads a batch and a message has read_ct > MAX_RETRIES:

  1. pgmq.send the payload to pipeline_jobs_dlq.
  2. insert into pipeline_dead_letters with (queue_name, original_msg_id, payload, step, recording_id, read_ct, last_error, first_seen, dead_at). last_error is the most recent step-side failure if available; null is acceptable for V0.3.
  3. pgmq.delete from pipeline_jobs.
  4. Skip dispatch.

MAX_RETRIES defaults to 5, env-tunable via PIPELINE_MAX_RETRIES. The first delivery counts as read_ct = 1, so 5 means “five honest tries before we give up.”

Step functions keep their current pattern: succeed and pgmq.delete, or do nothing and let the visibility timeout re-deliver. Step functions do not decide when to dead-letter. Step-side terminal errors (e.g. recording row deleted) can be added as an explicit “promote now” path in a later slice; for V0.3, retry-ceiling is sufficient.

The diagnostics table includes a 30-day retention policy. The pruning job (pg_cron delete from pipeline_dead_letters where dead_at < now() - interval '30 days') can ship in this slice or a follow-up, but the policy is part of this decision — without TTL, the table grows unbounded forever, and that’s a bug, not an open question.

Consequences

  • New surface area: one queue (pipeline_jobs_dlq), one table (pipeline_dead_letters), one env var (PIPELINE_MAX_RETRIES).
  • pipeline-tick grows a check at the top of the dispatch loop. The hot path is unchanged when no message is stuck.
  • Operations gets a simple SQL surface: select * from pipeline_dead_letters order by dead_at desc shows what’s stuck; joining with recordings shows whose pipeline failed.
  • Replay is queue-native. pgmq.read('pipeline_jobs_dlq', …)pgmq.send('pipeline_jobs', payload)pgmq.delete from DLQ → delete the matching diagnostics row. No new code path needed; an operator can do it from the SQL editor. A Settings UI for replay is a future nice-to-have, not a blocker.
  • TTL on diagnostics: 30 days. The pruning job is a follow-up slice (likely a pg_cron.schedule daily entry) but the contract is set here so anyone reading dead-letter rows knows they’re ephemeral. Anything we want to keep longer goes into Sentry (see hardening item #3) or PostHog, not this table.
  • DLQ retention is unbounded for V0.3. A pgmq DLQ with a few hundred rows is fine; if a queue accumulates real volume, we’ll add a pgmq.archive policy later. The diagnostics table TTL is the safety valve for now.
  • No exponential backoff in this slice. pgmq’s fixed 60s visibility timeout is the entire backoff for now. Five retries at 60s = ~5 minutes from first failure to dead-letter, which is fast enough to detect a real outage and slow enough that a transient blip recovers without ceremony.
  • The read_ct > MAX_RETRIES check fires at the next tick after the 5th failure, not the 5th failure itself. This is fine — the message is invisible during vt, so no work is wasted. The alternative (counting on the step side) requires shared state and reintroduces the per-step drift this ADR is closing.

Notes

pgmq exposes read_ct on every message read (see _shared/pipeline.ts QueueMessage type). Promotion uses the live pgmq.send / pgmq.delete we already use elsewhere. The DLQ queue is created in the same defensive DO ... EXCEPTION block style as pipeline_jobs so missing extensions never break the migration suite — direct-HTTP fallback (per ADR-0001) skips the queue entirely anyway.

If the V0.2+ migration to Inngest from ADR-0001’s “Re-evaluate” clause ever happens, this DLQ design ports cleanly: Inngest has native dead-letter semantics. The diagnostics table is the only piece worth keeping.