Diagnostic and scaling playbook for the live pipeline. Source of truth for “what’s wrong, what to look at, what to bump.” Update this file when you discover a new failure mode worth recording — the next person at 3 AM shouldn’t have to re-derive the diagnostic chain you just walked.
Companion docs:
SHARING.md— how to share / read these docs and call Edge Functions from your terminaldecisions/0011-ai-vendor-strategy.md— the AI architecture (11 layers)
1. Pipeline at a glance
ARCIVE’s background pipeline is a chain of pgmq queues fed by pg_cron:
recording uploaded ↓ ingest-audio ──► (size/auth/tier checks) ↓ pipeline_jobs queue ◄── pg_cron pipeline-tick (every 30s) │ ├─► transcribe-step (Layer 2 — Groq Whisper) ├─► diarize-step (Layer 7 — Deepgram Nova-3) ├─► reid-step (Pyannote on Modal) ├─► summarize-step (Layer 3 — Gemini → Anthropic → Groq) ├─► link-topics-step (Layer 10 — pg_trgm + pgvector hybrid resolution; ADR-0012) ├─► embed-step (Layer 4 — Voyage-3.5-lite) └─► compute-edges-step (Auto-correlate — pgvector HNSW)Each step is idempotent and deletes its own pgmq message on success.
Failures retry up to PIPELINE_MAX_RETRIES (default 5), then promote to
pipeline_jobs_dlq per ADR-0008.
2. Health snapshot — one query
Run this first when something feels off. It’s the “is the patient alive” test:
SELECT (SELECT COUNT(*) FROM pgmq.q_pipeline_jobs) AS queue_depth, (SELECT COUNT(*) FROM pgmq.q_pipeline_jobs WHERE read_ct > 1) AS retrying, (SELECT COUNT(*) FROM pipeline_dead_letters WHERE created_at > now() - interval '10 minutes') AS recent_dlq, (SELECT COUNT(*) FROM net._http_response WHERE created > now() - interval '2 minutes' AND status_code = 200) AS ticks_ok, (SELECT COUNT(*) FROM net._http_response WHERE created > now() - interval '2 minutes' AND status_code <> 200) AS ticks_failed;| Result | Interpretation |
|---|---|
All zeros except queue_depth: 0 and ticks_ok > 0 | Healthy and idle |
queue_depth > 0 and ticks_ok > 0 | Healthy and processing |
retrying > 5 | Downstream step is failing — check pipeline_dead_letters and step function logs |
recent_dlq > 0 | Jobs gave up after retries — investigate last_error in pipeline_dead_letters |
ticks_failed > 0 | pipeline-tick or pg_cron broken — see §3 |
ticks_ok = 0 AND ticks_failed = 0 | pg_cron not firing — see §4 |
3. Pipeline-tick failures
When ticks_failed > 0 or queue isn’t draining:
-- Last 5 pipeline-tick HTTP responses with bodySELECT id, status_code, LEFT(content::text, 250) AS body, createdFROM net._http_responseWHERE created > now() - interval '5 minutes'ORDER BY created DESCLIMIT 5;| Status / Body | Likely cause | Fix |
|---|---|---|
401 Unauthorized | Auth headers wrong; cron command stale | Update cron.alter_job to send both apikey and Authorization headers (see §6) |
500 {"error":"Invalid schema: pgmq"} | PostgREST not exposing pgmq | Re-apply 20260505000000_pgmq_read_wrapper.sql; the function falls back to public.pgmq_read |
500 other | Edge Function crashed | Dashboard → Edge Functions → pipeline-tick → Logs |
Body says {"drained": 0, ...} repeatedly with a non-empty queue | Cron has wrong service-role key (vault stale) | See §6 |
All null status_code | pg_net request never completed — wrong URL or vault key empty | See §6 |
4. pg_cron not firing
-- Is the cron job scheduled and active?SELECT jobid, jobname, schedule, command, activeFROM cron.jobWHERE jobname LIKE '%pipeline%';
-- Recent run historySELECT runid, status, return_message, to_char(start_time, 'HH24:MI:SS') AS startedFROM cron.job_run_detailsWHERE jobid IN (SELECT jobid FROM cron.job WHERE jobname = 'pipeline-tick')ORDER BY start_time DESCLIMIT 10;| Symptom | Fix |
|---|---|
No row in cron.job | Cron entry was dropped — re-create with cron.schedule(...) (see migrations) |
Row exists, active = false | UPDATE cron.job SET active = true WHERE jobid = N; |
Row exists but cron.job_run_details has no rows | pg_cron extension may need re-enabling — Dashboard → Database → Extensions |
Runs succeed but net._http_response is empty | pg_net not writing responses — check extension status |
5. Vendor / step function failures
-- Anything dead-lettered with the actual errorSELECT step, recording_id, last_error, read_ct, created_atFROM pipeline_dead_lettersORDER BY created_at DESCLIMIT 10;Common patterns:
last_error contains | Cause | Fix |
|---|---|---|
Anthropic (401) / Gemini (403) | Vendor API key invalid in deployed env | supabase secrets set <KEY>=<value> and redeploy step function |
Anthropic (429) / Gemini quota | Rate limit on free tier | Cascade should auto-fall to next provider; if all three exhausted, enable Gemini billing or split traffic across vendors |
Voyage (429) | Embedding API rate limit | Voyage has 50M tok free; if hit, throttle batch size or pay |
pgmq schema errors | PostgREST schema visibility regressed | Re-apply pgmq wrapper migrations |
memory missing | Recording was deleted before pipeline ran | Benign — DLQ row can be ignored |
6. Vault + cron auth — the dependency chain
pg_cron calls pipeline-tick over HTTP using credentials stored in
vault.decrypted_secrets. Two secrets are required:
| Vault secret name | What it holds |
|---|---|
project_url | https://<project-ref>.supabase.co |
service_role_key | The current service-role key (legacy JWT or new sb_secret_*) |
When you rotate the service-role key, update the vault secret manually — Supabase doesn’t auto-sync vault on rotation:
-- Check current state (no values exposed)SELECT name, CASE WHEN decrypted_secret IS NULL THEN 'NULL' WHEN decrypted_secret = '' THEN 'EMPTY' ELSE 'len=' || length(decrypted_secret) || ' starts="' || left(decrypted_secret, 12) || '"' END AS stateFROM vault.decrypted_secretsWHERE name IN ('project_url', 'service_role_key');
-- Update key (do not save the query — value is in plaintext)SELECT vault.update_secret( (SELECT id FROM vault.secrets WHERE name = 'service_role_key'), '<new key>');The cron command must send both apikey and Authorization headers
(same value) — the new sb_secret_* format is rejected by Supabase’s
gateway in Authorization alone:
SELECT cron.alter_job( job_id := (SELECT jobid FROM cron.job WHERE jobname = 'pipeline-tick'), command := $cmd$ select net.http_post( url := (select decrypted_secret from vault.decrypted_secrets where name = 'project_url') || '/functions/v1/pipeline-tick', headers := jsonb_build_object( 'Content-Type', 'application/json', 'apikey', (select decrypted_secret from vault.decrypted_secrets where name = 'service_role_key'), 'Authorization', 'Bearer ' || (select decrypted_secret from vault.decrypted_secrets where name = 'service_role_key') ), body := '{}'::jsonb ) as request_id; $cmd$);7. Re-running a backfill
For one-shot backfill of memories missing summary or topics, prefer SQL Editor over the Edge Function:
-- Dry-run: count memories needing backfillSELECT COUNT(*) AS candidatesFROM memoriesWHERE (summary IS NULL OR topics IS NULL) AND transcript IS NOT NULL AND transcript <> '' AND recording_id IS NOT NULL;
-- Enqueue summarize jobs (production memories — by recording_id)SELECT pgmq.send( 'pipeline_jobs', jsonb_build_object('step', 'summarize', 'recording_id', recording_id), 0)FROM memoriesWHERE (summary IS NULL OR topics IS NULL) AND transcript IS NOT NULL AND transcript <> '' AND recording_id IS NOT NULLLIMIT 500;
-- Enqueue summarize jobs (seed/fixture memories — by memory_id)-- summarize-step accepts either; embed step is skipped when recording_id is nullSELECT pgmq.send( 'pipeline_jobs', jsonb_build_object('step', 'summarize', 'memory_id', id), 0)FROM memoriesWHERE recording_id IS NULL AND transcript IS NOT NULL AND transcript <> '';Force-rewrite all summaries (e.g. after a prompt change):
SELECT pgmq.send( 'pipeline_jobs', jsonb_build_object('step', 'summarize', 'memory_id', id), 0)FROM memoriesWHERE transcript IS NOT NULL AND transcript <> '';summarize-step overwrites unconditionally — no NULL-check on existing
summaries.
8. Scaling thresholds
Pipeline throughput formula:
throughput = BATCH / cron_intervalCurrently: BATCH = 25, cron_interval = 30s → ~50 jobs/min → ~8K
memories/day (3 jobs/memory).
| Trigger | Action |
|---|---|
| Queue depth consistently > 100 | Bump BATCH from 25 to 50 in pipeline-tick:20 |
| Queue still climbing after batch bump | Bump cron interval from 30s to 15s via cron.alter_job |
| Vendor rate limits getting hit on summarize but compute_edges idle | Split into separate queues so each step’s batch can be tuned independently |
| 1K+ active users on free tiers | Enable Gemini billing (still ~$2.50/mo at this scale); split summarize traffic between Gemini and Groq Llama for elasticity |
| 10K+ users / B2B SLAs | Move orchestration off pgmq+cron to Inngest (already mentioned as V0.2+ option in 01_SOFTWARE_PLAN.md §1) |
Don’t pre-scale. The default settings hold to ~3K active users at 3 recordings/day — bigger than V0.3 needs to be.
9. Known anti-patterns (do not do)
- Don’t manually delete pgmq messages without reading them — the queue’s
visibility timeouts handle in-flight; explicit
pgmq.deleteoutside a step function leaves orphaned memories with no summary/embedding. - Don’t set
SUPABASE_SERVICE_ROLE_KEYas a custom Supabase secret on Edge Functions — Supabase auto-injects this; manually setting it can conflict with platform behavior. Update via key rotation in the dashboard. - Don’t put real keys into git — even
.env.exampleshould only document variable names, not actual values. Use Supabase secrets for deployed Edge Functions; Vercel env vars for the web app; Modal secrets for workers. - Don’t paste the service_role key into chat (AI or human) — it grants full DB access, bypasses RLS. Rotate immediately if leaked.
10. When in doubt
Run §2’s health snapshot, then drill down via §3-§5 based on what the snapshot shows. If still stuck after that:
- Edge Function logs: Dashboard → Edge Functions → click the function → Logs
- Step function direct test: Dashboard → Edge Functions → click → Test function → POST with body
- Recent dashboard activity: Dashboard → Logs → Postgres / Edge Functions