Skip to content

ARCIVE Operations Runbook

Diagnostic and scaling playbook for the live pipeline. Source of truth for “what’s wrong, what to look at, what to bump.” Update this file when you discover a new failure mode worth recording — the next person at 3 AM shouldn’t have to re-derive the diagnostic chain you just walked.

Companion docs:


1. Pipeline at a glance

ARCIVE’s background pipeline is a chain of pgmq queues fed by pg_cron:

recording uploaded
ingest-audio ──► (size/auth/tier checks)
pipeline_jobs queue ◄── pg_cron pipeline-tick (every 30s)
├─► transcribe-step (Layer 2 — Groq Whisper)
├─► diarize-step (Layer 7 — Deepgram Nova-3)
├─► reid-step (Pyannote on Modal)
├─► summarize-step (Layer 3 — Gemini → Anthropic → Groq)
├─► link-topics-step (Layer 10 — pg_trgm + pgvector hybrid resolution; ADR-0012)
├─► embed-step (Layer 4 — Voyage-3.5-lite)
└─► compute-edges-step (Auto-correlate — pgvector HNSW)

Each step is idempotent and deletes its own pgmq message on success. Failures retry up to PIPELINE_MAX_RETRIES (default 5), then promote to pipeline_jobs_dlq per ADR-0008.


2. Health snapshot — one query

Run this first when something feels off. It’s the “is the patient alive” test:

SELECT
(SELECT COUNT(*) FROM pgmq.q_pipeline_jobs) AS queue_depth,
(SELECT COUNT(*) FROM pgmq.q_pipeline_jobs WHERE read_ct > 1) AS retrying,
(SELECT COUNT(*) FROM pipeline_dead_letters
WHERE created_at > now() - interval '10 minutes') AS recent_dlq,
(SELECT COUNT(*) FROM net._http_response
WHERE created > now() - interval '2 minutes' AND status_code = 200) AS ticks_ok,
(SELECT COUNT(*) FROM net._http_response
WHERE created > now() - interval '2 minutes' AND status_code <> 200) AS ticks_failed;
ResultInterpretation
All zeros except queue_depth: 0 and ticks_ok > 0Healthy and idle
queue_depth > 0 and ticks_ok > 0Healthy and processing
retrying > 5Downstream step is failing — check pipeline_dead_letters and step function logs
recent_dlq > 0Jobs gave up after retries — investigate last_error in pipeline_dead_letters
ticks_failed > 0pipeline-tick or pg_cron broken — see §3
ticks_ok = 0 AND ticks_failed = 0pg_cron not firing — see §4

3. Pipeline-tick failures

When ticks_failed > 0 or queue isn’t draining:

-- Last 5 pipeline-tick HTTP responses with body
SELECT id, status_code, LEFT(content::text, 250) AS body, created
FROM net._http_response
WHERE created > now() - interval '5 minutes'
ORDER BY created DESC
LIMIT 5;
Status / BodyLikely causeFix
401 UnauthorizedAuth headers wrong; cron command staleUpdate cron.alter_job to send both apikey and Authorization headers (see §6)
500 {"error":"Invalid schema: pgmq"}PostgREST not exposing pgmqRe-apply 20260505000000_pgmq_read_wrapper.sql; the function falls back to public.pgmq_read
500 otherEdge Function crashedDashboard → Edge Functions → pipeline-tick → Logs
Body says {"drained": 0, ...} repeatedly with a non-empty queueCron has wrong service-role key (vault stale)See §6
All null status_codepg_net request never completed — wrong URL or vault key emptySee §6

4. pg_cron not firing

-- Is the cron job scheduled and active?
SELECT jobid, jobname, schedule, command, active
FROM cron.job
WHERE jobname LIKE '%pipeline%';
-- Recent run history
SELECT runid, status, return_message,
to_char(start_time, 'HH24:MI:SS') AS started
FROM cron.job_run_details
WHERE jobid IN (SELECT jobid FROM cron.job WHERE jobname = 'pipeline-tick')
ORDER BY start_time DESC
LIMIT 10;
SymptomFix
No row in cron.jobCron entry was dropped — re-create with cron.schedule(...) (see migrations)
Row exists, active = falseUPDATE cron.job SET active = true WHERE jobid = N;
Row exists but cron.job_run_details has no rowspg_cron extension may need re-enabling — Dashboard → Database → Extensions
Runs succeed but net._http_response is emptypg_net not writing responses — check extension status

5. Vendor / step function failures

-- Anything dead-lettered with the actual error
SELECT step, recording_id, last_error, read_ct, created_at
FROM pipeline_dead_letters
ORDER BY created_at DESC
LIMIT 10;

Common patterns:

last_error containsCauseFix
Anthropic (401) / Gemini (403)Vendor API key invalid in deployed envsupabase secrets set <KEY>=<value> and redeploy step function
Anthropic (429) / Gemini quotaRate limit on free tierCascade should auto-fall to next provider; if all three exhausted, enable Gemini billing or split traffic across vendors
Voyage (429)Embedding API rate limitVoyage has 50M tok free; if hit, throttle batch size or pay
pgmq schema errorsPostgREST schema visibility regressedRe-apply pgmq wrapper migrations
memory missingRecording was deleted before pipeline ranBenign — DLQ row can be ignored

6. Vault + cron auth — the dependency chain

pg_cron calls pipeline-tick over HTTP using credentials stored in vault.decrypted_secrets. Two secrets are required:

Vault secret nameWhat it holds
project_urlhttps://<project-ref>.supabase.co
service_role_keyThe current service-role key (legacy JWT or new sb_secret_*)

When you rotate the service-role key, update the vault secret manually — Supabase doesn’t auto-sync vault on rotation:

-- Check current state (no values exposed)
SELECT
name,
CASE
WHEN decrypted_secret IS NULL THEN 'NULL'
WHEN decrypted_secret = '' THEN 'EMPTY'
ELSE 'len=' || length(decrypted_secret) || ' starts="' || left(decrypted_secret, 12) || '"'
END AS state
FROM vault.decrypted_secrets
WHERE name IN ('project_url', 'service_role_key');
-- Update key (do not save the query — value is in plaintext)
SELECT vault.update_secret(
(SELECT id FROM vault.secrets WHERE name = 'service_role_key'),
'<new key>'
);

The cron command must send both apikey and Authorization headers (same value) — the new sb_secret_* format is rejected by Supabase’s gateway in Authorization alone:

SELECT cron.alter_job(
job_id := (SELECT jobid FROM cron.job WHERE jobname = 'pipeline-tick'),
command := $cmd$
select net.http_post(
url := (select decrypted_secret from vault.decrypted_secrets where name = 'project_url')
|| '/functions/v1/pipeline-tick',
headers := jsonb_build_object(
'Content-Type', 'application/json',
'apikey', (select decrypted_secret from vault.decrypted_secrets where name = 'service_role_key'),
'Authorization', 'Bearer ' || (select decrypted_secret from vault.decrypted_secrets where name = 'service_role_key')
),
body := '{}'::jsonb
) as request_id;
$cmd$
);

7. Re-running a backfill

For one-shot backfill of memories missing summary or topics, prefer SQL Editor over the Edge Function:

-- Dry-run: count memories needing backfill
SELECT COUNT(*) AS candidates
FROM memories
WHERE (summary IS NULL OR topics IS NULL)
AND transcript IS NOT NULL AND transcript <> ''
AND recording_id IS NOT NULL;
-- Enqueue summarize jobs (production memories — by recording_id)
SELECT pgmq.send(
'pipeline_jobs',
jsonb_build_object('step', 'summarize', 'recording_id', recording_id),
0
)
FROM memories
WHERE (summary IS NULL OR topics IS NULL)
AND transcript IS NOT NULL AND transcript <> ''
AND recording_id IS NOT NULL
LIMIT 500;
-- Enqueue summarize jobs (seed/fixture memories — by memory_id)
-- summarize-step accepts either; embed step is skipped when recording_id is null
SELECT pgmq.send(
'pipeline_jobs',
jsonb_build_object('step', 'summarize', 'memory_id', id),
0
)
FROM memories
WHERE recording_id IS NULL
AND transcript IS NOT NULL AND transcript <> '';

Force-rewrite all summaries (e.g. after a prompt change):

SELECT pgmq.send(
'pipeline_jobs',
jsonb_build_object('step', 'summarize', 'memory_id', id),
0
)
FROM memories
WHERE transcript IS NOT NULL AND transcript <> '';

summarize-step overwrites unconditionally — no NULL-check on existing summaries.


8. Scaling thresholds

Pipeline throughput formula:

throughput = BATCH / cron_interval

Currently: BATCH = 25, cron_interval = 30s → ~50 jobs/min → ~8K memories/day (3 jobs/memory).

TriggerAction
Queue depth consistently > 100Bump BATCH from 25 to 50 in pipeline-tick:20
Queue still climbing after batch bumpBump cron interval from 30s to 15s via cron.alter_job
Vendor rate limits getting hit on summarize but compute_edges idleSplit into separate queues so each step’s batch can be tuned independently
1K+ active users on free tiersEnable Gemini billing (still ~$2.50/mo at this scale); split summarize traffic between Gemini and Groq Llama for elasticity
10K+ users / B2B SLAsMove orchestration off pgmq+cron to Inngest (already mentioned as V0.2+ option in 01_SOFTWARE_PLAN.md §1)

Don’t pre-scale. The default settings hold to ~3K active users at 3 recordings/day — bigger than V0.3 needs to be.


9. Known anti-patterns (do not do)

  • Don’t manually delete pgmq messages without reading them — the queue’s visibility timeouts handle in-flight; explicit pgmq.delete outside a step function leaves orphaned memories with no summary/embedding.
  • Don’t set SUPABASE_SERVICE_ROLE_KEY as a custom Supabase secret on Edge Functions — Supabase auto-injects this; manually setting it can conflict with platform behavior. Update via key rotation in the dashboard.
  • Don’t put real keys into git — even .env.example should only document variable names, not actual values. Use Supabase secrets for deployed Edge Functions; Vercel env vars for the web app; Modal secrets for workers.
  • Don’t paste the service_role key into chat (AI or human) — it grants full DB access, bypasses RLS. Rotate immediately if leaked.

10. When in doubt

Run §2’s health snapshot, then drill down via §3-§5 based on what the snapshot shows. If still stuck after that:

  • Edge Function logs: Dashboard → Edge Functions → click the function → Logs
  • Step function direct test: Dashboard → Edge Functions → click → Test function → POST with body
  • Recent dashboard activity: Dashboard → Logs → Postgres / Edge Functions