ARCIVE Operations Runbook

Diagnostic and scaling playbook for the live pipeline. Source of truth for “what’s wrong, what to look at, what to bump.” Update this file when you discover a new failure mode worth recording — the next person at 3 AM shouldn’t have to re-derive the diagnostic chain you just walked.

Companion docs:

SHARING.md — how to share / read these docs and call Edge Functions from your terminal
decisions/0011-ai-vendor-strategy.md — the AI architecture (11 layers)

1. Pipeline at a glance

ARCIVE’s background pipeline is a chain of pgmq queues fed by pg_cron:

recording uploaded
       ↓
  ingest-audio  ──► (size/auth/tier checks)
       ↓
  pipeline_jobs queue  ◄── pg_cron pipeline-tick (every 30s)
       │
       ├─► transcribe-step    (Layer 2 — Groq Whisper)
       ├─► diarize-step       (Layer 7 — Deepgram Nova-3)
       ├─► reid-step          (Pyannote on Modal)
       ├─► summarize-step     (Layer 3 — Gemini → Anthropic → Groq)
       ├─► link-topics-step   (Layer 10 — pg_trgm + pgvector hybrid resolution; ADR-0012)
       ├─► embed-step         (Layer 4 — Voyage-3.5-lite)
       └─► compute-edges-step (Auto-correlate — pgvector HNSW)

Each step is idempotent and deletes its own pgmq message on success. Failures retry up to PIPELINE_MAX_RETRIES (default 5), then promote to pipeline_jobs_dlq per ADR-0008.

2. Health snapshot — one query

Run this first when something feels off. It’s the “is the patient alive” test:

SELECT
  (SELECT COUNT(*) FROM pgmq.q_pipeline_jobs)                          AS queue_depth,
  (SELECT COUNT(*) FROM pgmq.q_pipeline_jobs WHERE read_ct > 1)        AS retrying,
  (SELECT COUNT(*) FROM pipeline_dead_letters
     WHERE created_at > now() - interval '10 minutes')                 AS recent_dlq,
  (SELECT COUNT(*) FROM net._http_response
     WHERE created > now() - interval '2 minutes' AND status_code = 200) AS ticks_ok,
  (SELECT COUNT(*) FROM net._http_response
     WHERE created > now() - interval '2 minutes' AND status_code <> 200) AS ticks_failed;

Result	Interpretation
All zeros except `queue_depth: 0` and `ticks_ok > 0`	Healthy and idle
`queue_depth > 0` and `ticks_ok > 0`	Healthy and processing
`retrying > 5`	Downstream step is failing — check `pipeline_dead_letters` and step function logs
`recent_dlq > 0`	Jobs gave up after retries — investigate `last_error` in `pipeline_dead_letters`
`ticks_failed > 0`	pipeline-tick or pg_cron broken — see §3
`ticks_ok = 0 AND ticks_failed = 0`	pg_cron not firing — see §4

3. Pipeline-tick failures

When ticks_failed > 0 or queue isn’t draining:

-- Last 5 pipeline-tick HTTP responses with body
SELECT id, status_code, LEFT(content::text, 250) AS body, created
FROM net._http_response
WHERE created > now() - interval '5 minutes'
ORDER BY created DESC
LIMIT 5;

Status / Body	Likely cause	Fix
`401 Unauthorized`	Auth headers wrong; cron command stale	Update `cron.alter_job` to send both `apikey` and `Authorization` headers (see §6)
`500 {"error":"Invalid schema: pgmq"}`	PostgREST not exposing pgmq	Re-apply `20260505000000_pgmq_read_wrapper.sql`; the function falls back to `public.pgmq_read`
`500` other	Edge Function crashed	Dashboard → Edge Functions → `pipeline-tick` → Logs
Body says `{"drained": 0, ...}` repeatedly with a non-empty queue	Cron has wrong service-role key (vault stale)	See §6
All `null` status_code	pg_net request never completed — wrong URL or vault key empty	See §6

4. pg_cron not firing

-- Is the cron job scheduled and active?
SELECT jobid, jobname, schedule, command, active
FROM cron.job
WHERE jobname LIKE '%pipeline%';

-- Recent run history
SELECT runid, status, return_message,
       to_char(start_time, 'HH24:MI:SS') AS started
FROM cron.job_run_details
WHERE jobid IN (SELECT jobid FROM cron.job WHERE jobname = 'pipeline-tick')
ORDER BY start_time DESC
LIMIT 10;

Symptom	Fix
No row in `cron.job`	Cron entry was dropped — re-create with `cron.schedule(...)` (see migrations)
Row exists, `active = false`	`UPDATE cron.job SET active = true WHERE jobid = N;`
Row exists but `cron.job_run_details` has no rows	`pg_cron` extension may need re-enabling — Dashboard → Database → Extensions
Runs succeed but `net._http_response` is empty	`pg_net` not writing responses — check extension status

5. Vendor / step function failures

-- Anything dead-lettered with the actual error
SELECT step, recording_id, last_error, read_ct, created_at
FROM pipeline_dead_letters
ORDER BY created_at DESC
LIMIT 10;

Common patterns:

`last_error` contains	Cause	Fix
`Anthropic (401)` / `Gemini (403)`	Vendor API key invalid in deployed env	`supabase secrets set <KEY>=<value>` and redeploy step function
`Anthropic (429)` / `Gemini quota`	Rate limit on free tier	Cascade should auto-fall to next provider; if all three exhausted, enable Gemini billing or split traffic across vendors
`Voyage (429)`	Embedding API rate limit	Voyage has 50M tok free; if hit, throttle batch size or pay
`pgmq` schema errors	PostgREST schema visibility regressed	Re-apply pgmq wrapper migrations
`memory missing`	Recording was deleted before pipeline ran	Benign — DLQ row can be ignored

6. Vault + cron auth — the dependency chain

pg_cron calls pipeline-tick over HTTP using credentials stored in vault.decrypted_secrets. Two secrets are required:

Vault secret name	What it holds
`project_url`	`https://<project-ref>.supabase.co`
`service_role_key`	The current service-role key (legacy JWT or new `sb_secret_*`)

When you rotate the service-role key, update the vault secret manually — Supabase doesn’t auto-sync vault on rotation:

-- Check current state (no values exposed)
SELECT
  name,
  CASE
    WHEN decrypted_secret IS NULL THEN 'NULL'
    WHEN decrypted_secret = '' THEN 'EMPTY'
    ELSE 'len=' || length(decrypted_secret) || ' starts="' || left(decrypted_secret, 12) || '"'
  END AS state
FROM vault.decrypted_secrets
WHERE name IN ('project_url', 'service_role_key');

-- Update key (do not save the query — value is in plaintext)
SELECT vault.update_secret(
  (SELECT id FROM vault.secrets WHERE name = 'service_role_key'),
  '<new key>'
);

The cron command must send both apikey and Authorization headers (same value) — the new sb_secret_* format is rejected by Supabase’s gateway in Authorization alone:

SELECT cron.alter_job(
  job_id := (SELECT jobid FROM cron.job WHERE jobname = 'pipeline-tick'),
  command := $cmd$
    select net.http_post(
      url := (select decrypted_secret from vault.decrypted_secrets where name = 'project_url')
             || '/functions/v1/pipeline-tick',
      headers := jsonb_build_object(
        'Content-Type', 'application/json',
        'apikey', (select decrypted_secret from vault.decrypted_secrets where name = 'service_role_key'),
        'Authorization', 'Bearer ' || (select decrypted_secret from vault.decrypted_secrets where name = 'service_role_key')
      ),
      body := '{}'::jsonb
    ) as request_id;
  $cmd$
);

7. Re-running a backfill

For one-shot backfill of memories missing summary or topics, prefer SQL Editor over the Edge Function:

-- Dry-run: count memories needing backfill
SELECT COUNT(*) AS candidates
FROM memories
WHERE (summary IS NULL OR topics IS NULL)
  AND transcript IS NOT NULL AND transcript <> ''
  AND recording_id IS NOT NULL;

-- Enqueue summarize jobs (production memories — by recording_id)
SELECT pgmq.send(
  'pipeline_jobs',
  jsonb_build_object('step', 'summarize', 'recording_id', recording_id),
  0
)
FROM memories
WHERE (summary IS NULL OR topics IS NULL)
  AND transcript IS NOT NULL AND transcript <> ''
  AND recording_id IS NOT NULL
LIMIT 500;

-- Enqueue summarize jobs (seed/fixture memories — by memory_id)
-- summarize-step accepts either; embed step is skipped when recording_id is null
SELECT pgmq.send(
  'pipeline_jobs',
  jsonb_build_object('step', 'summarize', 'memory_id', id),
  0
)
FROM memories
WHERE recording_id IS NULL
  AND transcript IS NOT NULL AND transcript <> '';

Force-rewrite all summaries (e.g. after a prompt change):

SELECT pgmq.send(
  'pipeline_jobs',
  jsonb_build_object('step', 'summarize', 'memory_id', id),
  0
)
FROM memories
WHERE transcript IS NOT NULL AND transcript <> '';

summarize-step overwrites unconditionally — no NULL-check on existing summaries.

8. Scaling thresholds

Pipeline throughput formula:

throughput = BATCH / cron_interval

Currently: BATCH = 25, cron_interval = 30s → ~50 jobs/min → ~8K memories/day (3 jobs/memory).

Trigger	Action
Queue depth consistently > 100	Bump `BATCH` from 25 to 50 in pipeline-tick:20
Queue still climbing after batch bump	Bump cron interval from 30s to 15s via `cron.alter_job`
Vendor rate limits getting hit on summarize but compute_edges idle	Split into separate queues so each step’s batch can be tuned independently
1K+ active users on free tiers	Enable Gemini billing (still ~$2.50/mo at this scale); split summarize traffic between Gemini and Groq Llama for elasticity
10K+ users / B2B SLAs	Move orchestration off pgmq+cron to Inngest (already mentioned as V0.2+ option in 01_SOFTWARE_PLAN.md §1)

Don’t pre-scale. The default settings hold to ~3K active users at 3 recordings/day — bigger than V0.3 needs to be.

9. Known anti-patterns (do not do)

Don’t manually delete pgmq messages without reading them — the queue’s visibility timeouts handle in-flight; explicit pgmq.delete outside a step function leaves orphaned memories with no summary/embedding.
Don’t set SUPABASE_SERVICE_ROLE_KEY as a custom Supabase secret on Edge Functions — Supabase auto-injects this; manually setting it can conflict with platform behavior. Update via key rotation in the dashboard.
Don’t put real keys into git — even .env.example should only document variable names, not actual values. Use Supabase secrets for deployed Edge Functions; Vercel env vars for the web app; Modal secrets for workers.
Don’t paste the service_role key into chat (AI or human) — it grants full DB access, bypasses RLS. Rotate immediately if leaked.

10. When in doubt

Run §2’s health snapshot, then drill down via §3-§5 based on what the snapshot shows. If still stuck after that:

Edge Function logs: Dashboard → Edge Functions → click the function → Logs
Step function direct test: Dashboard → Edge Functions → click → Test function → POST with body
Recent dashboard activity: Dashboard → Logs → Postgres / Edge Functions