Skip to main content

Runbook

Day-2 operations for Atlas. This page summarises the most common incident classes and where to look first. It mirrors the structure of the repository's docs/RUNBOOK.md, which carries the full command-level detail.

Triage map

API and dependencies

When the API is unhealthy, check liveness/readiness and the dependency it can't reach.

curl -s :${API_PORT:-8000}/health # process alive?
curl -s :${API_PORT:-8000}/health/ready # which dependency is not ready?
docker compose ps # are postgres/neo4j/redis/minio healthy?
docker compose logs -f osint-api # boot/runtime errors

Remember the API is fail-closed: if the ontology schema or TranslationRegistry fails to load at boot, it aborts rather than serving. A readiness failure shortly after deploy often means a schema or migration problem — check Flyway first.

Database failures

docker compose logs flyway # migration success/failure
docker compose run --rm flyway info # current version + pending
# inside postgres: check long-running queries, locks, connection count vs pool

A migration failure blocks API boot by design. Resolve the migration, then restart the API.

Temporal and stuck investigations

  • Confirm temporal-worker is running and registered on the task queue.
  • Use the Temporal UI to find the workflow by investigation id and inspect failed/timed-out activities and retry counts.
  • A single slow module is often an external dependency (MCP tool, LLM, or provider API).

Investigations tolerate partial module failure — they complete with available data and report what failed. See Temporal workflows.

Data correctness — graph parity

PostgreSQL is the source of truth; Neo4j is a projection. If relationship queries look wrong, check parity.

# Check parity between PostgreSQL and Neo4j (see GraphParityService)
# Per-investigation parity checks are available via the graph/debug surfaces

If the projection has drifted, re-run the sync and, if needed, the cleanup script to remove stale/orphaned nodes. See Graph sync.

Auth failures

  • Verify Keycloak is reachable and the tenant realm exists.
  • A 401 with a valid-looking token usually means an unresolved tenant context — check AUTH_ENABLED and the AUTH_* group/path mappings.
  • Cross-tenant 404s are expected: RLS hides other tenants' rows. See Security & multi-tenancy.
  • See the repo's docs/debug/adr-022-login-failure.md for a worked login-failure investigation.

Performance and rate limits

  • 429 responses come from the slowapi rate limiter — review RATE_LIMIT_* values.
  • For slow reads, check DB load (long queries, locks, pool saturation) and cache TTLs.
  • MAX_CONCURRENT_INVESTIGATIONS bounds investigation parallelism; raising it increases load on LLMs/providers and the DB.

Escalation

Conflicts requiring human judgement surface through the mutation queue as review tasks, not as incidents. Operational alerting (Slack/email) is configured via SLACK_WEBHOOK_URL and SMTP_*. For deeper procedures (key rotation, DR drills, vulnerability management) consult the security runbooks in the main repository.