Runbook
Day-2 operations for Atlas. This page summarises the most common incident classes and where to
look first. It mirrors the structure of the repository's docs/RUNBOOK.md, which carries the full
command-level detail.
Triage map
API and dependencies
When the API is unhealthy, check liveness/readiness and the dependency it can't reach.
curl -s :${API_PORT:-8000}/health # process alive?
curl -s :${API_PORT:-8000}/health/ready # which dependency is not ready?
docker compose ps # are postgres/neo4j/redis/minio healthy?
docker compose logs -f osint-api # boot/runtime errors
Remember the API is fail-closed: if the ontology schema or TranslationRegistry fails to load at boot, it aborts rather than serving. A readiness failure shortly after deploy often means a schema or migration problem — check Flyway first.
Database failures
docker compose logs flyway # migration success/failure
docker compose run --rm flyway info # current version + pending
# inside postgres: check long-running queries, locks, connection count vs pool
A migration failure blocks API boot by design. Resolve the migration, then restart the API.
Temporal and stuck investigations
- Confirm
temporal-workeris running and registered on the task queue. - Use the Temporal UI to find the workflow by investigation id and inspect failed/timed-out activities and retry counts.
- A single slow module is often an external dependency (MCP tool, LLM, or provider API).
Investigations tolerate partial module failure — they complete with available data and report what failed. See Temporal workflows.
Data correctness — graph parity
PostgreSQL is the source of truth; Neo4j is a projection. If relationship queries look wrong, check parity.
# Check parity between PostgreSQL and Neo4j (see GraphParityService)
# Per-investigation parity checks are available via the graph/debug surfaces
If the projection has drifted, re-run the sync and, if needed, the cleanup script to remove stale/orphaned nodes. See Graph sync.
Auth failures
- Verify Keycloak is reachable and the tenant realm exists.
- A
401with a valid-looking token usually means an unresolved tenant context — checkAUTH_ENABLEDand theAUTH_*group/path mappings. - Cross-tenant
404s are expected: RLS hides other tenants' rows. See Security & multi-tenancy. - See the repo's
docs/debug/adr-022-login-failure.mdfor a worked login-failure investigation.
Performance and rate limits
429responses come from the slowapi rate limiter — reviewRATE_LIMIT_*values.- For slow reads, check DB load (long queries, locks, pool saturation) and cache TTLs.
MAX_CONCURRENT_INVESTIGATIONSbounds investigation parallelism; raising it increases load on LLMs/providers and the DB.
Escalation
Conflicts requiring human judgement surface through the mutation queue
as review tasks, not as incidents. Operational alerting (Slack/email) is configured via
SLACK_WEBHOOK_URL and SMTP_*. For deeper procedures (key rotation, DR drills, vulnerability
management) consult the security runbooks in the main repository.