Observability
Atlas is observable at three levels: LLM tracing (what the crews did), workflow history (what the investigation did), and logs/metrics/alerts (what the platform did). Together they let you answer not just "is it up?" but "why did this investigation produce this result?".
LLM tracing — Langfuse
Crew LLM calls are traced to Langfuse (src/observability/), giving per-investigation
visibility into prompts, tool calls, latencies, and cost. The Langfuse stack
(langfuse-web, langfuse-worker, ClickHouse, MinIO) is part of the deployment and configured via
LANGFUSE_* variables. flush_langfuse() / shutdown_langfuse() lifecycle hooks ensure traces are
not lost on shutdown.
Use Langfuse when a finding looks wrong and you need to see exactly what the crew searched, what the tools returned, and how the model reasoned.
Workflow history — Temporal
Because every investigation step is a recorded Temporal activity, the Temporal UI
(temporal-ui) shows full workflow history — which activities ran, which retried, which timed out.
In the Console, the TemporalActivityPanel surfaces module/activity progress to analysts in real
time. See Temporal workflows.
Logs, metrics, and errors
| Signal | Where |
|---|---|
| Structured logs | LOG_LEVEL, LOG_FORMAT; container/pod logs |
| Investigation activity log | investigation_logs (per-investigation timeline) |
| Cost & entity metrics | metrics_router (/metrics) |
| Centrality / analytics | analytics_router |
| Error reporting | SENTRY_DSN |
| MCP / pipeline internals | debug_router |
The per-investigation activity log is especially useful: it records the discrete steps and
domain events (EntityDiscovered, EntityMerged, …)
of a single run.
Alerting
Operational notifications go to Slack (SLACK_WEBHOOK_URL) and email (SMTP_*,
NOTIFICATION_EMAIL). Tie alerts to readiness failures, migration failures, and stuck-workflow
conditions rather than to expected business outcomes (a flagged high-risk company is a result, not
an incident).
What to watch
See the Runbook for how to act on each of these and Backup & DR for recovery procedures.