Skip to main content

Observability

Atlas is observable at three levels: LLM tracing (what the crews did), workflow history (what the investigation did), and logs/metrics/alerts (what the platform did). Together they let you answer not just "is it up?" but "why did this investigation produce this result?".

LLM tracing — Langfuse

Crew LLM calls are traced to Langfuse (src/observability/), giving per-investigation visibility into prompts, tool calls, latencies, and cost. The Langfuse stack (langfuse-web, langfuse-worker, ClickHouse, MinIO) is part of the deployment and configured via LANGFUSE_* variables. flush_langfuse() / shutdown_langfuse() lifecycle hooks ensure traces are not lost on shutdown.

Use Langfuse when a finding looks wrong and you need to see exactly what the crew searched, what the tools returned, and how the model reasoned.

Workflow history — Temporal

Because every investigation step is a recorded Temporal activity, the Temporal UI (temporal-ui) shows full workflow history — which activities ran, which retried, which timed out. In the Console, the TemporalActivityPanel surfaces module/activity progress to analysts in real time. See Temporal workflows.

Logs, metrics, and errors

SignalWhere
Structured logsLOG_LEVEL, LOG_FORMAT; container/pod logs
Investigation activity loginvestigation_logs (per-investigation timeline)
Cost & entity metricsmetrics_router (/metrics)
Centrality / analyticsanalytics_router
Error reportingSENTRY_DSN
MCP / pipeline internalsdebug_router

The per-investigation activity log is especially useful: it records the discrete steps and domain events (EntityDiscovered, EntityMerged, …) of a single run.

Alerting

Operational notifications go to Slack (SLACK_WEBHOOK_URL) and email (SMTP_*, NOTIFICATION_EMAIL). Tie alerts to readiness failures, migration failures, and stuck-workflow conditions rather than to expected business outcomes (a flagged high-risk company is a result, not an incident).

What to watch

See the Runbook for how to act on each of these and Backup & DR for recovery procedures.