Temporal workflows

Investigations are long-running, multi-step, and must survive process restarts — exactly the problem Temporal solves. Atlas models an investigation as a durable workflow whose steps are activities: idempotent units of work that Temporal will retry and whose progress is persisted. The code lives in src/temporal/.

Why Temporal

A single investigation runs seven OSINT modules, each an agentic LLM loop that can take minutes, then reconciles entities, syncs the graph, scores risk, and generates a report. If the worker crashes mid-investigation, the work must resume — not restart. Temporal gives Atlas durable execution, automatic retries, timeouts, and full history for observability.

The investigation workflow

InvestigationWorkflow (src/temporal/workflows.py) is the entry point. It fans out the seven modules in parallel, then runs reconciliation, persistence, graph sync, risk scoring, and report generation as activities. Partial module failure is tolerated — the investigation completes with the data it has and surfaces what failed.

Activities

Activity file	Responsibility
`activities.py`	Core: module execution, persistence, reconciliation, risk
`enrichment_activities.py`	Freshness checks, context building
`sync_activities.py`	Neo4j sync orchestration
`risk_rules.py`	Per-module risk-rule application

Client, worker, and queues

Client (client.py) — start_investigation(), cancel_investigation(), get_workflow_history() over gRPC.
Worker (worker.py) — polls task queues and executes workflows/activities. In production this is a separate temporal-worker deployment; a second workflow-engine-worker runs the low-code workflow engine.
Task queues separate concerns (general activities, graph sync, risk scoring) with their own timeouts and heartbeats for long-running steps.

Relationship to the low-code workflow engine

Beyond investigations, Atlas has an experimental workflow engine v2.0 (src/workflows/, ADR-007) that lets users compose custom task sequences in a visual builder. It runs on its own worker and is feature-flagged. The investigation workflow described here is the production path; the workflow engine is the extensibility path.

Observability

Because every step is a recorded activity, an in-flight investigation is fully inspectable — the Temporal UI (temporal-ui service) shows workflow history, and the frontend's TemporalActivityPanel surfaces module/activity progress to analysts. See Operations → Observability.

Deep dive: workflow internals

This reflects src/temporal/ — workflows.py, activities.py, enrichment_activities.py, constants.py.

The six stages

Enrichment fetches provider context for the prompts; failure is non-fatal (the investigation proceeds without it).
Modules run in full parallel — each execute_module is its own activity (not a child workflow), bracketed by open_run/close_run for provenance.
Reconciliation merges entities across modules and syncs the graph.
Risk computes overall level/score/recommendation.
Persist writes the result; a failure here is logged into output.errors, not fatal.
Graph sync runs as a detached child workflow (ABANDON) so it never blocks investigation completion.

Configuration constants

Setting	Value
Task queue	`osint-investigations` (`constants.py`)
Module retry	initial 1s, max 1m, 3 attempts, backoff 2.0
Per-module timeout	20 minutes
Reconciliation / persist / enrichment timeouts	5m / 2m / 2m
Tool result truncation (model)	`MAX_LLM_TOOL_RESULT_CHARS` = 12,000
Evidence retention	`MAX_EVIDENCE_TOOL_RESULT_CHARS` = 200,000

Partial-failure semantics

The investigation is marked successful if any module completes (or all are skipped) — a single failing module degrades the result rather than failing the whole run. Cancellation propagates asyncio.CancelledError to in-flight activities.

Transient vs permanent errors

Transient exceptions (RemoteProtocolError, ConnectError, TimeoutException, ReadTimeout, …) auto-retry per the RetryPolicy; permanent module exceptions are recorded in output.errors and don't cascade. A MissingTenantCredentialsError surfaces as a structural failure (no retry).

To add a module, see Add an OSINT module.

Why Temporal​

The investigation workflow​

Activities​

Client, worker, and queues​

Relationship to the low-code workflow engine​

Observability​

Deep dive: workflow internals​

The six stages​

Configuration constants​

Partial-failure semantics​

Transient vs permanent errors​