Skip to main content

Data model & persistence

Atlas persists data in two complementary stores: PostgreSQL is the system of record for the canonical ontology and all operational data; Neo4j holds a synced property graph optimised for relationship traversal. MinIO stores binary report documents and Redis backs caching.

Two stores, one truth

PostgreSQLNeo4j
RoleSystem of recordRead projection
HoldsEntities, attributes, claims, relationships, mutations, and all operational tablesEntities & relationships as a property graph
Optimised forTenant-scoped CRUD, transactions, RLSMulti-hop traversal (ownership, shared addresses, common connections)
Written byAPI + Temporal activitiesNeo4jSyncService from SQL changes

The graph is always downstream of PostgreSQL. Writes land in PostgreSQL first, then a sync activity projects them into Neo4j. A parity service audits divergence. See Graph sync.

Core relational tables

The ontology core models entities and their attributes as claims so that multiple providers can assert different values for the same field. The illustrative shape:

-- Representative structure — see src/database/entity_repository.py and migrations/
entities (id, tenant_id, entity_type, name, properties JSONB, created_at, updated_at)
relationships (id, tenant_id, source_id, target_id, rel_type, properties JSONB,)
attributes (id, entity_id, attr_name, attr_value, provider_id, confidence, source_url,)
claims (id, entity_id, claim_type, claim_data JSONB, supporting_evidence JSONB,)
mutations (id, entity_id, provider_id, mutation_type, before_value JSONB, after_value JSONB,)
conflicts (id, entity_id, mutation_ids, status, resolved_at, resolution_strategy)

Every table that holds tenant data carries a tenant_id and is guarded by an RLS policy. See Security & multi-tenancy.

Operational tables

Beyond the ontology core, PostgreSQL holds the operational domain:

DomainTables (representative)
Investigationsinvestigations, investigation_logs, provider_runs, evidence
Reportsreports
Companiescompanies, portfolio status history
Riskrisk_scores, risk_matrix_schemas, risk_matrix_assignments, risk_matrix_evaluations
Tenancy & configtenants, data_provider_credentials, agent_configurations, agent_prompts, ontology_schema_versions, reference_datasets

Repositories

Data access is mediated by ~21 repository classes under src/database/ rather than ad-hoc SQL in handlers. This keeps tenant-scoping, RLS, and query patterns in one place.

RepositoryResponsibility
OntologyPersistenceRepositoryEntity & relationship persistence, merges (entity_repository.py, large)
CompanyRepositoryCompany registry and subject-entity lookup
InvestigationRepository / ReportRepositoryInvestigation & report lifecycle
EvidenceRepositoryFinding evidence artefacts
ProviderRunRepositoryProvider execution logs
RiskRepositoryRisk scores & indicators
MutationRepository / conflict repoProvenance & conflict tracking (src/mutation_queue/)
TenantRepository, SettingsRepository, SchemaRepository, SegmentsRepository, DataProviderRepositoryTenancy, settings, schema versions, segments, providers

Connection pool & RLS

A single DatabasePool (asyncpg) is initialised at startup. The application connects through a restricted atlas_app role so that RLS is enforced rather than bypassed by a superuser. The current tenant is set on the session per request, and policies of the form tenant_id = current_setting('app.current_tenant') filter every query. See Security & multi-tenancy for the full model and the relevant ADR.

Migrations

Schema is versioned with Flyway (the migrations/ directory, 130+ versioned scripts) and applied automatically at pod/container startup before the API serves traffic. See Operations → Deployment.


Deep dive: pool, RLS binding, and JSONB discipline

This reflects src/database/connection.py, tenant_session.py, the repositories.

Two connection patterns

Tenant-owned tables are always read through a session that sets app.current_tenant_id; truly platform-shared tables (provider registry, MCP servers, coverage) use a raw pool connection. The distinction is deliberate — RLS-protected tables must carry a tenant context or they return nothing.

Repositories own the SQL

Handlers never write ad-hoc SQL. ~21 repositories under src/database/ encapsulate query patterns, so tenant scoping, JSONB handling, and RLS are applied consistently. New endpoints inject a repository via Depends (see Backend).

JSONB decode discipline

A subtle but load-bearing rule: top-level JSONB columns (entity_data, provenance) are decoded at the top level only — individual leaves are not re-parsed — so a numeric-looking registration number like "007" never silently coerces to an int. The claims read path (Claims & survivorship) depends on this to preserve exact source values.

asyncpg parameter binding

Per project convention, JSON parameters use CAST(:param AS jsonb) rather than a ::jsonb cast on a bind parameter — asyncpg binds positionally and the explicit CAST keeps parameterization correct.

To add a tenant-scoped table and endpoint, see Add a frontend feature → backend.