Skip to main content

Graph sync

Atlas keeps two representations of the knowledge graph: the system of record in PostgreSQL and a property graph in Neo4j optimised for relationship traversal. The graph is always a projection of PostgreSQL — never an independent source of truth. The sync logic lives in src/graph/.

Why two stores

Relational storage is ideal for tenant-scoped CRUD, transactions, and Row-Level Security. But questions like "show every company sharing this address" or "walk the ownership chain to the UBO" are multi-hop traversals that a property graph answers far more naturally. So Atlas writes to PostgreSQL, then projects changes into Neo4j.

The sync service

Neo4jSyncService (src/graph/neo4j_sync.py) batches entity and relationship changes from PostgreSQL into transactional Neo4j writes, with a retry queue on failure so a transient Neo4j outage doesn't lose updates. A higher-level GraphSyncService orchestrates sync, and an EventProjector (projections.py) turns domain events into graph mutations.

Parity auditing

Because two stores can drift, GraphParityService (parity_service.py) audits divergence between the relational truth and the graph projection. If the projection falls behind or diverges, parity checks surface it so it can be reconciled (and a cleanup script can remove stale/orphaned nodes).

Graph model and queries

Neo4j holds labels for the ontology entity types (LegalEntity, Person, Address, …) and relationships such as Ownership, Directorship, and RegisteredAt. The graph_router.py exposes traversal queries:

Typical queries: ownership chains, ultimate beneficial owners, shared-address clusters, common connections between entities, and centrality scores. These power the GraphExplorer and the report's entity/lineage graphs in the frontend.

Clients

Two graph clients exist under src/graph/: a Neo4jClient (neo4j-driver) and an AGEClient for Apache AGE (a graph extension running inside PostgreSQL). Cypher is generated via cypher_generator.py with pre-built patterns in cypher_queries.py.


Deep dive: sync, parity, and schema-driven Cypher

This reflects src/graph/neo4j_sync.py, neo4j_client.py, cypher_generator.py, parity_service.py.

Tenant-scoped full sync

The sync reads through a tenant session so RLS scopes it, upserts each entity/relationship into Neo4j, stamps neo4j_synced_at on success, and queues failures for retry rather than aborting. The Neo4jClient connects over Bolt with a pool (lifetime 3600s, size 50, acquire timeout 60s) and supports atomic execute_batch transactions.

Schema-driven Cypher with tenant isolation

The CypherGenerator reads entity/relationship types from the ontology — nothing is hard-coded. A relationship's Neo4j label comes from its neo4j_type (falling back to upper-snake-case), and every generated statement carries tenant_id in its MERGE/MATCH keys:

MERGE (e:LegalEntity {id: $entity_id, tenant_id: $tenant_id})
SET e += $properties

Bookkeeping fields (provenance, matching_key_hash, is_active, tenant_id, …) are excluded from node properties.

Parity auditing

GraphParityService.check_parity() compares counts (and optionally IDs) between PostgreSQL and Neo4j, returning a ParityResult with statusin_sync / minor_drift / drift_detected / error, plus missing_in_neo4j and orphans_in_neo4j ID lists. If Neo4j is unavailable, sync returns a failed status and the relational truth stays intact — graph features degrade gracefully rather than corrupting data.