Entity resolution

When a provider or crew reports an entity, Atlas must decide: is this the same company or person we already know, or a new one? Getting this right is what turns fragmented provider data into a single coherent graph. The logic lives in src/ontology/ — entity_matcher.py, entity_resolution.py, blocking.py, reconciliation.py, and the specialised person_matcher.py and address_reconciler.py.

The resolution pipeline

Blocking — making it tractable

Comparing every incoming entity against every existing one is quadratic. Blocking (src/ontology/blocking.py) generates a cheap key (for example a normalised, surname-first form of a person's name, or a normalised company name) and only candidates sharing that key are scored. This shrinks the comparison set from "everything" to "plausibly the same".

Similarity scoring

Candidates are scored by type-appropriate signals (entity_resolution.py, person_matcher.py):

Entity	Signals
LegalEntity	Name similarity, registration-number match, jurisdiction
Person	Name similarity, date of birth, nationality, role/company overlap
Address	Normalised address equality, geographic proximity (haversine)

The score drives a three-way decision:

Band	Decision
High (≥ merge threshold)	Merge into the existing entity
Ambiguous middle band	Conflict — raise a review task
Low	Create a new entity

Reconciliation

When merging, the EntityReconciler (src/ontology/reconciliation.py) combines the incoming and existing attributes, detects conflicts, and applies survivorship to pick winners. Address-specific reconciliation (address_reconciler.py) deduplicates addresses, and person deduplication (deduplicate_person_entities()) collapses near-duplicate people.

Multi-source disagreement

Disagreement is a first-class concept, not an error to be hidden. When sources contradict each other on a meaningful field, Atlas can require human resolution rather than silently picking a side — enforced per ADR-017. Missing data is similarly explicit: the absence of a value from a source that should have it is itself a signal.

Ownership and UBOs

Once entities and Ownership/Directorship relationships are resolved, ownership chains can be walked to compute Ultimate Beneficial Owners (src/ontology/ownership.py, calculate_ubos_for_entity()). These traversals run efficiently against the synced Neo4j graph.

The outputs of resolution — canonical entities, claims, mutations, and conflicts — feed Claims & survivorship, the mutation queue, and ultimately risk scoring and reporting.

Deep dive: the matching internals

This section is for engineers who need to tune or extend resolution. It reflects the implementation in src/ontology/ — blocking.py, entity_matcher.py, entity_resolution.py, person_matcher.py, address_reconciler.py, reconciliation.py.

Blocking keys

Blocking turns an O(n²) all-pairs comparison into the sum of per-block costs. The key is cheap and type-specific (blocking.py):

Type	Blocking key	Example
LegalEntity	`{jurisdiction}:{first-3 of normalized name}`	`nl:red` for "Redesign B.V."
Person	`person:{first-3 of sorted name parts}`	`person:fra` for "Frank / François"
Address	geohash of GPS (~100 m) ▸ postal+city ▸ street+city ▸ full ▸ city+country	`addr:geo:52.37

Name normalization (_normalize_company_name) strips 65+ legal suffixes (B.V., GmbH, SRL, Ltd, …) and diacritics; person normalization handles Turkish/Polish/Danish characters and reduces to first+last.

Matching keys & similarity

Within a block, a matching key is generated (identifier-first, name fallback), and pairs are scored with several algorithms combined by weight:

Field weights (entity_resolution.py): name 0.40, identifier 0.30, attribute 0.15, structural/shared-relationships 0.15.

Decision thresholds:

Band	Threshold	Decision
Exact	≥ 0.95	`MERGE`
Fuzzy	≥ 0.80	strong candidate
Review	≥ 0.60	`NEEDS_REVIEW` (human)
Minimum	≥ 0.40	keep as candidate
Below	< 0.40	discard → `CREATE` new

A registration-number match yields a high-confidence key (≈0.95–0.99); a name-only key is weaker (≈0.70). An exact normalized-name match is boosted to 0.96 to force a merge.

Person matching with a DOB veto

Person resolution (person_matcher.py, is_same_person_v2) adds tie-breakers and a veto:

Middle-initial conflicts ("John A." vs "John B.") block a match; significant name parts (length ≥ 3) of one must appear in the other. Five DOB formats are accepted; an unparseable DOB disables the veto rather than guessing.

Address clustering

Addresses cluster by Haversine distance with a 500 m threshold (address_reconciler.py); within a cluster the golden record is chosen by score (geocode source = Google Maps, type = registered office, has street, geocode confidence, completeness). Addresses are never merged across companies.

Key constants

Constant	Value	Where
Merge / fuzzy / review / min thresholds	0.95 / 0.80 / 0.60 / 0.40	`entity_resolution.py`
Name / id / attr / structural weights	0.40 / 0.30 / 0.15 / 0.15	`entity_resolution.py`
Address cluster threshold	500 m	`address_reconciler.py`
Blocking name-prefix length	3 chars	`blocking.py`
UBO threshold (baseline / strict)	25% / 10%	`ownership.py`
Max UBO chain depth	10	`ownership.py`

Invariants & failure modes

Deterministic keys: the same entity data always yields the same matching key.
No cross-company address merges.
Registration-number collisions fall back to name-based keys (confidence drops to ~0.70).
Circular ownership is detected during UBO traversal, logged, and the chain terminated.
Unknown provider in survivorship → default trust 0.50.

To add a new entity type or matcher, see Add an entity type.

The resolution pipeline​

Blocking — making it tractable​

Similarity scoring​

Reconciliation​

Multi-source disagreement​

Ownership and UBOs​

Deep dive: the matching internals​

Blocking keys​

Matching keys & similarity​

Person matching with a DOB veto​

Address clustering​

Key constants​

Invariants & failure modes​