Skip to main content

Entity resolution

When a provider or crew reports an entity, Atlas must decide: is this the same company or person we already know, or a new one? Getting this right is what turns fragmented provider data into a single coherent graph. The logic lives in src/ontology/entity_matcher.py, entity_resolution.py, blocking.py, reconciliation.py, and the specialised person_matcher.py and address_reconciler.py.

The resolution pipeline

Blocking — making it tractable

Comparing every incoming entity against every existing one is quadratic. Blocking (src/ontology/blocking.py) generates a cheap key (for example a normalised, surname-first form of a person's name, or a normalised company name) and only candidates sharing that key are scored. This shrinks the comparison set from "everything" to "plausibly the same".

Similarity scoring

Candidates are scored by type-appropriate signals (entity_resolution.py, person_matcher.py):

EntitySignals
LegalEntityName similarity, registration-number match, jurisdiction
PersonName similarity, date of birth, nationality, role/company overlap
AddressNormalised address equality, geographic proximity (haversine)

The score drives a three-way decision:

BandDecision
High (≥ merge threshold)Merge into the existing entity
Ambiguous middle bandConflict — raise a review task
LowCreate a new entity

Reconciliation

When merging, the EntityReconciler (src/ontology/reconciliation.py) combines the incoming and existing attributes, detects conflicts, and applies survivorship to pick winners. Address-specific reconciliation (address_reconciler.py) deduplicates addresses, and person deduplication (deduplicate_person_entities()) collapses near-duplicate people.

Multi-source disagreement

Disagreement is a first-class concept, not an error to be hidden. When sources contradict each other on a meaningful field, Atlas can require human resolution rather than silently picking a side — enforced per ADR-017. Missing data is similarly explicit: the absence of a value from a source that should have it is itself a signal.

Ownership and UBOs

Once entities and Ownership/Directorship relationships are resolved, ownership chains can be walked to compute Ultimate Beneficial Owners (src/ontology/ownership.py, calculate_ubos_for_entity()). These traversals run efficiently against the synced Neo4j graph.

The outputs of resolution — canonical entities, claims, mutations, and conflicts — feed Claims & survivorship, the mutation queue, and ultimately risk scoring and reporting.


Deep dive: the matching internals

This section is for engineers who need to tune or extend resolution. It reflects the implementation in src/ontology/blocking.py, entity_matcher.py, entity_resolution.py, person_matcher.py, address_reconciler.py, reconciliation.py.

Blocking keys

Blocking turns an O(n²) all-pairs comparison into the sum of per-block costs. The key is cheap and type-specific (blocking.py):

TypeBlocking keyExample
LegalEntity{jurisdiction}:{first-3 of normalized name}nl:red for "Redesign B.V."
Personperson:{first-3 of sorted name parts}person:fra for "Frank / François"
Addressgeohash of GPS (~100 m) ▸ postal+city ▸ street+city ▸ full ▸ city+country`addr:geo:52.37

Name normalization (_normalize_company_name) strips 65+ legal suffixes (B.V., GmbH, SRL, Ltd, …) and diacritics; person normalization handles Turkish/Polish/Danish characters and reduces to first+last.

Matching keys & similarity

Within a block, a matching key is generated (identifier-first, name fallback), and pairs are scored with several algorithms combined by weight:

Field weights (entity_resolution.py): name 0.40, identifier 0.30, attribute 0.15, structural/shared-relationships 0.15.

Decision thresholds:

BandThresholdDecision
Exact0.95MERGE
Fuzzy0.80strong candidate
Review0.60NEEDS_REVIEW (human)
Minimum0.40keep as candidate
Below< 0.40discard → CREATE new

A registration-number match yields a high-confidence key (≈0.95–0.99); a name-only key is weaker (≈0.70). An exact normalized-name match is boosted to 0.96 to force a merge.

Person matching with a DOB veto

Person resolution (person_matcher.py, is_same_person_v2) adds tie-breakers and a veto:

Middle-initial conflicts ("John A." vs "John B.") block a match; significant name parts (length ≥ 3) of one must appear in the other. Five DOB formats are accepted; an unparseable DOB disables the veto rather than guessing.

Address clustering

Addresses cluster by Haversine distance with a 500 m threshold (address_reconciler.py); within a cluster the golden record is chosen by score (geocode source = Google Maps, type = registered office, has street, geocode confidence, completeness). Addresses are never merged across companies.

Key constants

ConstantValueWhere
Merge / fuzzy / review / min thresholds0.95 / 0.80 / 0.60 / 0.40entity_resolution.py
Name / id / attr / structural weights0.40 / 0.30 / 0.15 / 0.15entity_resolution.py
Address cluster threshold500 maddress_reconciler.py
Blocking name-prefix length3 charsblocking.py
UBO threshold (baseline / strict)25% / 10%ownership.py
Max UBO chain depth10ownership.py

Invariants & failure modes

  • Deterministic keys: the same entity data always yields the same matching key.
  • No cross-company address merges.
  • Registration-number collisions fall back to name-based keys (confidence drops to ~0.70).
  • Circular ownership is detected during UBO traversal, logged, and the chain terminated.
  • Unknown provider in survivorship → default trust 0.50.

To add a new entity type or matcher, see Add an entity type.