Entity resolution
When a provider or crew reports an entity, Atlas must decide: is this the same company or person
we already know, or a new one? Getting this right is what turns fragmented provider data into a
single coherent graph. The logic lives in src/ontology/ —
entity_matcher.py, entity_resolution.py, blocking.py, reconciliation.py, and the
specialised person_matcher.py and address_reconciler.py.
The resolution pipeline
Blocking — making it tractable
Comparing every incoming entity against every existing one is quadratic. Blocking
(src/ontology/blocking.py) generates a cheap key (for example a normalised, surname-first form of
a person's name, or a normalised company name) and only candidates sharing that key are scored.
This shrinks the comparison set from "everything" to "plausibly the same".
Similarity scoring
Candidates are scored by type-appropriate signals (entity_resolution.py,
person_matcher.py):
| Entity | Signals |
|---|---|
| LegalEntity | Name similarity, registration-number match, jurisdiction |
| Person | Name similarity, date of birth, nationality, role/company overlap |
| Address | Normalised address equality, geographic proximity (haversine) |
The score drives a three-way decision:
| Band | Decision |
|---|---|
| High (≥ merge threshold) | Merge into the existing entity |
| Ambiguous middle band | Conflict — raise a review task |
| Low | Create a new entity |
Reconciliation
When merging, the EntityReconciler (src/ontology/reconciliation.py) combines the incoming and
existing attributes, detects conflicts, and applies survivorship to pick
winners. Address-specific reconciliation (address_reconciler.py) deduplicates addresses, and
person deduplication (deduplicate_person_entities()) collapses near-duplicate people.
Multi-source disagreement
Disagreement is a first-class concept, not an error to be hidden. When sources contradict each other on a meaningful field, Atlas can require human resolution rather than silently picking a side — enforced per ADR-017. Missing data is similarly explicit: the absence of a value from a source that should have it is itself a signal.
Ownership and UBOs
Once entities and Ownership/Directorship relationships are resolved, ownership chains can be
walked to compute Ultimate Beneficial Owners (src/ontology/ownership.py,
calculate_ubos_for_entity()). These traversals run efficiently against the synced
Neo4j graph.
The outputs of resolution — canonical entities, claims, mutations, and conflicts — feed Claims & survivorship, the mutation queue, and ultimately risk scoring and reporting.
Deep dive: the matching internals
This section is for engineers who need to tune or extend resolution. It reflects the implementation
in src/ontology/ — blocking.py, entity_matcher.py, entity_resolution.py, person_matcher.py,
address_reconciler.py, reconciliation.py.
Blocking keys
Blocking turns an O(n²) all-pairs comparison into the sum of per-block costs. The key is cheap and
type-specific (blocking.py):
| Type | Blocking key | Example |
|---|---|---|
| LegalEntity | {jurisdiction}:{first-3 of normalized name} | nl:red for "Redesign B.V." |
| Person | person:{first-3 of sorted name parts} | person:fra for "Frank / François" |
| Address | geohash of GPS (~100 m) ▸ postal+city ▸ street+city ▸ full ▸ city+country | `addr:geo:52.37 |
Name normalization (_normalize_company_name) strips 65+ legal suffixes (B.V., GmbH, SRL, Ltd, …)
and diacritics; person normalization handles Turkish/Polish/Danish characters and reduces to
first+last.
Matching keys & similarity
Within a block, a matching key is generated (identifier-first, name fallback), and pairs are scored with several algorithms combined by weight:
Field weights (entity_resolution.py): name 0.40, identifier 0.30, attribute 0.15,
structural/shared-relationships 0.15.
Decision thresholds:
| Band | Threshold | Decision |
|---|---|---|
| Exact | ≥ 0.95 | MERGE |
| Fuzzy | ≥ 0.80 | strong candidate |
| Review | ≥ 0.60 | NEEDS_REVIEW (human) |
| Minimum | ≥ 0.40 | keep as candidate |
| Below | < 0.40 | discard → CREATE new |
A registration-number match yields a high-confidence key (≈0.95–0.99); a name-only key is weaker (≈0.70). An exact normalized-name match is boosted to 0.96 to force a merge.
Person matching with a DOB veto
Person resolution (person_matcher.py, is_same_person_v2) adds tie-breakers and a veto:
Middle-initial conflicts ("John A." vs "John B.") block a match; significant name parts (length ≥ 3) of one must appear in the other. Five DOB formats are accepted; an unparseable DOB disables the veto rather than guessing.
Address clustering
Addresses cluster by Haversine distance with a 500 m threshold (address_reconciler.py); within
a cluster the golden record is chosen by score (geocode source = Google Maps, type = registered
office, has street, geocode confidence, completeness). Addresses are never merged across
companies.
Key constants
| Constant | Value | Where |
|---|---|---|
| Merge / fuzzy / review / min thresholds | 0.95 / 0.80 / 0.60 / 0.40 | entity_resolution.py |
| Name / id / attr / structural weights | 0.40 / 0.30 / 0.15 / 0.15 | entity_resolution.py |
| Address cluster threshold | 500 m | address_reconciler.py |
| Blocking name-prefix length | 3 chars | blocking.py |
| UBO threshold (baseline / strict) | 25% / 10% | ownership.py |
| Max UBO chain depth | 10 | ownership.py |
Invariants & failure modes
- Deterministic keys: the same entity data always yields the same matching key.
- No cross-company address merges.
- Registration-number collisions fall back to name-based keys (confidence drops to ~0.70).
- Circular ownership is detected during UBO traversal, logged, and the chain terminated.
- Unknown provider in survivorship → default trust 0.50.
To add a new entity type or matcher, see Add an entity type.