Skip to main content

Claims & survivorship

Atlas does not store a single "true" value for each attribute and overwrite it as new data arrives. Instead, every attribute can carry multiple claims — one per provider — and survivorship rules decide which claim is presented as canonical. This preserves provenance and makes disagreement auditable. The decision logic lives in src/ontology/survivorship.py, deal_breaker_consolidator.py, and the reconciler.

Why claims

A claim records who said it, with what confidence, and from what source. Because claims are kept, Atlas can answer "why does the report say Acme Corp?" and show the competing assertions and the rule that chose between them. See ADR-020.

The survivorship decision

Strategies

Each attribute in the ontology declares a strategy:

StrategyBehaviourExample field
most_trustedUse the value from the most trusted sourcelegal_name, registration_number
most_recentUse the most recently retrieved valuevolatile status fields
most_completePrefer a non-null / non-empty valuesparse attributes
combineMerge arrays / union valuesregistration_numbers

Trust weights

Ties within most_trusted are resolved by two weight tables from the ontology:

  • Module trust — relative authority of OSINT modules (cir 10 → dfwo 3).
  • Provider trust — per-field confidence for external providers (e.g. NorthData registration_number 0.99, KVK postal_code 0.95, OSINT _default 0.70).

Protected fields

Some fields must never be silently overwritten by a provider — the PEP and sanctions flags on Person and LegalEntity. They originate from screening crews (SPEPWS, AMLRR), and enforce_protected_fields() guarantees a registry provider cannot clobber them. This is load-bearing compliance behaviour, covered by tests.

Deal-breaker consolidation

For multi-value fields where claims genuinely conflict (and a simple merge would be wrong), deal_breaker_consolidator.py can call an LLM to consolidate — "which of these values are real?" — using the survivorship weights as context, with a deterministic fallback if the LLM is unavailable.

Every decision is a mutation

Whatever survivorship decides is written as a mutation (before/after, with provider attribution). Significant changes can raise a conflict for human review. This is the audit trail described in Mutation queue, and it is what lets an analyst trust — and defend — the canonical value.


Deep dive: how a winner is chosen

This reflects src/ontology/survivorship.py, reconciliation.py, and canonical_synthesis.py.

The pairwise decision

resolve_field_value(existing, existing_provider, new, new_provider, field) decides each field with an explicit branch order:

"More complete" (_is_more_complete) prefers longer non-placeholder strings, longer lists, and dicts with more keys — and explicitly skips placeholders like "unknown", "n/a", "not found".

Trust lookup

get_provider_trust(provider, field) reads the ontology's provider_trust table via the SchemaCache; an unknown provider/field falls back to 0.50. Reconciliation combines field confidence with module trust as effective_confidence = (field_confidence + module_trust) / 2.

Strategy implementations

The reconciler (reconciliation.py) applies the per-field strategy and emits a conflict record when multiple distinct values exist:

StrategyImplementation
most_trustedmax by effective confidence; ties → module then provider trust
most_recentmax by timestamp
most_completelongest / non-empty
most_specificrank by part-count + length + confidence (e.g. "England & Wales" > "UK")
combine / aggregateunion arrays; boolean OR for flags (a true PEP wins)
canonicalnormalize to canonical form (proper case, ISO codes)

Claim-plus-rank (Phase 110) and the read path

Each attribute can have many entity_claims rows. recompute_preferred_for_entity() does a pairwise reduction across all claims for an (entity_id, attribute_path) and flips a single is_preferred flag in one UPDATE (avoiding a transient dual-winner). The read path, synthesize_entity_attributes() (canonical_synthesis.py), then:

Each result is a CanonicalAttribute(value, source, trust, updated_at, alternative_claims?). When only one source contributed, alternative_claims is absent from the JSON (not null) — the model_dump(exclude_none=True) contract. Paths with no preferred claim fall back to the legacy entity_data + _field_provenance (source "legacy", trust 0.5).

Protected fields, precisely

enforce_protected_fields() strips protected fields from an incoming payload unless the provider is authorized (spepws, amlrr). On Person the protected set is is_pep, pep_position, pep_source, pep_details, pep_category, is_sanctioned, sanctions_matches, sanctions_source, sanctions_lists.

Invariants

  • One winner per attribute path (the orphan-loser guard prevents a loser surfacing as canonical).
  • Single-source omission: alternative_claims is None, never [].
  • JSONB decode discipline: claim leaves are decoded carefully so a numeric registration number never coerces to int.

To change policy without code, see Add a risk matrix and the ontology's survivorship config.