Skip to main content

ADR-020: Ontology Field Provenance & Temporal History

FieldValue
StatusSuperseded by Phase 115
Date2026-04-03
RelatesADR-008, ADR-008a, ADR-016, ADR-019

Phase 115 supersession (2026-05-13): ADR-020 originally proposed a separate append-only field_observations table. Phase 115 deliberately chose a smaller, non-breaking implementation: nullable audit-lineage columns on entity_claims (source_ref, retrieval_run_id, ingested_at, plugin_version, mapping_spec_version, transform_id, raw_payload_pointer). The append-only observation-log model is deferred as a future temporal-history design, not the current provenance implementation. See V131 migration and Phase 115 CONTEXT.md D-01..D-12.


1. Problem Statement

The platform ingests entity data from multiple sources (registry feeds, API integrations, manual entry, document extraction) into the ontology layer. When a field is updated — whether by re-ingestion, correction, or enrichment — the previous value is overwritten. There is no record of what the system knew about an entity at a given point in time.

This creates three gaps:

  1. Audit reconstruction. A regulator asks "Why was Entity X rated high risk in Q1 2026?" The evaluation result (ADR-019) records the value that was scored (country_of_incorporation = "PA"), but there is no independent way to verify that PA was indeed what the system held at that time. The proof relies entirely on the evaluation snapshot rather than a corroborating source-of-truth.

  2. Data freshness visibility. A compliance officer reviewing a portfolio has no way to know which entities were scored with data that is 6, 12, or 18 months old. Stale data is a material risk in AML assessments — the EBA guidelines explicitly require firms to keep customer information up to date.

  3. Change-driven re-evaluation. When a source corrects a value (e.g., a registry feed updates an entity's jurisdiction), the platform cannot determine which evaluations were affected by the old value and should be flagged for re-scoring.


2. Decision Drivers

  • EBA GL/2021/02 requires firms to demonstrate the basis for risk assessments and keep customer due diligence information current.
  • FATF Recommendation 10 requires ongoing due diligence including keeping documents and data up to date.
  • ADR-019 (Risk Matrix Scoring Pipeline) embeds a lightweight provenance snapshot in evaluation results (field, value, observed_at, source). This ADR defines where that provenance metadata originates.
  • The Schema Mapping Designer (ADR-016) already tracks which source fields map to which ontology fields. Provenance extends this by recording when and from where each value was actually observed.

3. Core Concept: Field Observations

Rather than versioning entire entity records (expensive, complex), the platform tracks provenance at the individual field level using an append-only observation log.

field_observations Table

CREATE TABLE field_observations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
entity_id UUID NOT NULL REFERENCES entities(id),
field_path TEXT NOT NULL, -- e.g. "country_of_incorporation"
value JSONB NOT NULL, -- the observed value (any type)
source_id TEXT NOT NULL, -- e.g. "kvk_registry", "manual_entry", "opencorporates_api"
source_ref TEXT, -- external record ID, e.g. "kvk:12345678"
observed_at TIMESTAMPTZ NOT NULL, -- when the source reported this value
ingested_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- when the platform recorded it
superseded_at TIMESTAMPTZ, -- set when a newer observation arrives for this entity+field
confidence TEXT, -- "authoritative", "derived", "manual", "unverified"
metadata JSONB -- source-specific context (API response ID, document page, etc.)
);

CREATE INDEX idx_field_obs_entity_field ON field_observations(entity_id, field_path, observed_at DESC);
CREATE INDEX idx_field_obs_source ON field_observations(source_id, observed_at DESC);
CREATE INDEX idx_field_obs_superseded ON field_observations(entity_id, field_path) WHERE superseded_at IS NULL;

How It Works

  1. On ingestion: When a source provides a value for an entity field, the ingestion pipeline writes a row to field_observations. If a previous observation exists for the same entity_id + field_path with superseded_at IS NULL, that row's superseded_at is set to now(). The entity's current field value is updated as before — the main entity record remains the fast-read path.

  2. On scoring (ADR-019): The scoring engine reads the entity's current field value and the active observation's provenance (source_id, source_ref, observed_at). Both are embedded in the evaluation result, making each evaluation a self-contained audit artifact.

  3. On query ("what did we know at time T?"): SELECT * FROM field_observations WHERE entity_id = ? AND field_path = ? AND observed_at <= T ORDER BY observed_at DESC LIMIT 1 reconstructs the state at any historical point.


4. Connection to ADR-019 Evaluation Results

The evaluation result's per-factor indicator gains an input block:

{
"factor_id": "jurisdiction_risk",
"raw_score": 8,
"capped_score": 8,
"max_score": 10,
"input": {
"field": "country_of_incorporation",
"value": "PA",
"source_id": "kvk_registry",
"source_ref": "kvk:12345678",
"observed_at": "2026-03-15T10:30:00Z"
},
"contributing_indicators": [{
"method": "REFERENCE_LOOKUP",
"value": "PA",
"dataset": "country_risk",
"matched_score": 8
}]
}

This means years from now, you can open any evaluation and see:

  • What was scored: country_of_incorporation = "PA"
  • Where it came from: KVK registry, record kvk:12345678
  • When it was observed: 2026-03-15
  • How it was scored: REFERENCE_LOOKUP against country_risk dataset → score 8
  • Which matrix version: foreign key on the evaluation record
  • What the reference data looked like: snapshotted at matrix publish time (ADR-008a)

No reconstruction, no joins against historical state. The evaluation is the audit trail.


5. Use Cases Served

Use CaseHow Provenance Helps
Regulatory auditOpen the evaluation → every factor shows its input value, source, and observation date. Self-contained proof.
Data freshness alertingQuery field_observations WHERE superseded_at IS NULL AND observed_at < now() - interval '6 months' to find stale fields. Dashboard shows entities scored with old data.
Change-driven re-evaluationWhen a source correction arrives, the old observation is superseded. Query evaluation results where input.source_ref matches the corrected record → flag for re-scoring.
Matrix comparisonCompare two evaluation runs: same entity, different matrix versions. The input blocks show whether the underlying data also changed between runs, or only the scoring logic.
Dispute resolutionEntity challenges their score. Show them: "Your jurisdiction was recorded as PA based on KVK registry data from March 2026. This scored 8/10 in our country risk dataset. Here is the dataset entry."
Model validationCorrelate evaluation inputs and scores with outcomes (SARs, exits) over time. Provenance lets you filter for "evaluations based on authoritative sources" vs "evaluations based on manual/unverified data."

6. Data Freshness Service

A background service periodically scans active observations and generates freshness alerts:

For each entity in active portfolio:
For each field used by the entity's applicable risk matrix (via wire mappings):
Check the active observation's observed_at
If older than the configured threshold (per source or global):
Create a freshness alert → surfaces in the Risk Center dashboard

Thresholds are configurable per source (registry data may be acceptable at 12 months; sanctions screening should be refreshed weekly).


7. Retention & Storage Considerations

Field observations are append-only and will grow over time. Retention policy options:

  • Keep all observations indefinitely for entities in active portfolios. Regulatory retention requirements (typically 5 years post-relationship) determine the minimum.
  • Archive superseded observations older than the retention period to cold storage.
  • Partition by ingested_at for efficient archival and querying.
  • The active observation set (one row per entity per field, superseded_at IS NULL) is the hot path and stays small relative to the full history.

8. Scope & Boundaries

In Scope

  • field_observations table and ingestion pipeline integration
  • Provenance metadata in ADR-019 evaluation results (input block)
  • Data freshness alerting service
  • Point-in-time state reconstruction queries

Out of Scope (Future Work)

  • Full entity-level versioning or event sourcing
  • Source reliability scoring (weighting authoritative vs. derived sources)
  • Automated conflict resolution when multiple sources disagree on the same field
  • Provenance UI in the entity detail page (display layer — depends on this ADR's data model)

9. Open Questions

  1. Granularity of field_path. Should nested fields (e.g., addresses[0].country) each get their own observation row, or should complex fields be stored as a single observation with a JSONB value? Leaning toward single observation with JSONB — keeps the observation count manageable.

  2. Multi-source conflict. If two sources report different values for the same field at similar times, which one becomes the active entity value? Current approach: last-write-wins with source priority (authoritative > derived > manual). This ADR records both observations; conflict resolution is a separate concern.

  3. Bulk ingestion performance. Portfolio re-ingestion from a registry feed might update thousands of entities × dozens of fields. The supersede-and-insert pattern needs to be efficient. Likely a batch upsert with a CTE.


10. POC Scope

Minimal implementation to validate the concept alongside ADR-019's scoring pipeline POC:

  1. Create the field_observations table.
  2. Modify one ingestion pipeline (the simplest registry source) to write observations on ingest.
  3. Modify the RiskMatrixScorer to read provenance alongside entity values and embed the input block in evaluation results.
  4. Build one query: "show me all entities with field data older than N months."
  5. Verify: open an evaluation result from the POC and trace one factor score back to its source observation.