ADR-020: Ontology Field Provenance & Temporal History
| Field | Value |
|---|---|
| Status | Superseded by Phase 115 |
| Date | 2026-04-03 |
| Relates | ADR-008, ADR-008a, ADR-016, ADR-019 |
Phase 115 supersession (2026-05-13): ADR-020 originally proposed a separate append-only
field_observationstable. Phase 115 deliberately chose a smaller, non-breaking implementation: nullable audit-lineage columns onentity_claims(source_ref,retrieval_run_id,ingested_at,plugin_version,mapping_spec_version,transform_id,raw_payload_pointer). The append-only observation-log model is deferred as a future temporal-history design, not the current provenance implementation. See V131 migration and Phase 115 CONTEXT.md D-01..D-12.
1. Problem Statement
The platform ingests entity data from multiple sources (registry feeds, API integrations, manual entry, document extraction) into the ontology layer. When a field is updated — whether by re-ingestion, correction, or enrichment — the previous value is overwritten. There is no record of what the system knew about an entity at a given point in time.
This creates three gaps:
-
Audit reconstruction. A regulator asks "Why was Entity X rated high risk in Q1 2026?" The evaluation result (ADR-019) records the value that was scored (
country_of_incorporation = "PA"), but there is no independent way to verify that PA was indeed what the system held at that time. The proof relies entirely on the evaluation snapshot rather than a corroborating source-of-truth. -
Data freshness visibility. A compliance officer reviewing a portfolio has no way to know which entities were scored with data that is 6, 12, or 18 months old. Stale data is a material risk in AML assessments — the EBA guidelines explicitly require firms to keep customer information up to date.
-
Change-driven re-evaluation. When a source corrects a value (e.g., a registry feed updates an entity's jurisdiction), the platform cannot determine which evaluations were affected by the old value and should be flagged for re-scoring.
2. Decision Drivers
- EBA GL/2021/02 requires firms to demonstrate the basis for risk assessments and keep customer due diligence information current.
- FATF Recommendation 10 requires ongoing due diligence including keeping documents and data up to date.
- ADR-019 (Risk Matrix Scoring Pipeline) embeds a lightweight provenance snapshot in evaluation results (
field,value,observed_at,source). This ADR defines where that provenance metadata originates. - The Schema Mapping Designer (ADR-016) already tracks which source fields map to which ontology fields. Provenance extends this by recording when and from where each value was actually observed.
3. Core Concept: Field Observations
Rather than versioning entire entity records (expensive, complex), the platform tracks provenance at the individual field level using an append-only observation log.
field_observations Table
CREATE TABLE field_observations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
entity_id UUID NOT NULL REFERENCES entities(id),
field_path TEXT NOT NULL, -- e.g. "country_of_incorporation"
value JSONB NOT NULL, -- the observed value (any type)
source_id TEXT NOT NULL, -- e.g. "kvk_registry", "manual_entry", "opencorporates_api"
source_ref TEXT, -- external record ID, e.g. "kvk:12345678"
observed_at TIMESTAMPTZ NOT NULL, -- when the source reported this value
ingested_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- when the platform recorded it
superseded_at TIMESTAMPTZ, -- set when a newer observation arrives for this entity+field
confidence TEXT, -- "authoritative", "derived", "manual", "unverified"
metadata JSONB -- source-specific context (API response ID, document page, etc.)
);
CREATE INDEX idx_field_obs_entity_field ON field_observations(entity_id, field_path, observed_at DESC);
CREATE INDEX idx_field_obs_source ON field_observations(source_id, observed_at DESC);
CREATE INDEX idx_field_obs_superseded ON field_observations(entity_id, field_path) WHERE superseded_at IS NULL;
How It Works
-
On ingestion: When a source provides a value for an entity field, the ingestion pipeline writes a row to
field_observations. If a previous observation exists for the sameentity_id + field_pathwithsuperseded_at IS NULL, that row'ssuperseded_atis set tonow(). The entity's current field value is updated as before — the main entity record remains the fast-read path. -
On scoring (ADR-019): The scoring engine reads the entity's current field value and the active observation's provenance (
source_id,source_ref,observed_at). Both are embedded in the evaluation result, making each evaluation a self-contained audit artifact. -
On query ("what did we know at time T?"):
SELECT * FROM field_observations WHERE entity_id = ? AND field_path = ? AND observed_at <= T ORDER BY observed_at DESC LIMIT 1reconstructs the state at any historical point.
4. Connection to ADR-019 Evaluation Results
The evaluation result's per-factor indicator gains an input block:
{
"factor_id": "jurisdiction_risk",
"raw_score": 8,
"capped_score": 8,
"max_score": 10,
"input": {
"field": "country_of_incorporation",
"value": "PA",
"source_id": "kvk_registry",
"source_ref": "kvk:12345678",
"observed_at": "2026-03-15T10:30:00Z"
},
"contributing_indicators": [{
"method": "REFERENCE_LOOKUP",
"value": "PA",
"dataset": "country_risk",
"matched_score": 8
}]
}
This means years from now, you can open any evaluation and see:
- What was scored:
country_of_incorporation = "PA" - Where it came from: KVK registry, record
kvk:12345678 - When it was observed: 2026-03-15
- How it was scored: REFERENCE_LOOKUP against country_risk dataset → score 8
- Which matrix version: foreign key on the evaluation record
- What the reference data looked like: snapshotted at matrix publish time (ADR-008a)
No reconstruction, no joins against historical state. The evaluation is the audit trail.
5. Use Cases Served
| Use Case | How Provenance Helps |
|---|---|
| Regulatory audit | Open the evaluation → every factor shows its input value, source, and observation date. Self-contained proof. |
| Data freshness alerting | Query field_observations WHERE superseded_at IS NULL AND observed_at < now() - interval '6 months' to find stale fields. Dashboard shows entities scored with old data. |
| Change-driven re-evaluation | When a source correction arrives, the old observation is superseded. Query evaluation results where input.source_ref matches the corrected record → flag for re-scoring. |
| Matrix comparison | Compare two evaluation runs: same entity, different matrix versions. The input blocks show whether the underlying data also changed between runs, or only the scoring logic. |
| Dispute resolution | Entity challenges their score. Show them: "Your jurisdiction was recorded as PA based on KVK registry data from March 2026. This scored 8/10 in our country risk dataset. Here is the dataset entry." |
| Model validation | Correlate evaluation inputs and scores with outcomes (SARs, exits) over time. Provenance lets you filter for "evaluations based on authoritative sources" vs "evaluations based on manual/unverified data." |
6. Data Freshness Service
A background service periodically scans active observations and generates freshness alerts:
For each entity in active portfolio:
For each field used by the entity's applicable risk matrix (via wire mappings):
Check the active observation's observed_at
If older than the configured threshold (per source or global):
Create a freshness alert → surfaces in the Risk Center dashboard
Thresholds are configurable per source (registry data may be acceptable at 12 months; sanctions screening should be refreshed weekly).
7. Retention & Storage Considerations
Field observations are append-only and will grow over time. Retention policy options:
- Keep all observations indefinitely for entities in active portfolios. Regulatory retention requirements (typically 5 years post-relationship) determine the minimum.
- Archive superseded observations older than the retention period to cold storage.
- Partition by
ingested_atfor efficient archival and querying. - The active observation set (one row per entity per field,
superseded_at IS NULL) is the hot path and stays small relative to the full history.
8. Scope & Boundaries
In Scope
field_observationstable and ingestion pipeline integration- Provenance metadata in ADR-019 evaluation results (
inputblock) - Data freshness alerting service
- Point-in-time state reconstruction queries
Out of Scope (Future Work)
- Full entity-level versioning or event sourcing
- Source reliability scoring (weighting authoritative vs. derived sources)
- Automated conflict resolution when multiple sources disagree on the same field
- Provenance UI in the entity detail page (display layer — depends on this ADR's data model)
9. Open Questions
-
Granularity of
field_path. Should nested fields (e.g.,addresses[0].country) each get their own observation row, or should complex fields be stored as a single observation with a JSONB value? Leaning toward single observation with JSONB — keeps the observation count manageable. -
Multi-source conflict. If two sources report different values for the same field at similar times, which one becomes the active entity value? Current approach: last-write-wins with source priority (authoritative > derived > manual). This ADR records both observations; conflict resolution is a separate concern.
-
Bulk ingestion performance. Portfolio re-ingestion from a registry feed might update thousands of entities × dozens of fields. The supersede-and-insert pattern needs to be efficient. Likely a batch upsert with a CTE.
10. POC Scope
Minimal implementation to validate the concept alongside ADR-019's scoring pipeline POC:
- Create the
field_observationstable. - Modify one ingestion pipeline (the simplest registry source) to write observations on ingest.
- Modify the
RiskMatrixScorerto read provenance alongside entity values and embed theinputblock in evaluation results. - Build one query: "show me all entities with field data older than N months."
- Verify: open an evaluation result from the POC and trace one factor score back to its source observation.