Skip to main content

ADR-008a: Reference Data Registry

Status: Proposed Date: 2026-03-28 Author: Atlas Architecture Prerequisite for: ADR-008 (EBA Risk Matrix Engine)


Context

ADR-008 introduces a versioned EBA risk matrix whose reference_data_snapshot column freezes external risk lists into each published matrix version. However, ADR-008 does not specify where these lists originate, who maintains them, or how they are versioned independently of the matrix itself.

Today, reference data is scattered across the codebase:

  • JurisdictionRiskFlags in src/ontology/schemas/entities.py defines the data structure for country risk but contains no actual country data.
  • UBO ownership thresholds in src/ontology/ownership.py are hardcoded per jurisdiction with no ability for customers to override.
  • Risk scoring weights in config/ontology_schema_v3.yaml are static YAML with no versioning or tenant customisation.
  • The DataProviderCountry model in the data providers layer tracks provider coverage by country but does not classify countries by risk.

Compliance teams need to maintain these lists on their own cadence — FATF updates its grey list roughly three times per year, the EU Commission updates its high-risk third countries list via delegated regulations, and CPI scores are published annually. A matrix that was published six months ago should not silently use stale country lists; equally, updating a country list should not retroactively change existing evaluations.


Decision

Introduce a Reference Data Registry — a tenant-scoped, versioned store for the configurable inputs that feed into the EBA risk matrix and other compliance features. Each dataset type is independently versioned, auditable, and freezable into a matrix snapshot at publish time.


Dataset Types

1. Country Risk Lists

These are the primary configurable inputs for the country_risk_list indicator type in the matrix scorer.

ListSourceUpdate FrequencyPurpose
eu_high_risk_third_countriesEC Delegated Regulation (AMLD)~1–2x per yearCountries with strategic AML/CFT deficiencies
fatf_grey_listFATF "Increased Monitoring"~3x per year (after each plenary)Jurisdictions under increased monitoring
fatf_black_listFATF "High-Risk Jurisdictions"~3x per yearCall for action jurisdictions
eu_tax_blacklistEU Council~2x per yearNon-cooperative jurisdictions for tax
cpi_below_40Transparency International CPIAnnualCountries with CPI score below 40
secrecy_jurisdictionsTax Justice Network FSI~BiennialHigh financial secrecy jurisdictions

Each list contains ISO 3166-1 alpha-2 country codes. Customers can extend any list with additional countries or create entirely custom lists (e.g., internal_restricted_countries).

2. Industry / Sector Risk Classifications

Configurable mappings from industry codes to risk tiers. Different customers have different risk appetites — a crypto exchange's high-risk sectors differ from a trade finance bank's.

DatasetContentUsed By
high_risk_sectorsIndustry code → risk tier mappingbusiness_profile factor
cash_intensive_industriesNACE/SIC codes of cash-intensive sectorsbusiness_profile factor
regulated_activitiesActivities requiring specific licensesregulatory_status factor

3. PEP Classification Tiers

Configurable mapping of PEP levels to risk scores. The EBA guidelines give latitude on how granularly PEP exposure is scored.

DatasetContentUsed By
pep_tiersPEP classification → score mappingpep_exposure factor

Default tiers: head_of_state (30), senior_government (30), member_of_parliament (25), senior_military (25), regional_political (20), senior_judicial (20), family_member (15), close_associate (15).

4. Sanctions List Sources

Configurable set of sanctions lists to check against and match-type weighting.

DatasetContentUsed By
active_sanctions_listsWhich lists to screen (OFAC, EU, UN, UK, etc.)sanctions_exposure factor
sanctions_match_weightsMatch type → score mappingsanctions_exposure factor

5. Product / Service Risk Taxonomy

Customer-defined mapping of their own product catalogue to risk scores. This populates the product_service dimension which ADR-008 currently leaves as input_type: "manual_or_workflow" with default_score: 0.

DatasetContentUsed By
product_risk_taxonomyProduct/service code → risk scoreproduct_complexity factor

6. UBO Ownership Thresholds

Jurisdiction-specific thresholds for beneficial ownership determination. Currently hardcoded in src/ontology/ownership.py.

DatasetContentUsed By
ubo_thresholdsCountry → ownership % thresholdOwnership analysis, ownership_complexity factor

Default: 25% (FATF baseline). EU: 25% (AMLD). US: 25% (FinCEN). Customers may need lower thresholds for enhanced due diligence scenarios.


Database Schema

-- V57__reference_data_registry.sql

-- ============================================================
-- Reference data types: defines what kinds of lists exist
-- ============================================================
CREATE TABLE reference_data_types (
id VARCHAR(100) PRIMARY KEY, -- e.g. 'country_risk_list'
name VARCHAR(255) NOT NULL,
description TEXT,
schema_spec JSONB NOT NULL, -- JSON Schema for validating entries
-- e.g. {"type": "array", "items": {"type": "string", "pattern": "^[A-Z]{2}$"}}
-- for country lists: array of ISO 3166-1 alpha-2 codes
-- for tier mappings: object with string keys and numeric values
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- ============================================================
-- Reference datasets: versioned, tenant-scoped data
-- ============================================================
CREATE TABLE reference_datasets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
type_id VARCHAR(100) NOT NULL REFERENCES reference_data_types(id),
list_key VARCHAR(150) NOT NULL, -- e.g. 'fatf_grey_list', 'pep_tiers'
version INTEGER NOT NULL DEFAULT 1,
name VARCHAR(255) NOT NULL,
description TEXT,

status VARCHAR(20) NOT NULL DEFAULT 'draft',
-- 'draft', 'active', 'archived'

-- The actual data
data JSONB NOT NULL,

-- Provenance: where this data came from
source VARCHAR(255), -- 'FATF Plenary Feb 2026', 'Manual entry'
source_url TEXT, -- Link to authoritative source
source_date DATE, -- Date of the source publication

-- Tenant scoping
tenant_id UUID, -- NULL = system default (shared)

-- Audit
created_by VARCHAR(100),
activated_by VARCHAR(100),
activated_at TIMESTAMPTZ,
archived_by VARCHAR(100),
archived_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),

-- Sentinel UUID used in place of NULL tenant_id for unique constraint purposes.
-- PostgreSQL treats NULLs as distinct in UNIQUE constraints, so a raw
-- UNIQUE(list_key, version, tenant_id) would allow duplicate system-default
-- rows. COALESCE to a fixed sentinel closes that hole.
UNIQUE(list_key, version, tenant_id) -- covers tenant-scoped rows (tenant_id NOT NULL)
);

-- Version uniqueness for system defaults (tenant_id IS NULL):
-- COALESCE sentinel ensures NULLs collide correctly.
CREATE UNIQUE INDEX uq_rd_version_system_default
ON reference_datasets(list_key, version)
WHERE tenant_id IS NULL;

-- Only one active version per list_key per tenant (including system default)
CREATE UNIQUE INDEX uq_rd_single_active
ON reference_datasets(list_key, COALESCE(tenant_id, '00000000-0000-0000-0000-000000000000'::UUID))
WHERE status = 'active';

CREATE INDEX idx_rd_type ON reference_datasets(type_id);
CREATE INDEX idx_rd_list_key ON reference_datasets(list_key);
CREATE INDEX idx_rd_tenant ON reference_datasets(tenant_id);
CREATE INDEX idx_rd_status ON reference_datasets(status, list_key);

-- Prevent modification of active datasets.
-- Two rules:
-- 1. The only allowed status transition FROM 'active' is to 'archived'.
-- 2. When OLD.status = 'active', NO data/provenance/key columns may change,
-- regardless of the target status. This closes the loophole where a
-- caller archives a dataset while silently mutating its contents.
CREATE OR REPLACE FUNCTION prevent_active_reference_data_update()
RETURNS TRIGGER AS $$
BEGIN
IF OLD.status = 'active' THEN
-- Rule 1: only allow transition to 'archived'
IF NEW.status NOT IN ('active', 'archived') THEN
RAISE EXCEPTION 'Active reference datasets may only transition to archived.';
END IF;

-- Rule 2: data and provenance columns are always immutable once active
IF NEW.data IS DISTINCT FROM OLD.data
OR NEW.list_key IS DISTINCT FROM OLD.list_key
OR NEW.version IS DISTINCT FROM OLD.version
OR NEW.type_id IS DISTINCT FROM OLD.type_id
OR NEW.tenant_id IS DISTINCT FROM OLD.tenant_id
OR NEW.source IS DISTINCT FROM OLD.source
OR NEW.source_url IS DISTINCT FROM OLD.source_url
OR NEW.source_date IS DISTINCT FROM OLD.source_date
OR NEW.name IS DISTINCT FROM OLD.name
OR NEW.created_by IS DISTINCT FROM OLD.created_by
OR NEW.activated_by IS DISTINCT FROM OLD.activated_by
OR NEW.activated_at IS DISTINCT FROM OLD.activated_at THEN
RAISE EXCEPTION 'Cannot modify an active reference dataset. Create a new version instead.';
END IF;

-- Rule 3: archived_by/archived_at may only be set during the
-- active→archived transition (not while staying active)
IF NEW.status = 'active' THEN
IF NEW.archived_by IS DISTINCT FROM OLD.archived_by
OR NEW.archived_at IS DISTINCT FROM OLD.archived_at THEN
RAISE EXCEPTION 'Cannot set archival metadata on an active dataset.';
END IF;
END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_prevent_active_reference_data_update
BEFORE UPDATE ON reference_datasets
FOR EACH ROW
EXECUTE FUNCTION prevent_active_reference_data_update();


-- ============================================================
-- Audit log: tracks every change to reference data
-- ============================================================
CREATE TABLE reference_data_audit_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
dataset_id UUID NOT NULL REFERENCES reference_datasets(id) ON DELETE RESTRICT,
action VARCHAR(50) NOT NULL, -- 'created', 'activated', 'archived', 'used_in_snapshot'
actor VARCHAR(100),
details JSONB, -- Action-specific context
created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_rdal_dataset ON reference_data_audit_log(dataset_id, created_at DESC);

How Reference Data Flows into the Matrix

┌─────────────────────────────────────────────────┐
│ Reference Data Registry │
│ │
│ fatf_grey_list v3 (active) │
│ eu_high_risk_third_countries v2 (active) │
│ cpi_below_40 v1 (active) │
│ pep_tiers v1 (active) │
│ product_risk_taxonomy v2 (active) │
│ ubo_thresholds v1 (active) │
└───────────────────────┬─────────────────────────┘

│ Matrix publish resolves
│ all referenced lists

┌─────────────────────────────────────────────────┐
│ risk_matrix_schemas.reference_data_snapshot │
│ │
│ { │
│ "fatf_grey_list": ["MM", "SY", "YE", ...], │
│ "eu_high_risk_third_countries": ["AF", ...], │
│ "cpi_below_40": ["SO", "SS", "SY", ...], │
│ "pep_tiers": {"head_of_state": 30, ...}, │
│ "product_risk_taxonomy": {"crypto": 20, ...}, │
│ "_snapshot_metadata": { │
│ "resolved_at": "2026-03-28T14:00:00Z", │
│ "resolver_tenant_id": "abc-123-...", │
│ "datasets": { │
│ "fatf_grey_list": { │
│ "dataset_id": "d1e2f3...", │
│ "version": 3, │
│ "tenant_id": null, │
│ "resolution_tier": "system_default", │
│ "source": "FATF Plenary Feb 2026", │
│ "source_url": "https://fatf-gafi.org/…",│
│ "source_date": "2026-02-21", │
│ "activated_at": "2026-02-25T09:00:00Z" │
│ }, │
│ "eu_high_risk_third_countries": { │
│ "dataset_id": "a4b5c6...", │
│ "version": 2, │
│ "tenant_id": "abc-123-...", │
│ "resolution_tier": "tenant_override", │
│ "source": "EC Delegated Reg 2025/…", │
│ "source_url": "https://eur-lex.…", │
│ "source_date": "2025-11-15", │
│ "activated_at": "2025-12-01T10:00:00Z" │
│ } │
│ } │
│ } │
│ } │
└─────────────────────────────────────────────────┘

When a matrix schema is published:

  1. The publish endpoint reads every country_risk_list reference and every other dataset key used by any factor in the matrix definition.
  2. For each key, ReferenceDataResolver resolves the active version for the tenant (falling back to the system default if no tenant override exists).
  3. For each resolved dataset, the resolver records the full provenance: dataset_id, version, tenant_id, resolution_tier (tenant_override or system_default), source, source_url, source_date, and activated_at. This metadata makes the snapshot fully replayable — an auditor can trace every input back to the exact dataset row, its authoritative source, and which resolution tier won.
  4. All resolved data is merged into reference_data_snapshot with the per-dataset metadata under _snapshot_metadata.datasets.
  5. The existing _validate_reference_lists method (ADR-008) still runs — but now it validates against the freshly resolved snapshot rather than a manually assembled one.

Tenant Override Model

Reference data follows a layered resolution strategy:

1. Tenant-specific active dataset → highest priority
2. System default active dataset → fallback
3. (missing) → publish-time validation error

This means Atlas ships with sensible system defaults (official FATF/EU/CPI lists) that customers can override per-tenant. A payment institution can add extra countries to eu_high_risk_third_countries without affecting other tenants.

The override only affects future matrix publications. Existing published matrices retain their original snapshots.


API Endpoints

All under /reference-data prefix.

# Dataset type management (admin only)
GET /reference-data/types List all dataset types
GET /reference-data/types/{type_id} Get type with JSON Schema spec

# Dataset CRUD
GET /reference-data/datasets List datasets (filterable by type, status, tenant)
POST /reference-data/datasets Create new dataset (draft)
GET /reference-data/datasets/{id} Get dataset by ID
PUT /reference-data/datasets/{id} Update draft dataset
POST /reference-data/datasets/{id}/activate Activate dataset (archives previous active version)
POST /reference-data/datasets/{id}/archive Archive dataset

# Versioning
GET /reference-data/datasets/{list_key}/versions List all versions of a dataset
GET /reference-data/datasets/{list_key}/active Get currently active version
POST /reference-data/datasets/{id}/new-version Create new version from existing
GET /reference-data/datasets/{list_key}/diff/{v1}/{v2} Diff between two versions

# Resolution (used internally by matrix publish)
GET /reference-data/resolve Resolve all active datasets for a tenant
GET /reference-data/resolve/{list_key} Resolve single active dataset for a tenant

# Audit
GET /reference-data/datasets/{id}/audit-log View change history

# Bulk import (for initial setup and periodic updates)
POST /reference-data/import Import datasets from CSV/JSON

ADR-008 and ADR-008a together introduce a navigation restructure. The existing sidebar section Configure (containing only Workflows) is replaced by Compliance Studio, grouping the three compliance authoring tools:

Compliance Studio
├── Workflows /workflows/schemas (existing, icon: flows)
├── Risk Matrices /risk-matrices (ADR-008, icon: heat-grid)
└── Risk Categories /studio/risk-categories (ADR-008a, icon: th-derived)

These three tools form a pipeline that compliance officers configure end-to-end: Risk Categories defines the facts (which countries are high-risk, what PEP tiers exist), Risk Matrices defines the scoring logic (how facts map to risk scores), and Workflows orchestrates the process (when to investigate, score, review, and decide).

The navigation change is already implemented in AppShell.tsx. The Risk Matrices and Risk Categories routes are registered in App.tsx.

System-level configuration (Data Providers, general settings) remains under Settings (/settings) in the bottom nav — it is not compliance content.


Frontend: Compliance Studio → Risk Categories

A dedicated page at /studio/risk-categories under the Compliance Studio navigation section. The UI should support:

Country Risk Lists View

  • Table showing all country risk list types with their active version, last update date, and entry count.
  • Drill into a list to see all countries with the ability to add/remove entries.
  • Visual world map highlighting countries in each list (stretch goal).
  • Diff view when activating a new version: shows countries added and removed.

Risk Classification Views

  • PEP tiers: table with tier name, description, and configurable score value.
  • Industry risk: table with industry code, description, and risk tier assignment.
  • Product taxonomy: table with product/service, description, and risk score.
  • UBO thresholds: table with jurisdiction and ownership percentage.

Common UI Patterns

  • Each dataset shows provenance (source, source date, source URL).
  • "New Version" button copies the current active version into a draft for editing.
  • "Activate" button with confirmation dialog showing the diff against the current active version.
  • Audit log sidebar showing who changed what and when.

Migration Path

Phase 1: Schema + Seeding (Week 1)

  • Create V57__reference_data_registry.sql migration.
  • Seed system defaults from authoritative sources: current FATF grey/black lists, current EU high-risk third countries, latest CPI scores, default PEP tiers.
  • Seed UBO thresholds from the existing hardcoded values in src/ontology/ownership.py.

Phase 2: Backend API (Weeks 2–3)

  • Repository layer (ReferenceDataRepo).
  • FastAPI router with all CRUD, versioning, and resolution endpoints.
  • JSON Schema validation on create/update.
  • Integration with matrix publish flow: ReferenceDataResolver called during POST /risk-matrix/schemas/{id}/publish.

Phase 3: Frontend (Weeks 3–4)

  • Settings → Reference Data tab with list management UI.
  • Country list editor with add/remove.
  • Tier/classification editors.
  • Version diff view.

Phase 4: Wire Into Existing Features (Week 5)

  • Replace hardcoded UBO thresholds in src/ontology/ownership.py with registry lookups.
  • Replace hardcoded risk scoring weights in ontology_schema_v3.yaml with registry-backed values.
  • Update JurisdictionRiskFlags population to use registry data.

Interaction with ADR-008

This ADR is an upstream prerequisite for ADR-008. The matrix publish flow (POST /risk-matrix/schemas/{id}/publish) cannot populate reference_data_snapshot without a running registry and resolver. ADR-008's scorer, ontology mapper, and version manager do not depend on this ADR directly — they consume the already-frozen snapshot — so those components can be developed in parallel once the registry schema and resolver are in place.

No ADR-008 interfaces change. The reference_data_snapshot column, the _validate_reference_lists method, and the country_risk_list indicator type all work exactly as specified. The only change is where the snapshot data comes from — instead of manual assembly, it is resolved from the registry at publish time.

The _validate_reference_lists call during matrix publish now has a concrete upstream: if a matrix factor references fatf_grey_list but no active dataset with list_key = 'fatf_grey_list' exists for the tenant (or as a system default), the validation fails and the publish is rejected.


File Structure

src/reference_data/ # NEW directory
├── __init__.py
├── repository.py # ReferenceDataRepo (CRUD + resolution)
├── resolver.py # ReferenceDataResolver (tenant-aware lookup)
├── validator.py # JSON Schema validation per dataset type
├── router.py # FastAPI endpoints
└── seed.py # System default seeding logic

config/reference_data/ # NEW directory
├── fatf_grey_list.json # Current FATF increased monitoring list
├── fatf_black_list.json # Current FATF high-risk jurisdictions
├── eu_high_risk_third_countries.json # Current EC delegated regulation list
├── cpi_below_40.json # Countries with CPI < 40
├── pep_tiers.json # Default PEP classification → score mapping
├── ubo_thresholds.json # Default jurisdiction → threshold mapping
└── dataset_types.json # Type definitions with JSON Schema specs

migrations/
├── V57__reference_data_registry.sql # Registry tables + audit log

Effort Estimate

ComponentEstimateDependencies
Database migration (V57)0.5 weeksNone
Repository + resolver1 weekMigration
JSON Schema validation0.5 weeksRepository
API endpoints (router)1 weekRepository
System default seeding0.5 weeksMigration
Matrix publish integration0.5 weeksResolver, ADR-008
Frontend: Settings tab1.5 weeksAPI endpoints
Wire into ownership/ontology0.5 weeksResolver
Testing1 weekAll components
Total7 weeks

ADR-008a is an upstream dependency for ADR-008. The registry, resolver, and seed data (Phases 1–2) must land before the matrix publish flow can be implemented. The frontend (Phase 3) and legacy wiring (Phase 4) can proceed in parallel with ADR-008's scorer and mapper work since those components do not depend on the registry UI.


Rollback Strategy

Fully reversible:

  1. Remove router registration from main.py.
  2. Revert matrix publish endpoint to accept manually assembled reference_data_snapshot.
  3. Drop tables: DROP TABLE reference_data_audit_log, reference_datasets, reference_data_types CASCADE.
  4. Remove src/reference_data/ and config/reference_data/ directories.

ADR-008's reference_data_snapshot column continues to work — it just requires manual population instead of automatic resolution.