ADR-008a: Reference Data Registry

Status: Proposed Date: 2026-03-28 Author: Atlas Architecture Prerequisite for: ADR-008 (EBA Risk Matrix Engine)

Context

ADR-008 introduces a versioned EBA risk matrix whose reference_data_snapshot column freezes external risk lists into each published matrix version. However, ADR-008 does not specify where these lists originate, who maintains them, or how they are versioned independently of the matrix itself.

Today, reference data is scattered across the codebase:

JurisdictionRiskFlags in src/ontology/schemas/entities.py defines the data structure for country risk but contains no actual country data.
UBO ownership thresholds in src/ontology/ownership.py are hardcoded per jurisdiction with no ability for customers to override.
Risk scoring weights in config/ontology_schema_v3.yaml are static YAML with no versioning or tenant customisation.
The DataProviderCountry model in the data providers layer tracks provider coverage by country but does not classify countries by risk.

Compliance teams need to maintain these lists on their own cadence — FATF updates its grey list roughly three times per year, the EU Commission updates its high-risk third countries list via delegated regulations, and CPI scores are published annually. A matrix that was published six months ago should not silently use stale country lists; equally, updating a country list should not retroactively change existing evaluations.

Decision

Introduce a Reference Data Registry — a tenant-scoped, versioned store for the configurable inputs that feed into the EBA risk matrix and other compliance features. Each dataset type is independently versioned, auditable, and freezable into a matrix snapshot at publish time.

Dataset Types

1. Country Risk Lists

These are the primary configurable inputs for the country_risk_list indicator type in the matrix scorer.

List	Source	Update Frequency	Purpose
`eu_high_risk_third_countries`	EC Delegated Regulation (AMLD)	~1–2x per year	Countries with strategic AML/CFT deficiencies
`fatf_grey_list`	FATF "Increased Monitoring"	~3x per year (after each plenary)	Jurisdictions under increased monitoring
`fatf_black_list`	FATF "High-Risk Jurisdictions"	~3x per year	Call for action jurisdictions
`eu_tax_blacklist`	EU Council	~2x per year	Non-cooperative jurisdictions for tax
`cpi_below_40`	Transparency International CPI	Annual	Countries with CPI score below 40
`secrecy_jurisdictions`	Tax Justice Network FSI	~Biennial	High financial secrecy jurisdictions

Each list contains ISO 3166-1 alpha-2 country codes. Customers can extend any list with additional countries or create entirely custom lists (e.g., internal_restricted_countries).

2. Industry / Sector Risk Classifications

Configurable mappings from industry codes to risk tiers. Different customers have different risk appetites — a crypto exchange's high-risk sectors differ from a trade finance bank's.

Dataset	Content	Used By
`high_risk_sectors`	Industry code → risk tier mapping	`business_profile` factor
`cash_intensive_industries`	NACE/SIC codes of cash-intensive sectors	`business_profile` factor
`regulated_activities`	Activities requiring specific licenses	`regulatory_status` factor

3. PEP Classification Tiers

Configurable mapping of PEP levels to risk scores. The EBA guidelines give latitude on how granularly PEP exposure is scored.

Dataset	Content	Used By
`pep_tiers`	PEP classification → score mapping	`pep_exposure` factor

Default tiers: head_of_state (30), senior_government (30), member_of_parliament (25), senior_military (25), regional_political (20), senior_judicial (20), family_member (15), close_associate (15).

4. Sanctions List Sources

Configurable set of sanctions lists to check against and match-type weighting.

Dataset	Content	Used By
`active_sanctions_lists`	Which lists to screen (OFAC, EU, UN, UK, etc.)	`sanctions_exposure` factor
`sanctions_match_weights`	Match type → score mapping	`sanctions_exposure` factor

5. Product / Service Risk Taxonomy

Customer-defined mapping of their own product catalogue to risk scores. This populates the product_service dimension which ADR-008 currently leaves as input_type: "manual_or_workflow" with default_score: 0.

Dataset	Content	Used By
`product_risk_taxonomy`	Product/service code → risk score	`product_complexity` factor

6. UBO Ownership Thresholds

Jurisdiction-specific thresholds for beneficial ownership determination. Currently hardcoded in src/ontology/ownership.py.

Dataset	Content	Used By
`ubo_thresholds`	Country → ownership % threshold	Ownership analysis, `ownership_complexity` factor

Default: 25% (FATF baseline). EU: 25% (AMLD). US: 25% (FinCEN). Customers may need lower thresholds for enhanced due diligence scenarios.

Database Schema

-- V57__reference_data_registry.sql

-- ============================================================
-- Reference data types: defines what kinds of lists exist
-- ============================================================
CREATE TABLE reference_data_types (
    id              VARCHAR(100) PRIMARY KEY,    -- e.g. 'country_risk_list'
    name            VARCHAR(255) NOT NULL,
    description     TEXT,
    schema_spec     JSONB NOT NULL,              -- JSON Schema for validating entries
    -- e.g. {"type": "array", "items": {"type": "string", "pattern": "^[A-Z]{2}$"}}
    -- for country lists: array of ISO 3166-1 alpha-2 codes
    -- for tier mappings: object with string keys and numeric values
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

-- ============================================================
-- Reference datasets: versioned, tenant-scoped data
-- ============================================================
CREATE TABLE reference_datasets (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    type_id         VARCHAR(100) NOT NULL REFERENCES reference_data_types(id),
    list_key        VARCHAR(150) NOT NULL,       -- e.g. 'fatf_grey_list', 'pep_tiers'
    version         INTEGER NOT NULL DEFAULT 1,
    name            VARCHAR(255) NOT NULL,
    description     TEXT,

    status          VARCHAR(20) NOT NULL DEFAULT 'draft',
        -- 'draft', 'active', 'archived'

    -- The actual data
    data            JSONB NOT NULL,

    -- Provenance: where this data came from
    source          VARCHAR(255),                -- 'FATF Plenary Feb 2026', 'Manual entry'
    source_url      TEXT,                        -- Link to authoritative source
    source_date     DATE,                        -- Date of the source publication

    -- Tenant scoping
    tenant_id       UUID,                        -- NULL = system default (shared)

    -- Audit
    created_by      VARCHAR(100),
    activated_by    VARCHAR(100),
    activated_at    TIMESTAMPTZ,
    archived_by     VARCHAR(100),
    archived_at     TIMESTAMPTZ,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),

    -- Sentinel UUID used in place of NULL tenant_id for unique constraint purposes.
    -- PostgreSQL treats NULLs as distinct in UNIQUE constraints, so a raw
    -- UNIQUE(list_key, version, tenant_id) would allow duplicate system-default
    -- rows.  COALESCE to a fixed sentinel closes that hole.
    UNIQUE(list_key, version, tenant_id)  -- covers tenant-scoped rows (tenant_id NOT NULL)
);

-- Version uniqueness for system defaults (tenant_id IS NULL):
-- COALESCE sentinel ensures NULLs collide correctly.
CREATE UNIQUE INDEX uq_rd_version_system_default
    ON reference_datasets(list_key, version)
    WHERE tenant_id IS NULL;

-- Only one active version per list_key per tenant (including system default)
CREATE UNIQUE INDEX uq_rd_single_active
    ON reference_datasets(list_key, COALESCE(tenant_id, '00000000-0000-0000-0000-000000000000'::UUID))
    WHERE status = 'active';

CREATE INDEX idx_rd_type ON reference_datasets(type_id);
CREATE INDEX idx_rd_list_key ON reference_datasets(list_key);
CREATE INDEX idx_rd_tenant ON reference_datasets(tenant_id);
CREATE INDEX idx_rd_status ON reference_datasets(status, list_key);

-- Prevent modification of active datasets.
-- Two rules:
--   1. The only allowed status transition FROM 'active' is to 'archived'.
--   2. When OLD.status = 'active', NO data/provenance/key columns may change,
--      regardless of the target status.  This closes the loophole where a
--      caller archives a dataset while silently mutating its contents.
CREATE OR REPLACE FUNCTION prevent_active_reference_data_update()
RETURNS TRIGGER AS $$
BEGIN
    IF OLD.status = 'active' THEN
        -- Rule 1: only allow transition to 'archived'
        IF NEW.status NOT IN ('active', 'archived') THEN
            RAISE EXCEPTION 'Active reference datasets may only transition to archived.';
        END IF;

        -- Rule 2: data and provenance columns are always immutable once active
        IF NEW.data IS DISTINCT FROM OLD.data
           OR NEW.list_key IS DISTINCT FROM OLD.list_key
           OR NEW.version IS DISTINCT FROM OLD.version
           OR NEW.type_id IS DISTINCT FROM OLD.type_id
           OR NEW.tenant_id IS DISTINCT FROM OLD.tenant_id
           OR NEW.source IS DISTINCT FROM OLD.source
           OR NEW.source_url IS DISTINCT FROM OLD.source_url
           OR NEW.source_date IS DISTINCT FROM OLD.source_date
           OR NEW.name IS DISTINCT FROM OLD.name
           OR NEW.created_by IS DISTINCT FROM OLD.created_by
           OR NEW.activated_by IS DISTINCT FROM OLD.activated_by
           OR NEW.activated_at IS DISTINCT FROM OLD.activated_at THEN
            RAISE EXCEPTION 'Cannot modify an active reference dataset. Create a new version instead.';
        END IF;

        -- Rule 3: archived_by/archived_at may only be set during the
        -- active→archived transition (not while staying active)
        IF NEW.status = 'active' THEN
            IF NEW.archived_by IS DISTINCT FROM OLD.archived_by
               OR NEW.archived_at IS DISTINCT FROM OLD.archived_at THEN
                RAISE EXCEPTION 'Cannot set archival metadata on an active dataset.';
            END IF;
        END IF;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_prevent_active_reference_data_update
    BEFORE UPDATE ON reference_datasets
    FOR EACH ROW
    EXECUTE FUNCTION prevent_active_reference_data_update();


-- ============================================================
-- Audit log: tracks every change to reference data
-- ============================================================
CREATE TABLE reference_data_audit_log (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    dataset_id      UUID NOT NULL REFERENCES reference_datasets(id) ON DELETE RESTRICT,
    action          VARCHAR(50) NOT NULL,        -- 'created', 'activated', 'archived', 'used_in_snapshot'
    actor           VARCHAR(100),
    details         JSONB,                       -- Action-specific context
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_rdal_dataset ON reference_data_audit_log(dataset_id, created_at DESC);

How Reference Data Flows into the Matrix

┌─────────────────────────────────────────────────┐
│  Reference Data Registry                         │
│                                                   │
│  fatf_grey_list v3 (active)                      │
│  eu_high_risk_third_countries v2 (active)        │
│  cpi_below_40 v1 (active)                        │
│  pep_tiers v1 (active)                           │
│  product_risk_taxonomy v2 (active)               │
│  ubo_thresholds v1 (active)                      │
└───────────────────────┬─────────────────────────┘
                        │
                        │  Matrix publish resolves
                        │  all referenced lists
                        ▼
┌─────────────────────────────────────────────────┐
│  risk_matrix_schemas.reference_data_snapshot      │
│                                                   │
│  {                                                │
│    "fatf_grey_list": ["MM", "SY", "YE", ...],    │
│    "eu_high_risk_third_countries": ["AF", ...],   │
│    "cpi_below_40": ["SO", "SS", "SY", ...],      │
│    "pep_tiers": {"head_of_state": 30, ...},      │
│    "product_risk_taxonomy": {"crypto": 20, ...},  │
│    "_snapshot_metadata": {                        │
│      "resolved_at": "2026-03-28T14:00:00Z",      │
│      "resolver_tenant_id": "abc-123-...",         │
│      "datasets": {                                │
│        "fatf_grey_list": {                        │
│          "dataset_id": "d1e2f3...",               │
│          "version": 3,                            │
│          "tenant_id": null,                       │
│          "resolution_tier": "system_default",     │
│          "source": "FATF Plenary Feb 2026",       │
│          "source_url": "https://fatf-gafi.org/…",│
│          "source_date": "2026-02-21",             │
│          "activated_at": "2026-02-25T09:00:00Z"   │
│        },                                         │
│        "eu_high_risk_third_countries": {           │
│          "dataset_id": "a4b5c6...",               │
│          "version": 2,                            │
│          "tenant_id": "abc-123-...",              │
│          "resolution_tier": "tenant_override",    │
│          "source": "EC Delegated Reg 2025/…",     │
│          "source_url": "https://eur-lex.…",       │
│          "source_date": "2025-11-15",             │
│          "activated_at": "2025-12-01T10:00:00Z"   │
│        }                                          │
│      }                                            │
│    }                                              │
│  }                                                │
└─────────────────────────────────────────────────┘

When a matrix schema is published:

The publish endpoint reads every country_risk_list reference and every other dataset key used by any factor in the matrix definition.
For each key, ReferenceDataResolver resolves the active version for the tenant (falling back to the system default if no tenant override exists).
For each resolved dataset, the resolver records the full provenance: dataset_id, version, tenant_id, resolution_tier (tenant_override or system_default), source, source_url, source_date, and activated_at. This metadata makes the snapshot fully replayable — an auditor can trace every input back to the exact dataset row, its authoritative source, and which resolution tier won.
All resolved data is merged into reference_data_snapshot with the per-dataset metadata under _snapshot_metadata.datasets.
The existing _validate_reference_lists method (ADR-008) still runs — but now it validates against the freshly resolved snapshot rather than a manually assembled one.

Tenant Override Model

Reference data follows a layered resolution strategy:

Tenant-specific active dataset  →  highest priority
System default active dataset   →  fallback
(missing)                       →  publish-time validation error

This means Atlas ships with sensible system defaults (official FATF/EU/CPI lists) that customers can override per-tenant. A payment institution can add extra countries to eu_high_risk_third_countries without affecting other tenants.

The override only affects future matrix publications. Existing published matrices retain their original snapshots.

API Endpoints

All under /reference-data prefix.

# Dataset type management (admin only)
GET    /reference-data/types                              List all dataset types
GET    /reference-data/types/{type_id}                    Get type with JSON Schema spec

# Dataset CRUD
GET    /reference-data/datasets                           List datasets (filterable by type, status, tenant)
POST   /reference-data/datasets                           Create new dataset (draft)
GET    /reference-data/datasets/{id}                      Get dataset by ID
PUT    /reference-data/datasets/{id}                      Update draft dataset
POST   /reference-data/datasets/{id}/activate             Activate dataset (archives previous active version)
POST   /reference-data/datasets/{id}/archive              Archive dataset

# Versioning
GET    /reference-data/datasets/{list_key}/versions       List all versions of a dataset
GET    /reference-data/datasets/{list_key}/active          Get currently active version
POST   /reference-data/datasets/{id}/new-version           Create new version from existing
GET    /reference-data/datasets/{list_key}/diff/{v1}/{v2}  Diff between two versions

# Resolution (used internally by matrix publish)
GET    /reference-data/resolve                            Resolve all active datasets for a tenant
GET    /reference-data/resolve/{list_key}                 Resolve single active dataset for a tenant

# Audit
GET    /reference-data/datasets/{id}/audit-log             View change history

# Bulk import (for initial setup and periodic updates)
POST   /reference-data/import                              Import datasets from CSV/JSON

ADR-008 and ADR-008a together introduce a navigation restructure. The existing sidebar section Configure (containing only Workflows) is replaced by Compliance Studio, grouping the three compliance authoring tools:

Compliance Studio
├── Workflows         /workflows/schemas    (existing, icon: flows)
├── Risk Matrices     /risk-matrices        (ADR-008, icon: heat-grid)
└── Risk Categories   /studio/risk-categories (ADR-008a, icon: th-derived)

These three tools form a pipeline that compliance officers configure end-to-end: Risk Categories defines the facts (which countries are high-risk, what PEP tiers exist), Risk Matrices defines the scoring logic (how facts map to risk scores), and Workflows orchestrates the process (when to investigate, score, review, and decide).

The navigation change is already implemented in AppShell.tsx. The Risk Matrices and Risk Categories routes are registered in App.tsx.

System-level configuration (Data Providers, general settings) remains under Settings (/settings) in the bottom nav — it is not compliance content.

Frontend: Compliance Studio → Risk Categories

A dedicated page at /studio/risk-categories under the Compliance Studio navigation section. The UI should support:

Country Risk Lists View

Table showing all country risk list types with their active version, last update date, and entry count.
Drill into a list to see all countries with the ability to add/remove entries.
Visual world map highlighting countries in each list (stretch goal).
Diff view when activating a new version: shows countries added and removed.

Risk Classification Views

PEP tiers: table with tier name, description, and configurable score value.
Industry risk: table with industry code, description, and risk tier assignment.
Product taxonomy: table with product/service, description, and risk score.
UBO thresholds: table with jurisdiction and ownership percentage.

Common UI Patterns

Each dataset shows provenance (source, source date, source URL).
"New Version" button copies the current active version into a draft for editing.
"Activate" button with confirmation dialog showing the diff against the current active version.
Audit log sidebar showing who changed what and when.

Migration Path

Phase 1: Schema + Seeding (Week 1)

Create V57__reference_data_registry.sql migration.
Seed system defaults from authoritative sources: current FATF grey/black lists, current EU high-risk third countries, latest CPI scores, default PEP tiers.
Seed UBO thresholds from the existing hardcoded values in src/ontology/ownership.py.

Phase 2: Backend API (Weeks 2–3)

Repository layer (ReferenceDataRepo).
FastAPI router with all CRUD, versioning, and resolution endpoints.
JSON Schema validation on create/update.
Integration with matrix publish flow: ReferenceDataResolver called during POST /risk-matrix/schemas/{id}/publish.

Phase 3: Frontend (Weeks 3–4)

Settings → Reference Data tab with list management UI.
Country list editor with add/remove.
Tier/classification editors.
Version diff view.

Phase 4: Wire Into Existing Features (Week 5)

Replace hardcoded UBO thresholds in src/ontology/ownership.py with registry lookups.
Replace hardcoded risk scoring weights in ontology_schema_v3.yaml with registry-backed values.
Update JurisdictionRiskFlags population to use registry data.

Interaction with ADR-008

This ADR is an upstream prerequisite for ADR-008. The matrix publish flow (POST /risk-matrix/schemas/{id}/publish) cannot populate reference_data_snapshot without a running registry and resolver. ADR-008's scorer, ontology mapper, and version manager do not depend on this ADR directly — they consume the already-frozen snapshot — so those components can be developed in parallel once the registry schema and resolver are in place.

No ADR-008 interfaces change. The reference_data_snapshot column, the _validate_reference_lists method, and the country_risk_list indicator type all work exactly as specified. The only change is where the snapshot data comes from — instead of manual assembly, it is resolved from the registry at publish time.

The _validate_reference_lists call during matrix publish now has a concrete upstream: if a matrix factor references fatf_grey_list but no active dataset with list_key = 'fatf_grey_list' exists for the tenant (or as a system default), the validation fails and the publish is rejected.

File Structure

src/reference_data/                    # NEW directory
├── __init__.py
├── repository.py                      # ReferenceDataRepo (CRUD + resolution)
├── resolver.py                        # ReferenceDataResolver (tenant-aware lookup)
├── validator.py                       # JSON Schema validation per dataset type
├── router.py                          # FastAPI endpoints
└── seed.py                            # System default seeding logic

config/reference_data/                 # NEW directory
├── fatf_grey_list.json                # Current FATF increased monitoring list
├── fatf_black_list.json               # Current FATF high-risk jurisdictions
├── eu_high_risk_third_countries.json  # Current EC delegated regulation list
├── cpi_below_40.json                  # Countries with CPI < 40
├── pep_tiers.json                     # Default PEP classification → score mapping
├── ubo_thresholds.json                # Default jurisdiction → threshold mapping
└── dataset_types.json                 # Type definitions with JSON Schema specs

migrations/
├── V57__reference_data_registry.sql   # Registry tables + audit log

Effort Estimate

Component	Estimate	Dependencies
Database migration (V57)	0.5 weeks	None
Repository + resolver	1 week	Migration
JSON Schema validation	0.5 weeks	Repository
API endpoints (router)	1 week	Repository
System default seeding	0.5 weeks	Migration
Matrix publish integration	0.5 weeks	Resolver, ADR-008
Frontend: Settings tab	1.5 weeks	API endpoints
Wire into ownership/ontology	0.5 weeks	Resolver
Testing	1 week	All components
Total	7 weeks

ADR-008a is an upstream dependency for ADR-008. The registry, resolver, and seed data (Phases 1–2) must land before the matrix publish flow can be implemented. The frontend (Phase 3) and legacy wiring (Phase 4) can proceed in parallel with ADR-008's scorer and mapper work since those components do not depend on the registry UI.

Rollback Strategy

Fully reversible:

Remove router registration from main.py.
Revert matrix publish endpoint to accept manually assembled reference_data_snapshot.
Drop tables: DROP TABLE reference_data_audit_log, reference_datasets, reference_data_types CASCADE.
Remove src/reference_data/ and config/reference_data/ directories.

ADR-008's reference_data_snapshot column continues to work — it just requires manual population instead of automatic resolution.

Context​

Decision​

Dataset Types​

1. Country Risk Lists​

2. Industry / Sector Risk Classifications​

3. PEP Classification Tiers​

4. Sanctions List Sources​

5. Product / Service Risk Taxonomy​

6. UBO Ownership Thresholds​

Database Schema​

How Reference Data Flows into the Matrix​

Tenant Override Model​

API Endpoints​

Navigation: Compliance Studio​

Frontend: Compliance Studio → Risk Categories​

Country Risk Lists View​

Risk Classification Views​

Common UI Patterns​

Migration Path​

Phase 1: Schema + Seeding (Week 1)​

Phase 2: Backend API (Weeks 2–3)​

Phase 3: Frontend (Weeks 3–4)​

Phase 4: Wire Into Existing Features (Week 5)​

Interaction with ADR-008​

File Structure​

Effort Estimate​

Rollback Strategy​