ADR-008a: Reference Data Registry
Status: Proposed Date: 2026-03-28 Author: Atlas Architecture Prerequisite for: ADR-008 (EBA Risk Matrix Engine)
Context
ADR-008 introduces a versioned EBA risk matrix whose reference_data_snapshot column freezes external risk lists into each published matrix version. However, ADR-008 does not specify where these lists originate, who maintains them, or how they are versioned independently of the matrix itself.
Today, reference data is scattered across the codebase:
JurisdictionRiskFlagsinsrc/ontology/schemas/entities.pydefines the data structure for country risk but contains no actual country data.- UBO ownership thresholds in
src/ontology/ownership.pyare hardcoded per jurisdiction with no ability for customers to override. - Risk scoring weights in
config/ontology_schema_v3.yamlare static YAML with no versioning or tenant customisation. - The
DataProviderCountrymodel in the data providers layer tracks provider coverage by country but does not classify countries by risk.
Compliance teams need to maintain these lists on their own cadence — FATF updates its grey list roughly three times per year, the EU Commission updates its high-risk third countries list via delegated regulations, and CPI scores are published annually. A matrix that was published six months ago should not silently use stale country lists; equally, updating a country list should not retroactively change existing evaluations.
Decision
Introduce a Reference Data Registry — a tenant-scoped, versioned store for the configurable inputs that feed into the EBA risk matrix and other compliance features. Each dataset type is independently versioned, auditable, and freezable into a matrix snapshot at publish time.
Dataset Types
1. Country Risk Lists
These are the primary configurable inputs for the country_risk_list indicator type in the matrix scorer.
| List | Source | Update Frequency | Purpose |
|---|---|---|---|
eu_high_risk_third_countries | EC Delegated Regulation (AMLD) | ~1–2x per year | Countries with strategic AML/CFT deficiencies |
fatf_grey_list | FATF "Increased Monitoring" | ~3x per year (after each plenary) | Jurisdictions under increased monitoring |
fatf_black_list | FATF "High-Risk Jurisdictions" | ~3x per year | Call for action jurisdictions |
eu_tax_blacklist | EU Council | ~2x per year | Non-cooperative jurisdictions for tax |
cpi_below_40 | Transparency International CPI | Annual | Countries with CPI score below 40 |
secrecy_jurisdictions | Tax Justice Network FSI | ~Biennial | High financial secrecy jurisdictions |
Each list contains ISO 3166-1 alpha-2 country codes. Customers can extend any list with additional countries or create entirely custom lists (e.g., internal_restricted_countries).
2. Industry / Sector Risk Classifications
Configurable mappings from industry codes to risk tiers. Different customers have different risk appetites — a crypto exchange's high-risk sectors differ from a trade finance bank's.
| Dataset | Content | Used By |
|---|---|---|
high_risk_sectors | Industry code → risk tier mapping | business_profile factor |
cash_intensive_industries | NACE/SIC codes of cash-intensive sectors | business_profile factor |
regulated_activities | Activities requiring specific licenses | regulatory_status factor |
3. PEP Classification Tiers
Configurable mapping of PEP levels to risk scores. The EBA guidelines give latitude on how granularly PEP exposure is scored.
| Dataset | Content | Used By |
|---|---|---|
pep_tiers | PEP classification → score mapping | pep_exposure factor |
Default tiers: head_of_state (30), senior_government (30), member_of_parliament (25), senior_military (25), regional_political (20), senior_judicial (20), family_member (15), close_associate (15).
4. Sanctions List Sources
Configurable set of sanctions lists to check against and match-type weighting.
| Dataset | Content | Used By |
|---|---|---|
active_sanctions_lists | Which lists to screen (OFAC, EU, UN, UK, etc.) | sanctions_exposure factor |
sanctions_match_weights | Match type → score mapping | sanctions_exposure factor |
5. Product / Service Risk Taxonomy
Customer-defined mapping of their own product catalogue to risk scores. This populates the product_service dimension which ADR-008 currently leaves as input_type: "manual_or_workflow" with default_score: 0.
| Dataset | Content | Used By |
|---|---|---|
product_risk_taxonomy | Product/service code → risk score | product_complexity factor |
6. UBO Ownership Thresholds
Jurisdiction-specific thresholds for beneficial ownership determination. Currently hardcoded in src/ontology/ownership.py.
| Dataset | Content | Used By |
|---|---|---|
ubo_thresholds | Country → ownership % threshold | Ownership analysis, ownership_complexity factor |
Default: 25% (FATF baseline). EU: 25% (AMLD). US: 25% (FinCEN). Customers may need lower thresholds for enhanced due diligence scenarios.
Database Schema
-- V57__reference_data_registry.sql
-- ============================================================
-- Reference data types: defines what kinds of lists exist
-- ============================================================
CREATE TABLE reference_data_types (
id VARCHAR(100) PRIMARY KEY, -- e.g. 'country_risk_list'
name VARCHAR(255) NOT NULL,
description TEXT,
schema_spec JSONB NOT NULL, -- JSON Schema for validating entries
-- e.g. {"type": "array", "items": {"type": "string", "pattern": "^[A-Z]{2}$"}}
-- for country lists: array of ISO 3166-1 alpha-2 codes
-- for tier mappings: object with string keys and numeric values
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- ============================================================
-- Reference datasets: versioned, tenant-scoped data
-- ============================================================
CREATE TABLE reference_datasets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
type_id VARCHAR(100) NOT NULL REFERENCES reference_data_types(id),
list_key VARCHAR(150) NOT NULL, -- e.g. 'fatf_grey_list', 'pep_tiers'
version INTEGER NOT NULL DEFAULT 1,
name VARCHAR(255) NOT NULL,
description TEXT,
status VARCHAR(20) NOT NULL DEFAULT 'draft',
-- 'draft', 'active', 'archived'
-- The actual data
data JSONB NOT NULL,
-- Provenance: where this data came from
source VARCHAR(255), -- 'FATF Plenary Feb 2026', 'Manual entry'
source_url TEXT, -- Link to authoritative source
source_date DATE, -- Date of the source publication
-- Tenant scoping
tenant_id UUID, -- NULL = system default (shared)
-- Audit
created_by VARCHAR(100),
activated_by VARCHAR(100),
activated_at TIMESTAMPTZ,
archived_by VARCHAR(100),
archived_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
-- Sentinel UUID used in place of NULL tenant_id for unique constraint purposes.
-- PostgreSQL treats NULLs as distinct in UNIQUE constraints, so a raw
-- UNIQUE(list_key, version, tenant_id) would allow duplicate system-default
-- rows. COALESCE to a fixed sentinel closes that hole.
UNIQUE(list_key, version, tenant_id) -- covers tenant-scoped rows (tenant_id NOT NULL)
);
-- Version uniqueness for system defaults (tenant_id IS NULL):
-- COALESCE sentinel ensures NULLs collide correctly.
CREATE UNIQUE INDEX uq_rd_version_system_default
ON reference_datasets(list_key, version)
WHERE tenant_id IS NULL;
-- Only one active version per list_key per tenant (including system default)
CREATE UNIQUE INDEX uq_rd_single_active
ON reference_datasets(list_key, COALESCE(tenant_id, '00000000-0000-0000-0000-000000000000'::UUID))
WHERE status = 'active';
CREATE INDEX idx_rd_type ON reference_datasets(type_id);
CREATE INDEX idx_rd_list_key ON reference_datasets(list_key);
CREATE INDEX idx_rd_tenant ON reference_datasets(tenant_id);
CREATE INDEX idx_rd_status ON reference_datasets(status, list_key);
-- Prevent modification of active datasets.
-- Two rules:
-- 1. The only allowed status transition FROM 'active' is to 'archived'.
-- 2. When OLD.status = 'active', NO data/provenance/key columns may change,
-- regardless of the target status. This closes the loophole where a
-- caller archives a dataset while silently mutating its contents.
CREATE OR REPLACE FUNCTION prevent_active_reference_data_update()
RETURNS TRIGGER AS $$
BEGIN
IF OLD.status = 'active' THEN
-- Rule 1: only allow transition to 'archived'
IF NEW.status NOT IN ('active', 'archived') THEN
RAISE EXCEPTION 'Active reference datasets may only transition to archived.';
END IF;
-- Rule 2: data and provenance columns are always immutable once active
IF NEW.data IS DISTINCT FROM OLD.data
OR NEW.list_key IS DISTINCT FROM OLD.list_key
OR NEW.version IS DISTINCT FROM OLD.version
OR NEW.type_id IS DISTINCT FROM OLD.type_id
OR NEW.tenant_id IS DISTINCT FROM OLD.tenant_id
OR NEW.source IS DISTINCT FROM OLD.source
OR NEW.source_url IS DISTINCT FROM OLD.source_url
OR NEW.source_date IS DISTINCT FROM OLD.source_date
OR NEW.name IS DISTINCT FROM OLD.name
OR NEW.created_by IS DISTINCT FROM OLD.created_by
OR NEW.activated_by IS DISTINCT FROM OLD.activated_by
OR NEW.activated_at IS DISTINCT FROM OLD.activated_at THEN
RAISE EXCEPTION 'Cannot modify an active reference dataset. Create a new version instead.';
END IF;
-- Rule 3: archived_by/archived_at may only be set during the
-- active→archived transition (not while staying active)
IF NEW.status = 'active' THEN
IF NEW.archived_by IS DISTINCT FROM OLD.archived_by
OR NEW.archived_at IS DISTINCT FROM OLD.archived_at THEN
RAISE EXCEPTION 'Cannot set archival metadata on an active dataset.';
END IF;
END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_prevent_active_reference_data_update
BEFORE UPDATE ON reference_datasets
FOR EACH ROW
EXECUTE FUNCTION prevent_active_reference_data_update();
-- ============================================================
-- Audit log: tracks every change to reference data
-- ============================================================
CREATE TABLE reference_data_audit_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
dataset_id UUID NOT NULL REFERENCES reference_datasets(id) ON DELETE RESTRICT,
action VARCHAR(50) NOT NULL, -- 'created', 'activated', 'archived', 'used_in_snapshot'
actor VARCHAR(100),
details JSONB, -- Action-specific context
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_rdal_dataset ON reference_data_audit_log(dataset_id, created_at DESC);
How Reference Data Flows into the Matrix
┌─────────────────────────────────────────────────┐
│ Reference Data Registry │
│ │
│ fatf_grey_list v3 (active) │
│ eu_high_risk_third_countries v2 (active) │
│ cpi_below_40 v1 (active) │
│ pep_tiers v1 (active) │
│ product_risk_taxonomy v2 (active) │
│ ubo_thresholds v1 (active) │
└───────────────────────┬─────────────────────────┘
│
│ Matrix publish resolves
│ all referenced lists
▼
┌─────────────────────────────────────────────────┐
│ risk_matrix_schemas.reference_data_snapshot │
│ │
│ { │
│ "fatf_grey_list": ["MM", "SY", "YE", ...], │
│ "eu_high_risk_third_countries": ["AF", ...], │
│ "cpi_below_40": ["SO", "SS", "SY", ...], │
│ "pep_tiers": {"head_of_state": 30, ...}, │
│ "product_risk_taxonomy": {"crypto": 20, ...}, │
│ "_snapshot_metadata": { │
│ "resolved_at": "2026-03-28T14:00:00Z", │
│ "resolver_tenant_id": "abc-123-...", │
│ "datasets": { │
│ "fatf_grey_list": { │
│ "dataset_id": "d1e2f3...", │
│ "version": 3, │
│ "tenant_id": null, │
│ "resolution_tier": "system_default", │
│ "source": "FATF Plenary Feb 2026", │
│ "source_url": "https://fatf-gafi.org/…",│
│ "source_date": "2026-02-21", │
│ "activated_at": "2026-02-25T09:00:00Z" │
│ }, │
│ "eu_high_risk_third_countries": { │
│ "dataset_id": "a4b5c6...", │
│ "version": 2, │
│ "tenant_id": "abc-123-...", │
│ "resolution_tier": "tenant_override", │
│ "source": "EC Delegated Reg 2025/…", │
│ "source_url": "https://eur-lex.…", │
│ "source_date": "2025-11-15", │
│ "activated_at": "2025-12-01T10:00:00Z" │
│ } │
│ } │
│ } │
│ } │
└─────────────────────────────────────────────────┘
When a matrix schema is published:
- The publish endpoint reads every
country_risk_listreference and every other dataset key used by any factor in the matrix definition. - For each key,
ReferenceDataResolverresolves theactiveversion for the tenant (falling back to the system default if no tenant override exists). - For each resolved dataset, the resolver records the full provenance:
dataset_id,version,tenant_id,resolution_tier(tenant_overrideorsystem_default),source,source_url,source_date, andactivated_at. This metadata makes the snapshot fully replayable — an auditor can trace every input back to the exact dataset row, its authoritative source, and which resolution tier won. - All resolved data is merged into
reference_data_snapshotwith the per-dataset metadata under_snapshot_metadata.datasets. - The existing
_validate_reference_listsmethod (ADR-008) still runs — but now it validates against the freshly resolved snapshot rather than a manually assembled one.
Tenant Override Model
Reference data follows a layered resolution strategy:
1. Tenant-specific active dataset → highest priority
2. System default active dataset → fallback
3. (missing) → publish-time validation error
This means Atlas ships with sensible system defaults (official FATF/EU/CPI lists) that customers can override per-tenant. A payment institution can add extra countries to eu_high_risk_third_countries without affecting other tenants.
The override only affects future matrix publications. Existing published matrices retain their original snapshots.
API Endpoints
All under /reference-data prefix.
# Dataset type management (admin only)
GET /reference-data/types List all dataset types
GET /reference-data/types/{type_id} Get type with JSON Schema spec
# Dataset CRUD
GET /reference-data/datasets List datasets (filterable by type, status, tenant)
POST /reference-data/datasets Create new dataset (draft)
GET /reference-data/datasets/{id} Get dataset by ID
PUT /reference-data/datasets/{id} Update draft dataset
POST /reference-data/datasets/{id}/activate Activate dataset (archives previous active version)
POST /reference-data/datasets/{id}/archive Archive dataset
# Versioning
GET /reference-data/datasets/{list_key}/versions List all versions of a dataset
GET /reference-data/datasets/{list_key}/active Get currently active version
POST /reference-data/datasets/{id}/new-version Create new version from existing
GET /reference-data/datasets/{list_key}/diff/{v1}/{v2} Diff between two versions
# Resolution (used internally by matrix publish)
GET /reference-data/resolve Resolve all active datasets for a tenant
GET /reference-data/resolve/{list_key} Resolve single active dataset for a tenant
# Audit
GET /reference-data/datasets/{id}/audit-log View change history
# Bulk import (for initial setup and periodic updates)
POST /reference-data/import Import datasets from CSV/JSON
Navigation: Compliance Studio
ADR-008 and ADR-008a together introduce a navigation restructure. The existing sidebar section Configure (containing only Workflows) is replaced by Compliance Studio, grouping the three compliance authoring tools:
Compliance Studio
├── Workflows /workflows/schemas (existing, icon: flows)
├── Risk Matrices /risk-matrices (ADR-008, icon: heat-grid)
└── Risk Categories /studio/risk-categories (ADR-008a, icon: th-derived)
These three tools form a pipeline that compliance officers configure end-to-end: Risk Categories defines the facts (which countries are high-risk, what PEP tiers exist), Risk Matrices defines the scoring logic (how facts map to risk scores), and Workflows orchestrates the process (when to investigate, score, review, and decide).
The navigation change is already implemented in AppShell.tsx. The Risk Matrices and Risk Categories routes are registered in App.tsx.
System-level configuration (Data Providers, general settings) remains under Settings (/settings) in the bottom nav — it is not compliance content.
Frontend: Compliance Studio → Risk Categories
A dedicated page at /studio/risk-categories under the Compliance Studio navigation section. The UI should support:
Country Risk Lists View
- Table showing all country risk list types with their active version, last update date, and entry count.
- Drill into a list to see all countries with the ability to add/remove entries.
- Visual world map highlighting countries in each list (stretch goal).
- Diff view when activating a new version: shows countries added and removed.
Risk Classification Views
- PEP tiers: table with tier name, description, and configurable score value.
- Industry risk: table with industry code, description, and risk tier assignment.
- Product taxonomy: table with product/service, description, and risk score.
- UBO thresholds: table with jurisdiction and ownership percentage.
Common UI Patterns
- Each dataset shows provenance (source, source date, source URL).
- "New Version" button copies the current active version into a draft for editing.
- "Activate" button with confirmation dialog showing the diff against the current active version.
- Audit log sidebar showing who changed what and when.
Migration Path
Phase 1: Schema + Seeding (Week 1)
- Create
V57__reference_data_registry.sqlmigration. - Seed system defaults from authoritative sources: current FATF grey/black lists, current EU high-risk third countries, latest CPI scores, default PEP tiers.
- Seed UBO thresholds from the existing hardcoded values in
src/ontology/ownership.py.
Phase 2: Backend API (Weeks 2–3)
- Repository layer (
ReferenceDataRepo). - FastAPI router with all CRUD, versioning, and resolution endpoints.
- JSON Schema validation on create/update.
- Integration with matrix publish flow:
ReferenceDataResolvercalled duringPOST /risk-matrix/schemas/{id}/publish.
Phase 3: Frontend (Weeks 3–4)
- Settings → Reference Data tab with list management UI.
- Country list editor with add/remove.
- Tier/classification editors.
- Version diff view.
Phase 4: Wire Into Existing Features (Week 5)
- Replace hardcoded UBO thresholds in
src/ontology/ownership.pywith registry lookups. - Replace hardcoded risk scoring weights in
ontology_schema_v3.yamlwith registry-backed values. - Update
JurisdictionRiskFlagspopulation to use registry data.
Interaction with ADR-008
This ADR is an upstream prerequisite for ADR-008. The matrix publish flow (POST /risk-matrix/schemas/{id}/publish) cannot populate reference_data_snapshot without a running registry and resolver. ADR-008's scorer, ontology mapper, and version manager do not depend on this ADR directly — they consume the already-frozen snapshot — so those components can be developed in parallel once the registry schema and resolver are in place.
No ADR-008 interfaces change. The reference_data_snapshot column, the _validate_reference_lists method, and the country_risk_list indicator type all work exactly as specified. The only change is where the snapshot data comes from — instead of manual assembly, it is resolved from the registry at publish time.
The _validate_reference_lists call during matrix publish now has a concrete upstream: if a matrix factor references fatf_grey_list but no active dataset with list_key = 'fatf_grey_list' exists for the tenant (or as a system default), the validation fails and the publish is rejected.
File Structure
src/reference_data/ # NEW directory
├── __init__.py
├── repository.py # ReferenceDataRepo (CRUD + resolution)
├── resolver.py # ReferenceDataResolver (tenant-aware lookup)
├── validator.py # JSON Schema validation per dataset type
├── router.py # FastAPI endpoints
└── seed.py # System default seeding logic
config/reference_data/ # NEW directory
├── fatf_grey_list.json # Current FATF increased monitoring list
├── fatf_black_list.json # Current FATF high-risk jurisdictions
├── eu_high_risk_third_countries.json # Current EC delegated regulation list
├── cpi_below_40.json # Countries with CPI < 40
├── pep_tiers.json # Default PEP classification → score mapping
├── ubo_thresholds.json # Default jurisdiction → threshold mapping
└── dataset_types.json # Type definitions with JSON Schema specs
migrations/
├── V57__reference_data_registry.sql # Registry tables + audit log
Effort Estimate
| Component | Estimate | Dependencies |
|---|---|---|
| Database migration (V57) | 0.5 weeks | None |
| Repository + resolver | 1 week | Migration |
| JSON Schema validation | 0.5 weeks | Repository |
| API endpoints (router) | 1 week | Repository |
| System default seeding | 0.5 weeks | Migration |
| Matrix publish integration | 0.5 weeks | Resolver, ADR-008 |
| Frontend: Settings tab | 1.5 weeks | API endpoints |
| Wire into ownership/ontology | 0.5 weeks | Resolver |
| Testing | 1 week | All components |
| Total | 7 weeks |
ADR-008a is an upstream dependency for ADR-008. The registry, resolver, and seed data (Phases 1–2) must land before the matrix publish flow can be implemented. The frontend (Phase 3) and legacy wiring (Phase 4) can proceed in parallel with ADR-008's scorer and mapper work since those components do not depend on the registry UI.
Rollback Strategy
Fully reversible:
- Remove router registration from
main.py. - Revert matrix publish endpoint to accept manually assembled
reference_data_snapshot. - Drop tables:
DROP TABLE reference_data_audit_log, reference_datasets, reference_data_types CASCADE. - Remove
src/reference_data/andconfig/reference_data/directories.
ADR-008's reference_data_snapshot column continues to work — it just requires manual population instead of automatic resolution.