
TL;DR
- Nobody knows where most FCPA cases come from. The DOJ says ~20% originate from whistleblowers. The other 80% — SARs, media, foreign regulators, internal audits — is a black box.
- SAR data is open-set dirty data. Banks file defensively, generating two million SARs a year with a 4% law enforcement response rate. You can’t find FCPA violations by scanning the database — the noise looks identical to the signal.
- The solution is data triangulation. Every enforcement dataset is noisy for different reasons, so the noise is uncorrelated. The overlap between independently noisy datasets — SARs, civil litigation, whistleblower tips, journalism, SEC filings — is where real cases live.
- A proof of concept suggests 30–50% of FCPA corporate actions have SAR-matchable financial signatures. The SAR doesn’t have to originate the case — a tip or a news article gives prosecutors a reason to look, and the pre-existing SARs provide the evidence.
- Build this yourself in a weekend. Stanford’s FCPA data is free, synthetic SAR generators are open-source, and the matching logic fits in a few hundred lines of Python.
The Stanford FCPA Clearinghouse has cataloged every Foreign Corrupt Practices Act enforcement action since 1977 — hundreds of cases, with defendants, countries, bribery schemes, and sanctions, all structured and searchable. FinCEN receives over two million suspicious activity reports every year from banks flagging transactions that look like bribery, money laundering, or fraud. SARs are strictly confidential — unauthorized disclosure is a federal criminal offense.
How many FCPA enforcement actions started because a compliance officer at a bank filed a SAR? The DOJ doesn’t say. But the financial patterns that trigger SARs are the same patterns described in FCPA charging documents. If you build a synthetic dataset that mimics what SARs look like, and match it against what FCPA enforcement actions describe, you can estimate how often the banking system’s surveillance machinery feeds the enforcement pipeline.
The Problem: Open-Set Dirty Data#
The DOJ has disclosed one number on FCPA case origination: according to the OECD Working Group Phase 4 Report, approximately 20% of FCPA matters came from whistleblowers. That leaves 80% originating from other channels — self-disclosure, media reports, foreign government referrals, SARs, proactive analytics — and the breakdown is opaque.
SARs don’t need to start an investigation to be the thing that makes it succeed. A whistleblower provides a company name and a country. A Reuters investigation names suspicious payments. What proves the case is the financial trail — and that trail may already be sitting in FinCEN’s database, filed months or years earlier by a compliance officer who flagged the same wire transfers. The leap from tip to case is short when the corroborating evidence is pre-assembled.
But finding that trail is a different kind of data problem than anything else in enforcement.
As we described in The Government’s Data Advantage, when the DOJ investigates PPP fraud, it works with a closed set — 11.8 million loan applications submitted by borrowers under penalty of federal fraud charges, cross-referenced against IRS payroll tax returns, HUD income records, and the SSA’s death master file. Every data point was created by the person under investigation, with legal consequences for inaccuracy. The government knows exactly who’s in the dataset. Anomalies are real. The signal is the data.
SAR data is open-set dirty data. There’s no defined population — banks are reporting on the entire flow of global financial transactions, an unbounded stream where “normal” is impossible to define. The bank filing the SAR isn’t reporting its own conduct — it’s reporting someone else’s transactions, under threat of penalty for under-reporting and with no consequence for over-reporting. The rational response is to file on everything that looks even mildly unusual. FinCEN acknowledged this in October 2025, telling institutions to stop defensive filing and focus on activity that provides value to law enforcement.
Two million SARs a year. A 4% median law enforcement response rate. FinCEN’s staff shrank by 10% from 2009 to 2019 while filing volumes kept climbing. A defensive filing about a legitimate wire transfer to Nigeria reads the same as a genuine filing about a bribe payment to Nigeria. You can stare at the database forever. The bottleneck isn’t evidence — FinCEN’s database is full of evidence. The bottleneck is knowing where to look.
The Solution: Data Triangulation#
In research methodology, data triangulation means using multiple independent sources to cross-validate findings that no single source can confirm on its own. A UVA framework recently formalized this for “measurement of adversarial systems” where “ground truth is unobservable or strategically concealed” — which is exactly the FCPA enforcement problem.
SARs aren’t the only open-set dirty data. Every enforcement signal source has its own noise problem. Whistleblower tips include disgruntled employees and award-chasers. Civil litigation dockets are full of nuisance suits. Investigative journalism includes sensationalized reporting. No single dataset reliably separates misconduct from noise. But each is noisy for different reasons — banks file defensively, plaintiffs’ lawyers file opportunistically, journalists chase headlines — so the noise is uncorrelated. The signal, actual misconduct, is the thing that shows up across multiple sources.
Tips and news investigations give prosecutors a reason to shoot in the dark — a company name, a country, a time period. In a closed-set domain, you don’t need that; the anomalies surface themselves. In open-set dirty data, it’s everything, because it turns an unsearchable ocean into a targeted query: “show me every SAR involving Company X in Country Y between 2018 and 2023.” The answer might be the entire financial trail of a bribery scheme, pre-assembled by compliance officers who had no idea they were building a prosecution file.
The triangulation works bidirectionally across every signal source:
Civil litigation ↔ SARs. Securities class actions allege concealed FCPA risk; qui tam suits surface the same conduct from a False Claims Act angle; employment litigation by terminated compliance officers describes transaction patterns a SAR would flag. Many are nuisance suits — but cross-reference against the SAR database and the suits naming entities with a cluster of defensive filings look different from those with no SAR activity. The litigation vets the SAR; the SAR vets the litigation. Stanford’s Securities Class Action Clearinghouse makes cross-referencing with the FCPA Clearinghouse straightforward.
Whistleblower tips ↔ SARs. The DOJ’s whistleblower program has received over 1,100 submissions since 2024 (~80% referred to prosecutors). Check those tips against FinCEN’s database: if multiple banks independently flagged the same entity, the tip is corroborated by financial data the tipster never saw.
Journalism ↔ SARs. ICIJ’s FinCEN Files analysis found banks filing SARs in response to news reports. Causality runs both directions.
SEC disclosures ↔ SARs. Companies disclose internal FCPA investigations in 10-K risk factors and 8-K filings. Match disclosure timing against SAR patterns to see whether the banking system flagged payments before the company came forward.
Foreign referrals ↔ SARs. The U.S. has MLATs with over 65 countries. The International Anti-Corruption Prosecutorial Taskforce (UK/Switzerland/France, 2025) is designed to pick up cases the U.S. deprioritizes.
Signal can also be extracted from SAR data itself: network analysis surfaces hot clusters of connected entities flagged by multiple banks (ICIJ used graph databases for this); quality scoring uses an LLM to separate detailed narratives from boilerplate; temporal clustering treats continuing activity reports as evidence of persistence; peer comparison flags outlier SAR volumes; and corridor analysis maps concentrations through high-risk routes against Transparency International’s CPI.
No noisy dataset is reliable alone, but the overlap between independently noisy datasets is where the signal lives.
The SAR is the bridge between “we heard Company X might be bribing officials in Country Y” and “here are seventeen wire transfers from Company X’s subsidiary to a shell company in Country Y, totaling $4.2 million over three years, routed through a correspondent bank in London.” That’s not an origination channel. It’s an evidence accelerator.
The Proof of Concept#
FCPA data. The Stanford FCPA Clearinghouse (FCPAC) provides enforcement actions with defendants, countries, industries, time periods, and payment descriptions — updated through July 2025, available to academic researchers. The key fields overlap with what a SAR would contain: entity, country, time period, payment mechanism.
Synthetic SARs. Peer-reviewed synthetic AML datasets — IBM’s AMLSim (NeurIPS 2023), SynthAML (16M transactions), Tide (University of Amsterdam) — provide the base. Extend with FCPA-specific bribery typologies from the FCPA Resource Guide, calibrate against Stanford’s country/industry distributions, and generate narratives with an LLM following FinCEN’s five-W format.
Matching. For each enforcement action: does it describe a transaction pattern that would trigger a SAR? Score on country, time period, industry, payment typology, and semantic similarity of narratives. A conservative estimate: 30–50% of corporate FCPA enforcement actions match standard SAR-triggering typologies. Exceptions: bribery through non-financial channels or payments too embedded in legitimate flows to surface.
Triangulation layer. For each SAR-matched action, search for a contemporaneous public signal — news, SEC filings, civil litigation, foreign regulatory actions. Convergence of SAR match + independent signal = high-confidence match. Validate against the 20% whistleblower baseline, charging documents that mention bank referrals, and the FinCEN Files transaction data for cases where leaked SARs overlap with known FCPA matters.
Why it matters now. President Trump paused FCPA enforcement in February 2025. New guidelines restarted it under narrower priorities in June. The DOJ closed roughly half its open investigations. But enforcement priorities are political; the pipeline is structural. Banks don’t stop filing SARs when the White House changes. The defensive SARs your correspondent bank filed are sitting in FinCEN’s database. They don’t expire.
What Comes Next#
This is the second post in Data Analytics and Fraud Enforcement — a series using AI and open datasets to illuminate enforcement patterns. The same methodology — and the same open-set dirty data framework — applies wherever enforcement depends on matching noisy government records against other noisy external signals.
Further Reading#
- Stanford FCPA Clearinghouse. The definitive open dataset of FCPA enforcement actions.
- FinCEN Suspicious Activity Reports. SAR requirements and filing procedures.
- FCPA Resource Guide (DOJ/SEC). Joint guidance on FCPA enforcement and common bribery typologies.
- FinCEN Files Investigation (ICIJ). The 2020 investigation that revealed how SARs work in practice.
- SynthAML: A Synthetic AML Benchmark Dataset. Peer-reviewed synthetic financial data for AML research.
- Realistic Synthetic Financial Transactions (IBM/NeurIPS 2023). Agent-based synthetic transaction generator.
- Tide: A Customisable Dataset Generator for AML Research. Open-source generator from the University of Amsterdam.
- Measurement for Opaque Systems: Multi-Source Triangulation (UVA). Academic framework for finding signal in adversarial, data-sparse environments.
- FCPA Enforcement 2025 Year in Review (Paul Weiss). The enforcement pause and new guidelines.
- Gibson Dunn 2024 Year-End FCPA Update. Enforcement trends and case summaries.
Read next in this series: The Data Miner’s Dilemma.
This post is part of the Data Analytics and Fraud Enforcement series on LegalRealist AI. It is intended for informational and educational purposes only and does not constitute legal advice. The synthetic data methodology described here produces directional estimates, not verified conclusions. Actual SAR filings are confidential under the Bank Secrecy Act; nothing in this post is based on or reveals the contents of any actual SAR. FCPA enforcement data reflects publicly available information as of the publication date. Laws governing anti-corruption enforcement vary by jurisdiction and are subject to rapid policy change.



