Skip to main content
  1. Posts/

Following the Money: Which FCPA Cases Look SAR-Matchable?

Author
LegalRealist AI
Data Analytics and Fraud - This article is part of a series.
Part 2: This Article

TL;DR

  • Nobody knows where most FCPA cases come from. DOJ has reportedly told the OECD that roughly 20% of FCPA matters originate from whistleblowers. The rest — self-disclosures, media, foreign regulators, internal audits, SARs, proactive analytics — is mostly opaque from the outside.
  • SAR data is open-set dirty data. Banks file millions of SARs, often defensively. You cannot infer FCPA violations by scanning the database alone because ordinary, defensive, and serious filings can look similar until another signal narrows the search.
  • The solution is data triangulation. Each enforcement signal is noisy in its own way. The overlap between SAR-like financial patterns and outside signals — litigation, whistleblower tips, journalism, SEC filings, foreign referrals — is more useful than any one dataset alone.
  • A proof of concept can identify SAR-matchable FCPA patterns, not SAR origination. Synthetic SAR matching cannot prove that a real SAR was filed, searched, or used by DOJ or SEC. It can test whether public FCPA facts resemble the patterns that would plausibly generate SARs.
  • The practical claim is evidence acceleration. A tip, article, or disclosure gives prosecutors a reason to look; SARs can point investigators toward underlying bank records, counterparties, dates, and payment corridors.
Corrections & Updates
  • June 24, 2026: Reframed this post from estimating SAR origination to identifying SAR-matchable financial signatures. Synthetic matching cannot prove that a real SAR existed or caused an FCPA case; it can only test whether public case facts resemble SAR-triggering patterns. The post was also updated to clarify that SARs are investigative leads pointing to underlying bank evidence, not courtroom evidence by themselves.

The Stanford FCPA Clearinghouse has cataloged every Foreign Corrupt Practices Act enforcement action since 1977 — hundreds of cases, with defendants, countries, bribery schemes, and sanctions, all structured and searchable. FinCEN receives millions of suspicious activity reports from banks flagging transactions that look like bribery, money laundering, fraud, or other reportable activity. SARs are strictly confidential — unauthorized disclosure is a federal criminal offense.

How many FCPA enforcement actions describe facts that would plausibly generate a SAR? That is a narrower question than case origination, and it is the one public data can support. The financial patterns that trigger SARs are often the same patterns described in FCPA charging documents: unusual consulting payments, shell companies, high-risk corridors, round-dollar wires, intermediary accounts, and payments with weak business purpose. If you build a synthetic dataset that mimics SAR-triggering patterns and match it against what FCPA enforcement actions describe, you can estimate how often public FCPA facts look SAR-matchable. You cannot prove that a SAR started the case.

The Problem: Open-Set Dirty Data
#

The DOJ has disclosed one number on FCPA case origination: according to the National Whistleblower Center’s summary of the OECD Working Group Phase 4 Report, approximately 20% of FCPA matters came from whistleblowers. That leaves most matters originating from other channels — self-disclosure, media reports, foreign government referrals, SARs, proactive analytics — and the breakdown is opaque.

SARs do not need to start an investigation to matter. A whistleblower provides a company name and a country. A Reuters investigation names suspicious payments. The SAR may tell investigators where to look next: which bank saw the activity, which counterparties appeared, which dates and corridors matter, and which records to subpoena. What proves the case is usually the underlying bank evidence — wires, account records, KYC files, emails, invoices, beneficial-ownership documents — not the SAR itself.

But finding that trail is a different kind of data problem than anything else in enforcement.

As we described in The Government’s Data Advantage, pandemic-relief fraud was closer to a closed-set problem: millions of PPP and EIDL applications were submitted into federal systems, under penalty of federal fraud charges, and could be cross-referenced against tax, payroll, identity, and benefits records. The data was not perfectly clean, and anomalies still required investigation. But the government at least knew the applicant population, the submission fields, and the external records to compare.

SAR data is open-set dirty data. There is no defined population — banks are reporting on the flow of financial transactions, an unbounded stream where “normal” is hard to define. The bank filing the SAR is often reporting someone else’s transactions, under strong incentives to avoid under-reporting. The rational response can drift toward filing on activity that looks even mildly unusual. FinCEN acknowledged this problem in October 2025 when it issued FAQs aimed at reducing low-value filings and focusing SAR narratives on information useful to law enforcement.

Millions of SARs are filed into a system whose value depends on selection, narrative quality, and investigative context. A defensive filing about a legitimate wire transfer to Nigeria can resemble a genuine filing about a bribe payment to Nigeria until another source gives investigators a target, time period, or counterparty. You can stare at the database forever. The bottleneck is not only access to records. The bottleneck is knowing which records deserve attention.

Split-panel comparison: bounded government application data with cross-agency verification versus open-set SAR filings from multiple banks, where noisy filings require outside context before they become useful leads

The Solution: Data Triangulation
#

In research methodology, data triangulation means using multiple independent sources to cross-validate findings that no single source can confirm on its own. A UVA framework recently formalized this for “measurement of adversarial systems” where “ground truth is unobservable or strategically concealed” — which is exactly the FCPA enforcement problem.

SARs are not the only open-set dirty data. Every enforcement signal source has its own noise problem. Whistleblower tips include disgruntled employees and award-chasers. Civil litigation dockets include nuisance suits. Investigative journalism can overread incomplete documents. No single dataset reliably separates misconduct from noise. But the noise is not identical across sources — banks file defensively, plaintiffs’ lawyers file opportunistically, journalists chase stories, companies draft cautious risk disclosures — so overlap can be more probative than any one signal alone.

Tips and news investigations give prosecutors a reason to search: a company name, a country, a time period. In a more bounded domain, anomalies can surface from the data itself. In open-set dirty data, outside context is essential because it turns an unsearchable ocean into a targeted query: “show me SARs involving Company X in Country Y between 2018 and 2023.” The answer might point to the financial trail of a bribery scheme, assembled across multiple banks before investigators knew which subpoenas to send.

The triangulation works bidirectionally across every signal source:

Civil litigation ↔ SARs. Securities class actions allege concealed FCPA risk; qui tam suits surface adjacent government-contracting facts; employment litigation by terminated compliance officers can describe transaction patterns a SAR would flag. Many suits are weak, but a suit naming entities that also appear in multiple relevant SARs would look different to investigators from a suit with no financial corroboration. Stanford’s Securities Class Action Clearinghouse makes cross-referencing public securities litigation with the FCPA Clearinghouse straightforward.

Whistleblower tips ↔ SARs. Reported updates on the DOJ’s whistleblower program suggest a substantial volume of post-2024 submissions. Check those tips against FinCEN’s database: if multiple banks independently flagged the same entity, corridor, or counterparty, the tip is corroborated by financial data the tipster may never have seen.

Journalism ↔ SARs. ICIJ’s FinCEN Files analysis found banks filing SARs in response to news reports. Causality runs both directions.

SEC disclosures ↔ SARs. Companies disclose internal FCPA investigations in 10-K risk factors and 8-K filings. Match disclosure timing against SAR patterns to see whether the banking system flagged payments before the company came forward.

Foreign referrals ↔ SARs. The U.S. has MLATs with over 65 countries. The International Anti-Corruption Prosecutorial Taskforce (UK/Switzerland/France, 2025) is designed to pick up cases the U.S. deprioritizes.

Potential leads can also be extracted from SAR metadata and narratives: network analysis surfaces clusters of connected entities flagged by multiple banks (ICIJ used graph databases for this); quality scoring uses an LLM to separate detailed narratives from boilerplate; temporal clustering treats continuing activity reports as evidence of persistence; peer comparison flags outlier SAR volumes; and corridor analysis maps concentrations through high-risk routes against Transparency International’s CPI.

No noisy dataset is reliable alone, but overlap between differently noisy datasets is where the better leads live.

Data triangulation diagram showing five overlapping noisy datasets — SARs, civil litigation, whistleblower tips, journalism, and SEC filings — each labeled with its noise type, converging on a central zone labeled higher-confidence lead

The SAR is not the evidence. It is the bridge between “we heard Company X might be bribing officials in Country Y” and “here are the banks, counterparties, dates, amounts, and corridors worth subpoenaing.” That is not an origination claim. It is a lead-intelligence claim.

Three-stage pipeline: a tip or news article enters as a reason to look, queries FinCEN’s SAR database for matching filings, and points investigators toward underlying bank records, counterparties, dates, amounts, and correspondent banks

The Proof of Concept
#

FCPA data. The Stanford FCPA Clearinghouse (FCPAC) provides enforcement actions with defendants, countries, industries, time periods, and payment descriptions — updated through July 2025, available to academic researchers. The key fields overlap with what a SAR would contain: entity, country, time period, payment mechanism.

Synthetic SARs. Peer-reviewed synthetic AML datasets — IBM’s AMLSim (NeurIPS 2023), SynthAML (16M transactions), Tide (University of Amsterdam) — provide the base. Extend with FCPA-specific bribery typologies from the FCPA Resource Guide, calibrate against Stanford’s country/industry distributions, and generate narratives with an LLM following FinCEN’s five-W format.

Matching. For each enforcement action: does it describe a transaction pattern that would plausibly trigger a SAR? Score on country, time period, industry, payment typology, intermediaries, payment purpose, and semantic similarity of narratives. The output should be reported as a sensitivity band: how many corporate FCPA actions are SAR-matchable under strict, moderate, and broad typology assumptions. The point is to measure that share, not assume an origination percentage.

Triangulation layer. For each SAR-matched action, search for a contemporaneous public signal — news, SEC filings, civil litigation, foreign regulatory actions. Convergence of SAR match + independent signal = high-confidence match. Validate against the 20% whistleblower baseline, charging documents that mention bank referrals, and the FinCEN Files transaction data for cases where leaked SARs overlap with known FCPA matters.

What this cannot prove. Synthetic SAR matching cannot show that a real SAR was filed, that FinCEN indexed it correctly, that DOJ or SEC searched it, or that it caused an enforcement action. It cannot see confidential SAR contents. It cannot calibrate real SAR base rates from public data alone. It can only test whether public FCPA facts resemble known SAR-triggering typologies and whether those patterns line up with other public signals.

Why it matters now. President Trump paused FCPA enforcement in February 2025. New guidelines restarted it under narrower priorities in June. The DOJ closed roughly half its open investigations. But enforcement priorities are political; the reporting infrastructure is structural. Banks do not stop filing SARs when the White House changes. SARs and related financial records can remain searchable for law-enforcement purposes long after the underlying payment clears.

What Comes Next
#

This is the second post in Data Analytics and Fraud — a series using AI and open datasets to illuminate enforcement patterns. The same methodology — and the same open-set dirty data framework — applies wherever enforcement depends on matching noisy government records against other noisy external signals.

Further Reading
#

Read next in this series: The Data Miner’s Dilemma.


This post is part of the Data Analytics and Fraud series on LegalRealist AI. It is intended for informational and educational purposes only and does not constitute legal advice. The synthetic data methodology described here produces directional estimates, not verified conclusions. It does not prove actual SAR filing, SAR use, or case origination by DOJ or SEC. Actual SAR filings are confidential under the Bank Secrecy Act; nothing in this post is based on or reveals the contents of any actual SAR. FCPA enforcement data reflects publicly available information as of the publication date and update date. Laws governing anti-corruption enforcement vary by jurisdiction and are subject to rapid policy change.

Data Analytics and Fraud - This article is part of a series.
Part 2: This Article

Related