Skip to main content

The Government Already Has the Data

Data Analytics and Fraud Enforcement - This article is part of a series.
Part 1: This Article

The Government Already Has the Data
#

TL;DR

  • The government holds a closed, mostly clean dataset of every transaction it needs to find fraud. Medicare claims, tax returns, PPP applications — these aren’t gathered through investigation. They’re submitted by the targets themselves, in structured digital format, directly into federal databases. Fraud detection is a query, not a manhunt.
  • PPP fraud was the proving ground. Data scientists analyzed 33 million loan applications using machine learning and cross-referenced Social Security numbers across agencies, flagging $79 billion in potential identity fraud and generating over 95,000 investigative leads.
  • Healthcare fraud enforcement hit record scale in 2025. The DOJ’s annual takedown charged 324 defendants across $14.6 billion in alleged fraud — and credited proactive data analytics with catching a $10.6 billion scheme before most payments went out.
  • The IRS now runs 126 active AI applications. Machine learning models score millions of returns simultaneously, cross-referencing W-2s, 1099s, and banking data to flag noncompliance — with an 18:1 return on investment on its fraud prevention system.
  • Anomaly detection catches outliers, not intent. Providers and businesses whose billing or filings deviate from statistical norms can be flagged even when acting in good faith. Compliance programs need to understand what the algorithms are looking for.

A transnational criminal organization bought dozens of medical supply companies across the United States and submitted $10.6 billion in fraudulent Medicare claims for urinary catheters that were never delivered, using stolen identities from over one million Americans. The DOJ’s Health Care Fraud Unit Data Analytics Team detected the anomalous billing through proactive data analytics. CMS blocked all but $41 million of the $4.45 billion scheduled to be paid. The scheme — Operation Gold Rush, the largest healthcare fraud case ever charged — was caught not by a whistleblower, not by a patient complaint, not by a Suspicious Activity Report, but by an algorithm that noticed billing patterns that didn’t make sense.

No bank filed a SAR. No insider picked up the phone. The government found this fraud in its own data — the Medicare claims it already collects, the billing patterns it already tracks, the enrollment records it already maintains. The federal government has spent the past five years building analytics infrastructure that treats its own structured data as an enforcement asset, cross-referencing billions of records across agencies to flag anomalies and generate investigative leads without waiting for anyone to report anything.

This post — the first in a series on data analytics and fraud enforcement — maps how three federal enforcement pipelines work: pandemic relief (the proving ground), healthcare (the largest target), and tax (the broadest reach). For defense counsel and compliance teams, the question is no longer whether regulators have the data. It’s whether your client’s patterns look normal.

The Closed Dataset
#

In one model, the government depends on intermediaries to flag suspicious activity. A bank employee notices unusual wire transfers, writes a narrative describing why the transactions look wrong, and files a Suspicious Activity Report with FinCEN. Financial institutions filed 4.7 million SARs in FY 2024 — over 12,000 per day — each one a human interpretation of suspicious behavior, written in unstructured text, requiring further human analysis to act on. SARs remain valuable for banking enforcement. But the model depends entirely on someone else noticing something wrong and choosing to report it.

In the other model, the government already has the data. Every Medicare claim is a structured digital record — procedure codes, dollar amounts, provider identifiers, patient identifiers, dates — submitted directly to CMS in machine-readable format. Every tax return is the same: income figures, deduction categories, employer IDs, all filed directly with the IRS. Every PPP loan application landed in the SBA’s systems with payroll numbers, employee counts, and Social Security numbers attached.

This is a closed, mostly clean dataset. Mostly — not perfectly. PPP applications didn’t require dates of birth. Medicare claims have coding errors. Tax returns contain mistakes and ambiguities. But the data is structured, machine-readable, and already in federal databases, submitted by the very entities being scrutinized. The government doesn’t need a third party to observe and report. Fraud detection isn’t an investigation that starts from a tip — it’s a query against data the government already holds. When CMS runs anomaly detection across 11 million daily Medicare claims, it’s not waiting for someone to call. It’s asking its own data: which billing patterns don’t look like the others?

The contrast with SARs matters because it explains why this enforcement model scales differently. A SAR requires a bank to notice, interpret, write, and file. A closed-dataset query requires only compute. The government can run the same anomaly detection across every Medicare provider, every tax return, every PPP borrower — simultaneously and continuously.

Side-by-side comparison of the SAR model (five human-dependent steps from observation to lead) versus the closed-dataset model (structured records submitted directly to federal databases, with algorithms generating leads automatically)

From “Pay and Chase” to “Detect and Deploy”
#

For decades, federal fraud enforcement ran on a simple model: pay the claim, wait for a tip, investigate, and try to recover. The False Claims Act’s qui tam provision — which lets private whistleblowers file suits on the government’s behalf and collect a share of the recovery — was the primary enforcement engine. In FY 2025, whistleblowers filed a record 1,297 qui tam actions, and total FCA recoveries hit $6.8 billion, the highest in the statute’s history. Whistleblowers aren’t going away.

But the government is building a parallel track. Proactive data analytics — cross-referencing claims data, tax filings, enrollment records, and third-party databases using machine learning and anomaly detection — now generate enforcement leads independently of whistleblower tips. The DOJ-HHS False Claims Act Working Group, formed in July 2025, explicitly plans to use enhanced data mining to drive new investigative leads. CMS Administrator Dr. Mehmet Oz described the shift in March 2026: the agency is replacing the “pay and chase” model with a “detect and deploy” strategy that uses AI to identify fraud before payments go out.

PPP’s $1 Trillion Experiment
#

The Paycheck Protection Program was the largest fraud detection stress test in U.S. government history. The SBA distributed approximately $800 billion in loans through over 5,000 lenders to more than 8 million borrowers — and did it in weeks, with self-certification and reduced internal controls. The speed that saved businesses also created the largest fraud surface the government had ever seen.

The Data Pipeline
#

Phase 1: Screening (reactive, limited). The SBA built a four-step anti-fraud process that compared applications against public and private databases, ran data analytics, flagged applications for manual review, and referred likely fraud to its Office of Inspector General. But the full process wasn’t in place until more than half of program funds had already been distributed — over $525 billion in PPP loans approved before the screening was fully operational. The SBA’s machine learning tool focused on prioritizing loans with existing flags for human review rather than identifying new suspicious patterns, limiting its ability to catch complex fraud schemes.

Phase 2: Cross-agency analytics (the PACE model). Congress funded the Pandemic Analytics Center of Excellence (PACE), a centralized data analytics hub run by the Pandemic Response Accountability Committee (PRAC). PACE assembled 59 datasets with access to over 1.6 billion records from public, non-public, and commercial sources. Its data scientists did something the SBA couldn’t do alone: they cross-referenced PPP and EIDL applications against Social Security Administration records, HUD housing benefit applications, and other federal program data.

PACE analyzed 33 million loan applications and identified over 69,000 questionable Social Security numbers used to obtain $5.4 billion in pandemic loans and grants. Many of those SSNs were never issued by the SSA or didn’t match the names and dates of birth on the applications. A later analysis using random sampling across 67.5 million funded applications estimated that approximately $79 billion in potential identity fraud could have been prevented with pre-award vetting.

PACE also compared income reported by PPP applicants against income reported to HUD for housing benefits — finding applicants who may have deliberately misrepresented their incomes to one program or the other. One such cross-agency analysis led to a large-scale criminal conspiracy case.

Phase 3: Knowledge graphs and lead generation. Private-sector analytics firms working with federal stakeholders built knowledge graph systems that mapped networks of control and ownership across PPP borrowers. These systems ingested raw loan data, enriched it with public records, watchlists, conviction databases, and commercial ownership datasets, then used entity resolution to connect duplicate or alias identities. The result: investigators could visualize networks — multiple loans tied to the same beneficial owner, unusually fast disbursement patterns, geographic clusters of suspicious applicants — rather than investigating loans one at a time.

An automated fraud detection tool built by eSimplicity for the SBA OIG identified over $200 billion in potential fraud and generated more than 95,000 actionable leads — representing, by the firm’s estimate, over 100 years of manual investigative casework. Those leads contributed to 1,632 indictments, 1,213 arrests, and 1,045 convictions related to COVID-EIDL and PPP fraud.

The Enforcement Results
#

As of January 2026, PACE had supported over 1,200 pandemic-related investigations involving more than 24,000 subjects and $2.5 billion in estimated fraud loss. The DOJ obtained more than 200 civil settlements and judgments totaling over $230 million for pandemic fraud in FY 2025 alone, bringing total civil recoveries to over $820 million. The PPP fraud conviction rate stands at 81.8%, with 2,532 defendants convicted out of 3,096 charged as of late 2024 — and 81% of those sentenced received prison time.

The statute of limitations for PPP fraud was extended to 10 years, meaning prosecutors have until 2030-2032 to bring cases. And a new category of enforcement actor has emerged: data-miner relators — private parties who use publicly available PPP loan data, corporate ownership records, and employment filings to identify potential FCA claims without any insider knowledge. They file qui tam suits based entirely on pattern analysis.

What PPP Proved
#

Cross-agency data sharing catches fraud that siloed systems miss. An SSN that looks clean in the SBA’s database might belong to a deceased person in SSA records, or to someone reporting $600 in annual income to HUD while claiming an $875,000 monthly payroll to the SBA. Machine learning at scale generates leads that no human review team could produce — 95,000 leads from a single analytics tool. And the infrastructure built for pandemic oversight works for other programs. The PRAC urged Congress to make PACE permanent and expand its jurisdiction to all federal spending. Data-sharing agreements like the one between the SBA OIG and USDA OIG, signed in February 2026, signal that the cross-agency model is spreading.

Network diagram showing how a single entity’s records are cross-referenced across SBA, IRS, SSA, CMS, HUD, DEA, and state agencies, with discrepancy callouts showing the specific mismatches that generate investigative leads

Healthcare: The $6.8 Billion Enforcement Machine
#

Healthcare fraud enforcement has used data analytics longer than any other federal domain — the CMS Fraud Prevention System has run machine learning against Medicare claims since 2011 — but the scale has changed dramatically.

The Data Fusion Center
#

In June 2025, alongside the largest healthcare fraud takedown in history (324 defendants, $14.6 billion in alleged fraud), the DOJ announced the creation of a Health Care Fraud Data Fusion Center. The Fusion Center brings together the DOJ’s Health Care Fraud Unit Data Analytics Team, HHS-OIG, the FBI, and other agencies to use cloud computing, AI, and shared analytics platforms. Its stated purpose: break down information silos and enable rapid prosecution of emerging fraud schemes. The initiative implements Executive Order 14243, “Stopping Waste, Fraud, and Abuse by Eliminating Information Silos.”

CMS launched its own Fraud Defense Operations Center (FDOC) in 2025. By March 2026, it had triaged more than 340 suspect providers and prevented over $1.4 billion in potential payments while investigations continued. CMS estimates it prevented $11.9 billion in potentially fraudulent Medicare payments from FY 2022 through 2024.

How the Analytics Work
#

CMS analyzes Medicare fee-for-service claims on a streaming, nationwide basis — processing over 11 million pre-paid claims daily. The system flags billing spikes, geographic anomalies, and provider-level outliers. When patterns deviate from norms, CMS can suspend payments, revoke billing privileges, or refer cases for investigation.

The Operation Gold Rush case from the opening of this post shows how this works. The Data Analytics Team spotted anomalous billing from newly acquired DME companies whose billing volume didn’t match normal provider behavior. CMS froze the payments before they went out. In March 2026, CMS imposed a six-month nationwide moratorium on new DME supplier enrollment and revoked billing privileges for 5,586 providers and suppliers.

The Fusion Center also enables what individual agencies couldn’t do alone: connecting billing anomalies in Medicare data with prescribing patterns tracked by the DEA, complaint data from state attorneys general, and financial transaction patterns flagged by law enforcement. As Blank Rome’s analysis notes, the government’s data-driven approach detects anomalies, not intent — meaning providers whose practices differ from regional or national averages may be identified as outliers even when the variation reflects legitimate clinical specialization, patient demographics, or innovative care models.

The FCA Pipeline
#

Healthcare accounted for over $5.7 billion of the $6.8 billion in FCA recoveries in FY 2025. The DOJ-HHS Working Group explicitly plans to use enhanced data mining to drive new investigative leads — shifting the mix from whistleblower-initiated to government-initiated FCA actions. Data analytics was deployed successfully for PPP fraud, and the government is now expanding that playbook to healthcare.

CMS also launched the Wasteful and Inappropriate Service Reduction (WISeR) Model in January 2026 — a voluntary program in six states that uses AI, machine learning, and human clinical review to introduce prior authorization requirements for services historically associated with fraud and inappropriate utilization. It’s the first CMS program designed to use AI to prevent waste before claims are paid, rather than recovering funds after.

The IRS Cross-References Everything
#

The IRS now runs 126 active AI applications — up from 10 in August 2022 — spanning audit selection, fraud detection, taxpayer services, and operational workflows. The Return Review Program (RRP), the primary system for pre-refund fraud detection, uses supervised and unsupervised machine learning to flag suspicious returns before refunds are issued. From 2015 through 2019, the RRP permanently froze nearly $11 billion in refunds, producing an 18:1 return on investment over its first decade. Treasury’s AI tools have helped prevent or recover more than $4 billion in taxpayer losses from fraudulent returns, improper payments, and check schemes.

The core mechanism is cross-referencing. The IRS matches reported income against W-2s, 1099s, crypto exchange reports, banking data, and state tax records through reciprocal agreements with state governments. When what you report doesn’t match what your employer, broker, or bank reported, the system flags the discrepancy. Machine learning models now analyze millions of returns simultaneously, scoring each for noncompliance risk and adapting their detection criteria roughly six times per tax year.

For complex cases, the IRS uses AI to target 75 of the largest U.S. partnerships, each with assets exceeding $10 billion — including hedge funds, real estate investment partnerships, and law firms. The Large Partnership Compliance program uses machine learning to assess accounting rules and tax law compliance in structures that are too complex for traditional audit selection.

The IRS also cross-references PPP data. Federal prosecutors are using AI software to cross-check Form 941 payroll tax filings, banking records, and unemployment data against PPP loan applications — flagging businesses that claimed large payrolls to the SBA but reported minimal payroll taxes to the IRS. When your PPP application says 50 employees with an $875,000 monthly payroll and your Form 941 says otherwise, the discrepancy is visible instantly.

Four-stage enforcement pipeline showing data ingestion (billions of records from Medicare, tax, PPP, and cross-agency sources), analytics (anomaly detection, cross-agency matching, knowledge graphs, ML scoring producing 95,000+ leads), triage (Fusion Center, PACE, FDOC reviewing cases), and enforcement outcomes ($11.9B prevented, 5,586 billing revocations, 324 criminal charges, $6.8B in FCA recoveries)

What This Means for Defense Counsel and Compliance Teams
#

Your client’s data is the first witness. Before an agent knocks on the door, algorithms have already compared your client’s billing, filings, or applications against every peer in the dataset. The investigation begins with a statistical anomaly, not a complaint. Defense counsel should understand what normal looks like in their client’s billing category — because the government’s analytics team already does. As one analysis put it, providers whose practices differ from statistical norms may be flagged regardless of whether the variation reflects legitimate clinical judgment.

Cross-agency data sharing eliminates the old silos. The government that couldn’t connect SSA data to SBA loan applications in 2020 can now do it routinely. Information-sharing agreements between agencies mean that a discrepancy in one dataset — an SSN mismatch, an income inconsistency, a billing outlier — can trigger scrutiny across multiple programs. Mintz recommends that companies follow the government’s lead: track the DOJ’s AI Use Case Inventory and consider using the same analytical tools as part of their compliance programs to identify what the government will see before the government sees it.

Self-disclosure has never been more valuable. The DOJ has reiterated its commitment to rewarding voluntary self-disclosure, cooperation, and remediation — with several FY 2025 settlements reflecting reduced penalties for companies that came forward early. When the government’s analytics are generating leads at industrial scale, the window between “we don’t know about this yet” and “the algorithm flagged it” is shrinking. Identifying and disclosing problems before they become leads is a compliance advantage that didn’t exist when enforcement depended on whistleblowers with longer timelines.

The question for compliance teams isn’t whether your data is clean. It’s whether the data you already submitted — the claims, the returns, the applications — looks clean to an algorithm comparing it against every other filer in the same category. The government doesn’t need to come get your data. It already has it.

Next in this series: how specific analytics techniques — anomaly detection, network analysis, predictive modeling, and natural language processing — actually work, and what each one can and can’t catch.

Further Reading
#


This post is part of the Data Analytics and Fraud Enforcement series on LegalRealist AI. It is intended for informational and educational purposes only and does not constitute legal advice. Enforcement statistics, agency capabilities, and regulatory programs described here reflect publicly available information as of the publication date and are subject to change. Laws governing fraud enforcement vary by jurisdiction and program.

Data Analytics and Fraud Enforcement - This article is part of a series.
Part 1: This Article

Related