Skip to main content
  1. Posts/

From Kaggle to MCP: Open-Source Medicare Fraud Detection

Author
LegalRealist AI
Data Analytics and Fraud - This article is part of a series.
Part 5: This Article

TL;DR

The PPP fraud analysis worked because the SBA released everything — 11.5 million loans, every field, one CSV. An open-source pipeline flagged 1.19 million suspicious loans, and every top-flagged lender matched a congressional investigation or DOJ enforcement action. Medicare spends roughly $1 trillion annually — more than four times what PPP disbursed over its entire lifetime. The public data available for outside analysis is worse by an order of magnitude.

The Kaggle Foundation
#

Most open-source Medicare fraud detection traces back to one dataset: Rohit Anand Gupta’s Healthcare Provider Fraud Detection Analysis, posted to Kaggle in May 2019. It contains 5,012 providers, roughly 558,000 claims, beneficiary demographics, inpatient and outpatient splits, diagnosis and procedure codes, and — crucially — pre-labeled fraud flags at the provider level. It has spawned dozens of GitHub repositories, Kaggle notebooks, Medium writeups, a March 2026 medRxiv preprint, a Journal of Big Data paper (2023), and an IEEE conference paper.

The repos follow a consistent pattern: feature engineering on claim amounts, chronic condition counts, and provider geography, then classification via logistic regression, random forest, XGBoost, or autoencoders, producing AUC scores in the 0.85–0.97 range. SMOTE handles class rebalancing. SHAP values provide explainability. The machine learning works. The dataset doesn’t resemble reality.

The Kaggle set is pre-balanced relative to actual Medicare fraud rates, pre-processed into clean features, and small enough to fit in memory on a laptop. Models trained on it produce impressive accuracy numbers that wouldn’t survive contact with the real CMS data — where the fraud rate is orders of magnitude lower, the features are aggregated and de-identified, and the clinical context that distinguishes fraud from unusual-but-legitimate practice doesn’t exist.

The Pyligent CMS-Medicare-Data-FRAUD-Detection repo (20 stars, 23 forks) sits between the Kaggle sandbox and real detection. It uses actual CMS Part D prescriber data — over 3GB — processed through Apache Spark and PySpark, joined with LEIE exclusion data and pharmaceutical payment records. More ambitious data handling, more realistic scale, but still supervised classification against known exclusions.

The Bridge: Real CMS Data
#

Two repos cross the line from Kaggle toy to actual CMS data — and together they outline a pipeline nobody has built yet.

dchannah/fraudhacker (18 stars, 8 forks) is the most complete tool architecturally. It loads raw CMS provider utilization data into PostgreSQL, runs clustering-based outlier detection per specialty per state, and serves results through a Flask dashboard a non-coder can browse. The approach is more honest than supervised classification: it says “this provider bills differently from peers” rather than “this provider resembles previously caught fraudsters.” But fraudhacker never checks whether its outliers actually turned out to be fraudulent. No LEIE cross-reference, no ground-truth validation, no feedback loop. It flags statistical anomalies and stops.

brenfrrs/medicare_fraud does what fraudhacker doesn’t: it joins Part D prescriber data against the OIG’s List of Excluded Individuals and Entities (LEIE) using NPI numbers. The LEIE lists providers excluded from federal healthcare programs for fraud convictions, license revocations, or program abuse — the closest thing to a public ground-truth fraud label. The repo’s models found gender, total claim count, and 30-day fill counts as the strongest predictors of exclusion — which immediately raises the question every honest fraud analyst has to ask: are those fraud signals, or demographic and volume correlates that happen to overlap with the providers OIG already caught?

Both repos use the same freely downloadable CMS public datasets. Both hit the same wall: the LEIE is a binary, lagging label. A provider appears on the list only after investigation, prosecution or administrative action, and formal exclusion. The label says who got caught. It doesn’t say how the fraud worked or what the billing looked like before exclusion.

The Backtest Nobody’s Built
#

The data to go further already exists. The LEIE includes exclusion dates and exclusion type codes under Sections 1128(a) and 1128(b) of the Social Security Act — structured categories, not prose. 1128(a)(1) is conviction for program-related fraud (billing for services not rendered). 1128(b)(7) is excessive claims or furnishing unnecessary services (real services, too many of them). Those are different fraud schemes that should produce different billing signatures. CMS publishes Part B provider utilization data going back to 2013. The NPI is the join key.

The pipeline: pull the LEIE with exclusion dates and type codes, pull the historical CMS billing data for excluded providers in the two to three years before exclusion, reconstruct their billing trajectories — what changed, what spiked, what deviated from specialty peers — and cluster those trajectories by exclusion type. The output is a set of labeled billing signatures: this is what upcoding looked like in cardiology in Florida before this provider got caught. This is what overutilization looked like in DME suppliers in Texas. Then run those signatures against current provider data to find active providers whose billing trajectories match.

Fraudhacker does the clustering but skips the validation. brenfrrs does the LEIE join but only as a static label on current-year data — not as a temporal backtest. Nobody chains them together. It’s not a weekend project — it requires joining across multiple years of multi-gigabyte CMS datasets and aligning temporal windows correctly — but it’s not research-level hard either. It’s the kind of analysis the Dicklesworthstone PPP pipeline did for loans.

The backtest is buildable with today’s public data. But “buildable” and “practical” aren’t the same thing. CMS publishes each year’s data as a separate file in a slightly different format. The LEIE’s exclusion type codes map to legal categories, not billing patterns — translating “conviction for program-related fraud” into “which CPT codes spiked” requires the feature engineering CMS could provide but doesn’t. And the analysis would still run on provider-level aggregates with patients de-identified and programs siloed. Better data from CMS wouldn’t just help — it would determine whether the backtest produces actionable fraud signatures or statistical noise.

The PPP Benchmark
#

The PPP contrast shows what better data looks like.

The SBA FOIA dataset gave outside analysts identified borrowers (business name, address), identified lenders, exact loan amounts, self-reported employee counts, NAICS codes, approval dates, and forgiveness amounts — 11.5 million loans in a single 8.4GB CSV. The fraud typologies were self-evident from the data: a business claiming ten employees with zero payroll on tax records is the scheme. A lender approving thousands of loans with identical metadata is the red flag. No external translation required. An open-source pipeline built on this data flagged 1.19 million suspicious loans, and every top-flagged lender matched a congressional investigation or DOJ enforcement action.

CMS gives outside analysts something fundamentally different. Provider-level billing aggregates — not claim-level records. Patient identifiers removed. Part B billing, Part D prescribing, DMEPOS orders, and Open Payments pharmaceutical income released as separate datasets on different schedules in different formats. They’re joinable on NPI, but the joins are non-trivial and the temporal alignment between datasets is inconsistent.

Medicare fraud typologies require a clinical translation layer the public data doesn’t provide. “Upcoding” means billing a higher-complexity office visit code than the encounter warranted — but the encounter notes aren’t in the public data. “Unbundling” means billing lab tests separately instead of as a panel — detectable if you know which CPT codes should be bundled, but CMS doesn’t publish the bundling rules alongside the billing data. “Phantom billing” — services billed but not rendered — should show up as high volume with low unique beneficiaries, but that pattern also describes a legitimate high-throughput specialist.

The biggest gap isn’t any single missing field. PPP was one self-incriminating table — the fraud signals were in the application. Medicare is five tables that CMS treats as unrelated releases, and the information that would connect billing anomalies to fraud patterns is stripped or siloed.

The MCP Frontier
#

The tooling is getting better even as the data stays the same. openpharma-org/medicare-mcp wraps CMS data into Model Context Protocol (MCP) tool calls — search_providers (Part B, 2013–2023), search_prescribers (Part D by drug, NPI, specialty, and state), search_hospitals (inpatient utilization and payment), search_spending (drug spending trends), search_formulary (Part D plan coverage), plus hospital quality metrics including star ratings, readmission rates, and mortality.

This drops the barrier from “can you write Python and handle multi-gigabyte CMS data formats” to “can you ask the right question.” A compliance officer connected to this MCP server through Claude Cowork or another LLM-based tool can ask “which cardiologists in South Florida billed Medicare for more than 3x the state average for stress tests last year” and get a structured answer without opening a Jupyter notebook. A qui tam relator screening for potential leads can run the same kind of outlier identification the Python repos do — specialty-level billing comparisons, geographic clustering, provider-level Anomaly Detection — through conversation.

The MCP server fits into a broader pattern: DeepJudge built an MCP connector for searching a firm’s prior matters. Midpage connected legal research tools for citation verification. Domain-specific datasets are becoming LLM tools. For healthcare fraud, the dataset is there. But the LLM queries the same de-identified, aggregated, program-siloed data underneath. A better interface to limited data is still limited analysis.

Closing the Gap
#

The backtest pipeline described above is buildable on today’s public data, but every step is harder than it needs to be. The PPP dataset is the standard for what CMS could release. Five changes would close the gap.

Cross-program provider linkage. A provider’s Part B billing, Part D prescribing, DMEPOS orders, and Open Payments income currently arrive as separate datasets. A unified provider profile — one record per NPI per year linking all four programs — is what DOJ builds internally when it constructs a fraud case. The join key exists. CMS just doesn’t do the join publicly. All provider-level, no patient exposure.

Longitudinal billing trends. The current releases are annual snapshots. A provider whose billing doubles year-over-year, whose specialty mix shifts suddenly, or whose patient panel changes dramatically doesn’t show that trajectory in a single year’s file. Multi-year trend data at the provider level would add the temporal dimension that made PPP Anomaly Detection work — the Dicklesworthstone pipeline flagged lenders partly because their approval volumes spiked in ways that didn’t match normal lending behavior over time.

Provider-level outcome data. CMS already publishes hospital-level quality metrics — star ratings, readmission rates, mortality. Extending aggregate, de-identified outcome data to the provider level would let outside analysts distinguish high-billing providers whose patients do well (legitimate high-acuity practice) from high-billing providers whose patients fare worse (potential overtreatment, fraud, or low-quality care). Billing volume alone can’t make that distinction. Billing volume paired with outcomes can.

Structured fraud typologies. This is the biggest gap between PPP and Medicare — and the one the backtest pipeline would most benefit from. PPP fraud typologies were self-evident from the data. Medicare fraud typologies require clinical translation. DOJ press releases describe schemes in prose. There’s no structured dataset mapping those descriptions to billing signatures.

For each LEIE exclusion — or at least a representative sample — CMS could publish the fraud typology (structured categories, not narrative), the billing signature (which CPT codes, what volume patterns, what geographic and temporal markers), and the pre-exclusion billing trajectory from the provider utilization data. A dataset that says “this is what upcoding looks like in Part B claims — here are 50 confirmed examples with their billing patterns before exclusion.” That’s what would turn the backtest from a feasible-but-painful exercise into a practical detection tool. DOJ’s FOCUS initiative tells data miners to bring more sophisticated analysis. Releasing typology data would give them something to apply that sophistication to.

Coding and bundling reference data. CMS knows which CPT codes should be billed together, which modifier combinations are legitimate, and what normal utilization ranges look like by specialty and geography. Publishing that reference data alongside the billing data would let outside analysts flag unbundling and upcoding against a known standard — the way the PPP pipeline flagged impossible NAICS-employee combinations against business registration norms — instead of relying on statistical deviation from peer averages.

All five proposals expose provider behavior, not patient identity. Open Payments already publishes provider-identified pharmaceutical industry payment data — full names, searchable, downloadable — under the Physician Payments Sunshine Act. The precedent for provider-level transparency exists.

The data for a Medicare fraud detection pipeline already exists — historical billing, exclusion labels with dates and type codes, NPI linkage across programs. An outside analyst can build the backtest today. But each step requires joining datasets CMS treats as unrelated, aligning temporal windows across files published on different schedules, and translating legal exclusion categories into billing features without reference data. The five proposals above wouldn’t just make new analysis possible — they’d make the analysis that’s already possible practical. The PPP experience showed what happens when the friction drops: outside analysts find the patterns enforcement finds, and they find them in the long tail where DOJ doesn’t have resources to look. Medicare’s long tail is estimated at 3–10% of total spending — $100–350 billion annually. The data to surface it already exists. It’s sitting inside CMS.

Further Reading
#


This post is part of the Data Analytics and Fraud series on LegalRealist AI. It is intended for informational and educational purposes only and does not constitute legal advice. The data access proposals discussed here are the author’s analysis — not policy recommendations. Open-source repositories referenced are third-party projects not affiliated with LegalRealist AI. AI capabilities, data availability, and enforcement practices described here reflect publicly available information as of the publication date and are subject to change. Laws governing healthcare fraud, data privacy, and qui tam litigation vary by jurisdiction.

Data Analytics and Fraud - This article is part of a series.
Part 5: This Article

Related

Show Your Work

· 14 mins
Public data can source prosecution leads. An open-source fraud-scoring system, run against the full SBA PPP dataset, identified the same lenders, geographies, and loan populations that DOJ prosecuted — using nothing but a downloadable CSV and a standard laptop.