
TL;DR
- Public data can source prosecution leads. An open-source fraud-scoring system, run against the full SBA PPP dataset, identified the same lenders, geographies, and loan populations that DOJ prosecuted — using nothing but a downloadable CSV.
- Every top-flagged lender appeared in DOJ enforcement actions. Harvest, Capital Plus, Benworth, Kabbage — the system’s highest-risk lenders are the same institutions whose loans DOJ has been prosecuting. The scoring methodology produces outputs consistent with known enforcement targets.
- The geographic clusters match prosecution geography. South suburban Cook County, metro Atlanta, and South Florida — the three regions where DOJ has concentrated PPP enforcement — are the same three regions the scoring system flags at 3.2–4.4x overrepresentation.
- The system identifies where fraud lives. The application file proves it. In every prosecution I checked, the scoring system correctly flagged the right neighborhood, the right lender, the right population. What converted the lead into a case was the SBA’s internal data — certifications, payroll records, bank activity — which isn’t in the public dataset.
- DOJ found the big cases. The small ones need data miners — and better data. Kabbage and Robertson are cases DOJ builds with internal resources. The $20K fictitious sole proprietorship is the case DOJ can’t staff. FOCUS asks data miners to bring better analysis. This system is what better analysis looks like — and it shows that the bottleneck is data, not skill.
DOJ’s Pandemic Analytics Center of Excellence (PACE) cross-referenced 33 million PPP and EIDL applications against Social Security, HUD, and IRS records to source fraud cases. Its prosecutors have since brought hundreds of PPP prosecutions. An open-source scoring system, running on nothing but the SBA’s public FOIA data, identifies the same lenders, the same geographies, and the same loan populations that DOJ acted on. Public data can source prosecution leads. The question is what happens with them — and whether the government will give data miners enough to reach the cases DOJ can’t staff.
The Scoring System#
The Dicklesworthstone PPP fraud analysis is an open-source, three-stage scoring system built by Jeff Emanuel — a former hedge fund analyst turned developer with 21,000+ GitHub stars across 170+ projects — to process the full 8.4GB SBA FOIA dataset: 11.5 million loans, every field structured and machine-readable. It’s not the only open-source PPP analysis — the Washington Post’s investigative team cross-referenced SBA data against SEC filings to identify public companies that took loans meant for small businesses, a different approach to the same underlying thesis. But the Dicklesworthstone system is the most comprehensive fraud-scoring effort: 3,281 lines of Python, pre-computed output publicly available, reproducible by cloning the repo and running it overnight on a standard machine. It flagged 1.19 million loans out of 6.27 million in the $5K–$22K range as suspicious — a 19% flag rate. That sounds high until you consider the baseline: the SBA OIG estimates at least $200 billion in fraud out of ~$800 billion disbursed, roughly 25% program-wide. A 19% flag rate in the most fraud-vulnerable segment — self-certified sole proprietor loans — is consistent with that estimate, if anything conservative. The flags are based on a weighted composite of six detection strategies:
Business name analysis. Over 100 regex patterns match against borrower names, each assigned a suspicion weight. Names referencing luxury brands, fictional places (“Wakanda”), or get-rich-quick phrases (“Quick Cash,” “Free Money”) score near the maximum. More subtle signals include unusually short names (suspicious loans average 1.8 words vs. 2.5 overall, confirmed by Mann-Whitney U test at p < 0.01), multiple spaces suggesting copy-paste errors, and generic structures that don’t match how legitimate businesses typically name themselves.
Address network mapping. The system builds address-to-business graphs. When multiple “businesses” share a single residential address — say, four sole proprietorships at Apt 4B — each loan gets 15 risk points per overlap plus 5 per connected business.
Temporal clustering. The system tracks loan volumes by ZIP code and SBA office per day. If a ZIP code that averages 8 loans per day suddenly spikes to 50 on a single date, each loan from that spike gets a logarithmic risk boost. Sequential or near-sequential loan numbers — the last five digits differing by fewer than 20 — add 25 points, catching batch submissions from loan agents processing fabricated applications in rapid succession.
Lender risk weighting. A dictionary maps lenders to risk weights based on known enforcement history. Kabbage, for instance, is assigned a 0.8 weight — the system adds 12 risk points (15 × 0.8) to any Kabbage-originated loan that also carries other flags. The weighting is conservative: lender risk alone doesn’t trigger a flag. It amplifies existing signals.
Loan amount and demographic patterns. Loans hitting exactly $20,832 or $20,833 — the PPP maximum for a single employee — get 25 points. Suspicious loans cluster at $19K mean vs. $17K overall. Blank demographic fields compound with other flags.
Each loan gets a composite risk score. The system uses a threshold of 100 to flag, then a cutoff of 140 to sort the highest-risk loans for detailed analysis. The system minimizes false positives through known-business verification, contextual checks, and multi-factor requirements — a loan needs several independent risk signals to clear the threshold. Certain flag combinations use multiplicative interaction rather than additive stacking.
The engineering underneath is serious. The 8.4GB dataset is processed in chunks with peak memory capped at 1–2GB — this runs on a standard machine, not a cluster. Statistical tests validate every major finding: Mann-Whitney U on business name length (p < 0.01), chi-square on demographic patterns (p < 0.001), t-tests on loan amount distributions (p < 0.05). A secondary XGBoost analysis tests whether the flagged population is statistically distinguishable from the general population on dimensions the primary heuristics don’t directly measure. The entire system — data source, code, parameters, thresholds, output — is public and reproducible.
A scope caveat: the system analyzes only small-dollar loans, predominantly sole proprietors and self-employed filers. This is the population most vulnerable to fraud (applications were largely self-certified) but also the population most likely to include legitimate filers caught by heuristics not designed for them. “Suspicious” here means “statistically anomalous on multiple risk dimensions.” It does not mean “fraudulent.”
I didn’t build, modify, or run this system. I’m examining whether the risk concentrations it surfaces — using only the public CSV — correspond to the patterns that federal investigators acted on with subpoena power and non-public agency data. The Government Already Has the Data described what the government has; The Data Miner’s Dilemma described what outsiders don’t. The output puts numbers on that gap.
The Overlay#
Lenders#
The most overrepresented lenders — those appearing in the suspicious loan dataset far more often than in the general population — are the same institutions that originated the loans DOJ has been prosecuting:
| Rank | Lender | Overrep. | DOJ Enforcement |
|---|---|---|---|
| #1 | Harvest Small Business Finance | 2.81x | Named as originator in multiple borrower fraud prosecutions; served as lender for Womply-routed applications |
| #3 | Capital Plus Financial | 2.46x | Lender for Blueacorn-routed loans; both Blueacorn co-founders prosecuted — Reis sentenced to 10 years/$66M restitution, Hockridge convicted of conspiracy |
| #5 | Prestamos CDFI | 2.26x | Lender for Womply-routed applications; 8,839 unforgiven Texas loans totaling $143M |
| #6 | Benworth Capital | 2.22x | Named as originator across multiple fraud ring indictments in Brooklyn and Oklahoma federal court |
| #8 | Cross River Bank | 1.99x | Bank partner for fintech lenders including Blueacorn; processed billions in PPP loans |
| #13 | Kabbage | 1.33x | $120M DOJ settlement — inflated loans, removed underwriting steps, submitted thousands of fraudulent applications |
Source: Scoring output from Dicklesworthstone/ppp_loan_fraud_analysis. Enforcement outcomes from DOJ press releases and cited sources.
A caveat this table needs: these lenders appear at the top partly because they were the largest fintech-partnered PPP lenders by application volume. Harvest, Capital Plus, and Prestamos collectively processed hundreds of thousands of small-dollar loans — the exact population the system scores. The overrepresentation may partly reflect “fintech-processed sole proprietor loans” rather than “fraud” per se. The lender overlay alone would be trivially expected from volume data. Its value is as a calibration check: the scoring methodology produces outputs consistent with known enforcement targets, which gives the geographic and temporal findings — harder to dismiss as volume artifacts — more weight.
What DOJ found underneath: Blueacorn, the fintech routing applications through Capital Plus and other lenders, ran a “VIPPP” service that coached borrowers on submitting false applications and charged kickbacks based on a percentage of loan proceeds. Kabbage removed underwriting steps to maximize processing fees and discouraged fraud reviewers from requesting additional documentation. The scoring system sees the statistical shadow these failures cast across the dataset. DOJ proved the failures themselves.
Kabbage’s position at #13 rather than #1 illustrates a real distinction, though not one the system was designed to make. Kabbage’s fraud was institutional — inflating existing loan amounts, weakening controls — rather than volumetric, processing huge numbers of questionable applications through fintech passthrough arrangements. The heuristics respond to volume-based patterns, so per-loan manipulation registers differently.
Geography#
The geographic output is dominated by three regional clusters — the same three regions where DOJ has concentrated PPP enforcement.
South suburban Cook County, Illinois: Harvey (4.28x overrepresentation, 1,933 flagged loans), Riverdale (4.33x, 1,554), Dolton (4.20x, 2,838), Calumet City (4.21x, 3,756), Bellwood (4.25x, 1,869), Maywood (4.21x, 2,160). Chicago proper at 3.78x with 98,244 flagged loans. DOJ has prosecuted dozens of cases here. In Dolton, federal prosecutors indicted police officer William Frederick Reed for fabricating payroll — Reed lived in Hazel Crest, also flagged at 4.06x. In Calumet Park (4.13x), the Illinois AG charged Raymond Harris, a Cook County Sheriff’s Office employee who fabricated a sole proprietorship to obtain $41,000.
Metro Atlanta, Georgia: Lithonia (3.70x, 6,452), Jonesboro (3.54x, 4,569), Riverdale GA (3.73x, 4,150), Decatur (3.35x, 7,234), Atlanta (3.29x, 34,417). Former assistant city attorney Shelitha Robertson was sentenced to over seven years for a $15M fraud scheme. The Northern District of Georgia separately charged ten individuals in a $9M ring.
South Florida: Opa-Locka (4.21x), Belle Glade (3.42x, 429), Lauderdale Lakes (3.29x, 1,592), Lauderhill (3.22x, 2,840). The Southern District of Florida hosts one of three national COVID-19 Fraud Strike Force teams and has charged over 185 people in PPP fraud schemes totaling $220M+. Miami-Dade, Broward, and Palm Beach counties received 4.2% of all PPP loans by count despite representing 1.8% of the national population — a disproportionate share that the scoring system independently identifies.
Three confounding factors apply to all three clusters. First, demographics: the system’s $5K–$22K range captures sole proprietor loans filed disproportionately in lower-income communities. Some overrepresentation may reflect who filed small-dollar PPP loans, not who committed fraud. Second, prosecution capacity: NDIL, NDGA, and SDFL are large, well-resourced U.S. Attorney’s offices. Third, fintech marketing: Blueacorn and Womply marketed aggressively in specific communities; geographic clusters may partly trace marketing reach. These factors may inflate the overrepresentation ratios. But Reed, Harris, Robertson, and the 185 South Florida defendants aren’t demographic artifacts.
From Lead to Case#
Reed (Dolton, IL): The system places his loan in a 4.20x geographic cluster — correctly. DOJ proved he fabricated his security company’s payroll, earned $189,000 as a police officer while claiming he needed a $5,862 PPP loan, and concealed the proceeds in a bankruptcy filing. The lead was in the public data. The proof was in the application file.
Harris (Calumet Park, IL): The system might flag the loan for a suspicious NAICS code or business name pattern — correctly. The IL AG proved the business never existed, using the application itself and the absence of any state business registration. The lead was in the public data. The proof was in the state records.
Robertson (Atlanta, GA): The system sees metro Atlanta’s 3.29x overrepresentation — correctly. DOJ proved she submitted applications for companies she controlled, received $15M, and used the proceeds to buy a Rolls-Royce and jewelry. The lead was in the public data. The proof was in bank records and transaction tracing.
In each case, the public data sourced a lead that pointed at a real prosecution. What converted the lead into a case was the SBA’s internal data — borrower certifications, payroll records, bank activity. The government doesn’t have better algorithms. It has better data. And it won’t share it.
The Cases DOJ Can’t Staff#
DOJ didn’t need this analysis to find Kabbage. A lender processing $7 billion in PPP loans with broken fraud controls surfaces itself. The $120M settlement, Robertson’s $15M scheme, the Blueacorn $65M conspiracy — DOJ builds these with PACE, cross-agency referrals, and multi-agency task force capacity. Big fish generate their own signal.
The question is below that threshold. The SBA referred 562,000 suspected fraudulent loans to Treasury for collection in April 2026 — $22.2 billion flagged but never previously investigated. Congress extended the statute of limitations to ten years. Enforcement runs through 2031. And DOJ has finite attorneys.
A $20,000 fabricated sole proprietorship in Calumet Park isn’t a case DOJ originates on its own. The expected recovery doesn’t justify the investigative cost. But it’s exactly the kind of case a data miner files as a qui tam — low dollar amount, pattern-based identification, high volume. Harris’s $41,000 scheme was caught by the Illinois AG, not DOJ. The analysis flags the population where these cases live. Multiply Harris by the thousands of statistically similar loans in the same geographic and lender clusters, and you’re looking at the bulk of PPP fraud by case count — the cases DOJ can’t reach without outside help.
The Scoring System as Exhibit A#
FOCUS — DOJ’s initiative to improve data miner qui tam quality — asks relators to bring better analysis. This scoring system is what better analysis looks like — and the three prosecutions above demonstrate that it sources leads pointing at real cases. The ceiling isn’t analytical. It’s informational.
This is the argument The Data Miner’s Dilemma made in theory. The scoring system makes it in evidence. FOCUS is a demand-side intervention — it wants higher-quality filings. But the system shows that the ceiling isn’t analytical. It’s informational. Data miners have already built tools that identify the same risk concentrations DOJ acts on. The PPP natural experiment proved that when the government releases better data — even involuntarily, via court-ordered FOIA — data miners produce $850M+ in enforcement outcomes. FOCUS asks for better output from the same inputs. This system is the best possible output from those inputs. If that’s not sufficient, the question isn’t whether data miners need better algorithms.
What would better inputs look like? Not full application files — those contain sensitive personal information. But the SBA could release forgiveness status by lender, anonymized payroll-to-loan ratios, and cross-program flags showing whether the same borrower applied for both PPP and EIDL with inconsistent employee counts. None of this identifies individual borrowers. Cross-program flags are the strongest lever: if Harris claimed zero employees on his EIDL application and five on his PPP application, that inconsistency is in the government’s data right now. Releasing it anonymized wouldn’t expose Harris. It would expose the pattern — and the pattern is what moves data miners from “statistically anomalous” toward “inconsistent with the borrower’s own filings.”
The SBA already runs this analysis internally. The GAO found that SBA’s own analytics contributed to $4.7 billion in loan proceeds not being forgiven and 669,000 referrals to the OIG. The data exists. The analytical framework exists. The question is whether the government treats private enforcement capacity as a resource to be equipped or a nuisance to be managed. The system demonstrates that the analytical talent is there. The data isn’t.
Further Reading#
- Dicklesworthstone PPP Fraud Analysis Pipeline. The open-source scoring tool and pre-computed output used in this post.
- Washington Post PPP Loans Database. The Post’s investigative team cross-referenced PPP recipients against SEC filings to identify public companies that took small-business loans.
- SBA PPP FOIA Dataset. The full public PPP loan dataset.
- House Select Subcommittee Report on Fintech Fraud in PPP. Findings from 83,000 pages of subpoenaed lender documents.
- GAO: COVID Relief Fraud Schemes and Indicators. Analysis of 330 PPP and COVID-EIDL fraud cases and SBA data analytics capabilities.
- Kabbage $120M DOJ Settlement. Constantine Cannon’s analysis of the largest PPP lender settlement.
- SBA Refers 562,000 Suspected Fraudulent Loans to Treasury. April 2026 announcement on $22.2 billion in delinquent PPP and EIDL loans.
- PPP Fraud Enforcement Survey. Benesch’s overview of civil and criminal enforcement trends.
- Fraud and Abuse in the Paycheck Protection Program. Academic study cross-referencing PPP loans against SEC filings to estimate fraud rates in investment advisory firms.
- The Government Already Has the Data. Series post #1 on how DOJ, CMS, and IRS source fraud cases from structured data.
- The Data Miner’s Dilemma. Series post #2 on the gap between data miner analysis and DOJ’s evidentiary bar.
This post is part of the Data Analytics and Fraud series on LegalRealist AI. It is intended for informational and educational purposes only and does not constitute legal advice. The scoring system output discussed in this post is a statistical analysis of publicly available data — it does not identify fraud, and no individual, business, or lender discussed here should be presumed to have committed fraud based on statistical overrepresentation alone. Enforcement statistics, scoring output, and data availability described here reflect publicly available information as of the publication date and are subject to change. Laws governing fraud enforcement, the False Claims Act, and the public disclosure bar vary by jurisdiction and are evolving.



