Table of Contents

The Legal AI Landscape - This article is part of a series.

Part 2: This Article

Part 5: Five Case Studies from the Firms Actually Using AI

The Fundamental Limits
#

On April 18, 2026, Andrew Dietderich — co-head of Sullivan & Cromwell’s global restructuring group, a partner at one of the most prestigious law firms in the world — sent a letter to Chief Judge Martin Glenn of U.S. Bankruptcy Court in Manhattan. Attached was a three-page, single-spaced chart of fabricated case citations, invented quotations, and incorrect case numbers that the firm had submitted in a motion on behalf of a client. The errors were AI-generated. They were caught not by Sullivan & Cromwell’s own review process, but by opposing counsel at Boies Schiller Flexner.

Sullivan & Cromwell has comprehensive AI policies. It has training requirements. It has citation review procedures. None of them prevented this filing from reaching the court. Partners at the firm charge roughly $2,000 per hour for bankruptcy work.

Every legal AI product runs on large language models. Every large language model hallucinates. This is not a defect that better engineering will fix. It is a consequence of how the technology works. Before evaluating any legal AI tool — a task we take up in the next post in this series — you need to understand this constraint.

Why LLMs Hallucinate: Indeterminacy by Design
#

The transformer architecture underlying every LLM is a probabilistic system. It doesn’t retrieve facts from a database. It predicts the most statistically likely next token — the next word fragment — given everything that came before it. When you ask an LLM to cite a case, it isn’t looking up that case. It is generating a sequence of characters that looks like a citation, based on patterns it absorbed during training. If the statistically likely sequence happens to match a real case with an accurate holding, the output is correct. If it doesn’t, the output is a hallucination. The model has no mechanism to distinguish between the two.

This is what makes hallucination fundamentally different from a software bug. A bug is a deviation from intended behavior that can be identified and patched. Hallucination is the intended behavior — probabilistic text generation — producing an unintended result. You can reduce the frequency with better engineering. You cannot eliminate it without replacing the architecture, because the architecture is designed to generate plausible language, not to verify truth.

MIT researchers found that AI models use more confident language when hallucinating than when stating facts. The prose reads identically whether the citation is real or fabricated.

How Bad Is It?
#

Stanford RegLab researchers quantified the problem. General-purpose LLMs hallucinate between 58% and 88% of the time when asked specific, verifiable legal questions about federal court cases. On tasks requiring precedential reasoning — determining whether one case supports or contradicts another — most models performed no better than a coin flip. The models were worst on lower-court cases and state-level law.

The Sanctions Landscape
#

A database maintained by researcher Damien Charlotin at HEC Paris has cataloged over 1,200 court cases globally where AI-generated hallucinated content was submitted to courts — up from under 200 a year earlier.

In the Sixth Circuit case (Whiting v. City of Athens), attorneys Van R. Irion and Russ Egli received $15,000 per attorney plus full reimbursement of the opposing party’s fees. A DOJ attorney resigned after fabricated quotations were discovered in a government filing — caught by a pro se plaintiff, not by the DOJ’s review process. A federal court in Alabama disqualified attorneys from a case entirely, even though the firm had an internal AI policy. In a Colorado defamation case, the judge sanctioned two attorneys $3,000 each for a filing with more than two dozen errors including hallucinated cases.

A California appellate court went further: it sanctioned one attorney for filing AI-generated fake citations and also penalized the opposing counsel for failing to detect and report them.

Q1 2026 sanctions totaled at least $145,000 — the highest quarterly total in legal history.

Grounding Techniques: Working Around the Architecture
#

Since hallucination can’t be eliminated at the model level, every serious legal AI tool builds layers of containment around it.

Retrieval-Augmented Generation (RAG)
#

The baseline technique. Instead of generating from training data, the system first retrieves relevant documents from a verified database and provides them as context. CoCounsel retrieves from Westlaw; Protégé retrieves from LexisNexis; Everlaw retrieves from the case document corpus.

RAG doesn’t change the model’s architecture — the model is still generating probabilistic text. What RAG changes is the input: by placing verified documents in the context window, it shifts the probability distribution toward outputs grounded in real sources. But the model can still ignore the retrieved context, mischaracterize it, or blend it with patterns from training data. RAG reduces the hallucination rate. It does not change the mechanism that produces hallucinations.

Citation Verification
#

A post-generation check. CoCounsel runs outputs against KeyCite to flag cases with negative treatment. Protégé does the same with Shepard’s Citations. LexisNexis describes a minimum of five quality checkpoints per prompt.

These systems check whether a cited case still says what the model claims — whether the holding has been reversed, distinguished, or superseded. This is the specific advantage publisher tools hold over startups: Shepard’s and KeyCite are decades-old verification systems with no equivalent outside Thomson Reuters and LexisNexis.

Knowledge Graphs
#

LexisNexis integrates its Shepard’s Citation Knowledge Graph into the retrieval pipeline — a technique called GraphRAG that retrieves entire subgraphs of related cases, citation hierarchies, and legal taxonomies rather than isolated documents. Research on graph-based hallucination detection (HalluGraph) shows this catches entity-level errors — swapping party names, misattributing holdings — that similarity-based retrieval misses.

Multi-Model Consensus
#

Luminance’s “Panel of Judges” architecture runs multiple models independently on the same clause and requires probabilistic agreement before surfacing a result. This exploits the indeterminacy: if two independent probabilistic systems produce the same answer, that answer is more likely grounded in the input than generated from noise. The trade-off is cost — running multiple models on every clause multiplies compute.

Human-in-the-Loop Review
#

EvenUp uses the most resource-intensive approach: over 100 nurses, paralegals, and lawyers review every output before delivery, with corrections feeding back into model training. In personal injury — where a missing medical bill can cost $5,000 in settlement value — the economics justify it.

Constrained Generation
#

Everlaw’s AI is grounded exclusively in the case document corpus. When insufficient evidence exists, the system is designed to say so rather than generate from general knowledge. LegalOn constrains outputs to attorney-authored playbooks. Both narrow the model’s output space — reducing the probability of hallucination by reducing the space in which it can occur.

How Much Do These Techniques Help?
#

Research shows RAG alone reduces hallucinations by roughly 71% compared to ungrounded generation — dropping a general-purpose LLM from 58–88% hallucination rates down significantly.

But a Stanford study tested the actual RAG-based products from LexisNexis and Thomson Reuters — tools that layer RAG with citation verification, knowledge graphs, and proprietary legal databases. Even with all of those techniques combined, these tools hallucinated between 17% and 34% of the time. LexisNexis’s Lexis+ AI performed best at 17%. Westlaw AI-Assisted Research hallucinated 34% — nearly twice the rate.

Some of those hallucinations were subtler than outright fabrication: a cited case might be real but mischaracterized, or a legal proposition attributed to a source that says something different. The Stanford researchers called providers’ claims of eliminating hallucinations “overstated.”

The gap between 88% (raw LLM) and 17% (best grounded tool) is real progress. The gap between 17% and zero is where every unsolved problem lives.

What Remains Unsolved
#

Even layered defenses leave gaps, because the underlying indeterminacy is still there.

Mischaracterization is harder to catch than fabrication. Inventing a nonexistent case is easy to detect — a lookup confirms it. Citing a real case but misstating its holding is far harder. Citation verification catches reversed or overruled cases; it doesn’t catch subtle misrepresentation. The Stanford study found this type of hallucination common even in the best RAG-based tools.

Novel questions expose the limits of retrieval. Every grounding technique works best when the answer exists clearly in the database. For novel legal theories, emerging regulatory frameworks, or cross-jurisdictional questions with sparse authority, the retrieval system finds nothing on point. The model’s indeterminacy fills the gap: it generates plausible-sounding reasoning with no actual legal basis, and the retrieval system has no document to check it against.

Verification remains non-delegable. ABA Formal Opinion 512 (July 2024) requires lawyers using AI to take reasonable measures to protect confidential information. Courts have extended this to accuracy: your duty to verify citations applies regardless of source. The tools that make verification easiest — inline citations, confidence flagging, integrated Shepard’s or KeyCite — reduce the time verification takes. They cannot reduce the obligation.

When vendors quote time savings, they rarely account for verification overhead. If a tool drafts a motion in 10 minutes but requires 45 minutes of citation checking, your real time savings is measured against the total, not the draft time.

Next in this series: The Tools — ten legal AI products across litigation and corporate practice, what they do, what they run on, and what they cost.

The Foundation

25 November 2025·3694 words·18 mins

Large-Language-Models Transformer-Architecture Frontier-Labs Open-vs-Closed-Models Token-Pricing LegalBench Benchmarks OpenAI Anthropic Google-Gemini Model-Evaluation Legal-AI-Procurement

LLMs are the core technology for AI applications

The Fundamental Limits

The Fundamental Limits
#

Why LLMs Hallucinate: Indeterminacy by Design
#

How Bad Is It?
#

The Sanctions Landscape
#

Grounding Techniques: Working Around the Architecture
#

Retrieval-Augmented Generation (RAG)
#

Citation Verification
#

Knowledge Graphs
#

Multi-Model Consensus
#

Human-in-the-Loop Review
#

Constrained Generation
#

How Much Do These Techniques Help?
#

What Remains Unsolved
#

Further Reading
#

Related

The Fundamental Limits #

Why LLMs Hallucinate: Indeterminacy by Design #

How Bad Is It? #

The Sanctions Landscape #

Grounding Techniques: Working Around the Architecture #

Retrieval-Augmented Generation (RAG) #

Citation Verification #

Knowledge Graphs #

Multi-Model Consensus #

Human-in-the-Loop Review #

Constrained Generation #

How Much Do These Techniques Help? #

What Remains Unsolved #

Further Reading #

Related

The Fundamental Limits
#

Why LLMs Hallucinate: Indeterminacy by Design
#

How Bad Is It?
#

The Sanctions Landscape
#

Grounding Techniques: Working Around the Architecture
#

Retrieval-Augmented Generation (RAG)
#

Citation Verification
#

Knowledge Graphs
#

Multi-Model Consensus
#

Human-in-the-Loop Review
#

Constrained Generation
#

How Much Do These Techniques Help?
#

What Remains Unsolved
#

Further Reading
#