Finding the Needle in the Haystack#
TL;DR
- LLMs get worse as you give them more to read — even on simple tasks. Chroma tested 18 frontier models and found every one degraded with increasing input length, a phenomenon called “context rot.” For legal document review, this means the model analyzing your 500th page is measurably less reliable than the one analyzing your 5th.
- The benchmarks that matter aren’t the ones vendors cite. Standard needle-in-a-haystack tests use exact word matching. Real investigations require semantic inference — connecting “鉛筆をなめる” to data falsification. When Adobe researchers removed lexical overlap, even GPT-4o’s performance collapsed.
- Evidence hides in the middle, where models pay the least attention. Transformer architecture creates a U-shaped attention curve: strong recall at the beginning and end of context, weak in the middle. If your smoking gun sits on page 47 of a 90-page document, the model is architecturally biased against finding it.
- Chunking is where most retrieval systems silently fail. How you split documents determines what the model can find. Cut a paragraph between a euphemism and its context, and the retrieval system will never connect them.
- You beat context rot by making the haystack smaller, not the model bigger. Concrete strategies — metadata pre-filtering, hierarchical chunking, subagent isolation, and position-aware prompting — keep irrelevant tokens out of the context window so the model can attend to what matters.
In Japanese corporate culture, there’s an expression: enpitsu wo nameru (鉛筆をなめる) — literally, “to lick the pencil.” It originally described writing with care, moistening an old-style pencil tip to make lines clearer. Over time, its meaning shifted. Today it’s a euphemism for fudging numbers, adjusting figures, making the data say what you need it to say.
In 2024, Japan’s transport ministry discovered that Toyota, Honda, Mazda, Suzuki, and Yamaha had all falsified safety and emissions certification data — in some cases for over a decade. Toyota subsidiary Daihatsu’s internal probe found irregularities in 174 items across 25 test categories spanning 64 vehicle models. These weren’t rogue employees. Investigators found systematic data falsification running through multiple companies, quality certification bodies, and an accreditation system that looked the other way for years.
Now imagine you’re the litigation team on the other side of this. You’ve received a production of 2 million documents — internal emails, test reports, meeting minutes, engineering logs — in Japanese, English, and German. Somewhere in that corpus is evidence that an engineer wrote something equivalent to “we need to lick the pencil on the NOx figures.” Not those exact words. A euphemism, in an idiomatic register, possibly in a language the reviewing attorneys don’t speak, buried on page 47 of an 80-page engineering report, surrounded by thousands of pages of routine compliance documentation.
Every AI vendor in legal tech will tell you their tool can find it. The research says otherwise.
What Needle-in-a-Haystack Actually Tests#
The needle-in-a-haystack (NIAH) test is how the AI industry measures whether a model can find specific information in a large body of text. Greg Kamradt designed the original version in late 2023: insert a known fact into a long document at various positions and depths, then ask the model to retrieve it. When Anthropic released Claude 3 in early 2024, near-perfect NIAH scores were a headline feature. Context windows expanded to 200K, then 1 million, then 2 million tokens. The implication was clear: bigger windows, better retrieval, problem solved.
The implication was wrong.
NIAH tests a narrow capability: lexical retrieval. The “needle” is a sentence with distinctive keywords. The question asks about those same keywords. The model pattern-matches surface-level text — useful for measuring whether it can physically attend to tokens at different positions, but nearly useless for real investigations, where the relationship between query and evidence is semantic.
The Benchmark That Matters: NoLiMa#
In February 2025, Adobe researchers published NoLiMa (No Literal Matching) — a needle-in-a-haystack variant that removes the lexical shortcut. Instead of asking a question that shares keywords with the needle, NoLiMa uses needle-question pairs with minimal word overlap. The model has to infer the connection.
One example from the paper: the needle states that a character “lives next to the Kiasma museum.” The question asks which character has been to Helsinki. To answer correctly, the model must know that Kiasma is in Helsinki — a latent association, not a keyword match.
Models that scored near-perfectly on standard NIAH showed significant performance drops on NoLiMa as context length increased. Even frontier models like GPT-4o degraded substantially when they couldn’t rely on word-level pattern matching. The paper was published at ICML 2025.
Nobody writes “I am committing fraud” in an email. They write “let’s adjust the baseline,” “the numbers need to be more competitive,” or — in a Japanese engineering report — enpitsu wo nameru.
Context Rot: More Tokens, Worse Performance#
In July 2025, Chroma published “Context Rot”: LLMs get measurably worse as input length increases, even on tasks that don’t get harder.
Chroma tested 18 frontier models — including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 — across controlled experiments that held task complexity constant while varying only input length. The findings:
- Every model tested exhibited performance degradation as input length grew.
- Lower similarity between the question and the answer accelerated the degradation. When the question and needle shared fewer words, performance dropped faster as context grew — precisely the scenario that matters for investigations.
- Distractors made it worse, but not uniformly. Text that was topically related to the needle but didn’t contain the answer — the kind of noise that fills every real document corpus — degraded performance in unpredictable, model-specific ways.
- The structure of the surrounding text mattered. When Chroma shuffled the haystack’s sentences to destroy narrative flow, model performance changed — suggesting that models attend to structural patterns, not just content.
A model processing a 200-page document is not as reliable as the same model processing a 20-page document, even if both tasks are equally difficult. The 2-million-token context window that vendors advertise as a feature is simultaneously a liability.
Lost in the Middle: The U-Shaped Blind Spot#
In 2023, researchers from Stanford and UC Berkeley published “Lost in the Middle” — a study showing that LLMs retrieve information best from the beginning and end of their context window, and worst from the middle. The performance curve is U-shaped: high at the edges, low in the center.
The architectural explanation is now well understood. Transformer models use causal masking: each token can only attend to tokens that came before it. Token #1 gets attended to by every subsequent token in the sequence. Token #5,000, sitting in the middle of a 10,000-token document, is only attended to by tokens #5,001 onward. Earlier tokens accumulate disproportionate attention weight across the model’s layers. A 2025 Meta paper proved mathematically that this U-shaped bias exists at initialization — before any training occurs.
For legal document review, this means position matters. If the critical clause is in section 12 of a 24-section contract, the model is architecturally biased against attending to it. If the key email in a thread sits between an innocuous opener and a routine sign-off, it’s in the blind spot.
Chunking: Where Retrieval Systems Silently Fail#
No production legal AI system feeds entire document corpora into a context window. They use retrieval-augmented generation (RAG): split documents into chunks, embed those chunks as vectors, retrieve the most relevant chunks for a given query, and feed only those chunks to the model. This is how CoCounsel queries Westlaw, how Everlaw powers Deep Dive, and how every RAG-based legal tool works under the hood. The failure mode at this layer is invisible to the user.
The Chunking Dilemma#
Consider an 80-page engineering report from a Japanese automaker’s emissions testing lab. A standard chunking strategy might split this document into 500-token chunks — roughly one page each. The chunk containing a reference to enpitsu wo nameru gets embedded as a vector. But the phrase’s significance depends on context that might be several pages away: the preceding section describing the specific test procedure, the subsequent section showing the reported results, and a footnote referencing the regulatory threshold. Split across chunks, each fragment is individually innocuous. The euphemism, detached from what it’s euphemizing, embeds as a comment about writing instruments, not about data falsification.
Research on legal-specific RAG identifies this as Document-Level Retrieval Mismatch (DRM): the retrieval system pulls chunks from the wrong document entirely because the relevant chunk, stripped of its document-level context, looks less relevant than a superficially similar chunk from an unrelated document. One mitigation — Summary-Augmented Chunking — enriches each chunk with a document-level summary, injecting the global context that standard chunking destroys. But even Summary-Augmented Chunking can’t solve the deeper problem: the relationship between the euphemism and the fraud is implicit, cultural, and semantic.
Why Bigger Chunks Don’t Help#
The intuitive response is to use bigger chunks — preserve more context per retrieval unit. But Chroma’s research shows why this backfires: larger chunks mean more tokens per retrieval, which means more context for the model to process, which means more context rot. You’re trading retrieval precision for generation reliability. Retrieval research confirms that the optimal chunk size depends on the query type, and investigation queries — which are open-ended, semantic, and often require cross-document reasoning — don’t have a single optimal size.
From Benchmark to Investigation: What Actually Breaks#
Cross-Language Semantic Inference#
Research on multilingual NIAH shows that model performance drops substantially when the needle is in a language outside the English family, and drops further when the query language differs from the needle language. A model might score 95% on English-English NIAH and 60% on English-Japanese retrieval at the same context length.
Enpitsu wo nameru isn’t just a translation problem — a dictionary lookup returns “to lick a pencil,” which is meaningless without cultural context. The model needs to know that this idiom, in a corporate Japanese context, signals number manipulation.
Cross-Document Pattern Recognition#
Individual documents rarely contain a complete fraud narrative. The evidence is distributed: an email setting expectations in January, a test report with adjusted figures in March, a meeting minute noting the discrepancy in May, and a corrective memo burying it in July. No single document is a smoking gun. The pattern across documents is.
Current RAG systems retrieve chunks per query. They don’t natively detect patterns across independently retrieved chunks from different documents, different dates, and different authors. Graph-based approaches like DISCOG — which use knowledge graphs to model relationships between documents, entities, and events — show promise, but they’re research prototypes, not production legal tools.
Adversarial Document Design#
In a regulatory investigation, the documents weren’t just written casually — they were written by people who knew they might be reviewed. The euphemism is the adversarial design. When Honda’s CEO told reporters that falsification was about “making the tests more efficient, so that we don’t have to repeat them,” he was demonstrating language designed to be technically accurate while obscuring the underlying conduct. In an investigation corpus, legitimate compliance documentation uses the same technical vocabulary as the fraudulent documents — structurally similar to what Chroma calls a “distractor.”
Slicing the Haystack: How to Work Around Context Rot#
The core insight from Chroma’s research is counterintuitive: the answer to context rot isn’t a bigger context window. It’s a smaller one. Context engineering — the discipline of curating the optimal set of tokens during inference — works along four axes: select the right tokens, compress what’s verbose, isolate tasks into separate contexts, and write structured memory for cross-turn persistence. Applied to legal document review, each axis maps to a concrete strategy.
Strategy 1: Pre-Filter Before the Model Sees Anything#
The cheapest token is the one you never send. Before any LLM touches a document, use metadata filters to eliminate what can’t be relevant: date ranges, custodians, file types, departments. An emissions investigation doesn’t need the HR onboarding files or the cafeteria vendor contracts. A second-request response doesn’t need documents outside the relevant time period.
This sounds obvious, but many RAG implementations skip it. They embed everything, retrieve by vector similarity alone, and force the model to sort signal from noise. Temporal filtering and metadata boosting — weighting recent documents higher, filtering by department or custodian before retrieval — can eliminate 60-80% of a corpus before the LLM ever fires. Every document you remove is context rot you prevent.
Strategy 2: Hierarchical Chunking — Summaries First, Details on Demand#
Instead of feeding the model raw document chunks, build a two-tier retrieval system. The first tier retrieves document-level summaries — generated during ingestion, not at query time. The model scans summaries to identify which documents are worth reading in full. The second tier loads the full text of only those documents.
This directly addresses the chunking dilemma. Summary-Augmented Chunking enriches each chunk with its parent document’s summary, so the retrieval system understands that a chunk mentioning enpitsu wo nameru comes from an emissions testing report, not a stationery inventory. The model sees fewer tokens at each stage, preserving its attention budget for the documents that matter.
For an investigation across 2 million documents, the math works out: summarize all 2 million (cheap, parallelizable, can use a budget-tier model). Retrieve the top 500 summaries. Load the full text of the top 50. The model processes maybe 200,000 tokens instead of 2 billion — a 10,000x reduction in context, with a corresponding reduction in rot.
Strategy 3: Isolate Tasks into Subagent Contexts#
Context rot accelerates when you ask a model to do multiple things in one context window — retrieve, classify, extract, and reason. Each task adds tokens; each token dilutes attention on the others.
The Morph framework for context engineering recommends isolating search into subagents with their own context windows and returning only precise results to the parent. Applied to document review: one agent scans and classifies. A second agent, starting with a clean context, receives only the classified-relevant documents and extracts specific findings. A third agent, again clean, synthesizes the extractions into a narrative.
Thomson Reuters’ Deep Research agents follow this pattern — specialized agents for case law, statutes, and secondary sources, each tuned to its domain, each operating in its own context. Lighthouse’s HSR second-request workflow does something similar: separate AI passes for relevance review, privilege review, and privilege log drafting, rather than one model doing everything at once.
Strategy 4: Position-Aware Prompting#
Since the U-shaped attention bias means models under-attend to the middle of their context, put the most important content where models attend best: at the beginning and end. In a RAG pipeline, this means placing retrieved documents in reverse-relevance order (most relevant first), or duplicating critical context in both a preamble and a closing instruction.
For document review, this translates to a practical rule: if you’re asking a model to analyze a long document, extract the key sections first (using a fast first pass) and place them at the top of the prompt. Feed the full document as supporting context below. The model will attend most strongly to the extracted sections and use the full document for verification rather than discovery.
Strategy 5: Cross-Document Entity Graphs#
Before the model touches any document, a separate pipeline extracts entities (people, dates, test procedures, regulatory thresholds) and builds a graph. The graph reveals relationships — who communicated with whom, which test results were reported after which internal discussions — that no single-document retrieval can surface. LexisNexis’s GraphRAG architecture does this for citation networks; the same approach applied to investigation corpora would connect the January email to the March report to the May meeting.
The graph doesn’t require a large context window. It’s a structured data layer that sits outside the LLM, feeding the model only the subgraph relevant to a specific query. Each query gets a small, focused context built from graph traversal — not a dump of every document mentioning a keyword.
Strategy 6: Adversarial Retrieval Testing#
Before deployment, test the pipeline the way Chroma tests models: with controlled experiments that vary needle-question similarity, distractor density, and document position. Insert known euphemisms at known positions in a test corpus. Measure recall. If the system can’t find the evidence you planted, it won’t find the evidence you didn’t.
For multilingual investigations, this means planting culturally specific idioms — not just translated keywords — and verifying retrieval across language pairs. Current multilingual embeddings don’t reliably capture culture-specific idiomatic meaning, which is why human investigators who speak the language remain irreplaceable for seeding the test corpus and validating results.
Strategy 7: Human in the Loop#
Every strategy above reduces context rot. None of them solve the fundamental problem: a model that doesn’t know what enpitsu wo nameru implies can’t find it no matter how clean its context window is. The human feedback loop is what closes that gap — and in e-discovery, it’s also what makes AI retrieval defensible.
The pattern that works is iterative: AI surfaces candidate documents, a human reviewer evaluates them, and the reviewer’s judgments feed back into the retrieval system to sharpen the next pass. This isn’t new — technology-assisted review (TAR) has used human seed sets since 2012. What’s new is that LLM-based systems can incorporate richer feedback than binary relevant/not-relevant coding. A reviewer who flags a document and annotates why — “this euphemism refers to test data manipulation” — gives the system a semantic signal it can propagate across the corpus: find other documents that use similar phrasing in similar contexts.
In the emissions scenario, a Japanese-speaking attorney who recognizes enpitsu wo nameru on first encounter transforms the entire investigation. That single annotation becomes a retrieval seed: the system can now search for semantically similar idioms, co-occurring terminology, and documents from the same custodians discussing the same test procedures. One human judgment, amplified across 2 million documents.
The DOJ and SEC are already using AI-powered analytics to identify suspicious patterns in corporate data — anomalous billing, unusual trading, bid-rigging signals. The London Metropolitan Police’s recent Palantir deployment uncovered misconduct in a week that years of human supervision had missed. But the investigators still decided which flags were real and which were noise. The AI doesn’t need to understand the idiom. It needs to flag the document as anomalous and put it in front of someone who does — and then learn from what that person decides.
The pencil-lickers of the world are counting on that gap.
Further Reading#
- Context Rot: How Increasing Input Tokens Impacts LLM Performance (Chroma, July 2025). The foundational study on performance degradation across 18 frontier models.
- NoLiMa: Long-Context Evaluation Beyond Literal Matching (Adobe Research, ICML 2025). The benchmark that removes lexical shortcuts from needle-in-a-haystack.
- Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias (Meta, 2025). Mathematical proof that the U-shaped attention bias exists at initialization.
- A Guide to Context Engineering for LLMs (ByteByteGo). Practical overview of select, compress, isolate, and write strategies.
- Context Engineering: The Definitive Guide (FlowHunt). Deep dive on the four-axis framework for managing context windows.
- Context Rot: Why LLMs Degrade as Context Grows (Morph). Subagent isolation and context management for production systems.
- Towards Reliable Retrieval in RAG Systems for Large Legal Datasets. Summary-Augmented Chunking and Document-Level Retrieval Mismatch.
- Context Poisoning in LLMs: How to Defend Your RAG System (Elasticsearch). Metadata filtering and temporal awareness for retrieval.
- Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery (DISCOG). Graph-based approaches to legal document retrieval.
- LLMTest Needle in a Haystack (Greg Kamradt). The original NIAH test codebase.
- Multilingual Needle in a Haystack. Research on cross-lingual retrieval degradation.
- The US Government Is Using AI To Detect Potential Wrongdoing (Skadden). How DOJ and SEC deploy AI analytics in investigations.
This is the first post in AI Under the Hood, a series on LegalRealist AI examining the technical foundations beneath legal AI products. It is intended for informational and educational purposes only and does not constitute legal advice. AI capabilities and research findings described here reflect publicly available information as of the publication date. The “enpitsu wo nameru” scenario is a hypothetical constructed for illustration; it does not reference any specific ongoing investigation.

