The Foundation #
If you’ve sat through a legal AI demo recently, you’ve heard the claims. Kira (now part of Litera) advertises “90%+ accuracy” from “1,400+ proprietary AI fields” refined over 45,000 lawyer hours. LegalOn claims “98% of customers achieve immediate time savings” on contract review. Everlaw promises AI that handles cases “exceeding 10 million documents” with answers “grounded in evidence.” CoCounsel, Thomson Reuters’ AI assistant, emphasizes legal research answers backed by Westlaw and Practical Law content.
Every vendor says “proprietary AI.” Almost none of them built the underlying model. Training a frontier large language model costs $100 million or more, requires thousands of specialized processors, and takes months. No legal tech company has that budget. What these vendors actually build is a proprietary application layer on top of a foundation model from OpenAI, Anthropic, Google, or another lab: custom prompts, retrieval pipelines, fine-tuning, and user interfaces. The foundation model does the reading and writing; the application tells it what to read and how to write.
Understanding what artificial intelligence (AI) foundation models are, who builds them, and how they differ is the minimum context you need to evaluate any legal AI product. This is the first post in our Legal AI Landscape series.
Large Language Model (LLM) #
A large language model (LLM) is a type of AI system trained on enormous volumes of text to predict and generate language. When a legal AI tool summarizes a deposition, flags a risky clause, or drafts a motion to compel, an LLM is doing the heavy lifting. The model doesn’t “understand” law the way a lawyer does. It processes patterns in language at a scale and speed no human can match, producing outputs that are often impressively useful and occasionally dangerously wrong.
Transformer Architecture #
The architecture behind virtually every modern LLM traces back to a single 2017 Google research paper: “Attention Is All You Need” (Vaswani et al.). It introduced the transformer, a neural network design built around “self-attention,” the model’s ability to weigh how every word in a passage relates to every other word, regardless of distance in the text. (For visual walkthroughs, see Jay Alammar’s The Illustrated Transformer or Georgia Tech’s interactive Transformer Explainer.)
Before transformers, language models processed text sequentially, word by word. That made them slow and prone to losing context in longer passages. Transformers process all positions in parallel, making them far better at maintaining coherence across long documents: statutes that cross-reference subsections, contracts with nested definitions, opinions that thread arguments across dozens of paragraphs.
Generative Pre-trained Transformers (GPTs) #
OpenAI popularized the term GPT (Generative Pre-trained Transformer) starting with GPT-1 in 2018. Generative: the model produces new text. Pre-trained: it learns general language patterns from a massive corpus before being fine-tuned for specific tasks. Transformer: it uses the architecture above.
The “pre-trained” part is what makes these models versatile. A single foundation model can be adapted through fine-tuning, prompt engineering, or retrieval-augmented generation (RAG) to perform legal research, contract review, or regulatory analysis without being rebuilt from scratch. This is why one underlying model can power dozens of seemingly different legal AI products.
Frontier AI Labs #
Frontier AI labs build the most advanced foundation models, the technology underneath the legal AI tools you’re evaluating. (The Stanford HAI AI Index Report tracks industry trends annually.)
OpenAI #
The company behind GPT-4, GPT-5, and ChatGPT. OpenAI pioneered the commercial LLM market and its application programming interface (API) powers a significant share of legal AI tools. Recent models include the o3 and o4 series, which use internal chain-of-thought reasoning that improves complex analytical tasks but adds to cost. Closed-source, API-only. See OpenAI’s model documentation.
Anthropic #
The maker of Claude, including the current flagship Claude Opus 4.6. Founded by former OpenAI researchers, Anthropic emphasizes AI safety research and has built strong performance on coding, reasoning, and extended document analysis. Closed-source, API-only. See Anthropic’s model overview.
Google #
Google DeepMind develops the Gemini family. Gemini’s standout feature is its context window: some versions support up to two million tokens, enough for an entire merger agreement with exhibits in a single prompt. Aggressive pricing on Flash-tier models makes it competitive for high-volume processing. See Gemini documentation.
xAI #
Elon Musk’s AI company, behind Grok. Grok models have climbed leaderboard rankings quickly but remain a newer entrant for legal-specific work.
Meta #
Meta releases its Llama models as open-weight, meaning anyone can download, run, and fine-tune them. For firms where sending client documents to a third-party API is a non-starter, Llama’s approach is significant.
DeepSeek #
A Chinese lab that shook the industry in January 2025 with DeepSeek-R1, a reasoning model matching top closed-source models at a fraction of the training cost. Open-source licenses, popular for self-hosted deployments. The trade-off for Western legal users: data residency and regulatory considerations.
Other Notable Labs #
Alibaba builds the Qwen family, strong on multilingual tasks. Z.ai (formerly Zhipu AI), a Tsinghua University spinoff, released GLM-5, a 744B-parameter model trained entirely on Huawei Ascend chips under MIT license. Mistral, based in Paris, emphasizes efficiency and European Union (EU) data sovereignty. Moonshot AI builds Kimi, focused on long-context and multi-agent orchestration.
Open v. Closed Models #
This is one of the most consequential distinctions for legal teams. (For formal definitions, see the Open Source Initiative’s AI definition.)
Closed-source models (OpenAI, Anthropic, Google) are accessed via APIs, meaning your data travels to the provider’s servers. You get the highest raw performance and managed infrastructure, but can’t inspect model weights or fully control data handling. Open-weight models (Llama, DeepSeek, GLM, Mistral) release model parameters for self-hosting. Your documents never leave your firm’s network, but you need GPU (graphics processing unit) infrastructure and machine learning expertise to run them.
The performance gap has narrowed from ~18 percentage points in late 2023 to essentially zero on knowledge and reasoning benchmarks by early 2026. Closed models still lead on complex agentic tasks, but for classification, extraction, and summarization, open models deliver comparable results at a fraction of the cost.
The right question isn’t “open or closed?” It’s “what am I processing, who can see it, and what performance level does the task actually require?”
Privilege, Confidentiality, and Data Retention #
When you send a document to a closed-source API, it leaves your firm’s network. That raises two questions lawyers should be asking before any AI tool touches client work.
Does using this tool waive privilege? The answer depends on your jurisdiction, your engagement letter, and the specific terms of the provider’s data processing agreement. ABA Formal Opinion 512 (July 2024) requires lawyers using AI to understand how the technology handles confidential information and to take reasonable measures to protect it. Sending privileged work product to a third-party server isn’t automatically a waiver, but it requires the same diligence you’d apply to any outside vendor with access to client data: a written agreement governing use, retention, and disclosure.
Does the provider retain or train on my data? This is the question most lawyers don’t ask, and it’s the one that matters most. Early LLM APIs used customer inputs to improve future models, meaning your client’s contract could end up influencing outputs for someone else’s query. Major providers now offer zero-retention policies and opt-out provisions for model training. OpenAI’s API data usage policy states that API inputs are not used for training by default. Anthropic’s data policy similarly commits to not training on API inputs. Google’s Gemini API terms offer similar protections for paid API tiers. But these policies apply to the API, not necessarily to the consumer chat products (ChatGPT, claude.ai, Gemini chat), which may have different terms. If your team is pasting client documents into a chat window instead of using the API through a vetted legal AI tool, the retention and training terms may be very different.
For firms that can’t accept any data leaving their network, self-hosted open-weight models eliminate the question entirely. Your documents are processed on your own hardware, and nothing is transmitted to an outside party.
Tokenomics #
Every LLM interaction is metered in tokens, subword units roughly equal to four English characters. A 10-page contract runs ~4,000-5,000 tokens; a full deposition transcript might hit 50,000. (Try OpenAI’s free Tokenizer tool to see how your documents get split.)
APIs charge separately for input tokens (what you send) and output tokens (what the model generates). Output always costs 3-10x more because generation requires more compute than reading. This asymmetry shapes legal AI costs: a system analyzing long contracts but producing short classifications has a fundamentally different cost profile than one drafting lengthy memoranda.
Every model also has a context window, the maximum tokens it can process at once, essentially its working memory. A 200K-token window holds a lengthy contract and exhibits; 1M-2M token windows can ingest entire deal rooms. Context size matters because if a document exceeds the window, the model can’t see a definition on page 3 when analyzing a clause on page 47.
To make pricing concrete, here’s what common legal tasks cost across models. The per-task cost formula:
Task cost = (input tokens × input price per token) + (output tokens × output price per token)
(For live pricing, see PE Collective’s comparison or BenchLM’s tracker.)
- Reviewing a 20-page contract → 1-page risk summary: ~7,500 input + ~750 output tokens
- Summarizing a 100-page deposition transcript: ~40,000 input + ~2,000 output tokens
- Drafting a 10-page brief from a detailed outline: ~3,000 input + ~7,500 output tokens
- Due diligence on 500 documents (extract key terms from each): ~1,000,000 input + ~50,000 output tokens
| Model | Provider | Review 1 Contract | Summarize 1 Deposition | Draft 10-pg Brief | DD: 500 Docs |
|---|---|---|---|---|---|
| Budget Tier | |||||
GPT-4.1 Nano |
OpenAI | $0.001 | $0.005 | $0.003 | $0.12 |
Gemini 2.5 Flash |
$0.002 | $0.007 | $0.005 | $0.18 | |
Claude Haiku 4.5 |
Anthropic | $0.011 | $0.050 | $0.041 | $1.25 |
DeepSeek V3 |
DeepSeek | $0.002 | $0.012 | $0.004 | $0.30 |
| Mid Tier | |||||
GPT-4.1 |
OpenAI | $0.021 | $0.096 | $0.066 | $2.40 |
Gemini 2.5 Pro |
$0.017 | $0.070 | $0.079 | $1.75 | |
Claude Sonnet 4.6 |
Anthropic | $0.034 | $0.150 | $0.122 | $3.75 |
| Frontier Tier | |||||
GPT-5.4 |
OpenAI | $0.030 | $0.130 | $0.120 | $3.25 |
Gemini 3.1 Pro |
$0.024 | $0.104 | $0.096 | $2.60 | |
Claude Opus 4.6 |
Anthropic | $0.056 | $0.250 | $0.203 | $6.25 |
Raw per-token pricing and context window specs are available at each provider link above. For a side-by-side comparison, see PE Collective’s pricing tracker or BenchLM. Task estimates: contract review = 7,500 in + 750 out; deposition summary = 40,000 in + 2,000 out; brief drafting = 3,000 in + 7,500 out; due diligence = 2,000 in + 100 out × 500 docs. April 2026.
For a single prompt, those numbers are trivially cheap. But two things inflate real-world costs: iteration and volume.
Iteration. You rarely get a usable result on the first try. You ask the model to review a contract, get back a summary that misses the indemnification cap, refine your prompt, and resubmit. Each round-trip re-sends the entire document plus the growing conversation history:
Iteration cost ≈ single-prompt cost × (N rounds × ~1.5-2x context growth per round)
Here’s what that looks like in practice, using Claude Sonnet 4.6 as the reference model:
| Task | Single Prompt | Typical Rounds | What You’re Fixing | Real Cost | Multiplier |
|---|---|---|---|---|---|
| Contract review | $0.034 | 3 | “Missed the indemnity cap,” “add limitation of liability,” “format as table” | ~$0.17 | 5x |
| Deposition summary | $0.150 | 2 | “Focus on the September meeting testimony, not the full transcript” | ~$0.45 | 3x |
| Brief drafting | $0.122 | 4-5 | Fix citations, strengthen section III, shorten facts, adjust tone for judge | ~$0.70 | 6x |
| Due diligence (500 docs) | $3.75 | 1 | Structured extraction; usually works first pass | $3.75 | 1x |
Multiply single-prompt estimates by 3-5x for tasks that require back-and-forth, which is most of them. The exception is structured extraction and classification, where one-shot accuracy is high enough that iteration is rare.
Volume. A single contract review at $0.17 (after iteration) is a rounding error. But at scale the model tier matters: a litigation team summarizing 200 depositions over the life of a case pays ~$90 on a mid-tier model vs. ~$1 on a budget model. A deal team running due diligence on 5,000 documents pays $37.50 on a mid-tier model or $1.20 on a budget model, but the budget model’s extractions may need more manual cleanup, and associate time costs far more per hour than the model savings.
Output weight. Brief drafting costs more than contract review despite shorter input, because generating 7,500 output tokens costs far more than reading 7,500 input tokens. When budgeting, pay attention to which direction the tokens flow.
Raw Model Cost vs. What You Actually Pay #
These are all raw model costs, not what you’ll pay for a legal AI product.
A contract review that costs the model five cents in tokens might cost you $3-10 through a vendor’s platform. That 60-200x markup covers months of prompt engineering to get reliable outputs on your document types, a retrieval pipeline that pulls in your firm’s playbook and precedent clauses, a user interface your associates can use without training, SOC 2 and ISO 27001 compliance certifications, customer support, ongoing testing as the underlying model gets updated, and the R&D to build all of it in the first place. The model API is the cheapest line item in a legal AI product’s cost structure.
The markup also reflects something subtler: the vendor has already solved the iteration problem for you. The reason a raw API call takes 3-5 rounds of refinement is that you’re writing prompts from scratch. A well-built product has spent thousands of hours tuning its prompts, building guardrails against hallucination, and testing edge cases on real legal documents. You’re paying for that accumulated work every time you click “analyze.”
Build vs. Buy #
If the model cost for reviewing a contract is $0.17 and the vendor charges $5, a firm reviewing 1,000 contracts a year is paying $5,000 for something that costs $170 in model fees. The $4,830 difference could fund internal development.
But the real cost of building isn’t the API bill. It’s everything else:
- Engineering. You need at least one developer who understands prompt engineering, retrieval-augmented generation, and LLM evaluation. That’s a $200,000-350,000 salary, plus the opportunity cost of not hiring another associate.
- Maintenance. When OpenAI updates
GPT-5.4or Anthropic ships a newClaudeversion, your prompts may break. Someone has to test, fix, and redeploy. Vendors do this continuously; an internal tool needs the same attention. - Compliance. If you’re processing client data through an API, your firm’s information security team needs to vet the pipeline. A vendor has already done this and can hand you their SOC 2 report.
- Evaluation. How do you know your internal tool is accurate? You need a testing framework, a set of ground-truth documents, and someone to run evaluations regularly. This is the work that legalbenchmarks.ai is trying to standardize for the industry.
The general rule: build when you have a high-volume, narrow task that no vendor serves well, and you have the engineering talent to maintain it. Buy when the task is well-served by existing products and your volume doesn’t justify a dedicated hire. Most firms should buy first, learn how the technology works in practice, and only build when they’ve identified a specific gap no vendor fills.
AI Economics: Does the Math Work? #
The formula for whether AI saves money on a given task:
Monthly savings = (Human cost per task - AI cost per task) × Volume
Where:
- Human cost per task = attorney or paralegal time × what that person’s time costs the firm
- AI cost per task = model cost + application markup + human review time on AI output
- Volume = number of times you perform this task per month
Contract review: a junior associate spending 45 minutes on a 20-page vendor contract costs the firm $150-200 in what that associate’s time is worth. An AI tool does the same for $3-10. Even with 10 minutes of attorney review on top, the total drops to $35-45. At 50 contracts a month, that’s $5,000-8,000 in savings.
Brief drafting is trickier. Model cost per draft is low, but attorney review time is high because generated text needs checking for hallucinated citations, misapplied standards, and tone. If review takes nearly as long as writing from scratch, the AI economics don’t work.
Additional savings come from prompt caching (50-90% off repeated system prompts), batch APIs (50% discount for async processing), and model routing (cheap models for simple tasks, frontier models only where quality matters).
Benchmarks #
Benchmarks are standardized tests for comparing LLM performance. They matter because vendors cite them selectively. (For a practitioner overview, see this Good Journey Consulting guide.)
LMArena (Chatbot Arena) #
LMArena, formerly LMSYS (Large Model Systems Organization) Chatbot Arena, ranks LLMs by crowdsourced human preference. Two models answer the same prompt blindly; a human picks the winner. Over six million votes produce Elo ratings. Top 5 as of April 23, 2026:
| Rank | Model | Provider | Elo Score |
|---|---|---|---|
| 1 | claude-opus-4-7-thinking |
Anthropic | 1503 |
| 2 | claude-opus-4-6-thinking |
Anthropic | 1503 |
| 3 | claude-opus-4-6 |
Anthropic | 1496 |
| 4 | claude-opus-4-7 |
Anthropic | 1494 |
| 5 | gemini-3.1-pro-preview |
1493 |
Source: arena.ai/leaderboard/text. Rankings shift daily. GPT-5.4 sits at #9 (1481 Elo); the first open-weight model (Z.ai’s glm-5.1) appears at #15.
Anthropic holds four of the top five spots, but the entire top 10 spans only 24 Elo points; the frontier is tightly packed. LMArena also publishes category-specific leaderboards for coding, long-context, and hard reasoning, so check the category that matches your use case.
LegalBench #
LegalBench is the legal profession’s own benchmark: 162 tasks covering issue-spotting, rule-recall, rule-application, and interpretation, published at NeurIPS 2023 and available on Hugging Face. Vals AI runs models on LegalBench independently. Top 5 as of April 2026:
| Rank | Model | Provider | Accuracy |
|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview |
87.40% | |
| 2 | Gemini 3 Pro |
87.04% | |
| 3 | Gemini 3 Flash |
86.86% | |
| 4 | GPT-5.4 |
OpenAI | 86.04% |
| 5 | GPT-4.1 |
OpenAI | ~85% |
Source: vals.ai/benchmarks/legal_bench. Claude Opus 4.6, #1 on LMArena, drops to ~84% here (#8). Llama 4 Scout sits at ~82% (#10).
Google sweeps the top three; Claude Opus drops from #1 overall to #8 on legal tasks. A model’s general ranking is not its legal ranking. Models also score well on short-clause classification (88%+) but struggle with longer text, so a model that aces clause review may falter on multi-page compliance disclosures.
legalbenchmarks.ai #
legalbenchmarks.ai benchmarks finished legal AI tools on real workflows. Their Phase 2 research found that specialized tools didn’t always produce better drafts than general-purpose LLMs, but offered much better workflow integration. Their open-access Legal AI Evaluation Framework provides a structured scoring system for procurement.
The Limits of Public Benchmarks #
All the benchmarks above measure what’s easy to measure, not what matters to your practice. LMArena captures preference from anonymous internet users, most of whom aren’t lawyers. LegalBench tests pattern-matching on short tasks, not judgment-heavy work. A model can score 87% on LegalBench and still produce a demand letter your supervising partner would reject.
This is sometimes called the “vibe check” problem. A model that benchmarks well can still feel wrong: it buries the conclusion, hedges where a lawyer would be direct, or confidently cites a case that doesn’t exist. No standardized test captures these failures because they depend on your firm’s standards, your jurisdiction, and your work.
Run Your Own Benchmark #
The most useful evaluation takes about an hour:
- Pick a real task you’ve already completed. Something you do repeatedly where you know what good looks like: a contract risk summary, a client intake memo, an email classification batch.
- Pull your answer key. The output you’d consider acceptable. Without it, you’ll grade against a vague feeling of quality.
- Give the same task to 2-3 models. Same prompt, same document. Try one budget, one mid-tier, one frontier.
- Grade blind. Print outputs without model names. Score on what matters: factual accuracy, completeness, tone, citation reliability, and whether you’d send it after light editing or need to rewrite.
- Compare cost vs. quality. If the budget model needs 5 minutes of editing and the frontier model needs 2, calculate whether that 3-minute difference justifies a 50x price gap at your expected volume.
Almost no one does this. An hour of hands-on testing with your own documents tells you more than any leaderboard.
Questions to Bring to Your Next Demo #
- “What foundation model does your product run on, and what happens when that model is updated?” If the vendor can’t name the model, they can’t explain how their product will change when the model provider ships an update.
- “Where are my client’s documents processed? Are they stored or used for training?” The answer should be specific: which data center, how long documents are retained, and whether the model provider ever sees them.
- “What’s your accuracy rate on [my specific document type], and who measured it?” A vendor claiming “95% accuracy” on NDAs may have never tested on the 80-page credit agreements your team reviews.
- “What does a single task cost you in model fees, and what do you charge me?” The raw model cost of reviewing a contract is under a dollar. If the vendor charges $50 per document, you should know the ratio before you sign.
- “Can I run a pilot on my own documents with a blind comparison against our current process?” Any vendor confident in their product will say yes.
Next in this series: how foundation models get turned into legal AI products, the application layer where RAG, fine-tuning, and prompt engineering transform a general-purpose LLM into something that can actually help you review a purchase agreement.
Further Reading #
- Attention Is All You Need. The 2017 transformer paper.
- The Illustrated Transformer. Jay Alammar’s visual guide.
- LegalBench (NeurIPS 2023). The legal reasoning benchmark paper.
- Vals AI Benchmarks. Independent legal and financial AI leaderboards.
- legalbenchmarks.ai Evaluation Framework. Open-access legal AI evaluation toolkit.
- LMArena Leaderboard. Live crowdsourced LLM rankings.
- PE Collective LLM Pricing. Updated pricing across providers.
- Stanford HAI AI Index. Annual AI industry trends report (Human-Centered Artificial Intelligence).
This post is part of the Legal AI Landscape series on LegalAI Insights. It is intended for informational and educational purposes only and does not constitute legal advice. AI capabilities, pricing, and benchmark results described here reflect publicly available information as of the publication date and are subject to rapid change. Laws governing AI use vary by jurisdiction.