
TL;DR
- Law is the highest-scoring domain in the most rigorous professional AI benchmark ever built.
GPT-5(Thinking=High) scores 77.9% on law tasks in APEX — while the top score in any other profession is medicine’s 65.5%, with consulting at 64.0% and investment banking at 63.0%. Legal work is where AI performs best against human baselines.- The people who wrote the tasks are people who’ve done the work. Mercor’s evals team built and runs the benchmark; the tasks and grading rubrics were authored by 137 contractors on its platform — lawyers averaging 7+ years at firms Mercor describes as “like Latham, Skadden, and Cravath.” Cass Sunstein advised and is a listed author on the technical report. Each task takes a seasoned professional 1–8 hours. This isn’t an academic exercise — it’s the day-to-day output of a BigLaw associate, turned into a grading rubric.
- When AI has to act like an associate — not just answer like one — it still fails nearly two legal tasks in three. APEX-Agents tests long-horizon tasks requiring file navigation, tool use, and multi-step reasoning across Google Workspace. The best agent scored 24% Pass@1 at launch in January 2026; as of June 2026 the best corporate-law score is 37.5%. The gap between answering a legal question and completing a legal project is the gap between a chatbot and a colleague.
- The benchmark tells you exactly which tasks to automate and which to keep human. Structured analysis and drafting — where AI scores 77.9% — are automatable now with human review. Multi-step workflows requiring judgment across documents are not. The line is specific, not abstract.
Every legal AI vendor has a demo that looks impressive. Harvey shows a contract analysis. CoCounsel walks through a research query. Spellbook redlines an NDA. The demo always works.
The question a managing partner should be asking is different: if I hand this tool the same work I’d hand a third-year associate — not a curated demo, but an actual assignment that takes hours, requires reading source documents, and needs to meet specific quality criteria — what percentage of the work meets the standard?
A team at Mercor built the benchmark to answer that question. Mercor is not a neutral party — it sells the expert data that benchmarks like this one show models need — so read what follows with that in mind.1
How APEX Was Built#
The AI Productivity Index (APEX) is a benchmark for measuring whether frontier AI models can perform economically valuable professional work. It launched in September 2025 and was extended in December 2025 with a larger evaluation set and improved methodology. The full technical report is open-access. The evaluation harness and a 100-task development set are open-source.
APEX covers four professions: investment banking associate, management consultant, BigLaw associate, and primary care physician. What makes it different from LegalBench or other legal AI benchmarks is who built it and what it tests.
The Experts#
Mercor’s core business is supplying frontier labs with expert-generated training and evaluation data — roughly what Scale AI is for general data labeling, but for specialized professions like law, banking, and medicine. For APEX, it recruited 137 professionals through its platform — people with an average of 7+ years of experience at top-tier firms. Law contributors came, in Mercor’s framing, from firms like Latham & Watkins, Skadden, and Cravath. The benchmark is advised by Cass Sunstein, the Harvard law professor and former White House regulatory administrator who is one of the most-cited legal scholars in the United States.
To be precise about roles: Mercor’s research team controls the methodology, the held-out evaluation set, and the leaderboard. The professionals are anonymous contractors who created the tasks and rubrics — the “firms like” framing can’t be checked against named individuals — and the marquee advisors (Sunstein for law, Dominic Barton for consulting, Eric Topol for medicine) lent domain review and author credit, not construction. Notably, Larry Summers appeared in Mercor’s launch marketing but is not on the current author list. None of this makes the rubrics worse — working-level lawyers writing working-level tasks is arguably more credible than big names would be. But the big names are doing marketing work, and you should weigh them accordingly.
Each expert created tasks drawn from their actual day-to-day work — not simplified academic problems. A law task might be: review this set of financial filings, identify the indemnification structure, draft a memo analyzing whether the buyer’s exposure exceeds the cap, and cite the specific provisions. The expert then wrote a grading rubric — an average of 14 specific criteria per task, each a binary pass/fail — functioning like unit tests for legal work product. Did the response identify the correct cap amount? Did it cite the right section? Did it flag the carve-out? Did the analysis reach a defensible conclusion?
The Scale#
The hidden evaluation set contains 400 tasks (100 per profession). Each task comes with source documents — PDFs, spreadsheets, contracts, financial models — averaging 26,677 tokens per task. The mean time for a seasoned professional to complete a task in the real world is 2.7 hours, ranging from 30 minutes to 20 hours. Models are run 8 times per task to account for variance, and scores are reported with 95% confidence intervals.
This is not a multiple-choice test. It’s not a classification task. It’s the actual deliverable a partner would expect to see on their desk — graded against the same criteria the expert would use to evaluate an associate’s work.
The Leaderboard#
In the December 2025 technical report (models tested in late November 2025), GPT-5 (Thinking=High) leads the overall leaderboard with a mean score of 67.0%, followed by Gemini 3 Pro (Thinking=High) at 64.3% and Grok 4 at 63.5%.
But the domain breakdown is where the story gets interesting.
Law is the highest-scoring domain by a wide margin. Three OpenAI models take the top three spots on law tasks. Claude Opus 4.5 (Thinking=On) scores 74.0% on law — fourth overall but notably strong on the hardest tasks. When APEX uses z-scores to measure performance on difficult tasks relative to other models, Opus 4.5 jumps from fifth to second overall, suggesting it handles the most complex legal work better than its mean score indicates.
The live leaderboard has moved since the report. As of June 2026, GPT 5.4 (Thinking=High) leads overall at 67.2%, with Claude Opus 4.6 close behind at 65.7% — and on law tasks the race is effectively a tie: GPT 5 at 76.6%, Opus 4.6 at 76.4%. The ordering shuffles with each model release. The domain pattern doesn’t: law has been the highest-scoring profession in every snapshot since launch.
Why Law Scores Highest#
The APEX team doesn’t offer a definitive explanation, but the data suggests a structural reason. Legal tasks in APEX tend to have clearer evaluation criteria than consulting or banking tasks. A memo either identifies the correct indemnification cap or it doesn’t. A contract analysis either cites the relevant provision or it misses it. The rubrics for law tasks have well-defined right answers grounded in source documents.
Consulting and banking tasks involve more subjective judgment — a market-entry recommendation might be defensible from multiple angles, and the rubric criteria reflect that ambiguity. Medicine tasks require highly specific clinical reasoning with precise guideline references.
This maps to a pattern visible across AI benchmarks generally: models perform best on tasks where the answer is verifiable against source material — exactly the kind of grounding that RAG architectures provide. Legal analysis, with its emphasis on specific textual provisions and defined terms, plays to the model’s strengths. Strategic judgment, with its emphasis on weighing incommensurable factors, exposes its weaknesses.
APEX-Agents: The Agentic Gap#
APEX measures whether a model can produce a good answer to a professional task. APEX-Agents, published in January 2026, measures something harder: whether an AI agent can complete a project — navigating files, using tools, and executing multi-step workflows across real software environments.
The benchmark is built differently. Former partners and managing directors from Goldman Sachs, McKinsey, and leading law firms constructed 33 project scenarios — simulated 5-to-10-day client engagements — inside Google Workspace and Box environments. Each scenario contains an average of 166 files: emails, spreadsheets, memos, contracts, chat logs. The 480 tasks (including ~160 corporate law tasks across 12 legal “worlds”) require agents to find the right files, extract the right information, reason across multiple documents, and produce a deliverable. Average real-world completion time: 1.8 hours per task.
The scoring metric is Pass@1 — does the agent get it right on the first try? At publication in January 2026, the best agent, Gemini 3 Flash (Thinking=High), scored 24.0%. GPT-5.2 followed at 23.0%. Claude Opus 4.5 (Thinking=High) and Gemini 3 Pro (Thinking=High) clustered around 21–22%.
Open-source models scored below 5%.
The agents leaderboard has improved fast: as of June 4, 2026, Gemini 3.5 Flash leads at 49.6% Pass@1 overall, and Claude Opus 4.8 tops the corporate-law tasks at 37.5%. Read those numbers carefully, though — they come from Mercor’s newer “Loop” agent harness. Under the ReAct harness the January paper used, the best June law score is 29.8%. The same models score roughly eight points higher on law when the scaffolding around them improves, a gain comparable to a full model generation. Harnesses matter more than leaderboard headlines suggest — which is worth remembering when a legal AI vendor’s pitch attributes its performance to “the model.” But the domain breakdown inverts the APEX story. Law’s best, 37.5%, sits well behind consulting (55.5%) and investment banking (57.0%). Law is the easiest profession for a model to answer and the hardest for an agent to execute.
The gap between APEX (77.9% on law) and APEX-Agents (37.5% on law) is the gap between “answer this question given these documents” and “complete this project given this messy information environment.” The first is a research task. The second is associate work. (The attempts themselves are cheap: at list prices, a single agent run burns $1.50–$3 of
tokens — Gemini 3 Flash’s low per-
token rate is offset by its 5.3-million-token appetite per run — which works out to roughly $8–14 per task that actually passes, against work a professional takes 1.8 hours to complete.)
The inversion has measurable causes. Law tasks carry more all-or-nothing grading criteria than banking tasks — 4.57 versus 2.93 per task, and Pass@1 requires meeting every one — and they’re the longest in the benchmark, at an expert-estimated 2.4 hours versus 1.4 for banking. The third cause is the most telling: holding the agent harness constant, the best banking score rose 14.4 points between January and June while law rose 3.9. Generic agentic improvements — file navigation, tool use, computation — convert directly into completed banking tasks, because banking outputs are machine-checkable numbers. Law’s bottleneck sits after the navigation: a judgment-graded deliverable that better tool use alone doesn’t complete.
Why Agents Fail Where Models Succeed#
The APEX-Agents analysis identifies three failure modes that legal teams should understand.
Fragmented context. Real work involves information scattered across email threads, document folders, and chat histories. Agents lose track of constraints when they have to assemble the relevant context from multiple sources instead of receiving it in a single prompt. A lawyer reading a data room navigates this automatically — scanning folder structures, recognizing which exhibits matter, skipping boilerplate. Current agents don’t replicate that triage.
Planning over long horizons. A legal project isn’t one question — it’s a sequence of dependent steps. Review the purchase agreement, identify the indemnification provisions, cross-reference against the disclosure schedules, check the definitions section for defined terms, and then draft the analysis. Current agents struggle to maintain a coherent plan across more than a few steps, particularly when an intermediate result changes the direction of the analysis.
Verifiable vs. subjective outputs. The same pattern from APEX appears in APEX-Agents: agents perform best on tasks with hard constraints (extract this number, find this clause) and worst on tasks requiring judgment (assess whether this risk is material, draft a recommendation). Software engineering — where code either runs or doesn’t — saw Pass@1 rates jump from 4.4% in 2023 to over 80% by early 2026. Legal work, where “correct” is often a matter of professional judgment, hasn’t seen the same trajectory.
Post-Training Closes Part of the Gap#
Applied Compute, a training infrastructure company, used reinforcement learning on Mercor’s expert-labeled data to post-train a model specifically for APEX-Agents tasks. The results on corporate law: Pass@1 tripled from 4.4% to 16.3%. Mean score nearly doubled. The model learned to navigate files efficiently rather than loop or abandon tasks.
This matters because it demonstrates that the agentic gap isn’t purely a model-capability problem — it’s partly a training-data problem.1 Models trained on expert demonstrations of how professionals actually navigate information environments get materially better at doing the same thing. The trajectory suggests that purpose-built legal agents, trained on real legal workflows rather than general web text, will close the gap further. But 16.3% Pass@1 is still a long way from reliable.
What This Means for Law Firms#
APEX and APEX-Agents together draw a specific, actionable line.
What AI does well now (77.9% territory): Structured analysis tasks where the model receives source documents and produces a deliverable graded against specific criteria. Contract review with defined rubrics. Memo drafting from identified sources. Provision extraction. Issue-spotting against a known framework. These are the tasks where a 77.9% score — with human review on the output — translates to genuine productivity gains. The model does the assembly; the lawyer does the judgment.
What AI doesn’t do well yet (37% territory): Multi-step projects requiring autonomous file navigation, cross-document reasoning, and judgment calls across ambiguous information. The full workflow of “here’s a data room, tell me what to worry about” — the assignment a partner actually gives a third-year associate — remains beyond current agents. Not because the models can’t reason about law (they clearly can), but because they can’t reliably navigate messy information environments, maintain a plan across dependent steps, and produce work product that meets professional standards on the first try.
The practical implication: the vendors selling agentic AI for legal work are ahead of where the technology actually is. Harvey’s 700,000 daily agentic tasks, CoCounsel’s autonomous workflows, Anthropic’s Claude for Legal plugins — all of these are real products doing real work. But the APEX-Agents data says the best agent still fails nearly two of three realistic corporate-law tasks as of June 2026. That doesn’t mean the products are useless — it means they need human oversight on every output, which is what every vendor says in the fine print and what every demo implicitly denies.
That oversight is where the real cost lives. The tokens for an agent run cost a few dollars; the verification doesn’t. A Pass@1 between 24% and 37% means an agent’s deliverable arrives with no signal about whether this was one of the runs that passed — so a lawyer must check every output as if it were wrong. Those attorney hours dwarf the API bill by orders of magnitude, and they scale with everything the agent produces, not just its failures. Any honest cost model for agentic legal AI is a verification budget with a rounding error attached for compute.
The firms that will get the most value from current AI are the ones that match the tool to the task type. Use AI for the 77.9% work — structured analysis, first-draft generation, extraction, classification — and keep humans on the 37% work. The APEX data tells you exactly where that line is. Most firms are guessing.
Further Reading#
- The AI Productivity Index: APEX-v1-extended. Vidgen et al. The APEX technical report.
- APEX-Agents. Vidgen et al. The agentic benchmark.
- APEX Leaderboard. Live results across all tested models.
- APEX-Agents Leaderboard. Agentic task results.
- APEX-v1-extended on Hugging Face. Open-source development set (100 tasks, CC-BY).
- Building State-of-the-Art Agents with Mercor. Applied Compute’s post-training results on APEX-Agents.
- BigLaw Bench. Harvey’s legal-specific benchmark.
- LegalBench. The 162-task legal reasoning benchmark from Stanford/NeurIPS.
- Vals AI Legal Benchmarks. Independent model evaluations on LegalBench tasks.
This post is part of The Evidence series on LegalRealist AI. It is intended for informational and educational purposes only and does not constitute legal advice. AI capabilities, benchmark results, and model performance described here reflect publicly available information as of the publication date and are subject to rapid change. APEX technical-report figures reflect models tested in November 2025; APEX-Agents launch figures reflect January 2026. Live leaderboard figures are as of June 4, 2026. Current model versions may perform differently. Laws and ethics rules governing AI use in legal practice vary by jurisdiction.
A disclosure worth weighing: Mercor’s core business is selling expert-generated training data to AI labs — a reported $500 million revenue run rate and $10 billion valuation as of late 2025. A benchmark showing that agents fail at professional work, paired with a case study showing that training on Mercor’s data fixes it, aligns neatly with that commercial interest. The methodology is open and auditable; the “it’s a data problem” framing is also the sales pitch. ↩︎ ↩︎



