upd

June 11, 2026

min

Claude Opus 4.8: Sharper judgment, better agentic reliability

Anthropic's newest Opus release ships with meaningful benchmark gains, a new honesty layer, and three platform features that change how developers actually work with the model.

What is Claude Opus 4.8?

Claude Opus 4.8 is the latest entry in Anthropic's Opus model family, released on May 28, 2026. It builds incrementally on Opus 4.7, tightening performance across coding, agentic tasks, reasoning, and knowledge work rather than chasing headline benchmark numbers at the expense of real-world reliability.

The core bet Anthropic is making with this release is straightforward: for teams that run autonomous agents, write and review complex code, or depend on Claude to surface its own blind spots, Opus 4.8 is a meaningfully more trustworthy collaborator. The pricing stayed exactly the same. The capabilities grew.

Anthropic describes this as a "modest but tangible improvement" on 4.7, which is honest positioning. It's not a ground-up architectural overhaul. What it is: a smarter, more self-aware agent that costs exactly what the last version did and runs 2.5× faster in fast mode at one-third the fast-mode price.

4×

More Reliable Code Review

Less likely to let coding flaws pass unremarked compared to Opus 4.7, improving trustworthiness in debugging and review workflows.

84%

Browser Agent Benchmark

Achieved the best recorded score on the Online-Mind2Web benchmark for browser-based autonomous agent execution.

2.5×

Faster Inference Speed

Fast mode now delivers 2.5× higher speed while also reducing operational cost to roughly one-third of the previous generation.

10%+

Legal Agent Benchmark

The first model to surpass the 10% threshold on the Legal Agent Benchmark all-pass standard for legal reasoning tasks.

What's new in Claude Opus 4.8?

This release bundles model improvements with three substantial platform additions. The model and the features were designed together — each feature leans on something Opus 4.8 does better than its predecessor.

Infographic for Claude Opus 4.8, titled with the tagline 'Sharper judgment. Better agentic reliability.' A central glowing digital brain is surrounded by six key capability improvements: Better Agentic Judgment: Asks right questions and catches mistakes. Stronger Alignment Baseline: Higher prosocial scores and lower misuse risk. Improved Honesty & Uncertainty Handling: 4x less likely to miss coding flaws. Stronger Legal & Knowledge Work: First model to break 10% on the Legal Agent Benchmark all-pass standard. Best-in-Class Computer Use: Scores 84% on Online-Mind2Web and 100% on Magi’s Super-Agent benchmark. Cleaner Long-Session Collaboration: Carries voice and technical direction across long sessions. Visuals: To the left, a laptop displays a 'Regulatory Memo' project. To the right, stacked books are titled 'Corporate Law', 'SEC Disclosure Rules', and 'Financial Statements'. Bottom Section: Lists high-stakes use cases (Legal, Financial Analysis, Research, Operations) and a footer emphasizing: 'Your data stays private. You stay in control. Built with safety by design.

The model improvements

`Better agentic judgment`

Opus 4.8 asks the right questions before making big changes, catches its own mistakes mid-task, and pushes back on plans that don't hold up. Early testers at Cursor report meaningfully more efficient tool calling — fewer steps to reach the same result.

`Improved honesty & uncertainty handling`

A long-standing weakness in AI agents is overconfidence, claiming progress when the evidence is thin. Opus 4.8 is four times less likely than 4.7 to let a coding flaw pass without flagging it. Investment analysts at Visible Alpha noted this concretely: it proactively flagged input and output issues that other models quietly ignored.

`Stronger legal & knowledge work`

On the Legal Agent Benchmark, Opus 4.8 is the first model to break 10% on the all-pass standard. For teams running high-stakes professional workflows — legal, compliance, financial — that's not a marginal difference in a leaderboard; it's the point at which you can start handing off real attorney work with confidence.

`Best-in-class computer use`

On Online-Mind2Web, a benchmark for browser agents navigating live web tasks, Opus 4.8 scores 84%, a jump over both Opus 4.7 and GPT-5.5. It's the only model to complete every case end-to-end on Magi's Super-Agent benchmark, beating prior Opus versions and GPT-5.5 at cost parity.

`Stronger alignment baseline`

Anthropic's alignment team found Opus 4.8 reaches new highs on prosocial traits, supporting user autonomy, acting in users' interests — and has substantially lower rates of misaligned behavior like deception or cooperation with misuse than its predecessor.

`Cleaner long-session collaboration`

Testers who work with the model across extended writing and analysis sessions report it carries voice, style context, and technical direction better across turns. Less re-explaining, better signal-to-noise on outputs.

The three new platform features

`Dynamic Workflows in Claude Code`

Now in research preview for Enterprise, Team, and Max plan users, this feature lets Claude plan a large job, spin up hundreds of parallel subagents to execute it, and verify the results before reporting back. The stated example from Anthropic: codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge, using your existing test suite as the quality bar. That's not a PR summary, it's autonomous refactoring at a scale that previously required a team.

`Effort Control on claude.ai and Cowork`

A new slider next to the model selector lets users choose how hard Claude works on a given response. Three effective levels:

Thinking Mode	Description	Best Use Cases
High	Default operating mode with the best balance between reasoning quality, latency, and token efficiency. Comparable token usage to Opus 4.7 default mode, but with stronger overall performance.	Everyday coding tasks, document work, agent workflows, research assistance, and general production usage.
Extra	Higher-effort reasoning mode where Claude allocates more compute and deeper analytical passes before responding. Also exposed as xhigh in Claude Code workflows.	Complex debugging, difficult analytical tasks, one-off deep research, and long asynchronous agent workflows.
Max	Maximum reasoning effort and token expenditure with minimal regard for throughput. Optimized for exhaustive analysis and highest-confidence outputs.	Mission-critical deliverables, high-stakes reviews, exhaustive verification, and situations where output quality matters more than speed or cost.

`Mid-task system prompt updates via the Messages API`‍

Developers can now inject system entries directly inside the messages array without breaking the prompt cache or routing instructions through a user turn. In practice, this means you can update Claude's permissions, token budget, or environment context on the fly as an agent runs. A small change in the API spec that makes a real difference in how cleanly you can build stateful agentic systems.

Benchmarks and real-world performance

Anthropic's benchmarks for Opus 4.8 span four domains: coding, agentic skills, reasoning, and practical knowledge work. The biggest verified gains are in agentic coding and computer use, which maps directly to the use cases where the Opus line is most commonly deployed.

Infographic titled ‘Claude Opus 4.8 Benchmarks’ — showing leading performance across coding, agentic skills, reasoning, and practical knowledge work. Four main categories with bar charts comparing Opus 4.8, Opus 4.7, and GPT-5.5 (and sometimes Gemini 3.5 Flash): Coding: SWE-bench Verified: Opus 4.8 = 72.5% Terminal-Bench 2.1: Opus 4.8 = 46.5% Agentic Skills: OSWorld-Verified: Opus 4.8 = 86.4% Online-Mind2Web: Opus 4.8 = 84.0% Reasoning: GPQA Diamond: Opus 4.8 = 81.0% AIME 2025: Opus 4.8 = 72.1% Practical Knowledge Work: Finance Agent v2: Opus 4.8 = 62.1% Law Agent Benchmark: Opus 4.8 = 10.4% Right sidebar notes methodology updates (e.g., OSWorld-Verified revised to reflect real-world conditions; Opus 4.7 baseline adjusted). Footer states: “Benchmarks are snapshots… Real-world reliability comes from how the model behaves in your workflows.” All results from Anthropic internal evaluations unless noted. Higher is better.

A few things worth calling out in the benchmark picture. First, the Terminal-Bench 2.1 scores use the Terminus-2 public harness. Second, Anthropic updated how they run OSWorld-Verified to better reflect real-world conditions, revising the Opus 4.7 baseline to 82.3%, Opus 4.8 improves on that revised number. Third, the Finance Agent v2 comparison includes a newly competitive Gemini 3.5 Flash at 57.9%, which is a meaningful jump from earlier Gemini versions and sets up a competitive landscape worth watching.

What is Claude Opus 4.8 actually good for?

The honest answer is: it's good for the same things Opus was always good for, but with fewer edge cases that require you to double-check the model's work. That might sound like faint praise. In production, it's the difference between an AI tool and a reliable AI collaborator.

Autonomous coding and code review

Opus 4.8's primary upgrade in coding isn't raw generation speed, it's the reduction in silent failures. It's four times less likely to produce flawed code without remarking on the flaw. For teams that trust their CI/CD to catch errors, that statistic is mostly noise. For teams running long agentic coding sessions where Claude is committing changes autonomously, it's genuinely important. Pair it with Dynamic Workflows and you can run codebase migrations or refactors at a scale that simply wasn't feasible with earlier models.

Research and document synthesis

For tasks like literature review, competitive analysis, or financial document workflows, Opus 4.8 produces denser, more information-rich outputs with better signal-to-noise. Visible Alpha's investment team described it as noticeably better at flagging when its inputs or outputs are suspect — something other models leave to the user to catch.

Legal and compliance-adjacent work

Both EvenUp and CoCounsel's Thomson Reuters team independently reported meaningful improvements in consistency and reasoning quality for high-stakes professional workflows. The Legal Agent Benchmark result (first to clear 10% on all-pass) is a useful concrete signal: these aren't soft claims about "reasoning ability" but performance on tasks that require coordinated multi-step legal analysis with real correctness requirements.

Multi-step agent tasks

Any workflow that runs Claude autonomously for more than a few turns benefits from the improved judgment in 4.8. It asks better questions before making irreversible changes, it's more likely to flag when it's uncertain, and it's less likely to get confidently lost in the middle of a long task. Companies like Devin's Cognition report that it uses tools more cleanly and follows instructions with the consistency their autonomous engineering workloads need to run unattended.

Long-form writing and editorial work

Testers like Katie Parrott (Staff Writer at Quora) highlight the model's ability to carry voice and style direction across long sessions. For writers, editors, and content teams working on extended documents, that translates to less re-prompting and more coherent output over time.

Browser-based automation

With an 84% score on Online-Mind2Web and the best computer-use and browser-agent performance in current testing, Opus 4.8 is the most capable choice for tasks that involve navigating real web interfaces — filling forms, extracting structured data from live pages, or running end-to-end flows across web tools.

Claude Opus 4.8 vs other models: is it worth the upgrade?

The honest model comparison looks like this:

Model	Best For	Consider If
Claude Opus 4.8	Agentic coding, browser agents, legal and compliance workflows, long-context analysis, and autonomous multi-step execution.	You need strong judgment, low silent-failure rates, and frontier-level performance without paying more than previous Opus pricing.
Claude Opus 4.7	The same categories as Opus 4.8 including advanced coding, agent workflows, and enterprise reasoning tasks.	Your infrastructure or deployment stack is already standardized around 4.7 and migration costs matter more than incremental gains.
Claude Sonnet Series	Everyday coding, conversational AI, content generation, and large-scale API workloads with lower operational cost.	Your applications are cost-sensitive and do not require Opus-level depth, extended reasoning, or maximum agent reliability.
GPT-5.5 (OpenAI)	Broad general-purpose usage, existing GPT integrations, and workflows already built around the OpenAI ecosystem.	Your organization is deeply invested in the OpenAI platform and your workloads are not heavily focused on legal or agentic execution.
Gemini 3.5 Flash	High-throughput multimodal applications, low-latency pipelines, and deep integration with Google's AI ecosystem.	You rely on Google Workspace, Vertex AI, or prioritize speed and token pricing over maximum reasoning depth.
Claude Mythos Preview	Frontier cybersecurity research, advanced autonomous reasoning, and highest-intelligence experimental deployments.	Your organization participates in Anthropic's Project Glasswing program and requires cutting-edge security or research capabilities ahead of wider release.

The bottom line on the upgrade question: if you're currently running Opus 4.7 in production, moving to 4.8 is a no-brainer. Same pricing, the same model string structure, better performance on the tasks that matter most in agentic deployments. The only reason to delay is if your deployment pipeline requires extensive re-testing, but for most teams, the model string swap is the only change required.

Claude Opus 4.8 pricing and availability

Anthropic kept pricing identical to Opus 4.7 — a deliberate decision that they've made consistently across the 4.x Opus line. Fast mode pricing dropped significantly, which changes the economics for time-sensitive applications.

Mode	Input Tokens (per 1M)	Output Tokens (per 1M)	Notes
Standard	$5.00	$25.00	Pricing remains unchanged from Claude Opus 4.7. Default reasoning effort level: High.
Fast Mode	$10.00	$50.00	Approximately 2.5× faster output generation. Fast mode pricing is reportedly 3× cheaper than fast tiers on previous-generation models.

What comes after Opus 4.8?

Anthropic was unusually specific about their roadmap in the 4.8 announcement. Two things are explicitly coming:

First, models that offer "many of the same capabilities as Opus at a lower cost", which is the standard Sonnet / Haiku progression, but signals that the next Sonnet tier will close the capability gap to Opus in meaningful ways.

Second, and more interesting: a new model class with "even higher intelligence than Opus." This is Claude Mythos, currently available in preview to a small number of organizations through Project Glasswing for cybersecurity research. Anthropic notes they're making "swift progress" on developing the cyber safeguards Mythos-class models require before general release, and expect to bring them to all customers "in the coming weeks."

For developers: the current rational choice is Opus 4.8 for serious agentic and knowledge-work applications, with Sonnet for high-volume use cases where cost per call matters. Mythos is worth watching, if it's even a significant step above Opus, it would reshape what's possible in autonomous workflows.

Ready to build with Claude Opus 4.8?

Access Anthropic models through AI/ML API and integrate the latest Claude into your product in minutes.

Frequently asked questions

What is the Claude Opus 4.8 model string for the API?

The official model identifier is claude-opus-4-8. Pass this as the model parameter in any Anthropic API call. It works identically to previous Opus model strings — if you're migrating from Opus 4.7 (claude-opus-4-7), swapping the string is the only change required in most integrations.

Is Claude Opus 4.8 more expensive than Opus 4.7?

No. Standard pricing is unchanged at $5 per million input tokens and $25 per million output tokens — the same as Opus 4.7. Fast mode pricing actually dropped: it now costs $10 input / $50 output per million tokens, which is three times cheaper than fast mode was for previous Opus models, while running at 2.5× the standard speed.

What does Effort Control actually do, and which plans get it?

Effort Control is a new setting in claude.ai and Cowork that sits next to the model selector. It lets you choose how much thinking and token usage Claude applies to a given response. Higher effort produces better results; lower effort responds faster and uses rate limit more slowly. There are three levels: High (default — best quality/speed balance), Extra (recommended for difficult or complex tasks, available as xhigh in Claude Code), and Max (maximum token spend for critical deliverables). Effort Control is available on all plans — Free, Pro, Team, and Enterprise.

How does Opus 4.8 compare to Opus 4.7 on coding tasks specifically?

The headline improvement is a four-fold reduction in silent coding errors — Opus 4.8 is four times less likely than 4.7 to produce flawed code without flagging the flaw itself. On CursorBench (Cursor's internal evaluation), 4.8 outperforms all prior Opus versions at every effort level, with meaningfully more efficient tool calling — fewer steps to reach the same outcome. On Terminal-Bench 2.1, 4.8 scores higher than 4.7 using the Terminus-2 public harness. It also fixed comment-verbosity and tool-calling issues that teams like Cognition reported with Opus 4.7.

When will Claude Mythos be available to all users?

Anthropic hasn't announced a firm date. As of the Opus 4.8 release (May 28, 2026), Claude Mythos Preview is available only to a small number of organizations through Project Glasswing, currently focused on cybersecurity research use cases. Anthropic says they expect to bring Mythos-class models to all customers "in the coming weeks" once the required cyber safeguards are finalized. Given they used the phrase "swift progress," a general release sometime in June or July 2026 is plausible — but unconfirmed.

Example H2

Share with friends

Ready to get started? Get Your API Key Now!

Get API Key