Anthropic's Mythos: When the Benchmarks Stop Being the Story

Twelve months ago, no AI model could complete an expert-level hacking task. Anthropic's Mythos now does it 73% of the time. That isn't a mere iteration, it's a phase change.

Claude Mythos Preview is Anthropic's newest frontier model, sitting in a tier above Opus (internally codenamed "Capybara"). The benchmark numbers are striking on their own: 93.9% on SWE-bench, 97.6% on USAMO 2026, a 31-point jump on math reasoning over Opus 4.6. But the benchmarks actually understate what's happening.

What Mythos has demonstrated in a few weeks of preview testing:

Found a 27-year-old vulnerability in OpenBSD, an operating system whose entire reputation rests on hardening
Helped Mozilla patch 271 Firefox vulnerabilities during the preview window
Wrote a browser exploit chaining four vulnerabilities to escape both renderer and OS sandboxes — autonomously
Identified zero-days in every major operating system and every major browser
Reached 89% exact agreement with human security experts on severity assessments of its findings

Anthropic priced it at $25/$125 per million tokens (5x Opus territory) and gated access through Project Glasswing. Glasswing is comprised of about 40 partners including Microsoft, Apple, Google, and Mozilla and backed by $100M in model credits and $4M in open-source security donations.

Here are some takeaways for your consideration:

The "AI can't really do hard engineering" argument is finished. Finding novel zero-days in code that has survived decades of human review and millions of automated security scans isn't pattern-matching. The implementation conversations of 2026 are not the conversations of 2025.
The defender side just got real leverage. A small team with Mythos-class access can arguably do work that previously required a tier-one security firm. For implementation projects, this changes what "reasonable due diligence" looks like and what clients can demand at what price point.
Compute economics are the new bottleneck. Mythos is among the first models trained on next-gen GPUs at scale. The pricing reflects that. Budget AI spend as a step-function, not a glide path.

The skeptics aren't wrong that some of this is marketing. Sam Altman called the rollout "fear-based marketing," and researchers have flagged opaque test conditions. But "partially overhyped" and "genuinely transformative" aren't mutually exclusive. Both can be true.

Are you using frontier-tier models in production today? What would change if Mythos-class capability landed in your workflow tomorrow?

Anthropic Mythos & Cybersecurity

Anthropic's Mythos: When the Benchmarks Stop Being the Story

Related Articles