Anthropic's Mythos: When the Benchmarks Stop Being the Story
Twelve months ago, no AI model could complete an expert-level hacking task. Anthropic's Mythos now does it 73% of the time. That isn't a mere iteration, it's a phase change.
Claude Mythos Preview is Anthropic's newest frontier model, sitting in a tier above Opus (internally codenamed "Capybara"). The benchmark numbers are striking on their own: 93.9% on SWE-bench, 97.6% on USAMO 2026, a 31-point jump on math reasoning over Opus 4.6. But the benchmarks actually understate what's happening.
What Mythos has demonstrated in a few weeks of preview testing:
- Found a 27-year-old vulnerability in OpenBSD, an operating system whose entire reputation rests on hardening
- Helped Mozilla patch 271 Firefox vulnerabilities during the preview window
- Wrote a browser exploit chaining four vulnerabilities to escape both renderer and OS sandboxes — autonomously
- Identified zero-days in every major operating system and every major browser
- Reached 89% exact agreement with human security experts on severity assessments of its findings
Anthropic priced it at $25/$125 per million tokens (5x Opus territory) and gated access through Project Glasswing. Glasswing is comprised of about 40 partners including Microsoft, Apple, Google, and Mozilla and backed by $100M in model credits and $4M in open-source security donations.
Here are some takeaways for your consideration:
- The "AI can't really do hard engineering" argument is finished. Finding novel zero-days in code that has survived decades of human review and millions of automated security scans isn't pattern-matching. The implementation conversations of 2026 are not the conversations of 2025.
- The defender side just got real leverage. A small team with Mythos-class access can arguably do work that previously required a tier-one security firm. For implementation projects, this changes what "reasonable due diligence" looks like and what clients can demand at what price point.
- Compute economics are the new bottleneck. Mythos is among the first models trained on next-gen GPUs at scale. The pricing reflects that. Budget AI spend as a step-function, not a glide path.
The skeptics aren't wrong that some of this is marketing. Sam Altman called the rollout "fear-based marketing," and researchers have flagged opaque test conditions. But "partially overhyped" and "genuinely transformative" aren't mutually exclusive. Both can be true.
Are you using frontier-tier models in production today? What would change if Mythos-class capability landed in your workflow tomorrow?