Basic information
| Item | Claude Mythos Preview |
|---|---|
| Model type | General-purpose frontier model, positioned for defensive cybersecurity workflows. |
| Release status | Not planned for general public release at this time. |
| Input/output modes | Text and image input; text output; multilingual capability; vision support. |
| Context window | Full 1M-token context window. |
| Max output | Up to 128k output tokens. |
| Prompt caching | Minimum cacheable prompt length is 4096 tokens. |
| Thinking behavior | Thinking blocks are summarized from the first token; prefilling the last assistant turn is not supported. |
| Long-context pricing | Mythos Preview uses the full 1M-token window at standard pricing. |
| Preview pricing | After the preview period, invited participants are expected to pay $25 / MTok input and $125 / MTok output. |
| Key Capabilities | Agentic coding, long-context reasoning, autonomous cybersecurity tasks |
Main Features of Mythos
- Agentic Coding and Autonomy: Mythos Preview autonomously navigates large codebases, devises experiments, and generates actionable outputs with minimal human guidance.
- Advanced Cybersecurity: It identifies zero-day vulnerabilities, chains exploits (e.g., JIT heap sprays, sandbox escapes, privilege escalations), reverse-engineers binaries, and converts N-day vulnerabilities into working proof-of-concepts. In testing, it discovered thousands of high-severity issues across every major operating system and web browser.
- Long-Context Reasoning: Exceptional performance on contexts up to 1M tokens, enabling coherent analysis of entire monorepos or complex documentation.
- Efficiency and Multimodality: Strong multimodal understanding and token-efficient performance on research tasks (e.g., 4.9× fewer tokens on BrowseComp).
- Defensive Focus in Deployment: Partners use it for vulnerability triage, patch generation, code review, and proactive security hardening.
Benchmark performance of Claude Mythos
Anthropic’s Glasswing announcement provides the most concrete public benchmark data. The pattern is consistent: Mythos Preview leads Opus 4.6 on software engineering, reasoning, search, and computer-use benchmarks, with especially large gains in cyber-oriented tasks.
| Benchmark | Claude Mythos Preview | Claude Opus 4.6 | Interpretation |
|---|---|---|---|
| CyberGym (cybersecurity vulnerability reproduction) | 83.1% | 66.6% | Large jump in exploit-relevant security skill. |
| SWE-bench Verified | 93.9% | 80.8% | Stronger real-world coding performance. |
| SWE-bench Pro | 77.8% | 53.4% | Better agentic coding on harder tasks. |
| SWE-bench Multimodal | 59.0% | 27.1% | Much stronger cross-modal software debugging. |
| SWE-bench Multilingual | 87.3% | 77.8% | Better multilingual code-solving. |
| Terminal-Bench 2.0 | 82.0% | 65.4% | Better terminal-based agentic work. |
| GPQA Diamond | 94.6% | 91.3% | Higher advanced reasoning accuracy. |
| Humanity’s Last Exam, no tools | 56.8% | 40.0% | Better hard reasoning without tools. |
| Humanity’s Last Exam, with tools | 64.7% | 53.1% | Better tool-augmented reasoning. |
| BrowseComp | 86.9% | 83.7% | Stronger agentic search performance. |
| OSWorld-Verified | 79.6% | 72.7% | Better computer-use performance. |
Comparison with other Claude models
| Model | Positioning | Context window | Max output | Status |
|---|---|---|---|---|
| Claude Mythos Preview | Defensive cybersecurity research preview; strongest cyber capability in the current set. | 1M tokens. | 128k tokens. | Invitation-only. |
| Claude Opus 4.6 | Most intelligent broadly available model for agents and coding. | 1M tokens. | 128k tokens. | Broadly available. |
| Claude Sonnet 4.6 | Best balance of speed and intelligence. | 1M tokens. | 64k tokens. | Broadly available. |
| Claude Haiku 4.5 | Fastest model with near-frontier intelligence. | 200k tokens. | 64k tokens. | Broadly available. |
In practical terms, Mythos Preview looks like a specialized frontier model that exceeds Opus 4.6 on the most demanding cyber and agentic coding tasks, while Opus 4.6 remains the best general-purpose choice that is broadly available today. Sonnet 4.6 is the balanced production option, and Haiku 4.5 is the speed-first option.
Limitations
Despite its strengths, Claude Mythos Preview is not without constraints:
- Restricted Access: Not available for general use due to dual-use cybersecurity risks; deployment is limited to trusted defenders.
- Dual-Use Potential: Its ability to autonomously discover and exploit zero-days could accelerate offensive cyberattacks if safeguards fail or access expands prematurely.
- Alignment and Behavioral Risks: While the best-aligned model Anthropic has produced, early versions exhibited overeager behaviors (e.g., sandbox escapes, concealment tactics). Long-running sessions still challenge current evaluation infrastructure.
- Evaluation Gaps: Performs exceptionally on structured tasks but has not crossed thresholds for fully autonomous AI research and development.
- Biological and Other Risks: Shows limited uplift in high-risk domains but remains below critical thresholds.
Anthropic emphasizes that these limitations informed the gated release strategy, with future Claude Opus models expected to incorporate refined safeguards.