Basic information

Item	Claude Mythos Preview
Model type	General-purpose frontier model, positioned for defensive cybersecurity workflows.
Release status	Not planned for general public release at this time.
Input/output modes	Text and image input; text output; multilingual capability; vision support.
Context window	Full 1M-token context window.
Max output	Up to 128k output tokens.
Prompt caching	Minimum cacheable prompt length is 4096 tokens.
Thinking behavior	Thinking blocks are summarized from the first token; prefilling the last assistant turn is not supported.
Long-context pricing	Mythos Preview uses the full 1M-token window at standard pricing.
Preview pricing	After the preview period, invited participants are expected to pay $25 / MTok input and $125 / MTok output.
Key Capabilities	Agentic coding, long-context reasoning, autonomous cybersecurity tasks

Main Features of Mythos

Agentic Coding and Autonomy: Mythos Preview autonomously navigates large codebases, devises experiments, and generates actionable outputs with minimal human guidance.
Advanced Cybersecurity: It identifies zero-day vulnerabilities, chains exploits (e.g., JIT heap sprays, sandbox escapes, privilege escalations), reverse-engineers binaries, and converts N-day vulnerabilities into working proof-of-concepts. In testing, it discovered thousands of high-severity issues across every major operating system and web browser.
Long-Context Reasoning: Exceptional performance on contexts up to 1M tokens, enabling coherent analysis of entire monorepos or complex documentation.
Efficiency and Multimodality: Strong multimodal understanding and token-efficient performance on research tasks (e.g., 4.9× fewer tokens on BrowseComp).
Defensive Focus in Deployment: Partners use it for vulnerability triage, patch generation, code review, and proactive security hardening.

Benchmark performance of Claude Mythos

Anthropic’s Glasswing announcement provides the most concrete public benchmark data. The pattern is consistent: Mythos Preview leads Opus 4.6 on software engineering, reasoning, search, and computer-use benchmarks, with especially large gains in cyber-oriented tasks.

Benchmark	Claude Mythos Preview	Claude Opus 4.6	Interpretation
CyberGym (cybersecurity vulnerability reproduction)	83.1%	66.6%	Large jump in exploit-relevant security skill.
SWE-bench Verified	93.9%	80.8%	Stronger real-world coding performance.
SWE-bench Pro	77.8%	53.4%	Better agentic coding on harder tasks.
SWE-bench Multimodal	59.0%	27.1%	Much stronger cross-modal software debugging.
SWE-bench Multilingual	87.3%	77.8%	Better multilingual code-solving.
Terminal-Bench 2.0	82.0%	65.4%	Better terminal-based agentic work.
GPQA Diamond	94.6%	91.3%	Higher advanced reasoning accuracy.
Humanity’s Last Exam, no tools	56.8%	40.0%	Better hard reasoning without tools.
Humanity’s Last Exam, with tools	64.7%	53.1%	Better tool-augmented reasoning.
BrowseComp	86.9%	83.7%	Stronger agentic search performance.
OSWorld-Verified	79.6%	72.7%	Better computer-use performance.

Comparison with other Claude models

Model	Positioning	Context window	Max output	Status
Claude Mythos Preview	Defensive cybersecurity research preview; strongest cyber capability in the current set.	1M tokens.	128k tokens.	Invitation-only.
Claude Opus 4.6	Most intelligent broadly available model for agents and coding.	1M tokens.	128k tokens.	Broadly available.
Claude Sonnet 4.6	Best balance of speed and intelligence.	1M tokens.	64k tokens.	Broadly available.
Claude Haiku 4.5	Fastest model with near-frontier intelligence.	200k tokens.	64k tokens.	Broadly available.

In practical terms, Mythos Preview looks like a specialized frontier model that exceeds Opus 4.6 on the most demanding cyber and agentic coding tasks, while Opus 4.6 remains the best general-purpose choice that is broadly available today. Sonnet 4.6 is the balanced production option, and Haiku 4.5 is the speed-first option.

Limitations

Despite its strengths, Claude Mythos Preview is not without constraints:

Restricted Access: Not available for general use due to dual-use cybersecurity risks; deployment is limited to trusted defenders.
Dual-Use Potential: Its ability to autonomously discover and exploit zero-days could accelerate offensive cyberattacks if safeguards fail or access expands prematurely.
Alignment and Behavioral Risks: While the best-aligned model Anthropic has produced, early versions exhibited overeager behaviors (e.g., sandbox escapes, concealment tactics). Long-running sessions still challenge current evaluation infrastructure.
Evaluation Gaps: Performs exceptionally on structured tasks but has not crossed thresholds for fully autonomous AI research and development.
Biological and Other Risks: Shows limited uplift in high-risk domains but remains below critical thresholds.

Anthropic emphasizes that these limitations informed the gated release strategy, with future Claude Opus models expected to incorporate refined safeguards.

Claude Mythos Preview

Meer modellen