MiMo-V2-Flash Overview
MiMo-V2-Flash is Xiaomi MiMo’s open-weight Mixture-of-Experts reasoning model for the MiMo-V2-Flash API, built around fast inference, coding, and agentic workflows. The model card and technical report describe it as a 309B-parameter MoE with 15B active parameters, a hybrid attention design, and multi-token prediction for faster decoding.
Technical specifications
| Item | MiMo-V2-Flash |
|---|---|
| Provider | Xiaomi MiMo |
| Model family | MiMo-V2 |
| Model type | Mixture-of-Experts (MoE) language model |
| Total parameters | 309B |
| Active parameters | 15B |
| Native context length | 32K |
| Extended context length | Up to 256K |
| Attention design | Hybrid Sliding Window Attention (5:1 SWA to Global Attention) |
| Sliding window size | 128 tokens |
| MTP layers | 3 |
| Training scale | 27T tokens |
| Output modality | Text |
| Release date | 2025-12-16 |
| Repository license | Apache-2.0 (GitHub repo) |
What is MiMo-V2-Flash?
MiMo-V2-Flash is Xiaomi’s inference-efficient foundation model for reasoning-heavy workloads. It is designed to balance long-context handling with lower serving cost, using sliding window attention to reduce cache pressure and multi-token prediction to speed up decoding.
Main features of MiMo-V2-Flash
- MoE efficiency with a small active footprint: 309B total parameters but only 15B active per token, which is a big part of why the model is positioned for efficient serving.
- Hybrid attention for long context: The architecture alternates five SWA layers with one global attention layer, using a 128-token window to cut KV-cache cost.
- Multi-token prediction for faster decoding: The model includes 3 MTP layers, and the technical materials describe this as a speed and throughput optimization for generation.
- Built for agentic workflows: Xiaomi positions it for reasoning, coding, and agent use cases, and the evaluation suite includes SWE-Bench, Terminal-Bench, and BrowseComp.
- Long-context support: The repo reports support up to 256K, while the vLLM recipe provides practical serving guidance for lower
max-model-lenvalues depending on memory budget.
Benchmark performance
The base-model table in the repo shows MiMo-V2-Flash performing competitively against larger open models on general knowledge, math, coding, and long-context tasks. The post-training table highlights strong agentic and reasoning results.
| Benchmark | MiMo-V2-Flash | What it suggests |
|---|---|---|
| MMLU-Pro | 84.9 | Strong broad reasoning |
| GPQA-Diamond | 83.7 | Solid difficult QA performance |
| AIME 2025 | 94.1 | Strong math reasoning |
| LiveCodeBench-v6 | 80.6 | Competitive coding ability |
| SWE-Bench Verified | 73.4 | Strong software-agent performance |
| SWE-Bench Multilingual | 71.7 | Good multilingual coding/agent coverage |
| Terminal-Bench 2.0 | 38.5 | Useful but not top-of-class on terminal-heavy tasks |
| NIAH-Multi 256K | 96.7 | Long-context retrieval remains strong at 256K |
MiMo-V2-Flash vs nearby reasoning models
| Model | MMLU-Pro | SWE-Bench Verified | Terminal-Bench 2.0 | Notes |
|---|---|---|---|---|
| MiMo-V2-Flash | 84.9 | 73.4 | 38.5 | Efficient open-weight reasoning model |
| Kimi-K2 Thinking | 84.6 | 71.3 | 35.7 | Close on reasoning, weaker on terminal tasks |
| DeepSeek-V3.2 Thinking | 85.0 | 73.1 | 46.4 | Strong terminal performance, similar reasoning tier |
Best use cases
MiMo-V2-Flash fits best when you need a model that can reason over long inputs, help with coding tasks, and stay efficient in production. It is a strong choice for document-heavy RAG, multi-step agent workflows, code assistance, and long-context analysis where serving cost matters.
Limitations
MiMo-V2-Flash is optimized for inference efficiency, so real-world throughput depends on batching, tensor parallelism, and the serving configuration. The vLLM guide also shows that practical max-model-len settings may be lower than the headline 256K depending on memory and latency tradeoffs.