MiMo-V2-Flash Overview

MiMo-V2-Flash is Xiaomi MiMo’s open-weight Mixture-of-Experts reasoning model for the MiMo-V2-Flash API, built around fast inference, coding, and agentic workflows. The model card and technical report describe it as a 309B-parameter MoE with 15B active parameters, a hybrid attention design, and multi-token prediction for faster decoding.

Technical specifications

Item	MiMo-V2-Flash
Provider	Xiaomi MiMo
Model family	MiMo-V2
Model type	Mixture-of-Experts (MoE) language model
Total parameters	309B
Active parameters	15B
Native context length	32K
Extended context length	Up to 256K
Attention design	Hybrid Sliding Window Attention (5:1 SWA to Global Attention)
Sliding window size	128 tokens
MTP layers	3
Training scale	27T tokens
Output modality	Text
Release date	2025-12-16
Repository license	Apache-2.0 (GitHub repo)

What is MiMo-V2-Flash?

MiMo-V2-Flash is Xiaomi’s inference-efficient foundation model for reasoning-heavy workloads. It is designed to balance long-context handling with lower serving cost, using sliding window attention to reduce cache pressure and multi-token prediction to speed up decoding.

Main features of MiMo-V2-Flash

MoE efficiency with a small active footprint: 309B total parameters but only 15B active per token, which is a big part of why the model is positioned for efficient serving.
Hybrid attention for long context: The architecture alternates five SWA layers with one global attention layer, using a 128-token window to cut KV-cache cost.
Multi-token prediction for faster decoding: The model includes 3 MTP layers, and the technical materials describe this as a speed and throughput optimization for generation.
Built for agentic workflows: Xiaomi positions it for reasoning, coding, and agent use cases, and the evaluation suite includes SWE-Bench, Terminal-Bench, and BrowseComp.
Long-context support: The repo reports support up to 256K, while the vLLM recipe provides practical serving guidance for lower max-model-len values depending on memory budget.

Benchmark performance

The base-model table in the repo shows MiMo-V2-Flash performing competitively against larger open models on general knowledge, math, coding, and long-context tasks. The post-training table highlights strong agentic and reasoning results.

Benchmark	MiMo-V2-Flash	What it suggests
MMLU-Pro	84.9	Strong broad reasoning
GPQA-Diamond	83.7	Solid difficult QA performance
AIME 2025	94.1	Strong math reasoning
LiveCodeBench-v6	80.6	Competitive coding ability
SWE-Bench Verified	73.4	Strong software-agent performance
SWE-Bench Multilingual	71.7	Good multilingual coding/agent coverage
Terminal-Bench 2.0	38.5	Useful but not top-of-class on terminal-heavy tasks
NIAH-Multi 256K	96.7	Long-context retrieval remains strong at 256K

MiMo-V2-Flash vs nearby reasoning models

Model	MMLU-Pro	SWE-Bench Verified	Terminal-Bench 2.0	Notes
MiMo-V2-Flash	84.9	73.4	38.5	Efficient open-weight reasoning model
Kimi-K2 Thinking	84.6	71.3	35.7	Close on reasoning, weaker on terminal tasks
DeepSeek-V3.2 Thinking	85.0	73.1	46.4	Strong terminal performance, similar reasoning tier

Best use cases

MiMo-V2-Flash fits best when you need a model that can reason over long inputs, help with coding tasks, and stay efficient in production. It is a strong choice for document-heavy RAG, multi-step agent workflows, code assistance, and long-context analysis where serving cost matters.

Limitations

MiMo-V2-Flash is optimized for inference efficiency, so real-world throughput depends on batching, tensor parallelism, and the serving configuration. The vLLM guide also shows that practical max-model-len settings may be lower than the headline 256K depending on memory and latency tradeoffs.

MiMo-V2-Flash is tuned for fast reasoning, coding, and agentic workflows rather than pure chat polish. Xiaomi describes it as a 309B-parameter MoE model with 15B active parameters and a hybrid attention design built to reduce serving cost while keeping long-context performance.

Support up to 256K context, with a native 32K pretraining length that was later extended.

Yes. In the post-training table, MiMo-V2-Flash scores 73.4 on SWE-Bench Verified, 71.7 on SWE-Bench Multilingual, and 38.5 on Terminal-Bench 2.0, which makes it a credible option for code assistants and agent loops.

Use MiMo-V2-Flash when you want a strong open-weight model with a smaller active compute footprint and good all-around reasoning plus agent performance. It is competitive with Kimi-K2 Thinking on MMLU-Pro and SWE-Bench, while DeepSeek-V3.2 Thinking is stronger on terminal-heavy tasks, so the better choice depends on whether you care more about efficiency or terminal depth.

Yes. The architecture uses sliding window attention to reduce long-sequence cost, and the repo reports very strong NIAH-Multi results even at 256K context. That makes it a sensible fit for long-document retrieval, summarization, and multi-hop context stitching.

It is optimized for inference efficiency, so speed and memory use still depend on batching, tensor parallelism, and the exact serving stack. A smaller runtime context can be a better production choice than the headline maximum if you need lower latency or lower memory use.

The vLLM recipe serves it from XiaomiMiMo/MiMo-V2-Flash with --trust-remote-code, --served-model-name mimo_v2_flash, and tensor parallelism tuned for your hardware. If you need agent-style tool calling, the recipe also shows parser options such as qwen3_xml and qwen3.