MiMo-V2-Flash 概览

MiMo-V2-Flash 是 Xiaomi MiMo 面向 MiMo-V2-Flash API 推出的开放权重混合专家推理模型，围绕快速推理、编程和智能体工作流构建。模型卡和技术报告将其描述为一个拥有 309B 参数、15B 激活参数的 MoE 模型，采用混合注意力设计，并使用多 token 预测来实现更快的解码。

技术规格

项目	MiMo-V2-Flash
提供方	Xiaomi MiMo
模型系列	MiMo-V2
模型类型	混合专家（MoE）语言模型
总参数量	309B
激活参数量	15B
原生上下文长度	32K
扩展上下文长度	最高 256K
注意力设计	混合滑动窗口注意力（SWA 与全局注意力比例为 5:1）
滑动窗口大小	128 tokens
MTP 层数	3
训练规模	27T tokens
输出模态	文本
发布日期	2025-12-16
仓库许可证	Apache-2.0（GitHub 仓库）

什么是 MiMo-V2-Flash？

MiMo-V2-Flash 是 Xiaomi 面向高推理负载场景打造的高推理效率基础模型。它旨在平衡长上下文处理能力与更低的服务成本，通过滑动窗口注意力降低缓存压力，并通过多 token 预测加快解码速度。

MiMo-V2-Flash 的主要特性

MoE 高效率且激活开销小： 总参数量为 309B，但每个 token 仅激活 15B 参数，这也是该模型被定位为高效服务模型的重要原因。
面向长上下文的混合注意力： 该架构以五层 SWA 加一层全局注意力的方式交替排列，并使用 128-token 窗口来降低 KV-cache 成本。
通过多 token 预测加快解码： 该模型包含 3 个 MTP 层，技术资料将其描述为生成速度和吞吐量优化设计。
为智能体工作流而构建： Xiaomi 将其定位于推理、编程和智能体使用场景，评测套件包括 SWE-Bench、Terminal-Bench 和 BrowseComp。
支持长上下文： 仓库报告支持最高 256K，而 vLLM 配方则根据内存预算为较低的 max-model-len 值提供了实际部署指导。

基准测试表现

仓库中的基础模型表显示，MiMo-V2-Flash 在通用知识、数学、编程和长上下文任务上，与更大的开放模型相比表现出较强竞争力。后训练结果表则突出了其在智能体和推理任务上的强劲表现。

基准测试	MiMo-V2-Flash	说明
MMLU-Pro	84.9	较强的广泛推理能力
GPQA-Diamond	83.7	扎实的高难度问答表现
AIME 2025	94.1	较强的数学推理能力
LiveCodeBench-v6	80.6	具有竞争力的编程能力
SWE-Bench Verified	73.4	较强的软件智能体表现
SWE-Bench Multilingual	71.7	良好的多语言编程/智能体覆盖能力
Terminal-Bench 2.0	38.5	实用，但在终端密集型任务上并非同类最佳
NIAH-Multi 256K	96.7	在 256K 上下文下仍保持较强的长上下文检索能力

MiMo-V2-Flash 与邻近推理模型的对比

模型	MMLU-Pro	SWE-Bench Verified	Terminal-Bench 2.0	备注
MiMo-V2-Flash	84.9	73.4	38.5	高效的开放权重推理模型
Kimi-K2 Thinking	84.6	71.3	35.7	推理能力接近，但终端任务较弱
DeepSeek-V3.2 Thinking	85.0	73.1	46.4	终端表现更强，推理水平相近

最佳使用场景

当你需要一个能够处理长输入、辅助编程任务并在生产环境中保持高效率的模型时，MiMo-V2-Flash 最为适合。它非常适用于文档密集型 RAG、多步智能体工作流、代码辅助以及对服务成本敏感的长上下文分析场景。

局限性

MiMo-V2-Flash 针对推理效率进行了优化，因此实际吞吐量取决于批处理、张量并行和服务配置。vLLM 指南还表明，依据内存和延迟权衡，实际可用的 max-model-len 设置可能会低于标称的 256K。

MiMo-V2-Flash is tuned for fast reasoning, coding, and agentic workflows rather than pure chat polish. Xiaomi describes it as a 309B-parameter MoE model with 15B active parameters and a hybrid attention design built to reduce serving cost while keeping long-context performance.

Support up to 256K context, with a native 32K pretraining length that was later extended.

Yes. In the post-training table, MiMo-V2-Flash scores 73.4 on SWE-Bench Verified, 71.7 on SWE-Bench Multilingual, and 38.5 on Terminal-Bench 2.0, which makes it a credible option for code assistants and agent loops.

Use MiMo-V2-Flash when you want a strong open-weight model with a smaller active compute footprint and good all-around reasoning plus agent performance. It is competitive with Kimi-K2 Thinking on MMLU-Pro and SWE-Bench, while DeepSeek-V3.2 Thinking is stronger on terminal-heavy tasks, so the better choice depends on whether you care more about efficiency or terminal depth.

Yes. The architecture uses sliding window attention to reduce long-sequence cost, and the repo reports very strong NIAH-Multi results even at 256K context. That makes it a sensible fit for long-document retrieval, summarization, and multi-hop context stitching.

It is optimized for inference efficiency, so speed and memory use still depend on batching, tensor parallelism, and the exact serving stack. A smaller runtime context can be a better production choice than the headline maximum if you need lower latency or lower memory use.

The vLLM recipe serves it from XiaomiMiMo/MiMo-V2-Flash with --trust-remote-code, --served-model-name mimo_v2_flash, and tensor parallelism tuned for your hardware. If you need agent-style tool calling, the recipe also shows parser options such as qwen3_xml and qwen3.