Home/Models/Xiaomi/mimo-v2-flash
X

mimo-v2-flash

Input:$0.08/M
Output:$0.24/M
MiMo-V2-Flash a comprehensive upgrade to Thinking Mode. It significantly enhances coding and complex logic capabilities, boosts tool calling accuracy to 97%, and optimizes Chain-of-Thought (CoT) to reduce hallucinations while lowering latency and token costs.
New
Commercial Use
Playground
Overview
Features
Pricing
API

MiMo-V2-Flash Overview

MiMo-V2-Flash is Xiaomi MiMo’s open-weight Mixture-of-Experts reasoning model for the MiMo-V2-Flash API, built around fast inference, coding, and agentic workflows. The model card and technical report describe it as a 309B-parameter MoE with 15B active parameters, a hybrid attention design, and multi-token prediction for faster decoding.

Technical specifications

ItemMiMo-V2-Flash
ProviderXiaomi MiMo
Model familyMiMo-V2
Model typeMixture-of-Experts (MoE) language model
Total parameters309B
Active parameters15B
Native context length32K
Extended context lengthUp to 256K
Attention designHybrid Sliding Window Attention (5:1 SWA to Global Attention)
Sliding window size128 tokens
MTP layers3
Training scale27T tokens
Output modalityText
Release date2025-12-16
Repository licenseApache-2.0 (GitHub repo)

What is MiMo-V2-Flash?

MiMo-V2-Flash is Xiaomi’s inference-efficient foundation model for reasoning-heavy workloads. It is designed to balance long-context handling with lower serving cost, using sliding window attention to reduce cache pressure and multi-token prediction to speed up decoding.

Main features of MiMo-V2-Flash

  • MoE efficiency with a small active footprint: 309B total parameters but only 15B active per token, which is a big part of why the model is positioned for efficient serving.
  • Hybrid attention for long context: The architecture alternates five SWA layers with one global attention layer, using a 128-token window to cut KV-cache cost.
  • Multi-token prediction for faster decoding: The model includes 3 MTP layers, and the technical materials describe this as a speed and throughput optimization for generation.
  • Built for agentic workflows: Xiaomi positions it for reasoning, coding, and agent use cases, and the evaluation suite includes SWE-Bench, Terminal-Bench, and BrowseComp.
  • Long-context support: The repo reports support up to 256K, while the vLLM recipe provides practical serving guidance for lower max-model-len values depending on memory budget.

Benchmark performance

The base-model table in the repo shows MiMo-V2-Flash performing competitively against larger open models on general knowledge, math, coding, and long-context tasks. The post-training table highlights strong agentic and reasoning results.

BenchmarkMiMo-V2-FlashWhat it suggests
MMLU-Pro84.9Strong broad reasoning
GPQA-Diamond83.7Solid difficult QA performance
AIME 202594.1Strong math reasoning
LiveCodeBench-v680.6Competitive coding ability
SWE-Bench Verified73.4Strong software-agent performance
SWE-Bench Multilingual71.7Good multilingual coding/agent coverage
Terminal-Bench 2.038.5Useful but not top-of-class on terminal-heavy tasks
NIAH-Multi 256K96.7Long-context retrieval remains strong at 256K

MiMo-V2-Flash vs nearby reasoning models

ModelMMLU-ProSWE-Bench VerifiedTerminal-Bench 2.0Notes
MiMo-V2-Flash84.973.438.5Efficient open-weight reasoning model
Kimi-K2 Thinking84.671.335.7Close on reasoning, weaker on terminal tasks
DeepSeek-V3.2 Thinking85.073.146.4Strong terminal performance, similar reasoning tier

Best use cases

MiMo-V2-Flash fits best when you need a model that can reason over long inputs, help with coding tasks, and stay efficient in production. It is a strong choice for document-heavy RAG, multi-step agent workflows, code assistance, and long-context analysis where serving cost matters.

Limitations

MiMo-V2-Flash is optimized for inference efficiency, so real-world throughput depends on batching, tensor parallelism, and the serving configuration. The vLLM guide also shows that practical max-model-len settings may be lower than the headline 256K depending on memory and latency tradeoffs.

FAQ

What does the MiMo-V2-Flash API do best?

MiMo-V2-Flash is tuned for fast reasoning, coding, and agentic workflows rather than pure chat polish. Xiaomi describes it as a 309B-parameter MoE model with 15B active parameters and a hybrid attention design built to reduce serving cost while keeping long-context performance.

How much context can the MiMo-V2-Flash API handle?

Support up to 256K context, with a native 32K pretraining length that was later extended.

Can MiMo-V2-Flash API handle coding and terminal-style agents?

Yes. In the post-training table, MiMo-V2-Flash scores 73.4 on SWE-Bench Verified, 71.7 on SWE-Bench Multilingual, and 38.5 on Terminal-Bench 2.0, which makes it a credible option for code assistants and agent loops.

When should I use MiMo-V2-Flash API instead of Kimi-K2 Thinking or DeepSeek-V3.2 Thinking?

Use MiMo-V2-Flash when you want a strong open-weight model with a smaller active compute footprint and good all-around reasoning plus agent performance. It is competitive with Kimi-K2 Thinking on MMLU-Pro and SWE-Bench, while DeepSeek-V3.2 Thinking is stronger on terminal-heavy tasks, so the better choice depends on whether you care more about efficiency or terminal depth.

Is MiMo-V2-Flash API suitable for long-document RAG or summarization?

Yes. The architecture uses sliding window attention to reduce long-sequence cost, and the repo reports very strong NIAH-Multi results even at 256K context. That makes it a sensible fit for long-document retrieval, summarization, and multi-hop context stitching.

What are the known limitations of MiMo-V2-Flash API?

It is optimized for inference efficiency, so speed and memory use still depend on batching, tensor parallelism, and the exact serving stack. A smaller runtime context can be a better production choice than the headline maximum if you need lower latency or lower memory use.

How do I integrate MiMo-V2-Flash API with vLLM?

The vLLM recipe serves it from XiaomiMiMo/MiMo-V2-Flash with --trust-remote-code, --served-model-name mimo_v2_flash, and tensor parallelism tuned for your hardware. If you need agent-style tool calling, the recipe also shows parser options such as qwen3_xml and qwen3.

Features for mimo-v2-flash

Explore the key features of mimo-v2-flash, designed to enhance performance and usability. Discover how these capabilities can benefit your projects and improve user experience.

Pricing for mimo-v2-flash

Explore competitive pricing for mimo-v2-flash, designed to fit various budgets and usage needs. Our flexible plans ensure you only pay for what you use, making it easy to scale as your requirements grow. Discover how mimo-v2-flash can enhance your projects while keeping costs manageable.
Comet Price (USD / M Tokens)Official Price (USD / M Tokens)Discount
Input:$0.08/M
Output:$0.24/M
Input:$0.1/M
Output:$0.3/M
-20%

Sample code and API for mimo-v2-flash

Access comprehensive sample code and API resources for mimo-v2-flash to streamline your integration process. Our detailed documentation provides step-by-step guidance, helping you leverage the full potential of mimo-v2-flash in your projects.
Python
JavaScript
Curl
from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"

client = OpenAI(api_key=COMETAPI_KEY, base_url="https://api.cometapi.com/v1")

# mimo-v2-flash is optimized for speed; test structured JSON output
completion = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Respond in JSON only."},
        {"role": "user", "content": "List 3 programming languages with their primary use case."},
    ],
    response_format={"type": "json_object"},
)

print(completion.choices[0].message.content)

More Models