mimo-v2-omni

Q: What can the MiMo-V2-Omni API understand besides text?

MiMo-V2-Omni is built for image, video, audio, and undfied perceptual system rather than separate modality add-ons, which makes it a better fit for multimodal agents than a text-only LLM.

Q: Can MiMo-V2-Omni API process audio and video together?

Yes. the model supports native audio-video joint input for video comprehension, so it can reason over what is happening on screen and in the soundtrack at the same time.

Q: How long of an audio file can MiMo-V2-Omni API handle?

MiMo-V2-Omni supports continuous audio understanding beyond 10 hours. That is a strong signal that it is meant for long-form audio analysis rather than short clip transcription only.

Q: When should I use MiMo-V2-Omni API instead of MiMo-V2-Pro?

Use MiMo-V2-Omni when the job depends on multimodal perception: screens, videos, voice, or audio-visual workflow mostly agentic text work and you want the largest flagship context window, which Xiaomi says reaches 1M tokens.

Q: Does MiMo-V2-Omni API support structured tool?

Yes. MiMo-V2-Omni natively supports structured tool calling, function execution, and UI grounding, which is exactly what you want for agent automation.

Q: Is MiMo-V2-Omni API good for browser automation and real-world agents?

Yes. Xiaomi’s demos show it scanning shopping adviceing on JD.com, and completing a TikTok upload workflow through OpenClaw. That makes it a strong fit for browser agents, workflow automation, and UI-driven tasks.

Input:$0.32/M

Output:$1.6/M

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities. 256K context window.

New

Commercial Use

Playground

Overview

Features

Pricing

API

MiMo-V2-Omni Overview

MiMo-V2-Omni is Xiaomi MiMo’s omni foundation model for the API platform, built to see, hear, read, and act in the same workflow. Xiaomi positions it as a multimodal agent model that combines image, video, audio, and text understanding with structured tool calling, function execution, and UI grounding.

Technical specifications

Item	MiMo-V2-Omni
Provider	Xiaomi MiMo
Model family	MiMo-V2
Modality	Image, video, audio, text
Output type	Text
Native audio support	Yes
Native audio-video joint input	Yes
Structured tool calling	Yes
Function execution	Yes
UI grounding	Yes
Long audio handling	Over 10 hours continuous audio understanding
Release date	2026-03-18
Public numeric context length	Not stated on the official Omni page

What is MiMo-V2-Omni?

MiMo-V2-Omni is designed for agentic systems that need perception and action in one model. Xiaomi says the model fuses dedicated image, video, and audio encoders into one shared backbone, then trains it to anticipate what should happen next rather than only describe what is already visible.

Main features of MiMo-V2-Omni

Unified multimodal perception: image, video, audio, and text are handled as one perceptual stream rather than separate add-ons.
Agent-ready outputs: the model natively supports structured tool calling, function execution, and UI grounding for real agent frameworks.
Long-form audio understanding: Xiaomi claims it can handle continuous audio longer than 10 hours, which is unusually strong for a general omni model.
Native audio-video reasoning: the official page highlights joint audio-video input for video comprehension instead of a text-only transcript pipeline.
Browser and workflow execution: Xiaomi demonstrates end-to-end browser shopping and TikTok upload flows using MiMo-V2-Omni plus OpenClaw.
Perception-to-action framing: the model is trained to connect what it sees with what it should do next, which is the core difference between a demo model and an agentic model.

Benchmark performance

mimo-v2-omni

It clearly states that Omni exceeds Gemini 3 Pro on audio understanding, exceeds Claude Opus 4.6 on image understanding, and performs on par with the strongest reasoning models on agentic productivity benchmarks.

MiMo-V2-Omni vs MiMo-V2-Pro vs MiMo-V2-Flash

Model	Core strength	Context / scale	Best fit
MiMo-V2-Omni	Multimodal perception + agent action	Public context length not stated on the Omni page	Audio, image, video, UI, and browser agents
MiMo-V2-Pro	Largest flagship agent model	Up to 1M-token context; 1T+ params, 42B active	Heavy agent orchestration and long-horizon work
MiMo-V2-Flash	Fast reasoning and coding	256K context; 309B total, 15B active	Efficient reasoning, coding, and high-throughput agent tasks

Best use cases

MiMo-V2-Omni is the right pick when your workflow depends on non-text inputs or outputs: screen understanding, voice and audio analysis, video review, browser automation, multimodal assistants, and robotics-style agent loops. If your workload is mostly text-only and you care more about raw speed or maximum context, the sibling Pro and Flash models are the more obvious alternatives.

FAQ

What can the MiMo-V2-Omni API understand besides text?

Can MiMo-V2-Omni API process audio and video together?

How long of an audio file can MiMo-V2-Omni API handle?

When should I use MiMo-V2-Omni API instead of MiMo-V2-Pro?

Does MiMo-V2-Omni API support structured tool?

Is MiMo-V2-Omni API good for browser automation and real-world agents?

Pricing for mimo-v2-omni

Explore competitive pricing for mimo-v2-omni, designed to fit various budgets and usage needs. Our flexible plans ensure you only pay for what you use, making it easy to scale as your requirements grow. Discover how mimo-v2-omni can enhance your projects while keeping costs manageable.

Comet Price (USD / M Tokens)	Official Price (USD / M Tokens)	Discount
Input:$0.32/M Output:$1.6/M	Input:$0.4/M Output:$2/M	-20%

Sample code and API for mimo-v2-omni

Access comprehensive sample code and API resources for mimo-v2-omni to streamline your integration process. Our detailed documentation provides step-by-step guidance, helping you leverage the full potential of mimo-v2-omni in your projects.

POST

/v1/chat/completions

POST

/v1/messages

Python
JavaScript
Curl

from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"

client = OpenAI(api_key=COMETAPI_KEY, base_url="https://api.cometapi.com/v1")

# mimo-v2-omni: built-in web_search tool (pass as top-level tools param)
completion = client.chat.completions.create(
    model="mimo-v2-omni",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who is Lei Jun?"},
    ],
    tools=[{"type": "web_search", "force_search": True, "max_keyword": 3, "limit": 1}],
    tool_choice="auto",
    extra_body={"thinking": {"type": "disabled"}},
)

msg = completion.choices[0].message
if msg.content:
    print(msg.content)

# annotations are populated when web_search runs (content may be null on search-only responses)
raw = completion.model_dump()
annotations = raw["choices"][0]["message"].get("annotations") or []
if annotations:
    print("
--- Sources ---")
    for ann in annotations:
        c = ann.get("url_citation") or {}
        print(f"[{c.get('title')}] {c.get('url')}")

Python Code Example

from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"

client = OpenAI(api_key=COMETAPI_KEY, base_url="https://api.cometapi.com/v1")

# mimo-v2-omni: built-in web_search tool (pass as top-level tools param)
completion = client.chat.completions.create(
    model="mimo-v2-omni",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who is Lei Jun?"},
    ],
    tools=[{"type": "web_search", "force_search": True, "max_keyword": 3, "limit": 1}],
    tool_choice="auto",
    extra_body={"thinking": {"type": "disabled"}},
)

msg = completion.choices[0].message
if msg.content:
    print(msg.content)

# annotations are populated when web_search runs (content may be null on search-only responses)
raw = completion.model_dump()
annotations = raw["choices"][0]["message"].get("annotations") or []
if annotations:
    print("\n--- Sources ---")
    for ann in annotations:
        c = ann.get("url_citation") or {}
        print(f"[{c.get('title')}] {c.get('url')}")

JavaScript Code Example

// Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
const api_key = process.env.COMETAPI_KEY || "<YOUR_COMETAPI_KEY>";

// mimo-v2-omni: use fetch for web_search (non-standard tool type unsupported by openai SDK)
const resp = await fetch("https://api.cometapi.com/v1/chat/completions", {
  method: "POST",
  headers: { Authorization: `Bearer ${api_key}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "mimo-v2-omni",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Who is Lei Jun?" },
    ],
    tools: [{ type: "web_search", force_search: true, max_keyword: 3, limit: 1 }],
    tool_choice: "auto",
    thinking: { type: "disabled" },
  }),
});

const data = await resp.json();
const msg = data.choices[0].message;
if (msg.content) console.log(msg.content);

const annotations = msg.annotations ?? [];
if (annotations.length) {
  console.log("\n--- Sources ---");
  for (const ann of annotations) {
    const c = ann.url_citation ?? {};
    console.log(`[${c.title}] ${c.url}`);
  }
}

Curl Code Example

# Get your CometAPI key from https://api.cometapi.com/console/token
# Export it as: export COMETAPI_KEY="your-key-here"

curl https://api.cometapi.com/v1/chat/completions \
  -H "Authorization: Bearer $COMETAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mimo-v2-omni",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who is Lei Jun?"}
    ],
    "tools": [{"type": "web_search", "force_search": true, "max_keyword": 3, "limit": 1}],
    "thinking": {"type": "disabled"}
  }'

More Models

Related Blog

Mar 26, 2026

MiMo V2 Pro vs Omni vs Flash: How should I choose in 2026?

MiMo V2 Pro is the flagship choice for demanding agentic work, MiMo V2 Omni is the multimodal specialist for image, video, audio, and tool-using agents, and MiMo V2 Flash is the fast, low-cost open-source option for reasoning, coding, and everyday agent workflows.

Mar 25, 2026

How to Use MiMo V2 API for Free in 2026: Complete Guide (Pro, Omni & Flash)

To use MiMo V2 API for free, get free quota via CometAPI or self-host the open-source weights on Hugging Face. For Pro and Omni, leverage OpenRouter routing, CometAPI aggregation, or Puter.js user-pays proxies. All models use a standard OpenAI-compatible endpoint. Official Xiaomi pricing starts at $1/$3 per million tokens for Pro (cheaper than Claude Opus 4.6), but free tiers and aggregators make high-performance agentic AI accessible without upfront costs.

mimo-v2-omni

Input:$0.32/M

Output:$1.6/M

New

Commercial Use

Playground

Overview

Features

Pricing

API

MiMo-V2-Omni Overview

Technical specifications

Item	MiMo-V2-Omni
Provider	Xiaomi MiMo
Model family	MiMo-V2
Modality	Image, video, audio, text
Output type	Text
Native audio support	Yes
Native audio-video joint input	Yes
Structured tool calling	Yes
Function execution	Yes
UI grounding	Yes
Long audio handling	Over 10 hours continuous audio understanding
Release date	2026-03-18
Public numeric context length	Not stated on the official Omni page

What is MiMo-V2-Omni?

Main features of MiMo-V2-Omni

Unified multimodal perception: image, video, audio, and text are handled as one perceptual stream rather than separate add-ons.
Agent-ready outputs: the model natively supports structured tool calling, function execution, and UI grounding for real agent frameworks.
Long-form audio understanding: Xiaomi claims it can handle continuous audio longer than 10 hours, which is unusually strong for a general omni model.
Native audio-video reasoning: the official page highlights joint audio-video input for video comprehension instead of a text-only transcript pipeline.
Browser and workflow execution: Xiaomi demonstrates end-to-end browser shopping and TikTok upload flows using MiMo-V2-Omni plus OpenClaw.
Perception-to-action framing: the model is trained to connect what it sees with what it should do next, which is the core difference between a demo model and an agentic model.

Benchmark performance

mimo-v2-omni

MiMo-V2-Omni vs MiMo-V2-Pro vs MiMo-V2-Flash

Model	Core strength	Context / scale	Best fit
MiMo-V2-Omni	Multimodal perception + agent action	Public context length not stated on the Omni page	Audio, image, video, UI, and browser agents
MiMo-V2-Pro	Largest flagship agent model	Up to 1M-token context; 1T+ params, 42B active	Heavy agent orchestration and long-horizon work
MiMo-V2-Flash	Fast reasoning and coding	256K context; 309B total, 15B active	Efficient reasoning, coding, and high-throughput agent tasks

Best use cases

FAQ

What can the MiMo-V2-Omni API understand besides text?

Can MiMo-V2-Omni API process audio and video together?

How long of an audio file can MiMo-V2-Omni API handle?

When should I use MiMo-V2-Omni API instead of MiMo-V2-Pro?

Does MiMo-V2-Omni API support structured tool?

Is MiMo-V2-Omni API good for browser automation and real-world agents?

Pricing for mimo-v2-omni

Comet Price (USD / M Tokens)	Official Price (USD / M Tokens)	Discount
Input:$0.32/M Output:$1.6/M	Input:$0.4/M Output:$2/M	-20%

Sample code and API for mimo-v2-omni

POST

/v1/chat/completions

POST

/v1/messages

Python
JavaScript
Curl

from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"

client = OpenAI(api_key=COMETAPI_KEY, base_url="https://api.cometapi.com/v1")

# mimo-v2-omni: built-in web_search tool (pass as top-level tools param)
completion = client.chat.completions.create(
    model="mimo-v2-omni",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who is Lei Jun?"},
    ],
    tools=[{"type": "web_search", "force_search": True, "max_keyword": 3, "limit": 1}],
    tool_choice="auto",
    extra_body={"thinking": {"type": "disabled"}},
)

msg = completion.choices[0].message
if msg.content:
    print(msg.content)

# annotations are populated when web_search runs (content may be null on search-only responses)
raw = completion.model_dump()
annotations = raw["choices"][0]["message"].get("annotations") or []
if annotations:
    print("
--- Sources ---")
    for ann in annotations:
        c = ann.get("url_citation") or {}
        print(f"[{c.get('title')}] {c.get('url')}")

Python Code Example

from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"

client = OpenAI(api_key=COMETAPI_KEY, base_url="https://api.cometapi.com/v1")

# mimo-v2-omni: built-in web_search tool (pass as top-level tools param)
completion = client.chat.completions.create(
    model="mimo-v2-omni",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who is Lei Jun?"},
    ],
    tools=[{"type": "web_search", "force_search": True, "max_keyword": 3, "limit": 1}],
    tool_choice="auto",
    extra_body={"thinking": {"type": "disabled"}},
)

msg = completion.choices[0].message
if msg.content:
    print(msg.content)

# annotations are populated when web_search runs (content may be null on search-only responses)
raw = completion.model_dump()
annotations = raw["choices"][0]["message"].get("annotations") or []
if annotations:
    print("\n--- Sources ---")
    for ann in annotations:
        c = ann.get("url_citation") or {}
        print(f"[{c.get('title')}] {c.get('url')}")

JavaScript Code Example

// Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
const api_key = process.env.COMETAPI_KEY || "<YOUR_COMETAPI_KEY>";

// mimo-v2-omni: use fetch for web_search (non-standard tool type unsupported by openai SDK)
const resp = await fetch("https://api.cometapi.com/v1/chat/completions", {
  method: "POST",
  headers: { Authorization: `Bearer ${api_key}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "mimo-v2-omni",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Who is Lei Jun?" },
    ],
    tools: [{ type: "web_search", force_search: true, max_keyword: 3, limit: 1 }],
    tool_choice: "auto",
    thinking: { type: "disabled" },
  }),
});

const data = await resp.json();
const msg = data.choices[0].message;
if (msg.content) console.log(msg.content);

const annotations = msg.annotations ?? [];
if (annotations.length) {
  console.log("\n--- Sources ---");
  for (const ann of annotations) {
    const c = ann.url_citation ?? {};
    console.log(`[${c.title}] ${c.url}`);
  }
}

Curl Code Example

# Get your CometAPI key from https://api.cometapi.com/console/token
# Export it as: export COMETAPI_KEY="your-key-here"

curl https://api.cometapi.com/v1/chat/completions \
  -H "Authorization: Bearer $COMETAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mimo-v2-omni",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who is Lei Jun?"}
    ],
    "tools": [{"type": "web_search", "force_search": true, "max_keyword": 3, "limit": 1}],
    "thinking": {"type": "disabled"}
  }'