ModelsPricingEnterprise
500+ AI Model API, All In One API.Just In CometAPI
Models API
Developer
Quick StartDocumentationAPI Dashboard
Company
About usEnterprise
Resources
AI ModelsBlogChangelogSupport
Terms of ServicePrivacy Policy
© 2026 CometAPI · All rights reserved
Home/Models/Aliyun/qwen3-vl-235b-a22b
Q

qwen3-vl-235b-a22b

Input:$0.24/M
Output:$0.96/M
Context:2M
Max Output:30K
qwen3-vl-235b-a22b is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Instruct variant optimizes instruction-following for general multimodal tasks. It excels in perception of real-world/synthetic categories, 2D/3D spatial grounding, and long-form visual comprehension, achieving competitive multimodal benchmark results.
New
Commercial Use
Playground
Overview
Features
Pricing
API
Versions

What is Qwen3-VL-235B-A22B

Qwen3-VL-235B-A22B is a high-capacity multimodal LLM from the Qwen (Alibaba) family. It combines a large MoE transformer backbone with cross-modal vision encoders and new positional/time encoding techniques to handle multi-image and long-duration video inputs, and to perform tasks such as visual question answering (VQA), long-document OCR, spatial/3D grounding, multimodal code generation, and agentic GUI control. The release includes both Instruct (task/few-shot tuned for instruction following) and Thinking (additional reasoning support and internal “think” mode) variants.


Main features (what makes Qwen3-VL-235B-A22B distinctive)

  • Large MoE design with high active capacity: a MoE stack that activates a subset of experts per request (≈22B active) to give more compute when needed while controlling inference cost.
  • Very long native context (256K) and scalable to ~1M: intended for book-length documents, hours of video, and multi-document workflows without aggressive chunking.
  • Advanced visual reasoning (spatial & temporal): Interleaved-MRoPE and DeepStack modules for timestamp alignment and fine-grained image–text fusion enabling video timeline queries and 3D grounding.
  • Improved OCR & document parsing: expanded OCR language support (advertised ~32 languages), stronger robustness to blur/tilt/low light and long, multi-page document structure parsing.
  • Visual agent + GUI automation: explicit agent capabilities to identify GUI elements, invoke functions or tools, and perform automation tasks on PC/mobile UIs.
  • Visual coding & multimodal program synthesis: can translate images/video/UI sketches into Draw.io/HTML/CSS/JS and assist in UI debugging.

How Qwen3-VL-235B-A22B compares to other models

Below are high-level comparisons to contemporaries; numbers and caps are taken from public provider/model pages and aggregator writeups.

  • Google Gemini 3 Pro — Gemini emphasizes very large multimodal reasoning and agentic tool use; Google advertises 1M token context modes and deep product integrations. Gemini is positioned as a general leader in agentic multimodality (closed-source / proprietary), and often outperforms publicly available open models on some productized benchmarks. Qwen3-VL competes more directly as a high-capacity open-weight alternative optimized for OCR, video timeline alignment, and MoE cost tradeoffs.
  • Grok-4 Heavy (xAI) — Grok-4 is another long-context, high-reasoning model family; some Grok variants list ~256K context windows and strong coding/math performance. Qwen3-VL and Grok-4 both target long-form reasoning; Qwen3-VL differentiates via heavy visual/video/OCR tooling and MoE scaling.
  • DeepSeek-R1 / DeepSeek family — DeepSeek R1 emphasizes efficient training and competitive reasoning performance at lower inference cost; it is often used as an open alternative for reasoning/code tasks. Qwen3-VL targets stronger multimodal and spatial/video capabilities than R1’s primary focus on text reasoning.

Representative use cases

  • Document parsing and large-scale OCR — long, multi-page invoices, books, historical documents with multilingual text.
  • Video understanding & timeline queries — summarize hours of recorded video, locate events by time, align text to video timestamps.
  • Visual question answering & multimodal assistants — multi-turn image + text dialogs (customer support with screenshots, medical imaging notes).
  • GUI automation / visual agents — detect UI elements and drive PC/mobile flows (automation, testing, assistive agents).
  • Multimodal code generation & UI prototyping — convert mockups / images into HTML/CSS/JS or Draw.io diagrams.
  • Research & large-document analysis — book-level summarization, multi-document synthesis with a single context.

How to access Qwen3 VL-235B-A22B API

Step 1: Sign Up for API Key

Log in to cometapi.com. If you are not our user yet, please register first. Sign into your CometAPI console. Get the access credential API key of the interface. Click “Add Token” at the API token in the personal center, get the token key: sk-xxxxx and submit.

Step 2: Send Requests to Qwen3 VL-235B-A22B API

Select the “Qwen3-VL-235B-A22B” endpoint to send the API request and set the request body. The request method and request body are obtained from our website API doc. Our website also provides Apifox test for your convenience. Replace <YOUR_API_KEY> with your actual CometAPI key from your account. base url is Chat

Insert your question or request into the content field—this is what the model will respond to . Process the API response to get the generated answer.

Step 3: Retrieve and Verify Results

Process the API response to get the generated answer. After processing, the API responds with the task status and output data.

Features for qwen3-vl-235b-a22b

Explore the key features of qwen3-vl-235b-a22b, designed to enhance performance and usability. Discover how these capabilities can benefit your projects and improve user experience.

Pricing for qwen3-vl-235b-a22b

Explore competitive pricing for qwen3-vl-235b-a22b, designed to fit various budgets and usage needs. Our flexible plans ensure you only pay for what you use, making it easy to scale as your requirements grow. Discover how qwen3-vl-235b-a22b can enhance your projects while keeping costs manageable.
Comet Price (USD / M Tokens)Official Price (USD / M Tokens)Discount
Input:$0.24/M
Output:$0.96/M
Input:$0.3/M
Output:$1.2/M
-20%

Sample code and API for qwen3-vl-235b-a22b

Access comprehensive sample code and API resources for qwen3-vl-235b-a22b to streamline your integration process. Our detailed documentation provides step-by-step guidance, helping you leverage the full potential of qwen3-vl-235b-a22b in your projects.
POST
/v1/chat/completions
Python
JavaScript
Curl
from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"
BASE_URL = "https://api.cometapi.com/v1"

client = OpenAI(base_url=BASE_URL, api_key=COMETAPI_KEY)

completion = client.chat.completions.create(
    model="qwen3-vl-235b-a22b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message.content)

Python Code Example

from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"
BASE_URL = "https://api.cometapi.com/v1"

client = OpenAI(base_url=BASE_URL, api_key=COMETAPI_KEY)

completion = client.chat.completions.create(
    model="qwen3-vl-235b-a22b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message.content)

JavaScript Code Example

import OpenAI from "openai";

// Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
const api_key = process.env.COMETAPI_KEY || "<YOUR_COMETAPI_KEY>";
const base_url = "https://api.cometapi.com/v1";

const openai = new OpenAI({
  apiKey: api_key,
  baseURL: base_url,
});

const completion = await openai.chat.completions.create({
  model: "qwen3-vl-235b-a22b",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Hello!" },
  ],
});

console.log(completion.choices[0].message.content);

Curl Code Example

curl https://api.cometapi.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $COMETAPI_KEY" \
  -d '{
    "model": "qwen3-vl-235b-a22b",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Versions of qwen3-vl-235b-a22b

The reason qwen3-vl-235b-a22b has multiple snapshots may include potential factors such as variations in output after updates requiring older snapshots for consistency, providing developers a transition period for adaptation and migration, and different snapshots corresponding to global or regional endpoints to optimize user experience. For detailed differences between versions, please refer to the official documentation.
Model namedescription
qwen3-vl-235b-a22bstandard
qwen3-vl-235b-a22b-thinkingthinking version

More Models

C

Claude Opus 4.7

Input:$3/M
Output:$15/M
Claude Opus 4.7 is a hybrid reasoning model designed specifically for frontier-level coding, AI agents, and complex multi-step professional work. Unlike lighter models (e.g., Sonnet or Haiku variants), Opus 4.7 prioritizes depth, consistency, and autonomy on the hardest tasks.
A

Claude Sonnet 4.6

Input:$2.4/M
Output:$12/M
Claude Sonnet 4.6 is our most capable Sonnet model yet. It’s a full upgrade of the model’s skills across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. Sonnet 4.6 also features a 1M token context window in beta.
O

GPT 5.5 Pro

Input:$24/M
Output:$144/M
An advanced model engineered for extremely complex logic and professional demands, representing the highest standard of deep reasoning and precise analytical capabilities.
O

GPT 5.5

Input:$4/M
Output:$24/M
A next-generation multimodal flagship model balancing exceptional performance with efficient response, dedicated to providing comprehensive and stable general-purpose AI services.
O

GPT Image 2 ALL

Per Request:$0.04
GPT Image 2 is openai state-of-the-art image generation model for fast, high-quality image generation and editing. It supports flexible image sizes and high-fidelity image inputs.
O

GPT 5.5 ALL

Input:$4/M
Output:$24/M
GPT-5.5 excels in code writing, online research, data analysis, and cross-tool operations. The model not only improves its autonomy in handling complex multi-step tasks but also significantly improves reasoning capabilities and execution efficiency while maintaining the same latency as its predecessor, marking an important step towards automated office automation in AI.