On Lunar New Year’s Eve (Feb 16–17, 2026), Alibaba Group released its next-generation model, Qwen 3.5 — a multimodal, agent-capable model positioned for what the company calls an “agentic AI” era. Industry coverage highlighted claims of large gains in efficiency and cost, and rapid support from hardware and cloud vendors. CometAPI is options for developers who want hosted API access or an OpenAI-compatible integration, while AMD announced Day-0 GPU support for the model on its Instinct line. ByteDance is one of the principal domestic competitors that released upgrades around the same holiday window. OpenAI remains a reference point for comparison in benchmarks and integration style.
What is Qwen 3.5?
Alibaba's Qwen 3.5 is the company’s latest generation multimodal large language model (LLM) positioned for the so-called “agentic AI” era — models that not only answer questions but can orchestrate multi-step workflows, call tools, work with images/video, and act across application boundaries. The model was publicly announced during the Lunar New Year period (the release window reported around 16 February 2026), a strategic date for product publicity in China and for capturing user attention during holiday spikes. Qwen 3.5 delivers significant cost and throughput improvements over its predecessors while focusing on long contexts and agent-style automation.
At a glance, the distinguishing technical and business claims about Qwen 3.5 are:
- A native multimodal architecture supporting text, images and video inputs and outputs (agentic workflows). new in-model capabilities to call tools, act on browser content, and chain steps (agentic behavior). Those features unlock automation—form filling, end-to-end workflows—but demand stronger safety controls.
- A hybrid mixture-of-experts architecture with very large total parameters but a smaller set active per forward pass — public technical notes indicate architectures like “397B total / 17B active” for one Qwen3.5 variant used in efficient serving. This design produces high capability at improved inference efficiency.
- Competitive benchmarks versus leading global closed-source models, with Alibaba claiming cost advantages and parity or better results on many practical tasks.
Editions you will encounter
- qwen3.5-397b-a17b(Open/weights release): downloadable checkpoints and community forks (for local and custom deployments). See the official project repositories and mirrors.
- qwen3.5-plus (Hosted “Plus” variant): fully managed on Alibaba Cloud Model Studio with the largest context window and built-in tools (tool calling, code assistant, web extraction). This is the version enterprise customers will likely call via API for reliability and scale.
What are Qwen-3.5’s headline features?
rchitecture & training highlights
Below is a concise feature table with the release:
| Feature | Qwen-3.5 (public details) | Practical impact |
|---|---|---|
| Architecture | Hybrid: linear attention + sparse MoE + dense transformer backbones. | Better decoding throughput and scaling efficiency vs purely dense models. |
| Multimodality | Native vision–language agentic abilities (taking actions across UIs). | Enables app control/multi-step agents, not just text-and-image QA. |
| Model series & open weights | Public release of at least one “open-weights” variant (e.g., Qwen3.5-397B-A17B). | Allows on-prem and third-party fine tuning; accelerates community evaluation. |
| Languages | >200 languages & dialects (release claims). | Broad international coverage for localization and multilingual agents. |
| RL / agents | Large-scale RL environment scaling and agent training pipelines. | Improves long-horizon planning and action sequencing in real tasks. |
Multimodality & agentic actions
Qwen-3.5 is explicitly engineered for agentic workflows — that means the model is designed not just to answer, but to plan, chain actions (APIs, UI interactions, file ops), and integrate visual inputs (screenshots, UI DOMs, images) into its decision loop. Alibaba highlights native vision–language fusion and tighter control hooks for executing tasks across mobile and desktop app boundaries.
Hybrid architecture (efficiency focus)
Alibaba’s materials and industry summaries say Qwen-3.5 uses a hybrid of linear-attention mechanisms with sparse Mixture-of-Experts routing (MoE) so that the effective parameter activation for common prompts is far lower than the headline number. The practical benefit: higher capability per unit of compute and lower inference cost — the firm claims up to ~60% lower deployment cost relative to prior releases.
Context window & multilingual support
Public notes indicate expanded context windows (256k tokens are mentioned for some open weights variants across the Qwen family) and broader language coverage (Alibaba has steadily expanded language/dialect support across Qwen generations). The result: better long-document and cross-lingual agent tasks.
How do I access Qwen 3.5 via CometAPI?
CometAPI provides a unified, OpenAI-compatible gateway to 500+ models (including Qwen hosted or third-party endpoints). That abstraction lets your code switch providers with minimum friction while CometAPI normalizes responses and offers usage analytics and pay-as-you-go billing.
Step-by-step: basic flow to call Qwen 3.5 via CometAPI
- Sign up & get an API key from the CometAPI dashboard.
- Choose the Qwen 3.5 variant in the CometAPI model list (e.g.,
qwen3.5-plusorqwen3.5-397b-a17b). CometAPI typically exposes the provider-specific model name as a string you pass in themodelfield. - Make a Chat Completion request using their OpenAI-compatible endpoint (base URL examples:
https://api.cometapi.com/v1). You can use the OpenAI SDK or raw HTTP. CometAPI’s docs show both approaches and recommend binding your library’s base URL to the CometAPI endpoint so existing OpenAI code works with little to no change.
Minimal examples
cURL (simple chat call)
export COMETAPI_KEY="sk-xxxx"
curl -s -X POST "https://api.cometapi.com/v1/chat/completions" \
-H "Authorization: Bearer $COMETAPI_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-plus",
"messages":[
{"role":"system","content":"You are a concise engineering assistant."},
{"role":"user","content":"Summarize the tradeoffs between retrieval-augmented generation and fine-tuning."}
],
"max_tokens": 512
}'
Python (OpenAI client with base_url override)
# Using the OpenAI-compatible client pattern many API hubs support:
from openai import OpenAI
client = OpenAI(api_key="YOUR_COMETAPI_KEY", base_url="https://api.cometapi.com/v1")
resp = client.chat.completions.create(
model="qwen3.5-plus",
messages=[
{"role":"system","content":"You are a concise engineering assistant."},
{"role":"user","content":"Explain how to implement streaming responses in production (short)."}
],
max_tokens=400
)
print(resp.choices[0].message.content)
Note: CometAPI normalizes many vendor differences; consult the CometAPI model list to pick the exact string name for each Qwen variant.
Calling image / multimodal capabilities through the gateway
If you want to use vision features (image + text), CometAPI typically exposes vendor capabilities through a single API but may require attaching binary/image data or signed URLs. The general pattern is to include an input_image (or vendor-specific parameter) and set the model to the appropriate multimodal Qwen-3.5 variant.
How much does a Qwen 3.5 cost?
API and Token Pricing of Aliyun
| Model | Input tokens per request | Input price (per 1M tokens) | Output price (per 1M tokens) | Free quota (Note) | |
|---|---|---|---|---|---|
| Non-thinking mode | Thinking mode (CoT + response) | ||||
| qwen3.5-plus | 0<Token≤256K | $0.4 | $2.4 | $2.4 | 1 million tokens eachValidity: 90 days after activating Model Studio |
| 256K<Token≤1M | $1.2 | $7.2 | $7.2 | ||
| qwen3.5-plus-2026-02-15 | 0<Token≤256K | $0.4 | $2.4 | $2.4 | |
| 256K<Token≤1M | $1.2 | $7.2 | $7.2 |
Pricing for qwen3.5-plus in CometAPI
CometAPI offers pay-as-you-go billing and helps centralize billing across providers; its exact per-token charges depend on the upstream provider and any CometAPI margin/discount applied. In practice, using a gateway like CometAPI simplifies provider switching and usage analytics at a small additional cost — useful for teams who want multi-vendor redundancy or want to compare performance vs. price without reengineering.
Explore competitive pricing for qwen3.5-plus, designed to fit various budgets and usage needs. Our flexible plans ensure you only pay for what you use, making it easy to scale as your requirements grow. Discover how qwen3.5-plus can enhance your projects while keeping costs manageable.
| Comet Price (USD / M Tokens) | Official Price (USD / M Tokens) | Discount |
|---|---|---|
| Input:$0.32/M; Output:$1.92/M | Input:$0.4/M; Output:$2.4/M | -20% |
Can I run Qwen 3.5 on-prem or on custom infra?
Yes, but with caveats:
- Large variants (hundreds of billions of parameters) require specialized hardware (multiple A100/H100 or AMD Instinct clusters). day-0 support for Qwen 3.5 on AMD Instinct GPUs; community projects (vLLM, HF) provide recipes to deploy optimized inference stacks. Expect substantial engineering effort and high hardware cost for production scale.
- Lighter Qwen family variants (smaller parameter sets, Qwen-Turbo-like weights) are easier to host and are useful for many production tasks with acceptable quality/cost tradeoffs.
If compliance or data residency mandates on-premise deployment, consider a hybrid approach: run embeddings and retrieval locally, and call hosted Qwen for complex multimodal or agentic tasks.
Which cloud or hosted options exist?
- Alibaba Cloud Model Studio: provides hosted Qwen endpoints, OpenAI-compatible interfaces, and integration tools (RAG, toolkits). Good for teams already using Alibaba Cloud.
- Third-party APIs (CometAPI, etc.): quick go-to for multi-model experiments, vendor-agnostic switching and cost comparison.
- Open weights / self-host: if you require full data locality, download the open weights and serve them on your cluster (NCCL/ROCm or CUDA stacks).
Hardware: what GPUs and stacks?
- Day-0 AMD support: AMD announced Day-0 ROCm tooling and containers for Qwen 3.5 on Instinct GPUs — useful if you deploy on AMD hardware. For NVIDIA shops, optimized containers and Triton support are likely to appear quickly.
- Inference optimizations: quantization (INT8/4), tensor slicing, and MoE routing tweaks lower memory and compute needs; choose model size accordingly. For real-time agents, prefer lower-parameter models with aggressive batching and small beam widths.
Best practices when integrating Qwen 3.5
Below are practical rules and engineering patterns — distilled from vendor docs, early reviews, and standard LLM engineering practice — to build robust, scalable, and cost-efficient systems.
Prompting & system message hygiene
- Use explicit system messages to set persona, token budgets, and output formats.
- Prefer short, structured prompts for predictable JSON or function outputs; reserve long chain-of-thought prompts only when necessary (they cost more and may increase latency). “Thinking” vs “Non-Thinking” modes — choose “Non-Thinking” for deterministic plain responses and switch to “Thinking” for heavy reasoning.
Token and context management (critical with 1M windows)
- Chunk long documents and use retrieval augmentation to keep active context small; even though Qwen Plus supports 1M tokens, passing huge contexts every call is expensive. Instead: index documents, fetch relevant chunks, and include only necessary snippets.
- Use embeddings + vector DBs for retrieval first; then call the model with the retrieved context plus a concise instruction. This RAG pattern reduces token costs and latency.
Cost optimization strategies
- Control output size with
max_tokensand explicit “answer in N words” instructions. - Use non-thinking mode for templates and short answers; reserve chain-of-thought only when quality gains justify the cost. Alibaba’s docs explicitly map hybrid thinking modes to cost/perf tradeoffs.
- Batch requests where possible (multiple prompts in one request) to amortize overheads for throughput-oriented workloads.
- Track tokens per request and latency with provider analytics (CometAPI provides usage dashboards). Monitor top-N prompts by cost to find optimization targets.
Reliability and rate limiting
- Implement exponential backoff + jitter for 429/503 errors.
- Use the gateway (CometAPI) or vendor dashboard to monitor quotas and set alerts. CometAPI provides usage analytics that can help spot cost spikes quickly.
Function calling / tools / agent design
Treat tool calls as a distinct stage: model suggests a tool + arguments, you validate/authorize and then execute the tool server-side. Never blindly execute untrusted tool instructions. Qwen 3.5 advertises built-in tool patterns; adopt strict input validation and access control.
Closing perspective: what to watch next
Qwen 3.5’s Lunar New Year release is strategic: it packages advanced agentic features, big context handling, and lower operating costs into both open-weight and hosted offerings. The immediate developer story is strong: multiple ways to try the model (hosted APIs like CometAPI, cloud hosting via Alibaba Cloud, or self-hosted weights) and fast hardware support (AMD).
Developers can access Qwen 3.5 API via CometAPI now.To begin, explore the model’s capabilities in the Playground and consult the API guide for detailed instructions. Before accessing, please make sure you have logged in to CometAPI and obtained the API key. CometAPI offer a price far lower than the official price to help you integrate.
Ready to Go?→ Sign up fo Qwen-3.5 today !
If you want to know more tips, guides and news on AI follow us on VK, X and Discord!
