Home/Models/Zhipu AI/GLM 4.6
Z

GLM 4.6

Input:$0.96/M
Output:$3.84/M
Context:200,000
Max Output:128,000
Zhipu's latest flagship model GLM-4.6 released: total parameters 355B, active parameters 32B. Overall core capabilities surpass GLM-4.5. Coding: Aligns with Claude Sonnet 4, best in China. Context: Extended to 200K (originally 128K). Inference: Improved, supports Tool calls. Search: Optimized Tool and agent framework. Writing: More aligned with human preferences, writing style, and role-playing. Multilingual: Enhanced translation effect.
New
Commercial Use
Playground
Overview
Features
Pricing
API

GLM-4.6 is the latest major release in Z.ai’s (formerly Zhipu AI) GLM family: a 4th-generation, large-language MoE (Mixture-of-Experts) model tuned for agentic workflows, long-context reasoning and real-world coding. The release emphasizes practical agent/tool integration, a very large context window, and open-weight availability for local deployment.

Key features

  • Long context — native 200K token context window (expanded from 128K). (docs.z.ai)
  • Coding & agentic capability — marketed improvements on real-world coding tasks and better tool invocation for agents.
  • Efficiency — reported ~30% lower token consumption vs GLM-4.5 on Z.ai’s tests.
  • Deployment & quantization — first announced FP8 and Int4 integration for Cambricon chips; native FP8 support on Moore Threads via vLLM.
  • Model size & tensor type — published artifacts indicate a ~357B-parameter model (BF16 / F32 tensors) on Hugging Face.

Technical details

Modalities & formats. GLM-4.6 is a text-only LLM (input and output modalities: text). Context length = 200K tokens; max output = 128K tokens.

Quantization & hardware support. The team reports FP8/Int4 quantization on Cambricon chips and native FP8 execution on Moore Threads GPUs using vLLM for inference — important for lowering inference cost and allowing on-prem and domestic cloud deployments.

Tooling & integrations. GLM-4.6 is distributed through Z.ai’s API, third-party provider networks (e.g., CometAPI), and integrated into coding agents (Claude Code, Cline, Roo Code, Kilo Code).

Technical details

Modalities & formats. GLM-4.6 is a text-only LLM (input and output modalities: text). Context length = 200K tokens; max output = 128K tokens.

Quantization & hardware support. The team reports FP8/Int4 quantization on Cambricon chips and native FP8 execution on Moore Threads GPUs using vLLM for inference — important for lowering inference cost and allowing on-prem and domestic cloud deployments.

Tooling & integrations. GLM-4.6 is distributed through Z.ai’s API, third-party provider networks (e.g., CometAPI), and integrated into coding agents (Claude Code, Cline, Roo Code, Kilo Code).

Benchmark performance

  • Published evaluations: GLM-4.6 was tested on eight public benchmarks covering agents, reasoning and coding and shows clear gains over GLM-4.5. On human-evaluated, real-world coding tests (extended CC-Bench), GLM-4.6 uses ~15% fewer tokens vs GLM-4.5 and posts a ~48.6% win rate vs Anthropic’s Claude Sonnet 4 (near-parity on many leaderboards).
  • Positioning: results claim GLM-4.6 is competitive with leading domestic and international models (examples cited include DeepSeek-V3.1 and Claude Sonnet 4).

Limitations & risks

  • Hallucinations & mistakes: like all current LLMs, GLM-4.6 can and does make factual errors — Z.ai’s docs explicitly warn outputs may contain mistakes. Users should apply verification & retrieval/RAG for critical content.
  • Model complexity & serving cost: 200K context and very large outputs dramatically increase memory & latency demands and can raise inference costs; quantized/inference engineering is required to run at scale.
  • Domain gaps: while GLM-4.6 reports strong agent/coding performance, some public reports note it still lags certain versions of competing models in specific microbenchmarks (e.g., some coding metrics vs Sonnet 4.5). Assess per-task before replacing production models.
  • Safety & policy: open weights increase accessibility but also raise stewardship questions (mitigations, guardrails, and red-teaming remain the user’s responsibility).

Use cases

  • Agentic systems & tool orchestration: long agent traces, multi-tool planning, dynamic tool invocation; the model’s agentic tuning is a key selling point.
  • Real-world coding assistants: multi-turn code generation, code review and interactive IDE assistants (integrated in Claude Code, Cline, Roo Code—per Z.ai). Token efficiency improvements make it attractive for heavy-use developer plans.
  • Long-document workflows: summarization, multi-document synthesis, long legal/technical reviews due to the 200K window.
  • Content creation & virtual characters: extended dialogues, consistent persona maintenance in multi-turn scenarios.

How GLM-4.6 compares to other models

  • GLM-4.5 → GLM-4.6: step change in context size (128K → 200K) and token efficiency (~15% fewer tokens on CC-Bench); improved agent/tool use.
  • GLM-4.6 vs Claude Sonnet 4 / Sonnet 4.5: Z.ai reports near parity on several leaderboards and a ~48.6% win rate on the CC-Bench real-world coding tasks (i.e., close competition, with some microbenchmarks where Sonnet still leads). For many engineering teams, GLM-4.6 is positioned as a cost-efficient alternative.
  • GLM-4.6 vs other long-context models (DeepSeek, Gemini variants, GPT-4 family): GLM-4.6 emphasizes large context & agentic coding workflows; relative strengths depend on metric (token efficiency/agent integration vs raw code synthesis accuracy or safety pipelines). Empirical selection should be task-driven.

Zhipu AI’s latest flagship model GLM-4.6 released: 355B total params, 32B active. Surpasses GLM-4.5 in all core capabilities.

  • Coding: Aligns with Claude Sonnet 4, best in China.
  • Context: Expanded to 200K (from 128K).
  • Reasoning: Improved, supports tool calling during inference.
  • Search: Enhanced tool calling and agent performance.
  • Writing: Better aligns with human preferences in style, readability, and role-playing.
  • Multilingual: Boosted cross-language translation.

FAQ

What are the context window and output limits for GLM-4-6?

GLM-4-6 supports a 200,000 token context window (extended from 128K in GLM-4.5) with up to 128,000 output tokens, enabling extensive document analysis and long-form generation.

How does GLM-4-6 compare to Claude Sonnet 4 in coding?

According to Zhipu, GLM-4-6's coding capabilities align with Claude Sonnet 4, making it the best coding model among Chinese domestic models.

Does GLM-4-6 support tool calling and agent workflows?

Yes, GLM-4-6 features improved inference capabilities with enhanced Tool calls support and an optimized agent framework for complex multi-step task automation.

What is the architecture of GLM-4-6?

GLM-4-6 is a Mixture-of-Experts model with 355B total parameters and 32B active parameters, balancing capability with efficiency.

What makes GLM-4-6 different from GLM-4.5?

GLM-4-6 offers extended context (200K vs 128K), improved reasoning and tool calling, enhanced writing aligned with human preferences, better multilingual translation, and optimized role-playing.

Is GLM-4-6 suitable for enterprise Chinese language applications?

Yes, GLM-4-6 is particularly strong for Chinese language tasks including translation, content writing, and conversational AI, with enhanced multilingual capabilities.

When should I choose GLM-4-6 over GPT-5.2 or Claude?

Choose GLM-4-6 for Chinese-first applications, cost-effective 200K context needs, or when you need a strong domestic AI alternative with coding capabilities comparable to frontier models.

Features for GLM 4.6

Explore the key features of GLM 4.6, designed to enhance performance and usability. Discover how these capabilities can benefit your projects and improve user experience.

Pricing for GLM 4.6

Explore competitive pricing for GLM 4.6, designed to fit various budgets and usage needs. Our flexible plans ensure you only pay for what you use, making it easy to scale as your requirements grow. Discover how GLM 4.6 can enhance your projects while keeping costs manageable.
Comet Price (USD / M Tokens)Official Price (USD / M Tokens)Discount
Input:$0.96/M
Output:$3.84/M
Input:$1.2/M
Output:$4.8/M
-20%

Sample code and API for GLM 4.6

GLM-4.6 is the latest major release in Z.ai’s (formerly Zhipu AI) GLM family: a 4th-generation, large-language MoE (Mixture-of-Experts) model tuned for agentic workflows, long-context reasoning and real-world coding. The release emphasizes practical agent/tool integration, a very large context window, and open-weight availability for local deployment.
Python
JavaScript
Curl
from openai import OpenAI
import os

# Get your CometAPI key from https://api.cometapi.com/console/token, and paste it here
COMETAPI_KEY = os.environ.get("COMETAPI_KEY") or "<YOUR_COMETAPI_KEY>"
BASE_URL = "https://api.cometapi.com/v1"

client = OpenAI(base_url=BASE_URL, api_key=COMETAPI_KEY)

completion = client.chat.completions.create(
    model="glm-4.6",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message.content)

More Models