Short answer: there isn’t an officially released, downloadable checkpoint for GLM-4.7-Flash yet. If you need to run something locally/offline, use the closest open models from the GLM family (e.g., THUDM/glm-4-9b-chat) and serve them with an inference engine for a “flash”-like experience.
Two practical ways to run GLM locally:
Option A — vLLM (fast, OpenAI-compatible API)
1) Requirements
- NVIDIA GPU (≥12 GB VRAM recommended; 4-bit quant works with ~8–12 GB)
- Python 3.10+, CUDA toolchain that matches your PyTorch/vLLM build
2) Install
pip install vllm transformers accelerate torch torchvision torchaudio bitsandbytes
3) Start an OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model THUDM/glm-4-9b-chat \
--trust-remote-code \
--gpu-memory-utilization 0.9
This exposes http://127.0.0.1:8000/v1
4) Call it like OpenAI (Python)
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="THUDM/glm-4-9b-chat",
messages=[{"role":"user","content":"Hello!"}],
temperature=0.7,
)
print(resp.choices[0].message.content)
Option B — Transformers (single-process script)
1) Install
pip install transformers accelerate torch bitsandbytes
2) FP16/BF16 (fast GPU)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
name = "THUDM/glm-4-9b-chat"
tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
prompt = "Hello!"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))
3) 4-bit quant (smaller GPUs)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
name = "THUDM/glm-4-9b-chat"
tok = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
name,
quantization_config=bnb,
device_map="auto",
trust_remote_code=True
)
Note
- If/when Zhipu releases GLM-4.7-Flash weights, you can swap the model name and keep the same steps.
- For even higher throughput, consider LMDeploy or TensorRT-LLM; and enable FlashAttention if your environment supports it.
- If “local” can use cloud API from local code, you can call Zhipu’s API with the GLM-4.7-Flash model name via their SDK or any OpenAI-compatible client.
If you share your OS, GPU VRAM, and whether you need fully offline vs. local client to cloud API, I can tailor exact commands.
GLM-4.7-Flash là một thành viên MoE A3B 30B nhẹ, hiệu năng cao trong họ GLM-4.7, được thiết kế để cho phép triển khai cục bộ với chi phí thấp cho lập trình, các quy trình tác tử và suy luận tổng quát. Bạn có thể chạy nó cục bộ theo ba cách thực tế: (1) qua Ollama (dễ dùng, runtime cục bộ được quản lý), (2) qua Hugging Face / Transformers / vLLM / SGLang (triển khai máy chủ ưu tiên GPU), hoặc (3) qua GGUF + llama.cpp / llama-cpp-python (thân thiện với CPU/thiết bị biên).