zai-org/GLM-5
GLM-5.2 & GLM-5.1 & GLM-5
π Join our Wechat or Discord community.
π Check out the GLM-5.2 blog and GLM-5 Technical report.
π Use GLM-5.2 API services on Z.ai API Platform.
π Try GLM-5.2 at z.ai.
Introduction
GLM-5.2
GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context.
GLM-5.2’s new capabilities include:
- Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work
- Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency
- Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9Γ at a 1M context length. We also improve GLM-5.2βs MTP layer for speculative decoding, increasing the acceptance length by up to 20%

On standard coding benchmarks, GLM-5.2 is the strongest open-source model, improving on GLM-5.1 by a wide margin: 81.0 vs. 62.0 on Terminal-Bench 2.1 and 62.1 vs. 58.4 on SWE-bench Pro. It also closes much of the gap to the closed-source frontier β on Terminal-Bench 2.1 (81.0) it lands within a few points of Claude Opus 4.8 (85.0) β while staying ahead of Gemini 3.1 Pro.
For more detail, check our blog.
GLM-5.1
GLM-5.1 is our next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

But the most meaningful leap goes beyond first-pass performance. Previous modelsβincluding GLM-5βtend to exhaust their repertoire early: they apply familiar techniques for quick initial gains, then plateau. Giving them more time doesn’t help.
GLM-5.1, by contrast, is built to stay effective on agentic tasks over much longer horizons. We’ve found that the model handles ambiguous problems with better judgment and stays productive over longer sessions. It breaks complex problems down, runs experiments, reads results, and identifies blockers with real precision. By revisiting its reasoning and revising its strategy through repeated iteration, GLM-5.1 sustains optimization over hundreds of rounds and thousands of tool calls. The longer it runs, the better the result.
GLM-5
We are launching GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

GLM-5 is purpose-built for complex systems engineering and long-horizon agentic tasks. On our internal evaluation suite CC-Bench-V2, GLM-5 significantly outperforms GLM-4.7 across frontend, backend, and long-horizon tasks, narrowing the gap to Claude Opus 4.5.

On Vending Bench 2, a benchmark that measures long-term operational capability, GLM-5 ranks #1 among open-source models. Vending Bench 2 requires the model to run a simulated vending machine business over a one-year horizon; GLM-5 finishes with a final account balance of $4,432, approaching Claude Opus 4.5 and demonstrating strong long-term planning and resource management.

Download Model
| Model | Download Links | Model Size | Precision |
|---|---|---|---|
| GLM-5.2 | π€ Hugging Face π€ ModelScope |
744B-A40B | BF16 |
| GLM-5.2-FP8 | π€ Hugging Face π€ ModelScope |
744B-A40B | FP8 |
| GLM-5.1 | π€ Hugging Face π€ ModelScope |
744B-A40B | BF16 |
| GLM-5.1-FP8 | π€ Hugging Face π€ ModelScope |
744B-A40B | FP8 |
| GLM-5 | π€ Hugging Face π€ ModelScope |
744B-A40B | BF16 |
| GLM-5-FP8 | π€ Hugging Face π€ ModelScope |
744B-A40B | FP8 |
Serve GLM-5 Series Locally
GLM-5.2 supports deployment with the following frameworks. Feel free to try them out:
- SGLang (v0.5.13.post1+) β see cookbook
- vLLM (v0.23.0+) β see recipes
- Transformers (v0.5.12+) β see transformers docs
- KTransformers (v0.5.12+) β see tutorial
- For deployment on the
Ascend NPUplatform, inference frameworks such as vLLM-Ascend, xLLM and SGLang are supported β see here.
GLM-5 supports controlling the thinking budget through the reasoning_effort parameter, which accepts two levels: max and high. max is the default β if reasoning_effort is left unset (or set to any value other than high), the model runs at Max. To use the High level, you must explicitly pass reasoning_effort="high". For default scenarios such as benchmark/leaderboard reproduction, keep Max (no setting required); only set reasoning_effort="high" when you specifically want the High level. Thinking can be turned off entirely by setting enable_thinking=false.
Citation
If you find GLM-5 series model useful in your research, please cite our technical report:
|
|