GLM-5: When Large Models Learn to 'Write Code' The Leap from Vibe Coding to Agentic Engineering
GLM-5: When Large Models Learn to 'Write Code' The Leap from Vibe Coding to Agentic Engineering
❝
🎯 Summary in One Sentence: Zhizhu AI, in collaboration with Tsinghua University, launched the 744B parameter GLM-5 model, which compresses attention computation using DeepSeek Sparse Attention (DSA), improves long task training efficiency with fully asynchronous reinforcement learning (Async RL), and implements a multi-stage post-training process, allowing large models to evolve from 'Vibe Coding' to 'Agentic Engineering', capable of independently completing real engineering projects.
Why is this Paper Needed?
Andrej Karpathy proposed an interesting concept in early 2025—Vibe Coding, meaning you just need to describe your requirements in natural language and let AI write code 'by feel'. This is indeed the mainstream experience of AI programming today: you say a sentence, and the model generates a piece of code, with the effectiveness largely depending on luck.
But the problem arises: real software engineering is far more complex than just 'writing code'. A true engineer needs to understand project architecture, debug errors, manage dependencies, and handle cross-module collaboration—none of which can be solved by 'one prompt, one piece of code'. The goal of the GLM-5 paper is to transform the model from a 'code-writing assistant' into an 'engineer capable of independently managing an entire project'.
This is no small goal. To achieve it, the Zhizhu team made significant innovations in model architecture, training processes, and reinforcement learning algorithms. This interpretation will break down these technical details for you.
Core Contributions: Three Key Innovations
Before diving into details, let's clarify the three core contributions of GLM-5:
Contribution Problem Solved Core Idea DSA Sparse Attention 128K long context computation cost explosion Dynamically selects important tokens, skipping irrelevant ones, saving 1.5-2 times computing power Asynchronous Reinforcement Learning Framework GPU idling during long task RL training Decouples generation and training completely, enabling pipeline parallelism Multi-Stage Post-Training Process Balancing multiple capabilities like reasoning, coding, and agent SFT → reasoning RL → agent RL → general RL, gradually adding capabilities
Model Architecture: Subtracting on the MoE Framework
Basic Configuration
GLM-5 adopts a Mixture-of-Experts (MoE) architecture, with a total of 744B parameters, but only about 40B parameters are activated during each inference. This 'large and sparse' design has become an industry consensus—DeepSeek-V3/R1 and Qwen3 have followed a similar route.
How Does DSA Work?
The core idea of DSA can be understood with a metaphor: imagine you are looking for information in a library. Standard attention is like flipping through every book in the entire library and then deciding which ones are useful. In contrast, DSA is more like an experienced librarian—it first uses a Lightning Index to quickly scan the titles on the shelves, locking onto a few potentially relevant areas, and then only reads specific paragraphs in those areas in detail.
Training Process: Four-Stage 'Leveling Up'
The training process of GLM-5 is the highlight of this paper, divided into pre-training and post-training stages.
Pre-Training Stage
- Data Scale: 27T tokens, data mix includes web pages, code, academic papers, books, etc.
- Context Expansion: Gradually expands context from 4K to 200K through mid-term training, using RoPE frequency adjustments.
- Annealing Phase: Uses higher quality data for 'fine-tuning' at the end of pre-training.
Post-Training Quartet
This is the most distinctive part of GLM-5. GLM-5 underwent four rounds:
- Supervised Fine-Tuning (SFT) fine-tunes with high-quality instruction data.
- Reasoning Reinforcement Learning (Reasoning RL) conducts RL training on mathematical and code reasoning tasks.
- Agentic Reinforcement Learning (Agentic RL), which is a key innovation.
- General Reinforcement Learning (General RL) conducts RL on a broader range of general tasks.
Asynchronous Reinforcement Learning: Preventing GPU Idling
Traditional RL training is synchronous: collect a batch of data → calculate rewards → update the model → collect again. This works fine for short task times, but agent tasks often require dozens of interaction steps.
In-Depth Interpretation of Experimental Results
Main Benchmark Comparisons
Benchmark GLM-5 DeepSeek-V3.2 Claude Opus 4.5 Gemini 3 Pro GPT-5.2 MMLU-Pro 78.0 75.9 78.0 74.3 76.1 GPQA-Diamond 71.7 68.4 67.1 63.6 70.5 BrowseComp 57.1 32.0 26.3 25.1 46.9
Conclusion
The GLM-5 paper is rich in information. Setting aside specific numbers, the core message it conveys is: the next battlefield for large models is in 'doing work' rather than just 'answering questions'.
In terms of competition, GLM-5 demonstrates the competitiveness of Chinese AI teams in cutting-edge research on large models.
Paper Information
- Title: GLM-5: from Vibe Coding to Agentic Engineering
- Institution: Zhizhu AI & Tsinghua University
- Link: https://arxiv.org/abs/2602.15763

