ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools (2406.12793v2)

Published 18 Jun 2024 in cs.CL

Abstract: We introduce ChatGLM, an evolving family of LLMs that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.

Citations (235)

View on Semantic Scholar

Summary

The paper introduces innovative alignment techniques using SFT and RLHF that enable GLM-4 to rival GPT-4 on benchmarks like MMLU and GSM8K.
It details novel methods such as LongAlign and AgentTuning that enhance long-context processing and autonomous tool usage.
The research validates open-source ChatGLM models through extensive multilingual training and rigorous evaluations across academic and practical tasks.

ChatGLM: A Comprehensive Overview

The paper "ChatGLM: A Family of LLMs from GLM-130B to GLM-4 All Tools" presents an in-depth analysis and development trajectory of the ChatGLM family of LLMs. This research is a collaborative effort by Zhipu AI and Tsinghua University. The primary focus of this report is on the GLM-4 models, including GLM-4, GLM-4-Air, and GLM-4-9B, which are built upon the experiences and learnings from previous generations of ChatGLM.

Model Architecture and Pre-training

ChatGLM models utilize a Transformer architecture and incorporate several optimization techniques. The team has explored various strategies, such as DeepNorm, Rotary Positional Encoding (RoPE), Gated Linear Unit with GeLU activation, and more recently, RMSNorm and SwiGLU to enhance model performance. The GLM-4 models adopt a "No Bias Except QKV" approach to increase training speed and reduce inference costs.

The pre-training data consists of a multilingual corpus, primarily sourced from Chinese, English, and 24 other languages, encompassing 10 trillion tokens. The deduplication, filtering, and tokenization processes ensure high-quality, diverse data for training. The models are trained on a context length ranging from 2K to 128K and even up to 1M tokens, with techniques like position encoding extension and long context alignment aiding in managing extensive context tasks.

Post-training Alignment and Techniques

Post-training, consisting of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), plays a critical role in aligning the models with human preferences. For the GLM-4 series, SFT and RLHF are instrumental in enhancing the models' performance in understanding human intent, instruction following, and maintaining multi-turn dialogue coherence. The paper highlights that authentic human interactions significantly contribute to alignment quality.

Noteworthy techniques developed during this journey include:

LongAlign for extending context window size.
Self-Contrast for feedback-free alignment.
ChatGLM-Math for improving math problem-solving using self-critique.
AgentTuning to bolster agent capabilities.
APAR for auto-parallel auto-regressive generation.

Several new benchmarks, including AgentBench, LongBench, AlignBench, and HumanEval-X, were introduced to evaluate these models comprehensively.

Evaluation and Capabilities

The GLM-4 models have been rigorously evaluated on various academic and practical benchmarks:

Academic Benchmarks: GLM-4 closely rivals GPT-4 in metrics like MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, exhibiting strong performance.
Instruction Following: On IFEval, GLM-4 matches GPT-4-Turbo in both prompt and instruction levels in English and Chinese.
Alignment: In AlignBench, GLM-4 outperforms GPT-4 in Chinese language alignment across eight dimensions.
Long Context Handling: GLM-4's long-context model, evaluated on LongBench-Chat, matches or outperforms models like GPT-4 Turbo and Claude 3 Opus.
Coding: On NaturalCodeBench, GLM-4 demonstrates close performance to Claude 3 Opus in real-world coding tasks.

Practical Applications and All Tools Model

The GLM-4 All Tools model is notable for its ability to autonomously decide and use appropriate tools to complete complex tasks. This includes web browsing, Python interpretation, text-to-image generation, and user-defined functions, often surpassing GPT-4 All Tools in practical applications.

Open-Source Contributions

The paper emphasizes the open-source nature of the ChatGLM models, including ChatGLM-6B, GLM-4-9B, WebGLM, and CodeGeeX. These models have collectively received over 10 million downloads on platforms like Hugging Face, reflecting their accessibility and widespread usage.

Implications and Future Directions

The practical and theoretical implications of this research are significant. Practically, the GLM-4 models provide robust performance in a variety of tasks, aligning closely with state-of-the-art models. Theoretically, the techniques developed offer new insights into LLM training and alignment methodologies. Future developments could see improvements in model safety, efficiency, and further refinements in agent capabilities.

Safety and potential risks are also addressed through rigorous data filtering and alignment processes, with continuous efforts to ensure model harmlessness.

Conclusion

The ChatGLM family of models represents a substantial advancement in the field of LLMs. The GLM-4's capabilities in handling diverse and complex tasks, combined with their open-source nature, contribute significantly to the broader AI research community. As the development of these models continues, they stand poised to push the boundaries of what LLMs can achieve. The commitment of Zhipu AI and Tsinghua University to democratize cutting-edge AI technologies through open-source efforts will undoubtedly foster further innovation and accessibility in AI research.