Emergent Mind

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

(2404.04167)
Published Apr 5, 2024 in cs.CL and cs.AI

Abstract

In this study, we introduce CT-LLM, a 2B LLM that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

Data processing flow and deduplication ratios, with a schematic of similar line deduplication below.

Overview

  • CT-LLM#1 is a 2 billion parameter Large Language Model focused on the Chinese language, trained on 1,200 billion tokens including a significant portion of Chinese tokens.

  • The dataset for CT-LLM#1 includes 840.48 billion Chinese tokens, along with English and code tokens, with data filtering tailored for Chinese text quality.

  • CT-LLM#1 utilizes a transformer-based architecture with modifications for Chinese language optimization, and underwent Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for enhancing multilingual capacity and aligning with human preferences.

  • The model demonstrates exceptional capabilities in Chinese language processing and sets a milestone for linguistic diversity in LLM training, with its open-source training process facilitating further research.

Pretraining a Chinese-Centric Large Language Model (CT-LLM#1)

Introduction to CT-LLM#1

The development of LLMs traditionally leverages extensive English datasets, leading to advancements in understanding and generating natural language. However, this practice tends to overshadow the linguistic diversity inherent in human languages. Addressing this gap, the recently introduced Chinese Tiny LLM (CT-LLM#1), a 2 billion parameter model, signifies a shift in focus toward prioritizing the Chinese language from the get-go. Unlike conventional models, CT-LLM#1 was meticulously pretrained on a comprehensive corpus comprising 1,200 billion tokens, with a significant portion being Chinese tokens. This model challenges the prevailing norms in LLM training, showcasing remarkable capabilities in handling Chinese language tasks and suggesting a broader scope for training methodologies that embrace linguistic diversity.

Methodology Behind CT-LLM#1

Dataset Composition

The training dataset for CT-LLM#1 was meticulously assembled to ensure a vast and diverse coverage of Chinese text, encompassing 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. To refine the dataset quality, data filtering employed heuristic rules tailored specifically for Chinese texts, addressing the challenge of data diversity and quality variance noted in previous models.

Model Architecture and Training

CT-LLM#1 utilizes a transformer-based architecture, with modifications including multi-head attention mechanisms, SwiGLU activations, and RoPE embeddings, to optimize performance for the Chinese language. The tokenizer design and vocabulary size were carefully chosen to better encode numerical data and accommodate the Chinese language's nuances.

Supervised Fine-Tuning (SFT) and Human Preferences Learning

SFT was employed using both Chinese and English data to enhance the model's multilingual capacities. The model underwent SFT with various ratios of Chinese to English data, where results indicated remarkable proficiency in Chinese language tasks. Additionally, Direct Preference Optimization (DPO) was utilized to align the model more closely with human preferences, focusing on generating harmless and helpful responses.

Evaluation and Benchmarks

CT-LLM#1 underwent rigorous evaluations across multiple benchmarks, demonstrating its exceptional ability in Chinese language processing and multilingual tasks. The introduction of a new benchmark, the Chinese Hard Case Benchmark (CHC-Bench#1), specifically aimed to measure instruction understanding in Chinese, further confirmed the model's adeptness. The successful alignment with human preferences also marked significant progress in developing safer and more user-friendly LLMs.

Implications and Future Directions

By diverging from the predominantly English-focused training methodologies, CT-LLM#1 paves the way for more inclusive and versatile LLMs. Its remarkable performance in understanding and generating Chinese text underscores the potential for LLMs dedicated to other languages. Moreover, the open-sourcing of CT-LLM#1’s training process, including the comprehensive dataset and benchmarks, invites further exploration and innovation in the field, potentially leading to advancements in multilingual LLMs and their applications across diverse linguistic landscapes. Future research efforts might explore the scalability of such models, the integration of even more linguistic diversity, and the refinement of methodology for aligning LLMs with human preferences across various cultural contexts.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube