Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model (2404.04167v5)

Published 5 Apr 2024 in cs.CL and cs.AI

Abstract: In this study, we introduce CT-LLM, a 2B LLM that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile LLMs.

References (54)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces CT-LLM#1, a 2 billion parameter model trained predominantly on Chinese data, challenging traditional English-centric LLM methods.
The paper deploys a tailored transformer architecture with SwiGLU activations and customized tokenization to capture the nuances of Chinese text.
The paper demonstrates CT-LLM#1’s strong performance via benchmarks like CHC-Bench#1 and supervised fine-tuning, highlighting its safe multilingual alignment.

Pretraining a Chinese-Centric LLM (CT-LLM#1)

Introduction to CT-LLM#1

The development of LLMs traditionally leverages extensive English datasets, leading to advancements in understanding and generating natural language. However, this practice tends to overshadow the linguistic diversity inherent in human languages. Addressing this gap, the recently introduced Chinese Tiny LLM (CT-LLM#1), a 2 billion parameter model, signifies a shift in focus toward prioritizing the Chinese language from the get-go. Unlike conventional models, CT-LLM#1 was meticulously pretrained on a comprehensive corpus comprising 1,200 billion tokens, with a significant portion being Chinese tokens. This model challenges the prevailing norms in LLM training, showcasing remarkable capabilities in handling Chinese language tasks and suggesting a broader scope for training methodologies that embrace linguistic diversity.

Methodology Behind CT-LLM#1

Dataset Composition

The training dataset for CT-LLM#1 was meticulously assembled to ensure a vast and diverse coverage of Chinese text, encompassing 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. To refine the dataset quality, data filtering employed heuristic rules tailored specifically for Chinese texts, addressing the challenge of data diversity and quality variance noted in previous models.

Model Architecture and Training

CT-LLM#1 utilizes a transformer-based architecture, with modifications including multi-head attention mechanisms, SwiGLU activations, and RoPE embeddings, to optimize performance for the Chinese language. The tokenizer design and vocabulary size were carefully chosen to better encode numerical data and accommodate the Chinese language's nuances.

Supervised Fine-Tuning (SFT) and Human Preferences Learning

SFT was employed using both Chinese and English data to enhance the model's multilingual capacities. The model underwent SFT with various ratios of Chinese to English data, where results indicated remarkable proficiency in Chinese language tasks. Additionally, Direct Preference Optimization (DPO) was utilized to align the model more closely with human preferences, focusing on generating harmless and helpful responses.

Evaluation and Benchmarks

CT-LLM#1 underwent rigorous evaluations across multiple benchmarks, demonstrating its exceptional ability in Chinese language processing and multilingual tasks. The introduction of a new benchmark, the Chinese Hard Case Benchmark (CHC-Bench#1), specifically aimed to measure instruction understanding in Chinese, further confirmed the model's adeptness. The successful alignment with human preferences also marked significant progress in developing safer and more user-friendly LLMs.

Implications and Future Directions

By diverging from the predominantly English-focused training methodologies, CT-LLM#1 paves the way for more inclusive and versatile LLMs. Its remarkable performance in understanding and generating Chinese text underscores the potential for LLMs dedicated to other languages. Moreover, the open-sourcing of CT-LLM#1’s training process, including the comprehensive dataset and benchmarks, invites further exploration and innovation in the field, potentially leading to advancements in multilingual LLMs and their applications across diverse linguistic landscapes. Future research efforts might explore the scalability of such models, the integration of even more linguistic diversity, and the refinement of methodology for aligning LLMs with human preferences across various cultural contexts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1777144027814477961

https://twitter.com/_akhaliq/status/1777187978672005222

https://twitter.com/GeZhang86038849/status/1811279001417052548

https://twitter.com/fly51fly/status/1777453464114864599

https://twitter.com/GeZhang86038849/status/1811323501262844204

https://twitter.com/sawubonagmbh/status/1883203564182655017

YouTube

Show All Videos