CPM: A Large-scale Generative Chinese Pre-trained Language Model

Published 1 Dec 2020 in cs.CL | (2012.00413v1)

Abstract: Pre-trained LLMs (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained LLM (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained LLM, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at https://github.com/TsinghuaAI/CPM-Generate.

Abstract PDF Upgrade to Chat

Citations (110)

View on Semantic Scholar

Summary

The paper introduces CPM, a 2.6-billion-parameter transformer tailored for Chinese that outperforms smaller models in zero-shot and few-shot scenarios.
It employs a custom sub-word vocabulary and an increased token batch size to effectively capture Chinese semantic nuances and stabilize training.
CPM achieves notable improvements in text classification, dialogue generation, and entity probing, setting a new benchmark for Chinese NLP research.

Overview of CPM: A Large-scale Generative Chinese Pre-trained LLM

This paper introduces the Chinese Pre-trained LLM (CPM), a significant contribution to the field of NLP with a focus on Chinese text corpora. CPM is a transformer-based autoregressive LLM, featuring 2.6 billion parameters and trained on 100GB of Chinese data. This model positions itself as the largest Chinese pre-trained LLM to date, aiming to facilitate numerous downstream NLP tasks such as conversation, essay generation, and language understanding.

Motivation and Objectives

The motivation behind CPM stems from the challenges associated with applying existing large-scale models like GPT-3, primarily trained on English corpora, to Chinese NLP tasks. GPT-3's prominence in few-shot and zero-shot learning scenarios has highlighted the potential of such models, yet its limited applicability to Chinese tasks remains a barrier due to the non-availability of its parameters and limited Chinese training data.

Technical Approach

CPM adheres to a transformer architecture, modelling itself closely after GPT-3 in terms of generative capabilities but is tailored for Chinese language contexts. It constructs a sub-word vocabulary suited to Chinese text, recognizing that Chinese semantic richness might suffer using traditional character-level or BERT-based vocabularies. This vocabulary is uniquely built to accommodate both words and characters for better language representation. Additionally, the model utilizes an increased batch size of 3,072 tokens to effectively address the sparseness of the word distribution, indicating stability in training.

Experimental Results

The performance of CPM has been benchmarked across various tasks:

Text Classification: Across datasets such as TNEWS, IFLYTEK, and OCNLI, CPM demonstrated promising accuracy, especially in zero-shot settings, outperforming smaller-sized models and underscoring the advantage of larger parameter counts.
Chinese Idiom Cloze (ChID): The model's ability was benchmarked in both supervised and unsupervised settings. CPM-Large surpassed its smaller counterparts even in unsupervised scenarios, emphasizing the model's learned language proficiency.
Dialogue Generation: Using the Short-Text Conversation (STC) dataset, CPM was shown to achieve a higher diversity in generated responses compared to other state-of-the-art models, particularly when evaluated in few-shot settings.
Question Answering: Performances on CMRC2018 and DuReader benchmarks highlighted the limitations of the model in generating precise answers without tuning, though results improved with one-shot learning strategies.
Entity Generation: The model’s ability to generate accurate tail entities in the XLORE dataset was notable, especially in few-shot conditions, revealing its capacity for factual knowledge probing.

Implications and Future Directions

The creation of CPM showcases a significant step towards refining NLP tools for Chinese. Practical applications range from enhancing automated essay scoring systems to improving information retrieval algorithms in Chinese contexts. From a theoretical standpoint, CPM contributes to insights on model scalability and its impact on LLM capabilities.

Looking forward, the authors suggest optimizations in pre-training frameworks to handle computational costs more effectively, including distributed training strategies and model compression techniques. Expansion plans include incorporating multi-lingual corpora to develop a comprehensive multi-lingual model, and integrating structured data like knowledge graphs for improved contextual language understanding.

In essence, CPM stands as a pivotal advancement in NLP for the Chinese language, setting a benchmark for future research endeavors aimed at non-English language applications.

Markdown Report Issue