Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model (2404.04167v5)
Abstract: In this study, we introduce CT-LLM, a 2B LLM that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile LLMs.
- Minicpm: Unveiling the potential of end-side large language models, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- BAAI. BAAI-CCI: Chinese internet corpus. https://data.baai.ac.cn/details/BAAI-CCI, 2023. Accessed: 2024-03-27.
- Qwen technical report. ArXiv preprint, abs/2309.16609, 2023. URL https://arxiv.org/abs/2309.16609.
- Coig-cqia: Quality is all you need for chinese instruction fine-tuning, 2024.
- Deepseek llm: Scaling open-source language models with longtermism. ArXiv preprint, abs/2401.02954, 2024. URL https://arxiv.org/abs/2401.02954.
- Rethinking llm language adaptation: A case study on chinese mixtral. ArXiv preprint, abs/2403.01851, 2024. URL https://arxiv.org/abs/2403.01851.
- From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. ArXiv preprint, abs/2305.08283, 2023. URL https://arxiv.org/abs/2305.08283.
- Textbooks are all you need. ArXiv preprint, abs/2306.11644, 2023. URL https://arxiv.org/abs/2306.11644.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Training compute-optimal large language models, 2022.
- Open hermes preferences. https://huggingface.co/datasets/argilla/OpenHermesPreferences, 2024a.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024b.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
- Mistral 7b. ArXiv preprint, abs/2310.06825, 2023. URL https://arxiv.org/abs/2310.06825.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
- Bloom: A 176b-parameter open-access multilingual language model. 2022.
- Cmmlu: Measuring massive multitask language understanding in chinese. ArXiv preprint, abs/2306.09212, 2023a. URL https://arxiv.org/abs/2306.09212.
- Cif-bench: A chinese instruction-following benchmark for evaluating the generalizability of large language models. arXiv preprint arXiv:2402.13109, 2024.
- Textbooks are all you need ii: phi-1.5 technical report. ArXiv preprint, abs/2309.05463, 2023b. URL https://arxiv.org/abs/2309.05463.
- Yayi 2: Multilingual open-source large language models. ArXiv preprint, abs/2312.14862, 2023. URL https://arxiv.org/abs/2312.14862.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Scaling language models: Methods, analysis & insights from training gopher, 2022.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
- Noam Shazeer. GLU variants improve transformer. ArXiv preprint, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
- Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
- Roformer: Enhanced transformer with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Gemma: Open models based on gemini research and technology. ArXiv preprint, abs/2403.08295, 2024. URL https://arxiv.org/abs/2403.08295.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Aya model: An instruction finetuned open-access multilingual language model. ArXiv preprint, abs/2402.07827, 2024. URL https://arxiv.org/abs/2402.07827.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017a. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017b. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Telechat technical report. ArXiv preprint, abs/2401.03804, 2024. URL https://arxiv.org/abs/2401.03804.
- Skywork: A more open bilingual foundation model. ArXiv preprint, abs/2310.19341, 2023. URL https://arxiv.org/abs/2310.19341.
- CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4003–4012, Marseille, France, 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.494.
- Cvalues: Measuring the values of chinese large language models from safety to responsibility, 2023.
- Baichuan 2: Open large-scale language models. ArXiv preprint, abs/2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
- Yi: Open foundation models by 01. ai. ArXiv preprint, abs/2403.04652, 2024. URL https://arxiv.org/abs/2403.04652.
- Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
- Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12360–12371, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b-Abstract.html.
- Chinese open instruction generalist: A preliminary release, 2023.
- Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970, 2022.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
- Llama beyond english: An empirical study on language capability transfer. ArXiv preprint, abs/2401.01055, 2024. URL https://arxiv.org/abs/2401.01055.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024a.
- Kun: Answer polishment for chinese self-alignment with instruction back-translation. ArXiv preprint, abs/2401.06477, 2024b. URL https://arxiv.org/abs/2401.06477.
- Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024c. URL http://arxiv.org/abs/2403.13372.