Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca (2304.08177v3)
Abstract: LLMs, such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards AGI. Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several LLMs, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. Chinese LLaMA series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}
- Revisiting pre-trained models for Chinese natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 657–668, Online, November 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.findings-emnlp.58.
- Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514, 2021. doi: 10.1109/TASLP.2021.3124365.
- Lert: A linguistically-motivated pre-trained language model. arXiv preprint arXiv:2211.05344, 2022.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N19-1423.
- GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Georgi Gerganov. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023.
- LoRA: Low-Rank Adaptation of Large Language Models. arXiv e-prints, art. arXiv:2106.09685, June 2021. doi: 10.48550/arXiv.2106.09685.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv e-prints, art. arXiv:2304.07327, April 2023. doi: 10.48550/arXiv.2304.07327.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
- RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- OpenAI. GPT-4 Technical Report. arXiv e-prints, art. arXiv:2303.08774, March 2023. doi: 10.48550/arXiv.2303.08774.
- Training language models to follow instructions with human feedback. arXiv e-prints, art. arXiv:2203.02155, March 2022. doi: 10.48550/arXiv.2203.02155.
- Improving language understanding by generative pre-training. 2018.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Noam Shazeer. Glu variants improve transformer, 2020.
- Roformer: Enhanced transformer with rotary position embedding, 2021.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023a.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023b.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv e-prints, art. arXiv:2212.10560, December 2022. doi: 10.48550/arXiv.2212.10560.
- Bright Xu. Nlp chinese corpus: Large scale chinese corpus for nlp, September 2019. URL https://doi.org/10.5281/zenodo.3402023.
- CINO: A Chinese minority pre-trained language model. In Proceedings of the 29th International Conference on Computational Linguistics, pp. 3937–3949, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.346.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
- Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019. URL https://openreview.net/references/pdf?id=S1qBAf6rr.