YAYI 2: Multilingual Open-Source Large Language Models (2312.14862v1)
Abstract: As the latest advancements in natural language processing, LLMs have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.
- 01-AI. 2023. Yi: A series of large language models trained from scratch by developers at 01-ai. https://github.com/01-ai/Yi.
- Falcon-40B: An open large language model with state-of-the-art performance. https://huggingface.co/tiiuae/falcon-40b.
- Palm 2 technical report.
- Program synthesis with large language models.
- Layer normalization.
- BAAI. 2023. Aquila2 series proposed by BAAI. https://github.com/FlagAI-Open/Aquila2.
- Qwen technical report.
- Constitutional AI: Harmlessness from AI feedback.
- Baichuan. 2023. A large-scale 7B pretraining language model developed by baichuan Inc. https://github.com/baichuan-inc/Baichuan-7B.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems.
- Evaluating large language models trained on code.
- Training verifiers to solve math word problems.
- Together Computer. 2023. RedPajama: An open dataset for training large language models. https://github.com/togethercomputer/RedPajama-Data.
- Efficient and effective text encoding for Chinese LLaMA and Alpaca.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Measuring mathematical problem solving with the MATH dataset. In Conference on Neural Information Processing Systems Track on Datasets and Benchmarks.
- C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- InternLM. 2023. InternLM: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Challenges and applications of large language models.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
- CMMLU: Measuring massive multitask language understanding in chinese.
- Let’s verify step by step.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
- MosaicML et al. 2023. MPT-30B: Raising the bar for open-source foundation models. https://www.mosaicml.com/blog/mpt-30b.
- OpenCompass. 2023. OpenCompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Workshop on the Challenges in the Management of Large Corpora.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
- The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only.
- Yarn: Efficient context window extension of large language models.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
- Proximal policy optimization algorithms.
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
- Noam Shazeer. 2020. Glu variants improve transformer.
- SlimPajama-DC: Understanding data combinations for LLM training.
- Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
- Challenging big-bench tasks and whether chain-of-thought can solve them.
- LLaMA: Open and efficient foundation language models.
- LLaMA 2: Open foundation and fine-tuned chat models.
- Attention is all you need. In Advances in Neural Information Processing Systems.
- Bloom: A 176B-parameter open-access multilingual language model.
- XVERSE. 2023. XVERSE-13B: A multilingual large language model developed by XVERSE Technology Inc. https://github.com/xverse-ai/XVERSE-13B.
- Baichuan 2: Open large-scale language models.
- GLM-130B: An open bilingual pre-trained model. In International Conference on Learning Representations.
- Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. In Advances in Neural Information Processing Systems.
- Evaluating the performance of large language models on GAOKAO benchmark.
- AGIEval: A human-centric benchmark for evaluating foundation models.