MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (2404.06395v3)

Published 9 Apr 2024 in cs.CL and cs.LG

Abstract: The burgeoning interest in developing LLMs with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small LLMs (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

References (80)

Citations (171)

View on Semantic Scholar

Summary

The paper demonstrates MiniCPM's efficiency in matching performance levels of larger models through scalable training strategies.
It introduces innovative techniques like Model Wind Tunnel Experiments and a Warmup-Stable-Decay learning rate scheduler to optimize training dynamics.
The study highlights the versatility of MiniCPM variants, showcasing their potential in diverse AI applications and sustainable model scaling.

MiniCPM: Demonstrating the Efficiency and Scalability of Small LLMs

Introduction

The paper "MiniCPM: Unveiling the Potential of Small LLMs with Scalable Training Strategies" explores the field of Small LLMs (SLMs) as an alternative to the more commonly discussed LLMs. The authors bring to light the significant capabilities of MiniCPM, a family of models particularly the 1.2B and 2.4B non-embedding variant models, asserting their remarkable performance, which competes with larger counterparts ranging from 7B to 13B parameters. This paper emphasizes a scalable approach in training strategies, which can be beneficial for both model and data dimensions, setting a potential pathway for future research into larger models.

Model Wind Tunnel Experiment (MWTE)

The paper introduces the concept of Model Wind Tunnel Experiments (MWTE), aimed at exploring the limits of SLMs before transitioning learned insights to LLMs. The MWTE comprises extensive hyper-parameter optimization, optimal batch-size scaling, and learning rate stability, among other factors. Such comprehensive testing, inspired by aerodynamic wind tunnel testing, is crucial for understanding the scalability and stability of SLMs, thereby informing the development strategy for larger models.

Warmup-Stable-Decay Learning Rate Scheduler (WSD LRS)

One of the notable contributions of this research is the development of the WSD learning rate scheduler, conducive to continuous training and domain adaptation. The WSD scheduler demonstrates unique training dynamics, particularly during the decay phase, where a notable decrease in loss is observed. This insight can drastically reduce the effort in studying data-model scaling laws, providing an efficient alternative to traditionally computationally intense approaches. Furthermore, the WSD LRS facilitates an understanding of training dynamics not previously captured with common practices.

MiniCPM Family: Diverse Applications and Scalability

The introduction of the MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K, exemplifies the diversity and scalability of SLMs. Each variant targets different application areas or technical challenges, from preference alignment through reinforcement learning to handling long-context tasks. This diversity not only demonstrates the robustness of MiniCPM models but also their adaptability to a wide range of AI tasks, further reinforcing the potential of SLMs in practical applications.

Implications and Future Directions

This research underlines a critical consideration in the AI field: the importance of exploring efficient and scalable training strategies for SLMs. The demonstrated efficiency of MiniCPM models suggests a reevaluation of the current focus on exponentially growing LLMs, advocating for a scientific and sustainable model scaling approach. Moreover, the successful application of WSD LRS introduces a promising direction for optimizing training strategies, potentially impacting future developments in both SLMs and LLMs.

Conclusion

The paper "MiniCPM: Unveiling the Potential of Small LLMs with Scalable Training Strategies" accentuates the untapped potential of SLMs for achieving remarkable performance on par with LLMs, highlighting the significance of efficient training methodologies. The scalability demonstrated through various MiniCPM variants suggests a broad applicability of SLMs, further advocating for their utility in research and practical deployments. This work paves the way for future explorations into more sustainable, efficient, and scientifically grounded approaches to model training and scaling within the AI community.

Related Papers

Tweets

https://twitter.com/srush_nlp/status/1843664798543950094

https://twitter.com/arankomatsuzaki/status/1777871466647748699

https://twitter.com/DeanHu11/status/1781157659590422867

https://twitter.com/xingyaow_/status/1852390059825058105

https://twitter.com/_akhaliq/status/1777885987563069532

https://twitter.com/fly51fly/status/1779633491401183599