Emergent Mind

Abstract

The burgeoning interest in developing LLMs with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .

MiniCPM-DPO-2.4B outperforms larger models based on MTBench score.

Overview

  • The paper introduces MiniCPM, a family of Small Language Models (SLMs), showcasing their competitive performance against larger models through scalable training strategies.

  • It details the Model Wind Tunnel Experiment (MWTE) method for assessing the scalability and stability of SLMs, contributing valuable insights towards the development of Larger Language Models (LLMs).

  • A novel Warmup-Stable-Decay Learning Rate Scheduler (WSD LRS) is introduced for effective training dynamics, reducing the computational effort typically required in data-model scaling law studies.

  • The MiniCPM family is highlighted for its diversity and scalability, demonstrating SLMs' robustness and adaptability across a range of AI tasks, prompting a reevaluation of the focus on larger models.

MiniCPM: Demonstrating the Efficiency and Scalability of Small Language Models

Introduction

The paper "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies" explore the realm of Small Language Models (SLMs) as an alternative to the more commonly discussed LLMs. The authors bring to light the significant capabilities of MiniCPM, a family of models particularly the 1.2B and 2.4B non-embedding variant models, asserting their remarkable performance, which competes with larger counterparts ranging from 7B to 13B parameters. This study emphasizes a scalable approach in training strategies, which can be beneficial for both model and data dimensions, setting a potential pathway for future research into larger models.

Model Wind Tunnel Experiment (MWTE)

The paper introduces the concept of Model Wind Tunnel Experiments (MWTE), aimed at exploring the limits of SLMs before transitioning learned insights to LLMs. The MWTE comprises extensive hyper-parameter optimization, optimal batch-size scaling, and learning rate stability, among other factors. Such comprehensive testing, inspired by aerodynamic wind tunnel testing, is crucial for understanding the scalability and stability of SLMs, thereby informing the development strategy for larger models.

Warmup-Stable-Decay Learning Rate Scheduler (WSD LRS)

One of the notable contributions of this research is the development of the WSD learning rate scheduler, conducive to continuous training and domain adaptation. The WSD scheduler demonstrates unique training dynamics, particularly during the decay phase, where a notable decrease in loss is observed. This insight can drastically reduce the effort in studying data-model scaling laws, providing an efficient alternative to traditionally computationally intense approaches. Furthermore, the WSD LRS facilitates an understanding of training dynamics not previously captured with common practices.

MiniCPM Family: Diverse Applications and Scalability

The introduction of the MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE, and MiniCPM-128K, exemplifies the diversity and scalability of SLMs. Each variant targets different application areas or technical challenges, from preference alignment through reinforcement learning to handling long-context tasks. This diversity not only demonstrates the robustness of MiniCPM models but also their adaptability to a wide range of AI tasks, further reinforcing the potential of SLMs in practical applications.

Implications and Future Directions

This research underlines a critical consideration in the AI field: the importance of exploring efficient and scalable training strategies for SLMs. The demonstrated efficiency of MiniCPM models suggests a reevaluation of the current focus on exponentially growing LLMs, advocating for a scientific and sustainable model scaling approach. Moreover, the successful application of WSD LRS introduces a promising direction for optimizing training strategies, potentially impacting future developments in both SLMs and LLMs.

Conclusion

The paper "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies" accentuates the untapped potential of SLMs for achieving remarkable performance on par with LLMs, highlighting the significance of efficient training methodologies. The scalability demonstrated through various MiniCPM variants suggests a broad applicability of SLMs, further advocating for their utility in research and practical deployments. This work paves the way for future explorations into more sustainable, efficient, and scientifically grounded approaches to model training and scaling within the AI community.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.