GEB-1.3B: Open Lightweight Large Language Model (2406.09900v1)

Published 14 Jun 2024 in cs.CL

Abstract: Recently developed LLMs such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.

Summary

The paper introduces GEB-1.3B, a model with 1.3B parameters that achieves efficient CPU inference and robust performance across multiple benchmarks.
The methodology utilizes advanced techniques like ROPE, GQA, and FlashAttention-2 along with optimized training on 64 NVIDIA GPUs to enhance speed and reduce toxicity.
Implications highlight the model's practical deployment on edge devices and set the stage for future improvements through quantization to further accelerate inference.

Overview of GEB-1.3B: Open Lightweight LLM

The paper presents GEB-1.3B, a lightweight LLM designed to be efficient on CPUs. Unlike many contemporary LLMs which are resource-intensive and typically run on high-performance servers, GEB-1.3B offers a significant advancement by enabling deployment on more accessible devices, such as laptops and smartphones. This essay provides an expert analysis of the methodologies, results, and future implications as elucidated in the paper.

Model Architecture and Training Techniques

GEB-1.3B is engineered with 1.3 billion parameters and trained on a diverse corpus of 550 billion tokens across both Chinese and English languages. The training employs several advanced techniques including ROPE, Group-Query-Attention (GQA), and FlashAttention-2 to expedite the training process while preserving model performance.

Key architectural decisions include a custom tokenizer from the ChatGLM-3 model, a relatively compact vocabulary of 64,896 entries, and an untied embedding strategy that utilizes Rotary Positional Embedding (RoPE). The transformer blocks adopt GQA instead of traditional multi-head attention, enhancing inference speed without detracting from accuracy. The model employs SwiGLU as the activation function and Post-RMSNorm for normalization, optimizing for effective convergence.

The training infrastructure incorporates BFloat16 mixed precision on 64 NVIDIA RTX3090ti GPUs with a global batch size of 320 samples. Stability measures such as Batch Sample Replacement, Iteration Skipping, Embedding Layer Gradient Shrinkage, and Learning Rate Adjustment were implemented to mitigate the challenges posed by a small batch size.

Alignment Techniques

GEB-1.3B enhances alignment with human conversational norms through a combination of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Approximately 16 million instances of instructional data were used for SFT, covering a wide spectrum of benign and sensitive topics. DPO further refines the model’s alignment by teaching it to avoid generating harmful or unethical responses, despite having a relatively small DPO dataset of 10,000 samples.

Evaluation and Performance

GEB-1.3B's performance was compared with several known LLMs, including Llama2-7B, Baichuan-7B, Falcon-7B, MPT-7B, and ChatGLM-6B on benchmarks such as MMLU, C-Eval, and CMMLU. The model significantly outperforms Llama-7B and demonstrates superior performance over similar-sized models like MindLLM-1.3B and TinyLLaMA-1.1B. Specifically, GEB-1.3B achieves higher scores on Chinese benchmarks while maintaining commendable English language capabilities.

In terms of toxicity, GEB-1.3B generates less toxic content compared to larger models such as Falcon-7B and Llama2-7B, as demonstrated by its lower scores on the ToxiGen dataset. Furthermore, the model exhibits efficient CPU inference with a rate of 12 tokens per second in its FP32 version, indicating practical applicability on edge devices.

Implications and Future Developments

The release of GEB-1.3B as an open-source model is a substantial contribution to the field of lightweight LLMs. It serves as a robust alternative for scenarios where deploying resource-heavy models is impractical. Its superior performance on CPUs opens possibilities for wider application, particularly in mobile and low-computation environments.

Future work will focus on enhancing GEB-1.3B’s inference speed through advanced quantization techniques. This may yield even faster performance, further bolstering the model's utility on ubiquitous computing devices.

Conclusion

GEB-1.3B stands out as an efficient and effective lightweight LLM, suitable for deployment on CPUs. The model sets a benchmark for the development of smaller yet potent LLMs, providing a pathway toward more accessible and widely usable AI technology. While it showcases remarkable performance across various benchmarks and tasks, ongoing refinement and user vigilance remain crucial in addressing the typical limitations inherent to LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1802628053815943224

https://twitter.com/vanstriendaniel/status/1802621575923179635

https://twitter.com/fly51fly/status/1802818526140563755

https://twitter.com/gm8xx8/status/1802526539948417178

https://twitter.com/knishimae0531/status/1802864373708554604