Tele-FLM Technical Report (2404.16645v1)

Published 25 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual LLM that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual LLMing abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Tele-FLM, a 52-billion parameter model that efficiently scales multilingual LLMs using advanced 3D parallel training.
It presents a robust pre-training pipeline that optimizes data processing and hyperparameter search for stable and cost-effective training.
The model outperforms larger counterparts like Llama2-70B on key benchmarks, showcasing superior language understanding and resource efficiency.

Efficient Scaling of Multilingual LLMs: Introducing Tele-FLM (FLM-2)

Introduction

This paper introduces Tele-FLM, a 52-billion parameter, open-sourced multilingual LLM that demonstrates efficient scaling and superior multilingual capabilities. The model optimally balances the costs and computational resources typically associated with training large-scale models through a streamlined model-producing pipeline and advanced hyperparameter search methodologies.

Pre-training Details

Data Processing and Model Configurations:

The training dataset comprises texts from diverse domains, processed using a robust pipeline to ensure high-quality and uniform distribution, especially focusing on English and Chinese texts.
Modifications from its predecessor, FLM-101B, include optimized normalization techniques and activation functions, contributing to its stable training dynamics.

Parallelism and Training Infrastructure:

Tele-FLM employs 3D parallel training, combining data, tensor, and pipeline parallelism to optimize computational resources across a cluster of 896 Nvidia A800 GPUs.
The utilization of advanced parallel training techniques facilitates efficient scaling and robust training dynamics, enabling the model to train with minimal restarts and computational waste.

Performance and Evaluation

Benchmark Performance:

Tele-FLM achieves impressive scores on both English and Chinese LLMing benchmarks, demonstrating strong compression abilities and reducing the Bits-Per-Byte (BPB) metric, which is a crucial performance indicator for LLMs.
The model performs on par with or better than larger models like Llama2-70B and Qwen1.5-72B on various datasets, substantiating its robust multilingual capabilities.

Evaluation Insights:

Detailed evaluation results highlight Tele-FLM's consistent performance across English and Chinese benchmarks.
It shows particular prowess in tasks requiring in-depth language understanding and reasoning, further evidenced by its performance in specialized benchmarks like HumanEval and Big-Bench Hard.

Discussion and Implications

General Observations:

High-quality, diversified pre-training data significantly contributes to the model's comprehensive language understanding capabilities.
Effective hyperparameter tuning, especially using sophisticated methodologies like the $\mu$ P model, plays a crucial role in enhancing model performance and ensuring efficient scaling.

Technical Insights:

Tele-FLM inherits and improves upon the low carbon techniques and advanced pre-training objectives from the FLM family, ensuring an eco-friendly yet powerful modeling approach.
The provided documentation of model architecture, pre-training details, and training dynamics offers valuable insights for both academic research and practical applications in the AI community.

Future Directions

The authors plan to continue refining Tele-FLM's capabilities to broaden its application spectrum and improve its efficiency. Future developments may include exploring larger model scales and enhancing the model's adaptability across more diverse languages and tasks.

Conclusions

The introduction of Tele-FLM marks significant progress in the development of scalable and efficient LLMs. By offering detailed insights and open-sourcing the model, the paper contributes valuably to the ongoing research and development in the field of AI and LLMs. Furthermore, the strategic improvements in model training and resource utilization present a promising direction for future large-scale AI model development.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1783691901419282639

YouTube

Show All Videos