Emergent Mind

Tele-FLM Technical Report

(2404.16645)
Published Apr 25, 2024 in cs.CL and cs.AI

Abstract

LLMs have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

BPB curves comparing Tele-FLM with Llama series on various linguistic and code validation datasets.

Overview

  • Tele-FLM is a 52-billion parameter, multilingual Large Language Model (LLM) that enhances scaling efficiency and demonstrates superior multilingual capabilities via advanced training methodologies and optimizations.

  • The model uses 3D parallel training across 896 Nvidia A800 GPUs, achieving top performance in English and Chinese language benchmarks, and showing competitive results against larger models.

  • Future plans include expanding Tele-FLM's capabilities and its application spectrum, emphasizing ongoing improvements in model scalability and performance across diverse language tasks.

Efficient Scaling of Multilingual LLMs: Introducing Tele-FLM (FLM-2)

Introduction

This paper introduces Tele-FLM, a 52-billion parameter, open-sourced multilingual Large Language Model (LLM) that demonstrates efficient scaling and superior multilingual capabilities. The model optimally balances the costs and computational resources typically associated with training large-scale models through a streamlined model-producing pipeline and advanced hyperparameter search methodologies.

Pre-training Details

Data Processing and Model Configurations:

  • The training dataset comprises texts from diverse domains, processed using a robust pipeline to ensure high-quality and uniform distribution, especially focusing on English and Chinese texts.
  • Modifications from its predecessor, FLM-101B, include optimized normalization techniques and activation functions, contributing to its stable training dynamics.

Parallelism and Training Infrastructure:

  • Tele-FLM employs 3D parallel training, combining data, tensor, and pipeline parallelism to optimize computational resources across a cluster of 896 Nvidia A800 GPUs.
  • The utilization of advanced parallel training techniques facilitates efficient scaling and robust training dynamics, enabling the model to train with minimal restarts and computational waste.

Performance and Evaluation

Benchmark Performance:

  • Tele-FLM achieves impressive scores on both English and Chinese language modeling benchmarks, demonstrating strong compression abilities and reducing the Bits-Per-Byte (BPB) metric, which is a crucial performance indicator for LLMs.
  • The model performs on par with or better than larger models like Llama2-70B and Qwen1.5-72B on various datasets, substantiating its robust multilingual capabilities.

Evaluation Insights:

  • Detailed evaluation results highlight Tele-FLM's consistent performance across English and Chinese benchmarks.
  • It shows particular prowess in tasks requiring in-depth language understanding and reasoning, further evidenced by its performance in specialized benchmarks like HumanEval and Big-Bench Hard.

Discussion and Implications

General Observations:

  • High-quality, diversified pre-training data significantly contributes to the model's comprehensive language understanding capabilities.
  • Effective hyperparameter tuning, especially using sophisticated methodologies like the $\mu$P model, plays a crucial role in enhancing model performance and ensuring efficient scaling.

Technical Insights:

  • Tele-FLM inherits and improves upon the low carbon techniques and advanced pre-training objectives from the FLM family, ensuring an eco-friendly yet powerful modeling approach.
  • The provided documentation of model architecture, pre-training details, and training dynamics offers valuable insights for both academic research and practical applications in the AI community.

Future Directions

The authors plan to continue refining Tele-FLM's capabilities to broaden its application spectrum and improve its efficiency. Future developments may include exploring larger model scales and enhancing the model's adaptability across more diverse languages and tasks.

Conclusions

The introduction of Tele-FLM marks significant progress in the development of scalable and efficient LLMs. By offering detailed insights and open-sourcing the model, the paper contributes valuably to the ongoing research and development in the field of AI and LLMs. Furthermore, the strategic improvements in model training and resource utilization present a promising direction for future large-scale AI model development.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube