Emergent Mind

Abstract

LLMs have gained significant attention in the field of NLP due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

Win rates for models in non-tie matches; ChatFlow ranks 5th among 7B models.

Overview

  • This paper introduces ChatFlow, a novel approach to enhance Chinese LLMs using cross-language transfer learning and a dynamic data sampler.

  • ChatFlow employs a systematic training regimen involving bilingual corpora and dynamic progressive data sampling, ensuring a gradual transition from unsupervised pre-training to supervised fine-tuning.

  • The proposed method demonstrates significant improvements in model performance, training efficiency, and bilingual capabilities, with implications for extending this technique to other languages.

Dynamic Data Sampler for Cross-Language Transfer Learning in LLMs

The paper under examination, titled "Dynamic Data Sampler for Cross-Language Transfer Learning in LLMs," addresses the substantial challenge of training LLMs for non-English languages, such as Chinese. This work, authored by Yudong Li et al., from institutions including Shenzhen University and Tencent AI Lab, introduces a novel approach named ChatFlow to facilitate cost-effective training via cross-language transfer learning.

Overview and Motivation

The prevalent LLMs, including models like LLaMA2, typically excel due to the availability of massive English-language corpora. However, the data disparity between languages presents significant obstacles in creating high-quality LLMs for languages like Chinese, which constitutes only 1.4% of the web corpus. Existing Chinese models such as ChatGLM and Baichuan often rely on private datasets, hindering reproducibility and broader research efforts. The proposed ChatFlow method aims to fill this void by leveraging English language resources to enhance Chinese LLMs using a cross-language transfer mechanism.

Methodology

Transfer Learning with Dynamic Data Sampler

ChatFlow stands on the shoulders of the LLaMA2-7B model, augmenting it with Chinese language capabilities through a methodical training regimen involving bilingual (Chinese and English) corpora and dynamic progressive data sampling. The dynamic data sampler plays a critical role by ensuring a smooth transition from unsupervised pre-training to supervised fine-tuning (SFT), inspired by curriculum learning principles.

Instead of abruptly shifting from pre-training to fine-tuning, the dynamic sampler gradually increases the proportion of Chinese data and supervised instruction tasks in the training batches. This careful calibration ensures appropriate representation learning and mitigates the risk of model confusion due to sudden changes in data distributions.

Training Data Composition

The training data comprises approximately 50GB, including unsupervised corpus, parallel Chinese-English corpus, and instruction data:

  • Parallel Corpus: Notable sources such as ParaCrawl v9 and WikiMatri help align cross-language representations, enabling efficient knowledge transfer from English to Chinese.
  • Unsupervised Corpus: Incorporates Chinese datasets such as CLUECorpus and CSL, and an English corpus subset such as RefinedWeb, preserving existing knowledge while expanding Chinese capabilities.
  • Instruction Data: Utilizes diverse sources like BELLE and UltraChat to enhance the model’s interaction proficiency.

Experimental Results

Performance Metrics

ChatFlow’s performance was rigorously evaluated on several benchmarks including MMLU, C-Eval, CMMLU, and GAOKAO:

  • Superior Performance: ChatFlow exhibited superior results compared to other Chinese models post-trained on LLaMA2-7B, such as HFL-Alpaca2, especially in the domains of Chinese understanding and bilingual capabilities.
  • Training Efficiency: The dynamic data sampler facilitated faster model convergence and higher stability across training stages, evidenced by tracking loss curves and performance metrics.

Human Evaluation

In a human evaluation on the SuperCLUE platform, ChatFlow ranked 5th among comparable 7B-scale models, with insights indicating its advantage in leveraging transfer learning from an English foundation model. It still trails behind state-of-the-art commercial models, indicating avenues for further enhancements.

Implications and Future Directions

The proposed methodology highlights important practical and theoretical implications:

  • Practical Utility: ChatFlow offers a reproducible and efficient framework for bilingual LLM training, with a significant focus on resource efficiency and open availability.
  • Theoretical Insights: The work underscores the importance of dynamic data sampling in transfer learning, providing empirical evidence of its benefits in stabilizing learning processes in multilingual contexts.

Future research directions may explore extending this approach to other languages with similarly limited training data, refining the dynamic data sampler mechanism, and integrating reinforcement learning from human feedback (RLHF) to further optimize model performance.

Conclusion

The paper introduces ChatFlow, a well-structured, cost-effective strategy for enhancing Chinese LLMs through cross-language transfer. By innovatively employing a dynamic data sampler and leveraging both bilingual and instruction datasets, the study contributes a valuable reference point for future cross-linguistic AI model developments. With its successful outcomes and open-source commitment, ChatFlow represents a meaningful step toward inclusive and reproducible AI research initiatives.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.