Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data (2309.11235v2)

Published 20 Sep 2023 in cs.CL

Abstract: Nowadays, open-source LLMs like LLaMA have emerged. Recent developments have incorporated supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to align these models with human goals. However, SFT methods treat all training data with mixed quality equally, while RLFT methods require high-quality pairwise or ranking-based preference data. In this study, we present a novel framework, named OpenChat, to advance open-source LLMs with mixed-quality data. Specifically, we consider the general SFT training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. Interestingly, the optimal policy in C-RLFT can be easily solved through single-stage, RL-free supervised learning, which is lightweight and avoids costly human preference labeling. Through extensive experiments on three standard benchmarks, our openchat-13b fine-tuned with C-RLFT achieves the highest average performance among all 13b open-source LLMs. Moreover, we use AGIEval to validate the model generalization performance, in which only openchat-13b surpasses the base model. Finally, we conduct a series of analyses to shed light on the effectiveness and robustness of OpenChat. Our code, data, and models are publicly available at https://github.com/imoneoi/openchat and https://huggingface.co/openchat.

Citations (196)

Summary

  • The paper introduces the C-RLFT framework, effectively training LLMs with mixed-quality data.
  • It uses class-conditioned policies to differentiate expert from sub-optimal data, achieving superior win rates on benchmarks.
  • Experimental results on AlpacaEval and AGIEval confirm OpenChat-13b's robust generalization and high-quality outputs.

Advancing Open-source LLMs with Mixed-Quality Data: A Review of OpenChat

The paper "OpenChat: Advancing Open-source LLMs with Mixed-Quality Data" discusses an innovative approach to enhance the performance of open-source LLMs, specifically targeting scenarios where training data is of mixed quality. The authors introduce OpenChat, emphasizing a new framework called Conditioned-Reinforcement Learning Fine-tuning (C-RLFT) to effectively utilize mixed-quality datasets without the necessity of fine-grained preference labels.

Key Contributions and Methods

Problem Scope

The primary issue addressed by the paper is the prevalent challenge in supervised fine-tuning (SFT) methods, which indiscriminately treat all training data equally. This is problematic because datasets often contain both high-quality and sub-optimal data. On the other hand, reinforcement learning fine-tuning (RLFT) methods typically require high-quality, pairwise, or ranking-based preference data, which is expensive to gather. The authors seek to bridge this gap by proposing a novel approach that can leverage mixed-quality data effectively.

Conditioned-RLFT Framework

The proposed C-RLFT framework innovatively resolves limitations by introducing a class-conditioned policy that distinguishes data sources based on coarse-grained reward labels. Here's a detailed overview of this method:

  1. Class-Conditioned Dataset and Rewards:
    • The authors classify the data into expert data and sub-optimal data, encoding rewards of 1 for expert and a lower value (α < 1) for sub-optimal data.
  2. Policy Optimization:
    • The conditioned policy incorporates data source information as an additional dimension, optimizing the model based on a KL-regularized RL framework. This novel approach substitutes traditional base model regularization with a class-conditioned reference policy, significantly enhancing the quality differentiation.
  3. Model Inference:
    • During inference, the OpenChat model utilizes specific prompts used in high-quality data training, ensuring the generation of high-quality responses aligned with expert data patterns.

Experimental Validation and Implications

The authors validated OpenChat on several benchmarks, including AlpacaEval, MT-bench, and Vicuna-bench for instruction-following abilities, and AGIEval to assess model generalization. The OpenChat-13b model consistently demonstrated superior performance:

  • AlpacaEval and MT-bench:
    • OpenChat-13b achieved the highest win rate among all 13b open-source models, outperforming even gpt-3.5-turbo in several instances.
  • AGIEval:
    • The model surpassed base llama-2-13b in generalization tasks, indicating robustness against overfitting and maintaining accuracy across diverse tasks.

These results are significant as they illustrate that the OpenChat framework can effectively utilize mixed-quality datasets, offering practical benchmarks for the deployment of LLMs in varied applications where data quality is not uniformly high.

Future Directions

The paper opens several avenues for future research:

  1. Fine-grained Reward Tuning:
    • While the coarse-grained reward system used in OpenChat is efficient, exploring more nuanced reward structures could further improve model performance.
  2. Extended Applications:
    • Extending the C-RLFT framework to enhance reasoning abilities alongside instruction-following capabilities could broaden the practical applications of LLMs in complex task scenarios.
  3. Data Source Quality Metrics:
    • Developing metrics to better quantify and utilize the quality of data sources could improve the robustness of models trained on mixed-quality datasets.

Conclusion

The authors of "OpenChat: Advancing Open-source LLMs with Mixed-Quality Data" present a compelling framework to address inherent challenges in training LLMs with heterogeneous data quality. Their approach leverages class-conditioned policies and a novel reward optimization strategy to achieve superior performance and robustness. This work represents a significant contribution to the field of AI, offering a practical, effective solution for enhancing the capabilities of open-source LLMs. As AI continues to evolve, the principles and methodologies introduced in this paper will likely inspire further innovations and applications.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com