G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation (2405.12915v2)

Published 21 May 2024 in cs.CL

Abstract: LLMs have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two main challenges for instruction finetuning. With regard to this, in this paper, we propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation. Our key innovation centers around analyzing how individual training examples influence the model during training. Specifically, we select training examples that exert beneficial influences on the model as high-quality ones by means of Influence Function plus a small high-quality seed dataset. Moreover, to enhance the diversity of the training data we maximize the variety of influences they have on the model by clustering on their gradients and resampling. Extensive experiments on WMT22 and FLORES translation tasks demonstrate the superiority of our methods, and in-depth analysis further validates their effectiveness and generalization.

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that applying gradient-based influence functions effectively selects high-quality training data from a diverse candidate pool.
It utilizes gradient clustering with K-means to enhance data diversity, significantly benefiting translation models, especially in low-data regimes.
Experiments on WMT22 and FLORES tasks show that G-DIG enables 7B models to perform competitively against larger state-of-the-art systems.

Gradient-based Data Selection for Machine Translation Fine-Tuning in LLMs

The paper introduces a novel gradient-based method, named G-DIG, aimed at automatically selecting high-quality and diverse instruction fine-tuning data for machine translation using LLMs. This method focuses on overcoming the inherent challenges posed by the diversity and quality of instruction data, which are critical to model alignment and performance.

Key Methodology

The central innovation of the paper lies in the application of influence functions to the data selection process. Specifically, the influence function is used to measure the impact of individual training examples on the model's performance on test instances. The approach entails two main components:

High-Quality Data Selection: Training examples that exert beneficial influences on a pre-constructed small set of high-quality seed data are selected. This process involves:
- Manually curating a small seed dataset.
- Using influence functions to quantify the positive impact of candidate data on the seed dataset.
Data Diversity Enhancement: The diversity of the selected data is ensured by clustering on their gradients and resampling. The authors use Euclidean distance to measure gradient similarity and employ K-means clustering to maximize diversity among selected examples.

Experimental Validation

The effectiveness of the G-DIG method was demonstrated through extensive experiments on the WMT22 and FLORES translation tasks, specifically focusing on Zh $\Rightarrow$ En and De $\Rightarrow$ En translation directions. The evaluation metrics included COMET, BLEU, and BLEURT scores. The experimental setup involved:

Using Baichuan2-7B for Zh $\Rightarrow$ En and Llama2-7B for De $\Rightarrow$ En.
Collecting large candidate pools from various sources such as WMT22 datasets.
Comparing G-DIG against both baselines and state-of-the-art models like Bayling-13B, BigTranslate-13B, and TIM-7B.

Results and Analysis

The results indicated that:

Superiority Over Baselines: G-DIG consistently outperformed random selection and reward model-based selection methods across various data sizes.
Competitive with SOTA Models: When compared with larger, more complex models, G-DIG enabled 7B models to achieve comparable, and sometimes superior, performance in terms of translation quality.
Impact of Data Diversity: The analyses revealed that diversity enhancement was particularly beneficial when the size of the training data was limited. This benefit diminished as the number of training examples increased.

Theoretical and Practical Implications

The theoretical implications of this work are manifold. By utilizing gradient information via influence functions, the paper provides a robust mechanism to leverage model-intrinsic behaviors for training data selection—eschewing reliance on external models. This approach not only aligns with theoretical advancements in gradient-based learning but also sets a precedent for further research into data-centric model optimizations.

Practically, the G-DIG method offers a scalable solution to fine-tuning LLMs for specific tasks such as machine translation. By systematically enhancing both the quality and diversity of training data, the method ensures that models are better aligned with human-like instructional responses, which is crucial for applications in precise language translation tasks.

Future Directions

The paper acknowledges the computational cost associated with calculating influence functions, especially for large models. Future research can focus on optimizing these calculations or exploring alternative methods that retain the core benefits of gradient-based data selection while reducing computational overhead. Further investigations might also explore the applicability of G-DIG in other domains where instructional fine-tuning is critical, such as different NLP tasks or beyond.

Conclusion

G-DIG represents a significant advancement in the methods used to fine-tune LLMs for specialized tasks like machine translation. By combining high-quality data selection with gradient-based diversity enhancements, this paper offers a sophisticated and effective approach to address the challenges inherent in instruction fine-tuning, ensuring robust and high-performing LLMs. Considering the demonstrated effectiveness and potential applications, G-DIG sets a higher standard for future research on intelligent data selection methodologies in the context of generative AI and LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mctalentowen/status/1815031607885218186