Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation (2401.06477v4)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for LLMs without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun

References (38)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel self-training algorithm that integrates instruction back-translation and answer polishment to generate high-quality Chinese instructional datasets.
Experimental results on a 6-billion-parameter model show the Kun-52k variant outperforms others in human evaluations on benchmarks like C-EVAL and CMMLU.
The approach scales to curate over one million Chinese data points automatically, reducing the need for resource-intensive manual annotation.

Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation

The paper introduces a novel methodology named Kun, designed to improve the instructional tuning of LLMs for Chinese text. This is achieved by circumventing the need for manually annotated datasets, which are typically resource-intensive to produce. The authors propose a self-training approach leveraging instruction back-translation and answer polishment (AP) to generate a high-quality instruction-following dataset from unlabelled sources such as Wudao, Wanjuan, and SkyPile. The primary aim is to automatically curate and refine large datasets that enhance the operational efficiency of LLMs.

Methodology

Kun employs a self-curation strategy that relies on adapting a self-training algorithm, integrating the novel processes of instruction back-translation and answer polishment. This approach effectively bridges the gap between raw instruction data and their corresponding outputs, thereby ensuring more contextually relevant datasets. The method operates independently of traditional LLMs, showcasing the potential for scalability in generating instruction-following capabilities without heavy reliance on manual annotations.

Experiments and Results

Empirical evaluations were conducted utilizing the 6-billion-parameter Yi model, chosen for its open-source accessibility and reliable performance metrics. The experiments span several standard and comprehensive benchmarks, such as C-EVAL and CMMLU, specifically focusing on the effectiveness of the instruction datasets produced through Kun. The human evaluation encompassed 500 prompts from ShareGPT-zh, covering various tasks to compare model outputs with those of other LLMs.

Notably, the experiments demonstrated that the Kun-52k variant exhibited a performance edge over other models, specifically through heightened output quality as ascertained by human evaluation metrics. A critical component of the methodology's success was the identification that scoring the instruction component more significantly affected the final quality than scoring both components collectively.

Contributions

The notable contributions of this paper include:

Algorithmic Advancement: The introduction of the answer polishment (AP) process improves data coherence and clarity, leading to a more expansive and higher quality dataset for fine-tuning purposes.
Scalable Data Generation: Over one million Chinese instructional data points were curated from unlabelled data, challenging the traditional need for extensive human labor in data annotation processes.

Implications and Future Directions

Practically, the development of Kun suggests a scalable and efficient route for enhancing the instruction-following capabilities of LLMs, with wide-ranging applicability across diverse fields that rely on LLMs. Theoretically, it empowers further research into data generation methods that operate independently of costly manual data annotation mechanisms. Future developments could explore the implementation of Kun-like strategies to other languages and domains, further expanding the methodological applications of AI within global contexts.

Overall, Kun represents a significant shift in the methodology of training LLMs, presenting a potentially impactful alternative to current data annotation practices. It opens avenues for broader application and scalability in AI, providing a useful template for similar challenges in the ever-expanding field of language processing technologies.

PDF Markdown

Related Papers

GitHub

GitHub - Zheng0428/COIG-Kun (35 stars)

Tweets

https://twitter.com/GeZhang86038849/status/1821972345562788037