Chinese Open Instruction Generalist: A Preliminary Release (2304.07987v4)

Published 17 Apr 2023 in cs.CL and cs.AI

Abstract: Instruction tuning is widely recognized as a key technique for building generalist LLMs, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress in English-oriented large-scale LLMs, it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction \textbf{G}eneralist (\textbf{COIG}) corpora are available in Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and Github\footnote{\url{https://github.com/BAAI-Zlab/COIG}}, and will be continuously updated.

Citations (27)

View on Semantic Scholar

Summary

The paper presents a novel large-scale Chinese instruction dataset that enhances LLM tuning with rigorous manual verification and quality checks.
It details a blended methodology combining translated corpora, exam content, value-aligned instructions, counterfactual corrections, and coding tasks.
Empirical evaluation emphasizes the importance of cultural context and specialized pipelines in improving instruction-following performance in LLMs.

An Academic Analysis of the Chinese Open Instruction Generalist Preliminary Release

This document introduces the Chinese Open Instruction Generalist (COIG), a significant endeavor in creating a high-quality Chinese instruction dataset aimed at enhancing the performance and instruction-following capabilities of LLMs. As the use of LLMs in AI applications continues to grow, the demand for diverse and high-quality training data becomes increasingly crucial. This paper provides a detailed description of their methodology for assembling a Chinese instruction dataset with manual verification, addressing a notable gap in the availability of non-English instruction tuning data.

Instruction Tuning and Its Challenges

Instruction tuning is essential for empowering LLMs with the capability to interpret and execute tasks as described by specific instructions. Although English instruction tuning datasets are abundant, their Chinese counterparts are relatively underdeveloped in both scale and diversity. The COIG project's primary contribution is to fill this gap by assembling a comprehensive, well-verified Chinese instruction dataset.

Data Collection Methodology

The COIG project is meticulous in its approach to data collection:

Translation-Based General Instruction Corpus: The corpus is derived from meticulous translations of existing high-quality English datasets, such as unnatural instructions and self-instruct sources. This phase involved automatic translation followed by stringent manual verification to ensure cultural relevance and accuracy. The paper emphasizes the high correctness rate achieved through multi-step quality verification processes.
Exam Instructions: Leveraging existing Chinese educational materials, this dataset comprises a variety of question formats and subjects. It employs manual annotation to ensure the integrity and educational relevance of the instructional data.
Human Value Alignment Instructions: The dataset considers cultural nuances unique to the Chinese-speaking world. It carefully selects seeds from ethics education materials, promotes widely shared human values, and eschews regional beliefs or political content, thus ensuring that the resulting instructions resonate culturally while aligning with ethical standards.
Counterfactual Correction Multi-round Chat: This dataset addresses factual errors and hallucinations in LLM responses. By utilizing role-play dialogues based on a knowledge base, the paper aims to enhance the factual consistency and accuracy of Chinese LLMs.
Leetcode Instructions: Given the significance of code-related instructions, the dataset includes tasks that align with Chinese language processing and span various coding genres.

Empirical Evaluation and Contributions

The empirical evaluation in this paper highlights the importance of In-Context Learning (ICL) for instruction expansion, as well as the strategic use of human verification to bridge cultural gaps in translated datasets. This suggests that a nuanced understanding of the target audience is critical when developing multilingual instruction corpora, thus implying that future research should focus on the cultural and contextual nuances present in the data.

Furthermore, the project outlines several significant contributions to the field:

The construction of one of the most extensive Chinese instruction tuning corpora to date.
A workflow model for future instruction corpus construction that balances automated and manual processes.
Insights into domain-specific pipeline design, crucial for handling different domains like academic exams or human value alignment.

Practical and Theoretical Implications

Practically, COIG data provides a robust foundation for developing Chinese LLMs capable of better instruction comprehension and execution. The project also facilitates further research on improving instructional data quality and diversification in non-English languages.

Theoretically, the paper opens discussions on potential algorithmic improvements. For instance, the disparity in instruction utility suggests the need for active learning methodologies to identify the most informative data samples. Additionally, overcoming gradient interference during instruction tuning may improve model convergence and performance.

Concluding Thoughts

In summary, the COIG project presents a meticulously constructed dataset that provides a substantial contribution to Chinese instruction tuning for LLMs. While acknowledging the project's early phase, the document emphasizes a commitment to continual updates and invites collaboration. Future research can build upon this foundation, exploring more advanced or specialized tuning strategies and data curation methods, potentially extending the concepts discussed to other languages and cultures.