Emergent Mind

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

(2403.18058)
Published Mar 26, 2024 in cs.CL and cs.AI

Abstract

Recently, there have been significant advancements in LLMs, particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and developing Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA

Length distribution of combined instructions and responses from a specific dataset.

Overview

  • The COIG-CQIA dataset introduces a comprehensive corpus for Chinese instruction fine-tuning, addressing the shortage of high-quality resources for Chinese language models.

  • Derived from diverse internet sources, the dataset contains 48,375 instances, covering formal and informal language use across various domains including STEM and humanities.

  • The dataset supports a broad range of task types, aiding in training models for intricate understanding and response generation in Chinese.

  • Models trained on COIG-CQIA showed improved performance in tasks requiring deep understanding, underscoring its potential to enhance human-like interaction in Chinese language AI systems.

COIG-CQIA: A High-Quality Chinese Instruction Tuning Dataset for Improved Human-Like Interaction

Introduction to COIG-CQIA

The evolution of LLMs has drastically enhanced machine understanding and response generation capabilities, especially in the context of instruction-following tasks. However, the existing resources for instruction tuning predominantly cater to English, leaving a significant void in high-quality datasets for Chinese instruction fine-tuning. This gap impairs the development of models capable of understanding and executing instructions in Chinese with high fidelity. To address this, the introduction of the COIG-CQIA dataset marks a significant step forward. It aims to offer a comprehensive corpus tailored for instruction tuning in Chinese, meticulously curated from diverse, authentic internet sources and processed to meet high-quality standards.

Dataset Curation

COIG-CQIA stands out due to its methodical curation process and the wealth of sources it taps into for data collection. The dataset is derived from a mixture of social media platforms, Q&A communities, encyclopedias, exams, and existing NLP datasets, ensuring a broad coverage that spans both formal and informal usage, as well as a variety of domains such as STEM, humanities, and general knowledge.

The compilation process involved rigorous steps to ensure the quality and relevance of the data:

  • Filtering and Processing: Utilized both automated and manual review processes to filter out low-quality content, irrelevant information, and to ensure the cleanliness of the data.
  • Diverse Sources: Collected data from over 22 unique sources, including prominent Chinese websites and forums, ensuring a rich diversity in the types of instruction-response pairs in the dataset.

Dataset Composition and Characteristics

  • Task Variety: COIG-CQIA encompasses a wide array of task types, from question answering and knowledge extraction to generation tasks, facilitating comprehensive model training.
  • Volume and Diversity: The dataset boasts of 48,375 instances, a testament to its volume and the diversity it encapsulates. This variety is crucial for training models to understand and generate a wide range of responses.

Data Analysis and Evaluation

The dataset was rigorously analyzed to ascertain its diversity, quality, and coverage. The potential influence of the data sourced from various platforms on model performance was also evaluated across different benchmarks, demonstrating the dataset's effectiveness in enhancing models' capacity for understanding and executing Chinese instructions accurately.

Experimental Findings and Implications

Models trained on the COIG-CQIA dataset showcased competitive results in both human assessment and benchmark evaluations, particularly highlighting its efficacy in tasks requiring deep understanding and complex response generation. This finding underscores COIG-CQIA's potential to significantly contribute to advancing the development of instruction-tuned LLMs capable of comprehensively understanding and interacting in Chinese.

Conclusion and Future Directions

The development of COIG-CQIA represents a formidable stride towards bridging the gap in resources for Chinese instruction tuning tasks. Its comprehensive curation from a wide range of sources, coupled with the meticulous cleaning and processing efforts, ensures high-quality and diversity, making it an invaluable asset for the Chinese NLP community.

The dataset’s release invites further research and exploration into instruction tuning for Chinese LLMs, with the potential to pave the way for models that demonstrate improved alignment with human interactions in Chinese. As the NLP field continues to evolve, datasets like COIG-CQIA will be instrumental in fostering advancements that bring us closer to achieving truly human-like interaction capabilities in AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube