A Large-Scale Chinese Short-Text Conversation Dataset

Published 10 Aug 2020 in cs.CL | (2008.03946v2)

Abstract: The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (126)

View on Semantic Scholar

Summary

The paper presents a large-scale Chinese conversation dataset comprising 6.8M and 12M dialogue instances.
The authors employed a rigorous two-phase cleaning process using heuristic rules and classifiers to reduce noise.
Pre-training models like CDialGPT, refined with LCCC, deliver superior fluency, relevance, and performance in dialogue tasks.

A Large-Scale Chinese Short-Text Conversation Dataset

The paper presents a significant contribution to the field of natural language processing by introducing a large-scale Chinese short-text conversation dataset, known as LCCC. This dataset is designed to address the scarcity of Chinese dialogue corpora, which has been a hindrance for developing pre-training models for Chinese dialogue generation. The authors have meticulously constructed and cleaned the dataset to ensure its quality, making it suitable for advancing research in open-domain dialogue generation.

Dataset Construction and Quality

The LCCC dataset comprises two main versions: LCCC-base with 6.8 million dialogues and LCCC-large with 12.0 million dialogues. Originating from social media platforms like Weibo, the dataset underwent a rigorous two-phase cleaning process. Initially, heuristic rules were employed to filter dialogues. Subsequently, a more refined filtering was achieved using classifiers trained on over 100,000 annotated dialogue pairs. This meticulous approach mitigates common issues in online datasets, such as noise from toxic comments and irrelevant content, which can degrade the performance of dialogue models.

Pre-Training Models

Leveraging the cleaned dataset, the authors have also introduced pre-training models such as CDialGPT, tailored for Chinese dialogue generation. These models were both pre-trained on a Chinese novel corpus and post-trained on the LCCC dataset to optimize performance. These models provide a robust foundation for further research and development in Chinese NLP tasks.

Comparative Analysis

The new dataset and pre-training models were evaluated against existing methods and datasets. Notably, the authors highlight a significant reduction in noise compared to previous datasets, like the STC dataset, and substantial improvements in model performance metrics. Both automatic and human evaluations were conducted, demonstrating the superior fluency, relevance, and informativeness of models trained using the LCCC dataset.

Implications and Future Directions

The introduction of the LCCC dataset and associated models holds substantial implications for the field of NLP. By providing a high-quality resource for Chinese dialogue generation, this work facilitates more accurate and contextually aware conversational models. Moreover, these developments could be instrumental in practical applications such as chatbots and virtual assistants in Mandarin-speaking regions.

Looking forward, the availability of such resources is likely to spur further innovations in AI, especially in the realms of cross-lingual dialogue systems and personalized conversation agents. Future research might explore refining the dataset further, expanding its scope, or integrating it with multimodal data for even richer interaction models.

In summary, this paper marks a significant step forward in the development of Chinese NLP resources, providing both theoretical enhancement through a well-constructed dataset and practical advancement via pre-trained dialogue models. The release of these resources promotes further exploration and innovation in open-domain conversation modeling and related applications.

Markdown Report Issue