CSL: A Large-scale Chinese Scientific Literature Dataset

Published 12 Sep 2022 in cs.CL | (2209.05034v1)

Abstract: Scientific literature serves as a high-quality corpus, supporting a lot of NLP research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL

Abstract PDF Upgrade to Chat

Authors (7)

Citations (45)

View on Semantic Scholar

Summary

The paper introduces a large-scale Chinese literature dataset comprising metadata from 396,209 peer-reviewed papers across 67 disciplines to address the gap in non-English NLP resources.
It employs state-of-the-art text-to-text models such as T5, PEGASUS, and BART in a unified multi-task learning framework for tasks like summarization, keyword extraction, and classification.
The tailored CSL-T5 model outperforms general-domain baselines, underscoring the benefits of domain-adaptive training and paving the way for advanced cross-task and few-shot learning research.

Overview of "CSL: A Large-scale Chinese Scientific Literature Dataset"

This paper introduces a novel dataset known as CSL, aimed at enhancing NLP research within Chinese scientific literature. Addressing a significant gap, CSL provides a corpus that is essential for developing NLP applications in non-English contexts, particularly in Chinese. This dataset comprises metadata from 396,209 papers, which includes titles, abstracts, keywords, and academic fields, making it a comprehensive resource for various NLP tasks.

Dataset Characteristics

CSL is distinguished by its focus on the Chinese language and its extensive coverage across 67 disciplines divided into 13 first-level categories. Unlike existing databases that predominantly cater to the English language, CSL leverages Chinese academic journals that have undergone peer review, ensuring high data reliability. The dataset directly accesses the database to maintain accuracy in metadata representation.

NLP Task Derivation and Benchmarking

The metadata inherent in CSL enables the creation of multiple NLP tasks such as text summarization, keyword generation, and text classification. The authors construct a benchmark from these tasks to evaluate model performance, facilitating advancements in NLP for Chinese scientific contexts. Specifically, they explore summarization of abstracts to titles, keyword extraction, and academic categorization.

Methodology and Evaluation

The paper utilizes cutting-edge text-to-text models, including T5, PEGASUS, and BART, to establish baselines. The authors perform multi-task learning by unifying these tasks into text generation formats and fine-tune the models on CSL-specific tasks. Providing evidence of the dataset's value, results from pre-trained CSL-T5 show improvements over general-domain models, affirming the effectiveness of domain-adaptive training.

Experimental Outcomes

Empirical results suggest that while existing models achieve modest success in task performance, there remains substantial room for improvement. Particularly, the tailored CSL-T5 model demonstrates superior performance, highlighting the benefits of domain-specific training. The study also underscores the potential for CSL to serve as a foundational resource for cross-task and few-shot learning research, given its versatile task construction capabilities.

Implications and Future Directions

The introduction of CSL sets a critical precedent for expanding research in non-English NLP, significantly enriching the resources available for Chinese NLP research. By providing a platform to develop and evaluate models across diverse scientific disciplines, CSL facilitates specialized research previously constrained by resource limitations.

Anticipated future developments involve extending the dataset to include multi-label annotations and exploring its application in few-shot learning scenarios. Additionally, the potential for CSL to contribute to broader cross-linguistic studies and comparisons is noteworthy.

In conclusion, CSL represents a significant contribution to the NLP field, especially for those focusing on non-English resources. Its comprehensive coverage and high-quality data pave the way for progress in Chinese scientific literature processing, influencing both theoretical and practical advancements in AI-driven language technology.

Markdown Report Issue