BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark (2302.09432v2)

Published 18 Feb 2023 in cs.CL

Abstract: To advance Chinese financial NLP, we introduce BBT-FinT5, a new Chinese financial pre-training LLM based on the T5 model. To support this effort, we have built BBT-FinCorpus, a large-scale financial corpus with approximately 300GB of raw text from four different sources. In general domain NLP, comprehensive benchmarks like GLUE and SuperGLUE have driven significant advancements in LLM pre-training by enabling head-to-head comparisons among models. Drawing inspiration from these benchmarks, we propose BBT-CFLEB, a Chinese Financial Language understanding and generation Evaluation Benchmark, which includes six datasets covering both understanding and generation tasks. Our aim is to facilitate research in the development of NLP within the Chinese financial domain. Our model, corpus and benchmark are released at https://github.com/ssymmetry/BBT-FinCUGE-Applications. Our work belongs to the Big Bang Transformer (BBT), a large-scale pre-trained LLM project.

Citations (41)

View on Semantic Scholar

Summary

The paper presents BBT-FinT5, a domain-specific PLM that integrates a 300GB Chinese financial corpus with a novel KETM pre-training strategy.
The methodology adapts the T5 architecture to address limitations of general models, enhancing performance in financial NLP tasks.
Experimental results show that the FinT5-large model outperforms competitors on key financial benchmarks, proving the scalability of domain-specific pre-training.

Overview of BBT-Fin: Chinese Financial Domain Pre-trained LLM

The paper "BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained LLM, Corpus and Benchmark" introduces BBT-FinT5, a novel pre-trained LLM (PLM) aimed at advancing NLP within the Chinese financial domain. This effort is supplemented by a large-scale financial corpus called BBT-FinCorpus, comprising approximately 300GB of raw text from diverse sources, and a dedicated evaluation benchmark, BBT-CFLEB, to facilitate the performance comparison of models across understanding and generation tasks.

BBT-FinT5 builds upon the T5 architecture, known for its efficacy in transfer learning across varied NLP tasks, with adaptations for domain-specific pre-training. The paper acknowledges the limitations of general PLMs such as BERT and T5 when applied to domain-specific texts. It leverages prior findings to inform the domain-specific pre-training of BBT-FinT5, which involves a massive parameter set of 220 million and 1 billion for the base and large versions, respectively. The model employs knowledge-enhanced pre-training strategies, particularly through Knowledge Enhancement via Triple Masking (KETM), facilitating improved entity knowledge retention.

Core Components

BBT-FinCorpus: This extensive corpus encompasses various text types essential for financial NLP, including corporate reports, financial news, and social media commentary, enhancing the diversity and scale crucial for effective domain pre-training. The acquisition and filtering of these sources address prior deficits in existing financial corpora.
BBT-CFLEB Benchmark: Designed to evaluate both understanding and generation capabilities, BBT-CFLEB comprises six datasets reflecting prevalent tasks in the financial industry. These tasks offer a comprehensive measure of a model's capability to handle domain-specific challenges.
Knowledge Enhanced Pre-training: The proposed KETM method enriches the T5 model's pre-training process by integrating a specialized task that fosters the comprehension and retention of entity knowledge pivotal in financial texts.

Experimental Validation

The experimentation section details rigorous evaluations comparing BBT-FinT5 against notable models such as GPT2-base, T5-base, FinBERT, and Mengzi-BERT. Results indicate that BBT-FinT5, particularly when augmented with knowledge-enhanced strategies, surpasses its contemporaries in several metrics, underscoring the effectiveness of domain-specific pre-training. Larger model architectures, as evidenced by the performance of the FinT5-large, further validate the benefits of scaling model parameters.

Implications and Future Directions

The introduction of a comprehensive corpus, a large-scale PLM, and a dedicated benchmark establishes a robust foundation for advancing NLP in the Chinese financial sector. This framework not only addresses the existing capacity limitations of prior models but also sets the stage for subsequent enhancements in domain-specific language processing capabilities.

Practically, the BBT-Fin framework is pivotal for applications necessitating precise language understanding and generation within the financially constrained contexts of the Chinese market. Theoretically, the paper contributes to the field of domain-specific PLM development, especially regarding effective strategies for integrating external knowledge resources.

Future developments may include expanding the corpus and model scope, exploring multilingual capabilities, and incorporating multimodal data sources to further bolster the adaptability of PLMs in this domain. As domain-specific demands continue to grow, such innovations will be crucial in bridging the gap between general NLP advancements and practical industry applications.

PDF Markdown

Related Papers

GitHub

GitHub - ssymmetry/BBT-FinCUGE-Applications (269 stars)