FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models (2308.09975v2)

Published 19 Aug 2023 in cs.CL

Abstract: LLMs have demonstrated outstanding performance in various natural language processing tasks, but their security capabilities in the financial domain have not been explored, and their performance on complex tasks like financial agent remains unknown. This paper presents FinEval, a benchmark designed to evaluate LLMs' financial domain knowledge and practical abilities. The dataset contains 8,351 questions categorized into four different key areas: Financial Academic Knowledge, Financial Industry Knowledge, Financial Security Knowledge, and Financial Agent. Financial Academic Knowledge comprises 4,661 multiple-choice questions spanning 34 subjects such as finance and economics. Financial Industry Knowledge contains 1,434 questions covering practical scenarios like investment research. Financial Security Knowledge assesses models through 1,640 questions on topics like application security and cryptography. Financial Agent evaluates tool usage and complex reasoning with 616 questions. FinEval has multiple evaluation settings, including zero-shot, five-shot with chain-of-thought, and assesses model performance using objective and subjective criteria. Our results show that Claude 3.5-Sonnet achieves the highest weighted average score of 72.9 across all financial domain categories under zero-shot setting. Our work provides a comprehensive benchmark closely aligned with Chinese financial domain.

Citations (18)

View on Semantic Scholar

Summary

The paper introduces FinEval, a benchmark assessing Chinese financial knowledge with 4,661 questions spanning Finance, Economy, Accounting, and Certificate categories.
The methodology employs diverse prompts, including zero-shot, few-shot, answer-only, and chain-of-thought, to capture both direct and complex reasoning skills.
The evaluation shows GPT-4 achieving nearly 70% accuracy, underlining advanced LLM capabilities while highlighting areas for further domain-specific improvements.

Overview of "FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for LLMs"

The paper "FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for LLMs" introduces FinEval, a significant development in the domain of specialized benchmarks for evaluating LLMs within financial contexts. The necessity of such a benchmark arises from the critical role that finance plays in shaping societal structures and economic growth, coupled with the increasing application of LLMs across diverse domains.

Key Contributions

1. Benchmark Design and Scope:

FinEval is specifically crafted to assess the proficiency of LLMs in Chinese financial knowledge. The benchmark is comprised of 4,661 multiple-choice questions spread across four major categories: Finance, Economy, Accounting, and Certificate. These categories encompass 34 distinct subjects pertinent to the financial sector. This breadth ensures a comprehensive evaluation across various facets of financial knowledge, distinguishing FinEval from other existing benchmarks.

2. Evaluation Methodology:

The benchmark employs a range of prompts, including zero-shot, few-shot, answer-only (AO), and chain-of-thought (CoT), to provide a nuanced assessment of LLM performance. This varied approach helps capture the models' capabilities in both straightforward question-answering tasks and more complex reasoning tasks.

3. Model Performance and Insights:

Through extensive evaluation of state-of-the-art Chinese and English LLMs on FinEval, the paper provides critical insights into their performance and areas for improvement. Notably, GPT-4 demonstrates the highest accuracy, achieving close to 70% in different settings, which underscores the potential of advanced LLMs in the financial domain.

Contributions to the Field

The implications of FinEval are both practical and theoretical. On a practical level, FinEval provides a detailed benchmarking tool that can guide the tuning and improvement of LLMs for better performance in financial applications. The availability of specific data sets and evaluation criteria enables an objective comparison of different LLMs, fostering competition and innovation in model development.

Theoretically, the paper highlights the challenges associated with processing financial data, particularly in the nuanced Chinese context. The empirical results emphasize the complexity of financial problems and the necessity of domain-specific training to achieve meaningful performance improvements. Furthermore, the decrease in model accuracy in CoT settings across many subjects suggests opportunities for further exploration and enhancement in reasoning capabilities.

Future Directions

The paper concludes with aspirations to extend FinEval to cover more specialized financial scenarios such as virtual assistants and fraud detection. This vision underscores an ongoing commitment to refining and expanding the evaluation of LLMs within highly specialized domains. Additionally, the paper suggests a critical area of future research in enhancing foundation models through tailored instruction tuning, particularly by leveraging few-shot learning for further adaptation to domain-specific tasks.

In summary, FinEval is poised to be a pivotal resource in evaluating and advancing LLM capabilities in the financial domain. Its comprehensive design and insightful results set a new standard in domain-specific model assessment, paving the way for future breakthroughs and innovations in artificial intelligence.

PDF Markdown

Related Papers

GitHub

GitHub - SUFE-AIFLM-Lab/FinEval: FinEval是一个包含金融、经济、会计和证书等领域高质量多项选择题的集合。 (207 stars)