FinBen: A Holistic Financial Benchmark for Large Language Models (2402.12659v2)
Abstract: LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867.
- Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90.
- Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models.
- Robert A Ariel. 1987. A monthly effect in stock returns. Journal of financial economics, 18(1):161–174.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Multi-lingual esg issue identification. In Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, pages 111–115.
- Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning.
- Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205.
- Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711.
- Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering.
- Davide Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13.
- Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 519–535.
- Laiw: A chinese legal large language models benchmark.
- Leon Derczynski. 2016. Complementarity, F-score, and NLP evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 261–266, Portorož, Slovenia. European Language Resources Association (ELRA).
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv preprint arXiv:2310.00566.
- Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European conference on information retrieval, pages 345–359. Springer.
- Mastering pair trading with risk-aware recurrent reinforcement learning.
- Select and trade: Towards unified pair trading with hierarchical reinforcement learning. arXiv preprint arXiv:2301.10724.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Hans Hofmann. 1994. Statlog (German Credit Data). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5NC77.
- FinBART: A pre-trained seq2seq language model for Chinese financial tasks. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 906–917, Harbin, China. Chinese Information Processing Society of China.
- Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Multifin: A dataset for multilingual financial nlp. In Findings of the Association for Computational Linguistics: EACL 2023, pages 864–879.
- How are we detecting inconsistent method names? an empirical study from code review perspective. arXiv preprint arXiv:2308.12701.
- Bizbench: A quantitative reasoning benchmark for business and finance. arXiv preprint arXiv:2311.06602.
- Textual analogy parsing: What’s shared and what’s compared among analogous facts. arXiv preprint arXiv:1809.02700.
- A survey of large language models in finance (finllms).
- Cfbenchmark: Chinese financial assistant benchmark for large language model.
- Cfgpt: Chinese financial assistant with large language model.
- Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? an examination on several typical tasks. arXiv preprint arXiv:2305.05862.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485.
- Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning.
- Dynamic datasets and market environments for financial reinforcement learning.
- Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 4513–4519. International Joint Conferences on Artificial Intelligence Organization. Special Track on AI in FinTech.
- Alejandro Lopez-Lira and Yuehua Tang. 2023. Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.
- Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432.
- Malik Magdon-Ismail and Amir F Atiya. 2004. Maximum drawdown. Risk Magazine, 17(10):99–102.
- Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
- Www’18 open challenge: Financial opinion mining and question answering. pages 1941–1942.
- Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796.
- Financial document causality detection shared task (fincausal 2020). arXiv preprint arXiv:2012.02505.
- Kevin S McGrew. 2009. Chc theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research.
- Ectsum: A new benchmark dataset for bullet point summarization of long earnings call transcripts. arXiv preprint arXiv:2210.12467.
- OpenAI. 2023a. Gpt-4 technical report.
- R OpenAI. 2023b. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
- André E Punt. 2017. Strategic management decision-making in a complex world: quantifying, understanding, and using trade-offs. ICES Journal of Marine Science, 74(2):499–510.
- Ross Quinlan. Statlog (Australian Credit Approval). UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C59012.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, Parramatta, Australia.
- W Joel Schneider and Kevin S McGrew. 2012. The cattell-horn-carroll model of intelligence.
- Trillion dollar words: A new financial dataset, task & market analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6664–6679, Toronto, Canada. Association for Computational Linguistics.
- Finer: Financial named entity recognition dataset and weak-supervision model. arXiv preprint arXiv:2302.11157.
- When flue meets flang: Benchmarks and large pretrained language model for financial domain. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2322–2335.
- Financial numeric extreme labelling: A dataset and benchmarking. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3550–3561.
- Finred: A dataset for relation extraction in financial domain. In Companion Proceedings of the Web Conference 2022, pages 595–597.
- William F Sharpe. 1998. The sharpe ratio. Streetwise–the Best of the Journal of Portfolio Management, 3:169–85.
- Ankur Sinha and Tanmay Khandait. 2020. Impact of news on the commodity market: Dataset and results.
- Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 589–601. Springer.
- Accurate stock movement prediction with self-supervised learning from sparse noisy tweets. In 2022 IEEE International Conference on Big Data (Big Data), pages 1691–1700. IEEE.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Fine-grained argument understanding with bert ensemble techniques: A deep dive into financial sentiment analysis. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), pages 242–249.
- Fin-Eva Team. 2023a. Fin-eva version 1.0.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- InternLM Team. 2023b. Internlm: A multilingual language model with progressively enhanced capabilities.
- Llama: Open and efficient foundation language models.
- Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.
- Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 1627–1630.
- Bloomberggpt: A large language model for finance.
- The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges. arXiv preprint arXiv:2304.05351.
- Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443.
- Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1979.
- Fingpt: Open-source financial large language models.
- Generating plausible counterfactual explanations for deep transformers in financial text classification. arXiv preprint arXiv:2010.12512.
- Investlm: A large language model for investment using financial domain instruction tuning.
- Finbert: A pretrained language model for financial communications.
- Finmem: A performance-enhanced llm trading agent with layered memory and character design.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
- Instruction tuning for large language models: A survey.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Dólares or dollars? unraveling the bilingual prowess of financial llms between spanish and english.
- Cgce: A chinese generative chat evaluation benchmark for general and financial domains.
- Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters.
- Forecasting the equity premium: Do deep neural network models work? Modern Finance, 1(1):1–11.
- Trade the event: Corporate events detection for news-based event-driven trading.
- Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. arXiv preprint arXiv:2105.07624.