Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Published 27 Jan 2024 in cs.CL and cs.AI | (2402.01722v1)

Abstract: LLMs generate responses to questions; however, their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models. The objective is to enhance AI models through continuous feedback loops, utilizing metrics such as cosine similarity, LLM evaluation and Rouge-L scores to evaluate the models. Leveraging LLMs like GPT-3.5, GPT4ALL, and LLaMA2, and Claude, this approach is benchmarked on financial datasets, including the FinanceBench and RAG Instruct Benchmark Tester Dataset, illustrating the necessity of fine-tuning. The results showcase the capability of fine-tuned models to surpass the accuracy of zero-shot LLMs, providing superior question and answering capabilities. Notably, the combination of fine-tuning the LLM with a process known as Retrieval Augmented Generation (RAG) proves to generate responses with improved accuracy.

Abstract PDF HTML Upgrade to Chat

Citations (8)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning with supervised and parameter-efficient methods significantly boosts LLM accuracy in financial question-answering.
It employs Retrieval Augmented Generation (RAG) and active retrieval to reduce hallucinations and ground responses in authoritative sources.
Evaluation using metrics like ROUGE-L and cosine similarity confirms the effectiveness of these techniques in enhancing model reliability.

The paper "Enhancing LLM Performance To Answer Questions and Extract Information More Accurately" explores techniques to improve the accuracy of LLM question answering, especially in the financial domain. The central problem addressed is the propensity of LLMs to "hallucinate" or provide inaccurate information, which is unacceptable in contexts requiring high precision, such as financial analysis. The authors employ fine-tuning methods and Retrieval Augmented Generation (RAG) to mitigate these issues.

The paper begins by introducing the challenges associated with LLMs, including their tendency to generate incorrect responses. It highlights the importance of addressing these inaccuracies, particularly in fields like finance, where even minor errors can have significant consequences. The adaptability of LLMs through techniques like zero-shot learning, few-shot learning, and fine-tuning is discussed, setting the stage for the methodologies employed in the study.

The study investigates RAG, a technique designed to enhance the relevance and accuracy of LLM outputs by enabling them to reference external knowledge bases. RAG involves retrieving relevant documents and incorporating them into the input prompt, thereby grounding the LLM's responses in authoritative sources. The limitations of RAG, such as the potential for retrieving irrelevant information and the challenges associated with semantic search and chunking, are also acknowledged.

Fine-tuning is presented as another method for customizing LLM outputs for specific use cases. Unlike RAG, fine-tuning involves modifying the model's parameters using additional data. The paper discusses various fine-tuning methods, including Supervised Fine-tuning (SFT) and Parameter Efficient Fine Tuning (PEFT). SFT involves training the model on paired inputs and outputs, while PEFT aims to reduce the computational resources required for fine-tuning. Low Rank Adaptation (LoRA) and QLoRA (Quantized Low-Rank Adaptation) are introduced as techniques within PEFT that further reduce the number of trainable parameters and memory requirements.

The authors detail their process for enhancing question-answering accuracy through supervised fine-tuning of GPT-3.5 Turbo, reprompting on GPT4All and Llama2, and benchmarking with RAG. The absence of standardized methodologies for evaluating model performance is noted, and the evaluation metrics used such as cosine similarity and Rouge-L are described.

Two question-answering datasets were used for fine-tuning and testing: FinanceBench and RAG Instruct Benchmark tester. FinanceBench, developed by Patronus AI, comprises 10,231 questions about publicly traded companies, along with corresponding answers and evidence strings. The RAG Instruct Benchmark tester dataset, designed by LLMWARE, includes 200 questions with context passages from various 'retrieval scenarios' such as financial news and earnings releases.

A key aspect of the methodology is the preprocessing of data to conform to the specific formats required by different LLMs. For example, the LLaMA-2 model requires data to be formatted with specific tags indicating user vs. system prompts. The authors emphasize the importance of prompt engineering in guiding the model's behavior and reducing hallucinations.

To enhance both question-answering and retrieval abilities, the authors employed several techniques. For question-answering, they utilized human-curated datasets to fine-tune the models. For retrieval, they explored Forward-looking active retrieval augmented generation (FLARE), an active approach that prompts the LLM to generate a hypothetical response and uses both the query and hypothetical response to search for relevant chunks.

The evaluation method involved a detailed assessment using ROUGE-L, cosine similarity, and LLM evaluation. The authors tested multiple versions of each LLM, with varying amounts of labeled data, to assess the impact on performance. The trade-offs between open-source and closed-source LLMs, particularly with respect to data privacy and computational resources, are also discussed.

The results section presents plots showing the increase in Rouge score and cosine similarity score with increasing numbers of labels, demonstrating the effectiveness of fine-tuning. The conclusion reiterates the importance of human feedback and the limitations of simple RAG pipelines for domain-specific questions.

In terms of next steps, the authors suggest further tuning of model parameters, experimentation with different embedding models, and the incorporation of unsupervised fine-tuning and reinforcement learning with human feedback. They also plan to explore additional methods for improving retrieval algorithms, such as implementing a re-ranking algorithm and utilizing metadata annotations.