Calibrating Large Language Models Using Their Generations Only (2403.05973v1)

Published 9 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.

References (88)

Citations (10)

View on Semantic Scholar

Summary

The paper presents APRICOT, which calibrates LLM confidence by predicting targets solely from text input and output.
The method leverages embedding clustering to derive calibration targets without accessing internal model data.
Experiments on TriviaQA and CoQA demonstrate significant reduction in calibration error and improved detection of incorrect answers.

Calibrating LLMs Through Auxiliary Models Predicting Confidence

Introduction to APRICOT

In the field of LLMing, ensuring that LLMs can provide not just any responses but reliable and trustworthy ones is paramount, especially as these models find more applications in user-facing services. A significant challenge in this context is the calibration of LLMs; specifically, how can one quantify and enhance the model's confidence in its own predictions when interaction with the model is limited to its generated text? The paper introduces APRICOT (Auxiliary prediction of confidence targets), a novel method tackling this problem by training an auxiliary model to predict the confidence of an LLM's answers solely based on the textual input and output.

Key Contributions

The paper positions APRICOT as a straightforward and conceptually simple approach to calibrating LLMs that does not require access to the model beyond its outputs. This is particularly useful given the increasing prevalence of black-box LLMs offered as services, where internal model details or token probabilities are not accessible. The auxiliary model trained by APRICOT provides valuable information about the LLM's confidence in its answers without interfering with the language generation process, making it highly versatile and applicable across various implementations and scenarios. The authors empirically demonstrate APRICOT's effectiveness in reducing calibration error for both white-box and black-box LLMs on closed-book question-answering tasks, specifically focusing on the ability to detect incorrect answers.

Methodological Overview

APRICOT stands out by obtaining calibration targets without requiring additional information about the LLM's internals or question metadata. Instead, it utilizes the text input given to and output produced by the LLM to predict calibration targets, which are derived via clustering similar questions based on their embeddings. This clustering forms the basis for setting confidence targets without direct access to the LLM's predictions, an approach that is not only innovative but also practical, considering the operational parameters of many LLMs today.

Experimentation and Results

The experiments conducted to validate APRICOT's approach are thorough in their methodology and analysis. The authors used datasets such as TriviaQA and CoQA for testing, with both white-box (Vicuna v1.5) and black-box (GPT-3.5) LLMs. APRICOT demonstrated competitive performance in terms of calibration error while also significantly outperforming baselines in detecting incorrect model answers across different scenarios and configurations. Notably, APRICOT effectively calibrated LLMs using both fine-grained targets obtained through clustering and a binary approach that focused on answer correctness.

Practical Implications and Future Directions

This work underscores the importance of LLM confidence in improving user trust and safety in AI applications. APRICOT presents a practical solution to a previously intractable problem, offering a pathway to more reliable and interpretable AI without requiring invasive access or modifications to the underlying models. Looking forward, the techniques presented here could extend to other domains of AI beyond text generation, offering a general method for enhancing model reliability across the board.

Conclusion

In summary, APRICOT offers a compelling approach to the calibration of LLMs through an auxiliary model that requires no internal model access. By leveraging textual inputs and outputs for confidence prediction, APRICOT paves the way for more trustworthy and safe applications of LLMs in real-world scenarios. The method's simplicity, effectiveness, and versatility stand to significantly impact the future development and deployment of LLMs across various industries and applications.

Tweets

https://twitter.com/dnnslmr/status/1767584342849556701

https://twitter.com/dnnslmr/status/1823641514876264873