Large Language Model Confidence Estimation via Black-Box Access (2406.04370v2)

Published 1 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of LLMs with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b and Mistral-7b on four benchmark Q&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

Authors (5)

Tejaswini Pedapati (31 papers)
Amit Dhurandhar (62 papers)
Soumya Ghosh (39 papers)
Soham Dan (41 papers)
Prasanna Sattigeri (70 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a logistic regression framework that estimates LLM confidence using engineered features from various prompt perturbations.
It demonstrates over 10% AUROC improvements and identifies key features like syntactic similarity as critical for robust confidence estimation.
The framework offers a scalable and interpretable method for black-box LLM access, with promising potential for universal application across models.

LLM Confidence Estimation via Black-Box Access

The paper, authored by Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, and Prasanna Sattigeri, presents a significant contribution to the field of LLMs. The research focuses on estimating the confidence of LLM responses with only black-box access, addressing a pertinent problem in AI reliability and trustworthiness.

Background and Problem Statement

Confidence estimation in the outputs of LLMs is crucial for several reasons, including trust evaluation, benchmarking, and hallucination mitigation. Unlike traditional tasks where outputs are exact and can be directly compared with ground truths, LLMs often produce varied, semantically equivalent responses. This variability necessitates a sophisticated approach to assess the model's confidence.

Here, the problem is framed formally: given an input prompt $x$ and an LLM $f$ , the goal is to estimate the probability that the output $f(x)$ meets or exceeds a semantic similarity threshold $\theta$ with the expected response $y$ . Prior approaches have explored this issue, but most rely on strategies that require extensive internal model access or computational resources, limiting their practicality.

Methodology

The authors propose a simple but powerful framework utilizing logistic regression to estimate LLM response confidence based on engineered features obtained through various prompt perturbations. The novel aspect lies in the use of interpretability to glean insights from the features, enhancing the robustness of the confidence estimates. Key components of the methodology include:

Prompt Perturbations:
- Stochastic Decoding (SD): Multiple responses for a single prompt using different decoding strategies.
- Paraphrasing (PP): Back-translation-based paraphrasing of the context in the prompt.
- Sentence Permutation (SP): Reordering sentences containing named entities to create alternative prompts.
- Entity Frequency Amplification (EFA): Repeating sentences with named entities within the context.
- Stopword Removal (SR): Removing stopwords from the context while maintaining the semantic content.
- Split Response Consistency (SRC): Contradiction detection within split responses using an NLI model.
Featurization:
- Semantic Set: Number of semantically equivalent sets formed from the responses.
- Syntactic Similarity: Average syntactic similarity between responses.
- SRC Minimum Value: Highest contradiction probability in the split responses.
Training and Validation:
- Logistic regression is employed for its interpretability and efficiency. The model is trained on features derived from the perturbations and the corresponding correctness labels determined by the Rouge score against ground truth responses.

Experimental Results

The framework was evaluated across three prominent LLMs—Mistral-7B-Instruct-v0.2, llama-2-13b chat, and flan-ul2—on four datasets: CoQA, SQuAD, TriviaQA, and Natural Questions (NQ).

Key Findings:

AUROC and AUARC Performance: The framework consistently outperformed state-of-the-art baselines, with improvements in AUROC by over 10% in several cases and substantial enhancements in AUARC as well.
Feature Importance: Analysis revealed that features like SD syntactic similarity and SP syntactic similarity were critical across various LLM-dataset combinations, indicating the robustness of these features in estimating confidence.
Model Transferability: Logistic confidence models trained for one LLM showed effective performance when applied to other LLMs on the same dataset, suggesting the potential for a universal confidence model applicable across multiple LLMs.

Implications and Future Directions

The implications of this research are both practical and theoretical:

Practical Impact: The proposed framework provides a scalable and interpretable solution for black-box LLM confidence estimation without requiring internal model access or modifications.
Theoretical Insights: The discovery that certain features are universally important across different LLMs and datasets underscores the potential for developing generalizable confidence models.

Future work could extend this approach to more varied LLMs and datasets, including non-English languages, to further validate the framework's robustness. Additionally, incorporating more sophisticated perturbations and exploring deeper featurization strategies might enhance the confidence estimation further.

Conclusion

This paper by Pedapati et al. introduces an effective and interpretable framework for LLM confidence estimation using black-box access. By leveraging engineered features from prompt perturbations and employing logistic regression, the authors deliver a method that not only outperforms existing approaches but also offers insights into the generalizability of confidence features across different LLMs. This work opens avenues for more reliable and trustworthy AI applications, underscoring the importance of confidence estimation in modern AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1800009313596977221

HackerNews

Large Language Model Confidence Estimation via Black-Box Access (1 point, 0 comments)