Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Published 3 Apr 2024 in cs.CL | (2404.02655v2)

Abstract: LLMs optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these LLMs often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the LLM confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by LLMs. Then, we propose a plug-and-play method to estimate the confidence of LLMs. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.