Calibrating Large Language Models Using Their Generations Only (2403.05973v1)
Abstract: As LLMs are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.
- Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
- Anonymous. 2024. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations.
- Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703.
- Mars: Meaning-aware response scoring for uncertainty estimation in generative llms. arXiv preprint arXiv:2402.11756.
- Exploring prediction uncertainty in machine translation quality estimation. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 208–218, Berlin, Germany. Association for Computational Linguistics.
- Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 401–413.
- Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
- Jarosław Błasiok and Preetum Nakkiran. 2023. Smooth ece: Principled reliability diagrams via kernel smoothing. arXiv preprint arXiv:2309.12236.
- Confidence estimation for machine translation. In Coling 2004: Proceedings of the 20th international conference on computational linguistics, pages 315–321.
- Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3.
- Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer.
- Ilias Chalkidis. 2023. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark. arXiv preprint arXiv:2304.12202.
- Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
- Reconfidencing llms from the grouping loss perspective. arXiv preprint arXiv:2402.04957.
- A close look into the calibration of pre-trained language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1343–1367. Association for Computational Linguistics.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Large legal fictions: Profiling legal hallucinations in large language models. arXiv preprint arXiv:2401.01301.
- Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297.
- An optimal transportation approach for assessing almost stochastic order. In The Mathematics of the Uncertain, pages 33–44. Springer.
- Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 295–302. Association for Computational Linguistics.
- A diachronic perspective on user trust in ai under uncertainty. arXiv preprint arXiv:2310.13544.
- Deep dominance - how to properly compare deep neural models. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2773–2785. Association for Computational Linguistics.
- Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120.
- Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pages 226–231. AAAI Press.
- Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3920–3938, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Perspectives on the state and future of deep learning–2023. arXiv preprint arXiv:2312.09323.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
- Survey on leveraging uncertainty estimation towards trustworthy deep neural networks: The case of reject option and post-training processing. arXiv preprint arXiv:2304.04906.
- The practical implementation of artificial intelligence technologies in medicine. Nature medicine, 25(1):30–36.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Benedikt Höltgen and Robert C Williamson. 2023. On the richness of calibration. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1124–1138.
- Decomposing uncertainty for large language models through input clarification ensembling. arXiv preprint arXiv:2311.08718.
- Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in ai. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 624–635.
- Calibrating language models via augmented prompt ensembles.
- How can we know When language models know? on the calibration of language models for question answering. Trans. Assoc. Comput. Linguistics, 9:962–977.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Quantifying the carbon emissions of machine learning. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019.
- DEUP: direct epistemic uncertainty prediction. Trans. Mach. Learn. Res., 2023.
- Question and answer test-train overlap in open-domain question answering datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 1000–1008. Association for Computational Linguistics.
- Q Vera Liao and S Shyam Sundar. 2022. Designing for responsible trust in ai systems: A communication perspective. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1257–1268.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022.
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
- Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam.
- Energy usage reports: Environmental awareness as part of algorithmic accountability. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019.
- Andrey Malinin and Mark J. F. Gales. 2021. Uncertainty estimation in autoregressive structured prediction. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Gary Marcus. 2020. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177.
- Comparison of deep learning models and various text pre-processing techniques for the toxic comments classification. Applied Sciences, 10(23):8631.
- Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
- Deep deterministic uncertainty: A new simple baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24384–24394.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
- Lucia Nalbandian. 2022. An eye for an ‘i:’a critical assessment of artificial intelligence tools in migration and asylum management. Comparative Migration Studies, 10(1):1–23.
- OpenAI. 2022. Introducing chatgpt.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.
- How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840.
- Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 345–356. Springer.
- John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
- Christopher Quirk. 2004. Training a sentence-level machine translation confidence measure. In LREC.
- Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
- CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing.
- Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5433–5442. Association for Computational Linguistics.
- William Timkey and Marten van Schijndel. 2021. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 4527–4546. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388.
- Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2707–2735. Association for Computational Linguistics.
- deep-significance: Easy and meaningful signifcance testing in the age of neural networks. In ML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations.
- Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data. In Machine Learning for Health, pages 341–354. PMLR.
- Intensive care unit physicians’ perspectives on artificial intelligence–based clinical decision support tools: Preimplementation survey study. JMIR Human Factors, 10:e39114.
- Benchmarking scalable predictive uncertainty in text classification. IEEE Access.
- Hybrid uncertainty quantification for selective text classification in ambiguous tasks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11659–11681. Association for Computational Linguistics.
- Algorithmic learning in a random world, volume 29. Springer.
- Improving back-translation with uncertainty-based confidence estimation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 791–802. Association for Computational Linguistics.
- Understanding how dimension reduction tools work: an empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization. The Journal of Machine Learning Research, 22(1):9129–9201.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Fine-grained human feedback gives better rewards for language model training. In Thirty-seventh Conference on Neural Information Processing Systems.
- Wat zei je? detecting out-of-distribution translations with variational transformers. arXiv preprint arXiv:2006.08344.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. CoRR, abs/2306.13063.
- Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111.
- Disentangling uncertainty in machine translation evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 8622–8641. Association for Computational Linguistics.
- Better uncertainty quantification for machine translation evaluation. arXiv e-prints, pages arXiv–2204.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. arXiv preprint arXiv:2401.06730.
- Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439.