Alignment for Honesty (2312.07000v2)
Abstract: Recent research has made significant strides in aligning LLMs with helpfulness and harmlessness. In this paper, we argue for the importance of alignment for \emph{honesty}, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning an LLM's knowledge boundaries, which demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. We address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source all relevant resources to facilitate future research at \url{https://github.com/GAIR-NLP/alignment-for-honesty}.
- Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.2, knowledge manipulation. CoRR, abs/2309.14402.
- Palm 2 technical report. CoRR, abs/2305.10403.
- A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Extracting training data from large language models. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, pages 2633–2650. USENIX Association.
- Factool: Factuality detection in generative AI - A tool augmented framework for multi-task and multi-domain scenarios. CoRR, abs/2307.13528.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- Selectively answering ambiguous questions. CoRR, abs/2305.14613.
- The analects of confucius.
- Ultrafeedback: Boosting language models with high-quality feedback. CoRR, abs/2310.01377.
- Enhancing chat language models by scaling high-quality instructional conversations. CoRR, abs/2305.14233.
- RAFT: reward ranked finetuning for generative foundation model alignment. CoRR, abs/2304.06767.
- Truthful AI: developing and governing AI that does not lie. CoRR, abs/2110.06674.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR.
- Improving alignment of dialogue agents via targeted human judgements. CoRR, abs/2209.14375.
- Textbooks are all you need. CoRR, abs/2306.11644.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. CoRR, abs/2307.04657.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):248:1–248:38.
- How can we know When language models know? on the calibration of language models for question answering. Trans. Assoc. Comput. Linguistics, 9:962–977.
- Active retrieval augmented generation. CoRR, abs/2305.06983.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics.
- Personas as a way to model truthfulness in language models. CoRR, abs/2310.18168.
- Language models (mostly) know what they know. CoRR, abs/2207.05221.
- Challenges and applications of large language models. CoRR, abs/2307.10169.
- Alignment of language agents. CoRR, abs/2103.14659.
- Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466.
- Factuality enhanced language models for open-ended text generation. In NeurIPS.
- Generative judge for evaluating alignment. CoRR, abs/2310.05470.
- Inference-time intervention: Eliciting truthful answers from a language model. CoRR, abs/2306.03341.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
- Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463.
- Let’s verify step by step. CoRR, abs/2305.20050.
- Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain, pages 605–612. ACL.
- Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. CoRR, abs/2308.05374.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Collie: Collaborative training of large language models in an efficient way. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 527–542. Association for Computational Linguistics.
- James E. Mahon. 2015. The definition of lying and deception.
- When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. CoRR, abs/2212.10511.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9802–9822. Association for Computational Linguistics.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. CoRR, abs/2305.14251.
- Ambigqa: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 5783–5797. Association for Computational Linguistics.
- Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.
- OpenAI. 2023a. GPT-4 technical report. CoRR, abs/2303.08774.
- OpenAI. 2023b. Introducing chatgpt.
- Training language models to follow instructions with human feedback. In NeurIPS.
- How to catch an AI liar: Lie detection in black-box llms by asking unrelated questions. CoRR, abs/2309.15840.
- AI deception: A survey of examples, risks, and potential solutions. CoRR, abs/2308.14752.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. CoRR, abs/2302.12813.
- John Schulman. 2023. Reinforcement learning from human feedback: Progress and challenges.
- Towards understanding sycophancy in language models. CoRR, abs/2310.13548.
- REPLUG: retrieval-augmented black-box language models. CoRR, abs/2301.12652.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. CoRR, abs/2305.14975.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
- Simple synthetic data reduces sycophancy in large language models. CoRR, abs/2308.03958.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. CoRR, abs/2306.13063.
- Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
- Improving language models via plug-and-play retrieval feedback. CoRR, abs/2305.14002.
- RRHF: rank responses to align language models with human feedback without tears. CoRR, abs/2304.05302.
- Eliezer Yudkowsky. 2018. Meta-honesty: Firming up honesty around its edge-cases.
- Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219.
- Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
- Why does chatgpt fall short in providing truthful answers?
- LIMA: less is more for alignment. CoRR, abs/2305.11206.
- Navigating the grey area: Expressions of overconfidence and uncertainty in language models. CoRR, abs/2302.13439.
- Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1, knowledge storage and extraction. CoRR, abs/2309.14316.