Contrastive Decoding Improves Reasoning in Large Language Models (2309.09117v2)
Abstract: We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from LLMs.
- Palm 2 technical report, 2023.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Dola: Decoding by contrasting layers improves factuality in large language models, 2023.
- Scaling instruction-finetuned language models, 2022.
- Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Training verifiers to solve math word problems, 2021.
- Hierarchical neural story generation, 2018.
- Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance, 2023.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, 2021.
- Roscoe: A suite of metrics for scoring step-by-step reasoning, 2022.
- Measuring massive multitask language understanding, 2021a.
- Measuring mathematical problem solving with the math dataset, 2021b.
- The curious case of neural text degeneration, 2020.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
- Scaling laws for neural language models, 2020.
- Discriminator-guided multi-step reasoning with language models, 2023.
- Contrastive decoding: Open-ended text generation as optimization, 2022.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 158–167, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1015.
- Dexperts: Decoding-time controlled text generation with experts and anti-experts, 2021.
- Locally typical sampling, 2023.
- A diverse corpus for evaluating and developing english math word problem solvers, 2021.
- Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
- OpenAI. Gpt-4 technical report, 2023.
- Are nlp models really able to solve simple math word problems?, 2021.
- Reasoning with language model prompting: A survey, 2023.
- Winogrande: An adversarial winograd schema challenge at scale, 2019.
- Socialiqa: Commonsense reasoning about social interactions, 2019.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
- Llama: Open and efficient foundation language models, 2023.
- Towards understanding chain-of-thought prompting: An empirical study of what matters, 2023a.
- Self-consistency improves chain of thought reasoning in language models, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.naacl-main.276.
- Surfacing biases in large language models using contrastive input decoding, 2023.
- Hellaswag: Can a machine really finish your sentence?, 2019.