Deductive Verification of Chain-of-Thought Reasoning
Abstract: LLMs significantly benefit from Chain-of-Thought (CoT) prompting in performing various reasoning tasks. While CoT allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors, thereby limiting models' ability to solve complex reasoning tasks. Inspired by how humans engage in careful and meticulous deductive logical reasoning processes to solve tasks, we seek to enable LLMs to perform explicit and rigorous deductive reasoning, and also ensure the trustworthiness of their reasoning process through self-verification. However, directly verifying the validity of an entire deductive reasoning process is challenging, even with advanced models like ChatGPT. In light of this, we propose to decompose a reasoning verification process into a series of step-by-step subprocesses, each only receiving their necessary context and premises. To facilitate this procedure, we propose Natural Program, a natural language-based deductive reasoning format. Our approach enables models to generate precise reasoning steps where subsequent steps are more rigorously grounded on prior steps. It also empowers LLMs to carry out reasoning self-verification in a step-by-step manner. By integrating this verification process into each deductive reasoning stage, we significantly enhance the rigor and trustfulness of generated reasoning steps. Along this process, we also improve the answer correctness on complex reasoning tasks. Code will be released at https://github.com/lz1oceani/verify_cot.
- Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171, 2022.
- Natural language deduction through search over statement compositions. arXiv preprint arXiv:2201.06028, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022.
- Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
- Roscoe: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations, 2022.
- Hallucinations in large multilingual translation models. arXiv preprint arXiv:2303.16104, 2023.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533, 2014.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
- Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
- Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- Few-shot self-rationalization with natural language prompts, 2022.
- On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023.
- Receval: Evaluating reasoning chains via correctness and informativeness. 2023.
- Street: A multi-task structured reasoning and explanation benchmark. arXiv preprint arXiv:2302.06729, 2023.
- Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
- Large language models can be easily distracted by irrelevant context. arXiv preprint arXiv:2302.00093, 2023.
- Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 3621–3634. Association for Computational Linguistics, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
- Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
- Generating natural language proofs with verifier-guided search. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
- Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.