Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

General Purpose Verification for Chain of Thought Prompting (2405.00204v1)

Published 30 Apr 2024 in cs.CL and cs.AI

Abstract: Many of the recent capabilities demonstrated by LLMs arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should adhere to while reasoning: (i) Relevance, (ii) Mathematical Accuracy, and (iii) Logical Consistency. We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation. The constraints are applied in the form of verifiers: the model itself is asked to verify if the generated steps satisfy each constraint. To further steer the generations towards high-quality solutions, we use the perplexity of the reasoning steps as an additional verifier. We evaluate our method on 4 distinct types of reasoning tasks, spanning a total of 9 different datasets. Experiments show that our method is always better than vanilla generation, and, in 6 out of the 9 datasets, it is better than best-of N sampling which samples N reasoning chains and picks the lowest perplexity generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Towards a human-like open-domain chatbot.
  2. The falcon series of open language models. ArXiv, abs/2311.16867.
  3. Constitutional ai: Harmlessness from ai feedback.
  4. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  5. Graph of thoughts: Solving elaborate problems with large language models.
  6. Language models are few-shot learners. ArXiv, abs/2005.14165.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712.
  8. Teaching large language models to self-debug. ArXiv, abs/2304.05128.
  9. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  10. Training verifiers to solve math word problems. ArXiv, abs/2110.14168.
  11. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  12. Demystifying prompts in language models via perplexity estimation. ArXiv, abs/2212.04037.
  13. Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation. Proceedings of the ACM on Human-Computer Interaction, 6:1 – 28.
  14. Think before you speak: Training language models with pause tokens. ArXiv, abs/2310.02226.
  15. Reasoning with language model is planning with world model. ArXiv, abs/2305.14992.
  16. Large language models cannot self-correct reasoning yet. ArXiv, abs/2310.01798.
  17. Large language models are zero-shot reasoners. ArXiv, abs/2205.11916.
  18. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
  19. Jieyi Long. 2023. Large language model guided tree-of-thought. ArXiv, abs/2305.08291.
  20. E. A. M. and Richard Mckeon. 1941. The basic works of aristotle.
  21. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651.
  22. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. ArXiv, abs/2305.15852.
  23. Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114.
  24. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  25. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. ArXiv, abs/2308.03188.
  26. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  27. Self-critiquing models for assisting human evaluators. ArXiv, abs/2206.05802.
  28. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761.
  29. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning.
  30. Reflexion: Language agents with verbal reinforcement learning.
  31. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  32. Commonsenseqa 2.0: Exposing the limits of ai through gamification. ArXiv, abs/2201.05320.
  33. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  34. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388.
  35. Solving math word problems with process- and outcome-based feedback. ArXiv, abs/2211.14275.
  36. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171.
  37. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. ArXiv, abs/2302.01560.
  38. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903.
  39. Self-evaluation guided beam search for reasoning.
  40. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601.
  41. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629.
  42. Do large language models know what they don’t know? In Annual Meeting of the Association for Computational Linguistics.
Citations (5)

Summary

  • The paper introduces a verification framework that enforces relevance, mathematical accuracy, and logical consistency in chain-of-thought prompting.
  • It employs dedicated verifiers and a geometric mean aggregation strategy to systematically score reasoning steps and improve performance on benchmarks like GSM8k.
  • Human evaluations correlate with verifier metrics, highlighting both the framework’s benefits and potential areas for further refinement in LLM reasoning.

General Purpose Verification for Chain of Thought Prompting

Introduction

The paper "General Purpose Verification for Chain of Thought Prompting" describes a methodology aimed at enhancing the reasoning capabilities of LLMs. The authors focus on two main areas: exploring diverse chains of thought and validating individual reasoning steps. They propose three key principles for ensuring sound reasoning: relevance, mathematical accuracy, and logical consistency. These principles are enforced through dedicated verifiers, which check if each reasoning step adheres to these constraints. Additionally, the perplexity of the reasoning steps is used as an auxiliary verifier to guide the models towards producing high-quality solutions.

Methodology

Solution Generation

The approach utilizes a solution generator—typically an LLM—that operates on a prompt to generate a sequence of reasoning steps. The generation process is oriented around the chain-of-thought prompting method, where intermediate reasoning steps are generated, potentially allowing more robust solutions as opposed to a single-step generation process. The prompt encourages the LLM to operate in a manner akin to step-by-step human reasoning.

Step Verification

Verification of reasoning steps operates on three principles:

  1. Relevance: Ensures that each step contributes meaningfully to the final solution, avoiding irrelevant or unrelated information.
  2. Mathematical Accuracy: Verifies the correctness of any mathematical calculations in a reasoning step, employing structured extraction of mathematical expressions to validate them systematically.
  3. Logical Consistency: Assesses whether a reasoning step contradicts prior steps, ensuring coherence throughout the reasoning chain. Figure 1

    Figure 1: An example of each of our proposed verifiers applied to a given question and previous steps.

Aggregation Strategy

To derive a comprehensive score for a complete reasoning chain, the verifiers' outputs are aggregated. Each reasoning step is scored individually, and a geometric mean is used to amalgamate these step-level scores into an overall chain score. This approach ensures that individual inaccuracies are identified and mitigated across the reasoning process.

Experimental Results

Datasets and Results

The evaluation spans various reasoning tasks across nine datasets, including GSM8k and BigBench Date Understanding. Results demonstrate that using the proposed verification framework yields better performance than traditional generation methods, with significant improvements highlighted over baseline models, as shown in self-consistency experiments. Figure 2

Figure 2: Correlation between the scores of our proposed verifiers and the assessment of human annotators over the three reasoning principles explored in this work. All correlations have a p-value of less than 0.0001.

Human Evaluation

To assess the alignment of the verification framework with human judgment, extensive human evaluation was conducted. Correlations between the verifiers' scores and human assessments indicated that while the principles capture a substantial portion of the reasoning quality, individual verifiers still have room for enhancement, indicated by moderate positive correlations.

Implications and Future Work

The authors conclude that though current verification methodologies provide notable improvements in reasoning tasks, there is potential for further refinement in verifier accuracy and applicability. Developing more precise and adapting these verifiers for specific domain tasks could offer substantial benefits. Additionally, investigating alternative strategies for constructing verifiers that can maintain performance while being computationally efficient remains a vital avenue for future exploration.

Conclusion

The verification framework for chain-of-thought prompting provides a novel approach to refine LLM reasoning by integrating model-agnostic and task-agnostic principles that align with human evaluation criteria. By demonstrating measurable performance gains across various datasets, this paper establishes a foundation for more robust and interpretable LLM applications in complex reasoning tasks.

This paper opens pathways to leveraging structured verification methods, potentially translating into better decision-making and inference capabilities in LLMs across diverse applied contexts.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com