Large Language Models are Better Reasoners with Self-Verification (2212.09561v5)

Published 19 Dec 2022 in cs.AI and cs.CL

Abstract: Recently, with the chain of thought (CoT) prompting, LLMs, e.g., GPT-3, have shown strong reasoning ability in several natural language processing tasks such as arithmetic, commonsense, and logical reasoning. However, LLMs with CoT require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes and vulnerable to error accumulation. The above issues make the LLMs need the ability to verify the answers. In fact, after inferring conclusions in some thinking decision tasks, people often check them by re-verifying steps to avoid some mistakes. In this paper, we propose and prove that LLMs also have similar self-verification abilities. We take the conclusion obtained by CoT as one of the conditions for solving the original problem. By performing a backward verification of the answers that LLM deduced for itself, we can obtain interpretable answer validation scores to select the candidate answer with the highest score. Experimental results demonstrate that the proposed method can improve the reasoning performance on various arithmetic, commonsense, and logical reasoning datasets. Our code is publicly available at: https://github.com/WENGSYX/Self-Verification.

Citations (135)

View on Semantic Scholar

Summary

The paper introduces a self-verification method that utilizes backward verification to reduce cumulative reasoning errors in large language models.
It employs condition masking and sampling-based candidate generation to enhance performance in arithmetic, commonsense, and logical tasks.
Experimental results demonstrate significant accuracy improvements across datasets, highlighting the method's robustness and scalability in data-scarce scenarios.

LLMs Are Better Reasoners with Self-Verification

Introduction

The paper investigates the reasoning capabilities of LLMs when equipped with self-verification mechanisms. Models like GPT-3 have leveraged chain of thought (CoT) prompting to perform reasoning tasks, yet are susceptible to cumulative errors due to their sensitivity to minor mistakes. This research introduces a novel self-verification strategy, paralleling human cognitive practices where individuals re-evaluate their conclusions to mitigate errors. The proposed method incorporates backward verification of solutions derived from CoT, enhancing the models' reasoning reliability without the need for additional verifiers.

Methodology

Forward Reasoning

Forward reasoning involves standard CoT prompting, where LLMs are tasked to solve a problem by generating a sequence of intermediate steps leading to a final answer. This process generates multiple candidate answers via sampling decoding, promoting diversity and enhancing robustness against prediction errors.

Figure 1: The answer of a question can be verified by masking and predicting the conditions of the original contexts. To mimic the self-verification ability of human, we predict the accuracy of $f_\mathcal{C}$ .

Backward Verification

The verification process follows forward reasoning, focusing on assessing and validating each candidate answer's correctness. The process entails rewriting each candidate conclusion as a declarative statement and utilizing methods such as Condition Mask Verification (CMV) or True-False Item Verification (TFV) to ascertain consistency. These verifications involve masking certain conditions and re-predicting them, subsequently comparing the predicted conditions with the original values to calculate verification scores.

Figure 2: Example of self-verification. In step one, LLM generates candidate answers and forms different conclusions. Then, in step two, LLM verifies these conclusions in turn and computes the verification score.

Experimental Results

Extensive experiments conducted on various reasoning datasets demonstrated significant performance improvements when integrating self-verification. The model's accuracy in solving arithmetic, commonsense, and logical tasks noticeably increased as evidenced by comparison against traditional CoT and other forward reasoning methods.

Figure 3: The self-verification ability of models with different sizes.

Dataset Performance

Arithmetic Reasoning: The method showed a marked improvement in datasets like GSM8K and SingleEq, achieving higher accuracy than prior state-of-the-art baselines.
Commonsense Reasoning: Self-verification provided moderate enhancements but was less impactful compared to arithmetic tasks.
Logical Reasoning: The approach helped rectify errors from CoT-generated solutions, showcasing improvements in datasets like Date Understanding.
Figure 4: Problem solve rate (\%) comparison of 2-shot to 8-shot prompts.

Discussion

The introduction of self-verification without additional training data or verifiers positions this method as a scalable solution viable across multiple domains. The empirical results indicate that larger models, characterized by superior reasoning capabilities, benefit more significantly from self-verification. Interestingly, the robustness of the approach with limited data, such as low-shot scenarios, underscores its utility in data-constrained environments.

Figure 5: Comparison of problem solve rate (\%) between single-condition verification and multiple-condition verification.

Conclusion

The research asserts that LLMs equipped with self-verification mechanisms exhibit enhanced reasoning capabilities. By utilizing backward verification, these models can efficiently self-assess and validate their predictions, thus improving the accuracy and reliability of reasoning tasks. Future explorations could focus on optimizing backward verification for diverse reasoning challenges and further honing LLM reasoning by integrating advanced self-verification strategies.