Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving (2405.01379v4)

Published 2 May 2024 in cs.CL

Abstract: Natural language explanations represent a proxy for evaluating explanation-based and multi-step Natural Language Inference (NLI) models. However, assessing the validity of explanations for NLI is challenging as it typically involves the crowd-sourcing of apposite datasets, a process that is time-consuming and prone to logical errors. To address existing limitations, this paper investigates the verification and refinement of natural language explanations through the integration of LLMs and Theorem Provers (TPs). Specifically, we present a neuro-symbolic framework, named Explanation-Refiner, that integrates TPs with LLMs to generate and formalise explanatory sentences and suggest potential inference strategies for NLI. In turn, the TP is employed to provide formal guarantees on the logical validity of the explanations and to generate feedback for subsequent improvements. We demonstrate how Explanation-Refiner can be jointly used to evaluate explanatory reasoning, autoformalisation, and error correction mechanisms of state-of-the-art LLMs as well as to automatically enhance the quality of explanations of variable complexity in different domains.

Citations (8)

Summary

  • The paper introduces a framework that employs LLMs and symbolic theorem proving to iteratively verify and improve natural language explanations.
  • The methodology leverages autoformalisation via Neo-Davidsonian semantics and First-Order Logic, using tools like Isabelle/HOL to translate sentences into structured proofs.
  • Empirical results show significant improvements, with logical validity rising up to 84% and syntax error reductions exceeding 68% across NLI datasets.

Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving

Introduction

The paper discusses a novel neuro-symbolic framework, Explanation-Refiner, which integrates LLMs with theorem provers (TPs) to improve the verification and refinement of natural language explanations in Natural Language Inference (NLI). It highlights the limitations of previous approaches, where language generation metrics often failed to capture logical reasoning, resulting in explanations with incomplete or erroneous logic.

Explanation-Refiner Framework

Explanation-Refiner is designed to leverage LLMs to generate and formalize explanatory sentences and suggest potential inference strategies for NLI tasks. TPs provide formal guarantees of logical validity and generate feedback for improving human-annotated explanations. The framework emphasizes the combined use of Neo-Davidsonian event semantics and First-Order Logic for systematically translating natural language sentences into proofs. Figure 1

Figure 1: The overall pipeline of Explanation-Refiner illustrating its operational phases.

Explanation verification is achieved through an iterative process in which the TP constructs deductive proofs. If the initial proof is invalid, specific erroneous steps are identified through TP feedback, prompting an LLM-based refinement of the explanation. This ensures logical consistency and completeness while accommodating the iterative enhancement of explanation validity.

Implementation Details

Several state-of-the-art LLMs such as GPT-4, GPT-3.5, LLama, and Mistral are used in conjunction with the Isabelle/HOL proof assistant. The inclusion of Neo-Davidsonian semantics in autoformalisation assists in maintaining semantic fidelity during logical form translation.

Significant performance improvements were observed across the e-SNLI, QASC, and WorldTree datasets, with logical validity rising from 36% to 84%, 12% to 55%, and 2% to 37%, respectively. Additionally, integrating TPs reduced syntax errors by 68.67%, 62.31%, and 55.17%.

Empirical Evaluation

Comparative experiments with several LLMs demonstrated that closed-source models like GPT-4 outperform others in explanation reasoning and autoformalisation. The experiments highlighted that the complexity of explanations impacts formalisation accuracy, with more complex datasets like WorldTree posing greater challenges. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Average proof steps processed by the proof assistant versus total suggested proof steps in both refined and unrefined conditions.

Autoformalisation and Proof Construction

The framework utilizes autoformalisation for converting natural language into structured logical representations. This process uses Neo-Davidsonian semantics to prevent semantic abstraction loss. Constructed explanations are iteratively refined using proof construction, allowing the LLM to identify non-redundant logic necessary to establish hypothesis entailment.

Importance of External Feedback

External feedback from TPs significantly directs the refinement of LLM-generated explanations. This feedback mechanism allows for the correction of logical errors, resulting in substantial improvements in NLI tasks. The iterative refinement cycle is a crucial component in achieving syntactic and logical consistency, offering a pathway for enhancing the quality of AI-generated explanations.

Conclusion

The Explanation-Refiner framework effectively bridges LLMs with symbolic TPs, enhancing logical validity and explanation quality in NLI tasks. This research emphasizes the potential for neuro-symbolic integration in advancing explainable AI, with robust implications for future developments in AI-generated explanations. Future work could explore extending the framework to complex domains, targeting both explanation precision and logical soundness across a broader spectrum of AI applications.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Easy summary

This paper is about teaching AI systems to explain their answers in a way that is correct, step by step, and checkable. The authors built a system called Explanation-Refiner that combines two kinds of tools:

  • LLMs, which are good at writing explanations in plain English.
  • A “theorem prover,” a strict logic checker that can verify whether those explanations really prove what they claim.

By making these two work together in a loop—with the logic checker giving feedback and the LLM fixing mistakes—the system turns messy or partly wrong explanations into clear, logically correct ones.

What questions did the paper ask?

The researchers focused on three simple questions:

  • Can we automatically check and improve (refine) the explanations written in natural language?
  • Can we fix and improve explanations written by humans, not just by AI?
  • How good are today’s LLMs at explaining their reasoning, turning sentences into logic, and fixing errors across different kinds of problems?

How did they do it?

To make this understandable, think of solving a puzzle in class:

  • The LLM is like a student who writes down how they solved the puzzle.
  • The theorem prover is like a very strict teacher who checks whether each step is logically valid.
  • If the teacher finds a mistake, they point to the exact step that failed. The student then fixes that step and tries again.

Here’s the approach, step by step:

Step 1: Write and translate the explanation

  • The LLM writes explanation sentences in plain English, like “If someone gives a speech, then they are speaking.”
  • Those sentences are then translated into formal logic—a precise “math-like” language that computers can check. The paper uses a style called Neo-Davidsonian event semantics to keep track of actions and who did what.
    • For example, “A man gives a speech” becomes something like: there is an event of giving, with the man as the agent (doer) and the speech as the patient (thing acted on).
    • This helps keep all the important details (who, what, action) when turning sentences into logic.

Step 2: Check with a logic tool (theorem prover)

  • The team uses a tool called Isabelle/HOL, a proof assistant. It tries to prove that:
    • Premise + Explanation ⇒ Hypothesis
  • If something doesn’t make sense, Isabelle tells you where the logic breaks. It also catches “syntax errors” (like missing brackets or mismatched types—similar to a programming typo).

Step 3: Fix it in a loop (feedback and refinement)

  • Using the exact error from the theorem prover, the LLM:
    • Removes irrelevant or repeated explanation steps.
    • Repairs the broken step.
    • Improves the logic translation and proof steps.
  • This loop repeats a few times until the explanation works or the system gives up after a set number of tries.

What problems did they test on?

They tested on three datasets with growing difficulty:

  • e-SNLI: Short, simple examples with one explanation sentence.
  • QASC: Science questions with a couple of explanation sentences.
  • WorldTree: Harder science questions that may need many (up to 16) explanation sentences combined together.

What did they find?

Here are the key results and why they matter:

  • The logic checker’s feedback makes explanations much better.
    • Using GPT-4, the percentage of logically valid explanations jumped:
    • e-SNLI: from 36% to 84%
    • QASC: from 12% to 55%
    • WorldTree: from 2% to 37%
    • Why this matters: It shows that careful checking and fixing can make AI explanations far more trustworthy.
  • The system reduced “syntax errors” in the logical code a lot (think of it like fewer typos and formatting mistakes):
    • Average reductions were about 69%, 62%, and 55% across the three datasets.
    • Why this matters: Cleaner logic code means the theorem prover can do its job and verify reasoning.
  • More complex explanations are harder.
    • Short, simple cases (e-SNLI) worked best.
    • Longer, multi-step science explanations (WorldTree) were toughest.
  • Some LLMs are better at this than others.
    • GPT-4 and GPT-3.5 did better than open-source models like Llama and Mistral, both at writing explanations and turning them into logic.
    • Swapping in GPT-4 just for the “logic translation” step helped weaker models a lot.
  • Human check: Are the refined explanations true and non-trivial?
    • Most refined explanations were factually correct (very high rates).
    • Only a small number in the science datasets had over-generalizations (e.g., treating all tetrapods as having four limbs, which wrongly includes snakes).
    • Explanations usually weren’t trivial (i.e., not just repeating the premise or hypothesis).

Why is this important?

When AIs explain their answers, we want those explanations to be:

  • Correct: They actually support the answer.
  • Clear: Each step follows logically.
  • Checkable: A separate system can verify them.

This paper shows a practical way to achieve that by combining natural language generation (LLMs) with strict logical checking (theorem provers). It also helps clean up human-written explanations in datasets, which are sometimes incomplete or slightly wrong. That means better training data and fairer evaluations of AI reasoning.

What could this change in the future?

This approach can:

  • Make AI explanations more reliable in areas like education, science, law, or medicine where correctness matters.
  • Help build better benchmarks by automatically fixing noisy human-written explanations.
  • Encourage new systems that mix flexible language understanding (neural nets) with precise logic (symbolic tools).

The authors note limitations too:

  • The hardest, multi-step explanations still challenge today’s models.
  • Better logical consistency doesn’t automatically guarantee overall safety or full correctness in all real-world uses.
  • Future work aims to handle more complex explanations with fewer refinement steps, making the process faster and more robust.

In short, Explanation-Refiner is like giving AIs both a good “writer” and a strict “proof-checking teacher,” and letting them work together until the explanation is both easy to read and logically solid.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 15 likes.

Upgrade to Pro to view all of the tweets about this paper: