Reasoning over Uncertain Text by Generative Large Language Models (2402.09614v3)

Published 14 Feb 2024 in cs.CL and cs.AI

Abstract: This paper considers the challenges LLMs face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values. This type of reasoning is relevant to a variety of contexts ranging from everyday conversations to medical decision-making. Despite improvements in the mathematical reasoning capabilities of LLMs, they still exhibit significant difficulties when it comes to probabilistic reasoning. To deal with this problem, we introduce the Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. We use BLInD to find out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLInD and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs.

References (32)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces BLInD and a two-stage approach that decomposes probabilistic reasoning into number extraction and graph generation.
It shows that structured prompting with code-generation, Monte Carlo simulation, and ProbLog mapping significantly improves performance on complex Bayesian networks.
Empirical results reveal that GPT-4 achieves near-perfect accuracy with neuro-symbolic methods while GPT-3.5 remains challenged by increasing BN complexity.

Reasoning over Uncertain Text by Generative LLMs

This paper addresses the notable limitations of current LLMs in probabilistic reasoning over natural language text that expresses explicit uncertainty, particularly quantitative probabilities. The central contribution is twofold: the introduction of the Bayesian Linguistic Inference Dataset (BLInD), a challenging new resource for evaluating probabilistic reasoning from text, and a systematic empirical analysis of LLM reasoning strategies blending prompt engineering, code generation, and neuro-symbolic mapping. The scope encompasses both model evaluation (GPT-3.5, GPT-4) and the development of prompting/representation recipes that make LLMs more effective at answering probabilistic queries over explicitly uncertain contexts.

Dataset Design and Task Formulation

BLInD is designed to systematically investigate whether LLMs can extract, represent, and manipulate explicit probability information embedded in text. Each instance in BLInD encodes:

A Bayesian network (BN) with binary variables (up to 10 variables per instance), where each CPT entry is translated into simple sentence templates.
A probabilistic query phrased in natural language, e.g., "What is the probability that G is true and P is false given O is false?"
The corresponding ground-truth answer is obtained using pgmpy for exact inference, ensuring correctness and eliminating annotation noise.

The dataset allows fine-grained difficulty scaling by varying the number of nodes and the BN's graph structure (e.g., arborescence, non-tree structures).

Baseline Evaluation and Model Limitations

Baseline methods using direct question-answering (BQA) and chain-of-thought (CoT) prompting show that both GPT-3.5 and GPT-4 underperform dramatically compared to their performance on standard math or word-problem benchmarks. Without structured guidance, models frequently fail to map textual probability statements to correct symbolic representations, misapply independence/dependence relations, or hallucinate numbers.

GPT-3.5 achieves less than 10% accuracy in complex settings (more than 5 variables), with BQA or CoT.
GPT-4 fares better, but accuracy sharply declines as BN size/complexity increases (falling below 40% for cases with more than 6 variables).

Structured Decomposition and Subtasks

To isolate the bottlenecks, the paper decomposes probabilistic reasoning into subtasks:

Number Extraction (NE): Extracting all explicit probabilities and CPT entries from the context.
Graph Generation (GG): Recovering the BN structure (edges/arcs denoting conditional dependencies).

Empirical analysis shows that NE is relatively robust (GPT-4 achieves 100% accuracy; GPT-3.5 >90% for moderate-scale BNs), but GG performance degrades with graph size and complexity, especially for GPT-3.5.

Combining the two subtasks in a sequenced prompt (first NE, then GG) improves joint accuracy, suggesting the importance of modular decomposition for prompt-based reasoning over complex contexts.

Symbolic Mapping Techniques

The core practical advancement is mapping the probabilistic reasoning task, post-decomposition, to formal representations that mitigate LLMs' limitations in numerical computation and inference:

Program-Aided Language Modeling (PAL):
- LLMs are prompted to output Python code encoding the probability computation, using the NE extraction as input. The code is executed separately for the final answer.
- This method exposes accuracy bottlenecks due to code generation errors and faulty variable mapping, especially as the BN grows in size.
Monte Carlo Approximate Inference (MC):
- LLMs generate Python-based MC simulation code that samples BN variables according to the extracted structure and CPTs, then empirically estimates the answer.
- MC prompting is notably more robust for larger BNs, particularly when GG is included in the prompt to help establish the variable ordering required for proper sampling.
Neuro-Symbolic Probabilistic Logic Mapping (ProbLog):
- LLMs generate a probabilistic logic program (ProbLog) representation from the text, and inference is executed outside the model.
- This representation is the most scalable and precise with GPT-4, which can almost perfectly map context/query pairs to correct ProbLog code for non-trivial BNs.

Quantitative Results

Strong numerical results are tabulated for both models and all strategies:

Method	GPT3.5 (avg, V6–V10)	GPT4 (avg, V6–V10)
BQA zero-shot	2%	10%
CoT zero-shot	3%	15%
PAL + NE	9%	29%
MC + GG	25%	90%
ProbLog	46%	98%

Key observations:

MC+GG and ProbLog methods allow GPT-4 to maintain >90% accuracy even for BNs with 6–10 variables.
GPT-3.5 still struggles in code-generation and logic-mapping, but accuracy improves significantly versus prompting-based QA alone.

Results are consistent on an adapted version of the CLADDER causal reasoning dataset, underscoring the generality of the neuro-symbolic approaches.

Implementation Considerations

For practitioners, practical implementation involves:

Subtask Chaining: Design multi-step prompts that force the LLM to extract all necessary symbolic information before reasoning. This includes explicit NE and GG stages to prevent number hallucination and dependency confusion.
External Code Execution: Always execute LLM-generated code externally; do not trust string outputs as final answers due to computation and parsing errors.
Prompt Length Management: As the number of variables increases, model output length increases sharply. There is a trade-off between completeness (including all NE/GG info) and hitting context/window limits.
Model Selection: GPT-4 is substantially more reliable than GPT-3.5 for multi-step, code-generating prompts, especially for unfamiliar domains (e.g., ProbLog).
Error Checking: Execute NE and GG evaluation steps as acceptance tests before running probabilistic code; missing/incorrect CPTs or edges result in downstream answer errors.

An abstracted pseudocode pipeline for applying these techniques:

def solve_probabilistic_query(context, query, model):
    probabilities = model.extract_probabilities(context)  # NE step
    graph = model.extract_graph(context)                  # GG step
    code = model.generate_symbolic_code(probabilities, graph, query)
    answer = execute_external_code(code)  # PAL/MC/ProbLog runner
    return answer

Theoretical and Practical Implications

The work demonstrates that current LLMs are not natively Bayesian reasoners over text, despite improvements in mathematical reasoning on benchmarks like GSM8K. In probabilistic settings, unstructured prompts and vanilla CoT fail due to:

Difficulty mapping natural language uncertainty to joint/marginal distributions.
Poor management of variable dependencies and CPTs.
Susceptibility to number hallucination and symbolic inconsistency over long chains.

Structured neuro-symbolic decomposition, with external formal representation and code execution, is tractable and enables accurate performance. However, this pipelines away the end-to-end differentiability and flexibility that make LLMs attractive.

Important empirical findings:

Adding more structured symbolic steps improves accuracy more than simply adding more reasoning text (CoT).
Accuracy bottlenecks often occur in the information extraction (NE and GG) rather than the final inference step, particularly with larger or denser Bayesian networks.
MC-based approximate inference is often more scalable on large BNs than explicit symbolic computation, but still limited by correct dependency extraction and variable ordering.

Future Directions

Possible directions inferred from the analysis:

Architectures and objectives for end-to-end neural probabilistic reasoning: Rather than using LLMs as code-generators, integrate explicit graphical model reasoning into model architectures or pre-training.
Linguistic uncertainty beyond quantitative CPTs: Extend to cases where probabilities are vague or only partially specified.
Joint neuro-symbolic training: Explicitly train models to map text to symbolic probabilistic programs (e.g., ProbLog or pgmpy representations), closing the gap between LLMs and symbolic reasoning systems.
Scaling and deployment: Systems that integrate LLM-driven number/graph extraction with symbolic solvers are immediately actionable for domains such as clinical decision support, risk assessment, and scientific NLP.

In summary, this work provides an empirical foundation, practical toolkit, and a benchmark for systematically closing the gap between language-based and symbolic probabilistic reasoning. The integration of NE and GG subtasks, symbolic representations, and external execution is a promising template for near-term systems, while highlighting the significant work required to advance LLM-native probabilistic reasoning.