Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Reasoning over Uncertain Text by Generative Large Language Models (2402.09614v3)

Published 14 Feb 2024 in cs.CL and cs.AI

Abstract: This paper considers the challenges LLMs face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values. This type of reasoning is relevant to a variety of contexts ranging from everyday conversations to medical decision-making. Despite improvements in the mathematical reasoning capabilities of LLMs, they still exhibit significant difficulties when it comes to probabilistic reasoning. To deal with this problem, we introduce the Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. We use BLInD to find out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLInD and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. pgmpy: Probabilistic graphical models using python. In Proceedings of the 14th Python in Science Conference (SCIPY 2015). Citeseer, 2015.
  2. Ivan Bratko. Prolog Programming for Artificial Intelligence. Pearson Addison-Wesley, Harlow, England, 3 edition, 2000.
  3. Language models are few-shot learners, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  5. Training verifiers to solve math word problems, 2021.
  6. Problog: A probabilistic prolog and its application in link discovery. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, page 2468–2473, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  8. Solving probability problems in natural language. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3981–3987, 2017.
  9. Mathematical capabilities of chatgpt, 2023.
  10. Pal: Program-aided language models, 2023.
  11. Google. Google gemini ai. https://blog.google/technology/ai/google-gemini-ai/#availability, 2023.
  12. Solving math word problems by combining language models with symbolic solvers, 2023.
  13. John Heritage. Action formation and its epistemic (and other) backgrounds. Discourse Studies, 15(5):551–578, 2013.
  14. Cladder: Assessing causal reasoning in language models. 2023.
  15. It ain’t over: A multi-aspect diverse math word problem dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14984–15011, Singapore, December 2023. Association for Computational Linguistics.
  16. Whose decision? negotiating epistemic and deontic rights in medical treatment decisions. Journal of Pragmatics, 78:54–69, 2015. Epistemics and Deontics in Conversational Directives.
  17. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  18. Bayesian Rationality: The probabilistic approach to human reasoning. Oxford University Press, 02 2007.
  19. Précis of bayesian rationality: The probabilistic approach to human reasoning. Behavioral and Brain Sciences, 32(1):69–84, 2009.
  20. OpenAI. Gpt-4 technical report, 2023.
  21. Training language models to follow instructions with human feedback, 2022.
  22. Measuring sentence-level and aspect-level (un)certainty in science communications. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9959–10011, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  23. Certified deductive reasoning with language models, 2023.
  24. Uncertain words, uncertain texts. perception and effects of uncertainty in biomedical communication. Acta Polytechnica Hungarica, 2019.
  25. Artificial Intelligence: A Modern Approach. Pearson, 4 edition, 2021. Global Edition.
  26. RuleBERT: Teaching soft rules to pre-trained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1460–1476, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  27. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts, 2022.
  28. Hybrid uncertainty quantification for selective text classification in ambiguous tasks. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11659–11681, Toronto, Canada, July 2023. Association for Computational Linguistics.
  29. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022.
  30. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
  31. Language models are few-shot multilingual learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 1–15, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  32. Uncertainty quantification for text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 3426–3429, New York, NY, USA, 2023. Association for Computing Machinery.
Citations (3)

Summary

  • The paper introduces BLInD and a two-stage approach that decomposes probabilistic reasoning into number extraction and graph generation.
  • It shows that structured prompting with code-generation, Monte Carlo simulation, and ProbLog mapping significantly improves performance on complex Bayesian networks.
  • Empirical results reveal that GPT-4 achieves near-perfect accuracy with neuro-symbolic methods while GPT-3.5 remains challenged by increasing BN complexity.

Reasoning over Uncertain Text by Generative LLMs

This paper addresses the notable limitations of current LLMs in probabilistic reasoning over natural language text that expresses explicit uncertainty, particularly quantitative probabilities. The central contribution is twofold: the introduction of the Bayesian Linguistic Inference Dataset (BLInD), a challenging new resource for evaluating probabilistic reasoning from text, and a systematic empirical analysis of LLM reasoning strategies blending prompt engineering, code generation, and neuro-symbolic mapping. The scope encompasses both model evaluation (GPT-3.5, GPT-4) and the development of prompting/representation recipes that make LLMs more effective at answering probabilistic queries over explicitly uncertain contexts.

Dataset Design and Task Formulation

BLInD is designed to systematically investigate whether LLMs can extract, represent, and manipulate explicit probability information embedded in text. Each instance in BLInD encodes:

  • A Bayesian network (BN) with binary variables (up to 10 variables per instance), where each CPT entry is translated into simple sentence templates.
  • A probabilistic query phrased in natural language, e.g., "What is the probability that G is true and P is false given O is false?"
  • The corresponding ground-truth answer is obtained using pgmpy for exact inference, ensuring correctness and eliminating annotation noise.

The dataset allows fine-grained difficulty scaling by varying the number of nodes and the BN's graph structure (e.g., arborescence, non-tree structures).

Baseline Evaluation and Model Limitations

Baseline methods using direct question-answering (BQA) and chain-of-thought (CoT) prompting show that both GPT-3.5 and GPT-4 underperform dramatically compared to their performance on standard math or word-problem benchmarks. Without structured guidance, models frequently fail to map textual probability statements to correct symbolic representations, misapply independence/dependence relations, or hallucinate numbers.

  • GPT-3.5 achieves less than 10% accuracy in complex settings (more than 5 variables), with BQA or CoT.
  • GPT-4 fares better, but accuracy sharply declines as BN size/complexity increases (falling below 40% for cases with more than 6 variables).

Structured Decomposition and Subtasks

To isolate the bottlenecks, the paper decomposes probabilistic reasoning into subtasks:

  • Number Extraction (NE): Extracting all explicit probabilities and CPT entries from the context.
  • Graph Generation (GG): Recovering the BN structure (edges/arcs denoting conditional dependencies).

Empirical analysis shows that NE is relatively robust (GPT-4 achieves 100% accuracy; GPT-3.5 >90% for moderate-scale BNs), but GG performance degrades with graph size and complexity, especially for GPT-3.5.

Combining the two subtasks in a sequenced prompt (first NE, then GG) improves joint accuracy, suggesting the importance of modular decomposition for prompt-based reasoning over complex contexts.

Symbolic Mapping Techniques

The core practical advancement is mapping the probabilistic reasoning task, post-decomposition, to formal representations that mitigate LLMs' limitations in numerical computation and inference:

  • Program-Aided LLMing (PAL):
    • LLMs are prompted to output Python code encoding the probability computation, using the NE extraction as input. The code is executed separately for the final answer.
    • This method exposes accuracy bottlenecks due to code generation errors and faulty variable mapping, especially as the BN grows in size.
  • Monte Carlo Approximate Inference (MC):
    • LLMs generate Python-based MC simulation code that samples BN variables according to the extracted structure and CPTs, then empirically estimates the answer.
    • MC prompting is notably more robust for larger BNs, particularly when GG is included in the prompt to help establish the variable ordering required for proper sampling.
  • Neuro-Symbolic Probabilistic Logic Mapping (ProbLog):
    • LLMs generate a probabilistic logic program (ProbLog) representation from the text, and inference is executed outside the model.
    • This representation is the most scalable and precise with GPT-4, which can almost perfectly map context/query pairs to correct ProbLog code for non-trivial BNs.

Quantitative Results

Strong numerical results are tabulated for both models and all strategies:

Method GPT3.5 (avg, V6–V10) GPT4 (avg, V6–V10)
BQA zero-shot 2% 10%
CoT zero-shot 3% 15%
PAL + NE 9% 29%
MC + GG 25% 90%
ProbLog 46% 98%

Key observations:

  • MC+GG and ProbLog methods allow GPT-4 to maintain >90% accuracy even for BNs with 6–10 variables.
  • GPT-3.5 still struggles in code-generation and logic-mapping, but accuracy improves significantly versus prompting-based QA alone.

Results are consistent on an adapted version of the CLADDER causal reasoning dataset, underscoring the generality of the neuro-symbolic approaches.

Implementation Considerations

For practitioners, practical implementation involves:

  1. Subtask Chaining: Design multi-step prompts that force the LLM to extract all necessary symbolic information before reasoning. This includes explicit NE and GG stages to prevent number hallucination and dependency confusion.
  2. External Code Execution: Always execute LLM-generated code externally; do not trust string outputs as final answers due to computation and parsing errors.
  3. Prompt Length Management: As the number of variables increases, model output length increases sharply. There is a trade-off between completeness (including all NE/GG info) and hitting context/window limits.
  4. Model Selection: GPT-4 is substantially more reliable than GPT-3.5 for multi-step, code-generating prompts, especially for unfamiliar domains (e.g., ProbLog).
  5. Error Checking: Execute NE and GG evaluation steps as acceptance tests before running probabilistic code; missing/incorrect CPTs or edges result in downstream answer errors.

An abstracted pseudocode pipeline for applying these techniques:

1
2
3
4
5
6
def solve_probabilistic_query(context, query, model):
    probabilities = model.extract_probabilities(context)  # NE step
    graph = model.extract_graph(context)                  # GG step
    code = model.generate_symbolic_code(probabilities, graph, query)
    answer = execute_external_code(code)  # PAL/MC/ProbLog runner
    return answer

Theoretical and Practical Implications

The work demonstrates that current LLMs are not natively Bayesian reasoners over text, despite improvements in mathematical reasoning on benchmarks like GSM8K. In probabilistic settings, unstructured prompts and vanilla CoT fail due to:

  • Difficulty mapping natural language uncertainty to joint/marginal distributions.
  • Poor management of variable dependencies and CPTs.
  • Susceptibility to number hallucination and symbolic inconsistency over long chains.

Structured neuro-symbolic decomposition, with external formal representation and code execution, is tractable and enables accurate performance. However, this pipelines away the end-to-end differentiability and flexibility that make LLMs attractive.

Important empirical findings:

  • Adding more structured symbolic steps improves accuracy more than simply adding more reasoning text (CoT).
  • Accuracy bottlenecks often occur in the information extraction (NE and GG) rather than the final inference step, particularly with larger or denser Bayesian networks.
  • MC-based approximate inference is often more scalable on large BNs than explicit symbolic computation, but still limited by correct dependency extraction and variable ordering.

Future Directions

Possible directions inferred from the analysis:

  • Architectures and objectives for end-to-end neural probabilistic reasoning: Rather than using LLMs as code-generators, integrate explicit graphical model reasoning into model architectures or pre-training.
  • Linguistic uncertainty beyond quantitative CPTs: Extend to cases where probabilities are vague or only partially specified.
  • Joint neuro-symbolic training: Explicitly train models to map text to symbolic probabilistic programs (e.g., ProbLog or pgmpy representations), closing the gap between LLMs and symbolic reasoning systems.
  • Scaling and deployment: Systems that integrate LLM-driven number/graph extraction with symbolic solvers are immediately actionable for domains such as clinical decision support, risk assessment, and scientific NLP.

In summary, this work provides an empirical foundation, practical toolkit, and a benchmark for systematically closing the gap between language-based and symbolic probabilistic reasoning. The integration of NE and GG subtasks, symbolic representations, and external execution is a promising template for near-term systems, while highlighting the significant work required to advance LLM-native probabilistic reasoning.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com