Emergent Mind

Hypothesis Generation with Large Language Models

Published Apr 5, 2024 in cs.AI , cs.CL , cs.CY , and cs.LG


Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of LLMs to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.8% and 11.2% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.

Evaluating top k hypotheses on new examples, updating rewards, and generating hypotheses from errors.


  • This paper introduces HypoGeniC, an algorithm using LLMs for generating and refining scientific hypotheses, surpassing few-shot prompting and supervised learning benchmarks in accuracy.

  • HypoGeniC iteratively enhances hypotheses quality using a reward function to balance exploration and exploitation, incorporating a 'wrong example bank' for guidance.

  • Results show considerable improvements in classification accuracy across various datasets, with generated hypotheses being interpretable and capable of cross-model generalization.

  • The study underscores LLMs' potential to revolutionize scientific hypothesis generation, suggesting future research avenues including multimodal data and domain-specific knowledge integration.

Exploring the Efficacy of LLMs in Hypothesis Generation


The generation of novel hypotheses is a cornerstone of scientific achievement, yet its mechanisms largely remain beyond the direct grasp of computational systems. This paper presents an innovative approach to leveraging LLMs for the generation and iterative refinement of hypotheses based on labeled examples. Utilizing mechanisms inspired by the multi-armed bandit problem, the authors propose a method to produce hypotheses that significantly improve predictive performance across a variety of tasks when compared to few-shot prompting and supervised learning baselines. This includes an impressive enhancement on real-world datasets characterized by complex human behaviors such as deception detection and message popularity prediction.


The proposed algorithm, HypoGeniC, initiates by generating initial hypotheses from a subset of examples, which are then iteratively refined to enhance their quality. Key to this process is the introduction of a reward function designed to balance the exploration-exploitation trade-off intrinsic to the hypothesis update process. This innovative approach allows for:

  • Initial Hypothesis Generation: Starting from a small set of examples, generate preliminary hypotheses.
  • Iterative Refinement: Employing a reward function, iteratively refine and generate new hypotheses to address deficiencies in the current hypothesis pool.
  • Evaluation and Selection: Use a "wrong example bank" as a mechanism to capture knowledge gaps, guiding the generation of new, more accurate hypotheses.


The paper reports strongly positive results, highlighting a considerable increase in classification accuracy across multiple datasets when employing the generated hypotheses compared to few-shot prompting and supervised learning benchmarks. Specifically:

  • Improvements Over Baselines: The methodology achieves a 31.7% improvement on a synthetic dataset and respective improvements of 13.9%, 3.3%, and 24.9% on real-world datasets over few-shot prompting.
  • Comparison with Supervised Learning: In comparison to supervised learning models, HypoGeniC demonstrates superior performance on two challenging real-world datasets by margins of 12.8% and 11.2%.
  • Interpretability and Cross-Model Generalization: Beyond quantitative improvements, the generated hypotheses are shown to be interpretable and capable of generalizing across different LLMs and out-of-distribution datasets, corroborating and extending human theory.

Implications and Future Directions

The findings presented significantly contribute to our understanding of LLMs' potential in scientific hypothesis generation. Practically, this work opens new vistas in automating the generation of interpretable, data-driven hypotheses that not only match but can exceed human and existing AI baselines in predictive accuracy. Theoretically, it adds to the discourse on the exploitation-exploration paradigm in machine learning, suggesting novel ways LLMs can be steered to uncover patterns and relationships in data.

Furthermore, the ability of HypoGeniC to produce hypotheses that generalize across models and datasets hints at a deeper, model-agnostic understanding that LLMs can achieve, raising intriguing questions about the nature of knowledge representation within these models. This cross-generalization also underscores the robustness of the generated hypotheses, suggesting they tap into fundamental truths that transcend specific data distributions.

Looking ahead, the burgeoning field of AI-driven hypothesis generation stands on the cusp of transformative growth, with significant implications for accelerating scientific discovery in domains ranging from social sciences to natural sciences. Future research could extend these methodologies to incorporate multimodal data, leverage extensive literature, and explore the generation of hypotheses requiring nuanced domain-specific knowledge. Ultimately, as LLMs continue to evolve, their integration into the fabric of scientific inquiry promises to unveil new paradigms of understanding, heralding an era of enhanced collaboration between artificial intelligence and human intellect in the pursuit of knowledge.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.