Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? (2302.03495v3)

Published 3 Feb 2023 in cs.IR and cs.AI

Abstract: Systematic reviews are comprehensive reviews of the literature for a highly focused research question. These reviews are often treated as the highest form of evidence in evidence-based medicine, and are the key strategy to answer research questions in the medical field. To create a high-quality systematic review, complex Boolean queries are often constructed to retrieve studies for the review topic. However, it often takes a long time for systematic review researchers to construct a high quality systematic review Boolean query, and often the resulting queries are far from effective. Poor queries may lead to biased or invalid reviews, because they missed to retrieve key evidence, or to extensive increase in review costs, because they retrieved too many irrelevant studies. Recent advances in Transformer-based generative models have shown great potential to effectively follow instructions from users and generate answers based on the instructions being made. In this paper, we investigate the effectiveness of the latest of such models, ChatGPT, in generating effective Boolean queries for systematic review literature search. Through a number of extensive experiments on standard test collections for the task, we find that ChatGPT is capable of generating queries that lead to high search precision, although trading-off this for recall. Overall, our study demonstrates the potential of ChatGPT in generating effective Boolean queries for systematic review literature search. The ability of ChatGPT to follow complex instructions and generate queries with high precision makes it a valuable tool for researchers conducting systematic reviews, particularly for rapid reviews where time is a constraint and often trading-off higher precision for lower recall is acceptable.

Citations (151)

View on Semantic Scholar

Summary

The paper demonstrates that ChatGPT can generate high-precision Boolean queries, though its results may compromise recall in systematic reviews.
It reveals that using detailed prompts and examples like PICO elements significantly enhances query formulation and refinement methods.
The study highlights the potential of guided, iterative interactions to optimize query performance while addressing challenges in reproducibility and MeSH accuracy.

Evaluation of ChatGPT for Boolean Query Formulation in Systematic Reviews

The research paper titled "Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?" provides an empirical analysis of the capabilities of ChatGPT in generating Boolean queries that facilitate systematic review literature searches. This exploration is relevant given the critical role systematic reviews play in synthesizing evidence for healthcare and medical research. Boolean queries are central to these reviews, serving as filters that determine the studies included. The challenge lies in crafting Boolean queries that balance precision and recall, ensuring they capture exhaustive relevant literature without excessive noise.

Key Findings

A comprehensive experimental setup was designed to evaluate the performance of ChatGPT against well-established and manually curated methods. The paper focuses on three main explorations: single prompt query formulation, single prompt query refinement, and guided prompt query formulation based on established procedures.

Single Prompt Query Formulation:
- The study reveals that while ChatGPT-generated queries are adept at achieving high precision, they generally sacrifice recall. Queries derived from detailed prompts, particularly those incorporating high-quality examples of Boolean structures, showed improved performance metrics.
- It is evident that prompts incorporating semantically close examples or guidance on query structure, such as using PICO elements, helped refine terms more effectively, albeit with mixed effects on recall.
Single Prompt Query Refinement:
- ChatGPT demonstrated promising results when refining existing queries, notably those generated by state-of-the-art strategies like the objective method. Enhancements in precision and F-measure were evident, suggesting ChatGPT’s capacity to optimize pre-existing Boolean formulations.
- The process revealed the potential of ChatGPT to reduce irrelevant retrievals while maintaining satisfactory recall, especially when refining queries initially structured to be comprehensive.
Guided Prompt Query Formulation:
- By employing a sequence of prompts that mimic established query formation methods, ChatGPT attained improvements both in precision and recall compared to its baseline query formulation methods. This underscores the utility of structured guidance and iterative interactions for complex Boolean query creation.

Implications and Future Directions

The paper’s findings provide substantial insights into the potential applicability of ChatGPT for systematic review query construction, particularly in contexts demanding rapid review processes where time constraints necessitate higher precision yet tolerate lower recall. ChatGPT’s precision capabilities and its ability to refine queries make it a useful tool in this domain.

However, the study identifies several challenges, notably the volatility of query effectiveness across interactions, underscoring both the promise and limitation of generative models in deterministic outputs. These issues invite further research into stability and robustness mechanisms that ensure reproducible and standardized performance across multiple runs.

Furthermore, addressing the inaccuracies in MeSH term generation remains critical. The incorrect handling of such terms could lead to substantial losses in recall, undermining the comprehensive nature of systematic reviews. Future research could explore integrated approaches that rectify and enhance MeSH term selections post-generation.

Lastly, this paper opens avenues for exploring the combination of ChatGPT with automated cluster analysis or active learning-driven refinements to handle inefficiencies and improve the retrieval of unassessed documents in batches, thus widening the net for potential relevant studies.

Conclusion

The utility of ChatGPT in systematic review literature search through Boolean query generation presents an intriguing and promising intersection of artificial intelligence and medical informatics. While certain limitations are acknowledged, particularly regarding reproducibility and variability, preliminary outcomes affirm the potential to enhance precision and facilitate rapid review processes. This work stands as a pioneering step toward integrating powerful LLMs like ChatGPT within evidence synthesis workflows and catalyzes continued exploration into refining AI capabilities in this critical area of research.