Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews (2407.10652v2)

Published 15 Jul 2024 in cs.LG, cs.HC, and cs.DL

Abstract: Systematic literature reviews (SLRs) are essential but labor-intensive due to high publication volumes and inefficient keyword-based filtering. To streamline this process, we evaluate LLMs for enhancing efficiency and accuracy in corpus filtration while minimizing manual effort. Our open-source tool LLMsurver presents a visual interface to utilize LLMs for literature filtration, evaluate the results, and refine queries in an interactive way. We assess the real-world performance of our approach in filtering over 8.3k articles during a recent survey construction, comparing results with human efforts. The findings show that recent LLM models can reduce filtering time from weeks to minutes. A consensus scheme ensures recall rates >98.8%, surpassing typical human error thresholds and improving selection accuracy. This work advances literature review methodologies and highlights the potential of responsible human-AI collaboration in academic research.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel LLM-based approach that automates initial literature screening, achieving over 98% recall and reducing review time from weeks to minutes.
It employs consensus voting among top-performing models to enhance filtration accuracy while mitigating issues like semantic ambiguities and hallucinations.
The study’s findings imply scalable and efficient resource allocation in academic reviews, enabling researchers to focus more on deep synthesis and analysis.

Efficient Filtration in Systematic Literature Reviews with LLMs

The paper "Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews" by Lucas Joos, Daniel A. Keim, and Maximilian T. Fischer from the University of Konstanz engages with a critical issue in academic research: the labor-intensive process of conducting systematic literature reviews (SLRs). The authors propose and evaluate a methodological enhancement using LLMs to streamline and optimize the preliminary steps of literature filtration.

Introduction to SLR Challenges

The creation of systematic literature reviews is a time-consuming and repetitive task, fundamental for synthesizing existing research comprehensively. Manual screening of large volumes of scholarly articles based on titles and abstracts often results in weeks or even months of dedicated effort by researchers. Conventional keyword-based filtering techniques are often inadequate due to semantic ambiguities and inconsistent terminology, leading to suboptimal inclusion and exclusion of relevant literature.

Methodological Innovation

This research introduces a structured approach leveraging LLMs to automate the initial filtration process. By acting as classification agents, LLMs are employed to evaluate a large dataset based on pre-defined criteria set through carefully crafted prompts. The primary objective is to enhance efficiency and reduce the manual workload while maintaining or exceeding human performance accuracy benchmarks. This is achieved by avoiding common pitfalls of LLMs, such as hallucinations, through context-specific prompts and consensus voting schemes.

Evaluation and Results

The paper presents a rigorous evaluation using a dataset of 8,323 articles pertinent to a specific research domain, compared against a manually curated ground truth. The authors test several advanced LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Llama3 (8B and 70B). The evaluation metrics focus on true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), alongside accuracy (Acc), precision (P), recall (R), and F $_1$ score.

Key Findings

Accuracy and Recall: All LLMs demonstrated high accuracy, exceeding 90%, with recall rates surpassing 97%. This is critical as high recall ensures minimal loss of relevant literature during filtration.
Efficiency: The use of LLMs reduced the initial review time from several weeks to minutes. For instance, GPT-4o processed the dataset in under ten minutes, costing approximately $28.81.
Consensus Voting: The application of consensus schemes significantly improved filtration accuracy. A "Consensus (Best)" approach, utilizing the top-performing models, achieved a recall rate of over 98.8%, with only a single paper misclassified as a false negative.

Implications and Future Directions

The demonstrated ability of LLMs to substantially reduce the initial workload and time required for SLRs presents multiple implications:

Resource Allocation: Significant reduction in manual efforts allows researchers to focus more on synthesis and analysis, optimizing resource use.
Scalability: The methodology's scalability makes it suitable for large-scale reviews across diverse academic fields.
Interdisciplinary Research: The improved and efficient filtration may facilitate broader interdisciplinary studies by making initial literature surveys less burdensome.

The paper also highlights potential pathways for future research. These include refining prompt engineering techniques to enhance LLM performance further, exploring few-shot or zero-shot learning for more nuanced classification tasks, and extending automation to subsequent stages of the SLR process, such as paper coding and qualitative analysis.

Conclusion

This paper offers a comprehensive evaluation of leveraging LLMs for systematic literature filtration, providing a robust and structured methodology that can significantly enhance efficiency and accuracy. The findings suggest that incorporating LLMs into the academic review process can transform the conventional labor-intensive methods into more streamlined and scalable approaches, potentially revolutionizing how systematic literature reviews are conducted in the future. This research underscores the growing potential of responsible AI integration in academic practices, paving the way for broader and more accessible applications in advancing scholarly work.