Emergent Mind

Abstract

The recent success of LLMs (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the case of domain-specific KGs. To overcome this challenge, in this study, we evaluate several strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs. In particular, we propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph towards a larger dataset of semantically enriched question-to-SPARQL query pairs, enabling fine-tuning even for datasets where these pairs are scarce. In this context, we also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments. Finally, we evaluate our approach over the real-world Bgee gene expression knowledge graph and we show that semantic clues can improve model performance by up to 33% compared to a baseline with random variable names and no comments included.

Overview

  • The study evaluates fine-tuning strategies for OpenLLaMA, a Large Language Model, to improve SPARQL query generation from natural language questions in the life sciences.

  • Incorporating a unique data augmentation approach, it targets enhancing SPARQL's utility on the Bgee gene expression knowledge graph.

  • The method includes augmenting question-to-SPARQL pairs and fine-tuning with techniques like QLoRA and PEFT, focusing on semantic richness without extensive hyperparameter optimizations.

  • The paper's findings reveal the significance of dataset augmentation and strategic fine-tuning for SPARQL generation accuracy, particularly highlighting the balance between domain-general and domain-specific knowledge.

Fine-tuning OpenLLaMA for Enhanced SPARQL Generation in Life Sciences

Introduction

In the rapidly evolving field of Question Answering Systems (QAS) over Knowledge Graphs (KGs), leveraging LLMs offers a promising avenue for facilitating direct natural language interaction with data. This study evaluates various fine-tuning strategies for OpenLLaMA, an open-source LLM, aimed at translating natural language questions into SPARQL queries specifically for the life sciences domain. Through an innovative end-to-end data augmentation approach, this research extends the utility of SPARQL in querying the Bgee gene expression knowledge graph, a pivotal resource in life sciences.

Background and Related Works

The intersection of LLMs and KGs has seen notable advancements, albeit with challenges in SPARQL query generation due to the specificity and complexity involved. Previous works have identified limitations in LLMs' ability to accurately generate semantically correct SPARQL queries. The deployment of models like ChatGPT has indeed demonstrated potential, yet the intricate domain-specific requirements of scientific knowledge bases, such as Bgee, demand highly accurate query translations, underscoring the importance of fine-tuning strategies and dataset augmentations in overcoming these challenges.

Methodology

The core methodology centers around two primary objectives: augmenting the existing set of question-to-SPARQL query pairs and fine-tuning OpenLLaMA to improve SPARQL query generation accuracy. Through a systematic approach, the study extends the range and semantic richness of the training data, incorporating variable names with semantic "clues" and inline comments to explore their impact on model performance. The fine-tuning leverages techniques like QLoRA and PEFT, without extensive hyperparameter optimization, showcasing the practical feasibility of the approach.

Evaluation and Discussion

The study's evaluation, grounded in the life sciences domain with the Bgee gene expression knowledge graph, employs robust metrics including BLEU, SP-BLEU, METEOR, and ROUGE-L. The findings reveal that incorporating context through meaningful variable names and inline comments significantly enhances model performance across all metrics. Interestingly, pre-fine-tuning the model with a diverse dataset like KQA Pro does not conclusively improve, and in some instances, may even detract from performance when subsequently fine-tuned on a domain-specific dataset such as Bgee. This highlights the nuanced interplay between domain-general and domain-specific knowledge in LLM fine-tuning.

Conclusion

This paper presents a comprehensive approach to fine-tuning LLMs for SPARQL query generation within the life science domain, particularly over the Bgee knowledge graph. The results underscore the value of dataset augmentation and strategic fine-tuning in achieving notable improvements in query generation accuracy. The nuanced findings regarding pre-fine-tuning with general datasets versus direct domain-specific fine-tuning offer valuable insights for future research. Moving forward, expanding the dataset augmentation techniques and exploring their applicability across a broader range of scientific knowledge bases remains a promising direction, with the potential to significantly advance the capabilities of QAS over KGs in the life sciences and beyond.

Acknowledging the complexity and the critical necessity for accurate data querying in life sciences research, this study provides a seminal step towards harnessing the full potential of LLMs in bridging natural language queries and SPARQL, thus enhancing the accessibility and utility of invaluable data resources in the domain.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.