Emergent Mind

Improving Sentence Embeddings with an Automatically Generated NLI Dataset

(2402.15132)
Published Feb 23, 2024 in cs.CL and cs.LG

Abstract

Decoder-based LLMs have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL makes great use of fine-tuning with a manually annotated natural language inference (NLI) dataset. We aim to improve sentence embeddings learned in an unsupervised setting by automatically generating an NLI dataset with an LLM and using it to fine-tune PromptEOL. In experiments on STS tasks, the proposed method achieved an average Spearman's rank correlation coefficient of 82.21 with respect to human evaluation, thus outperforming existing methods without using large, manually annotated datasets.

Overview

  • The study introduces a novel method for enhancing sentence embeddings through automatically generated NLI datasets, reducing reliance on manual annotations.

  • Utilizing LLMs, particularly the PromptEOL model, the research focuses on improving embeddings for Semantic Textual Similarity (STS) tasks.

  • An innovative automatic NLI dataset generation technique is developed, using simple prompts and few-shot learning to create datasets that match the quality of manual ones.

  • Empirical evaluation shows that the model, fine-tuned with these automatically generated datasets, achieves excellent performance on STS tasks, indicating the methodology's efficacy.

Enhancing Sentence Embeddings via Automatically Generated NLI Datasets

Introduction

The quest for learning sophisticated sentence embeddings has led to various methodologies, notably the fine-tuning of pre-trained language models. Historically, encoder-based models, such as SentenceBERT and PromptBERT, have taken center stage. Lately, however, the utilization of decoder-based LLMs has shown promising results in tasks across the NLP spectrum, including Semantic Textual Similarity (STS). A significant breakthrough was achieved with the PromptEOL model, designed to predict an entire sentence's meaning with a single word through prompting. Despite its superior performance on STS tasks, its dependency on large, manually annotated Natural Language Inference (NLI) datasets poses a limitation. Addressing this, the study introduces a method to generate NLI datasets automatically, leveraging LLM capabilities to fine-tune the PromptEOL model for enhanced sentence embeddings without relying on extensive manual annotations.

PromptEOL: A Focused Analysis

PromptEOL distinguishes itself by harnessing prompts to extract sentence embeddings from decoder-based LLMs. This novel approach employs a specially crafted prompt to encapsulate the semantic entirety of a sentence into a single word, leveraging the pre-training objective of LLMs centered around next-token prediction. Fine-tuning on NLI datasets further refines the embeddings to emphasize entailment and contradiction relations, foundational to generating semantically rich embeddings.

Automatic NLI Dataset Generation

The cornerstone of this research is the innovative method for automatic NLI dataset generation. By employing simple prompts to transform premise sentences into hypotheses with labels of entailment or contradiction, the study bypasses the extensive manual effort. To enhance the quality of these hypothesis sentences, the study incorporates few-shot learning, sequentially increasing the sophistication of dataset generation from 0-shot to 20-shot learning, with the latter matching the quality of manually curated datasets.

Empirical Evaluation

The study's empirical investigations showcase the generated NLI dataset's superior quality and its efficacy in training the PromptEOL model to achieve remarkable performances in STS tasks. The model, fine-tuned with datasets obtained from 20-shot learning, rivaled the scores of manually annotated datasets, underscoring the potential of automatically generated NLI data in learning high-quality sentence embeddings. Noteworthy is the model's average Spearman’s rank correlation coefficient of 82.21 on STS benchmarks, highlighting the effectiveness of this methodology over existing unsupervised approaches and setting a new precedent for the use of NLI datasets in sentence embedding learning.

Conclusion and Prospects for Future Work

The proposed framework offers a novel pathway to obtain sentence embeddings by leveraging automatically generated NLI datasets, significantly reducing the dependency on large, manually annotated corpora. While the results on STS tasks are promising, the study also acknowledges limitations, including the exclusive use of the Llama-2-7b model and the focus on English. Future explorations could extend to other LLMs and languages to broaden the applicability and utility of this approach.

This research illuminates the path forward in the development of more efficient sentence embedding methodologies that can potentially adapt to various languages and models, promising an exciting avenue for further exploration in the domain of Natural Language Processing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.