Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning (2408.00690v2)

Published 1 Aug 2024 in cs.CL

Abstract: While LLMs show remarkable performance in natural language understanding, their resource-intensive nature makes them less accessible. In contrast, smaller LLMs such as MiniCPM offer more sustainable scalability, but often underperform without specialized optimization. In this paper, we explore the enhancement of smaller LLMs through the improvement of their text embeddings. We select three LLMs, MiniCPM, Phi-2, and Gemma, to conduct contrastive fine-tuning on the NLI dataset. Our results demonstrate that this fine-tuning method enhances the quality of text embeddings for all three models across various benchmarks, with MiniCPM showing the most significant improvements of an average 56.33% performance gain. The contrastive fine-tuning code is publicly available at https://github.com/trapoom555/Language-Model-STS-CFT.

Summary

The paper employs contrastive fine-tuning with LoRA to enhance the semantic embedding quality of smaller language models.
It demonstrates that MiniCPM outperforms Gemma and Phi-2 with an average performance gain of 56.33% across nine STS benchmarks.
Optimized learning rates and hard negatives further improve model efficiency and stability in resource-constrained NLP applications.

Improving Text Embeddings for Smaller LLMs Using Contrastive Fine-tuning

The paper "Improving Text Embeddings for Smaller LLMs Using Contrastive Fine-tuning" presents a paper on enhancing the text embeddings of smaller LLMs (LMs) through the application of contrastive fine-tuning. This research focuses on three specific models: MiniCPM, Phi-2, and Gemma, with a particular emphasis on MiniCPM due to its demonstrated capacity for substantial improvement.

Introduction

Text embeddings play a crucial role in various NLP tasks, including information retrieval, document classification, and semantic textual similarity (STS). While LLMs such as GPT-4 have shown significant capabilities in natural language understanding, these models are often resource-intensive and less accessible. Smaller models like MiniCPM, Gemma, and Phi-2 offer a more scalable solution but tend to underperform without specific optimizations. This paper addresses the gap by improving the text embedding quality of smaller LLMs, making them viable alternatives for resource-constrained applications.

Methodology

The core methodology involves contrastive fine-tuning, a technique that enhances the models' ability to distinguish between semantically similar and dissimilar text pairs. The research leverages a parameter-efficient fine-tuning method, Low-Rank Adaptation (LoRA), to ensure the process remains computationally feasible. The training dataset used is a processed version of the Natural Language Inference (NLI) dataset, consisting of approximately 275,000 samples.

Contrastive Fine-tuning Approach

The contrastive fine-tuning approach is designed to improve the models' semantic understanding by aligning similar text representations closely in the embedding space while pushing dissimilar ones apart. This is achieved using the InfoNCE loss with in-batch negatives and hard negatives, formulated as:

$\min - \log \frac{e^{\text{sim}(h_i, h_i^+) / \tau}}{\sum_{j=1}^N \left( e^{\text{sim}(h_i, h_j^+) / \tau }+ e^{\text{sim}(h_i, h_j^-) / \tau} \right)}$

Here, $h_i$ denotes an embedding vector of a premise $x_i$ , $\tau$ is a temperature parameter, and $\text{sim}(h_i, h_i^+)$ computes the cosine similarity between embedding vectors.

Experiments and Results

Benchmark Evaluation

The models were evaluated on nine STS benchmarks: STS12, STS13, STS14, STS15, STS16, STS17, STSBenchmark, BIOSSES, and SICK-R. These benchmarks cover a broad spectrum of sentence pairs, ranging from general news headlines to biomedical fields. The evaluation metric used was Spearman correlations between the cosine similarities of the embeddings generated by the models and ground truth similarities.

The results, as summarized in Table \ref{tab:model_performance}, demonstrate that MiniCPM significantly outperforms both Gemma and Phi-2 across all benchmarks, with an average performance gain of 56.33\%. Specifically, MiniCPM achieved the highest correlations on datasets like STS12 (76.38%) and STS17 (89.96%), indicating its robust capability in capturing semantic similarities.

Ablation Studies

Several ablation studies were conducted to explore the model's performance:

Pre-Fine-Tuning Performance: This paper showed that MiniCPM had the most substantial improvement post-fine-tuning, emphasizing the effectiveness of the fine-tuning process.
Impact of Learning Rate: It was found that a learning rate of $5 \times 10^{-5}$ yielded the best results, whereas higher learning rates led to instability and underfitting.
Prompting Techniques: While the original MiniCPM model benefited from specific prompt designs, the fine-tuned model exhibited marginal gains, suggesting a model-specific preference for sentence structures encountered during training.
Training Data Efficiency: The model showed rapid performance gains within the first 200 training steps, showcasing high training efficiency.
Hard Negatives Penalty: Penalizing hard negatives was generally beneficial, improving performance across most benchmarks.

Conclusion

This research underscores the viability of using contrastive fine-tuning to enhance the text embedding quality of smaller LLMs. The significant improvements observed, particularly in the MiniCPM model, highlight its potential for deployment in resource-constrained environments. The studies within the paper also offer valuable insights into the configurations that maximize the efficiency and effectiveness of the fine-tuning process.

Implications and Future Directions

The improved performance of smaller LMs through contrastive fine-tuning opens up new avenues for practical applications where computational resources are limited. This research contributes to making low-resource NLP tasks more accessible and efficient. Future developments could explore more advanced fine-tuning techniques, further optimization of learning rates, and broader applications of these smaller models across diverse NLP tasks.

The paper's code and models are publicly available, contributing to further advancements in the community and fostering collaborative improvements in the field of text embeddings.

PDF Markdown

Related Papers

GitHub

GitHub - trapoom555/Language-Model-STS-CFT: Improving Text Embedding of Language Models Using Contrastive Fine-tuning (60 stars)

Tweets

https://twitter.com/_akhaliq/status/1819198419254005945

https://twitter.com/_reachsumit/status/1819201933863014825

https://twitter.com/fly51fly/status/1819487226696540587

https://twitter.com/arxivsanitybot/status/1820088785419301263

https://twitter.com/gm8xx8/status/1819202798262632840

https://twitter.com/rnella01/status/1819735548765868182

YouTube

Show All Videos