Emergent Mind

LongEmbed: Extending Embedding Models for Long Context Retrieval

(2404.12096)
Published Apr 18, 2024 in cs.CL and cs.LG

Abstract

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

Arrangement of pids for scaling APE models; learnable and frozen position vectors during further tuning.

Overview

  • The paper introduces a method to expand the context windows of embedding models, enhancing their performance for processing long inputs.

  • LongEmbed, a new benchmark, is introduced to assess embedding models on extended contexts, highlighting areas needing improvement.

  • Various strategies, including Position Interpolation, RoPE, and APE, are explored to extend the operational range of these models without retraining.

  • Empirical results show RoPE-based models have significant potential in handling extended contexts, suggesting a shift in future model design preferences.

Extending Context Window in Embedding Models for Enhanced Long Input Processing

Introduction and Motivation

Embedding models are fundamental to various NLP applications, yet they have traditionally been limited by narrow context windows. This paper sets forth a comprehensive exploration into strategies for extending the context windows of existing embedding models without necessitating retraining. The focus is on enhancing performance for long input scenarios, such as lengthy documents or detailed contracts, where traditional models falter due to their typical limitation of 512 to 8k tokens.

Benchmarking Current Models

The paper introduces LongEmbed, a new benchmark designed to critically assess the performance of embedding models across extended contexts. LongEmbed includes both synthetic and real-world tasks tailored to challenge the models with inputs significantly exceeding traditional lengths. The results from these benchmarks highlighted considerable room for improvement, as current models struggled with effectively managing longer contexts.

Strategies for Context Extension

Several methodologies were tested for extending the operational range of these models:

  • Position Interpolation and Reorganization: Methods like parallel context windows and position interpolation proved effective across various models, multiplying the effective context window several-fold.
  • RoPE and APE Comparisons: Distinct strategies were tailored for models based on their position encoding methodologies — Absolute Position Encoding (APE) and Rotary Position Embedding (RoPE). For APE, techniques like position interpolation allowed for extended context processing without additional training. RoPE models, however, benefited significantly from RoPE-specific methods like NTK (Neural Tangent Kernel-aware Interpolation) and SelfExtend, which leverage their inherent handling of relative positions.

Empirical Findings

The empirical studies conducted showed remarkable results:

  • APE-based models could handle increased token loads with extended position embedding techniques, with fine-tuning offering further benefits while preserving performance on shorter inputs.
  • RoPE-based models saw substantial improvements using RoPE-specific extensions, demonstrating their potential for managing even longer inputs effectively — for instance, extending E5-Mistral's context window to 32k tokens improved performance metrics significantly.

Implications and Future Work

The insights from this paper suggest substantial implications for the development of more efficient and capable embedding models. The demonstrated superiority of RoPE in handling extended contexts proposes a shift in model design preferences for future embedding tasks. Moreover, the methodologies and new benchmark introduced here provide a foundation for further research into embedding model enhancements.

The research also sets the stage for exploring additional strategies in context window extension and fine-tuning, and stresses the benefit of shared benchmarks like LongEmbed for consistent evaluation and comparison of future models.

Conclusion

Overall, this work not only advances our understanding of how embedding models can be adapted to manage longer contexts effectively but also underlines the importance of model and methodology choices in achieving high performance in long-input scenarios. Researchers are encouraged to leverage the findings and tools made available through this paper to propel the capabilities of NLP applications further.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.