Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Repetition Improves Language Model Embeddings (2402.15449v1)

Published 23 Feb 2024 in cs.CL and cs.LG

Abstract: Recent approaches to improving the extraction of text embeddings from autoregressive LLMs have largely focused on improvements to data, backbone pretrained LLMs, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Ms marco: A human generated machine reading comprehension dataset.
  2. Quora question pairs.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  4. Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
  5. Eli5: Long form question answering.
  6. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983.
  7. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  8. Geoffrey E Hinton. 1984. Distributed representations.
  9. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pages 1681–1691.
  10. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825.
  12. Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
  13. Promptbert: Improving bert sentence embeddings with prompts. arXiv preprint arXiv:2201.04337.
  14. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  15. Dense passage retrieval for open-domain question answering.
  16. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
  17. Skip-thought vectors. Advances in neural information processing systems, 28.
  18. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
  19. Xianming Li and Jing Li. 2023. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871.
  20. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
  21. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
  22. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  23. Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  24. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  25. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
  26. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  28. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  29. Dureader-retrieval: A large-scale chinese benchmark for passage retrieval from web search engine.
  30. Improving language understanding by generative pre-training.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  32. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  33. Learning representations by back-propagating errors. nature, 323(6088):533–536.
  34. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
  35. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in neural information processing systems, 24.
  36. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  37. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
  38. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  39. Fever: a large-scale dataset for fact extraction and verification.
  40. Llama 2: Open foundation and fine-tuned chat models.
  41. Nearest neighbor search in google correlate. Technical report, Google.
  42. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  43. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
  44. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
  45. Cse: Conceptual sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 505–515.
  46. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198.
  47. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.
  48. C-pack: Packaged resources to advance general chinese embedding.
  49. T2ranking: A large-scale chinese benchmark for passage ranking.
  50. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
  51. Language models are universal embedders. arXiv preprint arXiv:2310.08232.
  52. Mr. tydi: A multi-lingual benchmark for dense retrieval.
  53. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.
Citations (21)

Summary

  • The paper presents a novel echo technique that improves autoregressive LLM embeddings by integrating future token context.
  • It repeats input sentences to enable bidirectional information capture, yielding over 9% performance improvements on the MTEB benchmark.
  • The method offers a practical solution with theoretical insights, paving the way for future enhancements in embedding optimization.

Enhancing Autoregressive LLM Embeddings with Echo Technique

Introduction to Echo Embeddings

The objective of improving text embeddings has been a central theme in the deployment of neural networks to tasks like information retrieval, semantic similarity estimation, classification, and clustering. A paper presents a novel approach to generating embeddings from autoregressive LLMs for these purposes, addressing a core limitation: their inability to incorporate information from subsequent tokens in generating embeddings for a given token. This work introduces the concept of "echo embeddings," which effectively incorporates future context into embeddings by repeating the input sentence within the model's context. The paper reports a significant performance improvement on the Massive Text Embedding Benchmark (MTEB), establishing echo embeddings as a potent method for leveraging the strengths of autoregressive LLMs in generating text embeddings.

Methodology and Findings

Dealing with Autoregressive Models' Limitation

The paper identifies a notable limitation with autoregressive LLMs in that they cannot utilize future token information in the generation of embeddings, which can lead to suboptimal performance in applications requiring a holistic understanding of the text. To counter this, the authors propose echo embeddings, which involve repeating the input sentence, allowing the model to attend to the entire input when encoding each token during the second occurrence. This method enables early tokens to encompass information from later portions of the text, overcoming the inherent limitation of autoregressive models.

Empirical Validation

Through extensive experiments, the authors demonstrate the efficacy of echo embeddings. The performance of echo embeddings was benchmarked using the MTEB, showcasing an improvement of over 9% in a zero-shot setting and consistent gains across various tasks when compared to classical embeddings. Additionally, the experimentation on synthetic data further affirmed that echo embeddings could capture bidirectional information, enabling it to outperform classical embeddings in scenarios where early tokens only superficially suggested similarity.

Practical and Theoretical Implications

The implementation of echo embeddings represents an easily adaptable method that could be integrated with existing or future enhancements in autoregressive LLM embeddings. Theoretical implications of this research suggest a pathway for maximizing the informational content of embeddings derived from autoregressive models, potentially paving the way for more sophisticated and contextually aware neural network architectures.

Future Outlook

Looking forward, the concept introduces promising avenues for further exploration. While the immediate benefits to information retrieval and related applications are clear, understanding the deeper mechanics of why echo embeddings yield performance gains, especially after fine-tuning, warrants additional research. The method's computational efficiency, mainly since it necessitates processing inputs twice, might be an area ripe for optimization.

Additionally, the conceptual framework of echo embeddings could inspire comparable methodologies across different model architectures, not limited to text data. As autoregressive models continue to evolve, integrating echo embeddings or similar approaches could become a standard practice for generating high-quality embeddings, contributing further to advancements in machine learning and AI.

Conclusion

The introduction of echo embeddings marks a significant development in the field of neural text embeddings, particularly for autoregressive LLMs. By ingeniously addressing a critical limitation of these models, the researchers have not only demonstrated substantial performance improvements but also opened new horizons for future research and applications. As the AI community continues to strive for more contextually rich and informative embeddings, techniques like echo embeddings will likely play a crucial role.

Github Logo Streamline Icon: https://streamlinehq.com