Repetition Improves Language Model Embeddings (2402.15449v1)
Abstract: Recent approaches to improving the extraction of text embeddings from autoregressive LLMs have largely focused on improvements to data, backbone pretrained LLMs, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.
- Ms marco: A human generated machine reading comprehension dataset.
- Quora question pairs.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
- Eli5: Long form question answering.
- Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983.
- Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
- Geoffrey E Hinton. 1984. Distributed representations.
- Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pages 1681–1691.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
- Promptbert: Improving bert sentence embeddings with prompts. arXiv preprint arXiv:2201.04337.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
- Dense passage retrieval for open-domain question answering.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- Skip-thought vectors. Advances in neural information processing systems, 28.
- Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
- Xianming Li and Jing Li. 2023. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
- Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Dureader-retrieval: A large-scale chinese benchmark for passage retrieval from web search engine.
- Improving language understanding by generative pre-training.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Learning representations by back-propagating errors. nature, 323(6088):533–536.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
- Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in neural information processing systems, 24.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
- Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
- Fever: a large-scale dataset for fact extraction and verification.
- Llama 2: Open foundation and fine-tuned chat models.
- Nearest neighbor search in google correlate. Technical report, Google.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
- Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
- Cse: Conceptual sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 505–515.
- Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.
- C-pack: Packaged resources to advance general chinese embedding.
- T2ranking: A large-scale chinese benchmark for passage ranking.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering.
- Language models are universal embedders. arXiv preprint arXiv:2310.08232.
- Mr. tydi: A multi-lingual benchmark for dense retrieval.
- MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.