C-Pack: Packed Resources For General Chinese Embeddings (2309.07597v5)
Abstract: We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
- Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263.
- Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval@ COLING, pages 81–91.
- Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
- Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393.
- * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43.
- Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988.
- Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260.
- mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
- The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4946–4951.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. arXiv preprint arXiv:2204.08582.
- A framework for few-shot language model evaluation.
- Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253.
- Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983.
- Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Resources for brewing beir: Reproducible reference models and an official leaderboard. arXiv preprint arXiv:2306.07471.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Jingyang Li and Maosong Sun. 2007. Scalable term selection for text categorization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 774–782.
- A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 545–552.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Csl: A large-scale chinese scientific literature dataset. arXiv preprint arXiv:2209.05034.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
- Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th international conference on computational linguistics, pages 1952–1962.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035.
- Multi-cpr: A multi domain chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3046–3056.
- Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. RecSys ’13, New York, NY, USA. Association for Computing Machinery.
- Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124.
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
- Ms marco: A human-generated machine reading comprehension dataset.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
- Dureader_retrieval: A large-scale chinese benchmark for passage retrieval from web search engine. arXiv preprint arXiv:2203.10232.
- Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
- Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
- Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
- Retromae-2: Duplex masked auto-encoder for pre-training retrieval-oriented language models. arXiv preprint arXiv:2305.02564.
- T2ranking: A large-scale chinese benchmark for passage ranking. arXiv preprint arXiv:2304.03679.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
- Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
- Cluecorpus2020: A large-scale chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355.
- Paws-x: A cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68.
- Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071.
- Chinese medical question answer matching using end-to-end character-level multi-scale cnns. Applied Sciences, 7(8):767.