Emergent Mind

C-Pack: Packaged Resources To Advance General Chinese Embedding

(2309.07597)
Published Sep 14, 2023 in cs.CL , cs.AI , and cs.IR

Abstract

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263.
  2. Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval@ COLING, pages 81–91.
  3. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
  4. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393.
  5. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43.
  6. SantaCoder: don't reach for the stars!
  7. Task-aware Retrieval with Instructions
  8. mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset
  9. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  10. A large annotated corpus for learning natural language inference
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  12. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
  13. The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4946–4951.
  14. PaLM: Scaling Language Modeling with Pathways
  15. Scaling Instruction-Finetuned Language Models
  16. SentEval: An Evaluation Toolkit for Universal Sentence Representations
  17. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  18. MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
  19. A framework for few-shot language model evaluation
  20. Condenser: a Pre-training Architecture for Dense Retrieval
  21. Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
  22. SimCSE: Simple Contrastive Learning of Sentence Embeddings
  23. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  24. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications
  25. Training Compute-Optimal Large Language Models
  26. OCNLI: Original Chinese Natural Language Inference
  27. Unsupervised Dense Information Retrieval with Contrastive Learning
  28. Atlas: Few-shot Learning with Retrieval Augmented Language Models
  29. Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard
  30. Dense Passage Retrieval for Open-Domain Question Answering
  31. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  32. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  33. Jingyang Li and Maosong Sun. 2007. Scalable term selection for text categorization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 774–782.
  34. A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 545–552.
  35. StarCoder: may the source be with you!
  36. CSL: A Large-scale Chinese Scientific Literature Dataset
  37. Towards General Text Embeddings with Multi-stage Contrastive Learning
  38. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th international conference on computational linguistics, pages 1952–1962.
  39. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  40. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder
  41. Multi-cpr: A multi domain chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3046–3056.
  42. Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. RecSys ’13, New York, NY, USA. Association for Computing Machinery.
  43. SGPT: GPT Sentence Embeddings for Semantic Search
  44. OctoPack: Instruction Tuning Code Large Language Models
  45. Scaling Data-Constrained Language Models
  46. MTEB: Massive Text Embedding Benchmark
  47. Crosslingual Generalization through Multitask Finetuning
  48. Text and Code Embeddings by Contrastive Pre-Training
  49. Ms marco: A human-generated machine reading comprehension dataset
  50. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
  51. Large Dual Encoders Are Generalizable Retrievers
  52. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
  53. DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine
  54. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
  55. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  57. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  58. Multitask Prompted Training Enables Zero-Shot Task Generalization
  59. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  60. What Language Model to Train if You Have One Million GPU Hours?
  61. REPLUG: Retrieval-Augmented Black-Box Language Models
  62. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  63. One Embedder, Any Task: Instruction-Finetuned Text Embeddings
  64. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
  65. FEVER: a large-scale dataset for Fact Extraction and VERification
  66. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval
  67. Text Embeddings by Weakly-Supervised Contrastive Pre-training
  68. Finetuned Language Models Are Zero-Shot Learners
  69. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
  70. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models
  71. T2Ranking: A large-scale Chinese Benchmark for Passage Ranking
  72. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
  73. CLUE: A Chinese Language Understanding Evaluation Benchmark
  74. CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model
  75. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
  76. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
  77. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68.
  78. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071.
  79. Chinese medical question answer matching using end-to-end character-level multi-scale cnns. Applied Sciences, 7(8):767.

Show All 79