Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

C-Pack: Packed Resources For General Chinese Embeddings (2309.07597v5)

Published 14 Sep 2023 in cs.CL, cs.AI, and cs.IR

Abstract: We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263.
  2. Semeval-2014 task 10: Multilingual semantic textual similarity. In SemEval@ COLING, pages 81–91.
  3. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
  4. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393.
  5. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43.
  6. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988.
  7. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260.
  8. mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897.
  9. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  10. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  12. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
  13. The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4946–4951.
  14. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  15. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  16. Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
  17. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  18. Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. arXiv preprint arXiv:2204.08582.
  19. A framework for few-shot language model evaluation.
  20. Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253.
  21. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983.
  22. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  23. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  24. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073.
  25. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  26. Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444.
  27. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  28. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  29. Resources for brewing beir: Reproducible reference models and an official leaderboard. arXiv preprint arXiv:2306.07471.
  30. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
  31. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  32. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  33. Jingyang Li and Maosong Sun. 2007. Scalable term selection for text categorization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 774–782.
  34. A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 545–552.
  35. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  36. Csl: A large-scale chinese scientific literature dataset. arXiv preprint arXiv:2209.05034.
  37. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
  38. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th international conference on computational linguistics, pages 1952–1962.
  39. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  40. Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035.
  41. Multi-cpr: A multi domain chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3046–3056.
  42. Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. RecSys ’13, New York, NY, USA. Association for Computing Machinery.
  43. Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  44. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124.
  45. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  46. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  47. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  48. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
  49. Ms marco: A human-generated machine reading comprehension dataset.
  50. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877.
  51. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
  52. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  53. Dureader_retrieval: A large-scale chinese benchmark for passage retrieval from web search engine. arXiv preprint arXiv:2203.10232.
  54. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191.
  55. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  57. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  58. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  59. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  60. What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424.
  61. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  62. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  63. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
  64. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
  65. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
  66. Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578.
  67. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  68. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  69. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  70. Retromae-2: Duplex masked auto-encoder for pre-training retrieval-oriented language models. arXiv preprint arXiv:2305.02564.
  71. T2ranking: A large-scale chinese benchmark for passage ranking. arXiv preprint arXiv:2304.03679.
  72. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
  73. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
  74. Cluecorpus2020: A large-scale chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355.
  75. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828.
  76. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  77. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68.
  78. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071.
  79. Chinese medical question answer matching using end-to-end character-level multi-scale cnns. Applied Sciences, 7(8):767.
Citations (42)

Summary

  • The paper presents a comprehensive C-Pack package that integrates a Chinese benchmark, a vast dataset of over 100 million text pairs, and models across multiple scales.
  • The methodology combines pre-training on large corpora with task-specific fine-tuning to enhance retrieval and semantic similarity tasks.
  • The evaluation on 35 datasets across 6 tasks demonstrates up to a 10% improvement over baselines, setting new standards in Chinese NLP embeddings.

C-Pack: Advancing General Chinese Embedding

The paper presents C-Pack, a comprehensive package of resources that significantly advances the field of general Chinese embeddings. This package comprises a benchmark, a dataset, and a set of embedding models that are meticulously designed to enhance the development and evaluation of Chinese text embeddings. The research introduces a multi-faceted approach to tackle the challenges associated with creating generalized, robust text embeddings for the Chinese language.

Overview of C-Pack

C-Pack introduces three crucial components: a benchmark, a training dataset, and a family of embedding models.

  1. Benchmark (C-MTEB): This component extends the MTEB framework to evaluate general Chinese embeddings comprehensively. By encompassing 35 datasets across six tasks, such as retrieval and classification, the benchmark effectively measures the embedding's generality. It provides a standardized evaluation pipeline and categorizes datasets according to the capabilities they assess, ensuring a reliable measure of general Chinese embedding performance.
  2. Dataset (C-MTP): The dataset, divided into labeled and unlabeled corpora, is pivotal for training. With an overwhelming scale of 100 million unlabeled text pairs and 838,000 labeled pairs, it caters to diverse semantic structures and application scenarios. Noteworthy sources include Wudao and Amazon Reviews, which contribute to both the breadth and quality necessary for general-purpose embeddings.
  3. Models (C-TEM): The paper introduces a set of models configured across multiple scales—small, base, and large—providing flexibility for different computational needs and performance requirements. These models outperform existing Chinese embeddings significantly, demonstrating up to a 10% improvement on the benchmark.

Methodological Insights

The paper delineates an intricate training recipe to optimize the embedding models:

  • Pre-Training: Leveraging large-scale datasets like Wudao, the models are pre-trained using RetroMAE, an approach aimed at embedding-oriented encoders.
  • Fine-Tuning: The transition from general-purpose contrastive learning to task-specific fine-tuning ensures adaptability and heightened performance across various tasks. The authors deploy substantial batch sizes, crucial for effective negative sampling and performance boosts.
  • Instructions in Fine-Tuning: Incorporating task-specific prompts, the approach enhances the specificity and performance of the models, especially in retrieval and STS tasks.

Empirical Evaluation

The models derived from C-Pack undergo rigorous testing against popular baselines. The evaluations, particularly on the C-MTEB benchmark, reveal their superior performance in aspects like retrieval and semantic similarity. Acknowledging the variations between the models, the authors meticulously examine the impacts introduced by different training strategies and dataset utilization.

Implications and Future Directions

C-Pack's public release facilitates wide adoption and encourages future research on Chinese embeddings. Its robust framework sets a high standard for evaluating and developing text embeddings, making substantial contributions to both theoretical research and practical applications in NLP. The release empowers researchers to explore enhanced training methodologies and diverse linguistic applications, potentially influencing cross-linguistic NLP model development.

Conclusion

The C-Pack package, with its exhaustive resources and methodological rigor, represents a significant step in advancing Chinese text embeddings. By integrating comprehensive benchmarks, vast datasets, and state-of-the-art models, the research provides a solid foundation for further exploration in the field of NLP embeddings.

Github Logo Streamline Icon: https://streamlinehq.com