Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

C-Pack: Packed Resources For General Chinese Embeddings (2309.07597v5)

Published 14 Sep 2023 in cs.CL, cs.AI, and cs.IR

Abstract: We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

References (79)

Citations (42)

View on Semantic Scholar

Summary

The paper presents a comprehensive C-Pack package that integrates a Chinese benchmark, a vast dataset of over 100 million text pairs, and models across multiple scales.
The methodology combines pre-training on large corpora with task-specific fine-tuning to enhance retrieval and semantic similarity tasks.
The evaluation on 35 datasets across 6 tasks demonstrates up to a 10% improvement over baselines, setting new standards in Chinese NLP embeddings.

C-Pack: Advancing General Chinese Embedding

The paper presents C-Pack, a comprehensive package of resources that significantly advances the field of general Chinese embeddings. This package comprises a benchmark, a dataset, and a set of embedding models that are meticulously designed to enhance the development and evaluation of Chinese text embeddings. The research introduces a multi-faceted approach to tackle the challenges associated with creating generalized, robust text embeddings for the Chinese language.

Overview of C-Pack

C-Pack introduces three crucial components: a benchmark, a training dataset, and a family of embedding models.

Benchmark (C-MTEB): This component extends the MTEB framework to evaluate general Chinese embeddings comprehensively. By encompassing 35 datasets across six tasks, such as retrieval and classification, the benchmark effectively measures the embedding's generality. It provides a standardized evaluation pipeline and categorizes datasets according to the capabilities they assess, ensuring a reliable measure of general Chinese embedding performance.
Dataset (C-MTP): The dataset, divided into labeled and unlabeled corpora, is pivotal for training. With an overwhelming scale of 100 million unlabeled text pairs and 838,000 labeled pairs, it caters to diverse semantic structures and application scenarios. Noteworthy sources include Wudao and Amazon Reviews, which contribute to both the breadth and quality necessary for general-purpose embeddings.
Models (C-TEM): The paper introduces a set of models configured across multiple scales—small, base, and large—providing flexibility for different computational needs and performance requirements. These models outperform existing Chinese embeddings significantly, demonstrating up to a 10% improvement on the benchmark.

Methodological Insights

The paper delineates an intricate training recipe to optimize the embedding models:

Pre-Training: Leveraging large-scale datasets like Wudao, the models are pre-trained using RetroMAE, an approach aimed at embedding-oriented encoders.
Fine-Tuning: The transition from general-purpose contrastive learning to task-specific fine-tuning ensures adaptability and heightened performance across various tasks. The authors deploy substantial batch sizes, crucial for effective negative sampling and performance boosts.
Instructions in Fine-Tuning: Incorporating task-specific prompts, the approach enhances the specificity and performance of the models, especially in retrieval and STS tasks.

Empirical Evaluation

The models derived from C-Pack undergo rigorous testing against popular baselines. The evaluations, particularly on the C-MTEB benchmark, reveal their superior performance in aspects like retrieval and semantic similarity. Acknowledging the variations between the models, the authors meticulously examine the impacts introduced by different training strategies and dataset utilization.

Implications and Future Directions

C-Pack's public release facilitates wide adoption and encourages future research on Chinese embeddings. Its robust framework sets a high standard for evaluating and developing text embeddings, making substantial contributions to both theoretical research and practical applications in NLP. The release empowers researchers to explore enhanced training methodologies and diverse linguistic applications, potentially influencing cross-linguistic NLP model development.

Conclusion

The C-Pack package, with its exhaustive resources and methodological rigor, represents a significant step in advancing Chinese text embeddings. By integrating comprehensive benchmarks, vast datasets, and state-of-the-art models, the research provides a solid foundation for further exploration in the field of NLP embeddings.

PDF Markdown

GitHub

GitHub - FlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs (5,521 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1766162319049162911

https://twitter.com/EnricoShippole/status/1766157419896545639

https://twitter.com/TeraflopAI/status/1766156888901902727