DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

Published 19 Mar 2022 in cs.CL and cs.IR | (2203.10232v4)

Abstract: In this paper, we present DuReader_retrieval, a large-scale Chinese dataset for passage retrieval. DuReader_retrieval contains more than 90K queries and over 8M unique passages from a commercial search engine. To alleviate the shortcomings of other datasets and ensure the quality of our benchmark, we (1) reduce the false negatives in development and test sets by manually annotating results pooled from multiple retrievers, and (2) remove the training queries that are semantically similar to the development and testing queries. Additionally, we provide two out-of-domain testing sets for cross-domain evaluation, as well as a set of human translated queries for for cross-lingual retrieval evaluation. The experiments demonstrate that DuReader_retrieval is challenging and a number of problems remain unsolved, such as the salient phrase mismatch and the syntactic mismatch between queries and paragraphs. These experiments also show that dense retrievers do not generalize well across domains, and cross-lingual retrieval is essentially challenging. DuReader_retrieval is publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (12)

View on Semantic Scholar

Summary

The paper presents DuReader_retrieval, a fully human-annotated dataset with over 90,000 queries and 8 million passages sourced from a major Chinese search engine.
It employs advanced methods to exclude overlapping queries, ensuring leakage-free evaluations across different domains and cross-lingual scenarios.
Experimental results reveal that current dense retrievers struggle with cross-domain and multilingual tasks, highlighting the need for improved model generalization.

DuReader $_{\bf retrieval}$ : A Comprehensive Chinese Dataset for Passage Retrieval

The paper introduces DuReader $_{\bf retrieval}$ , an extensive Chinese dataset curated for the evaluation and benchmarking of passage retrieval systems. This dataset includes over 90,000 queries and 8 million unique passages sourced from a commercial search engine, notably Baidu. The creation of DuReader $_{\bf retrieval}$ is a response to the limitations observed in current datasets, particularly those geared towards non-English languages. Unlike other datasets which suffer from small scale or machine-generated queries, DuReader $_{\bf retrieval}$ is human-annotated, providing a more reliable basis for model training and evaluation.

Improvements and Features

DuReader $_{\bf retrieval}$ distinguishes itself through several key improvements over its predecessors:

Manual Annotation for Quality Assurance: The development and test sets have been meticulously annotated to minimize false negatives, a prevalent issue in many large-scale datasets due to limited human annotation presence.
Exclusion of Overlapping Queries: To address potential leaks of testing information, semantically similar queries between training and testing datasets have been identified and excluded using sophisticated query matching models.
Cross-Domain and Cross-Lingual Evaluations: The dataset not only offers primary testing sets but also includes two domain-specific testing sets for out-of-domain evaluation, as well as a set of human-translated queries for assessing cross-lingual retrieval capabilities.

Experimental Findings

Experiments conducted with DuReader $_{\bf retrieval}$ highlight significant challenges in current retrieval paths, including the mismatch of salient phrases and syntactic variations between queries and passages. Dense retrievers, while effective within domain, show poor generalization across domains and struggle with cross-lingual retrieval tasks, underscoring the persistent challenges in achieving truly flexible retrieval systems.

Comparative Analysis

The dataset's scale and manual refinement position it as a formidable benchmark for Chinese-language passage retrieval, filling a crucial gap left by prior datasets such as TianGong-PDR and Sougou-QCL, which lack comprehensive human annotation or are limited by size. DuReader $_{\bf retrieval}$ shares similarities with prominent English datasets like MS-MARCO and Natural Questions, yet marks substantial advancements tailored for Chinese retrieval demands.

Implications for Future Research

The implications of DuReader $_{\bf retrieval}$ are multifaceted. Practically, it offers a strong foundation for constructing more accurate and contextually aware Chinese retrievers. Theoretically, it challenges assumptions held in transfer learning and cross-lingual adaptation, pushing for advancements in these areas.

As models continue to evolve, DuReader $_{\bf retrieval}$ provides an essential platform for testing the limits and capabilities of retrieval algorithms, paving the way for more sophisticated, versatile, and domain-agnostic retrieval systems. This dataset sets the stage for future explorations in cross-lingual retrieval and domain adaptation, with potential applications in numerous fields reliant on accurate information retrieval.

In conclusion, DuReader $_{\bf retrieval}$ is a pivotal contribution to the landscape of passage retrieval that encourages deeper inquiry into both the linguistic intricacies of Chinese and the broader challenges of multilingual and cross-domain retrieval tasks.