Types of Out-of-Distribution Texts and How to Detect Them

Published 14 Sep 2021 in cs.CL and cs.LG | (2109.06827v2)

Abstract: Despite agreement on the importance of detecting out-of-distribution (OOD) examples, there is little consensus on the formal definition of OOD examples and how to best detect them. We categorize these examples by whether they exhibit a background shift or a semantic shift, and find that the two major approaches to OOD detection, model calibration and density estimation (language modeling for text), have distinct behavior on these types of OOD data. Across 14 pairs of in-distribution and OOD English natural language understanding datasets, we find that density estimation methods consistently beat calibration methods in background shift settings, while performing worse in semantic shift settings. In addition, we find that both methods generally fail to detect examples from challenge data, highlighting a weak spot for current methods. Since no single method works well across all settings, our results call for an explicit definition of OOD examples when evaluating different detection methods.

Abstract PDF Upgrade to Chat

Citations (89)

View on Semantic Scholar

Summary

The paper categorizes OOD texts into background and semantic shifts, outlining tailored detection strategies.
It applies calibration, density estimation, and likelihood ratio techniques to distinguish in-distribution from OOD samples.
Feature-based and unsupervised methods are proposed as promising solutions to counter overconfidence and improve detection reliability.

Out-of-Distribution (OOD) detection is crucial for ensuring the robustness and reliability of machine learning models, particularly in scenarios where models encounter data that significantly diverges from the training distribution. Various types of OOD texts and their detection methods have been explored in the literature.

Types of OOD Texts

OOD texts can generally be categorized into two main types:

Background Shift: This type refers to changes in the contextual or background information surrounding the main content. This shift remains relevant to the core task but varies in some aspects such as topics, domains, or settings.
Semantic Shift: This involves a change in the meaning or concepts and generally introduces completely new or irrelevant classes that the model was not trained on (2109.06827).

Methods for Detecting OOD Texts

The detection methods can be broadly classified into a few approaches:

Model Calibration and Confidence Scores: Traditional methods use the model's output probabilities to determine if a sample is OOD. However, these methods often suffer from overconfidence and misclassification issues with OOD data. Recent advancements include using energy scores, which align better with probability densities and mitigate overconfidence (Liu et al., 2020).
Density Estimation: These approaches involve modeling the probability distribution of the training data and identifying samples that fall outside this distribution. Density estimation methods are more effective in scenarios involving background shifts but less effective for semantic shifts (2109.06827).
Feature-Based Methods: Methods like SEM (Simple feature-based Semantics score function) combine high-level and low-level features to distinguish between in-distribution (ID) and OOD samples. SEM has proven effective in full-spectrum OOD detection, handling both semantic and covariate shifts (Yang et al., 2022).
Likelihood Ratios: Likelihood-based techniques, such as those involving deep generative models, normalize scores against background statistics. This method can correct for confounding factors and has shown success in various contexts including genomic datasets (Ren et al., 2019).
Unsupervised Techniques: Assuming no access to OOD data during training, some methods leverage unsupervised learning to enhance OOD detection. Techniques like unsupervised dual grouping (UDG) use external unlabeled sets to enrich semantic knowledge and distinguish ID/OOD samples, proving beneficial in more practical settings (Yang et al., 2021).

Benchmarking and Evaluation

The literature underscores the lack of a one-size-fits-all solution for OOD detection. Different methods excel under different OOD conditions such as background shifts, semantic shifts, or near/far OOD scenarios. Therefore, a comprehensive evaluation framework considering these nuances is crucial. Benchmarks and open challenges in this field continue to drive innovation and understanding of the trade-offs and limitations of current methods (2109.06827, Yang et al., 2022).

Collectively, these studies indicate that effective OOD detection often requires a combination of methods tailored to specific types of distribution shifts, and emphasize the need for a nuanced and context-aware approach to developing and evaluating OOD detection strategies.

Markdown