An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction (1909.02027v1)

Published 4 Sep 2019 in cs.CL, cs.AI, and cs.LG

Abstract: Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope---i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.

Authors (11)

Stefan Larson (15 papers)
Anish Mahendran (2 papers)
Joseph J. Peper (4 papers)
Christopher Clarke (13 papers)
Andrew Lee (33 papers)
Parker Hill (3 papers)
Jonathan K. Kummerfeld (38 papers)
Kevin Leach (29 papers)
Michael A. Laurenzano (2 papers)
Lingjia Tang (15 papers)
Jason Mars (21 papers)

Citations (480)

View on Semantic Scholar

Summary

The paper introduces a new dataset of 23,700 queries, including 1,200 out-of-scope examples, to challenge and improve dialogue system models.
Evaluation with classifiers like BERT (over 96% in-scope accuracy) shows significant difficulties in out-of-scope detection, achieving only 66% recall at best.
The dataset provides a rigorous benchmark for task-oriented dialogue systems and highlights critical areas for future research in robust query handling.

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

The paper introduces a new dataset designed for evaluating intent classification and out-of-scope prediction within task-oriented dialogue systems. This research focuses on developing more robust systems capable of handling queries both within and outside predefined intent classes.

Motivation

Task-oriented dialog systems often need to identify user intents to provide accurate responses. However, a key challenge arises when systems encounter out-of-scope queries—those that do not fit any system-supported intents. Existing datasets inadequately address this issue as they typically encompass only well-defined intent classes. The introduced dataset closes this gap by incorporating both in-scope and out-of-scope queries.

Dataset Overview

The dataset comprises 23,700 queries, including 22,500 in-scope examples spanning 150 intents across ten domains, alongside 1,200 out-of-scope queries. Data collection involved crowdsourcing, with tasks prompting users to generate commands and questions as they would interact with AI systems. The dataset is thoughtfully divided into training, validation, and test sets, ensuring diversity in the distribution of queries.

Variants

The dataset features several variations, such as:

Small: Reduced data to 50 training queries per intent.
Imbalanced: Training queries vary across intents.
OOS+: Including additional out-of-scope training instances to assess robustness.

Evaluation and Results

The paper evaluates various classifiers using this dataset, including SVM, MLP, FastText, CNN, BERT, and platforms like DialogFlow and Rasa. BERT consistently achieves the highest in-scope accuracy, surpassing 96%. However, all models grapple with out-of-scope prediction, with the best-performing method reaching an out-of-scope recall of only 66%.

Different strategies for out-of-scope detection were explored:

oos-train: An additional out-of-scope intent class.
oos-threshold: Probability thresholding for class predictions.
oos-binary: A two-stage classification approach for determining the scope before intent classification.

Implications

The dataset's introduction provides a more realistic benchmark for developing systems that need to distinguish between supported and unsupported queries. The findings highlight the challenge that current models face in out-of-scope query handling, indicating a crucial area for future research. This dataset paves the way for developing more adaptive and error-tolerant conversational AI systems.

Prior Work and Contributions

In contrast to prior datasets that focus on comprehensive query classifications and lack diversity in intents, this dataset emphasizes real-world application by accounting explicitly for out-of-scope handling. By filling this research void, it enables a more comprehensive evaluation of dialogue systems under conditions representing genuine user interactions.

Conclusion

This paper contributes significantly to the field of intent classification and out-of-scope prediction by introducing a dataset that allows for rigorous testing of dialogue systems. While current methods demonstrate high in-scope accuracy, they reveal limited performance on out-of-scope queries, underscoring the need for further research to enhance dialog systems' robustness and reliability. The dataset and findings forge a path towards more comprehensive conversational AI benchmarks.

PDF Markdown