How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation (1603.08023v2)

Published 25 Mar 2016 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Citations (1,268)

View on Semantic Scholar

Summary

The paper demonstrates that traditional metrics like BLEU, METEOR, and ROUGE show weak correlation with human evaluations, questioning their reliability in dialogue systems.
It compares metric performance across diverse datasets such as Twitter and Ubuntu, highlighting challenges in handling technical vocabulary and context-specific dialogues.
The study emphasizes the need for context-aware and learning-based metrics that better capture dialogue variability and improve evaluation accuracy.

An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Introduction

This paper, authored by Liu et al. from McGill University and Université de Montréal, addresses an empirical gap in the evaluation of dialogue response generation systems, particularly those using unsupervised methods. A significant point raised is the inadequacy of existing evaluation metrics—derived mainly from machine translation (MT) and automatic summarization domains—when applied to open-domain dialogue systems. The primary metrics in question include BLEU, METEOR, and ROUGE. The research presented investigates these metrics’ correlation with human judgements, revealing their limitations and suggesting avenues for future development of more reliable automatic evaluation methods.

Evaluation Metrics Under Scrutiny

Several metrics traditionally used in MT and summarization have been adopted by the dialogue systems community but without sufficient justification or validation in the dialogue context:

BLEU: Measures co-occurrences of n-grams, with smoothing for sentence-level application.
METEOR: Aligns tokens based on exact matching, WordNet synonyms, and paraphrases, using harmonic mean of precision and recall.
ROUGE: Evaluates the longest common subsequence or co-occurrence statistics, often used for summarization.

Additionally, word embedding-based metrics were also evaluated:

Greedy Matching: Measures maximum cosine similarity between tokens in proposed and ground truth responses.
Embedding Average: Compares sentence-level embeddings formed by averaging individual word vectors.
Vector Extrema: Focuses on the most extreme embeddings in any dimension.

Key Results

The empirical studies were conducted using two distinct datasets: the non-technical Twitter corpus and the technical Ubuntu Dialogue Corpus. The authors evaluated various response generation models, including TF-IDF-based retrieval models, dual encoder networks, and generative models like LSTMs and hierarchical recurrent encoder-decoders (HRED). The results demonstrated some stark findings:

Weak Correlation with Human Judgement: Across both datasets, BLEU, METEOR, ROUGE, and embedding-based metrics exhibited weak or no correlation with human evaluations. Notably, BLEU-3 and BLEU-4 frequently yielded near-zero scores due to the lack of overlapping n-grams, underscoring the diversity in valid dialogue responses.
Discrepancies in Task Complexity: The Ubuntu Dialogue Corpus, characterized by technical jargon, posed significant challenges for embedding-based metrics due to limited specific vocabulary found in model responses, further validating the need for context-sensitive and robust metrics.
Perplexity of Length and Common Phrases: Metrics like BLEU and METEOR were sensitive to response length disparities and common phrases, leading to skewed evaluations. This underscores that metrics should factor in salient word importance over mere word frequency or length equivalence.

Implications and Future Directions

The findings highlight the inadequacy of current automatic metrics for reliably evaluating unsupervised dialogue systems, necessitating the development of new metrics. These metrics should better capture the inherent variability and context-dependence of dialogues. Here are potential pathways for advancing metric design:

Context-Aware Metrics: Responses should be evaluated in consideration of dialogue history to understand appropriateness and coherence.
Learning-Based Evaluation: Employing discriminative models that differentiate human and machine responses or models trained on human-rated datasets could provide more nuanced and accurate evaluations.
Enhanced Embedding Models: Future work should explore embeddings that more effectively capture sentence-level semantics and consider qualitative aspects like informativeness and relevance.

Conclusion

This empirical paper unveils critical insights into the limitations of current evaluation metrics when applied to dialogue response generation systems. It calls for a paradigm shift towards developing metrics that more closely align with human judgment, thus fostering more effective and natural dialogue systems. The research also underscores the potential need for domain-specific adaptation of these metrics to ensure accuracy across varied dialogue contexts. While this paper does not propose definitive solutions, it provides a vital foundation for subsequent, more focused investigations into the evaluation of dialogue systems.

PDF Markdown