The Second Conversational Intelligence Challenge (ConvAI2) (1902.00098v1)

Published 31 Jan 2019 in cs.AI, cs.CL, and cs.HC

Abstract: We describe the setting and results of the ConvAI2 NeurIPS competition that aims to further the state-of-the-art in open-domain chatbots. Some key takeaways from the competition are: (i) pretrained Transformer variants are currently the best performing models on this task, (ii) but to improve performance on multi-turn conversations with humans, future systems must go beyond single word metrics like perplexity to measure the performance across sequences of utterances (conversations) -- in terms of repetition, consistency and balance of dialogue acts (e.g. how many questions asked vs. answered).

Citations (344)

View on Semantic Scholar

Summary

The paper introduces the ConvAI2 challenge as a benchmark for evaluating persona-based open-domain conversational agents.
It employs a three-stage evaluation combining automated metrics and human judgements to reveal performance gaps in pretrained Transformer models.
The study highlights the need for improved metrics that assess dialogue coherence and consistency for future conversational AI research.

Overview and Evaluation of the ConvAI2 Challenge

The paper "The Second Conversational Intelligence Challenge (ConvAI2)" presents an in-depth examination of the ConvAI2 competition held as part of the NeurIPS conference. The challenge primarily focuses on advancing the state-of-the-art in open-domain conversational agents, also known as chatbots. Key aspects of this competition include the deployment of dialogue systems able to engage in meaningful and coherent multi-turn conversations with humans without being goal-directed.

Competition Overview and Methodology

The ConvAI2 challenge builds on its 2017 predecessor by introducing key improvements in dataset provision and evaluation metrics. In this edition, the task centers around the {\sc Persona-Chat} dataset, which involves dialogues between agents tailored to given personas. This dataset is pivotal in training models to maintain a consistent conversational personality, addressing the frequent critique that chatbots often lack a coherent and engaging persona.

The competition structure allows for a rigorous evaluation of dialogue systems through three distinct stages: automatic metrics on a withheld test set, evaluation via Amazon Mechanical Turk, and 'wild' evaluations where volunteers interact with the systems. A combination of automatic and human evaluations guides the final assessment, with the human evaluation results conferring the grand prize. Notably, Hugging Face dominated the automatic metrics evaluation, whereas Lost in Conversation secured the grand prize through human evaluative rounds.

Results and Analysis

The analysis highlights several takeaway points. Pretrained Transformer models exhibit superior performance across automatic metrics, aligning with broader trends in NLP. Nonetheless, this prowess does not seamlessly translate into human evaluation triumphs, as evidenced by the discrepancies between automatic and human judgements. Challenges like excessive question repetition, lack of dialog act balance, and coherence issues in conversations persist. Successful models are those that mitigated these issues, as demonstrated by Lost in Conversation’s balanced engagement style. Furthermore, the paper identifies that beyond word perplexity, metric constructs must evolve to encompass dialogue flow and consistency metrics to better mirror human conversational assessment.

Implications and Future Directions

The findings imply a bifurcation of the dialogue evaluation problem: while automatic metrics are invaluable for initial filtering and development, they insufficiently capture conversational nuance. Future research should prioritize the development of evaluation methodologies that better reflect the human evaluation heuristics used in the competition, especially in how they account for dialog coherence, consistency, and engagement across multiple turns.

Speculatively, upcoming iterations of conversational AI challenges could explore more complex task-based dialogues to evaluate agents on long-term memory use and in-depth knowledge interactions. These considerations are facilitated by datasets such as the Wizard of Wikipedia, which offer a structure conducive to such evaluations.

In conclusion, the ConvAI2 competition extensively outlines the current capabilities and limitations of conversational AI systems. Through multifaceted evaluation mechanisms, it provides both a benchmark for progress and a roadmap for future development in dialogue systems. The insights derived from this competition promise to refine AI interactions to more closely resemble the nuanced, contextually-rich exchanges characteristic of human conversation.

PDF Markdown