- The paper introduces a novel dataset and evaluation metric to integrate commonsense reasoning in dialogue response generation.
- It employs ConceptNet and SocialIQA prompts to filter and collect multi-turn dialogues, enhancing commonsense inference in responses.
- Experimental results show improved commonsense plausibility in generated responses, with the new metric significantly correlating with human scores.
Commonsense-Focused Dialogues for Response Generation: An Empirical Study
This paper investigates the integration of commonsense reasoning into dialogue response generation (RG) systems. The authors address the lack of datasets and evaluation metrics specifically designed to assess commonsense in interactive dialogues, and introduce a new dataset and an automatic evaluation metric to mitigate these issues.
Data Collection and Preparation
To address the gap in commonsense-focused dialogue data, the paper employs two primary methods:
- Filtering Existing Datasets: A process is introduced to extract commonsense-focused dialogues from existing datasets like DailyDialog, EmpatheticDialogues, and MuTual. This involves using ConceptNet, a commonsense knowledge graph (CSKG), to identify dialogues containing commonsense inferences. The filtering process identifies candidate concepts using POS tagging and lemmatization, queries ConceptNet for neighboring entities, and searches for these entities in subsequent dialogue turns.
- New Data Collection Using SocialIQA Prompts: The authors collect a new dataset of 25,000 dialogues based on social contexts derived from the SocialIQA benchmark. Prompts are designed to elicit social commonsense inferences in an interactive setting. The collection process involves prompting crowd workers on Amazon Mechanical Turk (MTurk) with context sentences from SocialIQA, instructing them to create 4-6 turn dialogues from the perspective of a character in the given context.
Experimental Setup
The paper evaluates the impact of the collected datasets on RG models. The authors train GPT2, a pre-trained LLM, on various datasets. The training data setups include:
- Existing dialogue datasets: DailyDialog (DD), EmpatheticDialogues (ED), and Topical-Chat (TC).
- Filtered existing datasets (FE): Combining dialogues from DD, ED, and MuTual that contain commonsense inferences identified using ConceptNet.
- Combined datasets: FE combined with either all collected dialogues using SocialIQA contexts (FE+new crowdsourced) or with the ConceptNet filtered subset of these dialogues (FE+filtered crowdsourced).
Models are evaluated on a held-out test set using both automatic metrics (perplexity, METEOR, ROUGE, BERTScore) and human evaluation.
Evaluation Metrics
The authors employ both automatic and human evaluation metrics to assess the generated responses. Traditional automatic metrics are used to evaluate general response quality, while human evaluation is used to directly assess the commonsense plausibility of the responses. The paper also introduces a novel automatic metric specifically designed to evaluate commonsense in RG. This metric is based on a multi-layer perceptron (MLP) regressor trained on human annotation scores. The regressor incorporates both neural and symbolic features, with symbolic features derived from ConceptNet triples and neural features extracted from the DialoGPT model.
Results and Analysis
The experimental results indicate that models trained on the commonsense-focused datasets generate responses that exhibit more commonsense than baseline models. Human evaluation scores show that the FE+Crowdsourced and FE+Filtered Crowdsourced datasets lead to improved commonsense plausibility in generated responses. The proposed automatic metric demonstrates reasonable correlation with human annotations, suggesting its potential for efficient evaluation of commonsense in RG.
The authors report the following key findings:
- Using the filtered existing dialogue data (FE) improves the average commonsense scores.
- Including the newly collected dialogues further increases the average score (FE+Crowdsourced) and reduces variance.
- Using the filtered subset of the collected data yields slightly better performance than using the entire data collection.
- The proposed MLP-based regressor achieves the highest Spearman's correlation with human scores (0.20789, p-value 4.53E-22), outperforming baselines significantly.
Conclusion
This paper makes a significant contribution to the field of dialogue response generation by addressing the need for commonsense-focused datasets and evaluation metrics. The authors introduce a novel approach for collecting and filtering dialogue data to enhance commonsense reasoning in RG models. The experimental results demonstrate the effectiveness of the proposed datasets and the potential of the automatic metric for evaluating commonsense plausibility. The release of the Commonsense-Dialogues dataset is expected to facilitate further research in this area.