Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 31 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 9 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Commonsense-Focused Dialogues for Response Generation: An Empirical Study (2109.06427v2)

Published 14 Sep 2021 in cs.CL

Abstract: Smooth and effective communication requires the ability to perform latent or explicit commonsense inference. Prior commonsense reasoning benchmarks (such as SocialIQA and CommonsenseQA) mainly focus on the discriminative task of choosing the right answer from a set of candidates, and do not involve interactive language generation as in dialogue. Moreover, existing dialogue datasets do not explicitly focus on exhibiting commonsense as a facet. In this paper, we present an empirical study of commonsense in dialogue response generation. We first auto-extract commonsensical dialogues from existing dialogue datasets by leveraging ConceptNet, a commonsense knowledge graph. Furthermore, building on social contexts/situations in SocialIQA, we collect a new dialogue dataset with 25K dialogues aimed at exhibiting social commonsense in an interactive setting. We evaluate response generation models trained using these datasets and find that models trained on both extracted and our collected data produce responses that consistently exhibit more commonsense than baselines. Finally we propose an approach for automatic evaluation of commonsense that relies on features derived from ConceptNet and pre-trained language and dialog models, and show reasonable correlation with human evaluation of responses' commonsense quality. We are releasing a subset of our collected data, Commonsense-Dialogues, containing about 11K dialogs.

Citations (44)

Summary

  • The paper introduces a novel dataset and evaluation metric to integrate commonsense reasoning in dialogue response generation.
  • It employs ConceptNet and SocialIQA prompts to filter and collect multi-turn dialogues, enhancing commonsense inference in responses.
  • Experimental results show improved commonsense plausibility in generated responses, with the new metric significantly correlating with human scores.

Commonsense-Focused Dialogues for Response Generation: An Empirical Study

This paper investigates the integration of commonsense reasoning into dialogue response generation (RG) systems. The authors address the lack of datasets and evaluation metrics specifically designed to assess commonsense in interactive dialogues, and introduce a new dataset and an automatic evaluation metric to mitigate these issues.

Data Collection and Preparation

To address the gap in commonsense-focused dialogue data, the paper employs two primary methods:

  • Filtering Existing Datasets: A process is introduced to extract commonsense-focused dialogues from existing datasets like DailyDialog, EmpatheticDialogues, and MuTual. This involves using ConceptNet, a commonsense knowledge graph (CSKG), to identify dialogues containing commonsense inferences. The filtering process identifies candidate concepts using POS tagging and lemmatization, queries ConceptNet for neighboring entities, and searches for these entities in subsequent dialogue turns.
  • New Data Collection Using SocialIQA Prompts: The authors collect a new dataset of 25,000 dialogues based on social contexts derived from the SocialIQA benchmark. Prompts are designed to elicit social commonsense inferences in an interactive setting. The collection process involves prompting crowd workers on Amazon Mechanical Turk (MTurk) with context sentences from SocialIQA, instructing them to create 4-6 turn dialogues from the perspective of a character in the given context.

Experimental Setup

The paper evaluates the impact of the collected datasets on RG models. The authors train GPT2, a pre-trained LLM, on various datasets. The training data setups include:

  • Existing dialogue datasets: DailyDialog (DD), EmpatheticDialogues (ED), and Topical-Chat (TC).
  • Filtered existing datasets (FE): Combining dialogues from DD, ED, and MuTual that contain commonsense inferences identified using ConceptNet.
  • Combined datasets: FE combined with either all collected dialogues using SocialIQA contexts (FE+new crowdsourced) or with the ConceptNet filtered subset of these dialogues (FE+filtered crowdsourced).

Models are evaluated on a held-out test set using both automatic metrics (perplexity, METEOR, ROUGE, BERTScore) and human evaluation.

Evaluation Metrics

The authors employ both automatic and human evaluation metrics to assess the generated responses. Traditional automatic metrics are used to evaluate general response quality, while human evaluation is used to directly assess the commonsense plausibility of the responses. The paper also introduces a novel automatic metric specifically designed to evaluate commonsense in RG. This metric is based on a multi-layer perceptron (MLP) regressor trained on human annotation scores. The regressor incorporates both neural and symbolic features, with symbolic features derived from ConceptNet triples and neural features extracted from the DialoGPT model.

Results and Analysis

The experimental results indicate that models trained on the commonsense-focused datasets generate responses that exhibit more commonsense than baseline models. Human evaluation scores show that the FE+Crowdsourced and FE+Filtered Crowdsourced datasets lead to improved commonsense plausibility in generated responses. The proposed automatic metric demonstrates reasonable correlation with human annotations, suggesting its potential for efficient evaluation of commonsense in RG.

The authors report the following key findings:

  • Using the filtered existing dialogue data (FE) improves the average commonsense scores.
  • Including the newly collected dialogues further increases the average score (FE+Crowdsourced) and reduces variance.
  • Using the filtered subset of the collected data yields slightly better performance than using the entire data collection.
  • The proposed MLP-based regressor achieves the highest Spearman's correlation with human scores (0.20789, p-value 4.53E-22), outperforming baselines significantly.

Conclusion

This paper makes a significant contribution to the field of dialogue response generation by addressing the need for commonsense-focused datasets and evaluation metrics. The authors introduce a novel approach for collecting and filtering dialogue data to enhance commonsense reasoning in RG models. The experimental results demonstrate the effectiveness of the proposed datasets and the potential of the automatic metric for evaluating commonsense plausibility. The release of the Commonsense-Dialogues dataset is expected to facilitate further research in this area.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.