SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization (1911.12237v2)

Published 27 Nov 2019 in cs.CL

Abstract: This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news -- in contrast with human evaluators' judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.

Citations (549)

View on Semantic Scholar

Summary

The paper presents a novel human-annotated dialogue dataset specifically designed for abstractive summarization in multi-speaker contexts.
It details the creation of over 16,000 messenger-style chats and evaluates baseline models such as the LONGEST-3 to benchmark performance.
It highlights the challenges of using standard metrics like ROUGE, underscoring the need for specialized evaluation measures for dialogue summarization.

An Expert Overview of the SAMSum Corpus for Abstractive Dialogue Summarization

The paper "SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization" by Bogdan Gliwa et al. introduces a novel dataset focused on the abstractive summarization of dialogues. The authors emphasize that traditional summarization datasets have predominantly concentrated on single-speaker documents like news articles, leaving a gap in resources for multi-speaker dialogue contexts. This work seeks to address this gap by creating a high-quality corpus specifically for dialogue summarization, thereby enabling further advancements in this domain.

Dataset Creation and Structure

The SAMSum Corpus comprises over 16,000 chat dialogues, each annotated with abstractive summaries. This dataset was meticulously curated by linguists, who crafted dialogues reflecting the informal, varied style typical of modern messaging applications. The authors highlight the uniqueness of their approach, as previous datasets either lacked the conversational nature of chat dialogues or were too technical. The dataset includes a balanced distribution of dialogues with varying utterances, ensuring a diverse representation of conversational dynamics.

In addition to the dataset's creation, the authors describe the validation process that confirmed the linguistic authenticity of the dialogues as messenger-like conversations. The validation further establishes the corpus as a valuable tool for researchers aiming to explore the intricacies of dialogue summarization.

Baseline Models and Experimental Setup

The paper details several baseline models adapted for the task of dialogue summarization, such as the Lead-3 and LONGEST-n baselines. Through empirical evaluation, the authors found that the LONGEST-3 model served as the most effective baseline, albeit the task required more nuanced models to capture the complex dialogue structures effectively.

The research employs various summarization models—including Pointer Generator Networks, Transformer models, and lightweight convolution models—testing them on both dialogue and news datasets. This dual approach allows for a comprehensive assessment of model adaptability and performance across different text domains.

Performance Evaluation

Interestingly, the authors observe that standard evaluation metrics like ROUGE do not reliably capture the quality of abstractive dialogue summaries. Although the models achieved impressive ROUGE scores, these numbers did not necessarily align with human judgment. This discrepancy indicates that dialogue summarization presents unique challenges, possibly stemming from the dynamic nature of conversational exchanges and the presence of multiple interlocutors.

The analysis reveals that while pretrained embeddings and joint training on news and dialogues improved model performance, the ROUGE metric’s correlation with human judgment was weaker for dialogues than for news. This insight suggests that developing dedicated evaluation metrics for dialogue summarization is essential for future research.

Implications and Future Directions

The introduction of the SAMSum Corpus represents a significant step forward in dialogue summarization research, providing a high-quality benchmark that the research community can use to develop and refine abstractive summarization techniques. The paper underscores the need for dedicated architectures tailored to handle the unique challenges posed by dialogue data, including the integration of speaker information and better context comprehension.

The limitations identified with current evaluation practices signal a critical area for future investigation. The authors advocate for the creation of new, specialized metrics that account for the complexities of dialogue summarization, potentially involving linguistic coherence and information extraction accuracy.

In summary, the work by Gliwa et al. lays the groundwork for advancing the field of dialogue summarization, presenting a comprehensive dataset and highlighting the need for methodological innovations both in model development and evaluation criteria. As digital communication in messenger apps becomes increasingly prevalent, the implications of this research extend to practical applications in conversational AI systems and human-computer interaction.

PDF Markdown