SemEval-2017 Task 3: Community Question Answering (1912.00730v1)

Published 2 Dec 2019 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: We describe SemEval-2017 Task 3 on Community Question Answering. This year, we reran the four subtasks from SemEval-2016:(A) Question-Comment Similarity,(B) Question-Question Similarity,(C) Question-External Comment Similarity, and (D) Rerank the correct answers for a new question in Arabic, providing all the data from 2015 and 2016 for training, and fresh data for testing. Additionally, we added a new subtask E in order to enable experimentation with Multi-domain Question Duplicate Detection in a larger-scale scenario, using StackExchange subforums. A total of 23 teams participated in the task, and submitted a total of 85 runs (36 primary and 49 contrastive) for subtasks A-D. Unfortunately, no teams participated in subtask E. A variety of approaches and features were used by the participating systems to address the different subtasks. The best systems achieved an official score (MAP) of 88.43, 47.22, 15.46, and 61.16 in subtasks A, B, C, and D, respectively. These scores are better than the baselines, especially for subtasks A-C.

Citations (225)

View on Semantic Scholar

Summary

The paper advances CQA research by introducing five subtasks that challenge conventional ranking and similarity techniques.
It employs diverse machine learning models, including SVMs, CNNs, and LSTMs, to capture textual and semantic relationships.
Results reveal varied performance with MAP scores ranging from 15.46 to 88.43, emphasizing both strengths and challenges in CQA systems.

An Overview of "SemEval-2017 Task 3: Community Question Answering"

The paper "SemEval-2017 Task 3: Community Question Answering" addresses the development, implementation, and evaluation of the Community Question Answering (CQA) task in the context of the SemEval-2017 competition. This task builds upon previous iterations from 2015 and 2016, with the goal of advancing research and methodologies in the domain of CQA systems, which are critical for organizing and retrieving information on platforms like Stack Exchange and Qatar Living.

Core Objectives and Subtasks

The competition comprised five distinct subtasks, each designed to challenge different aspects of CQA systems:

Subtask A: Question-Comment Similarity. Participants ranked comments based on their relevance to a given question.
Subtask B: Question-Question Similarity. This involved identifying similarity between a new question and related questions retrieved by a search engine.
Subtask C: Question-External Comment Similarity. Participants ranked comments retrieved from related questions' threads to determine relevance to a new question.
Subtask D: Arabic-based task requiring the ranking of correct answers for a new question using related search results.
Subtask E: A new addition focused on Multi-domain Question Duplicate Detection using StackExchange data.

Methodologies and Results

A total of 23 teams participated, submitting 85 runs across the various subtasks. The methodologies employed were diverse, with significant emphasis on:

Machine Learning: Predominantly, SVMs and neural network architectures, including CNNs and LSTMs, were used to model relationships between questions and answers using textual features and distributed representations.
Feature Engineering: Participants leveraged a range of features from lexical similarities to embedding-based measures encapsulating semantic nuances.
Innovative Approaches: Notable approaches include KeLP's use of syntactic tree kernels to capture linguistic patterns and bunji's use of decomposable attention models for semantic similarity.

Results varied across subtasks, with Subtask A seeing the highest Mean Average Precision (MAP) of 88.43, while Subtask C demonstrated the complexity of real-world community data with a lower MAP of 15.46. The novelty of Subtask E, however, attracted no participants, indicating potential barriers to entry such as data volume and complexity.

Implications and Future Directions

This task series highlights several implications for the broader field of NLP and CQA:

Benchmarking and Evaluation: SemEval provides a robust framework for benchmarking CQA systems, fostering developments across multiple dimensions, including multilingual capabilities.
Feature Diversity: The results underscore the importance of both traditional feature engineering and deep learning representations in achieving high performance on complex tasks.
Real-world Applicability: The tasks emphasize the necessity for CQA systems to adapt to diverse data regimes, indicative of varying user-generated content seen on online forums.

The research community could benefit from exploring AutoML techniques to alleviate feature engineering efforts and improve adaptability to varying dataset characteristics. Additionally, further exploration into domain adaptation and transfer learning could facilitate better cross-forum performance for systems like those proposed in Subtask E.

In summary, the research presented within this paper demonstrates progressive strides in community question answering capabilities, setting a foundation for ongoing research and innovation.

PDF Markdown