Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain (2203.14093v2)

Published 26 Mar 2022 in cs.CL, cs.LG, cs.PL, and cs.SE

Abstract: This work proposes a new pipeline for leveraging data collected on the Stack Overflow website for pre-training a multimodal model for searching duplicates on question answering websites. Our multimodal model is trained on question descriptions and source codes in multiple programming languages. We design two new learning objectives to improve duplicate detection capabilities. The result of this work is a mature, fine-tuned Multimodal Question Duplicity Detection (MQDD) model, ready to be integrated into a Stack Overflow search system, where it can help users find answers for already answered questions. Alongside the MQDD model, we release two datasets related to the software engineering domain. The first Stack Overflow Dataset (SOD) represents a massive corpus of paired questions and answers. The second Stack Overflow Duplicity Dataset (SODD) contains data for training duplicate detection models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jan Pašek (2 papers)
  2. Jakub Sido (8 papers)
  3. Miloslav Konopík (8 papers)
  4. Ondřej Pražák (11 papers)

Summary

We haven't generated a summary for this paper yet.