Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PTM4Tag: Sharpening Tag Recommendation of Stack Overflow Posts with Pre-trained Models (2203.10965v4)

Published 21 Mar 2022 in cs.SE

Abstract: Stack Overflow is often viewed as the most influential Software Question Answer (SQA) website with millions of programming-related questions and answers. Tags play a critical role in efficiently structuring the contents in Stack Overflow and are vital to support a range of site operations, e.g., querying relevant contents. Poorly selected tags often introduce extra noise and redundancy, which leads to tag synonym and tag explosion problems. Thus, an automated tag recommendation technique that can accurately recommend high-quality tags is desired to alleviate the problems mentioned above. Inspired by the recent success of pre-trained LLMs (PTMs) in NLP, we present PTM4Tag, a tag recommendation framework for Stack Overflow posts that utilize PTMs with a triplet architecture, which models the components of a post, i.e., Title, Description, and Code with independent LLMs. To the best of our knowledge, this is the first work that leverages PTMs in the tag recommendation task of SQA sites. We comparatively evaluate the performance of PTM4Tag based on five popular pre-trained models: BERT, RoBERTa, ALBERT, CodeBERT, and BERTOverflow. Our results show that leveraging the software engineering (SE) domain-specific PTM CodeBERT in PTM4Tag achieves the best performance among the five considered PTMs and outperforms the state-of-the-art deep learning (Convolutional Neural Network-based) approach by a large margin in terms of average $Precision@k$, $Recall@k$, and $F1$-$score@k$. We conduct an ablation study to quantify the contribution of a post's constituent components (Title, Description, and Code Snippets) to the performance of PTM4Tag. Our results show that Title is the most important in predicting the most relevant tags, and utilizing all the components achieves the best performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Junda He (24 papers)
  2. Bowen Xu (60 papers)
  3. Zhou Yang (82 papers)
  4. DongGyun Han (17 papers)
  5. Chengran Yang (18 papers)
  6. David Lo (229 papers)
Citations (36)

Summary

We haven't generated a summary for this paper yet.