PTM4Tag: Sharpening Tag Recommendation of Stack Overflow with Pre-trained Models (ICPC 2022 - Research)

Who

Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, David Lo

Track

ICPC 2022 Research

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 15 May 2022 21:30 - 21:37 at ICPC room - Session 1: Summarization Chair(s): Haipeng Cai

Abstract

Stack Overflow is often viewed as the most influential SoftwareQuestion & Answer (SQA) website with millions of programming-related questions and answers. Tags play a critical role in efficiently structuring the contents in Stack Overflow and are vital to support a range of site operations, e.g., querying relevant contents. Poorly selected tags often introduce extra noise and redundancy, which leads to tag synonym and tag explosion problems. Thus, an automated tag recommendation technique that can accurately recommend high-quality tags is desired to alleviate the problems mentioned above. Inspired by the recent success of pre-trained language models(PTMs) in natural language processing (NLP), we present PTM4Tag, a tag recommendation framework for Stack Overflow posts that utilize PTMs with a triplet architecture, which models the components of a post, i.e., Title, Description, and Code with independent language models. To the best of our knowledge, this is the first work that leverages PTMs in the tag recommendation task of SQA sites. We comparatively evaluate the performance of PTM4Tag on five popular pre-trained models: three models trained on general domain textual data, i.e., BERT, RoBERTa, and ALBERT, and two SE domain-specific models, i.e., CodeBERT and BERTOverflow. Our results show that leveraging the SE-specific PTM CodeBERT in PTM4Tag can achieve the best performance among the five considered PTMs. Surprisingly, another SE-specific PTM BERTOverflow performs much worse than the above-mentioned BERT, RoBERTa, and CodeBERT. Furthermore, PTM4Tag that is implemented with CodeBERT outperforms the state-of-the-art approach (based on Convolutional Neural Network) by a large margin in terms of average 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘,𝑅𝑒𝑐𝑎𝑙𝑙@𝑘, and 𝐹1-𝑠𝑐𝑜𝑟𝑒@𝑘. More specifically, the 𝐹1-𝑠𝑐𝑜𝑟𝑒@5 is boosted by 15.3%. Furthermore, we conduct an ablation study to quantify the contribution of a post’s constituent components (Title, Description, and Code Snippets) to the performance of PTM4Tag. Our results show that Title is the most important in predicting the most relevant tags, and utilizing all the components achieves the best performance.

Junda He

Singapore Management University

Bowen Xu

Singapore Management University

Singapore

Zhou Yang

Singapore Management University

Singapore

DongGyun Han

Singapore Management University

Singapore

Chengran Yang

Singapore Management University

David Lo

Singapore Management University

Singapore

Media

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 15 May
Displayed time zone: Eastern Time (US & Canada) change

21:30 - 22:20	Session 1: SummarizationResearch at ICPC room Chair(s): Haipeng Cai Washington State University, USA

21:30 7m Talk		PTM4Tag: Sharpening Tag Recommendation of Stack Overflow with Pre-trained Models Research Junda He Singapore Management University, Bowen Xu Singapore Management University, Zhou Yang Singapore Management University, DongGyun Han Singapore Management University, Chengran Yang Singapore Management University, David Lo Singapore Management University Media Attached
21:37 7m Talk		GypSum: Learning Hybrid Representations for Code Summarization Research Yu Wang School of Data Science and Engineering, East China Normal University, Yu Dong School of Data Science and Engineering, East China Normal University, Xuesong Lu School of Data Science and Engineering, East China Normal University, Aoying Zhou East China Normal University DOI Pre-print Media Attached
21:44 7m Talk		M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source Code Summarization Research Yuexiu Gao Shandong Normal University, Chen Lyu Shandong Normal University Media Attached
21:51 7m Talk		Semantic Similarity Metrics for Evaluating Source Code Summarization Research Sakib Haque University of Notre Dame, Zachary Eberhart University of Notre Dame, Aakash Bansal University of Notre Dame, Collin McMillan University of Notre Dame Media Attached
21:58 7m Talk		LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition Research Rishab Sharma University of British Columbia, Fuxiang Chen University of British Columbia, Fatemeh Hendijani Fard University of British Columbia Pre-print Media Attached
22:05 15m Live Q&A		Q&A-Paper Session 1 Research

Information for Participants

Sun 15 May 2022 21:30 - 22:20 at ICPC room - Session 1: Summarization Chair(s): Haipeng Cai

Info for room ICPC room:

Click here to go to the room on Midspace