Write a Blog >>
ICSE 2022
Sun 8 - Fri 27 May 2022
Sun 8 May 2022 08:30 - 08:50 at NLBSE room - Paper Session 1 Chair(s): Andrea Di Sorbo, Sebastiano Panichella

Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis, exploration, and ultimately understanding of the large amounts of data that come from these communities. Prior research has aimed to automatically predict the topic (or “tag”) of a post, classifying text as one or more of a potentially very large label set. Proposed approaches for solving this extreme multi label classification (XMLC) problem involve classifying this text using embedding models, where post text and tag name are embedded into the same space to allow for quick classification of posts across many labels using similarity calculations of vectors. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.

We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible, and offer insight into their applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.

Sun 8 May

Displayed time zone: Eastern Time (US & Canada) change

08:30 - 09:40
Paper Session 1NLBSE at NLBSE room
Chair(s): Andrea Di Sorbo University of Sannio, Sebastiano Panichella Zurich University of Applied Sciences
08:30
20m
Talk
Unsupervised Extreme Multi Label Classification of Stack Overflow Posts
NLBSE
Peter Devine The University of Auckland, Kelly Blincoe University of Auckland
08:50
20m
Talk
Understanding Digits in Identifier Names: An Exploratory Study
NLBSE
Anthony Peruma Rochester Institute of Technology, Christian D. Newman Rochester Institute of Technology
Pre-print Media Attached
09:10
15m
Talk
From Zero to Hero: Generating Training Data for Question-To-Cypher Models
NLBSE
Dominik Opitz Bonn-Rhein-Sieg University oAS, Nico Hochgeschwender Hochschule Bonn-Rhein-Sieg
09:25
15m
Talk
Automatic Identification of Informative Code in Stack Overflow Posts
NLBSE
Preetha Chatterjee Drexel University, USA

Information for Participants
Sun 8 May 2022 08:30 - 09:40 at NLBSE room - Paper Session 1 Chair(s): Andrea Di Sorbo, Sebastiano Panichella
Info for room NLBSE room:

Click here to go to the room on Midspace