Unsupervised Extreme Multi Label Classification of Stack Overflow Posts (NLBSE 2022)

Write a Blog >>

Sun 8 - Fri 27 May 2022

Who

Peter Devine, Kelly Blincoe

Track

NLBSE 2022

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 8 May 2022 08:30 - 08:50 at NLBSE room - Paper Session 1 Chair(s): Andrea Di Sorbo, Sebastiano Panichella

Abstract

Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis, exploration, and ultimately understanding of the large amounts of data that come from these communities. Prior research has aimed to automatically predict the topic (or “tag”) of a post, classifying text as one or more of a potentially very large label set. Proposed approaches for solving this extreme multi label classification (XMLC) problem involve classifying this text using embedding models, where post text and tag name are embedded into the same space to allow for quick classification of posts across many labels using similarity calculations of vectors. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.

We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible, and offer insight into their applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.

Peter Devine

The University of Auckland

Kelly Blincoe

University of Auckland

New Zealand

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 8 May
Displayed time zone: Eastern Time (US & Canada) change

08:30 - 09:40	Paper Session 1NLBSE at NLBSE room Chair(s): Andrea Di Sorbo University of Sannio, Sebastiano Panichella Zurich University of Applied Sciences

08:30 20m Talk		Unsupervised Extreme Multi Label Classification of Stack Overflow Posts NLBSE Peter Devine The University of Auckland, Kelly Blincoe University of Auckland
08:50 20m Talk		Understanding Digits in Identifier Names: An Exploratory Study NLBSE Anthony Peruma Rochester Institute of Technology, Christian D. Newman Rochester Institute of Technology Pre-print Media Attached
09:10 15m Talk		From Zero to Hero: Generating Training Data for Question-To-Cypher Models NLBSE Dominik Opitz Bonn-Rhein-Sieg University oAS, Nico Hochgeschwender Hochschule Bonn-Rhein-Sieg
09:25 15m Talk		Automatic Identification of Informative Code in Stack Overflow Posts NLBSE Preetha Chatterjee Drexel University, USA

Information for Participants

Sun 8 May 2022 08:30 - 09:40 at NLBSE room - Paper Session 1 Chair(s): Andrea Di Sorbo, Sebastiano Panichella

Info for room NLBSE room:

Click here to go to the room on Midspace