Unsupervised Extreme Multi Label Classification of Stack Overflow Posts
Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis, exploration, and ultimately understanding of the large amounts of data that come from these communities. Prior research has aimed to automatically predict the topic (or “tag”) of a post, classifying text as one or more of a potentially very large label set. Proposed approaches for solving this extreme multi label classification (XMLC) problem involve classifying this text using embedding models, where post text and tag name are embedded into the same space to allow for quick classification of posts across many labels using similarity calculations of vectors. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.
We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible, and offer insight into their applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.
Sun 8 MayDisplayed time zone: Eastern Time (US & Canada) change
08:30 - 09:40 | Paper Session 1NLBSE at NLBSE room Chair(s): Andrea Di Sorbo University of Sannio, Sebastiano Panichella Zurich University of Applied Sciences | ||
08:30 20mTalk | Unsupervised Extreme Multi Label Classification of Stack Overflow Posts NLBSE | ||
08:50 20mTalk | Understanding Digits in Identifier Names: An Exploratory Study NLBSE Anthony Peruma Rochester Institute of Technology, Christian D. Newman Rochester Institute of Technology Pre-print Media Attached | ||
09:10 15mTalk | From Zero to Hero: Generating Training Data for Question-To-Cypher Models NLBSE | ||
09:25 15mTalk | Automatic Identification of Informative Code in Stack Overflow Posts NLBSE Preetha Chatterjee Drexel University, USA |