ASE 2021
Sun 14 - Sat 20 November 2021 Australia
Wed 17 Nov 2021 09:20 - 09:40 at Koala - Analysis I

Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn hidden patterns within unlabelled data, which has only been extensively studied in defect prediction. Nevertheless, unsupervised learning can be ineffective by itself and has not been explored in other domains (e.g., static analysis and issue close time).

Motivated by this literature gap and technical limitations, we explore the performance variations seen in several simple optimization schemes. We present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme that does not require sophisticated (e.g., deep learners) and expensive (e.g., 100% manually labelled data) methods. Our method optimizes the unsupervised learner’s configurations in the grid search manner while validating the picked settings on only 10% of the labelled train data before predicting. FRUGAL outperforms the state-of-the-art actionable static code warning recognizer and issue closed time predictor with less information, reducing the cost of labelling by 90%.

Our conclusions are two-fold. Firstly, FRUGAL can save considerable efforts in data labelling especially in validating prior work or researching new problems. Secondly, proponents of complex and expensive methods should always baseline such methods against simpler and cheaper alternatives. For instance, a semi-supervised learner like FRUGAL can serve as a baseline to the state-of-the-art software analytics tools.

