On the Relative Value of Feature Selection Techniques for Code Smell Detection
Machine/deep learning-based code smell detection aims to develop a classification model based on code smell features to predict the presence of code smell in new code instances. To ensure accurate detection, it is crucial to eliminate irrelevant or redundant features that may negatively impact performance. Previous studies have produced inconsistent findings about the impact of feature selection techniques for code smell detection, possibly because they examined only a limited number of different techniques. To address this gap, our study aims to provide a comprehensive analysis of feature selection techniques in code smell detection. We investigate 34 feature selection techniques with 7 classification models to build the code smell detection models on 6 code smell datasets. To assess these effects, we use 3 evaluation metrics, i.e., Precision, Recall, and F-measure, and compare the performance differences using the Scott-Knott effect size difference test and the McNemar’s test. The results show that (1) Not all feature selection techniques significantly improve detection performance. The techniques with better performance are chi-square, probabilistic significance, information gain, and symmetrical uncertainty. (2) In general, probabilistic significance should be used as the “generic” feature selection technique because detection models using probabilistic significance can identify more of the same smelly instances compared to models using other methods. (3) The high-frequency features selected by the four highest-performing techniques, which are important for identifying the corresponding code smells, are different for each dataset.