Data Quality Matters: A Case Study of ObsoleteComment Detection
Machine learning methods have achieved great success in many software engineering tasks. However, as a data-driven paradigm, how would the data quality impact the effectiveness of these methods remains largely unexplored. In this paper, we propose to explore this problem under the context of just-in-time obsolete comment detection. Specifically, we first conduct data cleaning on the existing benchmark dataset, and empirically observe that with only 0.22% label corrections and even 15.0% fewer data, the existing obsolete comment detection approaches can achieve up to 10.7% accuracy improvement. To further mitigate the data quality issues, we propose an adversarial learning framework to simultaneously estimate the data quality and make the final predictions. Experimental evaluations show that this adversarial learning framework can further improve the accuracy by up to 18.1% compared to the state-of-the-art method. Although our current results are from the obsolete comment detection problem, we believe that the proposed two-phase solution that handles the data quality issues through both data aspect and algorithm aspect, is also generalizable and applicable to other machine learning based software engineering tasks.