Data Quality Matters: A Case Study of ObsoleteComment Detection
Machine learning methods have achieved great success in many software engineering tasks. However, as a data-driven paradigm, how would the data quality impact the effectiveness of these methods remains largely unexplored. In this paper, we propose to explore this problem under the context of just-in-time obsolete comment detection. Specifically, we first conduct data cleaning on the existing benchmark dataset, and empirically observe that with only 0.22% label corrections and even 15.0% fewer data, the existing obsolete comment detection approaches can achieve up to 10.7% accuracy improvement. To further mitigate the data quality issues, we propose an adversarial learning framework to simultaneously estimate the data quality and make the final predictions. Experimental evaluations show that this adversarial learning framework can further improve the accuracy by up to 18.1% compared to the state-of-the-art method. Although our current results are from the obsolete comment detection problem, we believe that the proposed two-phase solution that handles the data quality issues through both data aspect and algorithm aspect, is also generalizable and applicable to other machine learning based software engineering tasks.
Wed 17 MayDisplayed time zone: Hobart change
15:45 - 17:15 | DocumentationTechnical Track / Journal-First Papers at Level G - Plenary Room 1 Chair(s): Denys Poshyvanyk College of William and Mary | ||
15:45 15mTalk | Developer-Intent Driven Code Comment Generation Technical Track Fangwen Mu Institute of Software Chinese Academy of Sciences, Xiao Chen Institute of Software Chinese Academy of Sciences, Lin Shi ISCAS, Song Wang York University, Qing Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences Pre-print | ||
16:00 15mTalk | Data Quality Matters: A Case Study of ObsoleteComment Detection Technical Track Shengbin Xu Nanjing University, Yuan Yao Nanjing University, Feng Xu Nanjing University, Tianxiao Gu TikTok Inc., Jingwei Xu , Xiaoxing Ma Nanjing University Pre-print | ||
16:15 15mTalk | Revisiting Learning-based Commit Message Generation Technical Track Jinhao Dong Peking University, Yiling Lou Fudan University, Dan Hao Peking University, Lin Tan Purdue University Pre-print | ||
16:30 15mTalk | Commit Message Matters: Investigating Impact and Evolution of Commit Message Quality Technical Track | ||
16:45 7mTalk | On the Significance of Category Prediction for Code-Comment Synchronization Journal-First Papers Zhen Yang City University of Hong Kong, China, Jacky Keung City University of Hong Kong, Xiao Yu Wuhan University of Technology, Yan Xiao National University of Singapore, Zhi Jin Peking University, Jingyu Zhang City University of Hong Kong | ||
16:52 7mTalk | Correlating Automated and Human Evaluation of Code Documentation Generation Quality Journal-First Papers Xing Hu Zhejiang University, Qiuyuan Chen Zhejiang University, Haoye Wang Hangzhou City University, Xin Xia Huawei, David Lo Singapore Management University, Thomas Zimmermann Microsoft Research | ||
17:00 7mTalk | Predictive Comment Updating with Heuristics and AST-Path-Based Neural Learning: A Two-Phase Approach Journal-First Papers Bo Lin National University of Defense Technology, Shangwen Wang National University of Defense Technology, Zhongxin Liu Zhejiang University, Xin Xia Huawei, Xiaoguang Mao National University of Defense Technology Link to publication DOI Pre-print |