Towards Filtering Out Deficient Pull Requests Collected through the GitHub API
As pull-based software development has become popular, collecting pull requests is frequent in many empirical studies. Although researchers can utilize publicly available datasets, the on-demand collection of PR data is indispensable to compensate for missing information or obtain the latest information. Unfortunately, PR data collected through the GitHub API sometimes has a deficiency in which parts of the data are lost. This data loss would be trouble for researchers in their data analysis. To reveal what data related to PRs tends to be lost during their collections using GitHub API, we conducted a study with 12,118 pull requests in six repositories of OSS projects on GitHub. In the study, we clarified the PR data that needs to be obtained through the GitHub API by defining their entities as features and attributes. We also collected data losses and classified them by checking the lost attributes based on exception reports triggered during PR collection. The collected data losses were categorized into seven. The paper shows our study results that more than half of the PRs (about 53%) involve data loss in total, which may be surprising for many researchers. The paper also discusses the possible causes of data losses, which helps researchers filter out deficient PRs during the collection.
Wed 4 DecDisplayed time zone: Beijing, Chongqing, Hong Kong, Urumqi change
14:00 - 15:30 | Session (4)Technical Track / ERA - Early Research Achievements at Room 4 (Xianglin Ballroom) Chair(s): Lina Gong Nanjing University of Aeronautics and Astronautic | ||
14:00 30mTalk | An Empirical Study of Cross-Project Pull Request Recommendation in GitHub Technical Track Wenyu Xu national university of defense technology, Yao Lu National University of Defense Technology, Xunhui Zhang National University of Defense Technology, China, Tanghaoran Zhang national university of defense technology, Xinjun Mao National University of Defense Technology, Bo Lin National University of Defense Technology | ||
14:30 30mTalk | FRELinker: A Novel Issue-Commit Link Recovery Model Based on Feature Refinement and Expansion with Multi-Classifier Fusion Technical Track Bangchao Wang Wuhan Textile University, Xinyu He School of Computer Science and Artificial Intelligence, Wuhan Textile University, Hongyan Wan Wuhan Textile University, Xiaoxiao Li School of Computer Science and Artificial Intelligence, Wuhan Textile University, Jiaxu Zhu School of Computer Science and Artificial Intelligence, Wuhan Textile University, Yukun Cao School of Computer Science and Artificial Intelligence, Wuhan Textile University | ||
15:00 20mTalk | Towards Filtering Out Deficient Pull Requests Collected through the GitHub API ERA - Early Research Achievements Bowen Tang Ritsumeikan University, Xiqin Lu Ritsumeikan University, Katsuhisa Maruyama Ritsumeikan University |