Smarter Project Selection for Software Engineering Research
Open Source Software (OSS) hosting platforms like GitHub also contain many non-software projects that should be excluded from the dataset for most software engineering research studies. However, due to the lack of obvious indicators, researchers have to spend considerable manual effort to find suitable projects or rely on convenience sampling or heuristics for selecting projects for their research. Moreover, the diverse nature of OSS projects often poses further challenges in selecting projects aligned with study objectives, especially when the study intends to identify projects based on semantic information like intended use, which is not easy to discern solely based on the project characteristics that are available through the search APIs like GitHub's.
Our goals are to establish a robust method of identifying software projects from the population of repositories hosted in social coding platforms and to categorize the software projects based on who the target users are and how those projects are meant to be used.
Using data from 35,621 projects in the World of Code dataset, we employed a combination of machine learning techniques, including Doc2Vec and Random Forest, to identify the software projects and to categorize them as standalone applications, libraries, or plug-ins.
Furthermore, our findings highlight the risks of selecting projects solely based on filtering by commonly used project criteria like the number of contributors, commits, or stars as even after using similar filtering, 16.6% of projects were found to be non-software projects.
Our research should aid software engineering researchers in project selection, benefiting both industry and academia. We also envision our work inspiring further research in this domain.
Tue 16 JulDisplayed time zone: Brasilia, Distrito Federal, Brazil change
| 11:00 - 12:30 | |||
| 11:0060m Talk | The Ever-Evolving Promises of Data in Software Ecosystems: Models, AI, and Analytics (Keynote) PROMISE 2024 Raula Gaikovina Kula Nara Institute of Science and TechnologyDOI | ||
| 12:0015m Talk | Smarter Project Selection for Software Engineering Research PROMISE 2024 Tapajit Dey Carnegie Mellon University Software Engineering Institute, Jonathan Loungani Carnegie Mellon University, James Ivers Carnegie Mellon UniversityDOI | ||
| 12:1515m Talk | Evaluating the Quality of Open Source Ansible Playbooks: An Executability Perspective PROMISE 2024 Pemsith Mendis Auburn University, Wilson Reaves Auburn University, Muhammad Ali Babar School of Computer Science, The University of Adelaide, Yue Zhang Auburn University, Akond Rahman Auburn UniversityDOI | ||

