ESEIW 2024
Sun 20 - Fri 25 October 2024 Barcelona, Spain

Empirical Software Engineering is the discipline that studies software development tools, technology, practices, and processes relying on observational data and experimentation. Software repositories have traditionally been one of the main sources of data in empirical software engineering. And with the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets before conducting the empirical studies. The creation of these datasets is therefore a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for creating representative samples of software repositories in empirical software engineering, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study. We illustrate our approach with a case study based on Hugging Face repositories.