On the Creation of Representative Samples of Software Repositories (ESEIW 2024 - ESEM Emerging Results, Vision and Reflection Papers Track)

Who

June Gorostidi, Adem Ait, Jordi Cabot, Javier Luis Cánovas Izquierdo

Track

ESEIW 2024 ESEM Emerging Results, Vision and Reflection Papers Track

Time Zone

The program is currently displayed in (GMT+02:00) Brussels, Copenhagen, Madrid, Paris.

Use conference time zone: (GMT+02:00) Brussels, Copenhagen, Madrid, ParisSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 24 Oct 2024 15:00 - 15:15 at Telensenyament (B3 Building - 1st Floor) - Empirical research methods and applications Chair(s): Valentina Lenarduzzi

Abstract

Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study. We illustrate our approach with use cases based on Hugging Face repositories.

Link to Preprint

https://arxiv.org/abs/2410.00639

June Gorostidi

IN3 - UOC

Spain

Adem Ait

University of Luxembourg

Luxembourg

Jordi Cabot

Luxembourg Institute of Science and Technology