Tue 24 Oct 2023 11:00 - 12:30 at Oak Alley - Session 6

Title: Towards escape from convenience sampling


Chair: Audris Mockus


Abstract:

With few exceptions, the existing empirical software engineering research employees some form of convenience sampling, e.g.:

a) Developer units:

  • students in a course (threats: behavior different from professional developers)

  • survey respondents on a mailing group (self-selection bias, potential lack of competence necessary to provide a meaningful response)

  • Developers in a (set of) project(s) (only activity within the project’s bounds is considered)

b) Industrial units (companies or products/projects):

  • Case studies of industry projects (bias towards large, unique, esoteric)

c) OSS units (strong bias towards large, active projects).

  • The most publications in IS analyze a list of SourceForge projects from 2000

  • MSR work typically starts from a list of many-star projects from GitHub

  • Often Apache Foundation projects are used as a very large sample

The key concern is that in all listed cases the analyzed sample may not represent the larger population. More importantly, I am not aware of the work that attempts to define what the larger population or populations of projects might or should be.

The convenience samples have another insidious bias: in OSS we can sample projects, but not individuals or source code. With the individuals who can contribute anywhere and code that is frequently copied, this forces an artificial focus on projects as the sampling units and prevents considering developers or code and APIs as sampling units.

Specifically, the bias in the convenience sample for size, yields knowledge about very large or popular projects that may be unique to the project or inappropriate for different scale projects.

Apart from the obvious bias introduced by the sampling units and some research on scale/popularity, little is known about the how the knowledge gathered in empirical work could and should be generalized as practical recommendations.


Session Goals:

The purpose of the session is to brainstorm various definitions of the “population”, ways to capture a random samples of the population that counteract the bias of specific sampling units or variations in size, potentially via unit classification techniques, stratified sampling methods, and by utilizing (or curating) large data collections.

Finally, the sampling is also critical for supervised and unsupervised training on machine learning methods, that may potentially bias actual development of software via poor recommendations.


Development of the Session: (How will the session be conducted? How much interaction?)

The session will have four major parts: population definitions, sampling strategies, role of large collections, and applications to AI. Depending on the size of the group these would be discussed sequentially or in parallel subgroups.


What means for interaction will be used or required?

A google document and conversation will be used to brainstorm and complete individual assignments. A brief task will be posed at the beginning of the session, e.g., list all literature related to the specific topic, then general discussion wit the attempt to merge and justify the opinions will ensue.


Background and recommended reading:

The area is not particularly well researched, hence this proposal, but some examples from related areas are below.

Defining a population

Wang, Zijian, et al. “Demographic inference and representative population estimates from multilingual social media data.” The world wide web conference. 2019.

Stratified sampling

Baltes, Sebastian, and Paul Ralph. “Sampling in software engineering research: A critical review and guidelines.” Empirical Software Engineering 27.4 (2022): 94.

Molléri, Jefferson Seide, Kai Petersen, and Emilia Mendes. “An empirically evaluated checklist for surveys in software engineering.” Information and Software Technology 119 (2020): 106240.

Curation, large data collections

Munaiah, Nuthan, et al. “Curating github for engineered software projects.” Empirical Software Engineering 22 (2017): 3219-3253.

Ma, Yuxing, et al. “World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data.” Empirical Software Engineering 26 (2021): 1-42.

AI bias

Mehrabi, Ninareh, et al. “A survey on bias and fairness in machine learning.” ACM Computing Surveys (CSUR) 54.6 (2021): 1-35.

Liu, Chao, et al. “On the reproducibility and replicability of deep learning in software engineering.” ACM Transactions on Software Engineering and Methodology (TOSEM) 31.1 (2021): 1-46.


Expected Outcomes and Plan for Continuing the Work beyond ISERN:

  • A bibliography related to the four topics discussed in the session

  • A brief document expanding upon the introduction to the problem described here

  • A followup session at ISERN or similar venue or, if the findings are sufficiently rich, a publication.

Tue 24 Oct

Displayed time zone: Central Time (US & Canada) change