Session 6: Towards escape from convenience sampling (ESEIW 2023 - ISERN)

Track

ESEIW 2023 ISERN

Time Zone

The program is currently displayed in (GMT-05:00) Central Time (US & Canada).

Use conference time zone: (GMT-05:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 24 Oct 2023 11:00 - 12:30 at Oak Alley - Session 6

Abstract

Title: Towards escape from convenience sampling

Chair: Audris Mockus

Abstract:

With few exceptions, the existing empirical software engineering research employees some form of convenience sampling, e.g.:

a) Developer units:

students in a course (threats: behavior different from professional developers)
survey respondents on a mailing group (self-selection bias, potential lack of competence necessary to provide a meaningful response)
Developers in a (set of) project(s) (only activity within the project’s bounds is considered)

b) Industrial units (companies or products/projects):

Case studies of industry projects (bias towards large, unique, esoteric)

c) OSS units (strong bias towards large, active projects).

The most publications in IS analyze a list of SourceForge projects from 2000
MSR work typically starts from a list of many-star projects from GitHub
Often Apache Foundation projects are used as a very large sample

The key concern is that in all listed cases the analyzed sample may not represent the larger population. More importantly, I am not aware of the work that attempts to define what the larger population or populations of projects might or should be.

The convenience samples have another insidious bias: in OSS we can sample projects, but not individuals or source code. With the individuals who can contribute anywhere and code that is frequently copied, this forces an artificial focus on projects as the sampling units and prevents considering developers or code and APIs as sampling units.

Specifically, the bias in the convenience sample for size, yields knowledge about very large or popular projects that may be unique to the project or inappropriate for different scale projects.

Apart from the obvious bias introduced by the sampling units and some research on scale/popularity, little is known about the how the knowledge gathered in empirical work could and should be generalized as practical recommendations.

Session Goals:

The purpose of the session is to brainstorm various definitions of the “population”, ways to capture a random samples of the population that counteract the bias of specific sampling units or variations in size, potentially via unit classification techniques, stratified sampling methods, and by utilizing (or curating) large data collections.

Finally, the sampling is also critical for supervised and unsupervised training on machine learning methods, that may potentially bias actual development of software via poor recommendations.

Development of the Session: (How will the session be conducted? How much interaction?)

The session will have four major parts: population definitions, sampling strategies, role of large collections, and applications to AI. Depending on the size of the group these would be discussed sequentially or in parallel subgroups.

What means for interaction will be used or required?

A google document and conversation will be used to brainstorm and complete individual assignments. A brief task will be posed at the beginning of the session, e.g., list all literature related to the specific topic, then general discussion wit the attempt to merge and justify the opinions will ensue.

Background and recommended reading:

The area is not particularly well researched, hence this proposal, but some examples from related areas are below.

Expected Outcomes and Plan for Continuing the Work beyond ISERN:

A bibliography related to the four topics discussed in the session
A brief document expanding upon the introduction to the problem described here
A followup session at ISERN or similar venue or, if the findings are sufficiently rich, a publication.

Time Zone

The program is currently displayed in (GMT-05:00) Central Time (US & Canada).

Use conference time zone: (GMT-05:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 24 Oct
Displayed time zone: Central Time (US & Canada) change

11:00 - 12:30	Session 6ISERN at Oak Alley

11:00 90m Other		Session 6: Towards escape from convenience sampling ISERN Audris Mockus

Session 6: Towards escape from convenience sampling

Defining a population

Stratified sampling

Curation, large data collections

AI bias

Tue 24 Oct
Displayed time zone: Central Time (US & Canada) change

Audris Mockus

ESEIW 2023

Session 6: Towards escape from convenience sampling

Program Display Configuration

Defining a population

Stratified sampling

Curation, large data collections

AI bias

Program Display Configuration

Tue 24 OctDisplayed time zone: Central Time (US & Canada) change

Audris Mockus

Tue 24 Oct
Displayed time zone: Central Time (US & Canada) change