Is Historical Data an Appropriate Benchmark for Reviewer Recommendation Systems? A Case Study of the Gerrit Community (ASE 2021 - Research Papers)

Who

Ian X. Gauthier, Maxime Lamothe, Gunter Mussbacher, Shane McIntosh

Track

ASE 2021 Research Papers

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 Nov 2021 11:00 - 11:20 at Koala - Empirical Studies Chair(s): Felipe Fronchetti

Abstract

The discipline of Mining Software Repositories (MSR) transforms the passive archives of data that accrue during software development into active, value-generating solutions, such as recommendation systems. It is customary to evaluate these solutions using held out historical data. While history-based evaluation makes pragmatic use of available data, historical records may be: (1) overly optimistic, since past recommendations may have been suboptimal choices for the task at hand; or (2) overly pessimistic, since ``incorrect'' recommendations may have been equal (or better) choices.

In this paper, we empirically evaluate the extent to which historical data is an appropriate benchmark for MSR solutions. As a concrete instance for experimentation, we use reviewer recommendation, which suggests community members to review change requests. We replicate the cHRev and WLRRec approaches and apply them to 9,679 reviews from the Gerrit open source community. We then assess the recommendations with members of the Gerrit reviewing community using quantitative (personalized questionnaires about their comfort level with tasks) and qualitative methods (semi-structured interviews).

We find that history-based evaluation is far more pessimistic than optimistic in the Gerrit context. Indeed, while 86% of those who had been assigned to a review in the past felt that they were well suited to handle the review, 74% of those labelled as incorrect recommendations also felt that they would have been comfortable reviewing the changes. This indicates that, on the one hand, when solutions recommend the past assignee, they should indeed be considered correct. Yet, on the other hand, recommendations labelled as incorrect because they do not match the past assignee may have been correct as well.

Our results suggest that current (reviewer) recommendation evaluations do not always model the reality of software development. Future studies may benefit from looking beyond repository data to gain a clearer understanding of the practical value of historical data in repository mining solutions.

Ian X. Gauthier

McGill University

Maxime Lamothe

Polytechnique Montréal

Canada

Gunter Mussbacher

McGill University

Canada

Shane McIntosh

University of Waterloo

Canada

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 16 Nov
Displayed time zone: Hobart change

11:00 - 12:00	Empirical StudiesIndustry Showcase / Research Papers / Tool Demonstrations at Koala Chair(s): Felipe Fronchetti Virginia Commonwealth University

11:00 20m Talk		Is Historical Data an Appropriate Benchmark for Reviewer Recommendation Systems? A Case Study of the Gerrit Community Research Papers Ian X. Gauthier McGill University, Maxime Lamothe Polytechnique Montréal, Gunter Mussbacher McGill University, Shane McIntosh University of Waterloo
11:20 20m Talk		An Empirical Study of Bugs in WebAssembly Compilers Research Papers Alan Romano University at Buffalo, Xinyue Liu University at Buffalo, SUNY, Yonghwi Kwon University of Virginia, Weihang Wang University at Buffalo, SUNY
11:40 10m Talk		Improving Configurability of Unit-level Continuous Fuzzing: An Industrial Case Study with SAP HANA Industry Showcase Hanyoung Yoo Handong Global University, Jingun Hong SAP Labs, Bader Lucas SAP Labs, Dongwon Hwang SAP Labs, Shin Hong Handong Global University
11:50 5m Talk		IncBL: Incremental Bug Localization Tool Demonstrations Zhou Yang Singapore Management University, Jieke Shi Singapore Management University, Shaowei Wang University of Manitoba, David Lo Singapore Management University