Methods2Test: A dataset of focal methods mapped to test cases (MSR 2022 - Data and Tool Showcase Track)

Who

Michele Tufano, Shao Kun Deng, Neel Sundaresan, Alexey Svyatkovskiy

Track

MSR 2022 Data and Tool Showcase Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 18 May 2022 20:25 - 20:29 at MSR Main room - even hours - Session 6: Maintenance & Testing Chair(s): Ajay Jha, Amjed Tahir

Abstract

Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or unintended behavior via machine learning requires large, metadata-rich, datasets. In this paper we present Methods2Test: a large, supervised dataset of test cases mapped to corresponding methods under test (i.e., focal methods). This dataset contains 780,944 pairs of JUnit tests and focal methods, extracted from a total of 91,385 Java open source projects hosted on GitHub with licenses permitting re-distribution. The main challenge behind the creation of the Methods2Test was to establish a reliable mapping between a test case and the relevant focal method. To this aim, we designed a set of heuristics, based on developers’ best practices in software testing, which identify the likely focal method for a given test case. To facilitate further analysis, we store a rich set of metadata for each method-test pair in JSON-formatted files. Additionally, we extract textual corpus from the dataset at different context levels, which we provide both in raw and tokenized forms, in order to enable researchers to train and evaluate machine learning models for Automated Test Generation. Methods2Test is publicly available at: https://github.com/microsoft/methods2test

Michele Tufano

Microsoft

Shao Kun Deng

Microsoft Corporation

United States

Neel Sundaresan

Microsoft Corporation

Alexey Svyatkovskiy

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 18 May
Displayed time zone: Eastern Time (US & Canada) change

20:00 - 20:50	Session 6: Maintenance & TestingData and Tool Showcase Track / Technical Papers at MSR Main room - even hours Chair(s): Ajay Jha University of Alberta, Amjed Tahir Massey University

20:00 4m Short-paper		Characterizing High-Quality Test Methods: A First Empirical Study Technical Papers Victor Veloso UFMG, Andre Hora UFMG Pre-print
20:04 7m Talk		CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning Technical Papers Mohammad Reza Taesiri University of Alberta, Finlay Macklon University of Alberta, Cor-Paul Bezemer University of Alberta
20:11 7m Talk		An Empirical Study on Maintainable Method Size in Java Technical Papers Shaiful Chowdhury University of Alberta, Gias Uddin University of Calgary, Canada, Reid Holmes University of British Columbia
20:18 7m Talk		Complex Python Features in the Wild Technical Papers Yi Yang Rensselaer Polytechnic Institute, Ana Milanova Rensselaer Polytechnic Institute, Martin Hirzel IBM Research
20:25 4m Talk		Methods2Test: A dataset of focal methods mapped to test cases Data and Tool Showcase Track Michele Tufano Microsoft, Shao Kun Deng Microsoft Corporation, Neel Sundaresan Microsoft Corporation, Alexey Svyatkovskiy
20:29 4m Talk		npm-filter: Automating the mining of dynamic information from npm packages Data and Tool Showcase Track Ellen Arteca Northeastern University, Alexi Turcotte Northeastern University Pre-print Media Attached
20:33 4m Talk		ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference Data and Tool Showcase Track Kevin Jesse University of California, Davis, Prem Devanbu Department of Computer Science, University of California, Davis DOI Pre-print
20:37 13m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Wed 18 May 2022 20:00 - 20:50 at MSR Main room - even hours - Session 6: Maintenance & Testing Chair(s): Ajay Jha, Amjed Tahir

Info for room MSR Main room - even hours:

Click here to go to the room on Midspace