MSR 2022
Mon 23 - Tue 24 May 2022
co-located with ICSE 2022

Call for Papers

The MSR Data/Tool Showcase track aims to actively promote and recognize the creation of reusable datasets and tools that are designed and built not only for a specific research project, but for the MSR community as a whole. These datasets and tools should enable other practitioners and researchers to jumpstart their own research efforts, and also enable the reproducibility of earlier work. The MSR Data/Tool Showcase papers can be descriptions of datasets or tools built by the authors that can be used by other practitioners or researchers, and/or descriptions of the use of tools built by others to obtain specific research results.

Types of MSR’22 Data and Tool Showcase Track Submission

MSR’22 Data/Tool Showcase Track will accept two types of submissions: (1) data showcase papers and (2) reusable tool showcase papers.

The authors should prepare submissions with a maximum of 4 pages, plus 1 additional page of references. Submissions should be submitted to the HotCRP submission site on or before Thursday 27th January 2022.

The Review Criteria for the Data/Tool Showcase submissions are as follows:

  • The value, usefulness, and reusability of the datasets or tools.
  • The quality of the presentation.
  • The clarity of relation with related work and its relevance to mining software repositories.
  • The availability of the datasets or tools.

1. Data Showcase

MSR Data showcase submissions are expected to include:

  • A description of the data source,
  • A description of the methodology used to gather the data (including provenance and the tool used to create/generate/gather the data, if any),
  • A description of the storage mechanism, including a schema if applicable,
  • If the data has been used by the authors or others, a description of how this was done including references to previously published papers,
  • A description of the originality of the data set (that is, even if the data set has been used in a published paper, its complete description must be unpublished) and similar existing datasets (if any)
  • A description of the design of the tool, and how to use the tool in practice ideas for future research questions that could be answered using the data set,
  • Ideas for further improvements that could be made to the data set, and
  • Any limitations and/or challenges in creating or using the data set.

2. Reusable Tool Showcase

MSR Reusable Tool showcase submissions are expected to include:

  • A description of the tool, which includes the background, motivation, novelty, overall architecture, detailed design, and preliminary evaluation of the tool, as well as the link to download or access the tool.
  • A description of the design of the tool, how to use the tool in practice.
  • Clear installation instructions and example data set that allow the reviewers to run the tool.
  • If the tool has been used by the authors or others, a description of how the tool was used including references to previously published papers Ideas for future reusability of the tools
  • Any limitations of using the tools

The dataset/tool should be made available at the time of submission of the paper for review but will be considered confidential until publication of the paper. The dataset/tool should include detailed instructions about how to set up the environment (e.g., requirements.txt), how to use the datasets/tools (e.g., how to import the data or how to access the data once it has been imported, how to use the tool with a running example).

At a minimum, upon publication of the paper, the authors should archive the data or tool on a persistent repository that can provide a digital object identifier (DOI) such as zenodo.org, figshare.com, Archive.org, or institutional repositories. In addition, the DOI-based citation of the dataset or the tool should be included in the camera-ready version of the paper.

Data/Tool showcase submissions are not:

  • Empirical studies.
  • Datasets that are based on poorly explained or untrustworthy heuristics for data collection, or results of trivial application of generic tools.

If custom tools have been used to create the data set, we expect the paper to be accompanied by the source code of the tools, along with clear documentation on how to run the tools to recreate the data set. The tools should be open source, accompanied by an appropriate license; the source code should be citable, i.e., refer to a specific release and have a DOI. GitHub provides an easy way to make source code citable. If you cannot provide the source code or the source code clause is not applicable (e.g., because the data set consists of qualitative data), please provide a short explanation of why this is not possible.

Important Dates

  • Abstract Deadline: Tuesday 25th January 2022
  • Paper Deadline: Thursday 27th January 2022
  • Author Notification: March 8
  • Camera Ready Deadline: Late March


Please submit your data and tool paper(s) (maximum 4 pages, plus 1 additional page of references) via the HotCRP submission site on or before Thursday 27th January 2022.

Submitted papers will undergo single-blind peer review. We opt for single-blind peer review (as opposed to the double-blind peer review of the main track) due to the requirement above to describe the ways how data has been used in the previous studies, including the bibliographic reference to those studies. Such a reference is likely to disclose the authors’ identity.

To make research datasets and tools accessible and citable, we further encourage authors to attend to the FAIR rules, i.e., datasets and tools should be: Findable, Accessible, Interoperable, and Reusable.

All authors should use the official “ACM Primary Article Template”, as can be obtained from the ACM Proceedings Template page. LaTeX users should use the sigconf option, as well as the review (to produce line numbers for easy reference by the reviewers). To that end, the following LaTeX code can be placed at the start of the LaTeX document:


\acmConference[MSR 2022]{MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories}{May 23–24, 2022}{Pittsburgh, PA, USA}

We encourage authors to upload their paper info early (the PDF can be submitted later). All submissions must adhere to the following requirements:

  • Submissions must not exceed the page limit (4 pages plus 1 additional page of references for short papers). The page limit is strict, and it will not be possible to purchase additional pages at any point in the process (including after acceptance).
  • Submissions must strictly conform to the ACM formatting instructions. Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.

Any submission that does not comply with these requirements is likely to be desk rejected by the PC Chairs without further review. In addition, by submitting to the MSR Technical Track, the authors acknowledge that they are aware of and agree to be bound by the following policies:

  • The ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to MSR 2022 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for MSR 2022. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases (including immediate rejection and reporting of the incident to ACM/IEEE). To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
  • The authorship policy of the ACM and the authorship policy of the IEEE.

Upon notification of acceptance, all authors of accepted papers will be asked to fill a copyright form and will receive further instructions for preparing the camera-ready version of their papers. At least one author of each paper is expected to register and present the paper at the MSR 2022 conference. All accepted contributions will be published in the electronic proceedings of the conference.

For enquiries, please contact the MSR Data/Tool Co-Chairs at chakkrit@monash.edu and xin.xia@acm.org

Accepted Papers

A Large-scale Dataset of (Open Source) License Text VariantsData and Tool Showcase Award
Data and Tool Showcase Track
DOI Pre-print
An Alternative Issue Tracking Dataset of Public Jira Repositories
Data and Tool Showcase Track
Pre-print Media Attached
AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information
Data and Tool Showcase Track
DOI Pre-print Media Attached
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
Data and Tool Showcase Track
A Time Series-Based Dataset of Open-Source Software Evolution
Data and Tool Showcase Track
DOI Pre-print Media Attached
A Versatile Dataset of Agile Open Source Software Projects
Data and Tool Showcase Track
Link to publication DOI Pre-print Media Attached
Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques
Data and Tool Showcase Track
Media Attached
DaSEA – A Dataset for Software Ecosystem Analysis
Data and Tool Showcase Track
Pre-print Media Attached
Dataset: Dependency Networks of Open Source Libraries Available Through CocoaPods, Carthage and Swift PM
Data and Tool Showcase Track
Pre-print Media Attached
DISCO: A Dataset of Discord Chat Conversations for Software Engineering Research
Data and Tool Showcase Track
DOI Pre-print Media Attached
ECench: An Energy Bug Benchmark of Ethereum Client Software
Data and Tool Showcase Track
Exploring Apache Incubator Project Trajectories with APEX
Data and Tool Showcase Track
FixJS: A Dataset of Bug-fixing JavaScript Commits
Data and Tool Showcase Track
File Attached
GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research
Data and Tool Showcase Track
Inspect4py: A Knowledge Extraction Framework for Python Code Repositories
Data and Tool Showcase Track
LAGOON: An Analysis Tool for Open Source Communities
Data and Tool Showcase Track
Pre-print Media Attached
Lupa: A Platform for Large Scale Analysis of The Progamming Language Usage
Data and Tool Showcase Track
DOI Pre-print
ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference
Data and Tool Showcase Track
DOI Pre-print
Methods2Test: A dataset of focal methods mapped to test cases
Data and Tool Showcase Track
npm-filter: Automating the mining of dynamic information from npm packages
Data and Tool Showcase Track
Pre-print Media Attached
ReCover: a Curated Dataset for Regression Testing Research
Data and Tool Showcase Track
SLNET: A Redistributable Corpus of 3rd-party Simulink Models
Data and Tool Showcase Track
DOI Pre-print Media Attached
SniP: An Efficient Stack Tracing Framework for Multi-threaded Programs
Data and Tool Showcase Track
DOI Pre-print
SoCCMiner: A Source Code-Comments and Comment-Context Miner
Data and Tool Showcase Track
SOSum: A Dataset of Stack Overflow Post Summaries
Data and Tool Showcase Track
The General Index of Software Engineering Papers
Data and Tool Showcase Track
DOI Pre-print
The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories
Data and Tool Showcase Track
DOI Pre-print Media Attached
The Unexplored Treasure Trove of Phabricator Code Reviews
Data and Tool Showcase Track
DOI Pre-print
The Unsolvable Problem or the Unheard Answer? A Dataset of 24,669 Open-Source Software Conference Talks
Data and Tool Showcase Track
DOI Pre-print
Tooling for Time- and Space-efficient git Repository Mining
Data and Tool Showcase Track
TriggerZoo: A Dataset of Android Applications Automatically Infected with Logic Bombs
Data and Tool Showcase Track
DOI Pre-print Media Attached
TSSB-3M: Mining single statement bugs at massive scale
Data and Tool Showcase Track
Pre-print Media Attached
TwinDroid: A Dataset of Android app System call traces and Trace Generation Pipeline
Data and Tool Showcase Track
Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared Towards the Study of Program Repair TechniquesData and Tool Showcase Award
Data and Tool Showcase Track
Pre-print Media Attached