Write a Blog >>
MSR 2022
co-located with ICSE 2022

The International Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge, we call upon everyone interested to apply their tools to a common dataset. The challenge is for researchers and practitioners to bravely use their mining tools and approaches on a dare.

Call for Mining Challenge Proposals

One of the secret ingredients behind the success of the International Conference on Mining Software Repositories (MSR) is its annual Mining Challenge, in which MSR participants can showcase their techniques, tools and creativity on a common data set. In true MSR fashion, this data set is a real data set contributed by researchers in the community, solicited through an open call. There are many benefits of sharing a data set for the MSR Mining Challenge. The selected challenge proposal explaining the data set will appear in the MSR 2022 proceedings, and the challenge papers using the data set will be required to cite the challenge proposal or an existing paper of the researchers about the selected data set. Furthermore, the authors of the data set will join the MSR 2022 organizing committee as Mining Challenge (co-)chair(s), who will oversee the reviewing process (e.g., recruiting a Challenge PC, managing submissions and review assignments). Finally, it is not uncommon for challenge data sets to feature in MSR and other publications well after the edition of the conference in which they appear! If you would like to submit your data set for consideration for the 2022 MSR Mining Challenge, please submit a one-page proposal with up to three pages of appendices at https://msr2022-challenge-proposals.hotcrp.com/, containing the following information:

  1. Title of data set.
  2. What does the data set contain?
  3. How large is it?
  4. How accessible is it and how can the data be obtained?
  5. How representative is it?
  6. Does it require specialized tools to mine it?
  7. What skills, infrastructure, and/or credentials would challenge participants need to work with the data set?
  8. What kinds of questions do you expect challenge participants to answer?
  9. A link to a (sub)sample of the data for the organizing committee to peruse (e.g., via GitHub, Zenodo, Figshare).

Each submission must conform to the ACM formatting instructions. Templates are available here.

The first task of the authors of the selected proposal will be to prepare the Call for Challenge Papers, which outlines the expected content and structure of submissions, as well as the technical details of how to access and analyze the data set. This call will be published on the MSR website on August 5th. By making the challenge data set available by late summer, we hope that many students will be able to use the challenge data set for their graduate class projects.

Important Dates

  • Deadline for proposals: July 1st, 2021
  • Notification: July 19th, 2021
  • Call for Challenge Papers Published: August 5th, 2021
  • Challenge PC formed: TBD
  • Submission Deadline for Challenge Papers: TBD

Call for Mining Challenge Papers

This year, the mining challenge is about the SmartSHARK data, a dataset that combines detailed information from the version control system (commits, code metrics, code clones, PMD warnings, change types, refactorings) with issue tracking data from Jira, pull request data from GitHub and continuous integration data from Travis.

All data is integrated into a single database that also contains links between the different information sources, e.g., commits and referenced issues, or pull requests and the related commits. The data was further extended with commonly used heuristic, most notable SZZ (and variants) to determine bug fixing and bug inducing changes, but also heuristics to identify changes to self-admitted technical debt or tests. Parts of the data are even manually validated, e.g., to validate if issues really report bugs, if links between commits and issues are correct, and even which changed lines contribute to bug fixes.

In this challenge, participants can use the two versions of the dataset v2.1. The small version of the dataset requires about 17 Gigabytes of storage and does not include code metrics, code clone data, and PMD warnings. The second version of the dataset requires about 450 Gigabytes of storage and includes all available data. We plan to release v2.2 with data for more projects in early December.

The challenge is open-ended: participants can choose the research questions that they find most interesting. Our suggestions include:

  • What are differences between discussions on mailing lists and in issues?
  • What is the relationship between refactorings and bug fixes?
  • How does manual validation affect results?
  • Are TODOs removed/introduced as part of bug fixes or through other commits?
  • Can we establish links between commits and mailing list discussions (“ML-SZZ”)?
  • Are bugs missed in pull request reviews and why did this happen?

These are just some of the questions that could be answered using the SmartSHARK dataset. Participants may combine the SmartSHARK data with other data. However, in this case we expect that the contribution includes the code for the collection of the data, including a reasonable suggestion how this data could be permanently integrated into the SmartSHARK database. We ask the participants to carefully consider any ethical implications that stem from using the SmartSHARK data and other data sources and explicitly discourage the use of personally identifiable information.

How to Participate in the Challenge

First, familiarize yourself with the SmartSHARK dataset:

  • Read the arXiv paper about the SmartSHARK data.
  • Study the download page of SmartSHARK, which includes the most recent version and links to download the dataset, the usage example as well as as well as the documentation page. Please use at least version 2.1 for this challenge!
  • Create a new issue here in case you have problems with the dataset or want to suggest ideas for improvements.

Finally, use the dataset to answer your research questions, report your findings in a four-page challenge paper that you submit to our challenge (see information below). If your paper is accepted, present your results at MSR 2022 in Pittsburgh, USA!


A challenge paper should describe the results of your work by providing an introduction to the problem you address and why it is worth studying, the version of the dataset you used, the approach and tools you used, your results and their implications, and conclusions. Make sure your report highlights the contributions and the importance of your work. See also our open science policy regarding the publication of software and additional data you used for the challenge.

All authors should use the official “ACM Primary Article Template”, as can be obtained from the ACM Proceedings Template page. LaTeX users should use the sigconf option, as well as the review (to produce line numbers for easy reference by the reviewers) and anonymous (omitting author names) options. To that end, the following LaTeX code can be placed at the start of the LaTeX document:

\acmConference[MSR 2022]{MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories}{May 23–24, 2022}{Pittsburgh, PA, USA}

Submissions to the Challenge Track can be made via the submission site by the submission deadline. We encourage authors to upload their paper info early (the PDF can be submitted later) to properly enter conflicts for anonymous reviewing. All submissions must adhere to the following requirements:

  • Submissions must not exceed the page limit (4 pages plus 1 additional page of references). The page limit is strict, and it will not be possible to purchase additional pages at any point in the process (including after acceptance).
  • Submissions must strictly conform to the ACM formatting instructions. Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
  • Submissions must not reveal the authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular, the authors’ names must be omitted from the submission and references to their prior work should be in the third person. Further advice, guidance, and explanation about the double-anonymous review process can be found in the Q&A page for ICSE 2022.
  • Submissions should consider the ethical implications of the research conducted within a separate section before the conclusion.
  • The official publication date is the date the proceedings are made available in the ACM or IEEE Digital Libraries. This date may be up to two weeks prior to the first day of the ICSE 2022. The official publication date affects the deadline for any patent filings related to published work.
  • Purchases of additional pages in the proceedings is not allowed.

Any submission that does not comply with these requirements is likely to be desk rejected by the PC Chairs without further review. In addition, by submitting to the MSR Challenge Track, the authors acknowledge that they are aware of and agree to be bound by the following policies:

  • The ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to MSR 2022 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for MSR 2022. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases (including immediate rejection and reporting of the incident to ACM/IEEE). To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
  • The authorship policy of the ACM and the authorship policy of the IEEE.

Upon notification of acceptance, all authors of accepted papers will be asked to fill a copyright form and will receive further instructions for preparing the camera-ready version of their papers. At least one author of each paper is expected to register and present the paper at the MSR 2022 conference. All accepted contributions will be published in the electronic proceedings of the conference.

This year’s mining challenge and the data can be cited as:

title={MSR Mining Challenge: The SmartSHARK Repository Data},
author={Trautsch, Alexander and Trautsch, Fabian and Herbold, Steffen},
booktitle={Proceedings of the International Conference on Mining Software Repositories (MSR 2022)},

A preprint is available online.

Submission Site

Papers must be submitted through HotCRP: https://msr2022-technical.hotcrp.com/

Important Dates

  • Abstract Deadline: Jan 31
  • Paper Deadline: Feb 3
  • Author Notification: March 8
  • Camera Ready Deadline: Late March

Open Science Policy

Openness in science is key to fostering progress via transparency, reproducibility and replicability. Our steering principle is that all research output should be accessible to the public and that empirical studies should be reproducible. In particular, we actively support the adoption of open data and open source principles. To increase reproducibility and replicability, we encourage all contributing authors to disclose:

  • the source code of the software they used to retrieve and analyze the data
  • the (anonymized and curated) empirical data they retrieved in addition to the SmartSHARK dataset
  • a document with instructions for other researchers describing how to reproduce or replicate the results

Already upon submission, authors can privately share their anonymized data and software on archives such as Zenodo or Figshare (tutorial available here). Zenodo accepts up to 50GB per dataset (more upon request). There is no need to use Dropbox or Google Drive. After acceptance, data and software should be made public so that they receive a DOI and become citable. Zenodo and Figshare accounts can easily be linked with GitHub repositories to automatically archive software releases. In the unlikely case that authors need to upload terabytes of data, Archive.org may be used.

We recognise that anonymising artifacts such as source code is more difficult than preserving anonymity in a paper. We ask authors to take a best effort approach to not reveal their identities. We will also ask reviewers to avoid trying to identify authors by looking at commit histories and other such information that is not easily anonymised. Authors wanting to share GitHub repositories may want to look into using https://anonymous.4open.science/ which is an open source tool that helps you to quickly double-blind your repository.

We encourage authors to self-archive pre- and postprints of their papers in open, preserved repositories such as arXiv.org. This is legal and allowed by all major publishers including ACM and IEEE and it lets anybody in the world reach your paper. Note that you are usually not allowed to self-archive the PDF of the published article (that is, the publisher proof or the Digital Library version).

Please note that the success of the open science initiative depends on the willingness (and possibilities) of authors to disclose their data and that all submissions will undergo the same review process independent of whether or not they disclose their analysis code or data. We encourage authors who cannot disclose industrial or otherwise non-public data, for instance due to non-disclosure agreements, to provide an explicit (short) statement in the paper.

Best Mining Challenge Paper Award

As mentioned above, all submissions will undergo the same review process independent of whether or not they disclose their analysis code or data. However, only accepted papers for which code and data are available on preserved archives, as described in the open science policy, will be considered by the program committee for the best mining challenge paper award.

Best Student Presentation Award

Like in the previous years, there will be a public voting during the conference to select the best mining challenge presentation. This award often goes to authors of compelling work who present an engaging story to the audience. Only students can compete for this award.