Mon 15 MayDisplayed time zone: Hobart change
09:00 - 10:30 | Opening Session & Award TalksMSR Awards / MIP Award at Meeting Room 109 Chair(s): Emad Shihab Concordia Univeristy, Bogdan Vasilescu Carnegie Mellon University | ||
09:00 20mDay opening | Opening Session & Award Announcements MSR Awards Emad Shihab Concordia Univeristy, Patanamon Thongtanunam The University of Melbourne, Bogdan Vasilescu Carnegie Mellon University | ||
09:20 20mTalk | MSR 2023 Foundational Contribution Award MSR Awards | ||
09:40 20mTalk | MSR 2023 Ric Holt Early Career Achievement Award MSR Awards Li Li Beihang University | ||
10:00 30mTalk | MIP #1: Mining Source Code Repositories at Massive Scale Using Language Modeling MIP Award |
11:00 - 11:45 | SE for MLData and Tool Showcase Track / Technical Papers at Meeting Room 110 Chair(s): Sarah Nadi University of Alberta | ||
11:00 12mTalk | AutoML from Software Engineering Perspective: Landscapes and ChallengesDistinguished Paper Award Technical Papers Chao Wang Peking University, Zhenpeng Chen University College London, UK, Minghui Zhou Peking University Pre-print | ||
11:12 12mTalk | Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries Technical Papers Nima Shiri Harzevili York University, Jiho Shin York University, Junjie Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Song Wang York University, Nachiappan Nagappan Facebook | ||
11:24 6mTalk | DeepScenario: An Open Driving Scenario Dataset for Autonomous Driving System Testing Data and Tool Showcase Track Chengjie Lu Simula Research Laboratory and University of Oslo, Tao Yue Simula Research Laboratory, Shaukat Ali Simula Research Laboratory Pre-print | ||
11:30 6mTalk | NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python Data and Tool Showcase Track Ratnadira Widyasari Singapore Management University, Singapore, Zhou Yang Singapore Management University, Ferdian Thung Singapore Management University, Sheng Qin Sim Singapore Management University, Singapore, Fiona Wee Singapore Management University, Singapore, Camellia Lok Singapore Management University, Singapore, Jack Phan Singapore Management University, Singapore, Haodi Qi Singapore Management University, Singapore, Constance Tan Singapore Management University, Singapore, Qijin Tay Singapore Management University, Singapore, David Lo Singapore Management University | ||
11:36 6mTalk | PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages Data and Tool Showcase Track Wenxin Jiang Purdue University, Nicholas Synovic Loyola University Chicago, Purvish Jajal Purdue University, Taylor R. Schorlemmer Purdue University, Arav Tewari Purdue University, Bhavesh Pareek Purdue University, George K. Thiruvathukal Loyola University Chicago and Argonne National Laboratory, James C. Davis Purdue University Pre-print |
11:50 - 12:35 | Documentation + Q&A IData and Tool Showcase Track / Technical Papers at Meeting Room 109 Chair(s): Ahmad Abdellatif Concordia University | ||
11:50 12mTalk | Evaluating Software Documentation Quality Technical Papers | ||
12:02 12mTalk | What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues Technical Papers Zhou Yang Singapore Management University, Chenyu Wang Singapore Management University, Jieke Shi Singapore Management University, Thong Hoang CSIRO's Data61, Pavneet Singh Kochhar Microsoft, Qinghua Lu CSIRO’s Data61, Zhenchang Xing , David Lo Singapore Management University | ||
12:14 12mTalk | PICASO: Enhancing API Recommendations with Relevant Stack Overflow Posts Technical Papers Ivana Clairine Irsan Singapore Management University, Ting Zhang Singapore Management University, Ferdian Thung Singapore Management University, Kisub Kim Singapore Management University, David Lo Singapore Management University | ||
12:26 6mTalk | GIRT-Data: Sampling GitHub Issue Report Templates Data and Tool Showcase Track Nafiseh Nikehgbal Sharif University of Technology, Amir Hossein Kargaran LMU Munich, Abbas Heydarnoori Bowling Green State University, Hinrich Schütze LMU Munich Pre-print |
13:45 - 14:15 | |||
13:45 30mTalk | MIP #2: The Impact of Tangled Code Changes MIP Award |
14:20 - 15:15 | Language ModelsTechnical Papers at Meeting Room 109 Chair(s): Patanamon Thongtanunam University of Melbourne | ||
14:20 12mTalk | On Codex Prompt Engineering for OCL Generation: An Empirical Study Technical Papers Seif Abukhalaf Polytechnique Montreal, Mohammad Hamdaqa Polytechnique Montréal, Foutse Khomh Polytechnique Montréal | ||
14:32 12mTalk | Cross-Domain Evaluation of a Deep Learning-Based Type Inference System Technical Papers Bernd Gruner DLR Institute of Data Science, Tim Sonnekalb German Aerospace Center (DLR), Thomas S. Heinze Cooperative University Gera-Eisenach, Clemens-Alexander Brust German Aerospace Center (DLR) | ||
14:44 12mTalk | Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study Technical Papers Tim van Dam Delft University of Technology, Maliheh Izadi Delft University of Technology, Arie van Deursen Delft University of Technology Pre-print | ||
14:56 12mTalk | Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models Technical Papers Iman Saberi University of British Columbia Okanagan, Fatemeh Hendijani Fard University of British Columbia |
14:20 - 15:15 | Understanding DefectsRegistered Reports / Data and Tool Showcase Track / Technical Papers at Meeting Room 110 Chair(s): Matteo Paltenghi University of Stuttgart, Germany | ||
14:20 12mTalk | What Happens When We Fuzz? Investigating OSS-Fuzz Bug History Technical Papers Brandon Keller Rochester Institute of Technology, Benjamin S. Meyers Rochester Institute of Technology, Andrew Meneely Rochester Institute of Technology | ||
14:32 12mTalk | An Empirical Study of High Performance Computing (HPC) Performance Bugs Technical Papers Md Abul Kalam Azad University of Michigan - Dearborn, Nafees Iqbal University of Michigan - Dearborn, Foyzul Hassan University of Michigan - Dearborn, Probir Roy University of Michigan at Dearborn Pre-print | ||
14:44 6mTalk | Semantically-enriched Jira Issue Tracking Data Data and Tool Showcase Track Themistoklis Diamantopoulos Electrical and Computer Engineering Dept, Aristotle University of Thessaloniki, Dimitrios-Nikitas Nastos Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki, Andreas Symeonidis Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki Pre-print | ||
14:50 6mTalk | An exploratory study of bug introducing changes: what happens when bugs are introduced in open source software? Registered Reports Lukas Schulte Universitity of Passau, Anamaria Mojica-Hanke University of Passau and Universidad de los Andes, Mario Linares-Vasquez Universidad de los Andes, Steffen Herbold University of Passau | ||
14:56 6mTalk | HasBugs - Handpicked Haskell Bugs Data and Tool Showcase Track | ||
15:02 6mTalk | An Empirical Study on the Performance of Individual Issue Label Prediction Technical Papers |
15:45 - 16:30 | |||
15:45 45mTalk | Tutorial: Recognizing Developers' Emotions Using Non-invasive Biometrics Sensors Tutorials Nicole Novielli University of Bari |
16:35 - 17:20 | Ethics & EnergyTechnical Papers / Registered Reports at Meeting Room 109 Chair(s): Arumoy Shome Delft University of Technology | ||
16:35 12mTalk | Energy Consumption Estimation of API-usage in Mobile Apps via Static Analysis Technical Papers Abdul Ali Bangash University of Alberta, Canada, Qasim Jamal FAST National University, Kalvin Eng University of Alberta, Karim Ali University of Alberta, Abram Hindle University of Alberta Pre-print | ||
16:47 12mTalk | An Exploratory Study on Energy Consumption of Dataframe Processing Libraries Technical Papers Pre-print | ||
16:59 6mTalk | Understanding issues related to personal data and data protection in open source projects on GitHub Registered Reports Anne Hennig Karlsruhe Institute of Technology, Lukas Schulte Universitity of Passau, Steffen Herbold University of Passau, Oksana Kulyk IT University of Copenhagen, Denmark, Peter Mayer University of Southern Denmark | ||
17:05 12mTalk | Whistleblowing and Tech on Twitter Technical Papers Laura Duits Vrije Universiteit Amsterdam, Isha Kashyap Vrije Universiteit Amsterdam, Joey Bekkink Vrije Universiteit Amsterdam, Kousar Aslam Vrije Universiteit Amsterdam, Emitzá Guzmán Vrije Universiteit Amsterdam |
16:35 - 17:20 | SecurityTechnical Papers / Data and Tool Showcase Track at Meeting Room 110 Chair(s): Chanchal K. Roy University of Saskatchewan | ||
16:35 12mTalk | UNGOML: Automated Classification of unsafe Usages in Go Technical Papers Anna-Katharina Wickert TU Darmstadt, Germany, Clemens Damke University of Munich (LMU), Lars Baumgärtner Technische Universität Darmstadt, Eyke Hüllermeier University of Munich (LMU), Mira Mezini TU Darmstadt Pre-print File Attached | ||
16:47 12mTalk | Connecting the .dotfiles: Checked-In Secret Exposure with Extra (Lateral Movement) Steps Technical Papers Gerhard Jungwirth TU Wien, Aakanksha Saha TU Wien, Michael Schröder TU Wien, Tobias Fiebig Max-Planck-Institut für Informatik, Martina Lindorfer TU Wien, Jürgen Cito TU Wien Pre-print | ||
16:59 12mTalk | MANDO-HGT: Heterogeneous Graph Transformers for Smart Contract Vulnerability Detection Technical Papers Hoang H. Nguyen L3S Research Center, Leibniz Universität Hannover, Hannover, Germany, Nhat-Minh Nguyen Singapore Management University, Singapore, Chunyao Xie L3S Research Center, Leibniz Universität Hannover, Germany, Zahra Ahmadi L3S Research Center, Leibniz Universität Hannover, Hannover, Germany, Daniel Kudenko L3S Research Center, Leibniz Universität Hannover, Germany, Thanh-Nam Doan Independent Researcher, Atlanta, Georgia, USA, Lingxiao Jiang Singapore Management University Pre-print Media Attached | ||
17:11 6mTalk | SecretBench: A Dataset of Software Secrets Data and Tool Showcase Track Setu Kumar Basak North Carolina State University, Lorenzo Neil North Carolina State University, Bradley Reaves North Carolina State University, Laurie Williams North Carolina State University Pre-print |
18:00 - 21:00 | |||
18:00 3hMeeting | MSR Dinner at Cargo Hall, South Wharf Technical Papers |
Tue 16 MayDisplayed time zone: Hobart change
09:50 - 10:30 | Tutorial #2Tutorials at Meeting Room 109 Chair(s): Alexander Serebrenik Eindhoven University of Technology | ||
09:50 40mTutorial | Tutorial: Mining and Analysing Collaboration in git Repositories with git2net Tutorials Christoph Gote Chair of Systems Design, ETH Zurich |
09:50 - 10:30 | Mining ChallengeMining Challenge at Meeting Room 110 Chair(s): Audris Mockus The University of Tennessee | ||
09:50 6mTalk | An Empirical Study to Investigate Collaboration Among Developers in Open Source Software (OSS) Mining Challenge Weijie Sun University of Alberta, Samuel Iwuchukwu University of Alberta, Abdul Ali Bangash University of Alberta, Canada, Abram Hindle University of Alberta Pre-print | ||
09:56 6mTalk | Insights into Female Contributions in Open-Source Projects Mining Challenge Arifa Islam Champa Idaho State University, Md Fazle Rabbi Idaho State University, Minhaz F. Zibran Idaho State University, Md Rakibul Islam University of Wisconsin - Eau Claire Pre-print | ||
10:02 6mTalk | The Secret Life of CVEs Mining Challenge Piotr Przymus Nicolaus Copernicus University in Toruń, Mikołaj Fejzer Nicolaus Copernicus University in Toruń, Jakub Narębski Nicolaus Copernicus University in Toruń, Krzysztof Stencel University of Warsaw Pre-print | ||
10:08 6mTalk | Evolution of the Practice of Software Testing in Java Projects Mining Challenge Anisha Islam Department of Computing Science, University of Alberta, Nipuni Tharushika Hewage Department of Computing Science, University of Alberta, Abdul Ali Bangash University of Alberta, Canada, Abram Hindle University of Alberta Pre-print | ||
10:14 6mTalk | Keep the Ball Rolling: Analyzing Release Cadence in GitHub Projects Mining Challenge Oz Kilic Carleton University, Nathaniel Bowness University of Ottawa, Olga Baysal Carleton University Pre-print |
11:00 - 11:45 | Documentation + Q&A IITechnical Papers / Data and Tool Showcase Track at Meeting Room 109 Chair(s): Maram Assi Queen's University | ||
11:00 12mTalk | Understanding the Role of Images on Stack Overflow Technical Papers Dong Wang Kyushu University, Japan, Tao Xiao Nara Institute of Science and Technology, Christoph Treude University of Melbourne, Raula Gaikovina Kula Nara Institute of Science and Technology, Hideaki Hata Shinshu University, Yasutaka Kamei Kyushu University Pre-print | ||
11:12 12mTalk | Do Subjectivity and Objectivity Always Agree? A Case Study with Stack Overflow Questions Technical Papers Saikat Mondal University of Saskatchewan, Masud Rahman Dalhousie University, Chanchal K. Roy University of Saskatchewan Pre-print | ||
11:24 6mTalk | GiveMeLabeledIssues: An Open Source Issue Recommendation System Data and Tool Showcase Track Joseph Vargovich Northern Arizona University, Fabio Marcos De Abreu Santos Northern Arizona University, USA, Jacob Penney Northern Arizona University, Marco Gerosa Northern Arizona University, Igor Steinmacher Northern Arizona University Pre-print Media Attached | ||
11:30 6mTalk | DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories Data and Tool Showcase Track | ||
11:36 6mTalk | PENTACET data - 23 Million Code Comments and 500,000 SATD comments Data and Tool Showcase Track Murali Sridharan University of Oulu, Leevi Rantala University of Oulu, Mika Mäntylä University of Oulu |
11:00 - 11:45 | Code SmellsTechnical Papers / Industry Track / Data and Tool Showcase Track at Meeting Room 110 Chair(s): Md Tajmilur Rahman Gannon University | ||
11:00 12mTalk | Don't Forget the Exception! Considering Robustness Changes to Identify Design Problems Technical Papers Anderson Oliveira PUC-Rio, João Lucas Correia Federal University of Alagoas, Leonardo Da Silva Sousa Carnegie Mellon University, USA, Wesley Assunção Johannes Kepler University Linz, Austria & Pontifical Catholic University of Rio de Janeiro, Brazil, Daniel Coutinho PUC-Rio, Alessandro Garcia PUC-Rio, Willian Oizumi GoTo, Caio Barbosa UFAL, Anderson Uchôa Federal University of Ceará, Juliana Alves Pereira PUC-Rio Pre-print | ||
11:12 12mTalk | Pre-trained Model Based Feature Envy Detection Technical Papers mawenhao Wuhan University, Yaoxiang Yu Wuhan University, Xiaoming Ruan Wuhan University, Bo Cai Wuhan University | ||
11:24 6mTalk | CLEAN++: Code Smells Extraction for C++ Data and Tool Showcase Track Tom Mashiach Ben Gurion University of the Negev, Israel, Bruno Sotto-Mayor Ben Gurion University of the Negev, Israel, Gal Kaminka Bar Ilan University, Israel, Meir Kalech Ben Gurion University of the Negev, Israel | ||
11:30 6mTalk | DACOS-A Manually Annotated Dataset of Code Smells Data and Tool Showcase Track Himesh Nandani Dalhousie University, Mootez Saad Dalhousie University, Tushar Sharma Dalhousie University Pre-print File Attached | ||
11:36 6mTalk | What Warnings Do Engineers Really Fix? The Compiler That Cried Wolf Industry Track Gunnar Kudrjavets University of Groningen, Aditya Kumar Snap, Inc., Ayushi Rastogi University of Groningen, The Netherlands Pre-print |
11:50 - 12:35 | Development Tools & Practices IIData and Tool Showcase Track / Industry Track / Technical Papers / Registered Reports at Meeting Room 109 Chair(s): Banani Roy University of Saskatchewan | ||
11:50 12mTalk | Automating Arduino Programming: From Hardware Setups to Sample Source Code Generation Technical Papers Imam Nur Bani Yusuf Singapore Management University, Singapore, Diyanah Binte Abdul Jamal Singapore Management University, Lingxiao Jiang Singapore Management University Pre-print | ||
12:02 6mTalk | A Dataset of Bot and Human Activities in GitHub Data and Tool Showcase Track Natarajan Chidambaram University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons | ||
12:08 6mTalk | Mining the Characteristics of Jupyter Notebooks in Data Science Projects Registered Reports Morakot Choetkiertikul Mahidol University, Thailand, Apirak Hoonlor Mahidol University, Chaiyong Ragkhitwetsagul Mahidol University, Thailand, Siripen Pongpaichet Mahidol University, Thanwadee Sunetnanta Mahidol University, Tasha Settewong Mahidol University, Raula Gaikovina Kula Nara Institute of Science and Technology | ||
12:14 6mTalk | Optimizing Duplicate Size Thresholds in IDEs Industry Track Konstantin Grotov JetBrains Research, Constructor University, Sergey Titov JetBrains Research, Alexandr Suhinin JetBrains, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research Pre-print | ||
12:20 12mTalk | Boosting Just-in-Time Defect Prediction with Specific Features of C Programming Languages in Code Changes Technical Papers Chao Ni Zhejiang University, xiaodanxu College of Computer Science and Technology, Zhejiang university, Kaiwen Yang Zhejiang University, David Lo Singapore Management University |
11:50 - 12:35 | Software Libraries & EcosystemsTechnical Papers / Industry Track / Data and Tool Showcase Track at Meeting Room 110 Chair(s): Mehdi Keshani Delft University of Technology | ||
11:50 12mTalk | A Large Scale Analysis of Semantic Versioning in NPM Technical Papers Donald Pinckney Northeastern University, Federico Cassano Northeastern University, Arjun Guha Northeastern University and Roblox Research, Jonathan Bell Northeastern University Pre-print | ||
12:02 12mTalk | Phylogenetic Analysis of Reticulate Software Evolution Technical Papers Akira Mori National Institute of Advanced Industrial Science and Technology, Japan, Masatomo Hashimoto Chiba Institute of Technology, Japan | ||
12:14 6mTalk | PyMigBench: A Benchmark for Python Library Migration Data and Tool Showcase Track Mohayeminul Islam University of Alberta, Ajay Jha North Dakota State University, Sarah Nadi University of Alberta, Ildar Akhmetov University of Alberta Pre-print | ||
12:20 6mTalk | Determining Open Source Project Boundaries Industry Track Sophia Vargas Google | ||
12:26 6mTalk | Intertwining Communities: Exploring Libraries that Cross Software Ecosystems Technical Papers Kanchanok Kannee Nara Institute of Science and Technology, Raula Gaikovina Kula Nara Institute of Science and Technology, Supatsara Wattanakriengkrai Nara Institute of Science and Technology, Kenichi Matsumoto Nara Institute of Science and Technology Pre-print |
13:45 - 14:30 | Tutorial #3Tutorials at Meeting Room 109 Chair(s): Alexander Serebrenik Eindhoven University of Technology | ||
13:45 45mTutorial | Tutorial: Beyond the leading edge. What else is out there? Tutorials Tim Menzies North Carolina State University Pre-print |
13:45 - 14:30 | Software QualityData and Tool Showcase Track / Technical Papers at Meeting Room 110 Chair(s): Tushar Sharma Dalhousie University | ||
13:45 12mTalk | Helm Charts for Kubernetes Applications: Evolution, Outdatedness and Security Risks Technical Papers Ahmed Zerouali Vrije Universiteit Brussel, Ruben Opdebeeck Vrije Universiteit Brussel, Coen De Roover Vrije Universiteit Brussel Pre-print | ||
13:57 12mTalk | Control and Data Flow in Security Smell Detection for Infrastructure as Code: Is It Worth the Effort? Technical Papers Ruben Opdebeeck Vrije Universiteit Brussel, Ahmed Zerouali Vrije Universiteit Brussel, Coen De Roover Vrije Universiteit Brussel Pre-print | ||
14:09 12mTalk | Method Chaining Redux: An Empirical Study of Method Chaining in Java, Kotlin, and Python Technical Papers Pre-print Media Attached | ||
14:21 6mTalk | Snapshot Testing Dataset Data and Tool Showcase Track |
14:35 - 15:15 | Defect PredictionData and Tool Showcase Track / Technical Papers at Meeting Room 109 Chair(s): Sarra Habchi Ubisoft | ||
14:35 12mTalk | Large Language Models and Simple, Stupid Bugs Technical Papers Kevin Jesse University of California at Davis, USA, Toufique Ahmed University of California at Davis, Prem Devanbu University of California at Davis, Emily Morgan University of California, Davis Pre-print | ||
14:47 12mTalk | The ABLoTS Approach for Bug Localization: is it replicable and generalizable?Distinguished Paper Award Technical Papers Feifei Niu University of Ottawa, Christoph Mayr-Dorn JOHANNES KEPLER UNIVERSITY LINZ, Wesley Assunção Johannes Kepler University Linz, Austria & Pontifical Catholic University of Rio de Janeiro, Brazil, Liguo Huang Southern Methodist University, Jidong Ge Nanjing University, Bin Luo Nanjing University, Alexander Egyed Johannes Kepler University Linz Pre-print File Attached | ||
14:59 6mTalk | LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations Data and Tool Showcase Track Catherine Tony Hamburg University of Technology, Markus Mutas Hamburg University of Technology, Nicolás E. Díaz Ferreyra Hamburg University of Technology, Riccardo Scandariato Hamburg University of Technology Pre-print | ||
15:05 6mTalk | Defectors: A Large, Diverse Python Dataset for Defect Prediction Data and Tool Showcase Track Parvez Mahbub Dalhousie University, Ohiduzzaman Shuvo Dalhousie University, Masud Rahman Dalhousie University Pre-print |
14:35 - 15:15 | Human AspectsTechnical Papers / Data and Tool Showcase Track at Meeting Room 110 Chair(s): Alexander Serebrenik Eindhoven University of Technology | ||
14:35 12mTalk | A Study of Gender Discussions in Mobile Apps Technical Papers Mojtaba Shahin RMIT University, Australia, Mansooreh Zahedi The Univeristy of Melbourne, Hourieh Khalajzadeh Deakin University, Australia, Ali Rezaei Nasab Shiraz University Pre-print | ||
14:47 12mTalk | Tell Me Who Are You Talking to and I Will Tell You What Issues Need Your Skills Technical Papers Fabio Marcos De Abreu Santos Northern Arizona University, USA, Jacob Penney Northern Arizona University, João Felipe Pimentel Northern Arizona University, Igor Wiese Federal University of Technology, Igor Steinmacher Northern Arizona University, Marco Gerosa Northern Arizona University Pre-print | ||
14:59 6mTalk | She Elicits Requirements and He Tests: Software Engineering Gender Bias in Large Language Models Technical Papers Pre-print Media Attached | ||
15:05 6mTalk | GitHub OSS Governance File Dataset Data and Tool Showcase Track Yibo Yan University of California, Davis, Seth Frey University of California, Davis, Amy Zhang University of Washington, Seattle, Vladimir Filkov University of California at Davis, USA, Likang Yin University of California at Davis Pre-print |
15:45 - 17:30 | Closing SessionVision and Reflection / MSR Awards at Meeting Room 109 Chair(s): Patanamon Thongtanunam The University of Melbourne | ||
15:45 20mTalk | MSR 2023 Doctoral Research Award MSR Awards Eman Abdullah AlOmar Stevens Institute of Technology | ||
16:05 30mTalk | Open Source Software Digital Sociology: Quantifying and Understanding Large Complex Open Source Ecosystems Vision and Reflection Minghui Zhou Peking University | ||
16:35 30mTalk | Human-Centered AI for SE: Reflection and Vision Vision and Reflection David Lo Singapore Management University | ||
17:05 25mDay closing | Closing MSR Awards Emad Shihab Concordia Univeristy |
Accepted Papers
Title | |
---|---|
An Empirical Study to Investigate Collaboration Among Developers in Open Source Software (OSS) Mining Challenge Pre-print | |
Evolution of the Practice of Software Testing in Java Projects Mining Challenge Pre-print | |
Insights into Female Contributions in Open-Source Projects Mining Challenge Pre-print | |
Keep the Ball Rolling: Analyzing Release Cadence in GitHub Projects Mining Challenge Pre-print | |
The Secret Life of CVEs Mining Challenge Pre-print |
Call for Mining Challenge Proposals
The International Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge, we call upon everyone interested to apply their tools to a common dataset. The challenge is for researchers and practitioners to bravely use their mining tools and approaches on a dare.
One of the secret ingredients behind the success of the International Conference on Mining Software Repositories (MSR) is its annual Mining Challenge, in which MSR participants can showcase their techniques, tools, and creativity on a common data set. In true MSR fashion, this data set is a real data set contributed by researchers in the community, solicited through an open call. There are many benefits of sharing a data set for the MSR Mining Challenge. The selected challenge proposal explaining the data set will appear in the MSR 2023 proceedings, and the challenge papers using the data set will be required to cite the challenge proposal or an existing paper of the researchers about the selected data set. Furthermore, the authors of the data set will join the MSR 2023 organizing committee as Mining Challenge (co-)chair(s), who will manage the reviewing process (e.g., recruiting a Challenge PC, managing submissions and review assignments). Finally, it is not uncommon for challenge data sets to feature in MSR and other publications well after the edition of the conference in which they appear!
If you would like to submit your data set for consideration for the 2023 MSR Mining Challenge, please submit a short proposal (1-2 pages plus appendices, if needed) at https://msr2023-challenge-proposals.hotcrp.com/, containing the following information:
- Title of data set.
- High-level overview:
- Short description, including what types of artifacts the data set contains.
- Summary statistics (how many artifacts of different types).
- Internal structure:
- How are the data structured and organized?
- (Link to) Schema, if applicable
- How to access:
- How can the data set be obtained?
- What are recommended ways to access it? Include examples of specific tools, shell commands, etc, if applicable.
- What skills, infrastructure, and/or credentials would challenge participants need to effectively work with the data set?
- What kinds of research questions do you expect challenge participants could answer?
- A link to a (sub)sample of the data for the organizing committee to peruse (e.g., via GitHub, Zenodo, Figshare).
Each submission must conform to the IEEE formatting instructions IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran}
without including the compsoc
or compsocconf
options). For more information see here: https://www.ieee.org/conferences/publishing/templates.html
The first task of the authors of the selected proposal will be to prepare the Call for Challenge Papers, which outlines the expected content and structure of submissions, as well as the technical details of how to access and analyze the data set. This call will be published on the MSR website on August 15th. By making the challenge data set available by late summer, we hope that many students will be able to use the challenge data set for their graduate class projects in the Fall semester.
Important Dates
- Deadline for proposals: July 18th, 2022
- Notification: July 28th, 2022
- Call for Challenge Papers Published: August 15th, 2022
Live Webinar & Kickoff Session
Welcome to the MSR 2023 mining challenge featuring “World of Code”!
To start things off we would like to invite you to a kick-off session on Thursday October 27 from 2pm to 4pm UTC (2:00 to 4:00 AoE). The session will be held via Zoom:
-
Link: https://ut-ee.zoom.us/j/99774559002?pwd=Ym9pakZYaFVKcDhtdFJuWlcxM2NWQT09
-
Meeting ID: 997 7455 9002
-
Passcode: 585089
The kick-off starts with a webinar during which you will learn about the basic structure of World of Code and start using it. Afterwards you will have the chance to ask questions and present a project idea or research question you would like to work on for the mining challenge or find someone with a great idea that you would like to collaborate with. Please find the detailed schedule below.
To prepare for the kick-off please:
- Check if you can login to da0 per instructions at https://github.com/woc-hack/tutorial. To log in please use the user name you have requested on the registration form. If you have any issues please contact Audris (audris@utk.edu).
- Join the World of Code Discord server using the following link. There you can talk to your fellow challenge participants or ask questions related to World of Code: https://discord.gg/fKPFxzWqZX
- For those of you who have a project idea please add ONE slide to the following slide deck by creating a copy of the empty template slide and filling it out. During the kick-off you will then have 1 minute to present your idea and potentially find other participants to join you: https://docs.google.com/presentation/d/1GiJMnF359OFd74pV95h5Y1At0WCYJpJdYbL8-lmrli8/edit?usp=sharing
Schedule (all times are in UTC)
02:00 - 02:45pm: webinar
02:45 - 03:00pm: Q&A session
03:00 - 03:10pm: break
03:10 - 03:20pm: project pitches
03:20 - 03:50pm: discussion in breakout groups and team formation
03:50 - 04:00pm: wrap-up
And don’t worry if you cannot make it to the kick-off. We will record the entire session and share it with you afterwards so that you can rewatch it in your own time.
Also if you have any questions about the kick-off or World of Code please get in touch with us via email or even better via Discord.
We look forward to seeing you soon
Audris, Alex and Jim
Call for Mining Challenge Papers
The International Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge, we call upon everyone interested to apply their tools to a common dataset. The challenge is for researchers and practitioners to bravely use their mining tools and approaches on a dare.
This year, the mining challenge is about the Global Software Supply Chain (GSSC) data, a giant dataset and an accompanying World of Code (WoC) infrastructure that collects, curates, and cross-references data from nearly all public version control systems.
GSSC data version U was collected based on updated and new repositories from GitHub, GitLab, Bitbucket, and dozens of other forges identified during Oct 20-30, 2021 with the git objects retrieved by Nov 28. The 173M git repositories contain over 3.1B commits, 12.6B trees, and 12.5B blobs. In this challenge, participants can use version U of the dataset. The entire dataset occupies over 250TB, so copying the entire dataset would be prohibitive. Participants of the mining challenge should, instead, use the Digital Archeology cluster that provides ample storage and computational resources either to conduct their analysis or to conduct pre-filtering to select a subset of the dataset for their research contribution. It has six powerful servers running RedHat Linux ranging from 360GB to 1.4TB of RAM.
About WoC Infrastructure
The primary data items in WoC are git objects retrieved from git repositories. These include commits (transactions representing changes to the source code), trees (representing folder structure), blobs (versions of the source files), and tags (specific commits identified as releases).
The remaining data in WoC are derived from these primary (or level 1) objects. Computation of level 2 data involves cross-referencing of (producing a graph over) primary git objects with projects, authors, files, and packages used in individual blobs, calculation of blobs created by a commit, and so forth. Level 3 data involves modeling steps used to defork projects and alias git author IDs (see below). These curated entities are also cross-referenced as in level 2. Finally, at level 4, various statistics concerning projects, authors and APIs (based on import / include / package statements) are calculated.
Many of these computations are extremely complex and CPU and/or memory intensive, thus the results are stored to speed up query and analysis downstream. Cross-referencing, for example, allows immediate access to all object references associated with any specific object (and investigation of the supply chain specific to that object). For example, all commits that have created a specific blob, all repositories where a specific blob or a specific commit resides, all commits for a specified author ID, the child commits of a specific commit, and other links in the supply chain that cannot be computed without complete data. Below, we discuss some of the data levels in more detail.
Level 2. Cross-references or maps are calculated based on the content or origin of git objects and include, among others, forward and backward links between commits and projects, blobs created by the commit, authors of the commits, time, and so on. These maps represent edges of the various software supply chains.
Level 3. Curation provides a map between the original and curated values and includes the cross-referencing for the curated entities as for the uncurated entities in level 2. The primary areas of curation include deforking of projects, aliasing author IDs, computing additional attributes for blobs by parsing file content to identify dependencies in 17 programming languages, and by inferring developer gender using leading commercial service (Namsor).
Level 4: Summary Level. The summary level dataset focuses on three entities: projects, developers and APIs and summarizes each in a mongodb collection to simplify queries. For example, each project, author, and API has the activity date ranges, monthly activities, core teams (developers responsible for 80% of all commits), and numerous other attributes precalculated and stored. The intention of this summary level is to enable natural experiments and representative sampling of projects, authors, and APIs.
For portability, ease of access, and to improve performance of operations sweeping the entire dataset, all (except the summary and raw level) datasets are also provided as flat files sorted and partitioned by key to facilitate the use of common Unix command line tools such as grep
, sed
, awk
, join
, sort
, uniq
. Thus, a common Unix tool-chain can be used to calculate transitive or more complicated relationships or conduct any other downstream analysis of WoC data.
Fast access to an arbitrary value by key for all cross-references is provided by the getValues
Unix command created for WoC. This command follows Unix conventions by reading from standard input and writing to standard output with the type of map, for example, author to blob (a2b) or aliased author to blob (A2b) provided as the parameter. Queries involving under one million keys can be done via getValues
, but larger queries can be more efficiently computed using the Unix join
command on the corresponding flat files.
Level 1 data containing git objects also has fast access to content using as key the git sha1 of the object via the showCnt
command. showCnt
also reads keys from standard input and writes results to standard output. The getValues
and showCnt
commands hide the internal structure of the underlying datasets that are all partitioned into 32 or 128 partitions based on their size.
Challenge
The challenge is open-ended: participants can choose the research questions that they find most interesting. Our suggestion is to consider problems that are not centered on a specific project or a set of projects but, instead, would exploit the completeness, curation, and cross-referencing capabilities of WoC. Some thoughts are below.
-
WoC is designed to measure three types of software supply chains (code dependencies, code copying, and author-code knowledge transfer). All three pose unique risks and benefits. Investigate these risks and benefits. Software supply chains underlie many topical questions, such as vulnerabilities, code provenance, bill of materials, and many others.
-
Research questions that require to construct, sample, or analyze the global network of source code, APIs, people, and projects, and to filter subsets by time or content. E.g., you could investigate where a particular piece of code came from, where and when a particular API was introduced and what projects or people use it, and what projects a particular developer worked on?
-
Determine the global context. A traditional MSR analysis tends to focus on a specific set of projects as only project-specific data needs to be obtained. Often critical context of the elements in such datasets is lost, such as actions of developers, activities associated with the code, and usage of APIs external to the specific set of projects. WoC allows recovery and quantification of such global context.
-
Avoid convenience sampling. Level-4 data provides detailed summaries of projects, APIs, and developers and could serve as a basis for selecting samples needed to conduct many kinds of natural experiments.
-
Exploit curation at global scale. The curation level in WoC solves common MSR headaches by aliasing author IDs and deforking projects based on shared commits.
-
Link/enhance the WoC dataset itself. Participants are encouraged to combine WoC with other data and include the code for the collection and linking of the external data, as well as suggestions on how this data could be permanently integrated into WoC.
-
Unlike a static database, WoC enables reconstruction of past states of the entire open source software. Many contemporary quality, lead time, effort, task prediction models need the ability to reconstruct past states to avoid the pervasive problem of “data leakage.”
-
The dataset provides ready-data for questions such as why and how developers decide to reuse pre-existing software, which type of the supply chain they choose (technical dependencies, copy-based, or reuse of the the ideas)?
-
The ability to reconstruct past states of OSS allows finding answers to questions such as “how to produce a widely used framework or library?” or “how to reduce the risk of changes in the upstream projects and how to reduce the risk for downstream projects.”
We ask the participants to carefully consider any ethical implications that stem from using the WoC data and other data sources and explicitly discourage the public exposure of personally identifiable information.
How to Participate in the Challenge
First, familiarize yourself with the WoC infrastructure:
- The details about the WoC infrastructure and the data is provided in our EMSE paper (https://mockus.org/papers/WoC_EMSE.pdf)
- Sign up for the account on WoC cluster (https://forms.gle/vixXjmjocBzX3BBB6). Typical turnaround time for account creation is one working day.
- Go over the self-paced tutorial (https://github.com/woc-hack/tutorial)
Finally, use the dataset to answer your research questions, report your findings in a four-page challenge paper that you submit to our challenge (see information below). If your paper is accepted, present your results at MSR 2023 in Melbourne, Australia!
You can also join the WoC community, get support and find others to collaborate with. To do so:
- Indicate your interest for a live webinar and kick-off session in the account signup form (https://forms.gle/vixXjmjocBzX3BBB6).
- Join the live tutorial in late October or early November.
- Join the WoC Discord server (https://discord.gg/22mSc842Wb) and the mining challenge channel to discuss with others.
- Create a new issue in case of problems or suggestions for improvements: https://github.com/woc-hack/mining-challenge-msr-2023-/issues
Submission
A challenge paper should describe the results of your work by providing an introduction to the problem you address and why it is worth studying, the parts of the dataset you used, the approach and tools you used, your results and their implications, and conclusions. Make sure your report highlights the contributions and the importance of your work. See also our open science policy regarding the publication of software and additional data you used for the challenge.
Submissions must conform to the IEEE conference proceedings template, specified in the IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options).
Submissions to the Challenge Track can be made via the submission site by the submission deadline. We encourage authors to upload their paper info early (the PDF can be submitted later) to properly enter conflicts for anonymous reviewing. All submissions must adhere to the following requirements:
- Submissions must not exceed the page limit (4 pages plus 1 additional page of references). The page limit is strict, and it will not be possible to purchase additional pages at any point in the process (including after acceptance).
- Submissions must strictly conform to the IEEE formatting guidelines (see above). Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
- Submissions must not reveal the authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular, the authors’ names must be omitted from the submission and references to their prior work should be in the third person. Further advice, guidance, and explanation about the double-anonymous review process can be found in the Q&A page for ICSE 2023.
- Submissions should consider the ethical implications of the research conducted within a separate section before the conclusion.
- The official publication date is the date the proceedings are made available in the ACM or IEEE Digital Libraries. This date may be up to two weeks prior to the first day of the ICSE 2023. The official publication date affects the deadline for any patent filings related to published work.
- Purchases of additional pages in the proceedings is not allowed.
Any submission that does not comply with these requirements is likely to be desk rejected by the PC Chairs without further review. In addition, by submitting to the MSR Challenge Track, the authors acknowledge that they are aware of and agree to be bound by the following policies:
- The ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to MSR 2023 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for MSR 2022. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases (including immediate rejection and reporting of the incident to ACM/IEEE). To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
- The authorship policy of the ACM and the authorship policy of the IEEE.
Upon notification of acceptance, all authors of accepted papers will be asked to fill a copyright form and will receive further instructions for preparing the camera-ready version of their papers. At least one author of each paper is expected to register and present the paper at the MSR 2023 conference. All accepted contributions will be published in the electronic proceedings of the conference.
This year’s mining challenge and the data can be cited as:
title={MSR Mining Challenge: World of Code},
author={Mockus, Audris and Nolte, Alexander and Herbsleb, James},
year={2023},
booktitle={Proceedings of the International Conference on Mining Software Repositories (MSR 2023)},
}
A preprint is available online.
Submission Site
Papers must be submitted through HotCRP: https://msr2023-challenge.hotcrp.com/
Important Dates
- Live tutorial and Kick-off session: October 2022 (exact date will be announced in September)
- Abstract Deadline: Friday, Feb 3th, 2023
- Paper Deadline: Sunday, Feb 5th, 2023
- Author Notification: Feb 21, 2023
- Camera Ready Deadline: March 13, 2023
Open Science Policy
Openness in science is key to fostering progress via transparency, reproducibility and replicability. Our steering principle is that all research output should be accessible to the public and that empirical studies should be reproducible. In particular, we actively support the adoption of open data and open source principles. To increase reproducibility and replicability, we encourage all contributing authors to disclose:
- the source code of the software they used to retrieve and analyze the data
- the (anonymized and curated) empirical data they retrieved in addition to the WoC dataset
- a document with instructions for other researchers describing how to reproduce or replicate the results
Already upon submission, authors can privately share their anonymized data and software on archives such as Zenodo or Figshare (tutorial available here). Zenodo accepts up to 50GB per dataset (more upon request). There is no need to use Dropbox or Google Drive. After acceptance, data and software should be made public so that they receive a DOI and become citable. Zenodo and Figshare accounts can easily be linked with GitHub repositories to automatically archive software releases. In the unlikely case that authors need to upload terabytes of data, Archive.org may be used.
We recognise that anonymising artifacts such as source code is more difficult than preserving anonymity in a paper. We ask authors to take a best effort approach to not reveal their identities. We will also ask reviewers to avoid trying to identify authors by looking at commit histories and other such information that is not easily anonymised. Authors wanting to share GitHub repositories may want to look into using https://anonymous.4open.science/ which is an open source tool that helps you to quickly double-blind your repository.
We encourage authors to self-archive pre- and postprints of their papers in open, preserved repositories such as arXiv.org. This is legal and allowed by all major publishers including ACM and IEEE and it lets anybody in the world reach your paper. Note that you are usually not allowed to self-archive the PDF of the published article (that is, the publisher proof or the Digital Library version).
Please note that the success of the open science initiative depends on the willingness (and possibilities) of authors to disclose their data and that all submissions will undergo the same review process independent of whether or not they disclose their analysis code or data. We encourage authors who cannot disclose industrial or otherwise non-public data, for instance due to non-disclosure agreements, to provide an explicit (short) statement in the paper.
Best Mining Challenge Paper Award
As mentioned above, all submissions will undergo the same review process independent of whether or not they disclose their analysis code or data. However, only accepted papers for which code and data are available on preserved archives, as described in the open science policy, will be considered by the program committee for the best mining challenge paper award.
Best Student Presentation Award
Like in the previous years, there will be a public voting during the conference to select the best mining challenge presentation. This award often goes to authors of compelling work who present an engaging story to the audience. Only students can compete for this award.