Write a Blog >>
MSR 2022
Mon 23 - Tue 24 May 2022
co-located with ICSE 2022

Conducting socio-technical software engineering research on closed-source software is difficult as most organizations do not want to give access to their code repositories. Most experiments and publications therefore focus on open-source projects which only provides a partial view of software development communities. Yet, closing the gap between open and closed source software industries is essential to increase the validity and applicability of results stemming from socio-technical software engineering research. We contribute to this effort by sharing our work in a large company counting 4,800 employees. We mined 101 repositories and produced the GDED dataset containing socio-technical information about 106,216 commits, 470,940 file modifications and 3,471,556 method modifications from 164 developers during the last 13 years, using various programming languages. For that, we used GitDelver, an open-source tool we developed on top of Pydriller, and anonymized and scrambled the data to comply with legal and corporate requirements. Our dataset can be used for various purposes and provides information about code complexity, self-admitted technical debt, bug fixes, as well as temporal information. We also share our experience regarding the processing of sensitive data to help other organizations making datasets publicly available to the research community.

Thu 19 May

Displayed time zone: Eastern Time (US & Canada) change

03:00 - 03:50
Session 8: Large-Scale Mining & Software EcosystemsTechnical Papers / Data and Tool Showcase Track at MSR Main room - odd hours
Chair(s): Fiorella Zampetti University of Sannio, Italy, Gregorio Robles Universidad Rey Juan Carlos
03:00
7m
Talk
An Empirical Study on the Survival Rate of GitHub Projects
Technical Papers
Adem Ait-Fonolla IN3 - UOC, Javier Luis Cánovas Izquierdo IN3 - UOC, Jordi Cabot Open University of Catalonia, Spain
Pre-print
03:07
7m
Talk
A Large-Scale Comparison of Python Code in Jupyter Notebooks and ScriptsDistinguished Paper Award
Technical Papers
Konstantin Grotov JetBrains Research, ITMO University, Sergey Titov JetBrains Research, Vladimir Sotnikov JetBrains Research, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research; HSE University
DOI Pre-print
03:14
7m
Talk
Do Customized Android Frameworks Keep Pace with Android?
Technical Papers
Pei Liu Monash University, Mattia Fazzini University of Minnesota, John Grundy Monash University, Li Li Monash University
03:21
4m
Talk
Lupa: A Platform for Large Scale Analysis of The Progamming Language Usage
Data and Tool Showcase Track
Anna Vlasova JetBrains Research, Maria Tigina JetBrains Research, ITMO University, Ilya Vlasov Saint Petersburg State University, Anastasiia Birillo JetBrains Research, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research; HSE University
DOI Pre-print
03:25
4m
Talk
GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research
Data and Tool Showcase Track
Nicolas Riquet University of Namur, Xavier Devroey University of Namur, Benoît Vanderose University of Namur
Pre-print
03:29
4m
Talk
DaSEA – A Dataset for Software Ecosystem Analysis
Data and Tool Showcase Track
Petya Buchkova IT University of Copenhagen, Joakim Hey Hinnerskov IT University of Copenhagen, Kasper Olsen IT University of Copenhagen, Rolf-Helge Pfeiffer IT University of Copenhagen
Pre-print Media Attached
03:33
4m
Talk
Dataset: Dependency Networks of Open Source Libraries Available Through CocoaPods, Carthage and Swift PM
Data and Tool Showcase Track
Kristiina Rahkema University of Tartu, Dietmar Pfahl University of Tartu
Pre-print Media Attached
03:37
13m
Live Q&A
Discussions and Q&A
Technical Papers


Information for Participants