Write a Blog >>
ICSE 2023
Sun 14 - Sat 20 May 2023 Melbourne, Australia
Thu 18 May 2023 11:00 - 11:15 at Meeting Room 103 - Code review Chair(s): Thomas LaToza

Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support.

To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed $470$ Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or \emph{steps} (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions.

For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned $470$ notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns.

We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.

DOI: https://doi.org/10.1007/s10664-022-10229-z

Thu 18 May

Displayed time zone: Hobart change

11:00 - 12:30
11:00
15m
Talk
Workflow analysis of data science code in public GitHub repositories
Journal-First Papers
Dhivyabharathi Ramasamy Department of Informatics, University of Zurich, Zurich, Switzerland, Cristina Sarasua Department of Informatics, University of Zurich, Zurich, Switzerland, Alberto Bacchelli University of Zurich, Abraham Bernstein Department of Informatics, University of Zurich, Zurich, Switzerland
11:15
15m
Talk
Quality Evaluation of Modern Code Reviews Through Intelligent Biometric Program Comprehension
Journal-First Papers
Haytham Hijazi CISUC, DEI, University of Coimbra, João Durães CISUC, Polytechnic Institute of Coimbra, Ricardo Couceiro University of Coimbra, Raul Barbosa CISUC, DEI, University of Coimbra, João Castelhano ICNAS, University of Coimbra, Júlio Medeiros CISUC, DEI, University of Coimbra, Miguel Castelo Branco ICNAS/CIBIT, University of Coimbra, Paulo Carvalho University of Coimbra, Henrique Madeira University of Coimbra
11:30
15m
Talk
Code Review of Build System Specifications: Prevalence, Purposes, Patterns, and Perceptions
Technical Track
Mahtab Nejati University of Waterloo, Mahmoud Alfadel University of Waterloo, Shane McIntosh University of Waterloo
Pre-print
11:45
15m
Talk
Please fix this mutant: How do developers resolve mutants surfaced during code review?
SEIP - Software Engineering in Practice
Goran Petrovic Google; Universität Passau, René Just University of Washington, Marko Ivanković Google; Universität Passau, Gordon Fraser University of Passau
12:00
15m
Talk
Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft
SEIP - Software Engineering in Practice
Jiyang Zhang University of Texas at Austin, Chandra Maddila Microsoft Research, Ramakrishna Bairi Microsoft Research, Christian Bird Microsoft Research, Ujjwal Raizada Microsoft Research, Apoorva Agrawal Microsoft Research, Yamini Jhawar Microsoft Research, Kim Herzig Microsoft, Arie van Deursen Delft University of Technology
Pre-print Media Attached
12:15
7m
Talk
A mixed-methods analysis of micro-collaborative coding practices in OpenStack
Journal-First Papers
Armstrong Foundjem Queen's University, Eleni Constantinou University of Cyprus, Tom Mens University of Mons, Bram Adams Queen's University, Kingston, Ontario