A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts (MSR 2022 - Technical Papers)

Who

Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, Timofey Bryksin

Track

MSR 2022 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 19 May 2022 03:07 - 03:14 at MSR Main room - odd hours - Session 8: Large-Scale Mining & Software Ecosystems Chair(s): Fiorella Zampetti, Gregorio Robles

Abstract

In recent years, Jupyter notebooks have grown in popularity in several domains of software engineering, such as data science, machine learning, and computer science education. Their popularity has to do with their rich features for presenting and visualizing data, however, recent studies show that notebooks also share a lot of drawbacks: high number of code clones, low reproducibility, etc.

In this work, we carry out a comparison between Python code written in Jupyter Notebooks and in traditional Python scripts. We compare the code from two perspectives: structural and stylistic. In the first part of the analysis, we report the difference in the number of lines, the usage of functions, as well as various complexity metrics. In the second part, we show the difference in the number of stylistic issues and provide an extensive overview of the 15 most frequent stylistic issues in the studied mediums. Overall, we demonstrate that notebooks are characterized by the lower code complexity, however, their code could be perceived as more entangled than in the scripts. As for the style, notebooks tend to have 1.4 times more stylistic issues, but at the same time, some of them are caused by specific coding practices in notebooks and should be considered as false positives. With this research, we want to pave the way to studying specific problems of notebooks that should be addressed by the development of notebook-specific tools, and provide various insights that can be useful in this regard.

Link to Preprint

https://arxiv.org/abs/2203.16718

DOI

https://doi.org/10.1145/3524842.3528447

Konstantin Grotov

JetBrains Research, ITMO University

Sergey Titov

JetBrains Research

Vladimir Sotnikov

JetBrains Research

Yaroslav Golubev

JetBrains Research

Russia

Timofey Bryksin

JetBrains Research; HSE University

Russia

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 19 May
Displayed time zone: Eastern Time (US & Canada) change

03:00 - 03:50	Session 8: Large-Scale Mining & Software EcosystemsTechnical Papers / Data and Tool Showcase Track at MSR Main room - odd hours Chair(s): Fiorella Zampetti University of Sannio, Italy, Gregorio Robles Universidad Rey Juan Carlos

03:00 7m Talk		An Empirical Study on the Survival Rate of GitHub Projects Technical Papers Adem Ait IN3 - UOC, Javier Luis Cánovas Izquierdo IN3 - UOC, Jordi Cabot Open University of Catalonia, Spain Pre-print
03:07 7m Talk		A Large-Scale Comparison of Python Code in Jupyter Notebooks and ScriptsDistinguished Paper Award Technical Papers Konstantin Grotov JetBrains Research, ITMO University, Sergey Titov JetBrains Research, Vladimir Sotnikov JetBrains Research, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research; HSE University DOI Pre-print
03:14 7m Talk		Do Customized Android Frameworks Keep Pace with Android? Technical Papers Pei Liu Monash University, Mattia Fazzini University of Minnesota, John Grundy Monash University, Li Li Monash University
03:21 4m Talk		Lupa: A Platform for Large Scale Analysis of The Progamming Language Usage Data and Tool Showcase Track Anna Vlasova JetBrains Research, Maria Tigina JetBrains Research, ITMO University, Ilya Vlasov Saint Petersburg State University, Anastasiia Birillo JetBrains Research, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research; HSE University DOI Pre-print
03:25 4m Talk		GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research Data and Tool Showcase Track Nicolas Riquet University of Namur, Xavier Devroey University of Namur, Benoît Vanderose University of Namur Pre-print
03:29 4m Talk		DaSEA – A Dataset for Software Ecosystem Analysis Data and Tool Showcase Track Petya Buchkova IT University of Copenhagen, Joakim Hey Hinnerskov IT University of Copenhagen, Kasper Olsen IT University of Copenhagen, Rolf-Helge Pfeiffer IT University of Copenhagen Pre-print Media Attached
03:33 4m Talk		Dataset: Dependency Networks of Open Source Libraries Available Through CocoaPods, Carthage and Swift PM Data and Tool Showcase Track Kristiina Rahkema University of Tartu, Dietmar Pfahl University of Tartu Pre-print Media Attached
03:37 13m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Thu 19 May 2022 03:00 - 03:50 at MSR Main room - odd hours - Session 8: Large-Scale Mining & Software Ecosystems Chair(s): Fiorella Zampetti, Gregorio Robles

Info for room MSR Main room - odd hours:

Click here to go to the room on Midspace