Exploring the Jupyter Ecosystem: An Empirical Study of Bugs and Vulnerabilities
Background. Jupyter notebooks are one of the main tools used by data scientists. Notebooks include features (configuration scripts, markdown, images, etc.) that make them challenging to analyze compared to traditional software. As a result, existing software engineering models, tools, and studies do not capture the uniqueness of Notebook’s behavior.
Aims. This paper aims at providing a large-scale empirical study of bugs and vulnerability in the Notebook ecosystems.
Method. Our quantitative analysis of two sources of notebooks (GitHub and Kaggle) indicates that due to the combination of configuration scripts, Python code, documentation, and output in the same documents, Notebooks are subject to many unique types of bugs that make Notebook projects hard to maintain. In addition, we further propose a new taxonomy for bugs in Jupyter Notebooks obtained from a qualitative analysis.
Results. Our findings highlight that configuration issues are among the most common bugs in notebook documents, followed by incorrect API usage. Finally, we explore common vulnerabilities associated with popular deployment frameworks to better understand risks associated with Notebook development.
Conclusions. This work highlights that notebooks are less well-supported than traditional software, resulting in more complex code, misconfiguration, and poor maintenance.