Predicting the Understandability of Computational Notebooks through Code Metrics Analysis
Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the understandability of the notebook code and identify the notebook metrics that play a significant role in its understandability. The level of code understandability is a qualitative variable closely associated with the user’s opinion about the code. Traditional approaches to measuring it either use limited questionnaires to review a few code pieces or rely on metadata such as likes and votes in software repositories. In our approach, we enhanced the measurement of the understandability level of Jupyter notebooks by leveraging user opinions related to code understandability within a software repository. As a case study, we started with 542,051 Kaggle Jupyter notebooks, compiled in a dataset named DistilKaggle, which we introduced in our previous research. To identify user comments associated with code understandability, we utilized a fine-tuned DistilBERT transformer. We established a user-opinion-based criterion for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, and the total views of the notebook received by the notebook. We refer to this criterion as User Opinion Code Understandability (UOCU), which has been proven to be much more effective than previous approaches. A hybrid approach combining UOCU with total upvotes further improved this criterion. Additionally, we trained machine learning models to classify notebook understandability solely based on notebook metrics. We collected 34 metrics for 132,723 final notebooks using the hybrid approach criterion. Our predictive model, built using a Random Forest classifier, achieved 89% accuracy in classifying code understandability levels in computational notebooks.
Fri 17 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
14:00 - 15:30 | Analytics 4Research Track / Journal-first Papers at Oceania I Chair(s): Diomidis Spinellis AUEB & TU Delft | ||
14:00 15mTalk | Back to the Roots: Assessing Mining Techniques for Java Vulnerability-Contributing Commits Journal-first Papers Torge Hinrichs Hamburg University of Technology, Emanuele Iannone Hamburg University of Technology, Tamás Aladics University of Szeged, Peter Hegedus University of Szeged, Andrea De Lucia University of Salerno, Fabio Palomba University of Salerno, Riccardo Scandariato Hamburg University of Technology | ||
14:15 15mTalk | Predicting the Understandability of Computational Notebooks through Code Metrics Analysis Journal-first Papers Mojtaba Mostafavi Sharif University of Technology, Alireza Asadi Department of Computer Engineering of Sharif University of Technology, Arash Asgari York University, Bardia Mohammadi Sharif University of Technology, Abbas Heydarnoori Bowling Green State University Link to publication DOI Media Attached | ||
14:30 15mTalk | How Configurable is the Linux Kernel? Analyzing Two Decades of Feature-Model History Journal-first Papers Elias Kuiter University of Magdeburg, Chico Sundermann TU Braunschweig, Thomas Thüm TU Braunschweig, Tobias Heß University of Ulm, Sebastian Krieter TU Braunschweig, Germany, Gunter Saake University of Magdeburg, Germany Pre-print | ||
14:45 15mTalk | Breaking Strong Encapsulation: A Comprehensive Study of Java Module Abuse Research Track Yirui He University of California, Irvine, Yongbo Chen University of California, Irvine, Jessy Ayala University of California, Irvine, Yecheng Zhou University of California, Irvine, Qiran Wang University of California, Irvine, Joshua Garcia University of California, Irvine | ||
15:00 15mTalk | Causal or Correlational? A Cohort Study on the Effects of Code Smells on Class Change- and Fault-Proneness Research Track Sabato Nocera University of Salerno, Sira Vegas Universidad Politecnica de Madrid, Giuseppe Scanniello University of Salerno, Massimiliano Di Penta University of Sannio, Italy, Natalia Juristo Universidad Politecnica de Madrid | ||
15:15 15mTalk | Six Million (Suspected) Fake Stars on GitHub: A Growing Spiral of Popularity Contests, Spams, and Malware Research Track Hao He Carnegie Mellon University, Haoqin Yang Carnegie Mellon University, Philipp Burckhardt Socket, Inc, Alexandros Kapravelos NCSU, Bogdan Vasilescu Carnegie Mellon University, Christian Kästner Carnegie Mellon University | ||