An Alternative to Cells for Selective Execution of Data Science Pipelines
Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed.
To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.
Fri 19 MayDisplayed time zone: Hobart change
13:45 - 15:15 | Software development toolsDEMO - Demonstrations / Technical Track / SEIP - Software Engineering in Practice / NIER - New Ideas and Emerging Results at Meeting Room 104 Chair(s): Xing Hu Zhejiang University | ||
13:45 15mTalk | Safe low-level code without overhead is practical Technical Track Pre-print | ||
14:00 15mTalk | Sibyl: Improving Software Engineering Tools with SMT Selection Technical Track Will Leeson University of Virgina, Matthew B Dwyer University of Virginia, Antonio Filieri AWS and Imperial College London Pre-print | ||
14:15 15mTalk | Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools SEIP - Software Engineering in Practice Brittany Johnson George Mason University, Christian Bird Microsoft Research, Denae Ford Microsoft Research, Nicole Forsgren Microsoft Research, Thomas Zimmermann Microsoft Research Pre-print | ||
14:30 15mTalk | CoCoSoDa: Effective Contrastive Learning for Code Search Technical Track Ensheng Shi Xi'an Jiaotong University, Wenchao Gu The Chinese University of Hong Kong, Yanlin Wang School of Software Engineering, Sun Yat-sen University, Lun Du Microsoft Research Asia, Hongyu Zhang The University of Newcastle, Shi Han Microsoft Research, Dongmei Zhang Microsoft Research, Hongbin Sun Xi'an Jiaotong University Pre-print | ||
14:45 7mTalk | Task Context: A Tool for Predicting Code Context Models for Software Development Tasks DEMO - Demonstrations Yifeng Wang Zhejiang University, Yuhang Lin Zhejiang University, Zhiyuan Wan Zhejiang University, Xiaohu Yang Zhejiang University Pre-print Media Attached | ||
14:52 7mTalk | Continuously Accelerating Research NIER - New Ideas and Emerging Results Sergey Mechtaev University College London, Jonathan Bell Northeastern University, Christopher Steven Timperley Carnegie Mellon University, Earl T. Barr University College London, Michael Hilton Carnegie Mellon University Pre-print | ||
15:00 7mTalk | An Alternative to Cells for Selective Execution of Data Science Pipelines NIER - New Ideas and Emerging Results Pre-print | ||
15:07 7mTalk | pytest-inline: An Inline Testing Tool for Python DEMO - Demonstrations Yu Liu University of Texas at Austin, Zachary Thurston Cornell University, Alan Han Cornell University, Pengyu Nie University of Texas at Austin, Milos Gligoric University of Texas at Austin, Owolabi Legunsen Cornell University |