An Alternative to Cells for Selective Execution of Data Science Pipelines (ICSE 2023 - NIER - New Ideas and Emerging Results) - ICSE 2023

Write a Blog >>

Sun 14 - Sat 20 May 2023 Melbourne, Australia

Who

Lars Reimann, Günter Kniesel-Wünsche

Track

ICSE 2023 NIER - New Ideas and Emerging Results

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Fri 19 May 2023 15:00 - 15:07 at Meeting Room 104 - Software development tools Chair(s): Xing Hu

Abstract

Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed.

To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.

Link to Preprint

https://arxiv.org/pdf/2302.14556.pdf

Lars Reimann

University of Bonn

Germany

Günter Kniesel-Wünsche

University of Bonn

Germany

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Fri 19 May
Displayed time zone: Hobart change

	13:45 - 15:15	Software development toolsDEMO - Demonstrations / Technical Track / SEIP - Software Engineering in Practice / NIER - New Ideas and Emerging Results at Meeting Room 104 Chair(s): Xing Hu Zhejiang University

	13:45 15m Talk		Safe low-level code without overhead is practical Technical Track Solal Pirelli EPFL, George Candea EPFL Pre-print
	14:00 15m Talk		Sibyl: Improving Software Engineering Tools with SMT Selection Technical Track Will Leeson University of Virgina, Matthew B Dwyer University of Virginia, Antonio Filieri AWS and Imperial College London Pre-print
	14:15 15m Talk		Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools SEIP - Software Engineering in Practice Brittany Johnson George Mason University, Christian Bird Microsoft Research, Denae Ford Microsoft Research, Nicole Forsgren Microsoft Research, Thomas Zimmermann Microsoft Research Pre-print
	14:30 15m Talk		CoCoSoDa: Effective Contrastive Learning for Code Search Technical Track Ensheng Shi Xi'an Jiaotong University, Wenchao Gu The Chinese University of Hong Kong, Yanlin Wang School of Software Engineering, Sun Yat-sen University, Lun Du Microsoft Research Asia, Hongyu Zhang The University of Newcastle, Shi Han Microsoft Research, Dongmei Zhang Microsoft Research, Hongbin Sun Xi'an Jiaotong University Pre-print
	14:45 7m Talk		Task Context: A Tool for Predicting Code Context Models for Software Development Tasks DEMO - Demonstrations Yifeng Wang Zhejiang University, Yuhang Lin Zhejiang University, Zhiyuan Wan Zhejiang University, Xiaohu Yang Zhejiang University Pre-print Media Attached
	14:52 7m Talk		Continuously Accelerating Research NIER - New Ideas and Emerging Results Sergey Mechtaev University College London, Jonathan Bell Northeastern University, Christopher Steven Timperley Carnegie Mellon University, Earl T. Barr University College London, Michael Hilton Carnegie Mellon University Pre-print
	15:00 7m Talk		An Alternative to Cells for Selective Execution of Data Science Pipelines NIER - New Ideas and Emerging Results Lars Reimann University of Bonn, Günter Kniesel-Wünsche University of Bonn Pre-print
	15:07 7m Talk		pytest-inline: An Inline Testing Tool for Python DEMO - Demonstrations Yu Liu University of Texas at Austin, Zachary Thurston Cornell University, Alan Han Cornell University, Pengyu Nie University of Texas at Austin, Milos Gligoric University of Texas at Austin, Owolabi Legunsen Cornell University