NB2P: Generating Data Science Pipelines from Computational Notebooks
Computational notebooks empower data scientists to explore data, perform analytics, and share their findings. During data exploration, the scientist uses notebooks to construct and refine data pipelines that process data in multiple stages. Extracting pipelines from a given notebook is useful in understanding the notebook’s semantics and in migrating it to production systems. However, the nature of the data exploration process, and the lack of sufficient documentation in the notebook, present two challenges in extracting the pipelines. First, notebook cells can be executed in any order, making it difficult to capture the data flow between pipeline stages. Second, data transformation operations belonging to a stage may not be cleanly separated, making it difficult to extract cohesive pipeline components.
In this paper, we propose NB2P, a novel system that automatically extracts data science pipelines from notebooks. Given an input notebook, NB2P first parses it into an Abstract Syntax Tree (AST). It then performs analysis on the AST to recover the execution order, thereby addressing the first challenge above. Next, it groups the data operations into pipeline stages based on their semantics. This step is called semantic segmentation, and it addresses the second challenge using a tree-based, learned encoding-decoding algorithm that captures the data flow and fine-grained hierarchical information in the notebook. Finally, NB2P assembles the stages and constructs the final pipeline that can be deployed into production systems.
We train NB2P on a large notebook corpus from Kaggle. We compare NB2P against baselines that use state-of-the-art large language models and other machine learning models for source code segmentation. The experimental results show that NB2P consistently outperforms the baselines while incurring low overhead.
Fri 17 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
14:00 - 15:30 | AI for Software Engineering 25Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) / Demonstrations at Europa II Chair(s): Daniel Feitosa University of Groningen | ||
14:00 15mTalk | ArtifactSync: Automated Repository Synchronization through Hierarchical Change Impact Analysis Demonstrations Ebube Alor Concordia University, João Pedro de Souza Olivo Tardivo Universidade Estadual do Paraná, SayedHassan Khatoonabadi Concordia University, Emad Shihab Concordia University | ||
14:15 15mTalk | Introducing Phylogenetics in Search-based Software Engineering: Phylogenetics-aware SBSE Journal-first Papers Daniel Blasco SVIT Research Group. Universidad San Jorge, Antonio Iglesias Universidad San Jorge, Jorge Echeverria Universidad San Jorge, Francisca Perez Universitat Politècnica de València, Carlos Cetina | ||
14:30 15mTalk | Automating Terraform Code Migration through Provider Evolution Knowledge New Ideas and Emerging Results (NIER) Pranjal Gupta IBM Research, Pooja Aggarwal IBM Research, Brent Paulovicks IBM Research, Prateeti Mohapatra IBM Research, Rong Lee IBM Research, Vadim Sheinin IBM Research | ||
14:45 15mTalk | Replacing Training with Reasoning: Reinterpreting Classic ML Pipelines with LLMs New Ideas and Emerging Results (NIER) Marco Alecci University of Luxembourg, Jordan Samhi University of Luxembourg, Luxembourg, Tegawendé F. Bissyandé University of Luxembourg, Jacques Klein University of Luxembourg | ||
15:00 15mTalk | NB2P: Generating Data Science Pipelines from Computational Notebooks Research Track Haotian Gao National University of Singapore, Singapore and NUSRI Chongqing, China, Quang Trung Ta National University of Singapore, Tien Tuan Anh Dinh Deakin University, Australia, Nhut Minh Ho National University of Singapore, Zhiyong Huang National University of Singapore, Beng Chin Ooi National University of Singapore, Singapore Media Attached | ||
15:15 15mTalk | Multi-Location Software Model Completion Research Track | ||