Data Leakage in Notebooks: Static Detection and Better Processes (ASE 2022 - Research Papers)

Write a Blog >>

Mon 10 - Fri 14 October 2022 Oakland Center, Michigan, United States

Who

Chenyang Yang, Rachel A Brower-Sinning, Grace Lewis, Christian Kästner

Track

ASE 2022 Research Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 12 Oct 2022 13:30 - 13:50 at Gold A - Technical Session 16 - Software Vulnerabilities Chair(s): Mohamed Wiem Mkaouer

Abstract

Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model’s accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.

Chenyang Yang

Rachel A Brower-Sinning

Carnegie Mellon Software Engineering Institute

Grace Lewis

Carnegie Mellon Software Engineering Institute

United States

Christian Kästner

Carnegie Mellon University