LeakageDetector 2.0: Analyzing Data Leakage in Jupyter-Driven Machine Learning Pipelines (ICSME 2025 - Tool Demonstration Track) - ICSME 2025 - International Conference on Software Maintenance and Evolution

Who

Owen Truong, Terrence Zhang, Arnav Marchareddy, Ryan Lee, Jeffery Busold, Michael Socas, Eman Abdullah AlOmar

Track

ICSME 2025 Tool Demonstration Track

Abstract

In software development environments, code quality is crucial. This study aims to assist Machine Learning (ML) engineers in enhancing their code by identifying and correcting Data Leakage issues within their models. Data Leakage occurs when information from the test dataset is inadvertently included in the training data when preparing a data science model, resulting in misleading performance evaluations. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. In this paper, we develop a new Visual Studio Code (VS Code) extension, called LEAKAGEDETECTOR, that detects data Leakage — mainly overlap, preprocessing and multi-test leakage – from Jupyter Notebook files. Beyond detection, we included two correction mechanisms named Quick Fix: a conventional approach that manually fixes the leakage and an LLM-driven approach that guides ML developers toward best practices for building ML pipelines. The plugin and its source code are publicly available on GitHub at https://github.com/SE4AIResearch/ DataLeakage Jupyter Notebook Fall2024. The demonstration video can be found on YouTube: https://www.youtube.com/watch?v=uyLLaxutzsg&t=6s. The website can be found at https://leakage-detector.vercel.app/.

Owen Truong

Stevens Institute of Technology

United States

Terrence Zhang

Stevens Institute of Technology

United States

Arnav Marchareddy

Stevens Institute of Technology

United States

Ryan Lee

Stevens Institute of Technology

United States

Jeffery Busold

Stevens Institute of Technology

United States

Michael Socas

Stevens Institute of Technology

United States

Eman Abdullah AlOmar

Stevens Institute of Technology, USA

United States

Media

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

Session Program

Fri 12 Sep
Displayed time zone: Auckland, Wellington change

13:30 - 15:00	Session 16 - Security 2Research Papers Track / Industry Track / Registered Reports / NIER Track at Case Room 2 260-057 Chair(s): Gregorio Robles Universidad Rey Juan Carlos

13:30 15m		Understanding the Faults in Serverless Computing Based Applications: An Empirical Study Research Papers Track Changrong Xie National University of Defense Technology, Yang Zhang National University of Defense Technology, China, Xinjun Mao National University of Defense Technology, Kang Yang National University of Defense Technology, Tanghaoran Zhang National University of Defense Technology
13:45 15m		Security Vulnerabilities in Docker Images: A Cross-Tag Study of Application Dependencies Research Papers Track Hamid Mohayeji Nasrabadi Eindhoven University of Technology, Eleni Constantinou University of Cyprus, Alexander Serebrenik Eindhoven University of Technology
14:00 15m		Trust and Verify: Formally Verified and Upgradable Trusted Functions Research Papers Track Marcus Birgersson KTH Royal Institute of Technology, Cyrille Artho KTH Royal Institute of Technology, Sweden, Musard Balliu KTH Royal Institute of Technology
14:25 10m		MalLoc: Towards Fine-grained Android Malicious Payload Localization via LLMs NIER Track Tiezhu Sun University of Luxembourg, Marco Alecci University of Luxembourg, Aleksandr Pilgun University of Luxembourg, Yewei Song University of Luxembourg, Xunzhu Tang University of Luxembourg, Jordan Samhi University of Luxembourg, Luxembourg, Tegawendé F. Bissyandé University of Luxembourg, Jacques Klein University of Luxembourg Pre-print
14:35 15m		Levels of Binary Equivalence for the Comparison of Binaries from Alternative Builds Industry Track Jens Dietrich Victoria University of Wellington, Tim White Victoria University of Wellington, Behnaz Hassanshahi Oracle Labs, Australia, Paddy Krishnan Oracle Labs, Australia
14:50 10m		Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs Registered Reports Maria Camporese University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam