Data Leakage in Notebooks: Static Detection and Better Processes
Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model’s accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.
Wed 12 OctDisplayed time zone: Eastern Time (US & Canada) change
13:30 - 15:30 | Technical Session 16 - Software VulnerabilitiesResearch Papers / Journal-first Papers at Gold A Chair(s): Mohamed Wiem Mkaouer Rochester Institute of Technology | ||
13:30 20mResearch paper | Data Leakage in Notebooks: Static Detection and Better Processes Research Papers Chenyang Yang , Rachel A Brower-Sinning Carnegie Mellon Software Engineering Institute, Grace Lewis Carnegie Mellon Software Engineering Institute, Christian Kästner Carnegie Mellon University | ||
13:50 20mResearch paper | GLITCH: Automated Polyglot Security Smell Detection in Infrastructure as CodeVirtual Research Papers Nuno Saavedra INESC-ID and IST, University of Lisbon, João F. Ferreira INESC-ID and IST, University of Lisbon Pre-print | ||
14:10 20mPaper | SafeDrop: Detecting Memory Deallocation Bugs of Rust Programs via Static Data-Flow AnalysisVirtual Journal-first Papers Mohan Cui Fudan University, Chengjun Chen Fudan University, Hui Xu Fudan University, Yangfan Zhou Fudan University | ||
14:30 20mResearch paper | Precise (Un)Affected Version Analysis for Web VulnerabilitiesVirtual Research Papers ShiYoukun Fudan University, Yuan Zhang Fudan University, Tianhan Luo Fudan University, Xiangyu Mao Fudan University, Min Yang Fudan University | ||
14:50 20mResearch paper | Leveraging Practitioners' Feedback to Improve a Security LinterVirtual Research Papers Sofia Reis Instituto Superior Técnico, U. Lisboa & INESC-ID, Rui Abreu Faculty of Engineering, University of Porto, Portugal, Marcelo d'Amorim Federal University of Pernambuco, Daniel Fortunato INESC-ID, University of Porto | ||
15:10 20mResearch paper | Insight: Exploring Cross-Ecosystem Vulnerability ImpactsVirtual Research Papers Meiqiu Xu Northeastern University, China, Ying Wang Northeastern University, China, Shing-Chi Cheung Hong Kong University of Science and Technology, Hai Yu Northeastern University, China, Zhiliang Zhu Northeastern University, China |