As Deep Learning (DL) systems become increasingly pervasive in safety-critical and high-impact domains, the need for effective techniques to test, localise, and repair faults in Deep Neural Networks (DNNs) has never been greater. Over the past few years, numerous fault localisation (FL) and repair approaches have been proposed, leveraging both static and dynamic analyses, as well as rule-based heuristics. However, a fundamental question remains: how effective and reliable are these techniques in practice?
In this talk, I will present a comprehensive empirical investigation into the current state of fault localisation and repair for DL systems. First, I will discuss a large-scale comparative evaluation of state-of-the-art FL techniques, conducted on a benchmark comprising both real-world faults collected from bug reporting platforms and faults generated via mutation testing. Our findings reveal that current techniques struggle to achieve strong and consistent performance when evaluated against a single human-defined ground truth, raising concerns about how effectiveness is currently assessed. Next, I will examine the broader ecosystem of DL fault localisation and repair techniques, highlighting their strengths and limitations. I will then present an empirical study investigating whether Large Language Models (LLMs) can effectively localise and repair faults in DL systems. Our evaluation shows that LLMs demonstrate strong performance compared to existing approaches, suggesting that they may offer a promising direction for advancing automated DL debugging. Finally, I will address a critical but often overlooked issue: the realism and reproducibility of existing DL fault benchmarks that are used to evaluate DL faults localisation and repair approaches. Through a manual analysis of hundreds of reported faults across widely used benchmarks, we find that only a limited subset satisfies strong realism criteria, and reproducibility remains a significant challenge. These findings raise important concerns about current evaluation practices and underscore the need for more rigorous assessment methodologies.
Mon 13 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
09:00 - 10:30 | Session 1: Opening & Keynote 1AST 2026 at Oceania VI Chair(s): Markus Borg CodeScene, Breno Miranda Federal University of Pernambuco, Ana Paiva INESC TEC, Faculty of Engineering, University of Porto, Andy Zaidman TU Delft 9:00 — Opening of AST 2026 by Organizers9:20 — Keynote by Gunel Jahangirova, King’s College LondonTitle: Deep Learning Fault Localisation and Repair: Benchmarks, Limitations, and the Role of LLMs Abstract: As Deep Learning (DL) systems become increasingly pervasive in safety-critical and high-impact domains, the need for effective techniques to test, localise, and repair faults in Deep Neural Networks (DNNs) has never been greater. Over the past few years, numerous fault localisation (FL) and repair approaches have been proposed, leveraging both static and dynamic analyses, as well as rule-based heuristics. However, a fundamental question remains: how effective and reliable are these techniques in practice? In this talk, I will present a comprehensive empirical investigation into the current state of fault localisation and repair for DL systems. First, I will discuss a large-scale comparative evaluation of state-of-the-art FL techniques, conducted on a benchmark comprising both real-world faults collected from bug reporting platforms and faults generated via mutation testing. Our findings reveal that current techniques struggle to achieve strong and consistent performance when evaluated against a single human-defined ground truth, raising concerns about how effectiveness is currently assessed. Next, I will examine the broader ecosystem of DL fault localisation and repair techniques, highlighting their strengths and limitations. I will then present an empirical study investigating whether Large Language Models (LLMs) can effectively localise and repair faults in DL systems. Our evaluation shows that LLMs demonstrate strong performance compared to existing approaches, suggesting that they may offer a promising direction for advancing automated DL debugging. Finally, I will address a critical but often overlooked issue: the realism and reproducibility of existing DL fault benchmarks that are used to evaluate DL faults localisation and repair approaches. Through a manual analysis of hundreds of reported faults across widely used benchmarks, we find that only a limited subset satisfies strong realism criteria, and reproducibility remains a significant challenge. These findings raise important concerns about current evaluation practices and underscore the need for more rigorous assessment methodologies. Bio: Gunel Jahangirova is a Lecturer (Assistant Professor) at King’s College London (KCL), United Kingdom. Prior to joining KCL, she was a Postdoctoral Researcher at Università della Svizzera Italiana (USI) in Lugano, Switzerland. She obtained her PhD through a joint programme between Fondazione Bruno Kessler (FBK) in Trento, Italy, and University College London (UCL), UK. Her research focuses on the automatic generation and evaluation of test oracles, error propagation in software systems, testing of deep learning systems, oracle design and quality metrics for autonomous vehicles, and the application of artificial intelligence to software engineering tasks. | ||
09:00 45mTalk | Session 1: Opening & Keynote AST 2026 Markus Borg CodeScene, Breno Miranda Federal University of Pernambuco, Ana Paiva INESC TEC, Faculty of Engineering, University of Porto, Andy Zaidman TU Delft, Gunel Jahangirova King's College London | ||