ESEIW 2024
Sun 20 - Fri 25 October 2024 Barcelona, Spain
Thu 24 Oct 2024 11:20 - 11:40 at Sala de graus (C4 Building) - Software testing Chair(s): Marco Torchiano

Background: Deep learning systems, which use Deep Neural Networks as their core, have become an increasingly popular type of software system in recent years. While successful, these systems have been proven to create biased and unfair outcomes. To mitigate such, various test generators have been proposed to perform fairness testing at the model level, detecting the fairness bugs before more serious issues occur.Aims: However, much work assumes perfect context and conditions from the other parts: the hyperparameters have been well-tuned for accuracy; the bias in the sampling process has been rectified, and the bias of labels introduced by humans has been dealt with. Yet, these are often difficult, if not impossible, due to their resource-/labour-intensive nature, hence they might not be desirable in practice. We need to have a more throughout understanding of how varying contexts affect the fairness testing outcomes. Method: To overcome such a gap, in this paper, we conduct an extensive empirical study, which covers 10, 800 cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. Results: Our results show that (1) compared with the common assumptions, non-optimized hyperparameters often make existing generators more struggle; while the presence of data bias boosts those generators. The contexts can also affect the rankings between generators. (2) Changing the context settings generally leads to a significant impact on the testing results. We then go one step further to investigate why these outcomes were observed, from which we found that: (1) varying the contexts and their settings alter the ruggedness with respect to local optima and/or the search guidance provided by the testing landscape of test adequacy. (2) There is a weak monotonic correlation between the test adequacy and fairness metric under hyperparameter changes; for varying data bias, such a correlation can be positive or negative, depending on certain categories of adequacy metrics. Conclusions: Our findings provide three key insights for practitioners to properly evaluate the test generators and hint at future research directions.

Thu 24 Oct

Displayed time zone: Brussels, Copenhagen, Madrid, Paris change

11:00 - 12:30
11:00
20m
Full-paper
Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?
ESEM Technical Papers
Triet Le The University of Adelaide, Muhammad Ali Babar School of Computer Science, The University of Adelaide
11:20
20m
Full-paper
Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems
ESEM Technical Papers
Chengwen Du University of Birmingham, Tao Chen University of Birmingham
11:40
20m
Full-paper
Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?
ESEM Technical Papers
Triet Le The University of Adelaide, Muhammad Ali Babar School of Computer Science, The University of Adelaide
12:00
15m
Industry talk
From Literature to Practice: Exploring Fairness Testing Tools for the Software Industry Adoption
ESEM IGC
Thanh Nguyen University of Calgary, Maria Teresa Baldassarre Department of Computer Science, University of Bari , Luiz Fernando de Lima , Ronnie de Souza Santos University of Calgary
Pre-print
12:15
15m
Vision and Emerging Results
Do Developers Use Static Application Security Testing (SAST) Tools Straight Out of the Box? A large-scale Empirical Study
ESEM Emerging Results, Vision and Reflection Papers Track
Gareth Bennett Lancaster University, Tracy Hall Lancaster University, Steve Counsell Brunel University London, Emily Winter Lancaster University, Thomas Shippey LogicMonitor