ESEIW 2024
Sun 20 - Fri 25 October 2024 Barcelona, Spain

Background: Deep learning systems, which use Deep Neural Networks as their core, have become an increasingly popular type of software system in recent years. While successful, these systems have been proven to create biased and unfair outcomes. To mitigate such, various test generators have been proposed to perform fairness testing at the model level, detecting the fairness bugs before more serious issues occur.Aims: However, much work assumes perfect context and conditions from the other parts: the hyperparameters have been well-tuned for accuracy; the bias in the sampling process has been rectified, and the bias of labels introduced by humans has been dealt with. Yet, these are often difficult, if not impossible, due to their resource-/labour-intensive nature, hence they might not be desirable in practice. We need to have a more throughout understanding of how varying contexts affect the fairness testing outcomes. Method: To overcome such a gap, in this paper, we conduct an extensive empirical study, which covers 10, 800 cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. Results: Our results show that (1) compared with the common assumptions, non-optimized hyperparameters often make existing generators more struggle; while the presence of data bias boosts those generators. The contexts can also affect the rankings between generators. (2) Changing the context settings generally leads to a significant impact on the testing results. We then go one step further to investigate why these outcomes were observed, from which we found that: (1) varying the contexts and their settings alter the ruggedness with respect to local optima and/or the search guidance provided by the testing landscape of test adequacy. (2) There is a weak monotonic correlation between the test adequacy and fairness metric under hyperparameter changes; for varying data bias, such a correlation can be positive or negative, depending on certain categories of adequacy metrics. Conclusions: Our findings provide three key insights for practitioners to properly evaluate the test generators and hint at future research directions.