Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems (ESEIW 2024 - ESEM Technical Papers Track)

Who

Chengwen Du, Tao Chen

Track

ESEIW 2024 ESEM Technical Papers

Time Zone

The program is currently displayed in (GMT+02:00) Brussels, Copenhagen, Madrid, Paris.

Use conference time zone: (GMT+02:00) Brussels, Copenhagen, Madrid, ParisSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 24 Oct 2024 11:20 - 11:40 at Sala de graus (C4 Building) - Software testing Chair(s): Marco Torchiano

Abstract

Background: Deep learning systems, which use Deep Neural Networks as their core, have become an increasingly popular type of software system in recent years. While successful, these systems have been proven to create biased and unfair outcomes. To mitigate such, various test generators have been proposed to perform fairness testing at the model level, detecting the fairness bugs before more serious issues occur.Aims: However, much work assumes perfect context and conditions from the other parts: the hyperparameters have been well-tuned for accuracy; the bias in the sampling process has been rectified, and the bias of labels introduced by humans has been dealt with. Yet, these are often difficult, if not impossible, due to their resource-/labour-intensive nature, hence they might not be desirable in practice. We need to have a more throughout understanding of how varying contexts affect the fairness testing outcomes. Method: To overcome such a gap, in this paper, we conduct an extensive empirical study, which covers 10, 800 cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. Results: Our results show that (1) compared with the common assumptions, non-optimized hyperparameters often make existing generators more struggle; while the presence of data bias boosts those generators. The contexts can also affect the rankings between generators. (2) Changing the context settings generally leads to a significant impact on the testing results. We then go one step further to investigate why these outcomes were observed, from which we found that: (1) varying the contexts and their settings alter the ruggedness with respect to local optima and/or the search guidance provided by the testing landscape of test adequacy. (2) There is a weak monotonic correlation between the test adequacy and fairness metric under hyperparameter changes; for varying data bias, such a correlation can be positive or negative, depending on certain categories of adequacy metrics. Conclusions: Our findings provide three key insights for practitioners to properly evaluate the test generators and hint at future research directions.

Chengwen Du

University of Birmingham

United Kingdom

Tao Chen