On the Robustness of Fairness Practices: A Causal Framework for Systematic Evaluation
Machine learning (ML) algorithms are increasingly deployed to make critical decisions in socioeconomic applications such as finance, criminal justice, and autonomous driving. However, due to their data-driven and pattern-seeking nature, ML algorithms may develop decision logic that disproportionately distributes opportunities, benefits, resources, or information among different population groups, potentially harming marginalized communities. In response to such fairness concerns, the software engineering and ML communities have made significant efforts to establish the best practices for creating fair ML software. These include fairness interventions for training ML models such as including sensitive features, selecting non-sensitive attributes, and applying bias mitigators. But how reliably can software professionals tasked with developing data-driven systems depend on these recommendations? And how well do these practices generalize in the presence of faulty labels, missing data, or distribution shifts? These questions form the core theme of this paper.
We present a testing tool and technique based on causality theory to assess the robustness of best practices in fair ML software development. Given a practice—specified as a first-order logic property—and a socio-critical dataset that satisfies the property, our goal is to search for neighborhood datasets to determine whether the property continues to hold. This process is akin to testing the robustness of a neural network for image classification, except that the “image" is an entire dataset, and its “neighbors" are datasets in which certain causal hypotheses are altered. Since computing neighborhood datasets while accounting for various factors—such as noise, faulty labeling, and demographic shifts—is challenging, we utilize causal graph representations of the dataset and leverage a search algorithm to explore equivalent causal graphs to generate datasets. Our results across various fairness-sensitive tasks, derived from prevalent fairness-sensitive applications, identify best practices that preserve robustness under the varying factors.