B-AIS: An Automated Process for Black-box Evaluation of AI-enabled Software Systems against Domain Semantics
AI-enabled software systems (AIS) are prevalent in a wide range of applications, such as visual tasks of autonomous systems, extensively deployed in automotive, aerial, and naval domains. Hence, it is crucial for human to evaluate the model’s intelligence before AIS is deployed to safety-critical environments, such as public roads.
In this paper, we assess AIS visual intelligence through measuring the completeness of its perception of primary concepts in a domain and the concept variants. For instance, is the visual perception of an autonomous detector mature enough to recognize the instances of \textit{pedestrian} (an automotive domain’s concept) in Halloween customs? An AIS will be more reliable once the model’s ability to perceive a concept is displayed in a human-understandable language. For instance, is the pedestrian in \textit{wheelchair} mistakenly recognized as a pedestrian on \textit{bike}, since the domain concepts bike and wheelchair, both associate with a mutual feature \textit{wheel}?
We answer the above-type questions by implementing a generic process within a framework, called B-AIS, which systematically evaluates AIS perception against the semantic specifications of a domain, while treating the model as a black-box. Semantics is the meaning and understanding of words in a language, and therefore, is more comprehensible for human brain than AIS pixel-level visual information. B-AIS processes the heterogeneous artifacts to be comparable, and leverages the comparison’s results to reveal AIS weaknesses in a human-understandable language. The evaluations of B-AIS for the vision task of pedestrian detection showed B-AIS identified the missing variants of the pedestrian with $F_{2}$ measures of 95% and in the dataset and 85% in the model.