Improving Generalizability of ML-enabled Software through Domain SpecificationResearch Paper
While the conventional software components implement pre-defined specifications, Machine Learning (ML)-enabled Software Components (MLSC) learn the domain specifications from the training samples. Thus, the MLSC’s data-driven and inductive reasoning become highly reliant on the quality of the training dataset, which is often arbitrarily collected in ad hoc manners. The random collection of samples leads to a significant gap between the actual specifications of a real-world concept, and the picture that a dataset represents of the concept, reducing MLSC generalizability, particularly in perceptual tasks where understanding the environment is an important factor of accurate prediction.
To fill the gap between the conceptualization of a targeted domain’s concept and its visualization in the MLSC training dataset, we propose exploiting semantic specification of the concept to identify the concepts’ missing variants in the dataset. To this end, we propose to first, semantically specify MLSC hard-to-specify targeted domain’s concepts and second, refer to the derived specifications to evaluate the diversity and relative completeness of MLSC collected datasets. The systematic augmentation of training datasets, with respect to the semantics of the domain, improves the quality of an arbitrarily collected dataset and potentially yields more reliable models. As a proof of concept, we automatically acquired the existing semantic knowledge for partially specifying the automotive domain concept \textit{``pedestrian.''} Referring to the derived specifications, we augmented the state-of-the-art pedestrian datasets. The evaluations show that semantic augmentation outperforms brute-force machine learning in satisfying the MLSC accuracy requirements.