Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as “melt”), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. The MLTE tool supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results.
Zhen Tian Beihang University, Yilong Yang Beihang University, Sheng Cheng Software Engineering and Digitalization Center of China Manned Space Engineering
Stefan Höppner Ulm University, Yves Haas Institute of Software Engineering and Programming Languages, Ulm University, Matthias Tichy Ulm University, Germany, Katharina Juhnke Institute of Software Engineering and Programming Languages, Ulm University
Kristóf Marussy Budapest University of Technology and Economics, Oszkár Semeráth Budapest University of Technology and Economics, Daniel Varro Linköping University / McGill University