An Exploratory Study of Dataset and Model Management in Open Source Machine Learning Applications (CAIN 2024 - Research and Experience Papers)

Who

Tajkia Rahman Toma, Cor-Paul Bezemer

Track

CAIN 2024 Research and Experience Papers

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 14 Apr 2024 14:30 - 14:45 at Pequeno Auditório - Data Engineering and Management for AI-Enabled Systems Chair(s): Marc Zeller

Abstract

Datasets and models are two key artifacts in machine learning (ML) applications. Although there exist tools to support dataset and model developers in managing ML artifacts, little is known about how these datasets and models are integrated into ML applications. In this paper, we study how datasets and models in ML applications are managed. In particular, we focus on how these artifacts are stored and versioned alongside the applications. After analyzing 93 repositories, we identified the most common storage location to store datasets and models is the file system, which causes availability issues. Notably, large data and model files, exceeding approximately 60 MB, are stored exclusively in remote storage. Most of the datasets and models lack proper integration with the version control system, posing potential traceability and reproducibility issues. Additionally, although datasets and models are likely to evolve during the application development, they are rarely updated in application repositories

Tajkia Rahman Toma

University of Alberta

Cor-Paul Bezemer