DVC in Open Source AI-development: The Action and the Reaction (CAIN 2024 - Research and Experience Papers)

Who

Lorena Barreto Simedo Pacheco, Musfiqur Rahman, Fazle Rabbi, Pouya Fathollahzadeh, Ahmad Abdellatif, Emad Shihab, Tse-Hsun (Peter) Chen, Jinqiu Yang, Ying Zou

Track

CAIN 2024 Research and Experience Papers

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 14 Apr 2024 14:45 - 14:55 at Pequeno Auditório - Data Engineering and Management for AI-Enabled Systems Chair(s): Marc Zeller

Abstract

Artificial Intelligence (AI) systems are gaining popularity, reshaping various domains ranging from customer services to software engineering. The effectiveness of AI systems is intricately linked to the quality of their training data. Therefore, practitioners invest substantial time experimenting with different data, parameters, and models to guarantee the quality of the end system. Prior work highlights unique challenges of developing AI systems, particularly concerning versioning data and model. Recently, various tools such as DVC and MLFlow have emerged to aid developers in the storage and tracking of data. Despite gaining popularity, very little is known about their usage patterns and impact on open-source software (OSS) systems. To address this gap, we conducted an empirical study on 56 GitHub OSS projects that use DVC to understand the DVC usage pattern and the impact of using DVC on the software development process. We found that Versioning and tracking is the most adopted DVC feature, being utilized by all 56 projects and being the only adopted feature in 85.7% of them. Furthermore, we find that DVC has a significant impact on the software development process indicators (e.g., number of created PRs, number of bug-fix commits), causing a significant shift in the trend of the most studied indicators.

Lorena Barreto Simedo Pacheco

Concordia University

Musfiqur Rahman

Concordia University

Canada

Fazle Rabbi

Concordia University

Pouya Fathollahzadeh

Queen’s University

Ahmad Abdellatif

University of Calgary

Canada

Emad Shihab

Concordia University

Canada

Tse-Hsun (Peter) Chen

Concordia University

Canada

Jinqiu Yang

Concordia University

Canada

Ying Zou

Queen's University, Kingston, Ontario

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 14 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	Data Engineering and Management for AI-Enabled SystemsResearch and Experience Papers / Industry Talks at Pequeno Auditório Chair(s): Marc Zeller Siemens AG

14:00 15m Talk		What About the Data? A Mapping Study on Data Engineering for AI Systems Research and Experience Papers Petra Heck Fontys University of Applied Sciences Pre-print
14:15 15m Talk		Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality Research and Experience Papers Gilberto Recupito University of Salerno, Raimondo Rapacciuolo University of Salerno, Dario Di Nucci University of Salerno, Fabio Palomba University of Salerno
14:30 15m Talk		An Exploratory Study of Dataset and Model Management in Open Source Machine Learning ApplicationsDistinguished paper Award Candidate Research and Experience Papers Tajkia Rahman Toma University of Alberta, Cor-Paul Bezemer University of Alberta
14:45 10m Talk		DVC in Open Source AI-development: The Action and the Reaction Research and Experience Papers Lorena Barreto Simedo Pacheco Concordia University, Musfiqur Rahman Concordia University, Fazle Rabbi Concordia University, Pouya Fathollahzadeh Queen’s University, Ahmad Abdellatif University of Calgary, Emad Shihab Concordia University, Tse-Hsun (Peter) Chen Concordia University, Jinqiu Yang Concordia University, Ying Zou Queen's University, Kingston, Ontario
14:55 10m Industry talk		Structuring the world of unstructured text data – Balancing business requirements, training data availability, and model performance. Industry Talks A: Sooji Han , Berinike Tech
15:05 10m Industry talk		Invited: Artificial Intelligence Projects, a quest between meaningful use cases, data, and unfulfilled desires. Industry Talks A: Andreas Jedlitschka Fraunhofer IESE
15:15 15m Live Q&A		Data : Q&A Session Research and Experience Papers