DVC in Open Source AI-development: The Action and the Reaction
Artificial Intelligence (AI) systems are gaining popularity, reshaping various domains ranging from customer services to software engineering. The effectiveness of AI systems is intricately linked to the quality of their training data. Therefore, practitioners invest substantial time experimenting with different data, parameters, and models to guarantee the quality of the end system. Prior work highlights unique challenges of developing AI systems, particularly concerning versioning data and model. Recently, various tools such as DVC and MLFlow have emerged to aid developers in the storage and tracking of data. Despite gaining popularity, very little is known about their usage patterns and impact on open-source software (OSS) systems. To address this gap, we conducted an empirical study on 56 GitHub OSS projects that use DVC to understand the DVC usage pattern and the impact of using DVC on the software development process. We found that Versioning and tracking is the most adopted DVC feature, being utilized by all 56 projects and being the only adopted feature in 85.7% of them. Furthermore, we find that DVC has a significant impact on the software development process indicators (e.g., number of created PRs, number of bug-fix commits), causing a significant shift in the trend of the most studied indicators.