CAIN 2024
Sun 14 - Mon 15 April 2024 Lisbon, Portugal
co-located with ICSE 2024

Artificial Intelligence (AI) is rapidly advancing with a data-centered approach suitable for various domains. Nevertheless, AI faces significant challenges, particularly in data quality. Data collection from diverse sources can introduce quality issues that may threaten the development of AI-enabled systems. A growing concern in this context is the emergence of \textit{data smells} – issues specific to the data used in building AI models, which can have long-term consequences. In this paper, we aim at enlarging the current body of knowledge on data smells, by proposing a two-step investigation into the matter. First, we updated an existing literature review in an effort of cataloguing the currently existing data smells and the tools to detect them. Afterward, we assess the prevalence of data smells and their correlation with data quality metrics. We identify a novel set composed of 12 data smells distributed across three additional categories. Secondly, we observe that the correlation between data smells and data quality is notably impactful, exhibiting a pronounced and substantial effect, especially in highly diffused data smell instances. This research sheds light on the complex relationship between data smells and data quality, providing valuable insights into the challenges of maintaining AI-enabled systems.

Sun 14 Apr

Displayed time zone: Lisbon change

14:00 - 15:30
Data Engineering and Management for AI-Enabled SystemsResearch and Experience Papers / Industry Talks at Pequeno Auditório
Chair(s): Marc Zeller Siemens AG
14:00
15m
Talk
What About the Data? A Mapping Study on Data Engineering for AI Systems
Research and Experience Papers
Petra Heck Fontys University of Applied Sciences
Pre-print
14:15
15m
Talk
Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality
Research and Experience Papers
Gilberto Recupito University of Salerno, Raimondo Rapacciuolo University of Salerno, Dario Di Nucci University of Salerno, Fabio Palomba University of Salerno
14:30
15m
Talk
An Exploratory Study of Dataset and Model Management in Open Source Machine Learning ApplicationsDistinguished paper Award Candidate
Research and Experience Papers
Tajkia Rahman Toma University of Alberta, Cor-Paul Bezemer University of Alberta
14:45
10m
Talk
DVC in Open Source AI-development: The Action and the Reaction
Research and Experience Papers
Lorena Barreto Simedo Pacheco Concordia University, Musfiqur Rahman Concordia University, Fazle Rabbi Concordia University, Pouya Fathollahzadeh Queen’s University, Ahmad Abdellatif University of Calgary, Emad Shihab Concordia University, Tse-Hsun (Peter) Chen Concordia University, Jinqiu Yang Concordia University, Ying Zou Queen's University, Kingston, Ontario
14:55
10m
Industry talk
Structuring the world of unstructured text data – Balancing business requirements, training data availability, and model performance.
Industry Talks
15:05
10m
Industry talk
Invited: Artificial Intelligence Projects, a quest between meaningful use cases, data, and unfulfilled desires.
Industry Talks
A: Andreas Jedlitschka Fraunhofer IESE
15:15
15m
Live Q&A
Data : Q&A Session
Research and Experience Papers