Generating and Verifying Synthetic Datasets with Requirements Engineering
With the rise of generative Artificial Intelligence (AI), Machine Learning (ML) developers are becoming less reliant on real data to train their models. Data insufficiency can be resolved by using synthetic data generated by a diffusion model. However, beyond ad hoc interpretation of a generative model’s outputs, there is little assurance of the synthetic data’s adherence to the data requirement specifications. Adherence of synthetic data to these specifications is critical given that they describe desired downstream model behavior. Therefore, without proper verification methods for this synthetic data, ML developers cannot be confident in the behavior of the downstream model. This paper presents a verification method for generating synthetic data to train downstream ML models by prompting the generative model using requirement specifications, and tracing elements of the output back to the prompt. The purpose of this research is to embed requirements engineering into the data augmentation process to increase the rigor and acceptance of these generative AI models to train downstream ML models. This improves the transparency of the data augmentation process, potentially increasing the trust of stakeholders in the generated data, and the use of generative models for data augmentation in a wider range of applications. This also provides a more traditional approach to synthetic data generation to guide ML developers in augmenting their datasets, thus incorporating a more rigorous engineering process into the ML development, i.e., ML Engineering.
Mon 28 AprDisplayed time zone: Eastern Time (US & Canada) change
16:00 - 17:30 | Generative Model EngineeringResearch and Experience Papers / Industry Talks at 208 Chair(s): Manel Abdellatif École de Technologie Supérieure | ||
16:00 15mTalk | DDPT: Diffusion Driven Prompt Tuning for Large Language Model Code Generation Research and Experience Papers Jinyang Li The University of Adelaide, Sangwon Hyun CREST, University of Adelaide, Muhammad Ali Babar School of Computer Science, The University of Adelaide | ||
16:15 15mTalk | Engineering LLM Powered Multi-agent Framework for Autonomous CloudOpsDistinguished paper Award Candidate Research and Experience Papers Kannan Parthasarathy MontyCloud, Karthik Vaidhyanathan IIIT Hyderabad, Rudra Dhar SERC, IIIT Hyderabad, India, Venkat Krishnamachari MontyCloud, Adyansh Kakran International Institute of Information Technology, Hyderabad, Sreemaee Akshathala IIIT Hyderabad, Shrikara Arun IIIT Hyderabad, Amey Karan IIIT Hyderabad, Basil Muhammed MontyCloud, Sumant Dubey MontyCloud, Mohan Veerubhotla MontyCloud | ||
16:30 15mTalk | Generating and Verifying Synthetic Datasets with Requirements Engineering Research and Experience Papers Lynn Vonderhaar Embry-Riddle Aeronautical University, Timothy Elvira Embry-Riddle Aeronautical University, Omar Ochoa Embry-Riddle Aeronautical University Pre-print | ||
16:45 15mTalk | LLM-Based Safety Case Generation for Baidu Apollo: Are We There Yet? Research and Experience Papers | ||
17:00 12mTalk | SqPal - text to SQL GenAI tool for PayPal Industry Talks | ||
17:12 18mOther | Discussion Research and Experience Papers |