Generating and Verifying Synthetic Datasets with Requirements Engineering (CAIN 2025 - Research and Experience Papers)

Who

Lynn Vonderhaar, Timothy Elvira, Omar Ochoa

Track

CAIN 2025 Research and Experience Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 16:30 - 16:45 at 208 - Generative Model Engineering Chair(s): Manel Abdellatif

Abstract

With the rise of generative Artificial Intelligence (AI), Machine Learning (ML) developers are becoming less reliant on real data to train their models. Data insufficiency can be resolved by using synthetic data generated by a diffusion model. However, beyond ad hoc interpretation of a generative model’s outputs, there is little assurance of the synthetic data’s adherence to the data requirement specifications. Adherence of synthetic data to these specifications is critical given that they describe desired downstream model behavior. Therefore, without proper verification methods for this synthetic data, ML developers cannot be confident in the behavior of the downstream model. This paper presents a verification method for generating synthetic data to train downstream ML models by prompting the generative model using requirement specifications, and tracing elements of the output back to the prompt. The purpose of this research is to embed requirements engineering into the data augmentation process to increase the rigor and acceptance of these generative AI models to train downstream ML models. This improves the transparency of the data augmentation process, potentially increasing the trust of stakeholders in the generated data, and the use of generative models for data augmentation in a wider range of applications. This also provides a more traditional approach to synthetic data generation to guide ML developers in augmenting their datasets, thus incorporating a more rigorous engineering process into the ML development, i.e., ML Engineering.

Link to Preprint

https://www.researchgate.net/publication/390555619_XXX-X-XXXX-XXXX-XXXXX00_C20XX_IEEE_Generating_and_Verifying_Synthetic_Datasets_with_Requirements_Engineering

Lynn Vonderhaar

Embry-Riddle Aeronautical University

Timothy Elvira

Embry-Riddle Aeronautical University

Omar Ochoa

Embry-Riddle Aeronautical University

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 28 Apr
Displayed time zone: Eastern Time (US & Canada) change

16:00 - 17:30	Generative Model EngineeringResearch and Experience Papers / Industry Talks at 208 Chair(s): Manel Abdellatif École de Technologie Supérieure

16:00 15m Talk		DDPT: Diffusion Driven Prompt Tuning for Large Language Model Code Generation Research and Experience Papers Jinyang Li The University of Adelaide, Sangwon Hyun CREST, University of Adelaide, Muhammad Ali Babar School of Computer Science, The University of Adelaide
16:15 15m Talk		Engineering LLM Powered Multi-agent Framework for Autonomous CloudOpsDistinguished paper Award Candidate Research and Experience Papers Kannan Parthasarathy MontyCloud, Karthik Vaidhyanathan IIIT Hyderabad, Rudra Dhar SERC, IIIT Hyderabad, India, Venkat Krishnamachari MontyCloud, Adyansh Kakran International Institute of Information Technology, Hyderabad, Sreemaee Akshathala IIIT Hyderabad, Shrikara Arun IIIT Hyderabad, Amey Karan IIIT Hyderabad, Basil Muhammed MontyCloud, Sumant Dubey MontyCloud, Mohan Veerubhotla MontyCloud
16:30 15m Talk		Generating and Verifying Synthetic Datasets with Requirements Engineering Research and Experience Papers Lynn Vonderhaar Embry-Riddle Aeronautical University, Timothy Elvira Embry-Riddle Aeronautical University, Omar Ochoa Embry-Riddle Aeronautical University Pre-print
16:45 15m Talk		LLM-Based Safety Case Generation for Baidu Apollo: Are We There Yet? Research and Experience Papers Oluwafemi Odu York University, Alvine Boaye Belle York University, Song Wang York University
17:00 12m Talk		SqPal - text to SQL GenAI tool for PayPal Industry Talks Dan Liyanage PayPal, Mahshid Moha PayPal, Sandy Suresh PayPal
17:12 18m Other		Discussion Research and Experience Papers