Reality Bites: Assessing the Realism of Driving Scenarios with Large Language ModelsFull Paper
Large Language Models (LLMs) are demonstrating outstanding potential for tasks such as text generation, summarization, and classification. Given that such models are trained on a humongous amount of online knowledge, we hypothesize that LLMs can assess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions. To test this hypothesis, we conducted an empirical evaluation to assess whether LLMs are effective and robust in performing the task. This reality check is an important step towards devising LLM-based autonomous driving testing techniques. For our empirical evaluation, we selected 64 realistic scenarios from DeepScenario–an open driving scenario dataset. Next, by introducing minor changes to them, we created 512 additional realistic scenarios while keeping them realistic, to form an overall dataset of 576 scenarios. With this dataset, we evaluated three LLMs (GPT-3.5, Llama2-13B and Mistral-7B) to assess their robustness in assessing the realism of driving scenarios. Our results demonstrate that: (1) Overall, GPT-3.5 achieved the highest robustness compared to Llama2-13B and Mistral-7B, consistently throughout almost all scenarios, roads, and weather conditions; (2) Mistral-7B performed the worst consistently; (3) Llama2-13B achieved good results under certain conditions but not for the others; and (4) roads and weather conditions do influence the robustness of the LLMs.
Sun 14 AprDisplayed time zone: Lisbon change
11:00 - 12:30 | Foundation Models for Software Quality AssuranceResearch Track at Luis de Freitas Branco Chair(s): Matteo Ciniselli Università della Svizzera Italiana | ||
11:00 14mFull-paper | Deep Multiple Assertions GenerationFull Paper Research Track | ||
11:14 14mFull-paper | MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems in LLM Augmented GenerationFull Paper Research Track Guanyu Wang Beijing University of Posts and Telecommunications, Yuekang Li The University of New South Wales, Yi Liu Nanyang Technological University, Gelei Deng Nanyang Technological University, Li Tianlin Nanyang Technological University, Guosheng Xu Beijing University of Posts and Telecommunications, Yang Liu Nanyang Technological University, Haoyu Wang Huazhong University of Science and Technology, Kailong Wang Huazhong University of Science and Technology | ||
11:28 14mFull-paper | Planning to Guide LLM for Code Coverage PredictionFull Paper Research Track Hridya Dhulipala University of Texas at Dallas, Aashish Yadavally University of Texas at Dallas, Tien N. Nguyen University of Texas at Dallas | ||
11:42 7mShort-paper | The Emergence of Large Language Models in Static Analysis: A First Look through Micro-BenchmarksNew Idea Paper Research Track Ashwin Prasad Shivarpatna Venkatesh University of Paderborn, Samkutty Sabu University of Paderborn, Amir Mir Delft University of Technology, Sofia Reis Instituto Superior Técnico, U. Lisboa & INESC-ID, Eric Bodden | ||
11:49 14mFull-paper | Reality Bites: Assessing the Realism of Driving Scenarios with Large Language ModelsFull Paper Research Track Jiahui Wu Simula Research Laboratory and University of Oslo, Chengjie Lu Simula Research Laboratory and University of Oslo, Aitor Arrieta Mondragon University, Tao Yue Beihang University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University | ||
12:03 7mShort-paper | Assessing the Impact of GPT-4 Turbo in Generating Defeaters for Assurance CasesNew Idea Paper Research Track Kimya Khakzad Shahandashti York University, Mithila Sivakumar York University, Mohammad Mahdi Mohajer York University, Alvine Boaye Belle York University, Song Wang York University, Timothy Lethbridge University of Ottawa | ||
12:10 20mOther | Discussion Research Track |