Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks (ICSME 2025 - New Ideas and Emerging Results Track) - ICSME 2025 - International Conference on Software Maintenance and Evolution

Who

Emir Bosnak, Sahand Moslemi Yengejeh, Mayasah Lami, Anil Koyuncu

Track

ICSME 2025 NIER Track

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 11 Sep 2025 16:25 - 16:35 at Case Room 2 260-057 - Session 12 - Security 1 Chair(s): Dhanushka Jayasuriya

Abstract

Large Language Models (LLMs) are increasingly used as code assistants, yet their behavior when explicitly asked to generate insecure code remains poorly understood. While prior research has focused on unintended vulnerabilities, this study examines a more direct threat: open-source LLMs generating vulnerable code when prompted. We propose a dual experimental design: (1) Dynamic Prompting, which systematically varies vulnerability type, user persona, and prompt phrasing across structured templates; and (2) Reverse Prompting, which derives natural-language prompts from real vulnerable code samples. We evaluate three open-source 7B-parameter models (Qwen2, Mistral, Gemma) using static analysis to assess both the presence and correctness of generated vulnerabilities. Our results show that all models frequently generate the requested vulnerabilities, though with significant performance differences. Gemma achieves the highest correctness for memory vulnerabilities under Dynamic Prompting (e.g., 98.6% for buffer overflows), while Qwen2 demonstrates the most balanced performance across all tasks. We find that professional personas (e.g., “DevOps Engineer”) consistently elicit higher success rates than student personas, and that the effectiveness of direct versus indirect phrasing is inverted depending on the prompting strategy. Vulnerability reproduction accuracy follows a non-linear pattern with code complexity, peaking in a moderate range. Our findings expose how LLMs’ reliance on pattern recall over semantic reasoning creates significant blind spots in their safety alignments, particularly for requests framed as plausible professional tasks.

Link to Preprint

https://arxiv.org/abs/2507.10054

Emir Bosnak

Bilkent University

Turkey

Sahand Moslemi Yengejeh

Bilkent University

Turkey

Mayasah Lami

Bilkent University

Turkey

Anil Koyuncu

Bilkent University

Turkey

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 11 Sep
Displayed time zone: Auckland, Wellington change

15:30 - 17:00	Session 12 - Security 1NIER Track / Research Papers Track / Tool Demonstration Track / Journal First Track at Case Room 2 260-057 Chair(s): Dhanushka Jayasuriya University of Auckland

15:30 15m		Retrieve, Refine, or Both? Using Task-Specific Guidelines for Secure Python Code Generation Research Papers Track Catherine Tony Hamburg University of Technology, Emanuele Iannone Hamburg University of Technology, Riccardo Scandariato Hamburg University of Technology Link to publication DOI Pre-print
15:45 15m		SAEL: Leveraging Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection Research Papers Track Lei Yu Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, Shiqi Cheng Institute of Software, Chinese Academy of Sciences, China, Zhirong Huang Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, Jingyuan Zhang Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, Chenjie Shen Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, Junyi Lu Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, Li Yang Institute of Software, Chinese Academy of Sciences, Fengjun Zhang Institute of Software, Chinese Academy of Sciences, China, Jiajia Ma Institute of Software, Chinese Academy of Sciences, China Pre-print
16:00 15m		Evaluating the maintainability of Forward-Porting vulnerabilities in fuzzer benchmarks Research Papers Track Timothée Riom Umeå Universitet, Sabine Houy Umeå Universitet, Bruno Kreyssig Umeå University, Alexandre Bartel Umeå University Pre-print
16:15 10m		VulGuard: An Unified Tool for Evaluating Just-In-Time Vulnerability Prediction Models Tool Demonstration Track Duong Nguyen Hanoi University of Science and Technology, Manh Tran-Duc Hanoi University of Science and Technology, Le-Cong Thanh The University of Melbourne, Triet Le The University of Adelaide, Muhammad Ali Babar School of Computer Science, The University of Adelaide, Quyet Thang Huynh Hanoi University of Science and Technology
16:25 10m		Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks NIER Track Emir Bosnak Bilkent University, Sahand Moslemi Yengejeh Bilkent University, Mayasah Lami Bilkent University, Anil Koyuncu Bilkent University Pre-print
16:35 15m		Vulnerabilities in Infrastructure as Code: What, How Many, and Who? Journal First Track Aïcha War University of Luxembourg, Alioune Diallo University of Luxembourg, Andrew Habib ABB Corporate Research, Germany, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg