From Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks (Virtual Talk) (LLM4Code 2025)

Who

Mohammed Murtuza Shahzad Syed, Joseph Wilson, Ibrahim Al Azher, Hamed Alhoori, Mona Rahimi

Track

LLM4Code 2025 Large Language Models for Code

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 3 May 2025 16:40 - 16:50 at 214 - Paper Session 4 / Virtual Talk / Award Session & Closing Chair(s): Lingming Zhang

Abstract

The increasing complexity and volume of software systems have heightened the importance of identifying and mitigating security vulnerabilities. The existing software vulnerability datasets frequently fall short in providing comprehensive, detailed code snippets explicitly linked to specific vulnerability descriptions, reducing their utility for advanced research and hindering efforts to develop a deeper understanding of security vulnerabilities. To address this challenge, we present a novel dataset that provides examples of vulnerable code snippets corresponding to Common Attack Pattern Enumerations and Classifications (CAPEC) and Common Weakness Enumeration (CWE) descriptions. By employing the capabilities of Generative Pre-trained Transformer (GPT) models, we have developed a robust methodology for generating these examples. Our approach utilizes GPT-4o, Llama and Claude models to generate code snippets that exhibit specific vulnerabilities as described in CAPEC and CWE documentation. This dataset not only enhances the understanding of security vulnerabilities in code but also serves as a valuable resource for training machine learning models focused on automatic vulnerability detection and remediation. Preliminary evaluations suggest that the dataset generated by Large Language Models demonstrates high accuracy and can serve as a reliable reference for vulnerability identification systems. We found consistent results across the three models, with 0.98 cosine similarity among codes. The final dataset comprises 615 CAPEC code snippets in three programming languages: Java, Python, and JavaScript, making it one of the most extensive and diverse resources in this domain. This research contributes to the field of cybersecurity by introducing an innovative dataset that supports advanced studies on software vulnerabilities and facilitates the development of tools for their prevention and mitigation.

Mohammed Murtuza Shahzad Syed

Northern Illinois University

Joseph Wilson

Northern Illinois University

Ibrahim Al Azher

Northern Illinois University

Hamed Alhoori

Dept. of Computer Science at the Northern Illinois University

Mona Rahimi

Dept. of Computer Science at the Northern Illinois University

Video

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 3 May
Displayed time zone: Eastern Time (US & Canada) change

16:00 - 17:30	Paper Session 4 / Virtual Talk / Award Session & ClosingLLM4Code at 214 Chair(s): Lingming Zhang University of Illinois at Urbana-Champaign

16:00 10m Talk		Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets LLM4Code Mahmoud Jahanshahi University of Tennessee, Audris Mockus University of Tennessee Pre-print
16:10 10m Talk		Understanding Code Properties: Is Code All You Need? LLM4Code Srivishnu Pyda University of Maryland, Daniel Nichols University of Maryland, Abhinav Bhatele University of Maryland
16:20 10m Talk		Analysis of Student-LLM Interaction in a Software Engineering Project LLM4Code Agrawal Naman National University of Singapore, Ridwan Salihin Shariffdeen National University of Singapore, Wang Guanlin National University of Singapore, Sanka Rasnayaka National University of Singapore, Ganesh Neelakanta Iyer National University of Singapore
16:30 10m Talk		Training LLMs for Generating IEC 61131-3 Structured Text with Online Feedback LLM4Code Aaron Haag Siemens AG, Bertram Fuchs Siemens AG, Altay Kacan Siemens AG, Oliver Lohse Siemens AG
16:40 10m Talk		Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning (Virtual Talk) LLM4Code Laura Puccioni Spotify, Alireza Farshin NVIDIA, Mariano Scazzariello RISE Research Institutes of Sweden, Changjie Wang KTH Royal Institute of Technology, Marco Chiesa KTH Royal Institute of Technology, Dejan Kostic KTH Royal Institute of Technology Media Attached
16:40 10m Talk		Is More or Less Automation Better? An Investigation into the LLM4TDD Process (Virtual Talk) LLM4Code Sanyogita Piya The University of Texas at Arlington, Anahita Samadi The University of Texas at Arlington, Allison Sullivan University of Texas at Arlington
16:40 10m Talk		Knowledge Graph Based Repository-Level Code Generation (Virtual Talk) LLM4Code Mihir Athale Northeastern University, Vishal Vaddina Quantiphi Inc. Pre-print Media Attached
16:40 10m Talk		Leveraging LLMs for Legacy Code Modernization: Evaluation of LLM-Generated Documentation (Virtual Talk) LLM4Code Colin Diggs MITRE Corporation, Michael Doyle MITRE Corporation, Amit Madan MITRE Corporation, Emily Escamilla MITRE Corporation, Siggy Scott MITRE Corporation, Jacob Zimmer MITRE Corporation, Naveed Nekoo MITRE Corporation, Paul Ursino MITRE Corporation, Michael Bartholf MITRE Corporation, Zachary Robin MITRE Corporation, Anand Patel MITRE Corporation, Chris Glasz MITRE Corporation, William Macke MITRE Corporation, Paul Kirk MITRE Corporation, Jasper Phillips MITRE Corporation, Arun Sridharan MITRE Corporation, Doug Wendt MITRE Corporation, Scott Rosen MITRE Corporation, Nitin Naik MITRE Corporation, Justin F. Brunelle MITRE Corporation, Samruddhi Thaker MITRE Corporation Media Attached
16:40 10m Talk		From Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks (Virtual Talk) LLM4Code Mohammed Murtuza Shahzad Syed Northern Illinois University, Joseph Wilson Northern Illinois University, Ibrahim Al Azher Northern Illinois University, Hamed Alhoori Dept. of Computer Science at the Northern Illinois University, Mona Rahimi Dept. of Computer Science at the Northern Illinois University Media Attached
16:40 10m Talk		Hierarchical Repository-Level Code Summarization for Business Applications Using Local LLMs (Virtual Talk) LLM4Code Nilesh Dhulshette TCS Research, Sapan Shah TCS Research, Vinay Kulkarni Tata Consultancy Services Research Media Attached
16:40 10m Talk		Code Summarization Beyond Function Level (Virtual Talk) LLM4Code Vladimir Makharev Innopolis University, AIRI, Vladimir Ivanov Innopolis University Media Attached
16:40 10m Talk		YABLoCo: Yet Another Benchmark for Long Context Code Generation (Virtual Talk) LLM4Code Aidar Valeev Innopolis University, Vladimir Ivanov Innopolis University, Roman Garaev Innopolis University, Vadim Lomshakov JetBrains, Irina Pionkovskaya Huawei Noah's Ark Lab, Israel Adewuyi Innopolis University
16:40 10m Talk		CoCoNUT: Structural Code Understanding does not fall out of a tree (Virtual Talk) LLM4Code Claas Beger Cornell University, Saikat Dutta Cornell University Pre-print Media Attached
16:40 10m Talk		Do Code LLMs Understand Design Patterns? (Virtual Talk) LLM4Code Zhenyu Pan Northwestern University, Xuefeng Song Northwestern University, Yunkun Wang Zhejiang University, Rongyu Cao Tongyi Lab, Alibaba, China, Binhua Li Tongyi Lab, Alibaba, China, Yongbin Li Tongyi Lab, Alibaba, China, Han Liu Northwestern University Media Attached
16:40 10m Talk		From Scientific Texts to Verifiable Code: Automating the Process with Transformers (Virtual Talk) LLM4Code Changjie Wang KTH Royal Institute of Technology, Mariano Scazzariello RISE Research Institutes of Sweden, Marco Chiesa KTH Royal Institute of Technology Media Attached
16:50 10m Day closing		Award Session & Closing LLM4Code