Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks (FSE 2025 - Research Papers)

Mon 23 - Fri 27 June 2025 Trondheim, Norway

co-located with ISSTA 2025

Who

Ali Al-Kaswan, Sebastian Deatc, Begüm Koç, Arie van Deursen, Maliheh Izadi

Track

FSE 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 25 Jun 2025 14:40 - 15:00 at Cosmos Hall - LLM for SE 4 Chair(s): Ting Su

Abstract

Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models’ size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.

Link to Preprint

https://arxiv.org/abs/2504.01850

DOI

https://doi.org/10.1145/3729380

Ali Al-Kaswan

Delft University of Technology, Netherlands

Netherlands

Sebastian Deatc

Delft University of Technology

Begüm Koç

Delft University of Technology

Arie van Deursen

TU Delft

Netherlands

Maliheh Izadi

Delft University of Technology

Netherlands

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 25 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

14:00 - 15:20	LLM for SE 4Research Papers / Journal First at Cosmos Hall Chair(s): Ting Su East China Normal University

14:00 20m Talk		Large Language Models for Software Engineering: A Systematic Literature Review Journal First Xinyi Hou Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Yue Liu Monash University, Zhou Yang Singapore Management University; University of Alberta, Kailong Wang Huazhong University of Science and Technology, Li Li Beihang University, Xiapu Luo Hong Kong Polytechnic University, David Lo Singapore Management University, John Grundy Monash University, Haoyu Wang Huazhong University of Science and Technology
14:20 20m Talk		Calibration of Large Language Models on Code Summarization Research Papers Yuvraj Virk UC Davis, Prem Devanbu University of California at Davis, Toufique Ahmed IBM Research DOI
14:40 20m Talk		Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks Research Papers Ali Al-Kaswan Delft University of Technology, Netherlands, Sebastian Deatc Delft University of Technology, Begüm Koç Delft University of Technology, Arie van Deursen TU Delft, Maliheh Izadi Delft University of Technology DOI Pre-print
15:00 20m Talk		PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing Journal First Yuwei Zhang Institute of Software Chinese Academy of Sciences, Zhi Jin Peking University, xingying Beijing University of Posts and Telecommunications, Ge Li Peking University, Fang Liu Beihang University, Jiaxin Zhu Institute of Software at Chinese Academy of Sciences, Wensheng Dou Institute of Software Chinese Academy of Sciences, Jun Wei Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences

Information for Participants

Wed 25 Jun 2025 14:00 - 15:20 at Cosmos Hall - LLM for SE 4 Chair(s): Ting Su

Info for room Cosmos Hall:

This is the main event hall of Clarion Hotel, which will be used to host keynote talks and other plenary sessions. The FSE and ISSTA banquets will also happen in this room.

The room is just in front of the registration desk, on the other side of the main conference area. The large doors with numbers “1” and “2” provide access to the Cosmos Hall.