Attacks and Defenses for Large Language Models on Coding Tasks (ASE 2024 - The New Ideas and Emerging Results (NIER) Track)

Who

Chi Zhang, Zifan Wang, Ruoshi Zhao, Ravi Mangal, Matt Fredrikson, Limin Jia, Corina S. Păsăreanu

Track

ASE 2024 NIER Track

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Oct 2024 11:50 - 12:00 at Magnoila - SE for AI 1 Chair(s): Chengcheng Wan

Abstract

Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks, including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e., small syntactic perturbations designed to “fool” the models. In this paper, we first aim to study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to LLMs. We also propose a new attack using an LLM to generate the perturbations. Further, we propose novel cost-effective techniques to defend LLMs against such adversaries via prompting, without incurring the cost of retraining. These prompt-based defenses involve modifying the prompt to include additional information, such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our preliminary experiments show the effectiveness of the attacks and the proposed defenses on popular LLMs such as GPT-3.5 and GPT-4.

Chi Zhang

Zifan Wang

Center for AI Safety

Ruoshi Zhao

Independent Researcher

Ravi Mangal

Colorado State University

United States

Matt Fredrikson

Carnegie Mellon University

Limin Jia

Corina S. Păsăreanu

Carnegie Mellon University; NASA Ames