A Hitchhiker’s Guide to Jailbreaking ChatGPT via Prompt Engineering (PROMISE 2024)

Who

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang

Track

PROMISE 2024

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 Jul 2024 10:15 - 10:30 at Acerola - Morning session 1

Abstract

Natural language prompts serve as an essential interface between users and Large Language Models (LLMs) like GPT-3.5 and GPT-4, which are employed by ChatGPT to produce outputs across various tasks. However, prompts crafted with malicious intent, known as jailbreak prompts, can circumvent the restrictions of LLMs, posing a significant threat to systems integrated with these models. Despite their critical importance, there is a lack of systematic analysis and comprehensive understanding of jailbreak prompts. Our paper aims to address this gap by exploring key research questions to enhance the robustness of LLM systems: 1) What common patterns are present in jailbreak prompts? 2) How effectively can these prompts bypass the restrictions of LLMs? 3) With the evolution of LLMs, how does the effectiveness of jailbreak prompts change?

To address our research questions, we embarked on an empirical study targeting the LLMs underpinning ChatGPT, one of today's most advanced chatbots. Our methodology involved categorizing 78 jailbreak prompts into 10 distinct patterns, further organized into three jailbreak strategy types, and examining their distribution. We assessed the effectiveness of these prompts on GPT-3.5 and GPT-4, using a set of 3,120 questions across 8 scenarios deemed prohibited by OpenAI. Additionally, our study tracked the performance of these prompts over a 3-month period, observing the evolutionary response of ChatGPT to such inputs. Our findings offer a comprehensive view of jailbreak prompts, elucidating their taxonomy, effectiveness, and temporal dynamics. Notably, we discovered that GPT-3.5 and GPT-4 could still generate inappropriate content in response to malicious prompts without the need for jailbreaking. This underscores the critical need for effective prompt management within LLM systems and provides valuable insights and data to spur further research in LLM testing and jailbreak prevention.

DOI

https://doi.org/10.1145/3663530.3665021

Yi Liu

Nanyang Technological University

Singapore

Gelei Deng

Nanyang Technological University

Singapore

Zhengzi Xu

Nanyang Technological University

Singapore

Yuekang Li

The University of New South Wales

Australia

Yaowen Zheng

Institute of Information Engineering at Chinese Academy of Sciences

China

Ying Zhang

Virginia Tech

United States

Lida Zhao

Nanyang Technological University

Singapore

Tianwei Zhang

Nanyang Technological University

Singapore

Kailong Wang

Huazhong University of Science and Technology

China

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 16 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

09:00 - 10:30	Morning session 1PROMISE 2024 at Acerola

09:00 5m Day opening		Opening PROMISE 2024
09:05 55m Keynote		SEA4DQ keynote 1 (Denys Poshyvanyk) PROMISE 2024
10:00 15m Talk		Graph Neural Network vs. Large Language Model: A Comparative Analysis for Bug Report Priority and Severity Prediction PROMISE 2024 Jagrit Acharya University of Calgary, Gouri Ginde (Deshpande) University of Calgary DOI
10:15 15m Talk		A Hitchhiker’s Guide to Jailbreaking ChatGPT via Prompt Engineering PROMISE 2024 Yi Liu Nanyang Technological University, Gelei Deng Nanyang Technological University, Zhengzi Xu Nanyang Technological University, Yuekang Li The University of New South Wales, Yaowen Zheng Institute of Information Engineering at Chinese Academy of Sciences, Ying Zhang Virginia Tech, Lida Zhao Nanyang Technological University, Tianwei Zhang Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology DOI