Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks (ICSE 2025 - Research Track)

Who

shide zhou, Li Tianlin, Kailong Wang, Yihao Huang, Ling Shi, Yang Liu, Haoyu Wang

Track

ICSE 2025 Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 30 Apr 2025 16:00 - 16:15 at 210 - SE for AI with Security Chair(s): Lina Marsso

Abstract

Large language models (LLMs) have revolutionized artificial intelligence, but their increasing deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks. Such vulnerability alerts the widespread inadequacy of pre-release testing. In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies, exemplified by the significant security concern of jailbreak attacks. Our study begins with a clustering analysis of the hidden states of LLMs, revealing that the embedded characteristics effectively distinguish between different query types. We then systematically evaluate the performance of these criteria across three key dimensions: criterion level, layer level, and token level.

Our research uncovers significant differences in the sets of neurons covered when LLMs process normal versus jailbreak queries, aligning with our clustering experiments. Leveraging these findings, we propose three practical applications of coverage criteria in the context of LLM security testing. Specifically, we develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak. Furthermore, we explore the use of coverage levels to prioritize test cases, improving testing efficiency by focusing on high-risk interactions and removing redundant tests. Lastly, we introduce a coverage-guided approach for generating jailbreak attack examples, enabling systematic refinement of prompts to uncover new vulnerabilities. This study improves our understanding of LLM security testing, enhances their safety, and provides a foundation for developing more robust AI applications.

shide zhou

Huazhong University of Science and Technology

Li Tianlin

NTU

Kailong Wang

Huazhong University of Science and Technology

China

Yihao Huang

NTU

Ling Shi

Nanyang Technological University

Singapore

Yang Liu

Nanyang Technological University

Singapore

Haoyu Wang

Huazhong University of Science and Technology

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 30 Apr
Displayed time zone: Eastern Time (US & Canada) change

16:00 - 17:30	SE for AI with SecurityResearch Track at 210 Chair(s): Lina Marsso École Polytechnique de Montréal

16:00 15m Talk		Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak AttacksSecuritySE for AI Research Track shide zhou Huazhong University of Science and Technology, Li Tianlin NTU, Kailong Wang Huazhong University of Science and Technology, Yihao Huang NTU, Ling Shi Nanyang Technological University, Yang Liu Nanyang Technological University, Haoyu Wang Huazhong University of Science and Technology
16:15 15m Talk		Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning SoftwareSecuritySE for AI Research Track Zhenpeng Chen Nanyang Technological University, Xinyue Li Peking University, Jie M. Zhang King's College London, Federica Sarro University College London, Yang Liu Nanyang Technological University Pre-print
16:30 15m Talk		HIFI: Explaining and Mitigating Algorithmic Bias through the Lens of Game-Theoretic InteractionsSecuritySE for AI Research Track Lingfeng Zhang East China Normal University, Zhaohui Wang Software Engineering Institute, East China Normal University, Yueling Zhang East China Normal University, Min Zhang East China Normal University, Jiangtao Wang Software Engineering Institute, East China Normal University
16:45 15m Talk		Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution DetectionSecuritySE for AI Research Track Yanfu Yan William & Mary, Viet Duong William & Mary, Huajie Shao College of William & Mary, Denys Poshyvanyk William & Mary
17:00 15m Talk		FairSense: Long-Term Fairness Analysis of ML-Enabled SystemsSecuritySE for AI Research Track Yining She Carnegie Mellon University, Sumon Biswas Carnegie Mellon University, Christian Kästner Carnegie Mellon University, Eunsuk Kang Carnegie Mellon University