ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

In recent years, the application of Large Language Models~(LLMs) has become increasingly widespread, along with growing concerns about their security. To assess the security of LLMs, researchers have proposed various jailbreak attack algorithms, but only rely on the models’ internal information or face limitations in exploring the unsafe behavior, highlighting the need for a more adaptive and generalizable approach. Inspired by the game of rats escaping a maze, we introduce a novel jailbreak attack approach, MazeBreaker, where attackers dynamically learn to find the exit based on feedback and their accumulated experience to compromise the target LLMs’ security defenses.
Our method is the first to systematically learn from the feedback of attack attempts on target LLMs through a multi-agent reinforcement learning system, enabling strategic exploration of the model’s unsafe boundaries without a reference oracle. We compared our approach with six state-of-the-art jailbreak attack methods, testing it on 13 different architectures of open-source and commercial models. The results show that our method performs exceptionally well in terms of attack effectiveness, especially for the commercial models (GPT-3.5-turbo, GPT-4o-mini, GLM-4-air and Claude-3.5-sonnet) with strong safety alignment. We hope this study will help academia and industry better test the security of large language models and promote adherence to safety and ethical standards. Code and data are available on our repository: https://anonymous.4open.science/r/MazeBreaker.