Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer ForumsICPCICPC Full paper
Regular expression Denial of Service (ReDoS) represents an algorithmic complexity attack that exploits the processing of regular expressions (regexes) to produce a denial-of-service attack. This attack manifests when regex evaluation time scales polynomially or exponentially with input length, posing sporadic yet significant challenges for software developers. The advent of Large Language Models (LLMs) has revolutionized the generation of regexes from natural language prompts, but not without its risks. Prior works showed that LLMs can generate code with vulnerabilities and security smells. In this paper, we synthesized a vast collection of regex patterns from a comprehensive dataset, assessing their correctness and ReDoS vulnerability. We investigated the characteristics of these vulnerable regexes, categorizing them into equivalence classes to unravel their weaknesses. Our inquiry also extended to examining ReDoS patterns in actual software projects, aligning them with corresponding regex classes. LLM-generated regexes mainly have polynomial ReDoS vulnerability patterns, and it is consistent with the real-world data. Moreover, we analyzed developer dialogues on GitHub and StackOverflow, constructing a taxonomy to investigate their experiences and perspectives on ReDoS. In this study, we found that GPT-3.5 was the best LLM to generate regexes that are both correct and secure. We also found that developers’ main concern is related to mitigation strategies to remove vulnerable regexes.
Presentation Slides (ICPC'24.pptx) | 5.3MiB |