TestForge: A Benchmarking Framework for LLM-Based Test Case Generation
Test-Driven Development (TDD) is a widely recognized practice for building reliable software, yet its adoption is often constrained by the effort required to write tests before implementing functionality. Recent advances in Large Language Models (LLMs) offer new opportunities to automate test generation directly from natural language specifications. We present TestForge, a benchmarking framework for systematically evaluating LLM-based test case generation, which defines the core evaluation components in line with well-established benchmarking guidelines. As a first instantiation, we apply TestForge to JUnit test generation using a curated subset of the IBM CodeNet dataset, which includes problem statements, limited reference tests, and both correct and faulty solutions. Our study encompasses problems ranging from basic algorithms to more complex scenarios, and it explores three prompting strategies: zero-shot, few-shot, and chain-of-thought. Results show that while LLMs can generate functionally useful tests for clearly specified problems, they continue to struggle with producing comprehensive, semantically rich suites for more complex cases.
Wed 18 MarDisplayed time zone: Athens change
11:00 - 12:30 | Session 1B - LLMs for Testing and Automated RepairResearch Track / Reproducibility Studies and Negative Results (RENE) Track / Short Papers and Posters Track / Early Research Achievement (ERA) Track / Tool Demo Track at Megaron Beta Chair(s): Choro Ulan Uulu Eindhoven University of Technology | ||
11:00 15mTalk | HieraTest: Hierarchical Dependency–Driven Framework with Multi-Strategy Repair for LLM-based Unit Test Generation Research Track Weichang Liu Zhejiang University, Junwei Zhang Zhejiang University, Xiaochun Zhu Insigma Hengtian Software LTD, Bo Zhou Northeastern University | ||
11:15 15mTalk | TestForge: A Benchmarking Framework for LLM-Based Test Case Generation Research Track Marco Vieira University of North Carolina at Charlotte, Bhavain Shah University of North Carolina at Charlotte, Priyam Ashish Shah University of North Carolina at Charlotte, Vineet Khadloya Salesforce | ||
11:30 15mTalk | RM -RF: Reward Model for Run-Free Unit Test Evaluation Research Track Elena Bruches Siberian Neuronets LLC, Daniil Grebenkin Siberian Neuronets LLC, Mikhail Klementev Siberian Neuronets LLC, Vadim Alperovich T-Technologies, Roman Derunets Siberian Neuronets LLC, Dari Baturova Siberian Neuronets LLC, Georgiy Mkrtchyan T-Technologies, Oleg Sedukhin Siberian Neuronets LLC, Ivan Bondarenko Novosibirsk State University, Nikolay Bushkov T-Technologies, Stanislav Moiseev T-Technologies Pre-print | ||
11:45 15mTalk | Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study Reproducibility Studies and Negative Results (RENE) Track Alexander Berndt , Vekil Bekmyradov SAP, Rainer Gemulla University of Mannheim, Marcus Kessel University of Mannheim, Thomas Bach SAP, Sebastian Baltes Heidelberg University | ||
12:00 7mTalk | Integrating A Large Language Model Into Search-based Automated Program Repair Short Papers and Posters Track | ||
12:07 7mTalk | RisConFix: LLM-based Automated Repair of Risk-Prone Drone Configurations Short Papers and Posters Track Liping Han Nanjing University of Posts and Telecommunications, Tingting Nie Nanjing University of Posts and Telecommunications, Le Yu Nanjing University of Posts and Telecommunications, Mingzhe Hu Nanjing University of Posts and Telecommunications, Tao Yue Beihang University | ||
12:14 7mTalk | Leveraging Mutation Analysis for LLM-based Repair of Quantum Programs Early Research Achievement (ERA) Track Chihiro Yoshida The University of Osaka, Yuta Ishimoto The University of Osaka, Olivier Nourry The University of Osaka, Masanari Kondo Kyushu University, Makoto Matsushita The University of Osaka, Yasutaka Kamei Kyushu University, Yoshiki Higo Osaka University | ||
12:21 7mTalk | AI-Assisted Semantic Modeling of Languages for Symbolic Execution Driven Unit Test Generation Tool Demo Track Mokshith Reddy Tanguturi , Atul Kumar IBM Research India, Nandakishore S Menon IBM Research India, Sridhar Chimalakonda Indian Institute of Technology Tirupati | ||