Automated Software Test Generation at Industry Scale Using a Multi-Agent Architecture and Workflow Integration
Distinguished Paper Award
Raising test coverage at industry scale is difficult: engineers report a median of four minutes to produce a single covered line of code, making coverage targets expensive in large repositories. We present AutoCover, a production system that automatically generates, validates, and proposes tests using large language models. AutoCover supports three complementary interaction modes: a CLI for local scripting, a Headless mode for generating tests at scale across repository shards and creating merge requests, and an IDE mode for human-in-the-loop generation that captures developer intent and enables rapid fixes. Together, these modes support both legacy code in repositories and newly developed code on developer machines.
AutoCover is implemented as a modular, agentic pipeline built on LangGraph, with subgraphs for preparation, generation, execution, and validation or repair. Preparation includes scaffolding tests and running initial coverage, while generation uses multi-shot prompting informed by prior failures. Repository adapters provide language- and repository-specific actions such as import splicing, coverage commands, and coding conventions, and a code-context retriever supplies only relevant symbols to the LLM to stay within context limits. To reduce low-quality outputs, AutoCover combines intent-aware generation with validation gates such as coverage deltas and mutation or branch checks where available, along with flakiness defenses like multi-run CI simulation when compute permits.
This paper makes three contributions. First, it describes AutoCover’s end-to-end architecture and evolution, including design considerations around user experience, test quality, and cost. Second, it details interaction patterns across CLI, Headless, and IDE modes, and shows how intent collection and human-in-the-loop repair improve acceptance. Third, it reports both intrinsic and extrinsic evaluation results. AutoCover now generates about 11% of all new tests that are reviewed and added to CompanyX’s codebase.
Wed 15 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
11:00 - 12:30 | Testing and Analysis 3SE In Practice (SEIP) / Research Track at Oceania I Chair(s): Yvonne Dittrich IT University of Copenhagen | ||
11:00 15mTalk | TestifAI: Tomography-Based Testing for Deep Learning Systems Research Track Arooj Arif Northeastern University London, Tobias Hartung Northeastern University London, Elena Botoeva University of Kent, Alexandros Koliousis Northeastern University London | ||
11:15 15mTalk | Automated Software Test Generation at Industry Scale Using a Multi-Agent Architecture and Workflow Integration SE In Practice (SEIP) Matas Rastenis Uber Technologies Inc., Ben Chou Uber Technologies Inc., Shauvik Roy Choudhary Uber Technologies, Inc, René Just University of Washington Media Attached | ||
11:30 15mTalk | On the Flakiness of LLM-generated Tests for Industrial and Open-Source Database Management Systems SE In Practice (SEIP) Alexander Berndt Heidelberg University, Thomas Bach SAP, Rainer Gemulla University of Mannheim, Marcus Kessel University of Mannheim, Sebastian Baltes Heidelberg University Pre-print | ||
11:45 15mTalk | Enabling Black-box RPC-API Testing with Multi-Agent Reinforcement Learning and LLMs: An Industry Case Study SE In Practice (SEIP) Xiaoqing Sun Alibaba Cloud, Zhou Shao Alibaba Cloud, Xiaonan Shi Alibaba Cloud, Shiliang Xiao Alibaba Cloud, Chao Ma Alibaba Cloud, Xiaobo Xue Alibaba Cloud, Jianyuan Lu Alibaba Cloud, Shize Zhang Alibaba Cloud, Enge Song Alibaba Cloud, Song Yang Alibaba Cloud, Xing Li Zhejiang University and Alibaba Cloud, Chongrong Fang Shanghai Jiao Tong University, Chunrong Fang Nanjing University, Biao Lyu Alibaba Cloud, Shunmin Zhu Hangzhou Feitian Cloud and Alibaba Cloud | ||
12:00 15mTalk | Hamster: A Large-Scale Study and Characterization of Developer-Written Tests SE In Practice (SEIP) Rangeet Pan IBM Research, Tyler Stennett Georgia Institute of Technology, Raju Pavuluri IBM T.J. Watson Research Center, Nate Levin Georgia Institute of Technology, Alessandro Orso University of Georgia, USA, Saurabh Sinha IBM Research | ||
12:15 15mTalk | AutoOracle: High-Quality C++ Test Oracle Generation via Data Quality-Driven and Filtering-Enabled LLMs SE In Practice (SEIP) Cong Li Samsung R&D Institute China Xi'an, Samsung Electronics, Jong-In Jang Samsung Electronics, Yuqi Zhang Samsung R&D Institute China Xi'an, Samsung Electronics, Nakwon Lee Samsung Electronics, Bin Wang , Yinghua Zhang Samsung R&D Institute China Xi'an, Samsung Electronics, Chanwook Kim Samsung Electronics, Jia Zhang Samsung R&D Institute China Xi'an, Samsung Electronics, HyunSeok Kim Samsung Electronics, Xing He Samsung R&D Institute China Xi'an, Samsung Electronics, Kangho Roh Samsung Electronics, Seongjun Ahn Samsung Electronics | ||