Test-based Patch Clustering for Automatically-Generated Patches Assessment (ASE 2025 - Journal-First Track)

Who

Matias Martinez, Maria Kechagia, Anjana Perera, Justyna Petke, Federica Sarro, Aldeida Aleti

Track

ASE 2025 Journal-First Track

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 17 Nov 2025 11:30 - 11:40 at Grand Hall 1 - Program Repair 1

Abstract

Automated program repair (APR) techniques generate patches for fixing software bugs automatically. The aim of APR is to significantly reduce the manual effort required by developers to fix software bugs. However, previous studies have shown that APR techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. These are termed \emph{plausible} patches. Therefore, the patches generated by APR tools need to be validated by human programmers, which can be very costly, and prevents APR tool adoption in practice. Our work aims to minimize the number of plausible patches that programmers have to review, thereby reducing the time and effort required to find a correct patch.

To alleviate these issues, we present a light-weight patch post-processing technique, named XTESTCLUSTER, that aims to reduce the number of generated patches that a developer has to assess. Our technique clusters plausible repair patches exhibiting the same behavior (according to a given set of test suites), and provides the developer with fewer patches, each representative of a given cluster, thus ensuring that those patches exhibit different behavior. Our technique can be used not only when a single tool generates multiple plausible patches for a given bug, but also when different available APR tools are running (potentially in parallel) in order to increase the chance of finding a correct patch. In this way, developers will only need to examine one patch, representative of a given cluster, rather than all, possibly hundreds, of patches produced by APR tools.

Our approach presents two main novelties: First, it leverages the diversity of the behavior of the generated patches (and this diversity is not exposed by the developer-written test cases used to synthesize patches). In particular, our clustering approach XTESTCLUSTER exploits automatically generated test cases that enforce diverse behavior in addition to the existing test suite. Second, our approach has the main advantage that it does not involve code instrumentation (aside from patch application) nor an oracle or pre-existing dataset to learn fix patterns.
Moreover, XTESTCLUSTER is complementary to previous work on patch overfitting assessment, as it can apply different prioritization strategies to each cluster.

The output of XTESTCLUSTER provides developers and code reviewers with: 1) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, and 2) enriched information for each patch, including newly generated test cases, their outcomes, and the inputs that expose behavioral differences across alternative patches for the same bug. Such information supports reviewers in selecting the most appropriate patch to merge into the codebase.

We evaluate our approach on 902 patches (248 correct and 654 overfitted) for bugs from Defects4J data set, generated by 21 different APR tools. After removing duplicates, we used two automated test-case generation tools, EvoSuite and Randoop, to generate test cases for our patch set. Finally, we cluster patches based on test case results. To our knowledge, XTESTCLUSTER is the first approach to analyze together patches from multiple program repair approaches generated to fix a particular bug.

Our results show that XTESTCLUSTER is able to create at least two clusters for almost half of the bugs that have two or more different patches. By having patches clustered, XTESTCLUSTER is able to reduce a median of 50% of the number of patches to review and analyze. This reduction could help code reviewers (developers using automated repair tools or researchers evaluating patches) to reduce the time of patch evaluation.

We also analyze the assessment done by two state-of-the-art patch assessment approaches, ODS and Cache on the patches clustered by XTESTCLUSTER. The results show that XTESTCLUSTER can be used complementarily to those approaches and can help to detect false positives and false negatives.

Matias Martinez

Universitat Politècnica de Catalunya (UPC)

Spain

Maria Kechagia

National and Kapodistrian University of Athens

Greece

Anjana Perera

Oracle Labs, Australia

Australia

Justyna Petke

University College London

United Kingdom

Federica Sarro