ConflictBench: A Benchmark to Evaluate Software Merge Tools (ASE 2024 - Journal-first Papers)

Who

Bowen Shen, Na Meng

Track

ASE 2024 Journal-first Papers

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 31 Oct 2024 14:15 - 14:30 at Gardenia - Software Merge Chair(s): Haiyan Zhao

Abstract

In collaborative software development, programmers create software branches to add features and fix bugs tentatively, and then merge branches to integrate edits. When branches divergently edit the same text, the edits conflict and cannot get co-applied. Tools were built to automatically merge software branches, to detect conflicts, and to resolve conflicts along the way. However, there is no third-party benchmark or metric to comprehensively evaluate or compare those tools.

For this paper, we introduce ConflictBench, a novel benchmark consisting of 180 merging scenarios extracted from 180 open-source Java projects. For each scenario, we sampled a conflicting chunk reported by git-merge. Because git-merge sometimes wrongly reports conflicts, with manual inspection, we labeled 136 of the 180 chunks as true conflicts, and 44 chunks as false conflicts. To facilitate tool evaluation, we also defined a systematic method of manual analysis to analyze all program versions involved in each merging scenario, and to summarize the root causes as well as developers’ resolution strategies. We further defined three novel metrics to evaluate merge tools. By applying five state-of-the-art tools to ConflictBench, we observed that ConflictBench is effective to characterize different tools. It helps reveal limitations of existing tools and sheds light on future research.

Before constructing the benchmark, we conducted a literature review for existing merge tools and empirical studies on merge techniques. We observed and discussed how merge tools were evaluated, identifying the following requirements that a good benchmark should satisfy:

Diversity: It should cover a wide range of scenarios where merge happens, so that the dataset is representative.
True Conflicts: It should include true conflicts between branch edits, to assess whether merge tools can identify the conflicts when two branches edit the same text differently.
False Conflicts: It should include false conflicts, to assess whether a merge tool wrongly reports conflicts when the branches do not edit the same text simultaneously.
Conflict Resolutions: It should include developers’ resolutions to reported conflicts, to evaluate whether the tool-generated resolutions match human-crafted ones.

To satisfy all requirements mentioned above, we created our benchmark by crawling 208 popular open-source Java repositories. For each repository, we randomly sampled a commit that attempts to merge software branches via git-merge, and manually inspected the conflicts reported by git-merge to pick one satisfying our selection criteria. After including all picked conflicts into our dataset, we formulated our benchmark named ConflictBench. Among the 180 conflicts it contains, there are 136 true conflicts and 44 false ones. To facilitate tool comparison, we also classified conflicts based on the types of branch edits, the types of edited files, and developers’ resolution strategies.

We applied five state-of-the-art merge tools to ConflictBench, to check whether our benchmark is effective in characterizing tools’ effectiveness and in revealing differences between tools. The tools include KDiff3, FSTMerge, JDime, IntelliMerge, and AutoMerge. We observed the following interesting phenomena in our experiments. KDiff3 has wider applicability than the other tools. JDime reported conflicts with the highest precision (92%), while AutoMerge reported the fewest conflicts (i.e., 17). KDiff3 achieved the highest resolution desirability (83%), meaning that the majority of merged versions it produces match developers’ hand-crafted versions.

In this paper, we made the following research contributions:

We defined a novel systematic method to classify merge-conflict data, and applied that method to manually create a benchmark of merge-conflict data named ConflictBench. This benchmark includes 180 merging scenarios with labeled true/false conflicts, types of branch edits, types of edited files, and developers’ resolution strategies. No prior work characterizes conflicts in such a comprehensive way.
We defined three novel metrics to evaluate software merge tools: tool applicability, detection precision, and resolution desirability.
We comprehensively evaluated five state-of-the-art software merge tools using ConflictBench, and observed interesting phenomena in terms of tool applicability, conflict-detection precision, and conflict-resolution desirability. No prior work does such an empirical evaluation of these tools or presents the novel findings we have.

This paper was accepted for publication by Journal of Systems and Software (JSS) in April 2024. Our work is not a secondary study but presents entirely new research findings and innovative contributions that have not been previously reported. The paper has not been presented at, nor is it under consideration for, journal-first programs of other conferences. The first author Bowen Shen will give the presentation. If accepted, this paper will be the only paper that Bowen presents at ASE 2024. Therefore, the acceptance will definitely increase Bowen’s opportunity to attend ASE. The paper is available at https://doi.org/10.1016/j.jss.2024.112084.

Bowen Shen

Virginia Tech

Na Meng