ConflictBench: A Benchmark to Evaluate Software Merge Tools
In collaborative software development, programmers create software branches to add features and fix bugs tentatively, and then merge branches to integrate edits. When branches divergently edit the same text, the edits conflict and cannot get co-applied. Tools were built to automatically merge software branches, to detect conflicts, and to resolve conflicts along the way. However, there is no third-party benchmark or metric to comprehensively evaluate or compare those tools.
For this paper, we introduce ConflictBench, a novel benchmark consisting of 180 merging scenarios extracted from 180 open-source Java projects. For each scenario, we sampled a conflicting chunk reported by git-merge. Because git-merge sometimes wrongly reports conflicts, with manual inspection, we labeled 136 of the 180 chunks as true conflicts, and 44 chunks as false conflicts. To facilitate tool evaluation, we also defined a systematic method of manual analysis to analyze all program versions involved in each merging scenario, and to summarize the root causes as well as developers’ resolution strategies. We further defined three novel metrics to evaluate merge tools. By applying five state-of-the-art tools to ConflictBench, we observed that ConflictBench is effective to characterize different tools. It helps reveal limitations of existing tools and sheds light on future research.
Before constructing the benchmark, we conducted a literature review for existing merge tools and empirical studies on merge techniques. We observed and discussed how merge tools were evaluated, identifying the following requirements that a good benchmark should satisfy:
-
Diversity: It should cover a wide range of scenarios where merge happens, so that the dataset is representative.
-
True Conflicts: It should include true conflicts between branch edits, to assess whether merge tools can identify the conflicts when two branches edit the same text differently.
-
False Conflicts: It should include false conflicts, to assess whether a merge tool wrongly reports conflicts when the branches do not edit the same text simultaneously.
-
Conflict Resolutions: It should include developers’ resolutions to reported conflicts, to evaluate whether the tool-generated resolutions match human-crafted ones.
To satisfy all requirements mentioned above, we created our benchmark by crawling 208 popular open-source Java repositories. For each repository, we randomly sampled a commit that attempts to merge software branches via git-merge, and manually inspected the conflicts reported by git-merge to pick one satisfying our selection criteria. After including all picked conflicts into our dataset, we formulated our benchmark named ConflictBench. Among the 180 conflicts it contains, there are 136 true conflicts and 44 false ones. To facilitate tool comparison, we also classified conflicts based on the types of branch edits, the types of edited files, and developers’ resolution strategies.
We applied five state-of-the-art merge tools to ConflictBench, to check whether our benchmark is effective in characterizing tools’ effectiveness and in revealing differences between tools. The tools include KDiff3, FSTMerge, JDime, IntelliMerge, and AutoMerge. We observed the following interesting phenomena in our experiments. KDiff3 has wider applicability than the other tools. JDime reported conflicts with the highest precision (92%), while AutoMerge reported the fewest conflicts (i.e., 17). KDiff3 achieved the highest resolution desirability (83%), meaning that the majority of merged versions it produces match developers’ hand-crafted versions.
In this paper, we made the following research contributions:
-
We defined a novel systematic method to classify merge-conflict data, and applied that method to manually create a benchmark of merge-conflict data named ConflictBench. This benchmark includes 180 merging scenarios with labeled true/false conflicts, types of branch edits, types of edited files, and developers’ resolution strategies. No prior work characterizes conflicts in such a comprehensive way.
-
We defined three novel metrics to evaluate software merge tools: tool applicability, detection precision, and resolution desirability.
-
We comprehensively evaluated five state-of-the-art software merge tools using ConflictBench, and observed interesting phenomena in terms of tool applicability, conflict-detection precision, and conflict-resolution desirability. No prior work does such an empirical evaluation of these tools or presents the novel findings we have.
This paper was accepted for publication by Journal of Systems and Software (JSS) in April 2024. Our work is not a secondary study but presents entirely new research findings and innovative contributions that have not been previously reported. The paper has not been presented at, nor is it under consideration for, journal-first programs of other conferences. The first author Bowen Shen will give the presentation. If accepted, this paper will be the only paper that Bowen presents at ASE 2024. Therefore, the acceptance will definitely increase Bowen’s opportunity to attend ASE. The paper is available at https://doi.org/10.1016/j.jss.2024.112084.
Thu 31 OctDisplayed time zone: Pacific Time (US & Canada) change
13:30 - 15:00 | Software MergeResearch Papers / Journal-first Papers at Gardenia Chair(s): Haiyan Zhao Peking University | ||
13:30 15mTalk | Evaluation of Version Control Merge Tools Research Papers Benedikt Schesch ETH Zurich, Ryan Featherman UW CSE, Ben Roberts UW CSE, Kenneth J Yang UW CSE, Michael D. Ernst University of Washington | ||
13:45 15mTalk | Semistructured Merge with Language-Specific Syntactic Separators Research Papers Guilherme Cavalcanti Federal Institute of Pernambuco, Brazil, Paulo Borba Federal University of Pernambuco, Leonardo dos Anjos Federal University of Pernambuco, Jonatas Clementino Federal University of Pernambuco | ||
14:00 15mTalk | Automatic Prediction of Developers' Resolutions for Software Merge Conflicts Journal-first Papers Waad riadh aldndni Virginia Tech, Blacksburg,VA,U.S.A., Na Meng Virginia Tech, Francisco Servant ITIS Software, University of Malaga | ||
14:15 15mTalk | ConflictBench: A Benchmark to Evaluate Software Merge Tools Journal-first Papers | ||
14:30 15mTalk | Revisiting the Conflict-Resolving Problem from a Semantic Perspective Research Papers Jinhao Dong Peking University, Jun Sun Singapore Management University, Yun Lin Shanghai Jiao Tong University, Yedi Zhang National University of Singapore, Murong Ma National University of Singapore, Jin Song Dong National University of Singapore, Dan Hao Peking University |