Configurable Ensembles for Software Similarity: Challenging the Notion of Universal Metrics (SCAM 2025 - Research Track)

Who

Shujun Huang, Sebastian Proksch

Track

SCAM 2025 Research Track

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 9 Sep 2025 13:30 - 13:50 at OGGB5 260-051 - Analysis 3 Chair(s): Coen De Roover

Abstract

Software similarity analysis is crucial in various fields, including code clone detection, security analysis, and software refactoring. While research continues to identify new use cases, numerous similarity detectors have already been proposed for specific contexts. These detectors usually leverage project attributes, such as source code, contributors, documentation, and dependencies. Existing works consistently demonstrate that their approaches outperform others in extensive evaluations. In this paper, we challenge the idea of a universally superior similarity model. We argue that similarity is a fluent concept and that relevant metrics always depend on specific needs. We present a novel framework that enables a flexible aggregation of diverse similarity models, allowing fine-tuned configurations for specific needs and use cases. Our evaluation incorporates multiple existing similarity models and their respective benchmarks to reveal the fundamental dilemma: depending on the configuration, our aggregated model will either confirm prior results or expose significant differences among individual models. However, we will demonstrate that these variations can be explained by the additional information that leads to more fine-grained results. Our results illustrate the future of software similarity research: configurable ensembles of much more specialized models.

Link to Preprint

https://drive.google.com/file/d/1UIiyQXOF5nWhYna2eN2pFpFiOEvrooil/view?usp=sharing

Shujun Huang

Software Engineering Research Group (SERG), TU Delft

Netherlands

Sebastian Proksch

Delft University of Technology