Keynote by Marcel Böhme (SBFT 2025)

Track

SBFT 2025 Search-Based and Fuzz testing

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 11:00 - 12:00 at 104 - Keynote 2 and Paper Presentations 1 Chair(s): Vincenzo Riccio

Abstract

Benchmarks are our measure of progress. Or are they?

How do we know how well our tool solves a problem, like bug finding, compared to other state-of-the-art tools? We run a benchmark. We choose a few representative instances of the problem, define a reasonable measure of success, and identify and mitigate various threats to validity. Finally, we implement (or reuse) a benchmarking framework, and compare the results for our tool with those for the state-of-the-art.

For many important software engineering problems, we have seen new sparks of interest and serious progress made whenever a (substantially better) benchmark became available. Benchmarks are our measure of progress. Without them, we have no empirical support to our claims of effectiveness. Yet, time and again, we see practitioners disregard entire technologies as “paper-ware”—far from solving the problem they set out to solve.

In this keynote, I will discuss our recent efforts to systematically study the degree to which our evaluation methodologies allow us to measure those capabilities that we aim to measure. We shed new light on a long-standing dispute about code coverage as a measure of testing effectiveness, explore the impact of the specific benchmark configuration on the evaluation outcome, and call into question the actual versus measured progress of an entire field (ML4VD) just as it gains substantial momentum and interest.