Benchmarks are our measure of progress. Or are they?
How do we know how well our tool solves a problem, like bug finding, compared to other state-of-the-art tools? We run a benchmark. We choose a few representative instances of the problem, define a reasonable measure of success, and identify and mitigate various threats to validity. Finally, we implement (or reuse) a benchmarking framework, and compare the results for our tool with those for the state-of-the-art.
For many important software engineering problems, we have seen new sparks of interest and serious progress made whenever a (substantially better) benchmark became available. Benchmarks are our measure of progress. Without them, we have no empirical support to our claims of effectiveness. Yet, time and again, we see practitioners disregard entire technologies as “paper-ware”—far from solving the problem they set out to solve.
In this keynote, I will discuss our recent efforts to systematically study the degree to which our evaluation methodologies allow us to measure those capabilities that we aim to measure. We shed new light on a long-standing dispute about code coverage as a measure of testing effectiveness, explore the impact of the specific benchmark configuration on the evaluation outcome, and call into question the actual versus measured progress of an entire field (ML4VD) just as it gains substantial momentum and interest.
Mon 28 AprDisplayed time zone: Eastern Time (US & Canada) change
11:00 - 12:30 | |||
11:00 60mKeynote | Keynote by Marcel Böhme SBFT Marcel Böhme MPI for Security and Privacy | ||
12:00 15mResearch paper | DeepUIFuzz: A Guided Fuzzing Strategy for Testing UI Component Detection Models SBFT Proma Chowdhury University of Dhaka, Kazi Sakib Institute of Information Technology, University of Dhaka | ||
12:15 15mResearch paper | On Evaluating Fuzzers with Context-Sensitive Fuzzed Inputs: A Case Study on PKCS#1-v1.5 SBFT S Mahmudul Hasan Syracuse University, Polina Kozyreva Syracuse University, Endadul Hoque Syracuse University |