Towards a Data-Curation Platform for Code-Centric Research (BenchWork 2018)

Mon 16 - Tue 17 July 2018 Amsterdam, Netherlands

co-located with ECOOP and ISSTA 2018

Who

Ben Hermann, Lisa Nguyen Quang Do, Eric Bodden

Track

BenchWork 2018

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 18 Jul 2018 16:30 - 16:50 at Hanoi - Software Engineering & Compilers

Abstract

For many experiments in code-centric research, researchers require real-world code to verify theories and conduct evaluations. Given the abundance of available code on platforms such as Maven Central or GitHub, real-world examples can be collected easily in large quantities. However, in order for the experiments to be meaningful, the inspected code needs to be representative for the inspected problem. For example, to properly evaluate a precise call-graph algorithm the code must contain complex virtual call sites where the strength and limitations of the algorithm can be observed. It is labor-intensive to set up these collections of code every time they become necessary. Furthermore, to increase the comparability and repeatability of the experiments, collections of code objects must be well curated so that their construction is traceable and repeatable.

New findings for these collections might invalidate their data (in parts) and related research should be inspected. Static collections (e.g., XCorpus) quickly become outdated and it is hard to annotate them once they are out in the field. During the usage of a collection, interesting data on the collection items might have been created or computed by other researchers. Algorithms (e.g., Averroes) might be available to compute information relevant for an evaluation on the fly thereby extend the original dataset. It is not easy to find this information and relate it to the the items in the collection.

To address these challenges in benchmark creation and maintenance, we introduce Delphi, an online platform to search for representative candidates to construct datasets of real-world code based on various metrics. It consists of an automated data collection, a search engine, and facilities to trace the selection process in order to foster repeatable, tractable, and comparable research. We present the current state of the project as well as our plans to extend the platform with processes for ground truth data uploads, service integration, and curated data invalidation.

File attachments

Towards a Data-Curation Platform for Code-Centric Research (Slides) (Towards a Data-Curation Platform for Code-Centric Research.pdf)	2.67MiB

Ben Hermann

University of Paderborn

Germany

Lisa Nguyen Quang Do

Paderborn University

Germany

Eric Bodden

Heinz Nixdorf Institut, Paderborn University and Fraunhofer IEM