Towards a Data-Curation Platform for Code-Centric Research
For many experiments in code-centric research, researchers require real-world code to verify theories and conduct evaluations. Given the abundance of available code on platforms such as Maven Central or GitHub, real-world examples can be collected easily in large quantities. However, in order for the experiments to be meaningful, the inspected code needs to be representative for the inspected problem. For example, to properly evaluate a precise call-graph algorithm the code must contain complex virtual call sites where the strength and limitations of the algorithm can be observed. It is labor-intensive to set up these collections of code every time they become necessary. Furthermore, to increase the comparability and repeatability of the experiments, collections of code objects must be well curated so that their construction is traceable and repeatable.
New findings for these collections might invalidate their data (in parts) and related research should be inspected. Static collections (e.g., XCorpus) quickly become outdated and it is hard to annotate them once they are out in the field. During the usage of a collection, interesting data on the collection items might have been created or computed by other researchers. Algorithms (e.g., Averroes) might be available to compute information relevant for an evaluation on the fly thereby extend the original dataset. It is not easy to find this information and relate it to the the items in the collection.
To address these challenges in benchmark creation and maintenance, we introduce Delphi, an online platform to search for representative candidates to construct datasets of real-world code based on various metrics. It consists of an automated data collection, a search engine, and facilities to trace the selection process in order to foster repeatable, tractable, and comparable research. We present the current state of the project as well as our plans to extend the platform with processes for ground truth data uploads, service integration, and curated data invalidation.
|Towards a Data-Curation Platform for Code-Centric Research (Slides) (Towards a Data-Curation Platform for Code-Centric Research.pdf)||2.66MiB|
Wed 18 Jul
Ben HermannUniversity of Paderborn, Lisa Nguyen Quang DoPaderborn University, Eric BoddenHeinz Nixdorf Institut, Paderborn University and Fraunhofer IEMFile Attached
Beau JohnstonAustralian National UniversityFile Attached