A Micro-Benchmark for Dynamic Program Behaviour
We report on some intermediate results from a project investigating the unsoundness of static analysis tools. The project is empirical in nature, trying to capture the status quo of static program analysis, focusing on callgraph construction for Java programs.
Static callgraph construction has to trade off several, often conflicting, design goals: precision, performance and soundness. Soundness (or, using a quantitative term, recall) has attracted more attention recently after the publication of the soundiness manifesto.
Soundness requires that a static analysis models the entire program behaviour for all possible executions. An empirical study on soundness must therefore compare models obtained via static analysis with actual program behaviour. In the past year, we have explored several techniques to collect such program behaviour, including the analysis of exception stack traces from repositories, datasets of real-world programs with synthetic high-coverage drivers, and CVEs. More recently, we have used the insights gained from those studies to construct a micro-benchmark of programs using dynamic Java language features that present barriers for static analysis tools. This talk will focus on the construction of this benchmark. Some aspects to be discussed are:
- The design of the benchmark: programs are minimalistic, use conventions over configurations to facilitate experiments, are executable and designed to have behaviour that is easy to observe.
- What is a Java program, anyway? While there seems to be an obvious answer, we have included some programs that have engineered bytecode to reflect the fact that many static analysis tools use Java byte code as input, and Java byte code is increasingly generated by non-Java compilers and / or manipulated in post compilation steps.
- The selection of features represented, and their categorisation. The benchmark contains 34 programs from the following categories: 1) reflection, 2) serialization, 3) unsafe, 4) dynamic proxies, 5) dynamic classloading, 6) invokedynamic and JNI.
- Soundness is defined with respect to the ground truth of possible program behaviour. But what does this mean? The programs in the benchmark are designed to reveal their behaviour easily by just looking at them, perhaps also consulting the relevant specifications and / or executing the respective program with an included test suite. We will also discuss corner cases where the behaviour of the program is not well-defined (by the language / JVM specification and the core API documentation), and depends on a particular implementation of the JVM.
Wed 18 JulDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
14:00 - 15:30 | |||
14:00 30m | Benchmarking WebKit BenchWork Saam Barati Apple File Attached | ||
14:30 20m | Analyzing Duplication in JavaScript BenchWork Petr Maj Czech Technical University, Celeste Hollenbeck Northeastern University, USA, Shabbir Hussain Northeastern University, Jan Vitek Northeastern University | ||
14:50 20m | Building a Node.js Benchmark: Initial Steps BenchWork Petr Maj Czech Technical University, François Gauthier Oracle Labs, Celeste Hollenbeck Northeastern University, USA, Jan Vitek Northeastern University, Cristina Cifuentes Oracle Labs File Attached | ||
15:10 20m | A Micro-Benchmark for Dynamic Program Behaviour BenchWork Li Sui Massey University, New Zealand, Jens Dietrich Massey University, Michael Emery Massey University, Amjed Tahir Massey University, Shawn Rasheed Massey University |