Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse bugs?
Are neural bug detectors comparable to software developers on variable misuse bugs? While neural bug detectors become increasingly effective in identifying variable misuses, it is unclear how they compare to software developers with respect to the same code reviewing task. Therefore, we addressed this question by evaluating the performance of over 100 developers that are tasked with identifying variable misuse bugs in foreign code. Afterwards, we compared their performance with two popular bug detectors based on graph neural networks and transformers. Our evaluation shows that there is a large overlap of bugs found by software developers and bug detectors and that developers still outperform bug detectors considering the trade-off between detected bugs and false alarms.
With this artifact we provide access to all our evaluation results in machine-readable formats, so that they can be used for comparison in future studies on bug detection. In addition to our results, this artifact also contains the evaluation scripts and code necessary to replicate our results or to repeat the developer survey. More precisely, we included (1) the web interface used to survey the developers, (2) the implementation and trained models to evaluate the bug detectors and (3) the scripts (as Jupyter notebooks) to (re-)generate all figures and results of our paper.
The artifact can be used via a pre-installed virtual machine or by following our comprehensive manual installation guide.
Relation to paper: The artifact consists of four components intended for replication of our evaluation and (re-)use in future projects. Here, we describe how they relate to our paper:
-
Survey raw data: The artifact includes all our evaluation results including the answers of over 100 developers on the variable misuse detection task. In addition to what is already included in the paper, the raw data also gives access to further demographic information of the participants and the participants experience level with programming Java and code debugging.
-
Web interface: The web interface were employed for evaluating the performance of developers in detecting variable misuse bugs. The interface can be used to replicate our study setup with a new participant group.
-
Neural bug detectors: We included the implementation (+ trained models) of the two neural bug detectors used for comparison in this artifact. A script is provided for evaluating the performance of the bug detectors on our benchmark. The results of the script were used to compare the performance of neural bug detector and developer. Furthermore, a simple interface allows to facilitate the bug detectors in future projects easily.
-
Evaluation scripts: We provided all scripts to replicate our evaluation results and figures in form of Jupyter notebooks. Together with the included results of our developer study, it is possible to complete replicate the paper results by running the scripts.
Requirements: Since this artifact also contains the scripts for evaluating the neural bug detector, a machine with modern CPU and at least 8GB RAM is recommended.
Targeted Badge: Reusable