Is this Change the Answer to that Problem? Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness
Patch correctness has been the focus of automated program repair (APR) in recent years due to the propensity of APR tools to generate overfitting patches. Given a generated patch, the oracle (e.g., test suites) is generally weak in establishing correctness. Therefore, the literature has proposed various approaches of leveraging machine learning with engineered and deep learned features, or exploring dynamic execution information, to further explore the correctness of APR-generated patches. In this work, we propose a novel perspective to the problem of patch correctness assessment: a correct patch implements changes that ``answer'' to a problem posed by buggy behaviour. Concretely, we turn the patch correctness assessment into a Question Answering problem. To tackle this problem, our intuition is that natural language processing can provide the necessary representations and models for assessing the semantic correlation between a bug (question) and a patch (answer). Specifically, we consider as inputs the bug reports as well as the natural language description of the generated patches. Our approach, \toolname, first considers state of the art commit message generation models to produce the relevant inputs associated to each generated patch. Then we leverage a neural network architecture to learn the semantic correlation between bug reports and commit messages. Experiments on a large dataset of 9,135 patches generated for three bug datasets (Defects4j, Bugs.jar and Bears) show that \toolname can achieve an AUC of 0.873 on predicting patch correctness, and recalling 92% correct patches while filtering out 64% incorrect patches. Our experimental results further demonstrate the influence of inputs quality on prediction performance. We further perform experiments to highlight that the model indeed learns the relationship between bug reports and code change descriptions for the prediction. Finally, we compare against prior work and discuss the benefits of our approach.