Can neural clone detection generalize to unseen functionalities?
Many recently proposed code clone detectors exploit neural networks to capture latent semantics of source code, thus achieving impressive results for detecting semantic clones. These neural clone detectors rely on the availability of large amounts of labeled training data. We identify a key oversight in the current evaluation methodology for neural clone detection: cross-functionality generalization (i.e., detecting semantic clones of which the functionalities are unseen in training). Specifically, we focus on this question: do nerual clone detectors truly learn the ability to detect semantic clones, or they just learn how to model specific functionalities in training data while cannot generalize to realistic unseen functionalities? This paper investigates how the generalizability can be evaluated and improved.
Our contributions are 3-folds: (1) We propose an evaluation methodology that can systematically measure the cross-functionality generalizability of neural clone detection. Based on this evaluation methodology, an empirical study is conducted and the results indicate that current neural clone detectors cannot generalize well as expected. (2) We conduct empirical analysis to understand key factors that can impact the generalizability. We investigate 3 factors: training data diversity, vocabulary, and locality. Results show that the performance loss on unseen functionalities can be reduced through addressing the out-of-vocabulary problem and increasing training data diversity. (3) We propose a human-in-the-loop mechanism that help adapt neural clone detectors to new code repositories containing lots of unseen functionalities. It improves annotation efficiency with the combination of transfer learning and active learning. Experimental results show that it reduces the amount of annotations by about 88%.
Wed 17 NovDisplayed time zone: Hobart change
22:00 - 23:00
Analysis IIResearch Papers at Kangaroo
Chair(s): Annibale Panichella Delft University of Technology
Jihyeok Park KAIST, Seungmin An KAIST, Shin Wonho KAIST, Yusung Sim KAIST, Sukyoung Ryu KAIST
|Can neural clone detection generalize to unseen functionalities?|
Chenyao Liu School of Software, Tsinghua University, Zeqi Lin Microsoft Research, China, Jian-Guang Lou Microsoft Research, Lijie Wen School of Software, Tsinghua University, Dongmei Zhang Microsoft Research
|Characterizing Transaction-Reverting Statements in Ethereum Smart Contracts|
Lu Liu Southern University of Science and Technology; The Hong Kong University of Science and Technology, Lili Wei Hong Kong University of Science and Technology, Wuqi Zhang The Hong Kong University of Science and Technology, Ming Wen Huazhong University of Science and Technology, Yepang Liu Southern University of Science and Technology, Shing-Chi Cheung Hong Kong University of Science and Technology