Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings
Tue 10 May 2022 05:05 - 05:10 at ICSE room 2-odd hours - Search-Based Software Engineering 1 Chair(s): Ruchika Malhotra
Neural program embeddings have demonstrated considerable promise in a range of program analysis tasks, including clone identification, program repair, code completion, and program synthesis. However, most existing methods generate neural program embeddings directly from the program source codes, by learning from features such as tokens, abstract syntax trees, and control flow graphs.
This paper takes a fresh look at how to improve program embeddings by leveraging compiler intermediate representation (IR). We first demonstrate simple yet highly effective methods for enhancing embedding quality by training embedding models alongside source code and LLVM IR generated by default optimization levels (e.g., -O2). We then introduce IRGen, a framework based on genetic algorithms (GA), to identify (near-)optimal sequences of optimization flags that can significantly improve embedding quality.
We use IRGen to find optimal sequences of LLVM optimization flags by performing GA on source code datasets. We then extend a popular code embedding model, CodeCMR, by adding a new objective based on triplet loss to enable a joint learning over source code and LLVM IR. When CodeCMR was trained with source code and LLVM IRs optimized by findings of IRGen, the embedding quality was significantly improved, outperforming the state-of-the-art model, CodeBERT, which was trained only with source code. Our augmented CodeCMR also outperformed CodeCMR trained over source code and IR optimized with default optimization levels. We investigate the properties of optimization flags that increase embedding quality, demonstrate IRGen’s generalization in boosting other embedding models, and establish IRGen’s use in settings with extremely limited training data. Our research and findings demonstrate that a low-cost addition to modern neural code embedding models can provide an universal and highly effective enhancement.
Mon 9 MayDisplayed time zone: Eastern Time (US & Canada) change
20:00 - 21:00 | Search-Based Software Engineering 2NIER - New Ideas and Emerging Results / Technical Track at ICSE room 4-even hours Chair(s): Ali Ouni ETS Montreal, University of Quebec | ||
20:00 5mTalk | A Black Box Technique to Reduce Energy Consumption of Android Apps NIER - New Ideas and Emerging Results Abdul Ali Bangash University of Alberta, Canada, Karim Ali University of Alberta, Abram Hindle University of Alberta Pre-print Media Attached | ||
20:05 5mTalk | Fairness-aware Configuration of Machine Learning Libraries Technical Track Saeid Tizpaz-Niari University of Texas at El Paso, Ashish Kumar , Gang (Gary) Tan Pennsylvania State University, Ashutosh Trivedi University of Colorado Boulder DOI Pre-print Media Attached | ||
20:10 5mTalk | Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings Technical Track Zongjie Li The Hong Kong University of Science and Technology, Pingchuan Ma HKUST, Huaijin Wang , Shuai Wang Hong Kong University of Science and Technology, Qiyi Tang Tencent Security Keen Lab, Sen Nie Keen Security Lab, Tencent, Shi Wu Tencent Security Keen Lab DOI Pre-print Media Attached | ||
20:15 5mTalk | Control Parameters Considered Harmful: Detecting Range Specification Bugs in Drone Configuration Modules via Learning-Guided Search Technical Track Ruidong Han Xidian University, Chao Yang Xidian University, Siqi Ma The University of New South Wales Canberra, Jianfeng Ma Xidian University, Cong Sun Xidian University, Juanru Li Shanghai Jiao Tong University, Elisa Bertino Purdue University DOI Pre-print Media Attached | ||
20:20 5mTalk | Search-based Diverse Sampling from Real-world Software Product Lines Technical Track Yi Xiang South China University of Technology, Han Huang South China University of Technology, Yuren Zhou School of Data and Computer Science, Sun Yat-sen University, Sizhe Li South China University of Technology, Chuan Luo Beihang University, Qingwei Lin Microsoft Research, Miqing Li University of Birmingham, Xiaowei Yang South China University of Technology DOI Pre-print Media Attached | ||
20:25 5mTalk | Code Search based on Context-aware Code Translation Technical Track Weisong Sun State Key Laboratory for Novel Software Technology, Nanjing University, Chunrong Fang Nanjing University, Yuchen Chen Nanjing University, Guanhong Tao Purdue University, USA, Tingxu Han Nanjing University, Quanjun Zhang Nanjing University Pre-print Media Attached |
Tue 10 MayDisplayed time zone: Eastern Time (US & Canada) change
05:00 - 06:00 | Search-Based Software Engineering 1Technical Track at ICSE room 2-odd hours Chair(s): Ruchika Malhotra Delhi Technological University | ||
05:00 5mTalk | Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective OptimizationDistinguished Paper Award Technical Track Fitash Ul Haq University of Luxembourg, Donghwan Shin University of Luxembourg, Lionel Briand University of Luxembourg; University of Ottawa Pre-print Media Attached | ||
05:05 5mTalk | Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings Technical Track Zongjie Li The Hong Kong University of Science and Technology, Pingchuan Ma HKUST, Huaijin Wang , Shuai Wang Hong Kong University of Science and Technology, Qiyi Tang Tencent Security Keen Lab, Sen Nie Keen Security Lab, Tencent, Shi Wu Tencent Security Keen Lab DOI Pre-print Media Attached | ||
05:10 5mTalk | Control Parameters Considered Harmful: Detecting Range Specification Bugs in Drone Configuration Modules via Learning-Guided Search Technical Track Ruidong Han Xidian University, Chao Yang Xidian University, Siqi Ma The University of New South Wales Canberra, Jianfeng Ma Xidian University, Cong Sun Xidian University, Juanru Li Shanghai Jiao Tong University, Elisa Bertino Purdue University DOI Pre-print Media Attached | ||
05:15 5mTalk | Search-based Diverse Sampling from Real-world Software Product Lines Technical Track Yi Xiang South China University of Technology, Han Huang South China University of Technology, Yuren Zhou School of Data and Computer Science, Sun Yat-sen University, Sizhe Li South China University of Technology, Chuan Luo Beihang University, Qingwei Lin Microsoft Research, Miqing Li University of Birmingham, Xiaowei Yang South China University of Technology DOI Pre-print Media Attached | ||
05:20 5mTalk | PropR: Property-Based Automatic Program Repair Technical Track Matthías Páll Gissurarson Chalmers University of Technology, Sweden, Leonhard Applis Delft University of Technology, Annibale Panichella Delft University of Technology, Arie van Deursen Delft University of Technology, Netherlands, Dave Sands Chalmers DOI Pre-print Media Attached | ||
05:25 5mTalk | Code Search based on Context-aware Code Translation Technical Track Weisong Sun State Key Laboratory for Novel Software Technology, Nanjing University, Chunrong Fang Nanjing University, Yuchen Chen Nanjing University, Guanhong Tao Purdue University, USA, Tingxu Han Nanjing University, Quanjun Zhang Nanjing University Pre-print Media Attached |