GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses
Mon 23 May 2022 13:45 - 14:00 at Room 315+316 - Blended Technical Session 2 (Machine Learning and Information Retrieval) Chair(s): Preetha Chatterjee
Code embedding is a keystone in the application of machine learning on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is generic. To this end, we propose the first self-supervised pre-training approach (called GraphCode2Vec) which produces task-agnostic embedding of lexical and program dependence features. GraphCode2Vec achieves this via a synergistic combination of code analysis and Graph Neural Networks. GraphCode2Vec is generic, it allows pre-training, and it is applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on four (4) tasks (method name prediction, solution classification, mutation testing and overfitted patch classification), and compare it with four (4) similarly generic code embedding baselines (Code2Seq, Code2Vec, CodeBERT, GraphCodeBERT) and 7 task-specific, learning-based methods. In particular, GraphCode2Vec is more effective than both generic and task-specific learning-based baselines. It is also complementary and comparable to GraphCodeBERT (a larger and more complex model). We also demonstrate through a probing and ablation study that GraphCode2Vec learns lexical and program dependence features and that self-supervised pre-training improves effectiveness.
Thu 19 MayDisplayed time zone: Eastern Time (US & Canada) change
11:00 - 11:50 | Session 11: Machine Learning & Information RetrievalTechnical Papers at MSR Main room - odd hours Chair(s): Phuong T. Nguyen University of L’Aquila | ||
11:00 4mShort-paper | On the Naturalness of Fuzzer Generated Code Technical Papers Rajeswari Hita Kambhamettu Carnegie Mellon University, John Billos Wake Forest University, Carolyn "Tomi" Oluwaseun-Apo Pennsylvania State University, Benjamin Gafford Carnegie Mellon University, Rohan Padhye Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University | ||
11:04 7mTalk | Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes Technical Papers DOI Pre-print Media Attached | ||
11:11 7mTalk | Multimodal Recommendation of Messenger Channels Technical Papers Ekaterina Koshchenko JetBrains Research, Egor Klimov JetBrains Research, Vladimir Kovalenko JetBrains Research | ||
11:18 7mTalk | Senatus: A Fast and Accurate Code-to-Code Recommendation Engine Technical Papers Fran Silavong JP Morgan Chase & Co., Sean Moran JP Morgan Chase & Co., Antonios Georgiadis JP Morgan Chase & Co., Rohan Saphal JP Morgan Chase & Co., Robert Otter JP Morgan Chase & Co. DOI Pre-print Media Attached | ||
11:25 7mTalk | Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study Technical Papers Tatiana Castro Vélez City University of New York (CUNY) Graduate Center, Raffi Khatchadourian City University of New York (CUNY) Hunter College, Mehdi Bagherzadeh Oakland University, Anita Raja City University of New York (CUNY) Hunter College Pre-print Media Attached | ||
11:32 7mTalk | GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses Technical Papers Wei Ma SnT, University of Luxembourg, Mengjie Zhao LMU Munich, Ezekiel Soremekun SnT, University of Luxembourg, Qiang Hu University of Luxembourg, Jie M. Zhang King's College London, Mike Papadakis University of Luxembourg, Luxembourg, Maxime Cordy University of Luxembourg, Luxembourg, Xiaofei Xie Singapore Management University, Singapore, Yves Le Traon University of Luxembourg, Luxembourg Pre-print | ||
11:39 11mLive Q&A | Discussions and Q&A Technical Papers |
Mon 23 MayDisplayed time zone: Eastern Time (US & Canada) change
13:30 - 15:00 | Blended Technical Session 2 (Machine Learning and Information Retrieval) Technical Papers / Data and Tool Showcase Track at Room 315+316 Chair(s): Preetha Chatterjee Drexel University, USA | ||
13:30 15mTalk | Methods for Stabilizing Models across Large Samples of Projects(with case studies on Predicting Defect and Project Health) Technical Papers Suvodeep Majumder North Carolina State University, Tianpei Xia North Carolina State University, Rahul Krishna North Carolina State University, Tim Menzies North Carolina State University Pre-print Media Attached | ||
13:45 15mTalk | GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses Technical Papers Wei Ma SnT, University of Luxembourg, Mengjie Zhao LMU Munich, Ezekiel Soremekun SnT, University of Luxembourg, Qiang Hu University of Luxembourg, Jie M. Zhang King's College London, Mike Papadakis University of Luxembourg, Luxembourg, Maxime Cordy University of Luxembourg, Luxembourg, Xiaofei Xie Singapore Management University, Singapore, Yves Le Traon University of Luxembourg, Luxembourg Pre-print | ||
14:00 15mTalk | Senatus: A Fast and Accurate Code-to-Code Recommendation Engine Technical Papers Fran Silavong JP Morgan Chase & Co., Sean Moran JP Morgan Chase & Co., Antonios Georgiadis JP Morgan Chase & Co., Rohan Saphal JP Morgan Chase & Co., Robert Otter JP Morgan Chase & Co. DOI Pre-print Media Attached | ||
14:15 8mShort-paper | Comments on Comments: Where Code Review and Documentation Meet Technical Papers Nikitha Rao Carnegie Mellon University, Jason Tsay IBM Research, Martin Hirzel IBM Research, Vincent J. Hellendoorn Carnegie Mellon University DOI Pre-print File Attached | ||
14:23 8mShort-paper | On the Naturalness of Fuzzer Generated Code Technical Papers Rajeswari Hita Kambhamettu Carnegie Mellon University, John Billos Wake Forest University, Carolyn "Tomi" Oluwaseun-Apo Pennsylvania State University, Benjamin Gafford Carnegie Mellon University, Rohan Padhye Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University | ||
14:31 8mTalk | SOSum: A Dataset of Stack Overflow Post Summaries Data and Tool Showcase Track Bonan Kou Purdue University, Yifeng Di Purdue University, Muhao Chen University of Southern California, Tianyi Zhang Purdue University | ||
14:39 21mLive Q&A | Discussions and Q&A Technical Papers |