What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code (ICSE 2022 - Technical Track)

Write a Blog >>

Sun 8 - Fri 27 May 2022

Who

Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, Hai Jin

Track

ICSE 2022 Technical Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 10 May 2022 12:10 - 12:15 at ICSE room 1-even hours - Machine Learning with and for SE 9 Chair(s): Baishakhi Ray
Thu 12 May 2022 04:20 - 04:25 at ICSE room 1-even hours - Machine Learning with and for SE 3 Chair(s): Antinisca Di Marco

Abstract

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. Although current language models of code based on masked pre-training and Transformer have achieved promising results, there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of reducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

Link to Preprint

https://arxiv.org/pdf/2202.06840.pdf

Yao Wan

Huazhong University of Science and Technology

China

Wei Zhao

Huazhong University of Science and Technology

Hongyu Zhang

University of Newcastle

Australia

Yulei Sui

University of Technology Sydney

Australia

Guandong Xu

University of Technology, Sydney

Australia

Hai Jin

Huazhong University of Science and Technology

China

What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 10 May
Displayed time zone: Eastern Time (US & Canada) change

12:00 - 13:00	Machine Learning with and for SE 9Technical Track / SEIP - Software Engineering in Practice / Journal-First Papers at ICSE room 1-even hours Chair(s): Baishakhi Ray Columbia University

12:00 5m Talk		Journal First: On the Value of Oversampling for Deep Learning in Software Defect Prediction Journal-First Papers Rahul Yedida North Carolina State University, Tim Menzies North Carolina State University Media Attached
12:05 5m Talk		Strategies for Reuse and Sharing among Data Scientists in Software Teams SEIP - Software Engineering in Practice Will Epperson Carnegie Mellon University, April Wang University of Michigan, Robert DeLine Microsoft Research, Steven M. Drucker Microsoft Research Pre-print Media Attached
12:10 5m Talk		What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code Technical Track Yao Wan Huazhong University of Science and Technology, Wei Zhao Huazhong University of Science and Technology, Hongyu Zhang University of Newcastle, Yulei Sui University of Technology Sydney, Guandong Xu University of Technology, Sydney, Hai Jin Huazhong University of Science and Technology Pre-print Media Attached
12:15 5m Talk		Type4Py: Practical Deep Similarity Learning-Based Type Inference for Python Technical Track Amir Mir Delft University of Technology, Evaldas Latoskinas Delft University of Technology, Sebastian Proksch Delft University of Technology, Netherlands, Georgios Gousios Endor Labs & Delft University of Technology DOI Pre-print Media Attached
12:20 5m Talk		Decomposing Convolutional Neural Networks into Reusable and Replaceable Modules Technical Track Rangeet Pan Iowa State University, USA, Hridesh Rajan Iowa State University Pre-print Media Attached

Thu 12 May
Displayed time zone: Eastern Time (US & Canada) change

04:00 - 05:00	Machine Learning with and for SE 3Technical Track / Journal-First Papers / SEIP - Software Engineering in Practice at ICSE room 1-even hours Chair(s): Antinisca Di Marco University of L'Aquila

04:00 5m Talk		In-IDE Code Generation from Natural Language: Promise and Challenges Journal-First Papers Frank Xu Carnegie Mellon University, Bogdan Vasilescu Carnegie Mellon University, USA, Graham Neubig Carnegie Mellon University
04:05 5m Talk		Active Learning of Discriminative Subgraph Patterns for API Misuse Detection Journal-First Papers Hong Jin Kang Singapore Management University, David Lo Singapore Management University Pre-print Media Attached File Attached
04:10 5m Talk		Dependency Tracking for Risk Mitigation in Machine Learning (ML) Systems SEIP - Software Engineering in Practice Xiwei (Sherry) Xu CSIRO Data61, Chen Wang CSIRO DATA61, Zhen Wang CSIRO Data61, Qinghua Lu CSIRO’s Data61, Liming Zhu CSIRO’s Data61; UNSW Media Attached
04:15 5m Talk		DeepFD: Automated Fault Diagnosis and Localization for Deep Learning Programs Technical Track Jialun Cao Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Meiziniu LI Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Xiao Chen Huazhong University of Science and Technology, Ming Wen Huazhong University of Science and Technology, Yongqiang Tian The Hong Kong University of Science and Technology; University of Waterloo, Bo Wu MIT-IBM Watson AI Lab in Cambridge, Shing-Chi Cheung Hong Kong University of Science and Technology DOI Pre-print Media Attached
04:20 5m Talk		What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code Technical Track Yao Wan Huazhong University of Science and Technology, Wei Zhao Huazhong University of Science and Technology, Hongyu Zhang University of Newcastle, Yulei Sui University of Technology Sydney, Guandong Xu University of Technology, Sydney, Hai Jin Huazhong University of Science and Technology Pre-print Media Attached
04:25 5m Talk		A Universal Data Augmentation Approach for Fault Localization Technical Track Huan Xie Chongqing University, Yan Lei School of Big Data & Software Engineering, Chongqing University, Meng Yan Chongqing University, Yue Yu College of Computer, National University of Defense Technology, Changsha 410073, China, Xin Xia Huawei Software Engineering Application Technology Lab, Xiaoguang Mao National University of Defense Technology DOI Pre-print Media Attached
04:30 5m Talk		DeepState: Selecting Test Suites to Enhance the Robustness of Recurrent Neural Networks Technical Track Zixi Liu Nanjing University, Yang Feng Nanjing University, Yining Yin Nanjing University, China, Zhenyu Chen Nanjing University DOI Pre-print Media Attached

Information for Participants

Info for room ICSE room 1-even hours:

Click here to go to the room on Midspace