An Exploratory Study on Code Attention in BERT (ICPC 2022 - Research)

Who

Rishab Sharma, Fuxiang Chen, Fatemeh Hendijani Fard, David Lo

Track

ICPC 2022 Research

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 15 May 2022 22:51 - 22:58 at ICPC room - Session 2: Program Representation 1 Chair(s): Fatemeh Hendijani Fard

Abstract

Many of the recent models in software engineering use deep neural models based on Transformer architecture or use the pre-trained language models (PLM) based on Transformer and are pre-trained on source code. Though these models achieve the state of the arts results in many tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are studied mainly in Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in source code, despite the differences between the two languages. However, there is limited literature on explaining how code is modeled.

Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically \textit{identifiers} and \textit{separators}, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage \textit{identifiers} to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers’ embeddings are used in CodeBERT, a PLM is pre-trained on source code, the performance is improved by 21–24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP. It also opens new directions for developing smaller models with similar performance.

Link to Preprint

https://arxiv.org/abs/2204.10200

Rishab Sharma

University of British Columbia

Fuxiang Chen

University of British Columbia

Fatemeh Hendijani Fard

University of British Columbia

Canada

David Lo

Singapore Management University

Singapore

Media

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 15 May
Displayed time zone: Eastern Time (US & Canada) change

22:30 - 23:20	Session 2: Program Representation 1Research at ICPC room Chair(s): Fatemeh Hendijani Fard University of British Columbia

22:30 7m Talk		Zero-Shot Program Representation Learning Research Nan Cui Shanghai Jiao Tong University, Yuze Jiang Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, China, Beijun Shen School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University Pre-print Media Attached
22:37 7m Talk		On The Cross-Modal Transfer from Natural Language to Code through Adapter Modules Research Divyam Goel Indian Institute of Technology Roorkee, Ramansh Grover Delhi Technological University, Fatemeh Hendijani Fard University of British Columbia Pre-print Media Attached
22:44 7m Talk		Self-Supervised Learning of Smart Contract Representations Research Shouliang Yang School of Software, Shanghai Jiao Tong University, Beijun Shen School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, China Pre-print Media Attached
22:51 7m Talk		An Exploratory Study on Code Attention in BERT Research Rishab Sharma University of British Columbia, Fuxiang Chen University of British Columbia, Fatemeh Hendijani Fard University of British Columbia, David Lo Singapore Management University Pre-print Media Attached
22:58 7m Talk		Accurate Generation of Trigger-Action Programs with Domain-Adapted Sequence-to-Sequence Learning Research Imam Nur Bani Yusuf Singapore Management University, Lingxiao Jiang Singapore Management University, David Lo Singapore Management University DOI Pre-print Media Attached
23:05 15m Live Q&A		Q&A-Paper Session 2 Research

Information for Participants

Sun 15 May 2022 22:30 - 23:20 at ICPC room - Session 2: Program Representation 1 Chair(s): Fatemeh Hendijani Fard

Info for room ICPC room:

Click here to go to the room on Midspace