CoCoSoDa: Effective Contrastive Learning for Code Search (ICSE 2023 - Technical Track)

Who

Ensheng Shi, Wenchao Gu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, Hongbin Sun

Track

ICSE 2023 Technical Track

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 May 2023 14:30 - 14:45 at Meeting Room 104 - Software development tools Chair(s): Xing Hu

Abstract

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 18 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

Link to Preprint

https://arxiv.org/abs/2204.03293

Ensheng Shi

Xi'an Jiaotong University

China

Wenchao Gu

The Chinese University of Hong Kong

Yanlin Wang

School of Software Engineering, Sun Yat-sen University

Lun Du

Microsoft Research Asia

China

Hongyu Zhang

The University of Newcastle

Australia

Shi Han

Microsoft Research

China

Dongmei Zhang

Microsoft Research

China

Hongbin Sun

Xi'an Jiaotong University

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 19 May
Displayed time zone: Hobart change

13:45 - 15:15	Software development toolsDEMO - Demonstrations / Technical Track / SEIP - Software Engineering in Practice / NIER - New Ideas and Emerging Results at Meeting Room 104 Chair(s): Xing Hu Zhejiang University

13:45 15m Talk		Safe low-level code without overhead is practical Technical Track Solal Pirelli EPFL, George Candea EPFL Pre-print
14:00 15m Talk		Sibyl: Improving Software Engineering Tools with SMT Selection Technical Track Will Leeson University of Virgina, Matthew B Dwyer University of Virginia, Antonio Filieri AWS and Imperial College London Pre-print
14:15 15m Talk		Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools SEIP - Software Engineering in Practice Brittany Johnson George Mason University, Christian Bird Microsoft Research, Denae Ford Microsoft Research, Nicole Forsgren Microsoft Research, Thomas Zimmermann Microsoft Research Pre-print
14:30 15m Talk		CoCoSoDa: Effective Contrastive Learning for Code Search Technical Track Ensheng Shi Xi'an Jiaotong University, Wenchao Gu The Chinese University of Hong Kong, Yanlin Wang School of Software Engineering, Sun Yat-sen University, Lun Du Microsoft Research Asia, Hongyu Zhang The University of Newcastle, Shi Han Microsoft Research, Dongmei Zhang Microsoft Research, Hongbin Sun Xi'an Jiaotong University Pre-print
14:45 7m Talk		Task Context: A Tool for Predicting Code Context Models for Software Development Tasks DEMO - Demonstrations Yifeng Wang Zhejiang University, Yuhang Lin Zhejiang University, Zhiyuan Wan Zhejiang University, Xiaohu Yang Zhejiang University Pre-print Media Attached
14:52 7m Talk		Continuously Accelerating Research NIER - New Ideas and Emerging Results Sergey Mechtaev University College London, Jonathan Bell Northeastern University, Christopher Steven Timperley Carnegie Mellon University, Earl T. Barr University College London, Michael Hilton Carnegie Mellon University Pre-print
15:00 7m Talk		An Alternative to Cells for Selective Execution of Data Science Pipelines NIER - New Ideas and Emerging Results Lars Reimann University of Bonn, Günter Kniesel-Wünsche University of Bonn Pre-print
15:07 7m Talk		pytest-inline: An Inline Testing Tool for Python DEMO - Demonstrations Yu Liu University of Texas at Austin, Zachary Thurston Cornell University, Alan Han Cornell University, Pengyu Nie University of Texas at Austin, Milos Gligoric University of Texas at Austin, Owolabi Legunsen Cornell University