CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences (ICSE 2022 - Technical Track)

Write a Blog >>

Sun 8 - Fri 27 May 2022

Who

Maliheh Izadi, Roberta Gismondi, Georgios Gousios

Track

ICSE 2022 Technical Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 10 May 2022 05:25 - 05:30 at ICSE room 1 - Machine Learning with and for SE 1 Chair(s): Gemma Catolino
Wed 11 May 2022 11:05 - 11:10 at ICSE room 3 - Search-Based Software Engineering 3 Chair(s): Mohamed Wiem Mkaouer

Abstract

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is very restricted in dynamically-typed language scenarios, whereas NLP-based autocompletion struggles to understand the semantics of the programming language, giving suggestions that ignore a developer’s context.

In this work, we present CodeFill, a language model for autocompletion that combines structure and naming information. Using a parallel Transformer architecture and Multi-Task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and significantly outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and data for replication and use.

Link to Preprint

https://arxiv.org/abs/2202.06689

DOI

https://doi.org/10.1145/3510003.3510172

Maliheh Izadi

Delft University of Technology

Netherlands

Roberta Gismondi

Delft University of Technology

Georgios Gousios

Endor Labs & Delft University of Technology

Netherlands

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 10 May
Displayed time zone: Eastern Time (US & Canada) change

05:00 - 06:00	Machine Learning with and for SE 1NIER - New Ideas and Emerging Results / Technical Track / Journal-First Papers at ICSE room 1 Chair(s): Gemma Catolino Tilburg University & Jheronimus Academy of Data Science

5m Talk		SQAPlanner: Generating Data-Informed Software Quality Improvement Plans -- A Journal-First Presentation Journal-First Papers Dilini Rajapaksha Monash University, Kla Tantithamthavorn Monash University, Jirayus Jiarpakdee Monash University, Australia, Christoph Bergmeir Monash University, John Grundy Monash University, Wray Buntine Monash University Link to publication Pre-print Media Attached
5m Talk		Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks Journal-First Papers NIKITA MEHROTRA Indraprastha Institute of Information Technology, NAVDHA AGARWAL Indraprastha Institute of Information Technology, Delhi, PIYUSH GUPTA Indraprastha Institute of Information Technology, Delhi, SAKET ANAND Indraprastha Institute of Information Technology, Delhi, David Lo Singapore Management University, Rahul Purandare IIIT-Delhi Link to publication DOI Media Attached
5m Talk		Improving the Learnability of Machine Learning APIs by Semi-Automated API Wrapping NIER - New Ideas and Emerging Results Lars Reimann University of Bonn, Günter Kniesel-Wünsche University of Bonn DOI Pre-print Media Attached
5m Talk		Learning to Recommend Method Names with Global Context Technical Track Fang Liu Peking University, Ge Li Peking University, Zhiyi Fu Peking University, Shuai Lu Peking University, Yiyang Hao Silicon Heart Tech Co., Zhi Jin Peking University Pre-print Media Attached
5m Talk		On the Importance of Building High-quality Training Datasets for Neural Code SearchNominated for Distinguished Paper Technical Track Zhensu Sun The Hong Kong Polytechnic University, Li Li Monash University, Yan Liu Tongji University, Xiaoning Du Monash University, Australia, Li Li Monash University Pre-print Media Attached
5m Talk		CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences Technical Track Maliheh Izadi Delft University of Technology, Roberta Gismondi Delft University of Technology, Georgios Gousios Endor Labs & Delft University of Technology DOI Pre-print

Wed 11 May
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:00	Search-Based Software Engineering 3Technical Track / NIER - New Ideas and Emerging Results at ICSE room 3 Chair(s): Mohamed Wiem Mkaouer Rochester Institute of Technology

5m Talk		A Black Box Technique to Reduce Energy Consumption of Android Apps NIER - New Ideas and Emerging Results Abdul Ali Bangash University of Alberta, Canada, Karim Ali University of Alberta, Abram Hindle University of Alberta Pre-print Media Attached
5m Talk		CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences Technical Track Maliheh Izadi Delft University of Technology, Roberta Gismondi Delft University of Technology, Georgios Gousios Endor Labs & Delft University of Technology DOI Pre-print
5m Talk		Fairness-aware Configuration of Machine Learning Libraries Technical Track Saeid Tizpaz-Niari University of Texas at El Paso, Ashish Kumar , Gang (Gary) Tan Pennsylvania State University, Ashutosh Trivedi University of Colorado Boulder DOI Pre-print Media Attached
5m Talk		Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective OptimizationDistinguished Paper Award Technical Track Fitash Ul Haq University of Luxembourg, Donghwan Shin University of Luxembourg, Lionel Briand University of Luxembourg; University of Ottawa Pre-print Media Attached
5m Talk		PropR: Property-Based Automatic Program Repair Technical Track Matthías Páll Gissurarson Chalmers University of Technology, Sweden, Leonhard Applis Delft University of Technology, Annibale Panichella Delft University of Technology, Arie van Deursen Delft University of Technology, Netherlands, Dave Sands Chalmers DOI Pre-print Media Attached

Information for Participants

Tue 10 May 2022 05:00 - 06:00 at ICSE room 1 - Machine Learning with and for SE 1 Chair(s): Gemma Catolino

Info for room ICSE room 1-odd hours:

Click here to go to the room on Midspace

Wed 11 May 2022 11:00 - 12:00 at ICSE room 3 - Search-Based Software Engineering 3 Chair(s): Mohamed Wiem Mkaouer

Info for room ICSE room 3-odd hours:

Click here to go to the room on Midspace