aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing (ICSE 2025 - Software Engineering in Practice (SEIP))

Who

Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Ge Li

Track

ICSE 2025 SE In Practice (SEIP)

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 1 May 2025 11:45 - 12:00 at 211 - AI for Design and Architecture Chair(s): Sarah Nadi

Abstract

Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs will increase the response time of code completion and decrease the developers’ productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLLaMa-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. As of the submission date, aiXcoder-7B has received 2,193 GitHub Stars.

Link to Preprint

https://arxiv.org/pdf/2410.13187

Siyuan Jiang

Jia Li

Peking University

China

He Zong

aiXcoder

Huanyu Liu

Peking University

China

Hao Zhu

Peking University

Shukai Hu

aiXcoder

Erlu Li

aiXcoder

Jiazheng Ding

aiXcoder

Ge Li

Peking University

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30	AI for Design and ArchitectureDemonstrations / SE In Practice (SEIP) / Research Track at 211 Chair(s): Sarah Nadi New York University Abu Dhabi

11:00 15m Talk		An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization Research Track Fraol Batole Tulane University, David OBrien Iowa State University, Tien N. Nguyen University of Texas at Dallas, Robert Dyer University of Nebraska-Lincoln, Hridesh Rajan Tulane University
11:15 15m Talk		Distilled Lifelong Self-Adaptation for Configurable Systems Research Track Yulong Ye University of Birmingham, Tao Chen University of Birmingham, Miqing Li University of Birmingham Pre-print
11:30 15m Talk		The Software Librarian: Python Package Insights for Copilot Demonstrations Jasmine Latendresse Concordia University, Nawres Day ISSAT Sousse, SayedHassan Khatoonabadi Concordia University, Montreal, Emad Shihab Concordia University, Montreal
11:45 15m Talk		aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing SE In Practice (SEIP) Siyuan Jiang , Jia Li Peking University, He Zong aiXcoder, Huanyu Liu Peking University, Hao Zhu Peking University, Shukai Hu aiXcoder, Erlu Li aiXcoder, Jiazheng Ding aiXcoder, Ge Li Peking University Pre-print
12:00 15m Talk		Leveraging MLOps: Developing a Sequential Classification System for RFQ Documents in Electrical Engineering SE In Practice (SEIP) Claudio Martens Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Hammam Abdelwahab Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Katharina Beckh Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Birgit Kirsch Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Vishwani Gupta Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Dennis Wegener Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Steffen Hoh Schneider Electric
12:15 15m Talk		On Mitigating Code LLM Hallucinations with API Documentation SE In Practice (SEIP) Nihal Jain Amazon Web Services, Robert Kwiatkowski , Baishakhi Ray Columbia University, Murali Krishna Ramanathan AWS AI Labs, Varun Kumar AWS AI Labs