Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data (ASE 2024 - Research Papers)

Who

Ming Zhu, Mohimenul Karim, Ismini Lourentzou, Daphne Yao

Track

ASE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Oct 2024 15:45 - 16:00 at Compagno - Program and Code translation Chair(s): Haiyan Zhao

Abstract

Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the ‘shallow translation’ problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.

Ming Zhu

Virginia Tech

Mohimenul Karim

Virginia Tech

Ismini Lourentzou

Virginia Tech

Daphne Yao

Virginia Tech

United States

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 29 Oct
Displayed time zone: Pacific Time (US & Canada) change

15:30 - 16:30	Program and Code translationResearch Papers / Tool Demonstrations at Compagno Chair(s): Haiyan Zhao Peking University

15:30 15m Talk		To Tag, or Not to Tag: Translating C’s Unions to Rust’s Tagged Unions Research Papers Jaemin Hong KAIST, Sukyoung Ryu KAIST DOI
15:45 15m Talk		Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data Research Papers Ming Zhu Virginia Tech, Mohimenul Karim Virginia Tech, Ismini Lourentzou Virginia Tech, Daphne Yao Virginia Tech
16:00 15m Talk		A Joint Learning Model with Variational Interaction for Multilingual Program Translation Research Papers Yali Du Nanjing University, Hui Sun Nanjing University, National Key Laboratory for Novel Software Technology, China; Nanjing University, School of Artificial Intelligence, China, Ming Li Nanjing University
16:15 10m Talk		Automated Validation of COBOL to Java Transformation Tool Demonstrations Atul Kumar IBM Research India, Diptikalyan Saha IBM Research India, Toshiaki Yasue IBM Research - Tokyo, Kohichi Ono IBM Research - Tokyo, Saravanan Krishnan IBM India Research Lab, Sandeep Hans IBM India Research Lab, Fumiko Satoh IBM Research - Tokyo, Gerald Mitchell IBM Software, Sachin Kumar IBM Software