Mind the Merge: Evaluating the Effects of Token Merging on Pre-trained Models for Code
This program is tentative and subject to change.
Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from classification to generation. However, the output of these tokenizers is often longer than what is traditionally used in compilers and interpreters. This can lead to undesirable effects, such as increased computational overhead. In this work, we explore the effect of merging the hidden representations of subtokens that belong to the same semantic unit, such as subtokens that form a single identifier. We experiment with two strategies: one based on averaging the representations and another that leverages a learning-based approach.
We conduct an empirical study using six language models for code: CodeBERT, GraphCodeBERT, UniXCoder, CodeT5, CodeT5+ (220M), and CodeT5+ (770M), across three software engineering tasks: vulnerability detection, code classification, and code translation. Results show that these strategies can reduce the number of floating-point operations by up to 19%. Regarding downstream performance, the most significant degradation occurs in the vulnerability detection task, where the F1 score decreases by 1.82 points compared to the baseline. In contrast, for code translation, we observe an improvement of 2.47 points in CodeBLEU. This work contributes to the broader effort of improving language models for code across multiple dimensions, including both computational efficiency and downstream performance.
This program is tentative and subject to change.
Thu 19 MarDisplayed time zone: Athens change
11:00 - 12:30 | Session 4A - Code Representation and AnalysisResearch Track / Tool Demo Track | ||
11:00 12mTalk | GDPO: Dual Learning for Self-Supervised Code Summarization in the Era of Large Language Models Research Track Chen Xiao , Wang Shuwei Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences, Zhang Weize Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences, Jiang Zhengwei Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences, Wang Qiuyun Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences | ||
11:12 12mTalk | Mind the Merge: Evaluating the Effects of Token Merging on Pre-trained Models for Code Research Track Mootez Saad Dalhousie University, Hao Li Queen's University, Tushar Sharma Dalhousie University, Ahmed E. Hassan Queen’s University | ||
11:25 12mTalk | CONCORD: A DSL for Generating Simplified and Scalable Graph-Based Code Representations Research Track Pre-print | ||
11:38 12mTalk | Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition Research Track Denis Neumüller Ulm University, Sebastian Boll Ulm University, David Schüler Ulm University, Matthias Tichy Ulm University | ||
11:51 12mTalk | A Multi-Modal Retrieval-Augmented Framework for Compiler Backend Generation with LLMs Research Track Ming Zhong SKLP, Institute of Computing Technology, CAS, Fang Lv Institute of Computing Technology, Chinese Academy of Sciences, Hongna Geng , Xin Sun , Lulin Wang , Lulin Wang , Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences, Xiaobing Feng ICT CAS | ||
12:04 12mTalk | AdaptVM: An LLVM-Based Function-Adaptive Code Virtualizer Tool Demo Track | ||
12:17 12mTalk | Static Analysis assisted Knowledge Graph based Automatic Functionality Discovery for Mainframe Applications Tool Demo Track Sasaank Janapati , Atul Kumar IBM Research India, Nandakishore S Menon IBM Research India, Sridhar Chimalakonda Indian Institute of Technology Tirupati | ||