Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading
This program is tentative and subject to change.
Line-level code completion aims to complete the current line in real-time as developers type. Low latency is crucial to maintaining a seamless and uninterrupted coding experience, enabling developers to remain in a productive flow. However, existing approaches face a fundamental trade-off: large language models (LLMs) provide high-quality suggestions but require expensive computational resources to ensure acceptable inference latency. In contrast, static-analysis-based methods and small language models respond quickly but often generate suboptimal completions. To fill this gap, our idea is to rely on the small model by default and only escalate the large model when necessary to achieve latency-accuracy trade-offs. Based on this idea, we propose MCCom(Model-Cascading-based code Completion), a framework that cascades a local small model with a high-performance cloud large model for code completion. Realizing effective model cascading requires answering two non-trivial questions, i.e., when to invoke the large model and how to enable effective collaboration between small and large models. For the first question, we leverage a valuable but easily overlooked signal, i.e., user actions, during code completion to accurately identify failed completions. This deferral decision allows us to invoke the large model only when necessary, reducing both latency and cloud-side computation costs. To enable effective collaboration, MCCom employs a two-stage speculative decoding strategy and an iterative retrieval mechanism that collectively accelerate and improve the quality of completions. Due to the lack of high-quality small models for code completion, we also train a lightweight model with only 121M parameters to implement MCCom. The small model achieves an average of 73.8% of the performance of the state-of-the-art 7B model. We evaluate MCCom on the RepoEval benchmark and a new benchmark, StmtEval, collected from real-world projects. Experimental results show that our approach not only reduces inference latency by up to 47.9% and cuts down LLM usage by an average of 46.3%, but also improves the exact match rate of the large model by an average of 8.9%.
This program is tentative and subject to change.
Thu 9 JulDisplayed time zone: Eastern Time (US & Canada) change
10:30 - 12:30 | |||
10:30 20mTalk | NES: An Instruction-Free, Low-Latency Next Edit Suggestion Framework Powered by Learned Historical Editing Trajectories Industry Papers Xinfang Chen Ant Group, Siyang Xiao Ant Group, Xianying Zhu Ant Group, Junhong Xie Ant Group, Ming Liang Ant Group, Dajun Chen Ant Group, Wei Jiang Ant Group, Yong Li Ant Group, Peng Di Kunlunxin & UNSW Sydney | ||
10:50 20mTalk | Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading Research Papers Lu Hanzhen Zhejiang University, Lishui Fan Zhejiang University, Jiachi Chen Sun Yat-sen University, Qiuyuan Chen Tencent Technology, Zhao Wei Tencent, Zhongxin Liu Zhejiang University | ||
11:10 20mTalk | Coding in a Bubble? Evaluating LLMs in Resolving Context Adaptation Bugs During Code Adaptation Research Papers Tanghaoran Zhang National University of Defense Technology, Xinjun Mao National University of Defense Technology, Shangwen Wang National University of Defense Technology, Yuxin Zhao Key Laboratory of Software Engineering for Complex Systems, National University of Defense Technology, Yao Lu National University of Defense Technology, Zezhou Tang National University of Defense Technology, Wenyu Xu National University of Defense Technology, Longfei Sun National University of Defense Technology, Changrong Xie National University of Defense Technology, Kang Yang National University of Defense Technology, Yue Yu PengCheng Lab | ||
11:30 20mTalk | Hallucinations in LLM-based Code Summarization: Unveiling, Detection, and Mitigation Research Papers Guanghua Wan Huazhong University of Science and Technology, Yuanning Feng Huazhong University of Science and Technology, Yao Wan Huazhong University of Science and Technology, Zhaoyang Chu Huazhong University of Science and Technology, Zhangqian Bi Huazhong University of Science and Technology, Junxiao Han Hangzhou City University, Zhou Zhao Zhejiang University, Hongyu Zhang Chongqing University, Pingpeng Yuan Huazhong University of Science and Technology, Xuanhua Shi Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology | ||
11:50 20mTalk | ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction? Research Papers Doha Nam Korea Advanced Institute of Science and Technology, Taehyoun Kim Korea Advanced Institute of Science and Technology; Agency for Defense Development, Duksan Ryu Jeonbuk National University, Jongmoon Baik Korea Advanced Institute of Science and Technology DOI Pre-print Media Attached | ||
12:10 20mTalk | Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing Research Papers Chaozheng Wang The Chinese University of Hong Kong, Zezhou Yang Hong Kong University, Shuzheng Gao Chinese University of Hong Kong, Cuiyun Gao Harbin Institute of Technology, Shenzhen , Li Zongjie Hong Kong University of Science and Technology, Yichen LI ByteDance, Ting Peng Tencent Inc., Hailiang Huang Tencent Inc., Yuetang Deng Tencent, Michael Lyu The Chinese University of Hong Kong | ||