Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute
This program is tentative and subject to change.
Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textit{How can personally deployable open-source LLMs (e.g., 32B models running on a single GPU) achieve comparable code reasoning performance?} To this end, we propose a unified Test-Time Compute (TTC) scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing “end-point only” verification methods.
Evaluations on SWE-bench Verified demonstrate our \textbf{32B model achieves a 46% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research.\footnote{Model: \url{https://github.com/yingweima2022/SWE-Reasoner/tree/6627eba7215425ecfef65a40a9c516b2feca1bc7}, Code: \url{https://github.com/yingweima2022/AnonymousSWESynInferpro}}. \textit{In fact, our method has been deployed in Tongyi Lingma, an IDE-based coding assistant developed by Alibaba Cloud, where it helps developers solve real-world programming problems.}
This program is tentative and subject to change.
Tue 18 NovDisplayed time zone: Seoul change
16:00 - 17:00 | |||
16:00 10mTalk | Adaptive Performance Regression Detection via Semi-Supervised Siamese Learning Industry Showcase Yongqian Sun Nankai University, Mengyao Li Nankai University, Xiao Xiong Nankai University, Lei Tao Nankai University, Yimin Zuo Nankai University, Wenwei Gu The Chinese University of Hong Kong, Shenglin Zhang Nankai University, Junhua Kuang Nankai University, Yu Luo Nankai University, Huandong Zhuang Huawei Cloud, Bowen Deng Huawei Cloud, Dan Pei Tsinghua University | ||
16:10 10mTalk | Deploying Language Models on Android-Based Edge Devices: A Practical Evaluation Pipeline Industry Showcase Suayder Costa Venturus - Innovation & Technology, Igor Lima Venturus - Innovation & Technology, William Harada Venturus - Innovation & Technology, Mateus Lucena Venturus - Innovation & Technology, Arthur Alves Venturus - Innovation & Technology, Ruan Belem TPV Technology, Agemilson Pimentel TPV Technology, Rômulo Fabrício TPV Technology, Alexandre Miranda Paulo Feitoza Foundation- FPFTech, Daniel Lins Venturus - Innovation & Technology, Frederico Goncalves Venturus - Innovation & Technology, Sidney Leal Venturus - Innovation & Technology | ||
16:20 10mTalk | How Can Infrastructure as Code Accelerate Data Center Bring-ups? A Case Study at ByteDance Industry Showcase Xianhao Jin ByteDance, Yifei Feng ByteDance, Yufei Gao ByteDance, Yongning Hu ByteDance, Jie Huang ByteDance, Kun Xia ByteDance, Luchuan Guo ByteDance | ||
16:30 10mTalk | MobileUPReg: Identifying User-Perceived Performance Regressions in Mobile OS Versions Industry Showcase Wei Liu Concordia University, Montreal, Canada, Yi Wen HENG Concordia University, Feng Lin Concordia University, Tse-Hsun (Peter) Chen Concordia University, Ahmed E. Hassan Queen’s University | ||
16:40 10mTalk | Context-Aware CodeLLM Eviction for AI-assisted Coding Industry Showcase Kishanthan Thangarajah Centre for Software Excellence, Huawei Canada, Boyuan Chen Centre for Software Excellence, Huawei Canada, Shi Chang University of Western Ontario, Ahmed E. Hassan Queen’s University | ||
16:50 10mTalk | Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute Industry Showcase Yingwei Ma Tongyi Lab, Alibaba, Yongbin Li Tongyi Lab, Alibaba, China, Yihong Dong Peking University, Xue Jiang , Yanhao Li Tongyi Lab, Alibaba, Yue Liu Monash University, Rongyu Cao Tongyi Lab, Alibaba, China, Jue Chen Tongyi Lab, Alibaba, China, Fei Huang Tongyi Lab, Alibaba, China, Binhua Li Tongyi Lab, Alibaba, China | ||