ICSE 2025 - Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

You're viewing the program in a time zone which is different from your device's time zone change time zone

Wed 30 Apr
Displayed time zone: Eastern Time (US & Canada) change

	07:00 - 22:00	Quiet Room WednesdaySocial, Networking and Special Rooms at 202 The Quiet Room (202) is for anyone who wants to get away from noise and be by themself. Please keep this room ‘library quiet’. If you want to listen to a stream or music., please wear headphones.

	07:00 - 19:00	Ready Room WednesdaySocial, Networking and Special Rooms at 209 Visit this room to upload or update your presentation or use a computer or talk to the AV technicians. You do not have to use the ready room as you can also upload your poster on the web

	09:00 - 12:30	Child Care Wednesday AMSocial, Networking and Special Rooms at 102 Child Care Child Care at ICSE is free, but you must have registered for child care when you registered for the conference. If you need to add child care to your registration, please contact the registration desk.

10:30 - 11:00	BreakCatering at Canada Hall 3 plus Foyer

10:30 30m Break		Wednesday Morning Break Catering

11:00 - 12:30	Formal Methods 1Research Track / New Ideas and Emerging Results (NIER) at 103 Chair(s): Cristian Cadar Imperial College London

11:00 15m Talk		SpecGen: Automated Generation of Formal Program Specifications via Large Language ModelsFormal Methods Research Track Lezhi Ma Nanjing University, Shangqing Liu Nanyang Technological University, Yi Li Nanyang Technological University, Xiaofei Xie Singapore Management University, Lei Bu Nanjing University
11:15 15m Talk		Gpass: a Goal-adaptive Neural Theorem Prover based on Coq for Automated Formal VerificationFormal Methods Research Track Yizhou Chen Peking University, Zeyu Sun Institute of Software, Chinese Academy of Sciences, Guoqing Wang Peking University, Dan Hao Peking University
11:30 15m Talk		AI-Assisted Autoformalization of Combinatorics Problems in Proof AssistantsFormal Methods New Ideas and Emerging Results (NIER) Long Doan George Mason University, ThanhVu Nguyen George Mason University
11:45 15m Talk		Formally Verified Binary-level Pointer AnalysisFormal Methods Research Track Freek Verbeek Open Universiteit & Virginia Tech, Ali Shokri Virginia Tech, Daniel Engel Open University Of The Netherlands, Binoy Ravindran Virginia Tech
12:00 15m Talk		EffBT: An Efficient Behavior Tree Reactive Synthesis and Execution FrameworkFormal Methods Research Track ziji wu National University of Defense Technology, yu huang National University of Defense Technology, peishan huang National University of Defense Technology, shanghua wen National University of Defense Technology, minglong li National University of Defense Technology, Ji Wang National University of Defense Technology
12:15 7m Talk		SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code GenerationFormal Methods New Ideas and Emerging Results (NIER) Junjie Sheng East China Normal University, Yanqiu Lin East China Normal University, Jiehao Wu East China Normal University, Yanhong Huang East China Normal University, Jianqi Shi East China Normal University, Min Zhang East China Normal University, Xiangfeng Wang East China Normal University
12:22 7m Talk		Listening to the Firehose: Sonifying Z3’s BehaviorFormal Methods New Ideas and Emerging Results (NIER) Finn Hackett University of British Columbia, Ivan Beschastnikh University of British Columbia

11:00 - 12:30	Program Comprehension 1Research Track at 204 Chair(s): Wing Lam George Mason University

11:00 15m Talk		An Empirical Study on Package-Level Deprecation in Python Ecosystem Research Track Zhiqing Zhong The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shilin He Microsoft Research, Haoxuan Wang The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), BoXi Yu The Chinese University of Hong Kong, Shenzhen, Haowen Yang The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Pinjia He Chinese University of Hong Kong, Shenzhen
11:15 15m Talk		Datalog-Based Language-Agnostic Change Impact Analysis for Microservices Research Track Qingkai Shi Nanjing University, Xiaoheng Xie Ant Group, Xianjin Fu Ant Group, Peng Di Ant Group & UNSW Sydney, Huawei Li Alibaba Inc., Ang Zhou Ant Group, Gang Fan Ant Group
11:30 15m Talk		GenC2Rust: Towards Generating Generic Rust Code from C Research Track Xiafa Wu University of California, Irvine, Brian Demsky University of California at Irvine
11:45 15m Talk		Instrumentation-Driven Evolution-Aware Runtime Verification Research Track Kevin Guan Cornell University, Owolabi Legunsen Cornell University
12:00 15m Talk		Moye: A Wallbreaker for Monolithic Firmware Research Track Jintao Huang Institute of Information Engineering, Chinese Academy of Science & University of Chinese Academy of Sciences, Beijing, China, Kai Yang School of Computer, Electronics and Information, Guangxi University, Gaosheng Wang Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China, Zhiqiang Shi Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China, Zhiwen Pan Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China, Shichao Lv Institute of Information Engineering, Chinese Academy of Science, Limin Sun Institute of Information Engineering, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
12:15 15m Talk		Understanding and Detecting Peer Dependency Resolving Loop in npm Ecosystem Research Track Xingyu Wang Zhejiang University, MingSen Wang Zhejiang University, Wenbo Shen Zhejiang University, Rui Chang Zhejiang University

11:00 - 12:30	Testing and QA 1Research Track / Journal-first Papers at 205 Chair(s): Jonathan Bell Northeastern University

11:00 15m Talk		Critical Variable State-Aware Directed Greybox Fuzzing Research Track Xu Chen Institute of Information Engineering at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China, Ningning Cui Institute of Information Engineering at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China, Zhe Pan Institute of Information Engineering at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China, Liwei Chen Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Gang Shi Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Dan Meng Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences
11:15 15m Talk		LWDIFF: An LLM-Assisted Differential Testing Framework for WebAssembly Runtimes Research Track Shiyao Zhou The Hong Kong Polytechnic University, Jincheng Wang Hong Kong Polytechnic University, He Ye University College London (UCL), Hao Zhou The Hong Kong Polytechnic University, Claire Le Goues Carnegie Mellon University, Xiapu Luo Hong Kong Polytechnic University
11:30 15m Talk		No Harness, No Problem: Oracle-guided Harnessing for Auto-generating C API Fuzzing Harnesses Research Track Gabriel Sherman University of Utah, Stefan Nagy University of Utah
11:45 15m Talk		Parametric Falsification of Many Probabilistic Requirements under Flakiness Research Track Matteo Camilli Politecnico di Milano, Raffaela Mirandola Karlsruhe Institute of Technology (KIT)
12:00 15m Talk		REDII: Test Infrastructure to Enable Deterministic Reproduction of Failures for Distributed Systems Research Track Yang Feng Nanjing University, Zheyuan Lin Nanjing University, Dongchen Zhao Nanjing University, Mengbo Zhou Nanjing University, Jia Liu Nanjing University, James Jones University of California at Irvine
12:15 15m Talk		Adopting Automated Bug Assignment in Practice - A Longitudinal Case Study at Ericsson Journal-first Papers Markus Borg CodeScene, Leif Jonsson Ericsson AB, Emelie Engstrom Lund University, Béla Bartalos Verint, Attila Szabo Ericsson

11:00 - 12:30	Gender, Equity and DiversitySE in Society (SEIS) / Research Track at 206 plus 208 Chair(s): Ronnie de Souza Santos University of Calgary

11:00 15m Talk		A Socio-Technical Grounded Theory on the Effect of Cognitive Dysfunctions in the Performance of Software Developers with ADHD and Autism SE in Society (SEIS) Kiev Gama Universidade Federal de Pernambuco, Grischa Liebel Reykjavik University, Miguel Goulao NOVA-LINCS, FCT/UNL, Aline Lacerda Federal University of Pernambuco (UFPE), Cristiana Lacerda Universidade Federal de Pernambuco Pre-print
11:15 15m Talk		Belonging Beyond Code: Queer Software Engineering and Humanities Student Experiences SE in Society (SEIS) Emily Vorderwülbeke University of Passau, Isabella Graßl Technical University of Darmstadt Pre-print
11:30 15m Talk		Breaking the Silos: An Actionable Framework for Recruiting Diverse Participants in SE SE in Society (SEIS) Shandler Mason North Carolina State University, Hank Lenham North Carolina State University, Sandeep Kuttal North Carolina State University
11:45 15m Talk		Enhancing Women's Experiences in Software Engineering SE in Society (SEIS) Júlia Rocha Fortunato University of Brasília, Luana Ribeiro Soares University of Brasília, Gabriela Silva Alves University of Brasília, Edna Dias Canedo University of Brasilia (UnB), Fabiana Freitas Mendes Aalto University
12:00 15m Talk		Investigating the Developer eXperience of LGBTQIAPN+ People in Agile Teams SE in Society (SEIS) Edvaldo R. Wassouf-Jr UFMS, Pedro Fukuda Federal University of Mato Grosso do Sul, Awdren Fontão Federal University of Mato Grosso do Sul (UFMS)
12:15 15m Talk		There's Nothing to See Here: A Study of Deaf and Hearing Developer Use of Stack Overflow SE in Society (SEIS) Steve Counsell Brunel University London, Giuseppe Destefanis Brunel University of London, Rumyana Neykova Brunel University London, Alina Miron Brunel University, Nadine Aburumman Brunel University, Thomas Shippey LogicMonitor

11:00 - 12:30	Human and Social Process 1SE In Practice (SEIP) / New Ideas and Emerging Results (NIER) / Journal-first Papers / Research Track at 207 Chair(s): Hausi Müller University of Victoria

11:00 15m Talk		Toward a Theory on Programmer's Block Inspired by Writer's Block Journal-first Papers Belinda Schantong Chemnitz University of Technology, Norbert Siegmund Leipzig University, Janet Siegmund Chemnitz University of Technology Link to publication
11:15 15m Talk		Digital Twins for Software Engineering Processes New Ideas and Emerging Results (NIER) Robin Kimmel University of Stuttgart, Judith Michael University of Regensburg, Andreas Wortmann University of Stuttgart, Jingxi Zhang University of Stuttgart Pre-print
11:30 15m Talk		Discovering Ideologies of the Open Source Software Movement New Ideas and Emerging Results (NIER) Yang Yue California State University San Marcos, Yi Wang Beijing University of Posts and Telecommunications, David Redmiles University of California, Irvine
11:45 15m Talk		Identifying Factors Contributing to ``Bad Days'' for Software Developers: A Mixed-Methods Study SE In Practice (SEIP) Ike Obi Purdue University, West Lafayette, Jenna L. Butler Microsoft Research, Sankeerti Haniyur Microsoft Corporation, Brian Hassan Microsoft Corporation, Margaret-Anne Storey University of Victoria, Brendan Murphy Microsoft Corporation
12:00 15m Talk		Time Warp: The Gap Between Developers’ Ideal vs Actual Workweeks in an AI-Driven EraAward Winner SE In Practice (SEIP) Sukrit Kumar Georgia Institute of Technology, Drishti Goel Microsoft, Thomas Zimmermann University of California, Irvine, Brian Houck Microsoft Research, B. Ashok Microsoft Research. India, Chetan Bansal Microsoft Research
12:15 15m Talk		Wearables to measure developer experience at work SE In Practice (SEIP) Charlotte Brandebusemeyer Hasso Plattner Institute, University of Potsdam, Tobias Schimmer SAP Labs, Bert Arnrich Hasso Plattner Institute, University of Potsdam

11:00 - 12:30	AI for User ExperienceSE In Practice (SEIP) / Demonstrations / Journal-first Papers / Research Track at 210 Chair(s): Chunyang Chen TU Munich

11:00 15m Talk		Automated Generation of Accessibility Test Reports from Recorded User TranscriptsAward Winner Research Track Syed Fatiul Huq University of California, Irvine, Mahan Tafreshipour University of California at Irvine, Kate Kalcevich Fable Tech Labs Inc., Sam Malek University of California at Irvine
11:15 15m Talk		KuiTest: Leveraging Knowledge in the Wild as GUI Testing Oracle for Mobile Apps SE In Practice (SEIP) Yongxiang Hu Fudan University, Yu Zhang Meituan, Xuan Wang Fudan University, Yingjie Liu School of Computer Science, Fudan University, Shiyu Guo Meituan, Chaoyi Chen Meituan, Xin Wang Fudan University, Yangfan Zhou Fudan University
11:30 15m Talk		GUIWatcher: Automatically Detecting GUI Lags by Analyzing Mobile Application Screencasts SE In Practice (SEIP) Wei Liu Concordia University, Montreal, Canada, Feng Lin Concordia University, Linqiang Guo Concordia University, Tse-Hsun (Peter) Chen Concordia University, Ahmed E. Hassan Queen’s University
11:45 15m Talk		GUIDE: LLM-Driven GUI Generation Decomposition for Automated Prototyping Demonstrations Kristian Kolthoff Institute for Software and Systems Engineering, Clausthal University of Technology, Felix Kretzer human-centered systems Lab (h-lab), Karlsruhe Institute of Technology (KIT) , Christian Bartelt , Alexander Maedche Human-Centered Systems Lab, Karlsruhe Institute of Technology, Simone Paolo Ponzetto Data and Web Science Group, University of Mannheim Pre-print
12:00 15m Talk		Agent for User: Testing Multi-User Interactive Features in TikTok SE In Practice (SEIP) Sidong Feng Monash University, Changhao Du Jilin University, huaxiao liu Jilin University, Qingnan Wang Jilin University, Zhengwei Lv ByteDance, Gang Huo ByteDance, Xu Yang ByteDance, Chunyang Chen TU Munich
12:15 7m Talk		Bug Analysis in Jupyter Notebook Projects: An Empirical Study Journal-first Papers Taijara Santana Federal University of Bahia, Paulo Silveira Neto Federal University Rural of Pernambuco, Eduardo Santana de Almeida Federal University of Bahia, Iftekhar Ahmed University of California at Irvine

11:00 - 12:30	Testing and SecurityResearch Track / Journal-first Papers at 211 Chair(s): Shiyi Wei University of Texas at Dallas

11:00 15m Talk		Fuzzing MLIR Compilers with Custom Mutation Synthesis Research Track Ben Limpanukorn UCLA, Jiyuan Wang University of California at Los Angeles, Hong Jin Kang University of Sydney, Eric Zitong Zhou UCLA, Miryung Kim UCLA and Amazon Web Services Pre-print
11:15 15m Talk		InSVDF: Interface-State-Aware Virtual Device Fuzzing Research Track Zexiang Zhang National University of Defense Technology, Gaoning Pan Hangzhou Dianzi University, Ruipeng Wang National University of Defense Technology, Yiming Tao Zhejiang University, Zulie Pan National University of Defense Technology, Cheng Tu National University of Defense Technology, Min Zhang National University of Defense Technology, Yang Li National University of Defense Technology, Yi Shen National University of Defense Technology, Chunming Wu Zhejiang University
11:30 15m Talk		Reduce Dependence for Sound Concurrency Bug Prediction Research Track Shihao Zhu State Key Laboratory of Computer Science,Institute of Software,Chinese Academy of Sciences,China, Yuqi Guo Institute of Software, Chinese Academy of Sciences, Yan Cai Institute of Software at Chinese Academy of Sciences, Bin Liang Renmin University of China, Long Zhang Institute of Software, Chinese Academy of Sciences, Rui Chen Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Tingting Yu Beijing Institute of Control Engineering; Beijing Sunwise Information Technology
11:45 15m Talk		SAND: Decoupling Sanitization from Fuzzing for Low Overhead Research Track Ziqiao Kong Nanyang Technological University, Shaohua Li The Chinese University of Hong Kong, Heqing Huang City University of Hong Kong, Zhendong Su ETH Zurich Link to publication Pre-print Media Attached File Attached
12:00 15m Talk		TransferFuzz: Fuzzing with Historical Trace for Verifying Propagated Vulnerability CodeSecurity Research Track Siyuan Li University of Chinese Academy of Sciences & Institute of Information Engineering Chinese Academy of Sciences, China, Yuekang Li UNSW, Zuxin Chen Institute of Information Engineering Chinese Academy of Sciences & University of Chinese Academy of Sciences, China, Chaopeng Dong Institute of Information Engineering Chinese Academy of Sciences & University of Chinese Academy of Sciences, China, Yongpan Wang University of Chinese Academy of Sciences & Institute of Information Engineering Chinese Academy of Sciences, China, Hong Li Institute of Information Engineering at Chinese Academy of Sciences, Yongle Chen Taiyuan University of Technology, China, Hongsong Zhu Institute of Information Engineering at Chinese Academy of Sciences; University of Chinese Academy of Sciences
12:15 15m Talk		Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?Security Journal-first Papers Emanuele Iannone Hamburg University of Technology, Giulia Sellitto University of Salerno, Emanuele Iaccarino University of Salerno, Filomena Ferrucci Università di Salerno, Andrea De Lucia University of Salerno, Fabio Palomba University of Salerno Link to publication DOI Authorizer link Pre-print

11:00 - 12:30	AI for Analysis 1Research Track at 212 Chair(s): Denys Poshyvanyk William & Mary

11:00 15m Talk		A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection Research Track TianChen Yu School of Software Engineering, South China University of Technology, Li Yuan School of Software Engineering, South China University of Technology, Guangzhou, China, Liannan Lin School of Software Engineering, South China University of Technology, Hongkui He School of Software Engineering, South China University of Technology
11:15 15m Talk		Can an LLM find its way around a Spreadsheet? Research Track Cho-Ting Lee Virginia Tech, Andrew Neeser Virginia Tech, Shengzhe Xu Virginia Tech, Jay Katyan Virginia Tech, Patrick Cross Virginia Tech, Sharanya Pathakota Virginia Tech, Marigold Norman World Forest ID, John C. Simeone Simeone Consulting, LLC, Jaganmohan Chandrasekaran Virginia Tech, Naren Ramakrishnan Virginia Tech
11:30 15m Talk		QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning Research Track Alex Sanchez-Stern University of Massachusetts at Amherst, Abhishek Varghese University of Massachusetts, Zhanna Kaufman University of Massachusetts, Shizhuo Zhang University of Illinois Urbana-Champaign, Talia Lily Ringer University of Illinois Urbana-Champaign, Yuriy Brun University of Massachusetts Link to publication Pre-print
11:45 15m Talk		TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference Research Track Chong Wang Nanyang Technological University, Jian Zhang Nanyang Technological University, Yiling Lou Fudan University, Mingwei Liu Fudan University, Weisong Sun Nanyang Technological University, Yang Liu Nanyang Technological University, Xin Peng Fudan University
12:00 15m Talk		ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation Research Track Xue Jiang , Yihong Dong Peking University, Yongding Tao University of Electronic Science and Technology of China, Huanyu Liu Xidian University, Zhi Jin Peking University, Ge Li Peking University
12:15 15m Talk		Rango: Adaptive Retrieval-Augmented Proving for Automated Software VerificationAward Winner Research Track Kyle Thompson University of California, San Diego, Nuno Saavedra INESC-ID and IST, University of Lisbon, Pedro Carrott Imperial College London, Kevin Fisher University of California San Diego, Alex Sanchez-Stern University of Massachusetts, Yuriy Brun University of Massachusetts, João F. Ferreira INESC-ID and IST, University of Lisbon, Sorin Lerner University of California at San Diego, Emily First University of California, San Diego Link to publication Pre-print File Attached

11:00 - 12:30	AutonomyResearch Track at 213 Chair(s): Lionel Briand University of Ottawa, Canada; Lero centre, University of Limerick, Ireland

11:00 15m Talk		A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs Research Track Trey Woodlief University of Virginia, Carl Hildebrandt University of Virginia, Sebastian Elbaum University of Virginia
11:15 15m Talk		Automating a Complete Software Test Process Using LLMs: An Automotive Case Study Research Track Shuai Wang , Yinan Yu Chalmers University of Technology, Robert Feldt Chalmers \| University of Gothenburg, Dhasarathy Parthasarathy Volvo Group Pre-print
11:30 15m Talk		LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems Research Track Venkata Sai Aswath Duvvuru Saint Louis University, Bohan Zhang Saint Louis University, Missouri, Michael Vierhauser University of Innsbruck, Ankit Agrawal Saint Louis University, Missouri Pre-print Media Attached
11:45 15m Talk		Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models Research Track Luciano Baresi Politecnico di Milano, Davide Yi Xian Hu Politecnico di Milano, Andrea Stocco Technical University of Munich, fortiss, Paolo Tonella USI Lugano Pre-print
12:00 15m Talk		GARL: Genetic Algorithm-Augmented Reinforcement Learning to Detect Violations in Marker-Based Autonomous Landing Systems Research Track Linfeng Liang Macquarie University, Yao Deng Macquarie University, Kye Morton Skyy Network, Valtteri Kallinen Skyy Network, Alice James Macquarie University, Avishkar Seth Macquarie University, Endrowednes Kuantama Macquarie University, Subhas Mukhopadhyay Macquarie University, Richard Han Macquarie University, Xi Zheng Macquarie University
12:15 15m Talk		Decictor: Towards Evaluating the Robustness of Decision-Making in Autonomous Driving Systems Research Track Mingfei Cheng Singapore Management University, Xiaofei Xie Singapore Management University, Yuan Zhou Zhejiang Sci-Tech University, Junjie Wang Tianjin University, Guozhu Meng Institute of Information Engineering, Chinese Academy of Sciences, Kairui Yang DAMO Academy, Alibaba Group, China

11:00 - 12:30	AI for Testing and QA 1Research Track / SE In Practice (SEIP) at 214 Chair(s): Jieshan Chen CSIRO's Data61

11:00 15m Talk		Does GenAI Make Usability Testing Obsolete?Award Winner Research Track Ali Ebrahimi Pourasad , Walid Maalej University of Hamburg Pre-print
11:15 15m Talk		Feature-Driven End-To-End Test Generation Research Track Parsa Alian University of British Columbia, Noor Nashid University of British Columbia, Mobina Shahbandeh University of British Columbia, Taha Shabani University of British Columbia, Ali Mesbah University of British Columbia
11:30 15m Talk		SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI AutomationAward Winner Research Track Dehai Zhao CSIRO's Data61, Zhenchang Xing CSIRO's Data61, Qinghua Lu Data61, CSIRO, Xiwei (Sherry) Xu Data61, CSIRO, Liming Zhu CSIRO’s Data61
11:45 15m Talk		Synthesizing Document Database Queries using Collection Abstractions Research Track Qikang Liu Simon Fraser University, Yang He Simon Fraser University, Yanwen Cai Simon Fraser University, Byeongguk Kwak Simon Fraser University, Yuepeng Wang Simon Fraser University
12:00 15m Talk		The Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages Research Track Boqi Chen McGill University, José Antonio Hernández López Linköping University, Gunter Mussbacher McGill University, Daniel Varro Linköping University / McGill University
12:15 15m Talk		DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production SE In Practice (SEIP) Xiaoyun Liang ByteDance, Jingyi Ren ByteDance, Jiayi Qi ByteDance, Chao Peng ByteDance, Bo Jiang Bytedance Network Technology

11:00 - 12:30	SE for AI 1New Ideas and Emerging Results (NIER) / SE In Practice (SEIP) / Research Track at 215 Chair(s): Houari Sahraoui DIRO, Université de Montréal

11:00 15m Talk		A Test Oracle for Reinforcement Learning Software based on Lyapunov Stability Control TheorySE for AIAward Winner Research Track Shiyu Zhang The Hong Kong Polytechnic University, Haoyang Song The Hong Kong Polytechnic University, Qixin Wang The Hong Kong Polytechnic University, Henghua Shen The Hong Kong Polytechnic University, Yu Pei The Hong Kong Polytechnic University
11:15 15m Talk		CodeImprove: Program Adaptation for Deep Code ModelsSE for AI Research Track Ravishka Shemal Rathnasuriya University of Texas at Dallas, zijie zhao , Wei Yang UT Dallas
11:30 15m Talk		FairQuant: Certifying and Quantifying Fairness of Deep Neural NetworksSE for AI Research Track Brian Hyeongseok Kim University of Southern California, Jingbo Wang University of Southern California, Chao Wang University of Southern California Pre-print
11:45 15m Talk		When in Doubt Throw It out: Building on Confident Learning for Vulnerability DetectionSecuritySE for AI New Ideas and Emerging Results (NIER) Yuanjun Gong Renmin University of China, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam Pre-print File Attached
12:00 15m Talk		Evaluation of Tools and Frameworks for Machine Learning Model ServingSE for AI SE In Practice (SEIP) Niklas Beck Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Benny Stein Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Dennis Wegener T-Systems International GmbH, Lennard Helmer Fraunhofer Institute for Intelligent Analysis and Information Systems
12:15 15m Talk		Real-time Adapting Routing (RAR): Improving Efficiency Through Continuous Learning in Software Powered by Layered Foundation ModelsSE for AI SE In Practice (SEIP) Kirill Vasilevski Huawei Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Ahmed E. Hassan Queen’s University Pre-print File Attached

11:00 - 12:30	AI for SE 1Research Track at Canada Hall 1 and 2 Chair(s): Tao Chen University of Birmingham

11:00 15m Talk		Calibration and Correctness of Language Models for Code Research Track Claudio Spiess University of California, Davis, David Gros University of California, Davis, Kunal Suresh Pai UC Davis, Michael Pradel University of Stuttgart, Rafiqul Rabin UL Research Institutes, Amin Alipour University of Houston, Susmit Jha SRI, Prem Devanbu University of California at Davis, Toufique Ahmed IBM Research Pre-print
11:15 15m Talk		An Empirical Study on Commit Message Generation using LLMs via In-Context Learning Research Track Yifan Wu Peking University, Yunpeng Wang Ant Group, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Wei Tao Independent Researcher, Siyu Yu The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Haowen Yang The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Wei Jiang , Jianguo Li Ant Group Pre-print
11:30 15m Talk		Instruct or Interact? Exploring and Eliciting LLMs’ Capability in Code Snippet Adaptation Through Prompt Engineering Research Track Tanghaoran Zhang National University of Defense Technology, Yue Yu PengCheng Lab, Xinjun Mao National University of Defense Technology, Shangwen Wang National University of Defense Technology, Kang Yang National University of Defense Technology, Yao Lu National University of Defense Technology, Zhang Zhang Key Laboratory of Software Engineering for Complex Systems, National University of Defense Technology, Yuxin Zhao Key Laboratory of Software Engineering for Complex Systems, National University of Defense Technology
11:45 15m Talk		Search-Based LLMs for Code OptimizationAward Winner Research Track Shuzheng Gao The Chinese University of Hong Kong, Cuiyun Gao Harbin Institute of Technology, Wenchao Gu The Chinese University of Hong Kong, Michael Lyu The Chinese University of Hong Kong
12:00 15m Talk		Towards Better Answers: Automated Stack Overflow Post Updating Research Track Yubo Mai Zhejiang University, Zhipeng Gao Shanghai Institute for Advanced Study - Zhejiang University, Haoye Wang Hangzhou City University, Tingting Bi The University of Melbourne, Xing Hu Zhejiang University, Xin Xia Huawei, JianLing Sun Zhejiang University
12:15 15m Talk		Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the FamiliarAward Winner Research Track Yuanliang Zhang National University of Defense Technology, Yifan Xie , Shanshan Li National University of Defense Technology, Ke Liu , Chong Wang National University of Defense Technology, Zhouyang Jia National University of Defense Technology, Xiangbing Huang National University of Defense Technology, Jie Song National University of Defense Technology, Chaopeng Luo National University of Defense Technology, Zhizheng Zheng National University of Defense Technology, Rulin Xu National University of Defense Technology, Yitong Liu National University of Defense Technology, Si Zheng National University of Defense Technology, Liao Xiangke National University of Defense Technology

12:30 - 14:00	LunchCatering at Canada Hall 3 plus Foyer

12:30 90m Lunch		Wednesday Lunch Catering

13:30 - 14:00	Wed Lunch Posters 13:30-14:00Research Track / Journal-first Papers / New Ideas and Emerging Results (NIER) / Posters at Canada Hall 3 Poster Area

13:30 30m Poster		Pattern-based Generation and Adaptation of Quantum WorkflowsQuantum Research Track Martin Beisel Institute of Architecture of Application Systems (IAAS), University of Stuttgart, Johanna Barzen University of Stuttgart, Frank Leymann University of Stuttgart, Lavinia Stiliadou Institute of Architecture of Application Systems (IAAS), University of Stuttgart, Daniel Vietz University of Stuttgart, Benjamin Weder Institute of Architecture of Application Systems (IAAS), University of Stuttgart
13:30 30m Talk		Mole: Efficient Crash Reproduction in Android Applications With Enforcing Necessary UI Events Journal-first Papers Maryam Masoudian Sharif University of Technology, Hong Kong University of Science and Technology (HKUST), Heqing Huang City University of Hong Kong, Morteza Amini Sharif University of Technology, Charles Zhang Hong Kong University of Science and Technology
13:30 30m Talk		Automated Testing Linguistic Capabilities of NLP Models Journal-first Papers Jaeseong Lee The University of Texas at Dallas, Simin Chen University of Texas at Dallas, Austin Mordahl University of Illinois Chicago, Cong Liu University of California, Riverside, Wei Yang UT Dallas, Shiyi Wei University of Texas at Dallas
13:30 30m Poster		BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries Research Track Wen Zhang University of Georgia, Botang Xiao University of Georgia, Qingchen Kong University of Georgia, Le Guan University of Georgia, Wenwen Wang University of Georgia
13:30 30m Talk		A Unit Proofing Framework for Code-level Verification: A Research AgendaFormal Methods New Ideas and Emerging Results (NIER) Paschal Amusuo Purdue University, Parth Vinod Patil Purdue University, Owen Cochell Michigan State University, Taylor Le Lievre Purdue University, James C. Davis Purdue University Pre-print
13:30 30m Talk		Listening to the Firehose: Sonifying Z3’s BehaviorFormal Methods New Ideas and Emerging Results (NIER) Finn Hackett University of British Columbia, Ivan Beschastnikh University of British Columbia
13:30 30m Talk		Towards Early Warning and Migration of High-Risk Dormant Open-Source Software DependenciesSecurity New Ideas and Emerging Results (NIER) Zijie Huang Shanghai Key Laboratory of Computer Software Testing and Evaluation, Lizhi Cai Shanghai Key Laboratory of Computer Software Testing & Evaluating, Shanghai Software Center, Xuan Mao Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, China, Kang Yang Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology
13:30 30m Poster		SimClone: Detecting Tabular Data Clones using Value Similarity Journal-first Papers Xu Yang University of Manitoba, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Shaowei Wang University of Manitoba, Zhen Ming (Jack) Jiang York University
13:30 30m Talk		SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code GenerationFormal Methods New Ideas and Emerging Results (NIER) Junjie Sheng East China Normal University, Yanqiu Lin East China Normal University, Jiehao Wu East China Normal University, Yanhong Huang East China Normal University, Jianqi Shi East China Normal University, Min Zhang East China Normal University, Xiangfeng Wang East China Normal University

15:30 - 16:00	Wed Afternoon Break Posters 15:30-16:00Journal-first Papers / SE In Practice (SEIP) / Research Track / Posters at Canada Hall 3 Poster Area

15:30 30m Poster		Non-Autoregressive Line-Level Code Completion Journal-first Papers Fang Liu Beihang University, Zhiyi Fu Peking University, Ge Li Peking University, Zhi Jin Peking University, Hui Liu Beijing Institute of Technology, Yiyang Hao Silicon Heart Tech Co., Li Zhang Beihang University
15:30 30m Poster		FlatD: Protecting Deep Neural Network Program from Reversing Attacks SE In Practice (SEIP) Jinquan Zhang The Pennsylvania State University, Zihao Wang Penn State University, Pei Wang Independent Researcher, Rui Zhong Palo Alto Networks, Dinghao Wu Pennsylvania State University
15:30 30m Talk		Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-PracticeSE for AI Journal-first Papers Bentley Oakes Polytechnique Montréal, Michalis Famelis Université de Montréal, Houari Sahraoui DIRO, Université de Montréal DOI Pre-print File Attached
15:30 30m Poster		Predicting the First Response Latency of Maintainers and Contributors in Pull Requests Journal-first Papers SayedHassan Khatoonabadi Concordia University, Montreal, Ahmad Abdellatif University of Calgary, Diego Elias Costa Concordia University, Canada, Emad Shihab Concordia University, Montreal
15:30 30m Talk		LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation Journal-first Papers Sarah Fakhoury Microsoft Research, Aaditya Naik University of Pennsylvania, Georgios Sakkas University of California at San Diego, Saikat Chakraborty Microsoft Research, Shuvendu K. Lahiri Microsoft Research Link to publication
15:30 30m Poster		RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code Research Track Pantazis Deligiannis Microsoft Research, Akash Lal Microsoft Research, Nikita Mehrotra Microsoft Research, Rishi Poddar Microsoft Research, Aseem Rastogi Microsoft Research
15:30 30m Talk		QuanTest: Entanglement-Guided Testing of Quantum Neural Network SystemsQuantum Journal-first Papers Jinjing Shi Central South University, Zimeng Xiao Central South University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Xuelong LI China Telecom Link to publication

15:30 - 16:00	BreakCatering at Canada Hall 3 plus Foyer

15:30 30m Break		Wednesday Afternoon Break Catering

16:00 - 17:30	Formal Methods 2Research Track / New Ideas and Emerging Results (NIER) / Journal-first Papers at 103 Chair(s): Yi Li Nanyang Technological University

16:00 15m Talk		ConsCS: Effective and Efficient Verification of Circom CircuitsFormal Methods Research Track Jinan Jiang The Hong Kong Polytechnic University, Xinghao Peng , Jinzhao Chu The Hong Kong Polytechnic University, Xiapu Luo Hong Kong Polytechnic University
16:15 15m Talk		Constrained LTL Specification Learning from ExamplesFormal Methods Research Track Changjian Zhang Carnegie Mellon University, Parv Kapoor Carnegie Mellon University, Ian Dardik Carnegie Mellon University, Leyi Cui Columbia University, Romulo Meira-Goes The Pennsylvania State University, David Garlan Carnegie Mellon University, Eunsuk Kang Carnegie Mellon University
16:30 15m Talk		LLM-aided Automatic Modeling for Security Protocol VerificationSecurityFormal Methods Research Track Ziyu Mao Zhejiang University, Jingyi Wang Zhejiang University, Jun Sun Singapore Management University, Shengchao Qin Xidian University, Jiawen Xiong East China Normal University
16:45 15m Talk		Model Assisted Refinement of Metamorphic Relations for Scientific SoftwareFormal Methods New Ideas and Emerging Results (NIER) Clay Stevens Iowa State University, Katherine Kjeer Iowa State University, Ryan Richard Iowa State University, Edward Valeev Virginia Tech, Myra Cohen Iowa State University
17:00 15m Talk		Precisely Extracting Complex Variable Values from Android AppsFormal Methods Journal-first Papers Marc Miltenberger Fraunhofer SIT; ATHENE, Steven Arzt Fraunhofer SIT; ATHENE
17:15 7m Talk		A Unit Proofing Framework for Code-level Verification: A Research AgendaFormal Methods New Ideas and Emerging Results (NIER) Paschal Amusuo Purdue University, Parth Vinod Patil Purdue University, Owen Cochell Michigan State University, Taylor Le Lievre Purdue University, James C. Davis Purdue University Pre-print
17:22 7m Talk		Automated Testing Linguistic Capabilities of NLP Models Journal-first Papers Jaeseong Lee The University of Texas at Dallas, Simin Chen University of Texas at Dallas, Austin Mordahl University of Illinois Chicago, Cong Liu University of California, Riverside, Wei Yang UT Dallas, Shiyi Wei University of Texas at Dallas

16:00 - 17:30	Databases and BusinessResearch Track / SE In Practice (SEIP) / Demonstrations / Journal-first Papers at 104 Chair(s): Lu Xiao Stevens Institute of Technology

16:00 15m Talk		Optimization of Automated and Manual Software Tests in Industrial Practice: A Survey and Historical Analysis Journal-first Papers Roman Haas Saarland University; CQSE, Raphael Nömmer Saarbr�cken Graduate School of Computer Science, CQSE, Elmar Juergens CQSE GmbH, Sven Apel Saarland University Link to publication Pre-print
16:15 15m Talk		A-COBREX : A Tool for Identifying Business Rules in COBOL Programs Demonstrations Samveg Shah Indian Institute of Technology, Tirupati, Shivali Agarwal IBM, Saravanan Krishnan IBM India Research Lab, Vini Kanvar IBM Research, Sridhar Chimalakonda Indian Institute of Technology Tirupati
16:30 15m Talk		Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential TestingAward Winner Research Track Ying Fu National University of Defense Technology, Zhiyong Wu Tsinghua University, China, Yuanliang Zhang National University of Defense Technology, Jie Liang , Jingzhou Fu School of Software, Tsinghua University, Yu Jiang Tsinghua University, Shanshan Li National University of Defense Technology, Liao Xiangke National University of Defense Technology
16:45 15m Talk		Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation Research Track Wenqian Deng Tsinghua University, Zhiyong Wu Tsinghua University, China, Jie Liang , Jingzhou Fu School of Software, Tsinghua University, Mingzhe Wang Tsinghua University, Yu Jiang Tsinghua University
17:00 15m Talk		Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction Research Track Zhiyong Wu Tsinghua University, China, Jie Liang , Jingzhou Fu School of Software, Tsinghua University, Mingzhe Wang Tsinghua University, Yu Jiang Tsinghua University
17:15 15m Talk		Safe Validation of Pricing Agreements SE In Practice (SEIP) John C. Kolesar Yale University, Tancrède Lepoint Amazon, Martin Schäf Amazon Web Services, Willem Visser Amazon Web Services

16:00 - 17:30	Program Comprehension 2Journal-first Papers / Research Track at 204 Chair(s): Xiaoxue Ren Zhejiang University

16:00 15m Talk		Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning Research Track Chun Li Nanjing University, Hui Li Samsung Electronics (China) R&D Centre, Zhong Li , Minxue Pan Nanjing University, Xuandong Li Nanjing University
16:15 15m Talk		On the Understandability of MLOps System Architectures Journal-first Papers Stephen John Warnett University of Vienna, Uwe Zdun University of Vienna Link to publication DOI
16:30 15m Talk		Bridging the Language Gap: An Empirical Study of Bindings for Open Source Machine Learning Libraries Across Software Package Ecosystems Journal-first Papers Hao Li Queen's University, Cor-Paul Bezemer University of Alberta Link to publication DOI Pre-print
16:45 15m Talk		Understanding Code Understandability Improvements in Code Reviews Journal-first Papers Delano Hélio Oliveira , Reydne Bruno dos Santos UFPE, Benedito Fernando Albuquerque de Oliveira Federal University of Pernambuco, Martin Monperrus KTH Royal Institute of Technology, Fernando Castor University of Twente, Fernanda Madeiral Universidade Federal de Pernambuco
17:00 15m Talk		Automatic Commit Message Generation: A Critical Review and Directions for Future Work Journal-first Papers Yuxia Zhang Beijing Institute of Technology, Zhiqing Qiu Beijing Institute of Technology, Klaas-Jan Stol Lero; University College Cork; SINTEF Digital , Wenhui Zhu Beijing Institute of Technology, Jiaxin Zhu Institute of Software at Chinese Academy of Sciences, Yingchen Tian Tmall Technology Co., Hui Liu Beijing Institute of Technology
17:15 7m Talk		Efficient Management of Containers for Software Defined Vehicles Journal-first Papers Anwar Ghammam Oakland University, Rania Khalsi University of Michigan - Flint, Marouane Kessentini University of Michigan - Flint, Foyzul Hassan University of Michigan at Dearborn

16:00 - 17:30	Testing and QA 2Journal-first Papers / Research Track at 205 Chair(s): Andreas Zeller CISPA Helmholtz Center for Information Security

16:00 15m Talk		EpiTESTER: Testing Autonomous Vehicles with Epigenetic Algorithm and Attention Mechanism Journal-first Papers Chengjie Lu Simula Research Laboratory and University of Oslo, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Tao Yue Beihang University
16:15 15m Talk		GenMorph: Automatically Generating Metamorphic Relations via Genetic Programming Journal-first Papers Jon Ayerdi Mondragon University, Valerio Terragni University of Auckland, Gunel Jahangirova King's College London, Aitor Arrieta Mondragon University, Paolo Tonella USI Lugano
16:30 15m Talk		Guess the State: Exploiting Determinism to Improve GUI Exploration Efficiency Journal-first Papers Diego Clerissi University of Milano-Bicocca, Giovanni Denaro University of Milano - Bicocca, Marco Mobilio University of Milano Bicocca, Leonardo Mariani University of Milano-Bicocca
16:45 15m Talk		Runtime Verification and Field-based Testing for ROS-based Robotic Systems Journal-first Papers Ricardo Caldas Gran Sasso Science Institute (GSSI), Juan Antonio Piñera García Gran Sasso Science Institute, Matei Schiopu Chalmers \| Gothenburg University, Patrizio Pelliccione Gran Sasso Science Institute, L'Aquila, Italy, Genaína Nunes Rodrigues University of Brasília, Thorsten Berger Ruhr University Bochum Link to publication DOI
17:00 15m Talk		Towards Effectively Testing Machine Translation Systems from White-Box Perspectives Journal-first Papers Hanying Shao University of Waterloo, Zishuo Ding The Hong Kong University of Science and Technology (Guangzhou), Weiyi Shang University of Waterloo, Jinqiu Yang Concordia University, Nikolaos Tsantalis Concordia University
17:15 15m Talk		Using Knowledge Units of Programming Languages to Recommend Reviewers for Pull Requests: An Empirical Study Journal-first Papers Md Ahasanuzzaman Queen's University, Gustavo A. Oliva Queen's University, Ahmed E. Hassan Queen’s University, Md Ahasanuzzaman Queen's University

16:00 - 17:45	Human and Social 1SE in Society (SEIS) / SE In Practice (SEIP) / Research Track at 206 plus 208 Chair(s): Yvonne Dittrich IT University of Copenhagen, Denmark

16:00 15m Talk		Systematizing Inclusive Design in MOSIP: An Experience Report SE In Practice (SEIP) Soumiki Chattopadhyay Oregon State University, Amreeta Chatterjee Oregon State University, Puja Agarwal Oregon State University, Bianca Trinkenreich Colorado State University, Swarathmika Kumar MOSIP-IIIT Bangalore, Rohit Ranjan Rai MOSIP-IIIT Bangalore, Resham Chugani MOSIP-IIIT Bangalore, Pragya Kumari MOSIP-IIIT Bangalore, Margaret Burnett Oregon State University, Anita Sarma Oregon State University Pre-print
16:15 15m Talk		A Collaborative Framework for Cross-Domain Scientific Experiments for Society 5.0Research Methods SE in Society (SEIS) Muhammad Mainul Hossain University of Saskatchewan, Banani Roy University of Saskatchewan, Chanchal K. Roy University of Saskatchewan, Kevin Schneider University of Saskatchewan
16:30 15m Talk		A First Look at AI Trends in Value-Aligned Software Engineering Publications: Human-LLM Insights SE in Society (SEIS) Ahmad Azarnik Universiti Teknologi Malaysia, Davoud Mougouei , Mahdi Fahmideh University of Southern Queensland, Elahe Mougouei Islamic Azad University Najafabad, Hoa Khanh Dam University of Wollongong, Arif Ali Khan University of Oulu, Saima Rafi Edinburgh Napier University, Javed Ali Khan University of Hertforshire Hertfordshire, UK, Aakash Ahmad School of Computing and Communications, Lancaster University Leipzig, Leipzig, Germany Link to publication
16:45 15m Talk		From Expectation to Habit: Why Do Software Practitioners Adopt Fairness Toolkits? SE in Society (SEIS) Gianmario Voria University of Salerno, Stefano Lambiase Aalborg University in Copenhagen, Maria Concetta Schiavone University of Salerno, Gemma Catolino University of Salerno, Fabio Palomba University of Salerno Pre-print
17:00 15m Talk		Not real or too soft? On the challenges of publishing interdisciplinary software engineering research SE in Society (SEIS) Sonja Hyrynsalmi LUT University, Grischa Liebel Reykjavik University, Ronnie de Souza Santos University of Calgary, Sebastian Baltes University of Bayreuth Pre-print
17:15 15m Talk		What is unethical about software? User perceptions in the Netherlands SE in Society (SEIS) Yagil Elias Vrije Universiteit Amsterdam, Tom P Humbert Vrije Universiteit Amsterdam, Lauren Olson Vrije Universiteit Amsterdam, Emitzá Guzmán Vrije Universiteit Amsterdam Pre-print

16:00 - 17:30	Human and Social Process 2Journal-first Papers / Research Track at 207 Chair(s): Armstrong Foundjem École Polytechnique de Montréal

16:00 15m Talk		An Empirical Study on Developers' Shared Conversations with ChatGPT in GitHub Pull Requests and Issues Journal-first Papers Huizi Hao Queen's University, Canada, Kazi Amit Hasan Queen's University, Canada, Hong Qin Queen's University, Marcos Macedo Queen's University, Yuan Tian Queen's University, Kingston, Ontario, Ding Steven, H., H. Queen’s University at Kingston, Ahmed E. Hassan Queen’s University
16:15 15m Talk		Who’s Pushing the Code: An Exploration of GitHub Impersonation Research Track Yueke Zhang Vanderbilt University, Anda Liang Vanderbilt University, Xiaohan Wang Vanderbilt University, Pamela J. Wisniewski Vanderbilt University, Fengwei Zhang Southern University of Science and Technology, Kevin Leach Vanderbilt University, Yu Huang Vanderbilt University
16:30 15m Talk		Understanding Real-time Collaborative Programming: a Study of Visual Studio Live Share Journal-first Papers Xin Tan Beihang University, Xinyue Lv Beihang University, Jing Jiang Beihang University, Li Zhang Beihang University
16:45 15m Talk		Characterizing the Prevalence, Distribution, and Duration of Stale Reviewer Recommendations Journal-first Papers Farshad Kazemi University of Waterloo, Maxime Lamothe Polytechnique Montreal, Shane McIntosh University of Waterloo
17:00 15m Talk		Diversity's Double-Edged Sword: Analyzing Race's Effect on Remote Pair Programming Interactions Journal-first Papers Shandler Mason North Carolina State University, Sandeep Kuttal North Carolina State University
17:15 7m Talk		Investigating the Impact of Interpersonal Challenges on Feeling Welcome in OSS Research Track Bianca Trinkenreich Colorado State University, Zixuan Feng Oregon State University, USA, Rudrajit Choudhuri Oregon State University, Marco Gerosa Northern Arizona University, Anita Sarma Oregon State University, Igor Steinmacher NAU RESHAPE LAB Pre-print

16:00 - 17:30	SE for AI with SecurityResearch Track at 210 Chair(s): Lina Marsso École Polytechnique de Montréal

16:00 15m Talk		Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak AttacksSecuritySE for AI Research Track shide zhou Huazhong University of Science and Technology, Li Tianlin NTU, Kailong Wang Huazhong University of Science and Technology, Yihao Huang NTU, Ling Shi Nanyang Technological University, Yang Liu Nanyang Technological University, Haoyu Wang Huazhong University of Science and Technology
16:15 15m Talk		Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning SoftwareSecuritySE for AI Research Track Zhenpeng Chen Nanyang Technological University, Xinyue Li Peking University, Jie M. Zhang King's College London, Federica Sarro University College London, Yang Liu Nanyang Technological University Pre-print
16:30 15m Talk		HIFI: Explaining and Mitigating Algorithmic Bias through the Lens of Game-Theoretic InteractionsSecuritySE for AI Research Track Lingfeng Zhang East China Normal University, Zhaohui Wang Software Engineering Institute, East China Normal University, Yueling Zhang East China Normal University, Min Zhang East China Normal University, Jiangtao Wang Software Engineering Institute, East China Normal University
16:45 15m Talk		Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution DetectionSecuritySE for AI Research Track Yanfu Yan William & Mary, Viet Duong William & Mary, Huajie Shao College of William & Mary, Denys Poshyvanyk William & Mary
17:00 15m Talk		FairSense: Long-Term Fairness Analysis of ML-Enabled SystemsSecuritySE for AI Research Track Yining She Carnegie Mellon University, Sumon Biswas Carnegie Mellon University, Christian Kästner Carnegie Mellon University, Eunsuk Kang Carnegie Mellon University

16:00 - 17:30	RequirementsResearch Track / Demonstrations / New Ideas and Emerging Results (NIER) at 211 Chair(s): Jane Cleland-Huang University of Notre Dame

16:00 15m Talk		A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing Research Track Sanan Hasanov University of Central Florida, Stefan Nagy University of Utah, Paul Gazzillo University of Central Florida
16:15 15m Talk		Exploring the Robustness of the Effect of EVO on Intention Valuation through ReplicationAward Winner Research Track Yesugen Baatartogtokh University of Massachusetts Amherst, Kaitlyn Cook Smith College, Alicia M. Grubb Smith College
16:30 15m Talk		Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts Research Track Sebastian Uchitel Universidad de Buenos Aires / Imperial College, Francisco Cirelli Universidad de Buenos Aires, Dalal Alrajeh Imperial College London
16:45 15m Talk		User Personas Improve Social Sustainability by Encouraging Software Developers to Deprioritize Antisocial Features Research Track Bimpe Ayoola Dalhousie University, Miikka Kuutila Dalhousie University, Rina R. Wehbe Dalhousie University, Paul Ralph Dalhousie University Pre-print
17:00 15m Talk		VReqST: A Requirement Specification Tool for Virtual Reality Software Products Demonstrations Amogha A Halhalli Software Engineering Research Center. IIIT Hyderabad, Raghu Reddy IIIT Hyderabad, Karre Sai Anirudh Phenom Inc.
17:15 15m Talk		What is a Feature, Really? Toward a Unified Understanding Across SE Disciplines New Ideas and Emerging Results (NIER) Nitish Patkar FHNW, Aimen Fahmi Tata Consultancy Services, Timo Kehrer University of Bern, Norbert Seyff University of Applied Sciences and Arts Northwestern Switzerland FHNW

16:00 - 17:30	AI for Analysis 2Research Track / Journal-first Papers at 212 Chair(s): Julia Rubin The University of British Columbia

16:00 15m Talk		Neurosymbolic Modular Refinement Type Inference Research Track Georgios Sakkas UC San Diego, Pratyush Sahu UC San Diego, Kyeling Ong University of California, San Diego, Ranjit Jhala University of California at San Diego
16:15 15m Talk		An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? Research Track Hyunjae Suh University of California, Irvine, Mahan Tafreshipour University of California at Irvine, Jiawei Li University of California Irvine, Adithya Bhattiprolu University of California, Irvine, Iftekhar Ahmed University of California at Irvine
16:30 15m Talk		Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets Research Track Smit Soneshbhai Patel University of Texas at Dallas, Aashish Yadavally University of Texas at Dallas, Hridya Dhulipala University of Texas at Dallas, Tien N. Nguyen University of Texas at Dallas
16:45 15m Talk		LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion Research Track Chong Wang Nanyang Technological University, Kaifeng Huang Tongji University, Jian Zhang Nanyang Technological University, Yebo Feng Nanyang Technological University, Lyuye Zhang Nanyang Technological University, Yang Liu Nanyang Technological University, Xin Peng Fudan University
17:00 15m Talk		Knowledge-Enhanced Program Repair for Data Science Code Research Track Shuyin Ouyang King's College London, Jie M. Zhang King's College London, Zeyu Sun Institute of Software, Chinese Academy of Sciences, Albert Merono Penuela King's College London
17:15 7m Talk		SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning Journal-first Papers Xueqi Yang North Carolina State University, Mariusz Jakubowski Microsoft, Li Kang Microsoft, Haojie Yu Microsoft, Tim Menzies North Carolina State University Link to publication DOI

16:00 - 17:30	AI for Program Comprehension 1Research Track at 213 Chair(s): Yintong Huo Singapore Management University, Singapore

16:00 15m Talk		ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems Research Track Wenwei Gu The Chinese University of Hong Kong, Jiazhen Gu Chinese University of Hong Kong, Jinyang Liu Chinese University of Hong Kong, Zhuangbin Chen Sun Yat-sen University, Jianping Zhang The Chinese University of Hong Kong, Jinxi Kuang The Chinese University of Hong Kong, Cong Feng Huawei Cloud Computing Technology, Yongqiang Yang Huawei Cloud Computing Technology, Michael Lyu The Chinese University of Hong Kong
16:15 15m Talk		LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models Research Track Zeyang Ma Concordia University, Dong Jae Kim DePaul University, Tse-Hsun (Peter) Chen Concordia University
16:30 15m Talk		Model Editing for LLMs4Code: How Far are We? Research Track Xiaopeng Li National University of Defense Technology, Shangwen Wang National University of Defense Technology, Shasha Li National University of Defense Technology, Jun Ma National University of Defense Technology, Jie Yu National University of Defense Technology, Xiaodong Liu National University of Defense Technology, Jing Wang National University of Defense Technology, Bin Ji National University of Defense Technology, Weimin Zhang National University of Defense Technology Pre-print
16:45 15m Talk		Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets Research Track Christof Tinnes Saarland University, Alisa Carla Welter Saarland University, Sven Apel Saarland University Pre-print
17:00 15m Talk		SpecRover: Code Intent Extraction via LLMs Research Track Haifeng Ruan National University of Singapore, Yuntong Zhang National University of Singapore, Abhik Roychoudhury National University of Singapore
17:15 15m Talk		Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models Research Track Van-Hoang Le The University of Newcastle, Yi Xiao , Hongyu Zhang Chongqing University

16:00 - 17:30	AI for Testing and QA 2Research Track / SE In Practice (SEIP) at 214 Chair(s): Michael Pradel University of Stuttgart

16:00 15m Talk		Faster Configuration Performance Bug Testing with Neural Dual-level Prioritization Research Track Youpeng Ma University of Electronic Science and Technology of China, Tao Chen University of Birmingham, Ke Li University of Exeter Pre-print
16:15 15m Talk		Metamorphic-Based Many-Objective Distillation of LLMs for Code-related Tasks Research Track Annibale Panichella Delft University of Technology
16:30 15m Talk		NIODebugger: A Novel Approach to Repair Non-Idempotent-Outcome Tests with LLM-Based Agent Research Track Kaiyao Ke University of Illinois at Urbana-Champaign
16:45 15m Talk		Test Intention Guided LLM-based Unit Test Generation Research Track Zifan Nan Huawei, Zhaoqiang Guo Software Engineering Application Technology Lab, Huawei, China, Kui Liu Huawei, Xin Xia Huawei
17:00 15m Talk		What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation Research Track Xin Yin Zhejiang University, Chao Ni Zhejiang University, xiaodanxu College of Computer Science and Technology, Zhejiang university, Xiaohu Yang Zhejiang University Pre-print
17:15 15m Talk		Improving Code Performance Using LLMs in Zero-Shot: RAPGen SE In Practice (SEIP) Spandan Garg Microsoft Corporation, Roshanak Zilouchian Moghaddam Microsoft, Neel Sundaresan Microsoft

16:00 - 17:30	Analysis 1Research Track / SE In Practice (SEIP) / Journal-first Papers at 215 Chair(s): Antonio Filieri AWS and Imperial College London

16:00 15m Talk		SUPERSONIC: Learning to Generate Source Code Optimizations in C/C++ Journal-first Papers Zimin Chen KTH Royal Institute of Technology, Sen Fang North Carolina State University, Martin Monperrus KTH Royal Institute of Technology
16:15 15m Talk		An Extensive Empirical Study of Nondeterministic Behavior in Static Analysis Tools Research Track Miao Miao The University of Texas at Dallas, Austin Mordahl University of Illinois Chicago, Dakota Soles The University of Texas at Dallas, Alice Beideck The University of Texas at Dallas, Shiyi Wei University of Texas at Dallas
16:30 15m Talk		Interactive Cross-Language Pointer Analysis for Resolving Native Code in Java ProgramsAward Winner Research Track Chenxi Zhang Nanjing University, Yufei Liang Nanjing University, Tian Tan Nanjing University, Chang Xu Nanjing University, Shuangxiang Kan UNSW, Yulei Sui University of New South Wales, Yue Li Nanjing University
16:45 15m Talk		Execution Trace Reconstruction Using Diffusion-Based Generative Models Research Track Madeline Janecek Brock University, Naser Ezzati Jivan , Wahab Hamou-Lhadj Concordia University, Montreal, Canada
17:00 15m Talk		Static Analysis of Remote Procedure Call in Java Programs Research Track Baoquan Cui Institute of Software at Chinese Academy of Sciences, China, RongQu State Key Laboratory of Computer Science, Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China, Zhen Tang Key Laboratory of System Software (Chinese Academy of Sciences), State Key Laboratory of Computer Science, Institute of Software Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China, Jian Zhang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences
17:15 15m Talk		ArkAnalyzer: The Static Analysis Framework for OpenHarmony SE In Practice (SEIP) chenhaonan Beihang University, Daihang Chen Beihang University, Yizhuo Yang Beihang University, Lingyun Xu Huawei, Liang Gao Huawei, Mingyi Zhou Monash University, Chunming Hu Beihang University, Li Li Beihang University

16:00 - 17:30	AI for SE 2Research Track / Journal-first Papers at Canada Hall 1 and 2 Chair(s): Tingting Yu University of Connecticut

16:00 15m Talk		Large Language Models for Safe Minimization Research Track Aashish Yadavally University of Texas at Dallas, Xiaokai Rong The University of Texas at Dallas, Phat Nguyen The University of Texas at Dallas, Tien N. Nguyen University of Texas at Dallas
16:15 15m Talk		LUNA: A Model-Based Universal Analysis Framework for Large Language Models Journal-first Papers Da Song University of Alberta, Xuan Xie University of Alberta, Norman Song , Derui Zhu Technical University of Munich, Yuheng Huang University of Alberta, Canada, Felix Juefei-Xu New York University, Lei Ma The University of Tokyo & University of Alberta, Yuheng Huang University of Alberta, Canada
16:30 15m Talk		Intention is All You Need: Refining Your Code from Your Intention Research Track Qi Guo Tianjin University, Xiaofei Xie Singapore Management University, Shangqing Liu Nanyang Technological University, Ming Hu Nanyang Technological University, Xiaohong Li Tianjin University, Lei Bu Nanjing University
16:45 15m Talk		RLCoder: Reinforcement Learning for Repository-Level Code Completion Research Track Yanlin Wang Sun Yat-sen University, Yanli Wang Sun Yat-sen University, Daya Guo , Jiachi Chen Sun Yat-sen University, Ruikai Zhang Huawei Cloud Computing Technologies, Yuchi Ma Huawei Cloud Computing Technologies, Zibin Zheng Sun Yat-sen University
17:00 15m Talk		InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation Research Track Marcos Macedo Queen's University, Yuan Tian Queen's University, Kingston, Ontario, Pengyu Nie University of Waterloo, Filipe Cogo Centre for Software Excellence, Huawei Canada, Bram Adams Queen's University
17:15 15m Talk		Toward a Theory of Causation for Interpreting Neural Code Models Journal-first Papers David Nader Palacio William & Mary, Alejandro Velasco William & Mary, Nathan Cooper William & Mary, Alvaro Rodriguez Universidad Nacional de Colombia, Kevin Moran University of Central Florida, Denys Poshyvanyk William & Mary Link to publication DOI Pre-print

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

	07:00 - 22:00	Quiet Room ThursdaySocial, Networking and Special Rooms at 202

	07:00 - 19:00	Ready Room ThursdaySocial, Networking and Special Rooms at 209

10:30 - 11:00	Thu Morning Break Posters 10:30-11Research Track / New Ideas and Emerging Results (NIER) / Demonstrations / Journal-first Papers / Posters at Canada Hall 3 Poster Area

10:30 30m Poster		Pattern-based Generation and Adaptation of Quantum WorkflowsQuantum Research Track Martin Beisel Institute of Architecture of Application Systems (IAAS), University of Stuttgart, Johanna Barzen University of Stuttgart, Frank Leymann University of Stuttgart, Lavinia Stiliadou Institute of Architecture of Application Systems (IAAS), University of Stuttgart, Daniel Vietz University of Stuttgart, Benjamin Weder Institute of Architecture of Application Systems (IAAS), University of Stuttgart
10:30 30m Talk		A Unit Proofing Framework for Code-level Verification: A Research AgendaFormal Methods New Ideas and Emerging Results (NIER) Paschal Amusuo Purdue University, Parth Vinod Patil Purdue University, Owen Cochell Michigan State University, Taylor Le Lievre Purdue University, James C. Davis Purdue University Pre-print
10:30 30m Talk		SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code GenerationFormal Methods New Ideas and Emerging Results (NIER) Junjie Sheng East China Normal University, Yanqiu Lin East China Normal University, Jiehao Wu East China Normal University, Yanhong Huang East China Normal University, Jianqi Shi East China Normal University, Min Zhang East China Normal University, Xiangfeng Wang East China Normal University
10:30 30m Talk		Listening to the Firehose: Sonifying Z3’s BehaviorFormal Methods New Ideas and Emerging Results (NIER) Finn Hackett University of British Columbia, Ivan Beschastnikh University of British Columbia
10:30 30m Poster		HyperCRX 2.0: A Comprehensive and Automated Tool for Empowering GitHub Insights Demonstrations Yantong Wang East China Normal University, Shengyu Zhao Tongji University, will wang , Fenglin Bi East China Normal University
10:30 30m Talk		Using ML filters to help automated vulnerability repairs: when it helps and when it doesn’tSecurity New Ideas and Emerging Results (NIER) Maria Camporese University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam Pre-print
10:30 30m Talk		Automated Testing Linguistic Capabilities of NLP Models Journal-first Papers Jaeseong Lee The University of Texas at Dallas, Simin Chen University of Texas at Dallas, Austin Mordahl University of Illinois Chicago, Cong Liu University of California, Riverside, Wei Yang UT Dallas, Shiyi Wei University of Texas at Dallas
10:30 30m Poster		Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models Research Track Kunpeng Zhang The Hong Kong University of Science and Technology, Shuai Wang Hong Kong University of Science and Technology, Jitao Han Central University of Finance and Economics, Xiaogang Zhu The University of Adelaide, Xian Li Swinburne University of Technology, Shaohua Wang Central University of Finance and Economics, Sheng Wen Swinburne University of Technology

10:30 - 11:00	BreakCatering at Canada Hall 3 plus Foyer

10:30 30m Break		Thursday Morning Break Catering

11:00 - 12:30	Design for AINew Ideas and Emerging Results (NIER) / SE In Practice (SEIP) / Research Track at 203 Chair(s): Chunyang Chen TU Munich

11:00 15m Talk		A Large-Scale Study of Model Integration in ML-Enabled Software SystemsSE for AI Research Track Yorick Sens Ruhr University Bochum, Henriette Knopp Ruhr University Bochum, Sven Peldszus Ruhr University Bochum, Thorsten Berger Ruhr University Bochum Pre-print
11:15 15m Talk		Are LLMs Correctly Integrated into Software Systems?SE for AI Research Track Yuchen Shao East China Normal University, Yuheng Huang the University of Tokyo, Jiawei Shen East China Normal University, Lei Ma The University of Tokyo & University of Alberta, Ting Su East China Normal University, Chengcheng Wan East China Normal University
11:30 15m Talk		Patch Synthesis for Property Repair of Deep Neural NetworksSE for AI Research Track Zhiming Chi Institute of Software, Chinese Academy of Sciences, Jianan Ma Hangzhou Dianzi University, China; Zhejiang University, Hangzhou, China, Pengfei Yang Institute of Software at Chinese Academy of Sciences, China, Cheng-Chao Huang Nanjing Institute of Software Technology, ISCAS, Renjue Li Institute of Software at Chinese Academy of Sciences, China, Jingyi Wang Zhejiang University, Xiaowei Huang University of Liverpool, Lijun Zhang Institute of Software, Chinese Academy of Sciences
11:45 15m Talk		Optimizing Experiment Configurations for LLM Applications Through Exploratory AnalysisSE for AI New Ideas and Emerging Results (NIER) Nimrod Busany Accenture Labs, Israel, Hananel Hadad Accenture Labs, Israel, Zofia Maszlanka Avanade, Poland, Rohit Shelke University of Ottawa, Canada, Gregory Price University of Ottawa, Canada, Okhaide Akhigbe University of Ottawa, Daniel Amyot University of Ottawa
12:00 15m Talk		AI-Assisted SQL Authoring at Industry ScaleSE for AI SE In Practice (SEIP) Chandra Sekhar Maddila Meta Platforms, Inc., Negar Ghorbani Meta Platforms Inc., Kosay Jabre Meta Platforms, Inc., Vijayaraghavan Murali Meta Platforms Inc., Edwin Kim Meta Platforms, Inc., Parth Thakkar Meta Platforms, Inc., Nikolay Pavlovich Laptev Meta Platforms, Inc., Olivia Harman Meta Platforms, Inc., Diana Hsu Meta Platforms, Inc., Rui Abreu Meta, Peter C Rigby Meta / Concordia University
12:15 15m Talk		Automating ML Model Development at ScaleSE for AI SE In Practice (SEIP) Kaiyuan Wang Google, Yang Li Google Inc, Junyang Shen Google Inc, Kaikai Sheng Google Inc, Yiwei You Google Inc, Jiaqi Zhang Google Inc, Srikar Ayyalasomayajula Google Inc, Julian Grady Google Inc, Martin Wicke Google Inc

11:00 - 12:30	Analysis 2SE In Practice (SEIP) / Journal-first Papers / Demonstrations / Research Track at 205 Chair(s): Mahmoud Alfadel University of Calgary

11:00 15m Talk		SIT: An accurate, compliant SBOM generator with incremental construction Demonstrations Changguo Jia Peking University, NIANYU LI ZGC Lab, China, Minghui Zhou Peking University, Kai Yang
11:15 15m Talk		Towards Better Static Analysis Bug Reports in the Clang Static Analyzer SE In Practice (SEIP) Kristóf Umann Eötvös Loránd University, Faculty of Informatics, Dept. of Programming Languages and Compilers, Zoltán Porkoláb Ericsson
11:30 15m Talk		Automatic Identification of Game Stuttering via Gameplay Videos Analysis Journal-first Papers Emanuela Guglielmi University of Molise, Gabriele Bavota Software Institute @ Università della Svizzera Italiana, Rocco Oliveto University of Molise, Simone Scalabrino University of Molise
11:45 15m Talk		LLM Driven Smart Assistant for Data Mapping SE In Practice (SEIP) Arihant Bedagkar Tata Consultancy Services, Sayandeep Mitra Tata Consultancy Services, Raveendra Kumar Medicherla TCS Research, Tata Consultancy Services, Ravindra Naik TCS Research, TRDDC, India, Samiran Pal Tata Consultancy Services
12:00 15m Talk		On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories SE In Practice (SEIP) Henri Aïdasso École de technologie supérieure (ÉTS), Francis Bordeleau École de Technologie Supérieure (ETS), Ali Tizghadam TELUS Pre-print
12:15 7m Talk		AddressWatcher: Sanitizer-Based Localization of Memory Leak Fixes Journal-first Papers Aniruddhan Murali University of Waterloo, Mahmoud Alfadel University of Calgary, Mei Nagappan University of Waterloo, Meng Xu University of Waterloo, Chengnian Sun University of Waterloo

11:00 - 12:30	Human and Social 2Research Track / Journal-first Papers at 206 plus 208 Chair(s): Alexander Serebrenik Eindhoven University of Technology

11:00 15m Talk		Code Today, Deadline Tomorrow: Procrastination Among Software Developers Research Track Zeinabsadat Saghi University of Southern California, Thomas Zimmermann University of California, Irvine, Souti Chattopadhyay University of Southern California
11:15 15m Talk		"Get Me In The Groove": A Mixed Methods Study on Supporting ADHD Professional Programmers Research Track Kaia Newman Carnegie Mellon University, Sarah Snay University of Michigan, Madeline Endres University of Massachusetts Amherst, Manasvi Parikh University of Michigan, Andrew Begel Carnegie Mellon University Pre-print
11:30 15m Talk		Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations Research Track Ruchit Rawal Max Planck Institute for Software Systems, Victor-Alexandru Padurean Max Planck Institute for Software Systems, Sven Apel Saarland University, Adish Singla Max Planck Institute for Software Systems, Mariya Toneva Max Planck Institute for Software Systems Pre-print
11:45 15m Talk		How Scientists Use Jupyter Notebooks: Goals, Quality Attributes, and Opportunities Research Track Ruanqianqian (Lisa) Huang University of California, San Diego, Savitha Ravi UC San Diego, Michael He UCSD, Boyu Tian University of California, San Diego, Sorin Lerner University of California at San Diego, Michael Coblenz University of California, San Diego Pre-print
12:00 15m Talk		Investigating the Online Recruitment and Selection Journey of Novice Software Engineers: Anti-patterns and Recommendations Journal-first Papers Miguel Setúbal Federal University of Ceará, Tayana Conte Universidade Federal do Amazonas, Marcos Kalinowski Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Allysson Allex Araújo Federal University of Cariri Link to publication Pre-print
12:15 15m Talk		Reputation Gaming in Crowd Technical Knowledge Sharing Journal-first Papers Iren Mazloomzadeh École Polytechnique de Montréal, Gias Uddin York University, Canada, Foutse Khomh Polytechnique Montréal, Ashkan Sami Edinburgh Napier University

11:00 - 12:30	Security and Analysis 1Research Track / SE In Practice (SEIP) at 210 Chair(s): Akond Rahman Auburn University

11:00 15m Talk		Accounting for Missing Events in Statistical Information Leakage AnalysisSecurity Research Track Seongmin Lee Max Planck Institute for Security and Privacy (MPI-SP), Shreyas Minocha Georgia Tech, Marcel Böhme MPI for Security and Privacy
11:15 15m Talk		AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software ArtifactsSecurity Research Track Setu Kumar Basak North Carolina State University, K. Virgil English North Carolina State University, Ken Ogura North Carolina State University, Vitesh Kambara North Carolina State University, Bradley Reaves North Carolina State University, Laurie Williams North Carolina State University
11:30 15m Talk		Enhancing The Open Network: Definition and Automated Detection of Smart Contract DefectsBlockchainSecurityAward Winner Research Track Hao Song , Teng Li University of Electronic Science and Technology of China, Jiachi Chen Sun Yat-sen University, Ting Chen University of Electronic Science and Technology of China, Beibei Li Sichuan University, Zhangyan Lin University of Electronic Science and Technology of China, Yi Lu BitsLab, Pan Li MoveBit, Xihan Zhou TonBit
11:45 15m Talk		Detecting Python Malware in the Software Supply Chain with Program AnalysisSecurity SE In Practice (SEIP) Ridwan Salihin Shariffdeen National University of Singapore, Behnaz Hassanshahi Oracle Labs, Australia, Martin Mirchev National University of Singapore, Ali El Husseini National University of Singapore, Abhik Roychoudhury National University of Singapore
12:00 15m Talk		$ZTD_{JAVA}$: Mitigating Software Supply Chain Vulnerabilities via Zero-Trust DependenciesSecurity Research Track Paschal Amusuo Purdue University, Kyle A. Robinson Purdue University, Tanmay Singla Purdue University, Huiyun Peng Mount Holyoke College, Aravind Machiry Purdue University, Santiago Torres-Arias Purdue University, Laurent Simon Google, James C. Davis Purdue University Pre-print
12:15 15m Talk		FairChecker: Detecting Fund-stealing Bugs in DeFi Protocols via Fairness ValidationBlockchainSecurity Research Track Yi Sun Purdue University, USA, Zhuo Zhang Purdue University, Xiangyu Zhang Purdue University

11:00 - 12:30	AI for Design and ArchitectureDemonstrations / SE In Practice (SEIP) / Research Track at 211 Chair(s): Sarah Nadi New York University Abu Dhabi

11:00 15m Talk		An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization Research Track Fraol Batole Tulane University, David OBrien Iowa State University, Tien N. Nguyen University of Texas at Dallas, Robert Dyer University of Nebraska-Lincoln, Hridesh Rajan Tulane University
11:15 15m Talk		Distilled Lifelong Self-Adaptation for Configurable Systems Research Track Yulong Ye University of Birmingham, Tao Chen University of Birmingham, Miqing Li University of Birmingham Pre-print
11:30 15m Talk		The Software Librarian: Python Package Insights for Copilot Demonstrations Jasmine Latendresse Concordia University, Nawres Day ISSAT Sousse, SayedHassan Khatoonabadi Concordia University, Montreal, Emad Shihab Concordia University, Montreal
11:45 15m Talk		aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing SE In Practice (SEIP) Siyuan Jiang , Jia Li Peking University, He Zong aiXcoder, Huanyu Liu Peking University, Hao Zhu Peking University, Shukai Hu aiXcoder, Erlu Li aiXcoder, Jiazheng Ding aiXcoder, Ge Li Peking University Pre-print
12:00 15m Talk		Leveraging MLOps: Developing a Sequential Classification System for RFQ Documents in Electrical Engineering SE In Practice (SEIP) Claudio Martens Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Hammam Abdelwahab Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Katharina Beckh Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Birgit Kirsch Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Vishwani Gupta Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Dennis Wegener Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Steffen Hoh Schneider Electric
12:15 15m Talk		On Mitigating Code LLM Hallucinations with API Documentation SE In Practice (SEIP) Nihal Jain Amazon Web Services, Robert Kwiatkowski , Baishakhi Ray Columbia University, Murali Krishna Ramanathan AWS AI Labs, Varun Kumar AWS AI Labs

11:00 - 12:30	AI for Analysis 3SE In Practice (SEIP) / Research Track at 212 Chair(s): Gias Uddin York University, Canada

11:00 15m Talk		COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge Research Track Yichen LI The Chinese University of Hong Kong, Yulun Wu The Chinese University of Hong Kong, Jinyang Liu Chinese University of Hong Kong, Zhihan Jiang The Chinese University of Hong Kong, Zhuangbin Chen Sun Yat-sen University, Guangba Yu The Chinese University of Hong Kong, Michael Lyu The Chinese University of Hong Kong
11:15 15m Talk		Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding Research Track Yifeng Di Purdue University, Tianyi Zhang Purdue University
11:30 15m Talk		HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation Research Track Dewu Zheng Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Ensheng Shi Xi’an Jiaotong University, Ruikai Zhang Huawei Cloud Computing Technologies, Yuchi Ma Huawei Cloud Computing Technologies, Hongyu Zhang Chongqing University, Zibin Zheng Sun Yat-sen University
11:45 15m Talk		SEMANTIC CODE FINDER: An Efficient Semantic Search Framework for Large-Scale Codebases SE In Practice (SEIP) daeha ryu Innovation Center, Samsung Electronics, Seokjun Ko Samsung Electronics Co., Eunbi Jang Innovation Center, Samsung Electronics, jinyoung park Innovation Center, Samsung Electronics, myunggwan kim Innovation Center, Samsung Electronics, changseo park Innovation Center, Samsung Electronics
12:00 15m Talk		Time to Retrain? Detecting Concept Drifts in Machine Learning Systems SE In Practice (SEIP) Tri Minh-Triet Pham Concordia University, Karthikeyan Premkumar Ericsson, Mohamed Naili Ericsson, Jinqiu Yang Concordia University
12:15 15m Talk		UML Sequence Diagram Generation: A Multi-Model, Multi-Domain Evaluation SE In Practice (SEIP) Chi Xiao Ericsson AB, Daniel Ståhl Ericsson AB, Jan Bosch Chalmers University of Technology

11:00 - 12:30	AI for RequirementsResearch Track / SE In Practice (SEIP) / Journal-first Papers / New Ideas and Emerging Results (NIER) at 213 Chair(s): Jennifer Horkoff Chalmers and the University of Gothenburg

11:00 15m Talk		From Bugs to Benefits: Improving User Stories by Leveraging Crowd Knowledge with CrUISE-AC Research Track Stefan Schwedt Heriot-Watt University, UK, Thomas Ströder FHDW Mettmann
11:15 15m Talk		LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation Research Track Dominik Fuchß Karlsruhe Institute of Technology (KIT), Tobias Hey Karlsruhe Institute of Technology (KIT), Jan Keim Karlsruhe Institute of Technology (KIT), Haoyu Liu Karlsruhe Institute of Technology (KIT), Niklas Ewald Karlsruhe Institute of Technology (KIT), Tobias Thirolf Karlsruhe Institute of Technology (KIT), Anne Koziolek Karlsruhe Institute of Technology Pre-print Media Attached
11:30 15m Talk		Replication in Requirements Engineering: the NLP for RE Case Journal-first Papers Sallam Abualhaija University of Luxembourg, Fatma Başak Aydemir Utrecht University, Fabiano Dalpiaz Utrecht University, Davide Dell'Anna Utrecht University, Alessio Ferrari CNR-ISTI, Xavier Franch Universitat Politècnica de Catalunya, Davide Fucci Blekinge Institute of Technology
11:45 15m Talk		On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability New Ideas and Emerging Results (NIER) Andreas Vogelsang paluno, University of Duisburg-Essen, Alexander Korn University of Duisburg-Essen, Giovanna Broccia ISTI-CNR, FMT Lab, Alessio Ferrari Consiglio Nazionale delle Ricerche (CNR) and University College Dublin (UCD), Jannik Fischbach Netlight Consulting GmbH and fortiss GmbH, Chetan Arora Monash University
12:00 15m Talk		NICE: Non-Functional Requirements Identification, Classification, and Explanation Using Small Language ModelsAward Winner SE In Practice (SEIP) Gokul Rejithkumar TCS Research, Preethu Rose Anish TCS Research Pre-print

11:00 - 12:30	AI for Testing and QA 3Research Track at 214 Chair(s): Mike Papadakis University of Luxembourg

11:00 15m Talk		A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs Research Track Myeongsoo Kim Georgia Institute of Technology, Tyler Stennett Georgia Institute of Technology, Saurabh Sinha IBM Research, Alessandro Orso Georgia Institute of Technology
11:15 15m Talk		ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs Research Track Hongyan Gao State Key Laboratory for Novel Software Technology, Nanjing University, Yibiao Yang Nanjing University, Maolin Sun Nanjing University, Jiangchang Wu State Key Laboratory for Novel Software Technology, Nanjing University, Yuming Zhou Nanjing University, Baowen Xu State Key Laboratory for Novel Software Technology, Nanjing University
11:30 15m Talk		LLM Based Input Space Partitioning Testing for Library APIs Research Track Jiageng Li Fudan University, Zhen Dong Fudan University, Chong Wang Nanyang Technological University, Haozhen You Fudan University, Cen Zhang Georgia Institute of Technology, Yang Liu Nanyang Technological University, Xin Peng Fudan University
11:45 15m Talk		Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests Research Track Amirhossein Deljouyi Delft University of Technology, Roham Koohestani Delft University of Technology, Maliheh Izadi Delft University of Technology, Andy Zaidman TU Delft DOI Pre-print
12:00 15m Talk		exLong: Generating Exceptional Behavior Tests with Large Language Models Research Track Jiyang Zhang University of Texas at Austin, Yu Liu Meta, Pengyu Nie University of Waterloo, Junyi Jessy Li University of Texas at Austin, USA, Milos Gligoric The University of Texas at Austin
12:15 15m Talk		TOGLL: Correct and Strong Test Oracle Generation with LLMs Research Track Soneya Binta Hossain University of Virginia, Matthew B Dwyer University of Virginia

11:00 - 12:30	SE for AI 2New Ideas and Emerging Results (NIER) / Research Track at 215 Chair(s): Grace Lewis Carnegie Mellon Software Engineering Institute

11:00 15m Talk		Answering User Questions about Machine Learning Models through Standardized Model CardsSE for AI Research Track Tajkia Rahman Toma University of Alberta, Balreet Grewal University of Alberta, Cor-Paul Bezemer University of Alberta Pre-print
11:15 15m Talk		Fairness Testing through Extreme Value TheorySE for AI Research Track Verya Monjezi University of Texas at El Paso, Ashutosh Trivedi University of Colorado Boulder, Vladik Kreinovich University of Texas at El Paso, Saeid Tizpaz-Niari University of Illinois Chicago
11:30 15m Talk		Fixing Large Language Models' Specification Misunderstanding for Better Code GenerationSE for AI Research Track Zhao Tian Tianjin University, Junjie Chen Tianjin University, Xiangyu Zhang Purdue University Pre-print
11:45 15m Talk		SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model AgentsSE for AI Research Track Feng Lin Concordia University, Dong Jae Kim DePaul University, Tse-Hsun (Peter) Chen Concordia University
12:00 15m Talk		The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML ProductsSE for AI Research Track Nadia Nahar Carnegie Mellon University, Haoran Zhang Carnegie Mellon University, Grace Lewis Carnegie Mellon Software Engineering Institute, Shurui Zhou University of Toronto, Christian Kästner Carnegie Mellon University
12:15 15m Talk		Towards Trustworthy LLMs for Code: A Data-Centric Synergistic Auditing FrameworkSE for AI New Ideas and Emerging Results (NIER) Chong Wang Nanyang Technological University, Zhenpeng Chen Nanyang Technological University, Li Tianlin NTU, Yilun Zhang AIXpert, Yang Liu Nanyang Technological University

12:30 - 14:00	LunchCatering at Canada Hall 3 plus Foyer

12:30 90m Lunch		Thursday Lunch Catering

13:00 - 13:30	Thu Lunch Posters 13:00-13:30Research Track / SE in Society (SEIS) / Journal-first Papers / SE In Practice (SEIP) / Posters at Canada Hall 3 Poster Area

13:00 30m Talk		BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks Research Track Yisong Xiao Beihang University, Aishan Liu Beihang University; Institute of Dataspace, Xinwei Zhang Beihang University, Tianyuan Zhang Beihang University, Li Tianlin NTU, Siyuan Liang National University of Singapore, Xianglong Liu Beihang University; Institute of Dataspace; Zhongguancun Laboratory, Yang Liu Nanyang Technological University, Dacheng Tao Nanyang Technological University
13:00 30m Talk		Ethical Issues in Video Games: Insights from Reddit Discussions SE in Society (SEIS) Yeqian Li Vrije Universiteit Amsterdam, Kousar Aslam Vrije Universiteit Amsterdam
13:00 30m Talk		An Empirical Study on Developers' Shared Conversations with ChatGPT in GitHub Pull Requests and Issues Journal-first Papers Huizi Hao Queen's University, Canada, Kazi Amit Hasan Queen's University, Canada, Hong Qin Queen's University, Marcos Macedo Queen's University, Yuan Tian Queen's University, Kingston, Ontario, Ding Steven, H., H. Queen’s University at Kingston, Ahmed E. Hassan Queen’s University
13:00 30m Talk		QuanTest: Entanglement-Guided Testing of Quantum Neural Network SystemsQuantum Journal-first Papers Jinjing Shi Central South University, Zimeng Xiao Central South University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Xuelong LI China Telecom Link to publication
13:00 30m Poster		FlatD: Protecting Deep Neural Network Program from Reversing Attacks SE In Practice (SEIP) Jinquan Zhang The Pennsylvania State University, Zihao Wang Penn State University, Pei Wang Independent Researcher, Rui Zhong Palo Alto Networks, Dinghao Wu Pennsylvania State University
13:00 30m Talk		Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-PracticeSE for AI Journal-first Papers Bentley Oakes Polytechnique Montréal, Michalis Famelis Université de Montréal, Houari Sahraoui DIRO, Université de Montréal DOI Pre-print File Attached
13:00 30m Talk		On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.Security Journal-first Papers Aurora Papotti Vrije Universiteit Amsterdam, Ranindya Paramitha University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam
13:00 30m Talk		Automating Explanation Need Management in App Reviews: A Case Study from the Navigation App Industry SE In Practice (SEIP) Martin Obaidi Leibniz Universität Hannover, Nicolas Voß Graphmasters GmbH, Hannah Deters Leibniz University Hannover, Jakob Droste Leibniz Universität Hannover, Marc Herrmann Leibniz University Hannover, Jannik Fischbach Netlight Consulting GmbH and fortiss GmbH, Kurt Schneider Leibniz Universität Hannover, Software Engineering Group

13:30 - 14:00	Thu Lunch Posters 13:30-14:00Journal-first Papers / New Ideas and Emerging Results (NIER) / Research Track / Posters at Canada Hall 3 Poster Area

13:30 30m Poster		Non-Autoregressive Line-Level Code Completion Journal-first Papers Fang Liu Beihang University, Zhiyi Fu Peking University, Ge Li Peking University, Zhi Jin Peking University, Hui Liu Beijing Institute of Technology, Yiyang Hao Silicon Heart Tech Co., Li Zhang Beihang University
13:30 30m Talk		LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation Journal-first Papers Sarah Fakhoury Microsoft Research, Aaditya Naik University of Pennsylvania, Georgios Sakkas University of California at San Diego, Saikat Chakraborty Microsoft Research, Shuvendu K. Lahiri Microsoft Research Link to publication
13:30 30m Talk		SusDevOps: Promoting Sustainability to a First Principle in Software Delivery New Ideas and Emerging Results (NIER) Istvan David McMaster University / McMaster Centre for Software Certification (McSCert)
13:30 30m Poster		Predicting the First Response Latency of Maintainers and Contributors in Pull Requests Journal-first Papers SayedHassan Khatoonabadi Concordia University, Montreal, Ahmad Abdellatif University of Calgary, Diego Elias Costa Concordia University, Canada, Emad Shihab Concordia University, Montreal
13:30 30m Poster		RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code Research Track Pantazis Deligiannis Microsoft Research, Akash Lal Microsoft Research, Nikita Mehrotra Microsoft Research, Rishi Poddar Microsoft Research, Aseem Rastogi Microsoft Research
13:30 30m Talk		Relevant information in TDD experiment reporting Journal-first Papers Fernando Uyaguari Instituto Superior Tecnológico Wissen, Silvia Teresita Acuña Castillo Universidad Autónoma de Madrid, John W. Castro Universidad de Atacama, Davide Fucci Blekinge Institute of Technology, Oscar Dieste Universidad Politécnica de Madrid, Sira Vegas Universidad Politecnica de Madrid

14:00 - 15:30	Testing and QA 3Research Track / Journal-first Papers at 205 Chair(s): Michael Pradel University of Stuttgart

14:00 15m Talk		Increasing the Effectiveness of Automatically Generated Tests by Improving Class ObservabilityAward Winner Research Track Geraldine Galindo-Gutierrez Centro de Investigación en Ciencias Exactas e Ingenierías, Universidad Católica Boliviana, Juan Pablo Sandoval Alcocer Pontificia Universidad Católica de Chile, Nicolas Jimenez-Fuentes Pontificia Universidad Católica de Chile, Alexandre Bergel University of Chile, Gordon Fraser University of Passau
14:15 15m Talk		Invivo Fuzzing by Amplifying Actual Executions Research Track Octavio Galland Canonical, Marcel Böhme MPI for Security and Privacy
14:30 15m Talk		Towards High-strength Combinatorial Interaction Testing for Highly Configurable Software Systems Research Track Chuan Luo Beihang University, Shuangyu Lyu Beihang University, Wei Wu Central South University; Xiangjiang Laboratory, Hongyu Zhang Chongqing University, Dianhui Chu Harbin Institute of Technology, Chunming Hu Beihang University
14:45 15m Talk		WDD: Weighted Delta Debugging Research Track Xintong Zhou University of Waterloo, Zhenyang Xu University of Waterloo, Mengxiao Zhang University of Waterloo, Yongqiang Tian , Chengnian Sun University of Waterloo
15:00 15m Talk		TopSeed: Learning Seed Selection Strategies for Symbolic Execution from Scratch Research Track Jaehyeok Lee Sungkyunkwan University, Sooyoung Cha Sungkyunkwan University
15:15 15m Talk		Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing Journal-first Papers Michel Maes Bermejo Universidad Rey Juan Carlos, Alexander Serebrenik Eindhoven University of Technology, Micael Gallego Universidad Rey Juan Carlos, Francisco Gortázar Universidad Rey Juan Carlos, Gregorio Robles Universidad Rey Juan Carlos, Jesus M. Gonzalez-Barahona Universidad Rey Juan Carlos

14:00 - 15:30	AI for Testing and QA 4Journal-first Papers / Demonstrations / Research Track at 206 plus 208 Chair(s): Andreas Jedlitschka Fraunhofer IESE

14:00 15m Talk		The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning LibrariesSecurityAward Winner Research Track Zhiyuan Li , Jingzheng Wu Institute of Software, The Chinese Academy of Sciences, Xiang Ling Institute of Software, Chinese Academy of Sciences, Tianyue Luo Institute of Software, Chinese Academy of Sciences, ZHIQING RUI Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yanjun Wu Institute of Software, Chinese Academy of Sciences
14:15 15m Talk		AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL Demonstrations Tyler Stennett Georgia Institute of Technology, Myeongsoo Kim Georgia Institute of Technology, Saurabh Sinha IBM Research, Alessandro Orso Georgia Institute of Technology
14:30 15m Talk		FairBalance: How to Achieve Equalized Odds With Data Pre-processing Journal-first Papers Zhe Yu Rochester Institute of Technology, Joymallya Chakraborty Amazon.com, Tim Menzies North Carolina State University
14:45 15m Talk		RLocator: Reinforcement Learning for Bug Localization Journal-first Papers Partha Chakraborty University of Waterloo, Mahmoud Alfadel University of Calgary, Mei Nagappan University of Waterloo
15:00 15m Talk		Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP Journal-first Papers Lukas Schulte University of Passau, Benjamin Ledel Digital Learning GmbH, Steffen Herbold University of Passau
15:15 15m Talk		Test Generation Strategies for Building Failure Models and Explaining Spurious Failures Journal-first Papers Baharin Aliashrafi Jodat University of Ottawa, Abhishek Chandar University of Ottawa, Shiva Nejati University of Ottawa, Mehrdad Sabetzadeh University of Ottawa Pre-print

14:00 - 15:30	Human and Social using AI 1Research Track at 207 Chair(s): Romain Robbes CNRS, LaBRI, University of Bordeaux

14:00 15m Talk		Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers Research Track Yuling Shi Shanghai Jiao Tong University, Hongyu Zhang Chongqing University, Chengcheng Wan East China Normal University, Xiaodong Gu Shanghai Jiao Tong University
14:15 15m Talk		Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? Research Track Rosalia Tufano Università della Svizzera Italiana, Alberto Martin-Lopez Software Institute - USI, Lugano, Ahmad Tayeb , Ozren Dabic Software Institute, Università della Svizzera italiana (USI), Switzerland, Sonia Haiduc , Gabriele Bavota Software Institute @ Università della Svizzera Italiana
14:30 15m Talk		An Exploratory Study of ML Sketches and Visual Code Assistants Research Track Luis F. Gomes Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University, Jonathan Aldrich Carnegie Mellon University, Rui Abreu Faculty of Engineering of the University of Porto, Portugal
14:45 15m Talk		LiCoEval: Evaluating LLMs on License Compliance in Code Generation Research Track Weiwei Xu Peking University, Kai Gao Peking University, Hao He Carnegie Mellon University, Minghui Zhou Peking University Pre-print
15:00 15m Talk		Trust Dynamics in AI-Assisted Development: Definitions, Factors, and Implications Research Track Sadra Sabouri University of Southern California, Philipp Eibl University of Southern California, Xinyi Zhou University of Southern California, Morteza Ziyadi Amazon AGI, Nenad Medvidović University of Southern California, Lars Lindemann University of Southern California, Souti Chattopadhyay University of Southern California Pre-print
15:15 15m Talk		What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI Research Track Rudrajit Choudhuri Oregon State University, Bianca Trinkenreich Colorado State University, Rahul Pandita GitHub, Inc., Eirini Kalliamvakou GitHub, Igor Steinmacher NAU RESHAPE LAB, Marco Gerosa Northern Arizona University, Christopher Sanchez Oregon State University, Anita Sarma Oregon State University Pre-print

14:00 - 15:30	AI for Security 1Research Track at 210 Chair(s): Tao Chen University of Birmingham

14:00 15m Talk		Large Language Models as Configuration ValidatorsSecurity Research Track Xinyu Lian University of Illinois at Urbana-Champaign, Yinfang Chen University of Illinois at Urbana-Champaign, Runxiang Cheng University of Illinois at Urbana-Champaign, Jie Huang University of Illinois at Urbana-Champaign, Parth Thakkar Meta Platforms, Inc., Minjia Zhang UIUC, Tianyin Xu University of Illinois at Urbana-Champaign
14:15 15m Talk		LLM Assistance for Memory SafetySecurity Research Track Nausheen Mohammed Microsoft Research, Akash Lal Microsoft Research, Aseem Rastogi Microsoft Research, Subhajit Roy IIT Kanpur, Rahul Sharma Microsoft Research
14:30 15m Talk		Vulnerability Detection with Code Language Models: How Far Are We?Security Research Track Yangruibo Ding Columbia University, Yanjun Fu University of Maryland, Omniyyah Ibrahim King Abdulaziz City for Science and Technology, Chawin Sitawarin University of California, Berkeley, Xinyun Chen , Basel Alomair King Abdulaziz City for Science and Technology, David Wagner UC Berkeley, Baishakhi Ray Columbia University, Yizheng Chen University of Maryland
14:45 15m Talk		Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with JustificationsBlockchainSecurity Research Track Wei Ma , Daoyuan Wu Hong Kong University of Science and Technology, Yuqiang Sun Nanyang Technological University, Tianwen Wang National University of Singapore, Shangqing Liu Nanyang Technological University, Jian Zhang Nanyang Technological University, Yue Xue , Yang Liu Nanyang Technological University
15:00 15m Talk		Towards Neural Synthesis for SMT-assisted Proof-Oriented ProgrammingSecurityFormal MethodsAward Winner Research Track Saikat Chakraborty Microsoft Research, Gabriel Ebner Microsoft Research, Siddharth Bhat University of Cambridge, Sarah Fakhoury Microsoft Research, Sakina Fatima University of Ottawa, Shuvendu K. Lahiri Microsoft Research, Nikhil Swamy Microsoft Research
15:15 15m Talk		Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and DefensesSecuritySE for AI Research Track Rodrigo Resendes Pedro INESC-ID / IST, Universidade de Lisboa, Miguel E. Coimbra INESC-ID; Instituto Superior Técnico - University of Lisbon, Daniel Castro INESC-ID / IST, Universidade de Lisboa, Paulo Carreira INESC-ID / IST, Universidade de Lisboa, Nuno Santos INESC-ID; Instituto Superior Técnico - University of Lisbon

14:00 - 15:30	Analysis 3Research Track / Journal-first Papers at 212 Chair(s): Shaowei Wang University of Manitoba

14:00 15m Talk		Boosting Path-Sensitive Value Flow Analysis via Removal of Redundant Summaries Research Track Yongchao WANG Hong Kong University of Science and Technology, Yuandao Cai Hong Kong University of Science and Technology, Charles Zhang Hong Kong University of Science and Technology Pre-print
14:15 15m Talk		Dockerfile Flakiness: Characterization and Repair Research Track Taha Shabani University of British Columbia, Noor Nashid University of British Columbia, Parsa Alian University of British Columbia, Ali Mesbah University of British Columbia Pre-print
14:30 15m Talk		Evaluating Garbage Collection Performance Across Managed Language Runtimes Research Track Yicheng Wang Institute of Software Chinese Academy of Sciences, Wensheng Dou Institute of Software Chinese Academy of Sciences, Yu Liang Institute of Software Chinese Academy of Sciences, Yi Wang Institute of Software Chinese Academy of Sciences, Wei Wang Institute of Software at Chinese Academy of Sciences, Jun Wei Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Tao Huang Institute of Software Chinese Academy of Sciences
14:45 15m Talk		Module-Aware Context Sensitive Pointer Analysis Research Track Haofeng Li SKLP, Institute of Computing Technology, CAS, Chenghang Shi SKLP, Institute of Computing Technology, CAS, Jie Lu SKLP, Institute of Computing Technology, CAS, Lian Li Institute of Computing Technology at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Zixuan Zhao Huawei Technologies Co. Ltd File Attached
15:00 15m Talk		An Empirical Study on Reproducible Packaging in Open-Source Ecosystems Research Track Giacomo Benedetti University of Genoa, Oreofe Solarin Case Western Reserve University, Courtney Miller Carnegie Mellon University, Greg Tystahl NCSU, William Enck North Carolina State University, Christian Kästner Carnegie Mellon University, Alexandros Kapravelos NCSU, Alessio Merlo CASD - School of Advanced Defense Studies, Luca Verderame University of Genoa
15:15 15m Talk		T-Rec: Fine-Grained Language-Agnostic Program Reduction Guided by Lexical Syntax Journal-first Papers Zhenyang Xu University of Waterloo, Yongqiang Tian , Mengxiao Zhang , Jiarui Zhang University of Waterloo, Puzhuo Liu Ant Group & Tsinghua University, Yu Jiang Tsinghua University, Chengnian Sun University of Waterloo

14:00 - 15:30	AI for Program Comprehension 2Research Track at 213 Chair(s): Oscar Chaparro William & Mary

14:00 15m Talk		Code Comment Inconsistency Detection and Rectification Using a Large Language Model Research Track Guoping Rong Nanjing University, YongdaYu Nanjing University, Song Liu Nanjing University, Xin Tan Nanjing University, Tianyi Zhang Nanjing University, Haifeng Shen Southern Cross University, Jidong Hu Zhongxing Telecom Equipment
14:15 15m Talk		Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation Research Track Aaron Imani University of California, Irvine, Iftekhar Ahmed University of California at Irvine, Mohammad Moshirpour University of California, Irvine
14:30 15m Talk		HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search Research Track Gong Chen Wuhan University, Xiaoyuan Xie Wuhan University, Xunzhu Tang University of Luxembourg, Qi Xin Wuhan University, Wenjie Liu Wuhan University
14:45 15m Talk		Reasoning Runtime Behavior of a Program with LLM: How Far Are We? Research Track Junkai Chen Zhejiang University, Zhiyuan Pan Zhejiang University, Xing Hu Zhejiang University, Zhenhao Li York University, Ge Li Peking University, Xin Xia Huawei
15:00 15m Talk		Source Code Summarization in the Era of Large Language Models Research Track Weisong Sun Nanjing University, Yun Miao Nanjing University, Yuekang Li UNSW, Hongyu Zhang Chongqing University, Chunrong Fang Nanjing University, Yi Liu Nanyang Technological University, Gelei Deng Nanyang Technological University, Yang Liu Nanyang Technological University, Zhenyu Chen Nanjing University Media Attached
15:15 15m Talk		Template-Guided Program Repair in the Era of Large Language Models Research Track Kai Huang , Jian Zhang Nanyang Technological University, Xiangxin Meng Beihang University, Beijing, China, Yang Liu Nanyang Technological University File Attached

14:00 - 15:30	SE for AI 3Research Track / SE in Society (SEIS) / Journal-first Papers at 215 Chair(s): Lina Marsso École Polytechnique de Montréal

14:00 15m Talk		Dissecting Global Search: A Simple yet Effective Method to Boost Individual Discrimination Testing and RepairSE for AI Research Track Lili Quan Tianjin University, Li Tianlin NTU, Xiaofei Xie Singapore Management University, Zhenpeng Chen Nanyang Technological University, Sen Chen Nankai University, Lingxiao Jiang Singapore Management University, Xiaohong Li Tianjin University Pre-print
14:15 15m Talk		FixDrive: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per ViolationSE for AI Research Track Yang Sun Singapore Management University, Chris Poskitt Singapore Management University, Kun Wang Zhejiang University, Jun Sun Singapore Management University Link to publication DOI Pre-print File Attached
14:30 15m Talk		MARQ: Engineering Mission-Critical AI-based Software with Automated Result Quality AdaptationSE for AI Research Track Uwe Gropengießer Technical University of Darmstadt, Elias Dietz Technical University of Darmstadt, Florian Brandherm Technical University of Darmstadt, Achref Doula Technical University of Darmstadt, Osama Abboud Munich Research Center, Huawei, Xun Xiao Munich Research Center, Huawei, Max Mühlhäuser Technical University of Darmstadt
14:45 15m Talk		An Empirical Study of Challenges in Machine Learning Asset ManagementSE for AI Journal-first Papers Zhimin Zhao Queen's University, Yihao Chen Queen's University, Abdul Ali Bangash Software Analysis and Intelligence Lab (SAIL), Queen's University, Canada, Bram Adams Queen's University, Ahmed E. Hassan Queen’s University
15:00 15m Talk		A Reference Model for Empirically Comparing LLMs with HumansSE for AI SE in Society (SEIS) Kurt Schneider Leibniz Universität Hannover, Software Engineering Group, Farnaz Fotrousi Chalmers University of Technology and University of Gothenburg, Rebekka Wohlrab Chalmers University of Technology
15:15 7m Talk		Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-PracticeSE for AI Journal-first Papers Bentley Oakes Polytechnique Montréal, Michalis Famelis Université de Montréal, Houari Sahraoui DIRO, Université de Montréal DOI Pre-print File Attached

15:30 - 16:00	Thu Afternoon Break Posters 15:30-16:00Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) / SE in Society (SEIS) / Posters at Canada Hall 3 Poster Area

15:30 30m Talk		Mole: Efficient Crash Reproduction in Android Applications With Enforcing Necessary UI Events Journal-first Papers Maryam Masoudian Sharif University of Technology, Hong Kong University of Science and Technology (HKUST), Heqing Huang City University of Hong Kong, Morteza Amini Sharif University of Technology, Charles Zhang Hong Kong University of Science and Technology
15:30 30m Talk		Best ends by the best means: ethical concerns in app reviews Journal-first Papers Neelam Tjikhoeri Vrije Universiteit Amsterdam, Lauren Olson Vrije Universiteit Amsterdam, Emitzá Guzmán Vrije Universiteit Amsterdam
15:30 30m Talk		Shaken, Not Stirred. How Developers Like Their Amplified Tests Journal-first Papers Carolin Brandt TU Delft, Ali Khatami Delft University of Technology, Mairieli Wessel Radboud University, Andy Zaidman TU Delft Pre-print
15:30 30m Poster		BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries Research Track Wen Zhang University of Georgia, Botang Xiao University of Georgia, Qingchen Kong University of Georgia, Le Guan University of Georgia, Wenwen Wang University of Georgia
15:30 30m Talk		Towards Early Warning and Migration of High-Risk Dormant Open-Source Software DependenciesSecurity New Ideas and Emerging Results (NIER) Zijie Huang Shanghai Key Laboratory of Computer Software Testing and Evaluation, Lizhi Cai Shanghai Key Laboratory of Computer Software Testing & Evaluating, Shanghai Software Center, Xuan Mao Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, China, Kang Yang Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology
15:30 30m Talk		Exploring User Privacy Awareness on GitHub: An Empirical Study Journal-first Papers Costanza Alfieri Università degli Studi dell'Aquila, Juri Di Rocco University of L'Aquila, Paola Inverardi Gran Sasso Science Institute, Phuong T. Nguyen University of L’Aquila
15:30 30m Poster		SimClone: Detecting Tabular Data Clones using Value Similarity Journal-first Papers Xu Yang University of Manitoba, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Shaowei Wang University of Manitoba, Zhen Ming (Jack) Jiang York University
15:30 30m Talk		Strategies to Embed Human Values in Mobile Apps: What do End-Users and Practitioners Think? SE in Society (SEIS) Rifat Ara Shams CSIRO's Data61, Mojtaba Shahin RMIT University, Gillian Oliver Monash University, Jon Whittle CSIRO's Data61 and Monash University, Waqar Hussain Data61, CSIRO, Harsha Perera CSIRO's Data61, Arif Nurwidyantoro Universitas Gadjah Mada

15:30 - 16:00	BreakCatering at Canada Hall 3 plus Foyer

15:30 30m Break		Thursday Afternoon Break Catering

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

	07:00 - 15:30	Quiet Room Friday Until 330Social, Networking and Special Rooms at 202

	07:00 - 19:00	Ready Room FridaySocial, Networking and Special Rooms at 209

10:30 - 11:00	Fri Morning Break Posters 10:30-11Journal-first Papers / SE In Practice (SEIP) / Research Track / SE in Society (SEIS) / New Ideas and Emerging Results (NIER) / Posters at Canada Hall 3 Poster Area

10:30 30m Talk		An Empirical Study on Developers' Shared Conversations with ChatGPT in GitHub Pull Requests and Issues Journal-first Papers Huizi Hao Queen's University, Canada, Kazi Amit Hasan Queen's University, Canada, Hong Qin Queen's University, Marcos Macedo Queen's University, Yuan Tian Queen's University, Kingston, Ontario, Ding Steven, H., H. Queen’s University at Kingston, Ahmed E. Hassan Queen’s University
10:30 30m Talk		Automating Explanation Need Management in App Reviews: A Case Study from the Navigation App Industry SE In Practice (SEIP) Martin Obaidi Leibniz Universität Hannover, Nicolas Voß Graphmasters GmbH, Hannah Deters Leibniz University Hannover, Jakob Droste Leibniz Universität Hannover, Marc Herrmann Leibniz University Hannover, Jannik Fischbach Netlight Consulting GmbH and fortiss GmbH, Kurt Schneider Leibniz Universität Hannover, Software Engineering Group
10:30 30m Talk		On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.Security Journal-first Papers Aurora Papotti Vrije Universiteit Amsterdam, Ranindya Paramitha University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam
10:30 30m Talk		Relevant information in TDD experiment reporting Journal-first Papers Fernando Uyaguari Instituto Superior Tecnológico Wissen, Silvia Teresita Acuña Castillo Universidad Autónoma de Madrid, John W. Castro Universidad de Atacama, Davide Fucci Blekinge Institute of Technology, Oscar Dieste Universidad Politécnica de Madrid, Sira Vegas Universidad Politecnica de Madrid
10:30 30m Talk		BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks Research Track Yisong Xiao Beihang University, Aishan Liu Beihang University; Institute of Dataspace, Xinwei Zhang Beihang University, Tianyuan Zhang Beihang University, Li Tianlin NTU, Siyuan Liang National University of Singapore, Xianglong Liu Beihang University; Institute of Dataspace; Zhongguancun Laboratory, Yang Liu Nanyang Technological University, Dacheng Tao Nanyang Technological University
10:30 30m Talk		Ethical Issues in Video Games: Insights from Reddit Discussions SE in Society (SEIS) Yeqian Li Vrije Universiteit Amsterdam, Kousar Aslam Vrije Universiteit Amsterdam
10:30 30m Talk		SusDevOps: Promoting Sustainability to a First Principle in Software Delivery New Ideas and Emerging Results (NIER) Istvan David McMaster University / McMaster Centre for Software Certification (McSCert)

10:30 - 11:00	BreakCatering at Canada Hall 3 plus Foyer

10:30 30m Break		Friday Morning Break Catering

11:00 - 12:30	Program Comprehension 3Research Track / Journal-first Papers at 204 Chair(s): Arie van Deursen TU Delft

11:00 15m Talk		Automated Test Generation For Smart Contracts via On-Chain Test Case Augmentation and MigrationBlockchain Research Track Jiashuo Zhang Peking University, China, Jiachi Chen Sun Yat-sen University, John Grundy Monash University, Jianbo Gao Peking University, Yanlin Wang Sun Yat-sen University, Ting Chen University of Electronic Science and Technology of China, Zhi Guan Peking University, Zhong Chen Pre-print
11:15 15m Talk		Boosting Code-line-level Defect Prediction with Spectrum Information and Causality Analysis Research Track Shiyu Sun , Yanhui Li Nanjing University, Lin Chen Nanjing University, Yuming Zhou Nanjing University, Jianhua Zhao Nanjing University, China
11:30 15m Talk		BatFix: Repairing language model-based transpilation Journal-first Papers Daniel Ramos Carnegie Mellon University, Ines Lynce INESC-ID/IST, Universidade de Lisboa, Vasco Manquinho INESC-ID; Universidade de Lisboa, Ruben Martins Carnegie Mellon University, Claire Le Goues Carnegie Mellon University
11:45 15m Talk		Tracking the Evolution of Static Code Warnings: The State-of-the-Art and a Better Approach Journal-first Papers Junjie Li , Jinqiu Yang Concordia University
12:00 15m Talk		PACE: A Program Analysis Framework for Continuous Performance Prediction Journal-first Papers Chidera Biringa University of Massachusetts, Gokhan Kul University of Massachusetts Dartmouth
12:15 15m Talk		Mimicking Production Behavior With Generated Mocks Journal-first Papers Deepika Tiwari KTH Royal Institute of Technology, Martin Monperrus KTH Royal Institute of Technology, Benoit Baudry Université de Montréal

11:00 - 12:30	Testing and QA 4Research Track at 205 Chair(s): Matteo Camilli Politecnico di Milano

11:00 15m Talk		DPFuzzer: Discovering Safety Critical Vulnerabilities for Drone Path PlannersSecurity Research Track Yue Wang , Chao Yang Xidian University, Xiaodong Zhang , Yuwanqi Deng Xidian University, Jianfeng Ma Xidian University
11:15 15m Talk		IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation Research Track Yuyang Rong University of California, Davis, Zhanghan Yu University of California, Davis, Zhenkai Weng University of California, Davis, Stephen Neuendorffer Advanced Micro Devices, Inc., Hao Chen University of California at Davis
11:30 15m Talk		Ranking Relevant Tests for Order-Dependent Flaky Tests Research Track Shanto Rahman The University of Texas at Austin, Bala Naren Chanumolu George Mason University, Suzzana Rafi George Mason University, August Shi The University of Texas at Austin, Wing Lam George Mason University
11:45 15m Talk		Selecting Initial Seeds for Better JVM Fuzzing Research Track Tianchang Gao Tianjin University, Junjie Chen Tianjin University, Dong Wang Tianjin University, Yile Guo College of Intelligence and Computing, Tianjin University, Yingquan Zhao Tianjin University, Zan Wang Tianjin University
12:00 15m Talk		Toward a Better Understanding of Probabilistic Delta Debugging Research Track Mengxiao Zhang , Zhenyang Xu University of Waterloo, Yongqiang Tian , Xinru Cheng University of Waterloo, Chengnian Sun University of Waterloo
12:15 15m Talk		Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?Award Winner Research Track Mingyuan Wu Southern University of Science and Technology, Jiahong Xiang Southern University of Science and Technology, Kunqiu Chen Southern University of Science and Technology, Peng Di Ant Group & UNSW Sydney, Shin Hwei Tan Concordia University, Heming Cui University of Hong Kong, Yuqun Zhang Southern University of Science and Technology

11:00 - 12:30	Human and Social 3SE In Practice (SEIP) / Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) at 206 plus 208 Chair(s): Yuan Tian Queen's University, Kingston, Ontario

11:00 15m Talk		Relationship Status: “It’s complicated” Developer-Security Expert Dynamics in ScrumSecurity Research Track Houda Naji Ruhr University Bochum, Marco Gutfleisch Ruhr University Bochum, Alena Naiakshina Ruhr University Bochum
11:15 15m Talk		Soft Skills in Software Engineering: Insights from the Trenches SE In Practice (SEIP) Sanna Malinen University of Canterbury, Matthias Galster University of Canterbury, Antonija Mitrovic University of Canterbury, New Zealand, Sreedevi Sankara Iyer University of Canterbury, Pasan Peiris University of Canterbury, New Zealand, April Clarke University of Canterbury
11:30 15m Talk		A Unified Browser-Based Consent Management Framework New Ideas and Emerging Results (NIER) Gayatri Priyadarsini Indian Institute of Technology Gandhinagar, Abhishek Bichhawat Indian Institute of Technology Gandhinagar
11:45 15m Talk		Predicting Attrition among Software Professionals: Antecedents and Consequences of Burnout and Engagement Journal-first Papers Bianca Trinkenreich Colorado State University, Fabio Marcos De Abreu Santos Colorado State University, USA, Klaas-Jan Stol Lero; University College Cork; SINTEF Digital
12:00 7m Talk		A Controlled Experiment in Age and Gender Bias When Reading Technical Articles in Software Engineering Journal-first Papers Anda Liang Vanderbilt University, Emerson Murphy-Hill Microsoft, Westley Weimer University of Michigan, Yu Huang Vanderbilt University
12:07 7m Talk		Best ends by the best means: ethical concerns in app reviews Journal-first Papers Neelam Tjikhoeri Vrije Universiteit Amsterdam, Lauren Olson Vrije Universiteit Amsterdam, Emitzá Guzmán Vrije Universiteit Amsterdam
12:14 7m Talk		Shaken, Not Stirred. How Developers Like Their Amplified Tests Journal-first Papers Carolin Brandt TU Delft, Ali Khatami Delft University of Technology, Mairieli Wessel Radboud University, Andy Zaidman TU Delft Pre-print
12:21 7m Talk		Exploring User Privacy Awareness on GitHub: An Empirical Study Journal-first Papers Costanza Alfieri Università degli Studi dell'Aquila, Juri Di Rocco University of L'Aquila, Paola Inverardi Gran Sasso Science Institute, Phuong T. Nguyen University of L’Aquila

11:00 - 12:30	Human and Social using AI 2Research Track / SE In Practice (SEIP) / Demonstrations at 207 Chair(s): Sebastian Baltes University of Bayreuth

11:00 15m Talk		Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models SE In Practice (SEIP) Hao Li Queen's University, Cor-Paul Bezemer University of Alberta, Ahmed E. Hassan Queen’s University Pre-print
11:15 15m Talk		FairLay-ML: Intuitive Debugging of Fairness in Data-Driven Social-Critical Software Demonstrations Normen Yu Penn State, Luciana Carreon University of Texas at El Paso, Gang (Gary) Tan Pennsylvania State University, Saeid Tizpaz-Niari University of Illinois Chicago
11:30 15m Talk		Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace SE In Practice (SEIP) Jenna L. Butler Microsoft Research, Jina Suh Microsoft Research, Sankeerti Haniyur Microsoft Corporation, Constance Hadley Institute for Work Life
11:45 15m Talk		Exploring GenAI in Software Development: Insights from a Case Study in a Large Brazilian Company SE In Practice (SEIP) Guilherme Vaz Pereira School of Technology, PUCRS, Brazil, Victoria Jackson University of California, Irvine, Rafael Prikladnicki School of Technology at PUCRS University, Andre van der Hoek University of California, Irvine, Luciane Fortes Globo, Carolina Araújo Globo, André Coelho Globo, Ligia Chelli Globo, Diego Ramos Globo Pre-print
12:00 15m Talk		Human-In-the-Loop Software Development Agents SE In Practice (SEIP) Wannita Takerngsaksiri Monash University, Jirat Pasuksmit Atlassian, Patanamon Thongtanunam University of Melbourne, Kla Tantithamthavorn Monash University, Ruixiong Zhang Atlassian, Fan Jiang Atlassian, Jing Li Atlassian, Evan Cook Atlassian, Kun Chen Atlassian, Ming Wu Atlassian
12:15 15m Talk		Measuring the Runtime Performance of C++ Code Written by Humans using GitHub Copilot Research Track Daniel Erhabor University of Waterloo, Sreeharsha Udayashankar University of Waterloo, Mei Nagappan University of Waterloo, Samer Al-Kiswany University of Waterloo DOI Pre-print File Attached

11:00 - 12:30	Security and Analysis 2Research Track at 210 Chair(s): Jordan Samhi University of Luxembourg, Luxembourg

11:00 15m Talk		A Study of Undefined Behavior Across Foreign Function Boundaries in Rust LibrariesSecurity Research Track Ian McCormack Carnegie Mellon University, Joshua Sunshine Carnegie Mellon University, Jonathan Aldrich Carnegie Mellon University Pre-print
11:15 15m Talk		Cooperative Software Verification via Dynamic Program SplittingSecurity Research Track Cedric Richter University of Oldenburg, Marek Chalupa Institute of Science and Technology Austria, Marie-Christine Jakobs LMU Munich, Germany, Heike Wehrheim University of Oldenburg
11:30 15m Talk		Exposing the Hidden Layer: Software Repositories in the Service of SEO ManipulationSecurity Research Track Mengying Wu Fudan University, Geng Hong Fudan University, Wuyuao Mai Fudan University, Xinyi Wu Fudan University, Lei Zhang Fudan University, Yingyuan Pu QI-ANXIN Technology Research Institute, Huajun Chai QI-ANXIN Technology Research Institute, Lingyun Ying Qi An Xin Group Corp., Haixin Duan Institute for Network Science and Cyberspace, Tsinghua University; Qi An Xin Group Corp., Min Yang Fudan University
11:45 15m Talk		Hetrify: Efficient Verification of Heterogeneous Programs on RISC-VSecurityAward Winner Research Track Yiwei Li School of Computer, National Univer sity of Defense Technology, Liangze Yin School of Computer, National Univer sity of Defense Technology, Wei Dong National University of Defense Technology, Jiaxin Liu National University of Defense Technology, Yanfeng Hu School of Computer, National Univer sity of Defense Technology, Shanshan Li National University of Defense Technology
12:00 15m Talk		Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic ExecutionSecurity Research Track Shuo Yang Sun Yat-sen University, Xingwei Lin Ant Group, Jiachi Chen Sun Yat-sen University, Qingyuan Zhong Sun Yat-sen University, Lei Xiao Sun Yat-sen University, renke huang Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Zibin Zheng Sun Yat-sen University
12:15 15m Talk		SmartReco: Detecting Read-Only Reentrancy via Fine-Grained Cross-DApp AnalysisSecurity Research Track Jingwen Zhang School of Software Engineering, Sun Yat sen University, Zibin Zheng Sun Yat-sen University, Yuhong Nan Sun Yat-sen University, Mingxi Ye Sun Yat-sen University, Kaiwen Ning Sun Yat-sen University, Yu Zhang Harbin Institute of Technology, Weizhe Zhang Harbin Institute of Technology

11:00 - 12:30	Design and Architecture 1Research Track / SE In Practice (SEIP) / Journal-first Papers at 211 Chair(s): Tushar Sharma Dalhousie University

11:00 15m Talk		A Catalog of Micro Frontends Anti-patterns Research Track Nabson Silva UFAM - Federal University of Amazonas, Eriky Rodrigues UFAM - Federal University of Amazonas Brazil, Tayana Conte Universidade Federal do Amazonas
11:15 15m Talk		PairSmell: A Novel Perspective Inspecting Software Modular StructureAward Winner Research Track Chenxing Zhong Nanjing University, Daniel Feitosa University of Groningen, Paris Avgeriou Univ. of Gronningen , Huang Huang State Grid Nanjing Power Supply Company, Yue Li Nanjing University, He Zhang Nanjing University Pre-print
11:30 15m Talk		Understanding Architectural Complexity, Maintenance Burden, and Developer Sentiment---a Large-Scale Study Research Track Yuanfang Cai Drexel University, Lanting He Google, Yony Kochinski Google, Jun Qian Google, Ciera Jaspan Google, Nan Zhang Google, Antonio Bianco Google
11:45 15m Talk		A Large-Scale Exploratory Study on the Proxy Pattern in EthereumBlockchain Journal-first Papers Amir Ebrahimi Queen's University, Bram Adams Queen's University, Gustavo A. Oliva Queen's University, Ahmed E. Hassan Queen’s University
12:00 15m Talk		Video Game Procedural Content Generation Through Software Transplantation SE In Practice (SEIP) Mar Zamorano López University College London, Daniel Blasco SVIT Research Group. Universidad San Jorge, Carlos Cetina , Federica Sarro University College London

11:00 - 12:30	AI for Analysis 4Research Track / New Ideas and Emerging Results (NIER) / SE In Practice (SEIP) at 212 Chair(s): Maliheh Izadi Delft University of Technology, Ali Al-Kaswan Delft University of Technology, Netherlands, Jonathan Katzy Delft University of Technology

11:00 15m Talk		RepairAgent: An Autonomous, LLM-Based Agent for Program Repair Research Track Islem BOUZENIA University of Stuttgart, Prem Devanbu University of California at Davis, Michael Pradel University of Stuttgart Pre-print
11:15 15m Talk		Evaluating Agent-based Program Repair at Google SE In Practice (SEIP) Patrick Rondon Google, Renyao Wei Google, José Pablo Cambronero Google, USA, Jürgen Cito TU Wien, Aaron Sun Google, Siddhant Sanyam Google, Michele Tufano Google, Satish Chandra Google, Inc
11:30 15m Talk		Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset SE In Practice (SEIP) Mohammad Saiful Islam Toronto Metropolitan University, Toronto, Canada, Mohamed Sami Rakha Toronto Metropolitan University, Toronto, Canada, William Pourmajidi Toronto Metropolitan University, Toronto, Canada, Janakan Sivaloganathan Toronto Metropolitan University, Toronto, Canada, John Steinbacher IBM, Andriy Miranskyy Toronto Metropolitan University (formerly Ryerson University) Pre-print
11:45 15m Talk		Crash Report Prioritization for Large-Scale Scheduled Launches SE In Practice (SEIP) Nimmi Rashinika Weeraddana University of Waterloo, Sarra Habchi Ubisoft Montréal, Shane McIntosh University of Waterloo
12:00 15m Talk		LogLM: From Task-based to Instruction-based Automated Log Analysis SE In Practice (SEIP) Yilun Liu Huawei co. LTD, Yuhe Ji Huawei co. LTD, Shimin Tao University of Science and Technology of China; Huawei co. LTD, Minggui He Huawei co. LTD, Weibin Meng Huawei co. LTD, Shenglin Zhang Nankai University, Yongqian Sun Nankai University, Yuming Xie Huawei co. LTD, Boxing Chen Huawei Canada, Hao Yang Huawei co. LTD Pre-print
12:15 7m Talk		Using ML filters to help automated vulnerability repairs: when it helps and when it doesn’tSecurity New Ideas and Emerging Results (NIER) Maria Camporese University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam Pre-print

11:00 - 12:30	AI for Testing and QA 5SE In Practice (SEIP) / Research Track at 214 Chair(s): Chunyang Chen TU Munich

11:00 15m Talk		ASTER: Natural and Multi-language Unit Test Generation with LLMsAward Winner SE In Practice (SEIP) Rangeet Pan IBM Research, Myeongsoo Kim Georgia Institute of Technology, Rahul Krishna IBM Research, Raju Pavuluri IBM T.J. Watson Research Center, Saurabh Sinha IBM Research Pre-print
11:15 15m Talk		Automated Code Review In Practice SE In Practice (SEIP) Umut Cihan Bilkent University, Vahid Haratian Bilkent Univeristy, Arda İçöz Bilkent University, Mert Kaan Gül Beko, Ömercan Devran Beko, Emircan Furkan Bayendur Beko, Baykal Mehmet Ucar Beko, Eray Tüzün Bilkent University Pre-print
11:30 15m Talk		CI at Scale: Lean, Green, and Fast SE In Practice (SEIP) Dhruva Juloori Uber Technologies, Inc, Zhongpeng Lin Uber Technologies Inc., Matthew Williams Uber Technologies, Inc, Eddy Shin Uber Technologies, Inc, Sonal Mahajan Uber Technologies Inc.
11:45 15m Talk		Moving Faster and Reducing Risk: Using LLMs in Release DeploymentAward Winner SE In Practice (SEIP) Rui Abreu Meta, Vijayaraghavan Murali Meta Platforms Inc., Peter C Rigby Meta / Concordia University, Chandra Sekhar Maddila Meta Platforms, Inc., Weiyan Sun Meta Platforms, Inc., Jun Ge Meta Platforms, Inc., Kaavya Chinniah Meta Platforms, Inc., Audris Mockus University of Tennessee, Megh Mehta Meta Platforms, Inc., Nachiappan Nagappan Meta Platforms, Inc.
12:00 15m Talk		Prioritizing Large-scale Natural Language Test Cases at OPPO SE In Practice (SEIP) Haoran Xu , Chen Zhi Zhejiang University, Tianyu Xiang Guangdong Oppo Mobile Telecommunications Corp., Ltd., Zixuan Wu Zhejiang University, Gaorong Zhang Zhejiang University, Xinkui Zhao Zhejiang University, Jianwei Yin Zhejiang University, Shuiguang Deng Zhejiang University; Alibaba-Zhejiang University Joint Institute of Frontier Technologies
12:15 15m Talk		Search+LLM-based Testing for ARM Simulators SE In Practice (SEIP) Bobby Bruce University of California at Davis, USA, Aidan Dakhama King's College London, Karine Even-Mendoza King’s College London, William B. Langdon University College London, Hector Menendez King’s College London, Justyna Petke University College London

11:00 - 12:30	SE for AI with Quality 1Research Track at 215 Chair(s): Chris Poskitt Singapore Management University

11:00 15m Talk		A Tale of Two DL Cities: When Library Tests Meet CompilerSE for AI Research Track Qingchao Shen Tianjin University, Yongqiang Tian , Haoyang Ma Hong Kong University of Science and Technology, Junjie Chen Tianjin University, Lili Huang College of Intelligence and Computing, Tianjin University, Ruifeng Fu Tianjin University, Shing-Chi Cheung Hong Kong University of Science and Technology, Zan Wang Tianjin University
11:15 15m Talk		Iterative Generation of Adversarial Example for Deep Code ModelsSE for AIAward Winner Research Track Li Huang , Weifeng Sun , Meng Yan Chongqing University
11:30 15m Talk		On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning ImplementationsSE for AI Research Track Rajdeep Singh Hundal National University of Singapore, Yan Xiao Sun Yat-sen University, Xiaochun Cao Sun Yat-Sen University, Jin Song Dong National University of Singapore, Manuel Rigger National University of Singapore Pre-print Media Attached File Attached
11:45 15m Talk		µPRL: a Mutation Testing Pipeline for Deep Reinforcement Learning based on Real FaultsSE for AI Research Track Deepak-George Thomas Tulane University, Matteo Biagiola Università della Svizzera italiana, Nargiz Humbatova Università della Svizzera italiana, Mohammad Wardat Oakland University, USA, Gunel Jahangirova King's College London, Hridesh Rajan Tulane University, Paolo Tonella USI Lugano Pre-print
12:00 15m Talk		Testing and Understanding Deviation Behaviors in FHE-hardened Machine Learning ModelsSE for AI Research Track Yiteng Peng Hong Kong University of Science and Technology, Daoyuan Wu Hong Kong University of Science and Technology, Zhibo Liu Hong Kong University of Science and Technology, Dongwei Xiao Hong Kong University of Science and Technology, Zhenlan Ji The Hong Kong University of Science and Technology, Juergen Rahmel HSBC, Shuai Wang Hong Kong University of Science and Technology
12:15 15m Talk		TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron ProvenanceSE for AI Research Track Waris Gill Virginia Tech, Ali Anwar University of Minnesota, Muhammad Ali Gulzar Virginia Tech Pre-print

11:00 - 12:30	AI for SE 3New Ideas and Emerging Results (NIER) / Journal-first Papers / Research Track / SE In Practice (SEIP) at Canada Hall 1 and 2 Chair(s): Ying Zou Queen's University, Kingston, Ontario

11:00 15m Talk		A First Look at Conventional Commits Classification Research Track Qunhong Zeng Beijing Institute of Technology, Yuxia Zhang Beijing Institute of Technology, Zhiqing Qiu Beijing Institute of Technology, Hui Liu Beijing Institute of Technology
11:15 15m Talk		ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples Research Track Chunhao Dong Beijing Institute of Technology, Yanjie Jiang Peking University, Yuxia Zhang Beijing Institute of Technology, Yang Zhang Hebei University of Science and Technology, Hui Liu Beijing Institute of Technology
11:30 15m Talk		SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing Research Track Wenchao Gu The Chinese University of Hong Kong, Ensheng Shi Xi’an Jiaotong University, Yanlin Wang Sun Yat-sen University, Lun Du Microsoft Research, Shi Han Microsoft Research, Hongyu Zhang Chongqing University, Dongmei Zhang Microsoft Research, Michael Lyu The Chinese University of Hong Kong
11:45 15m Talk		UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation New Ideas and Emerging Results (NIER) Liangying Shao School of Informatics, Xiamen University, China, Yanfu Yan William & Mary, Denys Poshyvanyk William & Mary, Jinsong Su School of Informatics, Xiamen University, China
12:00 15m Talk		How is Google using AI for internal code migrations? SE In Practice (SEIP) Stoyan Nikolov Google, Inc., Daniele Codecasa Google, Inc., Anna Sjovall Google, Inc., Maxim Tabachnyk Google, Siddharth Taneja Google, Inc., Celal Ziftci Google, Satish Chandra Google, Inc
12:15 7m Talk		LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation Journal-first Papers Sarah Fakhoury Microsoft Research, Aaditya Naik University of Pennsylvania, Georgios Sakkas University of California at San Diego, Saikat Chakraborty Microsoft Research, Shuvendu K. Lahiri Microsoft Research Link to publication
12:22 7m Talk		The impact of Concept drift and Data leakage on Log Level Prediction Models Journal-first Papers Youssef Esseddiq Ouatiti Queen's university, Mohammed Sayagh ETS Montreal, University of Quebec, Noureddine Kerzazi Ensias-Rabat, Bram Adams Queen's University, Ahmed E. Hassan Queen’s University, Youssef Esseddiq Ouatiti Queen's university

12:30 - 14:00	LunchCatering at Canada Hall 3 plus Foyer

13:15 45m Lunch		Friday Lunch Catering

13:00 - 13:30	Fri Lunch Posters 13:00-13:30SE in Society (SEIS) / Journal-first Papers / Demonstrations / Research Track / New Ideas and Emerging Results (NIER) / Posters at Canada Hall 3 Poster Area

13:00 30m Talk		Strategies to Embed Human Values in Mobile Apps: What do End-Users and Practitioners Think? SE in Society (SEIS) Rifat Ara Shams CSIRO's Data61, Mojtaba Shahin RMIT University, Gillian Oliver Monash University, Jon Whittle CSIRO's Data61 and Monash University, Waqar Hussain Data61, CSIRO, Harsha Perera CSIRO's Data61, Arif Nurwidyantoro Universitas Gadjah Mada
13:00 30m Talk		Best ends by the best means: ethical concerns in app reviews Journal-first Papers Neelam Tjikhoeri Vrije Universiteit Amsterdam, Lauren Olson Vrije Universiteit Amsterdam, Emitzá Guzmán Vrije Universiteit Amsterdam
13:00 30m Poster		HyperCRX 2.0: A Comprehensive and Automated Tool for Empowering GitHub Insights Demonstrations Yantong Wang East China Normal University, Shengyu Zhao Tongji University, will wang , Fenglin Bi East China Normal University
13:00 30m Poster		Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models Research Track Kunpeng Zhang The Hong Kong University of Science and Technology, Shuai Wang Hong Kong University of Science and Technology, Jitao Han Central University of Finance and Economics, Xiaogang Zhu The University of Adelaide, Xian Li Swinburne University of Technology, Shaohua Wang Central University of Finance and Economics, Sheng Wen Swinburne University of Technology
13:00 30m Talk		Using ML filters to help automated vulnerability repairs: when it helps and when it doesn’tSecurity New Ideas and Emerging Results (NIER) Maria Camporese University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam Pre-print
13:00 30m Talk		Shaken, Not Stirred. How Developers Like Their Amplified Tests Journal-first Papers Carolin Brandt TU Delft, Ali Khatami Delft University of Technology, Mairieli Wessel Radboud University, Andy Zaidman TU Delft Pre-print
13:00 30m Talk		Exploring User Privacy Awareness on GitHub: An Empirical Study Journal-first Papers Costanza Alfieri Università degli Studi dell'Aquila, Juri Di Rocco University of L'Aquila, Paola Inverardi Gran Sasso Science Institute, Phuong T. Nguyen University of L’Aquila

14:00 - 15:30	Real-Time SESE In Practice (SEIP) / Demonstrations / Journal-first Papers / New Ideas and Emerging Results (NIER) / Research Track at 203 Chair(s): Domenico Bianculli University of Luxembourg

14:00 15m Talk		Closing the Gap between Sensor Inputs and Driving Properties: A Scene Graph Generator for CARLA Demonstrations Trey Woodlief University of Virginia, Felipe Toledo , Sebastian Elbaum University of Virginia, Matthew B Dwyer University of Virginia
14:15 15m Talk		LEGOS-SLEEC: Tool for Formalizing and Analyzing Normative Requirements Demonstrations Kevin Kolyakov University of Toronto, Lina Marsso École Polytechnique de Montréal, Nick Feng University of Toronto, Junwei Quan University of Toronto, Marsha Chechik University of Toronto
14:30 15m Talk		MarMot: Metamorphic Runtime Monitoring of Autonomous Driving Systems Journal-first Papers Jon Ayerdi Mondragon University, Asier Iriarte Mondragon University, Pablo Valle Mondragon University, Ibai Roman Mondragon University, Miren Illarramendi Mondragon University, Aitor Arrieta Mondragon University
14:45 15m Talk		Automatically Generating Content for Testing Autonomous Vehicles from User Descriptions New Ideas and Emerging Results (NIER) Benedikt Steininger IMC FH Krems, Chrysanthi Papamichail BeamNG GmbH, David Stark BeamNG GmbH, Dejan Nickovic Austrian Institute of Technology, Alessio Gambi Austrian Institute of Technology (AIT)
15:00 15m Talk		BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems SE In Practice (SEIP) Tao Duan Xi'an Jiaotong University, Runqing Chen Alibaba, Pinghui Wang Xi'an Jiaotong University, Junzhou Zhao Xi'an Jiaotong University, Jiongzhou Liu Alibaba, Shujie Han Northwestern Polytechnical University, Yi Liu Alibaba, Fan Xu Alibaba
15:15 15m Talk		On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet? SE In Practice (SEIP) Matteo Esposito University of Oulu, Francesco Palagiano Multitel di Lerede Alessandro & C. s.a.s., Valentina Lenarduzzi University of Oulu, Davide Taibi University of Oulu DOI Pre-print

14:00 - 15:30	Program Comprehension 4Research Track at 204 Chair(s): Simone Scalabrino University of Molise

14:00 15m Talk		Decoding the Issue Resolution Process In Practice via Issue Report Analysis: A Case Study of Firefox Research Track Antu Saha William & Mary, Oscar Chaparro William & Mary Pre-print
14:15 15m Talk		Preserving Privacy in Software Composition Analysis: A Study of Technical Solutions and Enhancements Research Track Huaijin Wang Ohio State University, Zhibo Liu Hong Kong University of Science and Technology, Yanbo Dai The Hong Kong University of Science and Technology (Guangzhou), Shuai Wang Hong Kong University of Science and Technology, Qiyi Tang Tencent Security Keen Lab, Sen Nie Tencent Security Keen Lab, Shi Wu Tencent Security Keen Lab
14:30 15m Talk		UML is Back. Or is it? Investigating the Past, Present, and Future of UML in Open Source Software Research Track Joseph Romeo Software Institute - USI, Lugano, Switzerland, Marco Raglianti Software Institute - USI, Lugano, Csaba Nagy , Michele Lanza Software Institute - USI, Lugano Pre-print
14:45 15m Talk		Understanding the Response to Open-Source Dependency Abandonment in the npm EcosystemAward Winner Research Track Courtney Miller Carnegie Mellon University, Mahmoud Jahanshahi University of Tennessee, Audris Mockus University of Tennessee, Bogdan Vasilescu Raj Reddy Associate Professor of Software and Societal Systems, Carnegie Mellon University, USA, Christian Kästner Carnegie Mellon University
15:00 15m Talk		Understanding Compiler Bugs in Real Development Research Track Hao Zhong Shanghai Jiao Tong University
15:15 15m Talk		Studying Programmers Without Programming: Investigating Expertise Using Resting State fMRI Research Track Zachary Karas Vanderbilt University, Benjamin Gold Vanderbilt University, Violet Zhou University of Michigan, Noah Reardon University of Michigan, Thad Polk University of Michigan, Catie Chang Vanderbilt University, Yu Huang Vanderbilt University

14:00 - 15:30	Testing and QA 5Research Track / Journal-first Papers / New Ideas and Emerging Results (NIER) / Demonstrations at 205 Chair(s): Giovanni Denaro University of Milano - Bicocca

14:00 15m Talk		Leveraging Propagated Infection to Crossfire Mutants Research Track Hang Du University of California at Irvine, Vijay Krishna Palepu Microsoft, James Jones University of California at Irvine File Attached
14:15 15m Talk		IFSE: Taming Closed-box Functions in Symbolic Execution via Fuzz Solving Demonstrations Qichang Wang East China Normal University, Chuyang Chen The Ohio State University, Ruiyang Xu East China Normal University, Haiying Sun East China Normal University, Chengcheng Wan East China Normal University, Ting Su East China Normal University, Yueling Zhang East China Normal University, Geguang Pu East China Normal University, China
14:30 15m Talk		Takuan: Using Dynamic Invariants To Debug Order-Dependent Flaky Tests New Ideas and Emerging Results (NIER) Nate Levin Yorktown High School, Chengpeng Li University of Texas at Austin, Yule Zhang George Mason University, August Shi The University of Texas at Austin, Wing Lam George Mason University
14:45 15m Talk		Vision Transformer Inspired Automated Vulnerability RepairSecurity Journal-first Papers Michael Fu The University of Melbourne, Van Nguyen Monash University, Kla Tantithamthavorn Monash University, Dinh Phung Monash University, Australia, Trung Le Monash University, Australia
15:00 15m Talk		ZigZagFuzz: Interleaved Fuzzing of Program Options and Files Journal-first Papers Ahcheong Lee KAIST, Youngseok Choi KAIST, Shin Hong Chungbuk National University, Yunho Kim Hanyang University, Kyutae Cho LIG Nex1 AI R&D, Moonzoo Kim KAIST / VPlusLab Inc.
15:15 15m Talk		Reducing the Length of Field-replay Based Load Testing Journal-first Papers Yuanjie Xia University of Waterloo, Lizhi Liao Memorial University of Newfoundland, Jinfu Chen Wuhan University, Heng Li Polytechnique Montréal, Weiyi Shang University of Waterloo

14:00 - 15:30	Human and Social 4Journal-first Papers / SE in Society (SEIS) / SE In Practice (SEIP) / Research Track at 206 plus 208 Chair(s): Liliana Pasquale University College Dublin & Lero

14:00 15m Talk		Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products SE In Practice (SEIP) Nadia Nahar Carnegie Mellon University, Christian Kästner Carnegie Mellon University, Jenna L. Butler Microsoft Research, Chris Parnin Microsoft, Thomas Zimmermann University of California, Irvine, Christian Bird Microsoft Research
14:15 15m Talk		Follow-Up Attention: An Empirical Study of Developer and Neural Model Code Exploration Journal-first Papers Matteo Paltenghi University of Stuttgart, Rahul Pandita GitHub, Inc., Austin Henley Carnegie Mellon University, Albert Ziegler XBow
14:30 15m Talk		Do Developers Adopt Green Architectural Tactics for ML-Enabled Systems? A Mining Software Repository Study SE in Society (SEIS) Vincenzo De Martino University of Salerno, Silverio Martínez-Fernández UPC-BarcelonaTech, Fabio Palomba University of Salerno Pre-print
14:45 15m Talk		Accessibility Issues in Ad-Driven Web Applications Research Track Abdul Haddi Amjad Virginia Tech, Muhammad Danish Virginia Tech, Bless Jah Virginia Tech, Muhammad Ali Gulzar Virginia Tech
15:00 15m Talk		A Bot-based Approach to Manage Codes of Conduct in Open-Source Projects SE in Society (SEIS) Sergio Cobos IN3 - UOC, Javier Luis Cánovas Izquierdo Universitat Oberta de Catalunya Pre-print
15:15 7m Talk		Toward Effective Secure Code Reviews: An Empirical Study of Security-Related Coding WeaknessesSecurity Journal-first Papers Wachiraphan (Ping) Charoenwet University of Melbourne, Patanamon Thongtanunam University of Melbourne, Thuan Pham University of Melbourne, Christoph Treude Singapore Management University

14:00 - 15:30	User ExperienceJournal-first Papers / Research Track / SE In Practice (SEIP) / SE in Society (SEIS) at 207 Chair(s): Ramiro Liscano Ontario Tech University

14:00 15m Talk		A Tale of Two Comprehensions? Analyzing Student Programmer Attention During Code Summarization Journal-first Papers Zachary Karas Vanderbilt University, Aakash Bansal University of Notre Dame, Yifan Zhang Vanderbilt University, Toby Jia-Jun Li University of Notre Dame, Collin McMillan University of Notre Dame, Yu Huang Vanderbilt University
14:15 15m Talk		Asking and Answering Questions During Memory Profiling Journal-first Papers Alison Fernandez Blanco University of Chile, Araceli Queirolo Cordova ISCLab, Department of Computer Science (DCC), University of Chile, Alexandre Bergel University of Chile, Juan Pablo Sandoval Alcocer Pontificia Universidad Católica de Chile
14:30 15m Talk		Unveiling the Energy Vampires: A Methodology for Debugging Software Energy ConsumptionAward Winner Research Track Enrique Barba Roque TU Delft, Luís Cruz TU Delft, Thomas Durieux TU Delft Pre-print
14:45 15m Talk		Designing a Tool for Evacuation Plan Validation: Multi-Agent Simulations with Persona-Based UI SE in Society (SEIS) Gennaro Zanfardino University of L'Aquila, Antinisca Di Marco University of L'Aquila, Michele Tucci University of L'Aquila
15:00 15m Talk		Testing False Recalls in E-commerce Apps: a User-perspective Blackbox Approach SE In Practice (SEIP) Shengnan Wu School of Computer Science, Fudan University, Yongxiang Hu Fudan University, Jiazhen Gu Fudan University, China, Penglei Mao School of Computer Science, Fudan University, Jin Meng Meituan Inc., Liujie Fan Meituan Inc., Zhongshi Luan Meituan Inc., Xin Wang Fudan University, Yangfan Zhou Fudan University
15:15 7m Talk		On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.Security Journal-first Papers Aurora Papotti Vrije Universiteit Amsterdam, Ranindya Paramitha University of Trento, Fabio Massacci University of Trento; Vrije Universiteit Amsterdam
15:22 7m Talk		On Effectiveness and Efficiency of Gamified Exploratory GUI Testing Journal-first Papers Riccardo Coppola Politecnico di Torino, Tommaso Fulcini Politecnico di Torino, Luca Ardito Politecnico di Torino, Marco Torchiano Politecnico di Torino, Emil Alégroth Blekinge Institute of Technology

14:00 - 15:30	Security and Analysis 3Research Track / SE In Practice (SEIP) at 210 Chair(s): Adriana Sejfia University of Edinburgh

14:00 15m Talk		Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly DetectionSecurity Research Track Qiaolin Qin Polytechnique Montréal, Heng Li Polytechnique Montréal, Ettore Merlo Polytechnique Montreal, Maxime Lamothe Polytechnique Montreal Pre-print
14:15 15m Talk		On Prescription or Off Prescription? An Empirical Study of Community-prescribed Security Configurations for KubernetesSecurity Research Track Shazibul Islam Shamim Auburn University, Hanyang Hu Company A, Akond Rahman Auburn University Pre-print File Attached
14:30 15m Talk		Similar but Patched Code Considered Harmful -- The Impact of Similar but Patched Code on Recurring Vulnerability Detection and How to Remove ThemSecurity Research Track Zixuan Tan Zhejiang University, Jiayuan Zhou Huawei, Xing Hu Zhejiang University, Shengyi Pan Zhejiang University, Kui Liu Huawei, Xin Xia Huawei Pre-print
14:45 15m Talk		TIVER: Identifying Adaptive Versions of C/C++ Third-Party Open-Source Components Using a Code Clustering TechniqueSecurity Research Track Youngjae Choi Korea University, Seunghoon Woo Korea University
15:00 15m Talk		A scalable, effective and simple Vulnerability Tracking approach for heterogeneous SAST setups based on Scope+OffsetSecurity SE In Practice (SEIP) James Johnson --, Julian Thome GitLab Inc., Lucas Charles GitLab Inc., Hua Yan GitLab Inc., Jason Leasure GitLab Inc. Pre-print
15:15 15m Talk		''ImmediateShortTerm3MthsAfterThatLOL'': Developer Secure-Coding Sentiment, Practice and Culture in OrganisationsSecurity SE In Practice (SEIP) Ita Ryan University College Cork, Utz Roedig University College Cork, Klaas-Jan Stol Lero; University College Cork; SINTEF Digital

14:00 - 15:30	Design and Architecture 2Journal-first Papers / Research Track at 211 Chair(s): Yuanfang Cai Drexel University, Jan Keim Karlsruhe Institute of Technology (KIT)

14:00 15m Talk		An Exploratory Study on the Engineering of Security FeaturesSecurity Research Track Kevin Hermann Ruhr University Bochum, Sven Peldszus Ruhr University Bochum, Jan-Philipp Steghöfer XITASO GmbH IT & Software Solutions, Thorsten Berger Ruhr University Bochum Pre-print
14:15 15m Talk		DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models Research Track Mingyue Yuan The university of new South Wales, Jieshan Chen CSIRO's Data61, Zhenchang Xing CSIRO's Data61, Aaron Quigley CSIRO's Data61, Yuyu Luo HKUST (GZ), Tianqi Luo HKUST (GZ), Gelareh Mohammadi The university of new South Wales, Qinghua Lu Data61, CSIRO, Liming Zhu CSIRO’s Data61
14:30 15m Talk		Fidelity of Cloud Emulators: The Imitation Game of Testing Cloud-based Software Research Track Anna Mazhar Cornell University, Saad Sher Alam University of Illinois Urbana-Champaign, William Zheng University of Illinois Urbana-Champaign, Yinfang Chen University of Illinois at Urbana-Champaign, Suman Nath Microsoft Research, Tianyin Xu University of Illinois at Urbana-Champaign
14:45 15m Talk		Formally Verified Cloud-Scale AuthorizationAward Winner Research Track Aleks Chakarov Amazon Web Services, Jaco Geldenhuys Amazon Web Services, Matthew Heck Amazon Web Services, MIchael Hicks Amazon, Samuel Huang Amazon Web Services, Georges-Axel Jaloyan Amazon Web Services, Anjali Joshi Amazon, K. Rustan M. Leino Amazon, Mikael Mayer Automated Reasoning Group, Amazon Web Services, Sean McLaughlin Amazon Web Services, Akhilesh Mritunjai Amazon.com, Clement Pit-Claudel EPFL, Sorawee Porncharoenwase Amazon Web Services, Florian Rabe Amazon Web Services, Marianna Rapoport Amazon Web Services, Giles Reger Amazon Web Services, Cody Roux Amazon Web Services, Neha Rungta Amazon Web Services, Robin Salkeld Amazon Web Services, Matthias Schlaipfer Amazon Web Services, Daniel Schoepe Amazon, Johanna Schwartzentruber Amazon Web Services, Serdar Tasiran Amazon, n.n., Aaron Tomb Amazon, Emina Torlak Amazon Web Services, USA, Jean-Baptiste Tristan Amazon, Lucas Wagner Amazon Web Services, Michael Whalen Amazon Web Services and the University of Minnesota, Remy Willems Amazon, Tongtong Xiang Amazon Web Services, Taejoon Byun University of Minnesota, Joshua M. Cohen Princeton University, Ruijie Fang University of Texas at Austin, Junyoung Jang McGill University, Jakob Rath TU Wien, Hira Taqdees Syeda , Dominik Wagner University of Oxford, Yongwei Yuan Purdue University
15:00 15m Talk		The Same Only Different: On Information Modality for Configuration Performance Analysis Research Track Hongyuan Liang University of Electronic Science and Technology of China, Yue Huang University of Electronic Science and Technology of China, Tao Chen University of Birmingham Pre-print
15:15 7m Talk		Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal Features Journal-first Papers Wenwei Gu The Chinese University of Hong Kong, Jinyang Liu Chinese University of Hong Kong, Zhuangbin Chen Sun Yat-sen University, Jianping Zhang The Chinese University of Hong Kong, Yuxin Su Sun Yat-sen University, Jiazhen Gu Chinese University of Hong Kong, Cong Feng Huawei Cloud Computing Technology, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Yongqiang Yang Huawei Cloud Computing Technology, Michael Lyu The Chinese University of Hong Kong

14:00 - 15:30	AI for Analysis 5Research Track / New Ideas and Emerging Results (NIER) at 212 Chair(s): Tien N. Nguyen University of Texas at Dallas

14:00 15m Talk		3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers Research Track Sarah Fakhoury Microsoft Research, Markus Kuppe Microsoft Research, Shuvendu K. Lahiri Microsoft Research, Tahina Ramananandro Microsoft Research, Nikhil Swamy Microsoft Research Pre-print
14:15 15m Talk		Aligning the Objective of LLM-based Program Repair Research Track Junjielong Xu The Chinese University of Hong Kong, Shenzhen, Ying Fu Chongqing University, Shin Hwei Tan Concordia University, Pinjia He Chinese University of Hong Kong, Shenzhen Pre-print
14:30 15m Talk		Revisiting Unnaturalness for Automated Program Repair in the Era of Large Language Models Research Track Aidan Z.H. Yang Carnegie Mellon University, Sophia Kolak Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University, Ruben Martins Carnegie Mellon University, Claire Le Goues Carnegie Mellon University
14:45 15m Talk		The Fact Selection Problem in LLM-Based Program Repair Research Track Nikhil Parasaram Uber Amsterdam, Huijie Yan University College London, Boyu Yang University College London, Zineb Flahy University College London, Abriele Qudsi University College London, Damian Ziaber University College London, Earl T. Barr University College London, Sergey Mechtaev Peking University
15:00 15m Talk		Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models Research Track Zhijie Wang University of Alberta, Zijie Zhou University of Illinois Urbana-Champaign, Da Song University of Alberta, Yuheng Huang University of Alberta, Canada, Shengmai Chen Purdue University, Lei Ma The University of Tokyo & University of Alberta, Tianyi Zhang Purdue University Pre-print
15:15 15m Talk		Beyond Syntax: How Do LLMs Understand Code? New Ideas and Emerging Results (NIER) Marc North Durham University, Amir Atapour-Abarghouei Durham University, Nelly Bencomo Durham University

14:00 - 15:30	AI for Security 2Research Track at 213 Chair(s): Gias Uddin York University, Canada

14:00 15m Talk		Repository-Level Graph Representation Learning for Enhanced Security Patch DetectionSecurity Research Track Xin-Cheng Wen Harbin Institute of Technology, Zirui Lin Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Hongyu Zhang Chongqing University, Yong Wang Anhui Polytechnic University, Qing Liao Harbin Institute of Technology
14:15 15m Talk		FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data FusionSecurity Research Track Chiming Duan Peking University, Yong Yang Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Guiyang Liu Alibaba, Jinbu Liu Alibaba, Huxing Zhang Alibaba Group, Qi Zhou Alibaba, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University
14:30 15m Talk		Leveraging Large Language Models to Detect npm Malicious PackagesSecurity Research Track Nusrat Zahan North Carolina State University, Philipp Burckhardt Socket, Inc, Mikola Lysenko Socket, Inc, Feross Aboukhadijeh Socket, Inc, Laurie Williams North Carolina State University
14:45 15m Talk		Magika: AI-Powered Content-Type DetectionSecurity Research Track Yanick Fratantonio Google, Luca Invernizzi Google, Loua Farah Google, Kurt Thomas Google, Marina Zhang Google, Ange Albertini Google, Francois Galilee Google, Giancarlo Metitieri Google, Julien Cretin Google, Alex Petit-Bianco Google, David Tao Google, Elie Bursztein Google
15:00 15m Talk		Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDESecurity Research Track Benjamin Steenhoek Microsoft, Siva Sivaraman Microsoft, Renata Saldivar Gonzalez Microsoft, Yevhen Mohylevskyy Microsoft, Roshanak Zilouchian Moghaddam Microsoft, Wei Le Iowa State University
15:15 15m Talk		Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code NaturalnessSecurity Research Track Weisong Sun Nanjing University, Yuchen Chen Nanjing University, Mengzhe Yuan Nanjing University, Chunrong Fang Nanjing University, Zhenpeng Chen Nanyang Technological University, Chong Wang Nanyang Technological University, Yang Liu Nanyang Technological University, Baowen Xu State Key Laboratory for Novel Software Technology, Nanjing University, Zhenyu Chen Nanjing University Pre-print Media Attached

14:00 - 15:30	AI for Testing and QA 6Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) at 214 Chair(s): Ladan Tahvildari University of Waterloo

14:00 15m Talk		Treefix: Enabling Execution with a Tree of Prefixes Research Track Beatriz Souza Universität Stuttgart, Michael Pradel University of Stuttgart Pre-print
14:15 15m Talk		Assessing Evaluation Metrics for Neural Test Oracle Generation Journal-first Papers Jiho Shin York University, Hadi Hemmati York University, Moshi Wei York University, Song Wang York University
14:30 15m Talk		Enhancing Energy-Awareness in Deep Learning through Fine-Grained Energy Measurement Journal-first Papers Saurabhsingh Rajput Dalhousie University, Tim Widmayer University College London (UCL), Ziyuan Shang Nanyang Technological University, Maria Kechagia National and Kapodistrian University of Athens, Federica Sarro University College London, Tushar Sharma Dalhousie University
14:45 15m Talk		Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality Journal-first Papers Hao Li Queen's University, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Cor-Paul Bezemer University of Alberta Link to publication DOI Pre-print
15:00 15m Talk		Evaluating the Generalizability of LLMs in Automated Program Repair New Ideas and Emerging Results (NIER) Fengjie Li Tianjin University, Jiajun Jiang Tianjin University, Jiajun Sun Tianjin University, Hongyu Zhang Chongqing University Pre-print
15:15 15m Talk		How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study New Ideas and Emerging Results (NIER) Alejandro Velasco William & Mary, Daniel Rodriguez-Cardenas William & Mary, David Nader Palacio William & Mary, Lutfar Rahman Alif University of Dhaka, Denys Poshyvanyk William & Mary Pre-print

14:00 - 15:30	SE for AI with Quality 2Journal-first Papers / Research Track at 215 Chair(s): Romina Spalazzese Malmö University

14:00 15m Talk		Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning ProjectsSE for AI Journal-first Papers Han Wang Monash University, Sijia Yu Jilin University, Chunyang Chen TU Munich, Burak Turhan University of Oulu, Xiaodong Zhu Jilin University Link to publication DOI Pre-print
14:15 15m Talk		Boundary State Generation for Testing and Improvement of Autonomous Driving SystemsSE for AI Journal-first Papers Matteo Biagiola Università della Svizzera italiana, Paolo Tonella USI Lugano DOI Pre-print
14:30 15m Talk		D3: Differential Testing of Distributed Deep Learning with Model GenerationSE for AI Journal-first Papers Jiannan Wang Purdue University, Hung Viet Pham York University, Qi Li , Lin Tan Purdue University, Yu Guo Meta Inc., Adnan Aziz Meta Inc., Erik Meijer
14:45 15m Talk		Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving SystemsSE for AI Journal-first Papers Mohammad Hossein Amini University of Ottawa, Shervin Naseri University of Ottawa, Shiva Nejati University of Ottawa
15:00 15m Talk		Reinforcement Learning for Online Testing of Autonomous Driving Systems: a Replication and Extension StudySE for AI Journal-first Papers Luca Giamattei Università di Napoli Federico II, Matteo Biagiola Università della Svizzera italiana, Roberto Pietrantuono Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II, Paolo Tonella USI Lugano DOI Pre-print
15:15 15m Talk		Two is Better Than One: Digital Siblings to Improve Autonomous Driving TestingSE for AI Journal-first Papers Matteo Biagiola Università della Svizzera italiana, Andrea Stocco Technical University of Munich, fortiss, Vincenzo Riccio University of Udine, Paolo Tonella USI Lugano DOI Pre-print

15:30 - 16:00	BreakCatering at Canada Hall 3 plus Foyer

15:30 30m Break		Friday Afternoon Break Catering

	16:00 - 22:00	Quiet Room Friday 4PMSocial, Networking and Special Rooms at 202

16:00 - 17:30	ProcessNew Ideas and Emerging Results (NIER) / Journal-first Papers / Research Track / SE In Practice (SEIP) at 203 Chair(s): Luigi Benedicenti University of New Brunswick

16:00 15m Talk		Full Line Code Completion: Bringing AI to Desktop SE In Practice (SEIP) Anton Semenkin JetBrains, Vitaliy Bibaev JetBrains, Yaroslav Sokolov JetBrains, Kirill Krylov JetBrains, Alexey Kalina JetBrains, Anna Khannanova JetBrains, Danila Savenkov JetBrains, Darya Rovdo JetBrains, Igor Davidenko JetBrains, Kirill Karnaukhov JetBrains, Maxim Vakhrushev JetBrains, Mikhail Kostyukov JetBrains, Mikhail Podvitskii JetBrains, Petr Surkov JetBrains, Yaroslav Golubev JetBrains Research, Nikita Povarov JetBrains, Timofey Bryksin JetBrains Research Pre-print
16:15 15m Talk		Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps Research Track Forough Mehralian University of California at Irvine, Ziyao He University of California, Irvine, Sam Malek University of California at Irvine
16:30 15m Talk		Qualitative Surveys in Software Engineering Research: Definition, Critical Review, and GuidelinesResearch Methods Journal-first Papers Jorge Melegati Free University of Bozen-Bolzano, Kieran Conboy University of Galway, Daniel Graziotin University of Hohenheim Link to publication DOI
16:45 15m Talk		VulNet: Towards improving vulnerability management in the Maven ecosystemSecurity Journal-first Papers Zeyang Ma Concordia University, Shouvick Mondal IIT Gandhinagar, Tse-Hsun (Peter) Chen Concordia University, Haoxiang Zhang Centre for Software Excellence at Huawei Canada, Ahmed E. Hassan Queen’s University, Zeyang Ma Concordia University
17:00 15m Talk		Energy-Aware Software Testing New Ideas and Emerging Results (NIER) Roberto Verdecchia University of Florence, Emilio Cruciani European University of Rome, Antonia Bertolino Gran Sasso Science Institute, Breno Miranda Centro de Informática at Universidade Federal de Pernambuco Pre-print
17:15 7m Talk		SusDevOps: Promoting Sustainability to a First Principle in Software Delivery New Ideas and Emerging Results (NIER) Istvan David McMaster University / McMaster Centre for Software Certification (McSCert)

16:00 - 17:30	Testing and QA 6Journal-first Papers / Research Track / Demonstrations at 205 Chair(s): Majid Babaei McGill University

16:00 15m Talk		Characterizing Timeout Builds in Continuous Integration Journal-first Papers Nimmi Weeraddana University of Waterloo, Mahmoud Alfadel University of Calgary, Shane McIntosh University of Waterloo
16:15 15m Talk		GeMTest: A General Metamorphic Testing Framework Demonstrations Simon Speth Technical University of Munich, Alexander Pretschner TU Munich Pre-print
16:30 15m Talk		Mole: Efficient Crash Reproduction in Android Applications With Enforcing Necessary UI Events Journal-first Papers Maryam Masoudian Sharif University of Technology, Hong Kong University of Science and Technology (HKUST), Heqing Huang City University of Hong Kong, Morteza Amini Sharif University of Technology, Charles Zhang Hong Kong University of Science and Technology
16:45 15m Talk		History-Driven Fuzzing for Deep Learning Libraries Journal-first Papers Nima Shiri Harzevili York University, Mohammad Mahdi Mohajer York University, Moshi Wei York University, Hung Viet Pham York University, Song Wang York University
17:00 15m Talk		Towards a Cognitive Model of Dynamic Debugging: Does Identifier Construction Matter? Journal-first Papers Danniell Hu University of Michigan, Priscila Santiesteban University of Michigan, Madeline Endres University of Massachusetts Amherst, Westley Weimer University of Michigan
17:15 15m Talk		Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency Research Track Chijin Zhou Tsinghua University, Quan Zhang Tsinghua University, Bingzhou Qian National University of Defense Technology, Yu Jiang Tsinghua University

16:00 - 17:30	Human and Social for AIResearch Track / SE in Society (SEIS) / SE In Practice (SEIP) at 206 plus 208 Chair(s): Ramiro Liscano Ontario Tech University

16:00 15m Talk		ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet? Research Track Salma Begum Tamanna University of Calgary, Canada, Gias Uddin York University, Canada, Song Wang York University, Lan Xia IBM, Canada, Longyu Zhang IBM, Canada
16:15 15m Talk		Navigating the Testing of Evolving Deep Learning Systems: An Exploratory Interview Study Research Track Hanmo You Tianjin University, Zan Wang Tianjin University, Bin Lin Hangzhou Dianzi University, Junjie Chen Tianjin University
16:30 15m Talk		An Empirical Study on Decision-Making Aspects in Responsible Software Engineering for AI SE In Practice (SEIP) Lekshmi Murali Rani Chalmers University of Technology and University of Gothenburg, Sweden, Faezeh Mohammadi Chalmers University of Technology and University of Gothenburg, Sweden, Robert Feldt Chalmers \| University of Gothenburg, Richard Berntsson Svensson Chalmers \| University of Gothenburg Pre-print
16:45 15m Talk		Curious, Critical Thinker, Empathetic, and Ethically Responsible: Essential Soft Skills for Data Scientists in Software Engineering SE in Society (SEIS) Matheus de Morais Leça University of Calgary, Ronnie de Souza Santos University of Calgary
17:00 15m Talk		Multi-Modal LLM-based Fully-Automated Training Dataset Generation Software Platform for Mathematics Education SE in Society (SEIS) Minjoo Kim Sookmyung Women's University, Tae-Hyun Kim Sookmyung Women's University, Jaehyun Chung Korea University, Hyunseok Choi Korea University, Seokhyeon Min Korea University, Joon-Ho Lim Tutorus Labs, Soohyun Park Sookmyung Women's University
17:15 15m Talk		What Does a Software Engineer Look Like? Exploring Societal Stereotypes in LLMs SE in Society (SEIS) Muneera Bano CSIRO's Data61, Hashini Gunatilake Monash University, Rashina Hoda Monash University

16:00 - 17:30	Mobile SoftwareResearch Track at 207 Chair(s): Mattia Fazzini University of Minnesota

16:00 15m Talk		EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android ApplicationsSecurity Research Track Chenkai Guo Nankai University, China, Qianlu Wang College of Cyber Science, Nankai University, Naipeng Dong The University of Queensland, Australia, Lingling Fan Nankai University, Tianhong Wang College of Computer Science, Nankai University, Weijie Zhang College of Computer Science, Nankai University, EnBao Chen College of Cyber Science, Nankai University, Zheli Liu Nankai University, Lu Yu National University of Defense Technology; Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation
16:15 15m Talk		Mobile Application Coverage: The 30% Curse and Ways Forward Research Track Faridah Akinotcho University of British Columbia, Canada, Lili Wei McGill University, Julia Rubin The University of British Columbia Pre-print
16:30 15m Talk		The Design Smells Breaking the Boundary between Android Variants and AOSP Research Track Wuxia Jin Xi'an Jiaotong University, Jiaowei Shang Xi'an Jiaotong University, Jianguo Zheng Xi'an Jiaotong University, Mengjie Sun Xi’an Jiaotong University, Zhenyu Huang Honor Device Co., Ltd., Ming Fan Xi'an Jiaotong University, Ting Liu Xi'an Jiaotong University
16:45 15m Talk		Scenario-Driven and Context-Aware Automated Accessibility Testing for Android Apps Research Track Yuxin Zhang Tianjin University, Sen Chen Nankai University, Xiaofei Xie Singapore Management University, Zibo Liu College of Intelligence and Computing, Tianjin University, Lingling Fan Nankai University
17:00 15m Talk		TacDroid: Detection of Illicit Apps through Hybrid Analysis of UI-based Transition Graphs Research Track Yanchen Lu Zhejiang University, Hongyu Lin Zhejiang University, Zehua He Zhejiang University, Haitao Xu Zhejiang University, Zhao Li Hangzhou Yugu Technology, Shuai Hao Old Dominion University, Liu Wang Beijing University of Posts and Telecommunications, Haoyu Wang Huazhong University of Science and Technology, Kui Ren Zhejiang University
17:15 15m Talk		PacDroid: A Pointer-Analysis-Centric Framework for Security Vulnerabilities in Android AppsSecurityAward Winner Best Artifact Research Track Menglong Chen Nanjing University, Tian Tan Nanjing University, Minxue Pan Nanjing University, Yue Li Nanjing University

16:00 - 17:30	Security and QAResearch Track / Journal-first Papers / SE In Practice (SEIP) at 210 Chair(s): Nafiseh Kahani Carleton University

16:00 15m Talk		ROSA: Finding Backdoors with FuzzingSecurityAward Winner Best Artifact Research Track Dimitri Kokkonis Université Paris-Saclay, CEA, List, Michaël Marcozzi Université Paris-Saclay, CEA, List, Emilien Decoux Université Paris-Saclay, CEA List, Stefano Zacchiroli LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France Link to publication DOI Pre-print Media Attached File Attached
16:15 15m Talk		Analyzing the Feasibility of Adopting Google's Nonce-Based CSP Solutions on WebsitesSecurity Research Track Mengxia Ren Colorado School of Mines, Anhao Xiang Colorado School of Mines, Chuan Yue Colorado School of Mines
16:30 15m Talk		Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural ModelsSecurityAward Winner Research Track Lizhi Liao Memorial University of Newfoundland, Simon Eismann University of Würzburg, Heng Li Polytechnique Montréal, Cor-Paul Bezemer University of Alberta, Diego Elias Costa Concordia University, Canada, André van Hoorn University of Hamburg, Germany, Weiyi Shang University of Waterloo
16:45 15m Talk		Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic DatasetsSecurity Journal-first Papers Partha Chakraborty University of Waterloo, Krishna Kanth Arumugam University of Waterloo, Mahmoud Alfadel University of Calgary, Mei Nagappan University of Waterloo, Shane McIntosh University of Waterloo
17:00 15m Talk		Sunflower: Enhancing Linux Kernel Fuzzing via Exploit-Driven Seed GenerationSecurity SE In Practice (SEIP) Qiang Zhang Hunan University, Yuheng Shen Tsinghua University, Jianzhong Liu Tsinghua University, Yiru Xu Tsinghua University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Wanli Chang College of Computer Science and Electronic Engineering, Hunan University
17:15 15m Talk		Practical Object-Level Sanitizer With Aggregated Memory Access and Custom AllocatorSecurity Research Track Xiaolei wang National University of Defense Technology, Ruilin Li National University of Defense Technology, Bin Zhang National University of Defense Technology, Chao Feng National University of Defense Technology, Chaojing Tang National University of Defense Technology

16:00 - 17:30	AI for ProcessSE In Practice (SEIP) / Demonstrations / New Ideas and Emerging Results (NIER) / Research Track at 212 Chair(s): Keheliya Gallaba Centre for Software Excellence, Huawei Canada

16:00 15m Talk		OptCD: Optimizing Continuous Development Demonstrations Talank Baral George Mason University, Emirhan Oğul Middle East Technical University, Shanto Rahman The University of Texas at Austin, August Shi The University of Texas at Austin, Wing Lam George Mason University
16:15 15m Talk		LLMs as Evaluators: A Novel Approach to Commit Message Quality Assessment New Ideas and Emerging Results (NIER) Abhishek Kumar Indian Institute of Technology, Kharagpur, Sandhya Sankar Indian Institute of Technology, Kharagpur, Sonia Haiduc Florida State University, Partha Pratim Das Indian Institute of Technology, Kharagpur, Partha Pratim Chakrabarti Indian Institute of Technology, Kharagpur
16:30 15m Talk		Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings SE In Practice (SEIP) Petr Tsvetkov JetBrains Research, Aleksandra Eliseeva JetBrains Research, Danny Dig University of Colorado Boulder, JetBrains Research, Alexander Bezzubov JetBrains, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research, Yaroslav Zharov JetBrains Research Pre-print
16:45 15m Talk		Enhancing Differential Testing: LLM-Powered Automation in Release Engineering SE In Practice (SEIP) Ajay Krishna Vajjala George Mason University, Arun Krishna Vajjala George Mason University, Carmen Badea Microsoft Research, Christian Bird Microsoft Research, Robert DeLine Microsoft Research, Jason Entenmann Microsoft Research, Nicole Forsgren Microsoft Research, Aliaksandr Hramadski Microsoft, Sandeepan Sanyal Microsoft, Oleg Surmachev Microsoft, Thomas Zimmermann University of California, Irvine, Haris Mohammad Microsoft, Jade D'Souza Microsoft, Mikhail Demyanyuk Microsoft
17:00 15m Talk		How much does AI impact development speed? An enterprise-based randomized controlled trial SE In Practice (SEIP) Elise Paradis Google, Inc, Kate Grey Google, Quinn Madison Google, Daye Nam Google, Andrew Macvean Google, Inc., Nan Zhang Google, Ben Ferrari-Church Google, Satish Chandra Google, Inc
17:15 15m Talk		Using Reinforcement Learning to Sustain the Performance of Version Control Repositories New Ideas and Emerging Results (NIER) Shane McIntosh University of Waterloo, Luca Milanesio GerritForge Inc., Antonio Barone GerritForge Inc., Jacek Centkowski GerritForge Inc., Marcin Czech GerritForge Inc., Fabio Ponciroli GerritForge Inc. Pre-print

16:00 - 17:30	AI for Security 3Research Track / New Ideas and Emerging Results (NIER) at 213 Chair(s): Tien N. Nguyen University of Texas at Dallas

16:00 15m Talk		GVI: Guided Vulnerability Imagination for Boosting Deep Vulnerability DetectorsSecurity Research Track Heng Yong Nanjing University, Zhong Li , Minxue Pan Nanjing University, Tian Zhang Nanjing University, Jianhua Zhao Nanjing University, China, Xuandong Li Nanjing University
16:15 15m Talk		Decoding Secret Memorization in Code LLMs Through Token-Level CharacterizationSecurity Research Track Yuqing Nie Beijing University of Posts and Telecommunications, Chong Wang Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology, Guoai Xu Harbin Institute of Technology, Shenzhen, Guosheng Xu Key Laboratory of Trustworthy Distributed Computing and Service (MoE), Beijing University of Posts and Telecommunications, Haoyu Wang Huazhong University of Science and Technology
16:30 15m Talk		Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection SolutionsSecurity Research Track Satyaki Das University of Southern California, Syeda Tasnim Fabiha University of Southern California, Saad Shafiq University of Southern California, Nenad Medvidović University of Southern California Pre-print Media Attached File Attached
16:45 15m Talk		Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention InferenceSecurity Research Track Chong Wang Nanyang Technological University, Jianan Liu Fudan University, Xin Peng Fudan University, Yang Liu Nanyang Technological University, Yiling Lou Fudan University
17:00 15m Talk		Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance LearningSecurity Research Track Minghua He Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Chiming Duan Peking University, Huaqian Cai Peking University, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University
17:15 7m Talk		Towards Early Warning and Migration of High-Risk Dormant Open-Source Software DependenciesSecurity New Ideas and Emerging Results (NIER) Zijie Huang Shanghai Key Laboratory of Computer Software Testing and Evaluation, Lizhi Cai Shanghai Key Laboratory of Computer Software Testing & Evaluating, Shanghai Software Center, Xuan Mao Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, China, Kang Yang Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology

16:00 - 17:30	Quantum SEJournal-first Papers / Research Track at 214 Chair(s): Dennis Mancl MSWX Software Experts

16:00 15m Talk		QuanTest: Entanglement-Guided Testing of Quantum Neural Network SystemsQuantum Journal-first Papers Jinjing Shi Central South University, Zimeng Xiao Central South University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Xuelong LI China Telecom Link to publication
16:15 15m Talk		Quantum Approximate Optimization Algorithm for Test Case OptimizationQuantum Journal-first Papers Xinyi Wang Simula Research Laboratory; University of Oslo, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Tao Yue Beihang University, Paolo Arcaini National Institute of Informatics
16:30 15m Talk		Testing Multi-Subroutine Quantum Programs: From Unit Testing to Integration TestingQuantum Journal-first Papers Peixun Long Institute of High Energy Physics, Chinese Academy of Science, Jianjun Zhao Kyushu University
16:45 15m Talk		Mitigating Noise in Quantum Software Testing Using Machine LearningQuantum Journal-first Papers Asmar Muqeet Simula Research Laboratory and University of Oslo, Tao Yue Beihang University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Paolo Arcaini National Institute of Informatics , Asmar Muqeet Simula Research Laboratory and University of Oslo
17:00 15m Talk		Test Case Minimization with Quantum AnnealingQuantum Journal-first Papers Xinyi Wang Simula Research Laboratory; University of Oslo, Asmar Muqeet Simula Research Laboratory and University of Oslo, Tao Yue Beihang University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Paolo Arcaini National Institute of Informatics
17:15 7m Talk		When Quantum Meets Classical: Characterizing Hybrid Quantum-Classical Issues Discussed in Developer ForumsQuantum Research Track Jake Zappin William and Mary, Trevor Stalnaker William & Mary, Oscar Chaparro William & Mary, Denys Poshyvanyk William & Mary

16:00 - 17:30	SE for AI with Quality 3Research Track / SE In Practice (SEIP) at 215 Chair(s): Sumon Biswas Case Western Reserve University

16:00 15m Talk		Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hierarchical and Explainable ClassificationSE for AI Research Track Sigma Jahan Dalhousie University, Mehil Shah Dalhousie University, Parvez Mahbub Dalhousie University, Masud Rahman Dalhousie University Pre-print
16:15 15m Talk		Lightweight Concolic Testing via Path-Condition Synthesis for Deep Learning LibrariesSE for AI Research Track Sehoon Kim , Yonghyeon Kim UNIST, Dahyeon Park UNIST, Yuseok Jeon UNIST, Jooyong Yi UNIST, Mijung Kim UNIST
16:30 15m Talk		Mock Deep Testing: Toward Separate Development of Data and Models for Deep LearningSE for AI Research Track Ruchira Manke Tulane University, USA, Mohammad Wardat Oakland University, USA, Foutse Khomh Polytechnique Montréal, Hridesh Rajan Tulane University
16:45 15m Talk		RUG: Turbo LLM for Rust Unit Test GenerationSE for AI Research Track Xiang Cheng Georgia Institute of Technology, Fan Sang Georgia Institute of Technology, Yizhuo Zhai Georgia Institute of Technology, Xiaokuan Zhang George Mason University, Taesoo Kim Georgia Institute of Technology Pre-print Media Attached File Attached
17:00 15m Talk		Test Input Validation for Vision-based DL Systems: An Active Learning ApproachSE for AI SE In Practice (SEIP) Delaram Ghobari University of Ottawa, Mohammad Hossein Amini University of Ottawa, Dai Quoc Tran SmartInsideAI Company Ltd. and Sungkyunkwan University, Seunghee Park SmartInsideAI Company Ltd. and Sungkyunkwan University, Shiva Nejati University of Ottawa, Mehrdad Sabetzadeh University of Ottawa Pre-print
17:15 15m Talk		SEMANTIC CODE FINDER: An Efficient Semantic Search Framework for Large-Scale Codebases SE In Practice (SEIP) daeha ryu Innovation Center, Samsung Electronics, Seokjun Ko Samsung Electronics Co., Eunbi Jang Innovation Center, Samsung Electronics, jinyoung park Innovation Center, Samsung Electronics, myunggwan kim Innovation Center, Samsung Electronics, changseo park Innovation Center, Samsung Electronics

16:00 - 17:30	BlockchainResearch Track at Canada Hall 1 and 2 Chair(s): Daniel Amyot University of Ottawa

16:00 15m Talk		An Empirical Study of Proxy Smart Contracts at the Ethereum Ecosystem ScaleBlockchain Research Track Mengya Zhang The Ohio State University, Preksha Shukla George Mason University, Wuqi Zhang Mega Labs, Zhuo Zhang Purdue University, Pranav Agrawal George Mason University, Zhiqiang Lin The Ohio State University, Xiangyu Zhang Purdue University, Xiaokuan Zhang George Mason University
16:15 15m Talk		Demystifying and Detecting Cryptographic Defects in Ethereum Smart ContractsBlockchainAward Winner Research Track Jiashuo Zhang Peking University, China, Yiming Shen Sun Yat-sen University, Jiachi Chen Sun Yat-sen University, Jianzhong Su Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Ting Chen University of Electronic Science and Technology of China, Jianbo Gao Peking University, Zhong Chen
16:30 15m Talk		Chord: Towards a Unified Detection of Blockchain Transaction Parallelism BugsBlockchain Research Track Yuanhang Zhou Tsinghua University, Zhen Yan Tsinghua University, Yuanliang Chen Tsinghua University, Fuchen Ma Tsinghua University, Ting Chen University of Electronic Science and Technology of China, Yu Jiang Tsinghua University
16:45 15m Talk		Definition and Detection of Centralization Defects in Smart ContractsBlockchain Research Track Zewei Lin Sun Yat-sen University, Jiachi Chen Sun Yat-sen University, Jiajing Wu Sun Yat-sen University, Weizhe Zhang Harbin Institute of Technology, Zibin Zheng Sun Yat-sen University
17:00 15m Talk		Fork State-Aware Differential Fuzzing for Blockchain Consensus ImplementationsBlockchain Research Track Won Hoi Kim KAIST, Hocheol Nam KAIST, Muoi Tran ETH Zurich, Amin Jalilov KAIST, Zhenkai Liang National University of Singapore, Sang Kil Cha KAIST, Min Suk Kang KAIST DOI Pre-print
17:15 15m Talk		Code Cloning in Solidity Smart Contracts: Prevalence, Evolution, and Impact on DevelopmentBlockchain Research Track Ran Mo Central China Normal University, Haopeng Song Central China Normal University, Wei Ding Central China Normal University, Chaochao Wu Central China Normal University

Accepted Papers

	Title
	3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers Research Track Sarah Fakhoury, Markus Kuppe, Shuvendu K. Lahiri, Tahina Ramananandro, Nikhil Swamy Pre-print
	A Catalog of Micro Frontends Anti-patterns Research Track Nabson Silva, Eriky Rodrigues, Tayana Conte
	Accessibility Issues in Ad-Driven Web Applications Research Track Abdul Haddi Amjad, Muhammad Danish, Bless Jah, Muhammad Ali Gulzar
	Accounting for Missing Events in Statistical Information Leakage AnalysisSecurity Research Track Seongmin Lee, Shreyas Minocha, Marcel Böhme
	ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems Research Track Wenwei Gu, Jiazhen Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Jinxi Kuang, Cong Feng, Yongqiang Yang, Michael Lyu
	A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs Research Track Trey Woodlief, Carl Hildebrandt, Sebastian Elbaum
	A First Look at Conventional Commits Classification Research Track Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, Hui Liu
	A Large-Scale Study of Model Integration in ML-Enabled Software SystemsSE for AI Research Track Yorick Sens, Henriette Knopp, Sven Peldszus, Thorsten Berger Pre-print
	Aligning the Objective of LLM-based Program Repair Research Track Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He Pre-print
	A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing Research Track Sanan Hasanov, Stefan Nagy, Paul Gazzillo
	A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs Research Track Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso
	A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection Research Track TianChen Yu, Li Yuan, Liannan Lin, Hongkui He
	Analyzing the Feasibility of Adopting Google's Nonce-Based CSP Solutions on WebsitesSecurity Research Track Mengxia Ren, Anhao Xiang, Chuan Yue
	An Empirical Study of Proxy Smart Contracts at the Ethereum Ecosystem ScaleBlockchain Research Track Mengya Zhang, Preksha Shukla, Wuqi Zhang, Zhuo Zhang, Pranav Agrawal, Zhiqiang Lin, Xiangyu Zhang, Xiaokuan Zhang
	An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? Research Track Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, Iftekhar Ahmed
	An Empirical Study on Commit Message Generation using LLMs via In-Context Learning Research Track Yifan Wu, Yunpeng Wang, Ying Li, Wei Tao, Siyu Yu, Haowen Yang, Wei Jiang, Jianguo Li Pre-print
	An Empirical Study on Package-Level Deprecation in Python Ecosystem Research Track Zhiqing Zhong, Shilin He, Haoxuan Wang, BoXi Yu, Haowen Yang, Pinjia He
	An Empirical Study on Reproducible Packaging in Open-Source Ecosystems Research Track Giacomo Benedetti, Oreofe Solarin, Courtney Miller, Greg Tystahl, William Enck, Christian Kästner, Alexandros Kapravelos, Alessio Merlo, Luca Verderame
	An Exploratory Study of ML Sketches and Visual Code Assistants Research Track Luis F. Gomes, Vincent J. Hellendoorn, Jonathan Aldrich, Rui Abreu
	An Exploratory Study on the Engineering of Security FeaturesSecurity Research Track Kevin Hermann, Sven Peldszus, Jan-Philipp Steghöfer, Thorsten Berger Pre-print
	An Extensive Empirical Study of Nondeterministic Behavior in Static Analysis Tools Research Track Miao Miao, Austin Mordahl, Dakota Soles, Alice Beideck, Shiyi Wei
	An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization Research Track Fraol Batole, David OBrien, Tien N. Nguyen, Robert Dyer, Hridesh Rajan
	Answering User Questions about Machine Learning Models through Standardized Model CardsSE for AI Research Track Tajkia Rahman Toma, Balreet Grewal, Cor-Paul Bezemer Pre-print
	Are LLMs Correctly Integrated into Software Systems?SE for AI Research Track Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, Chengcheng Wan
	Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection SolutionsSecurity Research Track Satyaki Das, Syeda Tasnim Fabiha, Saad Shafiq, Nenad Medvidović Pre-print Media Attached File Attached
	AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software ArtifactsSecurity Research Track Setu Kumar Basak, K. Virgil English, Ken Ogura, Vitesh Kambara, Bradley Reaves, Laurie Williams
	A Study of Undefined Behavior Across Foreign Function Boundaries in Rust LibrariesSecurity Research Track Ian McCormack, Joshua Sunshine, Jonathan Aldrich Pre-print
	A Tale of Two DL Cities: When Library Tests Meet CompilerSE for AI Research Track Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang
	A Test Oracle for Reinforcement Learning Software based on Lyapunov Stability Control TheorySE for AIAward Winner Research Track Shiyu Zhang, Haoyang Song, Qixin Wang, Henghua Shen, Yu Pei
	Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps Research Track Forough Mehralian, Ziyao He, Sam Malek
	Automated Generation of Accessibility Test Reports from Recorded User TranscriptsAward Winner Research Track Syed Fatiul Huq, Mahan Tafreshipour, Kate Kalcevich, Sam Malek
	Automated Test Generation For Smart Contracts via On-Chain Test Case Augmentation and MigrationBlockchain Research Track Jiashuo Zhang, Jiachi Chen, John Grundy, Jianbo Gao, Yanlin Wang, Ting Chen, Zhi Guan, Zhong Chen Pre-print
	Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly DetectionSecurity Research Track Qiaolin Qin, Heng Li, Ettore Merlo, Maxime Lamothe Pre-print
	Automating a Complete Software Test Process Using LLMs: An Automotive Case Study Research Track Shuai Wang, Yinan Yu, Robert Feldt, Dhasarathy Parthasarathy Pre-print
	BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks Research Track Yisong Xiao, Aishan Liu, Xinwei Zhang, Tianyuan Zhang, Li Tianlin, Siyuan Liang, Xianglong Liu, Yang Liu, Dacheng Tao
	Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers Research Track Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu
	Boosting Code-line-level Defect Prediction with Spectrum Information and Causality Analysis Research Track Shiyu Sun, Yanhui Li, Lin Chen, Yuming Zhou, Jianhua Zhao
	Boosting Path-Sensitive Value Flow Analysis via Removal of Redundant Summaries Research Track Yongchao WANG, Yuandao Cai, Charles Zhang Pre-print
	Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention InferenceSecurity Research Track Chong Wang, Jianan Liu, Xin Peng, Yang Liu, Yiling Lou
	BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries Research Track Wen Zhang, Botang Xiao, Qingchen Kong, Le Guan, Wenwen Wang
	Calibration and Correctness of Language Models for Code Research Track Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Rafiqul Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed Pre-print
	Can an LLM find its way around a Spreadsheet? Research Track Cho-Ting Lee, Andrew Neeser, Shengzhe Xu, Jay Katyan, Patrick Cross, Sharanya Pathakota, Marigold Norman, John C. Simeone, Jaganmohan Chandrasekaran, Naren Ramakrishnan
	ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples Research Track Chunhao Dong, Yanjie Jiang, Yuxia Zhang, Yang Zhang, Hui Liu
	ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet? Research Track Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, Longyu Zhang
	Chord: Towards a Unified Detection of Blockchain Transaction Parallelism BugsBlockchain Research Track Yuanhang Zhou, Zhen Yan, Yuanliang Chen, Fuchen Ma, Ting Chen, Yu Jiang
	Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDESecurity Research Track Benjamin Steenhoek, Siva Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, Wei Le
	ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs Research Track Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, Baowen Xu
	COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge Research Track Yichen LI, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu, Michael Lyu
	Code Cloning in Solidity Smart Contracts: Prevalence, Evolution, and Impact on DevelopmentBlockchain Research Track Ran Mo, Haopeng Song, Wei Ding, Chaochao Wu
	Code Comment Inconsistency Detection and Rectification Using a Large Language Model Research Track Guoping Rong, YongdaYu , Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen, Jidong Hu
	CodeImprove: Program Adaptation for Deep Code ModelsSE for AI Research Track Ravishka Shemal Rathnasuriya, zijie zhao, Wei Yang
	Code Today, Deadline Tomorrow: Procrastination Among Software Developers Research Track Zeinabsadat Saghi, Thomas Zimmermann, Souti Chattopadhyay
	Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with JustificationsBlockchainSecurity Research Track Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, Yang Liu
	Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation Research Track Wenqian Deng, Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang
	ConsCS: Effective and Efficient Verification of Circom CircuitsFormal Methods Research Track Jinan Jiang, Xinghao Peng, Jinzhao Chu, Xiapu Luo
	Constrained LTL Specification Learning from ExamplesFormal Methods Research Track Changjian Zhang, Parv Kapoor, Ian Dardik, Leyi Cui, Romulo Meira-Goes, David Garlan, Eunsuk Kang
	Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation Research Track Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour
	Cooperative Software Verification via Dynamic Program SplittingSecurity Research Track Cedric Richter, Marek Chalupa, Marie-Christine Jakobs, Heike Wehrheim
	Critical Variable State-Aware Directed Greybox Fuzzing Research Track Xu Chen, Ningning Cui, Zhe Pan, Liwei Chen, Gang Shi, Dan Meng
	Datalog-Based Language-Agnostic Change Impact Analysis for Microservices Research Track Qingkai Shi, Xiaoheng Xie, Xianjin Fu, Peng Di, Huawei Li, Ang Zhou, Gang Fan
	Decictor: Towards Evaluating the Robustness of Decision-Making in Autonomous Driving Systems Research Track Mingfei Cheng, Xiaofei Xie, Yuan Zhou, Junjie Wang, Guozhu Meng, Kairui Yang
	Decoding Secret Memorization in Code LLMs Through Token-Level CharacterizationSecurity Research Track Yuqing Nie, Chong Wang, Kailong Wang, Guoai Xu, Guosheng Xu, Haoyu Wang
	Decoding the Issue Resolution Process In Practice via Issue Report Analysis: A Case Study of Firefox Research Track Antu Saha, Oscar Chaparro Pre-print
	Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? Research Track Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, Gabriele Bavota
	Definition and Detection of Centralization Defects in Smart ContractsBlockchain Research Track Zewei Lin, Jiachi Chen, Jiajing Wu, Weizhe Zhang, Zibin Zheng
	Demystifying and Detecting Cryptographic Defects in Ethereum Smart ContractsBlockchainAward Winner Research Track Jiashuo Zhang, Yiming Shen, Jiachi Chen, Jianzhong Su, Yanlin Wang, Ting Chen, Jianbo Gao, Zhong Chen
	DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models Research Track Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Aaron Quigley, Yuyu Luo, Tianqi Luo, Gelareh Mohammadi, Qinghua Lu, Liming Zhu
	Dissecting Global Search: A Simple yet Effective Method to Boost Individual Discrimination Testing and RepairSE for AI Research Track Lili Quan, Li Tianlin, Xiaofei Xie, Zhenpeng Chen, Sen Chen, Lingxiao Jiang, Xiaohong Li Pre-print
	Distilled Lifelong Self-Adaptation for Configurable Systems Research Track Yulong Ye, Tao Chen, Miqing Li Pre-print
	Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning SoftwareSecuritySE for AI Research Track Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, Yang Liu Pre-print
	Dockerfile Flakiness: Characterization and Repair Research Track Taha Shabani, Noor Nashid, Parsa Alian, Ali Mesbah Pre-print
	Does GenAI Make Usability Testing Obsolete?Award Winner Research Track Ali Ebrahimi Pourasad, Walid Maalej Pre-print
	DPFuzzer: Discovering Safety Critical Vulnerabilities for Drone Path PlannersSecurity Research Track Yue Wang, Chao Yang, Xiaodong Zhang, Yuwanqi Deng, Jianfeng Ma
	Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural ModelsSecurityAward Winner Research Track Lizhi Liao, Simon Eismann, Heng Li, Cor-Paul Bezemer, Diego Elias Costa, André van Hoorn, Weiyi Shang
	EffBT: An Efficient Behavior Tree Reactive Synthesis and Execution FrameworkFormal Methods Research Track ziji wu, yu huang, peishan huang, shanghua wen, minglong li, Ji Wang
	Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models Research Track Luciano Baresi, Davide Yi Xian Hu, Andrea Stocco, Paolo Tonella Pre-print
	Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding Research Track Yifeng Di, Tianyi Zhang
	Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning Research Track Chun Li, Hui Li, Zhong Li, Minxue Pan, Xuandong Li
	Enhancing The Open Network: Definition and Automated Detection of Smart Contract DefectsBlockchainSecurityAward Winner Research Track Hao Song, Teng Li, Jiachi Chen, Ting Chen, Beibei Li, Zhangyan Lin, Yi Lu, Pan Li, Xihan Zhou
	EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android ApplicationsSecurity Research Track Chenkai Guo, Qianlu Wang, Naipeng Dong, Lingling Fan, Tianhong Wang, Weijie Zhang, EnBao Chen, Zheli Liu, Lu Yu
	Evaluating Garbage Collection Performance Across Managed Language Runtimes Research Track Yicheng Wang, Wensheng Dou, Yu Liang, Yi Wang, Wei Wang, Jun Wei, Tao Huang
	Execution Trace Reconstruction Using Diffusion-Based Generative Models Research Track Madeline Janecek, Naser Ezzati Jivan, Wahab Hamou-Lhadj
	exLong: Generating Exceptional Behavior Tests with Large Language Models Research Track Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric
	Exploring the Robustness of the Effect of EVO on Intention Valuation through ReplicationAward Winner Research Track Yesugen Baatartogtokh, Kaitlyn Cook, Alicia M. Grubb
	Exposing the Hidden Layer: Software Repositories in the Service of SEO ManipulationSecurity Research Track Mengying Wu, Geng Hong, Wuyuao Mai, Xinyi Wu, Lei Zhang, Yingyuan Pu, Huajun Chai, Lingyun Ying, Haixin Duan, Min Yang
	FairChecker: Detecting Fund-stealing Bugs in DeFi Protocols via Fairness ValidationBlockchainSecurity Research Track Yi Sun, Zhuo Zhang, Xiangyu Zhang
	Fairness Testing through Extreme Value TheorySE for AI Research Track Verya Monjezi, Ashutosh Trivedi, Vladik Kreinovich, Saeid Tizpaz-Niari
	FairQuant: Certifying and Quantifying Fairness of Deep Neural NetworksSE for AI Research Track Brian Hyeongseok Kim, Jingbo Wang, Chao Wang Pre-print
	FairSense: Long-Term Fairness Analysis of ML-Enabled SystemsSecuritySE for AI Research Track Yining She, Sumon Biswas, Christian Kästner, Eunsuk Kang
	FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data FusionSecurity Research Track Chiming Duan, Yong Yang, Tong Jia, Guiyang Liu, Jinbu Liu, Huxing Zhang, Qi Zhou, Ying Li, Gang Huang
	Faster Configuration Performance Bug Testing with Neural Dual-level Prioritization Research Track Youpeng Ma, Tao Chen, Ke Li Pre-print
	Feature-Driven End-To-End Test Generation Research Track Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, Ali Mesbah
	Fidelity of Cloud Emulators: The Imitation Game of Testing Cloud-based Software Research Track Anna Mazhar, Saad Sher Alam, William Zheng, Yinfang Chen, Suman Nath, Tianyin Xu
	FixDrive: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per ViolationSE for AI Research Track Yang Sun, Chris Poskitt, Kun Wang, Jun Sun Link to publication DOI Pre-print File Attached
	Fixing Large Language Models' Specification Misunderstanding for Better Code GenerationSE for AI Research Track Zhao Tian, Junjie Chen, Xiangyu Zhang Pre-print
	Fork State-Aware Differential Fuzzing for Blockchain Consensus ImplementationsBlockchain Research Track Won Hoi Kim, Hocheol Nam, Muoi Tran, Amin Jalilov, Zhenkai Liang, Sang Kil Cha, Min Suk Kang DOI Pre-print
	Formally Verified Binary-level Pointer AnalysisFormal Methods Research Track Freek Verbeek, Ali Shokri, Daniel Engel, Binoy Ravindran
	Formally Verified Cloud-Scale AuthorizationAward Winner Research Track Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, MIchael Hicks, Samuel Huang, Georges-Axel Jaloyan, Anjali Joshi, K. Rustan M. Leino, Mikael Mayer, Sean McLaughlin, Akhilesh Mritunjai, Clement Pit-Claudel, Sorawee Porncharoenwase, Florian Rabe, Marianna Rapoport, Giles Reger, Cody Roux, Neha Rungta, Robin Salkeld, Matthias Schlaipfer, Daniel Schoepe, Johanna Schwartzentruber, Serdar Tasiran, Aaron Tomb, Emina Torlak, Jean-Baptiste Tristan, Lucas Wagner, Michael Whalen, Remy Willems, Tongtong Xiang, Taejoon Byun, Joshua M. Cohen, Ruijie Fang, Junyoung Jang, Jakob Rath, Hira Taqdees Syeda, Dominik Wagner, Yongwei Yuan
	From Bugs to Benefits: Improving User Stories by Leveraging Crowd Knowledge with CrUISE-AC Research Track Stefan Schwedt, Thomas Ströder
	Fuzzing MLIR Compilers with Custom Mutation Synthesis Research Track Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, Miryung Kim Pre-print
	GARL: Genetic Algorithm-Augmented Reinforcement Learning to Detect Violations in Marker-Based Autonomous Landing Systems Research Track Linfeng Liang, Yao Deng, Kye Morton, Valtteri Kallinen, Alice James, Avishkar Seth, Endrowednes Kuantama, Subhas Mukhopadhyay, Richard Han, Xi Zheng
	GenC2Rust: Towards Generating Generic Rust Code from C Research Track Xiafa Wu, Brian Demsky
	"Get Me In The Groove": A Mixed Methods Study on Supporting ADHD Professional Programmers Research Track Kaia Newman, Sarah Snay, Madeline Endres, Manasvi Parikh, Andrew Begel Pre-print
	Gpass: a Goal-adaptive Neural Theorem Prover based on Coq for Automated Formal VerificationFormal Methods Research Track Yizhou Chen, Zeyu Sun, Guoqing Wang, Dan Hao
	GVI: Guided Vulnerability Imagination for Boosting Deep Vulnerability DetectorsSecurity Research Track Heng Yong, Zhong Li, Minxue Pan, Tian Zhang, Jianhua Zhao, Xuandong Li
	HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search Research Track Gong Chen, Xiaoyuan Xie, Xunzhu Tang, Qi Xin, Wenjie Liu
	Hetrify: Efficient Verification of Heterogeneous Programs on RISC-VSecurityAward Winner Research Track Yiwei Li, Liangze Yin, Wei Dong, Jiaxin Liu, Yanfeng Hu, Shanshan Li
	HIFI: Explaining and Mitigating Algorithmic Bias through the Lens of Game-Theoretic InteractionsSecuritySE for AI Research Track Lingfeng Zhang, Zhaohui Wang, Yueling Zhang, Min Zhang, Jiangtao Wang
	Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations Research Track Ruchit Rawal, Victor-Alexandru Padurean, Sven Apel, Adish Singla, Mariya Toneva Pre-print
	How Scientists Use Jupyter Notebooks: Goals, Quality Attributes, and Opportunities Research Track Ruanqianqian (Lisa) Huang, Savitha Ravi, Michael He, Boyu Tian, Sorin Lerner, Michael Coblenz Pre-print
	HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation Research Track Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng
	Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic ExecutionSecurity Research Track Shuo Yang, Xingwei Lin, Jiachi Chen, Qingyuan Zhong, Lei Xiao, renke huang, Yanlin Wang, Zibin Zheng
	Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hierarchical and Explainable ClassificationSE for AI Research Track Sigma Jahan, Mehil Shah, Parvez Mahbub, Masud Rahman Pre-print
	Increasing the Effectiveness of Automatically Generated Tests by Improving Class ObservabilityAward Winner Research Track Geraldine Galindo-Gutierrez, Juan Pablo Sandoval Alcocer, Nicolas Jimenez-Fuentes, Alexandre Bergel, Gordon Fraser
	Instruct or Interact? Exploring and Eliciting LLMs’ Capability in Code Snippet Adaptation Through Prompt Engineering Research Track Tanghaoran Zhang, Yue Yu, Xinjun Mao, Shangwen Wang, Kang Yang, Yao Lu, Zhang Zhang, Yuxin Zhao
	Instrumentation-Driven Evolution-Aware Runtime Verification Research Track Kevin Guan, Owolabi Legunsen
	InSVDF: Interface-State-Aware Virtual Device Fuzzing Research Track Zexiang Zhang, Gaoning Pan, Ruipeng Wang, Yiming Tao, Zulie Pan, Cheng Tu, Min Zhang, Yang Li, Yi Shen, Chunming Wu
	Intention is All You Need: Refining Your Code from Your Intention Research Track Qi Guo, Xiaofei Xie, Shangqing Liu, Ming Hu, Xiaohong Li, Lei Bu
	Interactive Cross-Language Pointer Analysis for Resolving Native Code in Java ProgramsAward Winner Research Track Chenxi Zhang, Yufei Liang, Tian Tan, Chang Xu, Shuangxiang Kan, Yulei Sui, Yue Li
	InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation Research Track Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe Cogo, Bram Adams
	Investigating the Impact of Interpersonal Challenges on Feeling Welcome in OSS Research Track Bianca Trinkenreich, Zixuan Feng, Rudrajit Choudhuri, Marco Gerosa, Anita Sarma, Igor Steinmacher Pre-print
	Invivo Fuzzing by Amplifying Actual Executions Research Track Octavio Galland, Marcel Böhme
	IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation Research Track Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, Hao Chen
	Iterative Generation of Adversarial Example for Deep Code ModelsSE for AIAward Winner Research Track Li Huang, Weifeng Sun, Meng Yan
	Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency Research Track Chijin Zhou, Quan Zhang, Bingzhou Qian, Yu Jiang
	Knowledge-Enhanced Program Repair for Data Science Code Research Track Shuyin Ouyang, Jie M. Zhang, Zeyu Sun, Albert Merono Penuela
	Large Language Models as Configuration ValidatorsSecurity Research Track Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu
	Large Language Models for Safe Minimization Research Track Aashish Yadavally, Xiaokai Rong, Phat Nguyen, Tien N. Nguyen
	Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests Research Track Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman DOI Pre-print
	Leveraging Large Language Models to Detect npm Malicious PackagesSecurity Research Track Nusrat Zahan, Philipp Burckhardt, Mikola Lysenko, Feross Aboukhadijeh, Laurie Williams
	Leveraging Propagated Infection to Crossfire Mutants Research Track Hang Du, Vijay Krishna Palepu, James Jones File Attached
	LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models Research Track Zeyang Ma, Dong Jae Kim, Tse-Hsun (Peter) Chen
	LiCoEval: Evaluating LLMs on License Compliance in Code Generation Research Track Weiwei Xu, Kai Gao, Hao He, Minghui Zhou Pre-print
	Lightweight Concolic Testing via Path-Condition Synthesis for Deep Learning LibrariesSE for AI Research Track Sehoon Kim, Yonghyeon Kim, Dahyeon Park, Yuseok Jeon, Jooyong Yi, Mijung Kim
	LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation Research Track Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, Anne Koziolek Pre-print Media Attached
	LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems Research Track Venkata Sai Aswath Duvvuru, Bohan Zhang, Michael Vierhauser, Ankit Agrawal Pre-print Media Attached
	LLM-aided Automatic Modeling for Security Protocol VerificationSecurityFormal Methods Research Track Ziyu Mao, Jingyi Wang, Jun Sun, Shengchao Qin, Jiawen Xiong
	LLM Assistance for Memory SafetySecurity Research Track Nausheen Mohammed, Akash Lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma
	LLM Based Input Space Partitioning Testing for Library APIs Research Track Jiageng Li, Zhen Dong, Chong Wang, Haozhen You, Cen Zhang, Yang Liu, Xin Peng
	LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion Research Track Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, Xin Peng
	LWDIFF: An LLM-Assisted Differential Testing Framework for WebAssembly Runtimes Research Track Shiyao Zhou, Jincheng Wang, He Ye, Hao Zhou, Claire Le Goues, Xiapu Luo
	Magika: AI-Powered Content-Type DetectionSecurity Research Track Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, Elie Bursztein
	MARQ: Engineering Mission-Critical AI-based Software with Automated Result Quality AdaptationSE for AI Research Track Uwe Gropengießer, Elias Dietz, Florian Brandherm, Achref Doula, Osama Abboud, Xun Xiao, Max Mühlhäuser
	Measuring the Runtime Performance of C++ Code Written by Humans using GitHub Copilot Research Track Daniel Erhabor, Sreeharsha Udayashankar, Mei Nagappan, Samer Al-Kiswany DOI Pre-print File Attached
	Metamorphic-Based Many-Objective Distillation of LLMs for Code-related Tasks Research Track Annibale Panichella
	Mobile Application Coverage: The 30% Curse and Ways Forward Research Track Faridah Akinotcho, Lili Wei, Julia Rubin Pre-print
	Mock Deep Testing: Toward Separate Development of Data and Models for Deep LearningSE for AI Research Track Ruchira Manke, Mohammad Wardat, Foutse Khomh, Hridesh Rajan
	Model Editing for LLMs4Code: How Far are We? Research Track Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang Pre-print
	Module-Aware Context Sensitive Pointer Analysis Research Track Haofeng Li, Chenghang Shi, Jie Lu, Lian Li, Zixuan Zhao File Attached
	Moye: A Wallbreaker for Monolithic Firmware Research Track Jintao Huang, Kai Yang, Gaosheng Wang, Zhiqiang Shi, Zhiwen Pan, Shichao Lv, Limin Sun
	Navigating the Testing of Evolving Deep Learning Systems: An Exploratory Interview Study Research Track Hanmo You, Zan Wang, Bin Lin, Junjie Chen
	Neurosymbolic Modular Refinement Type Inference Research Track Georgios Sakkas, Pratyush Sahu, Kyeling Ong, Ranjit Jhala
	NIODebugger: A Novel Approach to Repair Non-Idempotent-Outcome Tests with LLM-Based Agent Research Track Kaiyao Ke
	No Harness, No Problem: Oracle-guided Harnessing for Auto-generating C API Fuzzing Harnesses Research Track Gabriel Sherman, Stefan Nagy
	On Prescription or Off Prescription? An Empirical Study of Community-prescribed Security Configurations for KubernetesSecurity Research Track Shazibul Islam Shamim, Hanyang Hu, Akond Rahman Pre-print File Attached
	On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning ImplementationsSE for AI Research Track Rajdeep Singh Hundal, Yan Xiao, Xiaochun Cao, Jin Song Dong, Manuel Rigger Pre-print Media Attached File Attached
	PacDroid: A Pointer-Analysis-Centric Framework for Security Vulnerabilities in Android AppsSecurityAward Winner Best Artifact Research Track Menglong Chen, Tian Tan, Minxue Pan, Yue Li
	PairSmell: A Novel Perspective Inspecting Software Modular StructureAward Winner Research Track Chenxing Zhong, Daniel Feitosa, Paris Avgeriou, Huang Huang, Yue Li, He Zhang Pre-print
	Parametric Falsification of Many Probabilistic Requirements under Flakiness Research Track Matteo Camilli, Raffaela Mirandola
	Patch Synthesis for Property Repair of Deep Neural NetworksSE for AI Research Track Zhiming Chi, Jianan Ma, Pengfei Yang, Cheng-Chao Huang, Renjue Li, Jingyi Wang, Xiaowei Huang, Lijun Zhang
	Pattern-based Generation and Adaptation of Quantum WorkflowsQuantum Research Track Martin Beisel, Johanna Barzen, Frank Leymann, Lavinia Stiliadou, Daniel Vietz, Benjamin Weder
	Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets Research Track Smit Soneshbhai Patel, Aashish Yadavally, Hridya Dhulipala, Tien N. Nguyen
	Practical Object-Level Sanitizer With Aggregated Memory Access and Custom AllocatorSecurity Research Track Xiaolei wang, Ruilin Li, Bin Zhang, Chao Feng, Chaojing Tang
	Preserving Privacy in Software Composition Analysis: A Study of Technical Solutions and Enhancements Research Track Huaijin Wang, Zhibo Liu, Yanbo Dai, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu
	µPRL: a Mutation Testing Pipeline for Deep Reinforcement Learning based on Real FaultsSE for AI Research Track Deepak-George Thomas, Matteo Biagiola, Nargiz Humbatova, Mohammad Wardat, Gunel Jahangirova, Hridesh Rajan, Paolo Tonella Pre-print
	Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and DefensesSecuritySE for AI Research Track Rodrigo Resendes Pedro, Miguel E. Coimbra, Daniel Castro, Paulo Carreira, Nuno Santos
	Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction Research Track Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang
	QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning Research Track Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Shizhuo Zhang, Talia Lily Ringer, Yuriy Brun Link to publication Pre-print
	Rango: Adaptive Retrieval-Augmented Proving for Automated Software VerificationAward Winner Research Track Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First Link to publication Pre-print File Attached
	Ranking Relevant Tests for Order-Dependent Flaky Tests Research Track Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, Wing Lam
	Reasoning Runtime Behavior of a Program with LLM: How Far Are We? Research Track Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia
	REDII: Test Infrastructure to Enable Deterministic Reproduction of Failures for Distributed Systems Research Track Yang Feng, Zheyuan Lin, Dongchen Zhao, Mengbo Zhou, Jia Liu, James Jones
	Reduce Dependence for Sound Concurrency Bug Prediction Research Track Shihao Zhu, Yuqi Guo, Yan Cai, Bin Liang, Long Zhang, Rui Chen, Tingting Yu
	Relationship Status: “It’s complicated” Developer-Security Expert Dynamics in ScrumSecurity Research Track Houda Naji, Marco Gutfleisch, Alena Naiakshina
	RepairAgent: An Autonomous, LLM-Based Agent for Program Repair Research Track Islem BOUZENIA, Prem Devanbu, Michael Pradel Pre-print
	Repository-Level Graph Representation Learning for Enhanced Security Patch DetectionSecurity Research Track Xin-Cheng Wen, Zirui Lin, Cuiyun Gao, Hongyu Zhang, Yong Wang, Qing Liao
	Revisiting Unnaturalness for Automated Program Repair in the Era of Large Language Models Research Track Aidan Z.H. Yang, Sophia Kolak, Vincent J. Hellendoorn, Ruben Martins, Claire Le Goues
	RLCoder: Reinforcement Learning for Repository-Level Code Completion Research Track Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng
	ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation Research Track Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, Ge Li
	ROSA: Finding Backdoors with FuzzingSecurityAward Winner Best Artifact Research Track Dimitri Kokkonis, Michaël Marcozzi, Emilien Decoux, Stefano Zacchiroli Link to publication DOI Pre-print Media Attached File Attached
	RUG: Turbo LLM for Rust Unit Test GenerationSE for AI Research Track Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, Taesoo Kim Pre-print Media Attached File Attached
	RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code Research Track Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, Aseem Rastogi
	SAND: Decoupling Sanitization from Fuzzing for Low Overhead Research Track Ziqiao Kong, Shaohua Li, Heqing Huang, Zhendong Su Link to publication Pre-print Media Attached File Attached
	Scenario-Driven and Context-Aware Automated Accessibility Testing for Android Apps Research Track Yuxin Zhang, Sen Chen, Xiaofei Xie, Zibo Liu, Lingling Fan
	Search-Based LLMs for Code OptimizationAward Winner Research Track Shuzheng Gao, Cuiyun Gao, Wenchao Gu, Michael Lyu
	SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing Research Track Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael Lyu
	SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI AutomationAward Winner Research Track Dehai Zhao, Zhenchang Xing, Qinghua Lu, Xiwei (Sherry) Xu, Liming Zhu
	Selecting Initial Seeds for Better JVM Fuzzing Research Track Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang
	Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code NaturalnessSecurity Research Track Weisong Sun, Yuchen Chen, Mengzhe Yuan, Chunrong Fang, Zhenpeng Chen, Chong Wang, Yang Liu, Baowen Xu, Zhenyu Chen Pre-print Media Attached
	Similar but Patched Code Considered Harmful -- The Impact of Similar but Patched Code on Recurring Vulnerability Detection and How to Remove ThemSecurity Research Track Zixuan Tan, Jiayuan Zhou, Xing Hu, Shengyi Pan, Kui Liu, Xin Xia Pre-print
	SmartReco: Detecting Read-Only Reentrancy via Fine-Grained Cross-DApp AnalysisSecurity Research Track Jingwen Zhang, Zibin Zheng, Yuhong Nan, Mingxi Ye, Kaiwen Ning, Yu Zhang, Weizhe Zhang
	SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model AgentsSE for AI Research Track Feng Lin, Dong Jae Kim, Tse-Hsun (Peter) Chen
	Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets Research Track Christof Tinnes, Alisa Carla Welter, Sven Apel Pre-print
	Source Code Summarization in the Era of Large Language Models Research Track Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen Media Attached
	SpecGen: Automated Generation of Formal Program Specifications via Large Language ModelsFormal Methods Research Track Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, Lei Bu
	SpecRover: Code Intent Extraction via LLMs Research Track Haifeng Ruan, Yuntong Zhang, Abhik Roychoudhury
	Static Analysis of Remote Procedure Call in Java Programs Research Track Baoquan Cui, RongQu , Zhen Tang, Jian Zhang
	Studying Programmers Without Programming: Investigating Expertise Using Resting State fMRI Research Track Zachary Karas, Benjamin Gold, Violet Zhou, Noah Reardon, Thad Polk, Catie Chang, Yu Huang
	Synthesizing Document Database Queries using Collection Abstractions Research Track Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang
	TacDroid: Detection of Illicit Apps through Hybrid Analysis of UI-based Transition Graphs Research Track Yanchen Lu, Hongyu Lin, Zehua He, Haitao Xu, Zhao Li, Shuai Hao, Liu Wang, Haoyu Wang, Kui Ren
	Template-Guided Program Repair in the Era of Large Language Models Research Track Kai Huang, Jian Zhang, Xiangxin Meng, Yang Liu File Attached
	Testing and Understanding Deviation Behaviors in FHE-hardened Machine Learning ModelsSE for AI Research Track Yiteng Peng, Daoyuan Wu, Zhibo Liu, Dongwei Xiao, Zhenlan Ji, Juergen Rahmel, Shuai Wang
	Test Intention Guided LLM-based Unit Test Generation Research Track Zifan Nan, Zhaoqiang Guo, Kui Liu, Xin Xia
	Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential TestingAward Winner Research Track Ying Fu, Zhiyong Wu, Yuanliang Zhang, Jie Liang, Jingzhou Fu, Yu Jiang, Shanshan Li, Liao Xiangke
	The Design Smells Breaking the Boundary between Android Variants and AOSP Research Track Wuxia Jin, Jiaowei Shang, Jianguo Zheng, Mengjie Sun, Zhenyu Huang, Ming Fan, Ting Liu
	The Fact Selection Problem in LLM-Based Program Repair Research Track Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl T. Barr, Sergey Mechtaev
	The Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages Research Track Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, Daniel Varro
	The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML ProductsSE for AI Research Track Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner
	The Same Only Different: On Information Modality for Configuration Performance Analysis Research Track Hongyuan Liang, Yue Huang, Tao Chen Pre-print
	The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning LibrariesSecurityAward Winner Research Track Zhiyuan Li, Jingzheng Wu, Xiang Ling, Tianyue Luo, ZHIQING RUI, Yanjun Wu
	TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference Research Track Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng
	TIVER: Identifying Adaptive Versions of C/C++ Third-Party Open-Source Components Using a Code Clustering TechniqueSecurity Research Track Youngjae Choi, Seunghoon Woo
	TOGLL: Correct and Strong Test Oracle Generation with LLMs Research Track Soneya Binta Hossain, Matthew B Dwyer
	TopSeed: Learning Seed Selection Strategies for Symbolic Execution from Scratch Research Track Jaehyeok Lee, Sooyoung Cha
	Toward a Better Understanding of Probabilistic Delta Debugging Research Track Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Xinru Cheng, Chengnian Sun
	Towards Better Answers: Automated Stack Overflow Post Updating Research Track Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, JianLing Sun
	Towards High-strength Combinatorial Interaction Testing for Highly Configurable Software Systems Research Track Chuan Luo, Shuangyu Lyu, Wei Wu, Hongyu Zhang, Dianhui Chu, Chunming Hu
	Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution DetectionSecuritySE for AI Research Track Yanfu Yan, Viet Duong, Huajie Shao, Denys Poshyvanyk
	Towards Neural Synthesis for SMT-assisted Proof-Oriented ProgrammingSecurityFormal MethodsAward Winner Research Track Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu K. Lahiri, Nikhil Swamy
	Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models Research Track Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, Tianyi Zhang Pre-print
	TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron ProvenanceSE for AI Research Track Waris Gill, Ali Anwar, Muhammad Ali Gulzar Pre-print
	TransferFuzz: Fuzzing with Historical Trace for Verifying Propagated Vulnerability CodeSecurity Research Track Siyuan Li, Yuekang Li, Zuxin Chen, Chaopeng Dong, Yongpan Wang, Hong Li, Yongle Chen, Hongsong Zhu
	Treefix: Enabling Execution with a Tree of Prefixes Research Track Beatriz Souza, Michael Pradel Pre-print
	Trust Dynamics in AI-Assisted Development: Definitions, Factors, and Implications Research Track Sadra Sabouri, Philipp Eibl, Xinyi Zhou, Morteza Ziyadi, Nenad Medvidović, Lars Lindemann, Souti Chattopadhyay Pre-print
	Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?Award Winner Research Track Mingyuan Wu, Jiahong Xiang, Kunqiu Chen, Peng Di, Shin Hwei Tan, Heming Cui, Yuqun Zhang
	UML is Back. Or is it? Investigating the Past, Present, and Future of UML in Open Source Software Research Track Joseph Romeo, Marco Raglianti, Csaba Nagy, Michele Lanza Pre-print
	Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts Research Track Sebastian Uchitel, Francisco Cirelli, Dalal Alrajeh
	Understanding and Detecting Peer Dependency Resolving Loop in npm Ecosystem Research Track Xingyu Wang, MingSen Wang, Wenbo Shen, Rui Chang
	Understanding Architectural Complexity, Maintenance Burden, and Developer Sentiment---a Large-Scale Study Research Track Yuanfang Cai, Lanting He, Yony Kochinski, Jun Qian, Ciera Jaspan, Nan Zhang, Antonio Bianco
	Understanding Compiler Bugs in Real Development Research Track Hao Zhong
	Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak AttacksSecuritySE for AI Research Track shide zhou, Li Tianlin, Kailong Wang, Yihao Huang, Ling Shi, Yang Liu, Haoyu Wang
	Understanding the Response to Open-Source Dependency Abandonment in the npm EcosystemAward Winner Research Track Courtney Miller, Mahmoud Jahanshahi, Audris Mockus, Bogdan Vasilescu, Christian Kästner
	Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models Research Track Van-Hoang Le, Yi Xiao, Hongyu Zhang
	Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the FamiliarAward Winner Research Track Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu, Yitong Liu, Si Zheng, Liao Xiangke
	Unveiling the Energy Vampires: A Methodology for Debugging Software Energy ConsumptionAward Winner Research Track Enrique Barba Roque, Luís Cruz, Thomas Durieux Pre-print
	User Personas Improve Social Sustainability by Encouraging Software Developers to Deprioritize Antisocial Features Research Track Bimpe Ayoola, Miikka Kuutila, Rina R. Wehbe, Paul Ralph Pre-print
	Vulnerability Detection with Code Language Models: How Far Are We?Security Research Track Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen
	WDD: Weighted Delta Debugging Research Track Xintong Zhou, Zhenyang Xu, Mengxiao Zhang, Yongqiang Tian, Chengnian Sun
	Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance LearningSecurity Research Track Minghua He, Tong Jia, Chiming Duan, Huaqian Cai, Ying Li, Gang Huang
	What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI Research Track Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma Pre-print
	What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation Research Track Xin Yin, Chao Ni, xiaodanxu , Xiaohu Yang Pre-print
	When Quantum Meets Classical: Characterizing Hybrid Quantum-Classical Issues Discussed in Developer ForumsQuantum Research Track Jake Zappin, Trevor Stalnaker, Oscar Chaparro, Denys Poshyvanyk
	Who’s Pushing the Code: An Exploration of GitHub Impersonation Research Track Yueke Zhang, Anda Liang, Xiaohan Wang, Pamela J. Wisniewski, Fengwei Zhang, Kevin Leach, Yu Huang
	Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models Research Track Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen
	$ZTD_{JAVA}$: Mitigating Software Supply Chain Vulnerabilities via Zero-Trust DependenciesSecurity Research Track Paschal Amusuo, Kyle A. Robinson, Tanmay Singla, Huiyun Peng, Aravind Machiry, Santiago Torres-Arias, Laurent Simon, James C. Davis Pre-print

Call for Papers

The International Conference on Software Engineering (ICSE) is the premier forum for presenting and discussing the most recent and significant technical research contributions in the field of Software Engineering. In the research track, we invite high-quality submissions of technical research papers describing original and unpublished results of software engineering research.

ICSE 2025 will follow a dual deadline structure introduced in 2024. In other words, submissions will occur in two cycles. Please refer to the section on Dual Submission Cycles in the following for the information.

NEW THIS YEAR #1: Due to the rapid growth of the area of “AI and Software Engineering”, it is now split into two: “AI for Software Engineering” and “Software Engineering for AI”. A new area “Architecture and Design” is introduced. The topics listed under each area have also been revised. Please see the “Research Areas” section below.

NEW THIS YEAR #2: We add back opportunities for “Author Response” in addition to “Revision” so that some potential misunderstandings can be clarified in the review process for papers that otherwise would be rejected. Also, for a paper receiving a “Revision” outcome, authors will be given an additional page of text in the revised paper to accommodate the required changes specified in the reviews.

NEW THIS YEAR #3: Submissions must follow the latest “IEEE Submission and Peer Review Policy” and “ACM Policy on Authorship” (with associated FAQ, which includes a policy regarding the use of generative AI tools and technologies, such as ChatGPT. After checking with the ICSE Steering Committee, we are piloting a human-in-the-loop automated process to identify AI-generated papers. A Review Process Co-Chair has volunteered to design and run this pilot process on submitted papers. To preserve confidentiality, when scanning submitted papers, the scripts will not make use of any third-party services.

NEW THIS YEAR #4: IEEE Transactions on Software Engineering, ACM Transactions on Software Engineering and Methodology and ICSE 2025, have received approval from the ICSE Steering Committee to launch the Sustainable Community Review Effort (SCRE) program, aimed at reducing community effort in reviewing journal extensions of conference papers and allowing authors to get faster and more consistent feedback. More information is available at: http://tinyurl.com/icse25-scre

NEW THIS YEAR #5: ICSE Steering Committee has recently approved a proposal for streamlining and enhancing the paper bidding and assignment process, aimed at reducing the workload of PC members and resulting in better assignments of papers. Two Review Process Co-Chairs have volunteered to help manage the updated bidding and assignment process. More information is available at: http://tinyurl.com/icse25-streamlining

NEW THIS YEAR #6: ICSE Steering Committee has recently approved a proposal for Shadow PC. Shadow PC is a mentoring program to train early-career researchers (PhD students, postdocs, new faculty members, and industry practitioners) in the review process of the technical track. For Cycle 2, for the first time, authors of ICSE submissions can opt-in for their papers to be considered for review in the Shadow PC track. Shadow reviews for papers that are reviewed by the Shadow PC will be sent out to authors after the end of the actual review process; shadow reviews will not affect the official decision made by the regular PC. More detailed information about the program is available at: http://tinyurl.com/icse25-shadowpc

Research Areas

ICSE welcomes submissions addressing topics across the full spectrum of Software Engineering, being inclusive of quantitative, qualitative, and mixed-methods research. Topics of interest include the following and are grouped into the following nine research areas. Please note that these topics are by no means exhaustive.

Each submission will need to indicate one of these nine areas as the chosen area. Optionally, the authors can consider adding an additional area. A paper may be moved from the chosen area(s) to another focus area at the discretion of the program chairs. Program chairs will ultimately assign a paper to an area chair, considering the authors’ selection, the paper’s content, and other factors such as (if applicable) possible conflicts of interest.

AI for Software Engineering

AI-enabled recommender systems for automated SE (e.g., code generation, program repair, AIOps, software composition analysis, etc.)
Human-centered AI for SE (e.g., how software engineers can synergistically work with AI agents)
Trustworthy AI for SE (e.g., how to provide guarantees, characterize limits, and prevent misuse of AI for SE)
Sustainable AI for SE (e.g., how to reduce energy footprint for greener AI for SE)
Collaborative AI for SE (e.g., how AI agents collaborate for automating SE)
Automating SE tasks with LLM and other foundation models (e.g., large vision model)
Efficacy measurement beyond traditional metrics (e.g., accuracy, BLEU, etc.)
Prompt engineering for SE (e.g., novel prompt design)
AI-assisted software design and model driven engineering (e.g., specification mining, program synthesis, software architectural design)

Analytics

Mining software repositories, including version control systems, issue tracking systems, software ecosystems, configurations, app stores, communication platforms, and novel software engineering data sources, to generate insights through various research methods
Software visualization
Data-driven user experience understanding and improvement
Data driven decision making in software engineering
Software metrics (and measurements)

Architecture and Design

Architecture and design measurement and assessment
Software design methodologies, principles, and strategies
Theory building for/of software design
Architecture quality attributes, such as security, privacy, performance, reliability
Modularity and reusability
Design and architecture modeling and analysis
Architecture recovery
Dependency and complexity analysis
Distributed architectures, such as microservice, SOA, cloud computing
Patterns and anti-patterns
Technical debt in design and architecture
Architecture refactoring
Adaptive architectures
Architecture knowledge management

Dependability and Security

Formal methods and model checking (excluding solutions focusing solely on hardware)
Reliability, availability, and safety
Resilience and antifragility
Confidentiality, integrity, privacy, and fairness
Performance
Design for dependability and security
Vulnerability detection to enhance software security
Dependability and security for embedded and cyber-physical systems

Evolution

Evolution and maintenance
API design and evolution
Release engineering and DevOps
Software reuse
Refactoring and program differencing
Program comprehension
Reverse engineering
Environments and software development tools
Traceability to understand evolution

Human and Social Aspects

Focusing on individuals (from program comprehension, workplace stress to job satisfaction and career progression)
Focusing on teams (e.g., collocated, distributed, global, virtual; communication and collaboration within a team), communities (e.g., open source, communities of practice) and companies (organization, economics)
Focusing on society (e.g., sustainability; diversity and inclusion)
Focusing on programming languages, environments, and tools supporting individuals, teams, communities, and companies.
Focusing on software development processes

Requirements and Modeling

Requirements engineering (incl. non-functional requirements)
Theoretical requirement foundations
Requirements and architecture
Feedback, user and requirements management
Requirements traceability and dependencies
Modeling and model-driven engineering
Variability and product lines
Systems and software traceability
Modeling languages, techniques, and tools
Empirical studies on the application of model-based engineering
Model-based monitoring and analysis

Software Engineering for AI

SE for AI models
SE for systems with AI components
SE for AI code, libraries, and datasets
Engineering autonomic systems and self-healing systems
Automated repair of AI models
Testing and verification of AI-based systems
Validation and user-based evaluation of AI-based systems
Requirements engineering for AI-based systems

Testing and Analysis

Software testing
Automated test generation techniques such as fuzzing, search-based approaches, and symbolic execution
Testing and analysis of non-functional properties
GUI testing
Mobile application testing
Program analysis
Program synthesis (e.g., constraint based techniques)
Program repair
Debugging and fault localization
Runtime analysis and/or error recovery

Scope

Since the authors will choose an area for their submission, the scope of each area becomes important. Some submissions may relate to multiple areas. In such cases, the authors should choose the area for which their paper brings the maximum new insights. Moreover, authors also have the choice of indicating an alternate area for each paper.

Similarly, for certain papers. authors may have a question whether it belongs to any area, or is simply out of scope. For such cases, we recommend the authors to judge whether their paper brings new insights for software engineering. As an example, a formal methods paper with a focus on hardware verification may be deemed out of scope for ICSE. In general, papers that only peripherally concern software engineering and do not give new insights from the software engineering perspective would be less relevant to ICSE. Our goal is, however, to be descriptive, rather than prescriptive, to enable authors to make their own decisions about relevance.

Dual Submission Cycles

Similar to ICSE 2024, we will have two submission cycles as follows:

First submission cycle

(Mandatory) Abstract: Mar 15, 2024
Submission: Mar 22, 2024
Author response period (3 days): Jun 10-13, 2024
Notification: Jul 5, 2024
Revision due: Aug 2, 2024
Camera-ready (of directly accepted papers): Aug 16, 2024
Final decision (of revised papers): Nov 1, 2024
Camera-ready (of accepted revised papers): Dec 13, 2024

Second submission cycle

(Mandatory) Abstract: Jul 26, 2024
Submission: Aug 2, 2024
Author response period (3 days): Oct 7-10, 2024
Notification: Nov 1, 2024
Revision due: Nov 29, 2024
Camera-ready (of directly accepted papers): Dec 13, 2024
Final decision (of revised papers): Jan 22, 2025
Camera-ready (of accepted revised papers): Feb 12, 2025

All dates are 23:59:59 AoE (UTC-12h).

Review Criteria

Each paper submitted to the Research Track will be evaluated based on the following criteria:

i) Novelty: The novelty and innovativeness of contributed solutions, problem formulations, methodologies, theories, and/or evaluations, i.e., the extent to which the paper is sufficiently original with respect to the state-of-the-art.

ii) Rigor: The soundness, clarity, and depth of a technical or theoretical contribution, and the level of thoroughness and completeness of an evaluation.

iii) Relevance: The significance and/or potential impact of the research on the field of software engineering.

iv) Verifiability and Transparency: The extent to which the paper includes sufficient information to understand how an innovation works; to understand how data was obtained, analyzed, and interpreted; and how the paper supports independent verification or replication of the paper’s claimed contributions. Any artifacts attached to or linked from the paper will be checked by one reviewer.

v) Presentation: The clarity of the exposition in the paper.

Reviewers will carefully consider all of the above criteria during the review process, and authors should take great care in clearly addressing them all. The paper should clearly explain and justify the claimed contributions. Each paper will be handled by an area chair who will ensure reviewing consistency among papers submitted within that area.

The outcome of each paper will be one of the following Accept, Revision, Reject. We now elaborate on the Revision outcome in the following.

Revisions

Papers submitted can go through revisions in response to specific revision requests made by the reviewers. Authors of papers receiving a Revision decision are expected to submit the revised papers, as well as the revised papers with changes marked in a different color, such as using LaTeXdiff. The authors also need to submit an “Author Response” document capturing the authors’ response to each reviewer’s comment and how those comments were addressed in the revision. This is similar to the “Summary of Changes and Response” document that is typically submitted by authors for a journal paper’s major revision. Authors may use the revision opportunity to revise and improve the paper, but should not use this to submit a substantially different paper. The reviewers will check the revised paper against the original paper and the suggested changes. Revised papers will be examined by the same set of reviewers. An unsatisfactory revised paper will be rejected. Authors are given 4 weeks to submit the revised papers. This is 1 week less than prior year as we reallocate the time to (i) add authors’ rebuttal, and (ii) provide more time for PC members to complete their reviews (to reduce reviewing fatigue considering the high workload to review increasing number of submissions). Authors are given an additional page of text in a revised paper to accommodate the required changes specified in the reviews.

Re-submissions of Rejected Papers

Authors of papers which receive a REJECT decision in the first submission cycle are strongly discouraged from re-submitting it to the second submission cycle. However, in exceptional cases where the reviewers evidently misunderstood their paper, upon approval of PC Chairs, authors can re-submit their paper to the second submission cycle with a “Clarifications and Summary of Improvements” document stating how they have changed the paper. They should also include the past reviews as part of this document, for completeness. These papers will be treated as new submissions, which may or may not get the same set of reviewers at the discretion of the PC chairs. Authors who try to bypass this guideline (e.g., by changing the paper title without significantly changing paper content, or by making small changes to the paper content) will have their papers desk-rejected by the PC chairs without further consideration.

Submission Process

Submissions must conform to the IEEE conference proceedings template, specified in the IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options). Note that IEEE format is being used this year, whereas last year it was ACM format, hence the appearance will differ from year to year.

All submissions must not exceed 10 pages for the main text, inclusive of all figures, tables, appendices, etc. Two more pages containing only references are permitted. All submissions must be in PDF. Accepted papers will be allowed one extra page for the main text of the camera-ready version.
Submissions must strictly conform to the IEEE conference proceedings formatting instructions specified above. Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
By submitting to the ICSE Technical Track, authors acknowledge that they are aware of and agree to be bound by the ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to ICSE 2025 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for ICSE 2025. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases. To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
If the research involves human participants/subjects, the authors must adhere to the ACM Publications Policy on Research Involving Human Participants and Subjects. Upon submitting, authors will declare their compliance with such a policy. Alleged violations of this policy or any ACM Publications Policy will be investigated by ACM and may result in a full retraction of your paper, in addition to other potential penalties, as per ACM Publications Policy.
Please ensure that you and your co-authors obtain an ORCID ID, so you can complete the publishing process for your accepted paper. ACM and IEEE have been involved in ORCID and may collect ORCID IDs from all published authors. We are committed to improve author discoverability, ensure proper attribution and contribute to ongoing community efforts around name normalization; your ORCID ID will help in these efforts.
The ICSE 2025 Research Track will employ a double-anonymous review process. Thus, no submission may reveal its authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular:
- Authors’ names must be omitted from the submission.
- All references to the author’s prior work should be in the third person.
- While authors have the right to upload preprints on ArXiV or similar sites, they must avoid specifying that the manuscript was submitted to ICSE 2025.
- All communication with the program committee must go through the program committee chairs. Do not contact individual program committee members regarding your submission.
Further advice, guidance, and explanation about the double-anonymous review process can be found on the Q&A page.
By submitting to the ICSE Research Track, authors acknowledge that they conform to the authorship policy of the IEEE, submission policy of the IEEE, and the authorship policy of the ACM (and associated FAQ. This includes following these points related to the use of Generative AI:
- “Generative AI tools and technologies, such as ChatGPT, may not be listed as authors of an ACM published Work. The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work. For example, the authors could include the following statement in the Acknowledgements section of the Work: ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations, etc.). If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work.” - ACM
- “The use of artificial intelligence (AI)–generated text in an article shall be disclosed in the acknowledgements section of any paper submitted to an IEEE Conference or Periodical. The sections of the paper that use AI-generated text shall have a citation to the AI system used to generate the text.” - IEEE
- “If you are using generative AI software tools to edit and improve the quality of your existing text in much the same way you would use a typing assistant like Grammarly to improve spelling, grammar, punctuation, clarity, engagement or to use a basic word processing system to correct spelling or grammar, it is not necessary to disclose such usage of these tools in your Work.” - ACM

Submissions to the Technical Track that meet the above requirements can be made via the Research Track submission site by the submission deadline. Any submission that does not comply with these requirements may be desk rejected without further review.

Submission site: https://icse2025.hotcrp.com/

We encourage the authors to upload their paper info early (and can submit the PDF later) to properly enter conflicts for double-anonymous reviewing. It is the sole responsibility of the authors to ensure that the formatting guidelines, double anonymous guidelines, and any other submission guidelines are met at the time of paper submission.

Open Science Policy

The research track of ICSE 2025 is governed by the ICSE 2025 Open Science policies. The guiding principle is that all research results should be accessible to the public and, if possible, empirical studies should be reproducible. In particular, we actively support the adoption of open artifacts and open source principles. We encourage all contributing authors to disclose (anonymized and curated) data/artifacts to increase reproducibility and replicability. Note that sharing research artifacts is not mandatory for submission or acceptance. However, sharing is expected to be the default, and non-sharing needs to be justified. We recognize that reproducibility or replicability is not a goal in qualitative research and that, similar to industrial studies, qualitative studies often face challenges in sharing research data. For guidelines on how to report qualitative research to ensure the assessment of the reliability and credibility of research results, see this curated Q&A page.

Upon submission to the research track, authors are asked

to make their artifact available to the program committee (via upload of supplemental material or a link to an anonymous repository) – and provide instructions on how to access this data in the paper; or
to include in the submission an explanation as to why this is not possible or desirable; and
to indicate in the submission why they do not intend to make their data or study materials publicly available upon acceptance, if that is the case. The default understanding is that the data and/or other artifacts will be publicly available upon acceptance of a paper.

Withdrawing a Paper

Authors can withdraw their paper at any moment until the final decision has been made, through the paper submission system. Resubmitting the paper to another venue before the final decision has been made without withdrawing from ICSE 2025 first is considered a violation of the concurrent submission policy, and will lead to automatic rejection from ICSE 2025 as well as any other venue adhering to this policy. Such violations may also be reported to appropriate organizations e.g. ACM and IEEE.

Conference Attendance Expectation

If a submission is accepted, at least one author of the paper is required to register for ICSE 2025 and present the paper. We are assuming that the conference will be in-person, and if it is virtual or hybrid, virtual presentations may be possible. These matters will be discussed with the authors closer to the date of the conference.

The proceedings are published by the IEEE. As of February, papers from the first cycle are available at the IEEE Computer Society Digital Library. Later on the remaining papers will appear there; they will also appear in the IEEE

The following papers have been accepted so far in the ICSE 2025 Research Track. The papers are will be published by the IEEE and appear in the IEEE and ACM digital libraries, subject to an author submitting their camera-ready and copyright forms, and registering to attend the conference. (Authors are required to present the papers at the conference, otherwise they will be withdrawn).

Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction"

Abstract: Database management systems (DBMSs) consistently strive for enhanced performance. For a given query, the optimizer of a DBMS aims to construct an optimal execution plan that incorporates multiple optimization operations. However, the resulting plan may sometimes perform worse than even if no optimizations were applied. This occurs because the interactions between optimizations are complex and some situations might be overlooked in the implementation. We refer to these issues as Performance Degradation Bugs (PDBs). PDBs can result in significant consequences from decreased system efficiency and prolonged query processing times to potential disruptions in critical business operations. In this paper, we present Puppy, an automated approach for detecting PDBs in DBMSs using limited-optimization plan construction. The key idea is to compare the performance with the plan generated with all optimization operations enabled, against the plan generated with only a subset of optimization operations in the same DBMS. If the response time of the plan with the limited optimization set is shorter than that of the fully optimized plan, it indicates a potential PDB. Specifically, Puppy first generates queries that incorporate multiple optimization sequences, guided by optimization operation sequence coverage. Secondly, Puppy analyzes the query plan and selectively disables specific optimizations to construct the limited optimization plan. We evaluate Puppy on five widely-used DBMSs, namely MySQL, Percona, TiDB, PolarDB, and PostgreSQL against the state-of-the-art DBMS performance testing tools APOLLO and AMOEBA. Puppy detected 26 and 25 more performance anomalies, covered 151,201 and 173,798 more branches than APOLLO and AMOEBA in 48 hours, respectively. More importantly, Puppy reports 62 PDBs, with 54 anomalies confirmed as previously unknown bugs.

Tags: "Databases", "Prog Comprehension/Reeng/Maint"

Chun Li, Hui Li, Zhong Li, Minxue Pan, Xuandong Li, "Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning"

Abstract: Engineers utilize logs as a primary resource for fault localization in large-scale software and system testing, a process that is notoriously time-consuming, costly, and labor-intensive. Despite considerable progress in automated fault localization approaches, their applicability remains limited in such settings, due to the unavailability of fine-grained features in logs essential for most existing fault localization methods. In response, we introduce FALCON, a novel log-based fault localization framework. FALCON organizes complex semantic log information into graphical representations and employs contrastive learning to capture the differences between passed and failed logs, enabling the identification of crucial fault-related features. It also incorporates a specifically designed transitive analysis-based adaptive graph augmentation to minimize the influence of fault-unrelated log information on contrastive learning. Through extensive evaluations against 34 spectrum-based and 4 learning-based fault localization methods, FALCON demonstrates superior performance by outperforming all the methods in comparison. In addition, FALCON demonstrated its practical value by successfully identifying 71 out of 90 faults with a file-level Top-1 accuracy rate during a one-month deployment within a global company’s testing system.

Tags: "Prog Comprehension/Reeng/Maint", "Analysis", "Other: Visualization"

Wenqian Deng, Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation"

Abstract: Database connectors are widely used in many applications to facilitate flexible and convenient database interactions. Potential vulnerabilities in database connectors can lead to various abnormal behaviors within applications, such as returning incorrect results or experiencing unexpected connection interruption. However, existing fuzzing works cannot be directly applied to testing database connectors as they mainly focus on SQL generation and use a small subset of database connector interfaces to execute SQLs. Due to a lack of domain knowledge, automated test case generation also struggles to generate complex test cases that explore connectors' deep logic. The main challenge in testing database connectors is to generate semantically correct test cases that can trigger a wide range of connector state transitions. To address that, we propose CONI, a framework designed for detecting logic bugs of database connectors with state-aware test case generation. First, we define the database connector state model by analyzing the corresponding specification. Building upon this model, CONI generates interface call sequences within test cases to encompass more connector state transitions. After that, CONI generates suitable parameter values based on the parameter information and contextual information collected during runtime. Then the test cases are executed on a target and a reference database connector. Inconsistent results indicate potential logic bugs. We evaluate CONI on 5 widely-used JDBC database connectors, namely MySQL Connector/J, MariaDB Connector/J, AWS JDBC Driver for MySQL, PGJDBC NG, and PostgreSQL JDBC Driver. In total, CONI successfully detected 44 previously unknown bugs, of which 34 have been confirmed.

Tags: "Testing and Quality", "Databases"

Gong Chen, Xiaoyuan Xie, Daniel Tang, Qi Xin, Wenjie Liu, "HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search"

Abstract: Code search is a vital activity in software engineering, focused on identifying and retrieving the correct code snippets based on a query provided in natural language. Approaches based on deep learning techniques have been increasingly adopted for this task, enhancing the initial representations of both code and its natural language descriptions. Despite this progress, there remains an unexplored gap in ensuring consistency between the representation spaces of code and its descriptions. Furthermore, existing methods have not fully leveraged the potential relevance between code snippets and their descriptions, presenting a challenge in discerning fine-grained semantic distinctions among similar code snippets. To address these challenges, we introduce a multi-task hedging contrastive Learning framework for Code Search, referred to as HedgeCode. HedgeCode is structured around two primary training phases. The first phase, known as the representation alignment stage, proposes a hedging contrastive learning approach. This method aims to detect subtle differences between code and natural language text, thereby aligning their representation spaces by identifying relevance. The subsequent phase involves multi-task joint learning, wherein the previously trained model serves as the encoder. This stage optimizes the model through a combination of supervised and self-supervised contrastive learning tasks. Our framework’s effectiveness is demonstrated through its performance on the CodeSearchNet benchmark, showcasing HedgeCode’s ability to address the mentioned limitations in code search tasks.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Jiashuo Zhang, Yiming Shen, Jiachi Chen, Jianzhong Su, Yanlin Wang, Ting Chen, Jianbo Gao, Zhong Chen, "Demystifying and Detecting Cryptographic Defects in Ethereum Smart Contracts"

Abstract: To enhance smart contracts with cryptographic capabilities, Ethereum has officially provided a set of system-level cryptographic APIs, such as ecrecover. These APIs have been utilized in over 10% of Ethereum transactions, motivating developers to implement various on-chain cryptographic tasks, such as digital signatures. However, since developers may not always be cryptographic experts, their ad-hoc and potentially defective implementations could compromise the theoretical guarantees of cryptography, leading to real-world security issues. To mitigate this threat, we conducted the first study aimed at demystifying and detecting cryptographic defects in smart contracts. Through the analysis of 2,406 real-world security reports, we defined nine types of cryptographic defects in smart contracts with detailed descriptions and practical detection patterns. Based on this categorization, we proposed CrySol, a fuzzing-based tool to automate the detection of cryptographic defects in smart contracts. It combines transaction replaying and dynamic taint analysis to extract fine-grained crypto-related semantics and employs crypto-specific strategies to guide the test case generation process. urthermore, we collected a large-scale dataset containing 25,745 real-world crypto-related smart contracts and evaluated CrySol's effectiveness on it. The result demonstrated that CrySol achieves an overall precision of 95.4% and a recall of 91.2%. Notably, CrySol revealed that 5,847 (22.7%) out of 25,745 contracts contain at least one cryptographic defect, highlighting the prevalence of these defects.

Tags: "Security", "Blockchain", "Other: Cryptography"

Chijin Zhou, Quan Zhang, Bingzhou Qian, Yu Jiang, "Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency"

Abstract: Rendering lies at the heart of our modern web experience. However, the correctness of browser rendering is not always guaranteed, often leading to rendering bugs. Traditional differential testing, while successful in various domains, falls short when applied to rendering bug detection because an HTML file is likely yield different rendered outcomes across different browsers. This paper introduces Visual Delta Consistency, a test oracle to detect rendering bugs in web browsers, aiming to make rendered pages across browsers comparable. Our key insight is that any modifications made to an HTML file should uniformly influence rendering outcomes across browsers. Specifically, when presented with two HTML files that differ only by minor modifications, the reaction of all browsers should be consistent, i.e., either all browsers render them identically or all render them differently. Based on this insight, We implemented it as a practical fuzzer named Janus. It constructs pairs of slightly modified HTML files and observes the change statuses of the corresponding rendered pages across browsers for bug detection. We evaluated it on three widely-used browsers, i.e., Chrome, Safari, and Firefox. In total, Janus detected 34 rendering bugs, out of which 26 confirmed with 8 fixed by the developers.

Tags: "Testing and Quality", "Other: Web"

Seongmin Lee, Shreyas Minocha, Marcel Böhme, "Accounting for Missing Events in Statistical Information Leakage Analysis"

Abstract: The leakage of secret information via a public channel is a critical privacy flaw in software systems. The more information is leaked per observation, the less time an attacker needs to learn the secret. Due to the size and complexity of the modern software, and because some empirical facts are not available to a formal analysis of the source code, researchers started investigating statistical methods using program executions as samples. However, current statistical methods require a high sample coverage. Ideally, the sample is large enough to contain every possible combination of secret $\times$ observable value to accurately reflect the joint distribution of $\langle$secret, observable$\rangle$. Otherwise, the information leakage is severely underestimated, which is problematic as it can lead to overconfidence in the security of an otherwise vulnerable program. In this paper, we introduce an improved estimator for information leakage and propose to use methods from applied statistics to improve our estimate of the joint distribution when sample coverage is low. The key idea is to reconstruct the joint distribution by casting our problem as a multinomial estimation problem in the absence of samples for all classes. We suggest two approaches and demonstrate the effectiveness of each approach on a set of benchmark subjects. We also propose novel refinement heuristics, which help to adjust the joint distribution and gain better estimation accuracy. Compared to existing statistical methods for information leakage estimation, our method can safely overestimate the mutual information and provide a more accurate estimate from a limited number of program executions.

Tags: "Security", "Analysis"

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Premkumar Devanbu, Toufique Ahmed, "Calibration and Correctness of Language Models for Code"

Abstract: Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a \emph{confidence measure}; if this confidence measure is strongly associated with \emph{likelihood of correctness}, then the model is said to be \emph{well-calibrated}. In this case, the confidence measure can serve as a basis for rational graduated decision making on how much review and care is needed. \emph{Calibration} has so far been studied in mostly non-generative (\emph{e.g.}, classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Given generated code developers must decide whether to directly use, use after varying intensity of careful review, or discard model-generated code; thus calibration is vital in generative settings. In this paper we make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are \textbf{\textit{\underline{not}}} well-calibrated out of the box. We then show how calibration can be improved, using standard methods such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in Software Engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.

Tags: "AI for SE"

Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, Aseem Rastogi, "RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code"

Abstract: The Rust programming language, with its safety guarantees, has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. These guarantees come from a strong ownership-based type system, as well as primitive support for features like closures, pattern matching, etc., that make the code more concise and amenable to reasoning. These unique Rust features also pose a steep learning curve for programmers. This paper presents a tool called RustAssistant that leverages the emergent capabilities of Large Language Models (LLMs) to automatically suggest fixes for Rust compilation errors. RustAssistant uses a careful combination of prompting techniques as well as iteration between an LLM and the Rust compiler to deliver high accuracy of fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories. We also contribute a dataset of Rust compilation errors to enable further research.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

guoping rong, Yongda Yu, Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen, Jidong Hu, "Code Comment Inconsistency Detection and Rectification Using a Large Language Model"

Abstract: Comments are widely used in source code. If a comment is consistent with the code snippet it intends to annotate, it would aid code comprehension. Otherwise, Code Comment Inconsistency (CCI) is not only detrimental to the understanding of code, but more importantly, it would negatively impact the development, testing, and maintenance of software. To tackle this issue, existing research has been primarily focused on detecting inconsistencies with varied performance. It is evident that detection alone does not solve the problem; it merely paves the way for solving it. A complete solution requires detecting inconsistencies and, more importantly, rectifying them by amending comments. However, this type of work is scarce. In this paper, we contribute C4RLLaMA, a fine-tuned large language model based on the open-source CodeLLaMA. It not only has the ability to rectify inconsistencies by correcting relevant comment content but also outperforms state-of-the-art approaches in detecting inconsistencies. Experiments with various datasets confirm that C4RLLaMA consistently surpasses both Post Hoc and Just-in-time CCI detection approaches. More importantly, C4RLLaMA outperforms substantially the only known CCI rectification approach in terms of multiple performance metrics. To further examine C4RLLaMA's efficacy in rectifying inconsistencies, we conducted a manual evaluation, and the results showed that the percentage of correct comment updates by C4RLLaMA was 65.0\% and 55.9\% in Just-in-time and Post Hoc, respectively, implying C4RLLaMA's real potential in practical use.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Kai Huang, Jian Zhang, Xiangxin Meng, Yang Liu, "Template-Guided Program Repair in the Era of Large Language Models"

Abstract: Recent advancements in automated program repair (APR) have been significantly driven by the application of Large Language Models (LLMs). In particular, the integration of LLMs with traditional template-based repair methods has demonstrated effective outcomes. Despite this, the synergy between the strengths of traditional methods and LLMs remains underexploited. This oversight originates from the indiscriminate use of templates and their insufficient coverage. Also, using small-scale LLMs within the zero-shot learning context proves to be suboptimal. To alleviate the limitations, we propose NTR (Neural Template Repair), a two-stage repair framework including template selection and patch generation, both of which are under the fine-tuning paradigm. In the template selection phase, we formulate it as a multiclass classification problem and fine-tune million-level LLMs for better selecting possible templates. During the patch generation phase, we leverage the chosen templates as probable directions (e.g., `Mutate Conditional Expression') to guide the fine-tuning process of LLMs at the billion-level scale for precise patch creation. Moreover, we incorporate a unique template to signify the absence of a suitable template and employ a probability-based prioritization of templates, thereby optimizing patch generation. This framework not only effectively addresses template mismatch issues, but also enables the billion-level LLMs to explore the patch space more efficiently, despite the GPU memory constraints. We evaluate NTR with different foundational models on Defects4J V1.2 and HumanEval-Java, the framework consistently demonstrates significant effectiveness. When utilizing StarCoder as the foundational model for patch generation, NTR fixes 128 and 129 bugs in Defects4J and HumanEval, outperforming the best baseline APR tool by 14 and 59 bugs. With the larger CodeLlama model, the fixed bugs rise to 139 and 136, respectively, exceeding the baseline by 25 and 66 bugs. Notably, the performance stems not only from the foundational models but also benefits greatly from our NTR framework. Specifically, NTR's implementation with StarCoder and CodeLlama leads to 22 and 23 additional fixes, which is beyond what the models achieve on their own. This emphasizes the success of our new perspective on utilizing templates to unlock the bug-fixing potential of LLMs.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Syed Fatiul Huq, Mahan Tafreshipour, Kate Kalcevich, Sam Malek, "Automated Generation of Accessibility Test Reports from Recorded User Transcripts"

Abstract: Testing for accessibility is a significant step when developing software, as it ensures that all users, including those with disabilities, can effectively engage with web and mobile applications. While automated tools exist to detect accessibility issues in software, none are as comprehensive and effective as the process of user testing, where testers with various disabilities evaluate the application for accessibility and usability issues. However, user testing is not popular with software developers as it requires conducting lengthy interviews with users and later parsing through large recordings to derive the issues to fix. In this paper, we explore how large language models (LLMs) like GPT 4.0, which have shown promising results in context comprehension and semantic text generation, can mitigate this issue and streamline the user testing process. Our solution, called Reca11, takes in informal transcripts of test recordings and extracts the accessibility and usability issues mentioned by the tester. Our systematic prompt engineering determines the optimal configuration of input, instruction, context and demonstrations for best results. We evaluate Reca11's effectiveness on 36 user testing sessions across three applications. Based on the findings, we investigate the strengths and weaknesses of using LLMs in this space.

Tags: "AI for SE", "User experience"

Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu, "Large Language Models as Configuration Validators"

Abstract: Misconfigurations are major causes of software failures. Existing practices rely on developer-written rules or test cases to validate configurations, which are expensive. Machine learning (ML) for configuration validation is considered a promising direction, but has been facing challenges such as the need of large-scale field data and system-specific models. Recent advances in Large Language Models (LLMs) show promise in addressing some of the long-lasting limitations of ML-based configuration validation. We present a first analysis on the feasibility and effectiveness of using LLMs for configuration validation. We empirically evaluate LLMs as configuration validators by developing a generic LLM-based configuration validation framework, named Ciri. Ciri employs effective prompt engineering with few-shot learning based on both valid configuration and misconfiguration data. Ciri checks outputs from LLMs when producing results, addressing hallucination and nondeterminism of LLMs. We evaluate Ciri’s validation effectiveness on eight popular LLMs using configuration data of ten widely deployed open-source systems. Our analysis (1) confirms the potential of using LLMs for configuration validation, (2) explores design space of LLMbased validators like Ciri, and (3) reveals open challenges such as ineffectiveness in detecting certain types of misconfigurations and biases towards popular configuration parameters.

Tags: "AI for SE", "Security"

Wen Zhang, Botang Xiao, Qingchen Kong, Le Guan, Wenwen Wang, "BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries"

Abstract: This paper presents BSan, a practical software-only memory error detector for binary code. Different from state-of-the-art binary-level detectors, which rely on either the shadow memory-based approach or the hardware-specific feature and thus suffer from several fundamental limitations, BSan adopts an identifier-based approach, enabling it to detect deep memory errors missed by existing detectors. Also, BSan does not depend on any specific hardware features. To reduce the high performance overhead caused by identifier propagation, BSan creates a novel hybrid approach, static analysis+dynamic instrumentation, to improve the performance without inheriting the poor reliability of static binary rewriting, distinguishing it from existing detectors that simply refer to static binary rewriting for better performance. The comprehensive evaluation demonstrates that BSan can detect more memory errors than state-of-the-art binary-level detectors. Meanwhile, the performance and memory overheads of BSan are comparable to those of existing detectors.

Tags: "Security", "Analysis"

Ying Fu, Zhiyong Wu, Yuanliang Zhang, Jie Liang, Jingzhou Fu, Yu Jiang, Shanshan Li, Xiangke Liao, "Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing"

Abstract: Differential testing is a prevalent strategy for establishing test oracles in automated DBMS testing. However, meticulously selecting equivalent DBMSs with diverse implementations and compatible input syntax requires huge manual efforts. In this paper, we propose Thanos, a framework that finds DBMS bugs via storage engine rotation based differential testing. Our key insight is that a DBMS with different storage engines must provide consistent basic storage functionalities. Therefore, it’s feasible to construct equivalent DBMSs based on storage engine rotation, ensuring that the same SQL test cases to these equivalent DBMSs yield consistent results. The framework involves four main steps: 1) select the appropriate storage engines; 2) extract equivalence information among the selected storage engines; 3) synthesize feature-orient test cases that ensure the DBMS equivalence; and 4) send test cases to the DBMSs with selected storage engines and compare the results. We evaluate Thanos on three widely used and extensively tested DBMSs, namely MySQL, MariaDB, and Percona against state-of-the-art fuzzers SQLancer, SQLsmith, and Squirrel. Thanos outperforms them on branch coverage by 24%–116%, and also finds many bugs missed by other fuzzers. More importantly, the vendors have confirmed 32 previously unknown bugs found by Thanos, with 29 verified as Critical.

Tags: "Testing and Quality", "Databases"

Courtney Miller, Mahmoud Jahanshahi, Audris Mockus, Bogdan Vasilescu, Christian Kästner, "Understanding the Response to Open-Source Dependency Abandonment in the npm Ecosystem"

Abstract: Many developers relying on open-source digital infrastructure expect continuous maintenance, but even the most critical packages can become unmaintained. Despite this, there is little understanding of the prevalence of abandonment of widely-used packages, of subsequent exposure, and of reactions to abandonment in practice, or the factors that influence them. We perform a large-scale quantitative analysis of all widely-used npm packages and find that abandonment is common among them, that abandonment exposes many projects which often do not respond, that responses correlate with other dependency management practices, and that removal is significantly faster when a projects end-of-life status is explicitly stated. We end with recommendations to both researchers and practitioners who are facing dependency abandonment or are sunsetting projects, such as opportunities for low-effort transparency mechanisms to help exposed projects make better, more informed decisions.

Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen, "Vulnerability Detection with Code Language Models: How Far Are We?"

Abstract: In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26\% F1 on BigVul but only 3.09\% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

Tags: "AI for SE", "Security"

Deepak-George Thomas, Matteo Biagiola, Nargiz Humbatova, Mohammad Wardat, Gunel Jahangirova, Hridesh Rajan, Paolo Tonella, "µPRL: A Mutation Testing Pipeline for Deep Reinforcement Learning based on Real Faults"

Abstract: Reinforcement Learning (RL) is increasingly adopted to train agents that can deal with complex sequential tasks, such as driving an autonomous vehicle or controlling a complex environment. Correspondingly, novel approaches are needed to ensure that RL agents have been tested adequately before going to production. Among them, mutation testing is quite promising, especially under the assumption that the injected faults (mutations) mimic the real ones. In this paper, we first describe a taxonomy of real RL faults obtained by repository mining. Then, we present the mutation operators derived from such real faults and implemented in the tool µPRL. Finally, we discuss the experimental results, which show that µPRL is extremely effective at discriminating strong from weak test generators, hence providing useful feedback to developers about the adequacy of the test scenarios generated and executed so far.

Tags: "SE for AI", "Testing and Quality"

Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner, "The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products"

Abstract: Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, challenging research progress. In this study, first, we contribute a novel process to identify 262 open-source ML products among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent startup-style development reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss 7 implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.

Tags: "SE for AI", "MSR"

Sanan Hasanov, Stefan Nagy, Paul Gazzillo, "A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing"

Abstract: The Linux kernel is actively-developed and widely-used. It supports billions of devices of all classes, from high-performance computing to the Internet-of-Things, in part because of its sophisticated configuration system, which automatically tailors the source code according to thousands of user-provided configuration options. Fuzzing has been highly successful at finding kernel bugs, being among the top bug reporters. Since the kernel receives 100s of patches per day, fuzzers run continuously, stopping regularly to rebuild the kernel with the latest changes before restarting fuzzing. But kernel fuzzers currently use predefined configuration settings that, as we show, exclude the majority of new patches from the kernel binary, nullifying the benefits of continuous fuzzing. Unfortunately, state-of-the-art configuration testing techniques are generally ill-suited to the needs of continuous fuzzing, excluding necessary options or requiring too many configuration files to be tractible. We distill down the needs of continuous testing into six properties with the most impact, systematically analyze the space of configuration selection strategies, and provide actionable recommendations. Through our analysis, we discover that continuous fuzzers can improve configuration variety without sacrificing performance. We empirically evaluate our discovery by modifying the configuration selection strategy for syzkaller, the most popular Linux kernel fuzzer, which subsequently found more than twice as many new bugs (35 vs. 13) than with the original configuration file and 12x more (24 vs. 2) when considering only unique bugs---with one security vulnerability being assigned a CVE.

Tags: "Testing and Quality", "Requirements"

Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun, "QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning"

Abstract: Formal verification is a promising method for producing highly reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using machine learning and a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. This makes proofs hard to find, and, when they are found, longer than necessary. Reinforcement learning could help estimate progress, but sparse rewards make this method ineffective. To address this problem, we create QEDCartographer, an novel automated proof-synthesis tool that combines supervised and reinforcement learning. QEDCartographer's key insight is that incorporating the branching structure of proofs into its learning enables reward-free search, mitigating the sparse reward challenge. We evaluate QEDCartographer on the CoqGym benchmark of 68,501 theorems from 124 open-source Coq projects. QEDCartographer proves 186 more theorems than Proverbot9001, a state-of-the-art proof synthesis tool, an increase of 8%. Further, the tools are complementary, together proving 12% more theorems than Proverbot9001 alone. For theorems both can prove, QEDCartographer produces 26% shorter proofs 27% faster.

Tags: "AI for SE", "Analysis"

Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, Miryung Kim, "Fuzzing MLIR Compilers with Custom Mutation Synthesis"

Abstract: A growing trend in compiler design is to enable modular extensions to intermediate representations (IRs). Multi- Level Intermediate Representation (MLIR) is a new effort to enable faster compiler development by providing an extensible framework for downstream developers to define custom IRs with MLIR dialects. Sets of MLIR dialects define new IRs that are tailored for specific domains. The diversity and rapid evolution of these IRs make it impractical to pre-define custom test generator logic for every available dialect. We design a new approach called SYNTHFUZZ that automatically infers and applies custom mutations from existing tests. The key essence of SYNTHFUZZ is that inferred custom mutations are parameterized and context-dependent such that they can be concretized differently depending on the target context. By doing this, we obviate the need to manually write custom mutations for newly introduced MLIR dialects. Further, SYNTHFUZZ increases the chance of finding effective edit locations and reduces the chance of inserting invalid edit content by performing k-ancestor- prefix and l-sibling-postfix matching. We compare SYNTHFUZZ to three baselines: Grammarinator—a grammar-based fuzzer without custom mutators, MLIRSmith—a custom test generator for MLIR, and NeuRI—a custom test generator with support for parameterized generation. We conduct this comprehensive comparison on 4 different MLIR projects where each project defines a new set of MLIR dialects that would take months of effort to manually write custom input generation and mutation logic. Our evaluation shows that SYNTHFUZZ on average improves input diversity by 1.51×, which increases branch coverage by 1.16×. Further, we show that our context dependent custom mutation increases the proportion of valid tests by up to 1.11×, indicating that SYNTHFUZZ correctly concretizes its parameterized mutations with respect to the target context. Parameterization of the mutations reduces the fraction of tests violating general MLIR constraints by 0.57×, increasing the time spent fuzzing dialect-specific code.

Tags: "Testing and Quality", "Security"

Forough Mehralian, Ziyao He, Sam Malek, "Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps"

Abstract: With mobile apps playing an increasingly vital role in our daily lives, the importance of ensuring their accessibility for users with disabilities is also growing. Despite this, app developers often overlook the accessibility challenges encountered by users of assistive technologies, such as screen readers. Screen reader users typically navigate content sequentially, focusing on one element at a time, unaware of changes occurring elsewhere in the app. While dynamic changes to content displayed on an app’s user interface may be apparent to sighted users, they pose significant accessibility obstacles for screen reader users. Existing accessibility testing tools are unable to identify challenges faced by blind users resulting from dynamic content changes. In this work, we first conduct a formative user study on dynamic changes in Android apps and their accessibility barriers for screen reader users. We then present TIMESTUMP, an automated framework that leverages our findings in the formative study to detect accessibility issues regarding dynamic changes. Finally, we empirically evaluate TIMESTUMP on real-world apps to assess its effectiveness and efficiency in detecting such accessibility issues.

Tags: "Testing and Quality", "Human/Social"

Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng, "RLCoder: Reinforcement Learning for Repository-Level Code Completion"

Abstract: Repository-level code completion aims to generate code for unfinished code snippets within the context of a specified repository. Existing approaches mainly rely on retrieval-augmented generation strategies due to limitations in input sequence length. However, traditional lexical-based retrieval methods like BM25 struggle to capture code semantics, while model-based retrieval methods face challenges due to the lack of labeled data for training. Therefore, we propose RLCoder, a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data. Specifically, we iteratively evaluate the usefulness of retrieved content based on the perplexity of the target code when provided with the retrieved content as additional context, and provide feedback to update the retriever parameters. This iterative process enables the retriever to learn from its successes and failures, gradually improving its ability to retrieve relevant and high-quality content. Considering that not all situations require information beyond code files and not all retrieved context is helpful for generation, we also introduce a stop signal mechanism, allowing the retriever to decide when to retrieve and which candidates to retain autonomously. Extensive experimental results demonstrate that RLCoder consistently outperforms state-of-the-art methods on CrossCodeEval and RepoEval, achieving 12.2\% EM improvement over previous methods. Moreover, experiments show that our framework can generalize across different programming languages and further improve previous methods like RepoCoder.

Tags: "AI for SE"

Nausheen Mohammed, Akash lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma, "LLM Assistance for Memory Safety"

Abstract: Memory safety violations in low-level code, written in languages like C, continues to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting, however, is a manual process that imposes significant burden on the programmer and, hence, there has been limited adoption of this technique. The task of porting not only requires inferring annotations, but may also need refactoring/rewriting of the code to make it amenable to such annotations. In this paper, we use Large Language Models (LLMs) towards addressing both these concerns. We show how to harness LLM capabilities to do complex code reasoning as well as rewriting of large codebases. We also present a novel framework for whole-program transformations that leverages lightweight static analysis to break the transformation into smaller steps that can be carried out effectively by an LLM. We implement our ideas in a tool called MSA that targets the CheckedC dialect. We evaluate MSA on several micro-benchmarks, as well as real-world code ranging up to 20K lines of code. We showcase superior performance compared to a vanilla LLM baseline, as well as demonstrate improvement over a state-of-the-art symbolic (non-LLM) technique.

Tags: "AI for SE", "Security"

Chenkai Guo, Qianlu Wang, Naipeng Dong, Lingling Fan, Tianhong Wang, Weijie Zhang, EnBao Chen, Zheli Liu, Lu Yu, "EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android Applications"

Abstract: Android applications are pervasively adopted and heavily relied on in our daily life, leading to the growing demand for enhanced user experiences, such as ease for operation and robustness. Nevertheless, developers continue to prioritize traditional functionality and performance, overlooking the pivotal role of user experience in real-world scenarios. For example, poorly designed page elements can lead to user confusion, resulting in unexpected outcomes, termed as the error-prone operation anomalies (EPAs). In this work, we undertake the first effort to uncover the underlying essence of the EPA problem. To achieve this objective, we investigated the root causes of EPAs from three dimensions, i.e., subject, object and environment. These causes were identified by multi-stage attribute capturing and precise similarity computation. In this process, the causes are categorized into fine-grained classes, namely confusing behaviours, unsuitable layout, and resource overload. Building upon these insights, we propose a dynamic GUI-based testing tool EP-Detector to facilitate detecting the EPAs in real-world apps. The EP-Detector is equipped with widget-exploration based target navigation and automatic test oracle, enabling it to detect error-prone page elements and simulate events with both comprehensiveness and precision. To systematically study the prevalence and severity of real-world EPAs, we conducted experiments on 53 popular Android apps with EP-Detector. The confirmed results not only validate the high precision and completeness of EP-Detector but also highlight that EPAs are prevalent in current apps, with at least one EPA existing in every two page widgets on average, and 28.3% of them may lead to security and functionality issues or risks. The EP-Detector is available at https://github.com/WordDealer/EP-Detector.

Tags: "Testing and Quality", "Mobile SW"

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, Michael Lyu, "Search-Based LLMs for Code Optimization"

Abstract: The code written by developers usually suffers from efficiency problems and contain various performance bugs. These inefficiencies necessitate the research of automated refactoring methods for code optimization. Early research in code optimization employs rule-based methods and focuses on specific inefficiency issues, which are labor-intensive and suffer from the low coverage issue. Recent work regards the task as a sequence generation problem, and resorts to deep learning (DL) techniques such as large language models (LLMs). These methods typically prompt LLMs to directly generate optimized code. Although these methods show state-of-the-art performance, such one-step generation paradigm is hard to achieve an optimal solution. First, complex optimization methods such as combinatorial ones are hard to be captured by LLMs. Second, the one-step generation paradigm poses challenge in precisely infusing the knowledge required for effective code optimization within LLMs, resulting in under-optimized code. To address these problems, we propose to model this task from the search perspective, and propose a search-based LLMs framework named SBLLM that enables iterative refinement and discovery of improved optimization methods. SBLLM synergistically integrate LLMs with evolutionary search and consists of three key components: 1) an execution-based representative sample selection part that evaluates the fitness of each existing optimized code and prioritizes promising ones to pilot the generation of improved code; 2) an adaptive optimization pattern retrieval part that infuses targeted optimization patterns into the model for guiding LLMs towards rectifying and progressively enhancing their optimization methods; and 3) a genetic operator-inspired chain-of-thought prompting part that aids LLMs in combining different optimization methods and generating improved optimization methods. Our evaluation of SBLLM on a dataset of Python and C++ code demonstrates its effectiveness in improving code efficiency. Specifically, the results indicate that SBLLM can improve program execution efficiency by up to 109.59% and consistently outperform all baseline methods by 8.72% ∼ 28.06% and 1.15% ∼ 9.56% with different LLMs in terms of top-5 speedup rate on Python and C++, respectively.

Tags: "AI for SE"

Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, Hui Liu, "A First Look at Conventional Commits Classification"

Abstract: Modern distributed software development relies on commits to control system versions. Commit classification plays a vital role in both industry and academia. The widely-used commit classification framework was proposed in 1976 by Swanson and includes three base classes: perfective, corrective, and adaptive. With the increasing complexity of software development, the industry has shifted towards a more fine-grained commit category, i.e., adopting Conventional Commits Specification (CCS) for delicacy management. The new commit framework requires developers to classify commits into ten distinct categories, such as ``feat'', ``fix'', and ``docs''. However, existing studies mainly focus on the three-category classification, leaving the definition and application of the fine-grained commit categories as knowledge gaps. This paper reports a preliminary study on this mechanism from its application status and problems. We also explore ways to address these identified problems. We find that a growing number of projects on GitHub are adopting CCS. By analyzing 194 issues from GitHub and 100 questions from Stack Overflow about the CCS application, we qualitatively categorized 52 challenges developers encountered. The most common one is CCS-type confusion. To address these challenges, we propose a clear definition of CCS types based on existing variants. Further, we designed an approach to automatically classify commits into CCS types, and the evaluation results demonstrate a promising performance. Our work facilitates a deeper comprehension of the present fine-grained commit categorization and holds the potential to alleviate application challenges significantly.

Tags: "Prog Comprehension/Reeng/Maint", "AI for SE"

Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng, "TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference"

Abstract: Python’s dynamic typing system offers flexibility and expressiveness but can lead to type-related errors, prompting the need for automated type inference despite efforts like Python Enhancement Proposals (PEPs) to enhance type hinting. While existing learning-based approaches show promising inference accuracy, they struggle with practical challenges in comprehensively handling various types, including complex generics and (unseen) user/library-defined types. To address these challenges, we introduce TIGER, employing a two-stage generating-then-ranking (GTR) framework. By fine-tuning pre-trained code models, TIGER trains a generation model with a generative span masking objective and a similarity model with a contrastive training objective. This enables TIGER to execute the GTR inference, generating diverse candidates and then ranking them alongside user/library-defined types. Evaluation on the ManyTypes4Py dataset demonstrates TIGER’s effectiveness across different type categories, particularly excelling in (unseen) user-defined types (with improvements of 11.2% and 20.1% in Top-5 Exact Match). The evaluation results also confirm the robustness and efficiency of TIGER, highlighting the contributions of the employed two stages.

Tags: "AI for SE", "Analysis"

Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang, "A Tale of Two DL Cities: When Library Tests Meet Compiler"

Abstract: Deep Learning (DL) compilers typically load a DL model and optimize it with intermediate representation. Existing DL compiler testing techniques mainly focus on model optimization stages, but rarely explore bug detection at the model loading stage. Effectively testing the model loading stage requires covering diverse usages of each DL operator from various DL libraries, which shares a common objective with DL library testing, indicating that the embedded knowledge in DL library tests could potentially be beneficial for testing the model loading stage of DL compilers. Thus, we conducted the first empirical study to investigate the effectiveness and efficiency of migrating the knowledge embedded in DL library tests to test the model loading stage. To support the conduct of this study, we develop a technique, called OPERA, consisting of test migration (regarding effectiveness investigation) and test prioritization (regarding efficiency investigation). We considered three sources of tests in DL libraries for migration and used eight frontends from three DL compilers (e.g., TVM, TensorRT, and OpenVINO) for evaluation. The migrated tests with the aid of OPERA detected 170 previously unknown bugs in total, 90 of which have been confirmed/fixed by developers, demonstrating the effectiveness of such the migration-based idea. The test prioritization strategy in OPERA improves testing efficiency with migrated tests by 11.9%~47.4% on average compared to general test prioritization strategies. Finally, we obtained 7 major findings and provided a set of guidelines for future work from this study.

Tags: "SE for AI", "Testing and Quality"

Rodrigo Pedro, Miguel E. Coimbra, Daniel Castro, Paulo Carreira, Nuno Santos, "Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses"

Abstract: Large Language Models (LLMs) have found widespread applications in various domains, including web applications with chatbot interfaces. Aided by an LLM-integration middleware such as LangChain, user prompts are translated into SQL queries used by the LLM to provide meaningful responses to users. However, unsanitized user prompts can lead to SQL injection attacks, potentially compromising the security of the database. In this paper, we present a comprehensive examination of prompt-to-SQL (P2SQL) injections targeting web applications based on frameworks such as LangChain and LlamaIndex. We characterize P2SQL injections, exploring their variants and impact on application security through multiple concrete examples. We evaluate seven state-of-the-art LLMs, demonstrating the risks of P2SQL attacks across language models. By employing both manual and automated methods, we discovered P2SQL vulnerabilities in five real-world applications. Our findings indicate that LLM-integrated applications are highly susceptible to P2SQL injection attacks, warranting the adoption of robust defenses. To counter these attacks, we propose four effective defense techniques that can be integrated as extensions to the LangChain framework.

Tags: "AI for SE", "Security"

Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia, "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?"

Abstract: Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4\%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, Yang Liu, "Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications"

Abstract: Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine- tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.

Tags: "AI for SE", "Security"

Shuo Yang, Xingwei Lin, Jiachi Chen, Qingyuan Zhong, Lei Xiao, Renke Huang, Yanlin Wang, Zibin Zheng, "Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic Execution"

Abstract: The rapid advancement of blockchain platforms has significantly accelerated the growth of decentralized applications (DApps). Similar to traditional applications, DApps integrate front-end descriptions that showcase their features to attract users, and back-end smart contracts for executing their business logic. However, inconsistencies between the features promoted in front-end descriptions and those actually implemented in the contract can confuse users and undermine DApps's trustworthiness. In this paper, we first conducted an empirical study to identify seven types of inconsistencies, each exemplified by a real-world DApp. Furthermore, we introduce Hyperion, an approach designed to automatically identify inconsistencies between front-end descriptions and back-end code implementation in DApps. This method leverages a fine-tuned large language model LLaMA2 to analyze DApp descriptions and employs dataflow-guided symbolic execution for contract bytecode analysis. Finally, Hyperion reports the inconsistency based on predefined detection patterns. The experiment on our ground truth dataset consisting of 54 DApps shows that Hyperion reaches 84.06\% overall recall and 92.06\% overall precision in reporting DApp inconsistencies. We also implement Hyperion to analyze 835 real-world DApps. The experimental results show that Hyperion discovers 459 real-world DApps containing at least one inconsistency.

Tags: "Analysis", "Security"

Zhiqing Zhong, Shilin He, Haoxuan Wang, Boxi Yu, Haowen Yang, Pinjia He, "An Empirical Study on Package-Level Deprecation in Python Ecosystem"

Abstract: Open-source software (OSS) plays a crucial role in modern software development. Utilizing OSS code can greatly accelerate software development, reduce redundancy, and enhance reliability. Python, a widely adopted programming language, is particularly renowned for its extensive and diverse third-party package ecosystem. However, a significant number of OSS packages within the Python ecosystem are in poor maintenance, leading to potential risks in terms of functionality and security. Consequently, it is essential to establish a deprecation mechanism that assists package developers and users in effectively managing these packages. To facilitate the establishment of the package-level deprecation mechanism, this paper presents a mixed-method empirical study, including data analysis and surveys. We investigate the current practices of announcing, receiving, and handling package-level deprecation in the Python ecosystem. We also assess the benefits of having deprecation announcements for inactively maintained packages. Furthermore, we investigate the challenges faced by package developers and users and their expectations for future deprecation practices. Our findings reveal valuable insights. For instance, 75.4\% of inactive package developers have no intention of releasing deprecation declarations for various reasons, while 89.5\% of users express a desire to be notified about the deprecation, highlighting a gap between developers and users; In many cases, no alternative solutions are available when deprecation occurs, emphasizing the need to explore practical approaches that enable seamless package handover and require less maintenance effort. We anticipate that our work will enhance the understanding of existing package-level deprecation patterns within the Python OSS realm and facilitate the development of deprecation practices for the Python community in the future.

Tags: "MSR", "Prog Comprehension/Reeng/Maint", "Open Source"

Lizhi Liao, Simon Eismann, Heng Li, Cor-Paul Bezemer, Diego Elias Costa, André van Hoorn, Weiyi Shang, "Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models"

Abstract: During software development, developers often make numerous modifications to the software to address existing issues or implement new features. However, certain changes may inadvertently have a detrimental impact on the overall system performance. To ensure that the performance of new software releases does not degrade (i.e., absence of performance regressions), existing practices rely on system-level performance testing, such as load testing, or component-level performance testing, such as microbenchmarking, to detect performance regressions. However, performance testing for the entire system is often expensive and time-consuming, posing challenges to adapting to the rapid release cycles common in modern DevOps practices. In addition, system-level performance testing cannot be conducted until the system is fully built and deployed. On the other hand, component-level testing focuses on isolated components, neglecting overall system performance and the impact of system workloads. In this paper, we propose a novel approach to early detection of performance regressions by bridging the local performance data generated by component-level testing and the system-level architectural models. Our approach uses local performance data to identify deviations at the component level, and then propagate these deviations to the architectural model. We then use the architectural model to predict regressions in the performance of the overall system. In an evaluation of our approach on two representative open-source benchmark systems, we show that it can effectively detect end-to-end system performance regressions from local performance deviations with different intensities and under various system workloads. More importantly, our approach can detect regressions as early as in the development phase, in contrast to existing approaches that require the system to be fully built and deployed. Our approach is lightweight and can complement traditional system performance testing when testing resources are scarce.

Tags: "Security", "Testing and Quality"

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman, "Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests"

Abstract: Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants, we investigate how the understandability of unit tests affects a software engineer's ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with test cases fix up to 33% more bugs and use up to 20\% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.

Tags: "Testing and Quality", "AI for SE"

Sebastian Uchitel, Francisco Cirelli, Dalal Alrajeh, "Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts"

Abstract: Boundary Conditions (BCs) express situations under which requirements specifications conflict. They are used within a broader conflict management process to produce less idealized specifications. Several approaches have been proposed to identify BCs automatically. Some introduce a prioritization criteria to reduce the number of BCs presented to an engineer. However, identifying the few, relevant boundary conditions remains an open challenge. In this paper, we argue that one of the problems of the state of the art is with the definition of BC itself -- it is too weak. We propose a stronger definition for the few, relevant BCs, which we refer to as Unavoidable Boundary Conditions (UBCs), which utilizes the notion of realizability in reactive synthesis. We show experimentally that UBCs non-trivially reduce the number of conditions produced by existing BC identification techniques. We also relate UBCs to existing concepts in reactive synthesis used to provide feedback for unrealizable specifications (including counter-strategies and unrealizable cores). We then show that UBCs provide a targeted form of feedback for repairing unrealizable specifications.

Tags: "Requirements", "Analysis"

Brian Hyeongseok Kim, Jingbo Wang, Chao Wang, "FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks"

Abstract: We propose a method for formally certifying and quantifying individual fairness of a deep neural network (DNN). Individual fairness guarantees that any two individuals who are identical except for some protected input attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the interval based analysis from the conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on five popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.

Tags: "SE for AI", "Analysis"

Mingyuan Wu, Jiahong Xiang, Kunqiu Chen, Peng DI, Shin Hwei Tan, Heming Cui, Yuqun Zhang, "Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?"

Abstract: Many assisting exploration strategies have been proposed to assist grey-box fuzzers in exploring program states guarded by tight and complex branch conditions such as equality constraints. Although they have shown promising results in their original papers, their evaluations seldom follow equivalent protocols, e.g., they are rarely evaluated on identical benchmarks. Moreover, there is a lack of sufficient investigations on the specifics of the program states explored by these strategies which can obfuscate the future application and development of such strategies. Consequently, there is a pressing need for a comprehensive study of assisting exploration strategies on their effectiveness, versatility, and limitations to enlighten their future development. To this end, we perform the first comprehensive study about the assisting exploration strategies for grey-box fuzzers. Specifically, we first collect nine recent fuzzers representing the mainstream assisting exploration strategies as our studied subjects and 21 real-world projects to form our benchmark suite. After evaluating the subjects on the benchmark suite, we then surprisingly find that the dictionary strategy is the most promising since it not only achieves similar or even slightly better performance over the other studied assisting exploration strategies in terms of exploring program states but also is more practical to be enhanced. Accordingly, we propose CDFUZZ, which generates a customized dictionary for each seed upon the baseline fuzzer AFL to improve over the original dictionary strategy. The evaluation results demonstrate that CDFUZZ increases the edge coverage by 16.1% on average for all benchmark projects over the best performer in our study (i.e., AFL++ with the dictionary strategy). CDFUZZ also successfully exposed 37 previously unknown bugs, with nine confirmed and seven fixed by the corresponding developers.

Tags: "Testing and Quality", "Analysis"

Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy, "Towards Neural Synthesis for SMT-assisted Proof-Oriented Programming"

Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem---producing a definition given a formal specification expressed as an F* type. We provide a program-fragment checker that queries F* to check the correctness of candidate solutions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.

Tags: "AI for SE", "Security"

Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen, "Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models"

Abstract: Deep learning (DL) libraries are widely used to form the basis of various AI applications in computer vision, natural language processing, and software engineering domains. Despite their popularity, DL libraries are known to have vulnerabilities, such as buffer overflows, use-after-free, and integer overflows, that can be exploited to compromise the security or effectiveness of the underlying libraries. While traditional fuzzing techniques have been used to find bugs in software, they are not well-suited for DL libraries. In general, the complexity of DL libraries and the diversity of their APIs make it challenging to test them thoroughly. To date, mainstream DL libraries like TensorFlow and PyTorch have featured over 1,000 APIs, and the number of APIs is still growing. Fuzzing all these APIs is a daunting task, especially when considering the complexity of the input data and the diversity of the API usage patterns. Recent advances in large language models (LLMs) have illustrated the high potential of LLMs in understanding and synthesizing human-like code. Despite their high potential, we find that emerging LLM-based fuzzers are less optimal for DL library API fuzzing, given their lack of in-depth knowledge on API input edge cases and inefficiency in generating test inputs. In this paper, we propose DFUZZ, a LLM-driven DL library fuzzing approach. We have two key insights: (1) With high reasoning ability, LLMs can replace human experts to reason edge cases (likely error-triggering inputs) from checks in an API's code, and transfer the extracted knowledge to test other (new or rarely-tested) APIs. (2) With high generation ability, LLMs can synthesize initial test programs with high accuracy that automates API testing. DFUZZ provides LLMs with a novel ''white-box view'' of DL library APIs, and therefore, can leverage LLMs' reasoning and generation abilities to achieve comprehensive fuzzing. Our experimental results on popular DL libraries demonstrate that DFUZZ is able to cover more APIs than SOTA (LLM-based) fuzzers on TensorFlow and PyTorch, respectively. Moreover, DFUZZ successfully detected 37 bugs, with 17 already confirmed as previously unknown bugs.

Tags: "AI for SE", "Testing and Quality"

Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, jianling Sun, "Towards Better Answers: Automated Stack Overflow Post Updating"

Abstract: Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. \textit{SO comments} often point out weaknesses of a post and provide valuable insights to improve the quality of answers, while SO comments are usually missed and/or ignored, leaving these problematic code snippets untouched. In this work, we first investigate the task of automatic SO posts updating based on their associated comments. We introduce a novel framework, named \textbf{Soup} (\textbf{\underline{S}}tack \textbf{\underline{O}}verflow \textbf{\underline{U}}pdator for \textbf{\underline{P}}ost) for this task. \textbf{Soup} addresses two key tasks: Valid Comment-Edit Prediction (VCP) and Automatic Post Updating (APU). We fine-tuned a large language model, CodeLlama, using low-rank adaptation techniques to complete the VCP task, and constructed a dataset containing 78k valid comment-edit pairs for the APU task. Subsequently, we tested the performance of multiple large language models on the APU task. Extensive experimental results show the promising performance of our model over a set of benchmarks. Moreover, we also perform an in-the-wild evaluation on Stack Overflow, we submitted 50 edits generated by our approach to Stack Overflow posts and 21 of them have been verified and accepted by SO maintainers, further proving the practical value of \textbf{Soup}.

Tags: "AI for SE"

Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang, "Selecting Initial Seeds for Better JVM Fuzzing"

Abstract: JVM fuzzing techniques serve as a cornerstone for guaranteeing the quality of implementations. In typical fuzzing workflows, initial seeds are crucial as they form the basis of the process. Literature in traditional program fuzzing has confirmed that effectiveness is largely impacted by redundancy among initial seeds, thereby proposing a series of seed selection methods. JVM fuzzing, compared to traditional ones, presents unique characteristics, including large-scale and intricate code, and programs with both syntactic and semantic features. However, it remains unclear whether the existing initial seed selection methods are suitable for JVM fuzzing and whether utilizing program features can enhance effectiveness. To address this, we devised a total of 10 initial seed selection methods, comprising coverage-based, prefuzz-based, and program-feature-based methods. We then conducted an empirical study on three JVM implementations to extensively evaluate the performance of the initial seed selection methods within two state-of-the-art fuzzing techniques (JavaTailor and VECT). Specifically, we examine performance from three aspects: (i) effectiveness and efficiency using widely studied initial seeds, (ii) effectiveness using the programs in the wild, and (iii) the ability to detect new bugs. Evaluation results first show that the program-feature-based method that utilizes the control flow graph not only has a significantly lower time overhead (i.e., 30s), but also outperforms other methods, achieving 142% to 269% improvement compared to the full set of initial seeds. Second, results reveal that the initial seed selection greatly improves the quality of wild programs and exhibits complementary effectiveness by detecting new behaviors. Third, results demonstrate that given the same testing period, initial seed selection improves the JVM fuzzing techniques by detecting more unknown bugs. Particularly, 16 out of the 25 detected bugs have been confirmed or fixed by developers. This work takes the first look at initial seed selection in JVM fuzzing, confirming its importance in fuzzing effectiveness and efficiency.

Tags: "Testing and Quality", "Analysis"

Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen, "Source Code Summarization in the Era of Large Language Models"

Abstract: To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types (e.g., procedural and object-oriented programming languages). Finally, we unexpectedly find that \codellama{} with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu, "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers"

Abstract: Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.

Tags: "AI for SE", "Human/Social"

Yue Wang, Chao Yang, Xiaodong Zhang, Yuwanqi Deng, JianFeng Ma, "DPFuzzer: Discovering Safety Critical Vulnerabilities for Drone Path Planners"

Abstract: State-of-the-art drone path planners enable drones to autonomously navigate around obstacles in GPS-denied, uncharted and cluttered environments. However, our investigation shows that path planners fail to maneuver drones correctly in specific scenarios, leading to incidents such as collisions. To minimize such risks, drone path planners should be tested thoroughly against diverse scenarios before deployment. Existing research for testing drones to uncover safety-critical vulnerabilities is only focused on the flight control programs and is limited in the capability to generate diverse obstacle scenarios for testing drone path planners. In this work, we propose \textit{DPFuzzer}, an automated framework for testing drone path planners. \textit{DPFuzzer} is an evolutionary algorithm (EA) based testing framework. It aims to uncover vulnerabilities in drone path planners by generating diverse critical scenarios that can trigger vulnerabilities. To better guide the critical scenario generation, we introduce \textit{Environmental Risk Factor (ERF)}, a metric we propose, to abstract potential safety threats of scenarios. We evaluate \textit{DPFuzzer} on state-of-the-art drone path planners and the experimental result shows that \textit{DPFuzzer} can effectively find diverse vulnerabilities. Additionally, we demonstrate that these vulnerabilities are exploitable in the real world on commercial drones.

Tags: "Testing and Quality", "Analysis"

Shiyu Zhang, Haoyang Song, Qixin Wang, Henghua Shen, Yu Pei, "A Test Oracle for Reinforcement Learning Software based on Lyapunov Stability Control Theory"

Abstract: Reinforcement Learning (RL) has gained significant attention in recent years. As RL software becomes more complex and infiltrates critical application domains, ensuring its quality and correctness becomes increasingly important. An indispensable aspect of software quality/correctness insurance is testing. However, testing RL software faces unique challenges compared to testing traditional software, due to the difficulty on defining the outputs’ correctness. This leads to the RL test oracle problem. Current approaches to testing RL software often rely on human oracles, i.e. convening human experts to judge the correctness of RL software outputs. This heavily depends on the availability and quality (including the experiences, subjective states, etc.) of the human experts, and cannot be fully automated. In this paper, we propose a novel approach to design test oracles for RL software by leveraging the Lyapunov stability control theory. By incorporating Lyapunov stability concepts to guide RL training, we hypothesize that a correctly implemented RL software shall output an agent that respects Lyapunov stability control theories. Based on this heuristics, we propose a Lyapunov stability control theory based oracle, LPEA(ϑ, θ), for testing RL software. We conduct extensive experiments over representative RL algorithms and RL software bugs to evaluate our proposed oracle. The results show that our proposed oracle can outperform the human oracle in most metrics. Particularly, LPEA(ϑ = 100%, θ = 75%) outperforms the human oracle by 53.6%, 50%, 18.4%, 34.8%, 18.4%, 127.8%, 60.5%, 38.9%, and 31.7% respectively on accuracy, precision, recall, F1 score, true positive rate, true negative rate, false positive rate, false negative rate, and ROC curve’s AUC; and LPEA(ϑ = 100%, θ = 50%) outperforms the human oracle by 48.2%, 47.4%, 10.5%, 29.1%, 10.5%, 127.8%, 60.5%, 22.2%, and 26.0% respectively on these metrics.

Tags: "SE for AI", "Analysis"

Yisong Xiao, Aishan Liu, Xinwei Zhang, Tianyuan Zhang, Tianlin Li, Siyuan Liang, Xianglong Liu, Yang Liu, Dacheng Tao, "BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks"

Abstract: Pre-trained large deep learning models are now serving as the dominant component for downstream middleware users and have revolutionized the learning paradigm, replacing the traditional approach of training from scratch locally. To reduce development costs, developers often integrate third-party pre-trained deep neural networks (DNNs) into their intelligent software systems. However, utilizing untrusted DNNs presents significant security risks, as these models may contain intentional backdoor defects resulting from the black-box training process. These backdoor defects can be activated by hidden triggers, allowing attackers to maliciously control the model and compromise the overall reliability of the intelligent software. To ensure the safe adoption of DNNs in critical software systems, it is crucial to establish a backdoor defect database for localization studies. This paper addresses this research gap by introducing \emph{BDefects4NN}, the first backdoor defect database, which provides labeled backdoor-defected DNNs at the neuron granularity and enables controlled localization studies of defect root causes. In \emph{BDefects4NN}, we define three defect injection rules and employ four representative backdoor attacks across four popular network architectures and three widely adopted datasets, yielding a comprehensive database of 1,654 backdoor-defected DNNs with four defect quantities and varying infected neurons. Based on \emph{BDefects4NN}, we conduct extensive experiments on evaluating six fault localization criteria and two defect repair techniques, which show limited effectiveness for backdoor defects. Additionally, we investigate backdoor-defected models in practical scenarios, specifically in lane detection for autonomous driving and large language models (LLMs), revealing potential threats and highlighting current limitations in precise defect localization. This paper aims to raise awareness of the threats brought by backdoor defects in our community and inspire future advancements in fault localization methods.

Tags: "SE for AI", "Analysis"

Ian McCormack, Joshua Sunshine, Jonathan Aldrich, "A Study of Undefined Behavior Across Foreign Function Boundaries in Rust Libraries"

Abstract: Developers rely on the Rust programming language's static safety guarantees to write secure and performant applications. However, Rust is frequently used to interoperate with other languages which allow design patterns that conflict with Rust's aliasing models. Miri is the only dynamic analysis tool capable of validating applications against these models, but it does not support foreign functions, indicating that there may be a critical correctness gap at the heart of the Rust ecosystem. We conducted a large-scale evaluation of multi-language Rust libraries to determine whether Miri's dynamic analyses remain useful in this context. We used Miri and an LLVM interpreter to jointly execute multi-language applications, where we found 48 instances of undefined or undesired behavior. These include three bugs from libraries that had over 10,000 daily downloads on average during our observation period, and one from a library maintained by the Rust Project. Many of the errors we found involved incompatible aliasing patterns, but Rust's latest Tree Borrows aliasing model was significantly more permissive than the earlier Stacked Borrows model. The Rust community must invest in new, production-ready tooling for multi-language applications to ensure that developers can detect these errors.

Tags: "Analysis", "Security"

Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang, "Synthesizing Document Database Queries using Collection Abstractions"

Abstract: Document databases are increasingly popular in various applications, but their queries are challenging to write due to the flexible and complex data model underlying document databases. This paper presents a synthesis technique that aims to generate document database queries from input-output examples automatically. A new domain-specific language is designed to express a representative set of document database queries in an algebraic style. Furthermore, the synthesis technique leverages a novel abstraction of collections for deduction to efficiently prune the search space and quickly generate the target query. An evaluation of 110 benchmarks from various sources shows that the proposed technique can synthesize 108 benchmarks successfully. On average, the synthesizer can generate document database queries from a small number of input-output examples within tens of seconds.

Tags: "AI for SE", "Testing and Quality"

Yesugen Baatartogtokh, Kaitlyn Cook, Alicia M. Grubb, "Exploring the Robustness of the Effect of EVO on Intention Valuation through Replication"

Abstract: The development of high-quality software depends on precise and comprehensive requirements that meet the objectives of stakeholders. Goal modeling techniques have been developed to fill this gap by capturing and analyzing stakeholders' needs and allowing them to make trade-off decisions; yet, goal modeling analysis is often difficult for stakeholders to interpret. Recent work found that when subjects are given minimal training on goal modeling and access to a color visualization, called EVO, they are able to use EVO to make goal modeling decisions faster without compromising quality. In this paper, we evaluate the robustness of the empirical evidence for EVO and question the underlying color choices made by the initial designers of EVO. We conduct a pseudo-exact replication ($n = 60$) of the original EVO study, varying the experimental site and the study population. Even in our heterogeneous sample with less a priori familiarity with requirements and goal modeling, we find that individuals using EVO answered the goal-modeling questions significantly faster than those using the control, expanding the external validity of the original results. However, we find some evidence that the chosen color scheme is not intuitive and make recommendations for the goal modeling community.

Tags: "Requirements", "Human/Social"

Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, Elie Bursztein, "Magika: AI-Powered Content-Type Detection"

Abstract: The task of content-type detection---which entails identifying the data encoded in an arbitrary byte sequence---is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by a major email provider for attachment scanning, and it has been integrated with VirusTotal to aid malware analysis.

Tags: "AI for SE", "Security"

Xiafa Wu, Brian Demsky, "GenC2Rust: Towards Generating Generic Rust Code from C"

Abstract: Rust provides an exciting combination of strong safety guarantees and high performance. Many new systems are being implemented in Rust. Nevertheless, there is a large body of existing C code that could greatly benefit from Rust's safety guarantees. Unfortunately, the manual effort required to rewrite C code into Rust is often prohibitively expensive. Researchers have explored tools to assist developers in translating legacy C code into Rust code. However, the mismatch between C abstractions and idiomatic Rust abstractions makes it challenging to automatically utilize Rust's language features, resulting in non-idiomatic Rust code that requires extensive manual effort to further refactor. For example, existing tools often fail to map polymorphic uses of void pointers in C to Rust's more idiomatic generic pointers. In this paper, we present a translation tool, GenC2Rust, that translates non-generic C code into generic Rust code. GenC2Rust statically analyzes the use of void pointers in the C program to compute the typing constraints and then retypes the void pointers into generic pointers. We conducted an evaluation of GenC2Rust across 42 C programs that vary in size and span multiple domains to demonstrate its scalability as well as correctness. We also present a detailed analysis of the limiting factors encountered in the translation process.

Tags: "Analysis", "Prog Comprehension/Reeng/Maint"

Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, Baowen Xu, "ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs"

Abstract: Ensuring the reliability of the Rust compiler is of paramount importance, as the Rust language is increasingly being adopted for developing critical systems due to its emphasis on memory and thread safety. However, due to Rust’s complex syntax and strict requirements, generating valid test programs for the Rust compiler poses significant challenges. Currently, with the growing popularity of large language models (LLMs), much research in software testing has explored the use of LLMs to generate test cases. Despite this, directly using LLMs to generate Rust programs often results in a large number of invalid test cases. Existing studies have indicated that test cases triggering historical compiler bugs can assist in software testing. Our investigation into Rust compiler bug issues further supports this observation. Inspired by existing work and our empirical research, we introduce a bracket-based masking and filling strategy called clozeMask. The clozeMask strategy involves extracting test code from historical issue reports, identifying and masking code snippets with specific structures, and utilizing an LLM to fill in the masked portions for synthesizing new test programs. This approach harnesses the generative capabilities of LLMs while retaining the ability to trigger Rust compiler bugs. Ultimately, it enables comprehensive testing of the compiler’s behavior, particularly in exploring corner cases. We implemented our approach as a prototype CLOZEMASTER. CLOZEMASTER has identified 27 confirmed bugs for rustc and mrustc, of which 10 have been fixed by developers. Furthermore, our experimental results indicate that CLOZEMASTER outperforms existing generative fuzzers in terms of code coverage and effectiveness.

Tags: "Testing and Quality", "AI for SE"

Xingyu Wang, MingSen Wang, Wenbo Shen, Rui Chang, "Understanding and Detecting Peer Dependency Resolving Loop in npm Ecosystem"

Abstract: As the default package manager for Node.js, npm has become one of the largest package management systems in the world. To facilitate dependency management for developers, npm supports a special type of dependency, Peer Dependency, whose installation and usage differ from regular dependencies. However, conflicts between peer dependencies can trap the npm client into infinite loops, leading to resource exhaustion and system crashes. We name this problem PeerSpin. Although PeerSpin poses a severe risk to ecosystems, it was overlooked by previous studies, and its impacts have not been explored. To bridge this gap, this paper conducts the first in-depth study to understand and detect PeerSpin in the npm ecosystem. First, by systematically analyzing the npm dependency resolution, we identify the root cause of PeerSpin and characterize two peer dependency patterns to guide detection. Second, we propose a novel technique called Node-Replacement-Conflict based PeerSpin Detection, which leverages the state of the directory tree during dependency resolution to achieve accurate and efficient PeerSpin detection. Based on this technique, we developed a tool called PeerChecker to detect PeerSpin. Finally, we apply PeerChecker to the entire NPM ecosystem and find that 5,662 packages, totaling 72,968 versions, suffer from PeerSpin. Up until now, we confirmed 28 real PeerSpin problems by reporting them to the package maintainer. We also open source all PeerSpin analysis implementations, tools, and data sets to the public to help the community detect PeerSpin issues and enhance the reliability of the npm ecosystem.

Tags: "MSR", "Prog Comprehension/Reeng/Maint"

Jiashuo Zhang, Jiachi Chen, John Grundy, Jianbo Gao, Yanlin Wang, Ting Chen, Zhi Guan, Zhong Chen, "Automated Test Generation For Smart Contracts via On-Chain Test Case Augmentation and Migration"

Abstract: Pre-deployment testing has become essential to ensure the functional correctness of smart contracts. However, since smart contracts are stateful programs integrating many different functionalities, manually writing test cases to cover all potential usages requires significant effort from developers, leading to insufficient testing and increasing risks in practice. Although several testing techniques for smart contracts have been proposed, they primarily focus on detecting common low-level vulnerabilities such as re-entrancy, rather than generating expressive and function-relevant test cases that can reduce manual testing efforts. To bridge the gap, we propose SolMigrator, an automated technique designed to generate expressive and representative test cases for smart contracts. To our knowledge, SolMigrator is the first migration-based test generation technique for smart contracts, which extracts test cases from real-world usages of on-chain contracts and migrates them to test newly developed smart contracts with similar functionalities. Given a target smart contract to be tested and an on-chain similar source smart contract, SolMigrator first transforms the on-chain usage of the source contract into off-chain executable test cases based on on-chain transaction replay and dependency analysis. It then employs fine-grained static analysis to migrate the augmented test cases from the source to the target smart contract. We built a prototype of SolMigrator and have evaluated it on real-world smart contracts within the two most popular categories, ERC20 and ERC721. Our evaluation results demonstrate that SolMigrator effectively extracts test cases from existing on-chain smart contracts and accurately migrates them across different smart contracts, achieving an average precision of 96.3% and accuracy of 93.6%. Furthermore, the results indicate that these migrated test cases effectively cover common key functionalities of the target smart contracts. This provides promising evidence that real-world usages of existing smart contracts can be transformed into effective test cases for other newly developed smart contracts.

Tags: "Testing and Quality", "Prog Comprehension/Reeng/Maint"

Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Xinru Cheng, Chengnian Sun, "Toward a Better Understanding of Probabilistic Delta Debugging"

Abstract: Given a list L of elements and a property ψ that L exhibits, ddmin is a classic test input minimization algorithm that aims to automatically remove ψ-irrelevant elements from L. This algorithm has been widely adopted in domains such as test input minimization and software debloating. Recently, ProbDD, a variant of ddmin, has been proposed and achieved stateof- the-art performance. By employing Bayesian optimization, ProbDD estimates the probability of each element in L being relevant to ψ, and statistically decides which and how many elements should be deleted together each time. However, the theoretical probabilistic model of ProbDD is rather intricate, and the underlying details for the superior performance of ProbDD have not been adequately explored. In this paper, we conduct the first in-depth theoretical analysis of ProbDD, clarifying the trends in probability and subset size changes and simplifying the probability model. We complement this analysis with empirical experiments, including success rate analysis, ablation studies, and examinations of trade-offs and limitations, to further comprehend and demystify this state-of- the-art algorithm. Our success rate analysis reveals how ProbDD effectively addresses bottlenecks that slow down ddmin by skipping inefficient queries that attempt to delete complements of subsets and previously tried subsets. The ablation study illustrates that randomness in ProbDD has no significant impact on efficiency. These findings provide valuable insights for future research and applications of test input minimization algorithms. Based on the findings above, we propose CDD, a simplified version of ProbDD, reducing the complexity in both theory and implementation. CDD assists in 1 validating the correctness of our key findings, e.g., that probabilities in ProbDD essentially serve as monotonically increasing counters for each element, and 2 identifying the main factors that truly contribute to ProbDD’s superior performance. Our comprehensive evaluations across 76 benchmarks in test input minimization and software debloating demonstrate that CDD can achieve the same performance as ProbDD, despite being much simplified.

Tags: "Testing and Quality", "Analysis"

Benjamin Steenhoek, Siva Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, Wei Le, "Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE"

Abstract: Security vulnerabilities impose significant costs on users and organizations. Detecting and addressing these vulnera bilities early is crucial to avoid exploits and reduce development costs. Recent studies have shown that deep learning models can effectively detect security vulnerabilities. Yet, little research explores how to adapt these models from benchmark tests to practical applications, and whether they can be useful in practice. This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DEEPVULGUARD, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on bench marks of historic vulnerability data. DEEPVULGUARD scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural- language explanations for alerts and fixes, leveraging chat inter faces. We recruited 17 professional software developers, observed their usage of the tool on their code, and conducted interviews to assess the tool’s usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users’ perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user’s codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at this link: https://figshare.com/s/77992badb1e37c09e4eb. We plan to release our tool as open-source to support further user studies for other AI based tools.

Tags: "AI for SE", "Security"

Soneya Binta Hossain, Matthew Dwyer, "TOGLL: Correct and Strong Test Oracle Generation with LLMs"

Abstract: Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have shown impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on a large dataset consisting of 110 Java projects. Utilizing the most effective fine- tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 unseen large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Regarding bug detection effectiveness, TOGLL can detect 1,023 unique mutants that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect. Additionally, TOGLL significantly outperforms TOGA in detecting real bugs from the Defects4J dataset.

Tags: "Testing and Quality", "AI for SE"

Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, Ali Mesbah, "Feature-Driven End-To-End Test Generation"

Abstract: End-to-end (E2E) testing is essential for ensuring web application quality. However, manual test creation is time-consuming and current test generation techniques produce random tests. In this paper, we present AUTOE2E, a novel approach that leverages Large Language Models (LLMs) to automate the generation of semantically meaningful feature-driven E2E test cases for web applications. AUTOE2E intelligently infers potential features within a web application and translates them into executable test scenarios. Furthermore, we address a critical gap in the research community by introducing E2EBENCH, a new benchmark for automatically assessing the feature coverage of E2E test suites. Our evaluation on E2EBENCH demonstrates that AUTOE2E achieves an average feature coverage of 79%, outperforming the best baseline by 558%, highlighting its effectiveness in generating high-quality, comprehensive test cases.

Tags: "AI for SE", "Testing and Quality"

Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour, "Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation"

Abstract: Commit messages provide descriptions of the modifications made in a commit using natural language, making them crucial for software maintenance and evolution. Recent developments in Large Language Models (LLMs) have led to their use in generating high-quality commit messages, such as the Omniscient Message Generator (OMG). This method employs GPT-4 to produce state-of-the-art commit messages. However, the use of proprietary LLMs like GPT-4 in coding tasks raises privacy and sustainability concerns, which may hinder their industrial adoption. Considering that open-source LLMs have achieved competitive performance in developer tasks such as compiler validation, this study investigates whether they can be used to generate commit messages that are comparable with OMG. Our experiments show that an open-source LLM can generate commit messages that are comparable to those produced by OMG. In addition, through a series of contextual refinements, we propose lOcal MessagE GenerAtor (OMEGA) , a CMG approach that uses a 4-bit quantized 8B open-source LLM. OMEGA produces state-of-the-art commit messages, surpassing the performance of GPT-4 in practitioners' preference.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu, Yitong Liu, Si Zheng, Xiangke Liao, "Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar"

Abstract: Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, there are three main gaps to objectively evaluate the real capability of LLMs: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been exposed during the training phase of LLM, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea using in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFUSEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5\%.

Tags: "AI for SE"

Baoquan Cui, rong qu, Zhen Tang, Jian Zhang, "Static Analysis of Remote Procedure Call in Java Programs"

Abstract: The Remote Procedure Call (RPC) is commonly used for inter-process communications over network, allowing a program to invoke a procedure in another address space even another machine as if it were a local call within the same address space. Its convenience comes from encapsulating network communication. However, for the same reason, it cannot be penetrated by current static analyzers. Since the RPC programs/frameworks play a more important role in various domains, the static analysis of RPC is significant and cannot be ignored. We have observed that many of the existing RPC frameworks/programs written in Java are based on explicit protocols, which makes them possible to be modelled for static analysis. The challenges are how to identify RPC operations in different frameworks/programs and how to automatically establish relationships between clients and servers. In this paper, we propose a novel approach, RPCBridge, which uses an adapter to unify the most basic operations during the RPC process. It models the RPC with logic rules in a straightforward and precise way based on its semantics, performs points-to analysis and constructs RPC edges in the call graph, making it more complete. The evaluation on real-world large-scale Java programs based on 5 common RPC frameworks shows that our approach can effectively capture the operations of the RPC (263 matched protocols and 1,098 RPCs), and construct critical links (2,578 edges in the call graph) between clients and servers, in which 60.1% are the true caller-callee pairs after execution. Our approach is expected to bring significant benefits (+24.3% leakage paths for the taint analyzer) for previously incompletely modelled code with a very little memory and time overhead, and connect the modules in a system, so that it can be statically analyzed more holistically.

Tags: "Analysis"

Shihao Zhu, Yuqi Guo, Yan Cai, Bin Liang, Long Zhang, Rui Chen, Tingting Yu, "Reduce Dependence for Sound Concurrency Bug Prediction"

Abstract: Recently, dynamic concurrency bug predictions have kept making notable progress in improving concurrency coverage while ensuring soundness. Most of them rely solely on dynamic information in traces and overlook the static semantics of the program when predicting bugs. To ensure soundness, they assume that \textbf{any (memory) read can fully affect subsequent program execution via control-flow and data-flow}. However, the assumption over-approximates constraints among (memory) writes and reads and hence limits reordering space over thread interleaving, ultimately leading to false negatives. From program semantics, only a subset of reads actually affect their subsequent executions. Therefore, by refining dependencies between reads and subsequent executions based on static program semantics, one can refine the assumption and eliminate unnecessary constraints while still guaranteeing soundness. This can bring a chance to explore more thread interleaving space and uncover more concurrency bugs. However, refining dependencies can compromise soundness and bring heavy overhead. To tackle these challenges, this paper introduces the concept of Necessary Consistent Read Event (NRE) and a hybrid analysis algorithm. NRE refines dependencies between reads and their subsequent events and is used to identify necessary constraints where a read probably affects the execution of its subsequent events. Next, we design an efficient and accurate hybrid analysis algorithm to calculate NREs for each event in the trace. The hybrid analysis algorithm maps events to program SSA instructions and simulates executions based on the original trace. We focused on data race and developed NRE and the algorithm as a prototype tool ReconP on top of a recent work M2. We conducted a set of comparative experiments on MySQL with M2 and SeqCheck. The results show that ReconP can detect 46.9\% and 22.4\% more data races than M2 and SeqCheck, respectively. And the hybrid algorithm only accounts for 34\% of the total time cost.

Tags: "Testing and Quality", "Security"

Huaijin Wang, Zhibo Liu, Yanbo Dai, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu, "Preserving Privacy in Software Composition Analysis: A Study of Technical Solutions and Enhancements"

Abstract: Software composition analysis (SCA) denotes the process of identifying open-source software components in an input software application. SCA has been extensively developed and adopted by academia and industry. However, we notice that the modern SCA techniques in industry scenarios still need to be improved due to privacy concerns. Overall, SCA requires the users to upload their applications’ source code to a remote SCA server, which then deeply inspects the applications and reports the component usage to users. This process is privacy-sensitive since the applications may contain sensitive information, such as proprietary algorithms, trade secrets, and user data. Moreover, applications' source code is generally deemed proprietary, and users do not want to share it with the SCA vendor. To protect customers' privacy, contemporary SCA vendors often propose to deploy a "lite" version of SCA service on the customer side. To avoid the leakage of SCA vendors' valuable assets (e.g., code, model, and data), the "lite" SCA usually only performs a shallow analysis with limited accuracy. Privacy concerns have prevented the SCA technology from being used in real-world scenarios. Therefore, academia and the industry demand privacy-preserving SCA solutions. For the first time, we analyze the privacy requirements of SCA and provide a landscape depicting possible technical solutions with varying privacy gains and overheads. In particular, given that de facto SCA frameworks are primarily driven by code similarity-based techniques, we explore combining several privacy-preserving protocols to encapsulate the similarity-based SCA framework. Among all viable solutions, we find that multi-party computation (MPC) offers the strongest privacy guarantee and plausible accuracy; it, however, incurs high overhead ($184\times$). We optimize the MPC-based SCA framework by reducing the amount of crypto protocol transactions using program analysis techniques. The evaluation results show that our proposed optimizations can reduce the MPC-based SCA overhead to only 8.5% without sacrificing SCA’s privacy guarantee or accuracy.

Tags: "Prog Comprehension/Reeng/Maint", "Analysis", "Open Source"

Jintao Huang, Kai Yang, Gaosheng Wang, Zhiqiang Shi, Zhiwen Pan, Shichao Lv, Limin Sun, "Moye: A Wallbreaker for Monolithic Firmware"

Abstract: As embedded devices become increasingly popular, monolithic firmware, known for its execution efficiency and simplicity, is widely used in resource-constrained devices. Different from ordinary firmware, the monolithic firmware image is packed without the file that indicates its format, which challenges the reverse engineering of monolithic firmware. Function identification is the prerequisite of monolithic firmware's analysis. Prior works on function identification are less effectiveness when applied to monolithic firmware due to their heavy reliance on file formats. In this paper, we propose Moye, a novel method to identify functions in monolithic firmware. We leverage the important insight that the use of registers must conform to some constraints. In particular, our approach segments the firmware, locate code sections and output the instructions. We uses a masked language model to learn hiding relationships among the instructions to identify the function boundaries. We evaluate Moye using 1,318 monolithic firmware images, including 48 samples collected from widely used devices. The evaluation demonstrates that our approach significantly outperforms current works, achieving a precision greater than 98% and a recall rate greater than 97% across most datasets, showing robustness to complicated compilation options.

Tags: "Analysis", "Prog Comprehension/Reeng/Maint"

Haifeng Ruan, Yuntong Zhang, Abhik Roychoudhury, "SpecRover: Code Intent Extraction via LLMs"

Abstract: Autonomous program improvement typically involves automatically producing bug fixes and feature additions. Such program improvement can be accomplished by a combination of large language model (LLM) and program analysis capabilities, in the form of an LLM agent. Since program repair or program improvement typically requires a specification of intended behavior - specification inference can be useful for producing high quality program patches. In this work, we examine efficient and low-cost workflows for iterative specification inference within an LLM agent. Given a GitHub issue to be resolved in a software project, our goal is to conduct iterative code search accompanied by specification inference - thereby inferring intent from both the project structure and behavior. The intent thus captured is examined by a reviewer agent with the goal of vetting the patches as well as providing a measure of confidence in the vetted patches. Our approach SpecRover is built on the open-source LLM agent AutoCodeRover. In an evaluation on the full SWE-Bench consisting of 2294 GitHub issues, it shows more than 50% improvement in efficacy over AutoCodeRover. Compared to the open-source agents available, our work shows modest cost ($0.65 per issue) in resolving an average GitHub issue in SWE-Bench lite. The production of explanation by SpecRover allows for a better "signal" to be given to the developer, on when the suggested patches can be accepted with confidence. SpecRover also seeks to demonstrate the continued importance of specification inference in automated program repair, even as program repair technologies enter the LLM era.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Zhiyuan Li, Zhiyuan Li, Jingzheng Wu, Jingzheng Wu, Jingzheng Wu, Xiang Ling, Xiang Ling, Xiang Ling, Tianyue Luo, Zhiqing Rui, Zhiqing Rui, Yanjun Wu, Yanjun Wu, Yanjun Wu, "The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries"

Abstract: The widespread application of Large Language Models (LLMs) underscores the importance of Deep Learning (DL) technologies that rely on foundational DL libraries such as PyTorch and TensorFlow. Despite their robust features, these libraries face challenges with scalability and adaptation to rapid advancements in the LLM community. In response, tech giants like Apple and Huawei are developing their own DL libraries to enhance performance, increase scalability, and safeguard intellectual property. Ensuring the security of these libraries is crucial, with fuzzing being a vital solution. However, existing fuzzing frameworks struggle with target flexibility, effectively testing bug-prone API sequences, and leveraging the limited available information in new libraries. To address these limitations, we propose FUTURE, the first universal DL library fuzzing framework tailored for newly introduced and prospective DL libraries. FUTURE leverages historical bug information from existing libraries and fine-tunes LLMs for specialized code generation. This strategy helps identify vulnerabilities in new libraries and uses insights from these libraries to enhance security in existing ones, creating a cycle from history to future and back. To evaluate FUTURE's effectiveness, we conduct comprehensive evaluations on three newly introduced DL libraries. Results demonstrate that FUTURE significantly outperforms existing fuzzers in bug detection, success rate of bug reproduction, validity rate of code generation, and API coverage. Notably, FUTURE has detected 148 bugs across 452 targeted APIs, including 142 previously unknown bugs. Among these, 10 have been assigned CVE IDs. Additionally, FUTURE detects 7 bugs in PyTorch, demonstrating its ability to enhance security in existing libraries in reverse.

Tags: "Testing and Quality", "AI for SE", "Security"

Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma, "What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI"

Abstract: Generative AI (genAI) tools, such as ChatGPT or Copilot, are advertised to improve developer productivity and are being integrated into software development. However, misaligned trust, skepticism, and usability concerns can impede the adoption of such tools. Research also indicates that AI can be exclusionary, failing to support diverse users adequately. One such aspect of diversity is cognitive diversity—variations in users’ cognitive styles—that leads to divergence in perspectives and interaction styles. When an individual’s cognitive style is unsupported, it creates barriers to technology adoption. Therefore, to understand how to effectively integrate genAI tools into software development, it is first important to model what factors affect developers’ trust and intentions to adopt genAI tools in practice? We developed a theoretical model to (1) identify factors that influence developers’ trust in genAI tools and (2) examine the relationship between developers’ trust, cognitive styles, and their intentions to use these tools. We surveyed software developers (N=238) at two major global tech organizations and employed Partial Least Squares-Structural Equation Modeling (PLS-SEM) to evaluate our model. Our findings reveal that genAI’s system/output quality, functional value, and goal maintenance significantly influence developers’ trust in these tools. Furthermore, developers’ trust and cognitive styles influence their intentions to use these tools. We offer practical suggestions for designing genAI tools for effective use and inclusive user experience.

Tags: "Human/Social", "AI for SE"

Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma, "What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI"

Tags: "Human/Social", "AI for SE"

Hao Zhong, "Understanding Compiler Bugs in Real Development"

Abstract: Compilers are critical in development, but compiler bugs can cause hidden and serious bugs in their compiled code. To deepen the understanding of compiler bugs, in the prior empirical studies, researchers read the bug reports and patches of compilers, and analyze their causes, locations, and patterns. Although they derive many interesting findings, their studies are limited. First, as bug reports seldom explain which projects encounter compiler bugs, it is infeasible to understand the outreaching impact. Second, before compiler bugs are fixed, programmers can bypass such bugs. The bug reports of compilers do not introduce such workarounds. Finally, the distribution of compiler bugs can be distorted, since researchers and compiler developers also file bug reports. In this paper, we propose a novel angle to analyze compiler bugs. Instead of compiler bug reports, we collect compiler bugs that are mentioned in real development. When programmers encounter compiler bugs in real development, they can leave traces in their commit messages. By searching such messages, we collected 644 unique commits whose messages explicitly mention the urls of compiler bugs. From this angle, in this paper, we conduct the first empirical study to analyze compiler bugs in the wild. We summarize our results into seven useful findings for users, compiler developers, and researchers. For example, for researchers, we find that some large workarounds of compiler bugs involve repetitive and systematic changes, which indicates a new research opportunity for code migration tools. Furthermore, we attempt to apply our findings in real development, and we obtain positive feedback.

Tags: "Prog Comprehension/Reeng/Maint", "MSR"

Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, Gabriele Bavota, "Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?"

Abstract: Several techniques have been proposed to (partially) automate code review. Early support consisted in recommending the most suited reviewer for a given change or in prioritizing the review tasks. With the advent of deep learning in software engineering, the level of automation has been pushed to new heights, with approaches able to provide feedback on source code in natural language as a human reviewer would do. Also, recent work documented open source projects adopting Large Language Models (LLMs) as co-reviewers. Although the research in this field is very active, little is known about the actual impact of including automatically generated code reviews in the code review process. While there are many aspects worth investigating (e.g., is knowledge transfer between developers affected?), in this work we focus on three of them: (i) review quality, i.e., the reviewer's ability to identify issues in the code; (ii) review cost, i.e., the time spent reviewing the code; and (iii) reviewer’s confidence, i.e., how confident is the reviewer about the provided feedback. We run a controlled experiment with 29 professional developers who reviewed different programs with/without the support of an automatically generated code review. During the experiment we monitored the reviewers’ activities, for over 50 hours of recorded code reviews. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior: Reviewers tend to focus on the code locations indicated by the LLM rather than searching for additional issues in other parts of the code. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high- severity issues as compared to a completely manual process. Finally, the automated support did not result in saved time and did not increase the reviewers’ confidence.

Tags: "AI for SE", "Human/Social"

Weiwei Xu, Kai Gao, Hao He, Minghui Zhou, "LiCoEval: Evaluating LLMs on License Compliance in Code Generation"

Abstract: Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

Tags: "Human/Social", "AI for SE", "Open Source"

Dehai Zhao, Zhenchang Xing, Qinghua Lu, Sherry Xiwei Xu, Liming Zhu, "SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation"

Abstract: UI automation is a useful technique for UI testing, bug reproduction and robotic process automation. Recording the user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support or assume specific app implementations. Reversing-engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognize human-understandable structured user actions ([command] [widget][location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model which can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record the user interactions with Word, Zoom, Firefox, Photoshop, and Window 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.

Tags: "AI for SE", "Testing and Quality"

Zeyang Ma, Dong Jae Kim, Tse-Hsun (Peter) Chen, "LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models"

Abstract: Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1) time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2) increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3) privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces LibreLog, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. LibreLog first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i) similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii) self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that LibreLog achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, LibreLog addresses privacy and cost concerns of using commercial LLMs while achieving state-of- the-arts parsing efficiency and accuracy.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Yi Sun, Zhuo Zhang, Xiangyu Zhang, "FairChecker: Detecting Fund-stealing Bugs in DeFi Protocols via Fairness Validation"

Abstract: Decentralized Finance (DeFi) is an emerging paradigm within the blockchain space that aims to revolutionize conventional financial systems through the application of blockchain technology. The substantial value of digital assets managed by DeFi protocols makes it a lucrative target for attacks. Despite the human resources and the application of automated tools, frequent attacks still cause significant fund losses to DeFi participants. Existing tools primarily rely on oracles similar to those used in traditional software analysis, making it challenging for them to detect functional bugs specific to the DeFi domain. Since blockchain functions as a distributed ledger system, the foundation of any DeFi protocol is the accurate maintenance of key state variables representing user funds. If these variables are not properly updated or designed to reflect the intended flow of funds, attackers can exploit these flaws to steal assets. From the study of popular DeFi protocols, we observe that, in DeFi systems, to ensure a transaction does not misappropriate someone's fund, the direction of changes (increase or decrease) of values associated with the amount of asset or debt of a user has to adhere to some fairness properties. We propose a concept called fairness bug which allows attackers to gain profit without cost. We propose an inter-procedural and inter-contract static analysis technique that utilizes symbolic execution and an SMT solver to automatically detect fairness bugs in DeFi smart contracts. We have implemented our fairness-checking approach in our tool, named FairChecker. We evaluate our tool on a benchmark of 113 real-world DeFi protocols with 34 fairness bugs. The results show that our tool can detect 32 bugs with a recall of 94.1\% and a precision of 46.4\%, demonstrating its effectiveness.

Tags: "Analysis", "Security", "Blockchain"

Li Huang, Weifeng Sun, Meng Yan, "Iterative Generation of Adversarial Example for Deep Code Models"

Abstract: Deep code models are vulnerable to adversarial attacks, making it possible for semantically identical inputs to trigger different responses. Current black-box attack methods typically prioritize the impact of identifiers on the model based on custom importance scores or program context and incrementally replace identifiers to generate adversarial examples. However, these methods often fail to fully leverage feedback from failed attacks to guide subsequent attacks, resulting in problems such as local optima bias and efficiency dilemmas. In this paper, we introduce ITGen, a novel black-box adversarial example generation method that iteratively utilizes feedback from failed attacks to refine the generation process. It employs a bitvector-based representation of code variants to mitigate local optima bias. By integrating these bit vectors with feedback from failed attacks, ITGen uses an enhanced Bayesian optimization framework to efficiently predict the most promising code variants, significantly reducing the search space and thus addressing the efficiency dilemma. We conducted experiments on a total of nine deep code models for both understanding and generation tasks, demonstrating ITGen's effectiveness and efficiency, as well as its ability to enhance model robustness through adversarial fine-tuning. For example, on average, ITGen improves the attack success rate by 47.98% and 69.70% over the state-of-the-art techniques (i.e., ALERT and BeamAttack), respectively.

Tags: "SE for AI", "Testing and Quality"

Chenxi Zhang, Yufei Liang, Tian Tan, Chang Xu, Shuangxiang Kan, Yulei Sui, Yue Li, "Interactive Cross-Language Pointer Analysis for Resolving Native Code in Java Programs"

Abstract: Java offers the Java Native Interface (JNI), which allows programs running in the Java Virtual Machine to invoke and be manipulated by native applications and libraries written in other languages, typically C. While JNI mechanism significantly enhances the Java platform's capabilities, it also presents challenges for static analysis of Java programs due to the complex behaviors introduced by native code. Therefore, effectively resolving the interactions between Java and native code is crucial for static analysis. In this paper, we introduce JNIFER, the first interactive cross-language pointer analysis for resolving native code in Java programs. JNIFER integrates both Java and C pointer analyses, equipped with advanced native call and JNI function analyses, enabling the simultaneous analysis of both Java and native code. During the analysis of cross-language interactions, the two analyzers interact with each other, constructing cross-language points-to relations and call graphs, thereby approximating the runtime behavior at the interaction sites. Our evaluation shows that JNIFER outperforms state-of-the-art approaches in terms of soundness while maintaining high precision and comparable efficiency, as evidenced by extensive experiments on OpenJDK and real-world Java applications.

Tags: "Analysis"

Chiming Duan, Yong Yang, Tong Jia, Guiyang Liu, Jinbu Liu, Huxing Zhang, Qi Zhou, Ying Li, Gang Huang, "FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data Fusion"

Abstract: Accurately diagnosing the fault that causes the failure is crucial for maintaining the reliability of a microservice system after a failure occurs. Mainstream fault diagnosis approaches are data-driven and mainly rely on three modalities of runtime data: traces, logs, and metrics. Diagnosing faults with multiple modalities of data in microservice systems has been an clear trend in recent years because different types of faults and corresponding failures tend to manifest in data of various modalities. Accurately diagnosing faults by fully leveraging multiple modalities of data is confronted with two challenges: 1)how to minimize information loss when extracting features for data of each modality; 2)how to correctly capture andutilize the relationships among data of different modalities. To address these challenges, we propose FAMOS, a Fault diagnosis Approach for MicrOservice Systems through effective multi-modal data fusion. On the one hand, FAMOS employs independent feature extractors to preserve the intrinc features for each modality. On the other hand, FAMOS introduces a new Gaussian-attention mechanism to accurately correlate data of different modalities and then captures the inter-modality relationship with a cross-attention mechanism. We evaluated FAMOS on two datasets constructed by injecting comprehensive and abundant faults into an open-source microservice system and a real-world industrial microservice system. Experimental results demonstrate the FAMOS’s effectiveness in fault diagnosis, achieving significant improvements in F1 scores compared to state-of-the-art (SOTA) methods, with an increase of 20.33%.

Tags: "AI for SE", "Security"

Van-Hoang Le, yi xiao, Hongyu Zhang, "Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models"

Abstract: Software-intensive systems often produce console logs for troubleshooting purpose. Log parsing, which aims at parsing a log message into a specific log template, typically serves as the first step toward automated log analytics. To better comprehend semantic information of log messages, many semantic-based log parsers have been proposed. These log parsers fine-tune a small pretrained language model (PLM) such as RoBERTa on a few labelled log samples. With the increasing popularity of large language models (LLMs), some recent studies also propose to leverage LLMs such as ChatGPT through in-context learning for automated log parsing, and obtain better results than previous semantic-based log parsers with small PLMs. In this paper, we show that semantic-based log parsers with small PLMs can actually achieve better or comparable performance to state-of-the-art LLM-based log parsing models while being more efficient and cost-effective. We propose UNLEASH, a novel semantic-based log parsing approach, which incorporates three enhancement methods to boost the performance of PLMs for log parsing, including (1) an entropy-based ranking method to select the most informative log samples; (2) a contrastive learning method to enhance the fine-tuning process; and (3) an inference optimization method to improve the log parsing performance. We evaluate UNLEASH on a set of large log datasets and the experimental results show that UNLEASH is effective and efficient, when compared to state-of-the-art log parsers.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Yiteng Peng, Daoyuan Wu, Zhibo Liu, Dongwei Xiao, Zhenlan Ji, Juergen Rahmel, Shuai Wang, "Testing and Understanding Deviation Behaviors in FHE-hardened Machine Learning Models"

Abstract: Fully homomorphic encryption (FHE) is a promising cryptographic primitive that enables secure computation over encrypted data. A primary use of FHE is to support privacy-preserving machine learning (ML) on public cloud infrastructures. Despite the rapid development of FHE-based ML (or HE-ML) in recent years, the community still lacks a systematic understanding of their robustness. In this paper, we aim to systematically test and understand the deviation behaviors of HE-ML models, where the same input causes deviant outputs between FHE-hardened models and their plaintext versions, leading to completely incorrect model predictions. To effectively uncover deviation-triggering inputs under the constraints of expensive FHE computation, we design a novel differential testing tool called HEDiff, which leverages the margin metric on the plaintext model as guidance to drive targeted testing on FHE models. For the identified deviation inputs, we further analyze them to determine whether they exhibit general noise patterns that are transferable. We evaluate HEDiff using three popular HE-ML frameworks, covering 12 different combinations of models and datasets. HEDiff successfully detected hundreds of deviation inputs across almost every tested FHE framework and model. We also quantitatively show that the identified deviation inputs are (visually) meaningful in comparison to regular inputs. Further schematic analysis reveals the root cause of these deviant inputs and allows us to generalize their noise patterns for more directed testing.

Tags: "SE for AI", "Testing and Quality"

Chuan Luo, Shuangyu Lyu, Wei Wu, Hongyu Zhang, Dianhui Chu, Chunming Hu, "Towards High-strength Combinatorial Interaction Testing for Highly Configurable Software Systems"

Abstract: Highly configurable software systems are crucial in practice to satisfy the rising demand for software customization, and combinatorial interaction testing (CIT) is an important methodology for testing such systems. Constrained covering array generation (CCAG), as the core problem in CIT, is to construct a t-wise covering array (CA) of minimum size, where t represents the testing strength. Extensive studies have demonstrated that high-strength CIT (e.g., 4-wise and 5-wise CIT) has stronger fault detection capability than low-strength CIT (i.e., 2-wise and 3-wise CIT), and there exist certain critical faults that can be disclosed through high-strength CIT. Although existing CCAG algorithm has exhibited effectiveness in solving the low-strength CCAG problem, they suffer the severe high-strength challenge when solving 4-wise and 5-wise CCAG, which urgently calls for effective solutions to solving 4-wise and 5-wise CCAG problems. To alleviate the high-strength challenge, we propose a novel and effective local search algorithm dubbed HSCA. Particularly, HSCA incorporates three new and powerful techniques, i.e., multi-round CA generation mechanism, dynamic priority assigning technique, and variable grouping strategy, to improve its performance. Extensive experiments on 35 real-world and synthetic instances demonstrate that HSCA can generate significantly smaller 4-wise and 5-wise CAs than existing state-of-the-art CCAG algorithms. More encouragingly, out of all 35 instances, HSCA successfully constructs 4-wise and 5-wise CAs for 11 and 15 instances, respectively, where existing CCAG algorithms fail. Our results indicate that HSCA can effectively mitigate the high-strength challenge.

Tags: "Testing and Quality"

Mengying Wu, Geng Hong, Wuyuao Mai, Xinyi Wu, Lei Zhang, Yingyuan Pu, Huajun Chai, Lingyun Ying, Haixin Duan, Min Yang, "Exposing the Hidden Layer: Software Repositories in the Service of SEO Manipulation"

Abstract: Distinct from traditional malicious packages, this paper uncovers a novel attack vector named “blackhat Search Engine Optimization through REPositories (RepSEO)”. In this approach, attackers carefully craft packages to manipulate search engine results, exploiting the credibility of software repositories to promote illicit websites. Our research presents a systematic analysis of the underground ecosystem of RepSEO, identifying key players such as account providers, advertisers, and publishers. We developed an effective detection tool, applied to a ten-year large-scale dataset of npm, Docker Hub, and NuGet software repositories. This investigation led to the startling discovery of 3,801,682 abusive packages, highlighting the widespread nature of this attack. Our study also delves into the supply chain tactics of these attacks, revealing strategies like the use of self-hosted email services for account registration, redirection methods to obscure landing pages, and rapid deployment techniques by aggressive attackers. Additionally, we explore the profit motives behind these attacks, identifying two primary types of advertisers: survey-based advertisers and malware distribution advertisers. We reported npm, NuGet, and Docker Hub about the RepSEO packages and the related supply chain vulnerabilities of Google, and received their acknowledgments. Software repositories have started removing the abusive packages as of this paper’s submission. We also open-source our code and data to facilitate future research.

Tags: "MSR", "Security"

Hanmo You, Zan Wang, Bin Lin, Junjie Chen, "Navigating the Testing of Evolving Deep Learning Systems: An Exploratory Interview Study"

Abstract: Deep Learning (DL) systems have been widely adopted across various industrial domains such as autonomous driving and intelligent healthcare. As with traditional software, DL systems also need to constantly evolve to meet ever-changing user requirements. However, ensuring the quality of these continuously evolving systems presents significant challenges, especially in the context of testing. Understanding how industry developers address these challenges and what extra obstacles they are facing could provide valuable insights for further safeguarding the quality of DL systems. To reach this goal, we conducted semi-structured interviews with 22 DL developers from diverse domains and backgrounds. More specifically, our study focuses on exploring the challenges developers encounter in testing evolving DL systems, the practical solutions they employ, and their expectations for extra support. Our results highlight the difficulties in testing evolving DL systems and identify the best practices for DL developers to address them. Additionally, we pinpoint potential future research directions to enhance testing effectiveness in evolving DL systems.

Tags: "SE for AI", "Human/Social"

Chenxing Zhong, Daniel Feitosa, Paris Avgeriou, Huang Huang, Yue Li, He Zhang, "PairSmell: A Novel Perspective Inspecting Software Modular Structure"

Abstract: Enhancing the modular structure of existing systems has attracted substantial research interest, focusing on two main methods: (1) software modularization and (2) identifying design issues (e.g., smells) as refactoring opportunities. However, re-modularization solutions often require extensive modifications to the original modules, and the design issues identified are generally too coarse to guide refactoring strategies. Combining the above two methods, this paper introduces a novel concept, \emph{PairSmell}, which exploits modularization to pinpoint design issues necessitating refactoring. We concentrate on a granular but fundamental aspect of modularity principles---\emph{modular relation (MR)}, i.e., \emph{whether a pair of entities are separated or collocated}. The main assumption is that, if the actual MR of a pair violates its `apt MR', i.e., an MR agreed on by multiple modularization tools (as raters), it can be deemed likely a flawed architectural decision that necessitates further examination. To quantify and evaluate \emph{PairSmell}, we conduct an empirical study on 20 C/C++ and Java projects, using 4 established modularization tools to identify two forms of \emph{PairSmell}: inapt separated pairs $\mathit{InSep}$ and inapt collocated pairs $\mathit{InCol}$. Our study on 260,003 instances reveals that their architectural impacts are substantial: (1) on average, 14.60\% and 20.44\% of software entities are involved in $\mathit{InSep}$ and $\mathit{InCol}$ MRs respectively; (2) $\mathit{InSep}$ pairs are associated with 190\% more co-changes than properly separated pairs, while $\mathit{InCol}$ pairs are associated with 35\% fewer co-changes than properly collocated pairs, both indicating a successful identification of modular structures detrimental to software quality; and (3) both forms of \emph{PairSmell} persist across software evolution. This evidence strongly suggests that \emph{PairSmell} can provide meaningful insights for inspecting modular structure, with the identified issues being both granular and fundamental, making the enhancement of modular design more efficient.

Tags: "Design/Architecture", "Prog Comprehension/Reeng/Maint"

Fraol Batole, David OBrien, Tien Nguyen, Robert Dyer, Hridesh Rajan, "An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization"

Abstract: Maintaining software design quality is crucial for the long-term maintainability and evolution of systems. However, design issues such as poor modularity and excessive complexity often emerge as codebases grow. Developers rely on external tools, such as program analysis techniques, to identify such issues. This work investigates an automated approach for analyzing and localizing design issues using Large Language Models (LLMs). Large language models have demonstrated significant performance on coding tasks, but directly leveraging them for design issue localization is challenging. Large codebases exceed typical LLM context windows, and program analysis tool outputs in non-textual modalities (e.g., graphs or interactive visualizations) are incompatible with LLMs’ natural language inputs. To address these challenges, we propose LOCALIZEAGENT, a novel multi-agent framework for effective design issue localization. LOCALIZEAGENT integrates specialized agents that (1) analyze code to identify potential code design issues, (2) transform program analysis outputs into abstraction-aware LLM-friendly natural language summaries, (3) generate context-aware prompts tailored to specific refactoring types, and (4) leverage LLMs to locate and rank the localized issues based on their relevance. Our evaluation using diverse real-world codebases demonstrates significant improvements over baseline approaches, with LOCALIZEAGENT achieving 138%, 166%, and 206% relative improvements in exact match accuracy for localizing information hiding, complexity, and modularity issues, respectively.

Tags: "AI for SE", "Design/Architecture"

Qingkai Shi, Xiaoheng Xie, Xianjin Fu, Peng Di, Huawei Li, Ang Zhou, Gang Fan, "Datalog-Based Language-Agnostic Change Impact Analysis for Microservices"

Abstract: The shift-left principle in the industry requires us to test a software application as early as possible. Particularly, when code changes in a microservice application are committed to the code repository, we have to efficiently identify all public microservice interfaces impacted by the changes, such that the impacted interfaces can be tested as soon as possible. However, developing an efficient change impact analysis is extremely challenging in microservices because of the multilingual problem: microservice applications are often implemented using varying programming languages and involve diverse frameworks and configuration files. To address this issue, this paper presents Microscope, a language-agnostic change impact analysis that uniformly represents the code, configuration files, frameworks, and code changes by relational Datalog rules. Microscope then benefits from an efficient Datalog solver to identify impacted interfaces. Experiments based on the use of Microscope in a leading software company demonstrate that Microscope is both effective and fast as it successfully identifies interfaces impacted by 112 code commits, with moderate time overhead, and could reduce 97% of interfaces to test and save 73% of testing time after code changes.

Tags: "Analysis", "Prog Comprehension/Reeng/Maint"

Robert Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First, "Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification"

Abstract: Formal verification using proof assistants, such as Coq, allows for high-quality software. However, the verification process is expensive, requiring significant expertise and manual effort to write proofs. Recent work has explored automating proof synthesis using machine learning, and even more recently, large language models (LLMs), showing that retrieving relevant premises (such as lemmas and definitions) is helpful for these models. We present Rango, a fully automated proof synthesis tool for Coq that uses, not only relevant premises but also similar proofs from the current project. Rango uses retrieval augmentation at every step of the proof to automatically determine which proofs and premises to include in the context of its fine-tuned LLM. In this way, Rango adapts to the project _and_ to the evolving state of the proof. We create a new dataset, CoqStoq, of 2,205 open-source Coq projects from GitHub, which includes both training data and a curated evaluation benchmark of well-maintained projects. On this benchmark, Rango synthesizes 27.7% of the proofs, which is 10% more proofs than prior state-of-the-art tool Tactician. Our evaluation also shows that adding relevant proofs to the context in Rango leads to a 45% increase in the number of theorems proven.

Tags: "AI for SE", "Analysis"

Miao Miao, Austin Mordahl, Dakota Soles, Alice Beideck, Shiyi Wei, "An Extensive Empirical Study of Nondeterministic Behavior in Static Analysis Tools"

Abstract: Recent research has studied the importance and identified causes of nondeterminism in software. Static analysis tools exhibit many risk factors for nondeterministic behavior, but no work has analyzed the occurrence of such behavior in these tools. To bridge this gap, we perform an extensive empirical study aiming to understand past and ongoing nondeterminism in 12 popular, open-source static analysis tools that target 5 types of projects. We first conduct a qualitative study to understand the extent to which nondeterministic behavior has been found and addressed within the tools under study, and find results in 7 tool repositories. After classifying the issues and commits by root cause, we find that the majority of nondeterminisms are caused by concurrency issues, incorrect analysis logic, or assumed orderings of unordered data structures, which have shared patterns. We also perform a quantitative analysis, where we use two strategies and diverse input programs and configurations to detect yet-unknown nondeterministic behaviors. We discover such behavior in 8 out of the 12 tools, including 3 which had no results from the qualitative analysis. We find that nondeterminism often appears in multiple configurations on a variety of input programs. We communicated all identified nondeterminism to the developers, and received confirmation of five tools. Finally, we detail a case study of fixing FlowDroid's nondeterministic behavior.

Tags: "Analysis"

Xintong Zhou, Zhenyang Xu, Mengxiao Zhang, Yongqiang Tian, Chengnian Sun, "WDD: Weighted Delta Debugging"

Abstract: Delta Debugging is a widely used family of algorithms (e.g., ddmin and ProbDD) to automatically minimize bug-triggering test inputs, thus to facilitate debugging. It takes a list of elements with each element representing a fragment of the test input, systematically partitions the list at different granularities, identifies and deletes bug-irrelevant partitions. Prior delta debugging algorithms assume there are no differences among the elements in the list, and thus treat them uniformly during partitioning. However, in practice, this assumption usually does not hold, because the size (referred to as weight) of the fragment represented by each element can vary significantly. For example, a single element representing 50% of the test input is much more likely to be bug-relevant than elements representing only 1%. This assumption inevitably impairs the efficiency or even effectiveness of these delta debugging algorithms. This paper proposes Weighted Delta Debugging (WDD), a novel concept to help prior delta debugging algorithms overcome the limitation mentioned above. The key insight of WDD is to assign each element in the list a weight according to its size, and distinguish different elements based on their weights during partitioning. We designed two new minimization algorithms, Wddmin and WProbDD, by applying WDD to ddmin and ProbDD respectively. We extensively evaluated Wddmin and WProbDD in two representative applications, HDD and Perses, on 62 benchmarks across two languages. On average, with Wddmin, HDD and Perses took 51.31% and 7.47% less time to generate 9.12% and 0.96% smaller results than with ddmin, respectively. With WProbDD, HDD and Perses used 11.98% and 9.72% less time to generate 13.40% and 2.20% smaller results than with ProbDD, respectively. The results strongly demonstrate the value of WDD. We firmly believe that WDD opens up a new dimension to improve test input minimization techniques.

Tags: "Testing and Quality"

Faridah Akinotcho, Lili Wei, Julia Rubin, "Mobile Application Coverage: The 30% Curse and Ways Forward"

Abstract: Testing, security analysis, and other dynamic quality assurance approaches rely on mechanisms that invoke software under test, aiming to achieve high code coverage. A large number of invocation mechanisms proposed in the literature, in particular for Android mobile applications, employ GUI-driven application exploration. However, studies show that even the most advanced GUI exploration techniques can cover only around 30% of a real- world application. This paper aims to investigate “the remaining 70%”. By conducting a large-scale experiment involving two human experts, who thoroughly explored 61 benchmark and 42 popular apps from Google Play, we show that achieving a substantially larger coverage for real-world applications is impractical even if we factor out known GUI-based exploration issues, such as the inability to provide semantic inputs and the right order of events. The main reasons preventing humans from covering the entire application include application dependencies on device configurations and external resources. Thus, future investment into GUI-based exploration strategies is unlikely to lead to substantial improvements in coverage. To chart possible ways forward and explore approaches to satisfy/bypass these dependencies, we thoroughly analyze code-level properties guarding them. Our analysis shows that a large fraction of the dependencies could actually be successfully bypassed with relatively simple beyond- GUI exploration techniques. We hope our study can inspire future work in this area and also provide a realistic benchmark for evaluating this work.

Tags: "Testing and Quality", "Mobile SW"

Hao Song, Teng Li, Jiachi Chen, Ting Chen, Beibei Li, Zhangyan Lin, Yi Lu, Pan Li, Xihan Zhou, "Enhancing The Open Network: Definition and Automated Detection of Smart Contract Defects"

Abstract: The Open Network (TON), designed to support Telegram's extensive user base of hundreds of millions, has garnered considerable attention since its launch in 2022. \textit{FunC} is the most popular programming language for writing smart contracts on TON. It is distinguished by a unique syntax compared to other smart contract languages. Despite growing interest, research on the practical defects of TON smart contracts is still in its early stages. In this paper, we summarize eight smart contract defects identified from TON's official blogs and audit reports, each with detailed definitions and code examples. Furthermore, we propose a static analysis framework called TONScanner to facilitate the detection of these defects. Specifically, TONScanner reuses \textit{FunC} compiler's frontend code to transform the \textit{FunC} contract code into \textit{FunC} intermediate representation (IR) in the form of a directed acyclic graph (DAG). Based on this IR, TONScanner constructs a control flow graph (CFG), then transforms it into a static single assignment (SSA) form to simplify further analysis. TONScanner also integrates Data Dependency, Call Graph, Taint Analysis, and Cell Construct, which are specifically tailored for TON blockchain's unique data structures. These components finally facilitate the identification of the eight defects. We evaluate the effectiveness of TONScanner by applying it to 1,640 smart contracts and find a total of 14,995 defects. Through random sampling and manual labeling, we find that TONScanner achieves an overall precision of 97.49%. The results reveal that current TON contracts contain numerous defects, indicating that developers are prone to making errors. TONScanner has proven its ability to accurately identify these defects, thereby aiding in their correction.

Tags: "Analysis", "Security", "Blockchain"

Kevin Guan, Owolabi Legunsen, "Instrumentation-Driven Evolution-Aware Runtime Verification"

Abstract: Runtime verification (RV) found hundreds of bugs by monitoring passing tests against formal specifications (specs). RV first instruments a program to obtain relevant events, e.g., method calls, to monitor. A hindrance to RV adoption, especially in continuous integration, is its high overhead. So, prior work proposed spec-driven evolution-aware techniques to speed up RV. They use complex analysis to re-monitor a subset of specs related to code impacted by changes. But, these techniques assume that RV overhead is dominated by monitoring time, and their designs often sacrifice safety (ability to find all new violations) for speed. We present iMOP, the first instrumentation-driven evolution-aware RV framework. iMOP leverages a recent observation that RV overhead during testing is often dominated by instrumentation, not monitoring. iMOP embodies a family of 14 techniques that aim to safely speed up RV by simply re-instrumenting only changed code. Instrumentation from the old revision is re-used for unchanged code, and all specs are re-monitored in the new revision. We implement iMOP as a Maven plugin and evaluate it on 1,627 revisions of 48 projects, using 160 specs of correct JDK API usage. iMOP is safe by design. It is up to 29.6x faster than re-running RV from scratch after each change, and 17.8x and 6.7x faster than safe and unsafe spec-driven techniques, respectively. iMOP is faster than just applying regression test selection to RV.

Tags: "Analysis", "Prog Comprehension/Reeng/Maint"

Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang, "Model Editing for LLMs4Code: How Far are We?"

Abstract: Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the effectiveness of all state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code models across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs. Results demonstrate that A-GRACE notably enhances generalization while maintaining similar levels of effectiveness and specificity compared to the vanilla GRACE.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

TianChen Yu, Li Yuan, Liannan Lin, Hongkui He, "A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection"

Abstract: Over the past decade, the application of deep learning in code clone detection has produced remarkable results. However, the current approaches have two limitations: (a) code representation approaches with low information utilization, such as vanilla Abstract Syntax Tree (AST), leading to information redundancy which results in performance degradation; (b) low efficiency of clone detection on evaluation, resulting in excessive time costs during practical use. In this paper, we propose a Multiple Representation Transformer with Optimized Abstract Syntax Tree (MRT-OAST) to introduce an efficient code representation method while achieving competitive performance. Specifically, MRT-OAST strategically prunes and enhances the AST, utilizing both pre-order and post-order traversals to represent two different representations. To speed up the evaluation process, MRT-OAST utilizes a pure Siamese network and employs cosine similarity to compare the similarity between codes. Our approach effectively reduces AST sequences to 40% and 39% of their original length in Java and C/C++ while preserving structural information. In code clone detection tasks, our model surpasses state-of-the-art approaches on OJClone and Google Code Jam. During the evaluation of BigCloneBench, our model has a 5x speed improvement compared to the state-of-the-art lightweight model and a 563x speed improvement compared to the BERT-based model, with only a 0.3% and 0.9% decrease in $F_1$-score.

Tags: "AI for SE", "Analysis"

Tanghaoran Zhang, Yue Yu, Xinjun Mao, Shangwen Wang, Kang Yang, Yao Lu, Zhang Zhang, Yuxin Zhao, "Instruct or Interact? Exploring and Eliciting LLMs’ Capability in Code Snippet Adaptation Through Prompt Engineering"

Abstract: Code snippet adaptation is a fundamental activity in the software development process. Unlike code generation, code snippet adaptation is not a ``free creation'', which requires developers to tailor a given code snippet in order to fit specific requirements and the code context. Recently, large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. However, their performance on code snippet adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. To bridge this gap, we conduct an empirical study to investigate the performance and issues of LLMs on the adaptation task. We first evaluate the adaptation performances of three popular LLMs and compare them to the code generation task. Our result indicates that their adaptation ability is weaker than generation, with a nearly 15\% decrease on pass@1 and more context-related errors. By manually inspecting 200 cases, we further investigate the causes of LLMs’ sub-optimal performance, which can be classified into three categories, \emph{i.e.,} \textit{Unclear Requirement}, \textit{Requirement Misalignment} and \textit{Context Misapplication}. Based on the above empirical research, we propose an interactive prompting approach to eliciting LLMs' ability on the adaptation task. Specifically, we enhance the prompt by enriching the context and decomposing the task, which alleviates context misapplication and improves requirement understanding. Besides, we enable LLMs' reflection by requiring them to interact with a human or a LLM counselor, compensating for unclear requirement. Our experimental result reveals that our approach greatly improve LLMs' adaptation performance. The best-performing Human-LLM interaction successfully solves 159 out of the 202 identified defects and improves the pass@1 and pass@5 by over 40\% compared to the initial instruction-based prompt. Considering human efforts, we suggest multi-agent interaction as a trade-off, which can achieve comparable performance with excellent generalization ability. We deem that our approach could provide methodological assistance for autonomous code snippet reuse and adaptation with LLMs.

Tags: "AI for SE"

Enrique Barba Roque, Luís Cruz, Thomas Durieux, "Unveiling the Energy Vampires: A Methodology for Debugging Software Energy Consumption"

Abstract: Energy consumption in software systems is becoming increasingly important, especially in large-scale deployments. However, debugging energy-related issues remains challenging due to the lack of specialized tools. This paper presents an energy debugging methodology for identifying and isolating energy consumption hotspots in software systems. We demonstrate the methodology's effectiveness through a case study of Redis, a popular in-memory database. Our analysis reveals significant energy consumption differences between Alpine and Ubuntu distributions, with Alpine consuming up to 20.2% more power in certain operations. We trace this difference to the implementation of the `memcpy` function in different C standard libraries (musl vs. glibc). By isolating and benchmarking `memcpy`, we confirm it as the primary cause of the energy discrepancy. Our findings highlight the importance of considering energy efficiency in software dependencies and demonstrate the capability to assist developers in identifying and addressing energy-related issues. This work contributes to the growing field of sustainable software engineering by providing a systematic approach to energy debugging.

Tags: "User experience", "Human/Social"

Cedric Richter, Marek Chalupa, Marie-Christine Jakobs, Heike Wehrheim, "Cooperative Software Verification via Dynamic Program Splitting"

Abstract: Cooperative software verification divides the task of software verification among several verification tools in order to increase efficiency and effectiveness. The basic approach is to let verifiers work on different parts of a program and at the end join verification results. While this idea is intuitively appealing, cooperative verification is usually hindered by the facts that program decomposition (1) is often static, disregarding strengths and weaknesses of employed verifiers, and (2) often represents the decomposed program parts in a specific proprietary format, thereby making the use of off-the-shelf verifiers in cooperative verification difficult. In this paper, we propose a novel cooperative verification scheme that we call dynamic program splitting (DPS). Splitting decomposes programs into (smaller) programs, and thus directly enables the use of off-the-shelf tools. In DPS, splitting is dynamically applied on demand: Verification starts by giving a verification task (a program plus a correctness specification) to a verifier V1. Whenever V1 finds the current task to be hard to verify, it splits the task (i.e., the program) and restarts verification on subtasks. DPS continues until (1) a violation is found, (2) all subtasks are completed or (3) some user-defined stopping criterion is met. In the latter case, the remaining uncompleted subtasks are merged into a single one and given to a next verifier V2, repeating the same procedure on the still unverified program parts. This way, the decomposition is steered by what is hard to verify for particular verifiers, leveraging their complementary strengths. We have implemented dynamic program splitting and evaluated it on benchmarks of the annual software verification competition SV-COMP. The evaluation shows that cooperative verification with DPS is able to solve verification tasks that none of the constituent verifiers can solve, without any significant overhead.

Tags: "Analysis", "Security"

Yiwei Li, Liangze Yin, Wei Dong, Jiaxin Liu, Yanfeng Hu, Shanshan Li, "Hetrify: Efficient Verification of Heterogeneous Programs on RISC-V"

Abstract: The heterogeneous nature of contemporary software, comprising components like closed-source libraries, embedded assembly snippets, and modules written in multiple programming languages, leads to significant verification challenges. Currently, There are no mature and available methods to effectively address such problems. To bridge this gap, we propose a verification approach capable of effectively verifying heterogeneous programs. This approach is universally applicable. It theoretically supports the verification of any heterogeneous program that can be compiled into binary code, without being constrained by any specific programming language. The approach begins by compiling the entire program or its unverifiable segments into binary format. Under guarantees of semantic equivalence, these binaries are converted into verifiable C code, which can then be verified using existing C verification tools. Based on the RISC-V architecture, we developed the Hetrify tool to implement this verification approach. The tool is supported by rigorous mathematical proofs to ensure operational semantic equivalence between the converted C programs and their original counterparts. To validate our approach, we conducted verification experiments on 130 programs, including 100 assembly programs and 30 large heterogeneous programs with missing critical function source code, demonstrating the effectiveness of our approach.

Tags: "Analysis", "Security"

Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, Ge Li, "ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation"

Abstract: Large language models (LLMs) have achieved impressive performance in code generation recently, offering programmers revolutionary assistance in software development. However, due to the auto-regressive nature of LLMs, they are susceptible to error accumulation during code generation. Once an error is produced, LLMs can merely continue to generate the subsequent code conditioned on it, given their inability to adjust previous outputs. This generation process differs from the common practice in human coding, which involves review and adjustment during the coding process according to quality and requirements. Existing LLM-based approaches that typically consider post-revising after code generation fail to resolve errors in time, leading to the challenging resolution of accumulated errors and the significant wastage of resources. Ideally, LLMs should rollback and resolve the occurred error immediately during code generation, rather than proceed on the basis of the error and wait for post-revising after generation. In this paper, we propose \ourapproachbf, which integrates the backtracking mechanism and program analysis into LLMs for code generation. Specifically, we employ program analysis to perform incremental error detection during the generation process. When an error is detected, the backtracking mechanism is triggered to priming rollback strategies and constraint regeneration, thereby avoiding the recurrence of the same error. Experiments on multiple code generation benchmarks show that \ourapproachbf can significantly reduce the errors generated by LLMs, with a compilation pass rate of over 98.9\%. The test pass rate is improved by up to 23.8\% compared to the best baseline approach. Compared to the post-revising baseline, the cost is reduced by 19.3\%. Moreover, our approach is model-agnostic and achieves consistent improvements across six LLMs.

Tags: "AI for SE", "Analysis"

Christof Tinnes, Alisa Welter, Sven Apel, "Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets"

Abstract: Modeling structure and behavior of software systems plays a crucial role in the industrial practice of software engineering. As with other software engineering artifacts, software models are subject to evolution. Supporting modelers in evolving software models with recommendations for model completions is still an open problem, though. In this paper, we explore the potential of large language models for this task. In particular, we propose an approach, \textsc{RaMc}, leveraging large language models, model histories of software systems, and retrieval-augmented generation for model completion. Through experiments on three datasets, including an industrial application, one public open-source community dataset, and one controlled collection of simulated model repositories, we evaluate the potential of large language models for model completion. We found that large language models are indeed a promising technology for supporting software model evolution (62.30% semantically correct completions on real-world industrial data and up to 86.19% type-correct completions). Furthermroe, we found that the general inference capabilities of large language models are useful, for example, when dealing with concepts for which there are few, noisy, or no examples at all.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Ran Mo, Haopeng Song, Wei Ding, Chaochao Wu, "Code Cloning in Solidity Smart Contracts: Prevalence, Evolution, and Impact on Development"

Abstract: In recent years, the development of Solidity smart contracts has been increasing rapidly in popularity. Code cloning is a common coding practice, and many prior studies have revealed that code clones could negatively impact software maintenance and quality. However, there is little work systematically analyzing the nature and impacts of code clones in solidity smart contracts. To bridge this gap, we investigate the prevalence, evolution, and bug-proneness of code clones in solidity smart contracts, and further identify the possible reasons for these clones’ occurrences. With our evaluation of 26,294 smart contracts with 97,877 functions, we have found that code clones are highly prevalent in smart contracts. Additionally, on average, 32.01% of clones co-evolve, indicating the need for careful management to avoid consistency issues. Surprisingly, unlike in traditional software development, code clones in smart contracts are rarely involved in bug fixes. Finally, we identify three main factors that affect the occurrences of clones. We believe our study can provide valuable insights for developers to understand and manage code clones in solidity smart contracts.

Tags: "Prog Comprehension/Reeng/Maint", "Blockchain"

Trey Woodlief, Carl Hildebrandt, Sebastian Elbaum, "A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs"

Abstract: The proliferation of autonomous vehicles (AVs) has made their failures increasingly evident. Testing efforts aimed at identifying the inputs leading to those failures are challenged by the input's long-tail distribution, whose area under the curve is dominated by rare scenarios. We hypothesize that leveraging emerging open-access datasets can accelerate the exploration of long-tail inputs. Having access to diverse inputs, however, is not sufficient to expose failures; an effective test also requires an oracle to distinguish between correct and incorrect behaviors. Current datasets lack such oracles and developing them is notoriously difficult. In response, we propose DiffTest4AV, a differential testing framework designed to address the unique challenges of testing AV systems: 1) for any given input, many outputs may be considered acceptable, 2) the long-tail contains an insurmountable number of inputs to explore, and 3) the AV's continuous execution loop requires for failures to persist in order to affect the system. DiffTest4AV integrates statistical analysis to identify meaningful behavioral variations, judges their importance in terms of the severity of these differences, and incorporates sequential analysis to detect persistent errors indicative of potential system-level failures. Our study on 5 versions of the commercially-available, road-deployed comma.ai OpenPilot system, using 3 available image datasets, demonstrates the capabilities of the framework to detect high-severity, high-confidence, long-running test failures.

Tags: "Testing and Quality", "Autonomy"

Giacomo Benedetti, Oreofe Solarin, Courtney Miller, Greg Tystahl, William Enck, Christian Kästner, Alexandros Kapravelos, Alessio Merlo, Luca Verderame, "An Empirical Study on Reproducible Packaging in Open-Source Ecosystems"

Abstract: The integrity of software builds is fundamental to the security of the software supply chain. While Thompson first raised the potential for attacks on build infrastructure in 1984, limited attention has been given to build integrity in the past 40 years, enabling recent attacks on SolarWinds, event-stream, and xz. The best-known defense against build system attacks is creating \emph{reproducible builds}; however, achieving them can be complex for both technical and social reasons and thus is often viewed as impractical to obtain. In this paper, we analyze reproducibility of builds in a novel context: reusable \emph{components} distributed as \emph{packages} in six popular software ecosystems (npm, Maven, PyPI, Go, RubyGems, and Cargo). Our quantitative study on a representative sample of 4000 packages in each ecosystem raises concerns: Rates of reproducible builds vary widely between ecosystems, with some ecosystems having all packages reproducible whereas others have \issues in nearly every package. However, upon deeper investigation, we identified that with relatively straightforward infrastructure configuration and patching of build tools, we can achieve very high rates of reproducible builds in all studied ecosystems. We conclude that if the ecosystems adopt our suggestions, the build process of published packages can be independently confirmed for nearly all packages without individual developer actions, and doing so will prevent significant future software supply chain attacks.

Tags: "MSR"

Nusrat Zahan, Philipp Burckhardt, Mikola Lysenko, Feross Aboukhadijeh, Laurie Williams, "Leveraging Large Language Models to Detect npm Malicious Packages"

Abstract: Existing malicious code detection techniques can aid the manual review process by predicting which packages are likely to be malicious. However, these techniques often suffer from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to assist security analysts in detecting malicious packages through the empirical study of using Large Language Models (LLMs) to detect malicious code in the npm ecosystem. We present SecurityAI, a malicious code review workflow to detect malicious code using ChatGPT. We leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We compare the effectiveness of static analysis as a pre-screener with SecurityAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. We attained precision and F1 scores of 91% and 94% for GPT-3, and 99% & 97% for GPT-4, respectively, with GPT-3 offering a cost-effective balance. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, hidden backdoors, and suspicious domain connection categories as the top detected malicious packages. The lack of diversity in model-generated responses led to hallucinations, resulting in misclassification cases, with GPT-3 hallucinating more frequently.

Tags: "AI for SE", "Security"

Xin-Cheng Wen, Zirui Lin, Cuiyun Gao, Hongyu Zhang, Yong Wang, Qing Liao, "Repository-Level Graph Representation Learning for Enhanced Security Patch Detection"

Abstract: Software vendors often silently release security patches without providing sufficient advisories (e.g., Common Vulnerabilities and Exposures) or delayed updates via resources (e.g., National Vulnerability Database). Therefore, it has become crucial to detect these security patches to ensure secure software maintenance. However, existing methods face the following challenges: (1) They primarily focus on the information within the patches themselves, overlooking the complex dependencies in the repository. (2) Security patches typically involve multiple functions and files, increasing the difficulty in well learning the representations. To alleviate the above challenges, this paper proposes a \textit{Repo}sitory-level Security Patch Detection framework named \textit{RepoSPD}, which comprises three key components: 1) a repository-level graph construction, RepoCPG, which represents software patches by merging pre-patch and post-patch source code at the repository level; 2) a structure-aware patch representation, which fuses the graph and sequence branch and aims at comprehending the relationship among multiple code changes; 3) progressive learning, which facilitates the model in balancing semantic and structural information. To evaluate RepoSPD, we employ two widely-used datasets in security patch detection: SPI-DB and PatchDB. We further extend these datasets to the repository level, incorporating a total of 20,238 and 28,781 versions of repository in C/C++ programming languages, respectively, denoted as SPI-DB* and PatchDB*. We compare RepoSPD with six existing security patch detection methods and five static tools. Our experimental results demonstrate that RepoSPD outperforms the state-of-the-art baseline, with improvements of 11.90\%, and 3.10\% in terms of accuracy on the two datasets, respectively. These results underscore the effectiveness of RepoSPD in detecting security patches. Furthermore, RepoSPD can detect 151 security patches, which outperforms the best-performing baseline by 21.36\% with respect to accuracy.

Tags: "AI for SE", "Security"

Tajkia Rahman Toma, Balreet Grewal, Cor-Paul Bezemer, "Answering User Questions about Machine Learning Models through Standardized Model Cards"

Abstract: Reusing pre-trained machine learning models is becoming very popular due to model hubs such as Hugging Face (HF). However, similar to when reusing software, many issues may arise when reusing an ML model. In many cases, users resort to asking questions on discussion forums such as the HF community forum. In this paper, we study how we can reduce the community's workload in answering these questions and increase the likelihood that questions receive a quick answer. We analyze 11,278 discussions from the HF model community that contain user questions about ML models. We focus on the effort spent handling questions, the high-level topics of discussions, and the potential for standardizing responses in model cards based on a model card template. Our findings indicate that there is not much effort involved in responding to user questions, however, 40.1% of the questions remain open without any response. A topic analysis shows that discussions are more centered around technical details on model development and troubleshooting, indicating that more input from model providers is required. We show that 42.5% of the questions could have been answered if the model provider followed a standard model card template for the model card. Based on our analysis, we recommend that model providers add more development-related details on the model's architecture, algorithm, data preprocessing and training code in existing documentation (sub)sections and add new (sub)sections to the template to address common questions about model usage and hardware requirements.

Tags: "MSR", "SE for AI"

Cho-Ting Lee, Andrew Neeser, Shengzhe Xu, Jay Katyan, Patrick Cross, Sharanya Pathakota, Marigold Norman, John C. Simeone, Jaganmohan Chandrasekaran, Naren Ramakrishnan, "Can an LLM find its way around a Spreadsheet?"

Abstract: Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of standardization, often creates the need for highly specialized pipelines. We ask whether an LLM can find its way around a spreadsheet and how to support end-users in taking their free-form data processing requests to fruition. Just like RAG retrieves context to answer users’ queries, we demonstrate how we can retrieve elements from a code library to compose data preprocessing pipelines. Through comprehensive experiments, we demonstrate the quality of our system and how it is able to continuously augment its vocabulary by saving new codes and pipelines back to the code library for future retrieval.

Tags: "AI for SE", "Analysis"

Madeline Janecek, Naser Ezzati-Jivan, Abdelwahab Hamou-Lhadj, "Execution Trace Reconstruction Using Diffusion-Based Generative Models"

Abstract: Execution tracing is essential for understanding system and software behaviour, yet lost trace events can significantly compromise data integrity and analysis. Existing solutions for trace reconstruction often fail to fully leverage available data, particularly in complex and high-dimensional contexts. Recent advancements in generative artificial intelligence, particularly diffusion models, have set new benchmarks in image, audio, and natural language generation. This study conducts the first comprehensive evaluation of diffusion models for reconstructing incomplete trace event sequences. Using nine distinct datasets generated from the Phoronix Test Suite, we rigorously test these models on sequences of varying lengths and missing data ratios. Our results indicate that the SSSD$^{S4}$ model, in particular, achieves superior performance, in terms of accuracy, perfect rate, and ROUGE-L score across diverse imputation scenarios. These findings underscore the potential of diffusion-based models to accurately reconstruct missing events, thereby maintaining data integrity and enhancing system monitoring and analysis.

Tags: "Testing and Quality", "Analysis", "AI for SE"

Joseph Romeo, Marco Raglianti, Csaba Nagy, Michele Lanza, "UML is Back. Or is it? Investigating the Past, Present, and Future of UML in Open Source Software"

Abstract: Since its inception, UML, the Unified Modeling Language, has been touted as the way to go when it comes to designing and documenting software systems. While being an integral part of many university software engineering programs, UML has found little consideration among developers, especially in open source software. Reasons for this include that UML shares some shortcomings with other forms of documentation (e.g., limited availability, outdatedness, inadequate level of detail). We present a study to investigate the evolution and the current situation regarding the use of UML in open source projects. We mined and analyzed ~13k GitHub projects, developing strategies and heuristics to identify UML files through their extensions and contents, for a quantitative analysis of two decades of evolution of the usage of UML. We explored the popularity of UML, derived characteristics of projects leveraging UML, and analyzed the authors, creators and maintainers, of UML artifacts. Our study confirms that UML is indeed still under-utilized. At the same time we found evidence of a resurgence coinciding with the popularity of human-readable text-based formats, defined and used by tools like PlantUML and Mermaid. We discuss how identifying and addressing the new challenges implied by this resurgence could impact the future of UML.

Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"

Abdul Haddi Amjad, Muhammad Danish, Bless Jah, Muhammad Ali Gulzar, "Accessibility Issues in Ad-Driven Web Applications"

Abstract: Website accessibility is essential for inclusiveness and regulatory compliance. Although third-party advertisements (ads) are a vital revenue source for free web services, they introduce significant accessibility challenges. Leasing a website’s space to ad-serving technologies like DoubleClick results in developers losing control over ad content accessibility. Even on highly accessible websites, third-party ads can undermine adherence to Web Content Accessibility Guidelines (WCAG). We conduct the first-of-its-kind large-scale investigation of 430K website elements, including nearly 100K ad elements, to understand the accessibility of ads on websites. We seek to understand the prevalence of inaccessible ads and their overall impact on the accessibility of websites. Our findings show that 67% of websites experience increased accessibility violations due to ads, with common violations including Focus Visible (WCAG 2.4.7) and On Input (WCAG 3.2.2). Popular ad-serving technologies like Taboola, DoubleClick, and RevContent often serve ads that fail to comply with WCAG standards. Even when ads are WCAG compliant, 27% of them have alternative text in ad images that misrepresents information, potentially deceiving users. Manual inspection of a sample of these misleading ads revealed that user-identifiable data is collected on 94% of websites through interactions, such as hovering or pressing enter. Since users with disabilities often rely on tools like screen readers that require hover events to access website content, they have no choice but to compromise their privacy in order to navigate website ads. Based on our findings, we further dissect the root cause of these violations and provide design guidelines to both website developers and ad-serving technologies to achieve WCAG-compliant ad integration.

Tags: "Human/Social", "Testing and Quality"

Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso, "A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs"

Abstract: As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as OpenAPI ones has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents---API, dependency, parameter, and value---collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.

Tags: "Testing and Quality", "AI for SE"

Geraldine Galindo-Gutierrez, Juan Pablo Sandoval Alcocer, Nicolas Jimenez-Fuentes, Alexandre Bergel, Gordon Fraser, "Increasing the Effectiveness of Automatically Generated Tests by Improving Class Observability"

Abstract: Automated unit test generation consists of two complementary challenges: Finding sequences of API calls that exercise the code of a class under test, and finding assertion statements that validate the behaviour of the class during execution. The former challenge is often addressed using meta-heuristic search algorithms optimising tests for code coverage, which are then annotated with regression assertions to address the latter challenge, i.e., assertions that capture the states observed during test generation. While the resulting tests tend to achieve high coverage, their fault finding potential is often inhibited by poor or difficult observability of the codebase. That is, relevant attributes and properties may either not be exposed adequately at all, or only in ways that the test generator is unable to handle. In this paper, we investigate the influence of observability in the context of the EvoSuite search-based Java test generator, which we extend in two complementary ways to study and improve observability: First, we apply a transformation to code under test to expose encapsulated attributes to the test generator; second, we address EvoSuite's limited capability of asserting the state of complex objects. Our evaluation demonstrates that together these observability improvements lead to significantly increased mutation scores, underscoring the importance of considering the class observability in the test generation process.

Tags: "Testing and Quality"

Nabson Silva, Eriky Rodrigues, Tayana Conte, "A Catalog of Micro Frontends Anti-patterns"

Abstract: Micro frontend (MFE) architectures have gained significant popularity for promoting independence and modularity in development. Despite their widespread adoption, the field remains relatively unexplored, especially concerning identifying problems and documenting best practices. Drawing on both established microservice (MS) anti-patterns and the analysis of real problems faced by software development teams that adopt MFE, this paper presents a catalog of 12 MFE anti-patterns. We composed an initial version of the catalog by recognizing parallels between MS anti-patterns and recurring issues in MFE projects to map and adapt MS anti-patterns to the context of MFE. To validate the identified problems and proposed solutions, we conducted a survey with industry practitioners, collecting valuable feedback to refine the anti-patterns. Additionally, we asked participants if they had encountered these problems in practice and to rate their harmfulness on a 10-point Likert scale. The survey results revealed that participants had encountered all the proposed anti-patterns in real-world MFE architectures, with only one reported by less than 50% of participants. They stated that the catalog can serve as a valuable guide for both new and experienced developers, with the potential to enhance MFE development quality. The collected feedback led to the development of a improved version of the anti-patterns catalog. Furthermore, we developed a web application designed to not only showcase the anti-patterns but also to actively foster collaboration and engagement within the MFE community. The proposed catalog is a valuable resource for identifying and mitigating potential pitfalls in MFE development. It empowers developers of all experience levels to create more robust, maintainable, and well-designed MFE applications.

Tags: "Design/Architecture"

Georgios Sakkas, Pratyush Sahu, Kyeling Ong, Ranjit Jhala, "Neurosymbolic Modular Refinement Type Inference"

Abstract: Refinement types -- a type-based generalization of Floyd-Hoare logics -- are an expressive and modular means of statically ensuring a wide variety of correctness, safety, and security properties of software. However, their expressiveness and modularity means that to use them, a developer must laboriously \emph{annotate} all the functions in their code with potentially complex type specifications that specify the contract for that function. We present XO, a neurosymbolic agent that uses LLMs to automatically generate refinement type annotations for all the functions in an entire package or module, using the refinement type checker LiquidHaskell as an oracle to verify the correctness of the generated specifications. We curate a dataset of three Haskell packages where refinement types are used to enforce a variety of correctness properties from data structure invariants to low-level memory safety and use this dataset to evaluate XO. Previously these packages required expert users several days to weeks to annotate with refinement types. Our evaluation shows that when even using models with relatively smaller models like the 3 billion parameter StarCoder LLM, by using fine-tuning, carefully chosen contexts, our neurosymbolic agent generates refinement types for up to 94\% of the functions across entire libraries automatically in just a few hours, thereby showing that LLMs can drastically shrink the human effort needed to use formal verification.

Tags: "AI for SE", "Analysis"

Luís F. Gomes, Vincent Hellendoorn, Jonathan Aldrich, Rui Abreu, "An Exploratory Study of ML Sketches and Visual Code Assistants"

Abstract: This paper explores the integration of Visual Code Assistants in Integrated Development Environments (IDEs). In Software Engineering, whiteboard sketching is often the initial step before coding, serving as a crucial collaboration tool for developers. Previous studies have investigated patterns in SE sketches and how they are used in practice, yet methods for directly using these sketches for code generation remain limited. The emergence of visually-equipped large language models presents an opportunity to bridge this gap, which is the focus of our research. In this paper, we built a first prototype of a Visual Code Assistant to get user feedback regarding in-IDE sketch-to-code tools. We conduct an experiment with 19 data scientists, most of whom regularly sketch as part of their job. We investigate developers' mental models by analyzing patterns commonly observed in their sketches when developing an ML workflow. Analysis indicates that diagrams were the preferred organizational component (52.6\%), often accompanied by lists (42.1\%) and numbered points (36.8\%). Our tool converts their sketches into a Python notebook by querying an LLM. We use an LLM-as-judge setup to score the quality of the generated code, finding that even brief sketching can effectively generate useful code outlines. We also find a significant, positive correlation between sketch time and the quality of the generated code. We conclude the study by conducting extensive interviews to assess the tool's usefulness, explore potential use cases, and understand developers' needs. As noted by participants, promising applications for these assistants include education, prototyping, and collaborative settings. Our findings signal promise for the next generation of Code Assistants to integrate visual information, both to improve code generation and to better leverage developers' existing sketching practices.

Tags: "AI for SE", "Human/Social"

Waris Gill, Ali Anwar, Muhammad Ali Gulzar, "TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron Provenance"

Abstract: In Federated Learning, clients train models on local data and send updates to a central server, which aggregates them into a global model using a fusion algorithm. This collaborative yet privacy-preserving training comes at a cost—FL developers face significant challenges in attributing global model predictions to specific clients. Localizing responsible clients is a crucial step towards (a) excluding clients primarily responsible for incorrect predictions and (b) encouraging clients who contributed high quality models to continue participating in the future. Existing ML explainability approaches are inherently inapplicable as they are designed for single-model, centralized training. We introduce TraceFL, a fine-grained neuron provenance capturing mechanism that identifies clients responsible for the global model’s prediction by tracking the flow of information from individual clients to the global model. Since inference on different inputs activates a different set of neurons of the global model, TraceFL dynamically quantifies the significance of the global model’s neurons in a given prediction. It then selectively picks a slice of the most crucial neurons in the global model and maps them to the corresponding neurons in every participating client to determine each client’s contribution, ultimately localizing the responsible client. We evaluate TraceFL on six datasets, including two real-world medical imaging datasets and four neural networks, including advanced models such as GPT. TraceFL achieves 99% accuracy in localizing the responsible client in FL tasks spanning both image and text classification tasks. At a time when state-of-the-art ML debugging approaches are mostly domain-specific (e.g., image classification only), TraceFL is the first technique to enable highly accurate automated reasoning across a wide range of FL applications.

Tags: "SE for AI", "Testing and Quality"

Yuanfang Cai, Lanting He, Yony Kochinski, Jun Qian, Ciera Jaspan, Nan Zhang, Antonio Bianco, "Understanding Architectural Complexity, Maintenance Burden, and Developer Sentiment---a Large-Scale Study"

Abstract: Intuitively, the more complex a software system is, the harder it is to maintain. Statistically, it is not clear which complexity metrics correlate with maintenance effort; in fact, it is not even clear how to objectively measure maintenance burden, so that developers' sentiment and intuition can be supported by numbers. Without effective complexity and maintenance metrics, it remains difficult to objectively monitor maintenance, control complexity, or justify refactoring. In this paper, we report a large-scale study of 1252 projects written in C++ and Java from Company_X. We collected three categories of metrics: (1) architectural complexity, measured using propagation cost (PC), decoupling level (DL), and structural anti-patterns; (2) maintenance activity, measured using the number of changes, lines of code (LOC) written, and active coding time (ACT) spent on feature-addition vs. bug-fixing, and (3) developer sentiment on complexity and productivity, collected from 7200 survey responses. We statistically analyzed the correlations among these metrics and obtained significant evidence of the following findings: 1) the more complex the architecture is (higher propagation cost, more instances of anti-patterns), the more LOC is spent on bug-fixing, rather than adding new features; 2) developers who commit more changes for features, spend more lines of code on features, or spend more time on features also feel that they are less hindered by technical debt and complexity. To the best of our knowledge, this is the first large-scale empirical study establishing the statistical correlation among architectural complexity, maintenance activity, and developer sentiment. The implication is that, instead of solely relying upon developer sentiment and intuition to detect degraded structure or increased burden to evolve, it is possible to objectively and continuously measure and monitor architectural complexity and maintenance difficulty, increasing feature delivery efficiency by reducing architectural complexity and anti-patterns.

Tags: "Design/Architecture", "MSR", "Human/Social", "Prog Comprehension/Reeng/Maint"

Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Samuel Huang, Georges-Axel Jaloyan, Anjali Joshi, K. Rustan M. Leino, Mikael Mayer, Sean McLaughlin, Akhilesh Mritunjai, Clement Pit-Claudel, Sorawee Porncharoenwase, Florian Rabe, Marianna Rapoport, Giles Reger, Cody Roux, Neha Rungta, Robin Salkeld, Matthias Schlaipfer, Daniel Schoepe, Johanna Schwartzentruber, Serdar Tasiran, Aaron Tomb, Emina Torlak, Jean-Baptiste Tristan, Lucas Wagner, Michael W. Whalen, Remy Willems, Tongtong Xiang, Tae Joon Byun, Joshua Cohen, Ruijie Fang, Junyoung Jang, Jakob Rath, Hira Taqdees Syeda, Dominik Wagner, Yongwei Yuan, "Formally Verified Cloud-Scale Authorization"

Abstract: All critical systems must evolve to meet the needs of a growing and diversifying user base. But supporting that evolution is challenging at increasing scale: Maintainers must find a way to ensure that each change does only what is intended, and will not inadvertently change behavior for existing users. This paper presents how we addressed this challenge for a cloud-scale authorization engine, invoked 1 billion times per second, by using formal verification. Over a period of four years, we built a new authorization engine, one that behaves functionally the same as its predecessor, using the verification-aware programming language Dafny. We can now confidently deploy enhancements and optimizations while maintaining the highest assurance of both correctness and backward compatibility. We deployed the new engine in 2024 without incident and customers immediately enjoyed a threefold performance improvement. The methodology we followed to build this new engine was not an off-the-shelf application of an existing verification tool, and this paper presents several key insights: 1) Rather than prove correct the existing engine, written in Java, we found it more effective to \emph{write a new engine} in Dafny, a language built for \emph{verification from the ground up}, and then compile the result to Java. 2) To ensure performance, debuggability, and to gain trust from stakeholders, we needed to generate readable, \emph{idiomatic} Java code, essentially a transliteration of the source Dafny. 3) To ensure that the specification matches the system's actual behavior, we performed \emph{extensive differential and shadow testing} throughout the development process, ultimately comparing against $10^{15}$ production samples prior to deployment. Our approach demonstrates how formal verification can be effectively applied to evolve critical legacy software at scale.

Tags: "Design/Architecture", "Security", "Prog Comprehension/Reeng/Maint", "Human/Social"

Shazibul Islam Shamim, Hanyang Hu, Akond Rahman, "On Prescription or Off Prescription? An Empirical Study of Community-prescribed Security Configurations for Kubernetes"

Abstract: Despite being beneficial for rapid delivery of software, Kubernetes deployments can be susceptible to security attacks, which can cause serious consequences. A systematic characterization of how community-prescribed security configurations, i.e., security configurations that are recommended by security experts, can aid practitioners to secure their Kubernetes deployments. To that end, we conduct an empirical study with 53 security configurations recommended by the Center for Internet Security (CIS), 20 survey respondents, and 356 configuration files obtained from open source software (OSS) repositories and 188 configuration files used by Company-A. From our empirical study, we observe: (i) practitioners can be unaware of prescribed security configurations as 5%~40% of the survey respondents are unfamiliar with 16 prescribed configurations; and (ii) for Company-A and OSS respectively, 18.0% and 17.9% of the configuration files include at least one violation of prescribed configurations. From our evaluation with 5 static application security testing (SAST) tools we find (i) only Kubescape to support all of the prescribed security configurations; (ii) the highest observed precision to be 0.48 and 0.43 respectively, for the Company-A and OSS datasets; and (iii) the highest observed recall to be respectively, 0.53 and 0.65 for the Company-A and OSS datasets. We conclude the paper by providing recommendations for practitioners on how they can use existing SAST tools to secure their Kubernetes deployments.

Tags: "Prog Comprehension/Reeng/Maint", "Security"

Jaehyeok Lee, Sooyoung Cha, "TopSeed: Learning Seed Selection Strategies for Symbolic Execution from Scratch"

Abstract: We present TopSeed, a new approach that automatically selects optimal seeds to enhance symbolic execution. Recently, the performance of symbolic execution has significantly improved through various state-of-the-art techniques, including search strategies and state-pruning heuristics. However, these techniques have typically demonstrated their effectiveness without considering “seeding”, which efficiently initializes program states for exploration. This paper aims to select valuable seeds from candidate inputs generated during interactions with any symbolic execution technique, without the need for a predefined seed corpus, thereby maximizing the technique’s effectiveness. One major challenge is the vast number of candidates, making it difficult to identify promising seeds. To address this, we introduce a customized online learning algorithm that iteratively groups candidate inputs, ranks each group, and selects a seed from the top-ranked group based on data accumulated during symbolic execution. Experimental results on 17 open-source C programs show that TopSeed significantly enhances four distinct cutting-edge techniques, implemented on top of two symbolic executors, in terms of branch coverage and bug-finding abilities.

Tags: "Testing and Quality"

Ali Ebrahimi Pourasad, Walid Maalej, "Does GenAI Make Usability Testing Obsolete?"

Abstract: Ensuring usability is crucial for the success of mobile apps. Usability issues can compromise user experience and negatively impact the perceived app quality. This paper presents UX-LLM, a novel tool powered by a Large Vision-Language Model that predicts usability issues in iOS apps. To evaluate the performance of UX-LLM we predicted usability issues in two open-source apps of a medium complexity and asked usability experts to assess the predictions. We also performed traditional usability testing and expert review for both apps and compared the results to those of UX-LLM. UX-LLM demonstrated precision ranging from 0.61 and 0.66 and recall between 0.35 and 0.38, indicating its ability to identify valid usability issues, yet failing to capture the majority of issues. Finally, we conducted a focus group with an app development team of a capstone project developing a transit app for visually impaired persons. The focus group expressed positive perceptions of UX-LLM as it identified unknown usability issues in their app. However, they also raised concerns about its integration into the development workflow, suggesting potential improvements. Our results show that UX-LLM cannot fully replace traditional usability evaluation methods but serves as a valuable supplement particularly for small teams with limited resources, to identify issues in less common user paths, due to its ability to inspect the source code.

Tags: "AI for SE", "Testing and Quality"

Zhao Tian, Junjie Chen, Xiangyu Zhang, "Fixing Large Language Models' Specification Misunderstanding for Better Code Generation"

Abstract: Code generation is to automatically generate source code conforming to a given programming specification, which has received extensive attention especially with the development of large language models (LLMs). Due to the inherent difficulty of code generation, the code generated by LLMs may not be aligned with the specification. Although thought-eliciting prompting techniques have been proposed to enhance the code generation performance of LLMs, producing correct understanding for complicated programming problems remains challenging, resulting in unsatisfactory performance. Also, some feedback-based prompting techniques have been proposed to fix incorrect code using error messages produced by test execution. However, when the generated code deviates significantly from the ground truth, they encounter difficulties in improving performance based on such coarse-grained information. In this work, we propose a novel prompting technique, called μFiX, to improve the code generation performance of LLMs by devising both sophisticated thought-eliciting prompting and feedback-based prompting and making the first exploration on their synergy. It first exploits test case analysis to obtain specification understanding and enables a self-improvement process to identify and refine the misunderstanding in the thought-eliciting prompting phase. μFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding (from the first phase) and the actual understanding implicitly utilized by LLMs for code generation in the feedback-based prompting phase. By improving the understanding with μFiX, the code generation performance of LLMs can be largely improved. Our evaluation on two advanced LLMs (ChatGPT and DeepSeek-Coder) with six widely-used benchmarks by comparing with 15 baselines, demonstrates the effectiveness of μFiX. For example, μFiX outperforms the most effective baseline with an average improvement of 35.62% in terms of Pass@1 across all subjects.

Tags: "AI for SE", "SE for AI", "Analysis"

Jingwen Zhang, Zibin Zheng, Yuhong Nan, Mingxi Ye, Kaiwen Ning, Yu Zhang, Weizhe Zhang, "SmartReco: Detecting Read-Only Reentrancy via Fine-Grained Cross-DApp Analysis"

Abstract: Despite the increasing popularity of Decentralized Applications (DApps), they are suffering from various vulnerabilities that can be exploited by adversaries for profits. Among such vulnerabilities, Read-Only Reentrancy (called ROR in this paper), is an emerging type of vulnerability that arises from the complex interactions between DApps. In recent three years, attack incidents of ROR have already caused around 30M USD losses to the DApp ecosystem. Existing techniques for vulnerability detection in smart contracts can hardly detect Read-Only Reentrancy attacks, due to the lack of tracking and analyzing the complex interactions between multiple DApps. In this paper, we propose SmartReco, a new framework for detecting Read-Only Reentrancy vulnerability in DApps through a novel combination of static and dynamic analysis (i.e., fuzzing) over smart contracts. The key design behind SmartReco is threefold: (1) SmartReco identifies the boundary between different DApps from the heavy-coupled cross-contract interactions. (2) SmartReco performs fine-grained static analysis to locate points of interest (i.e., entry functions) that may lead to ROR. (3) SmartReco utilizes the on-chain transaction data and performs multi-function fuzzing (i.e., the entry function and victim function) across different DApps to verify the existence of ROR. Our evaluation of a manual-labeled dataset with 45 RORs shows that SmartReco achieves an accuracy of 88.63% and a recall of 86.36%. In addition, SmartReco successfully detects 43 new RORs from 123 popular DApps. The total assets affected by such RORs reach around 520,000 USD.

Tags: "Security", "Analysis"

Mengya Zhang, Preksha Shukla, Wuqi Zhang, Zhuo Zhang, Pranav Agrawal, Zhiqiang Lin, Xiangyu Zhang, Xiaokuan Zhang, "An Empirical Study of Proxy Smart Contracts at the Ethereum Ecosystem Scale"

Abstract: Proxy has been introduced as a design pattern to separate data and code in an application to two different types of smart contracts, namely proxy and logic contracts, respectively. Data is stored in the proxy contracts, while the code to be executed is fetched from the logic contracts. Proxy patterns facilitate the flexibility of smart contract development by enabling upgradeability, extensibility, code reuse, etc. Despite its popularity and importance, there is currently no systematic study to understand the prevalence, use scenarios, and development pitfalls of proxies. In this work, we conduct the first comprehensive study on Ethereum proxies. To collect a comprehensive dataset of proxies, we propose ProxyEx, the first framework designed to detect proxy directly from bytecode. Our evaluation shows that ProxyEx achieves over 99% accuracy. With ProxyEx, we collect a large-scale dataset of 2,031,422 proxies from all contracts in Ethereum and conduct the first systematic empirical study. We first measure the total number of proxies and their transaction traffic, to obtain an overall understanding of the status quo of proxies on Ethereum. Then, we categorize the design pattern and use scenarios of proxies into four types: upgradeability, extensibility, code-sharing, and code-hiding. We further identify three types of common pitfalls in proxies: proxy-logic storage collision, logic-logic storage collision, and uninitialized contracts. We also design three checkers for these common pitfalls in proxies by replaying historical transactions. Our study leads to many interesting findings. For instance, we find that upgradeability is not the only reason that developers adopt the proxy pattern in developing Decentralized Applications (DApps). We also find that many proxies suffer from bugs such as storage collision and uninitialized contracts. Our study sheds light on the proxies landscape, and provides valuable insights to future smart contract research on the development, usage, quality assurance, and bug detection of proxies.

Tags: "Blockchain", "MSR"

Yifan Wu, Yunpeng Wang, Ying Li, Wei Tao, Siyu Yu, Haowen Yang, Wei Jiang, Jianguo Li, "An Empirical Study on Commit Message Generation using LLMs via In-Context Learning"

Abstract: Commit messages concisely describe code changes in natural language and are important for software maintenance. Several approaches have been proposed to automatically generate commit messages, but they still suffer from critical limitations, such as time-consuming training and poor generalization ability. To tackle these limitations, we propose to borrow the weapon of large language models (LLMs) and in-context learning (ICL). Our intuition is based on the fact that the training corpora of LLMs contain extensive code changes and their pairwise commit messages, which makes LLMs capture the knowledge about commits, while ICL can exploit the knowledge hidden in the LLMs and enable them to perform downstream tasks without model tuning. However, it remains unclear how well LLMs perform on commit message generation via ICL. Therefore, in this paper, we conduct a comprehensive empirical study to investigate the capability of LLMs to generate commit messages via ICL. Specifically, we first explore the impact of different settings on the performance of ICL-based commit message generation. We then compare ICL-based commit message generation with state-of-the-art approaches on a popular multilingual dataset and a new dataset we created to mitigate potential data leakage. The results show that ICL-based commit message generation significantly outperforms state-of-the-art approaches on subjective evaluation and achieves better generalization ability. We further analyze the root causes for LLM’s underperformance and propose several implications, which shed light on future research directions for using LLMs to generate commit messages.

Tags: "AI for SE"

Octavio Galland, Marcel Böhme, "Invivo Fuzzing by Amplifying Actual Executions"

Abstract: A major bottleneck that remains when fuzzing software libraries is the need for _fuzz drivers_, i.e., the glue code between the fuzzer and the library. Despite years of fuzzing, critical security flaws are still found, e.g., by manual auditing, because the fuzz drivers do not cover the complex interactions between the library and the host programs using it. In this work we propose an alternative approach to library fuzzing, which leverages a valid execution context that is set up by a given program using the library (the _host_), and _amplify_ its execution. More specifically, we execute the host until a designated function from a list of _target_ functions has been reached, and then perform coverage-guided function-level fuzzing on it. Once the fuzzing quota is exhausted, we move on to fuzzing the next target from the list. In this way we not only reduce the amount of manual work needed by a developer to incorporate fuzzing into their workflow, but we also allow the fuzzer to explore parts of the library as they are used in real-world programs that may otherwise not have been tested due to the simplicity of most fuzz drivers.

Tags: "Testing and Quality"

Wenwei Gu, Jiazhen Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Jinxi Kuang, Cong Feng, Yongqiang Yang, Michael Lyu, "ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems"

Abstract: A common practice in the reliability engineering of cloud services involves the collection of monitoring metrics, followed by comprehensive analysis to identify performance issues. However, existing methods often fall short of detecting diverse and evolving anomalies across different services. Moreover, there exists a significant gap between the technical and business interpretation of anomalies, i.e., a detected anomaly may not have an actual impact on system performance or user experience. To address these challenges, we propose ADAMAS, an adaptive AutoML-based anomaly detection framework aiming to achieve practical anomaly detection in production cloud systems. To improve the ability of detecting cross-service anomalies, we design a novel unsupervised evaluation function to facilitate the automatic searching of the optimal model structure and parameters. ADAMAS also contains a lightweight human-in-the-loop design, which can efficiently incorporate expert knowledge to adapt to the evolving anomaly patterns and bridge the gap between predicted anomalies and actual business exceptions. Furthermore, through monitoring the rate of mispredicted anomalies, ADAMAS proactively re-configures the optimal model, forming a continuous loop of system improvement. Extensive evaluation on one public and two industrial datasets shows that ADAMAS outperforms all baseline models with a 0.891 F1-score. The ablation study also proves the effectiveness of the evaluation function design and the incorporation of expert knowledge.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Zewei Lin, Jiachi Chen, Jiajing Wu, Weizhe Zhang, Zibin Zheng, "Definition and Detection of Centralization Defects in Smart Contracts"

Abstract: In recent years, security incidents stemming from centralization defects in smart contracts have led to substantial financial losses. A centralization defect refers to any error, flaw, or fault in a smart contract’s design or development stage that introduces a single point of failure. Such defects allow a specific account or user to disrupt the normal operations of smart contracts, potentially causing malfunctions or even complete project shutdowns. Despite the significance of this issue, most current smart contract analyses overlook centralization defects, focusing primarily on other types of defects. To address this gap, our paper introduces six types of centralization defects in smart contracts by manually analyzing 597 Stack Exchange posts and 117 audit reports. For each defect, we provide a detailed description and code examples to illustrate its characteristics and potential impacts. Additionally, we introduce a tool named CDRipper (Centralization Defects Ripper) designed to identify the defined centralization defects. Specifically, CDRipper constructs a permission dependency graph (PDG) and extracts the permission dependencies of functions from the source code of smart contracts. It then detects the sensitive operations in functions and identifies centralization defects based on predefined patterns. We conduct a large-scale experiment using CDRipper on 244,424 real-world smart contracts and evaluate the results based on a manually labeled dataset. Our findings reveal that 82,446 contracts contain at least one of the six centralization defects, with our tool achieving an overall precision of 93.7%.

Tags: "Blockchain", "Security", "Testing and Quality"

Changjian Zhang, Parv Kapoor, Ian Dardik, Leyi Cui, Romulo Meira-Goes, David Garlan, Eunsuk Kang, "Constrained LTL Specification Learning from Examples"

Abstract: Temporal logic specifications play an important role in a wide range of software analysis tasks, such as model checking, automated synthesis, program comprehension, and runtime monitoring. Given a set of positive and negative examples, specified as traces, \emph{LTL learning} is the problem of synthesizing a specification, in \emph{linear temporal logic (LTL)}, that evaluates to true over the positive traces and false over the negative ones. In this paper, we propose a new type of LTL learning problem called \emph{constrained LTL learning}, where the user, in addition to positive and negative examples, is given an option to specify one or more \emph{constraints} over the properties of the LTL formula to be learned. We demonstrate that the ability to specify these additional constraints significantly increases the range of applications for LTL learning, and also allows efficient generation of LTL formulas that satisfy certain desirable properties (such as minimality). We propose an approach for solving the constrained LTL learning problem through an encoding in a first-order relational logic and reduction to an instance of the \emph{maximal satisfiability (MaxSAT)} problem. An experimental evaluation demonstrates that ATLAS, an implementation of our proposed approach, is able to solve new types of learning problems while performing better than or competitively with the state-of-the-art tools in LTL learning.

Tags: "Formal methods", "Requirements"

Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl T. Barr, Sergey Mechtaev, "The Fact Selection Problem in LLM-Based Program Repair"

Abstract: Recent research has shown that incorporating bug- related facts, such as stack traces and GitHub issues, into prompts enhances the bug-fixing capabilities of large language models (LLMs). Considering the ever-increasing context window of these models, a critical question arises: what and how many facts should be included in prompts to maximise the chance of correctly fixing bugs? To answer this question, we conducted a large-scale study, employing over 19K prompts featuring various combinations of seven diverse facts to rectify 314 bugs from open-source Python projects within the BugsInPy benchmark. Our findings revealed that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of LLMs such as angelic values, is beneficial. Specifically, each fact aids in fixing some bugs that would remain unresolved or only be fixed with a low success rate without it. Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts; using too many facts leads to subpar outcomes. These insights led us to define the fact selection problem: determining the optimal set of facts for inclusion in a prompt to maximise LLM’s performance on a given task instance. We found that there is no one-size- fits-all set of facts for bug repair. Therefore, we developed a basic statistical model, named MANIPLE, which selects facts specific to a given bug to include in the prompt. This model significantly surpasses the performance of the best generic fact set. To underscore the significance of the fact selection problem, we benchmarked MANIPLE against the state-of-the-art zero-shot, non- conversational LLM-based bug repair methods. On our testing dataset of 157 bugs, MANIPLE repairs 88 bugs, 17% above the best configuration.

Tags: "AI for SE", "Analysis/Repair"

Anna Mazhar, Saad Sher Alam, William Zheng, Yinfang Chen, Suman Nath, Tianyin Xu, "Fidelity of Cloud Emulators: The Imitation Game of Testing Cloud-based Software"

Abstract: Modern software projects have been increasingly using cloud services as important components. The cloud-based programming practice greatly simplifies software development by harvesting cloud benefits (e.g., high availability and elasticity). However, it imposes new challenges for software testing and analysis, due to opaqueness of cloud backends and monetary cost of invoking cloud services for continuous integration and deployment. As a result, cloud emulators are developed for offline development and testing, before online testing and deployment. This paper presents a systematic analysis of cloud emulators from the perspective of cloud-based software testing. Our goal is to (1) understand the discrepancies introduced by cloud emula- tion with regard to software quality assurance and deployment safety and (2) address inevitable gaps between emulated and real cloud services. The analysis results are concerning. Among 255 APIs of five cloud services from Azure and Amazon Web Services (AWS), we detected discrepant behavior between the emulated and real services in 94 (37%) of the APIs. These discrepancies lead to inconsistent testing results, threatening deployment safety, introducing false alarms, and creating debuggability issues. The root causes are diverse, including accidental implementation defects and essential emulation challenges. We discuss potential solutions and develop a practical mitigation technique to address discrepancies of cloud emulators for software testing.

Tags: "Testing and Quality", "Design/Architecture"

Yuxin Zhang, Sen Chen, Xiaofei Xie, Zibo Liu, Lingling Fan, "Scenario-Driven and Context-Aware Automated Accessibility Testing for Android Apps"

Abstract: Mobile accessibility is increasingly important nowadays as it enables people with disabilities to use mobile applications to perform daily tasks. Ensuring mobile accessibility not only benefits those with disabilities but also enhances the user experience for all users, making applications more intuitive and user-friendly. Although numerous tools are available for testing and detecting accessibility issues in Android applications, a large number of false negatives and false positives persist due to limitations in the existing approaches, i.e., low coverage of UI scenarios and lack of consideration of runtime context. To address these problems, in this paper, we propose a scenario-driven exploration method for improving the coverage of UI scenarios, thereby detecting accessibility issues within the application, and ultimately reducing false negatives. Furthermore, to reduce false positives caused by not considering the runtime context, we propose a context-aware detection method that provides a more fine-grained detection capability. Our experimental results reveal that A11yScan can detect 1.7X more issues surpassing current state-of-the-art approaches like Xbot (3,991 vs. 2,321), thereby reducing the false negative rate by 41.84\%. Additionally, it outperforms established UI exploration techniques such as SceneDroid (952 vs. 661 UI scenarios), while achieving comparable activity coverage to recent leading GUI testing tools like GPTDroid on the available dataset (73\% vs. 71\%). Meanwhile, with the context-aware detection method, A11yScan effectively reduces the false positive rate by 21\%, validated with a 90.56\% accuracy rate through a user study.

Tags: "Testing and Quality", "Mobile SW"

Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, Iftekhar Ahmed, "An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?"

Abstract: Artificial Intelligence (AI) techniques, especially Large Language Models (LLMs), have started gaining popularity among researchers and software developers for generating source code. However, LLMs have been shown to generate code with quality issues and also incurred copyright/licensing infringements. Therefore, detecting whether a piece of source code is written by humans or AI has become necessary. This study first presents an empirical analysis to investigate the effectiveness of the existing AI detection tools in detecting AI-generated code. The results show that they all perform poorly and lack sufficient generalizability to be practically deployed. Then, to improve the performance of AI-generated code detection, we propose a range of approaches, including fine-tuning the LLMs and machine learning-based classification with static code metrics or code embedding generated from Abstract Syntax Tree (AST). Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55. We also conduct an ablation study on our best-performing model to investigate the impact of different source code features on its performance.

Tags: "AI for SE", "Analysis"

Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, Hao Chen, "IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation"

Abstract: Modern compilers, such as LLVM, are complex. Due to their complexity, manual testing is unlikely to suffice, yet formal verification is difficult to scale. End-to-end fuzzing can be used, but it has difficulties in discovering LLVM backend problems for two reasons. First, frontend preprocessing and middle optimization shield the backend from seeing diverse inputs. Besides, edge coverages cannot provide an effective feedback as LLVM backend contains much reusable code. In this paper, we implement IRFuzzer to investigate the need of specialized fuzzing of the LLVM compiler backend. We focus on two approaches to improve the fuzzer: guaranteed input validity using constrained mutations to improve input diversity and new metrics to improve feedback quality. The mutator in IRFuzzer is capable of generating a wide range of LLVM IR inputs, including structured control flow, vector types, and function definitions. The system instruments coding patterns in the compiler to monitor the execution status of instruction selection. The instrumentation not only provides a new coverage feedback called matcher table coverage, but also provides an architecture specific guidance to the mutator. We show that IRFuzzer is more effective than existing fuzzers by fuzzing on 29 mature LLVM backend targets. In the process, we reported 78 confirmed new bugs in LLVM upstream, out of which 57 have been fixed, five have been back ported to LLVM 15, showing that specialized fuzzing provides useful and actionable insights to LLVM developers.

Tags: "Testing and Quality", "Analysis"

Wuxia Jin, Jiaowei Shang, Jianguo Zheng, Mengjie Sun, Zhenyu Huang, Ming Fan, Ting Liu, "The Design Smells Breaking the Boundary between Android Variants and AOSP"

Abstract: Phone vendors customize their Android variants to enhance system functionalities based on the Android Open Source Project (AOSP). While independent development, Android variants have to periodically evolve with the upstream AOSP and merge code changes from AOSP. Vendors have invested great effort to maintain their variants and resolve merging conflicts. In this paper, we characterize the design smells with recurring patterns that break the design boundary between Android variants and AOSP. These smells are manifested as problematic dependencies across the boundary, hindering Android variants' maintainability and co-evolution with AOSP. We propose the DroidDS for automatically detecting design smells. We collect 22 Android variant versions and 22 corresponding AOSP versions, involving 4 open-source projects and 1 industrial project. Our results demonstrate that: files involved in design smells consume higher maintenance costs than other files; these infected files are not merely the files with large code size, increased complexity, and object-oriented smells; the infected files have been involved in more than half of code conflicts induced by re-applying AOSP's changes to Android variants; a substantial portion of design issues could be mitigable. Practitioners can utilize our DroidDS to pinpoint and prioritize design problems for Android variants. Refactoring these problems will help keep a healthy coupling between diverse variants and AOSP, potentially improving maintainability and reducing conflict risks.

Tags: "Mobile SW", "Design/Architecture", "Prog Comprehension/Reeng/Maint"

Mingfei Cheng, Xiaofei Xie, Yuan Zhou, Junjie Wang, Guozhu Meng, Kairui Yang, "Decictor: Towards Evaluating the Robustness of Decision-Making in Autonomous Driving Systems"

Abstract: Autonomous Driving System (ADS) testing is crucial in ADS development, with the current primary focus being on safety. However, the evaluation of non-safety-critical performance, particularly the ADS’s ability to make optimal decisions and produce optimal paths for autonomous vehicles (AVs), is also vital to ensure the intelligence and reduce risks of AVs. Currently, there is little work dedicated to assessing the robustness of ADSs’ path-planning decisions (PPDs), i.e., whether an ADS can maintain the optimal PPD after an insignificant change in the environment. The key challenges include the lack of clear oracles for assessing PPD optimality and the difficulty in searching for scenarios that lead to non-optimal PPDs. To fill this gap, in this paper, we focus on evaluating the robustness of ADSs’ PPDs and propose the first method, Decictor, for generating non-optimal decision scenarios (NoDSs), where the ADS does not plan optimal paths for AVs. Decictor comprises three main components: Non-invasive Mutation, Consistency Check, and Feedback. To overcome the oracle challenge, Non-invasive Mutation is devised to implement conservative modifications, ensuring the preservation of the original optimal path in the mutated scenarios. Subsequently, the Consistency Check is applied to determine the presence of non-optimal PPDs by comparing the driving paths in the original and mutated scenarios. To deal with the challenge of large environment space, we design Feedback metrics that integrate spatial and temporal dimensions of the AV’s movement. These metrics are crucial for effectively steering the generation of NoDSs. Therefore, Decictor can generate NoDSs by generating new scenarios and then identifying NoDSs in the new scenarios. We evaluate Decictor on Baidu Apollo, an open-source and production-grade ADS. The experimental results validate the effectiveness of Decictor in detecting non-optimal PPDs of ADSs. It generates 63.9 NoDSs in total, while the best-performing baseline only detects 35.4 NoDSs.

Tags: "Autonomy", "SE for AI"

Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Aaron Quigley, Yuyu Luo, Tianqi Luo, Gelareh Mohammadi, Qinghua Lu, Liming Zhu, "DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models"

Abstract: The rise of Large Language Models (LLMs) has streamlined frontend interface creation through tools like Vercel's V0, yet surfaced challenges in design quality (e.g., accessibility, and usability). Current solutions, often limited by their focus, generalisability, or data dependency, fall short in addressing these complexities comprehensively. Moreover, none of them examine the quality of LLM-generated UI design. In this work, we introduce DesignRepair, a novel dual-stream design guideline-aware system to examine and repair the UI design quality issues from both code aspect and rendered page aspect. We utilised the mature and popular Material Design as our knowledge base to guide this process. Specifically, we first constructed a comprehensive knowledge base encoding Google's Material Design principles into low-level component knowledge base and high-level system design knowledge base. After that, DesignRepair employs a LLM for the extraction of key components and utilizes the Playwright tool for precise page analysis, aligning these with the established knowledge bases. Finally, we integrate Retrieval-Augmented Generation with state-of-the-art LLMs like GPT-4 to holistically refine and repair frontend code through a strategic divide and conquer approach. Our extensive evaluations validated the efficacy and utility of our approach, demonstrating significant enhancements in adherence to design guidelines, accessibility, and user experience metrics.

Tags: "Analysis/Repair", "Testing and Quality", "Design/Architecture"

Houda Naji, Marco Gutfleisch, Alena Naiakshina, "Relationship Status: “It’s complicated” Developer-Security Expert Dynamics in Scrum"

Abstract: The high number of cyber threats poses significant challenges, with impactful software exploits ranging from data theft to ransomware deployment. Unfortunately, past research highlighted limited security expertise within development teams. Collaboration between developers and security experts, therefore, emerges as one of the few workable means to address this gap. In this paper, we explore the complex interplay between developers and security experts within Scrum, one of the most widely adopted frameworks which actively promotes collaboration, to shed light on their working relationship, challenges, and potential avenues for improvement. To this end, we conducted a qualitative interview study with 14 developers and 13 security experts. Our qualitative results reveal three communication patterns and five shared challenges between the groups affecting the develop-security expert collaboration. Top challenges include consistent interaction difficulties and the lack of workable means to balance business and security needs. As a result, we found that three core Scrum values (openness, respect, courage) are missing from this relationship. Based on our results, we propose recommendations for fostering a healthy collaboration between developers and security experts, both within and beyond Scrum.

Tags: "Human/Social"

Bianca Trinkenreich, Zixuan Feng, Rudrajit Choudhuri, Marco Gerosa, Anita Sarma, Igor Steinmacher, "Investigating the Impact of Interpersonal Challenges on Feeling Welcome in OSS"

Abstract: The sustainability of open source software (OSS) projects hinges on contributor retention. Interpersonal challenges can inhibit a feeling of welcomeness among contributors, particularly from underrepresented groups, which impacts their decision to continue with the project. How much this impact is, varies among individuals, underlining the importance of a thorough understanding of their effects. Here, we investigate the effects of interpersonal challenges on the sense of welcomeness among diverse populations within OSS, through the diversity lenses of gender, race, and (dis)ability. We analyzed the large-scale Linux Foundation Diversity and Inclusion survey (n = 706) to model a theoretical framework linking interpersonal challenges with the sense of welcomeness through Structural Equation Models Partial Least Squares (PLS-SEM). We then examine the model to identify the impact of these challenges on different demographics through Multi-Group Analysis (MGA). Finally, we conducted a regression analysis to investigate how differently people from different demographics experience different types of interpersonal challenges. Our findings confirm the negative association between interpersonal challenges and the feeling of welcomeness in OSS, with this relationship being more pronounced among gender minorities and people with disabilities. We found that different challenges have unique impacts on how people feel welcomed, with variations across gender, race, and disability groups. We also provide evidence that people from gender minorities and with disabilities are more likely to experience interpersonal challenges than their counterparts, especially when we analyze stalking, sexual harassment, and gender. Our insights benefit OSS communities, informing potential strategies to improve the landscape of interpersonal relationships, ultimately fostering more inclusive and welcoming communities.

Tags: "Human/Social", "Open Source"

Feng Lin, Dong Jae Kim, Tse-Hsun (Peter) Chen, "SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents"

Abstract: Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen – a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGen$_{Waterfall}$, FlowGen$_{TDD}$, and FlowGen$_{Scrum}$, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT-3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGen$_{Scrum}$ excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGen$_{Scrum}$ achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGen$_{Scrum}$ resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT-3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.

Tags: "AI for SE", "SE for AI", "MSR"

Linfeng Liang, Yao Deng, Kye Morton, Valtteri Kallinen, Alice James, Avishkar Seth, Endrowednes Kuantama, Subhas Mukhopadhyay, Richard Han, Xi Zheng, "GARL: Genetic Algorithm-Augmented Reinforcement Learning to Detect Violations in Marker-Based Autonomous Landing Systems"

Abstract: Automated Uncrewed Aerial Vehicle (UAV) landing is crucial for autonomous UAV services such as monitoring, surveying, and package delivery. It involves detecting landing targets, perceiving obstacles, planning collision-free paths, and controlling UAV movements for safe landing. Failures can lead to significant losses, necessitating rigorous simulation-based testing for safety. Traditional offline testing methods, limited to static environments and predefined trajectories, may miss violation cases caused by dynamic objects like people and animals. Conversely, online testing methods require extensive training time, which is impractical with limited budgets. To address these issues, we introduce GARL, a framework combining a genetic algorithm (GA) and reinforcement learning (RL) for efficient generation of diverse and real landing system failures within a practical budget. GARL employs GA for exploring various environment setups offline, reducing the complexity of RL's online testing in simulating challenging landing scenarios. Our approach outperforms existing methods by up to 18.35% in violation rate and 58% in diversity metric. We validate most discovered violation types with real-world UAV tests, pioneering the integration of offline and online testing strategies for autonomous systems. This method opens new research directions for online testing, with our code available at https://anonymous.4open.science/r/drone_testing-5CF0/.

Tags: "Testing and Quality", "Autonomy", "AI for SE"

Siyuan Li, Yuekang Li, Zuxin Chen, Chaopeng Dong, Yongpan Wang, Hong Li, Yongle Chen, Hongsong Zhu, "TransferFuzz: Fuzzing with Historical Trace for Verifying Propagated Vulnerability Code"

Abstract: Code reuse in software development frequently facilitates the spread of vulnerabilities, making the scope of affected software in CVE reports imprecise. Traditional methods primarily focus on identifying reused vulnerability code within target software, yet they cannot verify if these vulnerabilities can be triggered in new software contexts. This limitation often results in false positives. In this paper, we introduce TransferFuzz, a novel vulnerability verification framework, to verify whether vulnerabilities propagated through code reuse can be triggered in new software. Innovatively, we collected runtime information during the execution or fuzzing of the basic binary (the vulnerable binary detailed in CVE reports). This process allowed us to extract historical traces, which proved instrumental in guiding the fuzzing process for the target binary (the new binary that reused the vulnerable function). TransferFuzz introduces a unique Key Bytes Guided Mutation strategy and a Nested Simulated Annealing algorithm, which transfers these historical traces to implement trace-guided fuzzing on the target binary, facilitating the accurate and efficient verification of the propagated vulnerability. Our evaluation, conducted on widely recognized datasets, shows that TransferFuzz can quickly validate vulnerabilities previously unverifiable with existing techniques. Its verification speed is 2.5 to 26.2 times faster than existing methods. Moreover, TransferFuzz has proven its effectiveness by expanding the impacted software scope for 15 vulnerabilities listed in CVE reports, increasing the number of affected binaries from 15 to 53. The datasets and source code used in this article are available at https://anonymous.4open.science/r/TransferFuzz-E9B3.

Tags: "Testing and Quality", "Security"

Chong Wang, Jianan Liu, Xin Peng, Yang Liu, Yiling Lou, "Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention Inference"

Abstract: Resource leaks, caused by resources not being released after acquisition, often lead to performance issues and system crashes. Existing static detection techniques rely on mechanical matching of predefined resource acquisition/release APIs and null-checking conditions to find unreleased resources, suffering from both (1) false negatives caused by the incompleteness of predefined resource acquisition/release APIs and (2) false positives caused by the incompleteness of resource reachability validation identification. To overcome these challenges, we propose InferROI, a novel approach that leverages the exceptional code comprehension capability of large language models (LLMs) to directly infer resource-oriented intentions (acquisition, release, and reachability validation) in code. InferROI first prompts the LLM to infer involved intentions for a given code snippet, and then incorporates a two-stage static analysis approach to check control-flow paths for resource leak detection based on the inferred intentions. We evaluate the effectiveness of InferROI in both resource-oriented intention inference and resource leak detection. Experimental results on the DroidLeaks and JLeaks datasets demonstrate InferROI achieves promising bug detection rate (59.3% and 62.5%) and false alarm rate (18.6% and 19.5%). Compared to three industrial static detectors, InferROI detects 14~45 and 149~485 more bugs in DroidLeaks and JLeaks, respectively. When applied to real-world open-source projects, InferROI identifies 29 unknown resource leak bugs (verified by authors), with 7 of them being confirmed by developers. In addition, the results of an ablation study underscores the importance of combining LLM-based inference with static analysis. Finally, manual annotation indicated that InferROI achieved a precision of 74.6% and a recall of 81.8% in intention inference, covering more than 60% resource types involved in the datasets.

Tags: "Security", "AI for SE"

Wonhoi Kim, Hocheol Nam, Muoi Tran, Amin Jalilov, Zhenkai Liang, Sang Kil Cha, Min Suk Kang, "Fork State-Aware Differential Fuzzing for Blockchain Consensus Implementations"

Abstract: Blockchain networks allow multiple client implementations of the same consensus algorithm by different developers to coexist in the same system. Ensuring correct implementations among these heterogeneous clients is crucial, as even slight semantic discrepancies in their implementations can lead to safety failures. While existing fuzzing frameworks have discovered implementation flaws in blockchain, they suffer from several challenges in testing them with sequences of conflicting blocks, called forks. Existing tools fail to adequately assess the fork-handling processes in blockchain implementations when relying on traditional code coverage feedback, which lacks the granularity needed to navigate the diverse and complex fork-handling scenarios. This paper introduces Forky, a fork state-aware differential fuzzing framework designed to detect implementation discrepancies within the critical fork-handling process with its novel fork-aware mutation and fork-diversifying feedback mechanisms. We test Forky on the two most influential blockchain projects: Bitcoin and Ethereum, which are the representatives of the two major blockchain consensus algorithm families, Proofof-Work (PoW) and Proof-of-Stake (PoS) consensus algorithms.

Tags: "Testing and Quality", "Blockchain"

Yueke Zhang, Anda Liang, Xiaohan Wang, Pamela J. Wisniewski, Fengwei Zhang, Kevin Leach, Yu Huang, "Who’s Pushing the Code: An Exploration of GitHub Impersonation"

Abstract: GitHub is one of the largest open-source software (OSS) communities for software development and collaboration. Impersonation in the OSS communities refers to the malicious act of assuming another user's identity, often aiming to gain unauthorized access to code, manipulate project outcomes, or spread misinformation. With several recent real-world attacks resulting from impersonation, this issue is becoming and increasingly problematic concern within the OSS community. We present the first exploration of the impact of impersonation in GitHub. Specifically, we conduct structured interviews with 17 real-world OSS contributors about their perception of impersonation and corresponding mitigations. Our study reveals that, in general, GitHub users lack awareness of impersonation and underestimate the severity of its implications. After witnessing the impersonation, they show significant concern for the OSS community. Meanwhile, we also demonstrate that the current best practices (i.e., commit signing) that might mitigate impersonation must be improved to increase widespread acceptance and adoption. We also present and discuss participant perceptions of potential ways to mitigate GitHub impersonation. We collect a dataset comprising 12.5 million commits to investigate the current status of impersonation. Interestingly, we also find out that impersonation is not currently detected. We observe that existing commit histories treat impersonation behavior identically to pull request events, resulting in a lack of detection methods for impersonation.

Tags: "Human/Social", "Open Source"

Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, Yang Liu, "Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software"

Abstract: Intersectional fairness is a critical requirement for Machine Learning (ML) software, demanding fairness across subgroups defined by multiple protected attributes. This paper introduces FairHOME, a novel ensemble approach using higher order mutation of inputs to enhance intersectional fairness of ML software during the inference phase. Inspired by social science theories highlighting the benefits of diversity, FairHOME generates mutants representing diverse subgroups for each input instance, thus broadening the array of perspectives to foster a fairer decision-making process. Unlike conventional ensemble methods that combine predictions made by different models, FairHOME combines predictions for the original input and its mutants, all generated by the same ML model, to reach a final decision. Notably, FairHOME is even applicable to deployed ML software as it bypasses the need for training new models. We extensively evaluate FairHOME against six state-of-the-art fairness improvement methods across 24 decision-making tasks using widely adopted metrics. FairHOME consistently outperforms existing methods across all metrics considered. On average, it enhances intersectional fairness by 47.3%, surpassing the currently best-performing method by 10.1 percentage points.

Tags: "SE for AI", "Security"

Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He, "Aligning the Objective of LLM-based Program Repair"

Abstract: Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs can locate and repair bugs in certain functions using the related artifacts (e.g., test cases), existing methods still depend on statement-level fault localization methods to provide a list of buggy hunks for repair. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM’s APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first identifying faulty statements. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM’s pre-trained capability, and (2) replacing the traditional localize-buggy-hunks-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.

Tags: "AI for SE", "Analysis/Repair"

Aidan Z.H. Yang, Sophia Kolak, Vincent Hellendoorn, Ruben Martins, Claire Le Goues, "Revisiting Unnaturalness for Automated Program Repair in the Era of Large Language Models"

Abstract: Language models have improved by orders of magnitude with the recent emergence of Transformer-based Large Language Models (LLMs). LLMs have demonstrated their ability to generate "natural" code that is highly similar to code written by professional developers. One intermediate value an LLM can emit is entropy, which measures the naturalness of a token of code. We hypothesize that entropy can be used to improve the performance of Automated Program Repair (APR) tasks. While much progress has been made in Automated Program Repair (APR), fault localization techniques suffer from a lack of diversity in ranking scores, patch generation tools tend to be inefficient as all tests need to run before determining if a patch is likely to be correct, and patch ranking often suffers from the test-suite over-fitting problem. However, using an LLM directly for APR introduces concerns for training data leakage. In this work, we introduce a novel way of using the entropy of LLMs in combination with prior APR tools to improve all stages of APR. By using only the prefix and suffix context of a line or block of code to describe naturalness, we can use LLMs to localize faults and rank patches all while eliminating the dependency for test-suites. We show that entropy is highly complementary with prior fault localization tools. Our proposed method achieves a 108% top-1 score improvement over SBFL. When using entropy for patch ranking and classification, our proposed method can rank correct patches more effectively than state-of-the-art machine learning tools with an 49% improvement in top-1. Our work suggests that LLMs can be an effective addition to compliment prior APR tasks while minimizing both the test-suite over-fitting problem and the LLM data leakage problem.

Tags: "Analysis/Repair", "AI for SE"

Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, Dániel Varró, "The Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages"

Abstract: [Motivation] Automated bug detection in dynamically typed languages such as Python is essential for maintaining code quality. The lack of mandatory type annotations in such languages can lead to errors that are challenging to identify early with traditional static analysis tools. Recent progress in deep neural networks has led to increased use of neural bug detectors. In statically typed languages, a type checker is integrated into the compiler and thus taken into consideration when the neural bug detector is designed for these languages. [Problem] However, prior studies overlook this aspect during the training and testing of neural bug detectors for dynamically typed languages. When an optional type checker is used, assessing existing neural bug detectors on bugs easily detectable by type checkers may impact their performance estimation. Moreover, including these bugs in the training set of neural bug detectors can shift their detection focus toward the wrong type of bugs. [Contribution] We explore the impact of type checking on various neural bug detectors for variable misuse bugs, a common type targeted by neural bug detectors. Existing synthetic and real-world datasets are type-checked to evaluate the prevalence of type-related bugs. Then, we investigate how type-related bugs influence the training and testing of the neural bug detectors. [Findings] Our findings indicate that existing bug detection datasets contain a significant proportion of type-related bugs. Building on this insight, we discover integrating the neural bug detector with a type checker can be beneficial, especially when the code is annotated with types. Further investigation reveals neural bug detectors perform better on type-related bugs than other bugs. Moreover, removing type-related bugs from the training data helps improve neural bug detectors’ ability to identify bugs beyond the scope of type checkers.

Tags: "AI for SE", "Testing and Quality"

Verya Monjezi, Ashutosh Trivedi, Vladik Kreinovich, Saeid Tizpaz-Niari, "Fairness Testing through Extreme Value Theory"

Abstract: Data-driven software is increasingly being used as a critical component of automated decision-support systems. Since this class of software learns its logic from historical data, it can encode or amplify discriminatory practices. Previous research on algorithmic fairness has focused on improving “average-case” fairness. On the other hand, fairness at the extreme ends of the spectrum, which often signifies lasting and impactful shifts in societal attitudes, has received significantly less emphasis. Leveraging the statistics of extreme value theory (EVT), we propose a novel fairness criterion called extreme counterfactual discrimination (ECD). This criterion estimates the worst-case amounts of disadvantage in outcomes for individuals solely based on their memberships in a protected group. Utilizing tools from search-based software engineering and generative AI, we present a randomized algorithm that samples a statistically significant set of points from the tail of ML outcome distributions even if the input dataset lacks a sufficient number of relevant samples. We conducted several experiments on four ML models (deep neural networks, logistic regression, and random forests) over 10 socially relevant tasks from the literature on algorithmic fairness. First, we evaluate the generative AI methods and find that they generate sufficient samples to infer valid EVT distribution in 95% of cases. Remarkably, we found that the prevalent bias mitigators reduce the average-case discrimination but increase the worst-case discrimination significantly in 35% of cases. We also observed that even the tail-aware mitigation algorithm MiniMax-Fairness—increased the worst-case discrimination in 30% of cases. We propose a novel ECD-based mitigator that improves fairness in the tail in 90% of cases with no degradation of the average-case discrimination. We hope that the EVT framework serves as a robust tool for evaluating fairness in both average-case and worst-case discrimination.

Tags: "Security", "SE for AI"

Jiageng Li, Zhen Dong, Chong Wang, Haozhen You, Cen Zhang, Yang Liu, Xin Peng, "LLM Based Input Space Partitioning Testing for Library APIs"

Abstract: Automated library APIs testing is difficult as it requires exploring a vast space of parameter inputs that may involve objects with complex data types. Existing search based approaches, with limited knowledge of relations between object states and program branches, often suffer from the low efficiency issue, i.e., tending to generate invalid inputs. Symbolic execution based approaches can effectively identify such relations, but fail to scale to large programs. In this work, we present an LLM-based input space partitioning testing approach, LISP, for library APIs. The approach leverages LLMs to understand the code of a library API under test and perform input space partitioning based on its understanding and rich common knowledge. Specifically, we provide the signature and code of the API under test to LLMs, with the expectation of obtaining a text description of each input space partition of the API under test. Then, the generated text description is employed to guide the input generation process for each partition, ultimately resulting in test suites that systematically explore the program behavior of the API. We evaluate LISP on 10 popular open-source Java libraries (e.g., apache/commons-lang with 2.6k stars, guava with 48.8k stars on GitHub). Our experiment results show that LISP is effective in library API testing. It significantly outperforms state-of-the-art tool EvoSuite in terms of branch coverage. On average, LISP achieves 67.82% branch coverage, surpassing EvoSuite by 1.21 times. In total, LISP triggers 404 exceptions or errors in the experiments, and discovers 13 previously unknown vulnerabilities during evaluation, which have been assigned CVE IDs.

Tags: "Testing and Quality", "AI for SE"

Qiaolin Qin, Heng Li, Ettore Merlo, Maxime Lamothe, "Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection"

Abstract: With the advent of data-centric and machine learning (ML) systems, data quality is playing an increasingly critical role for ensuring the overall quality of software systems. Alas, data preparation, an essential step towards high data quality, is known to be a highly effort-intensive process. Although prior studies have dealt with one of the most impacting issues, data pattern violations, we observe that these studies usually require data-specific configurations (i.e., parameterized) or a certain set of fully curated data as learning examples (i.e., supervised). Both approaches require domain knowledge and depend on users' deep understanding of their data, and are often effort-intensive. In this paper, we introduce RIOLU: Regex Inferencer autO-parameterized Learning with Uncleaned data. RIOLU is fully automated, is automatically parameterized, and does not need labeled samples. We observe that RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%, exceeding the state-of-the-art baseline. In addition, according to our experiment on five datasets with anomalies, RIOLU can automatically estimate a data column's error rate, draw normal patterns, and predict anomalies from unlabeled data with higher performance (up to 800.4% improvement in terms of F1) than the state-of-the-art baseline. Furthermore, RIOLU can even outperform ChatGPT in terms of both accuracy (12.3% higher F1) and efficiency (10% less inference time). With user involvement, a variation (a guided version) of RIOLU can further boost its precision (up to 37.4% improvement in terms of F1). Our evaluation in an industrial setting further demonstrates the practical benefits of RIOLU.

Tags: "MSR", "Security", "SE for AI"

Bimpe Ayoola, Miikka Kuutilla, Rina Wehbe, Paul Ralph, "User Personas Improve Social Sustainability by Encouraging Software Developers to Deprioritize Antisocial Features"

Abstract: \textit{Background}: Sustainable software development involves creating software in a manner that meets present goals without undermining our ability to meet future goals. In a software engineering context, sustainability has at least four dimensions: ecological, economic, social, and technical. No interventions for improving social sustainability in software engineering have been tested in rigorous lab-based experiments, and little evidence-based guidance is available. \textit{Objective}: The purpose of this study is to evaluate the effectiveness of two interventions---stakeholder maps and persona models---for improving social sustainability by improving software feature prioritization. \textit{Method}: We conducted a randomized controlled factorial experiment with 79 undergraduate computer science students. Participants were randomly assigned to one of four groups and asked to prioritize a backlog of prosocial, neutral, and antisocial user stories for a shopping mall's digital screen display and facial recognition software. Participants received either persona models, a stakeholder map, both, or neither. We compared the differences in prioritization levels assigned to prosocial and antisocial user stories using Cumulative Link Mixed Model regression. \textit{Results}: Participants who received persona models gave significantly lower priorities to anti-social user stories but no significant difference was evident for pro-social user stories. The effects of the stakeholder map were not significant. The interaction effects were not significant. \textit{Conclusion}: Providing aspiring software professionals with well-crafted persona models causes them to de-prioritize anti-social software features. The impact of persona modelling on sustainable software development therefore warrants further study with more experience professionals. Moreover, the novel methodological strategy of assessing social sustainability behavior through backlog prioritization appears feasible in lab-based settings.

Tags: "Human/Social", "Requirements"

Daniel Erhabor, Sreeharsha Udayashankar, Meiyappan Nagappan, Samer Al-Kiswany, "Measuring the Runtime Performance of C++ Code Written by Humans using GitHub Copilot"

Abstract: GitHub Copilot is an artificially intelligent program- ming assistant used by many developers. While a few studies have evaluated the security risks of using Copilot, there has not been any study to show if it aids developers in producing code with better runtime performance. We evaluate the runtime performance of C++ code produced when developers use GitHub Copilot versus when they do not. To this end, we conducted a user study with 32 participants where each participant solved two C++ programming problems, one with Copilot and the other without it and measured the runtime performance of the participants’ solutions on our test data. Our results suggest that using Copilot may produce C++ code with a statistically significantly slower runtime performance.

Tags: "AI for SE", "Human/Social"

Yulong Ye, Tao Chen, Miqing Li, "Distilled Lifelong Self-Adaptation for Configurable Systems"

Abstract: Modern configurable systems provide tremendous opportunities for engineering future intelligent software systems. A key difficulty thereof is how to effectively self-adapt the configuration of a running system such that its performance (e.g., runtime and throughput) can be optimized under time-varying workloads. This unfortunately remains unaddressed in existing approaches as they either overlook the available past knowledge or rely on static exploitation of past knowledge without reasoning the usefulness of information when planning for self-adaptation. In this paper, we tackle this challenging problem by proposing DLiSA, a framework that self-adapts configurable systems. DLiSA comes with two properties: firstly, it supports lifelong planning, and thereby the planning process runs continuously throughout the lifetime of the system, allowing dynamic exploitation of the accumulated knowledge for rapid adaptation. Secondly, the planning for a newly emerged workload is boosted via distilled knowledge seeding, in which the knowledge is dynamically purified such that only useful past configurations are seeded when necessary, mitigating misleading information. Extensive experiments suggest that the proposed DLiSA significantly outperforms state-of-the-art approaches, demonstrating a performance improvement of up to 255\% and a resource acceleration of up to 2.22$\times$ on generating promising adaptation configurations. All data and sources can be found at our anonymous site: https://github.com/Anonymous-DLiSA/DLiSA.

Tags: "Design/Architecture", "AI for SE"

Zachary Karas, Benjamin Gold, Violet Zhou, Noah Reardon, Thad Polk, Catie Chang, Yu Huang, "Studying Programmers Without Programming: Investigating Expertise Using Resting State fMRI"

Abstract: Expert programmers are more effective at coding activities, but the reasons for this remain elusive. Accordingly, recent research has used neuroimaging such as fMRI to analyze how expert programmers might think as they perform coding activities. Those experiments have all involved specific programming tasks (i.e., comprehension), but have been unable to detect systematic differences based on coding experience. By using tasks, however, those studies may limit the number and type of brain networks involved. In Cognitive Neuroscience, researchers commonly analyze resting-state data, in which participants’ brain activity is recorded as they lay idle in the scanner. The brain’s functional organization is plastic, and can change with experience. These changes can be measured at rest, making this a suitable data type for studying how programming activities affect neural organization over time. In this paper, we analyzed the resting state scans from 150 participants, 96 of whom were programmers. We found increased connectivity in programmers between brain regions involved in language, math, and the temporal attention. Non-programmers demonstrated more connectivity with regions involved in social and emotional cognition. We found that as years of programming experience increases, connectivity decreases between two regions associated with visual processing during reading and articulation, respectively.

Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"

Setu Kumar Basak, K. Virgil English, Ken Ogura, Vitesh Kambara, Bradley Reaves, Laurie Williams, "AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software Artifacts"

Abstract: GitGuardian monitored secrets exposure in public GitHub repositories and reported that developers leaked over 12 million secrets (database and other credentials) in 2023, indicating a 113\% surge from 2021. Despite the availability of secret detection tools, developers ignore the tools' reported warnings because of false positives (25\%-99\%). However, each secret protects assets of different values accessible through asset identifiers (a DNS name and a public or private IP address). The asset information for a secret can aid developers in filtering false positives and prioritizing secret removal from the source code. However, existing secret detection tools do not provide the asset information, thus presenting difficulty to developers in filtering secrets only by looking at the secret value or finding the assets manually for each reported secret. \textit{The goal of our study is to aid software practitioners in prioritizing secrets removal by providing the assets information protected by the secrets through our novel static analysis tool.} We present AssetHarvester, a static analysis tool to detect secret-asset pairs in a repository. Since the location of the asset can be distant from where the secret is defined, we investigated secret-asset co-location patterns and found four patterns. To identify the secret-asset pairs of the four patterns, we utilized three approaches (pattern matching, data flow analysis, and fast-approximation heuristics). We curated a benchmark of 1,791 secret-asset pairs of four database types extracted from 188 public GitHub repositories to evaluate the performance of AssetHarvester. AssetHarvester demonstrates precision of (97\%), recall (90\%), and F1-score (94\%) in detecting secret-asset pairs. Our findings indicate that data flow analysis employed in AssetHarvester detects secret-asset pairs with 0\% false positives and aids in improving the recall of secret detection tools. Additionally, AssetHarvester shows 43\% increase in precision for database secret detection compared to existing detection tools through the detection of assets, thus reducing developer's alert fatigue.

Tags: "Analysis", "Security"

Yining She, Sumon Biswas, Christian Kästner, Eunsuk Kang, "FairSense: Long-Term Fairness Analysis of ML-Enabled Systems"

Abstract: Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are *model-centric* and designed to detect fairness issues under *static settings*. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system *impact* the environment, which in turn affects future decision-making. Such a self-reinforcing *feedback loop* can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation-based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. In particular, the framework targets systems with an ML model that is trained over tabular data using supervised learning. Given a fairness requirement, FairSense performs *Monte-Carlo simulation* to enumerate evolution traces for each system configuration. Then, FairSense performs *sensitivity analysis* on the space of system parameters to understand the impact of configuration decisions on long-term fairness of the system. We demonstrate FairSense's potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.

Tags: "SE for AI", "Security", "Requirements"

Islem BOUZENIA, Premkumar Devanbu, Michael Pradel, "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair"

Abstract: Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent’s effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI’s GPT-3.5 model, translates to 14 cents per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.

Tags: "Analysis/Repair", "AI for SE"

Sarah Fakhoury, Markus Kuppe, Shuvendu Lahiri, Tahina Ramananandro, Nikhil Swamy, "3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers"

Abstract: Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use. In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C. We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable.

Tags: "AI for SE", "Analysis/Repair"

Paschal Amusuo, Kyle A. Robinson, Tanmay Singla, Huiyun Peng, Aravind Machiry, Santiago Torres-Arias, Laurent Simon, James C Davis, "$ZTD_{JAVA}$: Mitigating Software Supply Chain Vulnerabilities via Zero-Trust Dependencies"

Abstract: Third-party libraries like Log4j accelerate software application development but introduce substantial risk. Vulnerabilities in these libraries have led to Software Supply Chain (SSC) attacks that compromised resources within the host system. These attacks benefit from current application permissions approaches: third-party libraries are implicitly trusted in the application runtime. An application runtime designed with Zero-Trust Architecture (ZTA) principles — secure access to resources, continuous monitoring, and least-privilege enforcement — could mitigate SSC attacks, as it would give zero implicit trust to these libraries. However, no individual security defense incorporates these principles at a low runtime cost. This paper proposes Zero-Trust Dependencies (ZTD) to mitigate SSC vulnerabilities: we apply the NIST ZTA to software applications. First, we assess the expected effectiveness and configuration cost of Zero-Trust Dependencies using a study of third-party software libraries and their vulnerabilities. Then, we present a system design, $ZTD_{sys}$, that enables the application of Zero-Trust Dependencies to software applications and a prototype, $ZTD_{JAVA}$, for Java applications. Finally, with evaluations on recreated vulnerabilities and realistic applications, we show that $ZTD_{JAVA}$ can defend against prevalent vulnerability classes, introduces negligible cost, and is easy to configure and use.

Tags: "Security", "Analysis"

Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, Longyu Zhang, "ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?"

Abstract: Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME

Tags: "SE for AI", "Human/Social"

Xin Yin, Chao Ni, Xiaodan Xu, Xiaohu Yang, "What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation"

Abstract: Software defects heavily affect software's functionalities and may cause huge losses. Recently, many AI-based approaches have been proposed to detect defects, which can be divided into two categories: software defect prediction and automatic unit test generation. While these approaches have made great progress in software defect detection, they still have several limitations in practical application, including the low confidence of prediction models and the inefficiency of unit testing models. To address these limitations, we propose a WYSIWYG (i.e., What You See Is What You Get) approach: \textbf{A}ttention-based Self-guided Automatic \textbf{U}nit Test \textbf{G}en\textbf{ER}ation (AUGER), which contains two stages: defect detection and error triggering. In the former stage, \toolname first detects the proneness of defects. Then, in the latter stage, it guides to generate unit tests for triggering such an error with the help of critical information obtained by the former stage. To evaluate the effectiveness of \toolname, we conduct a large-scale experiment by comparing with the state-of-the-art (SOTA) approaches on the widely used datasets (i.e., Bears, Bugs.jar, and Defects4J). AUGER makes great improvements by 4.7\% to 35.3\% and 17.7\% to 40.4\% in terms of F1-score and Precision in defect detection, and can trigger 23 to 84 more errors than SOTAs in unit test generation. Besides, we also conduct a further study to verify the generalization in practical usage by collecting a new dataset from real-world projects.

Tags: "AI for SE", "Testing and Quality"

Yizhou Chen, Zeyu Sun, Guoqing Wang, Dan Hao, "Gpass: a Goal-adaptive Neural Theorem Prover based on Coq for Automated Formal Verification"

Abstract: Formal verification is a crucial means to assure software quality. Regrettably, the manual composition of verification scripts proves to be both laborious and time-consuming. In response, researchers have put forth automated theorem prover approaches; however, these approaches still grapple with several limitations. These limitations encompass insufficient handling of lengthy proof steps, difficulty in aligning the various components of a Coq program with the requirements and constraints of the proof goal, and inefficiencies. To surmount these limitations, we present Gpass, a goal-adaptive neural theorem prover based on deep learning technology. Firstly, we design a unique sequence encoder for Gpass that completely scans previous proof tactics through multiple sliding windows and provides information related to the current proof step. Secondly, Gpass incorporates a goal-adaptive feature integration module to align the reasoning process with the requirements of the proof goal. Finally, we devise a parameter selection method based on loss values and loss slopes to procure parameter sets with diverse distributions, thereby facilitating the exploration of various proof tactics. Experimental results demonstrate that Gpass attains better performance on the extensive CoqGym benchmark and proves 11.03\%-96.37\% more theorems than the prior work most closely related to ours. We find that the orthogonality between Gpass and CoqHammer proves their complementary capabilities, and together they prove a total of 3,774 theorems, which is state-of-the-art performance. In addition, we propose an efficiency optimisation approach that allows Gpass to achieve performance beyond Diva at one-sixth of the parameter sets.

Tags: "Formal methods", "AI for SE"

Yanchen Lu, Hongyu Lin, Zehua He, Haitao Xu, Zhao Li, Shuai Hao, Liu Wang, Haoyu Wang, Kui Ren, "TacDroid: Detection of Illicit Apps through Hybrid Analysis of UI-based Transition Graphs"

Abstract: Illicit apps have emerged as a thriving underground industry, driven by their substantial profitability. These apps either offer users restricted services (e.g., porn and gambling) or engage in fraudulent activities like scams. Despite the widespread presence of illicit apps, scant attention has been directed towards this issue, with several existing detection methods predominantly relying on static analysis alone. However, given the burgeoning trend wherein an increasing number of mobile apps achieve their core functionality through dynamic resource loading, depending solely on static analysis proves inadequate. To address this challenge, in this paper, we introduce TacDroid, a novel approach that integrates dynamic analysis for dynamic content retrieval with static analysis to mitigate the limitations inherent in both methods, i.e., the low coverage of dynamic analysis and the low accuracy of static analysis. Specifically, TacDroid conducts both dynamic and static analyses on an Android app to construct dynamic and static User Interface Transition Graphs (UTGs), respectively. These two UTGs are then correlated to create an intermediate UTG. Subsequently, TacDroid embeds graph structure and utilizes an enhanced Graph Autoencoder (GAE) model to predict transitions between nodes. Through link prediction, TacDroid effectively eliminates false positive transition edges stemming from misjudgments in static analysis and supplements false negative transition edges overlooked in the intermediate UTG, thereby generating a comprehensive and accurate UTG. Finally, TacDroid determines the legitimacy of an app and identifies its category based on the app's UTG. Our evaluation results highlight the outstanding accuracy of TacDroid in detecting illicit apps. It significantly surpasses the state-of-the-art work, achieving an F1-score of 96.73%. This work represents a notable advancement in the identification and categorization of illicit apps.

Tags: "Mobile SW", "Analysis"

Ravishka Rathnasuriya, Zijie Zhao, Wei Yang, "CodeImprove: Program Adaptation for Deep Code Models"

Abstract: Leveraging deep learning (DL)-based code analysis tools to solve software engineering tasks is becoming increasingly popular. Code models often suffer performance degradation due to various reasons (e.g., code data shifts). Retraining is often required to address these issues, but frequent model updates are costly in labeling and deployment. In this paper, we explore an alternative solution: Adapting the program inputs to the code models. This can be achieved by two steps: 1) input validation that focuses on identifying whether an input is an out-of-scope input program that are beyond a model’s handling capability, and 2) input adaptation that adapts out-of-scope inputs to become in-scope inputs. Validating program input is challenging, as current techniques focus on continuous inputs such as image data and fail with discrete inputs like code data, which have unique characteristics and are processed differently by deep learning models. Adapting out-of-scope programs is also challenging due to their vast search spaces. Therefore, in this paper, we propose CodeImprove, which distinguishes out-of-scope from normal inputs and converts such out-of-scope inputs back to in-scope inputs through program transformation. In particular, we propose a validity score metric to identify out-of-scope inputs and leverage genetics algorithms to apply semantic preserving program transformation to convert out-of-scope inputs to in-scope inputs. Our experimental results show CodeImprove can enhance upto 8.78% of accuracy, and 51.28% of relative improvements in three code models on two SE tasks. Additionally, our input validation is promising in detecting outof-scope inputs (AUC score of 0.924).

Tags: "SE for AI"

Yanfu Yan, Viet Duong, Huajie Shao, Denys Poshyvanyk, "Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution Detection"

Abstract: Numerous machine learning (ML) models have been developed, including those for software engineering (SE) tasks, under the assumption that training and testing data come from the same distribution. However, train and test distributions often differ, as training datasets rarely encompass the entire distribution, while test distribution tends to shift over time. Hence, when confronted with out-of-distribution (OOD) instances that differ from the training data, a reliable and trustworthy SE ML model must be capable of detecting them to either abstain from making predictions, or potentially forward these OODs to appropriate models handling other categories or tasks. In this paper, we develop two types of SE-specific OOD detection models, unsupervised and weakly-supervised OOD detection for code. The unsupervised OOD detection approach is trained solely on in-distribution samples while the weakly-supervised approach utilizes a tiny number of OOD samples to further enhance the detection performance in various OOD scenarios. Extensive experimental results demonstrate that our proposed methods significantly outperform the baselines in detecting OOD samples from four different scenarios simultaneously and also positively impact a main code understanding task.

Tags: "SE for AI", "Security"

Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric, "exLong: Generating Exceptional Behavior Tests with Large Language Models"

Abstract: Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on “happy paths”, e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed EXLÓNG, that automatically generates EBTs. EXLÓNG is a large language model instruction-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare EX LÓNG with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT3.5), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that EXLÓNG outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by EXLÓNG were already accepted.

Tags: "Testing and Quality", "AI for SE"

Jake Zappin, Trevor Stalnaker, Oscar Chaparro, Denys Poshyvanyk, "When Quantum Meets Classical: Characterizing Hybrid Quantum-Classical Issues Discussed in Developer Forums"

Abstract: Recent advances in quantum computing have sparked excitement that this new computing paradigm could solve previously intractable problems. However, due to the faulty nature of current quantum hardware and quantum-intrinsic noise, the full potential of quantum computing is still years away. Hybrid quantum-classical computing has emerged as a possible compromise that achieves the best of both worlds. In this paper, we look at hybrid quantum-classical computing from a software engineering perspective and present the first empirical study focused on characterizing and evaluating recurrent issues faced by developers of hybrid quantum-classical applications. The study comprised a thorough analysis of 531 real-world issues faced by developers -- including software faults, hardware failures, quantum library errors, and developer mistakes -- documented in discussion threads from forums dedicated to quantum computing. By qualitatively analyzing such forum threads, we derive a comprehensive taxonomy of recurring issues in hybrid quantum-classical applications that can be used by both application and platform developers to improve the reliability of hybrid applications. The study considered how these recurring issues manifest and their causes, determining that hybrid applications are crash-dominant (74% of studied issues) and that errors were predominantly introduced by application developers (70% of issues). We conclude by identifying recurring obstacles for developers of hybrid applications and actionable recommendations to overcome them.

Tags: "Quantum", "MSR", "Human/Social"

Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, Taesoo Kim, "RUG: Turbo LLM for Rust Unit Test Generation"

Abstract: Unit testing improves software quality by evaluating isolated sections of the program. This approach alleviates the need for comprehensive program-wide testing and confines the potential error scope within the software. However, unit test development is time-consuming, requiring developers to create appropriate test contexts and determine input values to cover different code regions. This problem is particularly pronounced in Rust due to its intricate type system, making traditional unit test generation tools ineffective in Rust projects. Recently, LLM have demonstrated their proficiency in understanding programming language and completing software engineering tasks. However, merely prompting LLM with a basic prompt like "generate unit test for the following source code" often results in code with compilation errors. In addition, LLM-generated unit tests often have limited test coverage. To bridge this gap and harness the capabilities of LLM, we design and implement RUG, an end-to-end solution to automatically generate the unit test for Rust projects. To help LLM's generated test pass Rust strict compilation checks, RUG designs a semantic-aware bottom-up approach to divide the context construction problem into dependent sub-problems. It solves these sub-problems sequentially using an LLM and merges them to a complete context. To increase test coverage, RUG integrates coverage-guided fuzzing with LLM to prepare fuzzing harnesses. Applying RUG on 17 real-world Rust programs (average 24,937 LoC), we show that RUG can achieve a high code coverage, up to 71.37%, closely comparable to human effort (73.18%). We submitted 113 unit tests generated by RUG covering the new code: 53 of them have been accepted, 17 were rejected, and 43 are pending for review.

Tags: "Testing and Quality", "SE for AI"

Smit Soneshbhai Patel, Aashish Yadavally, Hridya Dhulipala, Tien N. Nguyen, "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"

Abstract: Large Language Models (LLMs) have been excellent in generating and reasoning about source code and the textual descriptions. They can recognize patterns, syntax, and semantics in code, making them effective in several software engineering tasks. However, they exhibit weaknesses in reasoning about the program execution. They primarily operate on static code representations, failing to capture the dynamic behavior and state changes that occur during program execution. In this paper, we advance the capabilities of LLMs in reasoning about program execution. We propose ORCA, a novel approach that instructs an LLM to autonomously formulate a plan to navigate through a control flow graph (CFG) for predictive execution of (in)complete code snippets. It acts as a predictive interpreter to ``execute'' the code. As a downstream task, we use ORCA to statically identify any runtime errors for online code snippets. Early detection of runtime errors and defects in these snippets is crucial to prevent costly fixes later in the development cycle after they were adapted into a codebase. In our novel technique, we guide the LLM to pause at the branching point, focusing on the state of the symbol tables for variables' values, thus minimizing error propagation in the LLM's computation. We also instruct the LLM not to stop at each step in its execution plan, resulting the use of only one prompt to the LLM, thus much cost-saving. Our empirical evaluation showed that ORCA is effective and improves over the state-of-the-art approaches in predicting the execution traces and in runtime error detection.

Tags: "AI for SE", "Analysis"

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, Lei Bu, "SpecGen: Automated Generation of Formal Program Specifications via Large Language Models"

Abstract: In the software development process, formal program specifications play a crucial role in various stages, including requirement analysis, software testing, and verification. However, manually crafting formal program specifications is rather difficult, making the job time-consuming and labor-intensive. Moreover, it is even more challenging to write specifications that correctly and comprehensively describe the semantics of complex programs. To reduce the burden on software developers, automated specification generation methods have emerged. However, existing methods usually rely on predefined templates or grammar, making them struggle to accurately describe the behavior and functionality of complex real-world programs. To tackle this challenge, we introduce SpecGen, a novel technique for formal program specification generation based on Large Language Models (LLMs). Our key insight is to overcome the limitations of existing methods by leveraging the code comprehension capability of LLMs. The process of SpecGen consists of two phases. The first phase employs a conversational approach that guides the LLM to generate appropriate specifications for a given program, aiming to utilize the ability of LLM to generate high-quality specifications. The second phase, designed for where the LLM fails to generate correct specifications, applies four mutation operators to the model-generated specifications and selects verifiable specifications from the mutated ones through a novel heuristic selection strategy by assigning different weights of variants in an efficient manner. We evaluate SpecGen on two datasets, including the SV-COMP Java category benchmark and a manually constructed dataset containing 120 programs. Experimental results demonstrate that SpecGen succeeds in generating verifiable specifications for 279 out of 385 programs, outperforming the existing LLM-based approaches and conventional specification generation tools like Houdini and Daikon. Further investigations on the quality of generated specifications indicate that SpecGen can comprehensively articulate the behaviors of the input program.

Tags: "Formal methods", "AI for SE", "Requirements"

Yifeng Di, Tianyi Zhang, "Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding"

Abstract: Large Language Models (LLMs) have demonstrated unprecedented capability in code generation. However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before. Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation. Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding. Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent. We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 16.9\% Pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis. Participants completed the given programming tasks 16.7\% faster and with 10.5\% improvement in task success rate when using our approach. Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence.

Tags: "AI for SE", "Analysis", "Human/Social"

Zixuan Tan, Jiayuan Zhou, Xing Hu, Shengyi Pan, Kui Liu, Xin Xia, "Similar but Patched Code Considered Harmful -- The Impact of Similar but Patched Code on Recurring Vulnerability Detection and How to Remove Them"

Abstract: Identifying recurring vulnerabilities is crucial for ensuring software security. Clone-based techniques, while widely used, often generate many false alarms due to the existence of similar but patched (SBP) code, which is similar to vulnerable code but is not vulnerable due to having been patched. Although the SBP code poses a great challenge to the effectiveness of existing approaches, it has not yet been well explored. In this paper, we propose a programming language agnostic framework, Fixed Vulnerability Filter (FVF), to identify and filter such SBP instances in vulnerability detection. Different from existing studies that leverage function signatures, our approach analyzes code change histories to precisely pinpoint SBPs and consequently reduce false alarms. Evaluation under practical scenarios confirms the effectiveness and precision of our approach. Remarkably, FVF identifies and filters 65.1% of false alarms from four vulnerability detection tools (i.e., ReDeBug, VUDDY, MVP, and an elementary hash-based approach) without yielding false positives. We further apply FVF to 1,081 real-world software projects and construct a real-world SBP dataset containing 6,827 SBP functions. Due to the SBP nature, the dataset can act as a strict benchmark to test the sensitivity of the vulnerability detection approach in distinguishing real vulnerabilities and SBPs. Using this dataset, we demonstrate the ineffectiveness of four state-of-the-art deep learning-based vulnerability detection approaches. Our dataset can help developers make a more realistic evaluation of vulnerability detection approaches and also paves the way for further exploration of real-world SBP scenarios.

Tags: "Prog Comprehension/Reeng/Maint", "Security"

Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael Lyu, "SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing"

Abstract: Code retrieval, which retrieves code snippets based on users’ natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retrieval according to vector similarity. Despite the effectiveness of these models, managing large-scale code bases presents significant challenges. Previous research propose deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. However, this approach’s reliance on linear scanning of the entire code base limits its scalability. To further improve the efficiency of large scale code retrieval, we propose a novel approach SECRET (Scalable and Efficient Code Retrieval via SegmEnTed deep hashing). SECRET converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy. After training, SECRET recalls code candidates by looking up the hash tables for each segment, the time complexity of recall can thus be greatly reduced. Extensive experimental results demonstrate that SECRET can drastically reduce the retrieval time by at least 95% while achieving comparable or even higher performance of existing deep hashing approaches. Besides, SECRET also exhibits superior performance and efficiency compared to the classical hash table-based approach known as LSH under the same number of hash tables.

Tags: "AI for SE"

Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, Wing Lam, "Ranking Relevant Tests for Order-Dependent Flaky Tests"

Abstract: One major challenge of regression testing is the presence of flaky tests, i.e., tests that may pass in one run but fail in another run for the same version of code. One prominent category of flaky tests are order-dependent (OD) flaky tests, which are tests that can pass or fail depending on the test-order in which the tests are run. To help developers debug and fix OD tests, prior work has attempted to automatically find OD-relevant tests, i.e., tests that will determine whether an OD test passes or fails depending on whether the OD-relevant tests are run before or after the OD test in the test-order. Prior work finds OD-relevant tests by running tests before the OD test, without regards to the tests’ likelihood of being OD-relevant tests. We propose RankF to rank tests in order of likelihood of being OD-relevant tests, so a developer can find the first OD-relevant test more quickly, without running tests as often. We propose two ranking approaches, each requiring different information. Our first approach, RankFL, relies on training a large-language model that analyzes test code. Our second approach, RankFO, relies on the analysis of prior test-order execution information. We evaluate our approaches on 155 OD tests from 34 modules across 24 open-source projects. We compare RankF against prior work baselines in terms of the time for finding the first OD-relevant test for an OD test. RankF on average finds the first OD-relevant test faster than the best of the baselines, providing speedups of 1.9X, 1.7X, and 2.6X for the three different types of OD-relevant tests we evaluate.

Tags: "Testing and Quality", "Analysis"

Hang Du, Vijay Krishna Palepu, James A. Jones, "Leveraging Propagated Infection to Crossfire Mutants"

Abstract: Mutation testing was proposed to identify weaknesses in test suites by repeatedly generating artificially faulty versions of the software (i.e., *mutants*) and determining if the test suite is sufficient to detect them (i.e., *kill* them). When the tests are insufficient, each surviving mutant provides an opportunity to improve the test suite. We conducted a study and found that many such surviving mutants (up to 84% for the subjects of our study) are detectable by simply augmenting existing tests with additional assertions, or *assertion amplification*. Moreover, we find that many of these mutants are detectable by multiple existing tests, giving developers options for how to detect them. To help with these challenges, we created a technique that performs memory-state analysis to identify candidate assertions that developers can use to detect the surviving mutants. Additionally, we build upon prior research that identifies "crossfiring" opportunities -- tests that coincidentally kill multiple mutants. To this end, we developed a theoretical model that describes the varying granularities that crossfiring can occur in the existing test suite, which provide opportunities and options for how to kill surviving mutants. We operationalize this model to an accompanying technique that optimizes the assertion amplification of the existing tests to crossfire multiple mutants with fewer added assertions, optionally concentrated within fewer tests. Our experiments show that we can kill *all* surviving mutants that are detectable with existing test data with only 1.1% of the identified assertion candidates, and increasing by a factor of 6x, on average, the number of killed mutants from amplified tests, over tests that do not crossfire.

Tags: "Testing and Quality", "Analysis"

Yang Sun, Christopher M. Poskitt, Kun Wang, Jun Sun, "FixDrive: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation"

Abstract: Autonomous Vehicles (AVs) are advancing rapidly, with Level-4 AVs already operating in real-world conditions. Current AVs, however, still lag behind human drivers in adaptability and performance, often exhibiting overly conservative behaviours and occasionally violating traffic laws. Existing solutions, such as runtime enforcement, mitigate this by automatically repairing the AV's planned trajectory at runtime, but such approaches lack transparency and should be a measure of last resort. It would be preferable for AV repairs to generalise beyond specific incidents and to be interpretable for users. In this work, we propose FixDrive, a framework that analyses driving records from near-misses or law violations to generate AV driving strategy repairs that reduce the chance of such incidents occurring again. These repairs are captured in $\mu$Drive, a high-level domain-specific language for specifying driving behaviours according to event-based triggers. Implemented for the state-of-the-art autonomous driving system Apollo, FixDrive identifies and visualises critical moments from driving records, then uses a Multimodal Large Language Model (MLLM) with zero-shot learning to generate $\mu$Drive programs. We tested FixDrive on various benchmark scenarios, and found that the generated repairs improved the AV's performance with respect to following traffic laws, avoiding collisions, and successfully reaching destinations. Furthermore, the direct costs of repairing an AV---15 minutes of offline analysis and \$0.08 per violation---are reasonable in practice.

Tags: "SE for AI", "Autonomy"

ziji wu, yu huang, peishan huang, shanghua wen, minglong li, ji wang, "EffBT: An Efficient Behavior Tree Reactive Synthesis and Execution Framework"

Abstract: Behavior Trees (BTs), originated from the control of Non-Player-Characters (NPCs), have been widely embraced in robotics and software engineering communities due to their modularity, reactivity, and other beneficial characteristics. It is highly desirable to synthesize BTs automatically. The consequent challenges are to ensure the generated BTs semantically correct, well-structured, and efficiently executable. To address these challenges, in this paper, we present a novel reactive synthesis method for BTs, namely EffBT, to generate correct and efficient controllers from formal specifications in GR(1) automatically. The idea is to construct BTs soundly from the intermediate strategies derived during the algorithm of GR(1) realizability check. Additionally, we introduce pruning strategies and use of \textit{Parallel} nodes to improve BT execution, while none of the priors explored before. We prove the soundness of the EffBT method, and experimental results across various scenarios and datasets demonstrate its effectiveness.

Tags: "Formal methods", "Analysis/Synthesis"

Haofeng Li, Chenghang Shi, Jie Lu, Lian Li, Zixuan Zhao, "Module-Aware Context Sensitive Pointer Analysis"

Abstract: The Java Platform Module System (JPMS) has found widespread applications since introduced in Java 9. However, existing pointer analyses fail to leverage the semantics of JPMS. This paper presents a novel module-aware approach to improving the soundness and performance of pointer analysis. For soundness, we model the semantics of keywords provides and uses in JPMS to recover missing points-to relations. For performance, we design a module-aware context-sensitive analysis, which can propagate and apply critical contexts (by exploiting modularity) to balance precision and efficiency better. We have implemented our module-aware pointer analysis named MPA in Tai-e and conducted extensive experiments to compare it with standard object-sensitive approaches. The evaluation results demonstrate that MPA improves soundness and enhances existing context-sensitivity, striking a good balance between efficiency and precision. In terms of soundness, MPA can increase the number of reachable methods up to 90.9 × 90.9× ( lombok lombok) under the same analysis. Performance-wise, MPA is nearly as fast as context-insensitivity for most benchmarks, while its precision is superior to that of 1-object-sensitivity on average.

Tags: "Analysis"

Xu Chen, Ningning Cui, Zhe Pan, Liwei Chen, Gang Shi, Dan Meng, "Critical Variable State-Aware Directed Greybox Fuzzing"

Abstract: Directed fuzzing is an effective software testing method that guides the fuzzing campaign towards user-defined target sites of interest, enabling the discovery of vulnerabilities relevant to those sites. However, even though the generated test cases cover the code near the target sites, complex vulnerabilities remain untriggered. By focusing only on test cases that cover new edges, the program states related to the targets are overlooked, resulting in insufficient testing of the targets and failure to capture complex vulnerabilities. In this paper, we propose a novel directed fuzzing solution named CSFuzz, which considers program states associated with the targets. First, CSFuzz extracts critical variables related to the target sites from the program using static analysis. Then, CSFuzz monitors the runtime values of these critical variables and infers the program states associated with the targets by adaptively partitioning the range of variable values. This allows CSFuzz to store interesting seeds in the state corpus that trigger new states near the target sites. Lastly, CSFuzz employs dynamic scheduling techniques to guide the fuzzing campaign in selecting different corpora and prioritizing seeds. This ensures more adequate testing of the target sites. We have implemented a prototype of CSFuzz and evaluated it on 2 benchmarks and widely fuzzed real-world software. Evaluation results show that CSFuzz outperforms state-of-the-art fuzzers in terms of vulnerability detection capability, achieving a maximum speedup of 219%. Moreover, CSFuzz has discovered 4 new bugs, including 2 CVE IDs assigned.

Tags: "Testing and Quality"

Stefan Schwedt, Thomas Ströder, "From Bugs to Benefits: Improving User Stories by Leveraging Crowd Knowledge with CrUISE-AC"

Abstract: Costs for resolving software defects increase exponentially in late stages. Incomplete or ambiguous requirements are one of the biggest sources for defects, since stakeholders might not be able to communicate their needs or fail to share their domain specific knowledge. Combined with insufficient developer experience, teams are prone to constructing incorrect or incomplete features. To prevent this, requirements engineering has to explore knowledge sources beyond stakeholder interviews. Publicly accessible issue trackers for systems within the same application domain hold essential information on identified weaknesses, edge cases, and potential error sources, all documented by actual users. Our research aims at (1) identifying, and (2) leveraging such issues to improve an agile requirements artifact known as a “user story”. We present CrUISE-AC (Crowd and User Informed Suggestion Engine for Acceptance Criteria) as a fully automated method that investigates issues and generates non-trivial additional acceptance criteria for a given user story by employing NLP techniques and an ensemble of LLMs. CrUISE-AC was evaluated by five independent experts in two distinct business domains. Our findings suggest that issue trackers hold valuable information pertinent to requirements engineering. Our evaluation shows that 80–82% of the generated acceptance criteria add relevant requirements to the user stories. Limitations are the dependence on accessible input issues and the fact that we do not check generated criteria for being conflict-free or non-overlapping with criteria from other user stories.

Tags: "Requirements", "AI for SE"

Lingfeng Zhang, Zhaohui Wang, Yueling Zhang, Min Zhang, Jiangtao Wang, "HIFI: Explaining and Mitigating Algorithmic Bias through the Lens of Game-Theoretic Interactions"

Abstract: Machine Learning (ML) algorithms are increasingly used in decision-making process across various social-critical domains, but they often somewhat inherit and amplify bias from their training data, leading to unfair and unethical outcomes. This issue highlights the urgent need for effective methods to detect, explain, and mitigate bias to ensure the fairness of ML systems. Previous studies are prone to analyze the root causes of algorithmic bias from a statistical perspective. However, to the best of our knowledge, none of them has discussed how sensitive information inducing the final discriminatory decision is encoded by ML models. In this work, we attempt to explain and mitigate algorithmic bias from a game-theoretic view. We mathematically decode an essential and common component of sensitive information implicitly defined by various fairness metrics with Harsanyi interactions, and on this basis, we propose an in-processing method HIFI for bias mitigation. We conduct an extensive evaluation of HIFI with 11 state-of-the-art methods, 5 real-world datasets, 4 fairness criteria, and 5 ML performance metrics, while also considering intersectional fairness for multiple protected attributes. The results show that HIFI surpasses state-of-the-art in-processing methods in terms of fairness improvement and fairness-performance trade-off, and also achieves notable effectiveness in reducing violations of individual fairness simultaneously.

Tags: "SE for AI", "Security"

Xiaolei wang, Ruilin Li, Bin Zhang, Chao Feng, Chaojing Tang, "Practical Object-Level Sanitizer With Aggregated Memory Access and Custom Allocator"

Abstract: To mitigate potential memory safety vulnerabilities, recently there have been significant advances in sanitizers for pre-production bug detection. However, the limited inability to balance performance and detection accuracy still holds. The main reason is due to excessive reliance on shadow memory and a large number of memory access checks at runtime, incurring a significant performance overhead (if fine-grained memory safety detection is performed, the overhead will be even greater). In this paper, we propose a novel Object-Level Address Sanitizer OLASan to reduce performance overhead further while implementing accurate memory violations (including intra-object overflow) detection. Unlike previous sanitizers ignoring the correlation between memory access and objects, OLASan aggregates multiple memory accesses of same object at function level to perform on-demand targeted sanitization, thus avoiding examining most memory accesses at runtime. Specifically, OLASan characterizes various memory access patterns to identify those which can be aggregated, and implements memory safety checks with customized memory tagging. We implement OLASan atop the LLVM framework and evaluate it on SPEC CPU benchmarks. Evaluations show that OLASan outperforms the state-of-the-art methods with 51.18%, 25.20% and 6.52% less runtime overhead than ASan, ASan−− and GiantSan respectively. Moreover, aided by customized memory tagging, OLASan achieves zero false negatives for the first time when testing Juliet suites. Finally, we confirm that OLASan also offers comparable detection capabilities on real bugs.

Tags: "Security", "Testing and Quality"

Dimitri Kokkonis, Michaël Marcozzi, Emilien Decoux, Stefano Zacchiroli, "ROSA: Finding Backdoors with Fuzzing"

Abstract: A code-level backdoor is a hidden access, programmed and concealed within the code of a program. For instance, hard-coded credentials planted in the code of an FTP server would enable maliciously logging into all the deployed instances of this server. Confirmed software supply-chain attacks have led to the injection of backdoors into popular open-source projects, and backdoors have been discovered in various router firmware. Manual code auditing for backdoors is challenging and existing semi-automated approaches can handle only a limited amount of programs and backdoors, while requiring manual reverse-engineering of the audited (binary) program. Graybox fuzzing (automated semi-randomized testing) has grown in popularity due to its success in discovering vulnerabilities and hence stands as a strong candidate for improved backdoor detection. However, current fuzzing knowledge does not offer any means to detect the triggering of a backdoor at runtime. In this work we introduce ROSA, a novel approach (and tool) which combines a state-of-the-art fuzzer (AFL++) with a new metamorphic test oracle, capable of detecting runtime backdoor triggers. To facilitate the evaluation of ROSA, we have created ROSARUM, the first openly available benchmark for assessing the detection of various backdoors in diverse programs. Experimental evaluation shows that ROSA has a level of robustness, speed and automation similar to classical fuzzing. Compared to existing detection tools, it can handle a diversity of backdoors and programs and it does not rely on manually reverse-engineering the fuzzed binary code.

Tags: "Security", "Testing and Quality"

Uwe Gropengießer, Elias Dietz, Florian Brandherm, Achref Doula, Osama Abboud, Xun Xiao, Max Mühlhäuser, "MARQ: Engineering Mission-Critical AI-based Software with Automated Result Quality Adaptation"

Abstract: AI-based mission-critical software exposes a blessing and a curse: its inherent statistical nature allows for flexibility in result quality, yet the mission-critical importance demands adherence to stringent constraints such as execution deadlines. This creates a space for trade-offs between the Quality of Result (QoR)—a metric that quantifies the quality of a computational outcome—and other application attributes like execution time and energy, particularly in real-time scenarios. Fluctuating resource constraints, such as data transfer to a remote server over unstable network connections, are prevalent in mobile and edge computing environments—encompassing use cases like Vehicle-to-Everything, drone swarms, or social-VR scenarios. We introduce a novel approach that enables software engineers to easily specify alternative AI service chains—sequences of AI services encapsulated in microservices aiming to achieve a predefined goal—with varying QoR and resource requirements. Our methodology facilitates dynamic optimization at runtime, which is automatically driven by the MARQ framework. Our evaluations show that MARQ can be used effectively for the dynamic selection of AI service chains in real-time while maintaining the required application constraints of mission-critical AI software. Notably, our approach achieves a 100x acceleration in service chain selection and an average 10% improvement in QoR compared to existing methods.

Tags: "SE for AI", "Autonomy"

Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, Chengcheng Wan, "Are LLMs Correctly Integrated into Software Systems?"

Abstract: Large language models (LLMs) provide effective solutions in various application scenarios, with the support of retrieval-augmented generation (RAG). However, developers face challenges in integrating LLM and RAG into software systems, due to lacking interface specifications, various requirements from software context, and complicated system management. In this paper, we have conducted a comprehensive study of 100 open-source applications that incorporate LLMs with RAG support, and identified 18 defect patterns. Our study reveals that 77% of these applications contain more than three types of integration defects that degrade software functionality, efficiency, and security. Guided by our study, we propose systematic guidelines for resolving these defects in software life cycle. We also construct an open-source defect library Hydrangea.

Tags: "SE for AI", "Design/Architecture"

Jinan Jiang, Xinghao Peng, Jinzhao Chu, Xiapu Luo, "ConsCS: Effective and Efficient Verification of Circom Circuits"

Abstract: Circom is a popular programming language for writing arithmetic circuits that can be used to generate zero-knowledge proofs (ZKPs) like zk-SNARKS. ZKPs have received tremendous attention in protocols like zkRollups. The Circom circuits are compiled to Rank-1 Constraint Systems (R1CS) circuits, based on which zk-SNARK proofs are generated. However, one major challenge associated with R1CS circuits is the problem of under-constrained circuits, which are susceptible to allowing incorrect computations to pass verification due to insufficient constraints, potentially leading to security vulnerabilities. In this paper, we propose a novel framework ConsCS to automatically verify Circom circuits. Our contributions are threefold: 1) we propose novel circuit inference rules to help reduce the size of circuits and to extract more comprehensive information than existing works; 2) we introduce the novel Binary Property Graph (BPG) as a highly efficient reasoning engine, outperforming all existing tools in effectiveness and efficiency; 3) we leverage fine-grained domain-specific information to guide the SMT solving to address non-linear constraints, increasing the success rate of SMT queries of existing works from 2.68% to 48.84%. We conduct experiments to show that ConsCS enhances the solved rate of existing works from around 50-60% to above 80%.

Tags: "Formal methods", "Security"

Youngjae Choi, Seunghoon Woo, "TIVER: Identifying Adaptive Versions of C/C++ Third-Party Open-Source Components Using a Code Clustering Technique"

Abstract: Reusing third-party open-source software (OSS) provides many benefits but can expose the entire system to risks owing to propagated vulnerabilities. While tracking the versions of OSS components can help prevent threats, existing approaches typically map a single version to a reused OSS codebase. This coarse-grained method fails to address multiple versions of code that coexist within the codebase, resulting in ineffective OSS management. Additionally, effectively identifying component versions is challenging owing to noise codes, such as algorithmic codes that coexist across different OSS, as well as duplicate components arising from the redundant reuse of OSS. In this paper, we introduce the concept of the adaptive version, a one-stop solution to represent the version diversity of reused OSS. We present TIVER, an effective approach for identifying adaptive versions of OSS components. TIVER employs two key techniques: (1) fine-grained function-level versioning to uncover detailed versions, and (2) OSS code clustering to identify duplicate components and remove noise. This enables precise identification of OSS reuse locations and adaptive versions, effectively mitigating threats related to OSS reuse. Evaluation of popular C/C++ software on GitHub revealed that OSS components with a single version accounted for only 33%, while the remaining 67% of the components contained more than three versions on average. Nonetheless, TIVER effectively identified adaptive versions of OSS components with 88.46% precision and 91.63% recall in duplicate component distinction, and 86% precision and 86.84% recall in eliminating noise, while existing approaches barely achieved 42% recall in distinguishing duplicates and did not address noise. Further experiments showed that TIVER could enhance vulnerability management and be applied to Software Bills of Materials (SBOM) to improve supply chain security.

Tags: "Prog Comprehension/Reeng/Maint", "Security"

Yuanhang Zhou, Zhen Yan, Yuanliang Chen, Fuchen Ma, Ting Chen, Yu Jiang, "Chord: Towards a Unified Detection of Blockchain Transaction Parallelism Bugs"

Abstract: Blockchain systems have implemented various transaction parallelism mechanisms to improve the system throughput and reduce the latency. However, they inevitably introduce bugs. Such bugs can result in severe consequences such as asset loss, double spending, consensus failure, and DDoS. Unfortunately, they have been little analyzed about their symptoms and root causes, leading to a lack of effective detection methods. In this work, we conduct a thorough analysis of historical transaction parallelism bugs in four commercial blockchains. Results show that most of them arise from mishandling conflicting transactions and manifest without obvious phenomena. However, given the heterogeneity of blockchains, it is challenging to trigger conflict handling in a unified way. Effectively identifying these bugs is also hard. Inspired by the findings, we propose Chord, aiming at detecting blockchain transaction parallelism bugs. Chord proposes a unified conflict transaction model to generate various conflict transactions. Chord also dynamically adjust the transaction submission and inserts proactive reverts during transaction execution to conduct thorough testing. Besides, Chord incorporates a local-remote differential oracle and a TPS oracle to capture the bugs. Our evaluation shows that Chord successfully detects 54 transaction parallelism bugs. Besides, Chord outperforms the existing methods by decreasing the TPS by 49.7% and increasing the latency by 388.0%, showing its effectiveness in triggering various conflict scenarios and exposing the bugs.

Tags: "Testing and Quality", "Blockchain"

Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, Anne Koziolek, "LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation"

Abstract: There are a multitude of software artifacts which need to be handled during the development and maintenance of a software system. These artifacts interrelate in multiple, complex ways. Therefore, many software engineering tasks are enabled — and even empowered — by a clear understanding of artifact interrelationships and also by the continued advancement of techniques for automated artifact linking. However, current approaches in automatic Traceability Link Recovery (TLR) target mostly the links between specific sets of artifacts, such as those between requirements and code. Fortunately, recent advancements in Large Language Models (LLMs) can enable TLR approaches to achieve broad applicability. Still, it is a nontrivial problem how to provide the LLMs with the specific information needed to perform TLR. In this paper, we present LiSSA, a framework that harnesses LLM performance and enhances them through Retrieval-Augmented Generation (RAG). We empirically evaluate LiSSA on three different TLR tasks, requirements to code, documentation to code, and architecture documentation to architecture models, and we compare our approach to state-of-the-art approaches. Our results show that the RAG-based approach can significantly outperform the state-of-the-art on the code-related tasks. However, further research is required to improve the performance of RAG-based approaches to be applicable in practice.

Tags: "Requirements", "AI for SE"

Weisong Sun, Yuchen Chen, Mengzhe Yuan, Chunrong Fang, Zhenpeng Chen, Chong Wang, Yang Liu, Baowen Xu, Zhenyu Chen, "Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness"

Abstract: Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight n-gram language model and trains it on a few clean code snippets. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. We conduct extensive experiments to evaluate the effectiveness and efficiency of KillBadCode, involving two types of advanced code poisoning attacks (a total of five poisoning strategies) and datasets from four representative code intelligence tasks. The experimental results demonstrate that across 20 code poisoning detection scenarios, KillBadCode achieves an average FPR of 8.30% and an average Recall of 100%, significantly outperforming four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average. These highlight the great potential of KillBadCode in efficiently killing various code poisoning attacks.

Tags: "AI for SE", "Security"

Freek Verbeek, Ali Shokri, Daniel Engel, Binoy Ravindran, "Formally Verified Binary-level Pointer Analysis"

Abstract: Binary-level pointer analysis can be of use in symbolic execution, testing, verification, and decompilation of software binaries. In various such contexts, it is crucial that the result is trustworthy, i.e., it can be formally established that the pointer designations are overapproximative. This paper presents an approach to formally proven correct binary-level pointer analysis. A salient property of our approach is that it first generically considers what proof obligations a generic abstract domain for pointer analysis must satisfy. This allows easy instantiation of different domains, varying in precision, while preserving the correctness of the analysis. In the trade-off between scalability and precision, such customization allows ``meaningful'' precision (sufficiently precise to ensure basic sanity properties, such as that relevant parts of the stack frame are not overwritten during function execution) while also allowing coarse analysis when pointer computations have become too obfuscated during compilation for sound and accurate bounds analysis. We experiment with three different abstract domains with high, medium, and low precision. Evaluation shows that our approach is able to derive designations for memory writes soundly in COTS binaries, in a context-sensitive interprocedural fashion.

Tags: "Formal methods", "Analysis"

Yuqing Nie, Chong Wang, Kailong Wang, Guoai Xu, Guosheng Xu, Haoyu Wang, "Decoding Secret Memorization in Code LLMs Through Token-Level Characterization"

Abstract: Code Large Language Models (LLMs) have demonstrated remarkable capabilities in generating, understanding, and manipulating programming code. However, their training process inadvertently leads to the memorization of sensitive information, posing severe privacy risks. Existing studies on memorization in LLMs primarily rely on prompt engineering techniques, which suffer from limitations such as widespread hallucination and inefficient extraction of the target sensitive information. In this paper, we present a novel approach to characterize real and fake secrets generated by Code LLMs based on token probabilities. We identify four key characteristics that differentiate genuine secrets from hallucinated ones, providing insights into distinguishing real and fake secrets. To overcome the limitations of existing works, we propose DESEC, a two-stage method that leverages token-level features derived from the identified characteristics to guide the token decoding process. DESEC consists of constructing an offline token scoring model using a proxy Code LLM and employing the scoring model to guide the decoding process by reassigning token likelihoods. Through extensive experiments on four state-of-the-art Code LLMs using a diverse dataset, we demonstrate the superior performance of DESEC in achieving a higher plausible rate and extracting more real secrets compared to existing baselines. Our findings highlight the effectiveness of our token-level approach in enabling an extensive assessment of the privacy leakage risks associated with Code LLMs.

Tags: "AI for SE", "Security"

Rajdeep Singh Hundal, Yan Xiao, Xiaochun Cao, Jin Song Dong, Manuel Rigger, "On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations"

Abstract: \emph{Deep Reinforcement Learning} (DRL) is a paradigm of artificial intelligence where an \emph{agent} uses a neural network to learn which actions to take in a given \emph{environment}. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous \emph{implementations} of the state-of-the-art algorithms responsible for training these agents, like the \emph{Deep Q-Network} (DQN) and \emph{Proximal Policy Optimization} (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, \emph{interchangeable}. In this paper, through a \emph{differential testing} lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcome of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are \textit{not} interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50\% of their total trials while the other two implementations only achieved superhuman performance for less than 15\% of their total trials. Furthermore, the performance among the high-performing PPO implementations was found to differ significantly in nine games. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to \emph{flip} experiment outcomes. Therefore, this calls for a shift in how implementations are being used. In addition, we recommend for (1) replicability studies for studies mistakenly assuming implementation interchangeability, (2) DRL researchers and practitioners to adopt the differential testing methodology proposed in this paper to combat implementation inconsistencies, and (3) the use of large environment suites.

Tags: "SE for AI", "Testing and Quality"

shide zhou, Tianlin Li, WANG KAILONG, Yihao Huang, Ling Shi, Yang Liu, Haoyu Wang, "Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks"

Abstract: Large language models (LLMs) have revolutionized artificial intelligence, but their increasing deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks. Such vulnerability alerts the widespread inadequacy of pre-release testing. In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies, exemplified by the significant security concern of jailbreak attacks. Our study begins with a clustering analysis of the hidden states of LLMs, revealing that the embedded characteristics effectively distinguish between different query types. We then systematically evaluate the performance of these criteria across three key dimensions: criterion level, layer level, and token level. Our research uncovers significant differences in the sets of neurons covered when LLMs process normal versus jailbreak queries, aligning with our clustering experiments. Leveraging these findings, we propose three practical applications of coverage criteria in the context of LLM security testing. Specifically, we develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak. Furthermore, we explore the use of coverage levels to prioritize test cases, improving testing efficiency by focusing on high-risk interactions and removing redundant tests. Lastly, we introduce a coverage-guided approach for generating jailbreak attack examples, enabling systematic refinement of prompts to uncover new vulnerabilities. This study improves our understanding of LLM security testing, enhances their safety, and provides a foundation for developing more robust AI applications.

Tags: "SE for AI", "Security"

Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, Xin Peng, "LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion"

Abstract: Large language models (LLMs), pre-trained or fine-tuned on large code corpora, have shown effectiveness in generating code completions. However, in LLM-based code completion, LLMs may struggle to use correct and up-to-date Application Programming Interfaces (APIs) due to the rapid and continuous evolution of libraries. While existing studies have highlighted issues with predicting incorrect APIs, the specific problem of deprecated API usage in LLM-based code completion has not been thoroughly investigated. To address this gap, we conducted the first evaluation study on deprecated API usage in LLM-based code completion. This study involved seven advanced LLMs, 145 API mappings from eight popular Python libraries, and 28,125 completion prompts. The study results reveal the status quo (i.e., API usage plausibility and deprecated usage rate) of deprecated API and replacing API usage in LLM-based code completion from the perspectives of model, prompt, and library, and indicate the root causes behind. Based on these findings, we propose two lightweight fixing approaches, ReplaceAPI and InsertPrompt, which can serve as baseline approaches for future research on mitigating deprecated API usage in LLM-based completion. Additionally, we provide implications for future research on integrating library evolution with LLM-driven software development.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Matteo Camilli, Raffaela Mirandola, "Parametric Falsification of Many Probabilistic Requirements under Flakiness"

Abstract: Falsification is a popular simulation-based testing method for Cyber-Physical Systems to find inputs that violate a formal requirement. It employs optimization algorithms to minimize a robustness metric that defines the satisfaction of a given property over an execution trace. Despite falsification representing an established approach, detecting violations considering many, possibly independent, requirements simultaneously, under flaky simulations is an open problem. We address this problem by proposing a novel approach that combines parametric model checking and many-objective optimization. We use parametric model checking to shift part of the complexity of the problem offline. We pre-compute numeric constraints for the satisfaction of all requirements on a parametric specification of the testing scenario. Flaky violations are then detected using many-objective optimization to explore the space of changing factors in the scenario and push the parameters out of all precomputed constraints. The results of our empirical evaluation using four open-source evaluation subjects with increasing complexity (number of requirements) show that our approach can falsify many requirements simultaneously, without hiding their individual contribution. The effectiveness, in terms of quantity and severity of violations, is significantly higher than random search as well as two selected state-of-the-art baseline approaches. Furthermore, the extra offline computation yields a negligible cost.

Tags: "Testing and Quality"

Kaia Newman, Sarah Snay, Madeline Endres, Manasvi Parikh, Andrew Begel, ""Get Me In The Groove": A Mixed Methods Study on Supporting ADHD Professional Programmers"

Abstract: Understanding the work styles of diverse programmers can help build inclusive workplaces, enabling all software engineers to excel. An estimated 10.6\% of programmers have \textit{Attention Deficit Hyperactivity Disorder} (ADHD), a condition characterized by differences in attention and working memory. Prior work has just begun to explore the impact of ADHD on software development, finding that inadequate support may negatively impact team productivity and employment. This prevents software development organizations from benefiting from ADHD-related strengths. To investigate these impacts, we conducted a two-phase mixed methods study. First, we qualitatively analyzed 99 threads (1,658 posts and comments) from \texttt{r/ADHD\_Programmers}, the largest public forum dedicated to the ADHD programmer community. We constructed a mapping that reveals how ADHD programmers apply personal strategies and organizational accommodations to address software task-specific challenges. Second, we conducted a large-scale survey of 239 ADHD and 254 non-ADHD professional programmers to validate how our qualitative data generalize to the worldwide developer population. Our results show that ADHD programmers are 1.8 to 4.4 times more likely to struggle more frequently than neurotypical developers with all challenges we consider, but especially with time management and design. Our findings have implications for inclusive and effective tool- and policy-building in software workplaces and motivate further research into the experiences of ADHD programmers.

Tags: "Human/Social"

Ruanqianqian (Lisa) Huang, Savitha Ravi, Michael He, Boyu Tian, Sorin Lerner, Michael Coblenz, "How Scientists Use Jupyter Notebooks: Goals, Quality Attributes, and Opportunities"

Abstract: Computational notebooks are intended to prioritize the needs of scientists, but little is known about how scientists interact with notebooks, what requirements drive scientists' software development processes, or what tactics scientists use to meet their requirements. We conducted an observational study of 20 scientists using Jupyter notebooks for their day-to-day tasks, finding that scientists prioritize different quality attributes depending on their goals. A qualitative analysis of their usage shows (1) a collection of goals scientists pursue with Jupyter notebooks, (2) a set of quality attributes that scientists value when they write software, and (3) tactics that scientists leverage to promote quality. In addition, we identify ways scientists incorporated AI tools into their notebook work. From our observations, we derive design recommendations for improving computational notebooks and future programming systems for scientists. Key opportunities pertain to helping scientists create and manage state, dependencies, and abstractions in their software, enabling more effective reuse of clearly-defined components.

Tags: "Human/Social"

Gabriel Sherman, Stefan Nagy, "No Harness, No Problem: Oracle-guided Harnessing for Auto-generating C API Fuzzing Harnesses"

Abstract: Library APIs are used by virtually every modern application and system, making them among today’s most security-critical software. In recent years, library bug-finding efforts have overwhelmingly adopted the powerful testing strategy of coverage-guided fuzzing. At its core, API fuzzing operates on harnesses: wrapper programs that initialize an API before feeding random inputs to its functions. Successful fuzzing demands correct and thorough harnesses, making manual harnessing challenging without sufficient domain expertise. To overcome this, recent strategies propose “learning” libraries’ intended usage to automatically generate their fuzzing harnesses. Yet, despite their high code coverage, resulting harnesses frequently miss key API semantics—bringing with them invalid, unrealistic, or otherwise-impossible data and call sequences—derailing fuzzing with false-positive crashes. Thus, without a precise, semantically-correct harnessing, many critical APIs will remain beyond fuzzing’s reach—leaving their hidden vulnerabilities ripe for attackers. This paper introduces Oracle-guided Harnessing: a technique for fully-automatic, semantics-aware API fuzzing harness synthesis. At a high level, Oracle-guided Harnessing mimics the trial-and-error process of manual harness creation—yet automates it via fuzzing. Specifically, we leverage information from API headers to mutationally stitch-together candidate harnesses; and evaluate their validity via a set of Correctness Oracles: compilation, execution, and changes in coverage. By keeping— and further mutating—only correct candidates, our approach produces a diverse set of semantically-correct harnesses for complex, real-world libraries in as little as one hour. We integrate Oracle-guided Harnessing as a prototype, OGHARN; and evaluate it alongside today’s leading fully-automatic harnessing approach, Hopper, and a plethora of developer-written harnesses from OSS-Fuzz. Across 20 real-world APIs, OGHARN outperforms developer-written harnesses by a median 14% code coverage, while uncovering 31 and 30 more vulnerabilities than both Hopper and developer-written harnesses, respectively—with zero false-positive crashes. Of the 41 new vulnerabilities found by OGHARN, all 41 are confirmed by developers—40 of which are since fixed—with many found in APIs that, until now, lacked harnesses whatsoever.

Tags: "Testing and Quality"

Ruchit Rawal, Victor-Alexandru Pădurean, Sven Apel, Adish Singla, Mariya Toneva, "Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations"

Abstract: With the recent advances in AI programming assistants such as GitHub Copilot, programming is not limited to classical programming languages anymore--programming tasks can also be expressed and solved by end-users in natural text. Despite the availability of this new programming modality, users still face difficulties with algorithmic understanding and program debugging. One promising approach to support end-users is to provide hints to help them find and fix bugs while forming and improving their programming capabilities. While it is plausible that hints can help, it is unclear which type of hint is helpful and how this depends on program representations (classic source code or a textual representation) and the user's capability of understanding the algorithmic task. To understand the role of hints in this space, we conduct a large-scale crowd-sourced study involving 753 participants investigating the effect of three types of hints (test cases, conceptual, and detailed), across two program representations (Python and text-based), and two groups of users (with clear understanding or confusion about the algorithmic task). We find that the program representation (Python vs. text) has a significant influence on the users' accuracy at finding and fixing bugs. Surprisingly, users are more accurate at finding and fixing bugs when they see the program in natural text. Hints are generally helpful in improving accuracy, but different hints help differently depending on the program representation and the user's understanding of the algorithmic task. These findings have implications for designing next-generation programming tools that provide personalized support to users, for example, by adapting the programming modality and providing hints with respect to the user's skill level and understanding.

Tags: "Human/Social"

Satyaki Das, Syeda Tasnim Fabiha, Saad Shafiq, Nenad Medvidovic, "Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection Solutions"

Abstract: Recent research has revealed that the reported results of an emerging body of deep learning-based techniques for detecting software vulnerabilities are not reproducible, either across different datasets or on unseen samples. This paper aims to provide the foundation for properly evaluating the research in this domain. We do so by analyzing prior work and existing vulnerability datasets for the syntactic and semantic features of code that contribute to vulnerability, as well as features that falsely correlate with vulnerability. We provide a novel, uniform representation to capture both sets of features, and use this representation to detect the presence of both vulnerability and spurious features in code. To this end, we design two types of code perturbations: feature preserving perturbations (FPP) ensure that the vulnerability feature remains in a given code sample, while feature eliminating perturbations (FEP) eliminate the feature from the code sample. These perturbations aim to measure the influence of spurious and vulnerability features on the predictions of a given vulnerability detection solution. To evaluate how the two classes of perturbations influence predictions, we conducted a large-scale empirical study on five state-of-the-art DL-based vulnerability detectors. Our study shows that, for vulnerability features, only ~2% of FPPs yield the undesirable effect of a prediction changing among the five detectors on average. However, on average, ~84% of FEPs yield the undesirable effect of retaining the vulnerability predictions. For spurious features, we observed that FPPs yielded a drop in recall up to 29% for graph-based detectors. We present the reasons underlying these results and suggest strategies for improving DNN-based vulnerability detectors. We provide our perturbation-based evaluation framework as a public resource to enable independent future evaluation of vulnerability detectors.

Tags: "Security", "AI for SE"

Aashish Yadavally, Xiaokai Rong, Phat Nguyen, Tien Nguyen, "Large Language Models for Safe Minimization"

Abstract: Several tasks in program analysis, verification, and testing are modeled as constraint solving problems, utilizing SMT solvers as the reasoning engine. In this work, we aim to investigate the reasoning capabilities of large language models (LLMs) toward reducing the size of an infeasible string constraint system by exploiting inter-constraint interactions such that the remaining ones are still unsatisfiable. We term this safe minimization. Motivated by preliminary observations of hallucination and error propagation in LLMs, we design SafeMin, a framework leveraging an LLM and SMT solver in tandem to ensure a safe and correct minimization. We test the applicability of our approach on string benchmarks from LeetCode in the computation of minimal unsatisfiable subsets (MUSes). We observed that SafeMin helps safely minimize 94.3% of these constraints, with an average minimization ratio of 98% relative to the MUSes. In addition, we assess SafeMin's capabilities in partially enumerating non-unique MUSes, which is baked into our approach via a "sample-and-enumerate'" decoding strategy. Overall, we captured 42.1% more non-unique MUSes than without such LLM-based macro-reasoning. Finally, we demonstrate SafeMin's usefulness in detecting infeasible paths in programs.

Tags: "AI for SE"

Annibale Panichella, "Metamorphic-Based Many-Objective Distillation of LLMs for Code-related Tasks"

Abstract: Knowledge distillation compresses large language models (LLMs) into more compact and efficient versions that achieve similar accuracy on code-related tasks. However, as we demonstrate in this study, compressed models are four times less robust than the original LLMs when evaluated with metamorphic code. They have a 440% higher probability of misclassifying code clones due to minor changes in the code fragment under analysis, such as replacing parameter names with synonyms. To address this issue, we propose MORPH, a method that combines metamorphic testing with many-objective optimization for a robust distillation of LLMs for code. MORPH efficiently explores the models’ configuration space and generates Paretooptimal models that effectively balance accuracy, efficiency, and robustness to metamorphic code. Metamorphic testing measures robustness as the number of code fragments for which a model incorrectly makes different predictions between the original and their equivalent metamorphic variants (prediction flips). We evaluate MORPH on two tasks—code clone and vulnerability detection—targeting CodeBERT and GraphCodeBERT for distillation. Our comparison includes MORPH, the state-of-theart distillation method AVATAR, and the fine-tuned non-distilled LLMs. Compared to AVATAR, MORPH produces compressed models that are (i) 47% more robust, (ii) 25% more efficient (fewer FLOPs), while maintaining (iii) equal or higher accuracy (up to +6%), and (iv) similar model size.

Tags: "AI for SE", "Testing and Quality"

Mengxia Ren, Anhao Xiang, Chuan Yue, "Analyzing the Feasibility of Adopting Google's Nonce-Based CSP Solutions on Websites"

Abstract: Content Security Policy (CSP) is a leading security mechanism for mitigating content injection attacks such as Cross-Site Scripting (XSS). Nevertheless, despite efforts from academia and industry, CSP policies (in short, CSPs) are not widely deployed on websites, and deployed CSPs often have security issues or errors. Such low and insecure CSP deployment problems are mainly due to the complexity of the CSP mechanism. Google recently proposed four nonce-based CSP solutions which are simpler and more secure compared to traditional whitelisting-based CSP solutions. Google successfully deployed their nonce- based CSP solutions on over 160 services, covering 62% of all outgoing Google traffic. These nonce-based CSP solutions use simple CSPs but provide fine-grained control of web resources; therefore, if widely adopted on many other websites, they can be very helpful on addressing the low and insecure CSP deployment problems. In this paper, we evaluate the feasibility of adopting Google's nonce-based CSP solutions on the Tranco top 10K websites. We construct a crawling tool to automatically visit websites, simulate user interactions, and insert four CSPs to collect the CSP violations triggered under them. We investigate the adoptability of the nonce-based CSP solutions, adoption issues, and the stability of adopting them on websites by analyzing the CSP violations triggered under the inserted CSPs. We found that most websites can adopt the nonce-based CSP solutions on all their webpages visited in our study. For websites that cannot, usually the adoption is hard on around 40% of their webpages. Overall, our results are very encouraging and can be helpful in promoting the proper deployment of CSPs on many websites.

Tags: "Security", "Testing and Quality"

Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, Tianyi Zhang, "Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models"

Abstract: Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation. However, there remains a limited understanding of code generation errors that LLMs can produce. To bridge the gap, we conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset. Specifically, we first employed open coding and thematic analysis to distill a comprehensive taxonomy of code generation errors. We analyzed two dimensions of error characteristics---semantic characteristics and syntactic characteristics. Our analysis revealed that LLMs often made non-trivial, multi-line code generation errors in various locations and with various root causes. We further analyzed the correlation between these errors and task complexity as well as test pass rate. Our findings highlight several challenges in locating and fixing code generation errors made by LLMs. In the end, we discussed several future directions to address these challenges.

Tags: "AI for SE", "MSR"

Zexiang Zhang, Gaoning Pan, Ruipeng Wang, Yiming Tao, Zulie Pan, Unknown, Min Zhang, Yang Li, Yi Shen, Chunming Wu, "InSVDF: Interface-State-Aware Virtual Device Fuzzing"

Abstract: Hypervisor is the core technology of virtualization for emulating independent hardware resources for each virtual machine. Virtual devices serve as the main interface of the hypervisor, making the security of virtual devices crucial, as any vulnerabilities can impact the entire virtualization environment and pose a threat to the host machine's security. Direct Memory Access (DMA) is the interface of virtual devices, enabling communication with the host machine. Recently, many efforts have focused on fuzzing against DMA to discover the hypervisor's vulnerabilities. However, the lack of sensitivity to the DMA state causes these efforts to be hindered in efficiency during fuzzing. Specifically, there are two main issues: the uncertain interaction moment and the unclear interaction depth. In this paper, we introduce InSVDF, a DMA interface state-aware fuzzing engine. InSVDF first models the intra-interface state of the DMA interface and incorporates an asynchrony-aware state snapshot mechanism along with a depth-aware seed preservation mechanism. To validate our approach, we compare InSVDF with a state-of-the-art fuzzer. The results demonstrate that InSVDF significantly enhances vulnerability discovery speed, with improvements of up to 24.2x in the best case. Furthermore, InSVDF has identified 2 new vulnerabilities, one of which has been assigned a CVE ID.

Tags: "Testing and Quality", "Security"

Heng Yong, Zhong Li, Minxue Pan, Tian Zhang, Jianhua Zhao, Xuandong Li, "GVI: Guided Vulnerability Imagination for Boosting Deep Vulnerability Detectors"

Abstract: The use of deep learning to achieve automated software vulnerability detection has been a longstanding interest within the software security community. These deep vulnerability detectors are mostly trained in a supervised manner, which heavily relies on large-scale, high-quality vulnerability datasets. However, the vulnerability datasets used to train deep vulnerability detectors frequently exhibit class imbalance due to the inherent nature of vulnerability data, where vulnerable cases are significantly rarer than non-vulnerable cases. This imbalance adversely affects the effectiveness of these detectors. A promising solution to address the class imbalance problem is to artificially generate vulnerable samples to enhance vulnerability datasets, yet existing vulnerability generation techniques are not satisfactory due to their inadequate representation of real-world vulnerabilities or their reliance on large-scale vulnerable samples for training the generation model. This paper proposes GVI, a novel approach aimed at generating vulnerable samples to boost deep vulnerability detectors. GVI takes inspiration from human learning with imagination and proposes exploring LLMs to imagine and create new, informative vulnerable samples from given seed vulnerabilities. Specifically, we design a Chain-of-Thought inspired prompt in GVI that instructs the LLMs to first analyze the seed to retrieve attributes related to vulnerabilities and then generate a set of vulnerabilities based on the seed’s attributes. Our extensive experiments on three vulnerability datasets (i.e., Devign, ReVeal, and BigVul) and across three deep vulnerability detectors (i.e., Devign, ReVeal, and LineVul) demonstrate that the vulnerable samples generated by GVI are not only more accurate but also more effective in enhancing the performance of deep vulnerability detectors.

Tags: "AI for SE", "Security"

Menglong Chen, Tian Tan, Minxue Pan, Yue Li, "PacDroid: A Pointer-Analysis-Centric Framework for Security Vulnerabilities in Android Apps"

Abstract: General frameworks such as FlowDroid, IccTA, P/Taint, Amandroid, and DroidSafe have significantly advanced the development of static analysis tools for Android security by providing fundamental facilities for them. However, while these frameworks have been instrumental in fostering progress, they often operate with inherent inefficiencies, such as redundant computations, reliance on separate tools, and unnecessary complexity, which are rarely scrutinized by the analysis tools that depend on them. This paper introduces PacDroid, a new static analysis framework for detecting security vulnerabilities in Android apps. PacDroid employs a simple yet effective pointer-analysis-centric approach that naturally manages alias information, interprocedural value propagation, and all Android features it supports (including ICC, lifecycles, and miscs), in a unified manner. Our extensive evaluation reveals that PacDroid not only outperforms state-of-the-art frameworks in achieving a superior trade-off between soundness and precision (F-measure) but also surpasses them in both analysis speed and robustness; moreover, PacDroid successfully identifies 77 real security vulnerability flows across 23 real-world Android apps that were missed by all other frameworks. With its ease of extension and provision of essential facilities, PacDroid is expected to serve as a foundational framework for various future analysis applications for Android.

Tags: "Security", "Mobile SW"

Chunhao Dong, Yanjie Jiang, Yuxia Zhang, Yang Zhang, Hui Liu, "ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples"

Abstract: Software refactoring is widely employed to improve software quality. However, conducting refactorings manually is tedious, time-consuming, and error-prone. Consequently, automated and semi-automated tool support is highly desirable for software refactoring in the industry, and most of the main-stream IDEs provide powerful tool support for refactoring. However, complex refactoring engines are prone to errors, which in turn may result in imperfect and incorrect refactorings. To this end, in this paper, we propose a ChatGPT-based approach to testing refactoring engines. We first manually analyze bug reports and test cases associated with refactoring engines, and construct a feature library containing fine-grained features that may trigger defects in refactoring engines. The approach automatically generates prompts according to both predefined prompt templates and features randomly selected from the feature library, requesting ChatGPT to generate test programs with the requested features. Test programs generated by ChatGPT are then forwarded to multiple refactoring engines for differential testing. To the best of our knowledge, it is the first approach in testing refactoring engines that guides test program generation with features derived from existing bugs. It is also the first approach in this line that exploits LLMs in the generation of test programs. Our initial evaluation of four main-stream refactoring engines suggests that the proposed approach is effective. It identified a total of 115 previously unknown bugs besides 28 inconsistent refactoring behaviors among different engines. Among the 115 bugs, 78 have been manually confirmed by the original developers of the tested engines, i.e., IntelliJ IDEA, Eclipse, VScode-Java, and NetBeans.

Tags: "Prog Comprehension/Reeng/Maint", "AI for SE"

Minghua He, Tong Jia, Chiming Duan, Huaqian Cai, Ying Li, Gang Huang, "Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning"

Abstract: Log-based anomaly detection is essential for maintaining software availability. However, existing log-based anomaly detection approaches heavily rely on fine-grained exact labels of log entries which are very hard to obtain in real-world systems. This brings a key problem that anomaly detection models require supervision signals while labeled log entries are unavailable. Facing this problem, we propose a new labeling strategy called inexact labeling that instead of labeling an log entry, system experts can label a bag of log entries in a time span. Furthermore, we propose MIDLog, a weakly supervised log-based anomaly detection approach with inexact labels. We leverage the multi-instance learning paradigm to achieve explicit separation of anomalous log entries from the inexact labeled anomalous log set so as to deduce exact anomalous log labels from inexact labeled log sets. Extensive evaluation on three public datasets shows that our approach achieves an F1 score of over 85\% with inexact labels.

Tags: "AI for SE", "Security"

Shuai Wang, Yinan Yu, Robert Feldt, Dhasarathy Parthasarathy, "Automating a Complete Software Test Process Using LLMs: An Automotive Case Study"

Abstract: Vehicle API testing verifies whether the interactions between a vehicle's internal systems and external applications meet expectations, ensuring that users can access and control various vehicle functions and data. However, this task is inherently complex, requiring the alignment and coordination of API systems, communication protocols, and even vehicle simulation systems to develop valid test cases. In practical industrial scenarios, inconsistencies, ambiguities, and interdependencies across various documents and system specifications pose significant challenges. This paper presents a system designed for the automated testing of in-vehicle APIs. By clearly defining and segmenting the testing process, we enable Large Language Models (LLMs) to focus on specific tasks, ensuring a stable and controlled testing workflow. Experiments conducted on over 100 APIs demonstrate that our system effectively automates vehicle API testing. The results also confirm that LLMs can efficiently handle mundane tasks requiring human judgment, making them suitable for complete automation in similar industrial contexts.

Tags: "Testing and Quality", "Autonomy"

Yicheng Wang, Wensheng Dou, Yu Liang, Yi Wang, Wei Wang, Jun Wei, Tao Huang, "Evaluating Garbage Collection Performance Across Managed Language Runtimes"

Abstract: Modern managed language runtimes (e.g., Java, Go and C#) rely on garbage collection (GC) mechanisms to automatically allocate and reclaim in-memory objects. The efficiency of GC implementations can greatly impact the overall performance of runtime-based applications. To improve GC performance, the academic and industrial communities have proposed several approaches to evaluate the GC implementations in an individual runtime. However, these approaches target a specific managed language (e.g., Java), and cannot be used to compare the GC implementations in different runtimes. In this paper, we propose GEAR, an automated approach to construct consistent GC workloads for different managed language runtimes, which can further be used to evaluate GC implementations across different runtimes. Specifically, we design a group of runtime-agnostic Memory Operation Primitives (MOP), which can portray the memory usage information that influences GC. GEAR can further automatically convert a MOP program into runtime-specific programs for the target runtimes, which serve as a consistent GC workload for different runtimes. To build MOP programs with real-world GC workloads, we instrument the commonly-used runtime Java Virtual Machine (JVM) to collect the memory operation trace during a Java application’s execution, and then transform the memory operation trace into a MOP program. The experimental result on three widely-used runtimes (i.e., Java, Go and C#) shows that GEAR can generate consistent GC workloads for different runtimes. We further conduct a comprehensive study on these three runtimes, and reveal some interesting findings about their GC performance, providing useful guidance for improving their GC implementations.

Tags: "Analysis"

Lili Quan, Tianlin Li, Xiaofei Xie, Zhenpeng Chen, Sen Chen, Lingxiao Jiang, Xiaohong Li, "Dissecting Global Search: A Simple yet Effective Method to Boost Individual Discrimination Testing and Repair"

Abstract: Deep Learning (DL) has achieved significant success in socially critical decision-making applications but often exhibits unfair behaviors, raising social concerns. Among these unfair behaviors, individual discrimination—examining inequalities between instance pairs with identical profiles differing only in sensitive attributes such as gender, race, and age—is extremely socially impactful. Existing methods have made significant and commendable efforts in testing individual discrimination before deployment. However, their efficiency and effectiveness remain limited, particularly when evaluating relatively fairer models. It remains unclear which phase of the existing testing framework (global or local) is the primary bottleneck limiting performance. Facing the above issues, we first identify that enhancing the global phase consistently improves overall testing effectiveness compared to enhancing the local phase. This motivates us to propose Genetic-Random Fairness Testing (GRFT), an effective and efficient method. In the global phase, we use a genetic algorithm to guide the search for more global discriminatory instances. In the local phase, we apply a light random search to explore the neighbors of these instances, avoiding time-consuming computations. Additionally, based on the fitness score, we also propose a straightforward yet effective repair approach. For a thorough evaluation, we conduct extensive experiments involving 6 testing methods, 5 datasets, 261 models (including 5 naively trained, 64 repaired, and 192 quantized for on-device deployment), and sixteen combinations of sensitive attributes, showing the superior performance of GRFT and our repair method.

Tags: "SE for AI"

Youpeng Ma, Tao Chen, Ke Li, "Faster Configuration Performance Bug Testing with Neural Dual-level Prioritization"

Abstract: As software systems become more complex and configurable, more performance problems tend to arise from the configuration designs. This has caused some configuration options to unexpectedly degrade performance which deviates from their original expectations designed by the developers. Such discrepancies, namely configuration performance bugs (CPBugs), are devastating and can be deeply hidden in the source code. Yet, efficiently testing CPBugs is difficult, not only due to the test oracle is hard to set, but also because the configuration measurement is expensive and there are simply too many possible configurations to test. As such, existing testing tools suffer from lengthy runtime or have been ineffective in detecting CPBugs when the budget is limited, compounded by inaccurate test oracle. In this paper, we seek to achieve significantly faster CPBug testing by neurally prioritizing the testing at both the configuration option and value range levels with automated oracle estimation. Our proposed tool, dubbed NDP, is a general framework that works with different heuristic generators. The idea is to leverage two neural language models: one to estimate the CPBug types that serve as the oracle while, more vitally, the other to infer the probabilities of an option being CPBug-related, based on which the options and the value ranges to be searched can be prioritized. Experiments on several widely-used systems of different versions reveal that NDP can, in general, better predict CPBug type in 87% cases and find more CPBugs with up to 88.88$\times$ testing efficiency speedup over the state-of-the-art tools.

Tags: "AI for SE", "Testing and Quality"

Kevin Hermann, Sven Peldszus, Jan-Philipp Steghöfer, Thorsten Berger, "An Exploratory Study on the Engineering of Security Features"

Abstract: Software security is of utmost importance for most software systems. Developers must systematically select, plan, design, implement, and especially maintain and evolve security features - functionalities to mitigate attacks or protect personal data such as cryptography or access control, to ensure the security of their software. While security features are usually available in libraries, additional code needs to be written and maintained to integrate security features and not all desired features can be reused this way. While there have been studies on the use of such libraries, surprisingly little is known about how developers engineer security features, how they select what security features to implement, and the implications on maintenance. We therefore currently rely on assumptions that are largely based on common sense or individual examples. However, researchers require hard empirical data to understand what practitioners need and how they view security, which we currently lack to provide them with effective solutions. We contribute an exploratory study with 26 knowledgeable industrial participants. We study how security features of software systems are selected and engineered in practice, what their code-level characteristics are, and the challenges practitioners face. Based on the empirical data gathered, we validate four common assumptions and gain insights into engineering practices.

Tags: "Security", "Design/Architecture"

Martin Beisel, Johanna Barzen, Frank Leymann, Lavinia Stiliadou, Daniel Vietz, Benjamin Weder, "Pattern-based Generation and Adaptation of Quantum Workflows"

Abstract: Building quantum applications requires deep knowledge of quantum computing and software engineering. Hence, an abstraction layer reducing the complexity for non-experts is needed. Patterns are an established concept for the abstract description of proven solutions to recurring problems. Therefore, the quantum computing patterns, a pattern language for the quantum computing domain, can be used to define the building blocks and the structure of hybrid quantum applications. Furthermore, concrete software artifacts can be associated with patterns to solve the corresponding problem. However, these software artifacts are usually heterogeneous, e.g., using different data formats. Quantum workflows enable a robust and scalable orchestration of these heterogeneous software artifacts. However, manually modeling and configuring such quantum workflows is a complex, error-prone, and time-consuming task. To overcome this issue, we present an approach that automates the generation and adaptation of quantum workflows using the quantum computing patterns. We provide an architecture realizing our approach, a corresponding prototype, as well as an evaluation comprising different use cases, a runtime comparison, and a user study.

Tags: "Design/Architecture", "Prog Comprehension/Reeng/Maint"

Beatriz Souza, Michael Pradel, "Treefix: Enabling Execution with a Tree of Prefixes"

Abstract: The ability to execute code is a prerequisite for various dynamic program analyses. Learning-guided execution has been proposed as an approach to enable the execution of arbitrary code snippets by letting a neural model predict likely values for any missing variables. Although state-of-the-art learning-guided execution approaches, such as LExecutor, can enable the execution of a relative high amount of code, they are limited to predicting a restricted set of possible values and do not use any feedback from previous executions to execute even more code. This paper presents Treefix, a novel learning-guided execution approach that leverages LLMs to iteratively create code prefixes that enable the execution of a given code snippet. The approach addresses the problem in a multi-step fashion, where each step uses feedback about the code snippet and its execution to instruct an LLM to improve a previously generated prefix. This process iteratively creates a tree of prefixes, a subset of which is returned to the user as prefixes that maximize the number of executed lines in the code snippet. In our experiments with two datasets of Python code snippets, Treefix achieves 25% and 7% more coverage relative to the current state of the art in learning- guided execution, covering a total of 84% and 82% of all lines in the code snippets.

Tags: "Testing and Quality", "AI for SE"

Venkata Sai Aswath Duvvuru, Bohan Zhang, Michael Vierhauser, Ankit Agrawal, "LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems"

Abstract: Thorough simulation testing is crucial for validating the correct behavior of small Uncrewed Aerial Systems (sUAS) across multiple scenarios, including adverse weather conditions (such as wind, and fog), diverse settings (hilly terrain, or urban areas), and varying mission profiles (surveillance, tracking). While various sUAS simulation tools exist to support developers, the entire process of creating, executing, and analyzing simulation tests remains a largely manual and cumbersome task. Developers must identify test scenarios, set up the simulation environment, integrate the System under Test (SuT) with simulation tools, formulate mission plans, and collect and analyze results. These labor-intensive tasks limit the ability of developers to conduct exhaustive testing across a wide range of scenarios. To alleviate this problem, in this paper, we propose AUTOSIMTEST, a Large Language Model (LLM)-driven framework, where multiple LLM agents collaborate to support the sUAS simulation testing process. This includes: (1) creating test scenarios that subject the SuT to unique environmental contexts; (2) preparing the simulation environment as per the test scenario; (3) generating diverse sUAS missions for the SuT to execute; and (4) automatically analyzing simulation results and providing an interactive analytics interface. Further, the design of the framework is flexible for creating and testing scenarios for a variety of sUAS use cases, simulation tools, and SuT input requirements. We evaluated our approach by (a) conducting simulation testing of PX4 and ArduPilot flight-controller-based SuTs, (b) analyzing the performance of each agent, and (c) gathering feedback from sUAS developers. Our findings indicate that AUTOSIMTEST significantly improves the efficiency and scope of the sUAS testing process, allowing for more comprehensive and varied scenario evaluations while reducing the manual effort.

Tags: "AI for SE", "Autonomy"

Hongyuan Liang, Yue Huang, Tao Chen, "The Same Only Different: On Information Modality for Configuration Performance Analysis"

Abstract: Configuration in software systems helps to ensure efficient operation and meet diverse user needs. Yet, some, if not all, configuration options have profound implications for the system's performance. Configuration performance analysis, wherein the key is to understand (or infer) the configuration options' relations and their impacts on performance, is crucial. Two major modalities exist that serve as the source information in the analysis: either the manual or source code. However, it remains unclear what roles they play in configuration performance analysis. Much work that relies on manuals claims their benefits of information richness and naturalness; while work that trusts the source code more prefers the structural information provided therein and criticizes the timeliness of manuals. To fill such a gap, in this paper, we conduct an extensive empirical study over 10 systems, covering 1,694 options, 106,798 words in the manual, and 22,859,552 lines-of-code for investigating the usefulness of manual and code in two important tasks of configuration performance analysis, namely performance-sensitive options identification and the associated dependencies extraction. We reveal several new findings and insights, such as it is beneficial to fuse the manual and code modalities for both tasks; the current automated tools that rely on a single modality are far from being practically useful and generally remain incomparable to human analysis. All those pave the way for further advancing configuration performance analysis.

Tags: "Design/Architecture", "Testing and Quality"

Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu, Michael Lyu, "COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge"

Abstract: Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures is critical for ensuring high reliability and availability. However, prevailing automatic root cause analysis (RCA) approaches rely significantly on comprehensive runtime monitoring data, which is often not fully available in issue platforms. Recent methods leverage large language models (LLMs) to analyze issue reports, but their effectiveness is limited by incomplete or ambiguous user-provided information. To obtain more accurate and comprehensive RCA results, the core idea of this work is to extract additional diagnostic clues from code to supplement data-limited issue reports. Specifically, we propose COCA, a code knowledge enhanced root cause analysis approach for issue reports. Based on the data within issue reports, COCA intelligently extracts relevant code snippets and reconstructs execution paths, providing a comprehensive execution context for further RCA. Subsequently, COCA construct a prompt combining historical issue reports along with profiled code knowledge, enabling the LLMs to generate detailed root cause summaries and localize responsible components. Our evaluation on datasets from five real-world distributed systems demonstrates that COCA significantly outperforms existing methods, achieving a 28.3% improvement in root cause localization and a 22.0% improvement in root cause summarization. Furthermore, COCA's performance consistency across various LLMs underscores its robust generalizability.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Luciano Baresi, Davide Yi Xian Hu, Andrea Stocco, Paolo Tonella, "Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models"

Abstract: Simulation-based testing is widely used to assess the reliability of Autonomous Driving Systems (ADS), but its effectiveness is limited by the operational design domain (ODD) conditions available in such simulators. To address this limitation, in this work, we explore the integration of generative artificial intelligence techniques with physics-based simulators to enhance ADS system-level testing. Our study evaluates the effectiveness and computational overhead of three generative strategies based on diffusion models, namely instruction-editing, inpainting, and inpainting with refinement. Specifically, we assess these techniques' capabilities to produce augmented simulator-generated images of driving scenarios representing new ODDs. We employ a novel automated detector for invalid inputs based on semantic segmentation to ensure semantic preservation and realism of the neural generated images. We then perform system-level testing to evaluate the ADS's generalization ability to newly synthesized ODDs. Our findings show that diffusion models help increase the ODD coverage for system-level testing of ADS. Our automated semantic validator achieved a percentage of false positives as low as 3\%, retaining the correctness and quality of the generated images for testing. Our approach successfully identified new ADS system failures before real-world testing.

Tags: "Testing and Quality", "Autonomy"

Shiyao Zhou, Jincheng Wang, He Ye, Hao Zhou, Claire Le Goues, Xiapu Luo, "LWDIFF: An LLM-Assisted Differential Testing Framework for WebAssembly Runtimes"

Abstract: WebAssembly (Wasm) runtimes execute Wasm programs, a popular low-level language for efficiently executing high-level languages in browsers, with broad applications across diverse domains. The correctness of those runtimes is critical for both functionality and security of Wasm execution, motivating testing approaches that target Wasm runtimes specifically. However, existing Wasm testing frameworks fail to generate test cases that effectively test all three phases of runtime, i.e., decoding, validation, and execution. To address this research gap, we propose a new differential testing framework for Wasm runtimes, which leverages knowledge from the Wasm language specification that prior techniques overlooked, enhancing comprehensive testing of runtime functionality. Specifically, we first use a large language model to extract that knowledge from the specification. We use that knowledge in the context of multiple novel mutation operators that generate test cases with diverse features to test all three runtime phases. We evaluate LWDIFF by applying it to eight Wasm runtimes. Compared with the state-of-the-art Wasm testers, LWDIFF achieves the highest branch coverage and identifies the largest number of bugs. In total, LWDIFF discovers 31 bugs across eight runtimes, all of which are confirmed, with 25 of them previously undiscovered.

Tags: "Testing and Quality"

Taha Shabani, Noor Nashid, Parsa Alian, Ali Mesbah, "Dockerfile Flakiness: Characterization and Repair"

Abstract: Dockerfile flakiness—unpredictable temporal build failures caused by external dependencies and evolving environments—undermines deployment reliability and increases debugging overhead. Unlike traditional Dockerfile issues, flakiness occurs without modifications to the Dockerfile itself, complicating its resolution. In this work, we present the first comprehensive study of Dockerfile flakiness, featuring a nine-month analysis of 8,132 Dockerized projects, revealing that around 10% exhibit flaky behavior. We propose a taxonomy categorizing common flakiness causes, including dependency errors and server connectivity issues. Existing tools fail to effectively address these challenges due to their reliance on pre-defined rules and limited generalizability. To overcome these limitations, we introduce FLAKIDOCK, a novel repair framework combining static and dynamic analysis, similarity retrieval, and an iterative feedback loop powered by Large Language Models (LLMs). Our evaluation demonstrates that FLAKIDOCK achieves a repair accuracy of 73.55%, significantly surpassing state-of-the-art tools and baselines.

Tags: "Analysis"

Ziqiao Kong, Shaohua Li, Heqing Huang, Zhendong Su, "SAND: Decoupling Sanitization from Fuzzing for Low Overhead"

Abstract: Sanitizers provide robust test oracles for various vulnerabilities. Fuzzing on sanitizer-enabled programs has been the best practice to find software bugs. Since sanitizers require heavy program instrumentation to insert run-time checks, sanitizer-enabled programs have much higher overhead compared to normally built programs. In this paper, we present SAND, a new fuzzing framework that decouples sanitization from the fuzzing loop. SAND performs fuzzing on a normally built program and only invokes the sanitizer-enabled program when input is shown to be interesting. Since most of the generated inputs are not interesting, i.e., not bug-triggering, SAND allows most of the fuzzing time to be spent on the normally built program. We further introduce execution pattern to practically and effectively identify interesting inputs. We implement SAND on top of AFL++ and evaluate it on 20 real-world programs. Our extensive evaluation highlights its effectiveness: in 24 hours, compared to all the baseline fuzzers, SAND significantly discovers more bugs while not missing any.

Tags: "Testing and Quality", "Security"

Shuyin Ouyang, Jie Zhang, Zeyu Sun, Albert Merono Penuela, "Knowledge-Enhanced Program Repair for Data Science Code"

Abstract: This paper introduces DSrepair, a knowledge-enhanced program repair method designed to repair the buggy code generated by LLMs in the data science domain. DSrepair uses knowledge graph based RAG for API knowledge retrieval as well as bug knowledge enrichment to construct repair prompts for LLMs. Specifically, to enable knowledge graph based API retrieval, we construct DS-KG (Data Science Knowledge Graph) for widely used data science libraries. For bug knowledge enrichment, we employ an abstract syntax tree (AST) to localize errors at the AST node level. DSrepair's effectiveness is evaluated against five state-of-the-art LLM-based repair baselines using four advanced LLMs on the DS-1000 dataset. The results show that DSrepair surpasses all five baselines. Specifically, when compared to the second-best baseline, DSrepair demonstrates significant improvements, fixing 44.4%, 14.2%, 20.6%, and 32.1% more buggy code snippets for each of the four evaluated LLMs, respectively. Additionally, it achieves greater efficiency, reducing the number of tokens required per code task by 17.49%, 34.24%, 24.71%, and 17.59%, respectively.

Tags: "AI for SE", "Analysis"

Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng, "HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation"

Abstract: To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model's performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain several important findings through our experimental study. For example, we find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1% under different context acquisition methods, compared to the evolution-aware evaluation approach. Based on the findings, we give actionable suggestions for more realistic evaluation of LLMs on code generation. We also build a shared evolution-aware code generation toolbox to facilitate future research. The replication package including source code and datasets is anonymously available at https://anonymous.4open.science/r/HumanEvo/.

Tags: "AI for SE", "Prog Comprehension/Reeng/Maint"

Ziyu Mao, Jingyi Wang, Jun Sun, Shengchao Qin, Jiawen Xiong, "LLM-aided Automatic Modeling for Security Protocol Verification"

Abstract: Symbolic protocol analysis serves as a pivotal technique for protocol design, security analysis, and the safeguarding of information assets. Several modern tools such as Tamarin and ProVerif have been proven successful in modeling and verifying real-world protocols, including complex protocols like TLS 1.3 and 5G AKA. However, developing formal models for protocol verification is a non-trivial task, which hinders the wide adoption of these powerful tools in practical protocol analysis. In this work, we aim to bridge the gap by developing an automatic method for generating symbolic protocol models using Large Language Models (LLMs) from protocol descriptions in natural language document. Although LLMs are powerful in various code generation tasks, it is shown to be ineffective in generating symbolic models (according to our empirical study). Therefore, rather than applying LLMs naively, we carefully decompose the symbolic protocol modelling task into several stages so that a series of formal models are incrementally developed towards generating the final correct symbolic model. Specifically, we apply LLMs for semantic parsing, enable lightweight manual interaction for disambiguation, and develop algorithms to transform the intermediate models for final symbolic model generation. To ensure the correctness of the generated symbolic model, each stage is designed based on a formal execution model and the model transformations are proven sound. To the best of our knowledge, this is the first work aiming to generate symbolic models for protocol verification from natural language documents. We also introduce a benchmark for symbolic protocol model generation, with 18 real-world security protocol's text description and their corresponding symbolic models. We then demonstrate the potential of our tool, which successfully generated correct models of moderate scale in 10 out of 18 cases.

Tags: "Security", "Formal methods"

Sadra Sabouri, Philipp Eibl, Xinyi Zhou, Morteza Ziyadi, Nenad Medvidovic, Lars Lindemann, Souti Chattopadhyay, "Trust Dynamics in AI-Assisted Development: Definitions, Factors, and Implications"

Abstract: Software developers increasingly rely on AI code generation utilities. To ensure that "good" code is accepted into the code base and "bad" code is rejected, developers must know when to trust an AI suggestion. Understanding how developers build this intuition is crucial to enhancing developer-AI collaborative programming. In this paper, we seek to understand how developers (1) define and (2) evaluate the trustworthiness of a code suggestion and (3) how trust evolves when using AI code assistants. To answer these questions, we conducted a mixed-method study consisting of an in-depth exploratory survey with (n=29) developers followed by an observation study (n=10). We found that comprehensibility and perceived correctness were the most frequently used factors to evaluate code suggestion trustworthiness. However, the gap in developers' definition and evaluation of trust points to a lack of support for evaluating trustworthy code in real-time. We also found that developers often alter their trust decisions, keeping only 52% of original suggestions. Based on these findings, we extracted four guidelines to enhance developer-AI interactions. We validated the guidelines through a survey with (n=7) domain experts and survey members (n=8). We discuss the validated guidelines, how to apply them, and tools to help adopt them.

Tags: "Human/Social", "AI for SE"

Sigma Jahan, Mehil B Shah, Parvez Mahbub, Masud Rahman, "Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hierarchical and Explainable Classification"

Abstract: Deep Neural Networks (DNN) have found numerous applications in various domains, including fraud detection, medical diagnosis, facial recognition, and autonomous driving. However, DNN-based systems often suffer from reliability issues due to their inherent complexity and the stochastic nature of their underlying models. Unfortunately, existing techniques to detect faults in DNN programs are either limited by the types of faults (e.g., hyperparameter or layer) they support or the kind of information (e.g., dynamic or static) they use. As a result, they might fall short of comprehensively detecting and diagnosing the faults. In this paper, we present DEFault (Detect and Explain Fault) -- a novel technique to detect and diagnose faults in DNN programs. It first captures dynamic (i.e., runtime) features during model training and leverages a hierarchical classification approach to detect all major fault categories from the literature. Then, it captures static features (e.g., layer types) from DNN programs and leverages explainable AI methods (e.g., SHAP) to narrow down the root cause of the fault. We train and evaluate DEFault on a large, diverse dataset of ~14.5K DNN programs and further validate our technique using a benchmark dataset of 52 real-life faulty DNN programs. Our approach achieves ~94% recall in detecting real-world faulty DNN programs and ~63% recall in diagnosing the root causes of the faults, demonstrating 3.92%--11.54% higher performance than that of state-of-the-art techniques. Thus, DEFault has the potential to significantly improve the reliability of DNN programs by effectively detecting and diagnosing the faults.

Tags: "SE for AI", "Testing and Quality"

Yang Feng, Zheyuan Lin, Dongchen Zhao, Mengbo Zhou, Jia Liu, James A. Jones, "REDII: Test Infrastructure to Enable Deterministic Reproduction of Failures for Distributed Systems"

Abstract: Despite the fact that distributed systems have become a crucial aspect of modern technology and support many of the software systems that enable modern life, developers experience challenges in performing regression testing of these systems. Existing solutions for testing distributed systems are often either: (1) specialized testing environments that are created specifically for each system by its development team, which requires substantial effort for each team, with little-to-no sharing of this effort across teams; or (2) randomized injection tools that are often computationally expensive and offer no guarantees of preventing regressions, due to their randomness. The challenge of providing a generalized and practical solution to trigger bugs for reproducing and demonstrating failures, as well as to guard against regressions, is largely unaddressed. In this work, we present REDII, an infrastructure for supporting regression testing of distributed systems. REDII contains a dataset of real bugs on common distributed systems, along with a generalizable testing framework REDIT that enables developers to write tests that can reproduce failures by providing ways to deterministically control distributed execution. In addition to the real failures in REDII from multiple distributed systems, REDIT provides a reusable, programmable, platform-agnostic, deterministic regression-testing framework for developers of distributed systems. It can help automate the running of such tests, for both practitioners and researchers. We demonstrate REDIT with 63 bugs that we selected in JIRA on 7 large and widely used distributed systems. Our case studies show that REDII can be used to allow developers to write tests that effectively reproduce bugs on distributed systems and generate specific scenarios for regression testing, as well as providing deterministic failure injection that can help developers and researchers to better understand deterministic failures that may occur in distributed systems in the future. Additionally, our studies show that REDII is efficient for real-world system regression testing, providing a powerful tool for all participants in this area.

Tags: "Testing and Quality"

Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe Cogo, Bram Adams, "InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation"

Abstract: Code translation aims to convert a program from one programming language (PL) to another. This long-standing software engineering task is crucial for modernizing legacy systems, ensuring cross-platform compatibility, enhancing performance, and more. However, automating this process remains challenging due to many syntactic and semantic differences between PLs. Recent studies show that even advanced techniques such as large language models (LLMs), especially open-source LLMs, still struggle with the task. Currently, code LLMs are trained with source code from multiple programming languages, thus presenting multilingual capabilities. In this paper, we investigate whether such capabilities can be harnessed to enhance code translation. To achieve this goal, we introduce InterTrans, an LLM-based automated code translation approach that, in contrast to existing approaches, leverages intermediate translations to bridge the syntactic and semantic gaps between source and target PLs. InterTrans contains two stages. It first utilizes a novel Tree of Code Translation (ToCT) algorithm to plan transitive intermediate translation sequences between a given source and target PL, then validates them in a specific order. We evaluate InterTrans with three open LLMs on three benchmarks (i.e., CodeNet, HumanEval-X, and TransCoder) involving six PLs. Results show an absolute improvement of 18.3% to 43.3% in Computation Accuracy (CA) for InterTrans over Direct Translation with 10 attempts. The best-performing variant of InterTrans(with the Magicoder LLM) achieved an average CA of 87.3%-95.4% on three benchmarks.

Tags: "AI for SE"

Shiyu Sun, Yanhui Li, Lin Chen, Yuming Zhou, Jianhua Zhao, "Boosting Code-line-level Defect Prediction with Spectrum Information and Causality Analysis"

Abstract: Code-line-level defect prediction (CLDP) is an effective technique to incorporate comprehensive measures for buggy line identification to optimize efforts in Software Quality Assurance activities. Most CLDP methods either consider the textual information of the code or rely merely on file-level label information, which have not fully leveraged the essential information in the CLDP context, with historical \textit{code-line-level labels} being incredibly overlooked in their application. Due to the vast number of code lines and the sparsity of the tokens they contain, leveraging historical code-line-level label information remains a significant challenge. To address this issue, we propose a novel CLDP method, \textbf{S}pectrum inf\textbf{O}rmation and ca\textbf{U}sality a\textbf{N}alysis based co\textbf{D}e-line-level defect prediction ($\mathsf{SOUND}$). $\mathsf{SOUND}$ incorporates two key ideas: (a) it introduces a spectrum information perspective, utilizing labels from historical defective lines to quantify the contribution of tokens to line-level defects, and (b) it applies causal analysis to obtain a more systematic and comprehensive understanding of the causal relationships between tokens and defects. After conducting a comprehensive study involving 142 releases across 19 software projects, the experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) CLDP baseline methods in terms of its ability to rank defective lines under three indicators, IFA, Recall@Top20\%LOC, and Effort@Top20\%Recall. Notably, in terms of IFA, our method achieves a score of 0 in most cases, indicating that the first line in the ranking list generated by our method is actually defective, significantly enhancing its practicality.

Tags: "Prog Comprehension/Reeng/Maint", "Testing and Quality"

Ruchira Manke, Mohammad Wardat, Foutse Khomh, Hridesh Rajan, "Mock Deep Testing: Toward Separate Development of Data and Models for Deep Learning"

Abstract: While deep learning (DL) has permeated, and become an integral component of many critical software systems, today software engineering research hasn’t explored how to separately test data and models that are integral for DL approaches to work effectively. The main challenge in independently testing these components arises from the tight dependency between data and models. This research explores this gap, introducing our methodology of mock deep testing for unit testing of DL applications. To enable unit testing, we introduce a design paradigm that decomposes the workflow into distinct, manageable components, minimizes sequential dependencies, and modularizes key stages of the DL, including data preparation and model design. For unit testing these components, we propose modeling their dependencies using mocks. In the context of DL, mocks refer to mock data and mock model that mimic the behavior of the original data and model, respectively. This modular approach facilitates independent development and testing of the components, ensuring comprehensive quality assurance throughout the development process. We have developed KUnit, a framework for enabling mock deep testing for the Keras library, a popular library for developing DL applications. We empirically evaluated KUnit to determine the effectiveness of mocks in independently testing data and models. Our assessment of 50 DL programs obtained from Stack Overflow and GitHub shows that mocks effectively identified 10 issues in the data preparation stage and 53 issues in the model design stage. We also conducted a user study with 36 participants using KUnit to perceive the effectiveness of our approach. Participants using KUnit successfully resolved 25 issues in the data preparation stage and 38 issues in the model design stage. We also found that mock objects provide a lightweight emulation of the dependencies for unit testing, facilitating early bug detection. Lastly, to evaluate the usability of KUnit, we conducted a post-study survey. The results reveal that KUnit is helpful to DL application developers, enabling them to independently test each component (data and model) and resolve issues effectively in different stages.

Tags: "SE for AI", "Testing and Quality"

Yorick Sens, Henriette Knopp, Sven Peldszus, Thorsten Berger, "A Large-Scale Study of Model Integration in ML-Enabled Software Systems"

Abstract: The rise of machine learning (ML) and its embedding in software-intensive systems has drastically changed the engineering of such systems. Traditionally, software engineering focuses on manually created artifacts, such as source code, and the process of creating them, as well as best practices for integrating them, i.e., software architectures. In contrast, the development of ML artifacts, i.e., ML models, comes from data science and focuses on the ML models and their training data. However, to deliver value to end users, these ML models must be integrated with traditional software components, often forming complex topologies. In fact, ML-enabled software can easily incorporate many different ML models. While the challenges and practices of building ML-enabled systems have been studied, little is known about the characteristics of real-world ML-enabled systems, beyond isolated examples. Properly embedding ML models in systems so that they can be easily maintained or reused is far from trivial. To improve development processes and architectures for ML-enabled systems, we need to improve our empirical understanding of these systems. We present the first large-scale study of real-world open-source ML-enabled software systems, covering over 2,928 systems on GitHub. We classified and analyzed them to determine their characteristics, as well as their practices for reusing ML models and related code, and the architecture of these systems. Practitioners and researchers benefit from insights into practices for embedding and integrating ML models, bringing data science and software engineering closer together.

Tags: "SE for AI", "Design/Architecture"

Qi Guo, Xiaofei Xie, Shangqing Liu, Ming Hu, Xiaohong Li, Lei Bu, "Intention is All You Need: Refining Your Code from Your Intention"

Abstract: This paper proposes an intention-based code refinement technique, transforming the conventional code refinement process from comment to code to intention to code. The process is decomposed into two phases: Intention Extraction and Intention Guided Code Modification Generation. Intention Extraction categorizes comments using predefined templates, while the latter employs large language models (LLMs) to generate revised code based on these defined intentions. Three categories with eight subcategories are designed for comment transformation, followed by a hybrid approach that combines rule-based and LLM-based classifiers for accurate classification. Extensive experiments with five LLMs (GPT4o, GPT3.5, DeepSeekV2, DeepSeek7B, CodeQwen7B) under different prompting settings demonstrate that our approach achieves 79% accuracy in intention extraction and up to 66% in code refinement generation. Our results underscore the potential of this approach in enhancing data quality and improving code refinement processes.

Tags: "AI for SE"

Yongchao Wang, Yuandao Ryan Cai, Charles Zhang, "Boosting Path-Sensitive Value Flow Analysis via Removal of Redundant Summaries"

Abstract: Value flow analysis that tracks the flow of values via data dependence is a widely used technique for detecting a broad spectrum of software bugs. However, the scalability issue often deteriorates when high precision (i.e., path-sensitivity) is required, as the instantiation of function summaries becomes excessively time- and memory-intensive. The primary culprit, as we observe, is the existence of redundant computations resulting from blindly computing summaries for a function, irrespective of whether they are related to bugs being checked. To address this problem, we present the first approach that can effectively identify and eliminate redundant summaries, thereby reducing the size of collected summaries from callee functions without compromising soundness or efficiency. Our evaluation on large programs demonstrates that our identification algorithm can significantly reduce the time and memory overhead of the state-of-the-art value flow analysis by 45\% and 27\%, respectively. Furthermore, the identification algorithm demonstrates remarkable efficiency by identifying nearly 80\% of redundant summaries while incurring a minimal additional overhead. In the largest \textit{mysqld} project, the identification algorithm reduces the time by 8107 seconds (2.25 hours) with a mere 17.31 seconds of additional overhead, leading to a ratio of time savings to paid overhead (i.e., performance gain) of 468.48 $\times$. In total, our method attains an average performance gain of 632.1 $\times$.

Tags: "Analysis"

Zeinadsadat Saghi, Thomas Zimmermann, Souti Chattopadhyay, "Code Today, Deadline Tomorrow: Procrastination Among Software Developers"

Abstract: Procrastination, the action of delaying or postponing something, is a well-known phenomenon that is relatable to all. While it has been studied in academic settings, little is known about why software developers procrastinate. How does it affect their work? How can developers manage procrastination? This paper presents the first investigation of procrastination among developers. We conduct an interview study with (n=15) developers across different industries to understand the process of procrastination. Using qualitative coding, we report the positive and negative effects of procrastination and factors that triggered procrastination, as perceived by participants. We validate our findings using member checking. Our results reveal 14 negative effects of procrastination on developer productivity. However, participants also reported eight positive effects, four impacting their satisfaction. We also found that participants reported three categories of factors that trigger procrastination: task-related, personal, and external. Finally, we present 19 techniques reported by our participants and studies in other domains that can help developers mitigate the impacts of procrastination. These techniques focus on raising awareness and task focus, help with task planning, and provide pathways to generate team support as a mitigation means. Based on these findings, we discuss interventions for developers and recommendations for tool building to reduce procrastination. Our paper shows that procrastination has unique effects and factors among developers compared to other populations.

Tags: "Human/Social"

Kaiyao Ke, "NIODebugger: A Novel Approach to Repair Non-Idempotent-Outcome Tests with LLM-Based Agent"

Abstract: Flaky tests, characterized by inconsistent results across repeated executions, present significant challenges in software testing, especially during regression testing. Recently, there has been emerging research interest in non-idempotent-outcome (NIO) flaky tests—tests that pass on the initial run but fail on subsequent executions within the same environment. Despite progress in utilizing Large Language Models (LLMs) to address flaky tests, existing methods have not tackled NIO flaky tests. The limited context window of LLMs restricts their ability to incorporate relevant source code beyond the test method itself, often overlooking crucial information needed to address state pollution, which is the root cause of NIO flakiness. This paper introduces NIODebugger, the first framework to utilize an LLM-based agent for fixing flaky tests. NIODebugger features a three-phase design: detection, exploration, and fixing. In the detection phase, dynamic analysis provides critical information (such as stack traces and custom test execution logs) from multiple test runs, which helps in understanding accumulative state pollution. During the exploration phase, the LLM-based agent identifies and provides instructions for extracting relevant source code associated with test flakiness. In the fixing phase, NIODebugger repairs the tests using the information gathered from the previous phases. NIODebugger can be integrated with multiple LLMs, achieving patching success rates ranging from 11.63% to 58.72%. Its best-performing variant, NIODebugger-GPT-4, successfully generated correct patches for 101 out of 172 previously unknown NIO tests across 20 large-scale open-source projects. We submitted pull requests for all generated patches; 58 have been merged, only 1 was rejected, and the remaining 42 are pending. The implementation of NIODebugger is provided as a Maven plugin accessible at https://github.com/NIOTester/NIODebugger.

Tags: "AI for SE", "Testing and Quality"

Zifan Nan, Zhaoqiang Guo, Kui Liu, Xin Xia, "Test Intention Guided LLM-based Unit Test Generation"

Abstract: The emergence of Large Language Models (LLMs) has accelerated the progress of intelligent software engineering technologies, which brings promising possibility for unit test generation. However, existing approaches on unit tests directly generated from Large Language Models (LLMs) often prove impractical due to their low coverage and insufficient mocking capabilities. This paper proposes IntUT, a novel approach that utilizes explicit test intentions (e.g. test inputs, mock behaviors, and expected results) to effectively guide the LLM to generate high-quality test cases. Our experimental results on three industry Java projects and live study demonstrate that prompting LLM with test intention can generate high-quality test cases for developers. Specifically, it achieves the improvements on branch coverage by 94% and line coverage by 49%. Eventually, we obtain developers' feedback on using IntUT to generate cases for 3 newly Java projects with over 80% line coverage and 30% efficiency improvement on writing unit test cases.

Tags: "AI for SE", "Testing and Quality"

Zhiming Chi, Jianan Ma, Pengfei Yang, Cheng-Chao Huang, Renjue Li, Jingyi Wang, Xiaowei Huang, Lijung Zhang, "Patch Synthesis for Property Repair of Deep Neural Networks"

Abstract: Deep neural networks (DNNs) are prone to various dependability issues, such as adversarial attacks, which hinder their adoption in safety-critical domains. Recently, NN repair techniques have been proposed to address these issues while preserving original performance by locating and modifying guilty neurons and their parameters. However, existing repair approaches are often limited to specific data sets and do not provide theoretical guarantees for the effectiveness of the repairs. To address these limitations, we introduce PatchPro, a novel patch-based approach for property-level repair of DNNs, focusing on local robustness. The key idea behind PatchPro is to construct patch modules that, when integrated with the original network, provide specialized repairs for all samples within the robustness neighborhood while maintaining the network's original performance. Our method incorporates formal verification and a heuristic mechanism for allocating patch modules, enabling it to defend against adversarial attacks and generalize to other inputs. PatchPro demonstrates superior efficiency, scalability, and repair success rates compared to existing DNN repair methods, i.e., realizing provable property-level repair for 100% cases across multiple high-dimensional datasets.

Tags: "SE for AI", "Analysis/Synthesis"

Antu Saha, Oscar Chaparro, "Decoding the Issue Resolution Process In Practice via Issue Report Analysis: A Case Study of Firefox"

Abstract: Effectively managing and resolving software issues is critical for maintaining and evolving software systems. Development teams often rely on issue trackers and issue reports to track and manage the work needed during issue resolution, ranging from issue reproduction and analysis to solution design, implementation, verification, and deployment. Despite the issue resolution process being generally known in the software engineering community as a sequential list of activities, it is unknown how developers implement this process in practice and how they discuss it in issue reports. This paper aims to enhance our understanding of the issue resolution process implemented in practice by analyzing the issue reports of Mozilla Firefox. We qualitatively and quantitatively analyzed the discussions found in 356 Firefox issue reports, to identify the sequences of stages that developers go through to address various software problems. We analyzed the sequences to identify the overall resolution process at Firefox and derived a catalog of 47 patterns that represent instances of the process. We analyzed the process and patterns across multiple dimensions, including pattern complexity, issue report types, problem categories, and issue resolution times, resulting in various insights about Mozilla's issue resolution process. We discuss these findings and their implications for different stakeholders on how to better assess and improve the issue resolution process.

Tags: "Prog Comprehension/Reeng/Maint", "Human/Social"

Sehoon Kim, Yonghyeon Kim, Dahyeon Park, Yuseok Jeon, Jooyong Yi, Mijung Kim, "Lightweight Concolic Testing via Path-Condition Synthesis for Deep Learning Libraries"

Abstract: Many techniques have been recently developed for testing deep learning (DL) libraries, recently. Although these techniques have effectively improved API and code coverage and detected unknown bugs, they rely on black-box fuzzing for input generation. Concolic testing (also known as dynamic symbolic execution) can be more effective in exploring diverse execution paths, but applying it to DL libraries is extremely challenging due to their inherent complexity. In this paper, we introduce the first concolic testing technique for DL libraries. Our technique offers a lightweight approach that significantly reduces the heavy overhead associated with traditional concolic testing. While symbolic execution maintains symbolic expressions for every variable with non-concrete values to build a path condition, our technique computes approximate path conditions by inferring branch conditions via inductive program synthesis. Despite potential imprecision from approximation, our method's light overhead allows for effective exploration of diverse execution paths within the complex implementations of DL libraries. We have implemented our tool, PathFinder, and evaluated it on PyTorch and TensorFlow. Our results show that PathFinder outperforms existing API-level DL library fuzzers by achieving 57\% more branch coverage on average; up to 58\% higher than TitanFuzz and 125\% higher than FreeFuzz. PathFinder is also effective in bug detection, uncovering 61 crash bugs, 59 of which were confirmed by developers as previously unknown, with 32 already fixed.

Tags: "SE for AI", "Testing and Quality"

Research TrackICSE 2025

Program Display Configuration

Wed 30 AprDisplayed time zone: Eastern Time (US & Canada) change

Thu 1 MayDisplayed time zone: Eastern Time (US & Canada) change

Fri 2 MayDisplayed time zone: Eastern Time (US & Canada) change

Information for Participants

Information for Participants

Information for Participants

Information for Participants

Accepted Papers

Call for Papers

Accepted papers

Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction"

Chun Li, Hui Li, Zhong Li, Minxue Pan, Xuandong Li, "Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning"

Wenqian Deng, Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation"

Gong Chen, Xiaoyuan Xie, Daniel Tang, Qi Xin, Wenjie Liu, "HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search"

Jiashuo Zhang, Yiming Shen, Jiachi Chen, Jianzhong Su, Yanlin Wang, Ting Chen, Jianbo Gao, Zhong Chen, "Demystifying and Detecting Cryptographic Defects in Ethereum Smart Contracts"

Chijin Zhou, Quan Zhang, Bingzhou Qian, Yu Jiang, "Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency"

Seongmin Lee, Shreyas Minocha, Marcel Böhme, "Accounting for Missing Events in Statistical Information Leakage Analysis"

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Premkumar Devanbu, Toufique Ahmed, "Calibration and Correctness of Language Models for Code"

Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, Aseem Rastogi, "RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code"

guoping rong, Yongda Yu, Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen, Jidong Hu, "Code Comment Inconsistency Detection and Rectification Using a Large Language Model"

Kai Huang, Jian Zhang, Xiangxin Meng, Yang Liu, "Template-Guided Program Repair in the Era of Large Language Models"

Syed Fatiul Huq, Mahan Tafreshipour, Kate Kalcevich, Sam Malek, "Automated Generation of Accessibility Test Reports from Recorded User Transcripts"

Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu, "Large Language Models as Configuration Validators"

Wen Zhang, Botang Xiao, Qingchen Kong, Le Guan, Wenwen Wang, "BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries"

Ying Fu, Zhiyong Wu, Yuanliang Zhang, Jie Liang, Jingzhou Fu, Yu Jiang, Shanshan Li, Xiangke Liao, "Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing"

Courtney Miller, Mahmoud Jahanshahi, Audris Mockus, Bogdan Vasilescu, Christian Kästner, "Understanding the Response to Open-Source Dependency Abandonment in the npm Ecosystem"

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen, "Vulnerability Detection with Code Language Models: How Far Are We?"

Deepak-George Thomas, Matteo Biagiola, Nargiz Humbatova, Mohammad Wardat, Gunel Jahangirova, Hridesh Rajan, Paolo Tonella, "µPRL: A Mutation Testing Pipeline for Deep Reinforcement Learning based on Real Faults"

Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner, "The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products"

Sanan Hasanov, Stefan Nagy, Paul Gazzillo, "A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing"

Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun, "QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning"

Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, Miryung Kim, "Fuzzing MLIR Compilers with Custom Mutation Synthesis"

Forough Mehralian, Ziyao He, Sam Malek, "Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps"

Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng, "RLCoder: Reinforcement Learning for Repository-Level Code Completion"

Nausheen Mohammed, Akash lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma, "LLM Assistance for Memory Safety"

Chenkai Guo, Qianlu Wang, Naipeng Dong, Lingling Fan, Tianhong Wang, Weijie Zhang, EnBao Chen, Zheli Liu, Lu Yu, "EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android Applications"

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, Michael Lyu, "Search-Based LLMs for Code Optimization"

Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, Hui Liu, "A First Look at Conventional Commits Classification"

Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng, "TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference"

Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang, "A Tale of Two DL Cities: When Library Tests Meet Compiler"

Rodrigo Pedro, Miguel E. Coimbra, Daniel Castro, Paulo Carreira, Nuno Santos, "Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses"

Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia, "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?"

Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, Yang Liu, "Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications"

Shuo Yang, Xingwei Lin, Jiachi Chen, Qingyuan Zhong, Lei Xiao, Renke Huang, Yanlin Wang, Zibin Zheng, "Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic Execution"

Zhiqing Zhong, Shilin He, Haoxuan Wang, Boxi Yu, Haowen Yang, Pinjia He, "An Empirical Study on Package-Level Deprecation in Python Ecosystem"

Lizhi Liao, Simon Eismann, Heng Li, Cor-Paul Bezemer, Diego Elias Costa, André van Hoorn, Weiyi Shang, "Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models"

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman, "Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests"

Sebastian Uchitel, Francisco Cirelli, Dalal Alrajeh, "Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts"

Brian Hyeongseok Kim, Jingbo Wang, Chao Wang, "FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks"

Mingyuan Wu, Jiahong Xiang, Kunqiu Chen, Peng DI, Shin Hwei Tan, Heming Cui, Yuqun Zhang, "Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?"

Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy, "Towards Neural Synthesis for SMT-assisted Proof-Oriented Programming"

Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen, "Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models"

Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, jianling Sun, "Towards Better Answers: Automated Stack Overflow Post Updating"

Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang, "Selecting Initial Seeds for Better JVM Fuzzing"

Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen, "Source Code Summarization in the Era of Large Language Models"

Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu, "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers"

Yue Wang, Chao Yang, Xiaodong Zhang, Yuwanqi Deng, JianFeng Ma, "DPFuzzer: Discovering Safety Critical Vulnerabilities for Drone Path Planners"

Shiyu Zhang, Haoyang Song, Qixin Wang, Henghua Shen, Yu Pei, "A Test Oracle for Reinforcement Learning Software based on Lyapunov Stability Control Theory"

Yisong Xiao, Aishan Liu, Xinwei Zhang, Tianyuan Zhang, Tianlin Li, Siyuan Liang, Xianglong Liu, Yang Liu, Dacheng Tao, "BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks"

Ian McCormack, Joshua Sunshine, Jonathan Aldrich, "A Study of Undefined Behavior Across Foreign Function Boundaries in Rust Libraries"

Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang, "Synthesizing Document Database Queries using Collection Abstractions"

Yesugen Baatartogtokh, Kaitlyn Cook, Alicia M. Grubb, "Exploring the Robustness of the Effect of EVO on Intention Valuation through Replication"

Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, Elie Bursztein, "Magika: AI-Powered Content-Type Detection"

Xiafa Wu, Brian Demsky, "GenC2Rust: Towards Generating Generic Rust Code from C"

Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, Baowen Xu, "ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs"

Xingyu Wang, MingSen Wang, Wenbo Shen, Rui Chang, "Understanding and Detecting Peer Dependency Resolving Loop in npm Ecosystem"

Jiashuo Zhang, Jiachi Chen, John Grundy, Jianbo Gao, Yanlin Wang, Ting Chen, Zhi Guan, Zhong Chen, "Automated Test Generation For Smart Contracts via On-Chain Test Case Augmentation and Migration"

Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Xinru Cheng, Chengnian Sun, "Toward a Better Understanding of Probabilistic Delta Debugging"

Benjamin Steenhoek, Siva Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, Wei Le, "Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE"

Soneya Binta Hossain, Matthew Dwyer, "TOGLL: Correct and Strong Test Oracle Generation with LLMs"

Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, Ali Mesbah, "Feature-Driven End-To-End Test Generation"

Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour, "Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation"

Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu, Yitong Liu, Si Zheng, Xiangke Liao, "Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar"

Baoquan Cui, rong qu, Zhen Tang, Jian Zhang, "Static Analysis of Remote Procedure Call in Java Programs"

Shihao Zhu, Yuqi Guo, Yan Cai, Bin Liang, Long Zhang, Rui Chen, Tingting Yu, "Reduce Dependence for Sound Concurrency Bug Prediction"

Huaijin Wang, Zhibo Liu, Yanbo Dai, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu, "Preserving Privacy in Software Composition Analysis: A Study of Technical Solutions and Enhancements"

Jintao Huang, Kai Yang, Gaosheng Wang, Zhiqiang Shi, Zhiwen Pan, Shichao Lv, Limin Sun, "Moye: A Wallbreaker for Monolithic Firmware"

Haifeng Ruan, Yuntong Zhang, Abhik Roychoudhury, "SpecRover: Code Intent Extraction via LLMs"

Wed 30 Apr
Displayed time zone: Eastern Time (US & Canada) change

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change