Call for Papers
The New Ideas and Emerging Results (NIER) track at ICSE provides a vibrant forum for forward-looking, innovative research in software engineering. Our aim is to accelerate the exposure of the software engineering community to early yet potentially ground-breaking research results, and to techniques and perspectives that challenge the status quo in the discipline.
Scope
NIER invites innovative, groundbreaking new ideas supported by promising initial results, such as:
- Forward-looking ideas: exciting new directions or techniques that may have yet to be supported by solid experimental results, but are nonetheless supported by strong and well-argued scientific intuitions or preliminary results as well as concrete plans going forward.
- Thought-provoking reflections: bold and unexpected results and reflections that can help us look at current research directions under a new light, calling for new directions for future research.
A NIER track paper is not just a scaled-down version of an ICSE full research track paper. The NIER track is reserved for first-class, top-quality technical contributions. Therefore, a NIER submission is neither an ICSE full research track submission with weaker or no evaluation nor an op-ed piece advertising existing and already published results. Authors of such submissions should instead consider submitting to one of the many satellite events of ICSE.
Evaluation Criteria
Each submission will be reviewed and evaluated in terms of the following quality criteria:
- Impact: The significance and potential impact of the research. The potential of the research to disrupt the current practice.
- Novelty: The novelty and innovativeness of contributed solutions, problem formulations, methodologies, and/or theories, i.e., the extent to which the paper is sufficiently original with respect to the state of the art.
- Relevance: The relevance of the research to the field of software engineering.
- Clarity: The soundness, clarity, and depth of a technical or theoretical contribution, as well as the level of thoroughness and completeness in defining future plans for completing the research.
- Presentation: The quality of the exposition in the paper.
Reviewers will carefully consider all of the above criteria during the review process, and authors should take great care in clearly addressing them all.
Submission Instructions
All submissions to the ICSE 2025 NIER track must conform to the following instructions:
-
The submissions must not exceed 4 pages for the main text, inclusive of all figures, tables, appendices, etc. An extra page is allowed for references only. The page limit is strict, and it will not be possible to purchase additional pages at any point in the process (including after the paper is accepted).
-
Each submission to the ICSE 2025 NIER track needs to include a section titled “Future Plans” where the authors outline the work they plan on doing to turn their new ideas and emerging results into a full-length paper in the future.
-
All submissions must be in PDF.
-
Submissions must strictly conform to the IEEE conference proceedings template, specified in the IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options). Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
-
The ICSE 2025 NIER track will employ a double-anonymous review process. Thus, no submission may reveal its authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular:
- Authors’ names must be omitted from the submission.
- All references to the author’s prior work should be in the third person.
- While authors have the right to upload preprints on ArXiV or similar sites, they must avoid specifying that the manuscript was submitted to ICSE 2025.
Further advice, guidance, and explanation about the double-anonymous review process can be found on the Q&A page.
Submission Policies
-
By submitting to the ICSE NIER track, authors acknowledge that they are aware of and agree to be bound by the ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to ICSE 2025 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for ICSE 2025. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases. To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
-
If the research involves human participants/subjects, the authors must adhere to the ACM Publications Policy on Research Involving Human Participants and Subjects. Upon submitting, authors will declare their compliance with such a policy. Alleged violations of this policy or any ACM Publications Policy will be investigated by ACM and may result in a full retraction of your paper, in addition to other potential penalties, as per ACM Publications Policy.
-
Submissions must follow the latest IEEE Submission and Peer Review Policy and ACM Policy on Authorship (with associated FAQ, which includes a policy regarding the use of generative AI tools and technologies, such as ChatGPT.
-
The ICSE 2025 NIER track is aligned with the ICSE 2025 Open Science policies. The guiding principle is that, wherever relevant, all research results, artifacts, and data should be accessible to the public. For additional guidelines, see the Open Science Policy of the Research Track and the Q&A page.
Submission Process
Submissions shall be made through NIER submission site https://icse2025-nier.hotcrp.com/ by the submission deadline. Any submission that does not comply with the submission instructions may be desk rejected without further review.
Please ensure that you and your co-authors obtain an ORCID ID, so you can complete the publishing process for your accepted paper. ACM and IEEE have been involved in ORCID and may collect ORCID IDs from all published authors. We are committed to improving author discoverability, ensuring proper attribution and contributing to ongoing community efforts around name normalization; your ORCID ID will help in these efforts.
Important Dates
- Submissions Deadline: October 10, 2024
- Acceptance Notification: December 11, 2024
- Camera Ready: January 15, 2025
All dates are 23:59:59 AoE (UTC-12h).
Conference Attendance Expectation
If a submission is accepted, at least one author of the paper is required to register for ICSE 2025 and present the paper. We are assuming that the conference will be in-person, and if it is virtual or hybrid, virtual presentations may be possible. These matters will be discussed with the authors closer to the date of the conference.
Accepted Papers
The following papers have been accepted in the ICSE 2025 NIER Track. The papers are will be published by the IEEE and appear in the IEEE and ACM digital libraries, subject to an author submitting their camera-ready and copyright forms, and registering to attend the conference. (Authors are required to present the papers at the conference, otherwise they will be withdrawn).
Finn Hackett, Ivan Beschastnikh, "Listening to the Firehose: Sonifying Z3’s Behavior"
Abstract: Modern formal methods rely heavily on Satisfiability Modulo Theory (SMT) solvers like Z3. Unfortunately, these solvers are complex, have unpredictable runtime behavior, and are highly sensitive to the structure of the input query. As a result, when a Z3 query runs for hours and times out, there is little that an end-user can do to figure out what went wrong. They can attempt to inspect the gigabytes of logged information that these tools produce every minute. But, no existing tool provides a broad understanding of Z3 behavior. We propose Z3Hydrant, a scalable approach that converts Z3 logs into sound. By relying on the innate abilities of the human ear to pick out patterns, Z3Hydrant encodes raw Z3 logs into an audio stream. The result is accessible to anyone who can hear and helps to provide a general flavor of what occurred during a particular run. We describe our approach and include several example audio files that capture complex Z3 runs.
Tags: "Formal methods", "Analysis"Fengjie Li, Jiajun Jiang, Jiajun Sun, Hongyu Zhang, "Evaluating the Generalizability of LLMs in Automated Program Repair"
Abstract: LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well-known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining fault semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with average correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information signigicantly enhances the LLMs’ capabilities (increasing correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original dataset results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs’ repair capabilities. According our study, we also offer several recommendations for future research.
Tags: "Testing and Quality", "AI for SE", "Analysis/Repair"Paschal C. Amusuo, Parth V. Patil, Owen Cochell, Taylor Le Lievre, James C. Davis, "A Unit Proofing Framework for Code-level Verification: A Research Agenda"
Abstract: Formal verification provides mathematical guarantees that a software is correct. Design-level verification tools ensure software specifications are correct, but they do not expose defects in actual implementations. For this purpose, engineers use code-level tools. However, such tools struggle to scale to large software. The process of "Unit Proofing" mitigates this by decomposing the software and verifying each unit independently. We examined AWS's use of unit proofing and observed that current approaches are manual and prone to faults that mask severe defects. We propose a research agenda for a unit proofing framework, both methods and tools, to support software engineers in applying unit proofing effectively and efficiently. This will enable engineers to discover code-level defects early.
Tags: "Formal methods", "Testing and Quality", "Design/Architecture"Zijie Huang, Lizhi Cai, Xuan Mao, Kang Yang, "Towards Early Warning and Migration of High-Risk Dormant Open-Source Software Dependencies"
Abstract: Dormant open-source software (OSS) dependencies are no longer maintained or actively developed, their related code components are more vulnerable and error-prone since they can hardly keep up with evolving software dependents. Presently, their migration remains costly and challenging for practitioners. To tackle such a challenge, we intend to characterize, predict, and automatically migrate high-risk dormant OSS dependencies. Our pilot study of 4,945 Maven dependencies reveals over half of them are dormant, and 12.15% pose a high security risk. These high-risk dependencies can be predicted early based on their version release and usage characteristics. They are rarely migrated by developers, and simple one-to-one API migrations can be achieved with little context using Large Language Models (LLMs). Future research will be conducted on a more complete dataset, incorporate socio-technical features for improved high-risk prediction, and fine-tune a migration code generator.
Tags: "Security", "AI for SE", "MSR", "Open Source"Marc North, Amir Atapour-Abarghouei, Nelly Bencomo, "Beyond Syntax: How Do LLMs Understand Code?"
Abstract: Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code. We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages. Our results show that LLMs have an understanding — and internal representation — of \emph{language-independent} coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components. Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
Tags: "AI for SE", "MSR"Andreas Vogelsang, Alexander Korn, Giovanna Broccia, Alessio Ferrari, Jannik Fischbach, Chetan Arora, "On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability"
Abstract: Large language models (LLMs) are increasingly used to generate software artifacts, such as source code, tests, and trace links. Requirements play a central role as they are often used as part of the prompts to synthesize the artifacts. However, the impact of requirements formulation on LLM performance remains unclear. In this paper, we investigate the role of requirements smells-indicators of potential issues like ambiguity and inconsistency-when used in prompts for LLMs. We conducted experiments using two LLMs focusing on automated trace link generation between requirements and code. Our results show mixed outcomes: while requirements smells had a small but significant effect when predicting whether a requirement was implemented in a piece of code (i.e., a trace link exists), no significant effect was observed when tracing the requirements with the associated lines of code. These findings suggest that requirements smells can affect LLM performance in certain SE tasks but may not uniformly impact all tasks. We highlight the need for further research to understand these nuances and propose future work toward developing guidelines for mitigating the negative effects of requirements smells in AI-driven SE processes.
Tags: "Requirements", "AI for SE"Andreas Vogelsang, Alexander Korn, Giovanna Broccia, Alessio Ferrari, Jannik Fischbach, Chetan Arora, "On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability"
Abstract: Large language models (LLMs) are increasingly used to generate software artifacts, such as source code, tests, and trace links. Requirements play a central role as they are often used as part of the prompts to synthesize the artifacts. However, the impact of requirements formulation on LLM performance remains unclear. In this paper, we investigate the role of requirements smells-indicators of potential issues like ambiguity and inconsistency-when used in prompts for LLMs. We conducted experiments using two LLMs focusing on automated trace link generation between requirements and code. Our results show mixed outcomes: while requirements smells had a small but significant effect when predicting whether a requirement was implemented in a piece of code (i.e., a trace link exists), no significant effect was observed when tracing the requirements with the associated lines of code. These findings suggest that requirements smells can affect LLM performance in certain SE tasks but may not uniformly impact all tasks. We highlight the need for further research to understand these nuances and propose future work toward developing guidelines for mitigating the negative effects of requirements smells in AI-driven SE processes.
Tags: "Requirements", "AI for SE"Istvan David, "SusDevOps: Promoting Sustainability to a First Principle in Software Delivery"
Abstract: Sustainability is becoming a key property of modern software systems. While there is a substantial and growing body of knowledge on engineering sustainable software, end-to-end frameworks that situate sustainability-related activities within the software delivery lifecycle are missing. In this article, we propose the SusDevOps framework that promotes sustainability to a first principle within a DevOps context. We demonstrate the lifecycle phases and techniques of SusDevOps through the case of a software development startup company.
Tags: "Process", "Sustainability", "DevOps"Liangying Shao, Yanfu Yan, Denys Poshyvanyk, Jinsong Su, "UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation"
Abstract: Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We will release our code upon acceptance.
Tags: "AI for SE"Junjie Sheng, Yanqiu Lin, Jiehao Wu, Yanhong Huang, Jianqi Shi, Min Zhang, Xiangfeng Wang, "SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code Generation"
Abstract: The Satisfiability (SAT) problem is a core challenge with significant applications in software engineering, including automated testing, configuration management, and program verification. This paper presents SolSearch, a novel framework that harnesses large language models (LLMs) to automatically discover and optimize SAT-solving strategies. Leveraging a curriculum-based, trial-and-error process, SolSearch enables the LLM to iteratively modify and generate SAT solver code, thereby improving solving efficiency and performance. This automated SAT solving paradigm has the advantage of being plug-and-play, allowing integration with any SAT solver and accelerating the development or design process of new SAT solvers (new methods). Our preliminary experimental results are encouraging by demonstrating that the LLM-powered paradigm not only improves state-of-the-art SAT solvers on general SAT benchmarks but also significantly enhances the performance of the widely used Z3 solver (11\% on PAR-2 score). These results highlight the potential for using LLM-driven methods to advance solver adaptability and effectiveness in real-world software engineering challenges. Future research directions are discussed to further refine and validate this approach, offering a promising avenue for integrating AI with traditional software engineering tasks.
Tags: "Formal methods", "AI for SE"Alejandro Velasco, Daniel Rodriguez-Cardenas, David N. Palacio, Lutfar Rahman Alif, Denys Poshyvanyk, "How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study"
Abstract: Large Language Models (LLMs) have shown significant potential in automating software engineering tasks, particularly in code generation. However, current evaluation benchmarks, which primarily focus on accuracy, fall short in assessing the quality of the code generated by these models, specifically their tendency to produce code smells. To address this limitation, we introduce CodeSmellEval, a benchmark designed to evaluate the propensity of LLMs for generating code smells. Our benchmark includes a novel metric: Propensity Smelly Score (PSC), and a curated dataset of method-level code smells: CodeSmellData. To demonstrate the use of CodeSmellEval, we conducted a case study with two state-of-the-art LLMs, CodeLlama and Mistral. The results reveal that both models tend to generate code smells, such as simplifiable-condition and consider-merging-isinstance. These findings highlight the effectiveness of our benchmark in evaluating LLMs, providing valuable insights into their reliability and their propensity to introduce code smells in code generation tasks.
Tags: "AI for SE", "Testing and Quality"Long Doan, ThanhVu (Vu) Nguyen, "AI-Assisted Autoformalization of Combinatorics Problems in Proof Assistants"
Abstract: Proof assistants such as Coq and \textsf{LEAN} have been increasingly used by renowned mathematicians to formalize and prove mathematical theorems. Despite their growing use, writing formal proofs is challenging, as it requires a deep understanding of these systems' languages. Recent advancements in AI, especially LLMs, have shown promise in automating this formalization task. However, domains such as combinatorics pose significant challenges for AI-assisted proof assistant systems due to their cryptic nature and the lack of existing data to train AI models. We introduce \textsf{AutoForm4Lean}, a system designed to leverage LLMs to aid in formalizing combinatorics problems for \textsf{LEAN}. By combining LLM power with SE/FM techniques such as synthesis and validation, \textsf{AutoForm4Lean} generates formalizations of combinatorics problems more effectively than current state-of-the-art LLMs. Moreover, this project seeks to provide a comprehensive collection of formalized combinatorics problems, theorems and lemmas, which would enrich the \textsf{LEAN} library and provide valuable training data for LLMs. Preliminary results demonstrate the effectiveness of \textsf{AutoForm4Lean} in formalizing combinatorics problems in \textsf{LEAN}, making a step forward in AI-based theorem proving.
Tags: "Formal methods", "AI for SE"Benedikt Steininger, Chrysanthi Papamichail, David Stark, Dejan Nickovic, Alessio Gambi, "Automatically Generating Content for Testing Autonomous Vehicles from User Descriptions"
Abstract: Testing autonomous vehicles (AV) software, which is currently done using simulations, requires the availability of various content, such as terrains and maps, to instantiate relevant scenarios. Manually generating such content is time-consuming, and current approaches for procedural content generation struggle to handle user requirements. Consequently, the limited availability of content strongly affects AV testing effectiveness. To address this problem, we present RoadGPT, the first generative AI approach that generates focused scenarios by translating user requirements in natural language into three-dimensional road models. RoadGPT leverages OpenAI foundational large language model (LLM) ChatGPT to interpret user descriptions and the physically accurate driving simulation BeamNG.tech to generate the corresponding driving simulations. Our initial evaluation, which includes a focused user study with experts in the AV testing domain, confirmed the ability of RoadGPT to generate roads matching user-defined descriptions and highlighted venues for future improvements. We believe that RoadGPT can become an essential component in AV testing and can extended to create other relevant testing environments, such as parking spaces.
Tags: "AI for SE", "Real-Time", "Testing and Quality"Shane McIntosh, Luca Milanesio, Antonio Barone, Jacek Centkowski, Marcin Czech, Fabio Ponciroli, "Using Reinforcement Learning to Sustain the Performance of Version Control Repositories"
Abstract: Although decentralized Version Control Systems (VCSs) like Git support several organizational structures, a central copy of the repository is typically where development activity is coalesced and where official software releases are produced. Due to growth in team size and the popularity of monolithic repositories (a.k.a., "monorepos") that span entire organizations, central repositories are being strained. Remedial actions that devops engineers take, such as performing garbage collection routines, can backfire because they are computationally expensive and if run at an inopportune moment, may degrade repository performance or even cause the host to crash. To sustain the performance of VCSs under production workloads, we propose a reinforcement learning agent that can take remedial actions. Since a large quantity of VCS activity is needed to train the agent, we first augment the VCS to enable a greater throughput, observing that the augmented VCS outperforms the stock VCS to a large, statistically significant degree. Then, we compare the performance that a central VCS can sustain when the agent is applied against a schedule-based garbage collection policy and a no-action baseline, observing 64 to 82-fold improvements in the Area Under the Curve (AUC) that plots repository performance over time. This paper takes a promising first step towards automatically sustaining the performance of VCSs under heavy production workloads.
Tags: "Process", "AI for SE"Nimrod Busany, Hananel Hadad, Zofia Maszlanka, Rohit Shelke, Gregory Price, Okhaide Akhigbe, Daniel Amyot, "Optimizing Experiment Configurations for LLM Applications Through Exploratory Analysis"
Abstract: The integration of Large Language Models (LLMs) into software applications necessitates informed design choices across various configurations, including LLM selection, prompting techniques, and their parameters, and prompt templates. Many of these choices are arbitrary, and developers often lack guidance on optimizing configurations. In this work, we define the Experiment Configuration Optimization Problem and illustrate it with a real-world Text-to-SQL application we developed. Our results show that most configurations are sub-optimal, with only a few offering a favorable trade-off between accuracy and cost. Highlighting the critical need for systematic exploration, we show that extensive experimentation is expensive, underscoring the importance for cost-effective methods to navigate the configuration space. Our findings motivate further research into methodologies that effectively optimize LLM application configurations.
Tags: "SE for AI", "Design/Architecture"Roberto Verdecchia, Emilio Cruciani, Antonia Bertolino, Breno Miranda, "Energy-Aware Software Testing"
Abstract: Our planet urges for a more responsible use of its resources, and since information technology contributes substantially to the global energy consumption, software engineering research has promptly embraced this request and is actively working towards more sustainable processes. An indispensable activity in software development is testing, which is known to be very costly in terms of time and effort. On top of this, a recent study by Zaidman has shown that software testing can be a voracious energy consumer as well. In this work we introduce the very concept of energy-aware testing as the adoption of ad hoc strategies that can help reduce the energy consumption of existing practices. We discuss some possible strategies and, as an example, we conduct a first study of an energy-aware variant of a simple similarity-based test prioritization approach, which provides evidence of perceptible savings. We encourage future research in energy-aware software testing that need to address further studies and to think up more strategies.
Tags: "Testing and Quality", "Green / Environmental SE"Yang Yue, Yi Wang, David Redmiles, "Discovering Ideologies of the Open Source Software Movement"
Abstract: Encompassing a diverse population of developers, non-technical users, and other stakeholders, open source software (OSS) development has expanded to broader social movements from the initial product development aims. Ideology, as a coherent system of ideas, offers value commitments and normative implications for any social movement, so do OSS ideologies for the open source movement. However, the SE literature on OSS ideology is often fragmented or lacks empirical evidence. We sought to develop a comprehensive empirical framework of OSS ideology. Following a grounded theory procedure, we collected and analyzed data from 22 OSS practitioners and 41 video recordings of Open Source Initiative (OSI) board members' public narratives. A framework of OSS ideology emerged in our analysis, with six key categories: membership, norms/values, goals, activities, resources, and positions/group relations; each consists of several themes. With this ideological lens, we discussed the implications and insights into the research and practice of open source development.
Tags: "Human/Social", "Process", "Open Source"Nate Levin, Chengpeng Li, Yule Zhang, August Shi, Wing Lam, "Takuan: Using Dynamic Invariants To Debug Order-Dependent Flaky Tests"
Abstract: Automated regression testing is critical to effective software development, but it suffers from flaky tests, i.e., tests that can nondeterministically pass or fail when run on the same version of code. Conceptually, a flaky test depends on a component not controlled by the code, where the test's outcome depends on the state of that component. For example, one prominent category of flaky tests are order-dependent (OD) tests, whose outcomes depend on the order in which they are run (where the order is not guaranteed), as a result of some other test “polluting” shared state. We propose the use of dynamic invariants to help debug flaky tests. By capturing the dynamic invariants that hold true during a passing execution of the flaky test and comparing them against those captured during a failing execution, we can isolate the reason for the flaky behavior. To illustrate the potential of using dynamic invariants for this task, we implement Takuan, a technique for debugging OD tests by analyzing differences in dynamic invariants collected between passing and failing runs for the OD tests. The invariants that hold true in a passing order but not in a failing order indicate the “clean” value of the shared state that makes the test pass. We further illustrate how these invariants can be used to even repair OD tests by developing automated approaches that use the invariants as inputs to then search for methods that can reset the shared state back to the desired “clean” state. Takuan's ability to analyze polluted shared state that is external to the program (e.g., in the file system) allows it to handle cases that prior work could not. We conduct a preliminary study of Takuan on existing OD tests and find that our approach has promising results.
Tags: "Testing and Quality", "Analysis"Chong Wang, Zhenpeng Chen, Tianlin Li, Yilun Zhang, Yang Liu, "Towards Trustworthy LLMs for Code: A Data-Centric Synergistic Auditing Framework"
Abstract: LLM-powered coding and development assistants have become prevalent to programmers’ workflows. However, concerns about the trustworthiness of LLMs for code persist despite their widespread use. Much of the existing research focused on either training or evaluation, raising questions about whether stakeholders in training and evaluation align in their understanding of model trustworthiness and whether they can move toward a unified direction. In this paper, we propose a vision for a unified trustworthiness auditing framework, DataTrust, which adopts a data-centric approach that synergistically emphasizes both training and evaluation data and their correlations. DataTrust aims to connect model trustworthiness indicators in evaluation with data quality indicators in training. It autonomously inspects training data and evaluates model trustworthiness using synthesized data, attributing potential causes from specific evaluation data to corresponding training data and refining indicator connections. Additionally, a trustworthiness arena powered by DataTrust will engage crowdsourced input and deliver quantitative outcomes. We outline the benefits that various stakeholders can gain from DataTrust and discuss the challenges and opportunities it presents.
Tags: "AI for SE", "SE for AI"Yuanjun Gong, Fabio Massacci, "When in Doubt Throw It out: Building on Confident Learning for Vulnerability Detection"
Abstract: [Context:] Confident learning's intuition is that a good model can be used to identify mislabelled data. By swapping mislabeled samples that are not confidently predicted, the performance of model can be further improved. [Problem:] Unfortunately, vulnerability detectors are generally under-performing models and confidence learning would conclude that the bulk of the dataset is mislabelled. [New Idea:] We extend confidence learning by identifying a type of training samples that appear in presence of under-performing models: \emph{confusing samples}. [Emerging Result:] We analyze the formal constraints for confusing samples and perform preliminary experiments that show that the model's performance is effectively improved after \emph{deleting} confusing samples entirely from the training set.
Tags: "SE for AI"Abhishek Kumar, Sandhya Sankar, Sonia Haiduc, Partha Pratim Das, Partha Pratim Chakrabarti, "LLMs as Evaluators: A Novel Approach to Commit Message Quality Assessment"
Abstract: Evaluating the quality of commit messages is a challenging task in software engineering. Existing evaluation approaches, such as automatic metrics like BLEU, ROUGE and METEOR, as well as manual human assessments have notable limitations. Automatic metrics often overlook semantic relevance and context, while human evaluations are time consuming and costly. To address these challenges, we explore the potential of using Large Language Models (LLMs) as an alternative method for commit message evaluation. We conducted two tasks using state-of-the-art LLMs, GPT-4o, LLaMA 3.1 (70B and 8B), and Mistral Large, to assess their capability in evaluating commit messages. Our findings show that LLMs can effectively identify relevant commit messages and align well with human judgment, demonstrating their potential to serve as reliable automated evaluators. This study provides a new perspective on utilizing LLMs for commit message assessment, paving the way for scalable and consistent evaluation methodologies in software engineering.
Tags: "AI for SE", "Process"Clay Stevens, Katherine Kjeer, Ryan Richard, Edward Valeev, Myra B. Cohen, "Model Assisted Refinement of Metamorphic Relations for Scientific Software"
Abstract: Ensuring the correctness of scientific software is challenging due to the need to represent and model complex phenomenon in a discrete form. Many dynamic approaches for correctness have been developed for numerical overflow or imprecision, which may manifest as program crashes or hangs. Less effort has been spent on functional correctness, where one of the most widely proposed technique is metamorphic testing. Metamorphic testing often requires deep domain expertise to design meaningful relations. In this vision paper we ask if we can utilize the process of abstraction and refinement, a traditionally formal approach, to guide the development of metamorphic relations. We have built an iterative approach we call Model Assisted Refinements (or MARS). It starts with domain-agnostic relations and a set of input-output relations created via a dynamic analysis. We then use a model checker to identify missing input/output patterns and potential passing and failing relations. We augment our dynamic analysis, and obtain domain expertise to verify and refine our relations. At the end we have a set of domain-specific metamorphic relations and test cases. We demonstrate our approach on a high-performance chemistry library. Within three refinements we discover several domain specific relations, and increase our behavioral coverage.
Tags: "Scientific SW", "Formal methods", "Testing and Quality"Robin Kimmel, Judith Michael, Andreas Wortmann, Jingxi Zhang, "Digital Twins for Software Engineering Processes"
Abstract: Digital twins promise a better understanding and use of complex systems. To this end, they represent these systems at their runtime and may interact with them to control their processes. Software engineering is a wicked challenge in which stakeholders from many domains collaborate to produce software artifacts together. In the presence of skilled software engineer shortage, our vision is to leverage DTs as means for better representing, understanding, and optimizing software engineering processes to (i) enable software experts making the best use of their time and (ii) support domain experts in producing high-quality software. This short manuscript lays out why this would be beneficial, how such a digital twin could look like, and what is missing towards realizing and deploying software engineering digital twins.
Tags: "Human/Social", "Process"Maria Camporese, Fabio Massacci, "Using ML filters to help automated vulnerability repairs: when it helps and when it doesn’t"
Abstract: [Context:] The acceptance of candidate patches in automated program repair has been typically based on testing oracles. Testing requires typically a costly process of building the application while ML models can be used to quickly classify patches, thus allowing more candidate patches to be generated in a positive feedback loop. [Problem:] If the model predictions are unreliable (as in vulnerability detection) they can hardly replace the more reliable oracles based on testing. [New Idea:] We propose to use an ML model as a preliminary filter of candidate patches which is put in front of a traditional filter based on testing. [Preliminary Results:] We identify some theoretical bounds on the precision and recall of the ML algorithm that makes such operation meaningful in practice. With these bounds and the results published in the literature, we calculate how fast some of state-of-the-art vulnerability detectors must be to be more effective over a traditional AVR pipeline such as APR4Vuln based just on testing.
Tags: "Analysis/Repair", "AI for SE", "Testing and Quality"Nitish Patkar, Aimen Fahmi, Timo Kehrer, Norbert Seyff, "What is a Feature, Really? Toward a Unified Understanding Across SE Disciplines"
Abstract: In software engineering, the concept of a ``feature'' is frequently used, yet inconsistently defined across disciplines like requirements engineering (RE) and software product lines (SPL). This inconsistency often leads to communication gaps, rework, and project inefficiencies. To address these challenges, this paper presents an empirical, data-driven approach to explore how features are described, implemented, and managed across real-world projects, starting with open-source software (OSS). By analyzing feature-related branches in OSS repositories, we identify patterns in contributor behavior, feature implementation, and project management activities. Our findings reveal distinct patterns in feature branch activity, offering actionable insights into improving project planning, resource allocation, and coordination across teams. We propose a roadmap for advancing feature-related research, focusing on key research questions that aim to unify the understanding of features across software engineering disciplines. This research has the potential to inform both academic inquiry and practical strategies for improving feature planning, resource allocation, and development workflows in diverse project environments.
Tags: "Requirements", "Design/Architecture", "Process", "Open Source"