Call for Papers
The International Conference on Software Engineering (ICSE) is the premier forum for presenting and discussing the most recent and significant technical research contributions in the field of Software Engineering. In the research track, we invite high-quality submissions of technical research papers describing original and unpublished results of software engineering research.
ICSE 2025 will follow a dual deadline structure introduced in 2024. In other words, submissions will occur in two cycles. Please refer to the section on Dual Submission Cycles in the following for the information.
NEW THIS YEAR #1: Due to the rapid growth of the area of “AI and Software Engineering”, it is now split into two: “AI for Software Engineering” and “Software Engineering for AI”. A new area “Architecture and Design” is introduced. The topics listed under each area have also been revised. Please see the “Research Areas” section below.
NEW THIS YEAR #2: We add back opportunities for “Author Response” in addition to “Revision” so that some potential misunderstandings can be clarified in the review process for papers that otherwise would be rejected. Also, for a paper receiving a “Revision” outcome, authors will be given an additional page of text in the revised paper to accommodate the required changes specified in the reviews.
NEW THIS YEAR #3: Submissions must follow the latest “IEEE Submission and Peer Review Policy” and “ACM Policy on Authorship” (with associated FAQ, which includes a policy regarding the use of generative AI tools and technologies, such as ChatGPT. After checking with the ICSE Steering Committee, we are piloting a human-in-the-loop automated process to identify AI-generated papers. A Review Process Co-Chair has volunteered to design and run this pilot process on submitted papers. To preserve confidentiality, when scanning submitted papers, the scripts will not make use of any third-party services.
NEW THIS YEAR #4: IEEE Transactions on Software Engineering, ACM Transactions on Software Engineering and Methodology and ICSE 2025, have received approval from the ICSE Steering Committee to launch the Sustainable Community Review Effort (SCRE) program, aimed at reducing community effort in reviewing journal extensions of conference papers and allowing authors to get faster and more consistent feedback. More information is available at: http://tinyurl.com/icse25-scre
NEW THIS YEAR #5: ICSE Steering Committee has recently approved a proposal for streamlining and enhancing the paper bidding and assignment process, aimed at reducing the workload of PC members and resulting in better assignments of papers. Two Review Process Co-Chairs have volunteered to help manage the updated bidding and assignment process. More information is available at: http://tinyurl.com/icse25-streamlining
NEW THIS YEAR #6: ICSE Steering Committee has recently approved a proposal for Shadow PC. Shadow PC is a mentoring program to train early-career researchers (PhD students, postdocs, new faculty members, and industry practitioners) in the review process of the technical track. For Cycle 2, for the first time, authors of ICSE submissions can opt-in for their papers to be considered for review in the Shadow PC track. Shadow reviews for papers that are reviewed by the Shadow PC will be sent out to authors after the end of the actual review process; shadow reviews will not affect the official decision made by the regular PC. More detailed information about the program is available at: http://tinyurl.com/icse25-shadowpc
Research Areas
ICSE welcomes submissions addressing topics across the full spectrum of Software Engineering, being inclusive of quantitative, qualitative, and mixed-methods research. Topics of interest include the following and are grouped into the following nine research areas. Please note that these topics are by no means exhaustive.
Each submission will need to indicate one of these nine areas as the chosen area. Optionally, the authors can consider adding an additional area. A paper may be moved from the chosen area(s) to another focus area at the discretion of the program chairs. Program chairs will ultimately assign a paper to an area chair, considering the authors’ selection, the paper’s content, and other factors such as (if applicable) possible conflicts of interest.
AI for Software Engineering
- AI-enabled recommender systems for automated SE (e.g., code generation, program repair, AIOps, software composition analysis, etc.)
- Human-centered AI for SE (e.g., how software engineers can synergistically work with AI agents)
- Trustworthy AI for SE (e.g., how to provide guarantees, characterize limits, and prevent misuse of AI for SE)
- Sustainable AI for SE (e.g., how to reduce energy footprint for greener AI for SE)
- Collaborative AI for SE (e.g., how AI agents collaborate for automating SE)
- Automating SE tasks with LLM and other foundation models (e.g., large vision model)
- Efficacy measurement beyond traditional metrics (e.g., accuracy, BLEU, etc.)
- Prompt engineering for SE (e.g., novel prompt design)
- AI-assisted software design and model driven engineering (e.g., specification mining, program synthesis, software architectural design)
Analytics
- Mining software repositories, including version control systems, issue tracking systems, software ecosystems, configurations, app stores, communication platforms, and novel software engineering data sources, to generate insights through various research methods
- Software visualization
- Data-driven user experience understanding and improvement
- Data driven decision making in software engineering
- Software metrics (and measurements)
Architecture and Design
- Architecture and design measurement and assessment
- Software design methodologies, principles, and strategies
- Theory building for/of software design
- Architecture quality attributes, such as security, privacy, performance, reliability
- Modularity and reusability
- Design and architecture modeling and analysis
- Architecture recovery
- Dependency and complexity analysis
- Distributed architectures, such as microservice, SOA, cloud computing
- Patterns and anti-patterns
- Technical debt in design and architecture
- Architecture refactoring
- Adaptive architectures
- Architecture knowledge management
Dependability and Security
- Formal methods and model checking (excluding solutions focusing solely on hardware)
- Reliability, availability, and safety
- Resilience and antifragility
- Confidentiality, integrity, privacy, and fairness
- Performance
- Design for dependability and security
- Vulnerability detection to enhance software security
- Dependability and security for embedded and cyber-physical systems
Evolution
- Evolution and maintenance
- API design and evolution
- Release engineering and DevOps
- Software reuse
- Refactoring and program differencing
- Program comprehension
- Reverse engineering
- Environments and software development tools
- Traceability to understand evolution
Human and Social Aspects
- Focusing on individuals (from program comprehension, workplace stress to job satisfaction and career progression)
- Focusing on teams (e.g., collocated, distributed, global, virtual; communication and collaboration within a team), communities (e.g., open source, communities of practice) and companies (organization, economics)
- Focusing on society (e.g., sustainability; diversity and inclusion)
- Focusing on programming languages, environments, and tools supporting individuals, teams, communities, and companies.
- Focusing on software development processes
Requirements and Modeling
- Requirements engineering (incl. non-functional requirements)
- Theoretical requirement foundations
- Requirements and architecture
- Feedback, user and requirements management
- Requirements traceability and dependencies
- Modeling and model-driven engineering
- Variability and product lines
- Systems and software traceability
- Modeling languages, techniques, and tools
- Empirical studies on the application of model-based engineering
- Model-based monitoring and analysis
Software Engineering for AI
- SE for AI models
- SE for systems with AI components
- SE for AI code, libraries, and datasets
- Engineering autonomic systems and self-healing systems
- Automated repair of AI models
- Testing and verification of AI-based systems
- Validation and user-based evaluation of AI-based systems
- Requirements engineering for AI-based systems
Testing and Analysis
- Software testing
- Automated test generation techniques such as fuzzing, search-based approaches, and symbolic execution
- Testing and analysis of non-functional properties
- GUI testing
- Mobile application testing
- Program analysis
- Program synthesis (e.g., constraint based techniques)
- Program repair
- Debugging and fault localization
- Runtime analysis and/or error recovery
Scope
Since the authors will choose an area for their submission, the scope of each area becomes important. Some submissions may relate to multiple areas. In such cases, the authors should choose the area for which their paper brings the maximum new insights. Moreover, authors also have the choice of indicating an alternate area for each paper.
Similarly, for certain papers. authors may have a question whether it belongs to any area, or is simply out of scope. For such cases, we recommend the authors to judge whether their paper brings new insights for software engineering. As an example, a formal methods paper with a focus on hardware verification may be deemed out of scope for ICSE. In general, papers that only peripherally concern software engineering and do not give new insights from the software engineering perspective would be less relevant to ICSE. Our goal is, however, to be descriptive, rather than prescriptive, to enable authors to make their own decisions about relevance.
Dual Submission Cycles
Similar to ICSE 2024, we will have two submission cycles as follows:
First submission cycle
- (Mandatory) Abstract: Mar 15, 2024
- Submission: Mar 22, 2024
- Author response period (3 days): Jun 10-13, 2024
- Notification: Jul 5, 2024
- Revision due: Aug 2, 2024
- Camera-ready (of directly accepted papers): Aug 16, 2024
- Final decision (of revised papers): Nov 1, 2024
- Camera-ready (of accepted revised papers): Dec 13, 2024
Second submission cycle
- (Mandatory) Abstract: Jul 26, 2024
- Submission: Aug 2, 2024
- Author response period (3 days): Oct 7-10, 2024
- Notification: Nov 1, 2024
- Revision due: Nov 29, 2024
- Camera-ready (of directly accepted papers): Dec 13, 2024
- Final decision (of revised papers): Jan 22, 2025
- Camera-ready (of accepted revised papers): Feb 12, 2025
All dates are 23:59:59 AoE (UTC-12h).
Review Criteria
Each paper submitted to the Research Track will be evaluated based on the following criteria:
i) Novelty: The novelty and innovativeness of contributed solutions, problem formulations, methodologies, theories, and/or evaluations, i.e., the extent to which the paper is sufficiently original with respect to the state-of-the-art.
ii) Rigor: The soundness, clarity, and depth of a technical or theoretical contribution, and the level of thoroughness and completeness of an evaluation.
iii) Relevance: The significance and/or potential impact of the research on the field of software engineering.
iv) Verifiability and Transparency: The extent to which the paper includes sufficient information to understand how an innovation works; to understand how data was obtained, analyzed, and interpreted; and how the paper supports independent verification or replication of the paper’s claimed contributions. Any artifacts attached to or linked from the paper will be checked by one reviewer.
v) Presentation: The clarity of the exposition in the paper.
Reviewers will carefully consider all of the above criteria during the review process, and authors should take great care in clearly addressing them all. The paper should clearly explain and justify the claimed contributions. Each paper will be handled by an area chair who will ensure reviewing consistency among papers submitted within that area.
The outcome of each paper will be one of the following Accept, Revision, Reject. We now elaborate on the Revision outcome in the following.
Revisions
Papers submitted can go through revisions in response to specific revision requests made by the reviewers. Authors of papers receiving a Revision decision are expected to submit the revised papers, as well as the revised papers with changes marked in a different color, such as using LaTeXdiff. The authors also need to submit an “Author Response” document capturing the authors’ response to each reviewer’s comment and how those comments were addressed in the revision. This is similar to the “Summary of Changes and Response” document that is typically submitted by authors for a journal paper’s major revision. Authors may use the revision opportunity to revise and improve the paper, but should not use this to submit a substantially different paper. The reviewers will check the revised paper against the original paper and the suggested changes. Revised papers will be examined by the same set of reviewers. An unsatisfactory revised paper will be rejected. Authors are given 4 weeks to submit the revised papers. This is 1 week less than prior year as we reallocate the time to (i) add authors’ rebuttal, and (ii) provide more time for PC members to complete their reviews (to reduce reviewing fatigue considering the high workload to review increasing number of submissions). Authors are given an additional page of text in a revised paper to accommodate the required changes specified in the reviews.
Re-submissions of Rejected Papers
Authors of papers which receive a REJECT decision in the first submission cycle are strongly discouraged from re-submitting it to the second submission cycle. However, in exceptional cases where the reviewers evidently misunderstood their paper, upon approval of PC Chairs, authors can re-submit their paper to the second submission cycle with a “Clarifications and Summary of Improvements” document stating how they have changed the paper. They should also include the past reviews as part of this document, for completeness. These papers will be treated as new submissions, which may or may not get the same set of reviewers at the discretion of the PC chairs. Authors who try to bypass this guideline (e.g., by changing the paper title without significantly changing paper content, or by making small changes to the paper content) will have their papers desk-rejected by the PC chairs without further consideration.
Submission Process
Submissions must conform to the IEEE conference proceedings template, specified in the IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options). Note that IEEE format is being used this year, whereas last year it was ACM format, hence the appearance will differ from year to year.
- All submissions must not exceed 10 pages for the main text, inclusive of all figures, tables, appendices, etc. Two more pages containing only references are permitted. All submissions must be in PDF. Accepted papers will be allowed one extra page for the main text of the camera-ready version.
- Submissions must strictly conform to the IEEE conference proceedings formatting instructions specified above. Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
- By submitting to the ICSE Technical Track, authors acknowledge that they are aware of and agree to be bound by the ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to ICSE 2025 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for ICSE 2025. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases. To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
- If the research involves human participants/subjects, the authors must adhere to the ACM Publications Policy on Research Involving Human Participants and Subjects. Upon submitting, authors will declare their compliance with such a policy. Alleged violations of this policy or any ACM Publications Policy will be investigated by ACM and may result in a full retraction of your paper, in addition to other potential penalties, as per ACM Publications Policy.
- Please ensure that you and your co-authors obtain an ORCID ID, so you can complete the publishing process for your accepted paper. ACM and IEEE have been involved in ORCID and may collect ORCID IDs from all published authors. We are committed to improve author discoverability, ensure proper attribution and contribute to ongoing community efforts around name normalization; your ORCID ID will help in these efforts.
- The ICSE 2025 Research Track will employ a double-anonymous review process. Thus, no submission may reveal its authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular:
- Authors’ names must be omitted from the submission.
- All references to the author’s prior work should be in the third person.
- While authors have the right to upload preprints on ArXiV or similar sites, they must avoid specifying that the manuscript was submitted to ICSE 2025.
- All communication with the program committee must go through the program committee chairs. Do not contact individual program committee members regarding your submission.
-
Further advice, guidance, and explanation about the double-anonymous review process can be found on the Q&A page.
- By submitting to the ICSE Research Track, authors acknowledge that they conform to the authorship policy of the IEEE, submission policy of the IEEE, and the authorship policy of the ACM (and associated FAQ. This includes following these points related to the use of Generative AI:
- “Generative AI tools and technologies, such as ChatGPT, may not be listed as authors of an ACM published Work. The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work. For example, the authors could include the following statement in the Acknowledgements section of the Work: ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations, etc.). If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work.” - ACM
- “The use of artificial intelligence (AI)–generated text in an article shall be disclosed in the acknowledgements section of any paper submitted to an IEEE Conference or Periodical. The sections of the paper that use AI-generated text shall have a citation to the AI system used to generate the text.” - IEEE
- “If you are using generative AI software tools to edit and improve the quality of your existing text in much the same way you would use a typing assistant like Grammarly to improve spelling, grammar, punctuation, clarity, engagement or to use a basic word processing system to correct spelling or grammar, it is not necessary to disclose such usage of these tools in your Work.” - ACM
Submissions to the Technical Track that meet the above requirements can be made via the Research Track submission site by the submission deadline. Any submission that does not comply with these requirements may be desk rejected without further review.
Submission site: https://icse2025.hotcrp.com/
We encourage the authors to upload their paper info early (and can submit the PDF later) to properly enter conflicts for double-anonymous reviewing. It is the sole responsibility of the authors to ensure that the formatting guidelines, double anonymous guidelines, and any other submission guidelines are met at the time of paper submission.
Open Science Policy
The research track of ICSE 2025 is governed by the ICSE 2025 Open Science policies. The guiding principle is that all research results should be accessible to the public and, if possible, empirical studies should be reproducible. In particular, we actively support the adoption of open artifacts and open source principles. We encourage all contributing authors to disclose (anonymized and curated) data/artifacts to increase reproducibility and replicability. Note that sharing research artifacts is not mandatory for submission or acceptance. However, sharing is expected to be the default, and non-sharing needs to be justified. We recognize that reproducibility or replicability is not a goal in qualitative research and that, similar to industrial studies, qualitative studies often face challenges in sharing research data. For guidelines on how to report qualitative research to ensure the assessment of the reliability and credibility of research results, see this curated Q&A page.
Upon submission to the research track, authors are asked
- to make their artifact available to the program committee (via upload of supplemental material or a link to an anonymous repository) – and provide instructions on how to access this data in the paper; or
- to include in the submission an explanation as to why this is not possible or desirable; and
- to indicate in the submission why they do not intend to make their data or study materials publicly available upon acceptance, if that is the case. The default understanding is that the data and/or other artifacts will be publicly available upon acceptance of a paper.
Withdrawing a Paper
Authors can withdraw their paper at any moment until the final decision has been made, through the paper submission system. Resubmitting the paper to another venue before the final decision has been made without withdrawing from ICSE 2025 first is considered a violation of the concurrent submission policy, and will lead to automatic rejection from ICSE 2025 as well as any other venue adhering to this policy. Such violations may also be reported to appropriate organizations e.g. ACM and IEEE.
Conference Attendance Expectation
If a submission is accepted, at least one author of the paper is required to register for ICSE 2025 and present the paper. We are assuming that the conference will be in-person, and if it is virtual or hybrid, virtual presentations may be possible. These matters will be discussed with the authors closer to the date of the conference.
Accepted papers: First and Second Cycle
The following papers have been accepted so far in the ICSE 2025 Research Track first cycle. The papers are will be published by the IEEE and appear in the IEEE and ACM digital libraries, subject to an author submitting their camera-ready and copyright forms, and registering to attend the conference. (Authors are required to present the papers at the conference, otherwise they will be withdrawn).
Many additional papers will appear later from the first-cycle papers where major revisions were requested (if the revisions are approved), and from the second cycle.
From first cycle
Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction"
Abstract: Database management systems (DBMSs) consistently strive for enhanced performance. For a given query, the optimizer of a DBMS aims to construct an optimal execution plan that incorporates multiple optimization operations. However, the resulting plan may sometimes perform worse than even if no optimizations were applied. This occurs because the interactions between optimizations are complex and some situations might be overlooked in the implementation. We refer to these issues as Performance Degradation Bugs (PDBs). PDBs can result in significant consequences from decreased system efficiency and prolonged query processing times to potential disruptions in critical business operations. In this paper, we present Puppy, an automated approach for detecting PDBs in DBMSs using limited-optimization plan construction. The key idea is to compare the performance with the plan generated with all optimization operations enabled, against the plan generated with only a subset of optimization operations in the same DBMS. If the response time of the plan with the limited optimization set is shorter than that of the fully optimized plan, it indicates a potential PDB. Specifically, Puppy first generates queries that incorporate multiple optimization sequences, guided by optimization operation sequence coverage. Secondly, Puppy analyzes the query plan and selectively disables specific optimizations to construct the limited optimization plan. We evaluate Puppy on five widely-used DBMSs, namely MySQL, Percona, TiDB, PolarDB, and PostgreSQL against the state-of-the-art DBMS performance testing tools APOLLO and AMOEBA. Puppy detected 26 and 25 more performance anomalies, covered 151,201 and 173,798 more branches than APOLLO and AMOEBA in 48 hours, respectively. More importantly, Puppy reports 62 PDBs, with 54 anomalies confirmed as previously unknown bugs.
Chun Li, Hui Li, Zhong Li, Minxue Pan, Xuandong Li, "Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning"
Abstract: Engineers utilize logs as a primary resource for fault localization in large-scale software and system testing, a process that is notoriously time-consuming, costly, and labor-intensive. Despite considerable progress in automated fault localization approaches, their applicability remains limited in such settings, due to the unavailability of fine-grained features in logs essential for most existing fault localization methods. In response, we introduce FALCON, a novel log-based fault localization framework. FALCON organizes complex semantic log information into graphical representations and employs contrastive learning to capture the differences between passed and failed logs, enabling the identification of crucial fault-related features. It also incorporates a specifically designed transitive analysis-based adaptive graph augmentation to minimize the influence of fault-unrelated log information on contrastive learning. Through extensive evaluations against 34 spectrum-based and 4 learning-based fault localization methods, FALCON demonstrates superior performance by outperforming all the methods in comparison. In addition, FALCON demonstrated its practical value by successfully identifying 71 out of 90 faults with a file-level Top-1 accuracy rate during a one-month deployment within a global company’s testing system.
Wenqian Deng, Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation"
Abstract: Database connectors are widely used in many applications to facilitate flexible and convenient database interactions. Potential vulnerabilities in database connectors can lead to various abnormal behaviors within applications, such as returning incorrect results or experiencing unexpected connection interruption. However, existing fuzzing works cannot be directly applied to testing database connectors as they mainly focus on SQL generation and use a small subset of database connector interfaces to execute SQLs. Due to a lack of domain knowledge, automated test case generation also struggles to generate complex test cases that explore connectors' deep logic. The main challenge in testing database connectors is to generate semantically correct test cases that can trigger a wide range of connector state transitions. To address that, we propose CONI, a framework designed for detecting logic bugs of database connectors with state-aware test case generation. First, we define the database connector state model by analyzing the corresponding specification. Building upon this model, CONI generates interface call sequences within test cases to encompass more connector state transitions. After that, CONI generates suitable parameter values based on the parameter information and contextual information collected during runtime. Then the test cases are executed on a target and a reference database connector. Inconsistent results indicate potential logic bugs. We evaluate CONI on 5 widely-used JDBC database connectors, namely MySQL Connector/J, MariaDB Connector/J, AWS JDBC Driver for MySQL, PGJDBC NG, and PostgreSQL JDBC Driver. In total, CONI successfully detected 44 previously unknown bugs, of which 34 have been confirmed.
Gong Chen, Xiaoyuan Xie, Daniel Tang, Qi Xin, Wenjie Liu, "HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search"
Abstract: Code search is a vital activity in software engineering, focused on identifying and retrieving the correct code snippets based on a query provided in natural language. Approaches based on deep learning techniques have been increasingly adopted for this task, enhancing the initial representations of both code and its natural language descriptions. Despite this progress, there remains an unexplored gap in ensuring consistency between the representation spaces of code and its descriptions. Furthermore, existing methods have not fully leveraged the potential relevance between code snippets and their descriptions, presenting a challenge in discerning fine-grained semantic distinctions among similar code snippets. To address these challenges, we introduce a multi-task hedging contrastive Learning framework for Code Search, referred to as HedgeCode. HedgeCode is structured around two primary training phases. The first phase, known as the representation alignment stage, proposes a hedging contrastive learning approach. This method aims to detect subtle differences between code and natural language text, thereby aligning their representation spaces by identifying relevance. The subsequent phase involves multi-task joint learning, wherein the previously trained model serves as the encoder. This stage optimizes the model through a combination of supervised and self-supervised contrastive learning tasks. Our framework’s effectiveness is demonstrated through its performance on the CodeSearchNet benchmark, showcasing HedgeCode’s ability to address the mentioned limitations in code search tasks.
Jiashuo Zhang, Yiming Shen, Jiachi Chen, Jianzhong Su, Yanlin Wang, Ting Chen, Jianbo Gao, Zhong Chen, "Demystifying and Detecting Cryptographic Defects in Ethereum Smart Contracts"
Abstract: To enhance smart contracts with cryptographic capabilities, Ethereum has officially provided a set of system-level cryptographic APIs, such as ecrecover. These APIs have been utilized in over 10% of Ethereum transactions, motivating developers to implement various on-chain cryptographic tasks, such as digital signatures. However, since developers may not always be cryptographic experts, their ad-hoc and potentially defective implementations could compromise the theoretical guarantees of cryptography, leading to real-world security issues. To mitigate this threat, we conducted the first study aimed at demystifying and detecting cryptographic defects in smart contracts. Through the analysis of 2,406 real-world security reports, we defined nine types of cryptographic defects in smart contracts with detailed descriptions and practical detection patterns. Based on this categorization, we proposed CrySol, a fuzzing-based tool to automate the detection of cryptographic defects in smart contracts. It combines transaction replaying and dynamic taint analysis to extract fine-grained crypto-related semantics and employs crypto-specific strategies to guide the test case generation process. urthermore, we collected a large-scale dataset containing 25,745 real-world crypto-related smart contracts and evaluated CrySol's effectiveness on it. The result demonstrated that CrySol achieves an overall precision of 95.4% and a recall of 91.2%. Notably, CrySol revealed that 5,847 (22.7%) out of 25,745 contracts contain at least one cryptographic defect, highlighting the prevalence of these defects.
Chijin Zhou, Quan Zhang, Bingzhou Qian, Yu Jiang, "Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency"
Abstract: Rendering lies at the heart of our modern web experience. However, the correctness of browser rendering is not always guaranteed, often leading to rendering bugs. Traditional differential testing, while successful in various domains, falls short when applied to rendering bug detection because an HTML file is likely yield different rendered outcomes across different browsers. This paper introduces Visual Delta Consistency, a test oracle to detect rendering bugs in web browsers, aiming to make rendered pages across browsers comparable. Our key insight is that any modifications made to an HTML file should uniformly influence rendering outcomes across browsers. Specifically, when presented with two HTML files that differ only by minor modifications, the reaction of all browsers should be consistent, i.e., either all browsers render them identically or all render them differently. Based on this insight, We implemented it as a practical fuzzer named Janus. It constructs pairs of slightly modified HTML files and observes the change statuses of the corresponding rendered pages across browsers for bug detection. We evaluated it on three widely-used browsers, i.e., Chrome, Safari, and Firefox. In total, Janus detected 34 rendering bugs, out of which 26 confirmed with 8 fixed by the developers.
Seongmin Lee, Shreyas Minocha, Marcel Böhme, "Accounting for Missing Events in Statistical Information Leakage Analysis"
Abstract: The leakage of secret information via a public channel is a critical privacy flaw in software systems. The more information is leaked per observation, the less time an attacker needs to learn the secret. Due to the size and complexity of the modern software, and because some empirical facts are not available to a formal analysis of the source code, researchers started investigating statistical methods using program executions as samples. However, current statistical methods require a high sample coverage. Ideally, the sample is large enough to contain every possible combination of secret $\times$ observable value to accurately reflect the joint distribution of $\langle$secret, observable$\rangle$. Otherwise, the information leakage is severely underestimated, which is problematic as it can lead to overconfidence in the security of an otherwise vulnerable program. In this paper, we introduce an improved estimator for information leakage and propose to use methods from applied statistics to improve our estimate of the joint distribution when sample coverage is low. The key idea is to reconstruct the joint distribution by casting our problem as a multinomial estimation problem in the absence of samples for all classes. We suggest two approaches and demonstrate the effectiveness of each approach on a set of benchmark subjects. We also propose novel refinement heuristics, which help to adjust the joint distribution and gain better estimation accuracy. Compared to existing statistical methods for information leakage estimation, our method can safely overestimate the mutual information and provide a more accurate estimate from a limited number of program executions.
Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Premkumar Devanbu, Toufique Ahmed, "Calibration and Correctness of Language Models for Code"
Abstract: Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a \emph{confidence measure}; if this confidence measure is strongly associated with \emph{likelihood of correctness}, then the model is said to be \emph{well-calibrated}. In this case, the confidence measure can serve as a basis for rational graduated decision making on how much review and care is needed. \emph{Calibration} has so far been studied in mostly non-generative (\emph{e.g.}, classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Given generated code developers must decide whether to directly use, use after varying intensity of careful review, or discard model-generated code; thus calibration is vital in generative settings. In this paper we make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are \textbf{\textit{\underline{not}}} well-calibrated out of the box. We then show how calibration can be improved, using standard methods such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in Software Engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.
Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, Aseem Rastogi, "RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code"
Abstract: The Rust programming language, with its safety guarantees, has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. These guarantees come from a strong ownership-based type system, as well as primitive support for features like closures, pattern matching, etc., that make the code more concise and amenable to reasoning. These unique Rust features also pose a steep learning curve for programmers. This paper presents a tool called RustAssistant that leverages the emergent capabilities of Large Language Models (LLMs) to automatically suggest fixes for Rust compilation errors. RustAssistant uses a careful combination of prompting techniques as well as iteration between an LLM and the Rust compiler to deliver high accuracy of fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories. We also contribute a dataset of Rust compilation errors to enable further research.
Guoping rong, Yongda Yu, Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen, Jidong Hu, "Code Comment Inconsistency Detection and Rectification Using a Large Language Model"
Abstract: Comments are widely used in source code. If a comment is consistent with the code snippet it intends to annotate, it would aid code comprehension. Otherwise, Code Comment Inconsistency (CCI) is not only detrimental to the understanding of code, but more importantly, it would negatively impact the development, testing, and maintenance of software. To tackle this issue, existing research has been primarily focused on detecting inconsistencies with varied performance. It is evident that detection alone does not solve the problem; it merely paves the way for solving it. A complete solution requires detecting inconsistencies and, more importantly, rectifying them by amending comments. However, this type of work is scarce. In this paper, we contribute C4RLLaMA, a fine-tuned large language model based on the open-source CodeLLaMA. It not only has the ability to rectify inconsistencies by correcting relevant comment content but also outperforms state-of-the-art approaches in detecting inconsistencies. Experiments with various datasets confirm that C4RLLaMA consistently surpasses both Post Hoc and Just-in-time CCI detection approaches. More importantly, C4RLLaMA outperforms substantially the only known CCI rectification approach in terms of multiple performance metrics. To further examine C4RLLaMA's efficacy in rectifying inconsistencies, we conducted a manual evaluation, and the results showed that the percentage of correct comment updates by C4RLLaMA was 65.0\% and 55.9\% in Just-in-time and Post Hoc, respectively, implying C4RLLaMA's real potential in practical use.
Kai Huang, Jian Zhang, Xiangxin Meng, Yang Liu, "Template-Guided Program Repair in the Era of Large Language Models"
Abstract: Recent advancements in automated program repair (APR) have been significantly driven by the application of Large Language Models (LLMs). In particular, the integration of LLMs with traditional template-based repair methods has demonstrated effective outcomes. Despite this, the synergy between the strengths of traditional methods and LLMs remains underexploited. This oversight originates from the indiscriminate use of templates and their insufficient coverage. Also, using small-scale LLMs within the zero-shot learning context proves to be suboptimal. To alleviate the limitations, we propose NTR (Neural Template Repair), a two-stage repair framework including template selection and patch generation, both of which are under the fine-tuning paradigm. In the template selection phase, we formulate it as a multiclass classification problem and fine-tune million-level LLMs for better selecting possible templates. During the patch generation phase, we leverage the chosen templates as probable directions (e.g., `Mutate Conditional Expression') to guide the fine-tuning process of LLMs at the billion-level scale for precise patch creation. Moreover, we incorporate a unique template to signify the absence of a suitable template and employ a probability-based prioritization of templates, thereby optimizing patch generation. This framework not only effectively addresses template mismatch issues, but also enables the billion-level LLMs to explore the patch space more efficiently, despite the GPU memory constraints. We evaluate NTR with different foundational models on Defects4J V1.2 and HumanEval-Java, the framework consistently demonstrates significant effectiveness. When utilizing StarCoder as the foundational model for patch generation, NTR fixes 128 and 129 bugs in Defects4J and HumanEval, outperforming the best baseline APR tool by 14 and 59 bugs. With the larger CodeLlama model, the fixed bugs rise to 139 and 136, respectively, exceeding the baseline by 25 and 66 bugs. Notably, the performance stems not only from the foundational models but also benefits greatly from our NTR framework. Specifically, NTR's implementation with StarCoder and CodeLlama leads to 22 and 23 additional fixes, which is beyond what the models achieve on their own. This emphasizes the success of our new perspective on utilizing templates to unlock the bug-fixing potential of LLMs.
Syed Fatiul Huq, Mahan Tafreshipour, Kate Kalcevich, Sam Malek, "Automated Generation of Accessibility Test Reports from Recorded User Transcripts"
Abstract: Testing for accessibility is a significant step when developing software, as it ensures that all users, including those with disabilities, can effectively engage with web and mobile applications. While automated tools exist to detect accessibility issues in software, none are as comprehensive and effective as the process of user testing, where testers with various disabilities evaluate the application for accessibility and usability issues. However, user testing is not popular with software developers as it requires conducting lengthy interviews with users and later parsing through large recordings to derive the issues to fix. In this paper, we explore how large language models (LLMs) like GPT 4.0, which have shown promising results in context comprehension and semantic text generation, can mitigate this issue and streamline the user testing process. Our solution, called Reca11, takes in informal transcripts of test recordings and extracts the accessibility and usability issues mentioned by the tester. Our systematic prompt engineering determines the optimal configuration of input, instruction, context and demonstrations for best results. We evaluate Reca11's effectiveness on 36 user testing sessions across three applications. Based on the findings, we investigate the strengths and weaknesses of using LLMs in this space.
Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu, "Large Language Models as Configuration Validators"
Abstract: Misconfigurations are major causes of software failures. Existing practices rely on developer-written rules or test cases to validate configurations, which are expensive. Machine learning (ML) for configuration validation is considered a promising direction, but has been facing challenges such as the need of large-scale field data and system-specific models. Recent advances in Large Language Models (LLMs) show promise in addressing some of the long-lasting limitations of ML-based configuration validation. We present a first analysis on the feasibility and effectiveness of using LLMs for configuration validation. We empirically evaluate LLMs as configuration validators by developing a generic LLM-based configuration validation framework, named Ciri. Ciri employs effective prompt engineering with few-shot learning based on both valid configuration and misconfiguration data. Ciri checks outputs from LLMs when producing results, addressing hallucination and nondeterminism of LLMs. We evaluate Ciri’s validation effectiveness on eight popular LLMs using configuration data of ten widely deployed open-source systems. Our analysis (1) confirms the potential of using LLMs for configuration validation, (2) explores design space of LLMbased validators like Ciri, and (3) reveals open challenges such as ineffectiveness in detecting certain types of misconfigurations and biases towards popular configuration parameters.
Wen Zhang, Botang Xiao, Qingchen Kong, Le Guan, Wenwen Wang, "BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries"
Abstract: This paper presents BSan, a practical software-only memory error detector for binary code. Different from state-of-the-art binary-level detectors, which rely on either the shadow memory-based approach or the hardware-specific feature and thus suffer from several fundamental limitations, BSan adopts an identifier-based approach, enabling it to detect deep memory errors missed by existing detectors. Also, BSan does not depend on any specific hardware features. To reduce the high performance overhead caused by identifier propagation, BSan creates a novel hybrid approach, static analysis+dynamic instrumentation, to improve the performance without inheriting the poor reliability of static binary rewriting, distinguishing it from existing detectors that simply refer to static binary rewriting for better performance. The comprehensive evaluation demonstrates that BSan can detect more memory errors than state-of-the-art binary-level detectors. Meanwhile, the performance and memory overheads of BSan are comparable to those of existing detectors.
Ying Fu, Zhiyong Wu, Yuanliang Zhang, Jie Liang, Jingzhou Fu, Yu Jiang, Shanshan Li, Xiangke Liao, "Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing"
Abstract: Differential testing is a prevalent strategy for establishing test oracles in automated DBMS testing. However, meticulously selecting equivalent DBMSs with diverse implementations and compatible input syntax requires huge manual efforts. In this paper, we propose Thanos, a framework that finds DBMS bugs via storage engine rotation based differential testing. Our key insight is that a DBMS with different storage engines must provide consistent basic storage functionalities. Therefore, it’s feasible to construct equivalent DBMSs based on storage engine rotation, ensuring that the same SQL test cases to these equivalent DBMSs yield consistent results. The framework involves four main steps: 1) select the appropriate storage engines; 2) extract equivalence information among the selected storage engines; 3) synthesize feature-orient test cases that ensure the DBMS equivalence; and 4) send test cases to the DBMSs with selected storage engines and compare the results. We evaluate Thanos on three widely used and extensively tested DBMSs, namely MySQL, MariaDB, and Percona against state-of-the-art fuzzers SQLancer, SQLsmith, and Squirrel. Thanos outperforms them on branch coverage by 24%–116%, and also finds many bugs missed by other fuzzers. More importantly, the vendors have confirmed 32 previously unknown bugs found by Thanos, with 29 verified as Critical.
Courtney Miller, Mahmoud Jahanshahi, Audris Mockus, Bogdan Vasilescu, Christian Kästner, "Understanding the Response to Open-Source Dependency Abandonment in the npm Ecosystem"
Abstract: Many developers relying on open-source digital infrastructure expect continuous maintenance, but even the most critical packages can become unmaintained. Despite this, there is little understanding of the prevalence of abandonment of widely-used packages, of subsequent exposure, and of reactions to abandonment in practice, or the factors that influence them. We perform a large-scale quantitative analysis of all widely-used npm packages and find that abandonment is common among them, that abandonment exposes many projects which often do not respond, that responses correlate with other dependency management practices, and that removal is significantly faster when a projects end-of-life status is explicitly stated. We end with recommendations to both researchers and practitioners who are facing dependency abandonment or are sunsetting projects, such as opportunities for low-effort transparency mechanisms to help exposed projects make better, more informed decisions.
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen, "Vulnerability Detection with Code Language Models: How Far Are We?"
Abstract: In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26\% F1 on BigVul but only 3.09\% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
Deepak-George Thomas, Matteo Biagiola, Nargiz Humbatova, Mohammad Wardat, Gunel Jahangirova, Hridesh Rajan, Paolo Tonella, "µPRL: a Mutation Testing Pipeline for Deep Reinforcement Learning based on Real Faults"
Abstract: Reinforcement Learning (RL) is increasingly adopted to train agents that can deal with complex sequential tasks, such as driving an autonomous vehicle or controlling a complex environment. Correspondingly, novel approaches are needed to ensure that RL agents have been tested adequately before going to production. Among them, mutation testing is quite promising, especially under the assumption that the injected faults (mutations) mimic the real ones. In this paper, we first describe a taxonomy of real RL faults obtained by repository mining. Then, we present the mutation operators derived from such real faults and implemented in the tool µPRL. Finally, we discuss the experimental results, which show that µPRL is extremely effective at discriminating strong from weak test generators, hence providing useful feedback to developers about the adequacy of the test scenarios generated and executed so far.
Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner, "The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products"
Abstract: Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, challenging research progress. In this study, first, we contribute a novel process to identify 262 open-source ML products among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent startup-style development reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss 7 implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.
Sanan Hasanov, Stefan Nagy, Paul Gazzillo, "A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing"
Abstract:The Linux kernel is actively-developed and widely-used. It supports billions of devices of all classes, from high-performance computing to the Internet-of-Things, in part because of its sophisticated configuration system, which automatically tailors the source code according to thousands of user-provided configuration options. Fuzzing has been highly successful at finding kernel bugs, being among the top bug reporters. Since the kernel receives 100s of patches per day, fuzzers run continuously, stopping regularly to rebuild the kernel with the latest changes before restarting fuzzing. But kernel fuzzers currently use predefined configuration settings that, as we show, exclude the majority of new patches from the kernel binary, nullifying the benefits of continuous fuzzing. Unfortunately, state-of-the-art configuration testing techniques are generally ill-suited to the needs of continuous fuzzing, excluding necessary options or requiring too many configuration files to be tractible. We distill down the needs of continuous testing into six properties with the most impact, systematically analyze the space of configuration selection strategies, and provide actionable recommendations. Through our analysis, we discover that continuous fuzzers can improve configuration variety without sacrificing performance. We empirically evaluate our discovery by modifying the configuration selection strategy for syzkaller, the most popular Linux kernel fuzzer, which subsequently found more than twice as many new bugs (35 vs. 13) than with the original configuration file and 12x more (24 vs. 2) when considering only unique bugs---with one security vulnerability being assigned a CVE.
Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun, "QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning"
Abstract: Formal verification is a promising method for producing highly reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using machine learning and a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. This makes proofs hard to find, and, when they are found, longer than necessary. Reinforcement learning could help estimate progress, but sparse rewards make this method ineffective. To address this problem, we create QEDCartographer, an novel automated proof-synthesis tool that combines supervised and reinforcement learning. QEDCartographer's key insight is that incorporating the branching structure of proofs into its learning enables reward-free search, mitigating the sparse reward challenge. We evaluate QEDCartographer on the CoqGym benchmark of 68,501 theorems from 124 open-source Coq projects. QEDCartographer proves 186 more theorems than Proverbot9001, a state-of-the-art proof synthesis tool, an increase of 8%. Further, the tools are complementary, together proving 12% more theorems than Proverbot9001 alone. For theorems both can prove, QEDCartographer produces 26% shorter proofs 27% faster.
Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, Miryung Kim, "Fuzzing MLIR Compilers with Custom Mutation Synthesis"
Abstract: A growing trend in compiler design is to enable modular extensions to intermediate representations (IRs). Multi- Level Intermediate Representation (MLIR) is a new effort to enable faster compiler development by providing an extensible framework for downstream developers to define custom IRs with MLIR dialects. Sets of MLIR dialects define new IRs that are tailored for specific domains. The diversity and rapid evolution of these IRs make it impractical to pre-define custom test generator logic for every available dialect. We design a new approach called SYNTHFUZZ that automatically infers and applies custom mutations from existing tests. The key essence of SYNTHFUZZ is that inferred custom mutations are parameterized and context-dependent such that they can be concretized differently depending on the target context. By doing this, we obviate the need to manually write custom mutations for newly introduced MLIR dialects. Further, SYNTHFUZZ increases the chance of finding effective edit locations and reduces the chance of inserting invalid edit content by performing k-ancestor- prefix and l-sibling-postfix matching. We compare SYNTHFUZZ to three baselines: Grammarinator—a grammar-based fuzzer without custom mutators, MLIRSmith—a custom test generator for MLIR, and NeuRI—a custom test generator with support for parameterized generation. We conduct this comprehensive comparison on 4 different MLIR projects where each project defines a new set of MLIR dialects that would take months of effort to manually write custom input generation and mutation logic. Our evaluation shows that SYNTHFUZZ on average improves input diversity by 1.51×, which increases branch coverage by 1.16×. Further, we show that our context dependent custom mutation increases the proportion of valid tests by up to 1.11×, indicating that SYNTHFUZZ correctly concretizes its parameterized mutations with respect to the target context. Parameterization of the mutations reduces the fraction of tests violating general MLIR constraints by 0.57×, increasing the time spent fuzzing dialect-specific code.
Forough Mehralian, Ziyao He, Sam Malek, "Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps"
Abstract: With mobile apps playing an increasingly vital role in our daily lives, the importance of ensuring their accessibility for users with disabilities is also growing. Despite this, app developers often overlook the accessibility challenges encountered by users of assistive technologies, such as screen readers. Screen reader users typically navigate content sequentially, focusing on one element at a time, unaware of changes occurring elsewhere in the app. While dynamic changes to content displayed on an app’s user interface may be apparent to sighted users, they pose significant accessibility obstacles for screen reader users. Existing accessibility testing tools are unable to identify challenges faced by blind users resulting from dynamic content changes. In this work, we first conduct a formative user study on dynamic changes in Android apps and their accessibility barriers for screen reader users. We then present TIMESTUMP, an automated framework that leverages our findings in the formative study to detect accessibility issues regarding dynamic changes. Finally, we empirically evaluate TIMESTUMP on real-world apps to assess its effectiveness and efficiency in detecting such accessibility issues.
Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng, "RLCoder: Reinforcement Learning for Repository-Level Code Completion"
Abstract: Repository-level code completion aims to generate code for unfinished code snippets within the context of a specified repository. Existing approaches mainly rely on retrieval-augmented generation strategies due to limitations in input sequence length. However, traditional lexical-based retrieval methods like BM25 struggle to capture code semantics, while model-based retrieval methods face challenges due to the lack of labeled data for training. Therefore, we propose RLCoder, a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data. Specifically, we iteratively evaluate the usefulness of retrieved content based on the perplexity of the target code when provided with the retrieved content as additional context, and provide feedback to update the retriever parameters. This iterative process enables the retriever to learn from its successes and failures, gradually improving its ability to retrieve relevant and high-quality content. Considering that not all situations require information beyond code files and not all retrieved context is helpful for generation, we also introduce a stop signal mechanism, allowing the retriever to decide when to retrieve and which candidates to retain autonomously. Extensive experimental results demonstrate that RLCoder consistently outperforms state-of-the-art methods on CrossCodeEval and RepoEval, achieving 12.2\% EM improvement over previous methods. Moreover, experiments show that our framework can generalize across different programming languages and further improve previous methods like RepoCoder.
Nausheen Mohammed, Akash lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma, "LLM Assistance for Memory Safety"
Abstract: Memory safety violations in low-level code, written in languages like C, continues to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting, however, is a manual process that imposes significant burden on the programmer and, hence, there has been limited adoption of this technique. The task of porting not only requires inferring annotations, but may also need refactoring/rewriting of the code to make it amenable to such annotations. In this paper, we use Large Language Models (LLMs) towards addressing both these concerns. We show how to harness LLM capabilities to do complex code reasoning as well as rewriting of large codebases. We also present a novel framework for whole-program transformations that leverages lightweight static analysis to break the transformation into smaller steps that can be carried out effectively by an LLM. We implement our ideas in a tool called MSA that targets the CheckedC dialect. We evaluate MSA on several micro-benchmarks, as well as real-world code ranging up to 20K lines of code. We showcase superior performance compared to a vanilla LLM baseline, as well as demonstrate improvement over a state-of-the-art symbolic (non-LLM) technique.
Chenkai Guo, Qianlu Wang, Naipeng Dong, Lingling Fan, Tianhong Wang, Weijie Zhang, EnBao Chen, Zheli Liu, Lu Yu, "EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android Applications"
Abstract: Android applications are pervasively adopted and heavily relied on in our daily life, leading to the growing demand for enhanced user experiences, such as ease for operation and robustness. Nevertheless, developers continue to prioritize traditional functionality and performance, overlooking the pivotal role of user experience in real-world scenarios. For example, poorly designed page elements can lead to user confusion, resulting in unexpected outcomes, termed as the error-prone operation anomalies (EPAs). In this work, we undertake the first effort to uncover the underlying essence of the EPA problem. To achieve this objective, we investigated the root causes of EPAs from three dimensions, i.e., subject, object and environment. These causes were identified by multi-stage attribute capturing and precise similarity computation. In this process, the causes are categorized into fine-grained classes, namely confusing behaviours, unsuitable layout, and resource overload. Building upon these insights, we propose a dynamic GUI-based testing tool EP-Detector to facilitate detecting the EPAs in real-world apps. The EP-Detector is equipped with widget-exploration based target navigation and automatic test oracle, enabling it to detect error-prone page elements and simulate events with both comprehensiveness and precision. To systematically study the prevalence and severity of real-world EPAs, we conducted experiments on 53 popular Android apps with EP-Detector. The confirmed results not only validate the high precision and completeness of EP-Detector but also highlight that EPAs are prevalent in current apps, with at least one EPA existing in every two page widgets on average, and 28.3% of them may lead to security and functionality issues or risks. The EP-Detector is available at https://github.com/WordDealer/EP-Detector.
Shuzheng Gao, Cuiyun Gao, Wenchao Gu, Michael Lyu, "Search-Based LLMs for Code Optimization"
Abstract: The code written by developers usually suffers from efficiency problems and contain various performance bugs. These inefficiencies necessitate the research of automated refactoring methods for code optimization. Early research in code optimization employs rule-based methods and focuses on specific inefficiency issues, which are labor-intensive and suffer from the low coverage issue. Recent work regards the task as a sequence generation problem, and resorts to deep learning (DL) techniques such as large language models (LLMs). These methods typically prompt LLMs to directly generate optimized code. Although these methods show state-of-the-art performance, such one-step generation paradigm is hard to achieve an optimal solution. First, complex optimization methods such as combinatorial ones are hard to be captured by LLMs. Second, the one-step generation paradigm poses challenge in precisely infusing the knowledge required for effective code optimization within LLMs, resulting in under-optimized code. To address these problems, we propose to model this task from the search perspective, and propose a search-based LLMs framework named SBLLM that enables iterative refinement and discovery of improved optimization methods. SBLLM synergistically integrate LLMs with evolutionary search and consists of three key components: 1) an execution-based representative sample selection part that evaluates the fitness of each existing optimized code and prioritizes promising ones to pilot the generation of improved code; 2) an adaptive optimization pattern retrieval part that infuses targeted optimization patterns into the model for guiding LLMs towards rectifying and progressively enhancing their optimization methods; and 3) a genetic operator-inspired chain-of-thought prompting part that aids LLMs in combining different optimization methods and generating improved optimization methods. Our evaluation of SBLLM on a dataset of Python and C++ code demonstrates its effectiveness in improving code efficiency. Specifically, the results indicate that SBLLM can improve program execution efficiency by up to 109.59% and consistently outperform all baseline methods by 8.72% ∼ 28.06% and 1.15% ∼ 9.56% with different LLMs in terms of top-5 speedup rate on Python and C++, respectively.
Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, Hui Liu, "A First Look at Conventional Commits Classification"
Abstract: Modern distributed software development relies on commits to control system versions. Commit classification plays a vital role in both industry and academia. The widely-used commit classification framework was proposed in 1976 by Swanson and includes three base classes: perfective, corrective, and adaptive. With the increasing complexity of software development, the industry has shifted towards a more fine-grained commit category, i.e., adopting Conventional Commits Specification (CCS) for delicacy management. The new commit framework requires developers to classify commits into ten distinct categories, such as ``feat'', ``fix'', and ``docs''. However, existing studies mainly focus on the three-category classification, leaving the definition and application of the fine-grained commit categories as knowledge gaps. This paper reports a preliminary study on this mechanism from its application status and problems. We also explore ways to address these identified problems. We find that a growing number of projects on GitHub are adopting CCS. By analyzing 194 issues from GitHub and 100 questions from Stack Overflow about the CCS application, we qualitatively categorized 52 challenges developers encountered. The most common one is CCS-type confusion. To address these challenges, we propose a clear definition of CCS types based on existing variants. Further, we designed an approach to automatically classify commits into CCS types, and the evaluation results demonstrate a promising performance. Our work facilitates a deeper comprehension of the present fine-grained commit categorization and holds the potential to alleviate application challenges significantly.
Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng, "TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference"
Abstract: Python’s dynamic typing system offers flexibility and expressiveness but can lead to type-related errors, prompting the need for automated type inference despite efforts like Python Enhancement Proposals (PEPs) to enhance type hinting. While existing learning-based approaches show promising inference accuracy, they struggle with practical challenges in comprehensively handling various types, including complex generics and (unseen) user/library-defined types. To address these challenges, we introduce TIGER, employing a two-stage generating-then-ranking (GTR) framework. By fine-tuning pre-trained code models, TIGER trains a generation model with a generative span masking objective and a similarity model with a contrastive training objective. This enables TIGER to execute the GTR inference, generating diverse candidates and then ranking them alongside user/library-defined types. Evaluation on the ManyTypes4Py dataset demonstrates TIGER’s effectiveness across different type categories, particularly excelling in (unseen) user-defined types (with improvements of 11.2% and 20.1% in Top-5 Exact Match). The evaluation results also confirm the robustness and efficiency of TIGER, highlighting the contributions of the employed two stages.
Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang, "A Tale of Two DL Cities: When Library Tests Meet Compiler"
Abstract: Deep Learning (DL) compilers typically load a DL model and optimize it with intermediate representation. Existing DL compiler testing techniques mainly focus on model optimization stages, but rarely explore bug detection at the model loading stage. Effectively testing the model loading stage requires covering diverse usages of each DL operator from various DL libraries, which shares a common objective with DL library testing, indicating that the embedded knowledge in DL library tests could potentially be beneficial for testing the model loading stage of DL compilers. Thus, we conducted the first empirical study to investigate the effectiveness and efficiency of migrating the knowledge embedded in DL library tests to test the model loading stage. To support the conduct of this study, we develop a technique, called OPERA, consisting of test migration (regarding effectiveness investigation) and test prioritization (regarding efficiency investigation). We considered three sources of tests in DL libraries for migration and used eight frontends from three DL compilers (e.g., TVM, TensorRT, and OpenVINO) for evaluation. The migrated tests with the aid of OPERA detected 170 previously unknown bugs in total, 90 of which have been confirmed/fixed by developers, demonstrating the effectiveness of such the migration-based idea. The test prioritization strategy in OPERA improves testing efficiency with migrated tests by 11.9%~47.4% on average compared to general test prioritization strategies. Finally, we obtained 7 major findings and provided a set of guidelines for future work from this study.
Rodrigo Pedro, Miguel E. Coimbra, Daniel Castro, Paulo Carreira, Nuno Santos, "Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses"
Abstract: Large Language Models (LLMs) have found widespread applications in various domains, including web applications with chatbot interfaces. Aided by an LLM-integration middleware such as LangChain, user prompts are translated into SQL queries used by the LLM to provide meaningful responses to users. However, unsanitized user prompts can lead to SQL injection attacks, potentially compromising the security of the database. In this paper, we present a comprehensive examination of prompt-to-SQL (P2SQL) injections targeting web applications based on frameworks such as LangChain and LlamaIndex. We characterize P2SQL injections, exploring their variants and impact on application security through multiple concrete examples. We evaluate seven state-of-the-art LLMs, demonstrating the risks of P2SQL attacks across language models. By employing both manual and automated methods, we discovered P2SQL vulnerabilities in five real-world applications. Our findings indicate that LLM-integrated applications are highly susceptible to P2SQL injection attacks, warranting the adoption of robust defenses. To counter these attacks, we propose four effective defense techniques that can be integrated as extensions to the LangChain framework.
Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia, "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?"
Abstract: Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4\%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.
Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, Yang Liu, "Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications"
Abstract: Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine- tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.
Shuo Yang, Xingwei Lin, Jiachi Chen, Qingyuan Zhong, Lei Xiao, Renke Huang, Yanlin Wang, Zibin Zheng, "Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic Execution"
Abstract: The rapid advancement of blockchain platforms has significantly accelerated the growth of decentralized applications (DApps). Similar to traditional applications, DApps integrate front-end descriptions that showcase their features to attract users, and back-end smart contracts for executing their business logic. However, inconsistencies between the features promoted in front-end descriptions and those actually implemented in the contract can confuse users and undermine DApps's trustworthiness. In this paper, we first conducted an empirical study to identify seven types of inconsistencies, each exemplified by a real-world DApp. Furthermore, we introduce Hyperion, an approach designed to automatically identify inconsistencies between front-end descriptions and back-end code implementation in DApps. This method leverages a fine-tuned large language model LLaMA2 to analyze DApp descriptions and employs dataflow-guided symbolic execution for contract bytecode analysis. Finally, Hyperion reports the inconsistency based on predefined detection patterns. The experiment on our ground truth dataset consisting of 54 DApps shows that Hyperion reaches 84.06\% overall recall and 92.06\% overall precision in reporting DApp inconsistencies. We also implement Hyperion to analyze 835 real-world DApps. The experimental results show that Hyperion discovers 459 real-world DApps containing at least one inconsistency.
Zhiqing Zhong, Shilin He, Haoxuan Wang, Boxi Yu, Haowen Yang, Pinjia He, "An Empirical Study on Package-Level Deprecation in Python Ecosystem"
Abstract: Open-source software (OSS) plays a crucial role in modern software development. Utilizing OSS code can greatly accelerate software development, reduce redundancy, and enhance reliability. Python, a widely adopted programming language, is particularly renowned for its extensive and diverse third-party package ecosystem. However, a significant number of OSS packages within the Python ecosystem are in poor maintenance, leading to potential risks in terms of functionality and security. Consequently, it is essential to establish a deprecation mechanism that assists package developers and users in effectively managing these packages. To facilitate the establishment of the package-level deprecation mechanism, this paper presents a mixed-method empirical study, including data analysis and surveys. We investigate the current practices of announcing, receiving, and handling package-level deprecation in the Python ecosystem. We also assess the benefits of having deprecation announcements for inactively maintained packages. Furthermore, we investigate the challenges faced by package developers and users and their expectations for future deprecation practices. Our findings reveal valuable insights. For instance, 75.4\% of inactive package developers have no intention of releasing deprecation declarations for various reasons, while 89.5\% of users express a desire to be notified about the deprecation, highlighting a gap between developers and users; In many cases, no alternative solutions are available when deprecation occurs, emphasizing the need to explore practical approaches that enable seamless package handover and require less maintenance effort. We anticipate that our work will enhance the understanding of existing package-level deprecation patterns within the Python OSS realm and facilitate the development of deprecation practices for the Python community in the future.
Lizhi Liao, Simon Eismann, Heng Li, Cor-Paul Bezemer, Diego Elias Costa, André van Hoorn, Weiyi Shang, "Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models"
Abstract: During software development, developers often make numerous modifications to the software to address existing issues or implement new features. However, certain changes may inadvertently have a detrimental impact on the overall system performance. To ensure that the performance of new software releases does not degrade (i.e., absence of performance regressions), existing practices rely on system-level performance testing, such as load testing, or component-level performance testing, such as microbenchmarking, to detect performance regressions. However, performance testing for the entire system is often expensive and time-consuming, posing challenges to adapting to the rapid release cycles common in modern DevOps practices. In addition, system-level performance testing cannot be conducted until the system is fully built and deployed. On the other hand, component-level testing focuses on isolated components, neglecting overall system performance and the impact of system workloads. In this paper, we propose a novel approach to early detection of performance regressions by bridging the local performance data generated by component-level testing and the system-level architectural models. Our approach uses local performance data to identify deviations at the component level, and then propagate these deviations to the architectural model. We then use the architectural model to predict regressions in the performance of the overall system. In an evaluation of our approach on two representative open-source benchmark systems, we show that it can effectively detect end-to-end system performance regressions from local performance deviations with different intensities and under various system workloads. More importantly, our approach can detect regressions as early as in the development phase, in contrast to existing approaches that require the system to be fully built and deployed. Our approach is lightweight and can complement traditional system performance testing when testing resources are scarce.
Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman, "Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests"
Abstract: Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants, we investigate how the understandability of unit tests affects a software engineer's ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with test cases fix up to 33% more bugs and use up to 20\% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.
Sebastian Uchitel, Francisco Cirelli, Dalal Alrajeh, "Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts"
Abstract: Boundary Conditions (BCs) express situations under which requirements specifications conflict. They are used within a broader conflict management process to produce less idealized specifications. Several approaches have been proposed to identify BCs automatically. Some introduce a prioritization criteria to reduce the number of BCs presented to an engineer. However, identifying the few, relevant boundary conditions remains an open challenge. In this paper, we argue that one of the problems of the state of the art is with the definition of BC itself -- it is too weak. We propose a stronger definition for the few, relevant BCs, which we refer to as Unavoidable Boundary Conditions (UBCs), which utilizes the notion of realizability in reactive synthesis. We show experimentally that UBCs non-trivially reduce the number of conditions produced by existing BC identification techniques. We also relate UBCs to existing concepts in reactive synthesis used to provide feedback for unrealizable specifications (including counter-strategies and unrealizable cores). We then show that UBCs provide a targeted form of feedback for repairing unrealizable specifications.
Brian Hyeongseok Kim, Jingbo Wang, Chao Wang, "FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks"
Abstract: We propose a method for formally certifying and quantifying individual fairness of a deep neural network (DNN). Individual fairness guarantees that any two individuals who are identical except for some protected input attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the interval based analysis from the conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on five popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.
Mingyuan Wu, Jiahong Xiang, Kunqiu Chen, Peng DI, Shin Hwei Tan, Heming Cui, Yuqun Zhang, "Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?"
Abstract: Many assisting exploration strategies have been proposed to assist grey-box fuzzers in exploring program states guarded by tight and complex branch conditions such as equality constraints. Although they have shown promising results in their original papers, their evaluations seldom follow equivalent protocols, e.g., they are rarely evaluated on identical benchmarks. Moreover, there is a lack of sufficient investigations on the specifics of the program states explored by these strategies which can obfuscate the future application and development of such strategies. Consequently, there is a pressing need for a comprehensive study of assisting exploration strategies on their effectiveness, versatility, and limitations to enlighten their future development. To this end, we perform the first comprehensive study about the assisting exploration strategies for grey-box fuzzers. Specifically, we first collect nine recent fuzzers representing the mainstream assisting exploration strategies as our studied subjects and 21 real-world projects to form our benchmark suite. After evaluating the subjects on the benchmark suite, we then surprisingly find that the dictionary strategy is the most promising since it not only achieves similar or even slightly better performance over the other studied assisting exploration strategies in terms of exploring program states but also is more practical to be enhanced. Accordingly, we propose CDFUZZ, which generates a customized dictionary for each seed upon the baseline fuzzer AFL to improve over the original dictionary strategy. The evaluation results demonstrate that CDFUZZ increases the edge coverage by 16.1% on average for all benchmark projects over the best performer in our study (i.e., AFL++ with the dictionary strategy). CDFUZZ also successfully exposed 37 previously unknown bugs, with nine confirmed and seven fixed by the corresponding developers.
Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy, "Towards Neural Synthesis for SMT-assisted Proof-Oriented Programming"
Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem---producing a definition given a formal specification expressed as an F* type. We provide a program-fragment checker that queries F* to check the correctness of candidate solutions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen, "Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models"
Abstract: Deep learning (DL) libraries are widely used to form the basis of various AI applications in computer vision, natural language processing, and software engineering domains. Despite their popularity, DL libraries are known to have vulnerabilities, such as buffer overflows, use-after-free, and integer overflows, that can be exploited to compromise the security or effectiveness of the underlying libraries. While traditional fuzzing techniques have been used to find bugs in software, they are not well-suited for DL libraries. In general, the complexity of DL libraries and the diversity of their APIs make it challenging to test them thoroughly. To date, mainstream DL libraries like TensorFlow and PyTorch have featured over 1,000 APIs, and the number of APIs is still growing. Fuzzing all these APIs is a daunting task, especially when considering the complexity of the input data and the diversity of the API usage patterns. Recent advances in large language models (LLMs) have illustrated the high potential of LLMs in understanding and synthesizing human-like code. Despite their high potential, we find that emerging LLM-based fuzzers are less optimal for DL library API fuzzing, given their lack of in-depth knowledge on API input edge cases and inefficiency in generating test inputs. In this paper, we propose DFUZZ, a LLM-driven DL library fuzzing approach. We have two key insights: (1) With high reasoning ability, LLMs can replace human experts to reason edge cases (likely error-triggering inputs) from checks in an API's code, and transfer the extracted knowledge to test other (new or rarely-tested) APIs. (2) With high generation ability, LLMs can synthesize initial test programs with high accuracy that automates API testing. DFUZZ provides LLMs with a novel ''white-box view'' of DL library APIs, and therefore, can leverage LLMs' reasoning and generation abilities to achieve comprehensive fuzzing. Our experimental results on popular DL libraries demonstrate that DFUZZ is able to cover more APIs than SOTA (LLM-based) fuzzers on TensorFlow and PyTorch, respectively. Moreover, DFUZZ successfully detected 37 bugs, with 17 already confirmed as previously unknown bugs.
Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, jianling Sun, "Towards Better Answers: Automated Stack Overflow Post Updating"
Abstract: Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. \textit{SO comments} often point out weaknesses of a post and provide valuable insights to improve the quality of answers, while SO comments are usually missed and/or ignored, leaving these problematic code snippets untouched. In this work, we first investigate the task of automatic SO posts updating based on their associated comments. We introduce a novel framework, named \textbf{Soup} (\textbf{\underline{S}}tack \textbf{\underline{O}}verflow \textbf{\underline{U}}pdator for \textbf{\underline{P}}ost) for this task. \textbf{Soup} addresses two key tasks: Valid Comment-Edit Prediction (VCP) and Automatic Post Updating (APU). We fine-tuned a large language model, CodeLlama, using low-rank adaptation techniques to complete the VCP task, and constructed a dataset containing 78k valid comment-edit pairs for the APU task. Subsequently, we tested the performance of multiple large language models on the APU task. Extensive experimental results show the promising performance of our model over a set of benchmarks. Moreover, we also perform an in-the-wild evaluation on Stack Overflow, we submitted 50 edits generated by our approach to Stack Overflow posts and 21 of them have been verified and accepted by SO maintainers, further proving the practical value of \textbf{Soup}.
Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang, "Selecting Initial Seeds for Better JVM Fuzzing"
Abstract: JVM fuzzing techniques serve as a cornerstone for guaranteeing the quality of implementations. In typical fuzzing workflows, initial seeds are crucial as they form the basis of the process. Literature in traditional program fuzzing has confirmed that effectiveness is largely impacted by redundancy among initial seeds, thereby proposing a series of seed selection methods. JVM fuzzing, compared to traditional ones, presents unique characteristics, including large-scale and intricate code, and programs with both syntactic and semantic features. However, it remains unclear whether the existing initial seed selection methods are suitable for JVM fuzzing and whether utilizing program features can enhance effectiveness. To address this, we devised a total of 10 initial seed selection methods, comprising coverage-based, prefuzz-based, and program-feature-based methods. We then conducted an empirical study on three JVM implementations to extensively evaluate the performance of the initial seed selection methods within two state-of-the-art fuzzing techniques (JavaTailor and VECT). Specifically, we examine performance from three aspects: (i) effectiveness and efficiency using widely studied initial seeds, (ii) effectiveness using the programs in the wild, and (iii) the ability to detect new bugs. Evaluation results first show that the program-feature-based method that utilizes the control flow graph not only has a significantly lower time overhead (i.e., 30s), but also outperforms other methods, achieving 142% to 269% improvement compared to the full set of initial seeds. Second, results reveal that the initial seed selection greatly improves the quality of wild programs and exhibits complementary effectiveness by detecting new behaviors. Third, results demonstrate that given the same testing period, initial seed selection improves the JVM fuzzing techniques by detecting more unknown bugs. Particularly, 16 out of the 25 detected bugs have been confirmed or fixed by developers. This work takes the first look at initial seed selection in JVM fuzzing, confirming its importance in fuzzing effectiveness and efficiency.
Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen, "Source Code Summarization in the Era of Large Language Models"
Abstract: To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types (e.g., procedural and object-oriented programming languages). Finally, we unexpectedly find that \codellama{} with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu, "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers"
Abstract: Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
From Second cycle
Yue Wang, Chao Yang, Xiaodong Zhang, Yuwanqi Deng, "JianFeng Ma, ""DPFuzzer: Discovering Safety Critical Vulnerabilities for Drone Path Planners""
Abstract: State-of-the-art drone path planners enable drones to autonomously navigate around obstacles in GPS-denied, uncharted and cluttered environments. However, our investigation shows that path planners fail to maneuver drones correctly in specific scenarios, leading to incidents such as collisions. To minimize such risks, drone path planners should be tested thoroughly against diverse scenarios before deployment. Existing research for testing drones to uncover safety-critical vulnerabilities is only focused on the flight control programs and is limited in the capability to generate diverse obstacle scenarios for testing drone path planners. In this work, we propose \textit{DPFuzzer}, an automated framework for testing drone path planners. \textit{DPFuzzer} is an evolutionary algorithm (EA) based testing framework. It aims to uncover vulnerabilities in drone path planners by generating diverse critical scenarios that can trigger vulnerabilities. To better guide the critical scenario generation, we introduce \textit{Environmental Risk Factor (ERF)}, a metric we propose, to abstract potential safety threats of scenarios. We evaluate \textit{DPFuzzer} on state-of-the-art drone path planners and the experimental result shows that \textit{DPFuzzer} can effectively find diverse vulnerabilities. Additionally, we demonstrate that these vulnerabilities are exploitable in the real world on commercial drones.
"
Shiyu Zhang, Haoyang Song, Qixin Wang, Henghua Shen, "Yu Pei, ""A Test Oracle for Reinforcement Learning Software based on Lyapunov Stability Control Theory""
Abstract: Reinforcement Learning (RL) has gained significant attention in recent years. As RL software becomes more complex and infiltrates critical application domains, ensuring its quality and correctness becomes increasingly important. An indispensable aspect of software quality/correctness insurance is testing. However, testing RL software faces unique challenges compared to testing traditional software, due to the difficulty on defining the outputs’ correctness. This leads to the RL test oracle problem. Current approaches to testing RL software often rely on human oracles, i.e. convening human experts to judge the correctness of RL software outputs. This heavily depends on the availability and quality (including the experiences, subjective states, etc.) of the human experts, and cannot be fully automated. In this paper, we propose a novel approach to design test oracles for RL software by leveraging the Lyapunov stability control theory. By incorporating Lyapunov stability concepts to guide RL training, we hypothesize that a correctly implemented RL software shall output an agent that respects Lyapunov stability control theories. Based on this heuristics, we propose a Lyapunov stability control theory based oracle, LPEA(ϑ, θ), for testing RL software. We conduct extensive experiments over representative RL algorithms and RL software bugs to evaluate our proposed oracle. The results show that our proposed oracle can outperform the human oracle in most metrics. Particularly, LPEA(ϑ = 100%, θ = 75%) outperforms the human oracle by 53.6%, 50%, 18.4%, 34.8%, 18.4%, 127.8%, 60.5%, 38.9%, and 31.7% respectively on accuracy, precision, recall, F1 score, true positive rate, true negative rate, false positive rate, false negative rate, and ROC curve’s AUC; and LPEA(ϑ = 100%, θ = 50%) outperforms the human oracle by 48.2%, 47.4%, 10.5%, 29.1%, 10.5%, 127.8%, 60.5%, 22.2%, and 26.0% respectively on these metrics.
"
Yisong Xiao, Aishan Liu, Xinwei Zhang, Tianyuan Zhang, Tianlin Li, Siyuan Liang, Xianglong Liu, Yang Liu, "Dacheng Tao, ""BDefects4NN: A Backdoor Defect Database for Controlled Localization Studies in Neural Networks""
Abstract: Pre-trained large deep learning models are now serving as the dominant component for downstream middleware users and have revolutionized the learning paradigm, replacing the traditional approach of training from scratch locally. To reduce development costs, developers often integrate third-party pre-trained deep neural networks (DNNs) into their intelligent software systems. However, utilizing untrusted DNNs presents significant security risks, as these models may contain intentional backdoor defects resulting from the black-box training process. These backdoor defects can be activated by hidden triggers, allowing attackers to maliciously control the model and compromise the overall reliability of the intelligent software. To ensure the safe adoption of DNNs in critical software systems, it is crucial to establish a backdoor defect database for localization studies. This paper addresses this research gap by introducing \emph{BDefects4NN}, the first backdoor defect database, which provides labeled backdoor-defected DNNs at the neuron granularity and enables controlled localization studies of defect root causes. In \emph{BDefects4NN}, we define three defect injection rules and employ four representative backdoor attacks across four popular network architectures and three widely adopted datasets, yielding a comprehensive database of 1,654 backdoor-defected DNNs with four defect quantities and varying infected neurons. Based on \emph{BDefects4NN}, we conduct extensive experiments on evaluating six fault localization criteria and two defect repair techniques, which show limited effectiveness for backdoor defects. Additionally, we investigate backdoor-defected models in practical scenarios, specifically in lane detection for autonomous driving and large language models (LLMs), revealing potential threats and highlighting current limitations in precise defect localization. This paper aims to raise awareness of the threats brought by backdoor defects in our community and inspire future advancements in fault localization methods.
"
Ian McCormack, Joshua Sunshine, Jonathan Aldrich, "A Study of Undefined Behavior Across Foreign Function Boundaries in Rust Libraries"
Abstract: Developers rely on the Rust programming language's static safety guarantees to write secure and performant applications. However, Rust is frequently used to interoperate with other languages which allow design patterns that conflict with Rust's aliasing models. Miri is the only dynamic analysis tool capable of validating applications against these models, but it does not support foreign functions, indicating that there may be a critical correctness gap at the heart of the Rust ecosystem. We conducted a large-scale evaluation of multi-language Rust libraries to determine whether Miri's dynamic analyses remain useful in this context. We used Miri and an LLVM interpreter to jointly execute multi-language applications, where we found 48 instances of undefined or undesired behavior. These include three bugs from libraries that had over 10,000 daily downloads on average during our observation period, and one from a library maintained by the Rust Project. Many of the errors we found involved incompatible aliasing patterns, but Rust's latest Tree Borrows aliasing model was significantly more permissive than the earlier Stacked Borrows model. The Rust community must invest in new, production-ready tooling for multi-language applications to ensure that developers can detect these errors.
Qikang Liu, Yang He, Yanwen Cai, Byeongguk Kwak, Yuepeng Wang, "Synthesizing Document Database Queries using Collection Abstractions"
Abstract: Document databases are increasingly popular in various applications, but their queries are challenging to write due to the flexible and complex data model underlying document databases. This paper presents a synthesis technique that aims to generate document database queries from input-output examples automatically. A new domain-specific language is designed to express a representative set of document database queries in an algebraic style. Furthermore, the synthesis technique leverages a novel abstraction of collections for deduction to efficiently prune the search space and quickly generate the target query. An evaluation of 110 benchmarks from various sources shows that the proposed technique can synthesize 108 benchmarks successfully. On average, the synthesizer can generate document database queries from a small number of input-output examples within tens of seconds.
Yesugen Baatartogtokh, Kaitlyn Cook, "Alicia M. Grubb, ""Exploring the Robustness of the Effect of EVO on Intention Valuation through Replication""
Abstract: The development of high-quality software depends on precise and comprehensive requirements that meet the objectives of stakeholders. Goal modeling techniques have been developed to fill this gap by capturing and analyzing stakeholders' needs and allowing them to make trade-off decisions; yet, goal modeling analysis is often difficult for stakeholders to interpret. Recent work found that when subjects are given minimal training on goal modeling and access to a color visualization, called EVO, they are able to use EVO to make goal modeling decisions faster without compromising quality. In this paper, we evaluate the robustness of the empirical evidence for EVO and question the underlying color choices made by the initial designers of EVO. We conduct a pseudo-exact replication ($n = 60$) of the original EVO study, varying the experimental site and the study population. Even in our heterogeneous sample with less a priori familiarity with requirements and goal modeling, we find that individuals using EVO answered the goal-modeling questions significantly faster than those using the control, expanding the external validity of the original results. However, we find some evidence that the chosen color scheme is not intuitive and make recommendations for the goal modeling community.
"
Yanick Fratantonio, Luca Invernizzi, Loua Farah, Kurt Thomas, Marina Zhang, Ange Albertini, Francois Galilee, Giancarlo Metitieri, Julien Cretin, Alex Petit-Bianco, David Tao, "Elie Bursztein, ""Magika: AI-Powered Content-Type Detection""
Abstract: The task of content-type detection---which entails identifying the data encoded in an arbitrary byte sequence---is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by a major email provider for attachment scanning, and it has been integrated with VirusTotal to aid malware analysis.
"
Xiafa Wu, "Brian Demsky, ""GenC2Rust: Towards Generating Generic Rust Code from C""
Abstract: Rust provides an exciting combination of strong safety guarantees and high performance. Many new systems are being implemented in Rust. Nevertheless, there is a large body of existing C code that could greatly benefit from Rust's safety guarantees. Unfortunately, the manual effort required to rewrite C code into Rust is often prohibitively expensive. Researchers have explored tools to assist developers in translating legacy C code into Rust code. However, the mismatch between C abstractions and idiomatic Rust abstractions makes it challenging to automatically utilize Rust's language features, resulting in non-idiomatic Rust code that requires extensive manual effort to further refactor. For example, existing tools often fail to map polymorphic uses of void pointers in C to Rust's more idiomatic generic pointers. In this paper, we present a translation tool, GenC2Rust, that translates non-generic C code into generic Rust code. GenC2Rust statically analyzes the use of void pointers in the C program to compute the typing constraints and then retypes the void pointers into generic pointers. We conducted an evaluation of GenC2Rust across 42 C programs that vary in size and span multiple domains to demonstrate its scalability as well as correctness. We also present a detailed analysis of the limiting factors encountered in the translation process.
"
Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, Baowen Xu, "ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs"
Abstract: Ensuring the reliability of the Rust compiler is of paramount importance, as the Rust language is increasingly being adopted for developing critical systems due to its emphasis on memory and thread safety. However, due to Rust’s complex syntax and strict requirements, generating valid test programs for the Rust compiler poses significant challenges. Currently, with the growing popularity of large language models (LLMs), much research in software testing has explored the use of LLMs to generate test cases. Despite this, directly using LLMs to generate Rust programs often results in a large number of invalid test cases. Existing studies have indicated that test cases triggering historical compiler bugs can assist in software testing. Our investigation into Rust compiler bug issues further supports this observation. Inspired by existing work and our empirical research, we introduce a bracket-based masking and filling strategy called clozeMask. The clozeMask strategy involves extracting test code from historical issue reports, identifying and masking code snippets with specific structures, and utilizing an LLM to fill in the masked portions for synthesizing new test programs. This approach harnesses the generative capabilities of LLMs while retaining the ability to trigger Rust compiler bugs. Ultimately, it enables comprehensive testing of the compiler’s behavior, particularly in exploring corner cases. We implemented our approach as a prototype CLOZEMASTER. CLOZEMASTER has identified 27 confirmed bugs for rustc and mrustc, of which 10 have been fixed by developers. Furthermore, our experimental results indicate that CLOZEMASTER outperforms existing generative fuzzers in terms of code coverage and effectiveness.
Xingyu Wang, MingSen Wang, Wenbo Shen, "Rui Chang, ""Understanding and Detecting Peer Dependency Resolving Loop in npm Ecosystem""
Abstract: As the default package manager for Node.js, npm has become one of the largest package management systems in the world. To facilitate dependency management for developers, npm supports a special type of dependency, Peer Dependency, whose installation and usage differ from regular dependencies. However, conflicts between peer dependencies can trap the npm client into infinite loops, leading to resource exhaustion and system crashes. We name this problem PeerSpin. Although PeerSpin poses a severe risk to ecosystems, it was overlooked by previous studies, and its impacts have not been explored. To bridge this gap, this paper conducts the first in-depth study to understand and detect PeerSpin in the npm ecosystem. First, by systematically analyzing the npm dependency resolution, we identify the root cause of PeerSpin and characterize two peer dependency patterns to guide detection. Second, we propose a novel technique called Node-Replacement-Conflict based PeerSpin Detection, which leverages the state of the directory tree during dependency resolution to achieve accurate and efficient PeerSpin detection. Based on this technique, we developed a tool called PeerChecker to detect PeerSpin. Finally, we apply PeerChecker to the entire NPM ecosystem and find that 5,662 packages, totaling 72,968 versions, suffer from PeerSpin. Up until now, we confirmed 28 real PeerSpin problems by reporting them to the package maintainer. We also open source all PeerSpin analysis implementations, tools, and data sets to the public to help the community detect PeerSpin issues and enhance the reliability of the npm ecosystem.
"
Jiashuo Zhang, Jiachi Chen, John Grundy, Jianbo Gao, Yanlin Wang, Ting Chen, Zhi Guan, Zhong Chen, "Automated Test Generation For Smart Contracts via On-Chain Test Case Augmentation and Migration"
Abstract: Pre-deployment testing has become essential to ensure the functional correctness of smart contracts. However, since smart contracts are stateful programs integrating many different functionalities, manually writing test cases to cover all potential usages requires significant effort from developers, leading to insufficient testing and increasing risks in practice. Although several testing techniques for smart contracts have been proposed, they primarily focus on detecting common low-level vulnerabilities such as re-entrancy, rather than generating expressive and function-relevant test cases that can reduce manual testing efforts. To bridge the gap, we propose SolMigrator, an automated technique designed to generate expressive and representative test cases for smart contracts. To our knowledge, SolMigrator is the first migration-based test generation technique for smart contracts, which extracts test cases from real-world usages of on-chain contracts and migrates them to test newly developed smart contracts with similar functionalities. Given a target smart contract to be tested and an on-chain similar source smart contract, SolMigrator first transforms the on-chain usage of the source contract into off-chain executable test cases based on on-chain transaction replay and dependency analysis. It then employs fine-grained static analysis to migrate the augmented test cases from the source to the target smart contract. We built a prototype of SolMigrator and have evaluated it on real-world smart contracts within the two most popular categories, ERC20 and ERC721. Our evaluation results demonstrate that SolMigrator effectively extracts test cases from existing on-chain smart contracts and accurately migrates them across different smart contracts, achieving an average precision of 96.3% and accuracy of 93.6%. Furthermore, the results indicate that these migrated test cases effectively cover common key functionalities of the target smart contracts. This provides promising evidence that real-world usages of existing smart contracts can be transformed into effective test cases for other newly developed smart contracts.
Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian, Xinru Cheng, "Chengnian Sun, ""Toward a Better Understanding of Probabilistic Delta Debugging""
Abstract: Given a list L of elements and a property ψ that L exhibits, ddmin is a classic test input minimization algorithm that aims to automatically remove ψ-irrelevant elements from L. This algorithm has been widely adopted in domains such as test input minimization and software debloating. Recently, ProbDD, a variant of ddmin, has been proposed and achieved stateof- the-art performance. By employing Bayesian optimization, ProbDD estimates the probability of each element in L being relevant to ψ, and statistically decides which and how many elements should be deleted together each time. However, the theoretical probabilistic model of ProbDD is rather intricate, and the underlying details for the superior performance of ProbDD have not been adequately explored. In this paper, we conduct the first in-depth theoretical analysis of ProbDD, clarifying the trends in probability and subset size changes and simplifying the probability model. We complement this analysis with empirical experiments, including success rate analysis, ablation studies, and examinations of trade-offs and limitations, to further comprehend and demystify this state-of- the-art algorithm. Our success rate analysis reveals how ProbDD effectively addresses bottlenecks that slow down ddmin by skipping inefficient queries that attempt to delete complements of subsets and previously tried subsets. The ablation study illustrates that randomness in ProbDD has no significant impact on efficiency. These findings provide valuable insights for future research and applications of test input minimization algorithms. Based on the findings above, we propose CDD, a simplified version of ProbDD, reducing the complexity in both theory and implementation. CDD assists in 1 validating the correctness of our key findings, e.g., that probabilities in ProbDD essentially serve as monotonically increasing counters for each element, and 2 identifying the main factors that truly contribute to ProbDD’s superior performance. Our comprehensive evaluations across 76 benchmarks in test input minimization and software debloating demonstrate that CDD can achieve the same performance as ProbDD, despite being much simplified.
"
Benjamin Steenhoek, Siva Sivaraman, Renata Saldivar Gonzalez, Yevhen Mohylevskyy, Roshanak Zilouchian Moghaddam, "Wei Le, ""Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE""
Abstract: Security vulnerabilities impose significant costs on users and organizations. Detecting and addressing these vulnera bilities early is crucial to avoid exploits and reduce development costs. Recent studies have shown that deep learning models can effectively detect security vulnerabilities. Yet, little research explores how to adapt these models from benchmark tests to practical applications, and whether they can be useful in practice. This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DEEPVULGUARD, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on bench marks of historic vulnerability data. DEEPVULGUARD scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural- language explanations for alerts and fixes, leveraging chat inter faces. We recruited 17 professional software developers, observed their usage of the tool on their code, and conducted interviews to assess the tool’s usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users’ perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user’s codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at this link: https://figshare.com/s/77992badb1e37c09e4eb. We plan to release our tool as open-source to support further user studies for other AI based tools.
"
Soneya Binta Hossain, "Matthew Dwyer, ""TOGLL: Correct and Strong Test Oracle Generation with LLMs""
Abstract: Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have shown impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on a large dataset consisting of 110 Java projects. Utilizing the most effective fine- tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 unseen large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Regarding bug detection effectiveness, TOGLL can detect 1,023 unique mutants that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect. Additionally, TOGLL significantly outperforms TOGA in detecting real bugs from the Defects4J dataset.
"
Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, Ali Mesbah, "Feature-Driven End-To-End Test Generation"
Abstract: End-to-end (E2E) testing is essential for ensuring web application quality. However, manual test creation is time-consuming and current test generation techniques produce random tests. In this paper, we present AUTOE2E, a novel approach that leverages Large Language Models (LLMs) to automate the generation of semantically meaningful feature-driven E2E test cases for web applications. AUTOE2E intelligently infers potential features within a web application and translates them into executable test scenarios. Furthermore, we address a critical gap in the research community by introducing E2EBENCH, a new benchmark for automatically assessing the feature coverage of E2E test suites. Our evaluation on E2EBENCH demonstrates that AUTOE2E achieves an average feature coverage of 79%, outperforming the best baseline by 558%, highlighting its effectiveness in generating high-quality, comprehensive test cases.
Aaron Imani, Iftekhar Ahmed, Mohammad Moshirpour, "Context Conquers Parameters: Outperforming Proprietary LLM in Commit Message Generation"
Abstract: Commit messages provide descriptions of the modifications made in a commit using natural language, making them crucial for software maintenance and evolution. Recent developments in Large Language Models (LLMs) have led to their use in generating high-quality commit messages, such as the Omniscient Message Generator (OMG). This method employs GPT-4 to produce state-of-the-art commit messages. However, the use of proprietary LLMs like GPT-4 in coding tasks raises privacy and sustainability concerns, which may hinder their industrial adoption. Considering that open-source LLMs have achieved competitive performance in developer tasks such as compiler validation, this study investigates whether they can be used to generate commit messages that are comparable with OMG. Our experiments show that an open-source LLM can generate commit messages that are comparable to those produced by OMG. In addition, through a series of contextual refinements, we propose lOcal MessagE GenerAtor (OMEGA) , a CMG approach that uses a 4-bit quantized 8B open-source LLM. OMEGA produces state-of-the-art commit messages, surpassing the performance of GPT-4 in practitioners' preference.
Yuanliang Zhang, Yifan Xie, Shanshan Li, Ke Liu, Chong Wang, Zhouyang Jia, Xiangbing Huang, Jie Song, Chaopeng Luo, Zhizheng Zheng, Rulin Xu, Yitong Liu, Si Zheng, "Xiangke Liao, ""Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar""
Abstract: Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, there are three main gaps to objectively evaluate the real capability of LLMs: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been exposed during the training phase of LLM, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea using in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFUSEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5\%.
"
Baoquan Cui, Rong Qu, Zhen Tang, "Jian Zhang, ""Static Analysis of Remote Procedure Call in Java Programs""
Abstract: The Remote Procedure Call (RPC) is commonly used for inter-process communications over network, allowing a program to invoke a procedure in another address space even another machine as if it were a local call within the same address space. Its convenience comes from encapsulating network communication. However, for the same reason, it cannot be penetrated by current static analyzers. Since the RPC programs/frameworks play a more important role in various domains, the static analysis of RPC is significant and cannot be ignored. We have observed that many of the existing RPC frameworks/programs written in Java are based on explicit protocols, which makes them possible to be modelled for static analysis. The challenges are how to identify RPC operations in different frameworks/programs and how to automatically establish relationships between clients and servers. In this paper, we propose a novel approach, RPCBridge, which uses an adapter to unify the most basic operations during the RPC process. It models the RPC with logic rules in a straightforward and precise way based on its semantics, performs points-to analysis and constructs RPC edges in the call graph, making it more complete. The evaluation on real-world large-scale Java programs based on 5 common RPC frameworks shows that our approach can effectively capture the operations of the RPC (263 matched protocols and 1,098 RPCs), and construct critical links (2,578 edges in the call graph) between clients and servers, in which 60.1% are the true caller-callee pairs after execution. Our approach is expected to bring significant benefits (+24.3% leakage paths for the taint analyzer) for previously incompletely modelled code with a very little memory and time overhead, and connect the modules in a system, so that it can be statically analyzed more holistically.
"
Shihao Zhu, Yuqi Guo, Yan Cai, Bin Liang, Long Zhang, Rui Chen, "Tingting Yu, ""Reduce Dependence for Sound Concurrency Bug Prediction""
Abstract: Recently, dynamic concurrency bug predictions have kept making notable progress in improving concurrency coverage while ensuring soundness. Most of them rely solely on dynamic information in traces and overlook the static semantics of the program when predicting bugs. To ensure soundness, they assume that \textbf{any (memory) read can fully affect subsequent program execution via control-flow and data-flow}. However, the assumption over-approximates constraints among (memory) writes and reads and hence limits reordering space over thread interleaving, ultimately leading to false negatives. From program semantics, only a subset of reads actually affect their subsequent executions. Therefore, by refining dependencies between reads and subsequent executions based on static program semantics, one can refine the assumption and eliminate unnecessary constraints while still guaranteeing soundness. This can bring a chance to explore more thread interleaving space and uncover more concurrency bugs. However, refining dependencies can compromise soundness and bring heavy overhead. To tackle these challenges, this paper introduces the concept of Necessary Consistent Read Event (NRE) and a hybrid analysis algorithm. NRE refines dependencies between reads and their subsequent events and is used to identify necessary constraints where a read probably affects the execution of its subsequent events. Next, we design an efficient and accurate hybrid analysis algorithm to calculate NREs for each event in the trace. The hybrid analysis algorithm maps events to program SSA instructions and simulates executions based on the original trace. We focused on data race and developed NRE and the algorithm as a prototype tool ReconP on top of a recent work M2. We conducted a set of comparative experiments on MySQL with M2 and SeqCheck. The results show that ReconP can detect 46.9\% and 22.4\% more data races than M2 and SeqCheck, respectively. And the hybrid algorithm only accounts for 34\% of the total time cost.
"
Huaijin Wang, Zhibo Liu, Yanbo Dai, Shuai Wang, Qiyi Tang, Sen Nie, "Shi Wu, ""Preserving Privacy in Software Composition Analysis: A Study of Technical Solutions and Enhancements""
Abstract: Software composition analysis (SCA) denotes the process of identifying open-source software components in an input software application. SCA has been extensively developed and adopted by academia and industry. However, we notice that the modern SCA techniques in industry scenarios still need to be improved due to privacy concerns. Overall, SCA requires the users to upload their applications’ source code to a remote SCA server, which then deeply inspects the applications and reports the component usage to users. This process is privacy-sensitive since the applications may contain sensitive information, such as proprietary algorithms, trade secrets, and user data. Moreover, applications' source code is generally deemed proprietary, and users do not want to share it with the SCA vendor. To protect customers' privacy, contemporary SCA vendors often propose to deploy a "lite" version of SCA service on the customer side. To avoid the leakage of SCA vendors' valuable assets (e.g., code, model, and data), the "lite" SCA usually only performs a shallow analysis with limited accuracy. Privacy concerns have prevented the SCA technology from being used in real-world scenarios. Therefore, academia and the industry demand privacy-preserving SCA solutions. For the first time, we analyze the privacy requirements of SCA and provide a landscape depicting possible technical solutions with varying privacy gains and overheads. In particular, given that de facto SCA frameworks are primarily driven by code similarity-based techniques, we explore combining several privacy-preserving protocols to encapsulate the similarity-based SCA framework. Among all viable solutions, we find that multi-party computation (MPC) offers the strongest privacy guarantee and plausible accuracy; it, however, incurs high overhead ($184\times$). We optimize the MPC-based SCA framework by reducing the amount of crypto protocol transactions using program analysis techniques. The evaluation results show that our proposed optimizations can reduce the MPC-based SCA overhead to only 8.5% without sacrificing SCA’s privacy guarantee or accuracy.
"
Jintao Huang, Kai Yang, Gaosheng Wang, Zhiqiang Shi, Zhiwen Pan, Shichao Lv, Limin Sun, "Moye: A Wallbreaker for Monolithic Firmware"
Abstract: As embedded devices become increasingly popular, monolithic firmware, known for its execution efficiency and simplicity, is widely used in resource-constrained devices. Different from ordinary firmware, the monolithic firmware image is packed without the file that indicates its format, which challenges the reverse engineering of monolithic firmware. Function identification is the prerequisite of monolithic firmware's analysis. Prior works on function identification are less effectiveness when applied to monolithic firmware due to their heavy reliance on file formats. In this paper, we propose Moye, a novel method to identify functions in monolithic firmware. We leverage the important insight that the use of registers must conform to some constraints. In particular, our approach segments the firmware, locate code sections and output the instructions. We uses a masked language model to learn hiding relationships among the instructions to identify the function boundaries. We evaluate Moye using 1,318 monolithic firmware images, including 48 samples collected from widely used devices. The evaluation demonstrates that our approach significantly outperforms current works, achieving a precision greater than 98% and a recall rate greater than 97% across most datasets, showing robustness to complicated compilation options.
Haifeng Ruan, Yuntong Zhang, Abhik Roychoudhury, "SpecRover: Code Intent Extraction via LLMs"
Abstract: Autonomous program improvement typically involves automatically producing bug fixes and feature additions. Such program improvement can be accomplished by a combination of large language model (LLM) and program analysis capabilities, in the form of an LLM agent. Since program repair or program improvement typically requires a specification of intended behavior - specification inference can be useful for producing high quality program patches. In this work, we examine efficient and low-cost workflows for iterative specification inference within an LLM agent. Given a GitHub issue to be resolved in a software project, our goal is to conduct iterative code search accompanied by specification inference - thereby inferring intent from both the project structure and behavior. The intent thus captured is examined by a reviewer agent with the goal of vetting the patches as well as providing a measure of confidence in the vetted patches. Our approach SpecRover is built on the open-source LLM agent AutoCodeRover. In an evaluation on the full SWE-Bench consisting of 2294 GitHub issues, it shows more than 50% improvement in efficacy over AutoCodeRover. Compared to the open-source agents available, our work shows modest cost ($0.65 per issue) in resolving an average GitHub issue in SWE-Bench lite. The production of explanation by SpecRover allows for a better "signal" to be given to the developer, on when the suggested patches can be accepted with confidence. SpecRover also seeks to demonstrate the continued importance of specification inference in automated program repair, even as program repair technologies enter the LLM era.
Zhiyuan Li, Jingzheng Wu, Xiang Ling, Tianyue Luo, Zhiqing Rui, Yanjun Wu, "The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries"
Abstract: The widespread application of Large Language Models (LLMs) underscores the importance of Deep Learning (DL) technologies that rely on foundational DL libraries such as PyTorch and TensorFlow. Despite their robust features, these libraries face challenges with scalability and adaptation to rapid advancements in the LLM community. In response, tech giants like Apple and Huawei are developing their own DL libraries to enhance performance, increase scalability, and safeguard intellectual property. Ensuring the security of these libraries is crucial, with fuzzing being a vital solution. However, existing fuzzing frameworks struggle with target flexibility, effectively testing bug-prone API sequences, and leveraging the limited available information in new libraries. To address these limitations, we propose FUTURE, the first universal DL library fuzzing framework tailored for newly introduced and prospective DL libraries. FUTURE leverages historical bug information from existing libraries and fine-tunes LLMs for specialized code generation. This strategy helps identify vulnerabilities in new libraries and uses insights from these libraries to enhance security in existing ones, creating a cycle from history to future and back. To evaluate FUTURE's effectiveness, we conduct comprehensive evaluations on three newly introduced DL libraries. Results demonstrate that FUTURE significantly outperforms existing fuzzers in bug detection, success rate of bug reproduction, validity rate of code generation, and API coverage. Notably, FUTURE has detected 148 bugs across 452 targeted APIs, including 142 previously unknown bugs. Among these, 10 have been assigned CVE IDs. Additionally, FUTURE detects 7 bugs in PyTorch, demonstrating its ability to enhance security in existing libraries in reverse.
Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, "Anita Sarma, ""What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI""
Abstract: Generative AI (genAI) tools, such as ChatGPT or Copilot, are advertised to improve developer productivity and are being integrated into software development. However, misaligned trust, skepticism, and usability concerns can impede the adoption of such tools. Research also indicates that AI can be exclusionary, failing to support diverse users adequately. One such aspect of diversity is cognitive diversity—variations in users’ cognitive styles—that leads to divergence in perspectives and interaction styles. When an individual’s cognitive style is unsupported, it creates barriers to technology adoption. Therefore, to understand how to effectively integrate genAI tools into software development, it is first important to model what factors affect developers’ trust and intentions to adopt genAI tools in practice? We developed a theoretical model to (1) identify factors that influence developers’ trust in genAI tools and (2) examine the relationship between developers’ trust, cognitive styles, and their intentions to use these tools. We surveyed software developers (N=238) at two major global tech organizations and employed Partial Least Squares-Structural Equation Modeling (PLS-SEM) to evaluate our model. Our findings reveal that genAI’s system/output quality, functional value, and goal maintenance significantly influence developers’ trust in these tools. Furthermore, developers’ trust and cognitive styles influence their intentions to use these tools. We offer practical suggestions for designing genAI tools for effective use and inclusive user experience.
" "
Hao Zhong, ""Understanding Compiler Bugs in Real Development""
Abstract: Compilers are critical in development, but compiler bugs can cause hidden and serious bugs in their compiled code. To deepen the understanding of compiler bugs, in the prior empirical studies, researchers read the bug reports and patches of compilers, and analyze their causes, locations, and patterns. Although they derive many interesting findings, their studies are limited. First, as bug reports seldom explain which projects encounter compiler bugs, it is infeasible to understand the outreaching impact. Second, before compiler bugs are fixed, programmers can bypass such bugs. The bug reports of compilers do not introduce such workarounds. Finally, the distribution of compiler bugs can be distorted, since researchers and compiler developers also file bug reports. In this paper, we propose a novel angle to analyze compiler bugs. Instead of compiler bug reports, we collect compiler bugs that are mentioned in real development. When programmers encounter compiler bugs in real development, they can leave traces in their commit messages. By searching such messages, we collected 644 unique commits whose messages explicitly mention the urls of compiler bugs. From this angle, in this paper, we conduct the first empirical study to analyze compiler bugs in the wild. We summarize our results into seven useful findings for users, compiler developers, and researchers. For example, for researchers, we find that some large workarounds of compiler bugs involve repetitive and systematic changes, which indicates a new research opportunity for code migration tools. Furthermore, we attempt to apply our findings in real development, and we obtain positive feedback.
"
Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Ozren Dabic, Sonia Haiduc, Gabriele Bavota, "Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?"
Abstract: Several techniques have been proposed to (partially) automate code review. Early support consisted in recommending the most suited reviewer for a given change or in prioritizing the review tasks. With the advent of deep learning in software engineering, the level of automation has been pushed to new heights, with approaches able to provide feedback on source code in natural language as a human reviewer would do. Also, recent work documented open source projects adopting Large Language Models (LLMs) as co-reviewers. Although the research in this field is very active, little is known about the actual impact of including automatically generated code reviews in the code review process. While there are many aspects worth investigating (e.g., is knowledge transfer between developers affected?), in this work we focus on three of them: (i) review quality, i.e., the reviewer's ability to identify issues in the code; (ii) review cost, i.e., the time spent reviewing the code; and (iii) reviewer’s confidence, i.e., how confident is the reviewer about the provided feedback. We run a controlled experiment with 29 professional developers who reviewed different programs with/without the support of an automatically generated code review. During the experiment we monitored the reviewers’ activities, for over 50 hours of recorded code reviews. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior: Reviewers tend to focus on the code locations indicated by the LLM rather than searching for additional issues in other parts of the code. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high- severity issues as compared to a completely manual process. Finally, the automated support did not result in saved time and did not increase the reviewers’ confidence.
Weiwei Xu, Kai Gao, Hao He, "Minghui Zhou, ""LiCoEval: Evaluating LLMs on License Compliance in Code Generation""
Abstract: Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for ""striking similarity"" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.
"
Dehai Zhao, Zhenchang Xing, Qinghua Lu, Sherry Xiwei Xu, Liming Zhu, "SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation"
Abstract: UI automation is a useful technique for UI testing, bug reproduction and robotic process automation. Recording the user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support or assume specific app implementations. Reversing-engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognize human-understandable structured user actions ([command] [widget][location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model which can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record the user interactions with Word, Zoom, Firefox, Photoshop, and Window 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.
Zeyang Ma, Dong Jae Kim, Tse-Hsun (Peter) Chen, "LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models"
Abstract: Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1) time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2) increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3) privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces LibreLog, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. LibreLog first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i) similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii) self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that LibreLog achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, LibreLog addresses privacy and cost concerns of using commercial LLMs while achieving state-of- the-arts parsing efficiency and accuracy.
Yi Sun, Zhuo Zhang, "Xiangyu Zhang, ""FairChecker: Detecting Fund-stealing Bugs in DeFi Protocols via Fairness Validation""
Abstract: Decentralized Finance (DeFi) is an emerging paradigm within the blockchain space that aims to revolutionize conventional financial systems through the application of blockchain technology. The substantial value of digital assets managed by DeFi protocols makes it a lucrative target for attacks. Despite the human resources and the application of automated tools, frequent attacks still cause significant fund losses to DeFi participants. Existing tools primarily rely on oracles similar to those used in traditional software analysis, making it challenging for them to detect functional bugs specific to the DeFi domain. Since blockchain functions as a distributed ledger system, the foundation of any DeFi protocol is the accurate maintenance of key state variables representing user funds. If these variables are not properly updated or designed to reflect the intended flow of funds, attackers can exploit these flaws to steal assets. From the study of popular DeFi protocols, we observe that, in DeFi systems, to ensure a transaction does not misappropriate someone's fund, the direction of changes (increase or decrease) of values associated with the amount of asset or debt of a user has to adhere to some fairness properties. We propose a concept called fairness bug which allows attackers to gain profit without cost. We propose an inter-procedural and inter-contract static analysis technique that utilizes symbolic execution and an SMT solver to automatically detect fairness bugs in DeFi smart contracts. We have implemented our fairness-checking approach in our tool, named FairChecker. We evaluate our tool on a benchmark of 113 real-world DeFi protocols with 34 fairness bugs. The results show that our tool can detect 32 bugs with a recall of 94.1\% and a precision of 46.4\%, demonstrating its effectiveness.
"
Li Huang, Weifeng Sun, Meng Yan, "Iterative Generation of Adversarial Example for Deep Code Models"
Abstract: Deep code models are vulnerable to adversarial attacks, making it possible for semantically identical inputs to trigger different responses. Current black-box attack methods typically prioritize the impact of identifiers on the model based on custom importance scores or program context and incrementally replace identifiers to generate adversarial examples. However, these methods often fail to fully leverage feedback from failed attacks to guide subsequent attacks, resulting in problems such as local optima bias and efficiency dilemmas. In this paper, we introduce ITGen, a novel black-box adversarial example generation method that iteratively utilizes feedback from failed attacks to refine the generation process. It employs a bitvector-based representation of code variants to mitigate local optima bias. By integrating these bit vectors with feedback from failed attacks, ITGen uses an enhanced Bayesian optimization framework to efficiently predict the most promising code variants, significantly reducing the search space and thus addressing the efficiency dilemma. We conducted experiments on a total of nine deep code models for both understanding and generation tasks, demonstrating ITGen's effectiveness and efficiency, as well as its ability to enhance model robustness through adversarial fine-tuning. For example, on average, ITGen improves the attack success rate by 47.98% and 69.70% over the state-of-the-art techniques (i.e., ALERT and BeamAttack), respectively.
Chenxi Zhang, Yufei Liang, Tian Tan, Chang Xu, Shuangxiang Kan, Yue Li, "Interactive Cross-Language Pointer Analysis for Resolving Native Code in Java Programs"
Abstract: Java offers the Java Native Interface (JNI), which allows programs running in the Java Virtual Machine to invoke and be manipulated by native applications and libraries written in other languages, typically C. While JNI mechanism significantly enhances the Java platform's capabilities, it also presents challenges for static analysis of Java programs due to the complex behaviors introduced by native code. Therefore, effectively resolving the interactions between Java and native code is crucial for static analysis. In this paper, we introduce JNIFER, the first interactive cross-language pointer analysis for resolving native code in Java programs. JNIFER integrates both Java and C pointer analyses, equipped with advanced native call and JNI function analyses, enabling the simultaneous analysis of both Java and native code. During the analysis of cross-language interactions, the two analyzers interact with each other, constructing cross-language points-to relations and call graphs, thereby approximating the runtime behavior at the interaction sites. Our evaluation shows that JNIFER outperforms state-of-the-art approaches in terms of soundness while maintaining high precision and comparable efficiency, as evidenced by extensive experiments on OpenJDK and real-world Java applications.
Chiming Duan, Yong Yang, Tong Jia, Guiyang Liu, Jinbu Liu, Huxing Zhang, Qi Zhou, Ying Li, Gang Huang, "FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data Fusion"
Abstract: Accurately diagnosing the fault that causes the failure is crucial for maintaining the reliability of a microservice system after a failure occurs. Mainstream fault diagnosis approaches are data-driven and mainly rely on three modalities of runtime data: traces, logs, and metrics. Diagnosing faults with multiple modalities of data in microservice systems has been an clear trend in recent years because different types of faults and corresponding failures tend to manifest in data of various modalities. Accurately diagnosing faults by fully leveraging multiple modalities of data is confronted with two challenges: 1)how to minimize information loss when extracting features for data of each modality; 2)how to correctly capture andutilize the relationships among data of different modalities. To address these challenges, we propose FAMOS, a Fault diagnosis Approach for MicrOservice Systems through effective multi-modal data fusion. On the one hand, FAMOS employs independent feature extractors to preserve the intrinc features for each modality. On the other hand, FAMOS introduces a new Gaussian-attention mechanism to accurately correlate data of different modalities and then captures the inter-modality relationship with a cross-attention mechanism. We evaluated FAMOS on two datasets constructed by injecting comprehensive and abundant faults into an open-source microservice system and a real-world industrial microservice system. Experimental results demonstrate the FAMOS’s effectiveness in fault diagnosis, achieving significant improvements in F1 scores compared to state-of-the-art (SOTA) methods, with an increase of 20.33%.
Van-Hoang Le, yi xiao, Hongyu Zhang, "Unleashing the True Potential of Semantic-based Log Parsing with Pre-trained Language Models"
Abstract: Software-intensive systems often produce console logs for troubleshooting purpose. Log parsing, which aims at parsing a log message into a specific log template, typically serves as the first step toward automated log analytics. To better comprehend semantic information of log messages, many semantic-based log parsers have been proposed. These log parsers fine-tune a small pretrained language model (PLM) such as RoBERTa on a few labelled log samples. With the increasing popularity of large language models (LLMs), some recent studies also propose to leverage LLMs such as ChatGPT through in-context learning for automated log parsing, and obtain better results than previous semantic-based log parsers with small PLMs. In this paper, we show that semantic-based log parsers with small PLMs can actually achieve better or comparable performance to state-of-the-art LLM-based log parsing models while being more efficient and cost-effective. We propose UNLEASH, a novel semantic-based log parsing approach, which incorporates three enhancement methods to boost the performance of PLMs for log parsing, including (1) an entropy-based ranking method to select the most informative log samples; (2) a contrastive learning method to enhance the fine-tuning process; and (3) an inference optimization method to improve the log parsing performance. We evaluate UNLEASH on a set of large log datasets and the experimental results show that UNLEASH is effective and efficient, when compared to state-of-the-art log parsers.
Yiteng Peng, Daoyuan Wu, Zhibo Liu, Dongwei Xiao, Zhenlan Ji, Juergen Rahmel, "Shuai Wang, ""Testing and Understanding Deviation Behaviors in FHE-hardened Machine Learning Models""
Abstract: Fully homomorphic encryption (FHE) is a promising cryptographic primitive that enables secure computation over encrypted data. A primary use of FHE is to support privacy-preserving machine learning (ML) on public cloud infrastructures. Despite the rapid development of FHE-based ML (or HE-ML) in recent years, the community still lacks a systematic understanding of their robustness. In this paper, we aim to systematically test and understand the deviation behaviors of HE-ML models, where the same input causes deviant outputs between FHE-hardened models and their plaintext versions, leading to completely incorrect model predictions. To effectively uncover deviation-triggering inputs under the constraints of expensive FHE computation, we design a novel differential testing tool called HEDiff, which leverages the margin metric on the plaintext model as guidance to drive targeted testing on FHE models. For the identified deviation inputs, we further analyze them to determine whether they exhibit general noise patterns that are transferable. We evaluate HEDiff using three popular HE-ML frameworks, covering 12 different combinations of models and datasets. HEDiff successfully detected hundreds of deviation inputs across almost every tested FHE framework and model. We also quantitatively show that the identified deviation inputs are (visually) meaningful in comparison to regular inputs. Further schematic analysis reveals the root cause of these deviant inputs and allows us to generalize their noise patterns for more directed testing.
"
Chuan Luo, Shuangyu Lyu, Wei Wu, Hongyu Zhang, Dianhui Chu, Chunming Hu, "Towards High-strength Combinatorial Interaction Testing for Highly Configurable Software Systems"
Abstract: Highly configurable software systems are crucial in practice to satisfy the rising demand for software customization, and combinatorial interaction testing (CIT) is an important methodology for testing such systems. Constrained covering array generation (CCAG), as the core problem in CIT, is to construct a t-wise covering array (CA) of minimum size, where t represents the testing strength. Extensive studies have demonstrated that high-strength CIT (e.g., 4-wise and 5-wise CIT) has stronger fault detection capability than low-strength CIT (i.e., 2-wise and 3-wise CIT), and there exist certain critical faults that can be disclosed through high-strength CIT. Although existing CCAG algorithm has exhibited effectiveness in solving the low-strength CCAG problem, they suffer the severe high-strength challenge when solving 4-wise and 5-wise CCAG, which urgently calls for effective solutions to solving 4-wise and 5-wise CCAG problems. To alleviate the high-strength challenge, we propose a novel and effective local search algorithm dubbed HSCA. Particularly, HSCA incorporates three new and powerful techniques, i.e., multi-round CA generation mechanism, dynamic priority assigning technique, and variable grouping strategy, to improve its performance. Extensive experiments on 35 real-world and synthetic instances demonstrate that HSCA can generate significantly smaller 4-wise and 5-wise CAs than existing state-of-the-art CCAG algorithms. More encouragingly, out of all 35 instances, HSCA successfully constructs 4-wise and 5-wise CAs for 11 and 15 instances, respectively, where existing CCAG algorithms fail. Our results indicate that HSCA can effectively mitigate the high-strength challenge.
Mengying Wu, Geng Hong, Wuyuao Mai, Xinyi Wu, Lei Zhang, Yingyuan Pu, Huajun Chai, Lingyun Ying, Haixin Duan, "Min Yang, ""Exposing the Hidden Layer: Software Repositories in the Service of SEO Manipulation""
Abstract: Distinct from traditional malicious packages, this paper uncovers a novel attack vector named “blackhat Search Engine Optimization through REPositories (RepSEO)”. In this approach, attackers carefully craft packages to manipulate search engine results, exploiting the credibility of software repositories to promote illicit websites. Our research presents a systematic analysis of the underground ecosystem of RepSEO, identifying key players such as account providers, advertisers, and publishers. We developed an effective detection tool, applied to a ten-year large-scale dataset of npm, Docker Hub, and NuGet software repositories. This investigation led to the startling discovery of 3,801,682 abusive packages, highlighting the widespread nature of this attack. Our study also delves into the supply chain tactics of these attacks, revealing strategies like the use of self-hosted email services for account registration, redirection methods to obscure landing pages, and rapid deployment techniques by aggressive attackers. Additionally, we explore the profit motives behind these attacks, identifying two primary types of advertisers: survey-based advertisers and malware distribution advertisers. We reported npm, NuGet, and Docker Hub about the RepSEO packages and the related supply chain vulnerabilities of Google, and received their acknowledgments. Software repositories have started removing the abusive packages as of this paper’s submission. We also open-source our code and data to facilitate future research.
"
Hanmo You, Zan Wang, Bin Lin, Junjie Chen, "Navigating the Testing of Evolving Deep Learning Systems: An Exploratory Interview Study"
Abstract: Deep Learning (DL) systems have been widely adopted across various industrial domains such as autonomous driving and intelligent healthcare. As with traditional software, DL systems also need to constantly evolve to meet ever-changing user requirements. However, ensuring the quality of these continuously evolving systems presents significant challenges, especially in the context of testing. Understanding how industry developers address these challenges and what extra obstacles they are facing could provide valuable insights for further safeguarding the quality of DL systems. To reach this goal, we conducted semi-structured interviews with 22 DL developers from diverse domains and backgrounds. More specifically, our study focuses on exploring the challenges developers encounter in testing evolving DL systems, the practical solutions they employ, and their expectations for extra support. Our results highlight the difficulties in testing evolving DL systems and identify the best practices for DL developers to address them. Additionally, we pinpoint potential future research directions to enhance testing effectiveness in evolving DL systems.
Chenxing Zhong, Daniel Feitosa, Paris Avgeriou, Huang Huang, Yue Li, "He Zhang, ""PairSmell: A Novel Perspective Inspecting Software Modular Structure""
Abstract: Enhancing the modular structure of existing systems has attracted substantial research interest, focusing on two main methods: (1) software modularization and (2) identifying design issues (e.g., smells) as refactoring opportunities. However, re-modularization solutions often require extensive modifications to the original modules, and the design issues identified are generally too coarse to guide refactoring strategies. Combining the above two methods, this paper introduces a novel concept, \emph{PairSmell}, which exploits modularization to pinpoint design issues necessitating refactoring. We concentrate on a granular but fundamental aspect of modularity principles---\emph{modular relation (MR)}, i.e., \emph{whether a pair of entities are separated or collocated}. The main assumption is that, if the actual MR of a pair violates its `apt MR', i.e., an MR agreed on by multiple modularization tools (as raters), it can be deemed likely a flawed architectural decision that necessitates further examination. To quantify and evaluate \emph{PairSmell}, we conduct an empirical study on 20 C/C++ and Java projects, using 4 established modularization tools to identify two forms of \emph{PairSmell}: inapt separated pairs $\mathit{InSep}$ and inapt collocated pairs $\mathit{InCol}$. Our study on 260,003 instances reveals that their architectural impacts are substantial: (1) on average, 14.60\% and 20.44\% of software entities are involved in $\mathit{InSep}$ and $\mathit{InCol}$ MRs respectively; (2) $\mathit{InSep}$ pairs are associated with 190\% more co-changes than properly separated pairs, while $\mathit{InCol}$ pairs are associated with 35\% fewer co-changes than properly collocated pairs, both indicating a successful identification of modular structures detrimental to software quality; and (3) both forms of \emph{PairSmell} persist across software evolution. This evidence strongly suggests that \emph{PairSmell} can provide meaningful insights for inspecting modular structure, with the identified issues being both granular and fundamental, making the enhancement of modular design more efficient.
"
Fraol Batole, David OBrien, Tien Nguyen, Robert Dyer, Hridesh Rajan, "An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization"
Abstract: Maintaining software design quality is crucial for the long-term maintainability and evolution of systems. However, design issues such as poor modularity and excessive complexity often emerge as codebases grow. Developers rely on external tools, such as program analysis techniques, to identify such issues. This work investigates an automated approach for analyzing and localizing design issues using Large Language Models (LLMs). Large language models have demonstrated significant performance on coding tasks, but directly leveraging them for design issue localization is challenging. Large codebases exceed typical LLM context windows, and program analysis tool outputs in non-textual modalities (e.g., graphs or interactive visualizations) are incompatible with LLMs’ natural language inputs. To address these challenges, we propose LOCALIZEAGENT, a novel multi-agent framework for effective design issue localization. LOCALIZEAGENT integrates specialized agents that (1) analyze code to identify potential code design issues, (2) transform program analysis outputs into abstraction-aware LLM-friendly natural language summaries, (3) generate context-aware prompts tailored to specific refactoring types, and (4) leverage LLMs to locate and rank the localized issues based on their relevance. Our evaluation using diverse real-world codebases demonstrates significant improvements over baseline approaches, with LOCALIZEAGENT achieving 138%, 166%, and 206% relative improvements in exact match accuracy for localizing information hiding, complexity, and modularity issues, respectively.
Qingkai Shi, Xiaoheng Xie, Xianjin Fu, Peng Di, Huawei Li, Ang Zhou, "Gang Fan, ""Datalog-Based Language-Agnostic Change Impact Analysis for Microservices""
Abstract: The shift-left principle in the industry requires us to test a software application as early as possible. Particularly, when code changes in a microservice application are committed to the code repository, we have to efficiently identify all public microservice interfaces impacted by the changes, such that the impacted interfaces can be tested as soon as possible. However, developing an efficient change impact analysis is extremely challenging in microservices because of the multilingual problem: microservice applications are often implemented using varying programming languages and involve diverse frameworks and configuration files. To address this issue, this paper presents Microscope, a language-agnostic change impact analysis that uniformly represents the code, configuration files, frameworks, and code changes by relational Datalog rules. Microscope then benefits from an efficient Datalog solver to identify impacted interfaces. Experiments based on the use of Microscope in a leading software company demonstrate that Microscope is both effective and fast as it successfully identifies interfaces impacted by 112 code commits, with moderate time overhead, and could reduce 97% of interfaces to test and save 73% of testing time after code changes.
"
Robert Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First, "Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification"
Abstract: Formal verification using proof assistants, such as Coq, allows for high-quality software. However, the verification process is expensive, requiring significant expertise and manual effort to write proofs. Recent work has explored automating proof synthesis using machine learning, and even more recently, large language models (LLMs), showing that retrieving relevant premises (such as lemmas and definitions) is helpful for these models. We present Rango, a fully automated proof synthesis tool for Coq that uses, not only relevant premises but also similar proofs from the current project. Rango uses retrieval augmentation at every step of the proof to automatically determine which proofs and premises to include in the context of its fine-tuned LLM. In this way, Rango adapts to the project _and_ to the evolving state of the proof. We create a new dataset, CoqStoq, of 2,205 open-source Coq projects from GitHub, which includes both training data and a curated evaluation benchmark of well-maintained projects. On this benchmark, Rango synthesizes 27.7% of the proofs, which is 10% more proofs than prior state-of-the-art tool Tactician. Our evaluation also shows that adding relevant proofs to the context in Rango leads to a 45% increase in the number of theorems proven.
Miao Miao, Austin Mordahl, Dakota Soles, Alice Beideck, "Shiyi Wei, ""An Extensive Empirical Study of Nondeterministic Behavior in Static Analysis Tools""
Abstract: Recent research has studied the importance and identified causes of nondeterminism in software. Static analysis tools exhibit many risk factors for nondeterministic behavior, but no work has analyzed the occurrence of such behavior in these tools. To bridge this gap, we perform an extensive empirical study aiming to understand past and ongoing nondeterminism in 12 popular, open-source static analysis tools that target 5 types of projects. We first conduct a qualitative study to understand the extent to which nondeterministic behavior has been found and addressed within the tools under study, and find results in 7 tool repositories. After classifying the issues and commits by root cause, we find that the majority of nondeterminisms are caused by concurrency issues, incorrect analysis logic, or assumed orderings of unordered data structures, which have shared patterns. We also perform a quantitative analysis, where we use two strategies and diverse input programs and configurations to detect yet-unknown nondeterministic behaviors. We discover such behavior in 8 out of the 12 tools, including 3 which had no results from the qualitative analysis. We find that nondeterminism often appears in multiple configurations on a variety of input programs. We communicated all identified nondeterminism to the developers, and received confirmation of five tools. Finally, we detail a case study of fixing FlowDroid's nondeterministic behavior.
"
Xintong Zhou, Zhenyang Xu, Mengxiao Zhang, Yongqiang Tian, "Chengnian Sun, ""WDD: Weighted Delta Debugging""
Abstract: Delta Debugging is a widely used family of algorithms (e.g., ddmin and ProbDD) to automatically minimize bug-triggering test inputs, thus to facilitate debugging. It takes a list of elements with each element representing a fragment of the test input, systematically partitions the list at different granularities, identifies and deletes bug-irrelevant partitions. Prior delta debugging algorithms assume there are no differences among the elements in the list, and thus treat them uniformly during partitioning. However, in practice, this assumption usually does not hold, because the size (referred to as weight) of the fragment represented by each element can vary significantly. For example, a single element representing 50% of the test input is much more likely to be bug-relevant than elements representing only 1%. This assumption inevitably impairs the efficiency or even effectiveness of these delta debugging algorithms. This paper proposes Weighted Delta Debugging (WDD), a novel concept to help prior delta debugging algorithms overcome the limitation mentioned above. The key insight of WDD is to assign each element in the list a weight according to its size, and distinguish different elements based on their weights during partitioning. We designed two new minimization algorithms, Wddmin and WProbDD, by applying WDD to ddmin and ProbDD respectively. We extensively evaluated Wddmin and WProbDD in two representative applications, HDD and Perses, on 62 benchmarks across two languages. On average, with Wddmin, HDD and Perses took 51.31% and 7.47% less time to generate 9.12% and 0.96% smaller results than with ddmin, respectively. With WProbDD, HDD and Perses used 11.98% and 9.72% less time to generate 13.40% and 2.20% smaller results than with ProbDD, respectively. The results strongly demonstrate the value of WDD. We firmly believe that WDD opens up a new dimension to improve test input minimization techniques.
"
Faridah Akinotcho, Lili Wei, Julia Rubin, "Mobile Application Coverage: The 30% Curse and Ways Forward"
Abstract: Testing, security analysis, and other dynamic quality assurance approaches rely on mechanisms that invoke software under test, aiming to achieve high code coverage. A large number of invocation mechanisms proposed in the literature, in particular for Android mobile applications, employ GUI-driven application exploration. However, studies show that even the most advanced GUI exploration techniques can cover only around 30% of a real- world application. This paper aims to investigate “the remaining 70%”. By conducting a large-scale experiment involving two human experts, who thoroughly explored 61 benchmark and 42 popular apps from Google Play, we show that achieving a substantially larger coverage for real-world applications is impractical even if we factor out known GUI-based exploration issues, such as the inability to provide semantic inputs and the right order of events. The main reasons preventing humans from covering the entire application include application dependencies on device configurations and external resources. Thus, future investment into GUI-based exploration strategies is unlikely to lead to substantial improvements in coverage. To chart possible ways forward and explore approaches to satisfy/bypass these dependencies, we thoroughly analyze code-level properties guarding them. Our analysis shows that a large fraction of the dependencies could actually be successfully bypassed with relatively simple beyond- GUI exploration techniques. We hope our study can inspire future work in this area and also provide a realistic benchmark for evaluating this work.
Hao Song, Teng Li, Jiachi Chen, Ting Chen, Beibei Li, Zhangyan Lin, Yi Lu, Pan Li, "Xihan Zhou, ""Enhancing The Open Network: Definition and Automated Detection of Smart Contract Defects""
Abstract: The Open Network (TON), designed to support Telegram's extensive user base of hundreds of millions, has garnered considerable attention since its launch in 2022. \textit{FunC} is the most popular programming language for writing smart contracts on TON. It is distinguished by a unique syntax compared to other smart contract languages. Despite growing interest, research on the practical defects of TON smart contracts is still in its early stages. In this paper, we summarize eight smart contract defects identified from TON's official blogs and audit reports, each with detailed definitions and code examples. Furthermore, we propose a static analysis framework called TONScanner to facilitate the detection of these defects. Specifically, TONScanner reuses \textit{FunC} compiler's frontend code to transform the \textit{FunC} contract code into \textit{FunC} intermediate representation (IR) in the form of a directed acyclic graph (DAG). Based on this IR, TONScanner constructs a control flow graph (CFG), then transforms it into a static single assignment (SSA) form to simplify further analysis. TONScanner also integrates Data Dependency, Call Graph, Taint Analysis, and Cell Construct, which are specifically tailored for TON blockchain's unique data structures. These components finally facilitate the identification of the eight defects. We evaluate the effectiveness of TONScanner by applying it to 1,640 smart contracts and find a total of 14,995 defects. Through random sampling and manual labeling, we find that TONScanner achieves an overall precision of 97.49%. The results reveal that current TON contracts contain numerous defects, indicating that developers are prone to making errors. TONScanner has proven its ability to accurately identify these defects, thereby aiding in their correction.
"
Kevin Guan, "Owolabi Legunsen, ""Instrumentation-Driven Evolution-Aware Runtime Verification""
Abstract: Runtime verification (RV) found hundreds of bugs by monitoring passing tests against formal specifications (specs). RV first instruments a program to obtain relevant events, e.g., method calls, to monitor. A hindrance to RV adoption, especially in continuous integration, is its high overhead. So, prior work proposed spec-driven evolution-aware techniques to speed up RV. They use complex analysis to re-monitor a subset of specs related to code impacted by changes. But, these techniques assume that RV overhead is dominated by monitoring time, and their designs often sacrifice safety (ability to find all new violations) for speed. We present iMOP, the first instrumentation-driven evolution-aware RV framework. iMOP leverages a recent observation that RV overhead during testing is often dominated by instrumentation, not monitoring. iMOP embodies a family of 14 techniques that aim to safely speed up RV by simply re-instrumenting only changed code. Instrumentation from the old revision is re-used for unchanged code, and all specs are re-monitored in the new revision. We implement iMOP as a Maven plugin and evaluate it on 1,627 revisions of 48 projects, using 160 specs of correct JDK API usage. iMOP is safe by design. It is up to 29.6x faster than re-running RV from scratch after each change, and 17.8x and 6.7x faster than safe and unsafe spec-driven techniques, respectively. iMOP is faster than just applying regression test selection to RV.
"
Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, "Weimin Zhang, ""Model Editing for LLMs4Code: How Far are We?""
Abstract: Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the effectiveness of all state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code models across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs. Results demonstrate that A-GRACE notably enhances generalization while maintaining similar levels of effectiveness and specificity compared to the vanilla GRACE.
"
TianChen Yu, Li Yuan, Liannan Lin, Hongkui He, "A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection"
Abstract: Over the past decade, the application of deep learning in code clone detection has produced remarkable results. However, the current approaches have two limitations: (a) code representation approaches with low information utilization, such as vanilla Abstract Syntax Tree (AST), leading to information redundancy which results in performance degradation; (b) low efficiency of clone detection on evaluation, resulting in excessive time costs during practical use. In this paper, we propose a Multiple Representation Transformer with Optimized Abstract Syntax Tree (MRT-OAST) to introduce an efficient code representation method while achieving competitive performance. Specifically, MRT-OAST strategically prunes and enhances the AST, utilizing both pre-order and post-order traversals to represent two different representations. To speed up the evaluation process, MRT-OAST utilizes a pure Siamese network and employs cosine similarity to compare the similarity between codes. Our approach effectively reduces AST sequences to 40% and 39% of their original length in Java and C/C++ while preserving structural information. In code clone detection tasks, our model surpasses state-of-the-art approaches on OJClone and Google Code Jam. During the evaluation of BigCloneBench, our model has a 5x speed improvement compared to the state-of-the-art lightweight model and a 563x speed improvement compared to the BERT-based model, with only a 0.3% and 0.9% decrease in $F_1$-score.
Tanghaoran Zhang, Yue Yu, Xinjun Mao, Shangwen Wang, Kang Yang, Yao Lu, Zhang Zhang, "Yuxin Zhao, ""Instruct or Interact? Exploring and Eliciting LLMs’ Capability in Code Snippet Adaptation Through Prompt Engineering""
Abstract: Code snippet adaptation is a fundamental activity in the software development process. Unlike code generation, code snippet adaptation is not a ``free creation'', which requires developers to tailor a given code snippet in order to fit specific requirements and the code context. Recently, large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. However, their performance on code snippet adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. To bridge this gap, we conduct an empirical study to investigate the performance and issues of LLMs on the adaptation task. We first evaluate the adaptation performances of three popular LLMs and compare them to the code generation task. Our result indicates that their adaptation ability is weaker than generation, with a nearly 15\% decrease on pass@1 and more context-related errors. By manually inspecting 200 cases, we further investigate the causes of LLMs’ sub-optimal performance, which can be classified into three categories, \emph{i.e.,} \textit{Unclear Requirement}, \textit{Requirement Misalignment} and \textit{Context Misapplication}. Based on the above empirical research, we propose an interactive prompting approach to eliciting LLMs' ability on the adaptation task. Specifically, we enhance the prompt by enriching the context and decomposing the task, which alleviates context misapplication and improves requirement understanding. Besides, we enable LLMs' reflection by requiring them to interact with a human or a LLM counselor, compensating for unclear requirement. Our experimental result reveals that our approach greatly improve LLMs' adaptation performance. The best-performing Human-LLM interaction successfully solves 159 out of the 202 identified defects and improves the pass@1 and pass@5 by over 40\% compared to the initial instruction-based prompt. Considering human efforts, we suggest multi-agent interaction as a trade-off, which can achieve comparable performance with excellent generalization ability. We deem that our approach could provide methodological assistance for autonomous code snippet reuse and adaptation with LLMs.
"
Enrique Barba Roque, Luís Cruz, Thomas Durieux, "Unveiling the Energy Vampires: A Methodology for Debugging Software Energy Consumption"
Abstract: Energy consumption in software systems is becoming increasingly important, especially in large-scale deployments. However, debugging energy-related issues remains challenging due to the lack of specialized tools. This paper presents an energy debugging methodology for identifying and isolating energy consumption hotspots in software systems. We demonstrate the methodology's effectiveness through a case study of Redis, a popular in-memory database. Our analysis reveals significant energy consumption differences between Alpine and Ubuntu distributions, with Alpine consuming up to 20.2% more power in certain operations. We trace this difference to the implementation of the `memcpy` function in different C standard libraries (musl vs. glibc). By isolating and benchmarking `memcpy`, we confirm it as the primary cause of the energy discrepancy. Our findings highlight the importance of considering energy efficiency in software dependencies and demonstrate the capability to assist developers in identifying and addressing energy-related issues. This work contributes to the growing field of sustainable software engineering by providing a systematic approach to energy debugging.
Cedric Richter, Marek Chalupa, Marie-Christine Jakobs, "Heike Wehrheim, ""Cooperative Software Verification via Dynamic Program Splitting""
Abstract: Cooperative software verification divides the task of software verification among several verification tools in order to increase efficiency and effectiveness. The basic approach is to let verifiers work on different parts of a program and at the end join verification results. While this idea is intuitively appealing, cooperative verification is usually hindered by the facts that program decomposition (1) is often static, disregarding strengths and weaknesses of employed verifiers, and (2) often represents the decomposed program parts in a specific proprietary format, thereby making the use of off-the-shelf verifiers in cooperative verification difficult. In this paper, we propose a novel cooperative verification scheme that we call dynamic program splitting (DPS). Splitting decomposes programs into (smaller) programs, and thus directly enables the use of off-the-shelf tools. In DPS, splitting is dynamically applied on demand: Verification starts by giving a verification task (a program plus a correctness specification) to a verifier V1. Whenever V1 finds the current task to be hard to verify, it splits the task (i.e., the program) and restarts verification on subtasks. DPS continues until (1) a violation is found, (2) all subtasks are completed or (3) some user-defined stopping criterion is met. In the latter case, the remaining uncompleted subtasks are merged into a single one and given to a next verifier V2, repeating the same procedure on the still unverified program parts. This way, the decomposition is steered by what is hard to verify for particular verifiers, leveraging their complementary strengths. We have implemented dynamic program splitting and evaluated it on benchmarks of the annual software verification competition SV-COMP. The evaluation shows that cooperative verification with DPS is able to solve verification tasks that none of the constituent verifiers can solve, without any significant overhead.
"
Yiwei Li, Liangze Yin, Wei Dong, Jiaxin Liu, Yanfeng Hu, Shanshan Li, "Hetrify: Efficient Verification of Heterogeneous Programs on RISC-V"
Abstract: The heterogeneous nature of contemporary software, comprising components like closed-source libraries, embedded assembly snippets, and modules written in multiple programming languages, leads to significant verification challenges. Currently, There are no mature and available methods to effectively address such problems. To bridge this gap, we propose a verification approach capable of effectively verifying heterogeneous programs. This approach is universally applicable. It theoretically supports the verification of any heterogeneous program that can be compiled into binary code, without being constrained by any specific programming language. The approach begins by compiling the entire program or its unverifiable segments into binary format. Under guarantees of semantic equivalence, these binaries are converted into verifiable C code, which can then be verified using existing C verification tools. Based on the RISC-V architecture, we developed the Hetrify tool to implement this verification approach. The tool is supported by rigorous mathematical proofs to ensure operational semantic equivalence between the converted C programs and their original counterparts. To validate our approach, we conducted verification experiments on 130 programs, including 100 assembly programs and 30 large heterogeneous programs with missing critical function source code, demonstrating the effectiveness of our approach.
Xue Jiang, Yihong Dong, Yongding Tao, Huanyu Liu, Zhi Jin, "Ge Li, ""ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation""
Abstract: Large language models (LLMs) have achieved impressive performance in code generation recently, offering programmers revolutionary assistance in software development. However, due to the auto-regressive nature of LLMs, they are susceptible to error accumulation during code generation. Once an error is produced, LLMs can merely continue to generate the subsequent code conditioned on it, given their inability to adjust previous outputs. This generation process differs from the common practice in human coding, which involves review and adjustment during the coding process according to quality and requirements. Existing LLM-based approaches that typically consider post-revising after code generation fail to resolve errors in time, leading to the challenging resolution of accumulated errors and the significant wastage of resources. Ideally, LLMs should rollback and resolve the occurred error immediately during code generation, rather than proceed on the basis of the error and wait for post-revising after generation. In this paper, we propose \ourapproachbf, which integrates the backtracking mechanism and program analysis into LLMs for code generation. Specifically, we employ program analysis to perform incremental error detection during the generation process. When an error is detected, the backtracking mechanism is triggered to priming rollback strategies and constraint regeneration, thereby avoiding the recurrence of the same error. Experiments on multiple code generation benchmarks show that \ourapproachbf can significantly reduce the errors generated by LLMs, with a compilation pass rate of over 98.9\%. The test pass rate is improved by up to 23.8\% compared to the best baseline approach. Compared to the post-revising baseline, the cost is reduced by 19.3\%. Moreover, our approach is model-agnostic and achieves consistent improvements across six LLMs.
"
Christof Tinnes, Alisa Welter, Sven Apel, "Software Model Evolution with Large Language Models: Experiments on Simulated, Public, and Industrial Datasets"
Abstract: Modeling structure and behavior of software systems plays a crucial role in the industrial practice of software engineering. As with other software engineering artifacts, software models are subject to evolution. Supporting modelers in evolving software models with recommendations for model completions is still an open problem, though. In this paper, we explore the potential of large language models for this task. In particular, we propose an approach, \textsc{RaMc}, leveraging large language models, model histories of software systems, and retrieval-augmented generation for model completion. Through experiments on three datasets, including an industrial application, one public open-source community dataset, and one controlled collection of simulated model repositories, we evaluate the potential of large language models for model completion. We found that large language models are indeed a promising technology for supporting software model evolution (62.30% semantically correct completions on real-world industrial data and up to 86.19% type-correct completions). Furthermroe, we found that the general inference capabilities of large language models are useful, for example, when dealing with concepts for which there are few, noisy, or no examples at all.
Ran Mo, Haopeng Song, Wei Ding, "Chaochao Wu, ""Code Cloning in Solidity Smart Contracts: Prevalence, Evolution, and Impact on Development""
Abstract: In recent years, the development of Solidity smart contracts has been increasing rapidly in popularity. Code cloning is a common coding practice, and many prior studies have revealed that code clones could negatively impact software maintenance and quality. However, there is little work systematically analyzing the nature and impacts of code clones in solidity smart contracts. To bridge this gap, we investigate the prevalence, evolution, and bug-proneness of code clones in solidity smart contracts, and further identify the possible reasons for these clones’ occurrences. With our evaluation of 26,294 smart contracts with 97,877 functions, we have found that code clones are highly prevalent in smart contracts. Additionally, on average, 32.01% of clones co-evolve, indicating the need for careful management to avoid consistency issues. Surprisingly, unlike in traditional software development, code clones in smart contracts are rarely involved in bug fixes. Finally, we identify three main factors that affect the occurrences of clones. We believe our study can provide valuable insights for developers to understand and manage code clones in solidity smart contracts.
"
Trey Woodlief, Carl Hildebrandt, "Sebastian Elbaum, ""A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs""
Abstract: The proliferation of autonomous vehicles (AVs) has made their failures increasingly evident. Testing efforts aimed at identifying the inputs leading to those failures are challenged by the input's long-tail distribution, whose area under the curve is dominated by rare scenarios. We hypothesize that leveraging emerging open-access datasets can accelerate the exploration of long-tail inputs. Having access to diverse inputs, however, is not sufficient to expose failures; an effective test also requires an oracle to distinguish between correct and incorrect behaviors. Current datasets lack such oracles and developing them is notoriously difficult. In response, we propose DiffTest4AV, a differential testing framework designed to address the unique challenges of testing AV systems: 1) for any given input, many outputs may be considered acceptable, 2) the long-tail contains an insurmountable number of inputs to explore, and 3) the AV's continuous execution loop requires for failures to persist in order to affect the system. DiffTest4AV integrates statistical analysis to identify meaningful behavioral variations, judges their importance in terms of the severity of these differences, and incorporates sequential analysis to detect persistent errors indicative of potential system-level failures. Our study on 5 versions of the commercially-available, road-deployed comma.ai OpenPilot system, using 3 available image datasets, demonstrates the capabilities of the framework to detect high-severity, high-confidence, long-running test failures.
"
Giacomo Benedetti, Oreofe Solarin, Courtney Miller, Greg Tystahl, William Enck, Christian Kästner, Alexandros Kapravelos, Alessio Merlo, "Luca Verderame, ""An Empirical Study on Reproducible Packaging in Open-Source Ecosystems""
Abstract: The integrity of software builds is fundamental to the security of the software supply chain. While Thompson first raised the potential for attacks on build infrastructure in 1984, limited attention has been given to build integrity in the past 40 years, enabling recent attacks on SolarWinds, event-stream, and xz. The best-known defense against build system attacks is creating \emph{reproducible builds}; however, achieving them can be complex for both technical and social reasons and thus is often viewed as impractical to obtain. In this paper, we analyze reproducibility of builds in a novel context: reusable \emph{components} distributed as \emph{packages} in six popular software ecosystems (npm, Maven, PyPI, Go, RubyGems, and Cargo). Our quantitative study on a representative sample of 4000 packages in each ecosystem raises concerns: Rates of reproducible builds vary widely between ecosystems, with some ecosystems having all packages reproducible whereas others have \issues in nearly every package. However, upon deeper investigation, we identified that with relatively straightforward infrastructure configuration and patching of build tools, we can achieve very high rates of reproducible builds in all studied ecosystems. We conclude that if the ecosystems adopt our suggestions, the build process of published packages can be independently confirmed for nearly all packages without individual developer actions, and doing so will prevent significant future software supply chain attacks.
"
Nusrat Zahan, Philipp Burckhardt, Mikola Lysenko, Feross Aboukhadijeh, Laurie Williams, "Leveraging Large Language Models to Detect npm Malicious Packages"
Abstract: Existing malicious code detection techniques can aid the manual review process by predicting which packages are likely to be malicious. However, these techniques often suffer from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to assist security analysts in detecting malicious packages through the empirical study of using Large Language Models (LLMs) to detect malicious code in the npm ecosystem. We present SecurityAI, a malicious code review workflow to detect malicious code using ChatGPT. We leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We compare the effectiveness of static analysis as a pre-screener with SecurityAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. We attained precision and F1 scores of 91% and 94% for GPT-3, and 99% & 97% for GPT-4, respectively, with GPT-3 offering a cost-effective balance. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, hidden backdoors, and suspicious domain connection categories as the top detected malicious packages. The lack of diversity in model-generated responses led to hallucinations, resulting in misclassification cases, with GPT-3 hallucinating more frequently.
Xin-Cheng Wen, Zirui Lin, Cuiyun Gao, Hongyu Zhang, Yong Wang, "Qing Liao, ""Repository-Level Graph Representation Learning for Enhanced Security Patch Detection""
Abstract: Software vendors often silently release security patches without providing sufficient advisories (e.g., Common Vulnerabilities and Exposures) or delayed updates via resources (e.g., National Vulnerability Database). Therefore, it has become crucial to detect these security patches to ensure secure software maintenance. However, existing methods face the following challenges: (1) They primarily focus on the information within the patches themselves, overlooking the complex dependencies in the repository. (2) Security patches typically involve multiple functions and files, increasing the difficulty in well learning the representations. To alleviate the above challenges, this paper proposes a \textit{Repo}sitory-level Security Patch Detection framework named \textit{RepoSPD}, which comprises three key components: 1) a repository-level graph construction, RepoCPG, which represents software patches by merging pre-patch and post-patch source code at the repository level; 2) a structure-aware patch representation, which fuses the graph and sequence branch and aims at comprehending the relationship among multiple code changes; 3) progressive learning, which facilitates the model in balancing semantic and structural information. To evaluate RepoSPD, we employ two widely-used datasets in security patch detection: SPI-DB and PatchDB. We further extend these datasets to the repository level, incorporating a total of 20,238 and 28,781 versions of repository in C/C++ programming languages, respectively, denoted as SPI-DB* and PatchDB*. We compare RepoSPD with six existing security patch detection methods and five static tools. Our experimental results demonstrate that RepoSPD outperforms the state-of-the-art baseline, with improvements of 11.90\%, and 3.10\% in terms of accuracy on the two datasets, respectively. These results underscore the effectiveness of RepoSPD in detecting security patches. Furthermore, RepoSPD can detect 151 security patches, which outperforms the best-performing baseline by 21.36\% with respect to accuracy.
"
Tajkia Rahman Toma, Balreet Grewal, Cor-Paul Bezemer, "Answering User Questions about Machine Learning Models through Standardized Model Cards"
Abstract: Reusing pre-trained machine learning models is becoming very popular due to model hubs such as Hugging Face (HF). However, similar to when reusing software, many issues may arise when reusing an ML model. In many cases, users resort to asking questions on discussion forums such as the HF community forum. In this paper, we study how we can reduce the community's workload in answering these questions and increase the likelihood that questions receive a quick answer. We analyze 11,278 discussions from the HF model community that contain user questions about ML models. We focus on the effort spent handling questions, the high-level topics of discussions, and the potential for standardizing responses in model cards based on a model card template. Our findings indicate that there is not much effort involved in responding to user questions, however, 40.1% of the questions remain open without any response. A topic analysis shows that discussions are more centered around technical details on model development and troubleshooting, indicating that more input from model providers is required. We show that 42.5% of the questions could have been answered if the model provider followed a standard model card template for the model card. Based on our analysis, we recommend that model providers add more development-related details on the model's architecture, algorithm, data preprocessing and training code in existing documentation (sub)sections and add new (sub)sections to the template to address common questions about model usage and hardware requirements.
Cho-Ting Lee, Andrew Neeser, Shengzhe Xu, Jay Katyan, Patrick Cross, Sharanya Pathakota, Marigold Norman, John C. Simeone, Jaganmohan Chandrasekaran, Naren Ramakrishnan, "Can an LLM find its way around a Spreadsheet?"
Abstract: Spreadsheets are routinely used in business and scientific contexts, and one of the most vexing challenges is performing data cleaning prior to analysis and evaluation. The ad-hoc and arbitrary nature of data cleaning problems, such as typos, inconsistent formatting, missing values, and a lack of standardization, often creates the need for highly specialized pipelines. We ask whether an LLM can find its way around a spreadsheet and how to support end-users in taking their free-form data processing requests to fruition. Just like RAG retrieves context to answer users’ queries, we demonstrate how we can retrieve elements from a code library to compose data preprocessing pipelines. Through comprehensive experiments, we demonstrate the quality of our system and how it is able to continuously augment its vocabulary by saving new codes and pipelines back to the code library for future retrieval.
Madeline Janecek, Naser Ezzati-Jivan, Abdelwahab Hamou-Lhadj, "Execution Trace Reconstruction Using Diffusion-Based Generative Models"
Abstract: Execution tracing is essential for understanding system and software behaviour, yet lost trace events can significantly compromise data integrity and analysis. Existing solutions for trace reconstruction often fail to fully leverage available data, particularly in complex and high-dimensional contexts. Recent advancements in generative artificial intelligence, particularly diffusion models, have set new benchmarks in image, audio, and natural language generation. This study conducts the first comprehensive evaluation of diffusion models for reconstructing incomplete trace event sequences. Using nine distinct datasets generated from the Phoronix Test Suite, we rigorously test these models on sequences of varying lengths and missing data ratios. Our results indicate that the SSSD$^{S4}$ model, in particular, achieves superior performance, in terms of accuracy, perfect rate, and ROUGE-L score across diverse imputation scenarios. These findings underscore the potential of diffusion-based models to accurately reconstruct missing events, thereby maintaining data integrity and enhancing system monitoring and analysis.
Joseph Romeo, Marco Raglianti, Csaba Nagy, "Michele Lanza, ""UML is Back. Or is it? Investigating the Past, Present, and Future of UML in Open Source Software""
Abstract: Since its inception, UML, the Unified Modeling Language, has been touted as the way to go when it comes to designing and documenting software systems. While being an integral part of many university software engineering programs, UML has found little consideration among developers, especially in open source software. Reasons for this include that UML shares some shortcomings with other forms of documentation (e.g., limited availability, outdatedness, inadequate level of detail). We present a study to investigate the evolution and the current situation regarding the use of UML in open source projects. We mined and analyzed ~13k GitHub projects, developing strategies and heuristics to identify UML files through their extensions and contents, for a quantitative analysis of two decades of evolution of the usage of UML. We explored the popularity of UML, derived characteristics of projects leveraging UML, and analyzed the authors, creators and maintainers, of UML artifacts. Our study confirms that UML is indeed still under-utilized. At the same time we found evidence of a resurgence coinciding with the popularity of human-readable text-based formats, defined and used by tools like PlantUML and Mermaid. We discuss how identifying and addressing the new challenges implied by this resurgence could impact the future of UML.
"
Abdul Haddi Amjad, Muhammad Danish, Bless Jah, Muhammad Ali Gulzar, "Accessibility Issues in Ad-Driven Web Applications"
Abstract: Website accessibility is essential for inclusiveness and regulatory compliance. Although third-party advertisements (ads) are a vital revenue source for free web services, they introduce significant accessibility challenges. Leasing a website’s space to ad-serving technologies like DoubleClick results in developers losing control over ad content accessibility. Even on highly accessible websites, third-party ads can undermine adherence to Web Content Accessibility Guidelines (WCAG). We conduct the first-of-its-kind large-scale investigation of 430K website elements, including nearly 100K ad elements, to understand the accessibility of ads on websites. We seek to understand the prevalence of inaccessible ads and their overall impact on the accessibility of websites. Our findings show that 67% of websites experience increased accessibility violations due to ads, with common violations including Focus Visible (WCAG 2.4.7) and On Input (WCAG 3.2.2). Popular ad-serving technologies like Taboola, DoubleClick, and RevContent often serve ads that fail to comply with WCAG standards. Even when ads are WCAG compliant, 27% of them have alternative text in ad images that misrepresents information, potentially deceiving users. Manual inspection of a sample of these misleading ads revealed that user-identifiable data is collected on 94% of websites through interactions, such as hovering or pressing enter. Since users with disabilities often rely on tools like screen readers that require hover events to access website content, they have no choice but to compromise their privacy in order to navigate website ads. Based on our findings, we further dissect the root cause of these violations and provide design guidelines to both website developers and ad-serving technologies to achieve WCAG-compliant ad integration.
Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso, "A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs"
Abstract: As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as OpenAPI ones has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents---API, dependency, parameter, and value---collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.
Geraldine Galindo-Gutierrez, Juan Pablo Sandoval Alcocer, Nicolas Jimenez-Fuentes, Alexandre Bergel, "Gordon Fraser, ""Increasing the Effectiveness of Automatically Generated Tests by Improving Class Observability""
Abstract: Automated unit test generation consists of two complementary challenges: Finding sequences of API calls that exercise the code of a class under test, and finding assertion statements that validate the behaviour of the class during execution. The former challenge is often addressed using meta-heuristic search algorithms optimising tests for code coverage, which are then annotated with regression assertions to address the latter challenge, i.e., assertions that capture the states observed during test generation. While the resulting tests tend to achieve high coverage, their fault finding potential is often inhibited by poor or difficult observability of the codebase. That is, relevant attributes and properties may either not be exposed adequately at all, or only in ways that the test generator is unable to handle. In this paper, we investigate the influence of observability in the context of the EvoSuite search-based Java test generator, which we extend in two complementary ways to study and improve observability: First, we apply a transformation to code under test to expose encapsulated attributes to the test generator; second, we address EvoSuite's limited capability of asserting the state of complex objects. Our evaluation demonstrates that together these observability improvements lead to significantly increased mutation scores, underscoring the importance of considering the class observability in the test generation process.
"
Nabson Silva, Unknown, Tayana Conte, "A Catalog of Micro Frontends Anti-patterns"
Abstract: Micro frontend (MFE) architectures have gained significant popularity for promoting independence and modularity in development. Despite their widespread adoption, the field remains relatively unexplored, especially concerning identifying problems and documenting best practices. Drawing on both established microservice (MS) anti-patterns and the analysis of real problems faced by software development teams that adopt MFE, this paper presents a catalog of 12 MFE anti-patterns. We composed an initial version of the catalog by recognizing parallels between MS anti-patterns and recurring issues in MFE projects to map and adapt MS anti-patterns to the context of MFE. To validate the identified problems and proposed solutions, we conducted a survey with industry practitioners, collecting valuable feedback to refine the anti-patterns. Additionally, we asked participants if they had encountered these problems in practice and to rate their harmfulness on a 10-point Likert scale. The survey results revealed that participants had encountered all the proposed anti-patterns in real-world MFE architectures, with only one reported by less than 50% of participants. They stated that the catalog can serve as a valuable guide for both new and experienced developers, with the potential to enhance MFE development quality. The collected feedback led to the development of a improved version of the anti-patterns catalog. Furthermore, we developed a web application designed to not only showcase the anti-patterns but also to actively foster collaboration and engagement within the MFE community. The proposed catalog is a valuable resource for identifying and mitigating potential pitfalls in MFE development. It empowers developers of all experience levels to create more robust, maintainable, and well-designed MFE applications.
Georgios Sakkas, Pratyush Sahu, Kyeling Ong, "Ranjit Jhala, ""Neurosymbolic Modular Refinement Type Inference""
Abstract: Refinement types -- a type-based generalization of Floyd-Hoare logics -- are an expressive and modular means of statically ensuring a wide variety of correctness, safety, and security properties of software. However, their expressiveness and modularity means that to use them, a developer must laboriously \emph{annotate} all the functions in their code with potentially complex type specifications that specify the contract for that function. We present XO, a neurosymbolic agent that uses LLMs to automatically generate refinement type annotations for all the functions in an entire package or module, using the refinement type checker LiquidHaskell as an oracle to verify the correctness of the generated specifications. We curate a dataset of three Haskell packages where refinement types are used to enforce a variety of correctness properties from data structure invariants to low-level memory safety and use this dataset to evaluate XO. Previously these packages required expert users several days to weeks to annotate with refinement types. Our evaluation shows that when even using models with relatively smaller models like the 3 billion parameter StarCoder LLM, by using fine-tuning, carefully chosen contexts, our neurosymbolic agent generates refinement types for up to 94\% of the functions across entire libraries automatically in just a few hours, thereby showing that LLMs can drastically shrink the human effort needed to use formal verification.
"
Luís F. Gomes, Vincent Hellendoorn, Jonathan Aldrich, "Rui Abreu, ""An Exploratory Study of ML Sketches and Visual Code Assistants""
Abstract: This paper explores the integration of Visual Code Assistants in Integrated Development Environments (IDEs). In Software Engineering, whiteboard sketching is often the initial step before coding, serving as a crucial collaboration tool for developers. Previous studies have investigated patterns in SE sketches and how they are used in practice, yet methods for directly using these sketches for code generation remain limited. The emergence of visually-equipped large language models presents an opportunity to bridge this gap, which is the focus of our research. In this paper, we built a first prototype of a Visual Code Assistant to get user feedback regarding in-IDE sketch-to-code tools. We conduct an experiment with 19 data scientists, most of whom regularly sketch as part of their job. We investigate developers' mental models by analyzing patterns commonly observed in their sketches when developing an ML workflow. Analysis indicates that diagrams were the preferred organizational component (52.6\%), often accompanied by lists (42.1\%) and numbered points (36.8\%). Our tool converts their sketches into a Python notebook by querying an LLM. We use an LLM-as-judge setup to score the quality of the generated code, finding that even brief sketching can effectively generate useful code outlines. We also find a significant, positive correlation between sketch time and the quality of the generated code. We conclude the study by conducting extensive interviews to assess the tool's usefulness, explore potential use cases, and understand developers' needs. As noted by participants, promising applications for these assistants include education, prototyping, and collaborative settings. Our findings signal promise for the next generation of Code Assistants to integrate visual information, both to improve code generation and to better leverage developers' existing sketching practices.
"
Waris Gill, Ali Anwar, "Muhammad Ali Gulzar, ""TraceFL: Interpretability-Driven Debugging in Federated Learning via Neuron Provenance""
Abstract: In Federated Learning, clients train models on local data and send updates to a central server, which aggregates them into a global model using a fusion algorithm. This collaborative yet privacy-preserving training comes at a cost—FL developers face significant challenges in attributing global model predictions to specific clients. Localizing responsible clients is a crucial step towards (a) excluding clients primarily responsible for incorrect predictions and (b) encouraging clients who contributed high-quality models to continue participating in the future. Existing ML explainability approaches are inherently inapplicable as they are designed for single-model, centralized training. We introduce TraceFL, a fine-grained neuron provenance capturing mechanism that identifies clients responsible for the global model’s prediction by tracking the flow of information from individual clients to the global model. Since inference on different inputs activates a different set of neurons of the global model, TraceFL dynamically quantifies the significance of the global model’s neurons in a given prediction. It then selectively picks a slice of the most crucial neurons in the global model and maps them to the corresponding neurons in every participating client to determine each client’s contribution, ultimately localizing the responsible client. We evaluate TraceFL on six datasets, including two real-world medical imaging datasets and four neural networks, including advanced models such as GPT. TraceFL achieves 99% accuracy in localizing the responsible client in FL tasks spanning both image and text classification tasks. At a time when state-of-the-art ML debugging approaches are mostly domain-specific (e.g., image classification only), TraceFL is the first technique to enable highly accurate automated reasoning across a wide range of FL applications.
"
Yuanfang Cai, Lanting He, Yony Kochinski, Jun Qian, Ciera Jaspan, Nan Zhang, Antonio Bianco, "Understanding Architectural Complexity, Maintenance Burden, and Developer Sentiment---a Large-Scale Study"
Abstract: Intuitively, the more complex a software system is, the harder it is to maintain. Statistically, it is not clear which complexity metrics correlate with maintenance effort; in fact, it is not even clear how to objectively measure maintenance burden, so that developers' sentiment and intuition can be supported by numbers. Without effective complexity and maintenance metrics, it remains difficult to objectively monitor maintenance, control complexity, or justify refactoring. In this paper, we report a large-scale study of 1252 projects written in C++ and Java from Company_X. We collected three categories of metrics: (1) architectural complexity, measured using propagation cost (PC), decoupling level (DL), and structural anti-patterns; (2) maintenance activity, measured using the number of changes, lines of code (LOC) written, and active coding time (ACT) spent on feature-addition vs. bug-fixing, and (3) developer sentiment on complexity and productivity, collected from 7200 survey responses. We statistically analyzed the correlations among these metrics and obtained significant evidence of the following findings: 1) the more complex the architecture is (higher propagation cost, more instances of anti-patterns), the more LOC is spent on bug-fixing, rather than adding new features; 2) developers who commit more changes for features, spend more lines of code on features, or spend more time on features also feel that they are less hindered by technical debt and complexity. To the best of our knowledge, this is the first large-scale empirical study establishing the statistical correlation among architectural complexity, maintenance activity, and developer sentiment. The implication is that, instead of solely relying upon developer sentiment and intuition to detect degraded structure or increased burden to evolve, it is possible to objectively and continuously measure and monitor architectural complexity and maintenance difficulty, increasing feature delivery efficiency by reducing architectural complexity and anti-patterns.
Aleks Chakarov, Jaco Geldenhuys, Matthew Heck, Michael Hicks, Samuel Huang, Georges-Axel Jaloyan, Anjali Joshi, K. Rustan M. Leino, Mikael Mayer, Sean McLaughlin, Akhilesh Mritunjai, Clement Pit-Claudel, Sorawee Porncharoenwase, Florian Rabe, Marianna Rapoport, Giles Reger, Cody Roux, Neha Rungta, Robin Salkeld, Matthias Schlaipfer, Daniel Schoepe, Johanna Schwartzentruber, Serdar Tasiran, Aaron Tomb, Emina Torlak, Jean-Baptiste Tristan, Lucas Wagner, Michael W. Whalen, Remy Willems, Tongtong Xiang, Tae Joon Byun, Joshua Cohen, Ruijie Fang, Junyoung Jang, Jakob Rath, Hira Taqdees Syeda, Dominik Wagner, Yongwei Yuan, "Formally Verified Cloud-Scale Authorization"
Abstract:
Shazibul Islam Shamim, Hanyang Hu, "Akond Rahman, ""On Prescription or Off Prescription? An Empirical Study of Community-prescribed Security Configurations for Kubernetes""
Abstract: Despite being beneficial for rapid delivery of software, Kubernetes deployments can be susceptible to security attacks, which can cause serious consequences. A systematic characterization of how community-prescribed security configurations, i.e., security configurations that are recommended by security experts, can aid practitioners to secure their Kubernetes deployments. To that end, we conduct an empirical study with 53 security configurations recommended by the Center for Internet Security (CIS), 20 survey respondents, and 356 configuration files obtained from open source software (OSS) repositories and 188 configuration files used by Company-A. From our empirical study, we observe: (i) practitioners can be unaware of prescribed security configurations as 5%~40% of the survey respondents are unfamiliar with 16 prescribed configurations; and (ii) for Company-A and OSS respectively, 18.0% and 17.9% of the configuration files include at least one violation of prescribed configurations. From our evaluation with 5 static application security testing (SAST) tools we find (i) only Kubescape to support all of the prescribed security configurations; (ii) the highest observed precision to be 0.48 and 0.43 respectively, for the Company-A and OSS datasets; and (iii) the highest observed recall to be respectively, 0.53 and 0.65 for the Company-A and OSS datasets. We conclude the paper by providing recommendations for practitioners on how they can use existing SAST tools to secure their Kubernetes deployments.
"
Jaehyeok Lee, Sooyoung Cha, "TopSeed: Learning Seed Selection Strategies for Symbolic Execution from Scratch"
Abstract: We present TopSeed, a new approach that automatically selects optimal seeds to enhance symbolic execution. Recently, the performance of symbolic execution has significantly improved through various state-of-the-art techniques, including search strategies and state-pruning heuristics. However, these techniques have typically demonstrated their effectiveness without considering “seeding”, which efficiently initializes program states for exploration. This paper aims to select valuable seeds from candidate inputs generated during interactions with any symbolic execution technique, without the need for a predefined seed corpus, thereby maximizing the technique’s effectiveness. One major challenge is the vast number of candidates, making it difficult to identify promising seeds. To address this, we introduce a customized online learning algorithm that iteratively groups candidate inputs, ranks each group, and selects a seed from the top-ranked group based on data accumulated during symbolic execution. Experimental results on 17 open-source C programs show that TopSeed significantly enhances four distinct cutting-edge techniques, implemented on top of two symbolic executors, in terms of branch coverage and bug-finding abilities.
Ali Ebrahimi Pourasad, "Walid Maalej, ""Does GenAI Make Usability Testing Obsolete?""
Abstract: Ensuring usability is crucial for the success of mobile apps. Usability issues can compromise user experience and negatively impact the perceived app quality. This paper presents UX-LLM, a novel tool powered by a Large Vision-Language Model that predicts usability issues in iOS apps. To evaluate the performance of UX-LLM we predicted usability issues in two open-source apps of a medium complexity and asked usability experts to assess the predictions. We also performed traditional usability testing and expert review for both apps and compared the results to those of UX-LLM. UX-LLM demonstrated precision ranging from 0.61 and 0.66 and recall between 0.35 and 0.38, indicating its ability to identify valid usability issues, yet failing to capture the majority of issues. Finally, we conducted a focus group with an app development team of a capstone project developing a transit app for visually impaired persons. The focus group expressed positive perceptions of UX-LLM as it identified unknown usability issues in their app. However, they also raised concerns about its integration into the development workflow, suggesting potential improvements. Our results show that UX-LLM cannot fully replace traditional usability evaluation methods but serves as a valuable supplement particularly for small teams with limited resources, to identify issues in less common user paths, due to its ability to inspect the source code.
"
Zhao Tian, Junjie Chen, "Xiangyu Zhang, ""Fixing Large Language Models' Specification Misunderstanding for Better Code Generation""
Abstract: Code generation is to automatically generate source code conforming to a given programming specification, which has received extensive attention especially with the development of large language models (LLMs). Due to the inherent difficulty of code generation, the code generated by LLMs may not be aligned with the specification. Although thought-eliciting prompting techniques have been proposed to enhance the code generation performance of LLMs, producing correct understanding for complicated programming problems remains challenging, resulting in unsatisfactory performance. Also, some feedback-based prompting techniques have been proposed to fix incorrect code using error messages produced by test execution. However, when the generated code deviates significantly from the ground truth, they encounter difficulties in improving performance based on such coarse-grained information. In this work, we propose a novel prompting technique, called μFiX, to improve the code generation performance of LLMs by devising both sophisticated thought-eliciting prompting and feedback-based prompting and making the first exploration on their synergy. It first exploits test case analysis to obtain specification understanding and enables a self-improvement process to identify and refine the misunderstanding in the thought-eliciting prompting phase. μFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding (from the first phase) and the actual understanding implicitly utilized by LLMs for code generation in the feedback-based prompting phase. By improving the understanding with μFiX, the code generation performance of LLMs can be largely improved. Our evaluation on two advanced LLMs (ChatGPT and DeepSeek-Coder) with six widely-used benchmarks by comparing with 15 baselines, demonstrates the effectiveness of μFiX. For example, μFiX outperforms the most effective baseline with an average improvement of 35.62% in terms of Pass@1 across all subjects.
"
Jingwen Zhang, Zibin Zheng, Yuhong Nan, Mingxi Ye, Kaiwen Ning, Yu Zhang, "Weizhe Zhang, ""SmartReco: Detecting Read-Only Reentrancy via Fine-Grained Cross-DApp Analysis""
Abstract: Despite the increasing popularity of Decentralized Applications (DApps), they are suffering from various vulnerabilities that can be exploited by adversaries for profits. Among such vulnerabilities, Read-Only Reentrancy (called ROR in this paper), is an emerging type of vulnerability that arises from the complex interactions between DApps. In recent three years, attack incidents of ROR have already caused around 30M USD losses to the DApp ecosystem. Existing techniques for vulnerability detection in smart contracts can hardly detect Read-Only Reentrancy attacks, due to the lack of tracking and analyzing the complex interactions between multiple DApps. In this paper, we propose SmartReco, a new framework for detecting Read-Only Reentrancy vulnerability in DApps through a novel combination of static and dynamic analysis (i.e., fuzzing) over smart contracts. The key design behind SmartReco is threefold: (1) SmartReco identifies the boundary between different DApps from the heavy-coupled cross-contract interactions. (2) SmartReco performs fine-grained static analysis to locate points of interest (i.e., entry functions) that may lead to ROR. (3) SmartReco utilizes the on-chain transaction data and performs multi-function fuzzing (i.e., the entry function and victim function) across different DApps to verify the existence of ROR. Our evaluation of a manual-labeled dataset with 45 RORs shows that SmartReco achieves an accuracy of 88.63% and a recall of 86.36%. In addition, SmartReco successfully detects 43 new RORs from 123 popular DApps. The total assets affected by such RORs reach around 520,000 USD.
"
Mengya Zhang, Preksha Shukla, Wuqi Zhang, Zhuo Zhang, Pranav Agrawal, Zhiqiang Lin, Xiangyu Zhang, "Xiaokuan Zhang, ""An Empirical Study of Proxy Smart Contracts at the Ethereum Ecosystem Scale""
Abstract: Proxy has been introduced as a design pattern to separate data and code in an application to two different types of smart contracts, namely proxy and logic contracts, respectively. Data is stored in the proxy contracts, while the code to be executed is fetched from the logic contracts. Proxy patterns facilitate the flexibility of smart contract development by enabling upgradeability, extensibility, code reuse, etc. Despite its popularity and importance, there is currently no systematic study to understand the prevalence, use scenarios, and development pitfalls of proxies. In this work, we conduct the first comprehensive study on Ethereum proxies. To collect a comprehensive dataset of proxies, we propose ProxyEx, the first framework designed to detect proxy directly from bytecode. Our evaluation shows that ProxyEx achieves over 99% accuracy. With ProxyEx, we collect a large-scale dataset of 2,031,422 proxies from all contracts in Ethereum and conduct the first systematic empirical study. We first measure the total number of proxies and their transaction traffic, to obtain an overall understanding of the status quo of proxies on Ethereum. Then, we categorize the design pattern and use scenarios of proxies into four types: upgradeability, extensibility, code-sharing, and code-hiding. We further identify three types of common pitfalls in proxies: proxy-logic storage collision, logic-logic storage collision, and uninitialized contracts. We also design three checkers for these common pitfalls in proxies by replaying historical transactions. Our study leads to many interesting findings. For instance, we find that upgradeability is not the only reason that developers adopt the proxy pattern in developing Decentralized Applications (DApps). We also find that many proxies suffer from bugs such as storage collision and uninitialized contracts. Our study sheds light on the proxies landscape, and provides valuable insights to future smart contract research on the development, usage, quality assurance, and bug detection of proxies.
"
Yifan Wu, Yunpeng Wang, Ying Li, Wei Tao, Siyu Yu, Haowen Yang, Wei Jiang, "Jianguo Li, ""An Empirical Study on Commit Message Generation using LLMs via In-Context Learning""
Abstract: Commit messages concisely describe code changes in natural language and are important for software maintenance. Several approaches have been proposed to automatically generate commit messages, but they still suffer from critical limitations, such as time-consuming training and poor generalization ability. To tackle these limitations, we propose to borrow the weapon of large language models (LLMs) and in-context learning (ICL). Our intuition is based on the fact that the training corpora of LLMs contain extensive code changes and their pairwise commit messages, which makes LLMs capture the knowledge about commits, while ICL can exploit the knowledge hidden in the LLMs and enable them to perform downstream tasks without model tuning. However, it remains unclear how well LLMs perform on commit message generation via ICL. Therefore, in this paper, we conduct a comprehensive empirical study to investigate the capability of LLMs to generate commit messages via ICL. Specifically, we first explore the impact of different settings on the performance of ICL-based commit message generation. We then compare ICL-based commit message generation with state-of-the-art approaches on a popular multilingual dataset and a new dataset we created to mitigate potential data leakage. The results show that ICL-based commit message generation significantly outperforms state-of-the-art approaches on subjective evaluation and achieves better generalization ability. We further analyze the root causes for LLM’s underperformance and propose several implications, which shed light on future research directions for using LLMs to generate commit messages.
"
Octavio Galland, "Marcel Böhme, ""Invivo Fuzzing by Amplifying Actual Executions""
Abstract: A major bottleneck that remains when fuzzing software libraries is the need for _fuzz drivers_, i.e., the glue code between the fuzzer and the library. Despite years of fuzzing, critical security flaws are still found, e.g., by manual auditing, because the fuzz drivers do not cover the complex interactions between the library and the host programs using it. In this work we propose an alternative approach to library fuzzing, which leverages a valid execution context that is set up by a given program using the library (the _host_), and _amplify_ its execution. More specifically, we execute the host until a designated function from a list of _target_ functions has been reached, and then perform coverage-guided function-level fuzzing on it. Once the fuzzing quota is exhausted, we move on to fuzzing the next target from the list. In this way we not only reduce the amount of manual work needed by a developer to incorporate fuzzing into their workflow, but we also allow the fuzzer to explore parts of the library as they are used in real-world programs that may otherwise not have been tested due to the simplicity of most fuzz drivers.
"
Wenwei Gu, Jiazhen Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Jinxi Kuang, Cong Feng, Yongqiang Yang, Michael Lyu, "ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems"
Abstract: A common practice in the reliability engineering of cloud services involves the collection of monitoring metrics, followed by comprehensive analysis to identify performance issues. However, existing methods often fall short of detecting diverse and evolving anomalies across different services. Moreover, there exists a significant gap between the technical and business interpretation of anomalies, i.e., a detected anomaly may not have an actual impact on system performance or user experience. To address these challenges, we propose ADAMAS, an adaptive AutoML-based anomaly detection framework aiming to achieve practical anomaly detection in production cloud systems. To improve the ability of detecting cross-service anomalies, we design a novel unsupervised evaluation function to facilitate the automatic searching of the optimal model structure and parameters. ADAMAS also contains a lightweight human-in-the-loop design, which can efficiently incorporate expert knowledge to adapt to the evolving anomaly patterns and bridge the gap between predicted anomalies and actual business exceptions. Furthermore, through monitoring the rate of mispredicted anomalies, ADAMAS proactively re-configures the optimal model, forming a continuous loop of system improvement. Extensive evaluation on one public and two industrial datasets shows that ADAMAS outperforms all baseline models with a 0.891 F1-score. The ablation study also proves the effectiveness of the evaluation function design and the incorporation of expert knowledge.
Zewei Lin, Jiachi Chen, Jiajing Wu, Weizhe Zhang, Zibin Zheng, "Definition and Detection of Centralization Defects in Smart Contracts"
Abstract: In recent years, security incidents stemming from centralization defects in smart contracts have led to substantial financial losses. A centralization defect refers to any error, flaw, or fault in a smart contract’s design or development stage that introduces a single point of failure. Such defects allow a specific account or user to disrupt the normal operations of smart contracts, potentially causing malfunctions or even complete project shutdowns. Despite the significance of this issue, most current smart contract analyses overlook centralization defects, focusing primarily on other types of defects. To address this gap, our paper introduces six types of centralization defects in smart contracts by manually analyzing 597 Stack Exchange posts and 117 audit reports. For each defect, we provide a detailed description and code examples to illustrate its characteristics and potential impacts. Additionally, we introduce a tool named CDRipper (Centralization Defects Ripper) designed to identify the defined centralization defects. Specifically, CDRipper constructs a permission dependency graph (PDG) and extracts the permission dependencies of functions from the source code of smart contracts. It then detects the sensitive operations in functions and identifies centralization defects based on predefined patterns. We conduct a large-scale experiment using CDRipper on 244,424 real-world smart contracts and evaluate the results based on a manually labeled dataset. Our findings reveal that 82,446 contracts contain at least one of the six centralization defects, with our tool achieving an overall precision of 93.7%.
Changjian Zhang, Parv Kapoor, Ian Dardik, Leyi Cui, Romulo Meira-Goes, David Garlan, Eunsuk Kang, "Constrained LTL Specification Learning from Examples"
Abstract: Temporal logic specifications play an important role in a wide range of software analysis tasks, such as model checking, automated synthesis, program comprehension, and runtime monitoring. Given a set of positive and negative examples, specified as traces, \emph{LTL learning} is the problem of synthesizing a specification, in \emph{linear temporal logic (LTL)}, that evaluates to true over the positive traces and false over the negative ones. In this paper, we propose a new type of LTL learning problem called \emph{constrained LTL learning}, where the user, in addition to positive and negative examples, is given an option to specify one or more \emph{constraints} over the properties of the LTL formula to be learned. We demonstrate that the ability to specify these additional constraints significantly increases the range of applications for LTL learning, and also allows efficient generation of LTL formulas that satisfy certain desirable properties (such as minimality). We propose an approach for solving the constrained LTL learning problem through an encoding in a first-order relational logic and reduction to an instance of the \emph{maximal satisfiability (MaxSAT)} problem. An experimental evaluation demonstrates that ATLAS, an implementation of our proposed approach, is able to solve new types of learning problems while performing better than or competitively with the state-of-the-art tools in LTL learning.
Nikhil Parasaram, Huijie Yan, Boyu Yang, Zineb Flahy, Abriele Qudsi, Damian Ziaber, Earl T. Barr, "Sergey Mechtaev, ""The Fact Selection Problem in LLM-Based Program Repair""
Abstract: Recent research has shown that incorporating bug- related facts, such as stack traces and GitHub issues, into prompts enhances the bug-fixing capabilities of large language models (LLMs). Considering the ever-increasing context window of these models, a critical question arises: what and how many facts should be included in prompts to maximise the chance of correctly fixing bugs? To answer this question, we conducted a large-scale study, employing over 19K prompts featuring various combinations of seven diverse facts to rectify 314 bugs from open-source Python projects within the BugsInPy benchmark. Our findings revealed that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of LLMs such as angelic values, is beneficial. Specifically, each fact aids in fixing some bugs that would remain unresolved or only be fixed with a low success rate without it. Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts; using too many facts leads to subpar outcomes. These insights led us to define the fact selection problem: determining the optimal set of facts for inclusion in a prompt to maximise LLM’s performance on a given task instance. We found that there is no one-size- fits-all set of facts for bug repair. Therefore, we developed a basic statistical model, named MANIPLE, which selects facts specific to a given bug to include in the prompt. This model significantly surpasses the performance of the best generic fact set. To underscore the significance of the fact selection problem, we benchmarked MANIPLE against the state-of-the-art zero-shot, non- conversational LLM-based bug repair methods. On our testing dataset of 157 bugs, MANIPLE repairs 88 bugs, 17% above the best configuration.
"
Anna Mazhar, Saad Sher Alam, William Zheng, Yinfang Chen, Suman Nath, "Tianyin Xu, ""Fidelity of Cloud Emulators: The Imitation Game of Testing Cloud-based Software""
Abstract: Modern software projects have been increasingly using cloud services as important components. The cloud-based programming practice greatly simplifies software development by harvesting cloud benefits (e.g., high availability and elasticity). However, it imposes new challenges for software testing and analysis, due to opaqueness of cloud backends and monetary cost of invoking cloud services for continuous integration and deployment. As a result, cloud emulators are developed for offline development and testing, before online testing and deployment. This paper presents a systematic analysis of cloud emulators from the perspective of cloud-based software testing. Our goal is to (1) understand the discrepancies introduced by cloud emula- tion with regard to software quality assurance and deployment safety and (2) address inevitable gaps between emulated and real cloud services. The analysis results are concerning. Among 255 APIs of five cloud services from Azure and Amazon Web Services (AWS), we detected discrepant behavior between the emulated and real services in 94 (37%) of the APIs. These discrepancies lead to inconsistent testing results, threatening deployment safety, introducing false alarms, and creating debuggability issues. The root causes are diverse, including accidental implementation defects and essential emulation challenges. We discuss potential solutions and develop a practical mitigation technique to address discrepancies of cloud emulators for software testing.
"
Yuxin Zhang, Sen Chen, Xiaofei Xie, Zibo Liu, "Lingling Fan, ""Scenario-Driven and Context-Aware Automated Accessibility Testing for Android Apps""
Abstract: Mobile accessibility is increasingly important nowadays as it enables people with disabilities to use mobile applications to perform daily tasks. Ensuring mobile accessibility not only benefits those with disabilities but also enhances the user experience for all users, making applications more intuitive and user-friendly. Although numerous tools are available for testing and detecting accessibility issues in Android applications, a large number of false negatives and false positives persist due to limitations in the existing approaches, i.e., low coverage of UI scenarios and lack of consideration of runtime context. To address these problems, in this paper, we propose a scenario-driven exploration method for improving the coverage of UI scenarios, thereby detecting accessibility issues within the application, and ultimately reducing false negatives. Furthermore, to reduce false positives caused by not considering the runtime context, we propose a context-aware detection method that provides a more fine-grained detection capability. Our experimental results reveal that A11yScan can detect 1.7X more issues surpassing current state-of-the-art approaches like Xbot (3,991 vs. 2,321), thereby reducing the false negative rate by 41.84\%. Additionally, it outperforms established UI exploration techniques such as SceneDroid (952 vs. 661 UI scenarios), while achieving comparable activity coverage to recent leading GUI testing tools like GPTDroid on the available dataset (73\% vs. 71\%). Meanwhile, with the context-aware detection method, A11yScan effectively reduces the false positive rate by 21\%, validated with a 90.56\% accuracy rate through a user study.
"
Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, Iftekhar Ahmed, "An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?"
Abstract: Artificial Intelligence (AI) techniques, especially Large Language Models (LLMs), have started gaining popularity among researchers and software developers for generating source code. However, LLMs have been shown to generate code with quality issues and also incurred copyright/licensing infringements. Therefore, detecting whether a piece of source code is written by humans or AI has become necessary. This study first presents an empirical analysis to investigate the effectiveness of the existing AI detection tools in detecting AI-generated code. The results show that they all perform poorly and lack sufficient generalizability to be practically deployed. Then, to improve the performance of AI-generated code detection, we propose a range of approaches, including fine-tuning the LLMs and machine learning-based classification with static code metrics or code embedding generated from Abstract Syntax Tree (AST). Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55. We also conduct an ablation study on our best-performing model to investigate the impact of different source code features on its performance.
Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, "Hao Chen, ""IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation""
Abstract: Modern compilers, such as LLVM, are complex. Due to their complexity, manual testing is unlikely to suffice, yet formal verification is difficult to scale. End-to-end fuzzing can be used, but it has difficulties in discovering LLVM backend problems for two reasons. First, frontend preprocessing and middle optimization shield the backend from seeing diverse inputs. Besides, edge coverages cannot provide an effective feedback as LLVM backend contains much reusable code. In this paper, we implement IRFuzzer to investigate the need of specialized fuzzing of the LLVM compiler backend. We focus on two approaches to improve the fuzzer: guaranteed input validity using constrained mutations to improve input diversity and new metrics to improve feedback quality. The mutator in IRFuzzer is capable of generating a wide range of LLVM IR inputs, including structured control flow, vector types, and function definitions. The system instruments coding patterns in the compiler to monitor the execution status of instruction selection. The instrumentation not only provides a new coverage feedback called matcher table coverage, but also provides an architecture specific guidance to the mutator. We show that IRFuzzer is more effective than existing fuzzers by fuzzing on 29 mature LLVM backend targets. In the process, we reported 78 confirmed new bugs in LLVM upstream, out of which 57 have been fixed, five have been back ported to LLVM 15, showing that specialized fuzzing provides useful and actionable insights to LLVM developers.
"
Wuxia Jin, Jiaowei Shang, Jianguo Zheng, Mengjie Sun, Zhenyu Huang, Ming Fan, Ting Liu, "The Design Smells Breaking the Boundary between Android Variants and AOSP"
Abstract: Phone vendors customize their Android variants to enhance system functionalities based on the Android Open Source Project (AOSP). While independent development, Android variants have to periodically evolve with the upstream AOSP and merge code changes from AOSP. Vendors have invested great effort to maintain their variants and resolve merging conflicts. In this paper, we characterize the design smells with recurring patterns that break the design boundary between Android variants and AOSP. These smells are manifested as problematic dependencies across the boundary, hindering Android variants' maintainability and co-evolution with AOSP. We propose the DroidDS for automatically detecting design smells. We collect 22 Android variant versions and 22 corresponding AOSP versions, involving 4 open-source projects and 1 industrial project. Our results demonstrate that: files involved in design smells consume higher maintenance costs than other files; these infected files are not merely the files with large code size, increased complexity, and object-oriented smells; the infected files have been involved in more than half of code conflicts induced by re-applying AOSP's changes to Android variants; a substantial portion of design issues could be mitigable. Practitioners can utilize our DroidDS to pinpoint and prioritize design problems for Android variants. Refactoring these problems will help keep a healthy coupling between diverse variants and AOSP, potentially improving maintainability and reducing conflict risks.
Mingfei Cheng, Xiaofei Xie, Yuan Zhou, Junjie Wang, Guozhu Meng, Kairui Yang, "Decictor: Towards Evaluating the Robustness of Decision-Making in Autonomous Driving Systems"
Abstract: Autonomous Driving System (ADS) testing is crucial in ADS development, with the current primary focus being on safety. However, the evaluation of non-safety-critical performance, particularly the ADS’s ability to make optimal decisions and produce optimal paths for autonomous vehicles (AVs), is also vital to ensure the intelligence and reduce risks of AVs. Currently, there is little work dedicated to assessing the robustness of ADSs’ path-planning decisions (PPDs), i.e., whether an ADS can maintain the optimal PPD after an insignificant change in the environment. The key challenges include the lack of clear oracles for assessing PPD optimality and the difficulty in searching for scenarios that lead to non-optimal PPDs. To fill this gap, in this paper, we focus on evaluating the robustness of ADSs’ PPDs and propose the first method, Decictor, for generating non-optimal decision scenarios (NoDSs), where the ADS does not plan optimal paths for AVs. Decictor comprises three main components: Non-invasive Mutation, Consistency Check, and Feedback. To overcome the oracle challenge, Non-invasive Mutation is devised to implement conservative modifications, ensuring the preservation of the original optimal path in the mutated scenarios. Subsequently, the Consistency Check is applied to determine the presence of non-optimal PPDs by comparing the driving paths in the original and mutated scenarios. To deal with the challenge of large environment space, we design Feedback metrics that integrate spatial and temporal dimensions of the AV’s movement. These metrics are crucial for effectively steering the generation of NoDSs. Therefore, Decictor can generate NoDSs by generating new scenarios and then identifying NoDSs in the new scenarios. We evaluate Decictor on Baidu Apollo, an open-source and production-grade ADS. The experimental results validate the effectiveness of Decictor in detecting non-optimal PPDs of ADSs. It generates 63.9 NoDSs in total, while the best-performing baseline only detects 35.4 NoDSs.
Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Aaron Quigley, Yuyu Luo, Tianqi Luo, Gelareh Mohammadi, Qinghua Lu, Liming Zhu, "DesignRepair: Dual-Stream Design Guideline-Aware Frontend Repair with Large Language Models"
Abstract: The rise of Large Language Models (LLMs) has streamlined frontend interface creation through tools like Vercel's V0, yet surfaced challenges in design quality (e.g., accessibility, and usability). Current solutions, often limited by their focus, generalisability, or data dependency, fall short in addressing these complexities comprehensively. Moreover, none of them examine the quality of LLM-generated UI design. In this work, we introduce DesignRepair, a novel dual-stream design guideline-aware system to examine and repair the UI design quality issues from both code aspect and rendered page aspect. We utilised the mature and popular Material Design as our knowledge base to guide this process. Specifically, we first constructed a comprehensive knowledge base encoding Google's Material Design principles into low-level component knowledge base and high-level system design knowledge base. After that, DesignRepair employs a LLM for the extraction of key components and utilizes the Playwright tool for precise page analysis, aligning these with the established knowledge bases. Finally, we integrate Retrieval-Augmented Generation with state-of-the-art LLMs like GPT-4 to holistically refine and repair frontend code through a strategic divide and conquer approach. Our extensive evaluations validated the efficacy and utility of our approach, demonstrating significant enhancements in adherence to design guidelines, accessibility, and user experience metrics.
Houda Naji, Marco Gutfleisch, Alena Naiakshina, "Relationship Status: “It’s complicated” Developer-Security Expert Dynamics in Scrum"
Abstract: The high number of cyber threats poses significant challenges, with impactful software exploits ranging from data theft to ransomware deployment. Unfortunately, past research highlighted limited security expertise within development teams. Collaboration between developers and security experts, therefore, emerges as one of the few workable means to address this gap. In this paper, we explore the complex interplay between developers and security experts within Scrum, one of the most widely adopted frameworks which actively promotes collaboration, to shed light on their working relationship, challenges, and potential avenues for improvement. To this end, we conducted a qualitative interview study with 14 developers and 13 security experts. Our qualitative results reveal three communication patterns and five shared challenges between the groups affecting the develop-security expert collaboration. Top challenges include consistent interaction difficulties and the lack of workable means to balance business and security needs. As a result, we found that three core Scrum values (openness, respect, courage) are missing from this relationship. Based on our results, we propose recommendations for fostering a healthy collaboration between developers and security experts, both within and beyond Scrum.
Bianca Trinkenreich, Zixuan Feng, Rudrajit Choudhuri, Marco Gerosa, Anita Sarma, "Igor Steinmacher, ""Investigating the Impact of Interpersonal Challenges on Feeling Welcome in OSS""
Abstract: The sustainability of open source software (OSS) projects hinges on contributor retention. Interpersonal challenges can inhibit a feeling of welcomeness among contributors, particularly from underrepresented groups, which impacts their decision to continue with the project. How much this impact is, varies among individuals, underlining the importance of a thorough understanding of their effects. Here, we investigate the effects of interpersonal challenges on the sense of welcomeness among diverse populations within OSS, through the diversity lenses of gender, race, and (dis)ability. We analyzed the large-scale Linux Foundation Diversity and Inclusion survey (n = 706) to model a theoretical framework linking interpersonal challenges with the sense of welcomeness through Structural Equation Models Partial Least Squares (PLS-SEM). We then examine the model to identify the impact of these challenges on different demographics through Multi-Group Analysis (MGA). Finally, we conducted a regression analysis to investigate how differently people from different demographics experience different types of interpersonal challenges. Our findings confirm the negative association between interpersonal challenges and the feeling of welcomeness in OSS, with this relationship being more pronounced among gender minorities and people with disabilities. We found that different challenges have unique impacts on how people feel welcomed, with variations across gender, race, and disability groups. We also provide evidence that people from gender minorities and with disabilities are more likely to experience interpersonal challenges than their counterparts, especially when we analyze stalking, sexual harassment, and gender. Our insights benefit OSS communities, informing potential strategies to improve the landscape of interpersonal relationships, ultimately fostering more inclusive and welcoming communities.
"
Feng Lin, Dong Jae Kim, Tse-Hsun (Peter) Chen, "SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents"
Abstract: Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen – a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGen$_{Waterfall}$, FlowGen$_{TDD}$, and FlowGen$_{Scrum}$, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT-3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGen$_{Scrum}$ excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGen$_{Scrum}$ achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGen$_{Scrum}$ resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT-3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.
Linfeng Liang, Yao Deng, Kye Morton, Valtteri Kallinen, Alice James, Avishkar Seth, Endrowednes Kuantama, Subhas Mukhopadhyay, Richard Han, Xi Zheng, "GARL: Genetic Algorithm-Augmented Reinforcement Learning to Detect Violations in Marker-Based Autonomous Landing Systems"
Abstract: Automated Uncrewed Aerial Vehicle (UAV) landing is crucial for autonomous UAV services such as monitoring, surveying, and package delivery. It involves detecting landing targets, perceiving obstacles, planning collision-free paths, and controlling UAV movements for safe landing. Failures can lead to significant losses, necessitating rigorous simulation-based testing for safety. Traditional offline testing methods, limited to static environments and predefined trajectories, may miss violation cases caused by dynamic objects like people and animals. Conversely, online testing methods require extensive training time, which is impractical with limited budgets. To address these issues, we introduce GARL, a framework combining a genetic algorithm (GA) and reinforcement learning (RL) for efficient generation of diverse and real landing system failures within a practical budget. GARL employs GA for exploring various environment setups offline, reducing the complexity of RL's online testing in simulating challenging landing scenarios. Our approach outperforms existing methods by up to 18.35% in violation rate and 58% in diversity metric. We validate most discovered violation types with real-world UAV tests, pioneering the integration of offline and online testing strategies for autonomous systems. This method opens new research directions for online testing, with our code available at https://anonymous.4open.science/r/drone_testing-5CF0/.
Siyuan Li, Yuekang Li, Zuxin Chen, Chaopeng Dong, Yongpan Wang, Hong Li, Yongle Chen, "Hongsong Zhu, ""TransferFuzz: Fuzzing with Historical Trace for Verifying Propagated Vulnerability Code""
Abstract: Code reuse in software development frequently facilitates the spread of vulnerabilities, making the scope of affected software in CVE reports imprecise. Traditional methods primarily focus on identifying reused vulnerability code within target software, yet they cannot verify if these vulnerabilities can be triggered in new software contexts. This limitation often results in false positives. In this paper, we introduce TransferFuzz, a novel vulnerability verification framework, to verify whether vulnerabilities propagated through code reuse can be triggered in new software. Innovatively, we collected runtime information during the execution or fuzzing of the basic binary (the vulnerable binary detailed in CVE reports). This process allowed us to extract historical traces, which proved instrumental in guiding the fuzzing process for the target binary (the new binary that reused the vulnerable function). TransferFuzz introduces a unique Key Bytes Guided Mutation strategy and a Nested Simulated Annealing algorithm, which transfers these historical traces to implement trace-guided fuzzing on the target binary, facilitating the accurate and efficient verification of the propagated vulnerability. Our evaluation, conducted on widely recognized datasets, shows that TransferFuzz can quickly validate vulnerabilities previously unverifiable with existing techniques. Its verification speed is 2.5 to 26.2 times faster than existing methods. Moreover, TransferFuzz has proven its effectiveness by expanding the impacted software scope for 15 vulnerabilities listed in CVE reports, increasing the number of affected binaries from 15 to 53. The datasets and source code used in this article are available at https://anonymous.4open.science/r/TransferFuzz-E9B3.
"
Chong Wang, Jianan Liu, Xin Peng, Yang Liu, "Yiling Lou, ""Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention Inference""
Abstract: Resource leaks, caused by resources not being released after acquisition, often lead to performance issues and system crashes. Existing static detection techniques rely on mechanical matching of predefined resource acquisition/release APIs and null-checking conditions to find unreleased resources, suffering from both (1) false negatives caused by the incompleteness of predefined resource acquisition/release APIs and (2) false positives caused by the incompleteness of resource reachability validation identification. To overcome these challenges, we propose InferROI, a novel approach that leverages the exceptional code comprehension capability of large language models (LLMs) to directly infer resource-oriented intentions (acquisition, release, and reachability validation) in code. InferROI first prompts the LLM to infer involved intentions for a given code snippet, and then incorporates a two-stage static analysis approach to check control-flow paths for resource leak detection based on the inferred intentions. We evaluate the effectiveness of InferROI in both resource-oriented intention inference and resource leak detection. Experimental results on the DroidLeaks and JLeaks datasets demonstrate InferROI achieves promising bug detection rate (59.3% and 62.5%) and false alarm rate (18.6% and 19.5%). Compared to three industrial static detectors, InferROI detects 14~45 and 149~485 more bugs in DroidLeaks and JLeaks, respectively. When applied to real-world open-source projects, InferROI identifies 29 unknown resource leak bugs (verified by authors), with 7 of them being confirmed by developers. In addition, the results of an ablation study underscores the importance of combining LLM-based inference with static analysis. Finally, manual annotation indicated that InferROI achieved a precision of 74.6% and a recall of 81.8% in intention inference, covering more than 60% resource types involved in the datasets.
"
Wonhoi Kim, Hocheol Nam, Muoi Tran, Amin Jalilov, Zhenkai Liang, Sang Kil Cha, Min Suk Kang, "Fork State-Aware Differential Fuzzing for Blockchain Consensus Implementations"
Abstract: Blockchain networks allow multiple client implementations of the same consensus algorithm by different developers to coexist in the same system. Ensuring correct implementations among these heterogeneous clients is crucial, as even slight semantic discrepancies in their implementations can lead to safety failures. While existing fuzzing frameworks have discovered implementation flaws in blockchain, they suffer from several challenges in testing them with sequences of conflicting blocks, called forks. Existing tools fail to adequately assess the fork-handling processes in blockchain implementations when relying on traditional code coverage feedback, which lacks the granularity needed to navigate the diverse and complex fork-handling scenarios. This paper introduces Forky, a fork state-aware differential fuzzing framework designed to detect implementation discrepancies within the critical fork-handling process with its novel fork-aware mutation and fork-diversifying feedback mechanisms. We test Forky on the two most influential blockchain projects: Bitcoin and Ethereum, which are the representatives of the two major blockchain consensus algorithm families, Proofof-Work (PoW) and Proof-of-Stake (PoS) consensus algorithms.
Yueke Zhang, Anda Liang, Xiaohan Wang, Pamela J. Wisniewski, Fengwei Zhang, Kevin Leach, "Yu Huang, ""Who’s Pushing the Code: An Exploration of GitHub Impersonation""
Abstract: GitHub is one of the largest open-source software (OSS) communities for software development and collaboration. Impersonation in the OSS communities refers to the malicious act of assuming another user's identity, often aiming to gain unauthorized access to code, manipulate project outcomes, or spread misinformation. With several recent real-world attacks resulting from impersonation, this issue is becoming and increasingly problematic concern within the OSS community. We present the first exploration of the impact of impersonation in GitHub. Specifically, we conduct structured interviews with 17 real-world OSS contributors about their perception of impersonation and corresponding mitigations. Our study reveals that, in general, GitHub users lack awareness of impersonation and underestimate the severity of its implications. After witnessing the impersonation, they show significant concern for the OSS community. Meanwhile, we also demonstrate that the current best practices (i.e., commit signing) that might mitigate impersonation must be improved to increase widespread acceptance and adoption. We also present and discuss participant perceptions of potential ways to mitigate GitHub impersonation. We collect a dataset comprising 12.5 million commits to investigate the current status of impersonation. Interestingly, we also find out that impersonation is not currently detected. We observe that existing commit histories treat impersonation behavior identically to pull request events, resulting in a lack of detection methods for impersonation.
"
Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, Yang Liu, "Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software"
Abstract: Intersectional fairness is a critical requirement for Machine Learning (ML) software, demanding fairness across subgroups defined by multiple protected attributes. This paper introduces FairHOME, a novel ensemble approach using higher order mutation of inputs to enhance intersectional fairness of ML software during the inference phase. Inspired by social science theories highlighting the benefits of diversity, FairHOME generates mutants representing diverse subgroups for each input instance, thus broadening the array of perspectives to foster a fairer decision-making process. Unlike conventional ensemble methods that combine predictions made by different models, FairHOME combines predictions for the original input and its mutants, all generated by the same ML model, to reach a final decision. Notably, FairHOME is even applicable to deployed ML software as it bypasses the need for training new models. We extensively evaluate FairHOME against six state-of-the-art fairness improvement methods across 24 decision-making tasks using widely adopted metrics. FairHOME consistently outperforms existing methods across all metrics considered. On average, it enhances intersectional fairness by 47.3%, surpassing the currently best-performing method by 10.1 percentage points.
Junjielong Xu, Ying Fu, Shin Hwei Tan, "Pinjia He, ""Aligning the Objective of LLM-based Program Repair""
Abstract: Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs can locate and repair bugs in certain functions using the related artifacts (e.g., test cases), existing methods still depend on statement-level fault localization methods to provide a list of buggy hunks for repair. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM’s APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first identifying faulty statements. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM’s pre-trained capability, and (2) replacing the traditional localize-buggy-hunks-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.
"
Aidan Z.H. Yang, Sophia Kolak, Vincent Hellendoorn, Ruben Martins, Claire Le Goues, "Revisiting Unnaturalness for Automated Program Repair in the Era of Large Language Models"
Abstract: Language models have improved by orders of magnitude with the recent emergence of Transformer-based Large Language Models (LLMs). LLMs have demonstrated their ability to generate "natural" code that is highly similar to code written by professional developers. One intermediate value an LLM can emit is entropy, which measures the naturalness of a token of code. We hypothesize that entropy can be used to improve the performance of Automated Program Repair (APR) tasks. While much progress has been made in Automated Program Repair (APR), fault localization techniques suffer from a lack of diversity in ranking scores, patch generation tools tend to be inefficient as all tests need to run before determining if a patch is likely to be correct, and patch ranking often suffers from the test-suite over-fitting problem. However, using an LLM directly for APR introduces concerns for training data leakage. In this work, we introduce a novel way of using the entropy of LLMs in combination with prior APR tools to improve all stages of APR. By using only the prefix and suffix context of a line or block of code to describe naturalness, we can use LLMs to localize faults and rank patches all while eliminating the dependency for test-suites. We show that entropy is highly complementary with prior fault localization tools. Our proposed method achieves a 108% top-1 score improvement over SBFL. When using entropy for patch ranking and classification, our proposed method can rank correct patches more effectively than state-of-the-art machine learning tools with an 49% improvement in top-1. Our work suggests that LLMs can be an effective addition to compliment prior APR tasks while minimizing both the test-suite over-fitting problem and the LLM data leakage problem.
Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, "Dániel Varró, ""The Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages""
Abstract: [Motivation] Automated bug detection in dynamically typed languages such as Python is essential for maintaining code quality. The lack of mandatory type annotations in such languages can lead to errors that are challenging to identify early with traditional static analysis tools. Recent progress in deep neural networks has led to increased use of neural bug detectors. In statically typed languages, a type checker is integrated into the compiler and thus taken into consideration when the neural bug detector is designed for these languages. [Problem] However, prior studies overlook this aspect during the training and testing of neural bug detectors for dynamically typed languages. When an optional type checker is used, assessing existing neural bug detectors on bugs easily detectable by type checkers may impact their performance estimation. Moreover, including these bugs in the training set of neural bug detectors can shift their detection focus toward the wrong type of bugs. [Contribution] We explore the impact of type checking on various neural bug detectors for variable misuse bugs, a common type targeted by neural bug detectors. Existing synthetic and real-world datasets are type-checked to evaluate the prevalence of type-related bugs. Then, we investigate how type-related bugs influence the training and testing of the neural bug detectors. [Findings] Our findings indicate that existing bug detection datasets contain a significant proportion of type-related bugs. Building on this insight, we discover integrating the neural bug detector with a type checker can be beneficial, especially when the code is annotated with types. Further investigation reveals neural bug detectors perform better on type-related bugs than other bugs. Moreover, removing type-related bugs from the training data helps improve neural bug detectors’ ability to identify bugs beyond the scope of type checkers.
"
Verya Monjezi, Ashutosh Trivedi, Vladik Kreinovich, Saeid Tizpaz-Niari, "Fairness Testing through Extreme Value Theory"
Abstract: Data-driven software is increasingly being used as a critical component of automated decision-support systems. Since this class of software learns its logic from historical data, it can encode or amplify discriminatory practices. Previous research on algorithmic fairness has focused on improving “average-case” fairness. On the other hand, fairness at the extreme ends of the spectrum, which often signifies lasting and impactful shifts in societal attitudes, has received significantly less emphasis. Leveraging the statistics of extreme value theory (EVT), we propose a novel fairness criterion called extreme counterfactual discrimination (ECD). This criterion estimates the worst-case amounts of disadvantage in outcomes for individuals solely based on their memberships in a protected group. Utilizing tools from search-based software engineering and generative AI, we present a randomized algorithm that samples a statistically significant set of points from the tail of ML outcome distributions even if the input dataset lacks a sufficient number of relevant samples. We conducted several experiments on four ML models (deep neural networks, logistic regression, and random forests) over 10 socially relevant tasks from the literature on algorithmic fairness. First, we evaluate the generative AI methods and find that they generate sufficient samples to infer valid EVT distribution in 95% of cases. Remarkably, we found that the prevalent bias mitigators reduce the average-case discrimination but increase the worst-case discrimination significantly in 35% of cases. We also observed that even the tail-aware mitigation algorithm MiniMax-Fairness—increased the worst-case discrimination in 30% of cases. We propose a novel ECD-based mitigator that improves fairness in the tail in 90% of cases with no degradation of the average-case discrimination. We hope that the EVT framework serves as a robust tool for evaluating fairness in both average-case and worst-case discrimination.
Jiageng Li, Zhen Dong, Chong Wang, Haozhen You, Cen Zhang, Yang Liu, "Xin Peng, ""LLM Based Input Space Partitioning Testing for Library APIs""
Abstract: Automated library APIs testing is difficult as it requires exploring a vast space of parameter inputs that may involve objects with complex data types. Existing search based approaches, with limited knowledge of relations between object states and program branches, often suffer from the low efficiency issue, i.e., tending to generate invalid inputs. Symbolic execution based approaches can effectively identify such relations, but fail to scale to large programs. In this work, we present an LLM-based input space partitioning testing approach, LISP, for library APIs. The approach leverages LLMs to understand the code of a library API under test and perform input space partitioning based on its understanding and rich common knowledge. Specifically, we provide the signature and code of the API under test to LLMs, with the expectation of obtaining a text description of each input space partition of the API under test. Then, the generated text description is employed to guide the input generation process for each partition, ultimately resulting in test suites that systematically explore the program behavior of the API. We evaluate LISP on 10 popular open-source Java libraries (e.g., apache/commons-lang with 2.6k stars, guava with 48.8k stars on GitHub). Our experiment results show that LISP is effective in library API testing. It significantly outperforms state-of-the-art tool EvoSuite in terms of branch coverage. On average, LISP achieves 67.82% branch coverage, surpassing EvoSuite by 1.21 times. In total, LISP triggers 404 exceptions or errors in the experiments, and discovers 13 previously unknown vulnerabilities during evaluation, which have been assigned CVE IDs.
"
Qiaolin Qin, Heng Li, Ettore Merlo, "Maxime Lamothe, ""Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection""
Abstract: With the advent of data-centric and machine learning (ML) systems, data quality is playing an increasingly critical role for ensuring the overall quality of software systems. Alas, data preparation, an essential step towards high data quality, is known to be a highly effort-intensive process. Although prior studies have dealt with one of the most impacting issues, data pattern violations, we observe that these studies usually require data-specific configurations (i.e., parameterized) or a certain set of fully curated data as learning examples (i.e., supervised). Both approaches require domain knowledge and depend on users' deep understanding of their data, and are often effort-intensive. In this paper, we introduce RIOLU: Regex Inferencer autO-parameterized Learning with Uncleaned data. RIOLU is fully automated, is automatically parameterized, and does not need labeled samples. We observe that RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%, exceeding the state-of-the-art baseline. In addition, according to our experiment on five datasets with anomalies, RIOLU can automatically estimate a data column's error rate, draw normal patterns, and predict anomalies from unlabeled data with higher performance (up to 800.4% improvement in terms of F1) than the state-of-the-art baseline. Furthermore, RIOLU can even outperform ChatGPT in terms of both accuracy (12.3% higher F1) and efficiency (10% less inference time). With user involvement, a variation (a guided version) of RIOLU can further boost its precision (up to 37.4% improvement in terms of F1). Our evaluation in an industrial setting further demonstrates the practical benefits of RIOLU.
"
Bimpe Ayoola, Miikka Kuutilla, Rina Wehbe, "Paul Ralph, ""User Personas Improve Social Sustainability by Encouraging Software Developers to Deprioritize Antisocial Features""
Abstract: \textit{Background}: Sustainable software development involves creating software in a manner that meets present goals without undermining our ability to meet future goals. In a software engineering context, sustainability has at least four dimensions: ecological, economic, social, and technical. No interventions for improving social sustainability in software engineering have been tested in rigorous lab-based experiments, and little evidence-based guidance is available. \textit{Objective}: The purpose of this study is to evaluate the effectiveness of two interventions---stakeholder maps and persona models---for improving social sustainability by improving software feature prioritization. \textit{Method}: We conducted a randomized controlled factorial experiment with 79 undergraduate computer science students. Participants were randomly assigned to one of four groups and asked to prioritize a backlog of prosocial, neutral, and antisocial user stories for a shopping mall's digital screen display and facial recognition software. Participants received either persona models, a stakeholder map, both, or neither. We compared the differences in prioritization levels assigned to prosocial and antisocial user stories using Cumulative Link Mixed Model regression. \textit{Results}: Participants who received persona models gave significantly lower priorities to anti-social user stories but no significant difference was evident for pro-social user stories. The effects of the stakeholder map were not significant. The interaction effects were not significant. \textit{Conclusion}: Providing aspiring software professionals with well-crafted persona models causes them to de-prioritize anti-social software features. The impact of persona modelling on sustainable software development therefore warrants further study with more experience professionals. Moreover, the novel methodological strategy of assessing social sustainability behavior through backlog prioritization appears feasible in lab-based settings.
"
Daniel Erhabor, Sreeharsha Udayashankar, Meiyappan Nagappan, Samer Al-Kiswany, "Measuring the Runtime Performance of C++ Code Written by Humans using GitHub Copilot"
Abstract: GitHub Copilot is an artificially intelligent program- ming assistant used by many developers. While a few studies have evaluated the security risks of using Copilot, there has not been any study to show if it aids developers in producing code with better runtime performance. We evaluate the runtime performance of C++ code produced when developers use GitHub Copilot versus when they do not. To this end, we conducted a user study with 32 participants where each participant solved two C++ programming problems, one with Copilot and the other without it and measured the runtime performance of the participants’ solutions on our test data. Our results suggest that using Copilot may produce C++ code with a statistically significantly slower runtime performance.
Yulong Ye, Tao Chen, "Miqing Li, ""Distilled Lifelong Self-Adaptation for Configurable Systems""
Abstract: Modern configurable systems provide tremendous opportunities for engineering future intelligent software systems. A key difficulty thereof is how to effectively self-adapt the configuration of a running system such that its performance (e.g., runtime and throughput) can be optimized under time-varying workloads. This unfortunately remains unaddressed in existing approaches as they either overlook the available past knowledge or rely on static exploitation of past knowledge without reasoning the usefulness of information when planning for self-adaptation. In this paper, we tackle this challenging problem by proposing DLiSA, a framework that self-adapts configurable systems. DLiSA comes with two properties: firstly, it supports lifelong planning, and thereby the planning process runs continuously throughout the lifetime of the system, allowing dynamic exploitation of the accumulated knowledge for rapid adaptation. Secondly, the planning for a newly emerged workload is boosted via distilled knowledge seeding, in which the knowledge is dynamically purified such that only useful past configurations are seeded when necessary, mitigating misleading information. Extensive experiments suggest that the proposed DLiSA significantly outperforms state-of-the-art approaches, demonstrating a performance improvement of up to 255\% and a resource acceleration of up to 2.22$\times$ on generating promising adaptation configurations. All data and sources can be found at our anonymous site: https://github.com/Anonymous-DLiSA/DLiSA.
"
Zachary Karas, Benjamin Gold, Violet Zhou, Noah Reardon, Thad Polk, Catie Chang, Yu Huang, "Studying Programmers Without Programming: Investigating Expertise Using Resting State fMRI"
Abstract: Expert programmers are more effective at coding activities, but the reasons for this remain elusive. Accordingly, recent research has used neuroimaging such as fMRI to analyze how expert programmers might think as they perform coding activities. Those experiments have all involved specific programming tasks (i.e., comprehension), but have been unable to detect systematic differences based on coding experience. By using tasks, however, those studies may limit the number and type of brain networks involved. In Cognitive Neuroscience, researchers commonly analyze resting-state data, in which participants’ brain activity is recorded as they lay idle in the scanner. The brain’s functional organization is plastic, and can change with experience. These changes can be measured at rest, making this a suitable data type for studying how programming activities affect neural organization over time. In this paper, we analyzed the resting state scans from 150 participants, 96 of whom were programmers. We found increased connectivity in programmers between brain regions involved in language, math, and the temporal attention. Non-programmers demonstrated more connectivity with regions involved in social and emotional cognition. We found that as years of programming experience increases, connectivity decreases between two regions associated with visual processing during reading and articulation, respectively.
Setu Kumar Basak, K. Virgil English, Ken Ogura, Vitesh Kambara, Bradley Reaves, Laurie Williams, "AssetHarvester: A Static Analysis Tool for Detecting Secret-Asset Pairs in Software Artifacts"
Abstract: GitGuardian monitored secrets exposure in public GitHub repositories and reported that developers leaked over 12 million secrets (database and other credentials) in 2023, indicating a 113\% surge from 2021. Despite the availability of secret detection tools, developers ignore the tools' reported warnings because of false positives (25\%-99\%). However, each secret protects assets of different values accessible through asset identifiers (a DNS name and a public or private IP address). The asset information for a secret can aid developers in filtering false positives and prioritizing secret removal from the source code. However, existing secret detection tools do not provide the asset information, thus presenting difficulty to developers in filtering secrets only by looking at the secret value or finding the assets manually for each reported secret. \textit{The goal of our study is to aid software practitioners in prioritizing secrets removal by providing the assets information protected by the secrets through our novel static analysis tool.} We present AssetHarvester, a static analysis tool to detect secret-asset pairs in a repository. Since the location of the asset can be distant from where the secret is defined, we investigated secret-asset co-location patterns and found four patterns. To identify the secret-asset pairs of the four patterns, we utilized three approaches (pattern matching, data flow analysis, and fast-approximation heuristics). We curated a benchmark of 1,791 secret-asset pairs of four database types extracted from 188 public GitHub repositories to evaluate the performance of AssetHarvester. AssetHarvester demonstrates precision of (97\%), recall (90\%), and F1-score (94\%) in detecting secret-asset pairs. Our findings indicate that data flow analysis employed in AssetHarvester detects secret-asset pairs with 0\% false positives and aids in improving the recall of secret detection tools. Additionally, AssetHarvester shows 43\% increase in precision for database secret detection compared to existing detection tools through the detection of assets, thus reducing developer's alert fatigue.
Yining She, Sumon Biswas, Christian Kästner, Eunsuk Kang, "FairSense: Long-Term Fairness Analysis of ML-Enabled Systems"
Abstract: Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are *model-centric* and designed to detect fairness issues under *static settings*. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system *impact* the environment, which in turn affects future decision-making. Such a self-reinforcing *feedback loop* can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation-based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. In particular, the framework targets systems with an ML model that is trained over tabular data using supervised learning. Given a fairness requirement, FairSense performs *Monte-Carlo simulation* to enumerate evolution traces for each system configuration. Then, FairSense performs *sensitivity analysis* on the space of system parameters to understand the impact of configuration decisions on long-term fairness of the system. We demonstrate FairSense's potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.
Islem BOUZENIA, Premkumar Devanbu, Michael Pradel, "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair"
Abstract: Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent’s effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI’s GPT-3.5 model, translates to 14 cents per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
Sarah Fakhoury, Markus Kuppe, Shuvendu Lahiri, Tahina Ramananandro, "Nikhil Swamy, ""3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers""
Abstract: Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use. In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C. We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable.
"
Paschal Amusuo, Kyle A. Robinson, Tanmay Singla, Huiyun Peng, Aravind Machiry, Santiago Torres-Arias, Laurent Simon, "James C Davis, ""$ZTD_{JAVA}$: Mitigating Software Supply Chain Vulnerabilities via Zero-Trust Dependencies""
Abstract: Third-party libraries like Log4j accelerate software application development but introduce substantial risk. Vulnerabilities in these libraries have led to Software Supply Chain (SSC) attacks that compromised resources within the host system. These attacks benefit from current application permissions approaches: third-party libraries are implicitly trusted in the application runtime. An application runtime designed with Zero-Trust Architecture (ZTA) principles — secure access to resources, continuous monitoring, and least-privilege enforcement — could mitigate SSC attacks, as it would give zero implicit trust to these libraries. However, no individual security defense incorporates these principles at a low runtime cost. This paper proposes Zero-Trust Dependencies (ZTD) to mitigate SSC vulnerabilities: we apply the NIST ZTA to software applications. First, we assess the expected effectiveness and configuration cost of Zero-Trust Dependencies using a study of third-party software libraries and their vulnerabilities. Then, we present a system design, $ZTD_{sys}$, that enables the application of Zero-Trust Dependencies to software applications and a prototype, $ZTD_{JAVA}$, for Java applications. Finally, with evaluations on recreated vulnerabilities and realistic applications, we show that $ZTD_{JAVA}$ can defend against prevalent vulnerability classes, introduces negligible cost, and is easy to configure and use.
"
Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, "Longyu Zhang, ""ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?""
Abstract: Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME
"
Xin Yin, Chao Ni, Xiaodan Xu, "Xiaohu Yang, ""What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation""
Abstract: Software defects heavily affect software's functionalities and may cause huge losses. Recently, many AI-based approaches have been proposed to detect defects, which can be divided into two categories: software defect prediction and automatic unit test generation. While these approaches have made great progress in software defect detection, they still have several limitations in practical application, including the low confidence of prediction models and the inefficiency of unit testing models. To address these limitations, we propose a WYSIWYG (i.e., What You See Is What You Get) approach: \textbf{A}ttention-based Self-guided Automatic \textbf{U}nit Test \textbf{G}en\textbf{ER}ation (AUGER), which contains two stages: defect detection and error triggering. In the former stage, \toolname first detects the proneness of defects. Then, in the latter stage, it guides to generate unit tests for triggering such an error with the help of critical information obtained by the former stage. To evaluate the effectiveness of \toolname, we conduct a large-scale experiment by comparing with the state-of-the-art (SOTA) approaches on the widely used datasets (i.e., Bears, Bugs.jar, and Defects4J). AUGER makes great improvements by 4.7\% to 35.3\% and 17.7\% to 40.4\% in terms of F1-score and Precision in defect detection, and can trigger 23 to 84 more errors than SOTAs in unit test generation. Besides, we also conduct a further study to verify the generalization in practical usage by collecting a new dataset from real-world projects.
"
Yizhou Chen, Zeyu Sun, Guoqing Wang, "Dan Hao, ""Gpass: a Goal-adaptive Neural Theorem Prover based on Coq for Automated Formal Verification""
Abstract: Formal verification is a crucial means to assure software quality. Regrettably, the manual composition of verification scripts proves to be both laborious and time-consuming. In response, researchers have put forth automated theorem prover approaches; however, these approaches still grapple with several limitations. These limitations encompass insufficient handling of lengthy proof steps, difficulty in aligning the various components of a Coq program with the requirements and constraints of the proof goal, and inefficiencies. To surmount these limitations, we present Gpass, a goal-adaptive neural theorem prover based on deep learning technology. Firstly, we design a unique sequence encoder for Gpass that completely scans previous proof tactics through multiple sliding windows and provides information related to the current proof step. Secondly, Gpass incorporates a goal-adaptive feature integration module to align the reasoning process with the requirements of the proof goal. Finally, we devise a parameter selection method based on loss values and loss slopes to procure parameter sets with diverse distributions, thereby facilitating the exploration of various proof tactics. Experimental results demonstrate that Gpass attains better performance on the extensive CoqGym benchmark and proves 11.03\%-96.37\% more theorems than the prior work most closely related to ours. We find that the orthogonality between Gpass and CoqHammer proves their complementary capabilities, and together they prove a total of 3,774 theorems, which is state-of-the-art performance. In addition, we propose an efficiency optimisation approach that allows Gpass to achieve performance beyond Diva at one-sixth of the parameter sets.
"
Yanchen Lu, Hongyu Lin, Zehua He, Haitao Xu, Zhao Li, Shuai Hao, Liu Wang, Haoyu Wang, "Kui Ren, ""TacDroid: Detection of Illicit Apps through Hybrid Analysis of UI-based Transition Graphs""
Abstract: Illicit apps have emerged as a thriving underground industry, driven by their substantial profitability. These apps either offer users restricted services (e.g., porn and gambling) or engage in fraudulent activities like scams. Despite the widespread presence of illicit apps, scant attention has been directed towards this issue, with several existing detection methods predominantly relying on static analysis alone. However, given the burgeoning trend wherein an increasing number of mobile apps achieve their core functionality through dynamic resource loading, depending solely on static analysis proves inadequate. To address this challenge, in this paper, we introduce TacDroid, a novel approach that integrates dynamic analysis for dynamic content retrieval with static analysis to mitigate the limitations inherent in both methods, i.e., the low coverage of dynamic analysis and the low accuracy of static analysis. Specifically, TacDroid conducts both dynamic and static analyses on an Android app to construct dynamic and static User Interface Transition Graphs (UTGs), respectively. These two UTGs are then correlated to create an intermediate UTG. Subsequently, TacDroid embeds graph structure and utilizes an enhanced Graph Autoencoder (GAE) model to predict transitions between nodes. Through link prediction, TacDroid effectively eliminates false positive transition edges stemming from misjudgments in static analysis and supplements false negative transition edges overlooked in the intermediate UTG, thereby generating a comprehensive and accurate UTG. Finally, TacDroid determines the legitimacy of an app and identifies its category based on the app's UTG. Our evaluation results highlight the outstanding accuracy of TacDroid in detecting illicit apps. It significantly surpasses the state-of-the-art work, achieving an F1-score of 96.73%. This work represents a notable advancement in the identification and categorization of illicit apps.
"
Ravishka Rathnasuriya, Zijie Zhao, Wei Yang, "CodeImprove: Program Adaptation for Deep Code Models"
Abstract: Leveraging deep learning (DL)-based code analysis tools to solve software engineering tasks is becoming increasingly popular. Code models often suffer performance degradation due to various reasons (e.g., code data shifts). Retraining is often required to address these issues, but frequent model updates are costly in labeling and deployment. In this paper, we explore an alternative solution: Adapting the program inputs to the code models. This can be achieved by two steps: 1) input validation that focuses on identifying whether an input is an out-of-scope input program that are beyond a model’s handling capability, and 2) input adaptation that adapts out-of-scope inputs to become in-scope inputs. Validating program input is challenging, as current techniques focus on continuous inputs such as image data and fail with discrete inputs like code data, which have unique characteristics and are processed differently by deep learning models. Adapting out-of-scope programs is also challenging due to their vast search spaces. Therefore, in this paper, we propose CodeImprove, which distinguishes out-of-scope from normal inputs and converts such out-of-scope inputs back to in-scope inputs through program transformation. In particular, we propose a validity score metric to identify out-of-scope inputs and leverage genetics algorithms to apply semantic preserving program transformation to convert out-of-scope inputs to in-scope inputs. Our experimental results show CodeImprove can enhance upto 8.78% of accuracy, and 51.28% of relative improvements in three code models on two SE tasks. Additionally, our input validation is promising in detecting outof-scope inputs (AUC score of 0.924).
Yanfu Yan, Viet Duong, Huajie Shao, "Denys Poshyvanyk, ""Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution Detection""
Abstract: Numerous machine learning (ML) models have been developed, including those for software engineering (SE) tasks, under the assumption that training and testing data come from the same distribution. However, train and test distributions often differ, as training datasets rarely encompass the entire distribution, while test distribution tends to shift over time. Hence, when confronted with out-of-distribution (OOD) instances that differ from the training data, a reliable and trustworthy SE ML model must be capable of detecting them to either abstain from making predictions, or potentially forward these OODs to appropriate models handling other categories or tasks. In this paper, we develop two types of SE-specific OOD detection models, unsupervised and weakly-supervised OOD detection for code. The unsupervised OOD detection approach is trained solely on in-distribution samples while the weakly-supervised approach utilizes a tiny number of OOD samples to further enhance the detection performance in various OOD scenarios. Extensive experimental results demonstrate that our proposed methods significantly outperform the baselines in detecting OOD samples from four different scenarios simultaneously and also positively impact a main code understanding task.
"
Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric, "exLong: Generating Exceptional Behavior Tests with Large Language Models"
Abstract: Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on “happy paths”, e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed EXLÓNG, that automatically generates EBTs. EXLÓNG is a large language model instruction-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare EX LÓNG with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT3.5), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that EXLÓNG outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by EXLÓNG were already accepted.
Jake Zappin, Trevor Stalnaker, Oscar Chaparro, Denys Poshyvanyk, "When Quantum Meets Classical: Characterizing Hybrid Quantum-Classical Issues Discussed in Developer Forums"
Abstract: Recent advances in quantum computing have sparked excitement that this new computing paradigm could solve previously intractable problems. However, due to the faulty nature of current quantum hardware and quantum-intrinsic noise, the full potential of quantum computing is still years away. Hybrid quantum-classical computing has emerged as a possible compromise that achieves the best of both worlds. In this paper, we look at hybrid quantum-classical computing from a software engineering perspective and present the first empirical study focused on characterizing and evaluating recurrent issues faced by developers of hybrid quantum-classical applications. The study comprised a thorough analysis of 531 real-world issues faced by developers -- including software faults, hardware failures, quantum library errors, and developer mistakes -- documented in discussion threads from forums dedicated to quantum computing. By qualitatively analyzing such forum threads, we derive a comprehensive taxonomy of recurring issues in hybrid quantum-classical applications that can be used by both application and platform developers to improve the reliability of hybrid applications. The study considered how these recurring issues manifest and their causes, determining that hybrid applications are crash-dominant (74% of studied issues) and that errors were predominantly introduced by application developers (70% of issues). We conclude by identifying recurring obstacles for developers of hybrid applications and actionable recommendations to overcome them.
Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, "Taesoo Kim, ""RUG: Turbo LLM for Rust Unit Test Generation""
Abstract: Unit testing improves software quality by evaluating isolated sections of the program. This approach alleviates the need for comprehensive program-wide testing and confines the potential error scope within the software. However, unit test development is time-consuming, requiring developers to create appropriate test contexts and determine input values to cover different code regions. This problem is particularly pronounced in Rust due to its intricate type system, making traditional unit test generation tools ineffective in Rust projects. Recently, LLM have demonstrated their proficiency in understanding programming language and completing software engineering tasks. However, merely prompting LLM with a basic prompt like ""generate unit test for the following source code"" often results in code with compilation errors. In addition, LLM-generated unit tests often have limited test coverage. To bridge this gap and harness the capabilities of LLM, we design and implement RUG, an end-to-end solution to automatically generate the unit test for Rust projects. To help LLM's generated test pass Rust strict compilation checks, RUG designs a semantic-aware bottom-up approach to divide the context construction problem into dependent sub-problems. It solves these sub-problems sequentially using an LLM and merges them to a complete context. To increase test coverage, RUG integrates coverage-guided fuzzing with LLM to prepare fuzzing harnesses. Applying RUG on 17 real-world Rust programs (average 24,937 LoC), we show that RUG can achieve a high code coverage, up to 71.37%, closely comparable to human effort (73.18%). We submitted 113 unit tests generated by RUG covering the new code: 53 of them have been accepted, 17 were rejected, and 43 are pending for review.
"
Smit Soneshbhai Patel, Aashish Yadavally, Hridya Dhulipala, Tien N. Nguyen, "Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets"
Abstract: Large Language Models (LLMs) have been excellent in generating and reasoning about source code and the textual descriptions. They can recognize patterns, syntax, and semantics in code, making them effective in several software engineering tasks. However, they exhibit weaknesses in reasoning about the program execution. They primarily operate on static code representations, failing to capture the dynamic behavior and state changes that occur during program execution. In this paper, we advance the capabilities of LLMs in reasoning about program execution. We propose ORCA, a novel approach that instructs an LLM to autonomously formulate a plan to navigate through a control flow graph (CFG) for predictive execution of (in)complete code snippets. It acts as a predictive interpreter to ``execute'' the code. As a downstream task, we use ORCA to statically identify any runtime errors for online code snippets. Early detection of runtime errors and defects in these snippets is crucial to prevent costly fixes later in the development cycle after they were adapted into a codebase. In our novel technique, we guide the LLM to pause at the branching point, focusing on the state of the symbol tables for variables' values, thus minimizing error propagation in the LLM's computation. We also instruct the LLM not to stop at each step in its execution plan, resulting the use of only one prompt to the LLM, thus much cost-saving. Our empirical evaluation showed that ORCA is effective and improves over the state-of-the-art approaches in predicting the execution traces and in runtime error detection.
Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, "Lei Bu, ""SpecGen: Automated Generation of Formal Program Specifications via Large Language Models""
Abstract: In the software development process, formal program specifications play a crucial role in various stages, including requirement analysis, software testing, and verification. However, manually crafting formal program specifications is rather difficult, making the job time-consuming and labor-intensive. Moreover, it is even more challenging to write specifications that correctly and comprehensively describe the semantics of complex programs. To reduce the burden on software developers, automated specification generation methods have emerged. However, existing methods usually rely on predefined templates or grammar, making them struggle to accurately describe the behavior and functionality of complex real-world programs. To tackle this challenge, we introduce SpecGen, a novel technique for formal program specification generation based on Large Language Models (LLMs). Our key insight is to overcome the limitations of existing methods by leveraging the code comprehension capability of LLMs. The process of SpecGen consists of two phases. The first phase employs a conversational approach that guides the LLM to generate appropriate specifications for a given program, aiming to utilize the ability of LLM to generate high-quality specifications. The second phase, designed for where the LLM fails to generate correct specifications, applies four mutation operators to the model-generated specifications and selects verifiable specifications from the mutated ones through a novel heuristic selection strategy by assigning different weights of variants in an efficient manner. We evaluate SpecGen on two datasets, including the SV-COMP Java category benchmark and a manually constructed dataset containing 120 programs. Experimental results demonstrate that SpecGen succeeds in generating verifiable specifications for 279 out of 385 programs, outperforming the existing LLM-based approaches and conventional specification generation tools like Houdini and Daikon. Further investigations on the quality of generated specifications indicate that SpecGen can comprehensively articulate the behaviors of the input program.
"
Yifeng Di, Tianyi Zhang, "Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding"
Abstract: Large Language Models (LLMs) have demonstrated unprecedented capability in code generation. However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before. Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation. Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding. Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent. We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 16.9\% Pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis. Participants completed the given programming tasks 16.7\% faster and with 10.5\% improvement in task success rate when using our approach. Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence.
Zixuan Tan, Jiayuan Zhou, Xing Hu, Shengyi Pan, Kui Liu, "Xin Xia, ""Similar but Patched Code Considered Harmful -- The Impact of Similar but Patched Code on Recurring Vulnerability Detection and How to Remove Them""
Abstract: Identifying recurring vulnerabilities is crucial for ensuring software security. Clone-based techniques, while widely used, often generate many false alarms due to the existence of similar but patched (SBP) code, which is similar to vulnerable code but is not vulnerable due to having been patched. Although the SBP code poses a great challenge to the effectiveness of existing approaches, it has not yet been well explored. In this paper, we propose a programming language agnostic framework, Fixed Vulnerability Filter (FVF), to identify and filter such SBP instances in vulnerability detection. Different from existing studies that leverage function signatures, our approach analyzes code change histories to precisely pinpoint SBPs and consequently reduce false alarms. Evaluation under practical scenarios confirms the effectiveness and precision of our approach. Remarkably, FVF identifies and filters 65.1% of false alarms from four vulnerability detection tools (i.e., ReDeBug, VUDDY, MVP, and an elementary hash-based approach) without yielding false positives. We further apply FVF to 1,081 real-world software projects and construct a real-world SBP dataset containing 6,827 SBP functions. Due to the SBP nature, the dataset can act as a strict benchmark to test the sensitivity of the vulnerability detection approach in distinguishing real vulnerabilities and SBPs. Using this dataset, we demonstrate the ineffectiveness of four state-of-the-art deep learning-based vulnerability detection approaches. Our dataset can help developers make a more realistic evaluation of vulnerability detection approaches and also paves the way for further exploration of real-world SBP scenarios.
"
Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael Lyu, "SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing"
Abstract: Code retrieval, which retrieves code snippets based on users’ natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retrieval according to vector similarity. Despite the effectiveness of these models, managing large-scale code bases presents significant challenges. Previous research propose deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. However, this approach’s reliance on linear scanning of the entire code base limits its scalability. To further improve the efficiency of large scale code retrieval, we propose a novel approach SECRET (Scalable and Efficient Code Retrieval via SegmEnTed deep hashing). SECRET converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy. After training, SECRET recalls code candidates by looking up the hash tables for each segment, the time complexity of recall can thus be greatly reduced. Extensive experimental results demonstrate that SECRET can drastically reduce the retrieval time by at least 95% while achieving comparable or even higher performance of existing deep hashing approaches. Besides, SECRET also exhibits superior performance and efficiency compared to the classical hash table-based approach known as LSH under the same number of hash tables.
Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, "Wing Lam, ""Ranking Relevant Tests for Order-Dependent Flaky Tests""
Abstract: One major challenge of regression testing is the presence of flaky tests, i.e., tests that may pass in one run but fail in another run for the same version of code. One prominent category of flaky tests are order-dependent (OD) flaky tests, which are tests that can pass or fail depending on the test-order in which the tests are run. To help developers debug and fix OD tests, prior work has attempted to automatically find OD-relevant tests, i.e., tests that will determine whether an OD test passes or fails depending on whether the OD-relevant tests are run before or after the OD test in the test-order. Prior work finds OD-relevant tests by running tests before the OD test, without regards to the tests’ likelihood of being OD-relevant tests. We propose RankF to rank tests in order of likelihood of being OD-relevant tests, so a developer can find the first OD-relevant test more quickly, without running tests as often. We propose two ranking approaches, each requiring different information. Our first approach, RankFL, relies on training a large-language model that analyzes test code. Our second approach, RankFO, relies on the analysis of prior test-order execution information. We evaluate our approaches on 155 OD tests from 34 modules across 24 open-source projects. We compare RankF against prior work baselines in terms of the time for finding the first OD-relevant test for an OD test. RankF on average finds the first OD-relevant test faster than the best of the baselines, providing speedups of 1.9X, 1.7X, and 2.6X for the three different types of OD-relevant tests we evaluate.
"
Hang Du, Vijay Krishna Palepu, James A. Jones, "Leveraging Propagated Infection to Crossfire Mutants"
Abstract: Mutation testing was proposed to identify weaknesses in test suites by repeatedly generating artificially faulty versions of the software (i.e., *mutants*) and determining if the test suite is sufficient to detect them (i.e., *kill* them). When the tests are insufficient, each surviving mutant provides an opportunity to improve the test suite. We conducted a study and found that many such surviving mutants (up to 84% for the subjects of our study) are detectable by simply augmenting existing tests with additional assertions, or *assertion amplification*. Moreover, we find that many of these mutants are detectable by multiple existing tests, giving developers options for how to detect them. To help with these challenges, we created a technique that performs memory-state analysis to identify candidate assertions that developers can use to detect the surviving mutants. Additionally, we build upon prior research that identifies "crossfiring" opportunities -- tests that coincidentally kill multiple mutants. To this end, we developed a theoretical model that describes the varying granularities that crossfiring can occur in the existing test suite, which provide opportunities and options for how to kill surviving mutants. We operationalize this model to an accompanying technique that optimizes the assertion amplification of the existing tests to crossfire multiple mutants with fewer added assertions, optionally concentrated within fewer tests. Our experiments show that we can kill *all* surviving mutants that are detectable with existing test data with only 1.1% of the identified assertion candidates, and increasing by a factor of 6x, on average, the number of killed mutants from amplified tests, over tests that do not crossfire.