Call for Papers
The International Conference on Software Engineering (ICSE) is the premier forum for presenting and discussing the most recent and significant technical research contributions in the field of Software Engineering. In the research track, we invite high-quality submissions of technical research papers describing original and unpublished results of software engineering research.
ICSE 2025 will follow a dual deadline structure introduced in 2024. In other words, submissions will occur in two cycles. Please refer to the section on Dual Submission Cycles in the following for the information.
NEW THIS YEAR #1: Due to the rapid growth of the area of “AI and Software Engineering”, it is now split into two: “AI for Software Engineering” and “Software Engineering for AI”. A new area “Architecture and Design” is introduced. The topics listed under each area have also been revised. Please see the “Research Areas” section below.
NEW THIS YEAR #2: We add back opportunities for “Author Response” in addition to “Revision” so that some potential misunderstandings can be clarified in the review process for papers that otherwise would be rejected. Also, for a paper receiving a “Revision” outcome, authors will be given an additional page of text in the revised paper to accommodate the required changes specified in the reviews.
NEW THIS YEAR #3: Submissions must follow the latest “IEEE Submission and Peer Review Policy” and “ACM Policy on Authorship” (with associated FAQ, which includes a policy regarding the use of generative AI tools and technologies, such as ChatGPT. After checking with the ICSE Steering Committee, we are piloting a human-in-the-loop automated process to identify AI-generated papers. A Review Process Co-Chair has volunteered to design and run this pilot process on submitted papers. To preserve confidentiality, when scanning submitted papers, the scripts will not make use of any third-party services.
NEW THIS YEAR #4: IEEE Transactions on Software Engineering, ACM Transactions on Software Engineering and Methodology and ICSE 2025, have received approval from the ICSE Steering Committee to launch the Sustainable Community Review Effort (SCRE) program, aimed at reducing community effort in reviewing journal extensions of conference papers and allowing authors to get faster and more consistent feedback. More information is available at: http://tinyurl.com/icse25-scre
NEW THIS YEAR #5: ICSE Steering Committee has recently approved a proposal for streamlining and enhancing the paper bidding and assignment process, aimed at reducing the workload of PC members and resulting in better assignments of papers. Two Review Process Co-Chairs have volunteered to help manage the updated bidding and assignment process. More information is available at: http://tinyurl.com/icse25-streamlining
NEW THIS YEAR #6: ICSE Steering Committee has recently approved a proposal for Shadow PC. Shadow PC is a mentoring program to train early-career researchers (PhD students, postdocs, new faculty members, and industry practitioners) in the review process of the technical track. For Cycle 2, for the first time, authors of ICSE submissions can opt-in for their papers to be considered for review in the Shadow PC track. Shadow reviews for papers that are reviewed by the Shadow PC will be sent out to authors after the end of the actual review process; shadow reviews will not affect the official decision made by the regular PC. More detailed information about the program is available at: http://tinyurl.com/icse25-shadowpc
Research Areas
ICSE welcomes submissions addressing topics across the full spectrum of Software Engineering, being inclusive of quantitative, qualitative, and mixed-methods research. Topics of interest include the following and are grouped into the following nine research areas. Please note that these topics are by no means exhaustive.
Each submission will need to indicate one of these nine areas as the chosen area. Optionally, the authors can consider adding an additional area. A paper may be moved from the chosen area(s) to another focus area at the discretion of the program chairs. Program chairs will ultimately assign a paper to an area chair, considering the authors’ selection, the paper’s content, and other factors such as (if applicable) possible conflicts of interest.
AI for Software Engineering
- AI-enabled recommender systems for automated SE (e.g., code generation, program repair, AIOps, software composition analysis, etc.)
- Human-centered AI for SE (e.g., how software engineers can synergistically work with AI agents)
- Trustworthy AI for SE (e.g., how to provide guarantees, characterize limits, and prevent misuse of AI for SE)
- Sustainable AI for SE (e.g., how to reduce energy footprint for greener AI for SE)
- Collaborative AI for SE (e.g., how AI agents collaborate for automating SE)
- Automating SE tasks with LLM and other foundation models (e.g., large vision model)
- Efficacy measurement beyond traditional metrics (e.g., accuracy, BLEU, etc.)
- Prompt engineering for SE (e.g., novel prompt design)
- AI-assisted software design and model driven engineering (e.g., specification mining, program synthesis, software architectural design)
Analytics
- Mining software repositories, including version control systems, issue tracking systems, software ecosystems, configurations, app stores, communication platforms, and novel software engineering data sources, to generate insights through various research methods
- Software visualization
- Data-driven user experience understanding and improvement
- Data driven decision making in software engineering
- Software metrics (and measurements)
Architecture and Design
- Architecture and design measurement and assessment
- Software design methodologies, principles, and strategies
- Theory building for/of software design
- Architecture quality attributes, such as security, privacy, performance, reliability
- Modularity and reusability
- Design and architecture modeling and analysis
- Architecture recovery
- Dependency and complexity analysis
- Distributed architectures, such as microservice, SOA, cloud computing
- Patterns and anti-patterns
- Technical debt in design and architecture
- Architecture refactoring
- Adaptive architectures
- Architecture knowledge management
Dependability and Security
- Formal methods and model checking (excluding solutions focusing solely on hardware)
- Reliability, availability, and safety
- Resilience and antifragility
- Confidentiality, integrity, privacy, and fairness
- Performance
- Design for dependability and security
- Vulnerability detection to enhance software security
- Dependability and security for embedded and cyber-physical systems
Evolution
- Evolution and maintenance
- API design and evolution
- Release engineering and DevOps
- Software reuse
- Refactoring and program differencing
- Program comprehension
- Reverse engineering
- Environments and software development tools
- Traceability to understand evolution
Human and Social Aspects
- Focusing on individuals (from program comprehension, workplace stress to job satisfaction and career progression)
- Focusing on teams (e.g., collocated, distributed, global, virtual; communication and collaboration within a team), communities (e.g., open source, communities of practice) and companies (organization, economics)
- Focusing on society (e.g., sustainability; diversity and inclusion)
- Focusing on programming languages, environments, and tools supporting individuals, teams, communities, and companies.
- Focusing on software development processes
Requirements and Modeling
- Requirements engineering (incl. non-functional requirements)
- Theoretical requirement foundations
- Requirements and architecture
- Feedback, user and requirements management
- Requirements traceability and dependencies
- Modeling and model-driven engineering
- Variability and product lines
- Systems and software traceability
- Modeling languages, techniques, and tools
- Empirical studies on the application of model-based engineering
- Model-based monitoring and analysis
Software Engineering for AI
- SE for AI models
- SE for systems with AI components
- SE for AI code, libraries, and datasets
- Engineering autonomic systems and self-healing systems
- Automated repair of AI models
- Testing and verification of AI-based systems
- Validation and user-based evaluation of AI-based systems
- Requirements engineering for AI-based systems
Testing and Analysis
- Software testing
- Automated test generation techniques such as fuzzing, search-based approaches, and symbolic execution
- Testing and analysis of non-functional properties
- GUI testing
- Mobile application testing
- Program analysis
- Program synthesis (e.g., constraint based techniques)
- Program repair
- Debugging and fault localization
- Runtime analysis and/or error recovery
Scope
Since the authors will choose an area for their submission, the scope of each area becomes important. Some submissions may relate to multiple areas. In such cases, the authors should choose the area for which their paper brings the maximum new insights. Moreover, authors also have the choice of indicating an alternate area for each paper.
Similarly, for certain papers. authors may have a question whether it belongs to any area, or is simply out of scope. For such cases, we recommend the authors to judge whether their paper brings new insights for software engineering. As an example, a formal methods paper with a focus on hardware verification may be deemed out of scope for ICSE. In general, papers that only peripherally concern software engineering and do not give new insights from the software engineering perspective would be less relevant to ICSE. Our goal is, however, to be descriptive, rather than prescriptive, to enable authors to make their own decisions about relevance.
Dual Submission Cycles
Similar to ICSE 2024, we will have two submission cycles as follows:
First submission cycle
- (Mandatory) Abstract: Mar 15, 2024
- Submission: Mar 22, 2024
- Author response period (3 days): Jun 10-13, 2024
- Notification: Jul 5, 2024
- Revision due: Aug 2, 2024
- Camera-ready (of directly accepted papers): Aug 16, 2024
- Final decision (of revised papers): Nov 1, 2024
- Camera-ready (of accepted revised papers): Dec 13, 2024
Second submission cycle
- (Mandatory) Abstract: Jul 26, 2024
- Submission: Aug 2, 2024
- Author response period (3 days): Oct 7-10, 2024
- Notification: Nov 1, 2024
- Revision due: Nov 29, 2024
- Camera-ready (of directly accepted papers): Dec 13, 2024
- Final decision (of revised papers): Jan 22, 2025
- Camera-ready (of accepted revised papers): Feb 12, 2025
All dates are 23:59:59 AoE (UTC-12h).
Review Criteria
Each paper submitted to the Research Track will be evaluated based on the following criteria:
i) Novelty: The novelty and innovativeness of contributed solutions, problem formulations, methodologies, theories, and/or evaluations, i.e., the extent to which the paper is sufficiently original with respect to the state-of-the-art.
ii) Rigor: The soundness, clarity, and depth of a technical or theoretical contribution, and the level of thoroughness and completeness of an evaluation.
iii) Relevance: The significance and/or potential impact of the research on the field of software engineering.
iv) Verifiability and Transparency: The extent to which the paper includes sufficient information to understand how an innovation works; to understand how data was obtained, analyzed, and interpreted; and how the paper supports independent verification or replication of the paper’s claimed contributions. Any artifacts attached to or linked from the paper will be checked by one reviewer.
v) Presentation: The clarity of the exposition in the paper.
Reviewers will carefully consider all of the above criteria during the review process, and authors should take great care in clearly addressing them all. The paper should clearly explain and justify the claimed contributions. Each paper will be handled by an area chair who will ensure reviewing consistency among papers submitted within that area.
The outcome of each paper will be one of the following Accept, Revision, Reject. We now elaborate on the Revision outcome in the following.
Revisions
Papers submitted can go through revisions in response to specific revision requests made by the reviewers. Authors of papers receiving a Revision decision are expected to submit the revised papers, as well as the revised papers with changes marked in a different color, such as using LaTeXdiff. The authors also need to submit an “Author Response” document capturing the authors’ response to each reviewer’s comment and how those comments were addressed in the revision. This is similar to the “Summary of Changes and Response” document that is typically submitted by authors for a journal paper’s major revision. Authors may use the revision opportunity to revise and improve the paper, but should not use this to submit a substantially different paper. The reviewers will check the revised paper against the original paper and the suggested changes. Revised papers will be examined by the same set of reviewers. An unsatisfactory revised paper will be rejected. Authors are given 4 weeks to submit the revised papers. This is 1 week less than prior year as we reallocate the time to (i) add authors’ rebuttal, and (ii) provide more time for PC members to complete their reviews (to reduce reviewing fatigue considering the high workload to review increasing number of submissions). Authors are given an additional page of text in a revised paper to accommodate the required changes specified in the reviews.
Re-submissions of Rejected Papers
Authors of papers which receive a REJECT decision in the first submission cycle are strongly discouraged from re-submitting it to the second submission cycle. However, in exceptional cases where the reviewers evidently misunderstood their paper, upon approval of PC Chairs, authors can re-submit their paper to the second submission cycle with a “Clarifications and Summary of Improvements” document stating how they have changed the paper. They should also include the past reviews as part of this document, for completeness. These papers will be treated as new submissions, which may or may not get the same set of reviewers at the discretion of the PC chairs. Authors who try to bypass this guideline (e.g., by changing the paper title without significantly changing paper content, or by making small changes to the paper content) will have their papers desk-rejected by the PC chairs without further consideration.
Submission Process
Submissions must conform to the IEEE conference proceedings template, specified in the IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options). Note that IEEE format is being used this year, whereas last year it was ACM format, hence the appearance will differ from year to year.
- All submissions must not exceed 10 pages for the main text, inclusive of all figures, tables, appendices, etc. Two more pages containing only references are permitted. All submissions must be in PDF. Accepted papers will be allowed one extra page for the main text of the camera-ready version.
- Submissions must strictly conform to the IEEE conference proceedings formatting instructions specified above. Alterations of spacing, font size, and other changes that deviate from the instructions may result in desk rejection without further review.
- By submitting to the ICSE Technical Track, authors acknowledge that they are aware of and agree to be bound by the ACM Policy and Procedures on Plagiarism and the IEEE Plagiarism FAQ. In particular, papers submitted to ICSE 2025 must not have been published elsewhere and must not be under review or submitted for review elsewhere whilst under consideration for ICSE 2025. Contravention of this concurrent submission policy will be deemed a serious breach of scientific ethics, and appropriate action will be taken in all such cases. To check for double submission and plagiarism issues, the chairs reserve the right to (1) share the list of submissions with the PC Chairs of other conferences with overlapping review periods and (2) use external plagiarism detection software, under contract to the ACM or IEEE, to detect violations of these policies.
- If the research involves human participants/subjects, the authors must adhere to the ACM Publications Policy on Research Involving Human Participants and Subjects. Upon submitting, authors will declare their compliance with such a policy. Alleged violations of this policy or any ACM Publications Policy will be investigated by ACM and may result in a full retraction of your paper, in addition to other potential penalties, as per ACM Publications Policy.
- Please ensure that you and your co-authors obtain an ORCID ID, so you can complete the publishing process for your accepted paper. ACM and IEEE have been involved in ORCID and may collect ORCID IDs from all published authors. We are committed to improve author discoverability, ensure proper attribution and contribute to ongoing community efforts around name normalization; your ORCID ID will help in these efforts.
- The ICSE 2025 Research Track will employ a double-anonymous review process. Thus, no submission may reveal its authors’ identities. The authors must make every effort to honor the double-anonymous review process. In particular:
- Authors’ names must be omitted from the submission.
- All references to the author’s prior work should be in the third person.
- While authors have the right to upload preprints on ArXiV or similar sites, they must avoid specifying that the manuscript was submitted to ICSE 2025.
- All communication with the program committee must go through the program committee chairs. Do not contact individual program committee members regarding your submission.
-
Further advice, guidance, and explanation about the double-anonymous review process can be found on the Q&A page.
- By submitting to the ICSE Research Track, authors acknowledge that they conform to the authorship policy of the IEEE, submission policy of the IEEE, and the authorship policy of the ACM (and associated FAQ. This includes following these points related to the use of Generative AI:
- “Generative AI tools and technologies, such as ChatGPT, may not be listed as authors of an ACM published Work. The use of generative AI tools and technologies to create content is permitted but must be fully disclosed in the Work. For example, the authors could include the following statement in the Acknowledgements section of the Work: ChatGPT was utilized to generate sections of this Work, including text, tables, graphs, code, data, citations, etc.). If you are uncertain about the need to disclose the use of a particular tool, err on the side of caution, and include a disclosure in the acknowledgements section of the Work.” - ACM
- “The use of artificial intelligence (AI)–generated text in an article shall be disclosed in the acknowledgements section of any paper submitted to an IEEE Conference or Periodical. The sections of the paper that use AI-generated text shall have a citation to the AI system used to generate the text.” - IEEE
- “If you are using generative AI software tools to edit and improve the quality of your existing text in much the same way you would use a typing assistant like Grammarly to improve spelling, grammar, punctuation, clarity, engagement or to use a basic word processing system to correct spelling or grammar, it is not necessary to disclose such usage of these tools in your Work.” - ACM
Submissions to the Technical Track that meet the above requirements can be made via the Research Track submission site by the submission deadline. Any submission that does not comply with these requirements may be desk rejected without further review.
Submission site: https://icse2025.hotcrp.com/
We encourage the authors to upload their paper info early (and can submit the PDF later) to properly enter conflicts for double-anonymous reviewing. It is the sole responsibility of the authors to ensure that the formatting guidelines, double anonymous guidelines, and any other submission guidelines are met at the time of paper submission.
Open Science Policy
The research track of ICSE 2025 is governed by the ICSE 2025 Open Science policies. The guiding principle is that all research results should be accessible to the public and, if possible, empirical studies should be reproducible. In particular, we actively support the adoption of open artifacts and open source principles. We encourage all contributing authors to disclose (anonymized and curated) data/artifacts to increase reproducibility and replicability. Note that sharing research artifacts is not mandatory for submission or acceptance. However, sharing is expected to be the default, and non-sharing needs to be justified. We recognize that reproducibility or replicability is not a goal in qualitative research and that, similar to industrial studies, qualitative studies often face challenges in sharing research data. For guidelines on how to report qualitative research to ensure the assessment of the reliability and credibility of research results, see this curated Q&A page.
Upon submission to the research track, authors are asked
- to make their artifact available to the program committee (via upload of supplemental material or a link to an anonymous repository) – and provide instructions on how to access this data in the paper; or
- to include in the submission an explanation as to why this is not possible or desirable; and
- to indicate in the submission why they do not intend to make their data or study materials publicly available upon acceptance, if that is the case. The default understanding is that the data and/or other artifacts will be publicly available upon acceptance of a paper.
Withdrawing a Paper
Authors can withdraw their paper at any moment until the final decision has been made, through the paper submission system. Resubmitting the paper to another venue before the final decision has been made without withdrawing from ICSE 2025 first is considered a violation of the concurrent submission policy, and will lead to automatic rejection from ICSE 2025 as well as any other venue adhering to this policy. Such violations may also be reported to appropriate organizations e.g. ACM and IEEE.
Conference Attendance Expectation
If a submission is accepted, at least one author of the paper is required to register for ICSE 2025 and present the paper. We are assuming that the conference will be in-person, and if it is virtual or hybrid, virtual presentations may be possible. These matters will be discussed with the authors closer to the date of the conference.
Accepted papers: First Cycle
The following papers have been accepted so far in the ICSE 2025 Research Track first cycle. The papers are will be published by the IEEE and appear in the IEEE and ACM digital libraries, subject to an author submitting their camera-ready and copyright forms, and registering to attend the conference. (Authors are required to present the papers at the conference, otherwise they will be withdrawn).
Many additional papers will appear later from the first-cycle papers where major revisions were requested (if the revisions are approved), and from the second cycle.
Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Puppy: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction"
Abstract: Database management systems (DBMSs) consistently strive for enhanced performance. For a given query, the optimizer of a DBMS aims to construct an optimal execution plan that incorporates multiple optimization operations. However, the resulting plan may sometimes perform worse than even if no optimizations were applied. This occurs because the interactions between optimizations are complex and some situations might be overlooked in the implementation. We refer to these issues as Performance Degradation Bugs (PDBs). PDBs can result in significant consequences from decreased system efficiency and prolonged query processing times to potential disruptions in critical business operations. In this paper, we present Puppy, an automated approach for detecting PDBs in DBMSs using limited-optimization plan construction. The key idea is to compare the performance with the plan generated with all optimization operations enabled, against the plan generated with only a subset of optimization operations in the same DBMS. If the response time of the plan with the limited optimization set is shorter than that of the fully optimized plan, it indicates a potential PDB. Specifically, Puppy first generates queries that incorporate multiple optimization sequences, guided by optimization operation sequence coverage. Secondly, Puppy analyzes the query plan and selectively disables specific optimizations to construct the limited optimization plan. We evaluate Puppy on five widely-used DBMSs, namely MySQL, Percona, TiDB, PolarDB, and PostgreSQL against the state-of-the-art DBMS performance testing tools APOLLO and AMOEBA. Puppy detected 26 and 25 more performance anomalies, covered 151,201 and 173,798 more branches than APOLLO and AMOEBA in 48 hours, respectively. More importantly, Puppy reports 62 PDBs, with 54 anomalies confirmed as previously unknown bugs.
Chun Li, Hui Li, Zhong Li, Minxue Pan, Xuandong Li, "Enhancing Fault Localization in Industrial Software Systems via Contrastive Learning"
Abstract: Engineers utilize logs as a primary resource for fault localization in large-scale software and system testing, a process that is notoriously time-consuming, costly, and labor-intensive. Despite considerable progress in automated fault localization approaches, their applicability remains limited in such settings, due to the unavailability of fine-grained features in logs essential for most existing fault localization methods. In response, we introduce FALCON, a novel log-based fault localization framework. FALCON organizes complex semantic log information into graphical representations and employs contrastive learning to capture the differences between passed and failed logs, enabling the identification of crucial fault-related features. It also incorporates a specifically designed transitive analysis-based adaptive graph augmentation to minimize the influence of fault-unrelated log information on contrastive learning. Through extensive evaluations against 34 spectrum-based and 4 learning-based fault localization methods, FALCON demonstrates superior performance by outperforming all the methods in comparison. In addition, FALCON demonstrated its practical value by successfully identifying 71 out of 90 faults with a file-level Top-1 accuracy rate during a one-month deployment within a global company’s testing system.
Wenqian Deng, Zhiyong Wu, Jie Liang, Jingzhou Fu, Mingzhe Wang, Yu Jiang, "Coni: Detecting Database Connector Bugs via State-Aware Test Case Generation"
Abstract: Database connectors are widely used in many applications to facilitate flexible and convenient database interactions. Potential vulnerabilities in database connectors can lead to various abnormal behaviors within applications, such as returning incorrect results or experiencing unexpected connection interruption. However, existing fuzzing works cannot be directly applied to testing database connectors as they mainly focus on SQL generation and use a small subset of database connector interfaces to execute SQLs. Due to a lack of domain knowledge, automated test case generation also struggles to generate complex test cases that explore connectors' deep logic. The main challenge in testing database connectors is to generate semantically correct test cases that can trigger a wide range of connector state transitions. To address that, we propose CONI, a framework designed for detecting logic bugs of database connectors with state-aware test case generation. First, we define the database connector state model by analyzing the corresponding specification. Building upon this model, CONI generates interface call sequences within test cases to encompass more connector state transitions. After that, CONI generates suitable parameter values based on the parameter information and contextual information collected during runtime. Then the test cases are executed on a target and a reference database connector. Inconsistent results indicate potential logic bugs. We evaluate CONI on 5 widely-used JDBC database connectors, namely MySQL Connector/J, MariaDB Connector/J, AWS JDBC Driver for MySQL, PGJDBC NG, and PostgreSQL JDBC Driver. In total, CONI successfully detected 44 previously unknown bugs, of which 34 have been confirmed.
Gong Chen, Xiaoyuan Xie, Daniel Tang, Qi Xin, Wenjie Liu, "HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search"
Abstract: Code search is a vital activity in software engineering, focused on identifying and retrieving the correct code snippets based on a query provided in natural language. Approaches based on deep learning techniques have been increasingly adopted for this task, enhancing the initial representations of both code and its natural language descriptions. Despite this progress, there remains an unexplored gap in ensuring consistency between the representation spaces of code and its descriptions. Furthermore, existing methods have not fully leveraged the potential relevance between code snippets and their descriptions, presenting a challenge in discerning fine-grained semantic distinctions among similar code snippets. To address these challenges, we introduce a multi-task hedging contrastive Learning framework for Code Search, referred to as HedgeCode. HedgeCode is structured around two primary training phases. The first phase, known as the representation alignment stage, proposes a hedging contrastive learning approach. This method aims to detect subtle differences between code and natural language text, thereby aligning their representation spaces by identifying relevance. The subsequent phase involves multi-task joint learning, wherein the previously trained model serves as the encoder. This stage optimizes the model through a combination of supervised and self-supervised contrastive learning tasks. Our framework’s effectiveness is demonstrated through its performance on the CodeSearchNet benchmark, showcasing HedgeCode’s ability to address the mentioned limitations in code search tasks.
Jiashuo Zhang, Yiming Shen, Jiachi Chen, Jianzhong Su, Yanlin Wang, Ting Chen, Jianbo Gao, Zhong Chen, "Demystifying and Detecting Cryptographic Defects in Ethereum Smart Contracts"
Abstract: To enhance smart contracts with cryptographic capabilities, Ethereum has officially provided a set of system-level cryptographic APIs, such as ecrecover. These APIs have been utilized in over 10% of Ethereum transactions, motivating developers to implement various on-chain cryptographic tasks, such as digital signatures. However, since developers may not always be cryptographic experts, their ad-hoc and potentially defective implementations could compromise the theoretical guarantees of cryptography, leading to real-world security issues. To mitigate this threat, we conducted the first study aimed at demystifying and detecting cryptographic defects in smart contracts. Through the analysis of 2,406 real-world security reports, we defined nine types of cryptographic defects in smart contracts with detailed descriptions and practical detection patterns. Based on this categorization, we proposed CrySol, a fuzzing-based tool to automate the detection of cryptographic defects in smart contracts. It combines transaction replaying and dynamic taint analysis to extract fine-grained crypto-related semantics and employs crypto-specific strategies to guide the test case generation process. urthermore, we collected a large-scale dataset containing 25,745 real-world crypto-related smart contracts and evaluated CrySol's effectiveness on it. The result demonstrated that CrySol achieves an overall precision of 95.4% and a recall of 91.2%. Notably, CrySol revealed that 5,847 (22.7%) out of 25,745 contracts contain at least one cryptographic defect, highlighting the prevalence of these defects.
Chijin Zhou, Quan Zhang, Bingzhou Qian, Yu Jiang, "Janus: Detecting Rendering Bugs in Web Browsers via Visual Delta Consistency"
Abstract: Rendering lies at the heart of our modern web experience. However, the correctness of browser rendering is not always guaranteed, often leading to rendering bugs. Traditional differential testing, while successful in various domains, falls short when applied to rendering bug detection because an HTML file is likely yield different rendered outcomes across different browsers. This paper introduces Visual Delta Consistency, a test oracle to detect rendering bugs in web browsers, aiming to make rendered pages across browsers comparable. Our key insight is that any modifications made to an HTML file should uniformly influence rendering outcomes across browsers. Specifically, when presented with two HTML files that differ only by minor modifications, the reaction of all browsers should be consistent, i.e., either all browsers render them identically or all render them differently. Based on this insight, We implemented it as a practical fuzzer named Janus. It constructs pairs of slightly modified HTML files and observes the change statuses of the corresponding rendered pages across browsers for bug detection. We evaluated it on three widely-used browsers, i.e., Chrome, Safari, and Firefox. In total, Janus detected 34 rendering bugs, out of which 26 confirmed with 8 fixed by the developers.
Seongmin Lee, Shreyas Minocha, Marcel Böhme, "Accounting for Missing Events in Statistical Information Leakage Analysis"
Abstract: The leakage of secret information via a public channel is a critical privacy flaw in software systems. The more information is leaked per observation, the less time an attacker needs to learn the secret. Due to the size and complexity of the modern software, and because some empirical facts are not available to a formal analysis of the source code, researchers started investigating statistical methods using program executions as samples. However, current statistical methods require a high sample coverage. Ideally, the sample is large enough to contain every possible combination of secret $\times$ observable value to accurately reflect the joint distribution of $\langle$secret, observable$\rangle$. Otherwise, the information leakage is severely underestimated, which is problematic as it can lead to overconfidence in the security of an otherwise vulnerable program. In this paper, we introduce an improved estimator for information leakage and propose to use methods from applied statistics to improve our estimate of the joint distribution when sample coverage is low. The key idea is to reconstruct the joint distribution by casting our problem as a multinomial estimation problem in the absence of samples for all classes. We suggest two approaches and demonstrate the effectiveness of each approach on a set of benchmark subjects. We also propose novel refinement heuristics, which help to adjust the joint distribution and gain better estimation accuracy. Compared to existing statistical methods for information leakage estimation, our method can safely overestimate the mutual information and provide a more accurate estimate from a limited number of program executions.
Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Premkumar Devanbu, Toufique Ahmed, "Calibration and Correctness of Language Models for Code"
Abstract: Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a \emph{confidence measure}; if this confidence measure is strongly associated with \emph{likelihood of correctness}, then the model is said to be \emph{well-calibrated}. In this case, the confidence measure can serve as a basis for rational graduated decision making on how much review and care is needed. \emph{Calibration} has so far been studied in mostly non-generative (\emph{e.g.}, classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Given generated code developers must decide whether to directly use, use after varying intensity of careful review, or discard model-generated code; thus calibration is vital in generative settings. In this paper we make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are \textbf{\textit{\underline{not}}} well-calibrated out of the box. We then show how calibration can be improved, using standard methods such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in Software Engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.
Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, Rishi Poddar, Aseem Rastogi, "RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code"
Abstract: The Rust programming language, with its safety guarantees, has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. These guarantees come from a strong ownership-based type system, as well as primitive support for features like closures, pattern matching, etc., that make the code more concise and amenable to reasoning. These unique Rust features also pose a steep learning curve for programmers. This paper presents a tool called RustAssistant that leverages the emergent capabilities of Large Language Models (LLMs) to automatically suggest fixes for Rust compilation errors. RustAssistant uses a careful combination of prompting techniques as well as iteration between an LLM and the Rust compiler to deliver high accuracy of fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories. We also contribute a dataset of Rust compilation errors to enable further research.
Guoping rong, Yongda Yu, Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen, Jidong Hu, "Code Comment Inconsistency Detection and Rectification Using a Large Language Model"
Abstract: Comments are widely used in source code. If a comment is consistent with the code snippet it intends to annotate, it would aid code comprehension. Otherwise, Code Comment Inconsistency (CCI) is not only detrimental to the understanding of code, but more importantly, it would negatively impact the development, testing, and maintenance of software. To tackle this issue, existing research has been primarily focused on detecting inconsistencies with varied performance. It is evident that detection alone does not solve the problem; it merely paves the way for solving it. A complete solution requires detecting inconsistencies and, more importantly, rectifying them by amending comments. However, this type of work is scarce. In this paper, we contribute C4RLLaMA, a fine-tuned large language model based on the open-source CodeLLaMA. It not only has the ability to rectify inconsistencies by correcting relevant comment content but also outperforms state-of-the-art approaches in detecting inconsistencies. Experiments with various datasets confirm that C4RLLaMA consistently surpasses both Post Hoc and Just-in-time CCI detection approaches. More importantly, C4RLLaMA outperforms substantially the only known CCI rectification approach in terms of multiple performance metrics. To further examine C4RLLaMA's efficacy in rectifying inconsistencies, we conducted a manual evaluation, and the results showed that the percentage of correct comment updates by C4RLLaMA was 65.0\% and 55.9\% in Just-in-time and Post Hoc, respectively, implying C4RLLaMA's real potential in practical use.
Kai Huang, Jian Zhang, Xiangxin Meng, Yang Liu, "Template-Guided Program Repair in the Era of Large Language Models"
Abstract: Recent advancements in automated program repair (APR) have been significantly driven by the application of Large Language Models (LLMs). In particular, the integration of LLMs with traditional template-based repair methods has demonstrated effective outcomes. Despite this, the synergy between the strengths of traditional methods and LLMs remains underexploited. This oversight originates from the indiscriminate use of templates and their insufficient coverage. Also, using small-scale LLMs within the zero-shot learning context proves to be suboptimal. To alleviate the limitations, we propose NTR (Neural Template Repair), a two-stage repair framework including template selection and patch generation, both of which are under the fine-tuning paradigm. In the template selection phase, we formulate it as a multiclass classification problem and fine-tune million-level LLMs for better selecting possible templates. During the patch generation phase, we leverage the chosen templates as probable directions (e.g., `Mutate Conditional Expression') to guide the fine-tuning process of LLMs at the billion-level scale for precise patch creation. Moreover, we incorporate a unique template to signify the absence of a suitable template and employ a probability-based prioritization of templates, thereby optimizing patch generation. This framework not only effectively addresses template mismatch issues, but also enables the billion-level LLMs to explore the patch space more efficiently, despite the GPU memory constraints. We evaluate NTR with different foundational models on Defects4J V1.2 and HumanEval-Java, the framework consistently demonstrates significant effectiveness. When utilizing StarCoder as the foundational model for patch generation, NTR fixes 128 and 129 bugs in Defects4J and HumanEval, outperforming the best baseline APR tool by 14 and 59 bugs. With the larger CodeLlama model, the fixed bugs rise to 139 and 136, respectively, exceeding the baseline by 25 and 66 bugs. Notably, the performance stems not only from the foundational models but also benefits greatly from our NTR framework. Specifically, NTR's implementation with StarCoder and CodeLlama leads to 22 and 23 additional fixes, which is beyond what the models achieve on their own. This emphasizes the success of our new perspective on utilizing templates to unlock the bug-fixing potential of LLMs.
Syed Fatiul Huq, Mahan Tafreshipour, Kate Kalcevich, Sam Malek, "Automated Generation of Accessibility Test Reports from Recorded User Transcripts"
Abstract: Testing for accessibility is a significant step when developing software, as it ensures that all users, including those with disabilities, can effectively engage with web and mobile applications. While automated tools exist to detect accessibility issues in software, none are as comprehensive and effective as the process of user testing, where testers with various disabilities evaluate the application for accessibility and usability issues. However, user testing is not popular with software developers as it requires conducting lengthy interviews with users and later parsing through large recordings to derive the issues to fix. In this paper, we explore how large language models (LLMs) like GPT 4.0, which have shown promising results in context comprehension and semantic text generation, can mitigate this issue and streamline the user testing process. Our solution, called Reca11, takes in informal transcripts of test recordings and extracts the accessibility and usability issues mentioned by the tester. Our systematic prompt engineering determines the optimal configuration of input, instruction, context and demonstrations for best results. We evaluate Reca11's effectiveness on 36 user testing sessions across three applications. Based on the findings, we investigate the strengths and weaknesses of using LLMs in this space.
Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, Tianyin Xu, "Large Language Models as Configuration Validators"
Abstract: Misconfigurations are major causes of software failures. Existing practices rely on developer-written rules or test cases to validate configurations, which are expensive. Machine learning (ML) for configuration validation is considered a promising direction, but has been facing challenges such as the need of large-scale field data and system-specific models. Recent advances in Large Language Models (LLMs) show promise in addressing some of the long-lasting limitations of ML-based configuration validation. We present a first analysis on the feasibility and effectiveness of using LLMs for configuration validation. We empirically evaluate LLMs as configuration validators by developing a generic LLM-based configuration validation framework, named Ciri. Ciri employs effective prompt engineering with few-shot learning based on both valid configuration and misconfiguration data. Ciri checks outputs from LLMs when producing results, addressing hallucination and nondeterminism of LLMs. We evaluate Ciri’s validation effectiveness on eight popular LLMs using configuration data of ten widely deployed open-source systems. Our analysis (1) confirms the potential of using LLMs for configuration validation, (2) explores design space of LLMbased validators like Ciri, and (3) reveals open challenges such as ineffectiveness in detecting certain types of misconfigurations and biases towards popular configuration parameters.
Wen Zhang, Botang Xiao, Qingchen Kong, Le Guan, Wenwen Wang, "BSan: A Powerful Identifier-Based Hardware-Independent Memory Error Detector for COTS Binaries"
Abstract: This paper presents BSan, a practical software-only memory error detector for binary code. Different from state-of-the-art binary-level detectors, which rely on either the shadow memory-based approach or the hardware-specific feature and thus suffer from several fundamental limitations, BSan adopts an identifier-based approach, enabling it to detect deep memory errors missed by existing detectors. Also, BSan does not depend on any specific hardware features. To reduce the high performance overhead caused by identifier propagation, BSan creates a novel hybrid approach, static analysis+dynamic instrumentation, to improve the performance without inheriting the poor reliability of static binary rewriting, distinguishing it from existing detectors that simply refer to static binary rewriting for better performance. The comprehensive evaluation demonstrates that BSan can detect more memory errors than state-of-the-art binary-level detectors. Meanwhile, the performance and memory overheads of BSan are comparable to those of existing detectors.
Ying Fu, Zhiyong Wu, Yuanliang Zhang, Jie Liang, Jingzhou Fu, Yu Jiang, Shanshan Li, Xiangke Liao, "Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing"
Abstract: Differential testing is a prevalent strategy for establishing test oracles in automated DBMS testing. However, meticulously selecting equivalent DBMSs with diverse implementations and compatible input syntax requires huge manual efforts. In this paper, we propose Thanos, a framework that finds DBMS bugs via storage engine rotation based differential testing. Our key insight is that a DBMS with different storage engines must provide consistent basic storage functionalities. Therefore, it’s feasible to construct equivalent DBMSs based on storage engine rotation, ensuring that the same SQL test cases to these equivalent DBMSs yield consistent results. The framework involves four main steps: 1) select the appropriate storage engines; 2) extract equivalence information among the selected storage engines; 3) synthesize feature-orient test cases that ensure the DBMS equivalence; and 4) send test cases to the DBMSs with selected storage engines and compare the results. We evaluate Thanos on three widely used and extensively tested DBMSs, namely MySQL, MariaDB, and Percona against state-of-the-art fuzzers SQLancer, SQLsmith, and Squirrel. Thanos outperforms them on branch coverage by 24%–116%, and also finds many bugs missed by other fuzzers. More importantly, the vendors have confirmed 32 previously unknown bugs found by Thanos, with 29 verified as Critical.
Courtney Miller, Mahmoud Jahanshahi, Audris Mockus, Bogdan Vasilescu, Christian Kästner, "Understanding the Response to Open-Source Dependency Abandonment in the npm Ecosystem"
Abstract: Many developers relying on open-source digital infrastructure expect continuous maintenance, but even the most critical packages can become unmaintained. Despite this, there is little understanding of the prevalence of abandonment of widely-used packages, of subsequent exposure, and of reactions to abandonment in practice, or the factors that influence them. We perform a large-scale quantitative analysis of all widely-used npm packages and find that abandonment is common among them, that abandonment exposes many projects which often do not respond, that responses correlate with other dependency management practices, and that removal is significantly faster when a projects end-of-life status is explicitly stated. We end with recommendations to both researchers and practitioners who are facing dependency abandonment or are sunsetting projects, such as opportunities for low-effort transparency mechanisms to help exposed projects make better, more informed decisions.
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen, "Vulnerability Detection with Code Language Models: How Far Are We?"
Abstract: In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26\% F1 on BigVul but only 3.09\% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
Deepak-George Thomas, Matteo Biagiola, Nargiz Humbatova, Mohammad Wardat, Gunel Jahangirova, Hridesh Rajan, Paolo Tonella, "µPRL: a Mutation Testing Pipeline for Deep Reinforcement Learning based on Real Faults"
Abstract: Reinforcement Learning (RL) is increasingly adopted to train agents that can deal with complex sequential tasks, such as driving an autonomous vehicle or controlling a complex environment. Correspondingly, novel approaches are needed to ensure that RL agents have been tested adequately before going to production. Among them, mutation testing is quite promising, especially under the assumption that the injected faults (mutations) mimic the real ones. In this paper, we first describe a taxonomy of real RL faults obtained by repository mining. Then, we present the mutation operators derived from such real faults and implemented in the tool µPRL. Finally, we discuss the experimental results, which show that µPRL is extremely effective at discriminating strong from weak test generators, hence providing useful feedback to developers about the adequacy of the test scenarios generated and executed so far.
Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, Christian Kästner, "The Product Beyond the Model -- An Empirical Study of Repositories of Open-Source ML Products"
Abstract: Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, challenging research progress. In this study, first, we contribute a novel process to identify 262 open-source ML products among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent startup-style development reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss 7 implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.
Sanan Hasanov, Stefan Nagy, Paul Gazzillo, "A Little Goes a Long Way: Tuning Configuration Selection for Continuous Kernel Fuzzing"
Abstract:The Linux kernel is actively-developed and widely-used. It supports billions of devices of all classes, from high-performance computing to the Internet-of-Things, in part because of its sophisticated configuration system, which automatically tailors the source code according to thousands of user-provided configuration options. Fuzzing has been highly successful at finding kernel bugs, being among the top bug reporters. Since the kernel receives 100s of patches per day, fuzzers run continuously, stopping regularly to rebuild the kernel with the latest changes before restarting fuzzing. But kernel fuzzers currently use predefined configuration settings that, as we show, exclude the majority of new patches from the kernel binary, nullifying the benefits of continuous fuzzing. Unfortunately, state-of-the-art configuration testing techniques are generally ill-suited to the needs of continuous fuzzing, excluding necessary options or requiring too many configuration files to be tractible. We distill down the needs of continuous testing into six properties with the most impact, systematically analyze the space of configuration selection strategies, and provide actionable recommendations. Through our analysis, we discover that continuous fuzzers can improve configuration variety without sacrificing performance. We empirically evaluate our discovery by modifying the configuration selection strategy for syzkaller, the most popular Linux kernel fuzzer, which subsequently found more than twice as many new bugs (35 vs. 13) than with the original configuration file and 12x more (24 vs. 2) when considering only unique bugs---with one security vulnerability being assigned a CVE.
Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun, "QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning"
Abstract: Formal verification is a promising method for producing highly reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using machine learning and a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. This makes proofs hard to find, and, when they are found, longer than necessary. Reinforcement learning could help estimate progress, but sparse rewards make this method ineffective. To address this problem, we create QEDCartographer, an novel automated proof-synthesis tool that combines supervised and reinforcement learning. QEDCartographer's key insight is that incorporating the branching structure of proofs into its learning enables reward-free search, mitigating the sparse reward challenge. We evaluate QEDCartographer on the CoqGym benchmark of 68,501 theorems from 124 open-source Coq projects. QEDCartographer proves 186 more theorems than Proverbot9001, a state-of-the-art proof synthesis tool, an increase of 8%. Further, the tools are complementary, together proving 12% more theorems than Proverbot9001 alone. For theorems both can prove, QEDCartographer produces 26% shorter proofs 27% faster.
Ben Limpanukorn, Jiyuan Wang, Hong Jin Kang, Eric Zitong Zhou, Miryung Kim, "Fuzzing MLIR Compilers with Custom Mutation Synthesis"
Abstract: A growing trend in compiler design is to enable modular extensions to intermediate representations (IRs). Multi- Level Intermediate Representation (MLIR) is a new effort to enable faster compiler development by providing an extensible framework for downstream developers to define custom IRs with MLIR dialects. Sets of MLIR dialects define new IRs that are tailored for specific domains. The diversity and rapid evolution of these IRs make it impractical to pre-define custom test generator logic for every available dialect. We design a new approach called SYNTHFUZZ that automatically infers and applies custom mutations from existing tests. The key essence of SYNTHFUZZ is that inferred custom mutations are parameterized and context-dependent such that they can be concretized differently depending on the target context. By doing this, we obviate the need to manually write custom mutations for newly introduced MLIR dialects. Further, SYNTHFUZZ increases the chance of finding effective edit locations and reduces the chance of inserting invalid edit content by performing k-ancestor- prefix and l-sibling-postfix matching. We compare SYNTHFUZZ to three baselines: Grammarinator—a grammar-based fuzzer without custom mutators, MLIRSmith—a custom test generator for MLIR, and NeuRI—a custom test generator with support for parameterized generation. We conduct this comprehensive comparison on 4 different MLIR projects where each project defines a new set of MLIR dialects that would take months of effort to manually write custom input generation and mutation logic. Our evaluation shows that SYNTHFUZZ on average improves input diversity by 1.51×, which increases branch coverage by 1.16×. Further, we show that our context dependent custom mutation increases the proportion of valid tests by up to 1.11×, indicating that SYNTHFUZZ correctly concretizes its parameterized mutations with respect to the target context. Parameterization of the mutations reduces the fraction of tests violating general MLIR constraints by 0.57×, increasing the time spent fuzzing dialect-specific code.
Forough Mehralian, Ziyao He, Sam Malek, "Automated Accessibility Analysis of Dynamic Content Changes on Mobile Apps"
Abstract: With mobile apps playing an increasingly vital role in our daily lives, the importance of ensuring their accessibility for users with disabilities is also growing. Despite this, app developers often overlook the accessibility challenges encountered by users of assistive technologies, such as screen readers. Screen reader users typically navigate content sequentially, focusing on one element at a time, unaware of changes occurring elsewhere in the app. While dynamic changes to content displayed on an app’s user interface may be apparent to sighted users, they pose significant accessibility obstacles for screen reader users. Existing accessibility testing tools are unable to identify challenges faced by blind users resulting from dynamic content changes. In this work, we first conduct a formative user study on dynamic changes in Android apps and their accessibility barriers for screen reader users. We then present TIMESTUMP, an automated framework that leverages our findings in the formative study to detect accessibility issues regarding dynamic changes. Finally, we empirically evaluate TIMESTUMP on real-world apps to assess its effectiveness and efficiency in detecting such accessibility issues.
Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng, "RLCoder: Reinforcement Learning for Repository-Level Code Completion"
Abstract: Repository-level code completion aims to generate code for unfinished code snippets within the context of a specified repository. Existing approaches mainly rely on retrieval-augmented generation strategies due to limitations in input sequence length. However, traditional lexical-based retrieval methods like BM25 struggle to capture code semantics, while model-based retrieval methods face challenges due to the lack of labeled data for training. Therefore, we propose RLCoder, a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data. Specifically, we iteratively evaluate the usefulness of retrieved content based on the perplexity of the target code when provided with the retrieved content as additional context, and provide feedback to update the retriever parameters. This iterative process enables the retriever to learn from its successes and failures, gradually improving its ability to retrieve relevant and high-quality content. Considering that not all situations require information beyond code files and not all retrieved context is helpful for generation, we also introduce a stop signal mechanism, allowing the retriever to decide when to retrieve and which candidates to retain autonomously. Extensive experimental results demonstrate that RLCoder consistently outperforms state-of-the-art methods on CrossCodeEval and RepoEval, achieving 12.2\% EM improvement over previous methods. Moreover, experiments show that our framework can generalize across different programming languages and further improve previous methods like RepoCoder.
Nausheen Mohammed, Akash lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma, "LLM Assistance for Memory Safety"
Abstract: Memory safety violations in low-level code, written in languages like C, continues to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting, however, is a manual process that imposes significant burden on the programmer and, hence, there has been limited adoption of this technique. The task of porting not only requires inferring annotations, but may also need refactoring/rewriting of the code to make it amenable to such annotations. In this paper, we use Large Language Models (LLMs) towards addressing both these concerns. We show how to harness LLM capabilities to do complex code reasoning as well as rewriting of large codebases. We also present a novel framework for whole-program transformations that leverages lightweight static analysis to break the transformation into smaller steps that can be carried out effectively by an LLM. We implement our ideas in a tool called MSA that targets the CheckedC dialect. We evaluate MSA on several micro-benchmarks, as well as real-world code ranging up to 20K lines of code. We showcase superior performance compared to a vanilla LLM baseline, as well as demonstrate improvement over a state-of-the-art symbolic (non-LLM) technique.
Chenkai Guo, Qianlu Wang, Naipeng Dong, Lingling Fan, Tianhong Wang, Weijie Zhang, EnBao Chen, Zheli Liu, Lu Yu, "EP-Detector: Automatic Detection of Error-prone Operation Anomalies in Android Applications"
Abstract: Android applications are pervasively adopted and heavily relied on in our daily life, leading to the growing demand for enhanced user experiences, such as ease for operation and robustness. Nevertheless, developers continue to prioritize traditional functionality and performance, overlooking the pivotal role of user experience in real-world scenarios. For example, poorly designed page elements can lead to user confusion, resulting in unexpected outcomes, termed as the error-prone operation anomalies (EPAs). In this work, we undertake the first effort to uncover the underlying essence of the EPA problem. To achieve this objective, we investigated the root causes of EPAs from three dimensions, i.e., subject, object and environment. These causes were identified by multi-stage attribute capturing and precise similarity computation. In this process, the causes are categorized into fine-grained classes, namely confusing behaviours, unsuitable layout, and resource overload. Building upon these insights, we propose a dynamic GUI-based testing tool EP-Detector to facilitate detecting the EPAs in real-world apps. The EP-Detector is equipped with widget-exploration based target navigation and automatic test oracle, enabling it to detect error-prone page elements and simulate events with both comprehensiveness and precision. To systematically study the prevalence and severity of real-world EPAs, we conducted experiments on 53 popular Android apps with EP-Detector. The confirmed results not only validate the high precision and completeness of EP-Detector but also highlight that EPAs are prevalent in current apps, with at least one EPA existing in every two page widgets on average, and 28.3% of them may lead to security and functionality issues or risks. The EP-Detector is available at https://github.com/WordDealer/EP-Detector.
Shuzheng Gao, Cuiyun Gao, Wenchao Gu, Michael Lyu, "Search-Based LLMs for Code Optimization"
Abstract: The code written by developers usually suffers from efficiency problems and contain various performance bugs. These inefficiencies necessitate the research of automated refactoring methods for code optimization. Early research in code optimization employs rule-based methods and focuses on specific inefficiency issues, which are labor-intensive and suffer from the low coverage issue. Recent work regards the task as a sequence generation problem, and resorts to deep learning (DL) techniques such as large language models (LLMs). These methods typically prompt LLMs to directly generate optimized code. Although these methods show state-of-the-art performance, such one-step generation paradigm is hard to achieve an optimal solution. First, complex optimization methods such as combinatorial ones are hard to be captured by LLMs. Second, the one-step generation paradigm poses challenge in precisely infusing the knowledge required for effective code optimization within LLMs, resulting in under-optimized code. To address these problems, we propose to model this task from the search perspective, and propose a search-based LLMs framework named SBLLM that enables iterative refinement and discovery of improved optimization methods. SBLLM synergistically integrate LLMs with evolutionary search and consists of three key components: 1) an execution-based representative sample selection part that evaluates the fitness of each existing optimized code and prioritizes promising ones to pilot the generation of improved code; 2) an adaptive optimization pattern retrieval part that infuses targeted optimization patterns into the model for guiding LLMs towards rectifying and progressively enhancing their optimization methods; and 3) a genetic operator-inspired chain-of-thought prompting part that aids LLMs in combining different optimization methods and generating improved optimization methods. Our evaluation of SBLLM on a dataset of Python and C++ code demonstrates its effectiveness in improving code efficiency. Specifically, the results indicate that SBLLM can improve program execution efficiency by up to 109.59% and consistently outperform all baseline methods by 8.72% ∼ 28.06% and 1.15% ∼ 9.56% with different LLMs in terms of top-5 speedup rate on Python and C++, respectively.
Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, Hui Liu, "A First Look at Conventional Commits Classification"
Abstract: Modern distributed software development relies on commits to control system versions. Commit classification plays a vital role in both industry and academia. The widely-used commit classification framework was proposed in 1976 by Swanson and includes three base classes: perfective, corrective, and adaptive. With the increasing complexity of software development, the industry has shifted towards a more fine-grained commit category, i.e., adopting Conventional Commits Specification (CCS) for delicacy management. The new commit framework requires developers to classify commits into ten distinct categories, such as ``feat'', ``fix'', and ``docs''. However, existing studies mainly focus on the three-category classification, leaving the definition and application of the fine-grained commit categories as knowledge gaps. This paper reports a preliminary study on this mechanism from its application status and problems. We also explore ways to address these identified problems. We find that a growing number of projects on GitHub are adopting CCS. By analyzing 194 issues from GitHub and 100 questions from Stack Overflow about the CCS application, we qualitatively categorized 52 challenges developers encountered. The most common one is CCS-type confusion. To address these challenges, we propose a clear definition of CCS types based on existing variants. Further, we designed an approach to automatically classify commits into CCS types, and the evaluation results demonstrate a promising performance. Our work facilitates a deeper comprehension of the present fine-grained commit categorization and holds the potential to alleviate application challenges significantly.
Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng, "TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference"
Abstract: Python’s dynamic typing system offers flexibility and expressiveness but can lead to type-related errors, prompting the need for automated type inference despite efforts like Python Enhancement Proposals (PEPs) to enhance type hinting. While existing learning-based approaches show promising inference accuracy, they struggle with practical challenges in comprehensively handling various types, including complex generics and (unseen) user/library-defined types. To address these challenges, we introduce TIGER, employing a two-stage generating-then-ranking (GTR) framework. By fine-tuning pre-trained code models, TIGER trains a generation model with a generative span masking objective and a similarity model with a contrastive training objective. This enables TIGER to execute the GTR inference, generating diverse candidates and then ranking them alongside user/library-defined types. Evaluation on the ManyTypes4Py dataset demonstrates TIGER’s effectiveness across different type categories, particularly excelling in (unseen) user-defined types (with improvements of 11.2% and 20.1% in Top-5 Exact Match). The evaluation results also confirm the robustness and efficiency of TIGER, highlighting the contributions of the employed two stages.
Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, Zan Wang, "A Tale of Two DL Cities: When Library Tests Meet Compiler"
Abstract: Deep Learning (DL) compilers typically load a DL model and optimize it with intermediate representation. Existing DL compiler testing techniques mainly focus on model optimization stages, but rarely explore bug detection at the model loading stage. Effectively testing the model loading stage requires covering diverse usages of each DL operator from various DL libraries, which shares a common objective with DL library testing, indicating that the embedded knowledge in DL library tests could potentially be beneficial for testing the model loading stage of DL compilers. Thus, we conducted the first empirical study to investigate the effectiveness and efficiency of migrating the knowledge embedded in DL library tests to test the model loading stage. To support the conduct of this study, we develop a technique, called OPERA, consisting of test migration (regarding effectiveness investigation) and test prioritization (regarding efficiency investigation). We considered three sources of tests in DL libraries for migration and used eight frontends from three DL compilers (e.g., TVM, TensorRT, and OpenVINO) for evaluation. The migrated tests with the aid of OPERA detected 170 previously unknown bugs in total, 90 of which have been confirmed/fixed by developers, demonstrating the effectiveness of such the migration-based idea. The test prioritization strategy in OPERA improves testing efficiency with migrated tests by 11.9%~47.4% on average compared to general test prioritization strategies. Finally, we obtained 7 major findings and provided a set of guidelines for future work from this study.
Rodrigo Pedro, Miguel E. Coimbra, Daniel Castro, Paulo Carreira, Nuno Santos, "Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses"
Abstract: Large Language Models (LLMs) have found widespread applications in various domains, including web applications with chatbot interfaces. Aided by an LLM-integration middleware such as LangChain, user prompts are translated into SQL queries used by the LLM to provide meaningful responses to users. However, unsanitized user prompts can lead to SQL injection attacks, potentially compromising the security of the database. In this paper, we present a comprehensive examination of prompt-to-SQL (P2SQL) injections targeting web applications based on frameworks such as LangChain and LlamaIndex. We characterize P2SQL injections, exploring their variants and impact on application security through multiple concrete examples. We evaluate seven state-of-the-art LLMs, demonstrating the risks of P2SQL attacks across language models. By employing both manual and automated methods, we discovered P2SQL vulnerabilities in five real-world applications. Our findings indicate that LLM-integrated applications are highly susceptible to P2SQL injection attacks, warranting the adoption of robust defenses. To counter these attacks, we propose four effective defense techniques that can be integrated as extensions to the LangChain framework.
Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia, "Reasoning Runtime Behavior of a Program with LLM: How Far Are We?"
Abstract: Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks and adapt them to new benchmarks within our framework. A large-scale empirical study is conducted and most LLMs show unsatisfactory performance on both Runtime Behavior Reasoning (i.e., an average accuracy of 44.4\%) and Incremental Consistency Evaluation (i.e., an average IC score of 10.3). Evaluation results of current code LLMs reflect the urgent need for the community to strengthen the code reasoning capability of code LLMs.
Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, Yang Liu, "Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications"
Abstract: Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine- tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.
Shuo Yang, Xingwei Lin, Jiachi Chen, Qingyuan Zhong, Lei Xiao, Renke Huang, Yanlin Wang, Zibin Zheng, "Hyperion: Unveiling DApp Inconsistencies using LLM and Dataflow-Guided Symbolic Execution"
Abstract: The rapid advancement of blockchain platforms has significantly accelerated the growth of decentralized applications (DApps). Similar to traditional applications, DApps integrate front-end descriptions that showcase their features to attract users, and back-end smart contracts for executing their business logic. However, inconsistencies between the features promoted in front-end descriptions and those actually implemented in the contract can confuse users and undermine DApps's trustworthiness. In this paper, we first conducted an empirical study to identify seven types of inconsistencies, each exemplified by a real-world DApp. Furthermore, we introduce Hyperion, an approach designed to automatically identify inconsistencies between front-end descriptions and back-end code implementation in DApps. This method leverages a fine-tuned large language model LLaMA2 to analyze DApp descriptions and employs dataflow-guided symbolic execution for contract bytecode analysis. Finally, Hyperion reports the inconsistency based on predefined detection patterns. The experiment on our ground truth dataset consisting of 54 DApps shows that Hyperion reaches 84.06\% overall recall and 92.06\% overall precision in reporting DApp inconsistencies. We also implement Hyperion to analyze 835 real-world DApps. The experimental results show that Hyperion discovers 459 real-world DApps containing at least one inconsistency.
Zhiqing Zhong, Shilin He, Haoxuan Wang, Boxi Yu, Haowen Yang, Pinjia He, "An Empirical Study on Package-Level Deprecation in Python Ecosystem"
Abstract: Open-source software (OSS) plays a crucial role in modern software development. Utilizing OSS code can greatly accelerate software development, reduce redundancy, and enhance reliability. Python, a widely adopted programming language, is particularly renowned for its extensive and diverse third-party package ecosystem. However, a significant number of OSS packages within the Python ecosystem are in poor maintenance, leading to potential risks in terms of functionality and security. Consequently, it is essential to establish a deprecation mechanism that assists package developers and users in effectively managing these packages. To facilitate the establishment of the package-level deprecation mechanism, this paper presents a mixed-method empirical study, including data analysis and surveys. We investigate the current practices of announcing, receiving, and handling package-level deprecation in the Python ecosystem. We also assess the benefits of having deprecation announcements for inactively maintained packages. Furthermore, we investigate the challenges faced by package developers and users and their expectations for future deprecation practices. Our findings reveal valuable insights. For instance, 75.4\% of inactive package developers have no intention of releasing deprecation declarations for various reasons, while 89.5\% of users express a desire to be notified about the deprecation, highlighting a gap between developers and users; In many cases, no alternative solutions are available when deprecation occurs, emphasizing the need to explore practical approaches that enable seamless package handover and require less maintenance effort. We anticipate that our work will enhance the understanding of existing package-level deprecation patterns within the Python OSS realm and facilitate the development of deprecation practices for the Python community in the future.
Lizhi Liao, Simon Eismann, Heng Li, Cor-Paul Bezemer, Diego Elias Costa, André van Hoorn, Weiyi Shang, "Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models"
Abstract: During software development, developers often make numerous modifications to the software to address existing issues or implement new features. However, certain changes may inadvertently have a detrimental impact on the overall system performance. To ensure that the performance of new software releases does not degrade (i.e., absence of performance regressions), existing practices rely on system-level performance testing, such as load testing, or component-level performance testing, such as microbenchmarking, to detect performance regressions. However, performance testing for the entire system is often expensive and time-consuming, posing challenges to adapting to the rapid release cycles common in modern DevOps practices. In addition, system-level performance testing cannot be conducted until the system is fully built and deployed. On the other hand, component-level testing focuses on isolated components, neglecting overall system performance and the impact of system workloads. In this paper, we propose a novel approach to early detection of performance regressions by bridging the local performance data generated by component-level testing and the system-level architectural models. Our approach uses local performance data to identify deviations at the component level, and then propagate these deviations to the architectural model. We then use the architectural model to predict regressions in the performance of the overall system. In an evaluation of our approach on two representative open-source benchmark systems, we show that it can effectively detect end-to-end system performance regressions from local performance deviations with different intensities and under various system workloads. More importantly, our approach can detect regressions as early as in the development phase, in contrast to existing approaches that require the system to be fully built and deployed. Our approach is lightweight and can complement traditional system performance testing when testing resources are scarce.
Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, Andy Zaidman, "Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests"
Abstract: Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants, we investigate how the understandability of unit tests affects a software engineer's ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with test cases fix up to 33% more bugs and use up to 20\% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.
Sebastian Uchitel, Francisco Cirelli, Dalal Alrajeh, "Unavoidable Boundary Conditions: A Control Perspective on Goal Conflicts"
Abstract: Boundary Conditions (BCs) express situations under which requirements specifications conflict. They are used within a broader conflict management process to produce less idealized specifications. Several approaches have been proposed to identify BCs automatically. Some introduce a prioritization criteria to reduce the number of BCs presented to an engineer. However, identifying the few, relevant boundary conditions remains an open challenge. In this paper, we argue that one of the problems of the state of the art is with the definition of BC itself -- it is too weak. We propose a stronger definition for the few, relevant BCs, which we refer to as Unavoidable Boundary Conditions (UBCs), which utilizes the notion of realizability in reactive synthesis. We show experimentally that UBCs non-trivially reduce the number of conditions produced by existing BC identification techniques. We also relate UBCs to existing concepts in reactive synthesis used to provide feedback for unrealizable specifications (including counter-strategies and unrealizable cores). We then show that UBCs provide a targeted form of feedback for repairing unrealizable specifications.
Brian Hyeongseok Kim, Jingbo Wang, Chao Wang, "FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks"
Abstract: We propose a method for formally certifying and quantifying individual fairness of a deep neural network (DNN). Individual fairness guarantees that any two individuals who are identical except for some protected input attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the interval based analysis from the conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on five popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.
Mingyuan Wu, Jiahong Xiang, Kunqiu Chen, Peng DI, Shin Hwei Tan, Heming Cui, Yuqun Zhang, "Tumbling Down the Rabbit Hole: How do Assisting Exploration Strategies Facilitate Grey-box Fuzzing?"
Abstract: Many assisting exploration strategies have been proposed to assist grey-box fuzzers in exploring program states guarded by tight and complex branch conditions such as equality constraints. Although they have shown promising results in their original papers, their evaluations seldom follow equivalent protocols, e.g., they are rarely evaluated on identical benchmarks. Moreover, there is a lack of sufficient investigations on the specifics of the program states explored by these strategies which can obfuscate the future application and development of such strategies. Consequently, there is a pressing need for a comprehensive study of assisting exploration strategies on their effectiveness, versatility, and limitations to enlighten their future development. To this end, we perform the first comprehensive study about the assisting exploration strategies for grey-box fuzzers. Specifically, we first collect nine recent fuzzers representing the mainstream assisting exploration strategies as our studied subjects and 21 real-world projects to form our benchmark suite. After evaluating the subjects on the benchmark suite, we then surprisingly find that the dictionary strategy is the most promising since it not only achieves similar or even slightly better performance over the other studied assisting exploration strategies in terms of exploring program states but also is more practical to be enhanced. Accordingly, we propose CDFUZZ, which generates a customized dictionary for each seed upon the baseline fuzzer AFL to improve over the original dictionary strategy. The evaluation results demonstrate that CDFUZZ increases the edge coverage by 16.1% on average for all benchmark projects over the best performer in our study (i.e., AFL++ with the dictionary strategy). CDFUZZ also successfully exposed 37 previously unknown bugs, with nine confirmed and seven fixed by the corresponding developers.
Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy, "Towards Neural Synthesis for SMT-assisted Proof-Oriented Programming"
Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux, to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem---producing a definition given a formal specification expressed as an F* type. We provide a program-fragment checker that queries F* to check the correctness of candidate solutions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
Kunpeng Zhang, Shuai Wang, Jitao Han, Xiaogang Zhu, Xian Li, Shaohua Wang, Sheng Wen, "Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models"
Abstract: Deep learning (DL) libraries are widely used to form the basis of various AI applications in computer vision, natural language processing, and software engineering domains. Despite their popularity, DL libraries are known to have vulnerabilities, such as buffer overflows, use-after-free, and integer overflows, that can be exploited to compromise the security or effectiveness of the underlying libraries. While traditional fuzzing techniques have been used to find bugs in software, they are not well-suited for DL libraries. In general, the complexity of DL libraries and the diversity of their APIs make it challenging to test them thoroughly. To date, mainstream DL libraries like TensorFlow and PyTorch have featured over 1,000 APIs, and the number of APIs is still growing. Fuzzing all these APIs is a daunting task, especially when considering the complexity of the input data and the diversity of the API usage patterns. Recent advances in large language models (LLMs) have illustrated the high potential of LLMs in understanding and synthesizing human-like code. Despite their high potential, we find that emerging LLM-based fuzzers are less optimal for DL library API fuzzing, given their lack of in-depth knowledge on API input edge cases and inefficiency in generating test inputs. In this paper, we propose DFUZZ, a LLM-driven DL library fuzzing approach. We have two key insights: (1) With high reasoning ability, LLMs can replace human experts to reason edge cases (likely error-triggering inputs) from checks in an API's code, and transfer the extracted knowledge to test other (new or rarely-tested) APIs. (2) With high generation ability, LLMs can synthesize initial test programs with high accuracy that automates API testing. DFUZZ provides LLMs with a novel ''white-box view'' of DL library APIs, and therefore, can leverage LLMs' reasoning and generation abilities to achieve comprehensive fuzzing. Our experimental results on popular DL libraries demonstrate that DFUZZ is able to cover more APIs than SOTA (LLM-based) fuzzers on TensorFlow and PyTorch, respectively. Moreover, DFUZZ successfully detected 37 bugs, with 17 already confirmed as previously unknown bugs.
Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, jianling Sun, "Towards Better Answers: Automated Stack Overflow Post Updating"
Abstract: Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. \textit{SO comments} often point out weaknesses of a post and provide valuable insights to improve the quality of answers, while SO comments are usually missed and/or ignored, leaving these problematic code snippets untouched. In this work, we first investigate the task of automatic SO posts updating based on their associated comments. We introduce a novel framework, named \textbf{Soup} (\textbf{\underline{S}}tack \textbf{\underline{O}}verflow \textbf{\underline{U}}pdator for \textbf{\underline{P}}ost) for this task. \textbf{Soup} addresses two key tasks: Valid Comment-Edit Prediction (VCP) and Automatic Post Updating (APU). We fine-tuned a large language model, CodeLlama, using low-rank adaptation techniques to complete the VCP task, and constructed a dataset containing 78k valid comment-edit pairs for the APU task. Subsequently, we tested the performance of multiple large language models on the APU task. Extensive experimental results show the promising performance of our model over a set of benchmarks. Moreover, we also perform an in-the-wild evaluation on Stack Overflow, we submitted 50 edits generated by our approach to Stack Overflow posts and 21 of them have been verified and accepted by SO maintainers, further proving the practical value of \textbf{Soup}.
Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang, "Selecting Initial Seeds for Better JVM Fuzzing"
Abstract: JVM fuzzing techniques serve as a cornerstone for guaranteeing the quality of implementations. In typical fuzzing workflows, initial seeds are crucial as they form the basis of the process. Literature in traditional program fuzzing has confirmed that effectiveness is largely impacted by redundancy among initial seeds, thereby proposing a series of seed selection methods. JVM fuzzing, compared to traditional ones, presents unique characteristics, including large-scale and intricate code, and programs with both syntactic and semantic features. However, it remains unclear whether the existing initial seed selection methods are suitable for JVM fuzzing and whether utilizing program features can enhance effectiveness. To address this, we devised a total of 10 initial seed selection methods, comprising coverage-based, prefuzz-based, and program-feature-based methods. We then conducted an empirical study on three JVM implementations to extensively evaluate the performance of the initial seed selection methods within two state-of-the-art fuzzing techniques (JavaTailor and VECT). Specifically, we examine performance from three aspects: (i) effectiveness and efficiency using widely studied initial seeds, (ii) effectiveness using the programs in the wild, and (iii) the ability to detect new bugs. Evaluation results first show that the program-feature-based method that utilizes the control flow graph not only has a significantly lower time overhead (i.e., 30s), but also outperforms other methods, achieving 142% to 269% improvement compared to the full set of initial seeds. Second, results reveal that the initial seed selection greatly improves the quality of wild programs and exhibits complementary effectiveness by detecting new behaviors. Third, results demonstrate that given the same testing period, initial seed selection improves the JVM fuzzing techniques by detecting more unknown bugs. Particularly, 16 out of the 25 detected bugs have been confirmed or fixed by developers. This work takes the first look at initial seed selection in JVM fuzzing, confirming its importance in fuzzing effectiveness and efficiency.
Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen, "Source Code Summarization in the Era of Large Language Models"
Abstract: To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types (e.g., procedural and object-oriented programming languages). Finally, we unexpectedly find that \codellama{} with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu, "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers"
Abstract: Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.