Write a Blog >>
ICSE 2023
Sun 14 - Sat 20 May 2023 Melbourne, Australia
Thu 18 May 2023 14:45 - 14:52 at Meeting Room 110 - Test quality and improvement Chair(s): Guowei Yang

Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times, which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this paper, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and 73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of 98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and 18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.

Thu 18 May

Displayed time zone: Hobart change

13:45 - 15:15
Test quality and improvementTechnical Track / Journal-First Papers / DEMO - Demonstrations at Meeting Room 110
Chair(s): Guowei Yang University of Queensland
13:45
15m
Talk
Test Selection for Unified Regression Testing
Technical Track
Shuai Wang University of Illinois at Urbana-Champaign, Xinyu Lian University of Illinois at Urbana-Champaign, Darko Marinov University of Illinois at Urbana-Champaign, Tianyin Xu University of Illinois at Urbana-Champaign
Pre-print
14:00
15m
Talk
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search
Technical Track
Rongqi Pan University of Ottawa, Taher A Ghaleb University of Ottawa, Lionel Briand University of Luxembourg; University of Ottawa
14:15
15m
Talk
Measuring and Mitigating Gaps in Structural Testing
Technical Track
Soneya Binta Hossain University of Virginia, Matthew B Dwyer University of Virginia, Sebastian Elbaum University of Virginia, Anh Nguyen-Tuong University of Virginia
Pre-print
14:30
7m
Talk
FlaPy: Mining Flaky Python Tests at Scale
DEMO - Demonstrations
Martin Gruber BMW Group, University of Passau, Gordon Fraser University of Passau
Pre-print
14:37
7m
Talk
Scalable and Accurate Test Case Prioritization in Continuous Integration Contexts
Journal-First Papers
Ahmadreza Saboor Yaraghi University of Ottawa, Mojtaba Bagherzadeh University of Ottawa, Nafiseh Kahani University of Carlton, Lionel Briand University of Luxembourg; University of Ottawa
14:45
7m
Talk
Flakify: A Black-Box, Language Model-based Predictor for Flaky Tests
Journal-First Papers
Sakina Fatima University of Ottawa, Taher A Ghaleb University of Ottawa, Lionel Briand University of Luxembourg; University of Ottawa
14:52
7m
Talk
Developer-centric test amplification
Journal-First Papers
Carolin Brandt Delft University of Technology, Andy Zaidman Delft University of Technology
Pre-print
15:00
7m
Talk
How Developers Engineer Test Cases: An Observational Study
Journal-First Papers
MaurĂ­cio Aniche Delft University of Technology, Christoph Treude University of Melbourne, Andy Zaidman Delft University of Technology
Pre-print