In an ideal Continuous Integration (CI) workflow, all potentially impacted builds/tests would be run before submission for every proposed change. In large-scale environments like Google’s mono-repository, this is not feasible to do in terms of both latency and compute cost, given the frequency of change requests and overall size of the codebase. Instead, the compromise is to run more comprehensive testing at later stages of the development lifecycle – batched after submission.
TAP (Test Automation Platform) – the main CI system at Google – is responsible for building and testing hermetic build/test “targets” (libraries, binaries, tests, etc) across the mono-repository. Targets are built/run both during individual pull requests before submission (“presubmit”) and over ranges of submitted commits (“postsubmit”). At presubmit time, TAP runs builds/tests that are directly relevant to the team/project corresponding to the change. Changes that have all presubmit runs passing are merged sequentially into the mainline branch. At postsubmit time, we periodically run all builds/tests at a change close to HEAD, that were potentially affected since the last test cycle.
Traditionally, developers are simply notified of breakages which they then need to root-cause and fix. Given a range of commits over which a target has broken, “culprit finding” refers to the task of finding the commit where the breakage is introduced (the “culprit”). This is further complicated by the prevalence of nondeterminism (“flakiness”) requiring multiple runs to confirm real breakages. Over the past several years, efficient automated culprit finders have been deployed to determine the “culprit” changes for all postsubmit breakages.
The availability of this labeled dataset of postsubmit breakages and attribution to their culprits open the possibility of predictive test selection at postsubmit time and motivate our exploratory analysis of what features of submitted code commits are predictive of changes which introduce code defects. To make testing more compute efficient and decrease the latency in discovering breakages introduced into the codebase, TAP Postsubmit is developing a new scheduling algorithm that utilizes Bug Prediction metrics, features from the change, and historical information about the targets to predict the likelihood of a target being broken by a change. Using these predictions, small subsets of targets at increased risk may be scheduled more frequently to more quickly uncover breakages. This work will examine the association between some of our selected features with culprits in the Google codebase.
Previous work at Google and elsewhere has found text execution history and coarse grain code metrics to be useful for test selection. Now, culprit finding allows us to perform this analysis at the individual commit level instead of postsubmit cycle granularity. We have looked more closely at other features viably accessible in real time such as tokens within the change description and the build graph distance between targets and files within the commit, and found that many are fairly predictive features of code introducing breakages. For example, the presence of individual tokens in the change description plus simple metrics such as LOC can capture 98% of culprit changes while filtering out 30% of safe changes. In this paper, we present our results for the mentioned features and more and the implied resource savings if used for test selection at Google.
Thu 20 AprDisplayed time zone: Dublin change
11:00 - 12:30
|How we use Hermetic, Ephemeral Test Environments at Google to reduce Test Flakiness|
Carlos Arguelles Google LLC
|Enabling Pre-Merge CI on your TV|
|What Breaks Google?|