ICST 2024
Mon 27 - Fri 31 May 2024 Canada
  • Background

Google’s development environment features a branchless, centralized repository with sequential versioning[1]. Code compilation and dependencies are managed by a declarative build system (Blaze/Bazel). To balance testing comprehensiveness, costs, and release speed, Google utilizes a four-stage testing strategy[2][3]: i. Development: First, developers interactively test changes locally, running existing tests that they think might be relevant and writing new tests as they iterate through a task. ii. Presubmit: As changes get close to completion pre-submission testing is performed. Tests at this stage are chosen mostly via static analysis using the build dependence graph along with some dynamic selection[4]. Devs iteratively make changes to their code until tests pass. Once all the tests pass, the code is submitted immediately, i.e - there is no submit queue. This means that test and compile breakages can slip through into the codebase. iii. Postsubmit: Google then uses post-submission testing to detect these breakages, running all tests that could have been affected with minimal test selection given package-level dependencies. This is used to provide groups of tests (projects) with statuses at regular intervals of time. iv. Release: When a group of code changes is ready for promotion (dev → qa → prod), we evaluate project statuses from post-submission testing. If all projects within the release unit show as passing (green), this evaluation triggers a final round of comprehensive testing. This final testing ensures compliance with production code requirements.

  • Paradigm Shift – Resource Scarcity As a company with a growing codebase and developer pool, this approach to testing has encountered super-linear (with respect to # of tests / size of the codebase) growth in compute cost and latency of breakage signal/release frequency. In order to scale effectively, we must attempt to decrease the number of builds/tests performed at earlier stages while maintaining signal, in order to not negatively impact developer productivity while reducing testing costs.

– Shifting Load

We approach the problem by “shifting left” and “shifting right” simultaneously. Shifting left means we’re trying to avoid or prevent as many defects as possible, as early as is reasonable. This involves running some tests earlier in the software dev lifecycle and often, more frequently. These are tests that have a high failure-probability to cost ratio - i.e they’re cheap to run, fail fast (and require devs to do something to fix them), and let us maintain confidence in signal - i.e they’re deterministic (low flakiness) and hermetic (network, machine, test order etc). Shifting right means we want to push the execution of tests that are going to continue passing later in the dev cycle to save resources.

We introduce another, more frequent “speculative” testing stage to supplement comprehensive postsubmit testing as our main effort to shift left. This would use ML to speculatively identify a small subset of the tests in scope during postsubmit testing and run them proactively much more frequently in order to provide breakage signal to developers faster.

As we increase confidence in finding breakages earlier, we can start to “shift right” comprehensive testing for releases by optimistically cutting release candidates (RCs), promoting RCs to pre-production environments, and performing comprehensive testing before promoting to production / end users. This leverages the accuracy of the above speculative testing approach to tradeoff the rare chance of tests failing at the final release promotion stage for significantly less testing load / latency of releases.

We additionally shift pre-submit testing load right by aggressive ml-based test filtering at presubmit time, both skipping tests deemed unlikely to break and ignoring test failures deemed highly likely to be the result of flakes.

References [1] R. Potvin and J. Levenberg, “Why google stores billions of lines of code in a single repository,” Communications of the ACM, vol. 59, no. 7, pp. 78–87, 2016. [2] J. Micco, “Continuous Integration at Google Scale,” EclipseCon 2013, Mar. 2013. [Online]. [3] P. Gupta, M. Ivey, and J. Penix, “Testing at the speed and scale of Google,” 2011. [Online]. [4] A. Memon, Zebao Gao, Bao Nguyen, S. Dhanda, E. Nickell, R. Siemborski, and J. Micco, “Taming Google-scale continuous testing,” in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). Piscataway, NJ, USA: IEEE, May 2017, pp. 233–242.