How we use Hermetic, Ephemeral Test Environments at Google to reduce Test Flakiness
The traditional industry approach to have long-lived Systems Under Test (SUT) tends to create flakiness in your integration tests in a number of ways. One, these SUTs make calls to dependencies that you don’t own, and these dependencies can be flakey (particularly if your SUT is using other teams’ test stacks for its dependencies instead of their prod stack). Two, communication between your SUT and its dependencies generally happens over the network, which comes with its own unavoidable hardware flakiness. Three, if it is a long-lived SUT, a test can neglect to clean up after itself, leaving your system data in an inconsistent or unexpected state that causes another, independent test to fail. Four, if the SUT is shared by many tests, two tests can be trying to write the same data at the same time, and end in an non-deterministic state, again creating flakiness in your test runs.
There are other problems besides flakiness. Since your network calls to dependencies are happening over the network, not only can they be flakey, but they can also be slow, and a lot of your test run time could be wasted simply going back and forth the network. Sometimes, your dependencies can’t handle some of the calls you are making, particularly if your calls are non-idempotent. And lastly, for load testing, sometimes your dependencies can’t handle the load your test will pass on to them. Debugging failures can be non-trivial, because your log files are highly distributed: some in the machine running your test, some in the machine running your service, some in the machines running your dependencies.
To address these problems, at Google we’ve invested in ephemeral, hermetic SUTs as a best practice, integrating it into our CI/CD infrastructure. First, we created a universal framework for defining, configuring and running SUTs. As more of the company uses the framework, we benefit from the network effect: if your dependencies are already modeled as SUTs, including them in your SUT is simple. Your dependencies are SUT components blessed by the team that owns that dependency, which reduces the flakiness encountered in traditional shared test dependencies. Our infrastructure starts these components in sandboxed containers, and provided you have sufficient hardware resources, they can all be started in the same machine, removing the flakiness and slowness of making calls across a physical network. Because the infrastructure understands where your dependencies are, it can provide a unified debugging experience with all your logs from your test, your service and all its dependencies. Since the SUTs can be spawned per test, we’re eliminating the problems introduced by multiple tests running concurrently and writing the same data, or a previous test leaving the datastore in an inconsistent state: you know every test has a predictable, clean state to start. It’s also simpler: your tests do not have the responsibility of restoring data to the original state after running, because the SUT is simply torn down after the test. Because you have your own copy of each of your dependencies, your tests can safely make non-idempotent calls to your dependencies, and you can easily load-test without overwhelming your actual dependency.
Ephemeral, hermetic SUTs have significantly reduced flakiness in integration tests across Google, but this approach has introduced some complexity that we needed to find solutions for.
The most obvious one is that spawning SUTs can be expensive, both in terms of time and hardware resources. Sometimes it can take ten, twenty, thirty minutes to start complex SUTs with many dependencies, increasing time to production, which is antithetical to the principles of CI/CD. We have taken a multi pronged approach to solving this problem. One, we have invested in telemetry and exposing startup data to our engineers, so that they can understand why starting SUTs takes a long time, and optimize it (for example, it could be because of a particular dependency that can easily be replaced by a fake, mock or stub). Second, for applications where acquiring a SUT in seconds is critical, we have created infrastructure to keep a pool of pre-warmed SUTs that can be leased (with the caveat that keeping this pool can introduce cost). For some cases where it makes sense, we can relax the ephemeral or hermetic principles, by doing things like multiple tests reusing a pre-existing SUT, or your SUT making calls to test or prod long-lived stacks rather than spinning their own stack. And lastly, we have infrastructure where we can [1] start your SUT with dependencies augmented with interceptors in record mode (so that all calls from your service to its dependencies are recorded), or [2] start your SUT with dependencies replaced by mocks that can replay the responses recorded. This is an elegant solution that removes the need to spawn full SUTs on every single test run, albeit it does come with additional complexity (deciding how often to run your tests in record mode so that your data isn’t stale, and dealing with non-determinism such as an rpc request or response incorporating date/time information or randomness).
One more problem can be that seeding the right data to your ephemeral, hermetic SUT can be time-consuming. Imagine you’re wanting to create an SUT for Google Earth and you need to copy terabytes of map data just so that you have a testable system. We have invested a fair bit into infrastructure to efficiently seed data into SUTs, and are currently researching ways to more intelligently ensure smaller data sets can achieve the same coverage and representation.
Having a common SUT infrastructure has also allowed us to build a lot of tooling on top of it, such as infrastructure so that googlers can easily and consistently create and run Functional, Performance and Diff tests once they create their SUT.
I’m a Senior Staff Engineer at Google, working on Developer Infrastructure for Integration Testing. I’ve been at Google for about 3 years. Before that, I was a Principal Engineer at Amazon for 11 years, working on Developer Tools. And before that, I spent 11 years at Microsoft as a Lead Engineer. I have a Masters in Computer Science from the University of Washington.
My passion for those two and a half decades in the industry has been centered Engineering Productivity and Core Infrastructure for large software companies. I have deep expertise in software development practices at Google, Amazon and Microsoft: how hundreds of thousands of developers write code, review code, test code and deploy code at large scale.
What’s most interesting to me is at these large companies, little inefficiencies can aggregate to millions, hundreds of millions of dollars of productivity lost or wasted hardware resources. I obsess about how to make engineers’ lives better, remove toil, improve efficiency, and raise the bar in engineering and operational excellency.
Thu 20 AprDisplayed time zone: Dublin change
11:00 - 12:30 | |||
11:00 30mTalk | How we use Hermetic, Ephemeral Test Environments at Google to reduce Test Flakiness CCIW Carlos Arguelles Google LLC | ||
11:30 30mTalk | Enabling Pre-Merge CI on your TV CCIW | ||
12:00 30mTalk | What Breaks Google? CCIW |