ICST 2023
Sun 16 - Thu 20 April 2023 Dublin, Ireland
Thu 20 Apr 2023 11:30 - 12:00 at Macken - Session 2

RokuOS is a proprietary “distroless” Linux incorporating a C++20 userspace, C drivers and middleware, and C++ or BrightScript channel code. The RokuOS supports dozens of different Roku devices, a variety of television, player, and speaker hardware platforms. As a result of our scale, we face unique challenges in terms of the complexity of our build system, the varieties of builds we offer, and the need for a high degree of correctness and verification in a short time frame. To address these challenges, we have implemented pre-merge and post-merge continuous integration (CI) checks. Pre merge CI blocks incoming changes from the software repository unless and until those changes have achieved a base level of trust by passing some tests, ensuring a working build for all committers. We present the changes that we have made. The data shows a roughly 50% improvement in the reliability of our builds.

Trying to make changes to the RokuOS build infrastructure is often likened to changing the wings of an aircraft while it is in flight. We’ve made some major changes to the way we build firmware in order to increase developer and tester productivity. This talk will address the changes we’ve made. We will discuss the changes, the motivations behind them, the risks, the rewards, and our experiences.

Historically, RokuOS source code resided in a single, monolithic repository (“monorepo”) hosted in Perforce. The RokuOS is a long lived code base that has evolved over the past 15 years. We focused exclusively on post-submit verification to report when a committer’s change caused a new failure. With the rate of incoming changes, the breadth of supported device platforms, and the turnaround time for breakages, development would sometimes come to a standstill as we backed out changes or resolved conflicts manually. We took this as a call to action to improve the developer experience.

In chronological order, these are the major changes we’ve made:

  • Moving our software configuration management and version control system from Perforce to git, a necessity for some later steps.
  • Migrating the RokuOS code base from a giant monorepo, to a multirepo with submodules for separate software components.
  • Restructuring of RokuOS to move vendor code into separate git submodules, with strong APIs between them.
  • Enabling merge commits, to allow changes from multiple contributors to be merged without forced synchronization points between them.
  • Adding pre-merge continuous integration (CI) checks to merges.
  • Making a passing run of pre-merge CI mandatory.
  • Increasing the scope of pre-merge CI testing, and managing the scope and run time of those tests.

As an embedded firmware build, our multitude of supported platforms requires us to generate - and verify - many different flavors of builds. A final build for a particular hardware platform will contain a combination of general and specific code, packaged in a manner that is appropriate for that platform, its capabilities, and its limitations. Performing such a multitude of builds requires leveraging common code and build artifacts to avoid repeated compilation.

To accelerate the turnaround time of our integration checks, we implemented an emulation layer to quickly vet builds for cross-platform correctness. The layer involves execution of the core RokuOS natively on an x86 or ARM host within a virtualized environment. This allows us to perform immediate testing without the need to deploy hardware to devices.

In addition to the aforementioned technical challenges, there were organizational challenges. Developers and workflows were entrenched in Perforce. Moving to submodules required software architecture changes. Merge commits required training engineers to read through git commits in a new way. Forcing passing pre-merge CI checks resulted in additional process for committers when merges failed. Implementing both pre-merge and post-merge CI has required us to take a fresh look at how we design and perform testing on our code base to build trust in the code, from “it compiles” to “we are ready to deploy this code to tens of millions of devices”.

We were able to justify the sweeping changes we’ve made thus far with data. By our metrics, the frequency of build breakages was reduced from 8.70 percent to 4.85 percent in a period of three months.

Thu 20 Apr

Displayed time zone: Dublin change