ICST 2024
Mon 27 - Fri 31 May 2024 Canada

Continuous Integration (CI) systems are widely used in the software industry to validate and integrate code changes into central repositories. At the high level they work by building projects and running automated tests. If all tests pass the change can be safely integrated, but if any test fails the change is blocked and the author is asked to revise it before submitting it again.

For this process to work as intended, automated tests should fail if and only if the change contains one or more regressions. If a test exhibits a non-deterministic behavior and fails in the absence of any regressions then change might be incorrectly blocked and create unnecessary work for the author. Integration tests which depend on external services are particularly prone to this problem. If any dependency is down or has degraded performance, many tests can fail in a short period and block many developers. Meta Platforms deployed system to detect and mitigate such events.

When a CI job fails, a workflow gets started to assess if the job was affected by a widespread error. First, the error text gets extracted from logs using predefined heuristics or automated methods which find differences in the logs of falling and passing executions. Then, this text gets fuzzy matched against a database of recently observed errors. If there exists a near-identical match its statistics get updated otherwise a new error gets added to the database. Finally, statistics are checked against thresholds and errors which occur unusually often get marked as widespread. Once an error is marked it either gets demoted to a warning or it gets enriched with information about the incident and ongoing investigation.

System described in the previous paragraph has been running at Meta Platforms since 2021. During the presentation we plan to describe it in more detail and share the learnings from developing, extending it, and operating it at scale in the last years.