Widespread Error Detection in Large Scale Continuous Integration Systems (CCIW 2024)

Mon 27 - Fri 31 May 2024 Canada

Who

Stanislaw Swierc, James Lu, Thomas Yi

Track

CCIW 2024

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 28 May 2024 09:15 - 09:40 at Room 5 - CCIW Session 1 Chair(s): Tim A. D. Henderson

Abstract

Continuous Integration (CI) systems are widely used in the software industry to validate and integrate code changes into central repositories. At the high level they work by building projects and running automated tests. If all tests pass the change can be safely integrated, but if any test fails the change is blocked and the author is asked to revise it before submitting it again.

For this process to work as intended, automated tests should fail if and only if the change contains one or more regressions. If a test exhibits a non-deterministic behavior and fails in the absence of any regressions then change might be incorrectly blocked and create unnecessary work for the author. Integration tests which depend on external services are particularly prone to this problem. If any dependency is down or has degraded performance, many tests can fail in a short period and block many developers. Meta Platforms deployed system to detect and mitigate such events.

When a CI job fails, a workflow gets started to assess if the job was affected by a widespread error. First, the error text gets extracted from logs using predefined heuristics or automated methods which find differences in the logs of falling and passing executions. Then, this text gets fuzzy matched against a database of recently observed errors. If there exists a near-identical match its statistics get updated otherwise a new error gets added to the database. Finally, statistics are checked against thresholds and errors which occur unusually often get marked as widespread. Once an error is marked it either gets demoted to a warning or it gets enriched with information about the incident and ongoing investigation.

System described in the previous paragraph has been running at Meta Platforms since 2021. During the presentation we plan to describe it in more detail and share the learnings from developing, extending it, and operating it at scale in the last years.

Link to Publication

https://github.com/StanislawSwierc/CCIW2024-Widespread-Error-Detection

Stanislaw Swierc

Meta Platforms, Inc.

United States

James Lu

Meta Platforms, Inc.

Thomas Yi

Meta Platforms, Inc.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 28 May
Displayed time zone: Eastern Time (US & Canada) change

08:30 - 10:30	CCIW Session 1CCIW at Room 5 Chair(s): Tim A. D. Henderson Google

08:30 20m Day opening		Welcome to CCIW CCIW Tim A. D. Henderson Google
08:50 25m Talk		Thinktank: Leveraging LLM Reasoning for Advanced Task Execution in CI/CD CCIW Tim Keller SAP SE
09:15 25m Talk		Widespread Error Detection in Large Scale Continuous Integration Systems CCIW Stanislaw Swierc Meta Platforms, Inc., James Lu Meta Platforms, Inc., Thomas Yi Meta Platforms, Inc. Link to publication
09:40 25m Talk		Scalable Continuous Integration using Remote Execution CCIW Ola Rozenfeld EngFlow Inc., Ulf Adams EngFlow Inc.
10:05 25m Talk		Replay-Based Continual Learning for Test Case Prioritization CCIW Asma Fariha Ontario Tech University, Akramul Azim Ontario Tech University, Ramiro Liscano Ontario Tech University