A Deep Technical Review of nZDC Fault Tolerance (CC 2025 - Main Conference)

Who

Minli Liao, Sam Ainsworth, Lev Mukhanov, Timothy M. Jones

Track

CC 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 1 Mar 2025 17:30 - 18:00 at Acacia A - Binary Analysis and Hardware I Chair(s): Sara Achour

Abstract

Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared.

The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve ``near-zero silent data corruption'' through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as Aarch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and Parsec with our extensions takes us from no workloads executing to all bar four, with 2x and 1.6x geomean overhead respectively relative to execution with no fault tolerance.

Link to Preprint

https://www.cl.cam.ac.uk/~tmj32/papers/docs/liao25-cc.pdf

Minli Liao

University of Cambridge

Sam Ainsworth

University of Edinburgh

United Kingdom

Lev Mukhanov

Queen Mary University London

Timothy M. Jones