Faults within CPU circuits, which generate incorrect results and thus silent data corruption, have become endemic at scale. The only generic techniques to detect one-time or intermittent soft errors, such as particle strikes or voltage spikes, require redundant execution, where copies of each instruction in a program are executed twice and compared.
The only software solution for this task that is open source and available for use today is nZDC, which aims to achieve ``near-zero silent data corruption'' through control- and data-flow redundancy. However, when we tried to apply this to large-scale workloads, we found it suffered a wide set of false positives, negatives, compiler bugs and run-time crashes, which meant it was impossible to benchmark against. This document details the wide set of fixes and workarounds we had to put in place to make nZDC work across full suites. We provide many new insights as to the edge cases that make such instruction duplication tricky under complex ISAs such as Aarch64 and their similarly complex ABIs. Evaluation across SPECint 2006 and Parsec with our extensions takes us from no workloads executing to all bar four, with 2x and 1.6x geomean overhead respectively relative to execution with no fault tolerance.
Sat 1 MarDisplayed time zone: Pacific Time (US & Canada) change
16:00 - 18:00 | |||
16:00 30mTalk | A Comparative Study on the Accuracy and the Speed of Static and Dynamic Program Classifiers Main Conference Anderson Faustino da Silva State University of Maringá, Jeronimo Castrillon TU Dresden, Germany, Fernando Magno Quintão Pereira Federal University of Minas Gerais | ||
16:30 30mTalk | Biotite: A High-Performance Static Binary Translator using Source-Level Information Main Conference Changbin Chen The University of Tokyo, Shu Sugita University of Tokyo, Yotaro Nada The University of Tokyo, Hidetsugu Irie University of Tokyo, Shuichi Sakai University of Tokyo, Ryota Shioya University of Tokyo | ||
17:00 30mTalk | Post-Link Outlining for Code Size Reduction Main Conference shaobai yuan Hunan University, Jihong He Hunan University, Yihui Xie Hunan University, Feng Wang Hunan University, Jie Zhao Hunan University | ||
17:30 30mTalk | A Deep Technical Review of nZDC Fault Tolerance Main Conference Minli Liao University of Cambridge, Sam Ainsworth University of Edinburgh, Lev Mukhanov Queen Mary University London, Timothy M. Jones University of Cambridge Pre-print Media Attached |