Enabling Software Resilience in GPGPU Applications via Partial Thread Protection
Graphic Processing Units (GPUs) are widely used by various applications in a wide variety of fields to accelerate their computation but remain susceptible to soft errors that can easily compromise application output. By taking advantage of the application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a framework that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows to engage replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmarks suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then selectively protect those groups of threads that truly need it. Furthermore, we show that remapping to different warps does not sacrifice application performance, surprisingly it even improves execution in some cases. In addition, we show how this remapping facilitates warp replication for error detection and/or correction and achieves average savings of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.