Award Winner 2019
Understanding and modeling error propagation in programs
by Guanpeng Li
Press Release
Abstract
Hardware errors are projected to increase in modern computer systems due to shrinking feature sizes and increasing manufacturing variations. The impact of hardware faults on programs can be catastrophic, and can lead to substantial financial and societal consequences. Error propagation is often the leading cause of catastrophic system failures, and hence must be mitigated. Traditional hardware only techniques to avoid error propagation are energy hungry, and hence not suitable for modern computer systems (i.e., commodity systems). Researchers have proposed selective software-based protection techniques to prevent error propagation at lower costs. However, these techniques use expensive fault injection simulations to determine which parts of a program must be protected. Fault injection simulation artificially introduces a fault to program execution and observe failures (if any) upon the completion of the program execution. Thousands of such simulations need to be performed in order to achieve statistical significance. It is time-consuming as even a single program execution of a common application may take a long time. In this dissertation, I first characterize error propagation in programs that lead to different types of failures, proposed both empirical and analytical approaches to identify and mitigate error propagation without expensive fault injections. The key observation is that only a small fraction of states are responsible for almost all error propagation in programs, and the propagation falls into identifiable patterns which can be modeled efficiently. The proposed techniques are nearly as close as fault injection approaches in measuring failure rates of programs, and orders of magnitude faster than fault injections. This allows developers to build low-cost fault-tolerant applications in an extremely efficient manner.