1 research outputs found
Silent Data Corruptions at Scale
Silent Data Corruption (SDC) can have negative impact on large-scale
infrastructure services. SDCs are not captured by error reporting mechanisms
within a Central Processing Unit (CPU) and hence are not traceable at the
hardware level. However, the data corruptions propagate across the stack and
manifest as application-level problems. These types of errors can result in
data loss and can require months of debug engineering time. In this paper, we
describe common defect types observed in silicon manufacturing that leads to
SDCs. We discuss a real-world example of silent data corruption within a
datacenter application. We provide the debug flow followed to root-cause and
triage faulty instructions within a CPU using a case study, as an illustration
on how to debug this class of errors. We provide a high-level overview of the
mitigations to reduce the risk of silent data corruptions within a large
production fleet. In our large-scale infrastructure, we have run a vast library
of silent error test scenarios across hundreds of thousands of machines in our
fleet. This has resulted in hundreds of CPUs detected for these errors, showing
that SDCs are a systemic issue across generations. We have monitored SDCs for a
period longer than 18 months. Based on this experience, we determine that
reducing silent data corruptions requires not only hardware resiliency and
production detection mechanisms, but also robust fault-tolerant software
architectures.Comment: 8 pages, 3 figures, 33 reference