The effects of soft errors in processor cores have been widely studied.
However, little has been published about soft errors in uncore components, such
as memory subsystem and I/O controllers, of a System-on-a-Chip (SoC). In this
work, we study how soft errors in uncore components affect system-level
behaviors. We have created a new mixed-mode simulation platform that combines
simulators at two different levels of abstraction, and achieves 20,000x speedup
over RTL-only simulation. Using this platform, we present the first study of
the system-level impact of soft errors inside various uncore components of a
large-scale, multi-core SoC using the industrial-grade, open-source OpenSPARC
T2 SoC design. Our results show that soft errors in uncore components can
significantly impact system-level reliability. We also demonstrate that uncore
soft errors can create major challenges for traditional system-level checkpoint
recovery techniques. To overcome such recovery challenges, we present a new
replay recovery technique for uncore components belonging to the memory
subsystem. For the L2 cache controller and the DRAM controller components of
OpenSPARC T2, our new technique reduces the probability that an application run
fails to produce correct results due to soft errors by more than 100x with
3.32% and 6.09% chip-level area and power impact, respectively.Comment: to be published in Proceedings of the 52nd Annual Design Automation
Conferenc