2 research outputs found

    Understanding Soft Errors in Uncore Components

    Full text link
    The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as memory subsystem and I/O controllers, of a System-on-a-Chip (SoC). In this work, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20,000x speedup over RTL-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multi-core SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the DRAM controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than 100x with 3.32% and 6.09% chip-level area and power impact, respectively.Comment: to be published in Proceedings of the 52nd Annual Design Automation Conferenc

    Transaction Level Error Susceptibility Model for Bus Based SoC Architectures

    No full text
    System on Chip architectures have traditionally relied upon bus based interconnect for their communication needs. However, increasing bus frequencies and the load on the bus calls for focus on reliability issues in such bus based systems. In this paper, we provide a detailed analysis of different kinds of errors and the susceptibility of such systems to such errors on various components that the bus comprises of. With elaborate experiments we determine the effect of a single bit error on the bus system during the course of different transactions. The work demonstrates the fact that only a few signals in a bus system are really critical and need to be guarded. Such transaction based analysis helps us to develop an effective prediction methodology to predict the effect of a single bit error on any application running on a bus based architecture. We demonstrate that our transaction based prediction scheme works with an average accuracy of 92% over all the benchmarks when compared with the actual simulation results
    corecore