Search CORE

63 research outputs found

Approaches to multiprocessor error recovery using an on-chip interconnect subsystem

Author: Vadlamani Ramakrishna P
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2010
Field of study

For future multicores, a dedicated interconnect subsystem for on-chip monitors was found to be highly beneficial in terms of scalability, performance and area. In this thesis, such a monitor network (MNoC) is used for multicores to support selective error identification and recovery and maintain target chip reliability in the context of dynamic voltage and frequency scaling (DVFS). A selective shared memory multiprocessor recovery is performed using MNoC in which, when an error is detected, only the group of processors sharing an application with the affected processors are recovered. Although the use of DVFS in contemporary multicores provides significant protection from unpredictable thermal events, a potential side effect can be an increased processor exposure to soft errors. To address this issue, a flexible fault prevention and recovery mechanism has been developed to selectively enable a small amount of per-core dual modular redundancy (DMR) in response to increased vulnerability, as measured by the processor architectural vulnerability factor (AVF). Our new algorithm for DMR deployment aims to provide a stable effective soft error rate (SER) by using DMR in response to DVFS caused by thermal events. The algorithm is implemented in real-time on the multicore using MNoC and controller which evaluates thermal information and multicore performance statistics in addition to error information. DVFS experiments with a multicore simulator using standard benchmarks show an average 6% improvement in overall power consumption and a stable SER by using selective DMR versus continuous DMR deployment

CiteSeerX

ScholarWorks@UMass Amherst

Understanding Soft Errors in Uncore Components

Author: A.
D.
DeHon A.
H.
J.
L.
Lilja K.
Loveless T. D.
Mitra S.
N.
P.
P.
S.
Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/05/2015
Field of study

The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as memory subsystem and I/O controllers, of a System-on-a-Chip (SoC). In this work, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20,000x speedup over RTL-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multi-core SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the DRAM controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than 100x with 3.32% and 6.09% chip-level area and power impact, respectively.Comment: to be published in Proceedings of the 52nd Annual Design Automation Conferenc

arXiv.org e-Print Archive

Crossref

Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors

Author: Sudhanva Gurumurthi
Taniya Siddiqua
Publication venue
Publication date: 01/01/2009
Field of study

Silicon reliability is a key challenge facing the microprocessor industry. Processors need to be designed such that they are resilient against both soft errors and lifetime reliability phenomena. However, techniques developed to address one class of reliability problems may impact other aspects of silicon reliability. In this paper, we show that Redundant Multi-Threading (RMT), which provides soft error protection, exacerbates lifetime reliability. We then explore two different architectural approaches to tackle this problem, namely, Dynamic Voltage Scaling (DVS) and partial RMT. We show that each approach has certain strengths and weaknesses with respect to performance, soft error coverage, and lifetime reliability. We then propose and evaluate a hybrid approach that combines DVS and partial RMT. We show that this approach provides better improvement in lifetime reliability than DVS or partial RMT alone, buys back a significant amount of performance that is lost due to DVS, and provides nearly complete soft error coverage. I

CiteSeerX

Crossref

Tolerating Radiation-Induced Transient Faults in Modern Processors

Author: A. Silberschatz
A.V. Aho
J.L. Hennessy
Jean-Luc Gaudiot
Xiaobin Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Adaptive execution assistance for multiplexed fault-tolerant chip multiprocessors

Author: Larsson Erik
Saluja Kewal
Singh Virendra
Subramanyan Pramod
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Relentless scaling of CMOS fabrication technology has made contemporary integrated circuits increasingly susceptible to transient faults, wearout-related permanent faults, intermittent faults and process variations. Therefore, mechanisms to mitigate the effects of decreased reliability are expected to become essential components of future general purpose microprocessors. In this paper, we introduce a new throughput-efficient architecture for multiplexed fault-tolerant chip multiprocessors (CMPs). Our proposal relies on the new technique of adaptive execution assistance, which dynamically varies instruction outcomes forwarded from the leading core to the trailing core based on measures of trailing core performance. We identify policies and design low overhead hardware mechanisms to achieve this. Our work also introduces a new priority-based thread-scheduling algorithm for multiplexed architectures that improves multiplexed fault tolerant CMP throughput by prioritizing stalled threads. Through simulation-based evaluation, we find that our proposal delivers 17.2% higher throughput than perfect dual modular redundant (DMR) execution and outperforms previous proposals for throughput-efficient CMP architectures

Crossref

Lund University Publications

Autonomous Fault-Tolerant Avionics for Small COTS Satellites: to Reality and Prototype

Author: Fuchs Christian M.
Murillo Nadia M.
Publication venue: DigitalCommons@USU
Publication date: 07/08/2021
Field of study

In this contribution we present practical experiences from realizing a prototype of the first truly fault-tolerant and autonomously operating avionics suite for miniaturized satellite down to the size of a 2U CubeSat. Our initial demonstrator setup consists of a mix of COTS parts and FPGA development boards, which we gradually expanded in scope and capabilities. After four iterations of PCB development and manufacturing, we have condensed this design to a fully integrated custom PCB-based prototype. Our fourth architecture iteration is stackable and is designed to fit on an 80×80mm PCB footprint. It is furthermore capable of operating as generic satellite subsystem node, functioning in a distributed, fault-tolerant, interconnected manner together with other subsystems. Each node is fully replaceable by two or more neighboring subsystem-nodes. In consequence, we achieve a satellite bus setup which is in spirit similar to integrated modular avionics and modern fault-tolerant avionics network architectures used in other fields. We realize this setup through a high-speed chip-to-chip network in a compact CubeSat form factor

DigitalCommons@USU

Two-Layer Error Control Codes Combining Rectangular and Hamming Product Codes for Cache Error

Author: Ampadu Paul
Zhang Meilin
Publication venue: 'MDPI AG'
Publication date: 01/01/2014
Field of study

We propose a novel two-layer error control code, combining error detection capability of rectangular codes and error correction capability of Hamming product codes in an efficient way, in order to increase cache error resilience for many core systems, while maintaining low power, area and latency overhead. Based on the fact of low latency and overhead of rectangular codes and high error control capability of Hamming product codes, two-layer error control codes employ simple rectangular codes for each cache line to detect cache errors, while loading the extra Hamming product code checks bits in the case of error detection; thus enabling reliable large-scale cache operations. Analysis and experiments are conducted to evaluate the cache fault-tolerant capability of various existing solutions and the proposed approach. The results show that the proposed approach can significantly increase Mean-Error-To-Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8×, reduce storage overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared to complex four-way 4EC5ED; and it increases METF and MTTF up to 133×, reduces storage overhead by over 11%, and achieves a similar IPC compared to simple eight-way single-error correcting double-error detecting (SECDED). The cost of the proposed approach is no more than 4% external memory access overhead

Multidisciplinary Digital Publishing Institute

DSpace@MIT

Directory of Open Access Journals

Soft Error Vulnerability of Iterative Linear Algebra Methods

Author: Bronevetsky G
de Supinski B
Publication venue: Lawrence Livermore National Laboratory
Publication date: 01/01/2008
Field of study

Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques

Crossref

UNT Digital Library