5 research outputs found

    A Method to Construct Low Delay Single Error Correction Codes for Protecting Data Bits Only

    Full text link
    Abstract—Error correction codes (ECCs) have been used for decades to protect memories from soft errors. Single error correction (SEC) codes that can correct 1-bit error per word are a common option for memory protection. In some cases, SEC codes are extended to also provide double error detection and are known as SEC-DED codes. As technology scales, soft errors on registers also became a concern and, therefore, SEC codes are used to protect registers. The use of an ECC impacts the circuit design in terms of both delay and area. Traditional SEC or SEC-DED codes developed for memories have focused on minimizing the number of redundant bits added by the code. This is important in a memory as those bits are added to each word in the memory. However, for registers used in circuits, minimizing the delay or area introduced by the ECC can be more important. In this paper, a method to construct low delay SEC or SEC-DED codes that correct errors only on the data bits is proposed. The method is evaluated for several data block sizes, showing that the new codes offer significant delay reductions when compared with traditional SEC or SEC-DED codes. The results for the area of the encoder and decoder also show substantial savings compared to existing codes. Index Terms—Double error detection, error correction codes (ECCs), single error correction (SEC), soft errors. I

    Cross-Layer Optimization for Power-Efficient and Robust Digital Circuits and Systems

    Full text link
    With the increasing digital services demand, performance and power-efficiency become vital requirements for digital circuits and systems. However, the enabling CMOS technology scaling has been facing significant challenges of device uncertainties, such as process, voltage, and temperature variations. To ensure system reliability, worst-case corner assumptions are usually made in each design level. However, the over-pessimistic worst-case margin leads to unnecessary power waste and performance loss as high as 2.2x. Since optimizations are traditionally confined to each specific level, those safe margins can hardly be properly exploited. To tackle the challenge, it is therefore advised in this Ph.D. thesis to perform a cross-layer optimization for digital signal processing circuits and systems, to achieve a global balance of power consumption and output quality. To conclude, the traditional over-pessimistic worst-case approach leads to huge power waste. In contrast, the adaptive voltage scaling approach saves power (25% for the CORDIC application) by providing a just-needed supply voltage. The power saving is maximized (46% for CORDIC) when a more aggressive voltage over-scaling scheme is applied. These sparsely occurred circuit errors produced by aggressive voltage over-scaling are mitigated by higher level error resilient designs. For functions like FFT and CORDIC, smart error mitigation schemes were proposed to enhance reliability (soft-errors and timing-errors, respectively). Applications like Massive MIMO systems are robust against lower level errors, thanks to the intrinsically redundant antennas. This property makes it applicable to embrace digital hardware that trades quality for power savings.Comment: 190 page

    Entwurfsmethodologie fĂŒr höchst zuverlĂ€ssige digitale ASIC-Designs angewandt auf Network-Centric System Middleware Switch Prozessor

    Get PDF
    The sensitivity of application-specific integrated circuits (ASICs) to single event effects (SEE) can lead to failures of subsystems which are exposed to increased radiation levels in space and on the ground. The work described in this thesis presents a design methodology for a fully fault-tolerant ASIC that is immune to single event upset effects (SEU) in sequential logic, single event transient effects (SET) in combinatorial logic, and single event latchup effects (SEL). Redundant circuits combined with SEL power switches (SPS) are the basis for a design methodology which achieves this goal. Within the standard ASIC design flow enhancements were made in order to incorporate redundancy and SPS cells and, consequently, enable protection against SEU, SET, and SEL. In order to validate the resulting fault-tolerant circuits a fault-injection environment with carefully designed fault models was developed. The moments of fault occurrence and their durations are modeled according to the real effects in actual hardware. The proposed design methodology was applied to an innovative space craft area network (SCAN) central processor unit, known as middleware switch processor. The measurement results presented in this thesis prove the correct functionality of DMR and SPS circuits, as well as the high fault-tolerance of the implemented ASICs along with moderate overhead with respect to power consumption and occupied silicon area. Irradiation measurements demonstrated the correct design and successful implementation of the SPS cell.Die Empfindlichkeit von anwendungsspezifischen integrierten Schaltungen (ASICs) zu den einzelnen Ereigniseffekten (SEE), kann zu AusfĂ€llen von Subsystemen fĂŒhren, die erhöhten Strahlungspegeln im Raum und auf dem Boden ausgesetzt werden. Die Arbeit, die in dieser Thesis beschrieben wird, stellt eine Entwurfsmethodologie vor um fehlertolerante ASICs zu entwerfen, welche die immun gegen singulĂ€re Störung Effekte (SEU) in sequentielle Logik ist, einzelne Ereignis vorĂŒbergehende Effekte (SET) in der kombinatorischen Logik und einzelnes Ereignis Latchup-Effekte (SEL). Modulare Redundanz und SEL-Schalter (SPS) sind die Basis fĂŒr eine Design-Methodik, die volle fehlertolerante ASIC liefert. Der Standard ASIC-Designflow ist erweitert worden, um Redundanz mit SPS-Schalter zu enthalten und Schutz gegen SEU, SET und SEL zu ermöglichen. Um die fehlertoleranten Stromkreise zu validieren ist eine Fault-Injektion Umgebung mit Fault Modellen entwickelt worden. Die Momente des Auftretens und der Dauer der injizierten Fehler werden entsprechend den realen Effekten in die Hardware modelliert. Die Methodologie des vorgeschlagenen Entwurfs ist an einem innovativen Space Craft Area Network (SCAN) Schaltkreis angewendet worden, bekannt als Middleware Switch Prozessor. Die Messergebnisse, die in dieser These dargestellt werden, haben die korrekte FunktionalitĂ€t von Redundanz- und SPS-Stromkreisen sowie die hohe Fehler-Toleranz der resultierende ASICs zusammen mit mĂ€ĂŸigen Unkosten in Bezug auf Leistungsaufnahme und besetzten SilikonflĂ€che nachgewiesen. Die Strahlungsmessungen haben das korrekte Design und die erfolgreiche Umsetzung der SPS-Zelle bewiesen

    Fault Tolerant Integer Data Computations: Algorithms and Applications

    Get PDF
    As computing units move to higher transistor integration densities and computing clusters become highly heterogeneous, studies begin to predict that, rather than being exceptions, data corruptions in memory and processor failures are likely to become more prevalent. It has therefore become imperative to improve the reliability of systems in the face of increasing soft error probabilities in memory and computing logic units of silicon CMOS integrated chips. This thesis introduces a new class of algorithms for fault tolerance in compute-intensive linear and sesquilinear (“one-and-half-linear”) data computations on integer data inputs within high-performance computing systems. The key difference between the proposed algorithms and existing fault tolerance methods is the elimination of the traditional requirement for additional hardware resources for system reliability. The first contribution of this thesis is in the detection of hardware-induced errors in integer matrix products. The proposed method of numerical packing for detecting a single error within a quadruple of matrix outputs is described in Chapter 2. The chapter includes analytic calculations of the proposed method’s computational complexity and reliability. Experimental results show that the proposed algorithm incurs comparable execution time overhead to existing algorithms for the detection and correction of a limited number of errors within generic matrix multiplication (GEMM) outputs. On the other hand, numerical packing becomes substantially more efficient in the mitigation of multiple errors. The achieved execution time gain of numerical packing is further analyzed with respect to its energy saving equivalent, thus paving the way for a new class of silent data corruption (SDC) mitigation method for integer matrix products that are fast, energy efficient, and highly reliable. A further advancement of the proposed numerical packing approach for the mitigation of core/processor failures in computing clusters (a.k.a., failstop failures) is described in Chapter 3 . The key advantage of this new packing approach is the ability to tolerate processor failures for all classes of sum-of-product computations. Because multimedia applications running on cloud computing platforms are now required to mitigate an increasing number of failures and outages at runtime, we analyze the efficiency of numerical packing within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminable) instances. Our results show that more than 70% reduction of cost can be achieved in comparison to conventional failure-intolerant processing based on AWS EC2 on-demand (i.e., higher-cost albeit guaranteed) instances. Finally, beyond numerical packing, we present a second approach for reliability in the case of linear and sesquilinear integer data computations by generalizing the recently-proposed concept of numerical entanglement. The proposed approach is capable of recovering from multiple fail-stop failures in a parallel/distributed computing environment. We present theoretical analysis of the computational and bit-width requirements of the proposed method in comparison to existing methods of checksum generation and processing. Our experiments with integer matrix products show that the proposed approach incurs 1.72% − 37.23% reduction in processing throughput in comparison to failure-intolerant processing while allowing for the mitigation of multiple fail-stop failures without the use of additional computing resources
    corecore