Search CORE

2 research outputs found

On-chip NBTI and Gate-Oxide-Degradation Sensing and Dynamic Management in VLSI Circuits.

Author: Singh Prashant
Publication venue
Publication date: 01/01/2011
Field of study

The VLSI industry has achieved advancement in technology by continuous process scaling which has resulted in large scale integration. However, scaling also poses new reliability challenges. Currently the industry ensures the reliability of chips by limiting the supply voltage and temperature, but these constraints limit the benefits that are obtained from new process nodes. This method of managing reliability during design time is called Static Reliability Management (SRM). While SRM ensures that all the chips meet the reliability specifications, it introduces extreme pessimism in the chips as it margins for worst process, voltage, temperature and circuit state (PVTS), which will not be required for the majority of chips. To reduce the pessimism of SRM, the system needs to be made aware of its reliability by employing degradation sensors or degradation detection techniques. Using the degradation measurements, the system can estimate its lifetime and can adjust its operating points (supply voltage and temperature limits) dynamically and trade excess reliability slack with performance. This method of reliability management is called Dynamic Reliability Management (DRM). In this work we investigate different methods of DRM. We focus on two critical degradation mechanisms: Negative Bias Temperature Instability (NBTI) and Gate-oxide degradation. We propose NBTI and Gate-oxide degradation sensors with low area and power overhead, which allows them to be deployed in large numbers on the chip enabling collection of degradation statistics. The sensors were designed in 130nm and 45nm process nodes and tested on two test-chips. We then used the sensors to perform DRM in a silicon test for the first time. We demonstrate that DRM eliminates excess reliability slack which allows for a boost in supply voltage and performance. We then propose in situ Bias Temperature Instability (BTI) and Gate-oxide wear-out detection techniques. The in situ technique measures the degradation in the actual devices in the core and removes all the layers of uncertainty which arise because of the statistical nature of degradation and its dependence on PVTS. We implemented and tested these techniques on two test chips in a 65nm process node. We then use the BTI sensing technique to perform DRM.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86281/1/prsingh_1.pd

Deep Blue Documents at the University of Michigan

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

Author: Lee Doowon
Publication venue
Publication date
Field of study

Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs. To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests. Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd

Deep Blue Documents at the University of Michigan