769 research outputs found

    Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems

    Get PDF
    Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural innovations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today\u27s technology driven world. Performance and dependability have become key concerns as these ubiquitous computing machines continue to drive our everyday life. Every application has unique demands, as they run in diverse operating environments. Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and energy consumption. Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing speculation based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case design methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance. In this dissertation, we address different aspects of timing speculation based adaptive reliable overclocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably controlled to ensure different design targets. As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through Tolerating Timing Errors, framework, and characterize the extent of application dependent performance acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique\u27s adaptiveness by exploring the necessary hardware requirements for dynamic overclocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented. Even though reliable overclocking guarantees functional correctness, it leads to higher power consumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for 14 years when being operated within 353K. Over the past five decades, technology scaling, as predicted by Moore\u27s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today\u27s omnipresent silicon microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor business. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay product. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node. Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

    Power Reductions with Energy Recovery Using Resonant Topologies

    Get PDF
    The problem of power densities in system-on-chips (SoCs) and processors has become more exacerbated recently, resulting in high cooling costs and reliability issues. One of the largest components of power consumption is the low skew clock distribution network (CDN), driving large load capacitance. This can consume as much as 70% of the total dynamic power that is lost as heat, needing elaborate sensing and cooling mechanisms. To mitigate this, resonant clocking has been utilized in several applications over the past decade. An improved energy recovering reconfigurable generalized series resonance (GSR) solution with all the critical support circuitry is developed in this work. This LC resonant clock driver is shown to save about 50% driver power (\u3e40% overall), on a 22nm process node and has 50% less skew than a non-resonant driver at 2GHz. It can operate down to 0.2GHz to support other energy savings techniques like dynamic voltage and frequency scaling (DVFS). As an example, GSR can be configured for the simpler pulse series resonance (PSR) operation to enable further power saving for double data rate (DDR) applications, by using de-skewing latches instead of flip-flop banks. A PSR based subsystem for 40% savings in clocking power with 40% driver active area reduction xii is demonstrated. This new resonant driver generates tracking pulses at each transition of clock for dual edge operation across DVFS. PSR clocking is designed to drive explicit-pulsed latches with negative setup time. Simulations using 45nm IBM/PTM device and interconnect technology models, clocking 1024 flip-flops show the reductions, compared to non-resonant clocking. DVFS range from 2GHz/1.3V to 200MHz/0.5V is obtained. The PSR frequency is set \u3e3Ă— the clock rate, needing only 1/10th the inductance of prior-art LC resonance schemes. The skew reductions are achieved without needing to increase the interconnect widths owing to negative set-up times. Applications in data circuits are shown as well with a 90nm example. Parallel resonant and split-driver non-resonant configurations as well are derived from GSR. Tradeoffs in timing performance versus power, based on theoretical analysis, are compared for the first time and verified. This enables synthesis of an optimal topology for a given application from the GSR

    Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chip

    Get PDF
    Process Variation (PV) is increasingly threatening the reliability of Networks-on-Chips. Thus, various resilient router designs have been recently proposed and evaluated. However, these evaluations assume random fault distributions, which result in 52%--81% inaccuracy. We propose an accurate circuit-level fault-modeling tool, which can be plugged into any system-level NoC simulator, quantify the system-level impact of PV-induced faults at runtime, pinpoint fault-prone router components that should be protected, and accurately evaluate alternative resilient multi-core designs.GigaScale Systems Research CenterFocus Center Research Program. Focus Center for Circuit & System Solutions. Semiconductor Research Corporation. Interconnect Focus Cente

    Design for Reliability and Low Power in Emerging Technologies

    Get PDF
    Die fortlaufende Verkleinerung von Transistor-Strukturgrößen ist einer der wichtigsten Antreiber für das Wachstum in der Halbleitertechnologiebranche. Seit Jahrzehnten erhöhen sich sowohl Integrationsdichte als auch Komplexität von Schaltkreisen und zeigen damit einen fortlaufenden Trend, der sich über alle modernen Fertigungsgrößen erstreckt. Bislang ging das Verkleinern von Transistoren mit einer Verringerung der Versorgungsspannung einher, was zu einer Reduktion der Leistungsaufnahme führte und damit eine gleichbleibenden Leistungsdichte sicherstellte. Doch mit dem Beginn von Strukturgrößen im Nanometerbreich verlangsamte sich die fortlaufende Skalierung. Viele Schwierigkeiten, sowie das Erreichen von physikalischen Grenzen in der Fertigung und Nicht-Idealitäten beim Skalieren der Versorgungsspannung, führten zu einer Zunahme der Leistungsdichte und, damit einhergehend, zu erschwerten Problemen bei der Sicherstellung der Zuverlässigkeit. Dazu zählen, unter anderem, Alterungseffekte in Transistoren sowie übermäßige Hitzeentwicklung, nicht zuletzt durch stärkeres Auftreten von Selbsterhitzungseffekten innerhalb der Transistoren. Damit solche Probleme die Zuverlässigkeit eines Schaltkreises nicht gefährden, werden die internen Signallaufzeiten üblicherweise sehr pessimistisch kalkuliert. Durch den so entstandenen zeitlichen Sicherheitsabstand wird die korrekte Funktionalität des Schaltkreises sichergestellt, allerdings auf Kosten der Performance. Alternativ kann die Zuverlässigkeit des Schaltkreises auch durch andere Techniken erhöht werden, wie zum Beispiel durch Null-Temperatur-Koeffizienten oder Approximate Computing. Wenngleich diese Techniken einen Großteil des üblichen zeitlichen Sicherheitsabstandes einsparen können, bergen sie dennoch weitere Konsequenzen und Kompromisse. Bleibende Herausforderungen bei der Skalierung von CMOS Technologien führen außerdem zu einem verstärkten Fokus auf vielversprechende Zukunftstechnologien. Ein Beispiel dafür ist der Negative Capacitance Field-Effect Transistor (NCFET), der eine beachtenswerte Leistungssteigerung gegenüber herkömmlichen FinFET Transistoren aufweist und diese in Zukunft ersetzen könnte. Des Weiteren setzen Entwickler von Schaltkreisen vermehrt auf komplexe, parallele Strukturen statt auf höhere Taktfrequenzen. Diese komplexen Modelle benötigen moderne Power-Management Techniken in allen Aspekten des Designs. Mit dem Auftreten von neuartigen Transistortechnologien (wie zum Beispiel NCFET) müssen diese Power-Management Techniken neu bewertet werden, da sich Abhängigkeiten und Verhältnismäßigkeiten ändern. Diese Arbeit präsentiert neue Herangehensweisen, sowohl zur Analyse als auch zur Modellierung der Zuverlässigkeit von Schaltkreisen, um zuvor genannte Herausforderungen auf mehreren Designebenen anzugehen. Diese Herangehensweisen unterteilen sich in konventionelle Techniken ((a), (b), (c) und (d)) und unkonventionelle Techniken ((e) und (f)), wie folgt: (a)\textbf{(a)} Analyse von Leistungszunahmen in Zusammenhang mit der Maximierung von Leistungseffizienz beim Betrieb nahe der Transistor Schwellspannung, insbesondere am optimalen Leistungspunkt. Das genaue Ermitteln eines solchen optimalen Leistungspunkts ist eine besondere Herausforderung bei Multicore Designs, da dieser sich mit den jeweiligen Optimierungszielsetzungen und der Arbeitsbelastung verschiebt. (b)\textbf{(b)} Aufzeigen versteckter Interdependenzen zwischen Alterungseffekten bei Transistoren und Schwankungen in der Versorgungsspannung durch „IR-drops“. Eine neuartige Technik wird vorgestellt, die sowohl Über- als auch Unterschätzungen bei der Ermittlung des zeitlichen Sicherheitsabstands vermeidet und folglich den kleinsten, dennoch ausreichenden Sicherheitsabstand ermittelt. (c)\textbf{(c)} Eindämmung von Alterungseffekten bei Transistoren durch „Graceful Approximation“, eine Technik zur Erhöhung der Taktfrequenz bei Bedarf. Der durch Alterungseffekte bedingte zeitlich Sicherheitsabstand wird durch Approximate Computing Techniken ersetzt. Des Weiteren wird Quantisierung verwendet um ausreichend Genauigkeit bei den Berechnungen zu gewährleisten. (d)\textbf{(d)} Eindämmung von temperaturabhängigen Verschlechterungen der Signallaufzeit durch den Betrieb nahe des Null-Temperatur Koeffizienten (N-ZTC). Der Betrieb bei N-ZTC minimiert temperaturbedingte Abweichungen der Performance und der Leistungsaufnahme. Qualitative und quantitative Vergleiche gegenüber dem traditionellen zeitlichen Sicherheitsabstand werden präsentiert. (e)\textbf{(e)} Modellierung von Power-Management Techniken für NCFET-basierte Prozessoren. Die NCFET Technologie hat einzigartige Eigenschaften, durch die herkömmliche Verfahren zur Spannungs- und Frequenzskalierungen zur Laufzeit (DVS/DVFS) suboptimale Ergebnisse erzielen. Dies erfordert NCFET-spezifische Power-Management Techniken, die in dieser Arbeit vorgestellt werden. (f)\textbf{(f)} Vorstellung eines neuartigen heterogenen Multicore Designs in NCFET Technologie. Das Design beinhaltet identische Kerne; Heterogenität entsteht durch die Anwendung der individuellen, optimalen Konfiguration der Kerne. Amdahls Gesetz wird erweitert, um neue system- und anwendungsspezifische Parameter abzudecken und die Vorzüge des neuen Designs aufzuzeigen. Die Auswertungen der vorgestellten Techniken werden mithilfe von Implementierungen und Simulationen auf Schaltkreisebene (gate-level) durchgeführt. Des Weiteren werden Simulatoren auf Systemebene (system-level) verwendet, um Multicore Designs zu implementieren und zu simulieren. Zur Validierung und Bewertung der Effektivität gegenüber dem Stand der Technik werden analytische, gate-level und system-level Simulationen herangezogen, die sowohl synthetische als auch reale Anwendungen betrachten

    CAD methodologies for low power and reliable 3D ICs

    Get PDF
    The main objective of this dissertation is to explore and develop computer-aided-design (CAD) methodologies and optimization techniques for reliability, timing performance, and power consumption of through-silicon-via(TSV)-based and monolithic 3D IC designs. The 3D IC technology is a promising answer to the device scaling and interconnect problems that industry faces today. Yet, since multiple dies are stacked vertically in 3D ICs, new problems arise such as thermal, power delivery, and so on. New physical design methodologies and optimization techniques should be developed to address the problems and exploit the design freedom in 3D ICs. Towards the objective, this dissertation includes four research projects. The first project is on the co-optimization of traditional design metrics and reliability metrics for 3D ICs. It is well known that heat removal and power delivery are two major reliability concerns in 3D ICs. To alleviate thermal problem, two possible solutions have been proposed: thermal-through-silicon-vias (T-TSVs) and micro-fluidic-channel (MFC) based cooling. For power delivery, a complex power distribution network is required to deliver currents reliably to all parts of the 3D IC while suppressing the power supply noise to an acceptable level. However, these thermal and power networks pose major challenges in signal routability and congestion. In this project, a co-optimization methodology for signal, power, and thermal interconnects in 3D ICs is presented. The goal of the proposed approach is to improve signal, thermal, and power noise metrics and to provide fast and accurate design space explorations for early design stages. The second project is a study on 3D IC partition. For a 3D IC, the target circuit needs to be partitioned into multiple parts then mapped onto the dies. The partition style impacts design quality such as footprint, wirelength, timing, and so on. In this project, the design methodologies of 3D ICs with different partition styles are demonstrated. For the LEON3 multi-core microprocessor, three partitioning styles are compared: core-level, block-level, and gate-level. The design methodologies for such partitioning styles and their implications on the physical layout are discussed. Then, to perform timing optimizations for 3D ICs, two timing constraint generation methods are demonstrated that lead to different design quality. The third project is on the buffer insertion for timing optimization of 3D ICs. For high performance 3D ICs, it is crucial to perform thorough timing optimizations. Among timing optimization techniques, buffer insertion is known to be the most effective way. The TSVs have a large parasitic capacitance that increases the signal slew and the delay on the downstream. In this project, a slew-aware buffer insertion algorithm is developed that handles full 3D nets and considers TSV parasitics and slew effects on delay. Compared with the well-known van Ginneken algorithm and a commercial tool, the proposed algorithm finds buffering solutions with lower delay values and acceptable runtime overhead. The last project is on the ultra-high-density logic designs for monolithic 3D ICs. The nano-scale 3D interconnects available in monolithic 3D IC technology enable ultra-high-density device integration at the individual transistor-level. The benefits and challenges of monolithic 3D integration technology for logic designs are investigated. First, a 3D standard cell library for transistor-level monolithic 3D ICs is built and their timing and power behavior are characterized. Then, various interconnect options for monolithic 3D ICs that improve design quality are explored. Next, timing-closed, full-chip GDSII layouts are built and iso-performance power comparisons with 2D IC designs are performed. Important design metrics such as area, wirelength, timing, and power consumption are compared among transistor-level monolithic 3D, gate-level monolithic 3D, TSV-based 3D, and traditional 2D designs.PhDCommittee Chair: Lim, Sung Kyu; Committee Member: Bakir, Muhannad; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin; Committee Member: Mukhopadhyay, Saiba

    A sensor-less NBTI mitigation methodology for NoC architectures

    Get PDF
    CMOS technology improvement allows to increase the number of cores integrated on a single chip and makes Network-on-Chips (NoCs) a key component from the performance and reliability standpoints. Unfortunately, continuous scaling of CMOS technology poses severe concerns regarding failure mechanisms such as NBTI and stressmigration, that are crucial in achieving acceptable component lifetime. Process variation complicates the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper presents a novel sensor-less methodology to reduce the NBTI degradation in the on-chip network virtual channel buffers, considering process variation effects as well. Experimental validation is obtained using a cycle accurate simulator considering both real and synthetic traffic patterns. We compare our methodology to the best sensor-wise approach used as reference golden model. The proposed sensor-less strategy achieves results within 25% to the optimal sensor-wise methodology while this gap is reduced around 10% decreasing the number of virtual channels per input port. Moreover, our proposal can mitigate NBTI impact both in short and long run, since we recover both the most degraded VC (short run) as well as all the other VCs (long term)

    RESISTIVE RAM BASED MAIN-MEMORY SYSTEMS: UNDERSTANDING THE OPPORTUNITIES, LIMITATIONS, AND TRADEOFFS

    Get PDF
    As DRAM faces scaling issues as a high-density memory, emerging technologies are being explored as alternatives. One promising candidate is Resistive Memories (ReRAM), which is scalable, vertically stackable, and because of the possibility of integration with standard logic process, can deliver higher density as a main-memory solution. The key differentiator with this approach involves a ReRAM memory array that integrates directly with a logic processor underneath. In this research work, I explore ReRAM as a main-memory alternative at three levels of detail – at the device level, the physical-design level, and finally at the architecture level. I begin with an overview of ReRAM and compare with alternate technologies. I look at the physical design of the solution and present the results of area studies on integrating a VSCALE processor at the 45nm technology node with a ReRAM bit-cell array. The area study was performed based on parameters specified by my collaborators at Crossbar Inc. The results showed that the optimum operating point is at 50% array efficiency with a VSCALE processor, and that this configuration incurs an area penalty of 18%. Two of the key challenges for ReRAM with respect to DRAM performance include the higher write latency requirement (typically on the order of 1us) and the lower write endurance (typically less than 10^8 cycles). This compares with DRAM write-latency times of less than 30ns (depending on technology node and generation) and write endurance of more than 10^15 write cycles. In this research work, I explore the possibility of utilizing the ReRAM cell in an intermediate state between non-volatile state and threshold state, where I intentionally tradeoff the write energy for a much lower data retention. This allows the chip to more easily replace existing DRAM-like main memory applications, without requiring higher write programming current or accommodating for a longer write latency. I performed this evaluation both at the device-level and at the architecture level. At the device-level, I used UMD’s Nano-fab lab to construct a Metal-Oxide based ReRAM bitcells on which I characterized the relationship between data-retention and write current applied. My fabricated ReRAM was composed of Titanium-Oxide and Aluminum Oxide. I also confirmed the behavior of a mixed-volatility state where a formed filament relaxes over time to move to a high-resistance level. Based on my experimental measurements, operating in the mixed volatile state would reduce write energy by 10 to 100x, and thereby improve the write endurance. Finally, at the architecture-level, I used the Structural Simulation Toolkit (SST) to characterize a ReRAM-based main-memory system and compare with a DRAM-based one using our research group’s DRAMSIM3 tool. I also characterized the sensitivity of various architectural parameters (core-to-memory controller ratio, queue depth, NoC topology) on system performance on stream and gups-based graph benchmarks which indicated that the torus topology will provide reasonable performance. Impact of the number of parallel processors indicated that at low processor counts, DRAM outperforms ReRAM due to its faster memory latency. However, at high processor counts, ReRAM with its higher number of parallel connections is able to deliver higher system performance than DRAM
    • …
    corecore