7 research outputs found

    An SRAM Compiler for Monolithic 3D Integrated Circuit

    Get PDF
    This article presents monolithic-3-D (M3D) SRAM arrays using multiple tiers of carbon nanotube (CNT) transistors. The compiler automatically generates single-tier 2-D SRAM subarrays and multitier 3-D SRAM subarrays with different tiers for cells and peripheral logic. Moreover, the compiler can integrate multiple subarrays of different dimensions to generate larger capacity SRAM arrays. The compiler is demonstrated in a commercial-grade M3D process design kit (PDK) with two tiers of carbon nanotube transistors (CNFETs). Simulations show that the M3D CNT SRAM design can improve the properties of memory compared to the 2-D CNT SRAM design. In a 32-kB memory implementation, the M3D design can reduce footprint, latency, and energy by 33%, 10%, and 19%, respectively. The compiler is used to show the feasibility of fine-grain logic and SRAM stacking in M3D technology.M.S

    Building Efficient and Reliable Emerging Technology Systems

    Full text link
    The semiconductor industry has been reaping the benefits of Moore’s law powered by Dennard’s voltage scaling for the past fifty years. However, with the end of Dennard scaling, silicon chip manufacturers are facing a widespread plateau in performance improvements. While the architecture community has focused its effort on exploring parallelism, such as with multi-core, many-core and accelerator-based systems, chip manufacturers have been forced to explore beyond-Moore technologies to improve performance while maintaining power density. Examples of such technologies include monolithic 3D integration, carbon nanotube transistors, tunneling-based transistors, spintronics and quantum computing. However, the infancy of the manufacturing process of these new technologies impedes their usage in commercial products. The goal of this dissertation is to combine both architectural and device-level efforts to provide solutions across the computing stack that can overcome the reliability concerns of emerging technologies. This allows for beyond-Moore systems to compete with highly optimized silicon-based processors, thus, enabling faster commercialization of such systems. This dissertation proposes the following key steps: (i) Multifaceted understanding and modeling of variation and yield issues that occur in emerging technologies, such as carbon nanotube transistors (CNFETs). (ii) Design of systems using suitable logic families such as pass transistor logic that provide high performance. (iii) Design of a multi-granular fault-tolerant reconfigurable architecture that enhances yield and performance. (iv) Design of a multi-technology, multi-accelerator heterogeneous system (v) Development of real-time constrained efficient workload scheduling mechanism for heterogeneous systems. This dissertation first presents the use of pass transistor logic family as an alternate to the CMOS logic family for CNFETs to improve performance. It explores various architectural design choices for CNFETs using pass transistor logic (PTL) to create an energy-efficient RISC-V processor. Our results show that while a CNFET RISC-V processor using CMOS logic achieves a 2.9x energy-delay product (EDP) improvement over a silicon design, using PTL along the critical path components of the processor can boost EDP improvement by 5x as well as reduce area by 17% over 16 nm silicon CMOS. This document further builds on providing fault-tolerant and yield enhancing solutions for emerging 3D integration compatible technologies in the context of CNFETs. The proposed framework can efficiently support high-variation technologies by providing protection against manufacturing defects at multiple granularities: module and pipeline-stage levels. Based on the variation observed in a synthesized design, a reliable CNFET-based 3D multi-granular reconfigurable architecture, 3DTUBE, is presented to overcome the manufacturing difficulties. For 0.4-0.7 V, 3DTUBE provides up to 6.0x higher throughput and 3.1x lower EDP compared to a silicon-based multi-core design evaluated at 1 part per billion transistor failure rate, which is 10,000x lower in comparison to CNFET’s failure rate. This dissertation then ventures into building multi-accelerator heterogeneous systems and real-time schedulers that cater to the requirements of the applications while taking advantage of the underlying heterogeneous system. We introduce optimizations like task pruning, hierarchical hetero-ranking and rank update built upon two scheduler policies (MS-static and MS-dynamic), that result in a performance improvement of 3.5x (average) for real-world autonomous vehicle applications, when compared against state-of-the-art schedulers. Adopting insights from the above work, this thesis presents a multi-accelerator, multi-technology heterogeneous system powered by a multi-constrained scheduler that optimizes for varying task requirements to achieve up to 6.1x better energy over a baseline silicon-based system.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169699/1/aporvaa_1.pd

    Interconnects for future technology generations - conventional CMOS with copper/low-k and beyond

    Get PDF
    The limitations of the conventional Cu/low-k interconnect technology for use in future ultra-scaled integrated circuits down to 7 nm in the year 2020 are investigated from the power/performance point of view. Compact models are used to demonstrate the impacts of various interconnect process parameters, for instance, the interconnect barrier/liner bilayer thickness and aspect ratio, on the design and optimization of a multilevel interconnect network. A framework to perform a sensitivity analysis for the circuit behavior to interconnect process parameters is created for future FinFET CMOS technology nodes. Multiple predictive cell libraries down to the 7‒nm technology node are constructed to enable early investigation of the electronic chip performance using commercial electronic design automation (EDA) tools with real chip information. Findings indicated new opportunities that arise for emerging novel interconnect technologies from the materials and process perspectives. These opportunities are evaluated based on potential benefits that are quantified with rigorous circuit-level simulations and requirements for key parameters are underlined. The impacts of various emerging interconnect technologies on the performances of emerging devices are analyzed to quantify the realistic circuit- and system-level benefits that these new switches can offer.Ph.D

    A Holistic Solution for Reliability of 3D Parallel Systems

    Full text link
    As device scaling slows down, emerging technologies such as 3D integration and carbon nanotube field-effect transistors are among the most promising solutions to increase device density and performance. These emerging technologies offer shorter interconnects, higher performance, and lower power. However, higher levels of operating temperatures and current densities project significantly higher failure rates. Moreover, due to the infancy of the manufacturing process, high variation, and defect densities, chip designers are not encouraged to consider these emerging technologies as a stand-alone replacement for Silicon-based transistors. The goal of this dissertation is to introduce new architectural and circuit techniques that can work around high-fault rates in the emerging 3D technologies, improving performance and reliability comparable to Silicon. We propose a new holistic approach to the reliability problem that addresses the necessary aspects of an effective solution such as detection, diagnosis, repair, and prevention synergically for a practical solution. By leveraging 3D fabric layouts, it proposes the underlying architecture to efficiently repair the system in the presence of faults. This thesis presents a fault detection scheme by re-executing instructions on idle identical units that distinguishes between transient and permanent faults while localizing it to the granularity of a pipeline stage. Furthermore, with the use of a dynamic and adaptive reconfiguration policy based on activity factors and temperature variation, we propose a framework that delivers a significant improvement in lifetime management to prevent faults due to aging. Finally, a design framework that can be used for large-scale chip production while mitigating yield and variation failures to bring up Carbon Nano Tube-based technology is presented. The proposed framework is capable of efficiently supporting high-variation technologies by providing protection against manufacturing defects at different granularities: module and pipeline-stage levels.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168118/1/javadb_1.pd

    Monolithically Integrated SRAM-ReRAM Cache-Main Memory System

    Get PDF
    Emerging non-volatile memories are dense and potentially compatible with standard CMOS processes, enabling a monolithically integrated CPU-main memory chip. However, area constraints impact the feasibility of fitting the entirety of a multi-core CPU and main memory system into a single die. ReRAM presents a unique opportunity in that it can be fabricated in crosspoint subarrays which leave the bulk of transistors beneath them available for other logic. However, ReRAM also poses a performance challenge; the latency is generally much higher than that of DRAM. Compensating for this through the increased bandwidth afforded from being on-die poses an architectural problem. The access circuitry for ReRAM subarrays requires only a small percentage of the area beneath the array. Still, this dense circuitry and wiring disrupts the layouts of irregular logic like CPUs. Caches are very regular and composed of smaller subarrays, making them a better candidate to place beneath crosspoint subarrays. By co-designing the cache subarrays and ReRAM crosspoint subarrays, minimal disruption to the cache logic can be achieved while still covering the bulk of the last-level cache area in ReRAM. This work explores the design space when co-designing the last-level cache and ReRAM crosspoint subarrays. Using a modified version of Cacti, we are able to explore the design trade-offs when integrating ReRAM and cache and quantify the impact the ReRAM has on the last-level cache. This design space exploration gives us a first order approximation of the memory capacity of a monolithic computer and informs architectural simulations of such a machine. We also examine how the physical integration presents opportunities for logical integration of the last-level cache and main memory. The interconnects and controllers can be combined, and the addressing can be such that data movement between the main memory and cache is primarily vertical. These optimizations can result in area and energy savings with minor impacts on performance. The second section of this work explores one architectural style which can balance the monolithic memory system and a general-purpose compute system---a tiled multicore with wide SIMD and multi-threading. We develop a simulator for this architecture capable of simulating a wide variety of system parameters. Through a design space exploration of many of the parameters across sparse, irregular graph kernels and dense, streaming computations, we find monolithic ReRAM exceeds the performance of a state-of-the-art DRAM system for memory intensive workloads given enough parallelism. We further develop an analytic model to describe our system and highlight the important performance characteristics for a monolithic CPU-main memory chip. The analytic model is validated against our simulation data. Using this model, we examine the architectural balance of the systems we simulated. Finally, we develop an RTL model of the combined cache--main memory interface. This gives a more accurate model for the increase in resources required for the combined controller. We additionally develop a system-on-a-chip with an RTL model that alters requests to the FPGA's main memory to be at the speed of ReRAM requests. This model is used to show the performance of more computationally intensive benchmarks. It also is the first step toward creating a test chip for a monolithically integrated ReRAM main memory

    Cost-Efficient Soft-Error Resiliency for ASIP-based Embedded Systems

    Full text link
    Recent decades have witnessed the rapid growth of embedded systems. At present, embedded systems are widely applied in a broad range of critical applications including automotive electronics, telecommunication, healthcare, industrial electronics, consumer electronics military and aerospace. Human society will continue to be greatly transformed by the pervasive deployment of embedded systems. Consequently, substantial amount of efforts from both industry and academic communities have contributed to the research and development of embedded systems. Application-specific instruction-set processor (ASIP) is one of the key advances in embedded processor technology, and a crucial component in some embedded systems. Soft errors have been directly observed since the 1970s. As devices scale, the exponential increase in the integration of computing systems occurs, which leads to correspondingly decrease in the reliability of computing systems. Today, major research forums state that soft errors are one of the major design technology challenges at and beyond the 22 nm technology node. Therefore, a large number of soft-error solutions, including error detection and recovery, have been proposed from differing perspectives. Nonetheless, most of the existing solutions are designed for general or high-performance systems which are different to embedded systems. For embedded systems, the soft-error solutions must be cost-efficient, which requires the tailoring of the processor architecture with respect to the feature of the target application. This thesis embodies a series of explorations for cost-efficient soft-error solutions for ASIP-based embedded systems. In this exploration, five major solutions are proposed. The first proposed solution realizes checkpoint recovery in ASIPs. By generating customized instructions, ASIP-implemented checkpoint recovery can perform at a finer granularity than what was previously possible. The fault-free performance overhead of this solution is only 1.45% on average. The recovery delay is only 62 cycles at the worst case. The area and leakage power overheads are 44.4% and 45.6% on average. The second solution explores utilizing two primitive error recovery techniques jointly. This solution includes three application-specific optimization methodologies. This solution generates the optimized error-resilient ASIPs, based on the characteristics of primitive error recovery techniques, static reliability analysis and design constraints. The resultant ASIP can be configured to perform at runtime according to the optimized recovery scheme. This solution can strategically enhance cost-efficiency for error recovery. In order to guarantee cost-efficiency in unpredictable runtime situations, the third solution explores runtime adaptation for error recovery. This solution aims to budget and adapt the error recovery operations, so as to spend the resources intelligently and to tolerate adverse influences of runtime variations. The resultant ASIP can make runtime decisions to determine the activation of spatial and temporal redundancies, according to the runtime situations. At the best case, this solution can achieve almost 50x reliability gain over the state of the art solutions. Given the increasing demand for multi-core computing systems, the last two proposed solutions target error recovery in multi-core ASIPs. The first solution of these two explores ASIP-implemented fine-grained process migration. This solution is a key infrastructure, which allows cost-efficient task management, for realizing cost-efficient soft-error recovery in multi-core ASIPs. The average time cost is only 289 machine cycles to perform process migration. The last solution explores using dynamic and adaptive mapping to assign heterogeneous recovery operations to the tasks in the multi-core context. This solution allows each individual ASIP-based processing core to dynamically adapt its specific error recovery functionality according to the corresponding task's characteristics, in terms of soft error vulnerability and execution time deadline. This solution can significantly improve the reliability of the system by almost two times, with graceful constraint penalty, in comparison to the state-of-the-art counterparts

    Heterogeneous Reconfigurable Fabrics for In-circuit Training and Evaluation of Neuromorphic Architectures

    Get PDF
    A heterogeneous device technology reconfigurable logic fabric is proposed which leverages the cooperating advantages of distinct magnetic random access memory (MRAM)-based look-up tables (LUTs) to realize sequential logic circuits, along with conventional SRAM-based LUTs to realize combinational logic paths. The resulting Hybrid Spin/Charge FPGA (HSC-FPGA) using magnetic tunnel junction (MTJ) devices within this topology demonstrates commensurate reductions in area and power consumption over fabrics having LUTs constructed with either individual technology alone. Herein, a hierarchical top-down design approach is used to develop the HSCFPGA starting from the configurable logic block (CLB) and slice structures down to LUT circuits and the corresponding device fabrication paradigms. This facilitates a novel architectural approach to reduce leakage energy, minimize communication occurrence and energy cost by eliminating unnecessary data transfer, and support auto-tuning for resilience. Furthermore, HSC-FPGA enables new advantages of technology co-design which trades off alternative mappings between emerging devices and transistors at runtime by allowing dynamic remapping to adaptively leverage the intrinsic computing features of each device technology. HSC-FPGA offers a platform for fine-grained Logic-In-Memory architectures and runtime adaptive hardware. An orthogonal dimension of fabric heterogeneity is also non-determinism enabled by either low-voltage CMOS or probabilistic emerging devices. It can be realized using probabilistic devices within a reconfigurable network to blend deterministic and probabilistic computational models. Herein, consider the probabilistic spin logic p-bit device as a fabric element comprising a crossbar-structured weighted array. The Programmability of the resistive network interconnecting p-bit devices can be achieved by modifying the resistive states of the array\u27s weighted connections. Thus, the programmable weighted array forms a CLB-scale macro co-processing element with bitstream programmability. This allows field programmability for a wide range of classification problems and recognition tasks to allow fluid mappings of probabilistic and deterministic computing approaches. In particular, a Deep Belief Network (DBN) is implemented in the field using recurrent layers of co-processing elements to form an n x m1 x m2 x ::: x mi weighted array as a configurable hardware circuit with an n-input layer followed by i ≥ 1 hidden layers. As neuromorphic architectures using post-CMOS devices increase in capability and network size, the utility and benefits of reconfigurable fabrics of neuromorphic modules can be anticipated to continue to accelerate
    corecore