97 research outputs found

    Improving Phase Change Memory (PCM) and Spin-Torque-Transfer Magnetic-RAM (STT-MRAM) as Next-Generation Memories: A Circuit Perspective

    Get PDF
    In the memory hierarchy of computer systems, the traditional semiconductor memories Static RAM (SRAM) and Dynamic RAM (DRAM) have already served for several decades as cache and main memory. With technology scaling, they face increasingly intractable challenges like power, density, reliability and scalability. As a result, they become less appealing in the multi/many-core era with ever increasing size and memory-intensity of working sets. Recently, there is an increasing interest in using emerging non-volatile memory technologies in replacement of SRAM and DRAM, due to their advantages like non-volatility, high device density, near-zero cell leakage and resilience to soft errors. Among several new memory technologies, Phase Change Memory (PCM) and Spin-Torque-Transfer Magnetic-RAM (STT-MRAM) are most promising candidates in building main memory and cache, respectively. However, both of them possess unique limitations that preventing them from being effectively adopted. In this dissertation, I present my circuit design work on tackling the limitations of PCM and STT-MRAM. At bit level, both PCM and STT-MRAM suffer from excessive write energy, and PCM has very limited write endurance. For PCM, I implement Differential Write to remove large number of unnecessary bit-writes that do not alter the stored data. It is then extended to STT-MRAM as Early Write Termination, with specific optimizations to eliminate the overhead of pre-write read. At array level, PCM enjoys high density but could not provide competitive throughput due to its long write latency and limited number of read/write circuits. I propose a Pseudo-Multi-Port Bank design to exploit intra-bank parallelism by recycling and reusing shared peripheral circuits between accesses in a time-multiplexed manner. On the other hand, although STT-MRAM features satisfactory throughput, its conventional array architecture is constrained on density and scalability by the pitch of the per-column bitline pair. I propose a Common-Source-Line Array architecture which uses a shared source-line along the row, essentially leaving only one bitline per column. For these techniques, I provide circuit level analyses as well as architecture/system level and/or process/device level discussions. In addition, relevant background and work are thoroughly surveyed and potential future research topics are discussed, offering insights and prospects of these next-generation memories

    Ultra-low-power SRAM design in high variability advanced CMOS

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 163-181).Embedded SRAMs are a critical component in modern digital systems, and their role is preferentially increasing. As a result, SRAMs strongly impact the overall power, performance, and area, and, in order to manage these severely constrained trade-offs, they must be specially designed for target applications. Highly energy-constrained systems (e.g. implantable biomedical devices, multimedia handsets, etc.) are an important class of applications driving ultra-low-power SRAMs. This thesis analyzes the energy of an SRAM sub-array. Since supply- and threshold-voltage have a strong effect, targets for these are established in order to optimize energy. Despite the heavy emphasis on leakage-energy, analysis of a high-density 256x256 sub-array in 45nm LP CMOS points to two necessary optimizations: (1) aggressive supply-voltage reduction (in addition to Vt elevation), and (2) performance enhancement. Important SRAM metrics, including read/write/hold-margin and read-current, are also investigated to identify trade-offs of these optimizations. Based on the need to lower supply-voltage, a 0.35V 256kb SRAM is demonstrated in 65nm LP CMOS. It uses an 8T bit-cell with peripheral circuit-assists to improve write-margin and bit-line leakage. Additionally, redundancy, to manage the increasing impact of variability in the periphery, is proposed to improve the area-offset trade-off of sense-amplifiers, demonstrating promise for highly advanced technology nodes. Based on the need to improve performance, which is limited by density constraints, a 64kb SRAM, using an offset-compensating sense-amplifier, is demonstrated in 45nm LP CMOS with high-density 0.25[mu]m2 bit-cells.(cont.) The sense-amplifier is regenerative, but non -strobed, overcoming timing uncertainties limiting performance, and it is single-ended, for compatibility with 8T cells. Compared to a conventional strobed sense-amplifier, it achieves 34% improvement in worst-case access-time and 4x improvement in the standard deviation of the access-time.by Naveen Verma.Ph.D

    Circuit and Architecture Co-Design of STT-RAM for High Performance and Low Energy

    Get PDF
    Spin-Transfer Torque Random Access Memory (STT-RAM) has been proved a promising emerging nonvolatile memory technology suitable for many applications such as cache mem- ory of CPU. Compared with other conventional memory technology, STT-RAM offers many attractive features such as nonvolatility, fast random access speed and extreme low leakage power. However, STT-RAM is still facing many challenges. First of all, programming STT-RAM is a stochastic process due to random thermal fluctuations, so the write errors are hard to avoid. Secondly, the existing STT-RAM cell designs can be used for only single-port accesses, which limits the memory access bandwidth and constraints the system performance. Finally, while other memory technology supports multi-level cell (MLC) design to boost the storage density, adopting MLC to STT-RAM brings many disadvantages such as requirement for large transistor and low access speed. In this work, we proposed solutions on both circuit and architecture level to address these challenges. For the write error issues, we proposed two probabilistic methods, namely write-verify- rewrite with adaptive period (WRAP) and verify-one-while-writing (VOW), for performance improvement and write failure reduction. For dual-port solution, we propose the design methods to support dual-port accesses for STT-RAM. The area increment by introducing an additional port is reduced by leveraging the shared source-line structure. Detailed analysis on the performance/reliability degrada- tion caused by dual-port accesses is performed, and the corresponding design optimization is provided. To unleash the potential of MLC STT-RAM cache, we proposed a new design through a cross-layer co-optimization. The memory cell structure integrated the reversed stacking of magnetic junction tunneling (MTJ) for a more balanced device and design trade-off. In architecture development, we presented an adaptive mode switching mechanism: based on application’s memory access behavior, the MLC STT-RAM cache can dynamically change between low latency SLC mode and high capacity MLC mode. Finally, we present a 4Kb test chip design which can support different types and sizes of MTJs. A configurable sensing solution is used in the test chip so that it can support wide range of MTJ resistance. Such test chip design can help to evaluate various type of MTJs in the future

    Reliability in the face of variability in nanometer embedded memories

    Get PDF
    In this thesis, we have investigated the impact of parametric variations on the behaviour of one performance-critical processor structure - embedded memories. As variations manifest as a spread in power and performance, as a first step, we propose a novel modeling methodology that helps evaluate the impact of circuit-level optimizations on architecture-level design choices. Choices made at the design-stage ensure conflicting requirements from higher-levels are decoupled. We then complement such design-time optimizations with a runtime mechanism that takes advantage of adaptive body-biasing to lower power whilst improving performance in the presence of variability. Our proposal uses a novel fully-digital variation tracking hardware using embedded DRAM (eDRAM) cells to monitor run-time changes in cache latency and leakage. A special fine-grain body-bias generator uses the measurements to generate an optimal body-bias that is needed to meet the required yield targets. A novel variation-tolerant and soft-error hardened eDRAM cell is also proposed as an alternate candidate for replacing existing SRAM-based designs in latency critical memory structures. In the ultra low-power domain where reliable operation is limited by the minimum voltage of operation (Vddmin), we analyse the impact of failures on cache functional margin and functional yield. Towards this end, we have developed a fully automated tool (INFORMER) capable of estimating memory-wide metrics such as power, performance and yield accurately and rapidly. Using the developed tool, we then evaluate the #effectiveness of a new class of hybrid techniques in improving cache yield through failure prevention and correction. Having a holistic perspective of memory-wide metrics helps us arrive at design-choices optimized simultaneously for multiple metrics needed for maintaining lifetime requirements

    Design of Variation-Tolerant Circuits for Nanometer CMOS Technology: Circuits and Architecture Co-Design

    Get PDF
    Aggressive scaling of CMOS technology in sub-90nm nodes has created huge challenges. Variations due to fundamental physical limits, such as random dopants fluctuation (RDF) and line edge roughness (LER) are increasing significantly with technology scaling. In addition, manufacturing tolerances in process technology are not scaling at the same pace as transistor's channel length due to process control limitations (e.g., sub-wavelength lithography). Therefore, within-die process variations worsen with successive technology generations. These variations have a strong impact on the maximum clock frequency and leakage power for any digital circuit, and can also result in functional yield losses in variation-sensitive digital circuits (such as SRAM). Moreover, in nanometer technologies, digital circuits show an increased sensitivity to process variations due to low-voltage operation requirements, which are aggravated by the strong demand for lower power consumption and cost while achieving higher performance and density. It is therefore not surprising that the International Technology Roadmap for Semiconductors (ITRS) lists variability as one of the most challenging obstacles for IC design in nanometer regime. To facilitate variation-tolerant design, we study the impact of random variations on the delay variability of a logic gate and derive simple and scalable statistical models to evaluate delay variations in the presence of within-die variations. This work provides new design insight and highlights the importance of accounting for the effect of input slew on delay variations, especially at lower supply voltages. The derived models are simple, scalable, bias dependent and only require the knowledge of easily measurable parameters. This makes them useful in early design exploration, circuit/architecture optimization as well as technology prediction (especially in low-power and low-voltage operation). The derived models are verified using Monte Carlo SPICE simulations using industrial 90nm technology. Random variations in nanometer technologies are considered one of the largest design considerations. This is especially true for SRAM, due to the large variations in bitcell characteristics. Typically, SRAM bitcells have the smallest device sizes on a chip. Therefore, they show the largest sensitivity to different sources of variations. With the drastic increase in memory densities, lower supply voltages and higher variations, statistical simulation methodologies become imperative to estimate memory yield and optimize performance and power. In this research, we present a methodology for statistical simulation of SRAM read access yield, which is tightly related to SRAM performance and power consumption. The proposed flow accounts for the impact of bitcell read current variation, sense amplifier offset distribution, timing window variation and leakage variation on functional yield. The methodology overcomes the pessimism existing in conventional worst-case design techniques that are used in SRAM design. The proposed statistical yield estimation methodology allows early yield prediction in the design cycle, which can be used to trade off performance and power requirements for SRAM. The methodology is verified using measured silicon yield data from a 1Mb memory fabricated in an industrial 45nm technology. Embedded SRAM dominates modern SoCs and there is a strong demand for SRAM with lower power consumption while achieving high performance and high density. However, in the presence of large process variations, SRAMs are expected to consume larger power to ensure correct read operation and meet yield targets. We propose a new architecture that significantly reduces array switching power for SRAM. The proposed architecture combines built-in self-test (BIST) and digitally controlled delay elements to reduce the wordline pulse width for memories while ensuring correct read operation; hence, reducing switching power. A new statistical simulation flow was developed to evaluate the power savings for the proposed architecture. Monte Carlo simulations using a 1Mb SRAM macro from an industrial 45nm technology was used to examine the power reduction achieved by the system. The proposed architecture can reduce the array switching power significantly and shows large power saving - especially as the chip level memory density increases. For a 48Mb memory density, a 27% reduction in array switching power can be achieved for a read access yield target of 95%. In addition, the proposed system can provide larger power saving as process variations increase, which makes it a very attractive solution for 45nm and below technologies. In addition to its impact on bitcell read current, the increase of local variations in nanometer technologies strongly affect SRAM cell stability. In this research, we propose a novel single supply voltage read assist technique to improve SRAM static noise margin (SNM). The proposed technique allows precharging different parts of the bitlines to VDD and GND and uses charge sharing to precisely control the bitline voltage, which improves the bitcell stability. In addition to improving SNM, the proposed technique also reduces memory access time. Moreover, it only requires one supply voltage, hence, eliminates the need of large area voltage shifters. The proposed technique has been implemented in the design of a 512kb memory fabricated in 45nm technology. Results show improvements in SNM and read operation window which confirms the effectiveness and robustness of this technique

    Technology Implications for Large Last-Level Caches

    Get PDF
    Large last-level cache (L3C) is efficient for bridging the performance and power gap between processor and memory. Several memory technologies, including SRAM, STT-RAM (MRAM), and embedded DRAM (eDRAM), have been used or considered as the technology to implement L3Cs. However, each of them has inherent weaknesses: SRAM is relatively low density and dissipates high leakage; STT-RAM has long write latency and requires high write energy; eDRAM requires refresh. As future processors are expected to have larger last-level caches, the objective of this dissertation is to study the tradeoffs associated with using each of these technologies to implement L3Cs. In order to make useful comparisons between L3Cs built with SRAM, STT-RAM, and eDRAM, we consider and implement several levels of details. First, to obtain unbiased cache performance and power properties (i.e., read/write access latency, read/write access energy, leakage power, refresh power, area), we prototype caches based on realistic memory and device models. Second, we present simplistic analytical models that enable us to quickly examine different memory technologies under various scenarios. Third, we review power-optimization techniques for each of the technologies, and propose using a low-cost dead-line prediction scheme for eDRAM-based L3Cs to eliminate unnecessary refreshes. Finally, the highlight of this dissertation is the comparison and analysis of low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. We report system performance, last-level cache energy breakdown, and memory hierarchy energy breakdown, using an augmented full-system simulator with the execution of a range of workloads and input sets. From the insights gained through simulation results, STT-RAM has the highest potential to save energy in future L3C designs. For contemporary processors, SRAM-based L3C results in the fastest system performance, whereas eDRAM consumes the lowest energy

    先端プロセス技術における混載SRAMの高信頼・低電力化に関する研究

    Get PDF
    13301甲第4843号博士(工学)金沢大学博士論文本文Ful

    Computing with Spintronics: Circuits and architectures

    Get PDF
    This thesis makes the following contributions towards the design of computing platforms with spintronic devices. 1) It explores the use of spintronic memories in the design of a domain-specific processor for an emerging class of data-intensive applications, namely recognition, mining and synthesis (RMS). Two different spintronic memory technologies — Domain Wall Memory (DWM) and STT-MRAM — are utilized to realize the different levels in the memory hierarchy of the domain-specific processor, based on their respective access characteristics. Architectural tradeoffs created by the use of spintronic memories are analyzed. The proposed design achieves 1.5X-4X improvements in energy-delay product compared to a CMOS baseline. 2) It describes the first attempt to use DWM in the cache hierarchy of general-purpose processors. DWM promises unparalleled density by packing several bits of data into each bit-cell. TapeCache, the proposed DWM-based cache architecture, utilizes suitable circuit and architectural optimizations to address two key challenges (i) the high energy and latency requirement of write operations and (ii) the need for shift operations to access the data stored in each DWM bit-cell. At the circuit level, DWM bit-cells that are tailored to the distinct design requirements of different levels in the cache hierarchy are proposed. At the architecture level, TapeCache proposes suitable cache organization and management policies to alleviate the performance impact of shift operations required to access data stored in DWM bit-cells. TapeCache achieves more than 7X improvements in both cache area and energy with virtually identical performance compared to an SRAM-based cache hierarchy. 3) It investigates the design of the on-chip memory hierarchy of general-purpose graphics processing units (GPGPUs)—massively parallel processors that are optimized for data-intensive high-throughput workloads—using DWM. STAG, a high density, energy-efficient Spintronic- Tape Architecture for GPGPU cache hierarchies is described. STAG utilizes different DWM bit-cells to realize different memory arrays in the GPGPU cache hierarchy. To address the challenge of high access latencies due to shifts, STAG predicts upcoming cache accesses by leveraging unique characteristics of GPGPU architectures and workloads, and prefetches data that are both likely to be accessed and require large numbers of shift operations. STAG achieves 3.3X energy reduction and 12.1% performance improvement over CMOS SRAM under iso-area conditions. 4) While the potential of spintronic devices for memories is widely recognized, their utility in realizing logic is much less clear. The thesis presents Spintastic, a new paradigm that utilizes Stochastic Computing (SC) to realize spintronic logic. In SC, data is encoded in the form of pseudo-random bitstreams, such that the probability of a \u271\u27 in a bitstream corresponds to the numerical value that it represents. SC can enable compact, low-complexity logic implementations of various arithmetic functions. Spintastic establishes the synergy between stochastic computing and spin-based logic by demonstrating that they mutually alleviate each other\u27s limitations. On the one hand, various building blocks of SC, which incur significant overheads in CMOS implementations, can be efficiently realized by exploiting the physical characteristics of spin devices. On the other hand, the reduced logic complexity and low logic depth of SC circuits alleviates the shortcomings of spintronic logic. Based on this insight, the design of spin-based stochastic arithmetic circuits, bitstream generators, bitstream permuters and stochastic-to-binary converter circuits are presented. Spintastic achieves 7.1X energy reduction over CMOS implementations for a wide range of benchmarks from the image processing, signal processing, and RMS application domains. 5) In order to evaluate the proposed spintronic designs, the thesis describes various device-to-architecture modeling frameworks. Starting with devices models that are calibrated to measurements, the characteristics of spintronic devices are successively abstracted into circuit-level and architectural models, which are incorporated into suitable simulation frameworks. (Abstract shortened by UMI.

    myCACTI: A new cache design tool for pipelined nanometer caches

    Get PDF
    TThe presence of caches in microprocessors has always been one of the most important techniques in bridging the memory wall, or the speed gap between the microprocessor and main memory. This importance is continuously increasing especially as we enter the regime of nanometer process technologies (i.e. 90nm and below), as industry has favored investing a larger and larger fraction of a chip.s transistor budget to improving the on-chip cache. This is the case in practice, as it has proven to be an efficient way to utilize the increasing number of transistors available with each succeeding technology. Consequently, it becomes even more important to have cache design tools that give accurate representations of designs that exist in actual microprocessors. The prevalent cache design tools that are the most widely used in academe are CACTI [Wilton1996] and eCACTI [Mamidipaka2004], and these have proven to be very useful tools not just for cache designers, but also for computer architects. This dissertation will show that both CACTI and eCACTI still contain major limitations and even flaws in their design, making them unsuitable for use in very-deep submicron and nanometer caches, especially pipelined designs. These limitations and flaws will be discussed in detail. This dissertation then introduces a new tool, called myCACTI, that addresses all these limitations and, in addition, introduces major enhancements to the simulation framework. This dissertation then demonstrates the use of myCACTI in the cache design process. Detailed design space explorations are done on multiple cache configurations to produce pareto optimal curves of the caches to show optimal implementations. Detailed studies are also performed to characterize the delay and power dissipation of different cache configurations and implementations. Finally, future directions to the development of myCACTI are identified to show possible ways that the tool can be improved in such a way as to allow even more different kinds of studies to be performed

    Power Management and SRAM for Energy-Autonomous and Low-Power Systems

    Full text link
    We demonstrate the two first-known, complete, self-powered millimeter-scale computer systems. These microsystems achieve zero-net-energy operation using solar energy harvesting and ultra-low-power circuits. A medical implant for monitoring intraocular pressure (IOP) is presented as part of a treatment for glaucoma. The 1.5mm3 IOP monitor is easily implantable because of its small size and measures IOP with 0.5mmHg accuracy. It wirelessly transmits data to an external wand while consuming 4.7nJ/bit. This provides rapid feedback about treatment efficacies to decrease physician response time and potentially prevent unnecessary vision loss. A nearly-perpetual temperature sensor is presented that processes data using a 2.1μW near-threshold ARM°R Cortex- M3TM μP that provides a widely-used and trusted programming platform. Energy harvesting and power management techniques for these two microsystems enable energy-autonomous operation. The IOP monitor harvests 80nW of solar power while consuming only 5.3nW, extending lifetime indefinitely. This allows the device to provide medical information for extended periods of time, giving doctors time to converge upon the best glaucoma treatment. The temperature sensor uses on-demand power delivery to improve low-load dc-dc voltage conversion efficiency by 4.75x. It also performs linear regulation to deliver power with low noise, improved load regulation, and tight line regulation. Low-power high-throughput SRAM techniques help millimeter-scale microsystems meet stringent power budgets. VDD scaling in memory decreases energy per access, but also decreases stability margins. These margins can be improved using sizing, VTH selection, and assist circuits, as well as new bitcell designs. Adaptive Crosshairs modulation of SRAM power supplies fixes 70% of parametric failures. Half-differential SRAM design improves stability, reducing VMIN by 72mV. The circuit techniques for energy autonomy presented in this dissertation enable millimeter-scale microsystems for medical implants, such as blood pressure and glucose sensors, as well as non-medical applications, such as supply chain and infrastructure monitoring. These pervasive sensors represent the continuation of Bell’s Law, which accurately traces the evolution of computers as they become smaller, more numerous, and more powerful. The development of millimeter-scale massively-deployed ubiquitous computers ensures the continued expansion and profitability of the semiconductor industry. NanoWatt circuit techniques will allow us to meet this next frontier in IC design.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86387/1/grgkchen_1.pd
    corecore