16 research outputs found
Reliable Low-Power High Performance Spintronic Memories
Moores Gesetz folgend, ist es der Chipindustrie in den letzten fĂĽnf Jahrzehnten gelungen, ein
explosionsartiges Wachstum zu erreichen. Dies hatte ebenso einen exponentiellen Anstieg der
Nachfrage von Speicherkomponenten zur Folge, was wiederum zu speicherlastigen Chips in
den heutigen Computersystemen fĂĽhrt. Allerdings stellen traditionelle on-Chip Speichertech-
nologien wie Static Random Access Memories (SRAMs), Dynamic Random Access Memories
(DRAMs) und Flip-Flops eine Herausforderung in Bezug auf Skalierbarkeit, Verlustleistung
und Zuverlässigkeit dar. Eben jene Herausforderungen und die überwältigende Nachfrage
nach höherer Performanz und Integrationsdichte des on-Chip Speichers motivieren Forscher,
nach neuen nichtflĂĽchtigen Speichertechnologien zu suchen. Aufkommende spintronische Spe-
ichertechnologien wie Spin Orbit Torque (SOT) und Spin Transfer Torque (STT) erhielten
in den letzten Jahren eine hohe Aufmerksamkeit, da sie eine Reihe an Vorteilen bieten. Dazu
gehören Nichtflüchtigkeit, Skalierbarkeit, hohe Beständigkeit, CMOS Kompatibilität und Unan-
fälligkeit gegenüber Soft-Errors. In der Spintronik repräsentiert der Spin eines Elektrons dessen
Information. Das Datum wird durch die Höhe des Widerstandes gespeichert, welche sich durch
das Anlegen eines polarisierten Stroms an das Speichermedium verändern lässt. Das Prob-
lem der statischen Leistung gehen die Speichergeräte sowohl durch deren verlustleistungsfreie
Eigenschaft, als auch durch ihr Standard- Aus/Sofort-Ein Verhalten an. Nichtsdestotrotz sind
noch andere Probleme, wie die hohe Zugriffslatenz und die Energieaufnahme zu lösen, bevor
sie eine verbreitete Anwendung finden können. Um diesen Problemen gerecht zu werden, sind
neue Computerparadigmen, -architekturen und -entwurfsphilosophien notwendig.
Die hohe Zugriffslatenz der Spintroniktechnologie ist auf eine vergleichsweise lange Schalt-
dauer zurĂĽckzufĂĽhren, welche die von konventionellem SRAM ĂĽbersteigt. Des Weiteren ist auf
Grund des stochastischen Schaltvorgangs der Speicherzelle und des Einflusses der Prozessvari-
ation ein nicht zu vernachlässigender Zeitraum dafür erforderlich. In diesem Zeitraum wird ein
konstanter Schreibstrom durch die Bitzelle geleitet, um den Schaltvorgang zu gewährleisten.
Dieser Vorgang verursacht eine hohe Energieaufnahme. FĂĽr die Leseoperation wird gleicher-
maßen ein beachtliches Zeitfenster benötigt, ebenfalls bedingt durch den Einfluss der Prozess-
variation. Dem gegenüber stehen diverse Zuverlässigkeitsprobleme. Dazu gehören unter An-
derem die Leseintereferenz und andere Degenerationspobleme, wie das des Time Dependent Di-
electric Breakdowns (TDDB). Diese Zuverlässigkeitsprobleme sind wiederum auf die benötigten
längeren Schaltzeiten zurückzuführen, welche in der Folge auch einen über längere Zeit an-
liegenden Lese- bzw. Schreibstrom implizieren. Es ist daher notwendig, sowohl die Energie, als
auch die Latenz zur Steigerung der Zuverlässigkeit zu reduzieren, um daraus einen potenziellen
Kandidaten fĂĽr ein on-Chip Speichersystem zu machen.
In dieser Dissertation werden wir Entwurfsstrategien vorstellen, welche das Ziel verfolgen,
die Herausforderungen des Cache-, Register- und Flip-Flop-Entwurfs anzugehen. Dies erre-
ichen wir unter Zuhilfenahme eines Cross-Layer Ansatzes. FĂĽr Caches entwickelten wir ver-
schiedene Ansätze auf Schaltkreisebene, welche sowohl auf der Speicherarchitekturebene, als
auch auf der Systemebene in Bezug auf Energieaufnahme, Performanzsteigerung und Zuver-
lässigkeitverbesserung evaluiert werden. Wir entwickeln eine Selbstabschalttechnik, sowohl für
die Lese-, als auch die Schreiboperation von Caches. Diese ist in der Lage, den Abschluss der
entsprechenden Operation dynamisch zu ermitteln. Nachdem der Abschluss erkannt wurde,
wird die Lese- bzw. Schreiboperation sofort gestoppt, um Energie zu sparen. Zusätzlich
limitiert die Selbstabschalttechnik die Dauer des Stromflusses durch die Speicherzelle, was
wiederum das Auftreten von TDDB und Leseinterferenz bei Schreib- bzw. Leseoperationen re-
duziert. Zur Verbesserung der Schreiblatenz heben wir den Schreibstrom an der Bitzelle an, um den magnetischen Schaltprozess zu beschleunigen. Um registerbankspezifische Anforderungen
zu berücksichtigen, haben wir zusätzlich eine Multiport-Speicherarchitektur entworfen, welche
eine einzigartige Eigenschaft der SOT-Zelle ausnutzt, um simultan Lese- und Schreiboperatio-
nen auszuführen. Es ist daher möglich Lese/Schreib- Konfilkte auf Bitzellen-Ebene zu lösen,
was sich wiederum in einer sehr viel einfacheren Multiport- Registerbankarchitektur nieder-
schlägt.
Zusätzlich zu den Speicheransätzen haben wir ebenfalls zwei Flip-Flop-Architekturen vorgestellt.
Die erste ist eine nichtflĂĽchtige non-Shadow Flip-Flop-Architektur, welche die Speicherzelle als
aktive Komponente nutzt. Dies ermöglicht das sofortige An- und Ausschalten der Versorgungss-
pannung und ist daher besonders gut fĂĽr aggressives Powergating geeignet. Alles in Allem zeigt
der vorgestellte Flip-Flop-Entwurf eine ähnliche Timing-Charakteristik wie die konventioneller
CMOS Flip-Flops auf. Jedoch erlaubt er zur selben Zeit eine signifikante Reduktion der statis-
chen Leistungsaufnahme im Vergleich zu nichtflĂĽchtigen Shadow- Flip-Flops. Die zweite ist eine
fehlertolerante Flip-Flop-Architektur, welche sich unanfällig gegenüber diversen Defekten und
Fehlern verhält. Die Leistungsfähigkeit aller vorgestellten Techniken wird durch ausführliche
Simulationen auf Schaltkreisebene verdeutlicht, welche weiter durch detaillierte Evaluationen
auf Systemebene untermauert werden. Im Allgemeinen konnten wir verschiedene Techniken en-
twickeln, die erhebliche Verbesserungen in Bezug auf Performanz, Energie und Zuverlässigkeit
von spintronischen on-Chip Speichern, wie Caches, Register und Flip-Flops erreichen
Embracing Low-Power Systems with Improvement in Security and Energy-Efficiency
As the economies around the world are aligning more towards usage of computing systems, the global energy demand for computing is increasing rapidly. Additionally, the boom in AI based applications and services has already invited the pervasion of specialized computing hardware architectures for AI (accelerators). A big chunk of research in the industry and academia is being focused on providing energy efficiency to all kinds of power hungry computing architectures. This dissertation adds to these efforts.
Aggressive voltage underscaling of chips is one the effective low power paradigms of providing energy efficiency. This dissertation identifies and deals with the reliability and performance problems associated with this paradigm and innovates novel energy efficient approaches. Specifically, the properties of a low power security primitive have been improved and, higher performance has been unlocked in an AI accelerator (Google TPU) in an aggressively voltage underscaled environment. And, novel power saving opportunities have been unlocked by characterizing the usage pattern of a baseline TPU with rigorous mathematical analysis
Harnessing noise to enhance robustness vs. efficiency trade-off in machine learning
While deep nets have achieved human-comparable accuracy in various classification tasks, they fall short significantly in terms of the robustness and cost metrics. For example, tiny engineered corruptions in deep net inputs can reduce their accuracy to zero. Furthermore, deep nets also require millions of trainable parameters, resulting in significant training and inference costs. These robustness and cost challenges are well recognized today. In response, there have been a plethora of works focusing on improving either the accuracy vs. robustness trade-off, or the accuracy vs. cost trade-off. However, simultaneous consideration of accuracy, robustness, and cost metrics is largely absent today, in part, because far fewer works have explored the robustness vs. cost trade-off. This dissertation aims to fill this gap by focusing explicitly on the robustness vs. cost trade-off in the presence of data noise, as well as hardware noise. Specifically, we explore how to harness the noise in order to enhance this trade-off. We characterize and improve robustness vs. cost trade-offs across diverse problem settings, ranging from beyond-CMOS hardware implementations of machine learning (ML) classifiers to efficient training of deep nets that are robust to multiple types of corruptions in their inputs. This dissertation can be roughly divided into two part, one focusing on hardware noise and the other on data noise.
In the first part, we start by focusing on harnessing noise in spintronic hardware implementations, where the logic gates become error prone when operated at lower switching energy/delay. We propose techniques to shape the resulting hardware noise distribution and to efficiently compensate it at the system-level output. As a result, we observe 1000x improvement intolerance to gate-level switching error rates, while keeping the area/energy overhead of compensation circuits to as low as 15%. These robustness enhancements further enable 3Ă— reduction in iso-throughput energy consumption of a binary ML classifier employed for EEG-based seizure detection. Building on this work, we propose spintronic channel networks, exponential decay of spin current to efficiently realize multi-bit dot product computation. We employ error-prone nanomagnets as efficient stochastic slicers biased by spin currents proportional to the likelihood of the classification decision. We achieve 112x-to-22.5x and 14x-to-2.5x higher energy-efficiency over conventional spin-based and 20 nm CMOS designs, respectively, when realizing 10-to-100-dimensional binary classifiers. Furthermore, we also consider the impact of hardware noise originated from process variations and readout circuits in in-memory computing implementations employing non-volatile resistive crossbar arrays. Based on our analysis, we identify design configurations achieving the highest signal-to-noise ratio (SNR), and further estimate how such robustness trades off with the array energy consumption.
In the second part, we switch gears to improve the robustness vs. cost trade-off for deep nets in the presence of data noise. Specifically, we focus on the impact of adversarial perturbations in the deep nets inputs. We propose and validate the hypotheses about orientations of dominant subspaces of adversarial perturbations. We demonstrate how changes in the curvature of decision boundary of the deep nets affects the orientations of the adversarial perturbations. Based on these insights we demonstrate how shaped noise can be introduced as a feature to enhance robustness vs. cost trade-off in deep nets. Specifically, we propose shaped noise augmented processing (SNAP), a method to efficiently train deep nets that are robust to multiple types of adversarial perturbations, simultaneously. SNAP prepends a deep net with a shaped noise augmentation layer whose distribution is learned along with the network parameters using any established robust training framework. Based on extensive comparisons with nine state-of-the-art (SOTA) robust training frameworks, we show that SNAP achieves the best robustness vs. training cost trade-off. In particular, it enables 4x reduction in the training cost compared to the SOTA approach published just this last year. Furthermore, thanks to the computational simplicity of SNAP, it is the first technique of its kind that is scalable to large datasets, such as ImageNet
Integrated Circuits/Microchips
With the world marching inexorably towards the fourth industrial revolution (IR 4.0), one is now embracing lives with artificial intelligence (AI), the Internet of Things (IoTs), virtual reality (VR) and 5G technology. Wherever we are, whatever we are doing, there are electronic devices that we rely indispensably on. While some of these technologies, such as those fueled with smart, autonomous systems, are seemingly precocious; others have existed for quite a while. These devices range from simple home appliances, entertainment media to complex aeronautical instruments. Clearly, the daily lives of mankind today are interwoven seamlessly with electronics. Surprising as it may seem, the cornerstone that empowers these electronic devices is nothing more than a mere diminutive semiconductor cube block. More colloquially referred to as the Very-Large-Scale-Integration (VLSI) chip or an integrated circuit (IC) chip or simply a microchip, this semiconductor cube block, approximately the size of a grain of rice, is composed of millions to billions of transistors. The transistors are interconnected in such a way that allows electrical circuitries for certain applications to be realized. Some of these chips serve specific permanent applications and are known as Application Specific Integrated Circuits (ASICS); while, others are computing processors which could be programmed for diverse applications. The computer processor, together with its supporting hardware and user interfaces, is known as an embedded system.In this book, a variety of topics related to microchips are extensively illustrated. The topics encompass the physics of the microchip device, as well as its design methods and applications
Architectural Techniques for Multi-Level Cell Phase Change Memory Based Main Memory
Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity main memory in modern computing systems. Multi-level cell (MLC) PCM storing multiple bits in a single cell offers high density with low per-byte fabrication cost. However, PCM suffers from long write latency, short cell endurance, limited write throughput and high peak power, which makes it challenging to be integrated in the memory hierarchy.
To address the long write latency, I propose write truncation to reduce the number of write iterations with the assistance of an extra error correction code (ECC). I also propose form switch (FS) to reduce the storage overhead of the ECC. By storing highly compressible lines in single level cell (SLC) form, FS improves read latency as well.
To attack the short cell endurance and large peak power, I propose elastic RESET (ER) to construct triple-level cell PCM. By reducing RESET energy, ER significantly reduces peak power and prolongs PCM lifetime.
To improve the write concurrency, I propose fine-grained write power budgeting (FPB) observing a global power budget and regulates power across write iterations according to the step-down power demand of each iteration. A global charge pump is also integrated onto a DIMM to boost power for hot PCM chips while staying within the global power budget.
To further reduce the peak power, I propose intra-write RESET scheduling distributing cell RESET initializations in the whole write operation duration, so that the on-chip charge pump size can also be reduced
Recommended from our members
Probabilistic design for emerging memory and nanometer-scale logic
As semiconductor technology has scaled down, the impact of stochastic behavior in very large scale integrated circuits (VLSI) has become an ever-more important concern. This dissertation investigates two distinct classes of problems that require the use of probabilistic methods and models: (1) Modeling and exploiting stochastic behavior in advanced memory technologies; (2) Probabilistic modeling of faults due to on-chip voltage variation.
This dissertation first investigates the unique physics-level stochasticity of spin-transfer torque magnetic RAM (STT-RAM). The write process of STT-RAM is stochastic: specifically, the write time of a bitcell varies significantly. The wors-tcase approach, which uses the longest write pulse duration, guarantees a successful write; however, it introduces significant energy overhead due to excessive margins since the average write pulse duration is far shorter than the worst-case pulse duration. This dissertation develops novel circuit techniques to exploit the stochastic properties of STT-RAM write operation for energy savings by moving away from the worst-case approach to dynamic strategies while maintaining the required low error rate. The first contribution is a variable energy write (VEW) architecture that effectively exploits the wide distribution of write time to greatly reduce energy via a mechanism that checks the instantaneous state of the bitcell and deactivates the write current once the correct value has registered. The second contribution is a multiple attempt write (MAW) strategy that utilizes the asymptotic temporal stochastic independence of repeated switching events to achieve a dramatic reduction in energy. The proposed architectures are evaluated using a compact STT-RAM cell model. Analysis indicates that VEW succeeded in reducing the write energy by 94.7% with approximately 1% relative area overhead under an efficient design methodology compared with the conventional designs relying on the worst case approach. MAW reduced the overall write energy by 94.6% with approximately 0.05% relative area overhead.
This dissertation then addresses the problem of probabilistic modeling of faults due to on-chip voltage variations. The power supply voltage variation can increase gate delay, resulting in timing faults on near-critical paths. These low-level faults ultimately propagate to architecture and application levels, often leading to critical system failures. Developing an accurate fault model and injection tool that generates and propagates faults from circuit- to gate-level is important for accurately predicting the resulting system failures. This is challenging since the model needs to accurately capture the physical characteristics at the circuit level that define the likelihood of a fault and use that information to guide the injection with the proper probability. At the same time, the analysis and fault injections need to be computationally manageable to allow analysis of realistic systems under realistic workloads. The conventional fault models rely on either Monte Carlo sampling or time-consuming runtime simulation using the worst-case voltage drop. To overcome simulation overheads of runtime circuit-level simulation, a novel two-phase approach is proposed. The main idea is that circuit characterization can be done before simulation. The result of pre-characterization is used at runtime via a form of look-up to enable gate-level efficiency. The two-phase methodology is time-efficient but may require high memory unless the look-up tables are carefully optimized. This dissertation also develops the fault probability estimation based on workload-specific voltage distribution, rather than a fixed worst-case voltage. The proposed methodology is implemented on an OpenSPARC design targeting on a 32nm technology node. Analysis indicates the proposed fault modeling and injection flow reduces runtime overhead by 24X compared to the previously best-known gate-level fault simulator while having circuit level accuracy.Electrical and Computer Engineerin
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
Low Power Memory/Memristor Devices and Systems
This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within
Shared Resource Management for Non-Volatile Asymmetric Memory
Non-volatile memory (NVM), such as Phase-Change Memory (PCM), is a promising energy-efficient candidate to replace DRAM. It is desirable because of its non-volatility, good scalability and low idle power. NVM, nevertheless, faces important challenges. The main problems are: writes are much slower and more power hungry than reads and write bandwidth is much lower than read bandwidth. Hybrid main memory architecture, which consists of a large NVM and a small DRAM, may become a solution for architecting NVM as main memory. Adding an extra layer of cache mitigates the drawbacks of NVM writes. However, writebacks from the last-level cache (LLC) might still (a) overwhelm the limited NVM write bandwidth and stall the application, (b) shorten lifetime and (c) increase energy consumption.
Effectively utilizing shared resources, such as the last-level cache and the memory bandwidth, is crucial to achieving high performance for multi-core systems. No existing cache and bandwidth allocation scheme exploits the read/write asymmetry property, which is fundamental in NVM. This thesis tries to consider the asymmetry property in partitioning the cache and memory bandwidth for NVM systems.
The thesis proposes three writeback-aware schemes to manage the resources in NVM systems. First, a runtime mechanism, Writeback-aware Cache Partitioning (WCP), is proposed to partition the shared LLC among multiple applications. Unlike past partitioning schemes, WCP considers the reduction in cache misses as well as writebacks. Second, a new runtime mechanism, Writeback-aware Bandwidth Partitioning (WBP), partitions NVM service cycles among applications. WBP uses a bandwidth partitioning weight to reflect the importance of writebacks (in addition to LLC misses) to bandwidth allocation. A companion Dynamic Weight Adjustment scheme dynamically selects the cache partitioning weight to maximize system performance. Third, Unified Writeback-aware Partitioning (UWP) partitions the last-level cache and the memory bandwidth cooperatively. UWP can further improve the system performance by considering the interaction of cache partitioning and bandwidth partitioning. The three proposed schemes improve system performance by considering the unique read/write asymmetry property of NVM