45 research outputs found

    DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

    Full text link
    Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2x and 2.3x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.Comment: 12 pages, 10 figure

    Gestión de jerarquías de memoria híbridas a nivel de sistema

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadoras y Automática y de Ku Leuven, Arenberg Doctoral School, Faculty of Engineering Science, leída el 11/05/2017.In electronics and computer science, the term ‘memory’ generally refers to devices that are used to store information that we use in various appliances ranging from our PCs to all hand-held devices, smart appliances etc. Primary/main memory is used for storage systems that function at a high speed (i.e. RAM). The primary memory is often associated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors, used for example as primary memory but also other purposes in computers and other digital electronic devices. The secondary/auxiliary memory, in comparison provides program and data storage that is slower to access but offers larger capacity. Examples include external hard drives, portable flash drives, CDs, and DVDs. These devices and media must be either plugged in or inserted into a computer in order to be accessed by the system. Since secondary storage technology is not always connected to the computer, it is commonly used for backing up data. The term storage is often used to describe secondary memory. Secondary memory stores a large amount of data at lesser cost per byte than primary memory; this makes secondary storage about two orders of magnitude less expensive than primary storage. There are two main types of semiconductor memory: volatile and nonvolatile. Examples of non-volatile memory are ‘Flash’ memory (sometimes used as secondary, sometimes primary computer memory) and ROM/PROM/EPROM/EEPROM memory (used for firmware such as boot programs). Examples of volatile memory are primary memory (typically dynamic RAM, DRAM), and fast CPU cache memory (typically static RAM, SRAM, which is fast but energy-consuming and offer lower memory capacity per are a unit than DRAM). Non-volatile memory technologies in Si-based electronics date back to the 1990s. Flash memory is widely used in consumer electronic products such as cellphones and music players and NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. The rapid increase of leakage currents in Silicon CMOS transistors with scaling poses a big challenge for the integration of SRAM memories. There is also the case of susceptibility to read/write failure with low power schemes. As a result of this, over the past decade, there has been an extensive pooling of time, resources and effort towards developing emerging memory technologies like Resistive RAM (ReRAM/RRAM), STT-MRAM, Domain Wall Memory and Phase Change Memory(PRAM). Emerging non-volatile memory technologies promise new memories to store more data at less cost than the expensive-to build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. These new memory technologies combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the non-volatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. The research and information on these Non-Volatile Memory (NVM) technologies has matured over the last decade. These NVMs are now being explored thoroughly nowadays as viable replacements for conventional SRAM based memories even for the higher levels of the memory hierarchy. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional(3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years...En el campo de la informática, el término ‘memoria’ se refiere generalmente a dispositivos que son usados para almacenar información que posteriormente será usada en diversos dispositivos, desde computadoras personales (PC), móviles, dispositivos inteligentes, etc. La memoria principal del sistema se utiliza para almacenar los datos e instrucciones de los procesos que se encuentre en ejecución, por lo que se requiere que funcionen a alta velocidad (por ejemplo, DRAM). La memoria principal está implementada habitualmente mediante memorias semiconductoras direccionables, siendo DRAM y SRAM los principales exponentes. Por otro lado, la memoria auxiliar o secundaria proporciona almacenaje(para ficheros, por ejemplo); es más lenta pero ofrece una mayor capacidad. Ejemplos típicos de memoria secundaria son discos duros, memorias flash portables, CDs y DVDs. Debido a que estos dispositivos no necesitan estar conectados a la computadora de forma permanente, son muy utilizados para almacenar copias de seguridad. La memoria secundaria almacena una gran cantidad de datos aun coste menor por bit que la memoria principal, siendo habitualmente dos órdenes de magnitud más barata que la memoria primaria. Existen dos tipos de memorias de tipo semiconductor: volátiles y no volátiles. Ejemplos de memorias no volátiles son las memorias Flash (algunas veces usadas como memoria secundaria y otras veces como memoria principal) y memorias ROM/PROM/EPROM/EEPROM (usadas para firmware como programas de arranque). Ejemplos de memoria volátil son las memorias DRAM (RAM dinámica), actualmente la opción predominante a la hora de implementar la memoria principal, y las memorias SRAM (RAM estática) más rápida y costosa, utilizada para los diferentes niveles de cache. Las tecnologías de memorias no volátiles basadas en electrónica de silicio se remontan a la década de1990. Una variante de memoria de almacenaje por carga denominada como memoria Flash es mundialmente usada en productos electrónicos de consumo como telefonía móvil y reproductores de música mientras NAND Flash solid state disks(SSDs) están progresivamente desplazando a los dispositivos de disco duro como principal unidad de almacenamiento en computadoras portátiles, de escritorio e incluso en centros de datos. En la actualidad, hay varios factores que amenazan la actual predominancia de memorias semiconductoras basadas en cargas (capacitivas). Por un lado, se está alcanzando el límite de integración de las memorias Flash, lo que compromete su escalado en el medio plazo. Por otra parte, el fuerte incremento de las corrientes de fuga de los transistores de silicio CMOS actuales, supone un enorme desafío para la integración de memorias SRAM. Asimismo, estas memorias son cada vez más susceptibles a fallos de lectura/escritura en diseños de bajo consumo. Como resultado de estos problemas, que se agravan con cada nueva generación tecnológica, en los últimos años se han intensificado los esfuerzos para desarrollar nuevas tecnologías que reemplacen o al menos complementen a las actuales. Los transistores de efecto campo eléctrico ferroso (FeFET en sus siglas en inglés) se consideran una de las alternativas más prometedores para sustituir tanto a Flash (por su mayor densidad) como a DRAM (por su mayor velocidad), pero aún está en una fase muy inicial de su desarrollo. Hay otras tecnologías algo más maduras, en el ámbito de las memorias RAM resistivas, entre las que cabe destacar ReRAM (o RRAM), STT-RAM, Domain Wall Memory y Phase Change Memory (PRAM)...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System

    Full text link
    SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity

    Accelerating Time Series Analysis via Processing using Non-Volatile Memories

    Full text link
    Time Series Analysis (TSA) is a critical workload for consumer-facing devices. Accelerating TSA is vital for many domains as it enables the extraction of valuable information and predict future events. The state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping (sDTW) algorithm. However, sDTW's computation complexity increases quadratically with the time series' length, resulting in two performance implications. First, the amount of data parallelism available is significantly higher than the small number of processing units enabled by commodity systems (e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low arithmetic intensity and 2) incurs a large memory footprint. To tackle these two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ computation where data resides, using the memory cells. PuM provides a promising solution to alleviate data movement bottlenecks and exposes immense parallelism. In this work, we present MATSA, the first MRAM-based Accelerator for Time Series Analysis. The key idea is to exploit magneto-resistive memory crossbars to enable energy-efficient and fast time series computation in memory. MATSA provides the following key benefits: 1) it leverages high levels of parallelism in the memory substrate by exploiting column-wise arithmetic operations, and 2) it significantly reduces the data movement costs performing computation using the memory cells. We evaluate three versions of MATSA to match the requirements of different environments (e.g., embedded, desktop, or HPC computing) based on MRAM technology trends. We perform a design space exploration and demonstrate that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM architectures, respectively

    Reliable Low-Power High Performance Spintronic Memories

    Get PDF
    Moores Gesetz folgend, ist es der Chipindustrie in den letzten fünf Jahrzehnten gelungen, ein explosionsartiges Wachstum zu erreichen. Dies hatte ebenso einen exponentiellen Anstieg der Nachfrage von Speicherkomponenten zur Folge, was wiederum zu speicherlastigen Chips in den heutigen Computersystemen führt. Allerdings stellen traditionelle on-Chip Speichertech- nologien wie Static Random Access Memories (SRAMs), Dynamic Random Access Memories (DRAMs) und Flip-Flops eine Herausforderung in Bezug auf Skalierbarkeit, Verlustleistung und Zuverlässigkeit dar. Eben jene Herausforderungen und die überwältigende Nachfrage nach höherer Performanz und Integrationsdichte des on-Chip Speichers motivieren Forscher, nach neuen nichtflüchtigen Speichertechnologien zu suchen. Aufkommende spintronische Spe- ichertechnologien wie Spin Orbit Torque (SOT) und Spin Transfer Torque (STT) erhielten in den letzten Jahren eine hohe Aufmerksamkeit, da sie eine Reihe an Vorteilen bieten. Dazu gehören Nichtflüchtigkeit, Skalierbarkeit, hohe Beständigkeit, CMOS Kompatibilität und Unan- fälligkeit gegenüber Soft-Errors. In der Spintronik repräsentiert der Spin eines Elektrons dessen Information. Das Datum wird durch die Höhe des Widerstandes gespeichert, welche sich durch das Anlegen eines polarisierten Stroms an das Speichermedium verändern lässt. Das Prob- lem der statischen Leistung gehen die Speichergeräte sowohl durch deren verlustleistungsfreie Eigenschaft, als auch durch ihr Standard- Aus/Sofort-Ein Verhalten an. Nichtsdestotrotz sind noch andere Probleme, wie die hohe Zugriffslatenz und die Energieaufnahme zu lösen, bevor sie eine verbreitete Anwendung finden können. Um diesen Problemen gerecht zu werden, sind neue Computerparadigmen, -architekturen und -entwurfsphilosophien notwendig. Die hohe Zugriffslatenz der Spintroniktechnologie ist auf eine vergleichsweise lange Schalt- dauer zurückzuführen, welche die von konventionellem SRAM übersteigt. Des Weiteren ist auf Grund des stochastischen Schaltvorgangs der Speicherzelle und des Einflusses der Prozessvari- ation ein nicht zu vernachlässigender Zeitraum dafür erforderlich. In diesem Zeitraum wird ein konstanter Schreibstrom durch die Bitzelle geleitet, um den Schaltvorgang zu gewährleisten. Dieser Vorgang verursacht eine hohe Energieaufnahme. Für die Leseoperation wird gleicher- maßen ein beachtliches Zeitfenster benötigt, ebenfalls bedingt durch den Einfluss der Prozess- variation. Dem gegenüber stehen diverse Zuverlässigkeitsprobleme. Dazu gehören unter An- derem die Leseintereferenz und andere Degenerationspobleme, wie das des Time Dependent Di- electric Breakdowns (TDDB). Diese Zuverlässigkeitsprobleme sind wiederum auf die benötigten längeren Schaltzeiten zurückzuführen, welche in der Folge auch einen über längere Zeit an- liegenden Lese- bzw. Schreibstrom implizieren. Es ist daher notwendig, sowohl die Energie, als auch die Latenz zur Steigerung der Zuverlässigkeit zu reduzieren, um daraus einen potenziellen Kandidaten für ein on-Chip Speichersystem zu machen. In dieser Dissertation werden wir Entwurfsstrategien vorstellen, welche das Ziel verfolgen, die Herausforderungen des Cache-, Register- und Flip-Flop-Entwurfs anzugehen. Dies erre- ichen wir unter Zuhilfenahme eines Cross-Layer Ansatzes. Für Caches entwickelten wir ver- schiedene Ansätze auf Schaltkreisebene, welche sowohl auf der Speicherarchitekturebene, als auch auf der Systemebene in Bezug auf Energieaufnahme, Performanzsteigerung und Zuver- lässigkeitverbesserung evaluiert werden. Wir entwickeln eine Selbstabschalttechnik, sowohl für die Lese-, als auch die Schreiboperation von Caches. Diese ist in der Lage, den Abschluss der entsprechenden Operation dynamisch zu ermitteln. Nachdem der Abschluss erkannt wurde, wird die Lese- bzw. Schreiboperation sofort gestoppt, um Energie zu sparen. Zusätzlich limitiert die Selbstabschalttechnik die Dauer des Stromflusses durch die Speicherzelle, was wiederum das Auftreten von TDDB und Leseinterferenz bei Schreib- bzw. Leseoperationen re- duziert. Zur Verbesserung der Schreiblatenz heben wir den Schreibstrom an der Bitzelle an, um den magnetischen Schaltprozess zu beschleunigen. Um registerbankspezifische Anforderungen zu berücksichtigen, haben wir zusätzlich eine Multiport-Speicherarchitektur entworfen, welche eine einzigartige Eigenschaft der SOT-Zelle ausnutzt, um simultan Lese- und Schreiboperatio- nen auszuführen. Es ist daher möglich Lese/Schreib- Konfilkte auf Bitzellen-Ebene zu lösen, was sich wiederum in einer sehr viel einfacheren Multiport- Registerbankarchitektur nieder- schlägt. Zusätzlich zu den Speicheransätzen haben wir ebenfalls zwei Flip-Flop-Architekturen vorgestellt. Die erste ist eine nichtflüchtige non-Shadow Flip-Flop-Architektur, welche die Speicherzelle als aktive Komponente nutzt. Dies ermöglicht das sofortige An- und Ausschalten der Versorgungss- pannung und ist daher besonders gut für aggressives Powergating geeignet. Alles in Allem zeigt der vorgestellte Flip-Flop-Entwurf eine ähnliche Timing-Charakteristik wie die konventioneller CMOS Flip-Flops auf. Jedoch erlaubt er zur selben Zeit eine signifikante Reduktion der statis- chen Leistungsaufnahme im Vergleich zu nichtflüchtigen Shadow- Flip-Flops. Die zweite ist eine fehlertolerante Flip-Flop-Architektur, welche sich unanfällig gegenüber diversen Defekten und Fehlern verhält. Die Leistungsfähigkeit aller vorgestellten Techniken wird durch ausführliche Simulationen auf Schaltkreisebene verdeutlicht, welche weiter durch detaillierte Evaluationen auf Systemebene untermauert werden. Im Allgemeinen konnten wir verschiedene Techniken en- twickeln, die erhebliche Verbesserungen in Bezug auf Performanz, Energie und Zuverlässigkeit von spintronischen on-Chip Speichern, wie Caches, Register und Flip-Flops erreichen

    The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

    Full text link
    In today's data-centric world, where data fuels numerous application domains, with machine learning at the forefront, handling the enormous volume of data efficiently in terms of time and energy presents a formidable challenge. Conventional computing systems and accelerators are continually being pushed to their limits to stay competitive. In this context, computing near-memory (CNM) and computing-in-memory (CIM) have emerged as potentially game-changing paradigms. This survey introduces the basics of CNM and CIM architectures, including their underlying technologies and working principles. We focus particularly on CIM and CNM architectures that have either been prototyped or commercialized. While surveying the evolving CIM and CNM landscape in academia and industry, we discuss the potential benefits in terms of performance, energy, and cost, along with the challenges associated with these cutting-edge computing paradigms

    High-Performance Energy-Efficient and Reliable Design of Spin-Transfer Torque Magnetic Memory

    Get PDF
    In this dissertation new computing paradigms, architectures and design philosophy are proposed and evaluated for adopting the STT-MRAM technology as highly reliable, energy efficient and fast memory. For this purpose, a novel cross-layer framework from the cell-level all the way up to the system- and application-level has been developed. In these framework, the reliability issues are modeled accurately with appropriate fault models at different abstraction levels in order to analyze the overall failure rates of the entire memory and its Mean Time To Failure (MTTF) along with considering the temperature and process variation effects. Design-time, compile-time and run-time solutions have been provided to address the challenges associated with STT-MRAM. The effectiveness of the proposed solutions is demonstrated in extensive experiments that show significant improvements in comparison to state-of-the-art solutions, i.e. lower-power, higher-performance and more reliable STT-MRAM design

    Spin-Transfer-Torque (STT) Devices for On-chip Memory and Their Applications to Low-standby Power Systems

    Get PDF
    With the scaling of CMOS technology, the proportion of the leakage power to total power consumption increases. Leakage may account for almost half of total power consumption in high performance processors. In order to reduce the leakage power, there is an increasing interest in using nonvolatile storage devices for memory applications. Among various promising nonvolatile memory elements, spin-transfer torque magnetic RAM (STT-MRAM) is identified as one of the most attractive alternatives to conventional SRAM. However, several design challenges of STT-MRAM such as shared read and write current paths, single-ended sensing, and high dynamic power are major challenges to be overcome to make it suitable for on-chip memories. To mitigate such problems, we propose a domain wall coupling based spin-transfer torque (DWCSTT) device for on-chip caches. Our proposed DWCSTT bit-cell decouples the read and the write current paths by the electrically-insulating magnetic coupling layer so that we can separately optimize read operation without having an impact on write-ability. In addition, the complementary polarizer structure in the read path of the DWCSTT device allows DWCSTT to enable self-referenced differential sensing. DWCSTT bit-cells improve the write power consumption due to the low electrical resistance of the write current path. Furthermore, we also present three different bit-cell level design techniques of Spin-Orbit Torque MRAM (SOT-MRAM) for alleviating some of the inefficiencies of conventional magnetic memories while maintaining the advantages of spin-orbit torque (SOT) based novel switching mechanism such as low write current requirement and decoupled read and write current path. Our proposed SOT-MRAM with supporting dual read/write ports (1R/1W) can address the issue of high-write latency of STT-MRAM by simultaneous 1R/1W accesses. Second, we propose a new type of SOT-MRAM which uses only one access transistor along with a Schottky diode in order to mitigate the area-overhead caused by two access transistors in conventional SOT-MRAM. Finally, a new design technique of SOT-MRAM is presented to improve the integration density by utilizing a shared bit-line structure
    corecore