Search CORE

3 research outputs found

Fusión de los niveles L1 y L2 de la jerarquía de memoria cache utilizando DWM

Author: Tárrega Sánchez Hugo
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 12/11/2021
Field of study

[ES] El presente trabajo aborda la necesidad cada vez mayor por parte de la industria de los semiconductores de contar con memorias cache más densas y con un menor consumo energético que las actuales. Debido a que la tecnología actual más utilizada, SRAM, no puede ofrecer estas mejoras, este trabajo propone el uso de las memorias magnéticas DWM (Domain Wall Memory) como tecnología emergente sustitutiva. El presente trabajo aborda principalmente uno de los mayores inconvenientes de DWM como es la latencia variable por acceso a los datos almacenados en una cinta magnética. Este hecho es especialmente crítico en las caches de primer nivel (L1) al encontrarse en el pipeline del procesador y conllevar un impacto directo en el rendimiento del sistema. Para superar este problema, se propone un diseño de cache de datos L1 ascendente. En primer lugar, se diseña una celda de memoria DWM capaz de almacenar múltiples bits y de reducir el impacto de la latencia variable de acceso mediante el uso de múltiples puertos de acceso sobre la cinta, entre otras características. A continuación, se diseña un módulo de cache que integra múltiples celdas DWM, de manera que los conjuntos se organizan de manera entrelazada entre los puertos, favoreciendo la localidad espacial que exhiben las aplicaciones en L1 y por tanto reduciendo la problemática de la latencia variable de acceso. Finalmente, el uso de módulos DWM permite implementar el vector de datos completo de una memoria cache de datos asociativa por conjuntos de

n

vías. La alta densidad de DWM permite fusionar los niveles L1 y L2 en un sólo nivel DWM con el objetivo de aumentar el rendimiento frente a un diseño de jerarquía de cache convencional SRAM. La propuesta de cache DWM se implementa y evalúa en el simulador ciclo-a-ciclo Multi2Sim, ampliamente utilizado tanto en la industria como en la academia. Los resultados experimentales muestran que la cache DWM reduce significativamente la penalización media por acceso a memoria, los fallos por kilo-instrucción y los ciclos de parada en el \emph{reorder buffer} frente a un diseño de cache convencional. Todo ello conlleva a una mejora en el rendimiento del sistema de un 10% en la media no sólo frente a un diseño convencional basado en SRAM sino también frente al diseño DWM del estado-del-arte, referido como TapeCache e implementado como parte del presente trabajo.[CA] El present treball aborda la necessitat cada volta més apressant per part de la indús- tria dels semiconductors de trobar memòries cau més denses i amb un menor consum energètic que les actuals. Com que la tecnologia actual més utilitzada, SRAM, no pot oferir aquestes millores, aquest treball propose l’ús de les memòries magnètiques DWM (Domain Wall Memory com a tecnologia de substitució.) Aquest treball tracta principalment un dels majors inconvenients de les DWM, com és la latència variable per accés a les dades emmagatzemades en una cinta magnètica. Aquest fet es especialment crític en les memòries cau de primer nivell (L1) al trovar-se en el pipeline del processador i implicar un impacte directe en el rendiment del sistema. Per a superar aquest problema, es propose un disseny de memòria L1 de dades ascendent. En primer lloc, es dissenya una cel·la de memòria DWM capaç d’emmagatzemar múlti- ples bits i de reduir l’impacte de la latència variable d’accés per mitjà de l’ús de múltiples ports d’accés sobre la cinta, entre altres característiques. A continuació, es dissenya un mòdul de memòria cau que integre múltiples cel·les DWM, de manera que els conjunts s’organitzen de manera entrellaçada entre els ports, afavorint la localitat espacial que ex- hibeixen les aplicacions en L1 i per tant reduint la problemàtica de la latència variable d’accés. Finalment, l’ús de mòduls DWM permet implementar el vector de dades com- pletes d’una memòria cau de dades associatives per conjunts de n vies. L’alta densitat de DWM permet fusionar els nivells L1 i L2 en un només nivell DWM amb l’objectiu d’augmentar el rendiment enfront d’un disseny de jerarquia de memòria cau convencio- nal SRAM. La proposta de memòria cau amb DWM s’implementa i avalue en el simulador cicle- a-cicle Multi2Sim, àmpliament utilitzat tant en la indústria com en l’acadèmia. Els re- sultats experimentals mostren que la memòria cau DWM redueix significativament la penalització mitjana per accés a memòria, les fallades per quilo-instrucció i els cicles de parada en el reorder buffer enfront d’un disseny de memòria convencional. Tot això comporta a una millora en el rendiment del sistema d’un 10% en la mitjana no sols en- front d’un disseny convencional basat en SRAM sinó també enfront del disseny DWM de l’estat-del-art, referit com TapeCache i implementat com a part del present treball.[EN] The present work addresses the growing need of the semiconductor industry for denser cache memories with lower power consumption than the current ones. Since the most widely used current technology, SRAM, cannot offer these improvements, this work proposes the use of DWM (Domain Wall Memory) magnetic memories as a substitute emerging technology. This work mainly addresses one of the major drawbacks of DWM, which is the variable latency for accessing data stored on a magnetic tape. This fact is especially critical in first-level (L1) caches as they are located in the processor pipeline and have a direct impact on the system performance. To overcome this problem, a bottom-up L1 data cache design is proposed. First, it is designed a DWM memory cell capable of storing multiple bits and reducing the impact of the variable access latency by using multiple access ports on the tape, among other features. Next, it is designed a cache module that integrates multiple DWM cells, such that the sets are organized in an interleaved structure between ports, favoring the spatial locality exhibited by applications on L1 and thus reducing the variable access latency issue. Finally, the use of DWM modules allows implementing the complete data array of an associative data cache with n-way sets. The high density of DWM allows merging the L1 and L2 levels into a single DWM level with the goal of increasing performance over a conventional SRAM cache hierarchy design. The proposed DWM cache is implemented and evaluated on the Multi2Sim cycle-accurate simulator, which is widely used in both industry and academia. Experimental results show that the DWM cache significantly reduces the average memory access penalty, misses per kilo-instruction, and stall cycles in the reorder buffer compared to a conventional cache design. This leads to a 10% improvement in the average system performance not only over a conventional SRAM-based design but also over the state-of-the-art DWM design, referred to as TapeCache and implemented as part of this work.Tárrega Sánchez, H. (2021). Fusión de los niveles L1 y L2 de la jerarquía de memoria cache utilizando DWM. Universitat Politècnica de València. http://hdl.handle.net/10251/17701

RiuNet

Computing with Spintronics: Circuits and architectures

Author: Venkatesan Rangharajan
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2014
Field of study

This thesis makes the following contributions towards the design of computing platforms with spintronic devices. 1) It explores the use of spintronic memories in the design of a domain-specific processor for an emerging class of data-intensive applications, namely recognition, mining and synthesis (RMS). Two different spintronic memory technologies — Domain Wall Memory (DWM) and STT-MRAM — are utilized to realize the different levels in the memory hierarchy of the domain-specific processor, based on their respective access characteristics. Architectural tradeoffs created by the use of spintronic memories are analyzed. The proposed design achieves 1.5X-4X improvements in energy-delay product compared to a CMOS baseline. 2) It describes the first attempt to use DWM in the cache hierarchy of general-purpose processors. DWM promises unparalleled density by packing several bits of data into each bit-cell. TapeCache, the proposed DWM-based cache architecture, utilizes suitable circuit and architectural optimizations to address two key challenges (i) the high energy and latency requirement of write operations and (ii) the need for shift operations to access the data stored in each DWM bit-cell. At the circuit level, DWM bit-cells that are tailored to the distinct design requirements of different levels in the cache hierarchy are proposed. At the architecture level, TapeCache proposes suitable cache organization and management policies to alleviate the performance impact of shift operations required to access data stored in DWM bit-cells. TapeCache achieves more than 7X improvements in both cache area and energy with virtually identical performance compared to an SRAM-based cache hierarchy. 3) It investigates the design of the on-chip memory hierarchy of general-purpose graphics processing units (GPGPUs)—massively parallel processors that are optimized for data-intensive high-throughput workloads—using DWM. STAG, a high density, energy-efficient Spintronic- Tape Architecture for GPGPU cache hierarchies is described. STAG utilizes different DWM bit-cells to realize different memory arrays in the GPGPU cache hierarchy. To address the challenge of high access latencies due to shifts, STAG predicts upcoming cache accesses by leveraging unique characteristics of GPGPU architectures and workloads, and prefetches data that are both likely to be accessed and require large numbers of shift operations. STAG achieves 3.3X energy reduction and 12.1% performance improvement over CMOS SRAM under iso-area conditions. 4) While the potential of spintronic devices for memories is widely recognized, their utility in realizing logic is much less clear. The thesis presents Spintastic, a new paradigm that utilizes Stochastic Computing (SC) to realize spintronic logic. In SC, data is encoded in the form of pseudo-random bitstreams, such that the probability of a \u271\u27 in a bitstream corresponds to the numerical value that it represents. SC can enable compact, low-complexity logic implementations of various arithmetic functions. Spintastic establishes the synergy between stochastic computing and spin-based logic by demonstrating that they mutually alleviate each other\u27s limitations. On the one hand, various building blocks of SC, which incur significant overheads in CMOS implementations, can be efficiently realized by exploiting the physical characteristics of spin devices. On the other hand, the reduced logic complexity and low logic depth of SC circuits alleviates the shortcomings of spintronic logic. Based on this insight, the design of spin-based stochastic arithmetic circuits, bitstream generators, bitstream permuters and stochastic-to-binary converter circuits are presented. Spintastic achieves 7.1X energy reduction over CMOS implementations for a wide range of benchmarks from the image processing, signal processing, and RMS application domains. 5) In order to evaluate the proposed spintronic designs, the thesis describes various device-to-architecture modeling frameworks. Starting with devices models that are calibrated to measurements, the characteristics of spintronic devices are successively abstracted into circuit-level and architectural models, which are incorporated into suitable simulation frameworks. (Abstract shortened by UMI.

Purdue E-Pubs

Design and Code Optimization for Systems with Next-generation Racetrack Memories

Author: Khan Asif Ali
Publication venue
Publication date: 16/06/2022
Field of study

With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

Technische Universität Dresden: Qucosa