142 research outputs found

    Energy Saving Techniques for Phase Change Memory (PCM)

    Full text link
    In recent years, the energy consumption of computing systems has increased and a large fraction of this energy is consumed in main memory. Towards this, researchers have proposed use of non-volatile memory, such as phase change memory (PCM), which has low read latency and power; and nearly zero leakage power. However, the write latency and power of PCM are very high and this, along with limited write endurance of PCM present significant challenges in enabling wide-spread adoption of PCM. To address this, several architecture-level techniques have been proposed. In this report, we review several techniques to manage power consumption of PCM. We also classify these techniques based on their characteristics to provide insights into them. The aim of this work is encourage researchers to propose even better techniques for improving energy efficiency of PCM based main memory.Comment: Survey, phase change RAM (PCRAM

    Power-Efficient and Low-Latency Memory Access for CMP Systems with Heterogeneous Scratchpad On-Chip Memory

    Get PDF
    The gradually widening speed disparity of between CPU and memory has become an overwhelming bottleneck for the development of Chip Multiprocessor (CMP) systems. In addition, increasing penalties caused by frequent on-chip memory accesses have raised critical challenges in delivering high memory access performance with tight power and latency budgets. To overcome the daunting memory wall and energy wall issues, this thesis focuses on proposing a new heterogeneous scratchpad memory architecture which is configured from SRAM, MRAM, and Z-RAM. Based on this architecture, we propose two algorithms, a dynamic programming and a genetic algorithm, to perform data allocation to different memory units, therefore reducing memory access cost in terms of power consumption and latency. Extensive and intensive experiments are performed to show the merits of the heterogeneous scratchpad architecture over the traditional pure memory system and the effectiveness of the proposed algorithms

    Gestión de jerarquías de memoria híbridas a nivel de sistema

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadoras y Automática y de Ku Leuven, Arenberg Doctoral School, Faculty of Engineering Science, leída el 11/05/2017.In electronics and computer science, the term ‘memory’ generally refers to devices that are used to store information that we use in various appliances ranging from our PCs to all hand-held devices, smart appliances etc. Primary/main memory is used for storage systems that function at a high speed (i.e. RAM). The primary memory is often associated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors, used for example as primary memory but also other purposes in computers and other digital electronic devices. The secondary/auxiliary memory, in comparison provides program and data storage that is slower to access but offers larger capacity. Examples include external hard drives, portable flash drives, CDs, and DVDs. These devices and media must be either plugged in or inserted into a computer in order to be accessed by the system. Since secondary storage technology is not always connected to the computer, it is commonly used for backing up data. The term storage is often used to describe secondary memory. Secondary memory stores a large amount of data at lesser cost per byte than primary memory; this makes secondary storage about two orders of magnitude less expensive than primary storage. There are two main types of semiconductor memory: volatile and nonvolatile. Examples of non-volatile memory are ‘Flash’ memory (sometimes used as secondary, sometimes primary computer memory) and ROM/PROM/EPROM/EEPROM memory (used for firmware such as boot programs). Examples of volatile memory are primary memory (typically dynamic RAM, DRAM), and fast CPU cache memory (typically static RAM, SRAM, which is fast but energy-consuming and offer lower memory capacity per are a unit than DRAM). Non-volatile memory technologies in Si-based electronics date back to the 1990s. Flash memory is widely used in consumer electronic products such as cellphones and music players and NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. The rapid increase of leakage currents in Silicon CMOS transistors with scaling poses a big challenge for the integration of SRAM memories. There is also the case of susceptibility to read/write failure with low power schemes. As a result of this, over the past decade, there has been an extensive pooling of time, resources and effort towards developing emerging memory technologies like Resistive RAM (ReRAM/RRAM), STT-MRAM, Domain Wall Memory and Phase Change Memory(PRAM). Emerging non-volatile memory technologies promise new memories to store more data at less cost than the expensive-to build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. These new memory technologies combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the non-volatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. The research and information on these Non-Volatile Memory (NVM) technologies has matured over the last decade. These NVMs are now being explored thoroughly nowadays as viable replacements for conventional SRAM based memories even for the higher levels of the memory hierarchy. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional(3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years...En el campo de la informática, el término ‘memoria’ se refiere generalmente a dispositivos que son usados para almacenar información que posteriormente será usada en diversos dispositivos, desde computadoras personales (PC), móviles, dispositivos inteligentes, etc. La memoria principal del sistema se utiliza para almacenar los datos e instrucciones de los procesos que se encuentre en ejecución, por lo que se requiere que funcionen a alta velocidad (por ejemplo, DRAM). La memoria principal está implementada habitualmente mediante memorias semiconductoras direccionables, siendo DRAM y SRAM los principales exponentes. Por otro lado, la memoria auxiliar o secundaria proporciona almacenaje(para ficheros, por ejemplo); es más lenta pero ofrece una mayor capacidad. Ejemplos típicos de memoria secundaria son discos duros, memorias flash portables, CDs y DVDs. Debido a que estos dispositivos no necesitan estar conectados a la computadora de forma permanente, son muy utilizados para almacenar copias de seguridad. La memoria secundaria almacena una gran cantidad de datos aun coste menor por bit que la memoria principal, siendo habitualmente dos órdenes de magnitud más barata que la memoria primaria. Existen dos tipos de memorias de tipo semiconductor: volátiles y no volátiles. Ejemplos de memorias no volátiles son las memorias Flash (algunas veces usadas como memoria secundaria y otras veces como memoria principal) y memorias ROM/PROM/EPROM/EEPROM (usadas para firmware como programas de arranque). Ejemplos de memoria volátil son las memorias DRAM (RAM dinámica), actualmente la opción predominante a la hora de implementar la memoria principal, y las memorias SRAM (RAM estática) más rápida y costosa, utilizada para los diferentes niveles de cache. Las tecnologías de memorias no volátiles basadas en electrónica de silicio se remontan a la década de1990. Una variante de memoria de almacenaje por carga denominada como memoria Flash es mundialmente usada en productos electrónicos de consumo como telefonía móvil y reproductores de música mientras NAND Flash solid state disks(SSDs) están progresivamente desplazando a los dispositivos de disco duro como principal unidad de almacenamiento en computadoras portátiles, de escritorio e incluso en centros de datos. En la actualidad, hay varios factores que amenazan la actual predominancia de memorias semiconductoras basadas en cargas (capacitivas). Por un lado, se está alcanzando el límite de integración de las memorias Flash, lo que compromete su escalado en el medio plazo. Por otra parte, el fuerte incremento de las corrientes de fuga de los transistores de silicio CMOS actuales, supone un enorme desafío para la integración de memorias SRAM. Asimismo, estas memorias son cada vez más susceptibles a fallos de lectura/escritura en diseños de bajo consumo. Como resultado de estos problemas, que se agravan con cada nueva generación tecnológica, en los últimos años se han intensificado los esfuerzos para desarrollar nuevas tecnologías que reemplacen o al menos complementen a las actuales. Los transistores de efecto campo eléctrico ferroso (FeFET en sus siglas en inglés) se consideran una de las alternativas más prometedores para sustituir tanto a Flash (por su mayor densidad) como a DRAM (por su mayor velocidad), pero aún está en una fase muy inicial de su desarrollo. Hay otras tecnologías algo más maduras, en el ámbito de las memorias RAM resistivas, entre las que cabe destacar ReRAM (o RRAM), STT-RAM, Domain Wall Memory y Phase Change Memory (PRAM)...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Volatile STT-RAM Scratchpad Design and Data Allocation for Low Energy

    Get PDF
    [Abstract] On-chip power consumption is one of the fundamental challenges of current technology scaling. Cache memories consume a sizable part of this power, particularly due to leakage energy. STT-RAM is one of several new memory technologies that have been proposed in order to improve power while preserving performance. It features high density and low leakage, but at the expense of write energy and performance. This article explores the use of STT-RAM--based scratchpad memories that trade nonvolatility in exchange for faster and less energetically expensive accesses, making them feasible for on-chip implementation in embedded systems. A novel multiretention scratchpad partitioning is proposed, featuring multiple storage spaces with different retention, energy, and performance characteristics. A customized compiler-based allocation algorithm suitable for use with such a scratchpad organization is described. Our experiments indicate that a multiretention STT-RAM scratchpad can provide energy savings of 53% with respect to an iso-area, hardware-managed SRAM cache

    Transparent management of scratchpad memories in shared memory programming models

    Get PDF
    Cache-coherent shared memory has traditionally been the favorite memory organization for chip multiprocessors thanks to its high programmability. In this organization the cache hierarchy is in charge of moving the data and keeping it coherent between all the caches, enabling the usage of shared memory programming models where the programmer does not need to carry out any data management operation. Unfortunately, performing all the data management operations in hardware causes severe problems, being the primary concerns the power consumption originated in the caches and the amount of coherence traffic in the interconnection network. A good solution is to introduce ScratchPad Memories (SPMs) alongside the cache hierarchy, forming a hybrid memory hierarchy. SPMs are more power-efficient than caches and do not generate coherence traffic, but they degrade programmability. In particular, SPMs require the programmer to partition the data, to program data transfers, and to keep coherence between different copies of the data. A promising solution to exploit the benefits of the SPMs without harming programmability is to allow programmers to use shared memory programming models and to automatically generate code that manages the SPMs. Unfortunately, current compilers and runtime systems encounter serious limitations to automatically generate code for hybrid memory hierarchies from shared memory programming models. This thesis proposes to transparently manage the SPMs of hybrid memory hierarchies in shared memory programming models. In order to achieve this goal this thesis proposes a combination of hardware and compiler techniques to manage the SPMs in fork-join programming models and a set of runtime system techniques to manage the SPMs in task programming models. The proposed techniques allow to program hybrid memory hierarchies with these two well-known and easy-to-use forms of shared memory programming models, capitalizing on the benefits of hybrid memory hierarchies in power consumption and network traffic without harming programmability. The first contribution of this thesis is a hardware/software co-designed coherence protocol to transparently manage the SPMs of hybrid memory hierarchies in fork-join programming models. The solution allows the compiler to always generate code to manage the SPMs with tiling software caches, even in the presence of unknown memory aliasing hazards between memory references to the SPMs and to the cache hierarchy. On the software side, the compiler generates a special form of memory instruction for memory references with possible aliasing hazards. On the hardware side, the special memory instructions are diverted to the correct copy of the data using a set of directories that track what data is mapped to the SPMs. The second contribution of this thesis is a set of runtime system techniques to manage the SPMs of hybrid memory hierarchies in task programming models. The proposed runtime system techniques exploit the characteristics of these programming models to map the data specified in the task dependences to the SPMs. Different policies are proposed to mitigate the communication costs of the data transfers, overlapping them with other execution phases such as the task scheduling phase or the execution of the previous task. The runtime system can also reduce the number of data transfers by using a task scheduler that exploits data locality in the SPMs. In addition, the proposed techniques are combined with mechanisms that reduce the impact of fine-grained tasks, such as hardware runtime systems or large SPM sizes. The accomplishment of this thesis is that hybrid memory hierarchies can be programmed with fork-join and task programming models. Consequently, architectures with hybrid memory hierarchies can be exposed to the programmer as a shared memory multiprocessor, taking advantage of the benefits of the SPMs while maintaining the programming simplicity of shared memory programming models.La memoria compartida con coherencia de caches es la jerarquía de memoria más utilizada en multiprocesadores gracias a su programabilidad. En esta solución la jerarquía de caches se encarga de mover los datos y mantener la coherencia entre las caches, habilitando el uso de modelos de programación de memoria compartida donde el programador no tiene que realizar ninguna operación para gestionar las memorias. Desafortunadamente, realizar estas operaciones en la arquitectura causa problemas severos, siendo especialmente relevantes el consumo de energía de las caches y la cantidad de tráfico de coherencia en la red de interconexión. Una buena solución es añadir Memorias ScratchPad (SPMs) acompañando la jerarquía de caches, formando una jerarquía de memoria híbrida. Las SPMs son más eficientes en energía y tráfico de coherencia, pero dificultan la programabilidad ya que requieren que el programador particione los datos, programe transferencias de datos y mantenga la coherencia entre diferentes copias de datos. Una solución prometedora para beneficiarse de las ventajas de las SPMs sin dificultar la programabilidad es permitir que el programador use modelos de programación de memoria compartida y generar código para gestionar las SPMs automáticamente. El problema es que los compiladores y los entornos de ejecución actuales sufren graves limitaciones al gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Esta tesis propone gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Para conseguir este objetivo esta tesis propone una combinación de técnicas hardware y de compilador para gestionar las SPMs en modelos de programación fork-join, y técnicas en entornos de ejecución para gestionar las SPMs en modelos de programación basados en tareas. Las técnicas propuestas hacen que las jerarquías de memoria híbridas puedan programarse con estos dos modelos de programación de memoria compartida, de tal forma que las ventajas en energía y tráfico de coherencia se puedan explotar sin dificultar la programabilidad. La primera contribución de esta tesis en un protocolo de coherencia hardware/software para gestionar SPMs en modelos de programación fork-join. La propuesta consigue que el compilador siempre pueda generar código para gestionar las SPMs, incluso cuando hay posibles alias de memoria entre referencias a memoria a las SPMs y a la jerarquía de caches. En la solución el compilador genera instrucciones especiales para las referencias a memoria con posibles alias, y el hardware sirve las instrucciones especiales con la copia válida de los datos usando directorios que guardan información sobre qué datos están mapeados en las SPMs. La segunda contribución de esta tesis son una serie de técnicas para gestionar SPMs en modelos de programación basados en tareas. Las técnicas aprovechan las características de estos modelos de programación para mapear las dependencias de las tareas en las SPMs y se complementan con políticas para minimizar los costes de las transferencias de datos, como solaparlas con fases del entorno de ejecución o la ejecución de tareas anteriores. El número de transferencias también se puede reducir utilizando un planificador que tenga en cuenta la localidad de datos y, además, las técnicas se pueden combinar con mecanismos para reducir los efectos negativos de tener tareas pequeñas, como entornos de ejecución en hardware o SPMs de más capacidad. Las propuestas de esta tesis consiguen que las jerarquías de memoria híbridas se puedan programar con modelos de programación fork-join y basados en tareas. En consecuencia, las arquitecturas con jerarquías de memoria híbridas se pueden exponer al programador como multiprocesadores de memoria compartida, beneficiándose de las ventajas de las SPMs en energía y tráfico de coherencia y manteniendo la simplicidad de uso de los modelos de programación de memoria compartida

    MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

    Full text link
    Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table

    Novel online data allocation for hybrid memories on tele-health systems

    Full text link
    [EN] The developments of wearable devices such as Body Sensor Networks (BSNs) have greatly improved the capability of tele-health industry. Large amount of data will be collected from every local BSN in real-time. These data is processed by embedded systems including smart phones and tablets. After that, the data will be transferred to distributed storage systems for further processing. Traditional on-chip SRAMs cause critical power leakage issues and occupy relatively large chip areas. Therefore, hybrid memories, which combine volatile memories with non-volatile memories, are widely adopted in reducing the latency and energy cost on multi-core systems. However, most of the current works are about static data allocation for hybrid memories. Those mechanisms cannot achieve better data placement in real-time. Hence, we propose online data allocation for hybrid memories on embedded tele-health systems. In this paper, we present dynamic programming and heuristic approaches. Considering the difference between profiled data access and actual data access, the proposed algorithms use a feedback mechanism to improve the accuracy of data allocation during runtime. Experimental results demonstrate that, compared to greedy approaches, the proposed algorithms achieve 20%-40% performance improvement based on different benchmarks. (C) 2016 Elsevier B.V. All rights reserved.This work is supported by NSF CNS-1457506 and NSF CNS-1359557.Chen, L.; Qiu, M.; Dai, W.; Hassan Mohamed, H. (2017). Novel online data allocation for hybrid memories on tele-health systems. Microprocessors and Microsystems. 52:391-400. https://doi.org/10.1016/j.micpro.2016.08.003S3914005

    Domain-specific Architectures for Data-intensive Applications

    Full text link
    Graphs' versatile ability to represent diverse relationships, make them effective for a wide range of applications. For instance, search engines use graph-based applications to provide high-quality search results. Medical centers use them to aid in patient diagnosis. Most recently, graphs are also being employed to support the management of viral pandemics. Looking forward, they are showing promise of being critical in unlocking several other opportunities, including combating the spread of fake content in social networks, detecting and preventing fraudulent online transactions in a timely fashion, and in ensuring collision avoidance in autonomous vehicle navigation, to name a few. Unfortunately, all these applications require more computational power than what can be provided by conventional computing systems. The key reason is that graph applications present large working sets that fail to fit in the small on-chip storage of existing computing systems, while at the same time they access data in seemingly unpredictable patterns, thus cannot draw benefit from traditional on-chip storage. In this dissertation, we set out to address the performance limitations of existing computing systems so to enable emerging graph applications like those described above. To achieve this, we identified three key strategies: 1) specializing memory architecture, 2) processing data near its storage, and 3) message coalescing in the network. Based on these strategies, this dissertation develops several solutions: OMEGA, which employs specialized on-chip storage units, with co-located specialized compute engines to accelerate the computation; MessageFusion, which coalesces messages in the interconnect; and Centaur, providing an architecture that optimizes the processing of infrequently-accessed data. Overall, these solutions provide 2x in performance improvements, with negligible hardware overheads, across a wide range of applications. Finally, we demonstrate the applicability of our strategies to other data-intensive domains, by exploring an acceleration solution for MapReduce applications, which achieves a 4x performance speedup, also with negligible area and power overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163186/1/abrahad_1.pd

    GraphR: Accelerating Graph Processing Using ReRAM

    Full text link
    This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suit- able for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x) speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes 4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is 3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201

    Design Space Exploration of Next-Generation HPC Machines

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. With the goal of improving the efficiency of next-generation large HPC systems, designers require tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. We simulate five hybrid (MPI+OpenMP) applications over 864 architectural proposals based on stateof-the-art and emerging HPC technologies, relevant both in industry and research. This paper significantly extends our previous work with MUltiscale Simulation Approach (MUSA) enabling accurate performance and power estimations of largescale HPC systems. We reveal that several applications present critical scalability issues mostly due to the software parallelization approach. Looking at speedup and energy consumption exploring the design space (i.e., changing memory bandwidth, number of cores, and type of cores), we provide evidence-based architectural recommendations that will serve as hardware and software codesign guidelines.Preprin
    corecore