30 research outputs found

    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks

    Get PDF
    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).Peer ReviewedPostprint (author’s final draft

    Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

    Get PDF
    Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low. This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%

    Language independent modelling of parallelism

    Get PDF
    To make programs work in parallel contexts without any hazards, programming languages require changes to their structures and compilers. One of the most complicated parts is memory models and how programming languages deal with memory interactions. Different processors provide a different level of safety guarantees (i.e. ARM provides relaxed whereas Intel provides strong guarantees). On the other hand, different programming languages provide different structures for parallel computation and have individual protocols for communicating with parallel processes. Unfortunately, no specific choice is best in all situations. This thesis focuses on memory models of various programming languages and processors highlighting some positive and negative features from the point of view of programmability, performance and portability. In order to give some evidence of problems and performance bottlenecks, some small programs have been developed. This thesis also concentrates on incorrect behaviors, especially on data race conditions in programs, providing suggestions on how to avoid them. Also, some litmus tests on systems featuring different vendors' processors were performed to observe data races on each system. Nowadays programming paradigms also became a big issue. Some of the programming styles support observable non-determinism which is the main reason for incorrect behavior in programs. In this thesis, different programming models are also discussed based on the current state of the available research. Also, the imperative and functional paradigms in different contexts are compared. Finally, a mathematical problem was solved using two different paradigms to provide some practical evidence of the theory

    A Modern Primer on Processing in Memory

    Full text link
    Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398

    Memory architectures for exaflop computing systems

    Get PDF
    Most computing systems are heavily dependent on their main memories, as their primary storage, or as an intermediate cache for slower storage systems (HDDs). The capacity of memory systems, as well as their performance, have a direct impact on overall computing capabilities of the system, and are also major contributors to its initial and operating costs. Dynamic Random Access Memory (DRAM) technology has been dominating the main memory landscape since its beginnings in 1970s until today. However, due to DRAM's inherent limitations, its steady rate of development has saturated over the past decade, creating a disparity between CPU and main memory performance, known as the memory wall. Modern parallel architectures, such as High-Performance Computing (HPC) clusters and manycore solutions, create even more stress on their memory systems. It is not trivial to estimate memory requirements that these systems will have in the future, and if DRAM technology would be able to meet them, or we would need to look for a novel memory solution. This thesis attempts to give insight in the most important technological challenges that future memory systems need to address, in order to meet the ever growing requirements of users and their applications, in manycore and HPC context. We try to describe the limitations of DRAM, as the dominant technology in today's main memory systems, that may impede performance or increase cost of future systems. We discuss some of the emerging memory technologies, and by comparing them with DRAM, we try to estimate their potential usage in future memory systems. The thesis evaluates the requirements of manycore scientific applications, in terms of memory bandwidth and footprint, and estimates how these requirements may change in the future. With this evaulation in mind, we propose a hybrid memory solution that employs DRAM and PCM, as well as several page placement and page migration policies, to bridge the gap between fast and small DRAM and larger but slower non-volatile memory. As the aforementioned evaluations required custom software solutions, we present tools we produced over the course of this PhD, which continue to be used in Heterogeneous Computer Architectures group in Barcelona Supercomputing Center. First, Limpio - a LIghtweight MPI instrumentatiOn framework, that provides an interface for low-overhead instrumentation and profiling of MPI applications with user-defined routines. Second, MemTraceMPI, a Valgrind tool, used to produce memory access traces of MPI applications, with several innovative concepts included (filter-cache, iteration tracing, compressed trace files).La mayoría de los sistemas de computación dependen en gran medida de sus principales recuerdos, como su almacenamiento primario, o como un caché intermedio para sistemas de almacenamiento más lentos (discos duros). La capacidad de los sistemas de memoria, así como su rendimiento, tienen un impacto directo en las capacidades globales de computación del sistema, y también son los principales contribuyentes a sus costos iniciales y de operación. Tecnología Dynamic Random Access memoria (DRAM) ha estado dominando el principal paisaje de memoria desde sus inicios en 1970 hasta la actualidad. Sin embargo, debido a las limitaciones inherentes de DRAM, su tasa constante de desarrollo ha saturado durante la última década, creando una disparidad entre la CPU y el rendimiento de la memoria principal, conocido como el muro de la memoria. Arquitecturas modernas paralelas, como la computación (HPC) de alto rendimiento y soluciones manycore, crear aún más presión sobre sus sistemas de memoria. No es trivial para estimar los requisitos de memoria que estos sistemas tendrán en el futuro, y si la tecnología DRAM sería capaz de cumplir con ellas, o que tendría que buscar una solución de memoria novela. En esta tesis se intenta dar una idea de los más importantes retos tecnológicos que los sistemas de memoria futuras deben abordar, con el fin de satisfacer las necesidades cada vez mayores de los usuarios y sus aplicaciones, en Manycore y HPC contexto. Intentamos describir las limitaciones de memoria DRAM, como la tecnología dominante en los sistemas de memoria principal de hoy en día, que pueden impedir el rendimiento o el aumento de los costos de los sistemas futuros. Se discuten algunas de las tecnologías de memoria emergentes, y comparándolos con DRAM, tratamos de estimar su uso potencial en sistemas de memoria futuras. La tesis evalúa los requisitos de las aplicaciones científicas manycore, en términos de ancho de banda de memoria y huella, y estima cómo estos requisitos pueden cambiar en el futuro. Con esta evaulation en mente, proponemos una solución de memoria híbrida que emplea DRAM y PCM, así como varias políticas de colocación de la página y la página de la migración, para cerrar la brecha entre la DRAM rápido y pequeño y más grande pero la memoria más lenta no volátil. Como las evaluaciones mencionadas necesarias soluciones de software personalizadas, se presentan las herramientas que hemos producido en el transcurso de esta tesis doctoral, que se siguen utilizando en el grupo heterogéneo de computadoras Arquitecturas en Barcelona Supercomputing Center. En primer lugar, Limpio - un marco MPI Instrumentación ligero, que proporciona una interfaz para la instrumentación de baja sobrecarga y perfilado de aplicaciones MPI con rutinas definidas por el usuario. En segundo lugar, MemTraceMPI, una herramienta Valgrind, utilizado para producir los rastros de acceso a memoria de aplicaciones MPI, con varios conceptos innovadores incluido (filtro-cache, trazado iteración, archivos de seguimiento comprimido)
    corecore