28 research outputs found

    Memory disambiguation hardware: a review

    Get PDF
    One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.Facultad de Informátic

    Reducing the LSQ and L1 Data Cache Power Consuption

    Get PDF
    In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a forwarding predictor. Our simulation results show that the power consumption decreases in 30-40% in each structure, with a negligible performance penalty of less than 0.1%

    Reducción de consumo en la caché de datos de nivel 1 utilizando un predictor de forwarding

    Get PDF
    En la mayoría de los diseños de los procesadores actuales, el acceso a la caché de datos de nivel 1 (L1D) se ha convertido en uno de los componentes de mayor consumo debido a su incremento de tamaño y elevadas frecuencias de acceso. Para reducir este consumo, proponemos una sencilla técnica de filtrado. Nuestra idea se basa en un predictor de forwarding de alta precisión que determina si una instrucción de load tomará su dato vía forwarding a través de la LSQ –evitando en este caso el acceso a la L1D- o si debe ir a por él a la caché de datos. Nuestros resultados de simulación muestran que podemos ahorrar de media un 35% del consumo de la L1D, con una degradación despreciable de rendimient

    Reducing the LSQ and L1 data cache power consumption

    Get PDF
    In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a forwarding predictor. Our simulation results show that the power consumption decreases in 30-40% in each structure, with a negligible performance penalty of less than 0.1%.Presentado en el V Workshop Arquitectura, Redes y Sistemas Operativos (WARSO)Red de Universidades con Carreras en Informática (RedUNCI

    Reducing cache hierarchy energy consumption by predicting forwarding and disabling associative sets

    Get PDF
    The first level data cache in modern processors has become a major consumer of energy due to its increasing size and high frequency access rate. In order to reduce this high energy consumption, we propose in this paper a straightforward filtering technique based on a highly accurate forwarding predictor. Specifically, a simple structure predicts whether a load instruction will obtain its corresponding data via forwarding from the load-store structure - thus avoiding the data cache access - or if it will be provided by the data cache. This mechanism manages to reduce the data cache energy consumption by an average of 21.5% with a negligible performance penalty of less than 0.1%. Furthermore, in this paper we focus on the cache static energy consumption too by disabling a portion of sets of the L2 associative cache. Overall, when merging both proposals, the combined L1 and L2 total energy consumption is reduced by an average of 29.2% with a performance penalty of just 0.25%

    Memory disambiguation hardware: a review

    Get PDF
    One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.Facultad de Informátic

    Reuse Detector: Improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA

    Reuse Detector: improving the management of STT-RAM SLLCs

    Get PDF
    Various constraints of Static Random Access Memory (SRAM) are leading to consider new memory technologies as candidates for building on-chip shared last-level caches (SLLCs). Spin-Transfer Torque RAM (STT-RAM) is currently postulated as the prime contender due to its better energy efficiency, smaller die footprint and higher scalability. However, STT-RAM also exhibits some drawbacks, like slow and energy-hungry write operations that need to be mitigated before it can be used in SLLCs for the next generation of computers. In this work, we address these shortcomings by leveraging a new management mechanism for STT-RAM SLLCs. This approach is based on the previous observation that although the stream of references arriving at the SLLC of a Chip MultiProcessor (CMP) exhibits limited temporal locality, it does exhibit reuse locality, i.e. those blocks referenced several times manifest high probability of forthcoming reuse. As such, conventional STT-RAM SLLC management mechanisms, mainly focused on exploiting temporal locality, result in low efficient behavior. In this paper, we employ a cache management mechanism that selects the contents of the SLLC aimed to exploit reuse locality instead of temporal locality. Specifically, our proposal consists in the inclusion of a Reuse Detector (RD) between private cache levels and the STT-RAM SLLC. Its mission is to detect blocks that do not exhibit reuse, in order to avoid their insertion in the SLLC, hence reducing the number of write operations and the energy consumption in the STT-RAM. Our evaluation, using multiprogrammed workloads in quad-core, eight-core and 16-core systems, reveals that our scheme reports on average, energy reductions in the SLLC in the range of 37–30%, additional energy savings in the main memory in the range of 6–8% and performance improvements of 3% (quad-core), 7% (eight-core) and 14% (16-core) compared with an STT-RAM SLLC baseline where no RD is employed. More importantly, our approach outperforms DASCA, the state-of-the-art STT-RAM SLLC management, reporting—depending on the specific scenario and the kind of applications used—SLLC energy savings in the range of 4–11% higher than those of DASCA, delivering higher performance in the range of 1.5–14% and additional improvements in DRAM energy consumption in the range of 2–9% higher than DASCA.Peer ReviewedPostprint (author's final draft

    Uso de la infraestructura docente MIPSfpga v2.0 en la asignatura Arquitectura de Sistemas Integrados

    Get PDF
    Este paper se basa en un artículo anterior titulado ”Practical experiences based on MIPSfpga”, publicado en el Workshop on Computer Architecture Education (celebrado en la conferencia ISCA-2017). Incluye algunas modificaciones: (1) Hemos ampliado la Sección II-D (que en este paper corresponde a la Sección 2.4) y la Sección III-A (que corresponde a la Sección 3); (2) Hemos eliminado las Secciones III-B y III-C; (3).En este artículo se describe el uso de la infraestructura docente MIPSfpga v2.0 en las prácticas de la asignatura Arquitectura de Sistemas Integrados, una asignatura obligatoria en el Grado en Ingeniería Electrónica de Comunicaciones que se imparte en la Universidad Complutense de Madrid.In this paper we describe the use of the MIPSfpga v2.0 teaching infrastructure for the labs included in the course Integrated Systems Architecture, a compulsory subject in the fourth year of the Electronic Engineering of Communications degree offered at University Complutense of Madrid.Universidad de Granada: Departamento de Arquitectura y Tecnología de Computadores; Vicerrectorado para la Garantía de la Calidad.Los autores agradecen la contribución de Imagination University Program, University of Nevada, Las Vegas, Imperial College London (UK), Grupo ArTeCS de la Universidad Complutense de Madrid y contratos TIN2015-65277-R, TIN2015- 65316-P y Artículo-83 (no 411-2016), Munir Hasan (IMG UK), Prashant Deokar (IMG India), Mahesh Firke (IMG India) Parimal Patel (Xilinx), Kent Brinkley (IMG USA), Rick Leatherman (IMG USA), Chuck Swartley (IMG USA), Sean Raby (IMG UK), Michio Abe (IMG Japan), Bingli Wang (IMG China), Sachin Sundar (IMG USA), Alex Wong (Digilent Inc.), Matthew Fortune (IMG UK), Jeffrey Deans (IMG UK), Laurence Keung (IMG UK), Roy Kravitz (Portland State University), Dennis Pinto (UCM), Tejaswini Angel (Portland State University), Christian White, Gibson Fahnestock, Jason Wong, Cathal McCabe (Xilinx), Larissa Swanland (Digilent)

    Metodología de internacionalización de material docente basada en el uso de Markdown y Pandoc

    Get PDF
    La internacionalización de la docencia ofrece grandes oportunidades para la Universidad, pero también plantea retos significativos para estudiantes y profesores. En particular, la creación y mantenimiento efectivo del material docente de una asignatura impartida simultáneamente en varios idiomas y con alto grado de coordinación entre los distintos grupos de la misma (p.ej., examen final/prácticas comunes para todos los estudiantes) puede suponer un importante desafío para los profesores. Para hacer frente a este problema, hemos diseñado una estrategia específica para la creación y gestión de material docente en dual (p.ej., inglés-español), y desarrollado un conjunto de herramientas multiplataforma para ponerla en práctica. La idea general es mantener en un mismo fichero de texto el contenido del documento que se desee construir en ambos idiomas, proporcionando justo detrás de cada párrafo y título en uno de los idiomas su traducción al otro idioma, empleando delimitadores especiales. Para crear estos documentos duales se emplea Markdown, un lenguaje de marcado ligero, que dada su sencillez y versatilidad está teniendo una rápida adopción por un amplio espectro de profesionales: desde escritores de novelas o periodistas, hasta administradores de sitios web. A partir de los documentos duales creados con Markdown, es posible generar automáticamente el documento final para cada idioma en el formato deseado que se pondrá a disposición de los estudiantes. Para esta tarea, nos basamos en el uso de la herramienta Pandoc, que permite realizar la conversión de documentos Markdown a una gran cantidad de formatos, como PDF, docx (Microsoft Word), EPUB (libro electrónico) o HTML. Como parte de nuestro proyecto, hemos creado extensiones de Pandoc para permitir la creación de documentos duales en Markdown y para aumentar la expresividad de este lenguaje con construcciones comunmente utilizadas en documentos de carácter docente
    corecore