6 research outputs found

    Cooperative cache scrubbing

    Get PDF
    Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challeng-ing. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory sys-tems, leading to more energy consumption. However, we demon-strate that not all memory traffic is necessary. For modern Java pro-grams, 10 to 60 % of DRAM writes are useless, because the data on these lines are dead- the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero ini-tialize it wastes bandwidth. We propose a software/hardware coop-erative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simula-tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57 % on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems

    Cooperative caching with keep-me and evict-me

    No full text
    Cooperative caching seeks to improve memory system performance by using compiler locality hints to assist hardware cache decisions. In this paper, the compiler suggests cache lines to keep or evict in setassociative caches. A compiler analysis predicts data that will be and will not be reused, and annotates the corresponding memory operations with a keep-me or evict-me hint. The architecture maintains these hints on a cache line and only acts on them on a cache miss. Evict-me caching prefers to evict lines marked evictme. Keep-me caching retains keep-me lines if possible. Otherwise, the default replacement algorithm evicts the least-recently-used (LRU) line in the set. This paper introduces the keep-me hint, the associated compiler analysis, and architectural support. The keep-me architecture includes very modest ISA support, replacement algorithms, and decay mechanisms that avoid retaining keep-me lines indefinitely. Our results are mixed for our implementation of keep-me, but show it has potential. We combine keep-me and evict-me from previous work, but find few additive benefits due to limitations in our compiler algorithm which only applies each independently rather than performing a combined analysis.

    Runtime-assisted optimizations in the on-chip memory hierarchy

    Get PDF
    Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.Siguiendo la ley de Moore, el n煤mero de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programaci贸n eficiente de estos sistemas. Se han desarrollado muchos modelos de programaci贸n para resolver este problema; un ejemplo particular son los modelos de programaci贸n basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicaci贸n. La informaci贸n de que disponen los sistemas en tiempo de ejecuci贸n (runtime systems) asociada con estos modelos de programaci贸n ofrece un enorme potencial para la mejora del dise帽o del hardware. Por otro lado, las limitaciones tecnol贸gicas hacen que la ley de Moore pueda dejar de cumplirse pr贸ximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral路leles de que dispone el runtime system para mejorar el dise帽o de la jerarqu铆a de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores dise帽os hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribuci贸n de esta tesis consiste en un conjunto de pol铆ticas de inserci贸n para las memorias cach茅 compartidas de 煤ltimo nivel que aprovecha la informaci贸n de las tareas y las dependencias de datos entre estas. La intuici贸n tras esta propuesta se basa en la observaci贸n de que los hilos de ejecuci贸n paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las pol铆ticas que se proponen insertan l铆neas de cach茅 en posiciones l贸gicas diferentes en funci贸n de los tipos de dependencia y tarea a los que corresponde la petici贸n de memoria. La segunda propuesta optimiza la ejecuci贸n de las reducciones, que se definen como un patr贸n de programaci贸n que combina datos de entrada para conseguir la variable de reducci贸n como resultado. Esto se consigue mediante una t茅cnica asistida por el runtime system para la realizaci贸n de reducciones en la jerarqu铆a de la cach茅 del procesador, con el objetivo de ser una soluci贸n aplicable de forma universal sin depender del tipo de la variable de la reducci贸n, su tama帽o o el patr贸n de acceso. A nivel de software, el modelo de programaci贸n se extiende para que el programador especifique las variables de reducci贸n de las tareas, as铆 como el nivel de cach茅 escogido para que se realice una determinada reducci贸n. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta informaci贸n al hardware subyacente, evitando as铆 movimientos de datos innecesarios hacia y desde el n煤cleo del procesador, al realizarse la operaci贸n donde se encuentran los datos de la misma. La tercera contribuci贸n proporciona un esquema de priorizaci贸n asistido por el runtime system para peticiones de memoria dentro de la jerarqu铆a de memoria del chip. La propuesta se basa en la noci贸n de camino cr铆tico en el contexto de los c贸digos paralelos y en el hecho conocido de que acelerar tareas cr铆ticas reduce el tiempo de ejecuci贸n de la aplicaci贸n completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleraci贸n de las tareas cr铆ticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programaci贸 eficient d鈥檃quests sistemes. Per intentar solucionar-ho, s鈥檋an desenvolupat diversos models de programaci贸; un exemple particular en s贸n els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral路lels dins d鈥檜na aplicaci贸. La informaci贸 que hi ha al nivell dels sistemes en temps d鈥檈xecuci贸 (runtime systems) associada amb aquests models de programaci贸 ofereix un gran potencial a l鈥檋ora de millorar el disseny del maquinari. D鈥檃ltra banda, les limitacions tecnol貌giques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tend猫ncies actuals en la millora de rendiment. L鈥檕bjectiu principal d鈥檃questa tesi 茅s aprofitar els coneixements que el runtime System t茅 d鈥檜na aplicaci贸 paral路lela per millorar el disseny de la jerarquia de mem貌ria dins el xip. L鈥檃coblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d鈥檃quests sistemes. La primera contribuci贸 d鈥檃questa tesi consisteix en un conjunt de pol铆tiques d鈥檌nserci贸 a les mem貌ries cau (cache memories) compartides d鈥櫭簂tim nivell que aprofita informaci贸 sobre tasques i les depend猫ncies de dades entre aquestes. La intu茂ci贸 que hi ha al darrere d鈥檃questa proposta es basa en el fet que els fils d鈥檈xecuci贸 paral路lels mostren diferents patrons d鈥檃cc茅s a la mem貌ria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les pol铆tiques que s鈥檋i proposen insereixen l铆nies de la mem貌ria cau a diferents ubicacions l貌giques en funci贸 dels tipus de depend猫ncia i de tasca als quals correspon la petici贸 de mem貌ria. La segona proposta optimitza l鈥檈xecuci贸 de les reduccions, que es defineixen com un patr贸 de programaci贸 que combina dades d鈥檈ntrada per aconseguir la variable de reducci贸 com a resultat. Aix貌 s鈥檃consegueix mitjan莽ant una t猫cnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la mem貌ria cau del processador, amb l鈥檕bjectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducci贸, la seva mida o el patr贸 d鈥檃cc茅s. A nivell de programari, es realitza una extensi贸 del model de programaci贸 per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, aix铆 com el nivell de mem貌ria cau desitjat on s鈥檋auria de realitzar una certa reducci贸. El compilador font a font (compilador source-to-source) i el runtime system s鈥檃mplien per traduir i passar aquesta informaci贸 al maquinari subjacent. A nivell de maquinari, les mem貌ries cau privades i compartides s鈥檈quipen amb unitats funcionals i la l貌gica corresponent per poder dur a terme les reduccions a la pr貌pia mem貌ria cau, evitant aix铆 moviments de dades innecessaris entre el nucli del processador i la jerarquia de mem貌ria. La tercera contribuci贸 proporciona un esquema de prioritzaci贸 assistit pel runtime System per peticions de mem貌ria dins de la jerarquia de mem貌ria del xip. La proposta es basa en la noci贸 de cam铆 cr铆tic en el context dels codis paral路lels i en el fet conegut que l鈥檃cceleraci贸 de les tasques que formen part del cam铆 cr铆tic redueix el temps d鈥檈xecuci贸 de l鈥檃plicaci贸 sencera. En el context d鈥檃quest treball, la criticitat de les tasques s鈥檕bserva al nivell del seu tipus ja que permet que el programador les indiqui mitjan莽ant anotacions senzilles. L鈥檃cceleraci贸 de les tasques cr铆tiques s鈥檃consegueix prioritzant les corresponents peticions de mem貌ria dins el microprocessador

    Software-assisted cache mechanisms for embedded systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 120-135).Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profit-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area.by Prabhat Jain.Ph.D
    corecore