50 research outputs found

    Latency reduction techniques in chip multiprocessor cache systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 117-122).Single-chip multiprocessors (CMPs) solve several bottlenecks facing chip designers today. Compared to traditional superscalars, CMPs deliver higher performance at lower power for thread-parallel workloads. In this thesis, we consider tiled CMPs, a class of CMPs where each tile contains a slice of the total on-chip L2 cache storage, and tiles are connected by an on-chip network. Two basic schemes are currently used to manage L2 slices. First, each slice can be used as a private L2 for the tile. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity because each tile creates a local copy of any block it touches. Second, all slices are aggregated to form a single large L2 shared by all tiles. A shared L2 cache increases the effective cache capacity for shared data, but incurs longer hit latencies when L2 data is on a remote tile. In practice, either private or shared works better for a given workload. We present two new policies, victim replication and victim migration, both of which combine the advantages of private and shared designs. They are variants of the shared scheme which attempt to keep copies of local L1 cache victims within the local L2 cache slice.(cont.) Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of single-threaded, multi-threaded, and multi-programmed workloads running on an eight-processor tiled CMP. We show that both techniques achieve significant performance improvement over baseline private and shared schemes for these workloads.by Michael Zhang.Ph.D

    Improving prefetching mechanisms for tiled CMP platforms

    Get PDF
    Recently, high performance processor designs have evolved toward Chip-Multiprocessor (CMP) architectures to deal with instruction level parallelism limitations and, more important, to manage the power consumption that is becoming unaffordable due to the increased transistor count and clock frequency. At the present moment, this architecture, which implements multiple processing cores on a single die, is commercially available with up to twenty four processors on a single chip and there are roadmaps and research trends that suggest that number of cores will increase in the near future. The increasing on number of cores has converted the interconnection network in a key issue that will have significant impact on performance. Moreover, as the number of cores increases, tiled architectures are foreseen to provide a scalable solution to handle design complexity. Network-on-Chip (NoC) emerges as a solution to deal with growing on-chip wire delays. On the other hand, CMP designs are likely to be equipped with latency hiding techniques like prefetching in order to reduce the negative impact on performance that, otherwise, high cache miss rates would lead to. Unfortunately, the extra number of network messages that prefetching entails can drastically increase power consumption and the latency in the NoC. In this thesis, we do not develop a new prefetching technique for CMPs but propose improvements applicable to any of them. Specifically, we analyze the behavior of the prefetching in the CMPs and its impact to the interconnect. We propose several dynamic management techniques to improve the performance of the prefetching mechanism in the system. Furthermore, we identify the main problems when implementing prefetching in distributed memory systems like tiled architectures and propose directions to solve them. Finally, we propose several research lines to continue the work done in this thesis.Recentment l'arquitectura dels processadors d'altes prestacions ha evolucionat cap a processadors amb diversos nuclis per a concordar amb les limitacions del paral·lelisme a nivell d'instrucció i, mes important encara, per tractar el consum d'energia que ha esdevingut insostenible degut a l'increment de transistors i la freqüència de rellotge. Ara mateix, aquestes arquitectures, que implementes varis nuclis en un sol xip, estan a la venta amb mes de vint-i-quatre processadors en un sol xip i hi ha previsions que suggereixen que aquest nombre de nuclis creixerà en un futur pròxim. Aquest increment del nombre de nuclis, ha convertit la xarxa que els connecta en un punt clau que tindrà un impacte important en el seu rendiment. Una topologia de xarxa que sembla que serà capaç de proveir una solució escalable per aquestes arquitectures ha estat la topologia tile. Les xarxes en el xip (NoC) es presenten com la solució del increment de la latència dels cables del xip. Per altre banda, els dissenys de multiprocessadors seguiran disposant de tècniques de reducció de latència de memòria com el prefetch per tal de reduir l'impacte negatiu en rendiment que, altrament, tindríem degut als elevats temps de latència en fallades a memòria cache. Desafortunadament, el gran nombre de peticions destinades a prefetch, pot augmentar dràsticament la congestió a la xarxa i el consum d'energia. En aquesta tesi, no desenvolupem cap tècnica nova de prefetching, però proposem millores aplicables a qualsevol d'ells. Concretament analitzem el comportament del prefetching en multiprocessadors i el seu impacte a la xarxa. Proposem diverses tècniques de control dinàmic per millor el rendiment del prefetcher al sistema. A més, identifiquem els problemes principals d'implementar el prefetching en els sistemes de memòria distribuïts com els de les arquitectures tile i proposem línies d'investigació per solucionar-los. Finalment, també proposem diverses línies d'investigació per continuar amb el treball fet en aquesta tesi.Postprint (published version

    Improving Energy and Area Scalability of the Cache Hierarchy in CMPs

    Full text link
    As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of larger core counts. Directory-based protocols constitute the most scalable alternative. A conventional directory, however, suffers from an inefficient use of storage and energy. First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single core. Second, although increasing directory size and associativity could boost system performance by reducing the coverage misses, it would come at the expense of area and energy consumption. This thesis focuses and exploits the important differences of behavior between private and shared blocks from the directory point of view. These differences claim for a separate management of both types of blocks at the directory. First, we propose the PS-Directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared Directory Cache, and uses a larger and slower second-level Private Directory Cache to track the large amount of private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption. In this thesis we also show that the shared/private ratio of entries in the directory varies across applications and across different execution phases within the applications, which encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory reduces the number of ways with storage for shared blocks and it allows this storage to be powered off or on at run-time according to the dynamic requirements of the applications following a repartitioning algorithm. Results show similar performance as a traditional directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power consumption savings. This dissertation also deals with the scalability issues in terms of power found in processor caches. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowledge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Results show significant dynamic power consumption savings. Finally, we propose an energy-efficient architectural design that can be effectively applied to any kind of set-associative cache memory, not only to processor caches. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the X least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that this filtering mechanism achieves energy consumption in set-associative caches similar to direct mapped ones. Experimental results show that the proposals presented in this thesis offer a good tradeoff among these three major design axes.Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético para cumplir con las demandas de un número núcleos mayor. Los protocolos basados en directorio constituyen la alternativa más escalable. Un directorio convencional, no obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que la mayoría de los bloques en un directorio solo se encuentran en la cache de un único núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el consumo energético. Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel (concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y lenta en el segundo nivel (Private Directory Cache) para poder mantener la información de los bloques privados. Los resultados experimentales muestran que, comparado con un directorio convencional, el PS-Directory consigue mejorar las prestaciones a la vez que reduce el área de silicio y el consumo energético. Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones de consumo estático y dinámico. Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos importantes ahorros de energía dinámica. Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un consumo energético en caches asociativas por conjunto similar de las caches de mapeado directo. Los resultados experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen compromiso entre estos tres importantes pilares de diseño.Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del directori augmentaria les prestacions del sistema, això suposaria un increment notable en el consum d'energia. Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private Directory Cache) per poder mantenir la informació dels blocs privats. Els resultats experimentals mostren que, comparat amb un directori convencional, el PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici i el consum energètic. Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic. Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro- bar en les memòries cache. Les caches on-chip consumeixen una part significativa del consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia dinàmica. Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés. Els resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un consum energètic en caches associatives per conjunt similar al de les caches de mapejada directa. Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon compromís entre aquestros tres importants pilars de diseny.Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551TESI

    Software-Oriented Data Access Characterization for Chip Multiprocessor Architecture Optimizations

    Get PDF
    The integration of an increasing amount of on-chip hardware in Chip-Multiprocessors (CMPs) poses a challenge of efficiently utilizing the on-chip resources to maximize performance. Prior research proposals largely rely on additional hardware support to achieve desirable tradeoffs. However, these purely hardware-oriented mechanisms typically result in more generic but less efficient approaches. A new trend is designing adaptive systems by exploiting and leveraging application-level information. In this work a wide range of applications are analyzed and remarkable data access behaviors/patterns are recognized to be useful for architectural and system optimizations. In particular, this dissertation work introduces software-based techniques that can be used to extract data access characteristics for cross-layer optimizations on performance and scalability. The collected information is utilized to guide cache data placement, network configuration, coherence operations, address translation, memory configuration, etc. In particular, an approach is proposed to classify data blocks into different categories to optimize an on-chip coherent cache organization. For applications with compile-time deterministic data access localities, a compiler technique is proposed to determine data partitions that guide the last level cache data placement and communication patterns for network configuration. A page-level data classification is also demonstrated to improve address translation performance. The successful utilization of data access characteristics on traditional CMP architectures demonstrates that the proposed approach is promising and generic and can be potentially applied to future CMP architectures with emerging technologies such as the Spin-transfer torque RAM (STT-RAM)

    A survey of system level power management schemes in the dark-silicon era for many-core architectures

    Get PDF
    Power consumption in Complementary Metal Oxide Semiconductor (CMOS) technology has escalated to a point that only a fractional part of many-core chips can be powered-on at a time. Fortunately, this fraction can be increased at the expense of performance through the dark-silicon solution. However, with many-core integration set to be heading towards its thousands, power consumption and temperature increases per time, meaning the number of active nodes must be reduced drastically. Therefore, optimized techniques are demanded for continuous advancement in technology. Existing efforts try to overcome this challenge by activating nodes from different parts of the chip at the expense of communication latency. Other efforts on the other hand employ run-time power management techniques to manage the power performance of the cores trading-off performance for power. We found out that, for a significant amount of power to saved and high temperature to be avoided, focus should be on reducing the power consumption of all the on-chip components. Especially, the memory hierarchy and the interconnect. Power consumption can be minimized by, reducing the size of high leakage power dissipating elements, turning-off idle resources and integrating power saving materials

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    PS-Architecture: A scalable and energy-efficient architecture for CMP NUCAs

    Full text link
    As the number of cores increases in both incoming and future shared-memory chip--multiprocessor (CMP) generations, coherence protocols and all elements in the cache hierarchy must scale to sustain performance. In this work we attack the scalability problem in the CMPs by studying and proposing some improvements for two of those elements, namely the directory and data caches. Each of these two structures have its particular issues which we try to solve employing some mechanisms involving the different type of blocks that can be found in parallel workloads. We introduce the PS directory, a directory cache that uses two different cache structures, each one tailored to one of these types of blocks (i.e., private and shared). The Shared directory cache, which tracks shared blocks is small, with low associativity and fast. The Private directory cache is aimed at tracking private blocks, which are highly dominant in current workloads. This structure does not store the sharer vector, is larger than the shared cache, and it has higher associativity. We also introduce the PS cache, an energy-efficient cache design which only accesses a subset of the set ways without hurting performance.Valls Mompó, JJ. (2013). PS-Architecture: A scalable and energy-efficient architecture for CMP NUCAs. http://hdl.handle.net/10251/44383Archivo delegad

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors

    Full text link
    [ES] Los procesadores multinúcleo actuales cuentan con recursos compartidos entre los diferentes núcleos. Dos de estos recursos compartidos, la cache de último nivel y el ancho de banda de memoria principal, pueden convertirse en cuellos de botella para el rendimiento. Además, con el crecimiento del número de núcleos que implementan los diseños más recientes, la red dentro del chip también se convierte en un cuello de botella que puede afectar negativamente al rendimiento, ya que las redes tradicionales pueden encontrar limitaciones a su escalabilidad en el futuro cercano. Prácticamente la totalidad de los diseños actuales implementan jerarquías de memoria que se comunican mediante rápidas redes de interconexión. Esta organización es eficaz dado que permite reducir el número de accesos que se realizan a memoria principal y la latencia media de acceso a memoria. Las caches, la red de interconexión y la memoria principal, conjuntamente con otras técnicas conocidas como la prebúsqueda, permiten reducir las enormes latencias de acceso a memoria principal, limitando así el impacto negativo ocasionado por la diferencia de rendimiento existente entre los núcleos de cómputo y la memoria. Sin embargo, compartir los recursos mencionados es fuente de diferentes problemas y retos, siendo uno de los principales el manejo de la interferencia entre aplicaciones. Hacer un uso eficiente de la jerarquía de memoria y las caches, así como contar con una red de interconexión apropiada, es necesario para sostener el crecimiento del rendimiento en los diseños tanto actuales como futuros. Esta tesis analiza y estudia los principales problemas e inconvenientes observados en estos dos recursos: la cache de último nivel y la red dentro del chip. En primer lugar, se estudia la escalabilidad de las tradicionales redes dentro del chip con topología de malla, así como esta puede verse comprometida en próximos diseños que cuenten con mayor número de núcleos. Los resultados de este estudio muestran que, a mayor número de núcleos, el impacto negativo de la distancia entre núcleos en la latencia puede afectar seriamente al rendimiento del procesador. Como solución a este problema, en esta tesis proponemos una de red de interconexión óptica modelada en un entorno de simulación detallado, que supone una solución viable a los problemas de escalabilidad observados en los diseños tradicionales. A continuación, esta tesis dedica un esfuerzo importante a identificar y proponer soluciones a los principales problemas de diseño de las jerarquías de memoria actuales como son, por ejemplo, el sobredimensionado del espacio de cache privado, la existencia de réplicas de datos y rigidez e incapacidad de adaptación de las estructuras de cache. Aunque bien conocidos, estos problemas y sus efectos adversos en el rendimiento pueden ser evitados en procesadores de alto rendimiento gracias a la enorme capacidad de la cache de último nivel que este tipo de procesadores típicamente implementan. Sin embargo, en procesadores de bajo consumo, no existe la posibilidad de contar con tales capacidades y hacer un uso eficiente del espacio disponible es crítico para mantener el rendimiento. Como solución a estos problemas en procesadores de bajo consumo, proponemos una novedosa organización de jerarquía de dos niveles cache que utiliza una red de interconexión óptica. Los resultados obtenidos muestran que, comparado con diseños convencionales, el consumo de energía estática en la arquitectura propuesta es un 60% menor, pese a que los resultados de rendimiento presentan valores similares. Por último, hemos extendido la arquitectura propuesta para dar soporte tanto a aplicaciones paralelas como secuenciales. Los resultados obtenidos con la esta nueva arquitectura muestran un ahorro de hasta el 78 % de energía estática en la ejecución de aplicaciones paralelas.[CA] Els processadors multinucli actuals compten amb recursos compartits entre els diferents nuclis. Dos d'aquests recursos compartits, la memòria d’últim nivell i l'ample de banda de memòria principal, poden convertir-se en colls d'ampolla per al rendiment. A mes, amb el creixement del nombre de nuclis que implementen els dissenys mes recents, la xarxa dins del xip també es converteix en un coll d'ampolla que pot afectar negativament el rendiment, ja que les xarxes tradicionals poden trobar limitacions a la seva escalabilitat en el futur proper. Pràcticament la totalitat dels dissenys actuals implementen jerarquies de memòria que es comuniquen mitjançant rapides xarxes d’interconnexió. Aquesta organització es eficaç ates que permet reduir el nombre d'accessos que es realitzen a memòria principal i la latència mitjana d’accés a memòria. Les caches, la xarxa d’interconnexió i la memòria principal, conjuntament amb altres tècniques conegudes com la prebúsqueda, permeten reduir les enormes latències d’accés a memòria principal, limitant així l'impacte negatiu ocasionat per la diferencia de rendiment existent entre els nuclis de còmput i la memòria. No obstant això, compartir els recursos esmentats és font de diversos problemes i reptes, sent un dels principals la gestió de la interferència entre aplicacions. Fer un us eficient de la jerarquia de memòria i les caches, així com comptar amb una xarxa d’interconnexió apropiada, es necessari per sostenir el creixement del rendiment en els dissenys tant actuals com futurs. Aquesta tesi analitza i estudia els principals problemes i inconvenients observats en aquests dos recursos: la memòria cache d’últim nivell i la xarxa dins del xip. En primer lloc, s'estudia l'escalabilitat de les xarxes tradicionals dins del xip amb topologia de malla, així com aquesta es pot veure compromesa en propers dissenys que compten amb major nombre de nuclis. Els resultats d'aquest estudi mostren que, a major nombre de nuclis, l'impacte negatiu de la distància entre nuclis en la latència pot afectar seriosament al rendiment del processador. Com a solució' a aquest problema, en aquesta tesi proposem una xarxa d’interconnexió' òptica modelada en un entorn de simulació detallat, que suposa una solució viable als problemes d'escalabilitat observats en els dissenys tradicionals. A continuació, aquesta tesi dedica un esforç important a identificar i proposar solucions als principals problemes de disseny de les jerarquies de memòria actuals com son, per exemple, el sobredimensionat de l'espai de memòria cache privat, l’existència de repliques de dades i la rigidesa i incapacitat d’adaptació' de les estructures de memòria cache. Encara que ben coneguts, aquests problemes i els seus efectes adversos en el rendiment poden ser evitats en processadors d'alt rendiment gracies a l'enorme capacitat de la memòria cache d’últim nivell que aquest tipus de processadors típicament implementen. No obstant això, en processadors de baix consum, no hi ha la possibilitat de comptar amb aquestes capacitats, i fer un us eficient de l'espai disponible es torna crític per mantenir el rendiment. Com a solució a aquests problemes en processadors de baix consum, proposem una nova organització de jerarquia de dos nivells de memòria cache que utilitza una xarxa d’interconnexió òptica. Els resultats obtinguts mostren que, comparat amb dissenys convencionals, el consum d'energia estàtica en l'arquitectura proposada és un 60% menor, malgrat que els resultats de rendiment presenten valors similars. Per últim, hem estes l'arquitectura proposada per donar suport tant a aplicacions paral·leles com seqüencials. Els resultats obtinguts amb aquesta nova arquitectura mostren un estalvi de fins al 78 % d'energia estàtica en l’execució d'aplicacions paral·leles.[EN] Current multicores face the challenge of sharing resources among the different processor cores. Two main shared resources act as major performance bottlenecks in current designs: the off-chip main memory bandwidth and the last level cache. Additionally, as the core count grows, the network on-chip is also becoming a potential performance bottleneck, since traditional designs may find scalability issues in the near future. Memory hierarchies communicated through fast interconnects are implemented in almost every current design as they reduce the number of off-chip accesses and the overall latency, respectively. Main memory, caches, and interconnection resources, together with other widely-used techniques like prefetching, help alleviate the huge memory access latencies and limit the impact of the core-memory speed gap. However, sharing these resources brings several concerns, being one of the most challenging the management of the inter-application interference. Since almost every running application needs to access to main memory, all of them are exposed to interference from other co-runners in their way to the memory controller. For this reason, making an efficient use of the available cache space, together with achieving fast and scalable interconnects, is critical to sustain the performance in current and future designs. This dissertation analyzes and addresses the most important shortcomings of two major shared resources: the Last Level Cache (LLC) and the Network on Chip (NoC). First, we study the scalability of both electrical and optical NoCs for future multicoresand many-cores. To perform this study, we model optical interconnects in a cycle-accurate multicore simulation framework. A proper model is required; otherwise, important performance deviations may be observed otherwise in the evaluation results. The study reveals that, as the core count grows, the effect of distance on the end-to-end latency can negatively impact on the processor performance. In contrast, the study also shows that silicon nanophotonics are a viable solution to solve the mentioned latency problems. This dissertation is also motivated by important design concerns related to current memory hierarchies, like the oversizing of private cache space, data replication overheads, and lack of flexibility regarding sharing of cache structures. These issues, which can be overcome in high performance processors by virtue of huge LLCs, can compromise performance in low power processors. To address these issues we propose a more efficient cache hierarchy organization that leverages optical interconnects. The proposed architecture is conceived as an optically interconnected two-level cache hierarchy composed of multiple cache modules that can be dynamically turned on and off independently. Experimental results show that, compared to conventional designs, static energy consumption is improved by up to 60% while achieving similar performance results. Finally, we extend the proposal to support both sequential and parallel applications. This extension is required since the proposal adapts to the dynamic cache space needs of the running applications, and multithreaded applications's behaviors widely differ from those of single threaded programs. In addition, coherence management is also addressed, which is challenging since each cache module can be assigned to any core at a given time in the proposed approach. For parallel applications, the evaluation shows that the proposal achieves up to 78% static energy savings. In summary, this thesis tackles major challenges originated by the sharing of on-chip caches and communication resources in current multicores, and proposes new cache hierarchy organizations leveraging optical interconnects to address them. The proposed organizations reduce both static and dynamic energy consumption compared to conventional approaches while achieving similar performance; which results in better energy efficiency.Puche Lara, J. (2021). Novel Cache Hierarchies with Photonic Interconnects for Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/165254TESI

    FOS: a low-power cache organization for multicores

    Get PDF
    [EN] The cache hierarchy of current multicore processors typically consists of one or two levels of private caches per core and a large shared last-level cache. This approach incurs area and energy wasting due to oversizing the private cache space, data replication through the inclusive cache levels, as well as the use of highly set-associative caches. In this paper, we claim that although this is the commonly adopted approach, it presents important design issues that can be addressed by a more energy efficient organization. This work proposes Flat On-chip Storage (FOS), a novel cache organization that, aimed at addressing energy and area on low-power processors, resolves the mentioned issues. For this purpose, FOS combines L2 and L3 cache levels into a single one, organized as a flat space, and composed of a pool of private small cache slices. These slices are initially powered off to save energy, and they are powered on and assigned to cores provided that the system performance is expected to improve. To provide fast and uniform access from the private L1 caches to the FOS's cache slices, multiple architectural challenges are overcome, which entails the design of a custom optical network-on-chip. Experimental results show that FOS achieves significant energy savings on both static and dynamic energy over conventional cache organizations with the same storage capacity. FOS static energy savings are as much as 60% over an electrically connected shared cache; these savings grow up to 75% compared to optically connected baselines. Moreover, despite deactivating part of the cache space, FOS achieves similar performance values as those achieved by conventional approaches.Puche-Lara, J.; Petit Martí, SV.; Sahuquillo Borrás, J.; Gómez Requena, ME. (2019). FOS: a low-power cache organization for multicores. The Journal of Supercomputing (Online). 75(10):6542-6573. https://doi.org/10.1007/s11227-019-02858-xS654265737510Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp 250–261. https://doi.org/10.1109/HPCA.2009.4798260Baer J, Low D, Crowley P, Sidhwaney N (2003) Memory hierarchy design for a multiprocessor look-up engine. In: 12th International Conference on Parallel Architectures and Compilation Techniques (PACT 2003)Bahirat S, Pasricha S (2014) Meteor: hybrid photonic ring-mesh network-on-chip for multicore architectures. ACM Trans Embed Comput Syst 13(3s):116:1–116:33. https://doi.org/10.1145/2567940Bartolini S, Grani P (2012) A simple on-chip optical interconnection for improving performance of coherency traffic in CMPS. In: 15th Euromicro Conference on Digital System Design, pp 312–318. https://doi.org/10.1109/DSD.2012.13Beckmann BM, Marty MR, Wood DA (2006) ASR: adaptive selective replication for CMP caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39. IEEE Computer Society, Washington, DC, USA, pp 443–454. https://doi.org/10.1109/MICRO.2006.10Beckmann N, Sanchez D (2013) Jigsaw: scalable software-defined caches. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13. IEEE Press, Piscataway, NJ, USA, pp 213–224. https://doi.org/10.1109/PACT.2013.6618818Bergman K, Carloni LP, Bibermani AC, Hendry G (2014) Photonic network-on-chip design, vol 68. Springer, New YorkChang J, Sohi GS (2006) Cooperative caching for chip multiprocessors. In: Proceedings 33rd Annual International Symposium on Computer Architecture, pp 264–276. https://doi.org/10.1109/ISCA.2006.17Chen G, Chen H, Haurylau M, Nelson N, Fauchet PM, Friedman EG, Albonesi D (2005) Predictions of CMOS compatible on-chip optical interconnect. In: Proceedings of International Workshop on System Level Interconnect Prediction, SLIP ’05, pp 13–20Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. SIGARCH Comput Archit News 33(2):357–368. https://doi.org/10.1145/1080695.1070001Cho S, Jin L (2006) Managing distributed, shared l2 caches through os-level page allocation. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), pp 455–468. https://doi.org/10.1109/MICRO.2006.31Cianchetti MJ, Kerekes JC, Albonesi DH (2009) Phastlane: a rapid transit optical routing network. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA’09, pp 441–450. https://doi.org/10.1145/1555754.1555809Demir Y, Hardavellas N (2015) Parka: thermally insulated nanophotonic interconnects. In: NOCS ’15, pp 1:1–1:8. https://doi.org/10.1145/2786572.2786597Duan GH, Fedeli JM, Keyvaninia S, Thomson D (2012) 10 gb/s integrated tunable hybrid iii-v/si laser and silicon mach-zehnder modulator. In: European Conference and Exhibition on Optical Communication. https://doi.org/10.1364/ECEOC.2012.Tu.4.E.2Dybdahl H, Stenstrom P (2007) An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In: 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pp 2–12. https://doi.org/10.1109/HPCA.2007.346180García A, Fernández R, Garca JM, Bartolini S (2014) Managing resources dynamically in hybrid photonic-electronic networks-on-chip. Concurr Comput Pract Exp 26(15):2530–2550. https://doi.org/10.1002/cpe.3332Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. SIGARCH Comput Archit News 37(3):184–195. https://doi.org/10.1145/1555815.1555779Herrero E, González J, Canal R (2008) Distributed cooperative caching. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp 134–143. https://doi.org/10.1145/1454115.1454136Herrero E, González J, Canal R (2010) Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp 419–428. https://doi.org/10.1145/1815961.1816018Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2005) A NUCA substrate for flexible CMP cache sharing. In: Proceedings of the 19th Annual International Conference on Supercomputing, ICS ’05. ACM, pp 31–40. https://doi.org/10.1145/1088149.1088154Kahng AB, Li B, Peh LS, Samadi K (2009) Orion 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. In: DATE. European Design and Automation Association, pp 423–428Kaxiras S, Hu Z, Martonosi M (2001) Cache decay: exploiting generational behavior to reduce cache leakage power. In: Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA’01, pp 240–251Kim S, Chandra D, Solihin D (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: PACT, pp 111–122Merino J, Puente V, Gregorio JA (2010) ESP-NUCA: a low-cost adaptive non-uniform cache architecture. In: HPCA-16 2010 the Sixteenth International Symposium on High-performance Computer Architecture, pp 1–10. https://doi.org/10.1109/HPCA.2010.5416641Morris R, Kodi AK, Louri A (2012) Dynamic reconfiguration of 3d photonic networks-on-chip for maximizing performance and improving fault tolerance. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp 282–293. https://doi.org/10.1109/MICRO.2012.34Muralimanohar N, Balasubramonian R, Jouppi NP (2009) Cacti 6.0: a tool to model large caches. In: HP LaboratoriesPang J, Dwyer C, Lebeck AR (2013) Exploiting emerging technologies for nanoscale photonic networks-on-chip. In: Proceedings of 6th International Workshop on NoC Architectures, NoCArc ’13, pp 53–58Petit S, Sahuquillo J, Such JM, Kaeli DR (2005) Exploiting temporal locality in drowsy cache policies. In: Proceedings of the Second Conference on Computing Frontiers, Ischia, Italy, 4–6 May 2005, pp 371–377Pons L, Selfa V, Sahuquillo J, Petit S, Pons J (2018) Improving system turnaround time with intel CAT by identifying LLC critical applications. In: Euro-Par 2018—Parallel Processing—24th International Conference on Parallel and Distributed Computing, Turin, Italy, 27–31 Aug 2018, Proceedings, pp 603–615. https://doi.org/10.1007/978-3-319-96983-1_43Qureshi M, Patt Y (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: MICRO, pp 423–432Rivers JA, Tam ES, Tyson GS, Davidson ES, Farrens MK (1998) Utilizing reuse information in data cache management. In: Proceedings of the 12th International Conference on Supercomputing, ICS 1998, Melbourne, Australia, 13–17 July 1998, pp 449–456. https://doi.org/10.1145/277830.277941Rosenfeld P, Cooper-Balis E, Jacob B (2011) Dramsim2: a cycle accurate memory system simulator. IEEE Comput Archit Lett 10:16–19. https://doi.org/10.1109/L-CA.2011.4Sahuquillo J, Pont A (1999) The filter cache: a run-time cache management approach1. In: 25th EUROMICRO ’99 Conference, Informatics: Theory and Practice for the New Millenium, 8–10 Sept 1999, Milan, Italy, pp 1424–1431. https://doi.org/10.1109/EURMIC.1999.794504Sahuquillo J, Pont A (2000) Splitting the data cache: a survey. IEEE Concurr 8(3):30–35. https://doi.org/10.1109/4434.865890Selfa V, Sahuquillo J, Eeckhout L, Petit S, Gómez ME (2017) Application clustering policies to address system fairness with intel’s cache allocation technology. In: 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, OR, USA, 9–13 Sept 2017, pp 194–205. https://doi.org/10.1109/PACT.2017.19Shacham A, Bergman K, Carloni L (2007) On the design of a photonic network-on-chip. In: Networks-on-Chip, NOCS 2007, pp 53–64Soref R, Bennett B (1987) Electrooptical effects in silicon. IEEE J Quantum Electron 23(1):123–129. https://doi.org/10.1109/JQE.1987.1073206Henning JL (2006) SPEC CPU2006 benchmark descriptions. SIGARCH Comput Archit News 34(4):1–17. https://doi.org/10.1145/1186736.1186737Tsai PA, Beckmann N, Sanchez D (2017) Jenga: software-defined cache hierarchies. SIGARCH Comput Archit News 45(2):652–665. https://doi.org/10.1145/3140659.3080214Ubal R, Sahuquillo J, Petit S, Lopez P (2007) Multi2sim: a simulation framework to evaluate multicore-multithreaded processors. In: International Symposium on Computer Architecture and High Performance Computing, pp 62–68. https://doi.org/10.1109/SBAC-PAD.2007.17Valero A, Sahuquillo J, Petit S, López P, Duato J (2012) Combining recency of information with selective random and a victim cache in last-level caches. ACM Trans Archit Code Optim 9(3):16:1–16:20. https://doi.org/10.1145/2355585.2355589Vantrease D, Binkert N, Schreiber R, Lipasti M (2009) Light speed arbitration and flow control for nanophotonic interconnects. In: Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium, pp 304–315Werner S, Navaridas J, Lujan M (2017) Designing low-power, low-latency networks-on-chip by optimally combining electrical and optical links. In: 2017 IEEE International Symposium of High Performance Computer Architectur
    corecore