810 research outputs found

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

    Get PDF

    Reducing Cache Contention On GPUs

    Get PDF
    The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture

    Partially shared cache and adaptive replacement algorithm for NoC-based many-core systems

    Get PDF
    The Network-on-Chip(NoC) is a promising alternative to traditional bus-based architectures that has been widely applied to interconnect multi/many-core systems due to its scalable and modular design. Undoubtedly, the memory wall problem is one of the most important challenges; however, this problem can now be somewhat be alleviated by cache subsystems. In this paper, to overcome the high resource consumption and low data-sharing rate problems of the private cache scheme, we propose a partially shared cache structure and a corresponding replacement algorithm based on a mesh NoC. In this scheme, the L2 cache is shared by each group of four cores that connected as a cluster to a given node by the local bus. To maximize the performance of this partially shared cache structure, we propose a core-aware re-reference interval prediction (CA-RRIP) replacement algorithm. The algorithm performs dynamic virtual partitioning on the partially shared cache; the core that initiated the cache access request will be given top priority when a cache area needs to be replaced or inserted. This approach guarantees cache exclusivity and can mitigate interactions among cores using different access patterns. We implement the traditional private, the proposed partially shared and the row-shared cache subsystems in our experiments. The comparisons indicate that the overall system resource occupation can be reduced by 20% with the same number of cores, and the instructions per cycle(IPC) of the system could increase by up to 49.2%. Moreover, the system throughput(STP) increased by an average of 5.89%. Our experimental results showed that the proposed CA-RRIP algorithm also reduces the average cache miss rate of the system under various cache access patterns

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Distribuição de conteúdos over-the-top multimédia em redes sem fios

    Get PDF
    mestrado em Engenharia Eletrónica e TelecomunicaçõesHoje em dia a Internet é considerada um bem essencial devido ao facto de haver uma constante necessidade de comunicar, mas também de aceder e partilhar conteúdos. Com a crescente utilização da Internet, aliada ao aumento da largura de banda fornecida pelos operadores de telecomunicações, criaram-se assim excelentes condições para o aumento dos serviços multimédia Over-The-Top (OTT), demonstrado pelo o sucesso apresentado pelos os serviços Netflix e Youtube. O serviço OTT engloba a entrega de vídeo e áudio através da Internet sem um controlo direto dos operadores de telecomunicações, apresentando uma proposta atractiva de baixo custo e lucrativa. Embora a entrega OTT seja cativante, esta padece de algumas limitações. Para que a proposta se mantenha em crescimento e com elevados padrões de Qualidade-de-Experiência (QoE) para os consumidores, é necessário investir na arquitetura da rede de distribuição de conteúdos, para que esta seja capaz de se adaptar aos diversos tipos de conteúdo e obter um modelo otimizado com um uso cauteloso dos recursos, tendo como objectivo fornecer serviços OTT com uma boa qualidade para o utilizador, de uma forma eficiente e escalável indo de encontro aos requisitos impostos pelas redes móveis atuais e futuras. Esta dissertação foca-se na distribuição de conteúdos em redes sem fios, através de um modelo de cache distribuída entre os diferentes pontos de acesso, aumentando assim o tamanho da cache e diminuindo o tráfego necessário para os servidores ou caches da camada de agregação acima. Assim, permite-se uma maior escalabilidade e aumento da largura de banda disponível para os servidores de camada de agregação acima. Testou-se o modelo de cache distribuída em três cenários: o consumidor está em casa em que se considera que tem um acesso fixo, o consumidor tem um comportamento móvel entre vários pontos de acesso na rua, e o consumidor está dentro de um comboio em alta velocidade. Testaram-se várias soluções como Redis2, Cachelot e Memcached para servir de cache, bem como se avaliaram vários proxies para ir de encontro ás características necessárias. Mais ainda, na distribuição de conteúdos testaram-se dois algoritmos, nomeadamente o Consistent e o Rendezvouz Hashing. Ainda nesta dissertação utilizou-se uma proposta já existente baseada na previsão de conteúdos (prefetching ), que consiste em colocar o conteúdo nas caches antes de este ser requerido pelos consumidores. No final, verificou-se que o modelo distribuído com a integração com prefecthing melhorou a qualidade de experiência dos consumidores, bem como reduziu a carga nos servidores de camada de agregação acima.Nowadays, the Internet is considered an essential good, due to the fact that there is a need to communicate, but also to access and share information. With the increasing use of the Internet, allied with the increased bandwidth provided by telecommunication operators, it has created conditions for the increase of Over-the-Top (OTT) Multimedia Services, demonstrated by the huge success of Net ix and Youtube. The OTT service encompasses the delivery of video and audio through the Internet without direct control of telecommunication operators, presenting an attractive low-cost and pro table proposal. Although the OTT delivery is captivating, it has some limitations. In order to increase the number of clients and keep the high Quality of Experience (QoE) standards, an enhanced architecture for content distribution network is needed. Thus, the enhanced architecture needs to provide a good quality for the user, in an e cient and scalable way, supporting the requirements imposed by future mobile networks. This dissertation aims to approach the content distribution in wireless networks, through a distributed cache model among the several access points, thus increasing the cache size and decreasing the load on the upstream servers. The proposed architecture was tested in three di erent scenarios: the consumer is at home and it is considered that it has a xed access, the consumer is mobile between several access points in the street, the consumer is in a high speed train. Several solutions were evaluated, such as Redis2, Cachelot and Memcached to serve as caches, along with the evaluation of several proxies server in order to ful ll the required features. Also, it was tested two distributed algorithms, namely the Consistent and Rendezvous Hashing. Moreover, in this dissertation it was integrated a prefetching mechanism, which consists of inserting the content in caches before being requested by the consumers. At the end, it was veri ed that the distributed model with prefetching improved the consumers QoE as well as it reduced the load on the upstream servers

    Runtime-assisted optimizations in the on-chip memory hierarchy

    Get PDF
    Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.Siguiendo la ley de Moore, el número de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programación eficiente de estos sistemas. Se han desarrollado muchos modelos de programación para resolver este problema; un ejemplo particular son los modelos de programación basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicación. La información de que disponen los sistemas en tiempo de ejecución (runtime systems) asociada con estos modelos de programación ofrece un enorme potencial para la mejora del diseño del hardware. Por otro lado, las limitaciones tecnológicas hacen que la ley de Moore pueda dejar de cumplirse próximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral·leles de que dispone el runtime system para mejorar el diseño de la jerarquía de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores diseños hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribución de esta tesis consiste en un conjunto de políticas de inserción para las memorias caché compartidas de último nivel que aprovecha la información de las tareas y las dependencias de datos entre estas. La intuición tras esta propuesta se basa en la observación de que los hilos de ejecución paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las políticas que se proponen insertan líneas de caché en posiciones lógicas diferentes en función de los tipos de dependencia y tarea a los que corresponde la petición de memoria. La segunda propuesta optimiza la ejecución de las reducciones, que se definen como un patrón de programación que combina datos de entrada para conseguir la variable de reducción como resultado. Esto se consigue mediante una técnica asistida por el runtime system para la realización de reducciones en la jerarquía de la caché del procesador, con el objetivo de ser una solución aplicable de forma universal sin depender del tipo de la variable de la reducción, su tamaño o el patrón de acceso. A nivel de software, el modelo de programación se extiende para que el programador especifique las variables de reducción de las tareas, así como el nivel de caché escogido para que se realice una determinada reducción. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta información al hardware subyacente, evitando así movimientos de datos innecesarios hacia y desde el núcleo del procesador, al realizarse la operación donde se encuentran los datos de la misma. La tercera contribución proporciona un esquema de priorización asistido por el runtime system para peticiones de memoria dentro de la jerarquía de memoria del chip. La propuesta se basa en la noción de camino crítico en el contexto de los códigos paralelos y en el hecho conocido de que acelerar tareas críticas reduce el tiempo de ejecución de la aplicación completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleración de las tareas críticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programació eficient d’aquests sistemes. Per intentar solucionar-ho, s’han desenvolupat diversos models de programació; un exemple particular en són els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral·lels dins d’una aplicació. La informació que hi ha al nivell dels sistemes en temps d’execució (runtime systems) associada amb aquests models de programació ofereix un gran potencial a l’hora de millorar el disseny del maquinari. D’altra banda, les limitacions tecnològiques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tendències actuals en la millora de rendiment. L’objectiu principal d’aquesta tesi és aprofitar els coneixements que el runtime System té d’una aplicació paral·lela per millorar el disseny de la jerarquia de memòria dins el xip. L’acoblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d’aquests sistemes. La primera contribució d’aquesta tesi consisteix en un conjunt de polítiques d’inserció a les memòries cau (cache memories) compartides d’últim nivell que aprofita informació sobre tasques i les dependències de dades entre aquestes. La intuïció que hi ha al darrere d’aquesta proposta es basa en el fet que els fils d’execució paral·lels mostren diferents patrons d’accés a la memòria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les polítiques que s’hi proposen insereixen línies de la memòria cau a diferents ubicacions lògiques en funció dels tipus de dependència i de tasca als quals correspon la petició de memòria. La segona proposta optimitza l’execució de les reduccions, que es defineixen com un patró de programació que combina dades d’entrada per aconseguir la variable de reducció com a resultat. Això s’aconsegueix mitjançant una tècnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la memòria cau del processador, amb l’objectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducció, la seva mida o el patró d’accés. A nivell de programari, es realitza una extensió del model de programació per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, així com el nivell de memòria cau desitjat on s’hauria de realitzar una certa reducció. El compilador font a font (compilador source-to-source) i el runtime system s’amplien per traduir i passar aquesta informació al maquinari subjacent. A nivell de maquinari, les memòries cau privades i compartides s’equipen amb unitats funcionals i la lògica corresponent per poder dur a terme les reduccions a la pròpia memòria cau, evitant així moviments de dades innecessaris entre el nucli del processador i la jerarquia de memòria. La tercera contribució proporciona un esquema de priorització assistit pel runtime System per peticions de memòria dins de la jerarquia de memòria del xip. La proposta es basa en la noció de camí crític en el context dels codis paral·lels i en el fet conegut que l’acceleració de les tasques que formen part del camí crític redueix el temps d’execució de l’aplicació sencera. En el context d’aquest treball, la criticitat de les tasques s’observa al nivell del seu tipus ja que permet que el programador les indiqui mitjançant anotacions senzilles. L’acceleració de les tasques crítiques s’aconsegueix prioritzant les corresponents peticions de memòria dins el microprocessador
    corecore