401 research outputs found

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Reducing Cache Contention On GPUs

    Get PDF
    The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Memory hierarchies for future HPC architectures

    Get PDF
    Efficiently managing the memory subsystem of modern multi/manycore architectures is increasingly becoming a challenge as systems grow in complexity and heterogeneity. In the field of high performance computing (HPC) in particular, where massively parallel architectures are used and input sets of several terabytes are common, careful management of the memory hierarchy is crucial to exploit the full computing power of these systems. The goal of this thesis is to provide computer architects with valuable information to guide the design of future systems, and in particular of those more widely used in the field of HPC, i.e., symmetric multicore processors (SMPs) and GPUs. With that aim, we present an analysis of some of the inefficiencies and shortcomings of current memory management techniques and propose two novel schemes leveraging the opportunities that arise from the use of new and emerging programming models and computing paradigms. The first contribution of this thesis is a block prefetching mechanism for task-based programming models. Using a task-based programming model simplifies parallel programming and allows for better resource utilization in the supercomputers used in the field of HPC, while enabling sophisticated memory management techniques. The scheme proposed relies on a memory-aware runtime system to guide prefetching while avoiding the main drawbacks of traditional prefetching mechanisms, i.e., cache pollution and lack of timeliness. It leverages the information provided by the user about tasks¿ input data to prefetch contiguous blocks of memory that are certain to be useful. The proposed scheme targets SMPs with large cache hierarchies and uses heuristics to dynamically decide the best cache level to prefetch into without evicting useful data. The focus of this thesis then turns to heterogeneous architectures combining GPUs and traditional multicore processors. The current trend towards tighter coupling of GPU and CPU enables new collaborative computations that tax the memory subsystem in a different manner than previous heterogeneous computations did, and requires careful analysis to understand the trade-offs that are to be expected when designing future memory organizations. The second contribution is an in-depth analysis on the impact of sharing the last-level cache between GPU and CPU cores on a system where the GPU is integrated on the same die as the CPU. The analysis focuses on the effect that a shared cache can have on collaborative computations where GPU and CPU threads concurrently work on a problem and share data at fine granularities. The results presented here show that sharing the last-level cache is largely beneficial as it allows for better resource utilization. In addition, the evaluation shows that collaborative computations benefit significantly from the faster CPU-GPU communication and higher cache hit rates that a shared cache level provides. The final contribution of this thesis analyzes the inefficiencies and drawbacks of demand paging as currently implemented in discrete GPUs by NVIDIA. Then, it proposes a novel memory organization and dynamic migration scheme that allows for efficient data sharing between GPU and CPU, specially when executing collaborative computations where data is migrated back and forth between the two separate memories. This scheme migrates data at cache line granularities transparently to the user and operating system, avoiding false sharing and the unnecessary data transfers that occur with demand paging. The results show that the proposed scheme is able to outperform the baseline system by reducing the migration latency of data that is copied multiple times between the two memories. In addition, analysis of different interconnect latencies shows that fine-grained data sharing between GPU and CPU is feasible as long as future interconnect technologies achieve four to five times lower round-trip times than PCI-Express 3.0.La gestión eficiente del subsistema de memoria se ha convertido en un problema complejo a la vez que los sistemas crecen en complejidad y heterogeneidad. En el campo de la computación de altas prestaciones (HPC) en particular, donde arquitecturas masivamente paralelas son usadas y entradas de varios terabytes son comunes, una gestión cuidadosa de la jerarquía de memoria es crucial para conseguir explotar todo el potencial de estas arquitecturas. El objetivo de esta tesis es proporcionar a los arquitectos de computadores información valiosa para el diseño de los sistemas del futuro, y en concreto de los más comúnmente usados en el campo de HPC, los procesadores multinúcleo simétricos (SMP) y las tarjetas gráficas (GPU). Para ello, presentamos un análisis de las ineficiencias y los inconvenientes de los sistemas de gestión de memoria actuales, y proponemos dos técnicas nuevas que aprovechan las oportunidades surgidas del uso de nuevos y emergentes modelos de programación y paradigmas de computación. La primera contribución de esta tesis es un mecanismo de prefetch de bloques para modelos de programación basados en tareas. Usando modelos de programación orientados a tareas simplifica la programación paralela y permite hacer un mejor uso de los recursos en los supercomputadores usados en HPC, mientras permiten el uso de sofisticados mecanismos de gestión de memoria. La técnica propuesta se basa en un sistema de runtime para guiar el prefetch de datos mientras evita los principales inconvenientes tradicionalmente asociados con prefetching, la polución de cache y la medida incorrecta de los tiempos. El mecanismo utiliza la información sobre las entradas de las tareas proporcionada por el usuario para prefetchear bloques contiguos de memoria sobre los que hay certeza que serán utilizados. El mecanismo está dirigido a arquitecturas SMP con amplias jerarquías de cache, y usa heurísticas para decidir dinámicamente en qué nivel de caché colocar los datos sin desplazar datos útiles. El focus de la tesis gira luego a arquitecturas heterogéneas que combinan GPUs con procesadores multinúcleo tradicionales. La actual tendencia a unir GPU y CPU permite el uso de una nueva serie de computaciones colaborativas que afectan al subsistema de memoria de forma diferente que las computaciones heterogéneas anteriores, y requiere de un cuidadoso análisis para entender las consecuencias que esto tiene en el diseño de las organizaciones de memoria futuras. La segunda contribución de la tesis es un análisis detallado del impacto que supone compartir el último nivel de cache entre núcleos de GPU y CPU en sistemas donde la GPU está integrada en el mismo chip que la CPU. El análisis se centra en el efecto que la cache compartida tiene en colaboraciones colaborativas donde hilos de GPU y CPU trabajan concurrentemente en un problema y comparten datos a grano fino. Los resultados presentados en esta tesis muestran que compartir el último nivel de cache es mayormente beneficioso ya que permite un mejor uso de los recursos. Además, la evaluación muestra que las computaciones colaborativas se benefician en gran medida de la comunicación más rápida entre GPU y CPU y las mayores tasas de acierto de cache que un nivel de cache compartido proporcionan

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP.Postprint (published version

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP
    • …
    corecore