1,847 research outputs found

    A Library for Pattern-based Sparse Matrix Vector Multiply

    Get PDF
    Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns. The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable, and that the analysis and conversion overhead is amortized in realistic application scenarios

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP.Postprint (published version

    A Survey on Thread-Level Speculation Techniques

    Get PDF
    Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

    Get PDF
    Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly-structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-flow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-flow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-flow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-flow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization
    corecore