351 research outputs found

    Modeling Energy Consumption of High-Performance Applications on Heterogeneous Computing Platforms

    Get PDF
    Achieving Exascale computing is one of the current leading challenges in High Performance Computing (HPC). Obtaining this next level of performance will allow more complex simulations to be run on larger datasets and offer researchers better tools for data processing and analysis. In the dawn of Big Data, the need for supercomputers will only increase. However, these systems are costly to maintain because power is expensive. Thus, a better understanding of power and energy consumption is required such that future hardware can benefit. Available power models accurately capture the relationship to the number of cores and clock-rate, however the relationship between workload and power is less understood. Thus, investigation and analysis of power measurements has been a focal point in this work with the aim to improve the general understanding of energy consumption in the context of HPC. This dissertation investigates power and energy consumption of many different parallel applications on several hardware platforms while varying a number of execution characteristics. Multicore and manycore hardware devices are investigated in homogeneous and heterogeneous computing environments. Further, common techniques for reducing power and energy consumption are employed to each of these devices. Well-known power and performance models have been combined to form the Execution-Phase model, which may be used to quantify energy contributions based on execution phase and has been used to predict energy consumption to within 10%. However, due to limitations in the measurement procedure, a less intrusive approach is required. The Empirical Mode Decomposition (EMD) and Hilbert-Huang Transform analysis technique has been applied in innovative ways to model, analyze, and visualize power and energy measurements. EMD is widely used in other research areas, including earthquake, brain-wave, speech recognition, and sea-level rise analysis and this is the first it has been applied to power traces to analyze the complex interactions occurring within HPC systems. Probability distributions may be used to represent power and energy traces, thereby providing an alternative means of predicting energy consumption while retaining the fact that power is not constant over time. Further, these distributions may be used to define the cost of a workload for a given computing platform

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    Full text link
    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    Proceedings, MSVSCC 2015

    Get PDF
    The Virginia Modeling, Analysis and Simulation Center (VMASC) of Old Dominion University hosted the 2015 Modeling, Simulation, & Visualization Student capstone Conference on April 16th. The Capstone Conference features students in Modeling and Simulation, undergraduates and graduate degree programs, and fields from many colleges and/or universities. Students present their research to an audience of fellow students, faculty, judges, and other distinguished guests. For the students, these presentations afford them the opportunity to impart their innovative research to members of the M&S community from academic, industry, and government backgrounds. Also participating in the conference are faculty and judges who have volunteered their time to impart direct support to their students’ research, facilitate the various conference tracks, serve as judges for each of the tracks, and provide overall assistance to this conference. 2015 marks the ninth year of the VMASC Capstone Conference for Modeling, Simulation and Visualization. This year our conference attracted a number of fine student written papers and presentations, resulting in a total of 51 research works that were presented. This year’s conference had record attendance thanks to the support from the various different departments at Old Dominion University, other local Universities, and the United States Military Academy, at West Point. We greatly appreciated all of the work and energy that has gone into this year’s conference, it truly was a highly collaborative effort that has resulted in a very successful symposium for the M&S community and all of those involved. Below you will find a brief summary of the best papers and best presentations with some simple statistics of the overall conference contribution. Followed by that is a table of contents that breaks down by conference track category with a copy of each included body of work. Thank you again for your time and your contribution as this conference is designed to continuously evolve and adapt to better suit the authors and M&S supporters. Dr.Yuzhong Shen Graduate Program Director, MSVE Capstone Conference Chair John ShullGraduate Student, MSVE Capstone Conference Student Chai

    Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

    Get PDF
    Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly-structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-flow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-flow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-flow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-flow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization

    Leveraging performance of 3D finite difference schemes in large scientific computing simulations

    Get PDF
    Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atrás quedaron los días en los que ingenieros y científicos realizaban sus experimentos empíricamente. Durante esas décadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos teóricos. Con la llegada de la era computacional, la computación científica se ha convertido en una solución factible comparada con métodos empíricos, en términos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados numéricos gracias al refinamiento del dominio. Diversos métodos numéricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). Métodos como Elementos Finitos (EF) y Volúmenes Finitos (VF) están bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias más altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una solución eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computación científica. Se proponen diferentes técnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con técnicas como spatial- o time-blocking, añadiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-simétricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbación del problema del ancho de memoria. Para paliar este problema, nuestra investigación se centra en estrategias para reducir la presión en la jerarquía de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposición de dominios para balancear la carga asegurando resultados casi óptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con técnicas de spatial-blocking y auto-tuning, explorando el espacio paramétrico y reduciendo los fallos en la cache de último nivel. Como alternativa a los métodos de fuerza bruta usados en auto-tuning donde un espacio paramétrico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una solución factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando parámetros suboptimos casi de forma instantánea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo características complejas como prefetchers, SMT y optimizaciones algorítmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecución, sino también para tomar decisiones de los mejores parámetros algorítmicos. Además, puede ser incluido en optimizadores run-time para decidir la mejor configuración SMT. Algunas industrias confían en técnicas DF para sus códigos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigación. En este aspecto, hemos diseñado e implementado desde cero una infraestructura DF que cubre las características más importantes que una aplicación industrial debe incluir. Algunas de las técnicas de optimización propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estratégicas para la industria: un modelo de transporte atmosférico que simula la dispersión de ceniza volcánica y un modelo de imagen sísmica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo
    • …
    corecore