12 research outputs found

    An Analysis of Variation Between Cores For Intel Xeon Phi Knights Corner And Xeon Phi Knights Landing

    Get PDF
    As we move towards exascale computing, the efficiency of application performance and energy utilization, must be optimized by redefining architectural features and application performance analysis. This research analyzes the performance per core of 8 applications on Intel Xeon Phi Knights Corner (KNC) and Knights Landing (KNL) to determine if performance variation within cores can lead to performance and energy improvements. Our results showed that KNC architecture\u27s core vary in performance, leading to faster inner core performance as a result of memory characteristics and core utilization. It also shows that cores 17, 34, and 51 on the KNL architectures performs consistently slower than other cores, with core 0 performing either faster, slower or within the average performance time all the cores. A power performance study was then done utilizing different core configurations on the KNC. The results show that by targeting inner cores for applications that exhibit better inner core performance, a maximum energy reduction of 16.4% compared to a con- figuration using all cores was possible with its optimal thread configuration. Energy reduction was achieved with along with a 2% reduction in the fastest execution time of the same application. Our results also show how application characteristics lead to different core variation performances on KNC and KNL Xeon Phi architectures

    Many-core Branch-and-Bound for GPU accelerators and MIC coprocessors

    Get PDF
    International audienceCoprocessors are increasingly becoming key building blocks of High Performance Computing platforms. These many-core energy-efficient devices boost the performance of traditional processors. On the other hand, Branch-and-Bound (B&B) algorithms are tree-based exact methods for solving to optimality combinatorial optimization problems (COPs). Solving large COPs results in the generation of a very large pool of subproblems and the evaluation of their associated lower bounds. Generating and evaluating those subproblems on coprocessors raises several issues including processor-coprocessor data transfer optimization, vectorization, thread divergence, and so on. In this paper, we investigate the offload-based parallel design and implementation of B&B algorithms for coprocessors addressing these issues. Two major many-core architectures are considered and compared: Nvidia GPU and Intel MIC. The proposed approaches have been experimented using the Flow-Shop scheduling problem and two hardware configurations equivalent in terms of energy consumption: Nvidia Tesla K40 and Intel Xeon Phi 5110P. The reported results show that the GPU-accelerated approach outperforms the MIC offload-based one even in its vectorized version. Moreover, vectorization improves the efficiency of the MIC offload-based approach with a factor of two

    Modeling Energy Consumption of High-Performance Applications on Heterogeneous Computing Platforms

    Get PDF
    Achieving Exascale computing is one of the current leading challenges in High Performance Computing (HPC). Obtaining this next level of performance will allow more complex simulations to be run on larger datasets and offer researchers better tools for data processing and analysis. In the dawn of Big Data, the need for supercomputers will only increase. However, these systems are costly to maintain because power is expensive. Thus, a better understanding of power and energy consumption is required such that future hardware can benefit. Available power models accurately capture the relationship to the number of cores and clock-rate, however the relationship between workload and power is less understood. Thus, investigation and analysis of power measurements has been a focal point in this work with the aim to improve the general understanding of energy consumption in the context of HPC. This dissertation investigates power and energy consumption of many different parallel applications on several hardware platforms while varying a number of execution characteristics. Multicore and manycore hardware devices are investigated in homogeneous and heterogeneous computing environments. Further, common techniques for reducing power and energy consumption are employed to each of these devices. Well-known power and performance models have been combined to form the Execution-Phase model, which may be used to quantify energy contributions based on execution phase and has been used to predict energy consumption to within 10%. However, due to limitations in the measurement procedure, a less intrusive approach is required. The Empirical Mode Decomposition (EMD) and Hilbert-Huang Transform analysis technique has been applied in innovative ways to model, analyze, and visualize power and energy measurements. EMD is widely used in other research areas, including earthquake, brain-wave, speech recognition, and sea-level rise analysis and this is the first it has been applied to power traces to analyze the complex interactions occurring within HPC systems. Probability distributions may be used to represent power and energy traces, thereby providing an alternative means of predicting energy consumption while retaining the fact that power is not constant over time. Further, these distributions may be used to define the cost of a workload for a given computing platform

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

    The readying of applications for heterogeneous computing

    Get PDF
    High performance computing is approaching a potentially significant change in architectural design. With pressures on the cost and sheer amount of power, additional architectural features are emerging which require a re-think to the programming models deployed over the last two decades. Today's emerging high performance computing (HPC) systems are maximising performance per unit of power consumed resulting in the constituent parts of the system to be made up of a range of different specialised building blocks, each with their own purpose. This heterogeneity is not just limited to the hardware components but also in the mechanisms that exploit the hardware components. These multiple levels of parallelism, instruction sets and memory hierarchies, result in truly heterogeneous computing in all aspects of the global system. These emerging architectural solutions will require the software to exploit tremendous amounts of on-node parallelism and indeed programming models to address this are emerging. In theory, the application developer can design new software using these models to exploit emerging low power architectures. However, in practice, real industrial scale applications last the lifetimes of many architectural generations and therefore require a migration path to these next generation supercomputing platforms. Identifying that migration path is non-trivial: With applications spanning many decades, consisting of many millions of lines of code and multiple scientific algorithms, any changes to the programming model will be extensive and invasive and may turn out to be the incorrect model for the application in question. This makes exploration of these emerging architectures and programming models using the applications themselves problematic. Additionally, the source code of many industrial applications is not available either due to commercial or security sensitivity constraints. This thesis highlights this problem by assessing current and emerging hard- ware with an industrial strength code, and demonstrating those issues described. In turn it looks at the methodology of using proxy applications in place of real industry applications, to assess their suitability on the next generation of low power HPC offerings. It shows there are significant benefits to be realised in using proxy applications, in that fundamental issues inhibiting exploration of a particular architecture are easier to identify and hence address. Evaluations of the maturity and performance portability are explored for a number of alternative programming methodologies, on a number of architectures and highlighting the broader adoption of these proxy applications, both within the authors own organisation, and across the industry as a whole

    Leveraging performance of 3D finite difference schemes in large scientific computing simulations

    Get PDF
    Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atrás quedaron los días en los que ingenieros y científicos realizaban sus experimentos empíricamente. Durante esas décadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos teóricos. Con la llegada de la era computacional, la computación científica se ha convertido en una solución factible comparada con métodos empíricos, en términos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados numéricos gracias al refinamiento del dominio. Diversos métodos numéricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). Métodos como Elementos Finitos (EF) y Volúmenes Finitos (VF) están bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias más altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una solución eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computación científica. Se proponen diferentes técnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con técnicas como spatial- o time-blocking, añadiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-simétricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbación del problema del ancho de memoria. Para paliar este problema, nuestra investigación se centra en estrategias para reducir la presión en la jerarquía de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposición de dominios para balancear la carga asegurando resultados casi óptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con técnicas de spatial-blocking y auto-tuning, explorando el espacio paramétrico y reduciendo los fallos en la cache de último nivel. Como alternativa a los métodos de fuerza bruta usados en auto-tuning donde un espacio paramétrico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una solución factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando parámetros suboptimos casi de forma instantánea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo características complejas como prefetchers, SMT y optimizaciones algorítmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecución, sino también para tomar decisiones de los mejores parámetros algorítmicos. Además, puede ser incluido en optimizadores run-time para decidir la mejor configuración SMT. Algunas industrias confían en técnicas DF para sus códigos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigación. En este aspecto, hemos diseñado e implementado desde cero una infraestructura DF que cubre las características más importantes que una aplicación industrial debe incluir. Algunas de las técnicas de optimización propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estratégicas para la industria: un modelo de transporte atmosférico que simula la dispersión de ceniza volcánica y un modelo de imagen sísmica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo

    Aceleración de una herramienta para la predicción de energía solar mediante arquitecturas masivamente paralelas

    Get PDF
    En la última década, Uruguay ha comenzado a incorporar fuertemente la energía eólica y solar a su matriz energética. La inclusión de este tipo de fuentes de energía para abastecer la red eléctrica presenta un gran desafío al momento de administrar su uso, principalmente debido a su flujo de carácter fluctuante. Considerando esta situación, y con el objetivo de simplificar el trabajo de despacho de carga (que se encarga de administrar eficientemente los recursos energéticos presentes en la matriz), desde la Facultad de Ingeniería se ha desarrollado una herramienta capaz de predecir la generación de energía solar fotovoltaica en el país para un horizonte de tiempo de 96 horas. Uno de los principales inconvenientes de dicha herramienta es su elevado costo computacional, lo que resulta en tiempos de ejecución restrictivos. Esta tesis aborda el estudio de la herramienta mencionada, haciendo foco especialmente en la componente que más tiempo y recursos requiere, el modelo numérico de circulación general de la atmósfera Weather Research and Forecasting (WRF). En una primera fase de este trabajo se analiza el tiempo de ejecución de dicho modelo, concluyendo que una de las etapas más costosa es el cómputo de la radiación solar, debido, entre otras cosas, a la precisión numérica que se requiere en estos cálculos. A partir de esta situación, en el presente trabajo se propone una nueva arquitectura de software asincrónica que permita desacoplar y calcular de forma paralela la radiación solar con el resto de las propiedades atmosféricas presentes en el WRF, siguiendo un patrón de paralelismo de tipo pipeline. Adicionalmente, se aborda el portado de una porción del cálculo de la radiación a un coprocesador masivamente paralelo, concretamente una GPU (Graphics Processing Unit) y/o un procesador XeonPhi, con el objetivo de disminuir la demanda de cómputo sobre la CPU. La evaluación experimental de esta propuesta en un escenario de doce plantas fotovoltaicas en el territorio uruguayo permite concluir que la arquitectura asincrónica logra disminuir los tiempos de ejecución del modelo original en un 10 % aproximadamente, cuando se consideran equipos multicore con una gran cantidad de núcleos. Adicionalmente, la extensión de esta arquitectura permite incorporar exitosamente la capacidad de cómputo de un coprocesador (GPU o Xeon-Phi), alcanzando mejoras de entre un 25 % a un 30 % en el tiempo total del modelo cuando se combinan ambas estrategias (asincronismo y uso de dispositivos de cómputo secundario).Over the last decade, Uruguay has begun a strong incorporation of eolic and solar energy to its energy matrix. The inclusion of this type of energy sources to supply the power grid poses a significant challenge at the moment of managing its use, mainly because of its variable flux. Considering this situation, and in order to simplify the power dispatch task (which efficiently manages the energy resources in the matrix), the Facultad de Ingeniería has developed a tool capable of predicting the generation of photovoltaic solar energy in the country for a 96-hour time horizon. One of the main drawbacks of said tool is its high computational cost, which results in restrictive runtimes. This thesis addresses the study of the aforementioned tool, focusing especially on the most resource- and time-consuming component, the numerical weather prediction model Weather Research and Forecasting (WRF). In a first stage of this work, the runtime of said model is assessed, concluding that one of the most expensive steps is the solar radiation computation, because of, inter alia, the numerical precision required in these calculations. Starting from this situation, this work proposes a new asynchronous software architecture which enables decoupling computation of solar radiation and its parallel calculation with the remaining atmospheric properties in the WRF, following a pipeline parallel strategy. Additionally, offloading of a portion of the radiation calculation to a co-processor is addressed, specifically a GPU (Graphics Processing Unit) and/or a Xeon-Phi processor, in order to decrease the computation load on the CPU. Experimental assessment of this proposal in a twelve-photovoltaic-facility scenario in Uruguayan land makes it possible to conclude that asynchronous architecture decreases runtimes of the original model by approximately 10 %, when considering multicore equipment with a large amount of cores. Furthermore, the extent of this architecture enables the successful incorporation of the computation ability of a co-processor (GPU o Xeon-Phi), reaching improvements of between 25 % and 30 % in the total execution time of the model when both strategies are combined (asynchronism and use of secondary computation devices)