14 research outputs found

    Performance Projections of HPC Applications on Chip Multiprocessor (CMP) Based Systems

    Get PDF
    Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement and application refinements. In this dissertation, we present an efficient method to project the performance of HPC applications onto Chip Multiprocessor (CMP) based systems using widely available standard benchmark data. The main advantage of this method is the use of published data about the target machine; the target machine need not be available. With the current trend in HPC platforms shifting towards cluster systems with chip multiprocessors (CMPs), efficient and accurate performance projection becomes a challenging task. Typically, CMP-based systems are configured hierarchically, which significantly impacts the performance of HPC applications. The goal of this research is to develop an efficient method to project the performance of HPC applications onto systems that utilize CMPs. To provide for efficiency, our projection methodology is automated (projections are done using a tool) and fast (with small overhead). Our method, called the surrogate-based workload application projection method, utilizes surrogate benchmarks to project an HPC application performance on target systems where computation component of an HPC application is projected separately from the communication component. Our methodology was validated on a variety of systems utilizing different processor and interconnect architectures with high accuracy and efficiency. The average projection error on three target systems was 11.22 percent with standard deviation of 1.18 percent for twelve HPC workloads

    Benchmarking the Amazon Elastic Compute Cloud (EC2)

    Get PDF
    This project sought to determine whether Amazon EC2 is an economically viable environment for scientic research-oriented activities within a university setting. The methodology involved benchmarking the performance of the cloud servers against the best systems available at the WPI ECE department. Results indicate that the newest ECE server outperformed the best EC2 instance by approximately 25% in most cases. A comprehensive cost analysis suggests that EC2 instances can achieve up to 60% better cost to performance ratios in the short-term when compared against the ECE servers. However, a long-term projected cost analysis determined that the overall cost of owning a large set of reserved instances comes out to almost 60% more than the cost of comparable in-house servers

    Selección de nodos de cómputo multicore, para una aplicación paralela de memoria compartida

    Get PDF
    Con la llegada de una amplia variedad de arquitecturas multicore (NUMA, UMA), seleccionar la mejor configuración del nodo de cómputo para una cierta aplicación paralela de memoria compartida, se convierte en la actualidad en un gran reto. Nuestro trabajo hace frente a este tema caracterizando los nodos de cómputo y las aplicaciones. Los nodos se caracterizan ejecutando pequeños programas (o microbenchmarks, μB), que contienen núcleos de estructuras representativas del comportamiento de programas paralelos de memoria compartida. Los μB’s ejecutados en cada uno de los nodos nos proporcionan perfiles de rendimiento, o datos medidos del comportamiento, que se almacena en una base de datos y se utiliza con para estimar el comportamiento de nuevas aplicaciones La aplicación es ejecutada sobre un nodo base para identificar sus fases representativas. Para cada fase se extrae información de rendimiento comparable con la de los μB’s, con el fin de caracterizar dicha fase. En la base de datos de los perfiles de rendimiento se localizan μB’s con características similares en comportamiento para cada fase de la aplicación sobre el nodo base. Finalmente, los perfiles seleccionados, pero ejecutados sobre los otros nodos candidatos, se usan para comparar el rendimiento de los nodos de cómputo y seleccionar el nodo de cómputo apropiado para la aplicaciónPresentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Selección de nodos de cómputo multicore, para una aplicación paralela de memoria compartida

    Get PDF
    Con la llegada de una amplia variedad de arquitecturas multicore (NUMA, UMA), seleccionar la mejor configuración del nodo de cómputo para una cierta aplicación paralela de memoria compartida, se convierte en la actualidad en un gran reto. Nuestro trabajo hace frente a este tema caracterizando los nodos de cómputo y las aplicaciones. Los nodos se caracterizan ejecutando pequeños programas (o microbenchmarks, μB), que contienen núcleos de estructuras representativas del comportamiento de programas paralelos de memoria compartida. Los μB’s ejecutados en cada uno de los nodos nos proporcionan perfiles de rendimiento, o datos medidos del comportamiento, que se almacena en una base de datos y se utiliza con para estimar el comportamiento de nuevas aplicaciones La aplicación es ejecutada sobre un nodo base para identificar sus fases representativas. Para cada fase se extrae información de rendimiento comparable con la de los μB’s, con el fin de caracterizar dicha fase. En la base de datos de los perfiles de rendimiento se localizan μB’s con características similares en comportamiento para cada fase de la aplicación sobre el nodo base. Finalmente, los perfiles seleccionados, pero ejecutados sobre los otros nodos candidatos, se usan para comparar el rendimiento de los nodos de cómputo y seleccionar el nodo de cómputo apropiado para la aplicaciónPresentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    An Unstructured CFD Mini-Application for the Performance Prediction of a Production CFD Code

    Get PDF
    Maintaining the performance of large scientific codes is a difficult task. To aid in this task, a number of mini-applications have been developed that are more tractable to analyze than large-scale production codes while retaining the performance characteristics of them. These “mini-apps” also enable faster hardware evaluation and, for sensitive commercial codes, allow evaluation of code and system changes outside of access approval processes. In this paper, we develop MG-CFD, a mini-application that represents a geometric multigrid, unstructured computational fluid dynamics (CFD) code, designed to exhibit similar performance characteristics without sharing commercially sensitive code. We detail our experiences of developing this application using guidelines detailed in existing research and contributing further to these. Our application is validated against the inviscid flux routine of HYDRA, a CFD code developed by Rolls-Royce plc for turbomachinery design. This paper (1) documents the development of MG-CFD, (2) introduces an associated performance model with which it is possible to assess the performance of HYDRA on new HPC architectures, and (3) demonstrates that it is possible to use MG-CFD and the performance models to predict the performance of HYDRA with a mean error of 9.2% for strong-scaling studies

    Enabling the use of embedded and mobile technologies for high-performance computing

    Get PDF
    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido

    ADEPT Runtime/Scalability Predictor in support of Adaptive Scheduling

    Get PDF
    A job scheduler determines the order and duration of the allocation of resources, e.g. CPU, to the tasks waiting to run on a computer. Round-Robin and First-Come-First-Serve are examples of algorithms for making such resource allocation decisions. Parallel job schedulers make resource allocation decisions for applications that need multiple CPU cores, on computers consisting of many CPU cores connected by different interconnects. An adaptive parallel scheduler is a parallel scheduler that is capable of adjusting its resource allocation decisions based on the current resource usage and demand. Adaptive parallel schedulers that decide the numbers of CPU cores to allocate to a parallel job provide more flexibility and potentially improve performance significantly for both local and grid job scheduling compared to non-adaptive schedulers. A major reason why adaptive schedulers are not yet used practically is due to lack of knowledge of the scalability curves of the applications, and high cost of existing white-box approaches for scalability prediction. We show that a runtime and scalability prediction tool can be developed with 3 requirements: accuracy comparable to white-box methods, applicability, and robustness. Applicability depends only on knowledge feasible to gain in a production environment. Robustness addresses anomalous behaviour and unreliable predictions. We present ADEPT, a speedup and runtime prediction tool that satisfies all criteria for both single problem size and across different problem sizes of a parallel application. ADEPT is also capable of handling anomalies and judging reliability of its predictions. We demonstrate these using experiments with MPI and OpenMP implementations of NAS benchmarks and seven real applications

    Towards the use of mini-applications in performance prediction and optimisation of production codes

    Get PDF
    Maintaining the performance of large scientific codes is a difficult task. To aid in this task a number of mini-applications have been developed that are more tract able to analyse than large-scale production codes, while retaining the performance characteristics of them. These “mini-apps” also enable faster hardware evaluation, and for sensitive commercial codes allow evaluation of code and system changes outside of access approval processes. Techniques for validating the representativeness of a mini-application to a target code are ultimately qualitative, requiring the researcher to decide whether the similarity is strong enough for the mini-application to be trusted to provide accurate predictions of the target performance. Little consideration is given to the sensitivity of those predictions to the few differences between the mini-application and its target, how those potentially-minor static differences may lead to each code responding very differently to a change in the computing environment. An existing mini-application, ‘Mini-HYDRA’, of a production CFD simulation code is reviewed. Arithmetic differences lead to divergence in intra-node performance scaling, so the developers had removed some arithmetic from Mini-HYDRA, but this breaks the simulation so limits numerical research. This work restores the arithmetic, repeating validation for similar performance scaling, achieving similar intra-node scaling performance whilst neither are memory-bound. MPI strong scaling functionality is also added, achieving very similar multi-node scaling performance. The arithmetic restoration inevitably leads to different memory-bounds, and also different and varied responses to changes in processor architecture or instruction set. A performance model is developed that predicts this difference in response, in terms of the arithmetic differences. It is supplemented by a new benchmark that measures the memory-bound of CFD loops. Together, they predict the strong scaling performance of a production ‘target’ code, with a mean error of 8.8% (s = 5.2%). Finally, the model is used to investigate limited speedup from vectorisation despite not being memory-bound. It identifies that instruction throughput is significantly reduced relative to serial counterparts, independent of data ordering in memory, indicating a bottleneck within the processor core
    corecore