21 research outputs found

    Evaluation of low-power architectures in a scientific computing environment

    Get PDF
    HPC (High Performance Computing) represents, together with theory and experiments, the third pillar of science. Through HPC, scientists can simulate phenomena otherwise impossible to study. The need of performing larger and more accurate simulations requires to HPC to improve every day. HPC is constantly looking for new computational platforms that can improve cost and power efficiency. The Mont-Blanc project is a EU funded research project that targets to study new hardware and software solutions that can improve efficiency of HPC systems. The vision of the project is to leverage the fast growing market of mobile devices to develop the next generation supercomputers. In this work we contribute to the objectives of the Mont-Blanc project by evaluating performance of production scientific applications on innovative low power architectures. In order to do so, we describe our experiences porting and evaluating sate of the art scientific applications on the Mont-Blanc prototype, the first HPC system built with commodity low power embedded technology. We then extend our study to compare off-the-shelves ARMv8 platforms. We finally discuss the most impacting issues encountered during the development of the Mont-Blanc prototype system

    Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC

    Get PDF
    Educational institutions provide in most cases basic theoretical background covering several computational science topics, however High Performance Computing (HPC) and Parallel and Distributed Computing (PDC) markets require specialized technical profiles. Even the most skilled students are often not prepared to face production HPC applications of thousands of lines nor complex computational frameworks from other disciplines nor heterogeneous multinode machines accessed by hundreds of users. In this paper, we offer an educational package for filling this gap. Leveraging the 4-years experience of the Student Cluster Competition, we present our educational journey together with the lessons learned and the outcomes of our methodology. We show how, in a time span of a semester and an affordable budget, a university can implement an educational package preparing pupils for starting competitive professional careers. Our findings also highlight that 78% of the students exposed to our methods remain within the HPC high-education, research or industry.The authors of this paper and the participants in the SCC have been supported by the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects, grant agreements n. 288777, 610402 and 671697; the HPC Advisory Council; the Facultat d’Informàtica de Barcelona – Universitat Politècnica de Catalunya; Arm Ltd.; Cavium Inc.; E4 Computer Engineering. We warmly thank Luna Backes Drault for her unconditioned dedication to the SCC cause in the early days and the pizzeria 7bello in Frankfurt for always having a table and a smile for us.SiPreprin

    Development of an oceanographic application in HPC

    Get PDF
    High Performance Computing (HPC) is used for running advanced application programs efficiently, reliably, and quickly. In earlier decades, performance analysis of HPC applications was evaluated based on speed, scalability of threads, memory hierarchy. Now, it is essential to consider the energy or the power consumed by the system while executing an application. In fact, the High Power Consumption (HPC) is one of biggest problems for the High Performance Computing (HPC) community and one of the major obstacles for exascale systems design. The new generations of HPC systems intend to achieve exaflop performances and will demand even more energy to processing and cooling. Nowadays, the growth of HPC systems is limited by energy issues Recently, many research centers have focused the attention on doing an automatic tuning of HPC applications which require a wide study of HPC applications in terms of power efficiency. In this context, this paper aims to propose the study of an oceanographic application, named OceanVar, that implements Domain Decomposition based 4D Variational model (DD-4DVar), one of the most commonly used HPC applications, going to evaluate not only the classic aspects of performance but also aspects related to power efficiency in different case of studies. These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology. In this work of thesis it was initially explained the concept of assimilation date, the context in which it is developed, and a brief description of the mathematical model 4DVAR. After this problem’s close examination, it was performed a porting from Matlab description of the problem of data-assimilation to its sequential version in C language. Secondly, after identifying the most onerous computational kernels in order of time, it has been developed a parallel version of the application with a parallel multiprocessor programming style, using the MPI (Message Passing Interface) protocol. The experiments results, in terms of performance, have shown that, in the case of running on HCA server, an Intel architecture, values of efficiency of the two most onerous functions obtained, growing the number of process, are approximately equal to 80%. In the case of running on ARM architecture, specifically on Thunder mini-cluster, instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can be explained by a more efficient use of resources (cache memory access) compared with the sequential case. In the second part of this paper was presented an analysis of the some issues of this application that has impact in the energy efficiency. After a brief discussion about the energy consumption characteristics of the Thunder chip in technological landscape, through the use of a power consumption detector, the Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were evaluated in order to determine an overview on the power-to-solution of this application to use as the basic standard for successive analysis with other parallel styles. Finally, a comprehensive performance evaluation, targeted to estimate the goodness of MPI parallelization, is conducted using a suitable performance tool named Paraver, developed by BSC. Paraver is such a performance analysis and visualisation tool which can be used to analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing. A set of graphical representation of these statistics make it easy for a developer to identify performance problems. Some of the problems that can be easily identified are load imbalanced decompositions, excessive communication overheads and poor average floating operations per second achieved. Paraver can also report statistics based on hardware counters, which are provided by the underlying hardware. This project aimed to use Paraver configuration files to allow certain metrics to be analysed for this application. To explain in some way the performance trend obtained in the case of analysis on the mini-cluster Thunder, the tracks were extracted from various case of studies and the results achieved is what expected, that is a drastic drop of cache misses by the case ppn (process per node) = 1 to case ppn = 16. This in some way explains a more efficient use of cluster resources with an increase of the number of processes

    Characterization of HPC applications for ARM SIMD instructions

    Get PDF
    Hoy en día, la mayoría de repertorios de instrucciones (ISA) incluyen instrucciones que procesan multiples datos en una única instruccion. Éstas instrucciones se utilizan para acelerar aplicaciones de alto rendimiento (HPC). La primera parte de este trabajo busca caracterizar aplicaciones HPC que han sido optimizadas utilizando NEON, que es el actual subcojunto de instrucciones vectoriales soportado por los procesadores basados en la ISA ARMv8. Para alcanzar este objetivo tenemos a nuestra disposición dos procesadores tope de gama basados en ARMv8, que son ThunderX y ThunderX2, y dos de los principales compiladores del mercado, GCC y Arm HPC Compiler. Con ellos hemos caracterizado una colección de benchmarks extraidos del conjunto de benchmarks RAJAPerf y las aplicaciones HACCKernels y HPCG. Esta caracterización incluye una serie de experimentos que buscan calcular el speed-up, la escalabilidad, la eficiencia energética y de consumo de potencia. Además, hemos analizado el código ensamblador para identificar que optimiaciones se han llevado a cabo y qué caracteristicas hacen que unos experimentos sean más rápidos que otros. La segunda parte de este trabajo se centra en la nueva extensión vectorial escalable (SVE) de Arm, la cual está especificada en la ISA ARMv8.2. Esta especificación introduce el modelo de programación independiente de la longitud de los registros vectoriales (VLA). La cual permite que los fabricantes de procesadores puedan elegir diferentes longitudes de vectores entre 128 y 2048 bits, para la implementación de sus microarquitecturas. A día de hoy, no existe ninguna máquina que implementa este nuevo repertorio de instrucciones, por lo tanto hemos tenido que usar una herramienta de emulación (ArmIE) desarrollada por Arm. Esta herramienta nos permite ejecutar binarios compilados con soporte para SVE en procesadores de la ISA ARMv8. Nuestro trabajo analiza cómo los compiladores GCC y Arm HPC Compiler vectorizan estos benchmarks y además propone ciertas optimizaciones de bajo nivel para mejorar la generación de código.Nowadays, most Intruction Set Architectures (ISA) include Single Instructions that process Multiple Data (SIMD) to speed up High Performance Computing (HPC) applications. The first part of this work aims to characterize HPC applications optimized using the NEON extension, which is the actual SIMD extension supported by ARMv8 processors. For this purpose, we have two high-end ARMv8 processors, ThunderX and ThunderX2, and two mainstream comercial ARMv8 compilers, GCC and Arm HPC Compiler. With this set up we have characterized a collection of benchmarks extracted from RAJAPerf, HACCKernels and HPCG benchmarks. The characterization includes experimental work in order to obtain speed-up, scalability, energy efficiency and power efficiency measurements for all benchmarks. Moreover, we have taken a look into the assembly code to identify what optimizations are used by each compiler that makes benchmarks run faster or slower. The second part of this work focuses on the novel Scalable Vector Extension (SVE) specified in the ARMv8.2 ISA. This SIMD specification introduces a Vector-Length Agnostic programming model, which enables implementation choices for vector lengths that scale from 128 to 2048 bits. To this day, no real processor implements this new ISA, therefore we have used the Arm Instruction Emulator (ArmIE), an emulation tool developed by Arm, that allows the execution of SVE compiled binaries running in an ARMv8 processor. Our work analizes how compilers that support SVE (GCC and Arm HPC Compiler) vectorize the benchmarks and what is the quality of the generated assembly code. We also propose some low level optimizations to improve code generation

    Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU

    Full text link
    In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc 3. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU is the same one that powers the Astra supercomputer, the first Arm-based supercomputer entering the Top500 in November 2018. We study from micro-benchmarks up to large production codes. We include an interdisciplinary evaluation of three scientific applications (a finite-element fluid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph 500 benchmark, focusing on parallel and energy efficiency as well as studying their scalability up to thousands of Armv8 cores. For comparison, we run the same tests on state-of-the-art x86 nodes included in Dibona and the Tier-0 supercomputer MareNostrum4. Our experiments show that the ThunderX2 has a 25% lower performance on average, mainly due to its small vector unit yet somewhat compensated by its 30% wider links between the CPU and the main memory. We found that the software ecosystem of the Armv8 architecture is comparable to the one available for Intel. Our results also show that ThunderX2 delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems

    Runtime Mechanisms to Survive New HPC Architectures: A Use-Case in Human Respiratory Simulations

    Get PDF
    Computational Fluid and Particle Dynamics (CFPD) simulations are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes demand high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on Intel-, IBM-, and Arm-based HPC technologies ranked in the Top500 supercomputers, showing the importance of using mechanisms applied at runtime to improve the performance independently of the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2x, across different architectures, while applying runtime techniques and keeping constant the computational resources.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPreprin

    Computational Fluid and Particle Dynamics Simulations for Respiratory System: Runtime Optimization on an Arm Cluster

    Get PDF
    Computational fluid and particle dynamics simulations (CFPD) are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes involves high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on both Intel- and Arm-based HPC clusters showing the importance of using mechanisms applied at runtime to improve the performance independently of the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2X, keeping the computational resources constant.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPostprint (author's final draft

    Xar-Trek: Run-Time Execution Migration among FPGAs and Heterogeneous-ISA CPUs

    Get PDF
    Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that migrating application execution at run-time across heterogeneous-ISA CPUs can yield significant performance and energy gains, with relatively little programmer effort. However, FPGAs have often been overlooked in that context: hardware acceleration using FPGAs involves statically implementing select application functions, which prohibits dynamic and transparent migration. We present Xar-Trek, a new compiler and run-time software framework that overcomes this limitation. Xar-Trek compiles an application for several CPU ISAs and select application functions for acceleration on an FPGA, allowing execution migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek's run-time monitors server workloads and migrates application functions to an FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a heuristic policy that uses application workload profiles to make scheduling decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64 server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over no-migration baselines

    Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs

    Get PDF
    Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches. In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energyefficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU
    corecore