21 research outputs found
Evaluation of low-power architectures in a scientific computing environment
HPC (High Performance Computing) represents, together with theory and experiments,
the third pillar of science. Through HPC, scientists can simulate phenomena
otherwise impossible to study. The need of performing larger and more accurate
simulations requires to HPC to improve every day.
HPC is constantly looking for new computational platforms that can improve cost
and power efficiency. The Mont-Blanc project is a EU funded research project that
targets to study new hardware and software solutions that can improve efficiency of
HPC systems. The vision of the project is to leverage the fast growing market of
mobile devices to develop the next generation supercomputers.
In this work we contribute to the objectives of the Mont-Blanc project by evaluating
performance of production scientific applications on innovative low power architectures.
In order to do so, we describe our experiences porting and evaluating sate of
the art scientific applications on the Mont-Blanc prototype, the first HPC system
built with commodity low power embedded technology. We then extend our study
to compare off-the-shelves ARMv8 platforms. We finally discuss the most impacting
issues encountered during the development of the Mont-Blanc prototype system
Filling the gap between education and industry: evidence-based methods for introducing undergraduate students to HPC
Educational institutions provide in most cases basic theoretical background covering several computational science topics, however High Performance Computing (HPC) and Parallel and Distributed Computing (PDC) markets require specialized technical profiles. Even the most skilled students are often not prepared to face production HPC applications of thousands of lines nor complex computational frameworks from other disciplines nor heterogeneous multinode machines accessed by hundreds of users. In this paper, we offer an educational package for filling this gap. Leveraging the 4-years experience of the Student Cluster Competition, we present our educational journey together with the lessons learned and the outcomes of our methodology. We show how, in a time span of a semester and an affordable budget, a university can implement an educational package preparing pupils for starting competitive professional careers. Our findings also highlight that 78% of the students exposed to our methods remain within the HPC high-education, research or industry.The authors of this paper and the participants in the SCC have been supported by the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects, grant agreements n. 288777, 610402 and 671697; the HPC Advisory Council; the Facultat
d’Informàtica de Barcelona – Universitat Politècnica de Catalunya; Arm Ltd.; Cavium Inc.; E4 Computer Engineering. We warmly thank Luna Backes Drault for her unconditioned dedication to the SCC cause in the early days and the pizzeria 7bello in Frankfurt for always having a table and a smile for us.SiPreprin
Development of an oceanographic application in HPC
High Performance Computing (HPC) is used for running advanced application programs
efficiently, reliably, and quickly.
In earlier decades, performance analysis of HPC applications was evaluated based on
speed, scalability of threads, memory hierarchy. Now, it is essential to consider the
energy or the power consumed by the system while executing an application.
In fact, the High Power Consumption (HPC) is one of biggest problems for the High
Performance Computing (HPC) community and one of the major obstacles for exascale
systems design.
The new generations of HPC systems intend to achieve exaflop performances and will
demand even more energy to processing and cooling. Nowadays, the growth of HPC
systems is limited by energy issues
Recently, many research centers have focused the attention on doing an automatic tuning
of HPC applications which require a wide study of HPC applications in terms of power
efficiency.
In this context, this paper aims to propose the study of an oceanographic application,
named OceanVar, that implements Domain Decomposition based 4D Variational model
(DD-4DVar), one of the most commonly used HPC applications, going to evaluate not
only the classic aspects of performance but also aspects related to power efficiency in
different case of studies.
These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the
Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology.
In this work of thesis it was initially explained the concept of assimilation date, the
context in which it is developed, and a brief description of the mathematical model
4DVAR.
After this problem’s close examination, it was performed a porting from Matlab
description of the problem of data-assimilation to its sequential version in C language.
Secondly, after identifying the most onerous computational kernels in order of time, it
has been developed a parallel version of the application with a parallel multiprocessor
programming style, using the MPI (Message Passing Interface) protocol.
The experiments results, in terms of performance, have shown that, in the case of
running on HCA server, an Intel architecture, values of efficiency of the two most
onerous functions obtained, growing the number of process, are approximately equal to
80%.
In the case of running on ARM architecture, specifically on Thunder mini-cluster,
instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can
be explained by a more efficient use of resources (cache memory access) compared with
the sequential case.
In the second part of this paper was presented an analysis of the some issues of this
application that has impact in the energy efficiency.
After a brief discussion about the energy consumption characteristics of the Thunder
chip in technological landscape, through the use of a power consumption detector, the
Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were
evaluated in order to determine an overview on the power-to-solution of this application
to use as the basic standard for successive analysis with other parallel styles.
Finally, a comprehensive performance evaluation, targeted to estimate the goodness of
MPI parallelization, is conducted using a suitable performance tool named Paraver,
developed by BSC.
Paraver is such a performance analysis and visualisation tool which can be used to
analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing.
A set of graphical representation of these statistics make it easy for a developer to
identify performance problems. Some of the problems that can be easily identified are
load imbalanced decompositions, excessive communication overheads and poor average
floating operations per second achieved.
Paraver can also report statistics based on hardware counters, which are provided by the
underlying hardware.
This project aimed to use Paraver configuration files to allow certain metrics to be
analysed for this application.
To explain in some way the performance trend obtained in the case of analysis on the
mini-cluster Thunder, the tracks were extracted from various case of studies and the
results achieved is what expected, that is a drastic drop of cache misses by the case ppn
(process per node) = 1 to case ppn = 16.
This in some way explains a more efficient use of cluster resources with an increase of
the number of processes
Characterization of HPC applications for ARM SIMD instructions
Hoy en día, la mayoría de repertorios de instrucciones (ISA) incluyen instrucciones que procesan multiples datos en una única instruccion. Éstas instrucciones se utilizan para acelerar aplicaciones de alto rendimiento (HPC). La primera parte de este trabajo busca caracterizar aplicaciones HPC que han sido optimizadas utilizando NEON, que es el actual subcojunto de instrucciones vectoriales soportado por los procesadores basados en la ISA ARMv8. Para alcanzar este objetivo tenemos a nuestra disposición dos procesadores tope de gama basados en ARMv8, que son ThunderX y ThunderX2, y dos de los principales compiladores del mercado, GCC y Arm HPC Compiler. Con ellos hemos caracterizado una colección de benchmarks extraidos del conjunto de benchmarks RAJAPerf y las aplicaciones HACCKernels y HPCG. Esta caracterización incluye una serie de experimentos que buscan calcular el speed-up, la escalabilidad, la eficiencia energética y de consumo de potencia. Además, hemos analizado el código ensamblador para identificar que optimiaciones se han llevado a cabo y qué caracteristicas hacen que unos experimentos sean más rápidos que otros. La segunda parte de este trabajo se centra en la nueva extensión vectorial escalable (SVE) de Arm, la cual está especificada en la ISA ARMv8.2. Esta especificación introduce el modelo de programación independiente de la longitud de los registros vectoriales (VLA). La cual permite que los fabricantes de procesadores puedan elegir diferentes longitudes de vectores entre 128 y 2048 bits, para la implementación de sus microarquitecturas. A día de hoy, no existe ninguna máquina que implementa este nuevo repertorio de instrucciones, por lo tanto hemos tenido que usar una herramienta de emulación (ArmIE) desarrollada por Arm. Esta herramienta nos permite ejecutar binarios compilados con soporte para SVE en procesadores de la ISA ARMv8. Nuestro trabajo analiza cómo los compiladores GCC y Arm HPC Compiler vectorizan estos benchmarks y además propone ciertas optimizaciones de bajo nivel para mejorar la generación de código.Nowadays, most Intruction Set Architectures (ISA) include Single Instructions that process Multiple Data (SIMD) to speed up High Performance Computing (HPC) applications. The first part of this work aims to characterize HPC applications optimized using the NEON extension, which is the actual SIMD extension supported by ARMv8 processors. For this purpose, we have two high-end ARMv8 processors, ThunderX and ThunderX2, and two mainstream comercial ARMv8 compilers, GCC and Arm HPC Compiler. With this set up we have characterized a collection of benchmarks extracted from RAJAPerf, HACCKernels and HPCG benchmarks. The characterization includes experimental work in order to obtain speed-up, scalability, energy efficiency and power efficiency measurements for all benchmarks. Moreover, we have taken a look into the assembly code to identify what optimizations are used by each compiler that makes benchmarks run faster or slower. The second part of this work focuses on the novel Scalable Vector Extension (SVE) specified in the ARMv8.2 ISA. This SIMD specification introduces a Vector-Length Agnostic programming model, which enables implementation choices for vector lengths that scale from 128 to 2048 bits. To this day, no real processor implements this new ISA, therefore we have used the Arm Instruction Emulator (ArmIE), an emulation tool developed by Arm, that allows the execution of SVE compiled binaries running in an ARMv8 processor. Our work analizes how compilers that support SVE (GCC and Arm HPC Compiler) vectorize the benchmarks and what is the quality of the generated assembly code. We also propose some low level optimizations to improve code generation
Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU
In this paper, we analyze the performance and energy consumption of an
Arm-based high-performance computing (HPC) system developed within the European
project Mont-Blanc 3. This system, called Dibona, has been integrated by
ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU
is the same one that powers the Astra supercomputer, the first Arm-based
supercomputer entering the Top500 in November 2018. We study from
micro-benchmarks up to large production codes. We include an interdisciplinary
evaluation of three scientific applications (a finite-element fluid dynamics
code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and
the Graph 500 benchmark, focusing on parallel and energy efficiency as well as
studying their scalability up to thousands of Armv8 cores. For comparison, we
run the same tests on state-of-the-art x86 nodes included in Dibona and the
Tier-0 supercomputer MareNostrum4. Our experiments show that the ThunderX2 has
a 25% lower performance on average, mainly due to its small vector unit yet
somewhat compensated by its 30% wider links between the CPU and the main
memory. We found that the software ecosystem of the Armv8 architecture is
comparable to the one available for Intel. Our results also show that ThunderX2
delivers similar or better energy-to-solution and scalability, proving that
Arm-based chips are legitimate contenders in the market of next-generation HPC
systems
Runtime Mechanisms to Survive New HPC Architectures: A Use-Case in Human Respiratory Simulations
Computational Fluid and Particle Dynamics (CFPD) simulations are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes demand high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on Intel-, IBM-, and Arm-based HPC technologies ranked in the Top500 supercomputers, showing the importance of using mechanisms applied at runtime to improve the performance independently of
the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2x, across different architectures, while applying runtime techniques and keeping constant the computational resources.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPreprin
Computational Fluid and Particle Dynamics Simulations for Respiratory System: Runtime Optimization on an Arm Cluster
Computational fluid and particle dynamics simulations (CFPD) are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes involves high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on both Intel- and Arm-based HPC clusters showing the importance of using mechanisms applied at runtime to improve the performance independently of the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2X, keeping the computational resources constant.This work is partially supported by the Spanish
Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat
de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPostprint (author's final draft
Xar-Trek: Run-Time Execution Migration among FPGAs and Heterogeneous-ISA CPUs
Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM
or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that
migrating application execution at run-time across heterogeneous-ISA CPUs can
yield significant performance and energy gains, with relatively little
programmer effort. However, FPGAs have often been overlooked in that context:
hardware acceleration using FPGAs involves statically implementing select
application functions, which prohibits dynamic and transparent migration. We
present Xar-Trek, a new compiler and run-time software framework that overcomes
this limitation. Xar-Trek compiles an application for several CPU ISAs and
select application functions for acceleration on an FPGA, allowing execution
migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek's
run-time monitors server workloads and migrates application functions to an
FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a
heuristic policy that uses application workload profiles to make scheduling
decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64
server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over
no-migration baselines
Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs
Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches.
In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energyefficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU