1,296 research outputs found
Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing
We investigate implementation of lattice Quantum Chromodynamics (QCD) code on
the Intel Xeon Phi Knights Landing (KNL). The most time consuming part of the
numerical simulations of lattice QCD is a solver of linear equation for a large
sparse matrix that represents the strong interaction among quarks. To establish
widely applicable prescriptions, we examine rather general methods for the SIMD
architecture of KNL, such as using intrinsics and manual prefetching, to the
matrix multiplication and iterative solver algorithms. Based on the performance
measured on the Oakforest-PACS system, we discuss the performance tuning on KNL
as well as the code design for facilitating such tuning on SIMD architecture
and massively parallel machines.Comment: 8 pages, 12 figures. Talk given at LHAM'17 "5th International
Workshop on Legacy HPC Application Migration" in CANDAR'17 "The Fifth
International Symposium on Computing and Networking" and to appear in the
proceeding
Porting and tuning of the Mont-Blanc benchmarks to the multicore ARM 64bit architecture
This project is about porting and tuning the Mont-Blanc benchmarks to the multicore ARM
64 bits architecture. The Mont-Blanc benchmarks are part of the Mont-Blanc European
project and they have been developed internally in the BSC (Barcelona Supercomputing
Center).
The project will explore the possibilities that an ARM architecture can offer running in a
HPC (High Performance Computing) setup, this includes to learn how to tune and adapt a
parallelized computer program and analyze its execution behavior.
As part of the project, we will analyze the performance of each benchmark using instrumentation
tools such like Extrae and Paraver. Each benchmark will be adapted, tuned and
executed mainly in the three new Mont-Blanc mini-clusters, Thunder (ARMv8 custom),
Merlin (ARMv8 custom) and Jetson TX (ARMv8 cortex-a57) using the OmpSs programming
model. The evolution of the performance obtained will be shown followed by a brief analysis
of the results after each optimization.Aquest projecte es basa en adaptar i afinar els Mont-Blanc benchmarks a l’arquitectura
multinucli ARM 64 bits. Els Mont-Blanc benchmarks formen part del projecte Europeu
Mont-Blanc i han estat desenvolupats internament en el BSC (Barcelona Supercomputing
Center).
Aquest projecte explorarà el potencial d’usar l’arquitectura ARM en un entorn HPC (High
Performance Computing), això inclou aprendre a adaptar i afinar un programa paral·lel, i
analitzar el seu comportament durant l’execució.
Com a part del projecte, s’analitzarà el rendiment de cada benchmark usant eines d’instrumentació
com Extrae o Paraver. Cada benchmark serà adaptat, afinat i executat en els tres nous miniclústers
de Mont-Blanc, Thunder (ARMv8 personalitzat), Merlin (ARMv8 personalitzat)
i Jetson TX (ARMv8 cortex-a57) usant el model de programació OmpSs. Es mostrarà
l’evolució del rendiment, seguit d’una breu explicació dels resultats després de cada optimització.Este proyecto se basa en adaptar y afinar los Mont-blanc benchmarks a la arquitectura
multi-núcleo ARM 64 bits. Los Mont-Blanc benchmarks forman parte del proyecto Europeo
Mont-Blanc y han sido desarrollados internamente en el BSC (Barcelona Supercomputing
Center).
Este proyecto explorará el potencial de usar la arquitectura ARM en un entorno HPC (High
Performance Computing), esto incluye aprender a adaptar y afinar un programa paralelo, y
analizar su comportamiento durante la ejecución.
Como parte del proyecto, se analizará el rendimiento de cada benchmark usando herramientas
de instrumentación como Extrae o Paraver. Cada benchmark será adaptado, afinado y
ejecutado en los tres nuevos mini-clústeres de Mont-Blanc, Thunder (ARMv8 personalizado),
Merlin (ARMv8 personalizado) y Jetson TX (ARMv8 cortex-a57) usando el modelo de
programación OmpSs. Se mostrará la evolución del rendimiento obtenido, y una breve
explicación de los resultados después de cada optimización
The Algorithms for FPGA Implementation of Sparse Matrices Multiplication
In comparison to dense matrices multiplication, sparse matrices multiplication real performance for CPU is roughly 5--100 times lower when expressed in GFLOPs. For sparse matrices, microprocessors spend most of the time on comparing matrices indices rather than performing floating-point multiply and add operations. For 16-bit integer operations, like indices comparisons, computational power of the FPGA significantly surpasses that of CPU. Consequently, this paper presents a novel theoretical study how matrices sparsity factor influences the indices comparison to floating-point operation workload ratio. As a result, a novel FPGAs architecture for sparse matrix-matrix multiplication is presented for which indices comparison and floating-point operations are separated. We also verified our idea in practice, and the initial implementations results are very promising. To further decrease hardware resources required by the floating-point multiplier, a reduced width multiplication is proposed in the case when IEEE-754 standard compliance is not required
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures
Feltor is a modular and free scientific software package. It allows
developing platform independent code that runs on a variety of parallel
computer architectures ranging from laptop CPUs to multi-GPU distributed memory
systems. Feltor consists of both a numerical library and a collection of
application codes built on top of the library. Its main target are two- and
three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin
methods as the main numerical discretization technique. We observe that
numerical simulations of a recently developed gyro-fluid model produce
non-deterministic results in parallel computations. First, we show how we
restore accuracy and bitwise reproducibility algorithmically and
programmatically. In particular, we adopt an implementation of the exactly
rounded dot product based on long accumulators, which avoids accuracy losses
especially in parallel applications. However, reproducibility and accuracy
alone fail to indicate correct simulation behaviour. In fact, in the physical
model slightly different initial conditions lead to vastly different end
states. This behaviour translates to its numerical representation. Pointwise
convergence, even in principle, becomes impossible for long simulation times.
In a second part, we explore important performance tuning considerations. We
identify latency and memory bandwidth as the main performance indicators of our
routines. Based on these, we propose a parallel performance model that predicts
the execution time of algorithms implemented in Feltor and test our model on a
selection of parallel hardware architectures. We are able to predict the
execution time with a relative error of less than 25% for problem sizes between
0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth
gives a minimum array size per compute node to achieve a scaling efficiency
above 50% (both strong and weak)
High Performance Reconfigurable Computing for Linear Algebra: Design and Performance Analysis
Field Programmable Gate Arrays (FPGAs) enable powerful performance acceleration for scientific computations because of their intrinsic parallelism, pipeline ability, and flexible architecture. This dissertation explores the computational power of FPGAs for an important scientific application: linear algebra. First of all, optimized linear algebra subroutines are presented based on enhancements to both algorithms and hardware architectures. Compared to microprocessors, these routines achieve significant speedup. Second, computing with mixed-precision data on FPGAs is proposed for higher performance. Experimental analysis shows that mixed-precision algorithms on FPGAs can achieve the high performance of using lower-precision data while keeping higher-precision accuracy for finding solutions of linear equations. Third, an execution time model is built for reconfigurable computers (RC), which plays an important role in performance analysis and optimal resource utilization of FPGAs. The accuracy and efficiency of parallel computing performance models often depend on mean maximum computations. Despite significant prior work, there have been no sufficient mathematical tools for this important calculation. This work presents an Effective Mean Maximum Approximation method, which is more general, accurate, and efficient than previous methods. Together, these research results help address how to make linear algebra applications perform better on high performance reconfigurable computing architectures
- …