143 research outputs found
Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
General matrix-matrix multiplications with double-precision real and complex
entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized
for square matrices but often show bad performance for tall & skinny matrices,
which are much taller than wide. NVIDIA's current CUBLAS implementation
delivers only a fraction of the potential performance as indicated by the
roofline model in this case. We describe the challenges and key characteristics
of an implementation that can achieve close to optimal performance. We further
evaluate different strategies of parallelization and thread distribution, and
devise a flexible, configurable mapping scheme. To ensure flexibility and allow
for highly tailored implementations we use code generation combined with
autotuning. For a large range of matrix sizes in the domain of interest we
achieve at least 2/3 of the roofline performance and often substantially
outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.Comment: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for
journal submissio
Design of a parallel hybrid direct/iterative solver for CFD problems
We discuss the parallel implementation of a hybrid direct/iterative solver for a special class of saddle point matrices arising from the discretization of the steady Navier-Stokes equations on an Arakawa C-grid, the F-matrices. The two-level method described here has the following properties: (i) it is very robust, even hat comparatively high Reynolds Numbers; (ii) a single parameter controls fill and convergence, making the method straightforward to use; (iii) the convergence rate is independent of the number of unknowns; (iv) it can be implemented on distributed memory machines in a natural way; (v) the matrix on the second level has the same structure and numerical properties as the original problem, so the method can be applied recursively. The implementation focusses on generality, modularity, code reuse and recursiveness. The solver is implemented using building blocks of the Trilinos libraries. We show its performance on a parallel computer for the Navier-Stokes equations
Exascale Sparse Eigensolver Developments for Quantum Physics Applications
In the German Research Foundation (DFG) project ESSEX (Equipping Sparse Solvers for Exascale), we develop scalable sparse eigensolver libraries for large quantum physics problems. Partners in ESSEX are the Universities of Erlangen, Greifswald, Wuppertal, Tokyo and Tsukuba as well as DLR. The project pursues a coherent co-design of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method and basic kernel library. The basic building block library supports an elaborate MPI+X approach that is able to fully exploit hardware heterogeneity while exposing functional parallelism and data parallelism to all other software layers in a flexible way.
The advanced building blocks were defined and employed by the developments at the algorithms layer. Here, ESSEX provides state-of-the-art library implementations of classic linear sparse eigenvalue solvers including block Jacobi-Davidson, Kernel Polynomial Method (KPM), and Chebyshev filter diagonalization (ChebFD) that are ready to use for production on modern heterogeneous compute nodes with best performance and numerical accuracy. Research in this direction included the development of appropriate parallel adaptive AMG software for the block Jacobi-Davidson method. Contour integral-based approaches were also covered in ESSEX and were extended in two directions: The FEAST method was further developed for improved scalability, and the Sakurai-Sugiura method (SSM) method was extended to nonlinear sparse eigenvalue problems. These developments were strongly supported by Japanese project partners from University of Tokyo, Computer Science, and University of Tsukuba, Applied Mathematics.
The applications layer delivers scalable solutions for conservative (Hermitian) and dissipative (non-Hermitian) quantum systems with strong links to optics and biology and to novel materials such as graphene and topological insulators
Wind-assisted, electric, and pure wind propulsion - the path towards zero-emission RoRo ships
Electrical and wind propulsion, together with energy stored in batteries and renewable energies harnessed onboard, can lead the way towards zero-emission ships. This study compares wind propulsion solutions and battery storage possibilities for a RoRo ship operating in the Baltic Sea. The ship energy systems simulation model ShipCLEAN is used to predict the performance of the zero-emission ship in real-life operating conditions. The study showcases how ships can be transferred from a conventional, diesel-powered to a zero-emission ship. For the zero-emission ship, all energy needed for auxiliaries and propulsion is taken from renewable sources onboard or from batteries. Challenges and opportunities, as well as necessary adaptions of the route and logistics, are discussed. Results of the study present which wind propulsion technology is the most suitable for the example RoRo ship, and how the installation of suitably sized battery packs for zero-emission operation affects the cargo capacity of the ship
Retrofitting WASP to a RoPax vessel—design, performance and uncertainties
Wind-assisted propulsion (WASP) is one of the most promising ship propulsion alternatives\ua0that radically reduce greenhouse gas emissions and are available today. Using the example of a\ua0RoPax ferry, this study presents the performance potential of WASP systems under realistic weather\ua0conditions. Different design alternatives and system layouts are discussed. Further, uncertainties in\ua0the performance prediction ofWASP systems are analyzed. Included in the analysis are the sail forces\ua0as well as the aero- and hydrodynamic interaction effects, i.e., the sail–sail and sail–deck interaction as\ua0well as the drift and yaw of the ship. As a result, this study provides guidelines on the most important\ua0parameters when designing and modeling aWASP ship. Finally, the study presents an analysis of the\ua0expected accuracy of the employed empirical/analytical performance prediction model ShipCLEAN
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
Development of a ship performance model for power estimation of inland waterway vessels
A ship performance model is an important factor in energy-efficient navigation. It formulates a speed–power relationship that can be used to adjust the engine loads for dynamic energy optimisation. However, currently available models have been developed for sea-going vessels, where the environmental conditions are significantly different from those experienced on inland waterways. Inland waterway shipping has great potential to become a mode of transport that can both improve safety and reduce emissions. Therefore, this paper presents the development of an energy performance model specifically for inland waterway vessels (IWVs). The holistic ship energy system model is based on empirical methods, from resistance to engine performance prediction, established in a modular code architecture. The resistance and propulsion prediction in confined waterways are captured by a newly developed method, considering a superposing of shallow water and bank effect. Verification against model tests and high-fidelity simulations indicate that the selected empirical methods achieved good accuracy for predicting ship performance. The resistance prediction error was 5.2% for single vessels and 8% for pusher-barge convoys based on empirical methods. The results of a case study investigating the performance of a self-propelled vessel under dynamic waterway data, indicate that the developed model could be used for onboard power monitoring and energy optimisation during operation
Software and Performance Engineering for Iterative Eigensolvers
The complexity of the latest HPC architectures increasingly limits the productivity of researchers in numerical
algorithms and the `time to market' for parallel algorithms. Implementing a new method on a supercomputer today
involves at least three levels of parallelism and typically several programming models like MPI, OpenMP and CUDA.
Frameworks like Trilinos and PETSc have since many years been useful for testing new ideas in parallel algorithms. But
when it comes to e.g. CPU/GPU clusters they fail to deliver convincing performance to date.
We look at sparse solvers from a software engineer's point of view and advocate a programming model we call `SPMD+OK',
introducing performance models in the test-driven development process and a strategy to facilitate the integration of
algorithmic developments into existing applications.
Harnessing the peta-scale with the block Jacobi-Davidson method is used as a running example in this talk, and the
libraries GHOST and PHIST are presented (https://bitbucket.org/essex/)
- …