143 research outputs found

    Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

    Get PDF
    General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.Comment: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for journal submissio

    Design of a parallel hybrid direct/iterative solver for CFD problems

    Get PDF
    We discuss the parallel implementation of a hybrid direct/iterative solver for a special class of saddle point matrices arising from the discretization of the steady Navier-Stokes equations on an Arakawa C-grid, the F-matrices. The two-level method described here has the following properties: (i) it is very robust, even hat comparatively high Reynolds Numbers; (ii) a single parameter controls fill and convergence, making the method straightforward to use; (iii) the convergence rate is independent of the number of unknowns; (iv) it can be implemented on distributed memory machines in a natural way; (v) the matrix on the second level has the same structure and numerical properties as the original problem, so the method can be applied recursively. The implementation focusses on generality, modularity, code reuse and recursiveness. The solver is implemented using building blocks of the Trilinos libraries. We show its performance on a parallel computer for the Navier-Stokes equations

    Exascale Sparse Eigensolver Developments for Quantum Physics Applications

    Get PDF
    In the German Research Foundation (DFG) project ESSEX (Equipping Sparse Solvers for Exascale), we develop scalable sparse eigensolver libraries for large quantum physics problems. Partners in ESSEX are the Universities of Erlangen, Greifswald, Wuppertal, Tokyo and Tsukuba as well as DLR. The project pursues a coherent co-design of all software layers where a holistic performance engineering process guides code development across the classic boundaries of application, numerical method and basic kernel library. The basic building block library supports an elaborate MPI+X approach that is able to fully exploit hardware heterogeneity while exposing functional parallelism and data parallelism to all other software layers in a flexible way. The advanced building blocks were defined and employed by the developments at the algorithms layer. Here, ESSEX provides state-of-the-art library implementations of classic linear sparse eigenvalue solvers including block Jacobi-Davidson, Kernel Polynomial Method (KPM), and Chebyshev filter diagonalization (ChebFD) that are ready to use for production on modern heterogeneous compute nodes with best performance and numerical accuracy. Research in this direction included the development of appropriate parallel adaptive AMG software for the block Jacobi-Davidson method. Contour integral-based approaches were also covered in ESSEX and were extended in two directions: The FEAST method was further developed for improved scalability, and the Sakurai-Sugiura method (SSM) method was extended to nonlinear sparse eigenvalue problems. These developments were strongly supported by Japanese project partners from University of Tokyo, Computer Science, and University of Tsukuba, Applied Mathematics. The applications layer delivers scalable solutions for conservative (Hermitian) and dissipative (non-Hermitian) quantum systems with strong links to optics and biology and to novel materials such as graphene and topological insulators

    Wind-assisted, electric, and pure wind propulsion - the path towards zero-emission RoRo ships

    Get PDF
    Electrical and wind propulsion, together with energy stored in batteries and renewable energies harnessed onboard, can lead the way towards zero-emission ships. This study compares wind propulsion solutions and battery storage possibilities for a RoRo ship operating in the Baltic Sea. The ship energy systems simulation model ShipCLEAN is used to predict the performance of the zero-emission ship in real-life operating conditions. The study showcases how ships can be transferred from a conventional, diesel-powered to a zero-emission ship. For the zero-emission ship, all energy needed for auxiliaries and propulsion is taken from renewable sources onboard or from batteries. Challenges and opportunities, as well as necessary adaptions of the route and logistics, are discussed. Results of the study present which wind propulsion technology is the most suitable for the example RoRo ship, and how the installation of suitably sized battery packs for zero-emission operation affects the cargo capacity of the ship

    Retrofitting WASP to a RoPax vessel—design, performance and uncertainties

    Get PDF
    Wind-assisted propulsion (WASP) is one of the most promising ship propulsion alternatives\ua0that radically reduce greenhouse gas emissions and are available today. Using the example of a\ua0RoPax ferry, this study presents the performance potential of WASP systems under realistic weather\ua0conditions. Different design alternatives and system layouts are discussed. Further, uncertainties in\ua0the performance prediction ofWASP systems are analyzed. Included in the analysis are the sail forces\ua0as well as the aero- and hydrodynamic interaction effects, i.e., the sail–sail and sail–deck interaction as\ua0well as the drift and yaw of the ship. As a result, this study provides guidelines on the most important\ua0parameters when designing and modeling aWASP ship. Finally, the study presents an analysis of the\ua0expected accuracy of the employed empirical/analytical performance prediction model ShipCLEAN

    Design of a parallel hybrid direct/iterative solver for CFD problems

    Get PDF

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    Development of a ship performance model for power estimation of inland waterway vessels

    Get PDF
    A ship performance model is an important factor in energy-efficient navigation. It formulates a speed–power relationship that can be used to adjust the engine loads for dynamic energy optimisation. However, currently available models have been developed for sea-going vessels, where the environmental conditions are significantly different from those experienced on inland waterways. Inland waterway shipping has great potential to become a mode of transport that can both improve safety and reduce emissions. Therefore, this paper presents the development of an energy performance model specifically for inland waterway vessels (IWVs). The holistic ship energy system model is based on empirical methods, from resistance to engine performance prediction, established in a modular code architecture. The resistance and propulsion prediction in confined waterways are captured by a newly developed method, considering a superposing of shallow water and bank effect. Verification against model tests and high-fidelity simulations indicate that the selected empirical methods achieved good accuracy for predicting ship performance. The resistance prediction error was 5.2% for single vessels and 8% for pusher-barge convoys based on empirical methods. The results of a case study investigating the performance of a self-propelled vessel under dynamic waterway data, indicate that the developed model could be used for onboard power monitoring and energy optimisation during operation

    Software and Performance Engineering for Iterative Eigensolvers

    Get PDF
    The complexity of the latest HPC architectures increasingly limits the productivity of researchers in numerical algorithms and the `time to market' for parallel algorithms. Implementing a new method on a supercomputer today involves at least three levels of parallelism and typically several programming models like MPI, OpenMP and CUDA. Frameworks like Trilinos and PETSc have since many years been useful for testing new ideas in parallel algorithms. But when it comes to e.g. CPU/GPU clusters they fail to deliver convincing performance to date. We look at sparse solvers from a software engineer's point of view and advocate a programming model we call `SPMD+OK', introducing performance models in the test-driven development process and a strategy to facilitate the integration of algorithmic developments into existing applications. Harnessing the peta-scale with the block Jacobi-Davidson method is used as a running example in this talk, and the libraries GHOST and PHIST are presented (https://bitbucket.org/essex/)
    • …
    corecore