1,792 research outputs found
nSharma: Numerical Simulation Heterogeneity Aware Runtime Manager for OpenFOAM
CFD simulations are a fundamental engineering application,implying huge workloads, often with dynamic behaviour due to run-time mesh refinement. Parallel processing over heterogeneous distributedmemory clusters is often used to process such workloads. The executionof dynamic workloads over a set of heterogeneous resources leads to loadimbalances that severely impacts execution time, when static uniformload distribution is used. This paper proposes applying dynamic, het-erogeneity aware, load balancing techniques within CFD simulations.nSharma, a software package that fully integrates with OpenFOAM, ispresented and assessed. Performance gains are demonstrated, achievedby reducing busy times standard deviation among resources, i.e. hetero-geneous computing resources are kept busy with useful work due to aneffective workload distribution. To best of authors’ knowledge, nSharmais the first implementation and integration of heterogeneity aware loadbalancing in OpenFOAM and will be made publicly available in order tofoster its adoption by the large community of OpenFOAM users.The authors would like to thank the financial funding by FEDER through the COMPETE 2020 Program, the National Funds through FCT under the projects UID/CTM/50025/2013. The first author was partially funded by the PT-FLAD Chair on Smart Cities & Smart Governance and also by the School of Engineering, University of Minho within project Performance Portability on Scalable Heterogeneous Computing Systems. The authors also wish to thank Kyle Mooney for making available his code supporting migration of dynamically refined meshes, as well as acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Dataflow Programming Paradigms for Computational Chemistry Methods
The transition to multicore and heterogeneous architectures has shaped the High Performance Computing (HPC) landscape over the past decades. With the increase in scale, complexity, and heterogeneity of modern HPC platforms, one of the grim challenges for traditional programming models is to sustain the expected performance at scale. By contrast, dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. This work introduces dataflow programming models for computational chemistry methods, and compares different dataflow executions in terms of programmability, resource utilization, and scalability.
This effort is driven by computational chemistry applications, considering that they comprise one of the driving forces of HPC. In particular, many-body methods, such as Coupled Cluster methods (CC), which are the gold standard to compute energies in quantum chemistry, are of particular interest for the applied chemistry community. On that account, the latest development for CC methods is used as the primary vehicle for this research, but our effort is not limited to CC and can be applied across other application domains.
Two programming paradigms for expressing CC methods into a dataflow form, in order to make them capable of utilizing task scheduling systems, are presented. Explicit dataflow, is the programming model where the dataflow is explicitly specified by the developer, is contrasted with implicit dataflow, where a task scheduling runtime derives the dataflow. An abstract model is derived to explore the limits of the different dataflow programming paradigms
SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters
International audience—The biomedical imagery, the numeric communi-cations, the acoustic signal processing and many others digital signal processing applications (DSP) are present more and more everyday in the numeric world. They process growing data volume which is represented with more and more accuracy, and using complex algorithms with time constraints to satisfying. Con-sequently, a high requirement of computing power characterize them. To satisfy this need, it's inevitable today to use parallel and heterogeneous architectures in order to speed-up the processing, where the best examples are the supercomputers like "Tianhe-2" and "Titan" of the ranking top500. These architectures with their multi-core nodes supported by many-core accelerators offer a good response to this problem, but they are still hard to program in order to make performance because of lot of things like synchronization, the memory management, the hardware specifications . . . In the present work, we propose a high level programming model to implement easily and efficiently digital signal processing applications on heterogeneous clusters
Dynamic load balancing of parallel road traffic simulation
The objective of this research was to investigate, develop and evaluate dynamic
load-balancing strategies for parallel execution of microscopic road traffic simulations. Urban road traffic simulation presents irregular, and dynamically varying
distributed computational load for a parallel processor system. The dynamic
nature of road traffic simulation systems lead to uneven load distribution during simulation, even for a system that starts off with even load distributions. Load balancing is a potential way of achieving improved performance by reallocating
work from highly loaded processors to lightly loaded processors leading to
a reduction in the overall computational time. In dynamic load balancing,
workloads are adjusted continually or periodically throughout the computation.
In this thesis load balancing strategies were evaluated and some load balancing
policies developed. A load index and a profitability determination algorithms
were developed. These were used to enhance two load balancing algorithms. One
of the algorithms exhibits local communications and distributed load evaluation
between the neighbour partitions (diffusion algorithm) and the other algorithm
exhibits both local and global communications while the decision making is
centralized (MaS algorithm). The enhanced algorithms were implemented and
synthesized with a research parallel traffic simulation. The performance of the
research parallel traffic simulator, optimized with the two modified dynamic load balancing strategies were studied
- …