867 research outputs found
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection
Line segment intersection is one of the elementary operations in computational geometry. Complex problems in Geographic Information Systems (GIS) like finding map overlays or spatial joins using polygonal data require solving segment intersections. Plane sweep paradigm is used for finding geometric intersection in an efficient manner. However, it is difficult to parallelize due to its in-order processing of spatial events. We present a new fine-grained parallel algorithm for geometric intersection and its CPU and GPU implementation using OpenMP and OpenACC. To the best of our knowledge, this is the first work demonstrating an effective parallelization of plane sweep on GPUs.
We chose compiler directive based approach for implementation because of its simplicity to parallelize sequential code. Using Nvidia Tesla P100 GPU, our implementation achieves around 40X speedup for line segment intersection problem on 40K and 80K data sets compared to sequential CGAL library
Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping
FLUSEPA (Registered trademark in France No. 134009261) is an advanced
simulation tool which performs a large panel of aerodynamic studies. It is the
unstructured finite-volume solver developed by Airbus Safran Launchers company
to calculate compressible, multidimensional, unsteady, viscous and reactive
flows around bodies in relative motion. The time integration in FLUSEPA is done
using an explicit temporal adaptive method. The current production version of
the code is based on MPI and OpenMP. This implementation leads to important
synchronizations that must be reduced. To tackle this problem, we present the
study of a task-based parallelization of the aerodynamic solver of FLUSEPA
using the runtime system StarPU and combining up to three levels of
parallelism. We validate our solution by the simulation (using a finite-volume
mesh with 80 million cells) of a take-off blast wave propagation for Ariane 5
launcher.Comment: Accepted manuscript of a paper in Journal of Computational Scienc
Exploiting nested task-parallelism in the H-LU factorization
[EN] We address the parallelization of the LU factorization of hierarchical matrices (H-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for H-matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks. As a result, we obtain a significant improvement in the parallel performance with respect to our previous work.The researchers from Universidad Jaume I (UJI) were supported by projects CICYT TIN2014-53495-R and TIN2017-82972-R of MINECO and FEDER; project UJI-B2017-46 of UJI; and the FPU program of MECD.Carratalá-Sáez, R.; Christophersen, S.; Aliaga, JI.; Beltrán, V.; Börm, S.; Quintana OrtĂ, ES. (2019). Exploiting nested task-parallelism in the H-LU factorization. Journal of Computational Science. 33:20-33. https://doi.org/10.1016/j.jocs.2019.02.004S203333Hackbusch, W. (1999). A Sparse Matrix Arithmetic Based on \Cal H -Matrices. Part I: Introduction to {\Cal H} -Matrices. Computing, 62(2), 89-108. doi:10.1007/s006070050015Grasedyck, L., & Hackbusch, W. (2003). Construction and Arithmetics of H -Matrices. Computing, 70(4), 295-334. doi:10.1007/s00607-003-0019-1Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Buttari, A., Langou, J., Kurzak, J., & Dongarra, J. (2009). A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1), 38-53. doi:10.1016/j.parco.2008.10.002Quintana-OrtĂ, G., Quintana-OrtĂ, E. S., Geijn, R. A. V. D., Zee, F. G. V., & Chan, E. (2009). Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3), 1-26. doi:10.1145/1527286.1527288Badia, R. M., Herrero, J. R., Labarta, J., PĂ©rez, J. M., Quintana-OrtĂ, E. S., & Quintana-OrtĂ, G. (2009). Parallelizing dense and banded linear algebra libraries using SMPSs. Concurrency and Computation: Practice and Experience, 21(18), 2438-2456. doi:10.1002/cpe.1463Aliaga, J. I., Badia, R. M., Barreda, M., Bollhofer, M., & Quintana-Orti, E. S. (2014). Leveraging Task-Parallelism with OmpSs in ILUPACK’s Preconditioned CG Method. 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing. doi:10.1109/sbac-pad.2014.24Agullo, E., Buttari, A., Guermouche, A., & Lopez, F. (2016). Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems. ACM Transactions on Mathematical Software, 43(2), 1-22. doi:10.1145/2898348Aliaga, J. I., Carratala-Saez, R., Kriemann, R., & Quintana-Orti, E. S. (2017). Task-Parallel LU Factorization of Hierarchical Matrices Using OmpSs. 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). doi:10.1109/ipdpsw.2017.124The OpenMP API specification for parallel programming, http://www.openmp.org/.OmpSs project home page, http://pm.bsc.es/ompss.Perez, J. M., Beltran, V., Labarta, J., & Ayguade, E. (2017). Improving the Integration of Task Nesting and Dependencies in OpenMP. 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). doi:10.1109/ipdps.2017.69HLIBpro library home page, https://www.hlibpro.com/.Bempp library home page, https://bempp.com/.HACApK library github repository, https://github.com/hoshino-UTokyo/hacapk-gpu.hmglib library github repository, https://github.com/zaspel/hmglib.HiCMA library github repository, https://github.com/ecrc/hicma.Hackbusch, W., & Börm, S. (2002). -matrix approximation of integral operators by interpolation. Applied Numerical Mathematics, 43(1-2), 129-143. doi:10.1016/s0168-9274(02)00121-
- …