3,850 research outputs found
Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework
We describe a set of lower-level abstractions to improve performance on
modern large scale heterogeneous systems. These provide portable access to
system- and hardware-dependent features, automatically apply dynamic
optimizations at run time, and target stencil-based codes used in finite
differencing, finite volume, or block-structured adaptive mesh refinement
codes.
These abstractions include a novel data structure to manage refinement
information for block-structured adaptive mesh refinement, an iterator
mechanism to efficiently traverse multi-dimensional arrays in stencil-based
codes, and a portable API and implementation for explicit SIMD vectorization.
These abstractions can either be employed manually, or be targeted by
automated code generation, or be used via support libraries by compilers during
code generation. The implementations described below are available in the
Cactus framework, and are used e.g. in the Einstein Toolkit for relativistic
astrophysics simulations
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
Recursive Algorithms for Distributed Forests of Octrees
The forest-of-octrees approach to parallel adaptive mesh refinement and
coarsening (AMR) has recently been demonstrated in the context of a number of
large-scale PDE-based applications. Although linear octrees, which store only
leaf octants, have an underlying tree structure by definition, it is not often
exploited in previously published mesh-related algorithms. This is because the
branches are not explicitly stored, and because the topological relationships
in meshes, such as the adjacency between cells, introduce dependencies that do
not respect the octree hierarchy. In this work we combine hierarchical and
topological relationships between octree branches to design efficient recursive
algorithms.
We present three important algorithms with recursive implementations. The
first is a parallel search for leaves matching any of a set of multiple search
criteria. The second is a ghost layer construction algorithm that handles
arbitrarily refined octrees that are not covered by previous algorithms, which
require a 2:1 condition between neighboring leaves. The third is a universal
mesh topology iterator. This iterator visits every cell in a domain partition,
as well as every interface (face, edge and corner) between these cells. The
iterator calculates the local topological information for every interface that
it visits, taking into account the nonconforming interfaces that increase the
complexity of describing the local topology. To demonstrate the utility of the
topology iterator, we use it to compute the numbering and encoding of
higher-order nodal basis functions.
We analyze the complexity of the new recursive algorithms theoretically, and
assess their performance, both in terms of single-processor efficiency and in
terms of parallel scalability, demonstrating good weak and strong scaling up to
458k cores of the JUQUEEN supercomputer.Comment: 35 pages, 15 figures, 3 table
- …