2 research outputs found
JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications
The distributed shared memory (DSM) architecture is widely used in today's
computer design to mitigate the ever-widening processing-memory gap, and
inevitably exhibits non-uniform memory access (NUMA) to shared-memory parallel
applications. Failure to achieve full NUMA-awareness can significantly
downgrade application performance, especially on today's manycore platforms
with tens to hundreds of cores. Yet traditional approaches such as first-touch
and memory policy fail short in either false page-sharing, fragmentation, or
ease-of-use. In this paper, we propose a partitioned shared memory approach
which allows multi-threaded applications to achieve full NUMA-awareness with
only minor code changes and develop a companying NUMA-aware heap manager which
eliminates false page-sharing and minimizes fragmentation. Experiments on a
256-core cc-NUMA computing node show that the proposed approach achieves true
NUMA-awareness and improves the performance of typical multi-threaded
scientific applications up to 4.3 folds with the increased use of cores.Comment: 12 pages, 3 figures, submitted to Euro-Par 201
JSweep: A Patch-centric Data-driven Approach for Parallel Sweeps on Large-scale Meshes
In mesh-based numerical simulations, sweep is an important computation
pattern. During sweeping a mesh, computations on cells are strictly ordered by
data dependencies in given directions. Due to such a serial order,
parallelizing sweep is challenging, especially for unstructured and deforming
structured meshes. Meanwhile, recent high-fidelity multi-physics simulations of
particle transport, including nuclear reactor and inertial confinement fusion,
require {\em sweeps} on large scale meshes with billions of cells and hundreds
of directions.
In this paper, we present JSweep, a parallel data-driven computational
framework integrated in the JAxMIN infrastructure. The essential of JSweep is a
general patch-centric data-driven abstraction, coupled with a high performance
runtime system leveraging hybrid parallelism of MPI+threads and achieving
dynamic communication on contemporary multi-core clusters. Built on JSweep, we
implement a representative data-driven algorithm, Sn transport, featuring
optimizations of vertex clustering, multi-level priority strategy and
patch-angle parallelism. Experimental evaluation with two real-world
applications on structured and unstructured meshes respectively, demonstrates
that JSweep can scale to tens of thousands of processor cores with reasonable
parallel efficiency.Comment: 10 pages, 17 figure