10,192 research outputs found
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping
FLUSEPA (Registered trademark in France No. 134009261) is an advanced
simulation tool which performs a large panel of aerodynamic studies. It is the
unstructured finite-volume solver developed by Airbus Safran Launchers company
to calculate compressible, multidimensional, unsteady, viscous and reactive
flows around bodies in relative motion. The time integration in FLUSEPA is done
using an explicit temporal adaptive method. The current production version of
the code is based on MPI and OpenMP. This implementation leads to important
synchronizations that must be reduced. To tackle this problem, we present the
study of a task-based parallelization of the aerodynamic solver of FLUSEPA
using the runtime system StarPU and combining up to three levels of
parallelism. We validate our solution by the simulation (using a finite-volume
mesh with 80 million cells) of a take-off blast wave propagation for Ariane 5
launcher.Comment: Accepted manuscript of a paper in Journal of Computational Scienc
Efficient Generation of Parallel Spin-images Using Dynamic Loop Scheduling
High performance computing (HPC) systems underwent a significant increase in
their processing capabilities. Modern HPC systems combine large numbers of
homogeneous and heterogeneous computing resources. Scalability is, therefore,
an essential aspect of scientific applications to efficiently exploit the
massive parallelism of modern HPC systems. This work introduces an efficient
version of the parallel spin-image algorithm (PSIA), called EPSIA. The PSIA is
a parallel version of the spin-image algorithm (SIA). The (P)SIA is used in
various domains, such as 3D object recognition, categorization, and 3D face
recognition. EPSIA refers to the extended version of the PSIA that integrates
various well-known dynamic loop scheduling (DLS) techniques. The present work:
(1) Proposes EPSIA, a novel flexible version of PSIA; (2) Showcases the
benefits of applying DLS techniques for optimizing the performance of the PSIA;
(3) Assesses the performance of the proposed EPSIA by conducting several
scalability experiments. The performance results are promising and show that
using well-known DLS techniques, the performance of the EPSIA outperforms the
performance of the PSIA by a factor of 1.2 and 2 for homogeneous and
heterogeneous computing resources, respectively
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
Going deeper and wider in neural architectures improves the accuracy, while
the limited GPU DRAM places an undesired restriction on the network design
domain. Deep Learning (DL) practitioners either need change to less desired
network architectures, or nontrivially dissect a network across multiGPUs.
These distract DL practitioners from concentrating on their original machine
learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling
runtime to enable the network training far beyond the GPU DRAM capacity.
SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis},
\textit{Unified Tensor Pool}, and \textit{Cost-Aware Recomputation}, all
together they effectively reduce the network-wide peak memory usage down to the
maximal memory usage among layers. We also address the performance issues in
those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not
only provisions the necessary memory for the training, but also dynamically
allocates the memory for convolution workspaces to achieve the high
performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have
demonstrated that SuperNeurons trains at least 3.2432 deeper network than
current ones with the leading performance. Particularly, SuperNeurons can train
ResNet2500 that has basic network layers on a 12GB K40c.Comment: PPoPP '2018: 23nd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programmin
A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines
Parallel architectures are essential in order to take advantage of the
parallelism inherent in streaming applications. One particular branch of these
employ hardware SIMD pipelines. In this paper, we analyse several scheduling
techniques, namely ad hoc overlapped execution, modulo scheduling and modulo
scheduling with unrolling, all of which aim to efficiently utilize the special
architecture design. Our investigation focuses on improving throughput while
analysing other metrics that are important for streaming applications, such as
register pressure, buffer sizes and code size. Through experiments conducted on
several media benchmarks, we present and discuss trade-offs involved when
selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
- …