14,016 research outputs found
Workload-aware Automatic Parallelization for Multi-GPU DNN Training
Deep neural networks (DNNs) have emerged as successful solutions for variety
of artificial intelligence applications, but their very large and deep models
impose high computational requirements during training. Multi-GPU
parallelization is a popular option to accelerate demanding computations in DNN
training, but most state-of-the-art multi-GPU deep learning frameworks not only
require users to have an in-depth understanding of the implementation of the
frameworks themselves, but also apply parallelization in a straight-forward way
without optimizing GPU utilization. In this work, we propose a workload-aware
auto-parallelization framework (WAP) for DNN training, where the work is
automatically distributed to multiple GPUs based on the workload
characteristics. We evaluate WAP using TensorFlow with popular DNN benchmarks
(AlexNet and VGG-16), and show competitive training throughput compared with
the state-of-the-art frameworks, and also demonstrate that WAP automatically
optimizes GPU assignment based on the workload's compute requirements, thereby
improving energy efficiency.Comment: This paper is accepted in ICASSP201
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many
sparse solvers. We investigate performance properties of spMVM with matrices of
various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded
jagged diagonals storage" (pJDS) format is proposed which may substantially
reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our
test scenarios the pJDS format cuts the overall spMVM memory footprint on the
GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance.
Using a suitable performance model we identify performance bottlenecks on the
node level that invalidate some types of matrix structures for efficient
multi-GPGPU parallelization. For appropriate sparsity patterns we extend
previous work on distributed-memory parallel spMVM to demonstrate a scalable
hybrid MPI-GPGPU code, achieving efficient overlap of communication and
computation.Comment: 10 pages, 5 figures. Added reference to other recent sparse matrix
format
Parallel symbolic state-space exploration is difficult, but what is the alternative?
State-space exploration is an essential step in many modeling and analysis
problems. Its goal is to find the states reachable from the initial state of a
discrete-state model described. The state space can used to answer important
questions, e.g., "Is there a dead state?" and "Can N become negative?", or as a
starting point for sophisticated investigations expressed in temporal logic.
Unfortunately, the state space is often so large that ordinary explicit data
structures and sequential algorithms cannot cope, prompting the exploration of
(1) parallel approaches using multiple processors, from simple workstation
networks to shared-memory supercomputers, to satisfy large memory and runtime
requirements and (2) symbolic approaches using decision diagrams to encode the
large structured sets and relations manipulated during state-space generation.
Both approaches have merits and limitations. Parallel explicit state-space
generation is challenging, but almost linear speedup can be achieved; however,
the analysis is ultimately limited by the memory and processors available.
Symbolic methods are a heuristic that can efficiently encode many, but not all,
functions over a structured and exponentially large domain; here the pitfalls
are subtler: their performance varies widely depending on the class of decision
diagram chosen, the state variable order, and obscure algorithmic parameters.
As symbolic approaches are often much more efficient than explicit ones for
many practical models, we argue for the need to parallelize symbolic
state-space generation algorithms, so that we can realize the advantage of both
approaches. This is a challenging endeavor, as the most efficient symbolic
algorithm, Saturation, is inherently sequential. We conclude by discussing
challenges, efforts, and promising directions toward this goal
Parallel load balancing strategy for Volume-of-Fluid methods on 3-D unstructured meshes
© 2016. This version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/l Volume-of-Fluid (VOF) is one of the methods of choice to reproduce the interface motion in the simulation of multi-fluid flows. One of its main strengths is its accuracy in capturing sharp interface geometries, although requiring for it a number of geometric calculations. Under these circumstances, achieving parallel performance on current supercomputers is a must. The main obstacle for the parallelization is that the computing costs are concentrated only in the discrete elements that lie on the interface between fluids. Consequently, if the interface is not homogeneously distributed throughout the domain, standard domain decomposition (DD) strategies lead to imbalanced workload distributions. In this paper, we present a new parallelization strategy for general unstructured VOF solvers, based on a dynamic load balancing process complementary to the underlying DD. Its parallel efficiency has been analyzed and compared to the DD one using up to 1024 CPU-cores on an Intel SandyBridge based supercomputer. The results obtained on the solution of several artificially generated test cases show a speedup of up to similar to 12x with respect to the standard DD, depending on the interface size, the initial distribution and the number of parallel processes engaged. Moreover, the new parallelization strategy presented is of general purpose, therefore, it could be used to parallelize any VOF solver without requiring changes on the coupled flow solver. Finally, note that although designed for the VOF method, our approach could be easily adapted to other interface-capturing methods, such as the Level-Set, which may present similar workload imbalances. (C) 2014 Elsevier Inc. Allrights reserved.Peer ReviewedPostprint (author's final draft
- âŠ