1,355 research outputs found
A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence
A hybrid scheme that utilizes MPI for distributed memory parallelism and
OpenMP for shared memory parallelism is presented. The work is motivated by the
desire to achieve exceptionally high Reynolds numbers in pseudospectral
computations of fluid turbulence on emerging petascale, high core-count,
massively parallel processing systems. The hybrid implementation derives from
and augments a well-tested scalable MPI-parallelized pseudospectral code. The
hybrid paradigm leads to a new picture for the domain decomposition of the
pseudospectral grids, which is helpful in understanding, among other things,
the 3D transpose of the global data that is necessary for the parallel fast
Fourier transforms that are the central component of the numerical
discretizations. Details of the hybrid implementation are provided, and
performance tests illustrate the utility of the method. It is shown that the
hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a
maximum mean efficiency of 83%. Data are presented that demonstrate how to
choose the optimal number of MPI processes and OpenMP threads in order to
optimize code performance on two different platforms.Comment: Submitted to Parallel Computin
Large scale ab initio calculations based on three levels of parallelization
We suggest and implement a parallelization scheme based on an efficient
multiband eigenvalue solver, called the locally optimal block preconditioned
conjugate gradient LOBPCG method, and using an optimized three-dimensional (3D)
fast Fourier transform (FFT) in the ab initio}plane-wave code ABINIT. In
addition to the standard data partitioning over processors corresponding to
different k-points, we introduce data partitioning with respect to blocks of
bands as well as spatial partitioning in the Fourier space of coefficients over
the plane waves basis set used in ABINIT. This k-points-multiband-FFT
parallelization avoids any collective communications on the whole set of
processors relying instead on one-dimensional communications only. For a single
k-point, super-linear scaling is achieved for up to 100 processors due to an
extensive use of hardware optimized BLAS, LAPACK, and SCALAPACK routines,
mainly in the LOBPCG routine. We observe good performance up to 200 processors.
With 10 k-points our three-way data partitioning results in linear scaling up
to 1000 processors for a practical system used for testing.Comment: 8 pages, 5 figures. Accepted to Computational Material Scienc
COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors
Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin
Write-Avoiding Algorithms
Short version of the technical report available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-163.pdf as Technical Report No. UCB/EECS-2015-163International audienc
- …