842 research outputs found
An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor
Modern OpenMP threading techniques are used to convert the MPI-only
Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two
separate implementations that differ by the sharing or replication of key data
structures among threads are considered, density and Fock matrices. All
implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi
processors. With 64 cores per processor, scaling numbers are reported on up to
192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory
footprint by approximately 200 times compared to the legacy code. The
MPI/OpenMP code was shown to run up to six times faster than the original for a
range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure
DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives
We present a new parallel algorithm for probabilistic graphical model
optimization. The algorithm relies on data-parallel primitives (DPPs), which
provide portable performance over hardware architecture. We evaluate results on
CPUs and GPUs for an image segmentation problem. Compared to a serial baseline,
we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare
our performance to a reference, OpenMP-based algorithm, and find speedups of up
to 7X (CPU).Comment: LDAV 2018, October 201
Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework
We describe a set of lower-level abstractions to improve performance on
modern large scale heterogeneous systems. These provide portable access to
system- and hardware-dependent features, automatically apply dynamic
optimizations at run time, and target stencil-based codes used in finite
differencing, finite volume, or block-structured adaptive mesh refinement
codes.
These abstractions include a novel data structure to manage refinement
information for block-structured adaptive mesh refinement, an iterator
mechanism to efficiently traverse multi-dimensional arrays in stencil-based
codes, and a portable API and implementation for explicit SIMD vectorization.
These abstractions can either be employed manually, or be targeted by
automated code generation, or be used via support libraries by compilers during
code generation. The implementations described below are available in the
Cactus framework, and are used e.g. in the Einstein Toolkit for relativistic
astrophysics simulations
Recommended from our members
A High-Performance Domain-Specific Language and Code Generator for General N-body Problems
General N-body problems are a set of problems in which an update to a single element in the system depends on every other element. N-body problems are ubiquitous, with applications in various domains ranging from scientific computing simulations in molecular dynamics, astrophysics, acoustics, and fluid dynamics all the way to computer vision, data mining and machine learning problems. Different N-body algorithms have been designed and implemented in these various fields. However, there is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. It is time-consuming to write fast, parallel, and scalable code for these problems. On the other hand, the sheer scale and growth of modern scientific datasets necessitate exploiting the power of both parallel and approximation algorithms where there is a potential to trade-off accuracy for performance. The main problem that we are tackling in this thesis is how to automatically generate asymptotically optimal N-body algorithms from the high-level specification of the problem. We combine the body of work in performance optimizations, compilers and the domain of N-body problems to build a unified system where domain scientists can write programs at the high level while attaining performance of code written by an expert at the low level.In order to generate a high-performance, scalable code for this group of problems, we take the following steps in this thesis; first, we propose a unified algorithmic framework named PASCAL in order to address the challenge of designing a general algorithmic template to represent the class of N-body problems. PASCAL utilizes space-partitioning trees and user-controlled pruning/approximations to reduce the asymptotic runtime complexity from linear to logarithmic in the number of data points. In PASCAL, we design an algorithm that automatically generates conditions for pruning or approximation of an N-body problem considering the problem's definition. In order to evaluate PASCAL, we developed tree-based algorithms for six well-known problems: k-nearest neighbors, range search, minimum spanning tree, kernel density estimation, expectation maximization, and Hausdorff distance. We show that applying domain-specific optimizations and parallelization to the algorithms written in PASCAL achieves 10x to 230x speedup compared to state-of-the-art libraries on a dual-socket Intel Xeon processor with 16 cores on real-world datasets. Second, we extend the PASCAL framework to build PASCAL-X that adds support for NUMA-aware parallelization. PASCAL-X also presents insights on the influence of tuning parameters. Tuning parameters such as leaf size (influences the shape of the tree) and cut-off level (controls the granularity of tasks) of the space-partitioning trees result in performance improvement of up to 4.6x. A key goal is to generate scalable and high-performance code automatically without sacrificing productivity. That implies minimizing the effort the users have to put in to generate the desired high-performance code. Another critical factor is the adaptivity, which indicates the amount of effort that is required to extend the high-performance code generation to new N-body problems. Finally, we consider these factors and develop a domain-specific language and code generator named Portal, which is built on top of PASCAL-X. Portal's language design is inspired by the mathematical representation of N-body problems, resulting in an intuitive language for rapid implementation of a variety of problems. Portal's back-end is designed and implemented to generate optimized, parallel, and scalable implementations for multi-core systems. We demonstrate that the performance achieved by using Portal is comparable to that of expert hand-optimized code while providing productivity for domain scientists. For instance, using Portal for the k-nearest neighbors problem gains performance that is similar to the hand-optimized code, while reducing the lines of code by 68x. To the best of our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of general N-body problems and this thesis primarily aims to fill this gap. Finally, we present a case study of Portal for the real-world problem of face clustering. In this case study, we show that Portal not only provides a fast solution for the face clustering problem with similar accuracy as the state-of-the-art algorithm, but also it provides productivity by implementing the face clustering algorithm in only 14 lines of Portal code
Angpow: a software for the fast computation of accurate tomographic power spectra
The statistical distribution of galaxies is a powerful probe to constrain
cosmological models and gravity. In particular the matter power spectrum
brings information about the cosmological distance evolution and the galaxy
clustering together. However the building of from galaxy catalogues
needs a cosmological model to convert angles on the sky and redshifts into
distances, which leads to difficulties when comparing data with predicted
from other cosmological models, and for photometric surveys like LSST.
The angular power spectrum between two bins located at
redshift and contains the same information than the matter power
spectrum, is free from any cosmological assumption, but the prediction of
from is a costly computation when performed exactly.
The Angpow software aims at computing quickly and accurately the auto
() and cross () angular power spectra between redshift
bins. We describe the developed algorithm, based on developments on the
Chebyshev polynomial basis and on the Clenshaw-Curtis quadrature method. We
validate the results with other codes, and benchmark the performance. Angpow is
flexible and can handle any user defined power spectra, transfer functions, and
redshift selection windows. The code is fast enough to be embedded inside
programs exploring large cosmological parameter spaces through the
comparison with data. We emphasize that the Limber's
approximation, often used to fasten the computation, gives wrong
values for cross-correlations.Comment: Published in Astronomy & Astrophysic
A Sparse SCF algorithm and its parallel implementation: Application to DFTB
We present an algorithm and its parallel implementation for solving a self
consistent problem as encountered in Hartree Fock or Density Functional Theory.
The algorithm takes advantage of the sparsity of matrices through the use of
local molecular orbitals. The implementation allows to exploit efficiently
modern symmetric multiprocessing (SMP) computer architectures. As a first
application, the algorithm is used within the density functional based tight
binding method, for which most of the computational time is spent in the linear
algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that
with this algorithm (i) single point calculations on very large systems
(millions of atoms) can be performed on large SMP machines (ii) calculations
involving intermediate size systems (1~000--100~000 atoms) are also strongly
accelerated and can run efficiently on standard servers (iii) the error on the
total energy due to the use of a cut-off in the molecular orbital coefficients
can be controlled such that it remains smaller than the SCF convergence
criterion.Comment: 13 pages, 11 figure
AMRA: An Adaptive Mesh Refinement Hydrodynamic Code for Astrophysics
Implementation details and test cases of a newly developed hydrodynamic code,
AMRA, are presented. The numerical scheme exploits the adaptive mesh refinement
technique coupled to modern high-resolution schemes which are suitable for
relativistic and non-relativistic flows. Various physical processes are
incorporated using the operator splitting approach, and include self-gravity,
nuclear burning, physical viscosity, implicit and explicit schemes for
conductive transport, simplified photoionization, and radiative losses from an
optically thin plasma. Several aspects related to the accuracy and stability of
the scheme are discussed in the context of hydrodynamic and astrophysical
flows.Comment: 41 pages, 21 figures (9 low-resolution), LaTeX, requires elsart.cls,
submitted to Comp. Phys. Comm.; additional documentation and high-resolution
figures available from http://www.camk.edu.pl/~tomek/AMRA/index.htm
- âŠ