Search CORE

842 research outputs found

An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor

Author: Alexeev Yuri
D'mello Michael
Gordon Mark S.
Keipert Kristopher
Mironov Vladimir
Moskovsky Alexander
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/08/2017
Field of study

Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two separate implementations that differ by the sharing or replication of key data structures among threads are considered, density and Fock matrices. All implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi processors. With 64 cores per processor, scaling numbers are reported on up to 192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory footprint by approximately 200 times compared to the legacy code. The MPI/OpenMP code was shown to run up to six times faster than the original for a range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure

arXiv.org e-Print Archive

Crossref

DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

Author: Bethel E. Wes
Camp David
Childs Hank
Heinemann Colleen
Lessley Brenton
Perciano Talita
Publication venue
Publication date: 13/09/2018
Field of study

We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework

Author: Schnetter Erik
Publication venue
Publication date: 01/01/2013
Field of study

We describe a set of lower-level abstractions to improve performance on modern large scale heterogeneous systems. These provide portable access to system- and hardware-dependent features, automatically apply dynamic optimizations at run time, and target stencil-based codes used in finite differencing, finite volume, or block-structured adaptive mesh refinement codes. These abstractions include a novel data structure to manage refinement information for block-structured adaptive mesh refinement, an iterator mechanism to efficiently traverse multi-dimensional arrays in stencil-based codes, and a portable API and implementation for explicit SIMD vectorization. These abstractions can either be employed manually, or be targeted by automated code generation, or be used via support libraries by compilers during code generation. The implementations described below are available in the Cactus framework, and are used e.g. in the Einstein Toolkit for relativistic astrophysics simulations

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

A High-Performance Domain-Specific Language and Code Generator for General N-body Problems

Author: Aghababaie Beni Laleh
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

General N-body problems are a set of problems in which an update to a single element in the system depends on every other element. N-body problems are ubiquitous, with applications in various domains ranging from scientific computing simulations in molecular dynamics, astrophysics, acoustics, and fluid dynamics all the way to computer vision, data mining and machine learning problems. Different N-body algorithms have been designed and implemented in these various fields. However, there is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. It is time-consuming to write fast, parallel, and scalable code for these problems. On the other hand, the sheer scale and growth of modern scientific datasets necessitate exploiting the power of both parallel and approximation algorithms where there is a potential to trade-off accuracy for performance. The main problem that we are tackling in this thesis is how to automatically generate asymptotically optimal N-body algorithms from the high-level specification of the problem. We combine the body of work in performance optimizations, compilers and the domain of N-body problems to build a unified system where domain scientists can write programs at the high level while attaining performance of code written by an expert at the low level.In order to generate a high-performance, scalable code for this group of problems, we take the following steps in this thesis; first, we propose a unified algorithmic framework named PASCAL in order to address the challenge of designing a general algorithmic template to represent the class of N-body problems. PASCAL utilizes space-partitioning trees and user-controlled pruning/approximations to reduce the asymptotic runtime complexity from linear to logarithmic in the number of data points. In PASCAL, we design an algorithm that automatically generates conditions for pruning or approximation of an N-body problem considering the problem's definition. In order to evaluate PASCAL, we developed tree-based algorithms for six well-known problems: k-nearest neighbors, range search, minimum spanning tree, kernel density estimation, expectation maximization, and Hausdorff distance. We show that applying domain-specific optimizations and parallelization to the algorithms written in PASCAL achieves 10x to 230x speedup compared to state-of-the-art libraries on a dual-socket Intel Xeon processor with 16 cores on real-world datasets. Second, we extend the PASCAL framework to build PASCAL-X that adds support for NUMA-aware parallelization. PASCAL-X also presents insights on the influence of tuning parameters. Tuning parameters such as leaf size (influences the shape of the tree) and cut-off level (controls the granularity of tasks) of the space-partitioning trees result in performance improvement of up to 4.6x. A key goal is to generate scalable and high-performance code automatically without sacrificing productivity. That implies minimizing the effort the users have to put in to generate the desired high-performance code. Another critical factor is the adaptivity, which indicates the amount of effort that is required to extend the high-performance code generation to new N-body problems. Finally, we consider these factors and develop a domain-specific language and code generator named Portal, which is built on top of PASCAL-X. Portal's language design is inspired by the mathematical representation of N-body problems, resulting in an intuitive language for rapid implementation of a variety of problems. Portal's back-end is designed and implemented to generate optimized, parallel, and scalable implementations for multi-core systems. We demonstrate that the performance achieved by using Portal is comparable to that of expert hand-optimized code while providing productivity for domain scientists. For instance, using Portal for the k-nearest neighbors problem gains performance that is similar to the hand-optimized code, while reducing the lines of code by 68x. To the best of our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of general N-body problems and this thesis primarily aims to fill this gap. Finally, we present a case study of Portal for the real-world problem of face clustering. In this case study, we show that Portal not only provides a fast solution for the face clustering problem with similar accuracy as the state-of-the-art algorithm, but also it provides productivity by implementing the face clustering algorithm in only 14 lines of Portal code

eScholarship - University of California

Angpow: a software for the fast computation of accurate tomographic power spectra

Author: Campagne J. -E.
Neveu J.
Plaszczynski S.
Publication venue: 'EDP Sciences'
Publication date: 01/01/2017
Field of study

The statistical distribution of galaxies is a powerful probe to constrain cosmological models and gravity. In particular the matter power spectrum

P(k)

brings information about the cosmological distance evolution and the galaxy clustering together. However the building of

P(k)

from galaxy catalogues needs a cosmological model to convert angles on the sky and redshifts into distances, which leads to difficulties when comparing data with predicted

P(k)

from other cosmological models, and for photometric surveys like LSST. The angular power spectrum

C_\ell(z_1,z_2)

between two bins located at redshift

z_1

and

z_2

contains the same information than the matter power spectrum, is free from any cosmological assumption, but the prediction of

C_\ell(z_1,z_2)

from

P(k)

is a costly computation when performed exactly. The Angpow software aims at computing quickly and accurately the auto (

z_1=z_2

) and cross (

z_1 \neq z_2

) angular power spectra between redshift bins. We describe the developed algorithm, based on developments on the Chebyshev polynomial basis and on the Clenshaw-Curtis quadrature method. We validate the results with other codes, and benchmark the performance. Angpow is flexible and can handle any user defined power spectra, transfer functions, and redshift selection windows. The code is fast enough to be embedded inside programs exploring large cosmological parameter spaces through the

C_\ell(z_1,z_2)

comparison with data. We emphasize that the Limber's approximation, often used to fasten the computation, gives wrong

C_\ell

values for cross-correlations.Comment: Published in Astronomy & Astrophysic

arXiv.org e-Print Archive

A Sparse SCF algorithm and its parallel implementation: Application to DFTB

Author: Rapacioli Mathias
Renon Nicolas
Scemama Anthony
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2014
Field of study

We present an algorithm and its parallel implementation for solving a self consistent problem as encountered in Hartree Fock or Density Functional Theory. The algorithm takes advantage of the sparsity of matrices through the use of local molecular orbitals. The implementation allows to exploit efficiently modern symmetric multiprocessing (SMP) computer architectures. As a first application, the algorithm is used within the density functional based tight binding method, for which most of the computational time is spent in the linear algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that with this algorithm (i) single point calculations on very large systems (millions of atoms) can be performed on large SMP machines (ii) calculations involving intermediate size systems (1~000--100~000 atoms) are also strongly accelerated and can run efficiently on standard servers (iii) the error on the total energy due to the use of a cut-off in the molecular orbital coefficients can be controlled such that it remains smaller than the SCF convergence criterion.Comment: 13 pages, 11 figure

arXiv.org e-Print Archive

HAL-INSA Toulouse

AMRA: An Adaptive Mesh Refinement Hydrodynamic Code for Astrophysics

Author: Adams
Aloy
Amdahl
Barton
Bell
Berger
Berger
Berger
Berger
Blom
Brandt
Brandt
Brandt
Chevalier
Cieciela̧g
Ciment
Colella
Colella
Cook
Courant
De Zeeuw
Dorfi
E. Müller
Falle
Falle
Falle
Favre
Ferland
Fryxell
Gehmeyr
Gingold
Glimm
Hawley
Huang
Jin
Kercek
Khokhlov
Kifonidis
Kifonidis
Kley
Kley
LeVeque
Lucy
Löhner
MacNeice
Martin
Martin
Martı́
Monaghan
Morris
Müller
Müller
Müller
Oliger
Plewa
Plewa
Plewa
Plewa
Quirk
Ruffert
Shelton
Sportisse
Steinmetz
Strang
Sutherland
T. Plewa
Tang
Terlevich
Toro
Walder
Woodward
Woodward
Yanenko
Ziegler
Ziegler
Publication venue: 'Elsevier BV'
Publication date: 01/01/2000
Field of study

Implementation details and test cases of a newly developed hydrodynamic code, AMRA, are presented. The numerical scheme exploits the adaptive mesh refinement technique coupled to modern high-resolution schemes which are suitable for relativistic and non-relativistic flows. Various physical processes are incorporated using the operator splitting approach, and include self-gravity, nuclear burning, physical viscosity, implicit and explicit schemes for conductive transport, simplified photoionization, and radiative losses from an optically thin plasma. Several aspects related to the accuracy and stability of the scheme are discussed in the context of hydrodynamic and astrophysical flows.Comment: 41 pages, 21 figures (9 low-resolution), LaTeX, requires elsart.cls, submitted to Comp. Phys. Comm.; additional documentation and high-resolution figures available from http://www.camk.edu.pl/~tomek/AMRA/index.htm

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server