Search CORE

918 research outputs found

A Vectorized Algorithm for Molecular Dynamics of Short Range Interacting Particles

Author: Buchholtz V.
Poeschel T.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/1993
Field of study

We report on a lattice based algorithm, completely vectorized for molecular dynamics simulations. Its algorithmic complexity is of the order O(N), where

N

is the number of particles. The algorithm works very effectively when the particles have short range interaction, but it is applicable to each kind of interaction. The code was tested on a Cray ymp el in a simulation of flowing granular material.Comment: 9 pages, 6 figures, Late

arXiv.org e-Print Archive

CiteSeerX

Crossref

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Author: Bergman K
Chandramowlishwaran A
Hamada T
Lorena A Barba
Rahimian A
Rio Yokota
Warren M
Yokota R
Publication venue: 'SAGE Publications'
Publication date: 16/10/2011
Field of study

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape

arXiv.org e-Print Archive

Crossref

Simulation of 1+1 dimensional surface growth and lattices gases using GPUs

Author: Amir
Barabási
Barma
Bernaschi
Castro
Chowdhury
Chowdhury
Edwards
Facsko
Family
Forster
Gergely Ódor
Gross
Géza Ódor
Halpin-Healy
Harris
Henrik Schulz
Hinrichsen
Hwa
Janssen
Juhász
Juhász
Kardar
Kardar
Katz
Klumpp
Krug
Krug
Ligget
Meakin
Máté Ferenc Nagy
Plischke
Preis
Prähofer
Sasamoto
Sasamoto
Schittmann
Schliwa
van Beijeren
van Meel
Weigel
Ódor
Ódor
Ódor
Ódor
Ódor
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.Comment: 20 pages 12 figures, 1 table, to appear in Comp. Phys. Com

arXiv.org e-Print Archive

Crossref

ELTE Digital Institutional Repository (EDIT)

Direct $N$ -body code on low-power embedded ARM GPUs

Author: AR Brodtkorb
E Bortolas
F Perez
J Hunter
K Nitadori
K Nitadori
M Katevenis
M Spera
R Capuzzo-Dolcetta
R Capuzzo-Dolcetta
S Harfst
S Konstantinidis
S Walt van der
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/01/2019
Field of study

This work arises on the environment of the ExaNeSt project aiming at design and development of an exascale ready supercomputer with low energy consumption profile but able to support the most demanding scientific and technical applications. The ExaNeSt compute unit consists of densely-packed low-power 64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are heterogeneous architecture where computing power is supplied both by CPUs and GPUs, and are emerging as a possible low-power and low-cost alternative to clusters based on traditional CPUs. A state-of-the-art direct

N

-body code suitable for astrophysical simulations has been re-engineered in order to exploit SoC heterogeneous platforms based on ARM CPUs and embedded GPUs. Performance tests show that embedded GPUs can be effectively used to accelerate real-life scientific calculations, and that are promising also because of their energy efficiency, which is a crucial design in future exascale platforms.Comment: 16 pages, 7 figures, 1 table, accepted for publication in the Computing Conference 2019 proceeding

arXiv.org e-Print Archive

Crossref

Parallel processors and nonlinear structural dynamics algorithms and software

Author: Belytschko Ted
Gilbertsen Noreen D.
Neal Mark O.
Plaskacz Edward J.
Publication venue
Publication date
Field of study

The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14

NASA Technical Reports Server