Search CORE

119,759 research outputs found

Large scale ab initio calculations based on three levels of parallelization

Author: Andrew Knyazev
Blöchl
Brommer
Bylander
Cricchio
Davidson
Dulub
François Bottin
Gilles Zérah
Goedecker
Gonze
Hohenberg
Hutter
Hutter
Knyazev
Knyazev
Knyazev
Knyazev
Knyazev
Kohn
Kresse
Kresse
Kresse
Lanczos
Ogitsu
Payne
Pulay
Segeva
Skylaris
Skylaris
Stéphane Leroux
Sugino
Teter
Vadali
Vanderbilt
Wang
Wood
Yang
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

We suggest and implement a parallelization scheme based on an efficient multiband eigenvalue solver, called the locally optimal block preconditioned conjugate gradient LOBPCG method, and using an optimized three-dimensional (3D) fast Fourier transform (FFT) in the ab initio}plane-wave code ABINIT. In addition to the standard data partitioning over processors corresponding to different k-points, we introduce data partitioning with respect to blocks of bands as well as spatial partitioning in the Fourier space of coefficients over the plane waves basis set used in ABINIT. This k-points-multiband-FFT parallelization avoids any collective communications on the whole set of processors relying instead on one-dimensional communications only. For a single k-point, super-linear scaling is achieved for up to 100 processors due to an extensive use of hardware optimized BLAS, LAPACK, and SCALAPACK routines, mainly in the LOBPCG routine. We observe good performance up to 200 processors. With 10 k-points our three-way data partitioning results in linear scaling up to 1000 processors for a practical system used for testing.Comment: 8 pages, 5 figures. Accepted to Computational Material Scienc

arXiv.org e-Print Archive

CiteSeerX

Crossref

Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning

Author: Das Raja
Mavriplis Dimitri
Ponnusamy Ravi
Saltz Joel
Publication venue
Publication date: 01/10/1991
Field of study

Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods

NASA Technical Reports Server

Syracuse University Research Facility and Collaborative Environment

Efficient Implementations of Molecular Dynamics Simulations for Lennard-Jones Systems

Author: Ito N.
Suzuki M.
Watanabe H.
Publication venue: 'Japan Society of Applied Physics'
Publication date: 23/06/2011
Field of study

Efficient implementations of the classical molecular dynamics (MD) method for Lennard-Jones particle systems are considered. Not only general algorithms but also techniques that are efficient for some specific CPU architectures are also explained. A simple spatial-decomposition-based strategy is adopted for parallelization. By utilizing the developed code, benchmark simulations are performed on a HITACHI SR16000/J2 system consisting of IBM POWER6 processors which are 4.7 GHz at the National Institute for Fusion Science (NIFS) and an SGI Altix ICE 8400EX system consisting of Intel Xeon processors which are 2.93 GHz at the Institute for Solid State Physics (ISSP), the University of Tokyo. The parallelization efficiency of the largest run, consisting of 4.1 billion particles with 8192 MPI processes, is about 73% relative to that of the smallest run with 128 MPI processes at NIFS, and it is about 66% relative to that of the smallest run with 4 MPI processes at ISSP. The factors causing the parallel overhead are investigated. It is found that fluctuations of the execution time of each process degrade the parallel efficiency. These fluctuations may be due to the interference of the operating system, which is known as OS Jitter.Comment: 33 pages, 19 figures, add references and figures are revise

arXiv.org e-Print Archive

Crossref

Programming distributed memory architectures using Kali

Author: Mehrotra Piyush
Vanrosendale John
Publication venue
Publication date
Field of study

Programming nonshared memory systems is more difficult than programming shared memory systems, in part because of the relatively low level of current programming environments for such machines. A new programming environment is presented, Kali, which provides a global name space and allows direct access to remote data values. In order to retain efficiency, Kali provides a system on annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing. The primitives and constructs provided by the language is described, and some of the issues raised in translating a Kali program for execution on distributed memory systems are also discussed

NASA Technical Reports Server

Compiling global name-space programs for distributed execution

Author: Koelbel Charles
Mehrotra Piyush
Publication venue
Publication date
Field of study

Distributed memory machines do not provide hardware support for a global address space. Thus programmers are forced to partition the data across the memories of the architecture and use explicit message passing to communicate data between processors. The compiler support required to allow programmers to express their algorithms using a global name-space is examined. A general method is presented for analysis of a high level source program and along with its translation to a set of independently executing tasks communicating via messages. If the compiler has enough information, this translation can be carried out at compile-time. Otherwise run-time code is generated to implement the required data movement. The analysis required in both situations is described and the performance of the generated code on the Intel iPSC/2 is presented

NASA Technical Reports Server