Search CORE

933 research outputs found

Distributed-memory parallelization of an explicit time-domain volume integral equation solver on Blue Gene/P

Author: Al-Jarro A
Bağcı H
Cheeseman M
Publication venue
Publication date: 29/02/2012
Field of study

Two distributed-memory schemes for efficiently parallelizing the explicit marching-on in-time based solution of the time domain volume integral equation on the IBM Blue Gene/P platform are presented. In the first scheme, each processor stores the time history of all source fields and only the computationally dominant step of the tested field computations is distributed among processors. This scheme requires all-to-all global communications to update the time history of the source fields from the tested fields. In the second scheme, the source fields as well as all steps of the tested field computations are distributed among processors. This scheme requires sequential global communications to update the time history of the distributed source fields from the tested fields. Numerical results demonstrate that both schemes scale well on the IBM Blue Gene/P platform and the memory efficient second scheme allows for the characterization of transient wave interactions on composite structures discretized using three million spatial elements without an acceleration algorithm

UCL Discovery

Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores

Author: Chatterjee Anando G.
Hadri Bilel
Khurram Rooh
Kumar Abhishek
Samtaney Ravi
Verma Mahendra K.
Publication venue: 'Elsevier BV'
Publication date: 01/03/2018
Field of study

In this paper we present scaling results of a FFT library, FFTK, and a pseudospectral code, Tarang, on grid resolutions up to

8192^3

grid using 65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We observe that communication dominates computation, more so on the Cray XC40. The computation time scales as

T_\mathrm{comp} \sim p^{-1}

, and the communication time as

T_\mathrm{comm} \sim n^{-\gamma_2}

with

\gamma_2

ranging from 0.7 to 0.9 for Blue Gene/P, and from 0.43 to 0.73 for Cray XC40. FFTK, and the fluid and convection solvers of Tarang exhibit weak as well as strong scaling nearly up to 196608 cores of Cray XC40. We perform a comparative study of the performance on the Blue Gene/P and Cray XC40 clusters

arXiv.org e-Print Archive

Crossref

Coventry University Pure Portal

GPAW optimized for Blue Gene/P using hybrid programming

Author: Happe Hans Henrik
Kristensen Mads Ruben Burgdorff
Vinter Brian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Copenhagen University Research Information System

Large-Scale MP2 Calculations on the Blue Gene Architecture Using the Fragment Molecular Orbital Method

Author: Fedorov Dmitri
Fletcher Graham
Gordon Mark
Gordon Mark
Pruitt Spencer
Windus Theresa
Windus Theresa
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2012
Field of study

Benchmark timings are presented for the fragment molecular orbital method on a Blue Gene/P computer. Algorithmic modifications that lead to enhanced performance on the Blue Gene/P architecture include strategies for the storage of fragment density matrices by process subgroups in the global address space. The computation of the atomic forces for a system with more than 3000 atoms and 44 000 basis functions, using second order perturbation theory and an augmented and polarized double-ζ basis set, takes ∼7 min on 131 072 cores

Digital Repository @ Iowa State University (ISU)

Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer

Author: Hines Michael
Kumar Sameer
Schürmann Felix
Publication venue: Frontiers Media S.A.
Publication date: 01/01/2011
Field of study

For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8–128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·1010 connections (K is 1024, M is 10242, and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods—the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores—had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Directory of Open Access Journals

PubMed Central

Frontiers - Publisher Connector

Towards Loosely-Coupled Programming on Petascale Systems

Author: Beckman Pete
Clifford Ben
Foster Ian
Iskra Kamil
Raicu Ioan
Wilde Mike
Zhang Zhao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/08/2008
Field of study

We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.Comment: IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SuperComputing/SC) 200

arXiv.org e-Print Archive

Crossref

Design and Experimentation of a Large Scale Distributed Stochastic Control Algorithm Applied to Energy Management Problems

Author: Stephane Vialle
Xavier Warin
Publication venue: 'IntechOpen'
Publication date: 01/08/2010
Field of study

The Stochastic Dynamic Programming method often used to solve some stochastic optimization problems is only usable in low dimension, being plagued by the curse of dimensionality. In this article, we explain how to postpone this limit by using High Performance Computing: parallel and distributed algorithms design, optimized implementations and usage of large scale distributed architectures (PC clusters and Blue Gene/P)

IntechOpen

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1