Search CORE

60 research outputs found

Recommended from our members

Unconventional Architectures for High-Throughput Sciences

Author: Chavarría-Miranda Daniel
Marquez Andres
Nieplocha Jarek
Petrini Fabrizio
Publication venue: Pacific Northwest National Laboratory (U.S.)
Publication date: 15/06/2007
Field of study

Science laboratories and sophisticated simulations are producing data of increasing volumes and complexities, and that’s posing significant challenges to current data infrastructures as terabytes to petabytes of data must be processed and analyzed. Traditional computing platforms, originally designed to support model-driven applications, are unable to meet the demands of the data-intensive scientific applications. Pacific Northwest National Laboratory (PNNL) research goes beyond “traditional supercomputing” applications to address emerging problems that need scalable, real-time solutions. The outcome is new unconventional architectures for data-intensive applications specifically designed to process the deluge of scientific data, including FPGAs, multithreaded architectures and IBM's Cell

UNT Digital Library

Partitioned Global Address Space Languages

Author: Bruno De Fraine
Lin Calvin
Mattias De Wael
Nieplocha Jarek
Nieplocha Jaroslaw
Stefan Marr
Tom Van Cutsem
Wolfgang De Meuter
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/05/2015
Field of study

The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of non-uniform communication costs. Today, about a dozen languages exist that adhere to the PGAS model. This survey proposes a definition and a taxonomy along four axes: how parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions and finally how data is accessed across partitions. Our taxonomy reveals that today's PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost models remain open challenges

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Kent Academic Repository

An efficient parallelization scheme for molecular dynamics simulations with many-body, flexible, polarizable empirical potentials: application to water

Author: A Rahman
B Quentrec
BR Brooks
BT Thole
CJ Burnham
CJ Burnham
DA Pearlman
DE Shaw
George S. Fanourgakis
GS Fanourgakis
GS Heffelfinger
H Partridge
J Nieplocha
J Nieplocha
JA Barker
Jarek Nieplocha
JC Grossman
JR Reimers
L Verlet
LX Dang
M Allesch
M Snir
MH Willebeek-LeMair
MP Allen
MW Mahoney
P Ahlström
S Nosé
S Plimpton
Sotiris S. Xantheas
TM Nymand
U Essmann
Vinod Tipparaju
W Smith
WC Swope
WH Press
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

Author: Jarek Nieplocha
Publication venue
Publication date
Field of study

This paper describes a novel methodology for implementing a common set of collective communication operations on clusters based on symmetric multiprocessor (SMP) nodes. Called Shared-Remote-Memory collectives, or SRM, our approach replaces the point-to-point message passing, traditionally used in implementation of collective message-passing operations, with a combination of shared and remote memory access (RMA) protocols that are used to implement semantics of the collective operations directly. Appropriate embedding of the communication graphs in a cluster maximizes the use of shared memory and reduces network communication. Substantial performance improvements are achieved over the highly optimized commercial IBM implementation and the open-source MPICH implementation of MPI across a wide range of message sizes on the IBM SP. For example, depending on the message size and number of processors, SRM implementation of broadcast, reduce, and barrier outperforms IBM MPI_Bcast by 27-84%, MPI_Reduce by 24- 79%, and MPI_Barrier by 73 % on 256 processors, respectively. 1

CiteSeerX

Data and computation abstractions for dynamic and irregular computations

Author: Jarek Nieplocha
Sriram Krishnamoorthy
Publication venue: Springer Verlag
Publication date: 01/01/2005
Field of study

Abstract. Effective data distribution and parallelization of computations involving irregular data structures is a challenging task. We address the twin-problems in the context of computations involving block-sparse matrices. The programming model provides a global view of a distributed block-sparse matrix. Abstractions are provided for the user to express the parallel tasks in the computation. The tasks are mapped onto processors to ensure load balance and locality. The abstractions are based on the Aggregate Remote Memory Copy Interface, and are interoperable with the Global Arrays programming suite and MPI. Results are presented that demonstrate the utility of the approach.

CiteSeerX

SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems

Author: Jarek Nieplocha
Manojkumar Krishnan
Publication venue
Publication date: 01/01/2004
Field of study

This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon’s algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over pdgemm from the ScaLAPACK/PBBLAS suite, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than pdgemm for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated. 1

CiteSeerX

Exploiting non-blocking remote memory access communication in scientific benchmarks

Author: Jarek Nieplocha
Manojkumar Krishnan
Publication venue
Publication date
Field of study

Abstract. This paper describes a comparative performance study of MPI and Remote Memory Access (RMA) communication models in context of four scientific benchmarks: NAS MG, NAS CG, SUMMA matrix multiplication, and Lennard Jones molecular dynamics on clusters with the Myrinet network. It is shown that RMA communication delivers a consistent performance advantage over MPI. In some cases an improvement as much as 50 % was achieved. Benefits of using non-blocking RMA for overlapping computation and communication are discussed.

CiteSeerX

Disk Resident Arrays: An Array-Oriented I/O Library for Out-of-Core Computations

Author: Ian Foster
Jarek Nieplocha
Publication venue
Publication date
Field of study

In out-of-core computations, disk storage is treated as another level in the memory hierarchy, below cache, local memory, and (in a parallel computer) remote memories. However, the tools used to manage this storage are typically quite different from those used to manage access to local and remote memory. This disparity complicates implementation of out-of-core algorithms and hinders portability. We describe a programming model that addresses this problem. This model allows parallel programs to use essentially the same mechanisms to manage the movement of data between any two adjacent levels in a hierarchical memory system. We take as our starting point the Global Arrays shared-memory model and library, which support a variety of operations on distributed arrays, including transfer between local and remote memories. We show how this model can be extended to support explicit transfer between global memory and secondary storage, and we define a Disk Resident Arrays library that supports s..

CiteSeerX