Search CORE

123 research outputs found

Recommended from our members

CMSSL: A acalable scientific software library

Author: Johnsson S. Lennart
Publication venue
Publication date: 21/03/2018
Field of study

Massively parallel processors introduces new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used poses serious challenges for software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at run{time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications.Engineering and Applied Science

Harvard University - DASH

QCD on the connection machine: beyond *LISP

Author: Baillie Clive F.
Brickner Ralph G.
Johnsson S. Lennart
Publication venue: 'Elsevier BV'
Publication date: 02/04/1991
Field of study

We report on the status of code development for a simulation of quantum chromodynamics (QCD) with dynamical Wilson fermions on the Connection Machine model CM-2. Our original code, written in Lisp, gave performance in the near-GFLOPS range. We have rewritten the most time-consuming parts of the code in the low-level programming systems CMIS, including the matrix multiply and the communication. Current versions of the code run at approximately 3.6 GFLOPS for the fermion matrix inversion, and we expect the next version to reach or exceed 5 GFLOPS

Caltech Authors

Recommended from our members

Communication and I/O Libraries

Author: Johnsson S. Lennart
Worley Patrick
Publication venue
Publication date: 09/11/2015
Field of study

Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Generalized Shuffle Permutations on Boolean Cubes

Author: Ho Ching-Tien
Johnsson S. Lennart
Publication venue
Publication date: 20/11/2015
Field of study

In a generalized shuffle permutation an address (a[q-1]a[1-2]...a[0]) receives its content from an address obtained through a cyclic shift on a subset of the q dimensions used for the encoding of the addresses. Big-complementation may be combined with the shift. We give an algorithm that requires (K/2)+2 exchanges for K elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor possible. The number of element exchanges in sequence is independent of the number of processor dimensions sigma(r) in the permutation. With no storage dimensions in the permutation our best algorithm requires (sigma[r]+1)(K/2sigma[r]) element exchanges. We also give an algorithm for sigma(r)=2, or the real shuffle consists of a number of cycles of length two, that requires (K/2)+1 element exchanges in sequence when there is no bit complement. The lower bound is (K/2) for both real and mixed shuffles with no bit complementation. The minimum number of communication start-ups for sigma(r) for both cases, which is also the lower bound. The data transfer time for communication restricted to one port per processor is sigma(r)(K/2), and the minimum number of start-ups is sigma(r). The analysis is verified by experimental results on the Intel iPSC/1, and for one case also on the Connection Machine.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

On the Accuracy of Poisson's Formula Based N-Body Algorithms

Author: Hu Y. Charlie
Johnsson S. Lennart
Publication venue
Publication date: 04/01/2016
Field of study

We study the accuracy-cost tradeoffs of a Poisson's formula based hierarchical N-body method. The parameters that control the degree of approximation of the computational elements and the separateness of interacting elements, govern both the arithmetic complexity and the accuracy of the method. Empirical models for predicting the execution time and the accuracy of the potential and force evaluations for three-dimensional problems are presented. We demonstrate how these models can be used to minimize the execution time for a prescribed error and verify the predictions through simulations on particle systems with up to one million particles. An interesting observation is that for a given error, defining the near-field to consist of only nearest neighbor elements yields a lower computational complexity for a given error than the two-element separation recommended in the literature. We also show that the particle distribution may have a significant impact on the error.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer

Author: Johnsson S. Lennart
Mathur Kapil K.
Publication venue
Publication date: 06/10/2015
Field of study

Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been implemented on the Connection Machine system CM-200 are described. No assumption is made on the shape or size of the operands. For matrix-matrix multiplication, both the nonsystolic and the systolic algorithms are outlined. A systolic algorithm that computes the product matrix in-place is described in detail. We show that a level-3 DBLAS yields better performance than a level-2 DBLAS. On the Connection Machine system CM-200, blocking yields a performance improvement by a factor of up to three over level-2 DBLAS. For certain matrix shapes the systolic algorithms offer both improved performance and significantly reduced temporary storage requirements compared to the nonsystolic block algorithms. We show that, in order to minimize the communication time, an algorithm that leaves the largest operand matrix stationary should be chosen for matrix-matrix multiplication. Furthermore, it is shown both analytically and experimentally that the optimum shape of the processor array yields square stationary submatrices in each processor, i.e., the ratio between the length of the axes of the processing array must be the same as the ratio between the corresponding axes of the stationary matrix. The optimum processor array shape may yield a factor of square matrices. For rectangular matrices a factor of 30 improvement was observed for an optimum processor array shape compared to a poorly chosen processor array shape.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

All-to-All Communication on the Connection Machine CM-200

Author: Johnsson S. Lennart
Mathur Kapil K.
Publication venue
Publication date: 09/11/2015
Field of study

Detailed algorithms for all-to-all broadcast and reduction are given for arrays mapped by binary or binary-reflected Gray code encoding to the processing nodes of binary cube networks. Algorithms are also given for the local computation of the array indices for the communicated data, thereby reducing the demand for communications bandwidth. For the Connection Machine system CM-200, Hamiltonian cycle based all-to-all communication algorithms yield a performance that is a factor of two to ten higher than the performance offered by algorithms based on trees, butterfly networks, or the Connection Machine router. The peak data rate achieved for all-to-all broadcast on a 2048 node Connection Machine system CM-200 is 5.4 Gbytes/sec when no recording is required. If the time for data reordering is included, then the effective peak data rate is reduced to 2.5 Gbytes/sec.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Index Transformation Algorithms in a Linear Algebra Framework

Author: Edelman Alan
Heller Steve
Johnsson S. Lennart
Publication venue
Publication date: 21/01/2016
Field of study

We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and 1's.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage

Author: Edelman Alan
Ho Ching-Tien
Johnsson S. Lennart
Publication venue
Publication date: 11/03/2016
Field of study

For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about P^2/√ N elements of each input operand, with operands of size PxP distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P^2/√N logN for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P^3, and parallel arithmetic complexity 2P^3/N. The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Communication efficient multi-processor FFT

Author: Jacquemin Michel
Johnsson S. Lennart
Krawitz Robert L.
Publication venue
Publication date: 21/03/2018
Field of study

Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radix-2 algorithm, a bi-section or multi-section algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input data is in normal order, and the data allocation consecutive. With a cyclic data allocation, or bit-reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that (P/N) + d - 2 twiddle factors be stored for P elements distributed evenly over N processors, and the axis subject to transformation distributed over 2d processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of log2 N more storage for N processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1 Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision, and 3 Gflops/s in 64-bit precision. Three-dimensional transforms executes at a slightly lower rate. Distributed ordered transforms executes at a rate of about 1/2 to 2/3 of the unordered transforms.Engineering and Applied Science

Harvard University - DASH