41 research outputs found
Recommended from our members
Efficient iteration in data-parallel programs with irregular and dynamically distributed data structures
To implement an efficient data-parallel program on a non-shared memory MIMD multicomputer, data and computations must be properly partitioned to achieve good load balance and locality of reference. Programs with irregular data reference patterns often require irregular partitions. Although good partitions may be easy to determine, they can be difficult or impossible to implement in programming languages that provide only regular data distributions, such as blocked or cyclic arrays. We are developing Onyx, a programming system that provides a shared memory model of distributed data structures and extends the concept of data distribution to include irregular and dynamic distributions. This provides a powerful means to specify irregular partitions. Perhaps surprisingly, programs using it can also execute efficiently. In this paper, we describe and evaluate the Onyx implementation of a model problem that repeatedly executes an irregular but fixed data reference pattern. On an NCUBE hypercube, the speed of the Onyx implementation is comparable to that of carefully handwritten message-passing code
Recommended from our members
Modeling node bandwidth limits and their effects on vector combining algorithms
Each node in a message-passing multicomputer typically has several communication links. However, the maximum aggregate communication speed of a node is often less than the sum of its individual link speeds. Such computers are called node bandwidth limited (NBL). The NBL constraint is important when choosing algorithms because it can change the relative performance of different algorithms that accomplish the same task. This paper introduces a model of communication performance for NBL computers and uses the model to analyze the overall performance of three algorithms for vector combining (global sum) on the Intel Touchstone DELTA computer. Each of the three algorithms is found to be at least 33% faster than the other two for some combinations of machine size and vector length. The NBL constraint is shown to significantly affect the conditions under which each algorithm is fastest
Recommended from our members
Global arrays: A portable {open_quotes}shared-memory{close_quotes} programming model for distributed memory computers
Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes a new approach, called Global Arrays (GA), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GA is that it provides a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. The authors have implemented GA libraries on a variety of computer systems, including the Intel DELTA and Paragon, the IBM SP-1 (all message-passers), the Kendall Square KSR-2 (a nonuniform access shared-memory machine), and networks of Unix workstations. They discuss the design and implementation of these libraries, report their performance, illustrate the use of GA in the context of computational chemistry applications, and describe the use of a GA performance visualization tool
Recommended from our members
Parallel inverse iteration with reorthogonalization
A parallel method for finding orthogonal eigenvectors of real symmetric tridiagonal is described. The method uses inverse iteration with repeated Modified Gram-Schmidt (MGS) reorthogonalization of the unconverged iterates for clustered eigenvalues. This approach is more parallelizable than reorthogonalizing against fully converged eigenvectors, as is done by LAPACK's current DSTEIN routine. The new method is found to provide accuracy and speed comparable to DSTEIN's and to have good parallel scalability even for matrices with large clusters of eigenvalues. We present al results for residual and orthogonality tests, plus timings on IBM RS/6000 (sequential) and Intel Touchstone DELTA (parallel) computers
Recommended from our members
Performance of a fully parallel dense real symmetric eigensolver in quantum chemistry applications
The parallel performance of a dense, standard and generalized, real, symmetric eigensolver based on bisection for eigenvalues and repeated inverse iteration and reorthogonalization for eigenvectors is described. The performance of this solver, called PeIGS, is given for two test problems and for three ``real-world`` quantum chemistry applications: SCF-Hartree-Fock, density functional theory,and Moeller-Plesset theory. The distinguishing feature of the repeated inverse iteration and orthogonalization method used by PEIGS is that orthogonalization may be performed across multiple processors as dictated by the spectrum. For each problem we describe the spectrum and the clustering of the eigenvalues, the most important factor in determining the execution time. For a spectrum that is well spaced, there is essentially no orthogonalization time. Most of the time is consumed in the Householder reduction to tridiagonal form. For large clusters, almost all of the time is consumed in the Householder reduction and in orthogonalization. Performance results from the Intel Paragon, and Kendall Square Research KSR-2 are reported