102 research outputs found

    Expanded delta networks for very large parallel computers

    Get PDF
    In this paper we analyze a generalization of the traditional delta network, introduced by Patel [21], and dubbed Expanded Delta Network (EDN). These networks provide in general multiple paths that can be exploited to reduce contention in the network resulting in increased performance. The crossbar and traditional delta networks are limiting cases of this class of networks. However, the delta network does not provide the multiple paths that the more general expanded delta networks provide, and crossbars are to costly to use for large networks. The EDNs are analyzed with respect to their routing capabilities in the MIMD and SIMD models of computation.The concepts of capacity and clustering are also addressed. In massively parallel SIMD computers, it is the trend to put a larger number processors on a chip, but due to I/O constraints only a subset of the total number of processors may have access to the network. This is introduced as a Restricted Access Expanded Delta Network of which the MasPar MP-1 router network is an example

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    A Massively Parallel MIMD Implemented by SIMD Hardware?

    Get PDF
    Both conventional wisdom and engineering practice hold that a massively parallel MIMD machine should be constructed using a large number of independent processors and an asynchronous interconnection network. In this paper, we suggest that it may be beneficial to implement a massively parallel MIMD using microcode on a massively parallel SIMD microengine; the synchronous nature of the system allows much higher performance to be obtained with simpler hardware. The primary disadvantage is simply that the SIMD microengine must serialize execution of different types of instructions - but again the static nature of the machine allows various optimizations that can minimize this detrimental effect. In addition to presenting the theory behind construction of efficient MIMD machines using SIMD microengines, this paper discusses how the techniques were applied to create a 16,384- processor shared memory barrier MIMD using a SIMD MasPar MP-1. Both the MIMD structure and benchmark results are presented. Even though the MasPar hardware is not ideal for implementing a MIMD and our microinterpreter was written in a high-level language (MPL), peak MIMD performance was 280 MFLOPS as compared to 1.2 GFLOPS for the native SIMD instruction set. Of course, comparing peak speeds is of dubious value; hence, we have also included a number of more realistic benchmark results

    Parametric micro-level performance models for parallel computing and parallel implementation of hydrostatic MM5

    Get PDF
    This dissertation presents Parametric micro-level performance models and Parallel implementation of the hydrostatic version of MM5;Parametric micro-level (PM) performance models are introduced to address the important issue of how to realistically model parallel performance. These models can be used to predict execution times and identify performance bottlenecks. The accurate prediction and analysis of execution times is achieved by incorporating precise details of interprocessor communication, memory operations, auxiliary instructions, and effects of communication and computation schedules. The parameters provide the flexibility to study various algorithmic and architectural issues. The development and verification process, parameters and the scope of applicability of these models are discussed. A coherent view of performance is obtained from the execution profiles generated by PM models. The models are targeted at a large class numerical algorithms commonly implemented on both SIMD and MIMD machines. Specific models are presented for matrix multiplication, LU decomposition, and FFT on a 2-D processor array with distributed memory. A case study includes comparison of parallel machines and parallel algorithms. In a comparison of parallel machines, PM models are used to analyze execution times so as to relate the performance to architectural attributes of a machine. In a comparison of parallel algorithms, PM models are used to study performance of two LU decomposition algorithms: non-blocked and blocked. Two algorithms are compared to identify the tradeoffs between them. This analysis is useful to determine an optimum block size for the blocked algorithm. The case study is done on MasPar MP-1 and MP-2 machines;The dissertation also describes the parallel implementation of the hydrostatic version of MM5 (the fifth generation of Mesoscale Model), which has been widely used for climate studies. The model was parallelized in machine-independent manner using the Runtime System Library (RSL), a runtime library for handling message-passing and index transformation. The dissertation discusses validation of the parallel implementation of MM5 using field data and presents performance results. The parallel model was tested on the IBM SP1, a distributed memory parallel computer

    A New Method for Efficient Parallel Solution of Large Linear Systems on a SIMD Processor.

    Get PDF
    This dissertation proposes a new technique for efficient parallel solution of very large linear systems of equations on a SIMD processor. The model problem used to investigate both the efficiency and applicability of the technique was of a regular structure with semi-bandwidth β,\beta, and resulted from approximation of a second order, two-dimensional elliptic equation on a regular domain under the Dirichlet and periodic boundary conditions. With only slight modifications, chiefly to properly account for the mathematical effects of varying bandwidths, the technique can be extended to encompass solution of any regular, banded systems. The computational model used was the MasPar MP-X (model 1208B), a massively parallel processor hostnamed hurricane and housed in the Concurrent Computing Laboratory of the Physics/Astronomy department, Louisiana State University. The maximum bandwidth which caused the problem\u27s size to fit the nyproc ×\times nxproc machine array exactly, was determined. This as well as smaller sizes were used in four experiments to evaluate the efficiency of the new technique. Four benchmark algorithms, two direct--Gauss elimination (GE), Orthogonal factorization--and two iterative--symmetric over-relaxation (SOR) (ω\omega = 2), the conjugate gradient method (CG)--were used to test the efficiency of the new approach based upon three evaluation metrics--deviations of results of computations, measured as average absolute errors, from the exact solution, the cpu times, and the mega flop rates of executions. All the benchmarks, except the GE, were implemented in parallel. In all evaluation categories, the new approach outperformed the benchmarks and very much so when N ≫\gg p, p being the number of processors and N the problem size. At the maximum system\u27s size, the new method was about 2.19 more accurate, and about 1.7 times faster than the benchmarks. But when the system size was a lot smaller than the machine\u27s size, the new approach\u27s performance deteriorated precipitously, and, in fact, in this circumstance, its performance was worse than that of GE, the serial code. Hence, this technique is recommended for solution of linear systems with regular structures on array processors when the problem\u27s size is large in relation to the processor\u27s size

    Quantum wave modeling on highly parallel distributed memory machines

    Get PDF
    Parallel computers are finding major applications in almost all scientific and engineering disciplines. An interesting area that has received attention is quantum scattering. Algorithms for studying quantum scattering are computation intensive and hence suitable for parallel machines. The state-of-the-art methods developed for uniprocessors require the computation of two Fast Fourier Transforms (FFTs) at each time step. However, the communication overhead in implementing FFTs make them an expensive operation on distributed memory parallel machines;The focus of this dissertation is the development of efficient parallel methods for studying the phenomenon of time-dependent quantum-wave scattering. The methods described belong to the class of integral equation methods, which involve the application of a repeated sequence of very short time step propagations. Free propagation of a wavepacket is most easily handled in the so-called momentum representation whereas the effect of the potential is most easily obtained in the coordinate representation. The two representations are Fourier Transforms of each other. The algorithm presented eliminates the computation of FFTs by performing the propagation totally within the coordinate representation. The communication required is only with the nearest neighbors and is load balanced, thus making the algorithm suitable for distributed memory parallel machines. Implementation results on the nCUBE hypercube and comparison with standard FFT methods are also presented

    NASA high performance computing and communications program

    Get PDF
    The National Aeronautics and Space Administration's HPCC program is part of a new Presidential initiative aimed at producing a 1000-fold increase in supercomputing speed and a 100-fold improvement in available communications capability by 1997. As more advanced technologies are developed under the HPCC program, they will be used to solve NASA's 'Grand Challenge' problems, which include improving the design and simulation of advanced aerospace vehicles, allowing people at remote locations to communicate more effectively and share information, increasing scientist's abilities to model the Earth's climate and forecast global environmental trends, and improving the development of advanced spacecraft. NASA's HPCC program is organized into three projects which are unique to the agency's mission: the Computational Aerosciences (CAS) project, the Earth and Space Sciences (ESS) project, and the Remote Exploration and Experimentation (REE) project. An additional project, the Basic Research and Human Resources (BRHR) project exists to promote long term research in computer science and engineering and to increase the pool of trained personnel in a variety of scientific disciplines. This document presents an overview of the objectives and organization of these projects as well as summaries of individual research and development programs within each project

    Parallelization techniques for scientific and engineering applications and implementation of the boundary element method (BEM)

    Get PDF
    This dissertation reports the implementation of a boundary element method (BEM) application on the massively parallel MasPar MP-1 and MP-2 computers. That implementation provides a case study to demonstrate several techniques for parallelization of sequential algorithms and for optimization of parallel programs;An existing formal technique for transforming a sequential algorithm into a systolic architecture is presented. This dissertation then discusses how a parallel systolic algorithm on a mesh-connected computer can be derived from such a systolic Architecture; The matrix multiplication algorithm used in the BEM implementation is derived in this way;As part of the BEM implementation, this dissertation covers a novel method of solving a system of linear equations, using matrix inversion and LU decomposition. This method is shown to be less expensive than LU decomposition alone. Several parallelizations of matrix inversion are considered;Finally, this dissertation presents techniques for transforming parallel program source code to increase performance. The transformation improves performance by decreasing processor local memory access cost and by increasing processor utilization

    Medical image tomography: A statistically tailored neural network approach

    Get PDF
    In medical computed tomography (CT) the tomographic images are reconstructed from planar information collected 180∘ to 360∘ around the patient. In clinical applications, the reconstructions are typically produced using a filtered backprojection algorithm. Filtered backprojection methods have limitations that create a high percentage of statistical uncertainty in the reconstructed images. Many techniques have been developed which produce better reconstructions, but they tend to be computationally expensive, and thus, impractical for clinical use;Artificial neural networks (ANN) have been shown to be adept at learning and then simulating complex functional relationships. For medical tomography, a neural network can be trained to produce a reconstructed medical image given the planar data as input. Once trained an ANN can produce an accurate reconstruction very quickly;A backpropagation ANN with statistically derived activation functions has been developed to improve the trainability and generalization ability of a network to produce accurate reconstructions. The tailored activation functions are derived from the estimated probability density functions (p.d.f.s) of the ANN training data set. A set of sigmoid derivative functions are fitted to the p.d.f.s and then integrated to produce the ANN activation functions, which are also estimates of the cumulative distribution functions (c.d.f.s) of the training data. The statistically tailored activation functions and their derivatives are substituted for the logistic function and its derivative that are typically used in backpropagation ANNs;A set of geometric images was derived for training an ANN for cardiac SPECT image reconstruction. The planar projections for the geometric images were simulated using the Monte Carlo method to produce sixty-four 64-quadrant planar views taken 180 about each image. A 4096 x 629 x 4096 architecture ANN was simulated on the MasPar MP-2, a massively parallel single-instruction multiple-data (SIMD) computer. The ANN was trained on the set of geometric tomographic images. Trained on the geometric images, the ANN was able to generalize the input-to-output function of the planar data-to-tomogram and accurately reconstruct actual cardiac SPECT images

    Probabilistic structural mechanics research for parallel processing computers

    Get PDF
    Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical
    • …
    corecore