132 research outputs found

    A Massively Parallel MIMD Implemented by SIMD Hardware?

    Get PDF
    Both conventional wisdom and engineering practice hold that a massively parallel MIMD machine should be constructed using a large number of independent processors and an asynchronous interconnection network. In this paper, we suggest that it may be beneficial to implement a massively parallel MIMD using microcode on a massively parallel SIMD microengine; the synchronous nature of the system allows much higher performance to be obtained with simpler hardware. The primary disadvantage is simply that the SIMD microengine must serialize execution of different types of instructions - but again the static nature of the machine allows various optimizations that can minimize this detrimental effect. In addition to presenting the theory behind construction of efficient MIMD machines using SIMD microengines, this paper discusses how the techniques were applied to create a 16,384- processor shared memory barrier MIMD using a SIMD MasPar MP-1. Both the MIMD structure and benchmark results are presented. Even though the MasPar hardware is not ideal for implementing a MIMD and our microinterpreter was written in a high-level language (MPL), peak MIMD performance was 280 MFLOPS as compared to 1.2 GFLOPS for the native SIMD instruction set. Of course, comparing peak speeds is of dubious value; hence, we have also included a number of more realistic benchmark results

    Computer vision algorithms on reconfigurable logic arrays

    Full text link

    Quantum wave modeling on highly parallel distributed memory machines

    Get PDF
    Parallel computers are finding major applications in almost all scientific and engineering disciplines. An interesting area that has received attention is quantum scattering. Algorithms for studying quantum scattering are computation intensive and hence suitable for parallel machines. The state-of-the-art methods developed for uniprocessors require the computation of two Fast Fourier Transforms (FFTs) at each time step. However, the communication overhead in implementing FFTs make them an expensive operation on distributed memory parallel machines;The focus of this dissertation is the development of efficient parallel methods for studying the phenomenon of time-dependent quantum-wave scattering. The methods described belong to the class of integral equation methods, which involve the application of a repeated sequence of very short time step propagations. Free propagation of a wavepacket is most easily handled in the so-called momentum representation whereas the effect of the potential is most easily obtained in the coordinate representation. The two representations are Fourier Transforms of each other. The algorithm presented eliminates the computation of FFTs by performing the propagation totally within the coordinate representation. The communication required is only with the nearest neighbors and is load balanced, thus making the algorithm suitable for distributed memory parallel machines. Implementation results on the nCUBE hypercube and comparison with standard FFT methods are also presented

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem

    Get PDF
    Branch-and-Bound (B&B) algorithms are time intensive tree-based exploration methods for solving to optimality combinatorial optimization problems. In this paper, we investigate the use of GPU computing as a major complementary way to speed up those methods. The focus is put on the bounding mechanism of B&B algorithms, which is the most time consuming part of their exploration process. We propose a parallel B&B algorithm based on a GPU-accelerated bounding model. The proposed approach concentrate on optimizing data access management to further improve the performance of the bounding mechanism which uses large and intermediate data sets that do not completely fit in GPU memory. Extensive experiments of the contribution have been carried out on well known FSP benchmarks using an Nvidia Tesla C2050 GPU card. We compared the obtained performances to a single and a multithreaded CPU-based execution. Accelerations up to x100 are achieved for large problem instances

    Data-parallel concurrent constraint programming.

    Get PDF
    by Bo-ming Tong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 104-[110]).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Concurrent Constraint Programming --- p.2Chapter 1.2 --- Finite Domain Constraints --- p.3Chapter 2 --- The Firebird Language --- p.5Chapter 2.1 --- Finite Domain Constraints --- p.6Chapter 2.2 --- The Firebird Computation Model --- p.6Chapter 2.3 --- Miscellaneous Features --- p.7Chapter 2.4 --- Clause-Based N on determinism --- p.9Chapter 2.5 --- Programming Examples --- p.10Chapter 2.5.1 --- Magic Series --- p.10Chapter 2.5.2 --- Weak Queens --- p.14Chapter 3 --- Operational Semantics --- p.15Chapter 3.1 --- The Firebird Computation Model --- p.16Chapter 3.2 --- The Firebird Commit Law --- p.17Chapter 3.3 --- Derivation --- p.17Chapter 3.4 --- Correctness of Firebird Computation Model --- p.18Chapter 4 --- Exploitation of Data-Parallelism in Firebird --- p.24Chapter 4.1 --- An Illustrative Example --- p.25Chapter 4.2 --- Mapping Partitions to Processor Elements --- p.26Chapter 4.3 --- Masks --- p.27Chapter 4.4 --- Control Strategy --- p.27Chapter 4.4.1 --- A Control Strategy Suitable for Linear Equations --- p.28Chapter 5 --- Data-Parallel Abstract Machine --- p.30Chapter 5.1 --- Basic DPAM --- p.31Chapter 5.1.1 --- Hardware Requirements --- p.31Chapter 5.1.2 --- Procedure Calling Convention And Process Creation --- p.32Chapter 5.1.3 --- Memory Model --- p.34Chapter 5.1.4 --- Registers --- p.41Chapter 5.1.5 --- Process Management --- p.41Chapter 5.1.6 --- Unification --- p.49Chapter 5.1.7 --- Variable Table --- p.49Chapter 5.2 --- DPAM with Backtracking --- p.50Chapter 5.2.1 --- Choice Point --- p.52Chapter 5.2.2 --- Trailing --- p.52Chapter 5.2.3 --- Recovering the Process Queues --- p.57Chapter 6 --- Implementation --- p.58Chapter 6.1 --- The DECmpp Massively Parallel Computer --- p.58Chapter 6.2 --- Implementation Overview --- p.59Chapter 6.3 --- Constraints --- p.60Chapter 6.3.1 --- Breaking Down Equality Constraints --- p.61Chapter 6.3.2 --- Processing the Constraint 'As Is' --- p.62Chapter 6.4 --- The Wide-Tag Architecture --- p.63Chapter 6.5 --- Register Window --- p.64Chapter 6.6 --- Dereferencing --- p.65Chapter 6.7 --- Output --- p.66Chapter 6.7.1 --- Collecting the Solutions --- p.66Chapter 6.7.2 --- Decoding the solution --- p.68Chapter 7 --- Performance --- p.69Chapter 7.1 --- Uniprocessor Performance --- p.71Chapter 7.2 --- Solitary Mode --- p.73Chapter 7.3 --- Bit Vectors of Domain Variables --- p.75Chapter 7.4 --- Heap Consumption of the Heap Frame Scheme --- p.77Chapter 7.5 --- Eager Nondeterministic Derivation vs Lazy Nondeterministic Deriva- tion --- p.78Chapter 7.6 --- Priority Scheduling --- p.79Chapter 7.7 --- Execution Profile --- p.80Chapter 7.8 --- Effect of the Number of Processor Elements on Performance --- p.82Chapter 7.9 --- Change of the Degree of Parallelism During Execution --- p.84Chapter 8 --- Related Work --- p.88Chapter 8.1 --- Vectorization of Prolog --- p.89Chapter 8.2 --- Parallel Clause Matching --- p.90Chapter 8.3 --- Parallel Interpreter --- p.90Chapter 8.4 --- Bounded Quantifications --- p.91Chapter 8.5 --- SIMD MultiLog --- p.91Chapter 9 --- Conclusion --- p.93Chapter 9.1 --- Limitations --- p.94Chapter 9.1.1 --- Data-Parallel Firebird is Specialized --- p.94Chapter 9.1.2 --- Limitations of the Implementation Scheme --- p.95Chapter 9.2 --- Future Work --- p.95Chapter 9.2.1 --- Extending Firebird --- p.95Chapter 9.2.2 --- Improvements Specific to DECmpp --- p.99Chapter 9.2.3 --- Labeling --- p.100Chapter 9.2.4 --- Parallel Domain Consistency --- p.101Chapter 9.2.5 --- Branch and Bound Algorithm --- p.102Chapter 9.2.6 --- Other Possible Future Work --- p.102Bibliography --- p.10

    Automatic visual recognition using parallel machines

    Get PDF
    Invariant features and quick matching algorithms are two major concerns in the area of automatic visual recognition. The former reduces the size of an established model database, and the latter shortens the computation time. This dissertation, will discussed both line invariants under perspective projection and parallel implementation of a dynamic programming technique for shape recognition. The feasibility of using parallel machines can be demonstrated through the dramatically reduced time complexity. In this dissertation, our algorithms are implemented on the AP1000 MIMD parallel machines. For processing an object with a features, the time complexity of the proposed parallel algorithm is O(n), while that of a uniprocessor is O(n2). The two applications, one for shape matching and the other for chain-code extraction, are used in order to demonstrate the usefulness of our methods. Invariants from four general lines under perspective projection are also discussed in here. In contrast to the approach which uses the epipolar geometry, we investigate the invariants under isotropy subgroups. Theoretically speaking, two independent invariants can be found for four general lines in 3D space. In practice, we show how to obtain these two invariants from the projective images of four general lines without the need of camera calibration. A projective invariant recognition system based on a hypothesis-generation-testing scheme is run on the hypercube parallel architecture. Object recognition is achieved by matching the scene projective invariants to the model projective invariants, called transfer. Then a hypothesis-generation-testing scheme is implemented on the hypercube parallel architecture

    Limits to parallelism in scientific computing

    Get PDF
    The goal of our research is to decrease the execution time of scientific computing applications. We exploit the application\u27s inherent parallelism to achieve this goal. This exploitation is expensive as we analyze sequential applications and port them to parallel computers. Many scientifically computational problems appear to have considerable exploitable parallelism; however, upon implementing a parallel solution on a parallel computer, limits to the parallelism are encountered. Unfortunately, many of these limits are characteristic of a specific parallel computer. This thesis explores these limits.;We study the feasibility of exploiting the inherent parallelism of four NASA scientific computing applications. We use simple models to predict each application\u27s degree of parallelism at several levels of granularity. From this analysis, we conclude that it is infeasible to exploit the inherent parallelism of two of the four applications. The interprocessor communication of one application is too expensive relative to its computation cost. The input and output costs of the other application are too expensive relative to its computation cost. We exploit the parallelism of the remaining two applications and measure their performance on an Intel iPSC/2 parallel computer. We parallelize an Optimal Control Boundary Value Problem. This guidance control problem determines an optimal trajectory of a boat in a river. We parallelize the Carbon Dioxide Slicing technique which is a macrophysical cloud property retrieval algorithm. This technique computes the height at the top of a cloud using cloud imager measurements. We consider the feasibility of exploiting its massive parallelism on a MasPar MP-2 parallel computer. We conclude that many limits to parallelism are surmountable while other limits are inescapable.;From these limits, we elucidate some fundamental issues that must be considered when porting similar problems to yet-to-be designed computers. We conclude that the technological improvements to reduce the isolation of computational units frees a programmer from many of the programmer\u27s current concerns about the granularity of the work. We also conclude that the technological improvements to relax the regimented guidance of the computational units allows a programmer to exploit the inherent heterogeneous parallelism of many applications
    • …
    corecore