Search CORE

734 research outputs found

Parallel alogorithms for MIMD parallel computers

Author: Hilal M. Yousif (7169864)
Publication venue
Publication date: 01/01/1986
Field of study

This thesis mainly covers the design and analysis of asynchronous parallel algorithms that can be run on MIMD (Multiple Instruction Multiple Data) parallel computers, in particular the NEPTUNE system at Loughborough University. Initially the fundamentals of parallel computer architectures are introduced with different parallel architectures being described and compared. The principles of parallel programming and the design of parallel algorithms are also outlined. Also the main characteristics of the 4 processor MIMD NEPTUNE system are presented, and performance indicators, i.e. the speed-up and the efficiency factors are defined for the measurement of parallelism in a given system. Both numerical and non-numerical algorithms are covered in the thesis. In the numerical solution of partial differential equations, a new parallel 9-point block iterative method is developed. Here, the organization of the blocks is done in such a way that each process contains its own group of 9 points on the network, therefore, they can be run in parallel. The parallel implementation of both 9-point and 4- point block iterative methods were programmed using natural and redblack ordering with synchronous and asynchronous approaches. The results obtained for these different implementations were compared and analysed. Next the parallel version of the A.G.E. (Alternating Group Explicit) method is developed in which the explicit nature of the difference equation is revealed and exploited when applied to derive the solution of both linear and non-linear 2-point boundary value problems. Two strategies have been used in the implementation of the parallel A.G.E. method using the synchronous and asynchronous approaches. The results from these implementations were compared. Also for comparison reasons the results obtained from the parallel A.G.E. were compared with the ~ corresponding results obtained from the parallel versions of the Jacobi, Gauss-Seidel and S.O.R. methods. Finally, a computational complexity analysis of the parallel A.G.E. algorithms is included. In the area of non-numeric algorithms, the problems of sorting and searching were studied. The sorting methods which were investigated was the shell and the digit sort methods. with each method different parallel strategies and approaches were used and compared to find the best results which can be obtained on the parallel machine. In the searching methods, the sequential search algorithm in an unordered table and the binary search algorithms were investigated and implemented in parallel with a presentation of the results. Finally, a complexity analysis of these methods is presented. The thesis concludes with a chapter summarizing the main results

Inflated speedups in parallel simulations via malloc()

Author: Nicol David M.
Publication venue
Publication date
Field of study

Discrete-event simulation programs make heavy use of dynamic memory allocation in order to support simulation's very dynamic space requirements. When programming in C one is likely to use the malloc() routine. However, a parallel simulation which uses the standard Unix System V malloc() implementation may achieve an overly optimistic speedup, possibly superlinear. An alternate implementation provided on some (but not all systems) can avoid the speedup anomaly, but at the price of significantly reduced available free space. This is especially severe on most parallel architectures, which tend not to support virtual memory. It is shown how a simply implemented user-constructed interface to malloc() can both avoid artificially inflated speedups, and make efficient use of the dynamic memory space. The interface simply catches blocks on the basis of their size. The problem is demonstrated empirically, and the effectiveness of the solution is shown both empirically and analytically

SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs) (Preliminary Version)

Author: Bader D.A.
Publication venue: UNM Digital Repository
Publication date: 01/11/1998
Field of study

Dynamic Systolization for Developing Multiprocessor Supercomputers

Author: Hwang Kai
Xu Zhiwei
Publication venue: 'Purdue University (bepress)'
Publication date: 01/10/1984
Field of study

A dynamic network approach is introduced for developing reconfigurable, systolic arrays or wavefront processors; This allows one to design very powerful and flexible processors to be used in a general-purpose, reconfigurable, and fault-tolerant, multiprocessor computer system. The concepts of macro-dataflow and multitasking can be integrated to handle variable-resolution granularities in computationally intensive algorithms. A multiprocessor architecture, Remps, is proposed based on these design methodologies. The Remps architecture is generalized from the Cedar, HEP, Cray X- MP, Trac, NYU ultracomputer, S-l, Pumps, Chip, and SAM projects. Our goal is to provide a multiprocessor research model for developing design methodologies, multiprocessing and multitasking supports, dynamic systolic/wavefront array processors, interconnection networks, reconfiguration techniques, and performance analysis tools. These system design and operational techniques should be useful to those who are developing or evaluating multiprocessor supercomputers

Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks

Author: Zhang Yu
Publication venue
Publication date: 28/01/2009
Field of study

Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment

Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems

Author: Katriel Ron
Publication venue: ScholarlyCommons
Publication date: 28/09/1987
Field of study

Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ). This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)