210 research outputs found

    A Practical Hierarchial Model of Parallel Computation: The Model

    Get PDF
    We introduce a model of parallel computation that retains the ideal properties of the PRAM by using it as a sub-model, while simultaneously being more reflective of realistic parallel architectures by accounting for and providing abstract control over communication and synchronization costs. The Hierarchical PRAM (H-PRAM) model controls conceptual complexity in the face of asynchrony in two ways. First, by providing the simplifying assumption of synchronization to the design of algorithms, but allowing the algorithms to work asynchronously with each other; and organizing this control asynchrony via an implicit hierarchy relation. Second, by allowing the restriction of communication asynchrony in order to obtain determinate algorithms (thus greatly simplifying proofs of correctness). It is shown that the model is reflective of a variety of existing and proposed parallel architectures, particularly ones that can support massive parallelism. Relationships to programming languages are discussed. Since the PRAM is a sub-model, we can use PRAM algorithms as sub-algorithms in algorithms for the H-PRAM; thus results that have been established with respect to the PRAM are potentially transferable to this new model. The H-PRAM can be used as a flexible tool to investigate general degrees of locality (“neighborhoods of activity) in problems, considering communication and synchronization simultaneously. This gives the potential of obtaining algorithms that map more efficiently to architectures, and of increasing the number of processors that can efficiently be used on a problem (in comparison to a PRAM that charges for communication and synchronization). The model presents a framework in which to study the extent that general locality can be exploited in parallel computing. A companion paper demonstrates the usage of the H-PRAM via the design and analysis of various algorithms for computing the complete binary tree and the FFT/butterfly graph

    Optimal processor assignment for pipeline computations

    Get PDF
    The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered

    Reading list of selected PASM-related publications

    Get PDF
    Prepared for a chapter to be published in the forthcoming Encyclopedia of Parallel Computing by Springer Publishing Company. The Encyclopedia will contain a broad coverage of the field and will include entries on machine organization, programming, algorithms, and applications. The broad coverage, together with extensive pointers to the literature for in-depth study, is expected to make the Encyclopedia a useful reference tool in parallel computing

    An efficient processor allocation strategy that maintains a high degree of contiguity among processors in 2D mesh connected multicomputers

    Get PDF
    Two strategies are used for the allocation of jobs to processors connected by mesh topologies: contiguous allocation and non-contiguous allocation. In non-contiguous allocation, a job request can be split into smaller parts that are allocated to non-adjacent free sub-meshes rather than always waiting until a single sub-mesh of the requested size and shape is available. Lifting the contiguity condition is expected to reduce processor fragmentation and increase system utilization. However, the distances traversed by messages can be long, and as a result the communication overhead, especially contention, is increased. The extra communication overhead depends on how the allocation request is partitioned and assigned to free sub-meshes. This paper presents a new Non-contiguous allocation algorithm, referred to as Greedy-Available-Busy-List (GABL for short), which can decrease the communication overhead among processors allocated to a given job. The simulation results show that the new strategy can reduce the communication overhead and substantially improve performance in terms of parameters such as job turnaround time and system utilization. Moreover, the results reveal that the Shortest-Service-Demand-First (SSD) scheduling strategy is much better than the First-Come-First-Served (FCFS) scheduling strategy

    Parallel Architectures and Parallel Algorithms for Integrated Vision Systems

    Get PDF
    Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems

    Experimental Benchmarks and Initial Evaluation of the Performance of the PASM System Prototype

    Get PDF
    The work reported here represents experiences with the PASM parallel processing system prototype during its first operational year. Most of the experiments were performed by students in the Fall semester of 1987. The first programming, and the first timing measurements, were made during the summer of 1987 by Sam Fineberg. The goal of the collection of experiments presented here was to undertake an Application-driven Architecture Study of the PASM system as a paradigm for parallel architecture evaluation in general. PASM was an excellent vehicle for experimenting with this evaluation technique due to its unique architectural features. Among these are: 1. A reconfigurable, partitionable multistage circuit-switched network. 2. Support for both SIMD and MIMD programs. 3. Ability to execute hybrid SIMD/MIMD programs. 4. An instruction queue which allows overlap of control-flow and data manipulation between micro-control (MC) units and processing elements (PE). It had been hypothesized that superlinear speed-up over the number of PEs could be attained with this feature, and experimental results verified this. 5. Support for barrier synchronization of MIMD tasks. This feature was exploited in some non-standard ways to show the ability to decouple variant length SIMD instructions into multiple MIMD streams for an overall performance benefit. This type of study is expected to continue in the future on PASM and other parallel machines at Purdue. This report should serve as a guide for this future work as well

    Model for an Intelligent Operating System for Executing Tasks on a Reconfigurable Parallel Architecture

    Get PDF
    Parallel processing is one approach to achieve the large computational processing capabilities required by many real-time computing tasks. One of the problems that must be addressed in the use of reconfigurable multiprocessor systems is matching the architecture configuration to the algorithms to be executed. This paper presents a conceptual model that explores the potential of artificial intelligence tools, specifically expert systems, to design an Intelligent Operating System for multiprocessor systems. The target task is the implementation of image understanding systems on multiprocessor architectures. PASM is used as an example multiprocessor. The Intelligent Operating System concepts developed here could also be used to address other problems requiring real-time processing. An example image understanding task is presented to illustrate the concept of intelligent scheduling by the Intelligent Operating System. Also considered is the use of the conceptual model when developing an image understanding system in order to test different strategies for choosing algorithms, imposing execution order constraints, and integrating results from various algorithms

    Reconfiguration for Fault Tolerance and Performance Analysis

    Get PDF
    Architecture reconfiguration, the ability of a system to alter the active interconnection among modules, has a history of different purposes and strategies. Its purposes develop from the relatively simple desire to formalize procedures that all processes have in common to reconfiguration for the improvement of fault-tolerance, to reconfiguration for performance enhancement, either through the simple maximizing of system use or by sophisticated notions of wedding topology to the specific needs of a given process. Strategies range from straightforward redundancy by means of an identical backup system to intricate structures employing multistage interconnection networks. The present discussion surveys the more important contributions to developments in reconfigurable architecture. The strategy here is in a sense to approach the field from an historical perspective, with the goal of developing a more coherent theory of reconfiguration. First, the Turing and von Neumann machines are discussed from the perspective of system reconfiguration, and it is seen that this early important theoretical work contains little that anticipates reconfiguration. Then some early developments in reconfiguration are analyzed, including the work of Estrin and associates on the fixed plus variable restructurable computer system, the attempt to theorize about configurable computers by Miller and Cocke, and the work of Reddi and Feustel on their restructable computer system. The discussion then focuses on the most sustained systems for fault tolerance and performance enhancement that have been proposed. An attempt will be made to define fault tolerance and to investigate some of the strategies used to achieve it. By investigating four different systems, the Tandern computer, the C.vmp system, the Extra Stage Cube, and the Gamma network, the move from dynamic redundancy to reconfiguration is observed. Then reconfiguration for performance enhancement is discussed. A survey of some proposals is attempted, then the discussion focuses on the most sustained systems that have been proposed: PASM, the DC architecture, the Star local network, and the NYU Ultracomputer. The discussion is organized around a comparison of control, scheduling, communication, and network topology. Finally, comparisons are drawn between fault tolerance and performance enhancement, in order to clarify the notion of reconfiguration and to reveal the common ground of fault tolerance and performance enhancement as well as the areas in which they diverge. An attempt is made in the conclusion to derive from this survey and analysis some observations on the nature of reconfiguration, as well as some remarks on necessary further areas of research

    Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems

    Get PDF
    Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ). This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)
    • …
    corecore