392 research outputs found

    Efficient parallel computation on multiprocessors with optical interconnection networks

    Get PDF
    This dissertation studies optical interconnection networks, their architecture, address schemes, and computation and communication capabilities. We focus on a simple but powerful optical interconnection network model - the Linear Array with Reconfigurable pipelined Bus System (LARPBS). We extend the LARPBS model to a simplified higher dimensional LAPRBS and provide a set of basic computation operations. We then study the following two groups of parallel computation problems on both one dimensional LARPBS\u27s as well as multi-dimensional LARPBS\u27s: parallel comparison problems, including sorting, merging, and selection; Boolean matrix multiplication, transitive closure and their applications to connected component problems. We implement an optimal sorting algorithm on an n-processor LARPBS. With this optimal sorting algorithm at disposal, we study the sorting problem for higher dimensional LARPBS\u27s and obtain the following results: • An optimal basic Columnsort algorithm on a 2D LARPBS. • Two optimal two-way merge sort algorithms on a 2D LARPBS. • An optimal multi-way merge sorting algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 2D LARPBS. • An optimal generalized column sort algorithm on a 3D LARPBS. • An optimal 5-phase sorting algorithm on a 3D LARPBS. Results for selection problems are as follows: • A constant time maximum-finding algorithm on an LARPBS. • An optimal maximum-finding algorithm on an LARPBS. • An O((log log n)2) time parallel selection algorithm on an LARPBS. • An O(k(log log n)2) time parallel multi-selection algorithm on an LARPBS. While studying the computation and communication properties of the LARPBS model, we find Boolean matrix multiplication and its applications to the graph are another set of problem that can be solved efficiently on the LARPBS. Following is a list of results we have obtained in this area. • A constant time Boolean matrix multiplication algorithm. • An O(log n)-time transitive closure algorithm. • An O(log n)-time connected components algorithm. • An O(log n)-time strongly connected components algorithm. The results provided in this dissertation show the strong computation and communication power of optical interconnection networks

    Scaling Simulations of Reconfigurable Meshes.

    Get PDF
    This dissertation deals with reconfigurable bus-based models, a new type of parallel machine that uses dynamically alterable connections between processors to allow efficient communication and to perform fast computations. We focus this work on the Reconfigurable Mesh (R-Mesh), one of the most widely studied reconfigurable models. We study the ability of the R-Mesh to adapt an algorithm instance of an arbitrary size to run on a given smaller model size without significant loss of efficiency. A scaling simulation achieves this adaptation, and the simulation overhead expresses the efficiency of the simulation. We construct a scaling simulation for the Fusing-Restricted Reconfigurable Mesh (FR-Mesh), an important restriction of the R-Mesh. The overhead of this simulation depends only on the simulating machine size and not on the simulated machine size. The results of this scaling simulation extend to a variety of concurrent write rules and also translate to an improved scaling simulation of the R-Mesh itself. We present a bus linearization procedure that transforms an arbitrary non-linear bus configuration of an R-Mesh into an equivalent acyclic linear bus configuration implementable on an Linear Reconfigurable Mesh (LR-Mesh), a weaker version of the R-Mesh. This procedure gives the algorithm designer the liberty of using buses of arbitrary shape, while automatically translating the algorithm to run on a simpler platform. We illustrate our bus linearization method through two important applications. The first leads to a faster scaling simulation of the R-Mesh. The second application adapts algorithms designed for R-Meshes to run on models with pipelined optical buses. We also present a simulation of a Directional Reconfigurable Mesh (DR-Mesh) on an LR-Mesh. This simulation has a much better efficiency compared to previous work. In addition to the LR-Mesh, this simulation also runs on models that use pipelined optical buses

    Design of an FPGA-based parallel SIMD machine for power flow analysis

    Get PDF
    Power flow analysis consists of computationally intensive calculations on large matrices, consumes several hours of computational time, and has shown the need for the implementation of application-specific parallel machines. The potential of Single-Instruction stream Multiple-Data stream (SIMD) parallel architectures for efficient operations on large matrices has been demonstrated as seen in the case of many existing supercomputers. The unsuitability of existing parallel machines for low-cost power system applications, their long design cycles, and the difficulty in using them show the need for application-specific SIMI) machines. Advances in VLSI technology and Field-Programmable Gate-Arrays (FPGAs) enable the implementation of Custom Computing Machines (CCMs) which can yield better performance for specific applications. The advent of SoftCore processors made it possible to integrate reconfigurable logic as a slave to a peripheral bus and has demonstrated the ability in the rapid prototyping of complete systems on programmable chips. This thesis aims at designing and implementing an FPGA-based SIMI) machine for power flow analysis. It presents the architecture of an SIMI) machine that consists of an array of processing elements with mesh interconnection and a Soft-Core processor; the latter is used as the host. The FPGAbased SIMI) machine is implemented on the Annapolis Microsystems Wildstar-II board that contains multiple Virtex-II FPGAs. The Soft-Core processor used is the Xilinx Microblaze and the application targeted is matrix multiplication

    Simulations and Algorithms on Reconfigurable Meshes With Pipelined Optical Buses.

    Get PDF
    Recently, many models using reconfigurable optically pipelined buses have been proposed in the literature. A system with an optically pipelined bus uses optical waveguides, with unidirectional propagation and predictable delays, instead of electrical buses to transfer information among processors. These two properties enable synchronized concurrent access to an optical bus in a pipelined fashion. Combined with the abilities of the bus structure to broadcast and multicast, this architecture suits many communication-intensive applications. We establish the equivalence of three such one-dimensional optical models, namely the LARPBS, LPB, and POB. This implies an automatic translation of algorithms (without loss of speed or efficiency) among these models. In particular, since the LPB is the same as an LARPBS without the ability to segment its buses, their equivalence establishes reconfigurable delays (rather than segmenting ability) as the key to the power of optically pipelined models. We also present simulations for a number of two-dimensional optical models and establish that they possess the same complexity, so that any of these models can simulate a step of one of the other models in constant time with a polynomial increase in size. Specifically, we determine the complexity of three two-dimensional optical models (the PR-Mesh, APPBS, and AROB) to be the same as the well known LR-Mesh and the cycle-free LR-Mesh. We develop algorithms for the LARPBS and PR-Mesh that are more efficient than existing algorithms in part by exploiting the pipelining, segmenting, and multicasting characteristics of these models. We also consider the implications of certain physical constraints placed on the system by restricting the distance over which two processors are able to communicate. All algorithms developed for these models assume that a healthy system is available. We present some fundamental algorithms that are able to tolerate up to N/2 faults on an N-processor LARPBS. We then extend these results to apply to other algorithms in the areas of image processing and matrix operations

    Reconfigurable Systems for Cryptography and Multimedia Applications

    Get PDF

    An efficient parallel algorithm for the all pairs shortest path problem using processor arrays with reconfigurable bus systems

    Get PDF
    The all pairs shortest path problem is a class of the algebraic path problem. Many parallel algorithms for the solution of this problem appear in the literature. One of the efficient parallel algorithms on W-RAM model is given by Kucera [17]. Though efficient, algorithms written for the W-RAM model of parallel computation are too idealistic to be implemented on the current hardware. In this report we present an efficient parallel algorithm for the solution of this problem using a relatively new model of parallel computing, Processor Arrays with Reconfigurable Bus Systems. The parallel time complexity of this algorithm is O(log2 n) and processors complexity is n2 × n × n

    FUNCTIONAL ENHANCEMENT AND APPLICATIONS DEVELOPMENT FOR A HYBRID, HETEROGENEOUS SINGLE-CHIP MULTIPROCESSOR ARCHITECTURE

    Get PDF
    Reconfigurable and dynamic computer architecture is an exciting area of research that is rapidly expanding to meet the requirements of compute intense real and non-real time applications in key areas such as cryptography, signal/radar processing and other areas. To meet the demands of such applications, a parallel single-chip heterogeneous Hybrid Data/Command Architecture (HDCA) has been proposed. This single-chip multiprocessor architecture system is reconfigurable at three levels: application, node and processor level. It is currently being developed and experimentally verified via a three phase prototyping process. A first phase prototype with very limited functionality has been developed. This initial prototype was used as a base to make further enhancements to improve functionality and performance resulting in a second phase virtual prototype, which is the subject of this thesis. In the work reported here, major contributions are in further enhancing the functionality of the system by adding additional processors, by making the system reconfigurable at the node level, by enhancing the ability of the system to fork to more than two processes and by designing some more complex real/non-real time applications which make use of and can be used to test and evaluate enhanced and new functionality added to the architecture. A working proof of concept of the architecture is achieved by Hardware Description Language (HDL) based development and use of a Virtual Prototype of the architecture. The Virtual Prototype was used to evaluate the architecture functionality and performance in executing several newly developed example applications. Recommendations are made to further improve the system functionality

    Compiling dataflow graphs into hardware

    Get PDF
    Department Head: L. Darrell Whitley.2005 Fall.Includes bibliographical references (pages 121-126).Conventional computers are programmed by supplying a sequence of instructions that perform the desired task. A reconfigurable processor is "programmed" by specifying the interconnections between hardware components, thereby creating a "hardwired" system to do the particular task. For some applications such as image processing, reconfigurable processors can produce dramatic execution speedups. However, programming a reconfigurable processor is essentially a hardware design discipline, making programming difficult for application programmers who are only familiar with software design techniques. To bridge this gap, a programming language, called SA-C (Single Assignment C, pronounced "sassy"), has been designed for programming reconfigurable processors. The process involves two main steps - first, the SA-C compiler analyzes the input source code and produces a hardware-independent intermediate representation of the program, called a dataflow graph (DFG). Secondly, this DFG is combined with hardware-specific information to create the final configuration. This dissertation describes the design and implementation of a system that performs the DFG to hardware translation. The DFG is broken up into three sections: the data generators, the inner loop body, and the data collectors. The second of these, the inner loop body, is used to create a computational structure that is unique for each program. The other two sections are implemented by using prebuilt modules, parameterized for the particular problem. Finally, a "glue module" is created to connect the various pieces into a complete interconnection specification. The dissertation also explores optimizations that can be applied while processing the DFG, to improve performance. A technique for pipelining the inner loop body is described that uses an estimation tool for the propagation delay of the nodes within the dataflow graph. A scheme is also described that identifies subgraphs with the dataflow graph that can be replaced with lookup tables. The lookup tables provide a faster implementation than random logic in some instances
    • …
    corecore