231 research outputs found

    Scaling Simulations of Reconfigurable Meshes.

    Get PDF
    This dissertation deals with reconfigurable bus-based models, a new type of parallel machine that uses dynamically alterable connections between processors to allow efficient communication and to perform fast computations. We focus this work on the Reconfigurable Mesh (R-Mesh), one of the most widely studied reconfigurable models. We study the ability of the R-Mesh to adapt an algorithm instance of an arbitrary size to run on a given smaller model size without significant loss of efficiency. A scaling simulation achieves this adaptation, and the simulation overhead expresses the efficiency of the simulation. We construct a scaling simulation for the Fusing-Restricted Reconfigurable Mesh (FR-Mesh), an important restriction of the R-Mesh. The overhead of this simulation depends only on the simulating machine size and not on the simulated machine size. The results of this scaling simulation extend to a variety of concurrent write rules and also translate to an improved scaling simulation of the R-Mesh itself. We present a bus linearization procedure that transforms an arbitrary non-linear bus configuration of an R-Mesh into an equivalent acyclic linear bus configuration implementable on an Linear Reconfigurable Mesh (LR-Mesh), a weaker version of the R-Mesh. This procedure gives the algorithm designer the liberty of using buses of arbitrary shape, while automatically translating the algorithm to run on a simpler platform. We illustrate our bus linearization method through two important applications. The first leads to a faster scaling simulation of the R-Mesh. The second application adapts algorithms designed for R-Meshes to run on models with pipelined optical buses. We also present a simulation of a Directional Reconfigurable Mesh (DR-Mesh) on an LR-Mesh. This simulation has a much better efficiency compared to previous work. In addition to the LR-Mesh, this simulation also runs on models that use pipelined optical buses

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    A bibliography on parallel and vector numerical algorithms

    Get PDF
    This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

    Computer vision algorithms on reconfigurable logic arrays

    Full text link

    Research in Applied Mathematics, Fluid Mechanics and Computer Science

    Get PDF
    This report summarizes research conducted at the Institute for Computer Applications in Science and Engineering in applied mathematics, fluid mechanics, and computer science during the period October 1, 1998 through March 31, 1999

    PaSE : Parallel Speedup Estimation Framework for Network-on-Chip Based Multi-core Systems

    Get PDF
    The massive integration of cores in multi-core system has enabled chip designer to design systems while meeting the power-performance demands of the applications. However, full-system simulations traditionally used to evaluate the speedup with these systems are computationally expensive and time consuming. On the other hand, analytical speedup models such as Amdahl’s law are powerful and fast ways to calculate the achievable speedup of these systems. However, Amdahl’s Law disregards the communication among the cores that play a vital role in defining the achievable speedup with the multi-core systems. To bridge this gap, in this work, we present PaSE a parallel speedup estimation framework for multi-core systems that considers the latency of the Network-on-Chip (NoC). To accurately capture the latency of the NoC, we propose a queuing theory based analytical model. Using our proposed PaSE framework, We conduct a speedup analysis for multicore system with real application based traffic i.g. Matrix Multiplication. the multiplication of two [32x32] matrices is considered in NoC based multi-core system with respect to three different cases ideal-core case, integer case (i.g. matrix elements are integer number), and denormal case (i.g. matrix elements are denormal number, NaNs, or infinity). From this analysis, we show how the system size, Network-on-Chip NoC architecture, and the computation to communication (C-to-C) ratio effect the achievable speedup. To sum up, instead of the simulation based performance estimation, our PaSE framework can be utilized as a design guideline i.g. it is possible to use it to understand the optimal multi-core system-size for certain applications. Thus, this model can reduce the design time and effort of such NoC based multi-core systems

    The instruction of systolic array (ISA) and simulation of parallel algorithms

    Get PDF
    Systolic arrays have proved to be well suited for Very Large Scale Integrated technology (VLSI) since they: -Consist of a regular network of simple processing cells, -Use local communication between the processing cells only, -Exploit a maximal degree of parallelism. However, systolic arrays have one main disadvantage compared with other parallel computer architectures: they are special purpose architectures only capable of executing one algorithm, e.g., a systolic array designed for sorting cannot be used to form matrix multiplication. Several approaches have been made to make systolic arrays more flexible, in order to be able to handle different problems on a single systolic array. In this thesis an alternative concept to a VLSI-architecture the Soft-Systolic Simulation System (SSSS), is introduced and developed as a working model of virtual machine with the power to simulate hard systolic arrays and more general forms of concurrency such as the SIMD and MIMD models of computation. The virtual machine includes a processing element consisting of a soft-systolic processor implemented in the virtual.machine language. The processing element considered here was a very general element which allows the choice of a wide range of arithmetic and logical operators and allows the simulation of a wide class of algorithms but in principle extra processing cells can be added making a library and this library be tailored to individual needs. The virtual machine chosen for this implementation is the Instruction Systolic Array (ISA). The ISA has a number of interesting features, firstly it has been used to simulate all SIMD algorithms and many MIMD algorithms by a simple program transformation technique, further, the ISA can also simulate the so-called wavefront processor algorithms, as well as many hard systolic algorithms. The ISA removes the need for the broadcasting of data which is a feature of SIMD algorithms (limiting the size of the machine and its cycle time) and also presents a fairly simple communication structure for MIMD algorithms. The model of systolic computation developed from the VLSI approach to systolic arrays is such that the processing surface is fixed, as are the processing elements or cells by virtue of their being embedded in the processing surface. The VLSI approach therefore freezes instructions and hardware relative to the movement of data with the virtual machine and softsystolic programming retaining the constructions of VLSI for array design features such as regularity, simplicity and local communication, allowing the movement of instructions with respect to data. Data can be frozen into the structure with instructions moving systolically. Alternatively both the data and instructions can move systolically around the virtual processors, (which are deemed fixed relative to the underlying architecture). The ISA is implemented in OCCAM programs whose execution and output implicitly confirm the correctness of the design. The soft-systolic preparation comprises of the usual operating system facilities for the creation and modification of files during the development of new programs and ISA processor elements. We allow any concurrent high level language to be used to model the softsystolic program. Consequently the Replicating Instruction Systolic Array Language (RI SAL) was devised to provide a very primitive program environment to the ISA but adequate for testing. RI SAL accepts instructions in an assembler-like form, but is fairly permissive about the format of statements, subject of course to syntax. The RI SAL compiler is adopted to transform the soft-systolic program description (RISAL) into a form suitable for the virtual machine (simulating the algorithm) to run. Finally we conclude that the principles mentioned here can form the basis for a soft-systolic simulator using an orthogonally connected mesh of processors. The wide range of algorithms which the ISA can simulate make it suitable for a virtual simulating grid

    Multistage Packet-Switching Fabrics for Data Center Networks

    Get PDF
    Recent applications have imposed stringent requirements within the Data Center Network (DCN) switches in terms of scalability, throughput and latency. In this thesis, the architectural design of the packet-switches is tackled in different ways to enable the expansion in both the number of connected endpoints and traffic volume. A cost-effective Clos-network switch with partially buffered units is proposed and two packet scheduling algorithms are described. The first algorithm adopts many simple and distributed arbiters, while the second approach relies on a central arbiter to guarantee an ordered packet delivery. For an improved scalability, the Clos switch is build using a Network-on-Chip (NoC) fabric instead of the common crossbar units. The Clos-UDN architecture made with Input-Queued (IQ) Uni-Directional NoC modules (UDNs) simplifies the input line cards and obviates the need for the costly Virtual Output Queues (VOQs). It also avoids the need for complex, and synchronized scheduling processes, and offers speedup, load balancing, and good path diversity. Under skewed traffic, a reliable micro load-balancing contributes to boosting the overall network performance. Taking advantage of the NoC paradigm, a wrapped-around multistage switch with fully interconnected Central Modules (CMs) is proposed. The architecture operates with a congestion-aware routing algorithm that proactively distributes the traffic load across the switching modules, and enhances the switch performance under critical packet arrivals. The implementation of small on-chip buffers has been made perfectly feasible using the current technology. This motivated the implementation of a large switching architecture with an Output-Queued (OQ) NoC fabric. The design merges assets of the output queuing, and NoCs to provide high throughput, and smooth latency variations. An approximate analytical model of the switch performance is also proposed. To further exploit the potential of the NoC fabrics and their modularity features, a high capacity Clos switch with Multi-Directional NoC (MDN) modules is presented. The Clos-MDN switching architecture exhibits a more compact layout than the Clos-UDN switch. It scales better and faster in port count and traffic load. Results achieved in this thesis demonstrate the high performance, expandability and programmability features of the proposed packet-switches which makes them promising candidates for the next-generation data center networking infrastructure

    Hypergraph-Based Interconnection Networks for Large Multicomputers

    Get PDF
    This thesis deals with issues pertaining to multicomputer interconnection networks namely topology, technology, switching method, and routing algorithm. It argues that a new class of regular low-dimensional hypergraph networks, the distributed crossbar switch hypermesh (DCSH), represents a promising alternative high-performance interconnection network for future large multicomputers to graph networks such as meshes, tori, and binary n-cubes, which have been widely used in current multicomputers. Channels in existing hypergraph and graph structures suffer from bandwidth limitations imposed by implementation technology. The first part of the thesis shows how the low-dimensional DCSH can use an innovative implementation scheme to alleviate this problem. It relies on the separation of processing and communication functions by physical layering in order to accommodate high wiring density and necessary message buffering, improving performance considerably. Various mathematical models of the DCSH, validated through discrete-event simulation, are then introduced. Effects of different switching methods (e.g., wormhole routing, virtual cut-through, and message switching), routing algorithms (e.g., restricted and random), and different switching element designs are investigated. Further, the impact on performance of different communication patterns, such as those including locality and hot-spots, are assessed. The remainder of the thesis compares the DCSH to other common hypergraph and graph networks assuming different implementation technologies, such as VLSI, multiple-chip technology, and the new layered implementation scheme. More realistic assumptions are introduced such as pipeline-bit transmission and non-zero delays through switching elements. The results show that the proposed structure has superior characteristics assuming equal implementation cost in both VLSI and multiple-chip technology. Furthermore, optimal performance is offered by the new layered implementation

    Scheduling and reconfiguration of interconnection network switches

    Get PDF
    Interconnection networks are important parts of modern computing systems, facilitating communication between a system\u27s components. Switches connecting various nodes of an interconnection network serve to move data in the network. The switch\u27s delay and throughput impact the overall performance of the network and thus the system. Scheduling efficient movement of data through a switch and configuring the switch to realize a schedule are the main themes of this research. We consider various interconnection network switches including (i) crossbar-based switches, (ii) circuit-switched tree switches, and (iii) fat-tree switches. For crossbar-based input-queued switches, a recent result established that logarithmic packet delay is possible. However, this result assumes that packet transmission time through the switch is no less than schedule-generation time. We prove that without this assumption (as is the case in practice) packet delay becomes linear. We also report results of simulations that bear out our result for practical switch sizes and indicate that a fast scheduling algorithm reduces not only packet delay but also buffer size. We also propose a fast mesh-of-trees based distributed switch scheduling (maximal-matching based) algorithm that has polylog complexity. A circuit-switched tree (CST) can serve as an interconnect structure for various computing architectures and models such as the self-reconfigurable gate array and the reconfigurable mesh. A CST is a tree structure with source and destination processing elements as leaves and switches as internal nodes. We design several scheduling and configuration algorithms that distributedly partition a given set of communications into non-conflicting subsets and then establish switch settings and paths on the CST corresponding to the communications. A fat-tree is another widely used interconnection structure in many of today\u27s high-performance clusters. We embed a reconfigurable mesh inside a fat-tree switch to generate efficient connections. We present an R-Mesh-based algorithm for a fat-tree switch that creates buses connecting input and output ports corresponding to various communications using that switch
    • …
    corecore