4,615 research outputs found

    On implementing dynamically reconfigurable architectures

    Get PDF
    Dynamically reconfigurable architectures have the ability to change their structure at each step of a computation. This dissertation studies various aspects of implementing dynamic reconfiguration, ranging from hardware building blocks and low-level architectures to modeling issues and high-level algorithm design. First we derive conditions under which classes of communication sets can be optimally scheduled on the circuit-switched tree (CST). Then we present a method to configure the CST to perform in constant time all communications scheduled for a step. This results in a constant time implementation of a step of a segmentable bus, a fundamental dynamically reconfigurable structure. We introduce a new bus delay measure (bends-cost) and define the bends-cost LR-Mesh; the LR-Mesh is a widely used reconfigurable model. Unlike the (idealized) LR-Mesh, which ignores bus delay, the bends-cost LR-Mesh uses the number of bends in a bus to estimate its delay. We present an implementation for which the bends-cost is an accurate estimate of the actual delay. We present algorithms to simulate various LR-Mesh configuration classes on the bends-cost LR-Mesh. For semimonotonic configurations, a Θ(N)*Θ(N) bends-cost LR-Mesh with bus delay at most D can simulate a step of the idealized N*N LR-Mesh in O((log N/(log D-log Δ))2) time (where Δ is the delay of an N-element segmentable bus), while employing about the same number of processors. For some special cases this time reduces to O(log N/(log D-log Δ)). If D=Nε, for an arbitrarily small constant ε \u3e 0, then the running times of bends-cost LR-Mesh algorithms are within a constant of their idealized counterparts. We also prove that with a polynomial blowup in the number of processors and D=Nε, the bends-cost LR-Mesh can simulate any step of an idealized LR-Mesh in constant time, thereby establishing that these models have the same power. We present an implementation (in VHDL) of the Enhanced Self Reconfigurable Gate Array (E-SRGA) architecture and perform a cost-benefit study for different dynamic reconfiguration features. This study shows our approach to be feasible

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    A general framework for efficient FPGA implementation of matrix product

    Get PDF
    Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    Unifying mesh- and tree-based programmable interconnect

    Get PDF
    We examine the traditional, symmetric, Manhattan mesh design for field-programmable gate-array (FPGA) routing along with tree-of-meshes (ToM) and mesh-of-trees (MoT) based designs. All three networks can provide general routing for limited bisection designs (Rent's rule with p<1) and allow locality exploitation. They differ in their detailed topology and use of hierarchy. We show that all three have the same asymptotic wiring requirements. We bound this tightly by providing constructive mappings between routes in one network and routes in another. For example, we show that a (c,p) MoT design can be mapped to a (2c,p) linear population ToM and introduce a corner turn scheme which will make it possible to perform the reverse mapping from any (c,p) linear population ToM to a (2c,p) MoT augmented with a particular set of corner turn switches. One consequence of this latter mapping is a multilayer layout strategy for N-node, linear population ToM designs that requires only /spl Theta/(N) two-dimensional area for any p when given sufficient wiring layers. We further show upper and lower bounds for global mesh routes based on recursive bisection width and show these are within a constant factor of each other and within a constant factor of MoT and ToM layout area. In the process we identify the parameters and characteristics which make the networks different, making it clear there is a unified design continuum in which these networks are simply particular regions

    Simulation of ultrasonic lamb wave generation, propagation and detection for an air coupled robotic scanner

    Get PDF
    A computer simulator, to facilitate the design and assessment of a reconfigurable, air-coupled ultrasonic scanner is described and evaluated. The specific scanning system comprises a team of remote sensing agents, in the form of miniature robotic platforms that can reposition non-contact Lamb wave transducers over a plate type of structure, for the purpose of non-destructive evaluation (NDE). The overall objective is to implement reconfigurable array scanning, where transmission and reception are facilitated by different sensing agents which can be organised in a variety of pulse-echo and pitch-catch configurations, with guided waves used to generate data in the form of 2-D and 3-D images. The ability to reconfigure the scanner adaptively requires an understanding of the ultrasonic wave generation, its propagation and interaction with potential defects and boundaries. Transducer behaviour has been simulated using a linear systems approximation, with wave propagation in the structure modelled using the local interaction simulation approach (LISA). Integration of the linear systems and LISA approaches are validated for use in Lamb wave scanning by comparison with both analytic techniques and more computationally intensive commercial finite element/difference codes. Starting with fundamental dispersion data, the paper goes on to describe the simulation of wave propagation and the subsequent interaction with artificial defects and plate boundaries, before presenting a theoretical image obtained from a team of sensing agents based on the current generation of sensors and instrumentation

    Scaling Simulations of Reconfigurable Meshes.

    Get PDF
    This dissertation deals with reconfigurable bus-based models, a new type of parallel machine that uses dynamically alterable connections between processors to allow efficient communication and to perform fast computations. We focus this work on the Reconfigurable Mesh (R-Mesh), one of the most widely studied reconfigurable models. We study the ability of the R-Mesh to adapt an algorithm instance of an arbitrary size to run on a given smaller model size without significant loss of efficiency. A scaling simulation achieves this adaptation, and the simulation overhead expresses the efficiency of the simulation. We construct a scaling simulation for the Fusing-Restricted Reconfigurable Mesh (FR-Mesh), an important restriction of the R-Mesh. The overhead of this simulation depends only on the simulating machine size and not on the simulated machine size. The results of this scaling simulation extend to a variety of concurrent write rules and also translate to an improved scaling simulation of the R-Mesh itself. We present a bus linearization procedure that transforms an arbitrary non-linear bus configuration of an R-Mesh into an equivalent acyclic linear bus configuration implementable on an Linear Reconfigurable Mesh (LR-Mesh), a weaker version of the R-Mesh. This procedure gives the algorithm designer the liberty of using buses of arbitrary shape, while automatically translating the algorithm to run on a simpler platform. We illustrate our bus linearization method through two important applications. The first leads to a faster scaling simulation of the R-Mesh. The second application adapts algorithms designed for R-Meshes to run on models with pipelined optical buses. We also present a simulation of a Directional Reconfigurable Mesh (DR-Mesh) on an LR-Mesh. This simulation has a much better efficiency compared to previous work. In addition to the LR-Mesh, this simulation also runs on models that use pipelined optical buses
    corecore