900 research outputs found

    High Performance Issues in Image Processing and Computer Vision

    Get PDF
    Typical image processing and computer vision tasks found in industrial, medical, and military applications require real-time solutions. These requirements have motivated the design of many parallel architectures and algorithms. Recently, a new architecture called the reconfigurable mesh has been proposed. This thesis addresses a number of problems in image processing and computer vision on reconfigurable meshes. We first show that a number of low-level descriptors of a digitized image such as the perimeter, area, histogram and median row can be reduced to computing the sum of all the integers in a matrix, which in turn can be reduced to computing the prefix sums of a binary sequence and the prefix sums of an integer sequence. We then propose a new computational paradigm for reconfigurable meshes, that is, identifying an entity by a bus and performing computations on the bus to obtain properties of the entity. Using the new paradigm, we solve a number of mid-level vision tasks including the Hough transform and component labeling. Finally, a VLSI-optimal constant time algorithm for computing the convex hull of a set of planar points is presented based on a VLSI-optimal constant time sorting algorithm. As by-products, two basic data movement techniques, computing the prefix sums of a binary sequence and computing the prefix maxima of a sequence of real numbers, and a VLSI-optimal constant time sorting algorithm have been developed. These by-products are interesting in their own right. In addition, they can be exploited to obtain efficient algorithms for a number of computational problems

    High Performance Issues on Parallel Architectures

    Get PDF
    In an effort to reduce communication latency in mesh-type architectures, these architectures have been augmented by various types of global and reconfigurable bus structures. The static bus structures provide excellent performance in many areas of computation especially structured numerical computations, but they lack the flexibility required of many large numerical and non-numerical applications. Reconfigurable bus systems have the dynamic adaptability to handle a much wider range of applications. While reconfigurable meshes can often yield constant time results for many problems, the cost of this performance is paid in the number of processors required. While in actuality the majority of these processors are employed as switching elements for the bus system and often do little actual computation. In an effort to reduce the processor cost while maintaining performance and communication flexibility, we present a new hybrid parallel array architecture with the goal of optimizing the best features of arrays with global buses and arrays with reconfigurable bus systems. The result is an architecture of n processing elements and a bus interconnection network which requires very basic circuitry to construct and control. This architecture allows prefix computations, such as prefix sum, prefix maximum(minimum) to be accomplished in O(log n) time. These functions then form the building blocks for complex procedures, which more fully exploit the communication flexibility of the architecture. Application of the architecture to graph theory produces optimal algorithms for graph properties such as spanning forest bipartiteness, fundamental cycles, bridges and biconnected components. Other optimal algorithms for the more complex least common ancestor and the connected component problems are also presented. By design, all algorithms maintain optimality for very large sparse graphs. We further examine the architecture\u27s ability to handle basic image processing tasks as well as its potential to simulate other parallel architectures and theoretic models

    A fast parallel algorithm for special linear systems of equations using processor arrays with reconfigurable bus systems

    Get PDF
    A parallel algorithm using Processor Arrays with Reconfigurable Bus Systems has been designed to solve dense Symmetric Positive Definite (SPD) systems of equations Ax = b. The key content of this report is the parallelisation of the algorithm by Delosme & Ipson [8]. In order to design a parallel algorithm for PARBS, many procedures involved in [8] are handled in a slightly different way. The parallel time and processor’s complexity of each step of the algorithm is calculated. The parallel time complexity is O(n) using 2n × 2n × 5n number of Processing Elements

    Unifying mesh- and tree-based programmable interconnect

    Get PDF
    We examine the traditional, symmetric, Manhattan mesh design for field-programmable gate-array (FPGA) routing along with tree-of-meshes (ToM) and mesh-of-trees (MoT) based designs. All three networks can provide general routing for limited bisection designs (Rent's rule with p<1) and allow locality exploitation. They differ in their detailed topology and use of hierarchy. We show that all three have the same asymptotic wiring requirements. We bound this tightly by providing constructive mappings between routes in one network and routes in another. For example, we show that a (c,p) MoT design can be mapped to a (2c,p) linear population ToM and introduce a corner turn scheme which will make it possible to perform the reverse mapping from any (c,p) linear population ToM to a (2c,p) MoT augmented with a particular set of corner turn switches. One consequence of this latter mapping is a multilayer layout strategy for N-node, linear population ToM designs that requires only /spl Theta/(N) two-dimensional area for any p when given sufficient wiring layers. We further show upper and lower bounds for global mesh routes based on recursive bisection width and show these are within a constant factor of each other and within a constant factor of MoT and ToM layout area. In the process we identify the parameters and characteristics which make the networks different, making it clear there is a unified design continuum in which these networks are simply particular regions

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    Geometric modeling for computer aided design

    Get PDF
    The primary goal of this grant has been the design and implementation of software to be used in the conceptual design of aerospace vehicles particularly focused on the elements of geometric design, graphical user interfaces, and the interaction of the multitude of software typically used in this engineering environment. This has resulted in the development of several analysis packages and design studies. These include two major software systems currently used in the conceptual level design of aerospace vehicles. These tools are SMART, the Solid Modeling Aerospace Research Tool, and EASIE, the Environment for Software Integration and Execution. Additional software tools were designed and implemented to address the needs of the engineer working in the conceptual design environment. SMART provides conceptual designers with a rapid prototyping capability and several engineering analysis capabilities. In addition, SMART has a carefully engineered user interface that makes it easy to learn and use. Finally, a number of specialty characteristics have been built into SMART which allow it to be used efficiently as a front end geometry processor for other analysis packages. EASIE provides a set of interactive utilities that simplify the task of building and executing computer aided design systems consisting of diverse, stand-alone, analysis codes. Resulting in a streamlining of the exchange of data between programs reducing errors and improving the efficiency. EASIE provides both a methodology and a collection of software tools to ease the task of coordinating engineering design and analysis codes

    An improved generalization of mesh-connected computers with multiple buses

    Get PDF
    ©2001 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.Mesh-connected computers (MCCs) are a class of important parallel architectures due to their simple and regular interconnections. However, their performances are restricted by their large diameters. Various augmenting mechanisms have been proposed to enhance the communication efficiency of MCCs. One major approach is to add nonconfigurable buses for improved broadcasting. A typical example is the mesh-connected computer with multiple buses (MMB). We propose a new class of generalized MMBs, the improved generalized MMBs (IMMBs). We compare IMMBs with MMBs and a class of previously proposed generalized MMBs (GMMBs). We show the power of IMMBs by considering semigroup and prefix computations. Specifically, as our main result we show that for any constant 0<&epsiv;<1, one can construct an N½×N½ square IMMB using which semigroup and prefix computations on N operands can be carried out in O(N&epsiv;) time, while maintaining O(1) broadcasting time. Compared with the previous best complexities O(N&frac18;) and O(N&frac116;) achieved on a rectangular MMB and GMMB, respectively, for the same computations, our results show that IMMBs are more powerful than MMBs and GMMBsYi Pen; Zheng, S.Q.; Keqin Li; Hong She

    An efficient parallel algorithm for the all pairs shortest path problem using processor arrays with reconfigurable bus systems

    Get PDF
    The all pairs shortest path problem is a class of the algebraic path problem. Many parallel algorithms for the solution of this problem appear in the literature. One of the efficient parallel algorithms on W-RAM model is given by Kucera [17]. Though efficient, algorithms written for the W-RAM model of parallel computation are too idealistic to be implemented on the current hardware. In this report we present an efficient parallel algorithm for the solution of this problem using a relatively new model of parallel computing, Processor Arrays with Reconfigurable Bus Systems. The parallel time complexity of this algorithm is O(log2 n) and processors complexity is n2 × n × n

    Fast Inner Product Computation on Short Buses

    Get PDF
    We propose a VLSI inner product processor architecture involving broadcasting only over short buses (containing less than 64 switches). The architecture leads to an efficient algorithm for the inner product computation. Specifically, it takes 13 broadcasts, each over less than 64 switches, plus 2 carry-save additions (tcsa) and 2 carry-lookahead additions (tcla) to compute the inner product of two arrays of N = 29 elements, each consisting of m = 64 bits. Using the same order of VLSI area, our algorithm runs faster than the best known fast inner product algorithm of Smith and Torng [ Design of a fast inner product processor, Proceedings of IEEE 7th Symposium on Computer Arithmetic (1985)], which takes about 28 tcsa + tcla for the computation

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER
    corecore