614 research outputs found

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Effective network grid synthesis and optimization for high performance very large scale integration system design

    Get PDF
    制度:新 ; 文部省報告番号:甲2642号 ; 学位の種類:博士(工学) ; 授与年月日:2008/3/15 ; 早大学位記番号:新480

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Vector quantization

    Get PDF
    During the past ten years Vector Quantization (VQ) has developed from a theoretical possibility promised by Shannon's source coding theorems into a powerful and competitive technique for speech and image coding and compression at medium to low bit rates. In this survey, the basic ideas behind the design of vector quantizers are sketched and some comments made on the state-of-the-art and current research efforts

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Low-Complexity Uncertainty-Set-Based Robust Adaptive Beamforming for Passive Sonar

    Get PDF
    Recent work has highlighted the potential benefits of exploiting ellipsoidal uncertainty-set-based robust Capon beamformer (RCB) techniques in passive sonar. Regrettably, the computational complexity required to form RCB weights is cubic in the number of adaptive degrees of freedom, which is often prohibitive in practice. For this reason, several low-complexity techniques for computing RCB weights, or equivalent worst case robust adaptive beamformer weights, have recently been developed. These techniques, whose complexities are only quadratic in the number of adaptive degrees of freedom, use gradient-based, reduced-dimension Krylov-subspace or Kalman-filtering methods. In this work, we review these techniques for passive sonar, analyzing their complexities and evaluating them initially on simulated data. The best performing methods are then evaluated on two in-water recorded passive sonar data sets. One set, containing a strong controlled acoustic source, demonstrates the ability of the algorithms to protect against signal cancellation when pointing at the source, and their ability to reject the source when pointing away from it. The other data set, recorded during a period when the boat was accelerating, demonstrates the ability of the algorithms to operate in the presence of speed-induced noises

    Digital Filter Design Using Improved Teaching-Learning-Based Optimization

    Get PDF
    Digital filters are an important part of digital signal processing systems. Digital filters are divided into finite impulse response (FIR) digital filters and infinite impulse response (IIR) digital filters according to the length of their impulse responses. An FIR digital filter is easier to implement than an IIR digital filter because of its linear phase and stability properties. In terms of the stability of an IIR digital filter, the poles generated in the denominator are subject to stability constraints. In addition, a digital filter can be categorized as one-dimensional or multi-dimensional digital filters according to the dimensions of the signal to be processed. However, for the design of IIR digital filters, traditional design methods have the disadvantages of easy to fall into a local optimum and slow convergence. The Teaching-Learning-Based optimization (TLBO) algorithm has been proven beneficial in a wide range of engineering applications. To this end, this dissertation focusses on using TLBO and its improved algorithms to design five types of digital filters, which include linear phase FIR digital filters, multiobjective general FIR digital filters, multiobjective IIR digital filters, two-dimensional (2-D) linear phase FIR digital filters, and 2-D nonlinear phase FIR digital filters. Among them, linear phase FIR digital filters, 2-D linear phase FIR digital filters, and 2-D nonlinear phase FIR digital filters use single-objective type of TLBO algorithms to optimize; multiobjective general FIR digital filters use multiobjective non-dominated TLBO (MOTLBO) algorithm to optimize; and multiobjective IIR digital filters use MOTLBO with Euclidean distance to optimize. The design results of the five types of filter designs are compared to those obtained by other state-of-the-art design methods. In this dissertation, two major improvements are proposed to enhance the performance of the standard TLBO algorithm. The first improvement is to apply a gradient-based learning to replace the TLBO learner phase to reduce approximation error(s) and CPU time without sacrificing design accuracy for linear phase FIR digital filter design. The second improvement is to incorporate Manhattan distance to simplify the procedure of the multiobjective non-dominated TLBO (MOTLBO) algorithm for general FIR digital filter design. The design results obtained by the two improvements have demonstrated their efficiency and effectiveness
    corecore