20 research outputs found

    High Performance Reconfigurable Computing for Linear Algebra: Design and Performance Analysis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) enable powerful performance acceleration for scientific computations because of their intrinsic parallelism, pipeline ability, and flexible architecture. This dissertation explores the computational power of FPGAs for an important scientific application: linear algebra. First of all, optimized linear algebra subroutines are presented based on enhancements to both algorithms and hardware architectures. Compared to microprocessors, these routines achieve significant speedup. Second, computing with mixed-precision data on FPGAs is proposed for higher performance. Experimental analysis shows that mixed-precision algorithms on FPGAs can achieve the high performance of using lower-precision data while keeping higher-precision accuracy for finding solutions of linear equations. Third, an execution time model is built for reconfigurable computers (RC), which plays an important role in performance analysis and optimal resource utilization of FPGAs. The accuracy and efficiency of parallel computing performance models often depend on mean maximum computations. Despite significant prior work, there have been no sufficient mathematical tools for this important calculation. This work presents an Effective Mean Maximum Approximation method, which is more general, accurate, and efficient than previous methods. Together, these research results help address how to make linear algebra applications perform better on high performance reconfigurable computing architectures

    A Many-Core Overlay for High-Performance Embedded Computing on FPGAs

    Get PDF
    In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Design of FPGA-Implemented Reed-Solomon Erasure Code (RS-EC) Decoders With Fault Detection and Location on User Memory

    Get PDF
    Reed–Solomon erasure codes (RS-ECs) are widely used in packet communication and storage systems to recover erasures. When the RS-EC decoder is implemented on a field-programmable gate array (FPGA) in a space platform, it will suffer single-event upsets (SEUs) that can cause failures. In this article, the reliability of an RS-EC decoder implemented on an FPGA when there are errors in the user memory is first studied. Then, a fault detection and location scheme is proposed based on partial reencoding for the faults in the user memory of the RS-EC decoder. Furthermore, check bits are added in the generator matrix to improve the fault location performance. The theoretical analysis shows that the scheme could detect most faults with small missing and false detection probability. Experimental results on a case study show that more than 90% of the faults on user memory could be tolerated by the decoder, and all the other faults can be detected by the fault detection scheme when the number of erasures is smaller than the correction capability of the code. Although false alarms exist (with probability smaller than 4%), they can be used to avoid fault accumulation. Finally, the fault location scheme could accurately locate all the faults. The theoretical estimates are very close to the experiment results, which verifies the correctness of the analysis done.This work was supported in part by the National Natural Science Foundation of China under Grant 61501321, in part by the China Postdoctoral Science Foundation and Luoyang Newvid Technology Company, Ltd., and in part by the ACHILLES Project PID2019-104207RB-I00 funded by the Spanish Ministry of Science and Innovation

    Kernel solver design of FPGA-based real-time simulator for active distribution networks

    Get PDF
    The field-programmable gate array (FPGA)-based real-time simulator takes advantage of many merits of FPGA, such as small time-step, high simulation precision, rich I/O interface resources, and low cost. The sparse linear equations formed by the node conductance matrix need to be solved repeatedly within each time-step, which introduces great challenges to the performance of the real-time simulator. In this paper, a fine-grained solver of the FPGA-based real-time simulator for active distribution networks is designed to meet the computational demand. The framework of the solver, offline process design on PC and online process design on FPGA are proposed in detail. The modified IEEE 33-node system with photovoltaics is simulated on a 4-FPGA-based real-time simulator. Simulation results are compared with PSCAD/EMTDC under the same conditions to validate the solver design

    RTL implementation of one-sided jacobi algorithm for singular value decomposition

    Get PDF
    Multi-dimensional digital signal processing such as image processing and image reconstruction involve manipulating of matrix data. Better quality images involve large amount of data, which result in unacceptably slow computation. A parallel processing scheme is a possible solution to solve this problem. This project presented an analysis and comparison to various algorithms for widely used matrix decomposition techniques and various computer architectures. As the result, a parallel implementation of one-sided Jacobi algorithm for computing singular value decomposition (SVD) of a 2х2 matrix on field programmable gate arrays (FPGA) is developed. The proposed SVD design is based on pipelined-datapath architecture The design process is started by evaluating the algorithm using Matlab, design datapath unit and control unit, coding in SystemVerilog HDL, verification and synthesis using Quartus II and simulated on ModelSim-Altera. The original matrix size of 4x4 and 8x8 is used to with the SVD processing element (PE). The result are compared with the Matlab version of the algorithm to evaluate the PE. The computation of SVD can be speed-up of more than 2 by increasing the number of PE at the cost of increased in circuit area

    Decomposition tool targeting FPGA architectures

    Full text link
    The growing interest in the field of logic synthesis targeting Field Programmable Gate Arrays (FPGA) and the active research carried out by a number of research groups in the area of functional decomposition is the prime motivation for this thesis. Logic synthesis has been an area of interest in many universities all over the world. The work involves the study and implementation of techniques and methods in logic synthesis. In this work, a logic synthesis tool has been developed implementing the aspects of general and complete Decomposition method based on functional decomposition techniques [4]. The tool is aimed at producing outputs faster and more efficient than the available software. C++ Standard template library is used to develop this tool. The output of this tool is designed to be compatible with the available vendor software. The tool has been tested on MCNC benchmarks and those created keeping in mind the industry requirements

    Pipelined Numerical Integration on Reduced Accuracy Architectures for Power System Transient Simulations

    Get PDF
    This work concerns a dedicated mixed-signal power system dynamic simulator. The equations that describe the behavior of a power system can be decoupled in a large linear system that is handled by the analog part of the hardware, and a set of differential equations. The latter are solved using numerical integration algorithms implemented in dedicated pipelines on a field programmable gate array (FPGA). This data path is operating in a precision-starved environment since is it synthesized using fixed-point arithmetic, as well as it relies on low-precision solutions that come from the analog linear solver. In this paper, the pipelined integration scheme is presented and an assessment of different numerical integration algorithms is performed based on their effect on the final results. It is concluded that in low-precision environments higher order integration algorithms should be preferred when the time step is large, since simpler algorithms result in unacceptable artifacts (extraneous instabilities)

    Entropy Based Robust Watermarking Algorithm

    Get PDF
    Tänu aina kasvavale multimeedia andmeedastus mahtudele Internetis, on esile kerkinud mured turvalisusest ja piraatlusest. Digitaalse meedia paljundamise ja muutmise maht on loonud vajaduse digitaalse meedia vesimärgistamise järgi. Selles töös on tutvustatud vastupidavaid vesimärkide lisamise algoritme, mis lisavad vesimärgid madala entroopiaga pildi osadesse. Välja pakutud algoritmides jagatakse algne pilt blokkidesse ning arvutatakse iga bloki entroopia. Kõikide blokkide keskmine entroopia väärtus valitakse künniseks, mille järgi otsustatakse, millistesse blokkidesse vesimärk lisada. Kõik blokid, mille entroopia on väiksem kui künnis, viiakse signaali sageduse kujule kasutades Discrete Wavelet Transform algoritmi. Madala sagedusega sagedusvahemikule rakendatakse Chirp Z-Transform algoritmi ja saadud tulemusele LU-dekompositsiooni või QR-dekompositsiooni. Singular Value Decomposition meetodi rakendamisel diagonaalmaatriksile, mis saadi eelmisest sammust, saadakse iga bloki vastav väärtus. Vesimärk lisatakse pildile, liites iga bloki arvutatud väärtusele vesimärgi Singular Value Decomposition meetodi tulemused. Kirjeldatud algoritme testiti ning võrreldi teiste tavapärast ning uudsete vesimärkide lisamise tehnoloogiatega. Kvantitatiivsed ja kvalitatiivsed eksperimendid näitavad, et välja pakutud meetodid on tajumatud ning vastupidavad signaali töötlemise rünnakutele.With growth of digital media distributed over the Internet, concerns about security and piracy have emerged. The amount of digital media reproduction and tampering has brought a need for content watermarking. In this work, multiple robust watermarking algorithms are introduced. They embed watermark image into singular values of host image’s blocks with low entropy values. In proposed algorithms, host image is divided into blocks, and the entropy of each block is calculated. The average of all entropies indicates the chosen threshold value for selecting the blocks in which watermark image should be embedded. All blocks with entropy lower than the calculated threshold are decomposed into frequency subbands using discrete wavelet transform (DWT). Subsequently chirp z-transform (CZT) is applied to the low-frequency subband followed by an appropriate matrix decomposition such as lower and upper decomposition (LUD) or orthogonal-triangular decomposition (QR decomposition). By applying singular value decomposition (SVD) to diagonal matrices obtained by the aforementioned matrix decompositions, the singular values of each block are calculated. Watermark image is embedded by adding singular values of the watermark image to singular values of the low entropy blocks. Proposed algorithms are tested on many host and watermark images, and they are compared with conventional and other state-of-the-art watermarking techniques. The quantitative and qualitative experimental results are indicating that the proposed algorithms are imperceptible and robust against many signal processing attacks
    corecore