26 research outputs found

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Mimo Systems Low complexity SVD Implementation Analysis

    Full text link
    This paper analyses the implementation of the singular value decomposition (SVD) using approximation to the exact computation for MIMO systems in the case of modulation-mode and power assignment set-up. The study developed in the paper focuses on the use of low complexity algorithm with low computational load oriented to the use of devices with limited resources as FPGA, highlighting some of the advantages and drawbacks against more sophisticated devices. The implementation of the SVD is analyzed through the algorithms that efficiently perform the required computations, seeking for computationally efficient solutions that provide parallelism and low complexity. The CORDIC algorithm seems to be a good candidate for this task since it can efficiently compute the singular value decomposition. It is shown that this algorithm provides an efficient tool for SVD computation with appropriate accuracy and the computational complexity obtained and the required resources make it feasible to be implemented on an FPGA device. System performance degradation is analyzed compared with conventional and exact method for SVD obtaining some key conclusions

    RTL implementation of one-sided jacobi algorithm for singular value decomposition

    Get PDF
    Multi-dimensional digital signal processing such as image processing and image reconstruction involve manipulating of matrix data. Better quality images involve large amount of data, which result in unacceptably slow computation. A parallel processing scheme is a possible solution to solve this problem. This project presented an analysis and comparison to various algorithms for widely used matrix decomposition techniques and various computer architectures. As the result, a parallel implementation of one-sided Jacobi algorithm for computing singular value decomposition (SVD) of a 2х2 matrix on field programmable gate arrays (FPGA) is developed. The proposed SVD design is based on pipelined-datapath architecture The design process is started by evaluating the algorithm using Matlab, design datapath unit and control unit, coding in SystemVerilog HDL, verification and synthesis using Quartus II and simulated on ModelSim-Altera. The original matrix size of 4x4 and 8x8 is used to with the SVD processing element (PE). The result are compared with the Matlab version of the algorithm to evaluate the PE. The computation of SVD can be speed-up of more than 2 by increasing the number of PE at the cost of increased in circuit area

    HIGH PERFORMANCE, LOW COST SUBSPACE DECOMPOSITION AND POLYNOMIAL ROOTING FOR REAL TIME DIRECTION OF ARRIVAL ESTIMATION: ANALYSIS AND IMPLEMENTATION

    Get PDF
    This thesis develops high performance real-time signal processing modules for direction of arrival (DOA) estimation for localization systems. It proposes highly parallel algorithms for performing subspace decomposition and polynomial rooting, which are otherwise traditionally implemented using sequential algorithms. The proposed algorithms address the emerging need for real-time localization for a wide range of applications. As the antenna array size increases, the complexity of signal processing algorithms increases, making it increasingly difficult to satisfy the real-time constraints. This thesis addresses real-time implementation by proposing parallel algorithms, that maintain considerable improvement over traditional algorithms, especially for systems with larger number of antenna array elements. Singular value decomposition (SVD) and polynomial rooting are two computationally complex steps and act as the bottleneck to achieving real-time performance. The proposed algorithms are suitable for implementation on field programmable gated arrays (FPGAs), single instruction multiple data (SIMD) hardware or application specific integrated chips (ASICs), which offer large number of processing elements that can be exploited for parallel processing. The designs proposed in this thesis are modular, easily expandable and easy to implement. Firstly, this thesis proposes a fast converging SVD algorithm. The proposed method reduces the number of iterations it takes to converge to correct singular values, thus achieving closer to real-time performance. A general algorithm and a modular system design are provided making it easy for designers to replicate and extend the design to larger matrix sizes. Moreover, the method is highly parallel, which can be exploited in various hardware platforms mentioned earlier. A fixed point implementation of proposed SVD algorithm is presented. The FPGA design is pipelined to the maximum extent to increase the maximum achievable frequency of operation. The system was developed with the objective of achieving high throughput. Various modern cores available in FPGAs were used to maximize the performance and details of these modules are presented in detail. Finally, a parallel polynomial rooting technique based on Newton’s method applicable exclusively to root-MUSIC polynomials is proposed. Unique characteristics of root-MUSIC polynomial’s complex dynamics were exploited to derive this polynomial rooting method. The technique exhibits parallelism and converges to the desired root within fixed number of iterations, making this suitable for polynomial rooting of large degree polynomials. We believe this is the first time that complex dynamics of root-MUSIC polynomial were analyzed to propose an algorithm. In all, the thesis addresses two major bottlenecks in a direction of arrival estimation system, by providing simple, high throughput, parallel algorithms

    Fast Algorithm Development for SVD: Applications in Pattern Matching and Fault Diagnosis

    Get PDF
    The project aims for fast detection and diagnosis of faults occurring in process plants by designing a low-cost FPGA module for the computation. Fast detection and diagnosis when the process is still operating in a controllable region helps avoiding the further advancement of the fault and reduce the productivity loss. Model-based methods are not popular in the domain of process control as obtaining an accurate model is expensive and requires an expertise. Data-driven methods like Principal Component Analysis(PCA) is a quite popular diagnostic method for process plants as they do not require any model. PCA is widely used tool for dimensionality reduction and thus reducing the computational e�ort. The trends are captured in prinicpal components as it is di�cult to have a same amount of disturbance as simulated in historical database. The historical database has multiple instances of various kinds of faults and disturbances along with normal operation. A moving window approach has been employed to detect similar instances in the historical database based on Standard PCA similarity factor. The measurements of variables of interest over a certain period of time forms the snapshot dataset, S. At each instant, a window of same size as that of snapshot dataset is picked from the historical database forms the historical window, H. The two datasets are then compared using similarity factors like Standard PCA similarity factor which signi�es the angular di�erence between the principal components of two datasets. Since many of the operating conditions are quite similar to each other and signi�cant number of mis-classi�cations have been observed, a candidate pool which orders the historical data windows on the values of similarity factor is formed. Based on the most detected operation among the top-most windows, the operating personnel takes necessary action. Tennessee Eastman Challenge process has been chosen as an initial case study for evaluating the performance. The measurements are sampled for every one minute and the fault having the smallest maximum duration is 8 hours. Hence the snapshot window size, m has been chosen to be consisting of 500 samples i.e 8.33 hours of most recent data of all the 52 variables. Ideally, the moving window should replace the oldest sample with a new one. Then it would take approximately the same number of comparisons as that of size of historical database. The size of the historical database is 4.32 million measurements(past 8years data) for each of the 52 variables. With software simulation on Matlab, this takes around 80-100 minutes to sweep through the whole 4.32 million historical database. Since most of the computation is spent in �nding principal components of the two datasets using SVD, a hardware design has to be incorporated to accelerate the pattern matching approach. The thesis is organized as follows: Chapter 1 describes the moving window approach, various similarity factors and metrics used for pattern matching. The previous work proposed by Ashish Singhal is based on skipping few samples for reducing the computational e�ort and also employs windows as large as 5761 which is four days of snapshot. Instead, a new method which skips the samples when the similarity factor is quite low has been proposed. A simpli�ed form of the Standard PCA similarity has been proposed without any trade-o� in accuracy. Pre-computation of historical database can also be done as the data is available aprior, but this requires a large memory requirement as most of the time is spent in read/write operations. The large memory requirement is due to the fact that every sample will give rise to 52�35 matrix assuming the top-35 PC's are sufficient enough to capture the variance of the dataset. Chapter 2 describes various popular algorithms for SVD. Algorithms apart from Jacobi methods like Golub-Kahan, Divide and conquer SVD algorithms are brie y discussed. While bi-diagonal methods are very accurate they suffer from large latency and computationally intensive. On the other hand, Jacobi methods are computationally inexpensive and parallelizable, thus reducing the latency. We also evaluted the performance of the proposed hybrid Golub-Kahan Jacobi algorithm to our application. Chapter 3 describes the basic building block CORDIC which is used for performing rotations required for Jacobi methods or for n-D householder re ections of Golub-Kahan SVD. CORIDC is widely employed in hardware design for computing trigonometric, exponential or logarithmic functions as it makes use of simple shift and add/subtract operations. Two modes of CORDIC namely Rotation mode and Vectoring mode are discussed which are used in the derivation of Two-sided Jacobi SVD. Chapter 4 describes the Jacobi methods of SVD which are quite popular in hardware implementation as they are quite amenable to parallel computation. Two variants of Jacobi methods namely One-sided and Two-sided Jacobi methods are brie y discussed. Two-sided Jacobi making making use of CORDIC has has been derived. The systolic array implementation which is quite popular in hardware implementation for the past three decades has been discussed. Chapter 5 deals with the Hardware implementation of Pattern matching and reports the literature survey of various architectures developed for computing SVD. Xilinx ZC7020 has been chosen as target device for FPGA implementation as it is inexpensive device with many built-in peripherals. The latency reports with both Vivado HLS and Vivado SDSoC are also reported for the application of interest. Evaluation of other case studies and other datadriven methods similar to PCA like Correspondence Analysis(CA) and Independent Component Analysis(ICA), development of efficient hybrid method for computing SVD in hardware and highly discriminating similarity factor, extending CORDIC to n-dimensions for householder re ections have been considered for future research

    Study of CORDIC based processing element for digital signal processing algorithms

    Get PDF
    There is a high demand for the efficient implementation of complex arithmetic operations in many Digital Signal Processing (DSP) algorithms. The COordinate Rotation DIgital Computer (CORDIC) algorithm is suitable to be implemented in DSP algorithms since its calculation for complex arithmetic is simple and elegant. Besides, since it avoids using multiplications, adopting the CORDIC algorithm can reduce the complexity. Here, in this project CORDIC based processing element for the construction of digital signal processing algorithms is implemented. This is a flexible device that can be used in the implementation of functions such as Singular Value Decomposition (SVD), Discrete Cosine Transform (DCT) as well as many other important functions. It uses a CORDIC module to perform arithmetic operations and the result is a flexible computational processing element (PE) for digital signal processing algorithms. To implement the CORDIC based architectures for functions like SVD and DCT, it is required to decompose their computations in terms of CORDIC operations. SVD is widely used in digital signal processing applications such as direction estimation, recursive least squares (RLS) filtering and system identification. Two different Jacobi-type methods for SVD parallel computation are usually considered, namely the Kogbetliantz (two-sided rotation) and the Hestenes (one- sided rotation) method. Kogbetliantz’s method has been considered, because it is suitable for mapping onto CORDIC array architecture and highly suitable for parallel computation. Here in its implementation, CORDIC algorithm provides the arithmetic units required in the processing elements as these enable the efficient implementation of plane rotation and phase computation. Many fundamental aspects of linear algebra rely on determining the rank of a matrix, making the SVD an important and widely used technique. DCT is one of the most widely used transform techniques in digital signal processing and it computation involves many multiplications and additions. The DCT based on CORDIC algorithm does not need multipliers. Moreover, it has regularity and simple architecture and it is used to compress a wide variety of images by transferring data into frequency domain. These digital signal-processing algorithms are used in many applications. The purpose of this thesis is to describe a solution in which a conventional CORDIC system is used to implement an SVD and DCT processing elements. The approach presented combines the low circuit complexity with high performance

    Singular value decomposition based pipeline architecture for MIMO communication systems

    Get PDF
    This thesis presents a design, implementation and performance benchmark of custom hardware for computing Singular Value Decomposition (SVD) of the radio communication channel characteristic matrix. Software Defined Radio (SDR) is a concept in which the radio transceiver is implemented by software programs running on a processor. SVD of the channel characteristic matrix is used in pre-coding, equalization and beamforming for Multiple Input Multiple Output (MIMO) and Orthogonal Frequency Division Modulation (OFDM) communication systems (e.g., IEEE 802.11n). Since SVD is computationally intensive, it may require custom hardware to reduce the computing time. The pipeline processor developed in this thesis is suitable for computing the SVD of a sequence of 2 × 2 matrices. A stream of 2×2 matrices is sent to the custom hardware, which returns the corresponding streams of singular values and unitary matrices. The architecture is based on the two sided Jacobi method utilizing Coordinate Rotation Digital Computer (CORDIC) algorithms. A 2×2 SVD prototype was implemented on Field-Programmable Gate Array (FPGA) for SDR applications. The 2×2 SVD prototype design can output the singular values and the corresponding unitary matrices in pipeline while operating at a data rate of 324 MHz on a Virtex 6 (xc6vlx240t-lff1156) FPGA. The prototype design consists of fifty-five CORDIC cores which takes 32 percent of available logic on the FPGA. It achieves the optimal pipeline rate equaled to the maximum hardware clock rate. The depth of the pipeline (latency) is 173 clock-cycles for 16-bit data hardware. The proposed architecture provides performance gains over standard software libraries, such as the ZGESVD function of Linear Algebra PACKage (LAPACK) library, which is based on Golub-Kahan-Reinsch SVD algorithm, when running on standard processors. The ZGESVD function of LAPACK implemented in Intel’s Math Kernel Library (MKL) will achieve a projected data rate of 40 MHz on a 2.50 GHz Intel Quad (Q9300) CPU. The pipeline SVD hardware ban width equals the clock frequency and the data rate can reach 324 MHz on the ML605 board (Virtex 6 xc6vlx240t). The proposed architecture also has the potential to be easily extended to solve 4×4 SVD problems used in pre-coding and equalization schemes. The proposed algorithm and design have better performance for small matrices, even though the general timing complexity is n2 when compared to nlog(n) complexity of Brent-Luk-Van Loan (BLV) systolic array using non-pipeline 2×2 processors. The performance gain of the proposed design is at the cost of increased circuit area.M.S., Computer Engineering -- Drexel University, 201

    Embedded Machine Learning: Emphasis on Hardware Accelerators and Approximate Computing for Tactile Data Processing

    Get PDF
    Machine Learning (ML) a subset of Artificial Intelligence (AI) is driving the industrial and technological revolution of the present and future. We envision a world with smart devices that are able to mimic human behavior (sense, process, and act) and perform tasks that at one time we thought could only be carried out by humans. The vision is to achieve such a level of intelligence with affordable, power-efficient, and fast hardware platforms. However, embedding machine learning algorithms in many application domains such as the internet of things (IoT), prostheses, robotics, and wearable devices is an ongoing challenge. A challenge that is controlled by the computational complexity of ML algorithms, the performance/availability of hardware platforms, and the application\u2019s budget (power constraint, real-time operation, etc.). In this dissertation, we focus on the design and implementation of efficient ML algorithms to handle the aforementioned challenges. First, we apply Approximate Computing Techniques (ACTs) to reduce the computational complexity of ML algorithms. Then, we design custom Hardware Accelerators to improve the performance of the implementation within a specified budget. Finally, a tactile data processing application is adopted for the validation of the proposed exact and approximate embedded machine learning accelerators. The dissertation starts with the introduction of the various ML algorithms used for tactile data processing. These algorithms are assessed in terms of their computational complexity and the available hardware platforms which could be used for implementation. Afterward, a survey on the existing approximate computing techniques and hardware accelerators design methodologies is presented. Based on the findings of the survey, an approach for applying algorithmic-level ACTs on machine learning algorithms is provided. Then three novel hardware accelerators are proposed: (1) k-Nearest Neighbor (kNN) based on a selection-based sorter, (2) Tensorial Support Vector Machine (TSVM) based on Shallow Neural Networks, and (3) Hybrid Precision Binary Convolution Neural Network (BCNN). The three accelerators offer a real-time classification with monumental reductions in the hardware resources and power consumption compared to existing implementations targeting the same tactile data processing application on FPGA. Moreover, the approximate accelerators maintain a high classification accuracy with a loss of at most 5%

    Electronic systems for the restoration of the sense of touch in upper limb prosthetics

    Get PDF
    In the last few years, research on active prosthetics for upper limbs focused on improving the human functionalities and the control. New methods have been proposed for measuring the user muscle activity and translating it into the prosthesis control commands. Developing the feed-forward interface so that the prosthesis better follows the intention of the user is an important step towards improving the quality of life of people with limb amputation. However, prosthesis users can neither feel if something or someone is touching them over the prosthesis and nor perceive the temperature or roughness of objects. Prosthesis users are helped by looking at an object, but they cannot detect anything otherwise. Their sight gives them most information. Therefore, to foster the prosthesis embodiment and utility, it is necessary to have a prosthetic system that not only responds to the control signals provided by the user, but also transmits back to the user the information about the current state of the prosthesis. This thesis presents an electronic skin system to close the loop in prostheses towards the restoration of the sense of touch in prosthesis users. The proposed electronic skin system inlcudes an advanced distributed sensing (electronic skin), a system for (i) signal conditioning, (ii) data acquisition, and (iii) data processing, and a stimulation system. The idea is to integrate all these components into a myoelectric prosthesis. Embedding the electronic system and the sensing materials is a critical issue on the way of development of new prostheses. In particular, processing the data, originated from the electronic skin, into low- or high-level information is the key issue to be addressed by the embedded electronic system. Recently, it has been proved that the Machine Learning is a promising approach in processing tactile sensors information. Many studies have been shown the Machine Learning eectiveness in the classication of input touch modalities.More specically, this thesis is focused on the stimulation system, allowing the communication of a mechanical interaction from the electronic skin to prosthesis users, and the dedicated implementation of algorithms for processing tactile data originating from the electronic skin. On system level, the thesis provides design of the experimental setup, experimental protocol, and of algorithms to process tactile data. On architectural level, the thesis proposes a design ow for the implementation of digital circuits for both FPGA and integrated circuits, and techniques for the power management of embedded systems for Machine Learning algorithms

    FPGA Acceleration of Domain-specific Kernels via High-Level Synthesis

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen
    corecore