334 research outputs found

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

    Full text link
    We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin

    Three-Level Parallel J-Jacobi Algorithms for Hermitian Matrices

    Get PDF
    The paper describes several efficient parallel implementations of the one-sided hyperbolic Jacobi-type algorithm for computing eigenvalues and eigenvectors of Hermitian matrices. By appropriate blocking of the algorithms an almost ideal load balancing between all available processors/cores is obtained. A similar blocking technique can be used to exploit local cache memory of each processor to further speed up the process. Due to diversity of modern computer architectures, each of the algorithms described here may be the method of choice for a particular hardware and a given matrix size. All proposed block algorithms compute the eigenvalues with relative accuracy similar to the original non-blocked Jacobi algorithm.Comment: Submitted for publicatio

    Mimo Systems Low complexity SVD Implementation Analysis

    Full text link
    This paper analyses the implementation of the singular value decomposition (SVD) using approximation to the exact computation for MIMO systems in the case of modulation-mode and power assignment set-up. The study developed in the paper focuses on the use of low complexity algorithm with low computational load oriented to the use of devices with limited resources as FPGA, highlighting some of the advantages and drawbacks against more sophisticated devices. The implementation of the SVD is analyzed through the algorithms that efficiently perform the required computations, seeking for computationally efficient solutions that provide parallelism and low complexity. The CORDIC algorithm seems to be a good candidate for this task since it can efficiently compute the singular value decomposition. It is shown that this algorithm provides an efficient tool for SVD computation with appropriate accuracy and the computational complexity obtained and the required resources make it feasible to be implemented on an FPGA device. System performance degradation is analyzed compared with conventional and exact method for SVD obtaining some key conclusions

    Error Analysis of the Cholesky QR-Based Block Orthogonalization Process for the One-Sided Block Jacobi SVD Algorithm

    Get PDF
    The one-sided block Jacobi method (OSBJ) has attracted attention as a fast and accurate algorithm for the singular value decomposition (SVD). The computational kernel of OSBJ is orthogonalization of a column block pair, which amounts to computing the SVD of this block pair. Hari proposes three methods for this partial SVD, and we found through numerical experiments that the variant named "V2", which is based on the Cholesky QR method, is the fastest variant and achieves satisfactory accuracy. While it is a good news from a practical viewpoint, it seems strange considering the well-known instability of the Cholesky QR method. In this paper, we perform a detailed error analysis of the V2 variant and explain why and when it can be used to compute the partial SVD accurately. Thus, our results provide a theoretical support for using the V2 variant safely in the OSBJ method

    A mixed precision Jacobi method for the symmetric eigenvalue problem

    Full text link
    The eigenvalue problem is a fundamental problem in scientific computing. In this paper, we propose a mixed precision Jacobi method for the symmetric eigenvalue problem. We first compute the eigenvalue decomposition of a real symmetric matrix by an eigensolver at low precision and we obtain a low-precision matrix of eigenvectors. Then by using the modified Gram-Schmidt orthogonalization process to the low-precision eigenvector matrix in high precision, a high-precision orthogonal matrix is obtained, which is used as an initial guess for the Jacobi method. We give the rounding error analysis for the proposed method and the quadratic convergence of the proposed method is established under some sufficient conditions. We also present a mixed precision one-side Jacobi method for the singular value problem and the corresponding rounding error analysis and quadratic convergence are discussed. Numerical experiments on CPUs and GPUs are conducted to illustrate the efficiency of the proposed mixed precision Jacobi method over the original Jacobi method.Comment: 31 pages, 2 figure

    Polynomial matrix decomposition techniques for frequency selective MIMO channels

    Get PDF
    For a narrowband, instantaneous mixing multi-input, multi-output (MIMO) communications system, the channel is represented as a scalar matrix. In this scenario, singular value decomposition (SVD) provides a number of independent spatial subchannels which can be used to enhance data rates or to increase diversity. Alternatively, a QR decomposition can be used to reduce the MIMO channel equalization problem to a set of single channel equalization problems. In the case of a frequency selective MIMO system, the multipath channel is represented as a polynomial matrix. Thus conventional matrix decomposition techniques can no longer be applied. The traditional solution to this broadband problem is to reduce it to narrowband form by using a discrete Fourier transform (DFT) to split the broadband channel into N narrow uniformly spaced frequency bands and applying scalar decomposition techniques within each band. This describes an orthogonal frequency division multiplexing (OFDM) based system. However, a novel algorithm has been developed for calculating the eigenvalue decomposition of a para-Hermitian polynomial matrix, known as the sequential best rotation (SBR2) algorithm. SBR2 and its QR based derivatives allow a true polynomial singular value and QR decomposition to be formulated. The application of these algorithms within frequency selective MIMO systems results in a fundamentally new approach to exploiting spatial diversity. Polynomial matrix decomposition and OFDM based solutions are compared for a wide variety of broadband MIMO communication systems. SVD is used to create a robust, high gain communications channel for ultra low signal-to-noise ratio (SNR) environments. Due to the frequency selective nature of the channels produced by polynomial matrix decomposition, additional processing is required at the receiver resulting in two distinct equalization techniques based around turbo and Viterbi equalization. The proposed approach is found to provide identical performance to that of an existing OFDM scheme while supporting a wider range of access schemes. This work is then extended to QR decomposition based communications systems, where the proposed polynomial approach is found to not only provide superior bit-error-rate (BER) performance but significantly reduce the complexity of transmitter design. Finally both techniques are combined to create a nulti-user MIMO system that provides superior BER performance over an OFDM based scheme. Throughout the work the robustness of the proposed scheme to channel state information (CSI) error is considered, resulting in a rigorous demonstration of the capabilities of the polynomial approach
    corecore