224 research outputs found

    Performance Analysis of Hardware/Software Co-Design of Matrix Solvers

    Get PDF
    Solving a system of linear and nonlinear equations lies at the heart of many scientific and engineering applications such as circuit simulation, applications in electric power networks, and structural analysis. The exponentially increasing complexity of these computing applications and the high cost of supercomputing force us to explore affordable high performance computing platforms. The ultimate goal of this research is to develop hardware friendly parallel processing algorithms and build cost effective high performance parallel systems using hardware in order to enable the solution of large linear systems. In this thesis, FPGA-based general hardware architectures of selected iterative methods and direct methods are discussed. Xilinx Embedded Development Kit (EDK) hardware/software (HW/SW) codesigns of these methods are also presented. For iterative methods, FPGA based hardware architectures of Jacobi, combined Jacobi and Gauss-Seidel, and conjugate gradient (CG) are proposed. The convergence analysis of the LNS-based Jacobi processor demonstrates to what extent the hardware resource constraints and additional conversion error affect the convergence of Jacobi iterative method. Matlab simulations were performed to compare the performance of three iterative methods in three ways, i.e., number of iterations for any given tolerance, number of iterations for different matrix sizes, and computation time for different matrix sizes. The simulation results indicate that the key to a fast implementation of the three methods is a fast implementation of matrix multiplication. The simulation results also show that CG method takes less number of iterations for any given tolerance, but more computation time as matrix size increases compared to other two methods, since matrix-vector multiplication is a more dominant factor in CG method than in the other two methods. By implementing matrix multiplications of the three methods in hardware with Xilinx EDK HW/SW codesign, the performance is significantly improved over pure software Power PC (PPC) based implementation. The EDK implementation results show that CG takes less computation time for any size of matrices compared to other two methods in HW/SW codesign, due to that fact that matrix multiplications dominate the computation time of all three methods while CG requires less number of iterations to converge compared to other two methods. For direct methods, FPGA-based general hardware architecture and Xilinx EDK HW/SW codesign of WZ factorization are presented. Single unit and scalable hardware architectures of WZ factorization are proposed and analyzed under different constraints. The results of Matlab simulations show that WZ runs faster than the LU on parallel processors but slower on a single processor. The simulation results also indicate that the most time consuming part of WZ factorization is matrix update. By implementing the matrix update of WZ factorization in hardware with Xilinx EDK HW/SW codesign, the performance is also apparently improved over PPC based pure software implementation

    Lecture 11: The Road to Exascale and Legacy Software for Dense Linear Algebra

    Get PDF
    In this talk, we will look at the current state of high performance computing and look at the next stage of extreme computing. With extreme computing, there will be fundamental changes in the character of floating point arithmetic and data movement. In this talk, we will look at how extreme-scale computing has caused algorithm and software developers to change their way of thinking on implementing and program-specific applications

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    architect: Arbitrary-precision Constant-hardware Iterative Compute

    Get PDF
    Many algorithms feature an iterative loop that converges to the result of interest. The numerical operations in such algorithms are generally implemented using finite-precision arithmetic, either fixed or floating point, most of which operate least-significant digit first. This results in a fundamental problem: if, after some time, the result has not converged, is this because we have not run the algorithm for enough iterations or because the arithmetic in some iterations was insufficiently precise? There is no easy way to answer this question, so users will often over-budget precision in the hope that the answer will always be to run for a few more iterations. We propose a fundamentally new approach: armed with the appropriate arithmetic able to generate results from most-significant digit first, we show that fixed compute-area hardware can be used to calculate an arbitrary number of algorithmic iterations to arbitrary precision, with both precision and iteration index increasing in lockstep. Thus, datapaths constructed following our principles demonstrate efficiency over their traditional arithmetic equivalents where the latter’s precisions are either under- or over-budgeted for the computation of a result to a particular accuracy. For the execution of 100 iterations of the Jacobi method, we obtain a 1.60x increase in frequency and 15.7x LUT and 50.2x flip-flop reductions over a 2048-bit parallel-in, serial-out traditional arithmetic equivalent, along with 46.2x LUT and 83.3x flip-flop decreases versus the state-of-the-art online arithmetic implementation

    Jacobi load flow accelerator using FPGA

    Get PDF
    Paper presented at the 37th Annual North American Power Symposium, Ames, IA.Full-AC load flow is a crucial task in power system analysis. Solving full-AC load flow utilizes iterative numerical methods such as Jacobi, Gauss-Seidel or Newton- Raphson. Newton-Raphson is currently the preferred solver used in industrial applications such as Power World and PSS/E due to it faster convergence than either Jacobi or Gauss-Seidel. In this paper, we reexamine the Jacobi method for use in a fully pipelined hardware implementation using a Field Programmable Gate Array (FPGA) as an alternative to Newton-Raphson. Using benchmark data from representative power systems, we compare the operation counts of Newton-Raphson software to the proposed Jacobi FPGA hardware. Our studies show that Jacobi method implemented in an FPGA for a sufficiently large power system has the potential to be a state of the art full-AC load flow engine
    • …
    corecore