12 research outputs found

    Performance Analysis of Hardware/Software Co-Design of Matrix Solvers

    Get PDF
    Solving a system of linear and nonlinear equations lies at the heart of many scientific and engineering applications such as circuit simulation, applications in electric power networks, and structural analysis. The exponentially increasing complexity of these computing applications and the high cost of supercomputing force us to explore affordable high performance computing platforms. The ultimate goal of this research is to develop hardware friendly parallel processing algorithms and build cost effective high performance parallel systems using hardware in order to enable the solution of large linear systems. In this thesis, FPGA-based general hardware architectures of selected iterative methods and direct methods are discussed. Xilinx Embedded Development Kit (EDK) hardware/software (HW/SW) codesigns of these methods are also presented. For iterative methods, FPGA based hardware architectures of Jacobi, combined Jacobi and Gauss-Seidel, and conjugate gradient (CG) are proposed. The convergence analysis of the LNS-based Jacobi processor demonstrates to what extent the hardware resource constraints and additional conversion error affect the convergence of Jacobi iterative method. Matlab simulations were performed to compare the performance of three iterative methods in three ways, i.e., number of iterations for any given tolerance, number of iterations for different matrix sizes, and computation time for different matrix sizes. The simulation results indicate that the key to a fast implementation of the three methods is a fast implementation of matrix multiplication. The simulation results also show that CG method takes less number of iterations for any given tolerance, but more computation time as matrix size increases compared to other two methods, since matrix-vector multiplication is a more dominant factor in CG method than in the other two methods. By implementing matrix multiplications of the three methods in hardware with Xilinx EDK HW/SW codesign, the performance is significantly improved over pure software Power PC (PPC) based implementation. The EDK implementation results show that CG takes less computation time for any size of matrices compared to other two methods in HW/SW codesign, due to that fact that matrix multiplications dominate the computation time of all three methods while CG requires less number of iterations to converge compared to other two methods. For direct methods, FPGA-based general hardware architecture and Xilinx EDK HW/SW codesign of WZ factorization are presented. Single unit and scalable hardware architectures of WZ factorization are proposed and analyzed under different constraints. The results of Matlab simulations show that WZ runs faster than the LU on parallel processors but slower on a single processor. The simulation results also indicate that the most time consuming part of WZ factorization is matrix update. By implementing the matrix update of WZ factorization in hardware with Xilinx EDK HW/SW codesign, the performance is also apparently improved over PPC based pure software implementation

    Computational methods and software systems for dynamics and control of large space structures

    Get PDF
    This final report on computational methods and software systems for dynamics and control of large space structures covers progress to date, projected developments in the final months of the grant, and conclusions. Pertinent reports and papers that have not appeared in scientific journals (or have not yet appeared in final form) are enclosed. The grant has supported research in two key areas of crucial importance to the computer-based simulation of large space structure. The first area involves multibody dynamics (MBD) of flexible space structures, with applications directed to deployment, construction, and maneuvering. The second area deals with advanced software systems, with emphasis on parallel processing. The latest research thrust in the second area, as reported here, involves massively parallel computers

    Performance and Memory Space Optimizations for Embedded Systems

    Get PDF
    Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed. The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are: We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code. We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality parallelism graph representation, and assigns the nodes of this graph to processors.We also introduce an Integer Linear Programming based formulation of the scheduling problem. We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times. We present an SPM management technique using Markov chain based data access. We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array intensive applications.The goal is to minimize the number of memory banks needed for executing the group of loop iterations

    Factorización LU de matrices dispersas en multiprocesadores

    Get PDF
    corecore