2 research outputs found

    High Performance Reconfigurable Computing for Linear Algebra: Design and Performance Analysis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) enable powerful performance acceleration for scientific computations because of their intrinsic parallelism, pipeline ability, and flexible architecture. This dissertation explores the computational power of FPGAs for an important scientific application: linear algebra. First of all, optimized linear algebra subroutines are presented based on enhancements to both algorithms and hardware architectures. Compared to microprocessors, these routines achieve significant speedup. Second, computing with mixed-precision data on FPGAs is proposed for higher performance. Experimental analysis shows that mixed-precision algorithms on FPGAs can achieve the high performance of using lower-precision data while keeping higher-precision accuracy for finding solutions of linear equations. Third, an execution time model is built for reconfigurable computers (RC), which plays an important role in performance analysis and optimal resource utilization of FPGAs. The accuracy and efficiency of parallel computing performance models often depend on mean maximum computations. Despite significant prior work, there have been no sufficient mathematical tools for this important calculation. This work presents an Effective Mean Maximum Approximation method, which is more general, accurate, and efficient than previous methods. Together, these research results help address how to make linear algebra applications perform better on high performance reconfigurable computing architectures

    Performance of synchronous parallel algorithms with regular structures

    No full text
    The ability to model the execution and predict the performance of parallel algorithms is necessary if parallel computer systems are to be designed and utilized effectively. One factor which is inherent to the structure of the algorithm is the delay due to synchronization requirements when one task has to await the completion of another task. This thesis is concerned with the problem of evaluating the performance of parallel algorithms on MIMD computer systems when tasks in the algorithms have non-deterministic execution times. It addresses the following questions: How does non-determinism affect the synchronization delays and ultimately the performance of parallel algorithms? Is it possible to predict this effect independent of implementation on a specific architecture? How can such predictions be validated? The approach adopted is based on the view of a parallel algorithm as a collection of tasks with constraints on their relative order of execution. Important classes of parallel algorithms which exhibit regularity in their task graph structures are identified and the performance of these structures is predicted using results from order statistics and extreme value theory. These algorithm classes are those whose task graphs have multiphase, partitioning, and pipeline structures. The Rice Parallel Processing Testbed (RPPT), a program-driven simulation tool for the performance evaluation of parallel algorithms on parallel architectures, is used along with distribution-driven simulation to validate the results. Variations in the execution times of tasks generally increase the length of synchronization delays and adversely affect the performance of a parallel algorithm. In the case of an algorithm with a regular task graph structure it is possible to quantify the performance degradation due to non-determinism under certain assumptions about the task execution times and the number of processors available to execute the tasks. In particular, it is possible to place bounds on and approximate the mean execution time of the multiphase, partitioning, and pipeline structures. Distribution-driven simulations and results from sorting algorithms run on the RPPT indicate that these bounds and approximations, which are based on independence assumptions about task execution time distributions, are robust even in the presence of dependencies among task execution times. (Abstract shortened with permission of author.
    corecore