293 research outputs found

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Formal synthesis of control signals for systolic arrays

    Get PDF

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Time-Optimal and Conflict-Free Mappings of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays

    Get PDF
    Most existing methods of mapping algorithms into processor arrays are restricted to the case where n-dimensional algorithms or algorithms with n nested loops are mapped into (n—l)-dimensional arrays. However, in practice, it is interesting to map n-dimensional algorithms into (k —l)-dimensional arrays where k\u3c.n. For example, many algorithms at bit-level are at least 4-dimensional (matrix multiplication, convolution, LU decomposition, etc.) and most existing bit level processor arrays are 2-dimensional. A computational conflict occurs if two or more computations of an algorithm are mapped into the same processor and the same execution time. In this paper, necessary and sufficient conditions are derived to identify all mappings without computational conflicts, based on the Hermite normal form of the mapping matrix. These conditions are used to propose methods of mapping any n-dimensional algorithm into (k— l)-dimensional arrays, kn—3, optimality of the mapping is guaranteed

    Partitioning of Uniform Dependency Algorithms for Parallel Execution on MIMD/ Systolic Systems

    Get PDF
    An algorithm can be modeled as an index set and a set of dependence vectors. Each index vector in the index set indexes a computation of the algorithm. If the execution of a computation depends on the execution of another computation, then this dependency is represented as the difference between the index vectors of the computations. The dependence matrix corresponds to a matrix where each column is a dependence vector. An independent partition of the index set is such that there are no dependencies between computations that belong to different blocks of the partition. This report considers uniform dependence algorithms with any arbitrary kind of index set and proposes two very simple methods to find independent partitions of the index set. Each method has advantages over the other one for certain kind of application, and they both outperform previously proposed approaches in terms of computational complexity and/or optimality. Also, lower bounds and upper bounds of the cardinality of the maximal independent partitions are given. For some algorithms it is shown that the cardinality of the maximal partition is equal to the greatest common divisor of some subdeterminants of the dependence matrix. In an MIMD/multiple systolic array computation environment, if different blocks of ail independent partition are assigned to different processors/arrays, the communications between processors/arrays will be minimized to zero. This is significant because the communications usually dominate the overhead in MIMD machines. Some issues of mapping partitioned algorithms into MIMD/systolic systems are addressed. Based on the theory of partitioning, a new method is proposed to test if a system of linear Diophantine equations has integer solutions

    Automatic parallelisation for a class of URE problems

    Get PDF
    PhD ThesisThis thesis deals with the methodology and software of automatic parallelisation for numerical supercomputing and supercomputers. Basically, we focus on the problem of Uniform Recurrence Equations (URE) which exists widely in numerical computations. vVepropose a complete methodology of automatic generation of parallel programs for regular array designs. The methodology starts with an introduction of a set of canonical dependencies which generates a general modelling of the various URE problems. Based on these canonical dependencies, partitioning and mapping methods are developed which gives the foundation of the universal design process. Using the theoretical results we propose the structures of parallel programs and eventually generate automatically parallel codes which run correctly and efficiently on transputer array. The achievements presented in this thesis can be regarded as a significant progress in the area of automatic generation of parallel codes and regular (systolic) array design. This methodology is integrated and self-contained, and may be the only practical working package in this area.The Research Committee of University of Newcastle upon Tyne: CVCP Overseas Research Students Awards Scheme
    • …
    corecore