140 research outputs found

    Towards Lattice Quantum Chromodynamics on FPGA devices

    Get PDF
    In this paper we describe a single-node, double precision Field Programmable Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we invert numerically the Dirac-Wilson operator on a 4-dimensional grid on three Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250 accelerator and the largest device available on the market, the VU13P device. In our implementation we separate software/hardware parts in such a way that the entire multiplication by the Dirac operator is performed in hardware, and the rest of the algorithm runs on the host. We find out that the FPGA implementation can offer a performance comparable with that obtained using current CPU or Intel's many core Xeon Phi accelerators. A possible multiple node FPGA-based system is discussed and we argue that power-efficient High Performance Computing (HPC) systems can be implemented using FPGA devices only.Comment: 17 pages, 4 figure

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Type-driven automated program transformations and cost modelling for optimising streaming programs on FPGAs

    Get PDF
    In this paper we present a novel approach to program optimisation based on compiler-based type-driven program transformations and a fast and accurate cost/performance model for the target architecture. We target streaming programs for the problem domain of scientific computing, such as numerical weather prediction. We present our theoretical framework for type-driven program transformation, our target high-level language and intermediate representation languages and the cost model and demonstrate the effectiveness of our approach by comparison with a commercial toolchain

    Efficient Hardware Design Of Iterative Stencil Loops

    Get PDF
    A large number of algorithms for multidimensional signals processing and scientific computation come in the form of iterative stencil loops (ISLs), whose data dependencies span across multiple iterations. Because of their complex inner structure, automatic hardware acceleration of such algorithms is traditionally considered as a difficult task. In this paper, we introduce an automatic design flow that identifies, in a wide family of bidimensional data processing algorithms, sub-portions that exhibit a kind of parallelism close to that of ISLs; these are mapped onto a space of highly optimized ad-hoc architectures, which is efficiently explored to identify the best implementations with respect to both area and throughput. Experimental results show that the proposed methodology generates circuits whose performance is comparable to that of manually-optimized solutions, and orders of magnitude higher than those generated by commercial HLS tools

    AMC: Advanced Multi-accelerator Controller

    Get PDF
    The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.Peer ReviewedPostprint (author’s final draft

    Automated optimization of reconfigurable designs

    Get PDF
    Currently, the optimization of reconfigurable design parameters is typically done manually and often involves substantial amount effort. The main focus of this thesis is to reduce this effort. The designer can focus on the implementation and design correctness, leaving the tools to carry out optimization. To address this, this thesis makes three main contributions. First, we present initial investigation of reconfigurable design optimization with the Machine Learning Optimizer (MLO) algorithm. The algorithm is based on surrogate model technology and particle swarm optimization. By using surrogate models the long hardware generation time is mitigated and automatic optimization is possible. For the first time, to the best of our knowledge, we show how those models can both predict when hardware generation will fail and how well will the design perform. Second, we introduce a new algorithm called Automatic Reconfigurable Design Efficient Global Optimization (ARDEGO), which is based on the Efficient Global Optimization (EGO) algorithm. Compared to MLO, it supports parallelism and uses a simpler optimization loop. As the ARDEGO algorithm uses multiple optimization compute nodes, its optimization speed is greatly improved relative to MLO. Hardware generation time is random in nature, two similar configurations can take vastly different amount of time to generate making parallelization complicated. The novelty is efficient use of the optimization compute nodes achieved through extension of the asynchronous parallel EGO algorithm to constrained problems. Third, we show how results of design synthesis and benchmarking can be reused when a design is ported to a different platform or when its code is revised. This is achieved through the new Auto-Transfer algorithm. A methodology to make the best use of available synthesis and benchmarking results is a novel contribution to design automation of reconfigurable systems.Open Acces

    FPGA Acceleration of Domain-specific Kernels via High-Level Synthesis

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    並列計算アクセラレータへの効率的なアプリケーションマッピングに関する研究

    Get PDF
    長崎大学学位論文 学位記番号:博(工)甲第3号 学位授与年月日:平成26年3月20日Nagasaki University (長崎大学)課程博
    corecore