14,422 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    High-level synthesis optimization for blocked floating-point matrix multiplication

    Get PDF
    In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a block-oriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using high-level synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices

    Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

    Full text link
    We investigate implementation of lattice Quantum Chromodynamics (QCD) code on the Intel Xeon Phi Knights Landing (KNL). The most time consuming part of the numerical simulations of lattice QCD is a solver of linear equation for a large sparse matrix that represents the strong interaction among quarks. To establish widely applicable prescriptions, we examine rather general methods for the SIMD architecture of KNL, such as using intrinsics and manual prefetching, to the matrix multiplication and iterative solver algorithms. Based on the performance measured on the Oakforest-PACS system, we discuss the performance tuning on KNL as well as the code design for facilitating such tuning on SIMD architecture and massively parallel machines.Comment: 8 pages, 12 figures. Talk given at LHAM'17 "5th International Workshop on Legacy HPC Application Migration" in CANDAR'17 "The Fifth International Symposium on Computing and Networking" and to appear in the proceeding
    corecore