109 research outputs found

    FPGA Processor In Memory Architectures (PIMs): Overlay or Overhaul ?

    Full text link
    The dominance of machine learning and the ending of Moore's law have renewed interests in Processor in Memory (PIM) architectures. This interest has produced several recent proposals to modify an FPGA's BRAM architecture to form a next-generation PIM reconfigurable fabric. PIM architectures can also be realized within today's FPGAs as overlays without the need to modify the underlying FPGA architecture. To date, there has been no study to understand the comparative advantages of the two approaches. In this paper, we present a study that explores the comparative advantages between two proposed custom architectures and a PIM overlay running on a commodity FPGA. We created PiCaSO, a Processor in/near Memory Scalable and Fast Overlay architecture as a representative PIM overlay. The results of this study show that the PiCaSO overlay achieves up to 80% of the peak throughput of the custom designs with 2.56x shorter latency and 25% - 43% better BRAM memory utilization efficiency. We then show how several key features of the PiCaSO overlay can be integrated into the custom PIM designs to further improve their throughput by 18%, latency by 19.5%, and memory efficiency by 6.2%.Comment: Accepted in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF

    A single-source C++20 HLS flow for function evaluation on FPGA and beyond

    Get PDF
    International audienceThis paper presents a framework to reuse the intelligence of RTL generators in a single-source HLS setting. This framework is illustrated by a C++ fixed-point library to generate mathematical function evaluator. A compiler flow from C++20 to Vivado IPs has been developed to make the library usable with Vitis HLS. This flow is demonstrated on two applications: an adder for the logarithmic number system, and additive sound synthesis. These experiments show that the approach allows to easily tune the precision of the types used in the application. They also demonstrate the ability to generate arbitrary function evaluator at the required precision

    Domain-Specific Optimisations for Image Processing on FPGAs

    Get PDF
    Image processing algorithms on FPGAs have increasingly become more pervasive in real-time vision applications. Such algorithms are computationally complex and memory intensive, which can be severely limited by available hardware resources. Optimisations are therefore necessary to achieve better performance and efficiency. We hypothesise that, unlike generic computing optimisations, domain-specific image processing optimisations can improve performance significantly. In this paper, we propose three domain-specific optimisation strategies that can be applied to many image processing algorithms. The optimisations are tested on popular image-processing algorithms and convolution neural networks on CPU/GPU/FPGA and the impact on performance, accuracy and power are measured. Experimental results show major improvements over the baseline non-optimised versions for both convolution neural networks (MobileNetV2 & ResNet50), Scale-Invariant Feature Transform (SIFT) and filter algorithms. Additionally, the optimised FPGA version of SIFT significantly outperformed an optimised GPU implementation when energy consumption statistics are taken into account

    Domain-Specific Optimisations for Image Processing on FPGAs

    Get PDF
    Image processing algorithms on FPGAs have increasingly become more pervasive in real-time vision applications. Such algorithms are computationally complex and memory intensive, which can be severely limited by available hardware resources. Optimisations are therefore necessary to achieve better performance and efficiency. We hypothesise that, unlike generic computing optimisations, domain-specific image processing optimisations can improve performance significantly. In this paper, we propose three domain-specific optimisation strategies that can be applied to many image processing algorithms. The optimisations are tested on popular image-processing algorithms and convolution neural networks on CPU/GPU/FPGA and the impact on performance, accuracy and power are measured. Experimental results show major improvements over the baseline non-optimised versions for both convolution neural networks (MobileNetV2 & ResNet50), Scale-Invariant Feature Transform (SIFT) and filter algorithms. Additionally, the optimised FPGA version of SIFT significantly outperformed an optimised GPU implementation when energy consumption statistics are taken into account
    corecore