189 research outputs found

    A time-multiplexed FPGA overlay with linear interconnect

    Get PDF
    Coarse-grained overlays improve FPGA design pro- ductivity by providing fast compilation and software like pro- grammability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput. The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels’ characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency

    High throughput accelerator interface framework for a linear time-multiplexed FPGA overlay

    Get PDF
    Coarse-grained FPGA overlays improve design productivity through software-like programmability and fast compilation. However, the effectiveness of overlays as accelerators is dependent on suitable interface and programming integration into a typically processor-based computing system, an aspect which has often been neglected in evaluations of overlays. We explore the integration of a time-multiplexed FPGA overlay over a server-class PCI Express interface. We show how this integration can be optimised to maximise performance, and evaluate the area overhead. We also propose a user-friendly programming model for such an overlay accelerator system

    Are coarse-grained overlays ready for general purpose application acceleration on FPGAs?

    Get PDF
    Combining processors with hardware accelerators has become a norm with systems-on-chip (SoCs) ever present in modern compute devices. Heterogeneous programmable system on chip platforms sometimes referred to as hybrid FPGAs, tightly couple general purpose processors with high performance reconfigurable fabrics, providing a more flexible alternative. We can now think of a software application with hardware accelerated portions that are reconfigured at runtime. While such ideas have been explored in the past, modern hybrid FPGAs are the first commercial platforms to enable this move to a more software oriented view, where reconfiguration enables hardware resources to be shared by multiple tasks in a bigger application. However, while the rapidly increasing logic density and more capable hard resources found in modern hybrid FPGA devices should make them widely deployable, they remain constrained within specialist application domains. This is due to both design productivity issues and a lack of suitable hardware abstraction to eliminate the need for working with platform-specific details, as server and desktop virtualization has done in a more general sense. To allow mainstream adoption of FPGA based accelerators in general purpose computing, there is a need to virtualize FPGAs and make them more accessible to application developers who are accustomed to software API abstractions and fast development cycles. In this paper, we discuss the role of overlay architectures in enabling general purpose FPGA application acceleration

    Accelerating SPICE Model-Evaluation using FPGAs

    Get PDF
    Single-FPGA spatial implementations can provide an order of magnitude speedup over sequential microprocessor implementations for data-parallel, floating-point computation in SPICE model-evaluation. Model-evaluation is a key component of the SPICE circuit simulator and it is characterized by large irregular floating-point compute graphs. We show how to exploit the parallelism available in these graphs on single-FPGA designs with a low-overhead VLIW-scheduled architecture. Our architecture uses spatial floating-point operators coupled to local high-bandwidth memories and interconnected by a time-shared network. We retime operation inputs in the model-evaluation to allow independent scheduling of computation and communication. With this approach, we demonstrate speedups of 2–18× over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx Virtex 5 LX330T for a variety of SPICE device models

    A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment

    Get PDF
    Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a computing platform to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency at the cost of increased design complexity. High-Level Synthesis (HLS) can ease IP design, but automated tools still lack the maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. We present an innovative hardware overlay offering plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA and offer simplified software management of the acceleration IP. Deploying the proposed overlay on a Xilinx ZU9EG shows ≈ 20% LUT usage and ≈ 4× speedup compared to program execution on the ARM host core

    Mocarabe: High-Performance Time-Multiplexed Overlays for FPGAs

    Get PDF
    Coarse-grained reconfigurable array (CGRA) overlays can improve dataflow kernel throughput by an order of magnitude over Vivado HLS on Xilinx Alveo U280. This is possible with a combination of carefully floorplanned high-frequency (645 - 768 MHz Torus, 788 - 856 MHz Mesh, 583 - 746 MHz BFT) design and a scalable, communication-aware compiler. Our CGRA architecture supports configurable Processing Element (PE) functionality supported by a configurable number of communication channels to match application demands. Compared to recent FPGA overlays like 4×4 ADRES and HyCUBE implementations in CGRA-ME, our design operates at a faster clock frequency by up to 3.4×, while scaling to an orders-of-magnitude larger array size of 19×69 on Xilinx Alveo U280. We propose a novel topology agnostic ILP placer that formulates the CGRA placement problem into an ILP problem. Our ILP placer optimizes placement regardless of topology and even for non-linear objective functions by using pre-computed placement costs as inputs to the ILP problem formulation. Using the ILP placer reduces placement quadratic wirelength up to 37% compared to the commonly used simulated annealing approach but increases runtime from less than a minute to hours. Our communication-aware compiler targets HLS objectives such as initiation interval (II) and minimizes communication cost using an integer linear programming (ILP) formulation. Unlike SDC schedulers in FPGA HLS tools, we treat data movement as a first-class citizen by encoding the space and time resources of the communication network in the ILP formulation. Given the same constraints on operational resources as Vivado HLS, we can retain our target II and achieve up to 9.2× higher frequency. We compare Torus and Mesh topologies, and show Mesh has less latency per area compared to Torus for the same benchmarks

    Throughput oriented FPGA overlays using DSP blocks

    Get PDF
    Design productivity is a major concern preventing the mainstream adoption of FPGAs. Overlay architectures have emerged as one possible solution to this challenge, offering fast compilation and software-like programmability. However, overlays typically suffer from area and performance overheads due to limited consideration for the underlying FPGA architecture. These overlays have often been of limited size, supporting only relatively small compute kernels. This paper examines the possibility of developing larger, more efficient, overlays using multiple DSP blocks and then maximising utilisation by mapping multiple instances of kernels simultaneously onto the overlay to exploit kernel level parallelism. We show a significant improvement in achievable overlay size and overlay utilisation, with a reduction of almost 70% in the overlay tile requirement compared to existing overlay architectures, an operating frequency in excess of 300 MHz, and kernel throughputs of almost 60 GOPS

    GraphStep: A System Architecture for Sparse-Graph Algorithms

    Get PDF
    Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this “memory wall,” we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreading activation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    FPGA dynamic and partial reconfiguration : a survey of architectures, methods, and applications

    Get PDF
    Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows, and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption
    corecore