39,809 research outputs found

    The synthesis of a hardware scheduler for Non-Manifest Loops

    Get PDF
    This paper addresses the hardware implementation of a dynamic scheduler for non-manifest data dependent periodic loops. Static scheduling techniques which are known to give near optimal scheduling-solutions for manifest loops, fail at scheduling non-manifest loops, since they lack the run time information needed which makes a static schedule feasible. In this paper a dynamic scheduling approach was chosen to circumvent this problem. We present a case study using VHDL where the focus lies on implementations with minimal memory usage and low communication overhead between various components of the architecture. This has resulted in an efficient and synthesisable system

    High-level synthesis under I/O Timing and Memory constraints

    Full text link
    The design of complex Systems-on-Chips implies to take into account communication and memory access constraints for the integration of dedicated hardware accelerator. In this paper, we present a methodology and a tool that allow the High-Level Synthesis of DSP algorithm, under both I/O timing and memory constraints. Based on formal models and a generic architecture, this tool helps the designer to find a reasonable trade-off between both the required I/O timing behavior and the internal memory access parallelism of the circuit. The interest of our approach is demonstrated on the case study of a FFT algorithm

    P4-compatible High-level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs

    Full text link
    Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated C++ classes. The pipeline layout is derived from a parser graph that corresponds a P4 code after a series of graph transformation rounds. The RTL code is generated from the C++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves 100 Gb/s data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art.Comment: Accepted for publication at the 26th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays February 25 - 27, 2018 Monterey Marriott Hotel, Monterey, California, 7 pages, 7 figures, 1 tabl

    Scheduling and Allocation of Non-Manifest Loops on Hardware Graph-Models

    Get PDF
    We address the problem of scheduling non-manifest data dependant periodic loops for high throughput DSP-applications based on a streaming data model. In contrast to manifest loops, non-manifest data dependant loops are loops where the number of iterations needed in order to perform a calculation is data dependant and hence not known at compile time. For the case of manifest loops, static scheduling techniques have been devised which produce near optimal schedules. Due to the lack of exact run-time execution knowledge of non-manifest loops, these static scheduling techniques are not suitable for tackling scheduling problems of DSP-algorithms with non-manifest loops embedded in them. We consider the case where (a) a-priori knowledge of the data distribution, and (b) worst case execution time of the non-manifest loop are known and a constraint on the total execution time has been given. Under these conditions dynamic schedules of the non-manifest data dependant loops within the DSP-algorithm are possible. We show how to construct hardware which dynamically schedules these non-manifest loops. The sliding window execution, which is the execution of a non-manifest loop when the data streams through it, of the constructed hardware will guarantee real time performance for the worst case situation. This is the situation when each non-manifest loop requires its maximum number of iterations

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    A Novel SAT-Based Approach to the Task Graph Cost-Optimal Scheduling Problem

    Get PDF
    The Task Graph Cost-Optimal Scheduling Problem consists in scheduling a certain number of interdependent tasks onto a set of heterogeneous processors (characterized by idle and running rates per time unit), minimizing the cost of the entire process. This paper provides a novel formulation for this scheduling puzzle, in which an optimal solution is computed through a sequence of Binate Covering Problems, hinged within a Bounded Model Checking paradigm. In this approach, each covering instance, providing a min-cost trace for a given schedule depth, can be solved with several strategies, resorting to Minimum-Cost Satisfiability solvers or Pseudo-Boolean Optimization tools. Unfortunately, all direct resolution methods show very low efficiency and scalability. As a consequence, we introduce a specialized method to solve the same sequence of problems, based on a traditional all-solution SAT solver. This approach follows the "circuit cofactoring" strategy, as it exploits a powerful technique to capture a large set of solutions for any new SAT counter-example. The overall method is completed with a branch-and-bound heuristic which evaluates lower and upper bounds of the schedule length, to reduce the state space that has to be visited. Our results show that the proposed strategy significantly improves the blind binate covering schema, and it outperforms general purpose state-of-the-art tool

    Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles

    Get PDF
    We propose a methodology to study and to quantify efficiency and the impact of overheads on runtime performance. Most work on High-Performance Computing (HPC) for FPGAs only studies runtime performance or cost, while we are interested in how far we are from peak performance and, more importantly, why. The efficiency of runtime performance is defined with respect to the ideal computational runtime in absence of inefficiencies. The analysis of the difference between actual and ideal runtime reveals the overheads and bottlenecks. A formal approach is proposed to decompose the efficiency into three components: frequency, area and cycles. After quantification of the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies, lost area and lost cycles. We propose a taxonomy of possible causes and practical methods to identify and quantify the overheads. The proposed methodology is applied on a number of use cases to illustrate the methodology. We show the interaction between the three components of efficiency and show how bottlenecks are revealed

    A Multi-objective Perspective for Operator Scheduling using Fine-grained DVS Architecture

    Full text link
    The stringent power budget of fine grained power managed digital integrated circuits have driven chip designers to optimize power at the cost of area and delay, which were the traditional cost criteria for circuit optimization. The emerging scenario motivates us to revisit the classical operator scheduling problem under the availability of DVFS enabled functional units that can trade-off cycles with power. We study the design space defined due to this trade-off and present a branch-and-bound(B/B) algorithm to explore this state space and report the pareto-optimal front with respect to area and power. The scheduling also aims at maximum resource sharing and is able to attain sufficient area and power gains for complex benchmarks when timing constraints are relaxed by sufficient amount. Experimental results show that the algorithm that operates without any user constraint(area/power) is able to solve the problem for most available benchmarks, and the use of power budget or area budget constraints leads to significant performance gain.Comment: 18 pages, 6 figures, International journal of VLSI design & Communication Systems (VLSICS
    • 

    corecore