249 research outputs found

    Polyhedral-based dynamic loop pipelining for high-level synthesis

    Get PDF
    Loop pipelining is one of the most important optimization methods in high-level synthesis (HLS) for increasing loop parallelism. There has been considerable work on improving loop pipelining, which mainly focuses on optimizing static operation scheduling and parallel memory accesses. Nonetheless, when loops contain complex memory dependencies, current techniques cannot generate high performance pipelines. In this paper, we extend the capability of loop pipelining in HLS to handle loops with uncertain dependencies (i.e., parameterized by an undetermined variable) and/or nonuniform dependencies (i.e., varying between loop iterations). Our optimization allows a pipeline to be statically scheduled without the aforementioned memory dependencies, but an associated controller will change the execution speed of loop iterations at runtime. This allows the augmented pipeline to process each loop iteration as fast as possible without violating memory dependencies. We use a parametric polyhedral analysis to generate the control logic for when to safely run all loop iterations in the pipeline and when to break the pipeline execution to resolve memory conflicts. Our techniques have been prototyped in an automated source-to-source code transformation framework, with Xilinx Vivado HLS, a leading HLS tool, as the RTL generation backend. Over a suite of benchmarks, experiments show that our optimization can implement optimized pipelines at almost the same clock speed as without our transformations, running approximately 3.7-10× faster, with a reasonable resource overhead

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    ParaFPGA 2013: Harnessing Programs, Power and Performance in Parallel FPGA applications

    Get PDF
    Future computing systems will require dedicated accelerators to achieve high-performance. The mini-symposium ParaFPGA explores parallel computing with FPGAs as an interesting avenue to reduce the gap between the architecture and the application. Topics discussed are the power of functional and dataflow languages, the performance of high-level synthesis tools, the automatic creation of hardware multi-cores using C-slow retiming, dynamic power management to control the energy consumption, real-time reconfiguration of streaming image processing filters and memory optimized event image segmentation

    Combining dynamic and static scheduling in high-level synthesis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description. A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow. This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis

    Full text link
    Dynamically scheduled high-level synthesis (HLS) enables the use of load-store queues (LSQs) which can disambiguate data hazards at circuit runtime, increasing throughput in codes with unpredictable memory accesses. However, the increased throughput comes at the price of lower clock frequency and higher resource usage compared to statically scheduled circuits without LSQs. The lower frequency often nullifies any throughput improvements over static scheduling, while the resource usage becomes prohibitively expensive with large queue sizes. This paper presents a method for achieving dynamically scheduled memory operations in HLS without significant clock period and resource usage increase. We present a novel LSQ based on shift-registers enabled by the opportunity to specialize queue sizes to a target code in HLS. We show a method to speculatively allocate addresses to our LSQ, significantly increasing pipeline parallelism in codes that could not benefit from an LSQ before. In stark contrast to traditional load value speculation, we do not require pipeline replays and have no overhead on misspeculation. On a set of benchmarks with data hazards, our approach achieves an average speedup of 11×\times against static HLS and 5×\times against dynamic HLS that uses a state of the art LSQ from previous work. Our LSQ also uses several times fewer resources, scaling to queues with hundreds of entries, and supports both on-chip and off-chip memory.Comment: To appear in the International Conference on Field Programmable Technology (FPT'23), Yokohama, Japan, 11-14 December 202

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis

    Full text link
    Dynamically scheduled high-level synthesis (HLS) achieves higher throughput than static HLS for codes with unpredictable memory accesses and control flow. However, excessive dataflow scheduling results in circuits that use more resources and have a slower critical path, even when only a part of the circuit exhibits dynamic behavior. Recent work has shown that marking parts of a dataflow circuit for static scheduling can save resources and improve performance (hybrid scheduling), but the dynamic part of the circuit still bottlenecks the critical path. We propose instead to selectively introduce dynamic scheduling into static HLS. This paper presents an algorithm for identifying code regions amenable to dynamic scheduling and shows a methodology for introducing dynamically scheduled basic blocks, loops, and memory operations into static HLS. Our algorithm is informed by modulo-scheduling and can be integrated into any modulo-scheduled HLS tool. On a set of ten benchmarks, we show that our approach achieves on average an up to 3.7×\times and 3×\times speedup against dynamic and hybrid scheduling, respectively, with an area overhead of 1.3×\times and frequency degradation of 0.74×\times when compared to static HLS.Comment: To appear in the 33rd International Conference on Field-Programmable Logic and Applications (2023
    • …
    corecore