593 research outputs found

    Area-energy aware dataflow optimisation of visual tracking systems

    Get PDF
    This paper presents an orderly dataflow-optimisation approach suitable for area-energy aware computer vision applications on FPGAs. Vision systems are increasingly being deployed in power constrained scenarios, where the dataflow model of computation has become popular for describing complex algorithms. Dataflow model allows processing datapaths comprised of several independent and well defined computations. However, compilers are often unsuccessful in identifying domain-specific optimisation opportunities resulting in wasted resources and power consumption. We present a methodology for the optimisation of dataflow networks, according to patterns often found in computer vision systems, focusing on identifying optimisations which are not discovered automatically by an optimising compiler. Code transformation using profiling and refactoring provides opportunities to optimise the design, targeting FPGA implementations and focusing on area and power abatement. Our refactoring methodology, applying transformations to a complex algorithm for visual tracking resulted in significant reduction in power consumption and resource usage

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Profile Guided Dataflow Transformation for FPGAs and CPUs

    Get PDF
    This paper proposes a new high-level approach for optimising field programmable gate array (FPGA) designs. FPGA designs are commonly implemented in low-level hardware description languages (HDLs), which lack the abstractions necessary for identifying opportunities for significant performance improvements. Using a computer vision case study, we show that modelling computation with dataflow abstractions enables substantial restructuring of FPGA designs before lowering to the HDL level, and also improve CPU performance. Using the CPU transformations, runtime is reduced by 43 %. Using the FPGA transformations, clock frequency is increased from 67MHz to 110MHz. Our results outperform commercial low-level HDL optimisations, showcasing dataflow program abstraction as an amenable computation model for highly effective FPGA optimisation

    Combining dynamic and static scheduling in high-level synthesis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description. A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow. This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces

    DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge

    Full text link
    The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and Computin

    Type-driven automated program transformations and cost modelling for optimising streaming programs on FPGAs

    Get PDF
    In this paper we present a novel approach to program optimisation based on compiler-based type-driven program transformations and a fast and accurate cost/performance model for the target architecture. We target streaming programs for the problem domain of scientific computing, such as numerical weather prediction. We present our theoretical framework for type-driven program transformation, our target high-level language and intermediate representation languages and the cost model and demonstrate the effectiveness of our approach by comparison with a commercial toolchain
    • …
    corecore