4,163 research outputs found

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    Steerable optical tweezers for ultracold atom studies

    Full text link
    We report on the implementation of an optical tweezer system for controlled transport of ultracold atoms along a narrow, static confinement channel. The tweezer system is based on high-efficiency acousto-optical deflectors and offers two-dimensional control over beam position. This opens up the possibility for tracking the transport channel when shuttling atomic clouds along the guide, forestalling atom spilling. Multiple clouds can be tracked independently by time-shared tweezer beams addressing individual sites in the channel. The deflectors are controlled using a multichannel direct digital synthesizer, which receives instructions on a sub-microsecond time scale from a field-programmable gate array. Using the tweezer system, we demonstrate sequential binary splitting of an ultracold 87Rb\rm^{87}Rb cloud into 252^5 clouds.Comment: 4 pages, 5 figures, 1 movie lin

    A Novel Frequency Based Current-to-Digital Converter with Programmable Dynamic Range

    Get PDF
    This work describes a novel frequency based Current to Digital converter, which would be fully realizable on a single chip. Biological systems make use of delay line techniques to compute many things critical to the life of an animal. Seeking to build up such a system, we are adapting the auditory localization circuit found in barn owls to detect and compute the magnitude of an input current. The increasing drive to produce ultra low-power circuits necessitates the use of very small currents. Frequently these currents need to accurately measured, but current solutions typically involve off-chip measurements. These are usually slow, and moving a current off chip increases noise to the system. Moving a system such as this completely on chip will allow for precise measurement and control of bias currents, and it will allow for better compensation of some common transistor mismatch issues. This project affords an extremely low power (100s nW) converter technology that is also very space efficient. The converter is completely asynchronous which yields ultra-low power standby operation [1]

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    The UTMOST: A hybrid digital signal processor transforms the MOST

    Get PDF
    The Molonglo Observatory Synthesis Telescope (MOST) is an 18,000 square meter radio telescope situated some 40 km from the city of Canberra, Australia. Its operating band (820-850 MHz) is now partly allocated to mobile phone communications, making radio astronomy challenging. We describe how the deployment of new digital receivers (RX boxes), Field Programmable Gate Array (FPGA) based filterbanks and server-class computers equipped with 43 GPUs (Graphics Processing Units) has transformed MOST into a versatile new instrument (the UTMOST) for studying the dynamic radio sky on millisecond timescales, ideal for work on pulsars and Fast Radio Bursts (FRBs). The filterbanks, servers and their high-speed, low-latency network form part of a hybrid solution to the observatory's signal processing requirements. The emphasis on software and commodity off-the-shelf hardware has enabled rapid deployment through the re-use of proven 'software backends' for its signal processing. The new receivers have ten times the bandwidth of the original MOST and double the sampling of the line feed, which doubles the field of view. The UTMOST can simultaneously excise interference, make maps, coherently dedisperse pulsars, and perform real-time searches of coherent fan beams for dispersed single pulses. Although system performance is still sub-optimal, a pulsar timing and FRB search programme has commenced and the first UTMOST maps have been made. The telescope operates as a robotic facility, deciding how to efficiently target pulsars and how long to stay on source, via feedback from real-time pulsar folding. The regular timing of over 300 pulsars has resulted in the discovery of 7 pulsar glitches and 3 FRBs. The UTMOST demonstrates that if sufficient signal processing can be applied to the voltage streams it is possible to perform innovative radio science in hostile radio frequency environments.Comment: 12 pages, 6 figure
    corecore