4,163 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant
redundancy, and high classification accuracy can be obtained even when weights
and activations are reduced from floating point to binary values. In this
paper, we present FINN, a framework for building fast and flexible FPGA
accelerators using a flexible heterogeneous streaming architecture. By
utilizing a novel set of optimizations that enable efficient mapping of
binarized neural networks to hardware, we implement fully connected,
convolutional and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a ZC706 embedded FPGA
platform drawing less than 25 W total system power, we demonstrate up to 12.3
million image classifications per second with 0.31 {\mu}s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications per second with
283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1%
and 94.9% accuracy. To the best of our knowledge, ours are the fastest
classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable
Gate Arrays, February 201
Steerable optical tweezers for ultracold atom studies
We report on the implementation of an optical tweezer system for controlled
transport of ultracold atoms along a narrow, static confinement channel. The
tweezer system is based on high-efficiency acousto-optical deflectors and
offers two-dimensional control over beam position. This opens up the
possibility for tracking the transport channel when shuttling atomic clouds
along the guide, forestalling atom spilling. Multiple clouds can be tracked
independently by time-shared tweezer beams addressing individual sites in the
channel. The deflectors are controlled using a multichannel direct digital
synthesizer, which receives instructions on a sub-microsecond time scale from a
field-programmable gate array. Using the tweezer system, we demonstrate
sequential binary splitting of an ultracold cloud into
clouds.Comment: 4 pages, 5 figures, 1 movie lin
A Novel Frequency Based Current-to-Digital Converter with Programmable Dynamic Range
This work describes a novel frequency based Current to Digital converter, which would be fully realizable on a single chip.
Biological systems make use of delay line techniques to compute many things critical to the life of an animal. Seeking to build up such a system, we are adapting the auditory localization circuit found in barn owls to detect and compute the magnitude of an input current.
The increasing drive to produce ultra low-power circuits necessitates the use of very small currents. Frequently these currents need to accurately measured, but current solutions typically involve off-chip measurements. These are usually slow, and moving a current off chip increases noise to the system. Moving a system such as this completely on chip will allow for precise measurement and control of bias currents, and it will allow for better compensation of some common transistor mismatch issues.
This project affords an extremely low power (100s nW) converter technology that is also very space efficient. The converter is completely asynchronous which yields ultra-low power standby operation [1]
Parallelization of dynamic programming recurrences in computational biology
The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
The UTMOST: A hybrid digital signal processor transforms the MOST
The Molonglo Observatory Synthesis Telescope (MOST) is an 18,000 square meter
radio telescope situated some 40 km from the city of Canberra, Australia. Its
operating band (820-850 MHz) is now partly allocated to mobile phone
communications, making radio astronomy challenging. We describe how the
deployment of new digital receivers (RX boxes), Field Programmable Gate Array
(FPGA) based filterbanks and server-class computers equipped with 43 GPUs
(Graphics Processing Units) has transformed MOST into a versatile new
instrument (the UTMOST) for studying the dynamic radio sky on millisecond
timescales, ideal for work on pulsars and Fast Radio Bursts (FRBs). The
filterbanks, servers and their high-speed, low-latency network form part of a
hybrid solution to the observatory's signal processing requirements. The
emphasis on software and commodity off-the-shelf hardware has enabled rapid
deployment through the re-use of proven 'software backends' for its signal
processing. The new receivers have ten times the bandwidth of the original MOST
and double the sampling of the line feed, which doubles the field of view. The
UTMOST can simultaneously excise interference, make maps, coherently dedisperse
pulsars, and perform real-time searches of coherent fan beams for dispersed
single pulses. Although system performance is still sub-optimal, a pulsar
timing and FRB search programme has commenced and the first UTMOST maps have
been made. The telescope operates as a robotic facility, deciding how to
efficiently target pulsars and how long to stay on source, via feedback from
real-time pulsar folding. The regular timing of over 300 pulsars has resulted
in the discovery of 7 pulsar glitches and 3 FRBs. The UTMOST demonstrates that
if sufficient signal processing can be applied to the voltage streams it is
possible to perform innovative radio science in hostile radio frequency
environments.Comment: 12 pages, 6 figure
- …