40 research outputs found
LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations
We propose two tiers of modifications to FPGA logic cell architecture to
deliver a variety of performance and utilization benefits with only minor area
overheads. In the irst tier, we augment existing commercial logic cell
datapaths with a 6-input XOR gate in order to improve the expressiveness of
each element, while maintaining backward compatibility. This new architecture
is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary
tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we
refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor
tree synthesis using generalized parallel counters (GPCs) is further improved
with the proposed modifications. Using both the Intel adaptive logic module and
the Xilinx slice at the 65nm technology node for a comparative study, it is
shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We
demonstrate that LUXOR can deliver an average reduction of 13-19% in logic
utilization on micro-benchmarks from a variety of domains.BNN benchmarks
benefit the most with an average reduction of 37-47% in logic utilization,
which is due to the highly-efficient mapping of the XnorPopcount operation on
our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA,
US
ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks
The deployment of AI models on low-power, real-time edge devices requires
accelerators for which energy, latency, and area are all first-order concerns.
There are many approaches to enabling deep neural networks (DNNs) in this
domain, including pruning, quantization, compression, and binary neural
networks (BNNs), but with the emergence of the "extreme edge", there is now a
demand for even more efficient models. In order to meet the constraints of
ultra-low-energy devices, we propose ULEEN, a model architecture based on
weightless neural networks. Weightless neural networks (WNNs) are a class of
neural model which use table lookups, not arithmetic, to perform computation.
The elimination of energy-intensive arithmetic operations makes WNNs
theoretically well suited for edge inference; however, they have historically
suffered from poor accuracy and excessive memory usage. ULEEN incorporates
algorithmic improvements and a novel training strategy inspired by BNNs to make
significant strides in improving accuracy and reducing model size. We compare
FPGA and ASIC implementations of an inference accelerator for ULEEN against
edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we
demonstrate classification on the MNIST dataset at 14.3 million inferences per
second (13 million inferences/Joule) with 0.21 s latency and 96.2%
accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69
million inferences/Joule) with 0.31 s latency and 95.83% accuracy. In a
45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million
inferences/second at 98.46% accuracy, while a quantized Bit Fusion model
achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy.
In our search for ever more efficient edge devices, ULEEN shows that WNNs are
deserving of consideration.Comment: 14 pages, 14 figures Portions of this article draw heavily from
arXiv:2203.01479, most notably sections 5E and 5F.
FPGA acceleration of a quantized neural network for remote-sensed cloud detection
The capture and transmission of remote-sensed imagery for Earth observation is both computationally and bandwidth expensive. In the analyses of remote-sensed imagery in the visual band, atmospheric cloud cover can obstruct up to two-thirds of observations, resulting in costly imagery being discarded. Mission objectives and satellite operational details vary; however, assuming a cloud-free observation requirement, a doubling of useful data downlinked with an associated halving of delivery cost is possible through effective cloud detection. A minimal-resource, real-time inference neural network is ideally suited to perform automatic cloud detection, both for pre-processing captured images prior to transmission and preventing unnecessary images being taken by larger payload sensors. Much of the hardware complexity of modern neural network implementations resides in high-precision floating-point calculation pipelines. In recent years, research has been conducted in identifying quantized, or low-integer precision equivalents to known deep learning models, which do not require the extensive resources of their floating-point, full-precision counterparts. Our work leverages existing research on binary and quantized neural networks to develop a real-time, remote-sensed cloud detection solution using a commodity field-programmable gate array. This follows on developments of the Forwards Looking Imager for predictive cloud detection developed by Craft Prospect, a space engineering practice based in Glasgow, UK. The synthesized cloud detection accelerator achieved an inference throughput of 358.1 images per second with a maximum power consumption of 2.4 W. This throughput is an order of magnitude faster than alternate algorithmic options for the Forwards Looking Imager at around one third reduction in classification accuracy, and approximately two orders of magnitude faster than the CloudScout deep neural network, deployed with HyperScout 2 on the European Space Agency PhiSat-1 mission. Strategies for incorporating fault tolerance mechanisms are expounded
Mercury: An Automated Remote Side-channel Attack to Nvidia Deep Learning Accelerator
DNN accelerators have been widely deployed in many scenarios to speed up the
inference process and reduce the energy consumption. One big concern about the
usage of the accelerators is the confidentiality of the deployed models: model
inference execution on the accelerators could leak side-channel information,
which enables an adversary to preciously recover the model details. Such model
extraction attacks can not only compromise the intellectual property of DNN
models, but also facilitate some adversarial attacks.
Although previous works have demonstrated a number of side-channel techniques
to extract models from DNN accelerators, they are not practical for two
reasons. (1) They only target simplified accelerator implementations, which
have limited practicality in the real world. (2) They require heavy human
analysis and domain knowledge. To overcome these limitations, this paper
presents Mercury, the first automated remote side-channel attack against the
off-the-shelf Nvidia DNN accelerator. The key insight of Mercury is to model
the side-channel extraction process as a sequence-to-sequence problem. The
adversary can leverage a time-to-digital converter (TDC) to remotely collect
the power trace of the target model's inference. Then he uses a learning model
to automatically recover the architecture details of the victim model from the
power trace without any prior knowledge. The adversary can further use the
attention mechanism to localize the leakage points that contribute most to the
attack. Evaluation results indicate that Mercury can keep the error rate of
model extraction below 1%
Binary Neural Networks in FPGAs: Architectures, Tool Flows and Hardware Comparisons.
Binary neural networks (BNNs) are variations of artificial/deep neural network (ANN/DNN) architectures that constrain the real values of weights to the binary set of numbers {-1,1}. By using binary values, BNNs can convert matrix multiplications into bitwise operations, which accelerates both training and inference and reduces hardware complexity and model sizes for implementation. Compared to traditional deep learning architectures, BNNs are a good choice for implementation in resource-constrained devices like FPGAs and ASICs. However, BNNs have the disadvantage of reduced performance and accuracy because of the tradeoff due to binarization. Over the years, this has attracted the attention of the research community to overcome the performance gap of BNNs, and several architectures have been proposed. In this paper, we provide a comprehensive review of BNNs for implementation in FPGA hardware. The survey covers different aspects, such as BNN architectures and variants, design and tool flows for FPGAs, and various applications for BNNs. The final part of the paper gives some benchmark works and design tools for implementing BNNs in FPGAs based on established datasets used by the research community
Stochastic Configuration Machines: FPGA Implementation
Neural networks for industrial applications generally have additional
constraints such as response speed, memory size and power usage. Randomized
learners can address some of these issues. However, hardware solutions can
provide better resource reduction whilst maintaining the model's performance.
Stochastic configuration networks (SCNs) are a prime choice in industrial
applications due to their merits and feasibility for data modelling. Stochastic
Configuration Machines (SCMs) extend this to focus on reducing the memory
constraints by limiting the randomized weights to a binary value with a scalar
for each node and using a mechanism model to improve the learning performance
and result interpretability. This paper aims to implement SCM models on a field
programmable gate array (FPGA) and introduce binary-coded inputs to the
algorithm. Results are reported for two benchmark and two industrial datasets,
including SCM with single-layer and deep architectures.Comment: 19 pages, 9 figures, 8 table
Rethinking FPGA Architectures for Deep Neural Network applications
The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance.
Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking.
First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks
Rapid SoC Design: On Architectures, Methodologies and Frameworks
Modern applications like machine learning, autonomous vehicles, and 5G networking require an order of magnitude boost in processing capability. For several decades, chip designers have relied on Moore’s Law - the doubling of transistor count every two years to deliver improved performance, higher energy efficiency, and an increase in transistor density. With the end of Dennard’s scaling and a slowdown in Moore’s Law, system architects have developed several techniques to deliver on the traditional performance and power improvements we have come to expect. More recently, chip designers have turned towards heterogeneous systems comprised of more specialized processing units to buttress the traditional processing units. These specialized units improve the overall performance, power, and area (PPA) metrics across a wide variety of workloads and applications. While the GPU serves as a classical example, accelerators for machine learning, approximate computing, graph processing, and database applications have become commonplace. This has led to an exponential growth in the variety (and count) of these compute units found in modern embedded and high-performance computing platforms.
The various techniques adopted to combat the slowing of Moore’s Law directly translates to an increase in complexity for modern system-on-chips (SoCs). This increase in complexity in turn leads to an increase in design effort and validation time for hardware and the accompanying software stacks. This is further aggravated by fabrication challenges (photo-lithography, tooling, and yield) faced at advanced technology nodes (below 28nm). The inherent complexity in modern SoCs translates into increased costs and time-to-market delays. This holds true across the spectrum, from mobile/handheld processors to high-performance data-center appliances.
This dissertation presents several techniques to address the challenges of rapidly birthing complex SoCs. The first part of this dissertation focuses on foundations and architectures that aid in rapid SoC design. It presents a variety of architectural techniques that were developed and leveraged to rapidly construct complex SoCs at advanced process nodes. The next part of the dissertation focuses on the gap between a completed design model (in RTL form) and its physical manifestation (a GDS file that will be sent to the foundry for fabrication). It presents methodologies and a workflow for rapidly walking a design through to completion at arbitrary technology nodes. It also presents progress on creating tools and a flow that is entirely dependent on open-source tools. The last part presents a framework that not only speeds up the integration of a hardware accelerator into an SoC ecosystem, but emphasizes software adoption and usability.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168119/1/ajayi_1.pd