13,315 research outputs found
CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs
This paper proposes CodeX, an end-to-end framework that facilitates encoding,
bitwidth customization, fine-tuning, and implementation of neural networks on
FPGA platforms. CodeX incorporates nonlinear encoding to the computation flow
of neural networks to save memory. The encoded features demand significantly
lower storage compared to the raw full-precision activation values; therefore,
the execution flow of CodeX hardware engine is completely performed within the
FPGA using on-chip streaming buffers with no access to the off-chip DRAM. We
further propose a fully-automated algorithm inspired by reinforcement learning
which determines the customized encoding bitwidth across network layers. CodeX
full-stack framework comprises of a compiler which takes a high-level Python
description of an arbitrary neural network architecture. The compiler then
instantiates the corresponding elements from CodeX Hardware library for FPGA
implementation. Proof-of-concept evaluations on MNIST, SVHN, and CIFAR-10
datasets demonstrate an average of 4.65x throughput improvement compared to
stand-alone weight encoding. We further compare CodeX with six existing
full-precision DNN accelerators on ImageNet, showing an average of 3.6x and
2.54x improvement in throughput and performance-per-watt, respectively
BENCHIP: Benchmarking Intelligence Processors
The increasing attention on deep learning has tremendously spurred the design
of intelligence processing hardware. The variety of emerging intelligence
processors requires standard benchmarks for fair comparison and system
optimization (in both software and hardware). However, existing benchmarks are
unsuitable for benchmarking intelligence processors due to their non-diversity
and nonrepresentativeness. Also, the lack of a standard benchmarking
methodology further exacerbates this problem. In this paper, we propose
BENCHIP, a benchmark suite and benchmarking methodology for intelligence
processors. The benchmark suite in BENCHIP consists of two sets of benchmarks:
microbenchmarks and macrobenchmarks. The microbenchmarks consist of
single-layer networks. They are mainly designed for bottleneck analysis and
system optimization. The macrobenchmarks contain state-of-the-art industrial
networks, so as to offer a realistic comparison of different platforms. We also
propose a standard benchmarking methodology built upon an industrial software
stack and evaluation metrics that comprehensively reflect the various
characteristics of the evaluated intelligence processors. BENCHIP is utilized
for evaluating various hardware platforms, including CPUs, GPUs, and
accelerators. BENCHIP will be open-sourced soon.Comment: 37pages, 14 figure
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
Convolutional Neural Networks have rapidly become the most successful machine
learning algorithm, enabling ubiquitous machine vision and intelligent
decisions on even embedded computing-systems. While the underlying arithmetic
is structurally simple, compute and memory requirements are challenging. One of
the promising opportunities is leveraging reduced-precision representations for
inputs, activations and model parameters. The resulting scalability in
performance, power efficiency and storage footprint provides interesting design
compromises in exchange for a small reduction in accuracy. FPGAs are ideal for
exploiting low-precision inference engines leveraging custom precisions to
achieve the required numerical accuracy for a given application. In this
article, we describe the second generation of the FINN framework, an end-to-end
tool which enables design space exploration and automates the creation of fully
customized inference engines on FPGAs. Given a neural network description, the
tool optimizes for given platforms, design targets and a specific precision. We
introduce formalizations of resource cost functions and performance
predictions, and elaborate on the optimization algorithms. Finally, we evaluate
a selection of reduced precision neural networks ranging from CIFAR-10
classifiers to YOLO-based object detection on a range of platforms including
PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at
50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin
Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Convolutional Neural Networks (CNN) have been widely deployed in diverse
application domains. There has been significant progress in accelerating both
their training and inference using high-performance GPUs, FPGAs, and custom
ASICs for datacenter-scale environments. The recent proliferation of mobile and
IoT devices have necessitated real-time, energy-efficient deep neural network
inference on embedded-class, resource-constrained platforms. In this context,
we present {\em Synergy}, an automated, hardware-software co-designed,
pipelined, high-throughput CNN inference framework on embedded heterogeneous
system-on-chip (SoC) architectures (Xilinx Zynq). {\em Synergy} leverages,
through multi-threading, all the available on-chip resources, which includes
the dual-core ARM processor along with the FPGA and the NEON SIMD engines as
accelerators. Moreover, {\em Synergy} provides a unified abstraction of the
heterogeneous accelerators (FPGA and NEON) and can adapt to different network
configurations at runtime without changing the underlying hardware accelerator
architecture by balancing workload across accelerators through work-stealing.
{\em Synergy} achieves 7.3X speedup, averaged across seven CNN models, over a
well-optimized software-only solution. {\em Synergy} demonstrates substantially
better throughput and energy-efficiency compared to the contemporary CNN
implementations on the same SoC architecture.Comment: 34 pages, submitted to ACM Transactions on Embedded Computing Systems
(TECS
DeepFense: Online Accelerated Defense Against Adversarial Deep Learning
Recent advances in adversarial Deep Learning (DL) have opened up a largely
unexplored surface for malicious attacks jeopardizing the integrity of
autonomous DL systems. With the wide-spread usage of DL in critical and
time-sensitive applications, including unmanned vehicles, drones, and video
surveillance systems, online detection of malicious inputs is of utmost
importance. We propose DeepFense, the first end-to-end automated framework that
simultaneously enables efficient and safe execution of DL models. DeepFense
formalizes the goal of thwarting adversarial attacks as an optimization problem
that minimizes the rarely observed regions in the latent feature space spanned
by a DL network. To solve the aforementioned minimization problem, a set of
complementary but disjoint modular redundancies are trained to validate the
legitimacy of the input samples in parallel with the victim DL model. DeepFense
leverages hardware/software/algorithm co-design and customized acceleration to
achieve just-in-time performance in resource-constrained settings. The proposed
countermeasure is unsupervised, meaning that no adversarial sample is leveraged
to train modular redundancies. We further provide an accompanying API to reduce
the non-recurring engineering cost and ensure automated adaptation to various
platforms. Extensive evaluations on FPGAs and GPUs demonstrate up to two orders
of magnitude performance improvement while enabling online adversarial sample
detection.Comment: Adding hardware acceleration for real-time execution of defender
module
A Hardware-Software Blueprint for Flexible Deep Learning Specialization
Specialized Deep Learning (DL) acceleration stacks, designed for a specific
set of frameworks, model architectures, operators, and data types, offer the
allure of high performance while sacrificing flexibility. Changes in
algorithms, models, operators, or numerical systems threaten the viability of
specialized hardware accelerators. We propose VTA, a programmable deep learning
architecture template designed to be extensible in the face of evolving
workloads. VTA achieves this flexibility via a parametrizable architecture,
two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA
that explicitly orchestrates concurrent compute and memory tasks and (2) a
microcode-ISA which implements a wide variety of operators with single-cycle
tensor-tensor operations. Next, we propose a runtime system equipped with a JIT
compiler for flexible code-generation and heterogeneous execution that enables
effective use of the VTA architecture. VTA is integrated and open-sourced into
Apache TVM, a state-of-the-art deep learning compilation stack that provides
flexibility for diverse models and divergent hardware backends. We propose a
flow that performs design space exploration to generate a customized hardware
architecture and software operator library that can be leveraged by mainstream
learning frameworks. We demonstrate our approach by deploying optimized deep
learning models used for object classification and style transfer on edge-class
FPGAs.Comment: 6 pages plus references, 8 figure
A Systematic Approach to Blocking Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are the state of the art solution for
many computer vision problems, and many researchers have explored optimized
implementations. Most implementations heuristically block the computation to
deal with the large data sizes and high data reuse of CNNs. This paper explores
how to block CNN computations for memory locality by creating an analytical
model for CNN-like loop nests. Using this model we automatically derive
optimized blockings for common networks that improve the energy efficiency of
custom hardware implementations by up to an order of magnitude. Compared to
traditional CNN CPU implementations based on highly-tuned, hand-optimized BLAS
libraries,our x86 programs implementing the optimal blocking reduce the number
of memory accesses by up to 90%
Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic
Convolutional Neural Networks have dramatically improved in recent years,
surpassing human accuracy on certain problems and performance exceeding that of
traditional computer vision algorithms. While the compute pattern in itself is
relatively simple, significant compute and memory challenges remain as CNNs may
contain millions of floating-point parameters and require billions of
floating-point operations to process a single image. These computational
requirements, combined with storage footprints that exceed typical cache sizes,
pose a significant performance and power challenge for modern compute
architectures. One of the promising opportunities to scale performance and
power efficiency is leveraging reduced precision representations for all
activations and weights as this allows to scale compute capabilities, reduce
weight and feature map buffering requirements as well as energy consumption.
While a small reduction in accuracy is encountered, these Quantized Neural
Networks have been shown to achieve state-of-the-art accuracy on standard
benchmark datasets, such as MNIST, CIFAR-10, SVHN and even ImageNet, and thus
provide highly attractive design trade-offs. Current research has focused
mainly on the implementation of extreme variants with full binarization of
weights and or activations, as well typically smaller input images. Within this
paper, we investigate the scalability of dataflow architectures with respect to
supporting various precisions for both weights and activations, larger image
dimensions, and increasing numbers of feature map channels. Key contributions
are a formalized approach to understanding the scalability of the existing
hardware architecture with cost models and a performance prediction as a
function of the target device size. We provide validating experimental results
for an ImageNet classification on a server-class platform, namely the AWS F1
node
Computing-in-Memory for Performance and Energy Efficient Homomorphic Encryption
Homomorphic encryption (HE) allows direct computations on encrypted data.
Despite numerous research efforts, the practicality of HE schemes remains to be
demonstrated. In this regard, the enormous size of ciphertexts involved in HE
computations degrades computational efficiency. Near-memory Processing (NMP)
and Computing-in-memory (CiM) - paradigms where computation is done within the
memory boundaries - represent architectural solutions for reducing latency and
energy associated with data transfers in data-intensive applications such as
HE. This paper introduces CiM-HE, a Computing-in-memory (CiM) architecture that
can support operations for the B/FV scheme, a somewhat homomorphic encryption
scheme for general computation. CiM-HE hardware consists of customized
peripherals such as sense amplifiers, adders, bit-shifters, and sequencing
circuits. The peripherals are based on CMOS technology, and could support
computations with memory cells of different technologies. Circuit-level
simulations are used to evaluate our CiM-HE framework assuming a 6T-SRAM
memory. We compare our CiM-HE implementation against (i) two optimized CPU HE
implementations, and (ii) an FPGA-based HE accelerator implementation. When
compared to a CPU solution, CiM-HE obtains speedups between 4.6x and 9.1x, and
energy savings between 266.4x and 532.8x for homomorphic multiplications (the
most expensive HE operation). Also, a set of four end-to-end tasks, i.e., mean,
variance, linear regression, and inference are up to 1.1x, 7.7x, 7.1x, and 7.5x
faster (and 301.1x, 404.6x, 532.3x, and 532.8x more energy efficient). Compared
to CPU-based HE in a previous work, CiM-HE obtain 14.3x speed-up and >2600x
energy savings. Finally, our design offers 2.2x speed-up with 88.1x energy
savings compared to a state-of-the-art FPGA-based accelerator.Comment: 14 page
Optimizing Temporal Convolutional Network inference on FPGA-based accelerators
Convolutional Neural Networks are extensively used in a wide range of
applications, commonly including computer vision tasks like image and video
classification, recognition, and segmentation. Recent research results
demonstrate that multilayer(deep) networks involving mono-dimensional
convolutions and dilation can be effectively used in time series and sequences
classification and segmentation, as well as in tasks involving sequence
modelling. These structures, commonly referred to as Temporal Convolutional
Networks (TCNs), have been demonstrated to consistently outperform Recurrent
Neural Networks in terms of accuracy and training time [1]. While FPGA-based
inference accelerators for classic CNNs are widespread, literature is lacking
in a quantitative evaluation of their usability on inference for TCN models. In
this paper we present such an evaluation, considering a CNN accelerator with
specific features supporting TCN kernels as a reference and a set of
state-of-the-art TCNs as a benchmark. Experimental results show that, during
TCN execution, operational intensity can be critical for the overall
performance. We propose a convolution scheduling based on batch processing that
can boost efficiency up to 96% of theoretical peak performance. Overall we can
achieve up to 111,8 GOPS/s and power efficiency of 33,9 GOPS/s/W on an
Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with
respect to pure software implementation)
- …