5,146 research outputs found
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
An Efficient Graph Accelerator with Parallel Data Conflict Management
Graph-specific computing with the support of dedicated accelerator has
greatly boosted the graph processing in both efficiency and energy.
Nevertheless, their data conflict management is still sequential in essential
when some vertex needs a large number of conflicting updates at the same time,
leading to prohibitive performance degradation. This is particularly true for
processing natural graphs.
In this paper, we have the insight that the atomic operations for the vertex
updating of many graph algorithms (e.g., BFS, PageRank and WCC) are typically
incremental and simplex. This hence allows us to parallelize the conflicting
vertex updates in an accumulative manner. We architect a novel graphspecific
accelerator that can simultaneously process atomic vertex updates for massive
parallelism on the conflicting data access while ensuring the correctness. A
parallel accumulator is designed to remove the serialization in atomic
protection for conflicting vertex updates through merging their results in
parallel. Our implementation on Xilinx Virtex UltraScale+ XCVU9P with a wide
variety of typical graph algorithms shows that our accelerator achieves an
average throughput by 2.36 GTEPS as well as up to 3.14x performance speedup in
comparison with state-of-the-art ForeGraph (with single-chip version)
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
Convolutional Neural Networks have rapidly become the most successful machine
learning algorithm, enabling ubiquitous machine vision and intelligent
decisions on even embedded computing-systems. While the underlying arithmetic
is structurally simple, compute and memory requirements are challenging. One of
the promising opportunities is leveraging reduced-precision representations for
inputs, activations and model parameters. The resulting scalability in
performance, power efficiency and storage footprint provides interesting design
compromises in exchange for a small reduction in accuracy. FPGAs are ideal for
exploiting low-precision inference engines leveraging custom precisions to
achieve the required numerical accuracy for a given application. In this
article, we describe the second generation of the FINN framework, an end-to-end
tool which enables design space exploration and automates the creation of fully
customized inference engines on FPGAs. Given a neural network description, the
tool optimizes for given platforms, design targets and a specific precision. We
introduce formalizations of resource cost functions and performance
predictions, and elaborate on the optimization algorithms. Finally, we evaluate
a selection of reduced precision neural networks ranging from CIFAR-10
classifiers to YOLO-based object detection on a range of platforms including
PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at
50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs
Recently, significant accuracy improvement has been achieved for acoustic
recognition systems by increasing the model size of Long Short-Term Memory
(LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to
inefficient designs on FPGAs due to the limited on-chip resources. The previous
work proposes to use a pruning based compression technique to reduce the model
size and thus speedups the inference on FPGAs. However, the random nature of
the pruning technique transforms the dense matrices of the model to highly
unstructured sparse ones, which leads to unbalanced computation and irregular
memory accesses and thus hurts the overall performance and energy efficiency.
In contrast, we propose to use a structured compression technique which could
not only reduce the LSTM model size but also eliminate the irregularities of
computation and memory accesses. This approach employs block-circulant instead
of sparse matrices to compress weight matrices and reduces the storage
requirement from to . Fast Fourier Transform
algorithm is utilized to further accelerate the inference by reducing the
computational complexity from to
. The datapath and activation functions are
quantized as 16-bit to improve the resource utilization. More importantly, we
propose a comprehensive framework called C-LSTM to automatically optimize and
implement a wide range of LSTM variants on FPGAs. According to the experimental
results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy
efficiency compared with the state-of-the-art LSTM implementation under the
same experimental setup, and the accuracy degradation is very small.Comment: Proceedings of the 2018 ACM/SIGDA International Symposium on
Field-Programmable Gate Array
Evolutionary Cell Aided Design for Neural Network Architectures
Mathematical theory shows us that multilayer feedforward Artificial Neural
Networks(ANNs) are universal function approximators, capable of approximating
any measurable function to any desired degree of accuracy. In practice
designing practical and efficient neural network architectures require
significant effort and expertise. We present a novel software framework called
Evolutionary Cell Aided Design(ECAD) meant to aid in the exploration and design
of efficient Neural Network Architectures(NNAs) for reconfigurable hardware.
Given a general neural network structure and a set of constraints and fitness
functions, the framework will explore both the space of possible NNA and the
space of possible hardware designs, using evolutionary algorithms, and attempt
to find the fittest co-design solutions according to a predefined set of goals.
We test the framework on an image classification task and use the MNIST data
set of hand written digits with an Intel Arria 10 GX 1150 device as our target
platform. We design and implement a modular and scalable 2D systolic array with
enhancements for machine learning that can be used by the framework for the
hardware search space. Our results demonstrate the ability to pair neural
network design and hardware development together using an evolutionary
algorithm and removing traditional human-in-the-loop development tasks. By
running various experiments of the fittest solutions for neural network and
hardware searches, we demonstrate the full end-to-end capabilities of the ECAD
framework.Comment: Text and image edit
Moving Processing to Data: On the Influence of Processing in Memory on Data Management
Near-Data Processing refers to an architectural hardware and software
paradigm, based on the co-location of storage and compute units. Ideally, it
will allow to execute application-defined data- or compute-intensive operations
in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data
Processing seeks to minimize expensive data movement, improving performance,
scalability, and resource-efficiency. Processing-in-Memory is a sub-class of
Near-Data processing that targets data processing directly within memory (DRAM)
chips. The effective use of Near-Data Processing mandates new architectures,
algorithms, interfaces, and development toolchains
WLAN Specific IoT Enable Power Efficient RAM Design on 40nm FPGA
Increasing the speed of computer is one of the important aspects of the
Random Access Memory (RAM) and for better and fast processing it should be
efficient. In this work, the main focus is to design energy efficient RAM and
it also can be accessed through internet. A 128-bit IPv6 address is added to
the RAM in order to control it via internet. Four different types of Low
Voltage CMOS (LCVMOS) IO standards are used to make it low power under five
different WLAN frequencies is taken. At WLAN frequency 2.4GHz, there is maximum
power reduction of 85% is achieved when LVCMOS12 is taken in place of LVCMOS25.
This design is implemented using Virtex-6 FPGA, Device xc6vlx75t and Package
FF48
Full-stack Optimization for Accelerating CNNs with FPGA Validation
We present a full-stack optimization framework for accelerating inference of
CNNs (Convolutional Neural Networks) and validate the approach with
field-programmable gate arrays (FPGA) implementations. By jointly optimizing
CNN models, computing architectures, and hardware implementations, our
full-stack approach achieves unprecedented performance in the trade-off space
characterized by inference latency, energy efficiency, hardware utilization and
inference accuracy. As a validation vehicle, we have implemented a 170MHz FPGA
inference chip achieving 2.28ms latency for the ImageNet benchmark. The
achieved latency is among the lowest reported in the literature while achieving
comparable accuracy. However, our chip shines in that it has 9x higher energy
efficiency compared to other implementations achieving comparable latency. A
highlight of our full-stack approach which attributes to the achieved high
energy efficiency is an efficient Selector-Accumulator (SAC) architecture for
implementing the multiplier-accumulator (MAC) operation present in any digital
CNN hardware. For instance, compared to a FPGA implementation for a traditional
8-bit MAC, SAC substantially reduces required hardware resources (4.85x fewer
Look-up Tables) and power consumption (2.48x)
- …