124 research outputs found
F-E3D: FPGA-based acceleration of an efficient 3D convolutional neural network for human action recognition
Three-dimensional convolutional neural networks (3D CNNs) have demonstrated their outstanding classification accuracy for human action recognition (HAR). However, the large number of computations and parameters in 3D CNNs limits their deployability in real-life applications. To address this challenge, this paper adopts an algorithm-hardware co-design method by proposing an efficient 3D CNN building unit called 3D-1 bottleneck residual block (3D-1 BRB) at the algorithm level, and a corresponding FPGA-based hardware architecture called F-E3D at hardware level. Based on 3D-1 BRB, a novel 3D CNN model called E3DNet is developed, which achieves nearly 37 times reduction in model size and 5% improvement in accuracy compared to standard 3D CNNs on the UCF101 dataset. Together with several hardware optimizations, including 3D fused BRB, online blocking and kernel reuse, the proposed F-E3D is nearly 13 times faster than a previous FPGA design for 3D CNNs, with performance and accuracy comparable to other state-of-the-art 3D CNN models on GPU platforms while requiring only 7% of their energy consumption
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices
For Human Action Recognition tasks (HAR), 3D Convolutional Neural Networks
have proven to be highly effective, achieving state-of-the-art results. This
study introduces a novel streaming architecture based toolflow for mapping such
models onto FPGAs considering the model's inherent characteristics and the
features of the targeted FPGA device. The HARFLOW3D toolflow takes as input a
3D CNN in ONNX format and a description of the FPGA characteristics, generating
a design that minimizes the latency of the computation. The toolflow is
comprised of a number of parts, including i) a 3D CNN parser, ii) a performance
and resource model, iii) a scheduling algorithm for executing 3D models on the
generated hardware, iv) a resource-aware optimization engine tailored for 3D
models, v) an automated mapping to synthesizable code for FPGAs. The ability of
the toolflow to support a broad range of models and devices is shown through a
number of experiments on various 3D CNN and FPGA system pairs. Furthermore, the
toolflow has produced high-performing results for 3D CNN models that have not
been mapped to FPGAs before, demonstrating the potential of FPGA-based systems
in this space. Overall, HARFLOW3D has demonstrated its ability to deliver
competitive latency compared to a range of state-of-the-art hand-tuned
approaches being able to achieve up to 5 better performance compared to
some of the existing works.Comment: 11 pages, 8 figures, 6 table
FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition
3D Convolutional Neural Networks are gaining increasing attention from
researchers and practitioners and have found applications in many domains, such
as surveillance systems, autonomous vehicles, human monitoring systems, and
video retrieval. However, their widespread adoption is hindered by their high
computational and memory requirements, especially when resource-constrained
systems are targeted. This paper addresses the problem of mapping X3D, a
state-of-the-art model in Human Action Recognition that achieves accuracy of
95.5\% in the UCF101 benchmark, onto any FPGA device. The proposed toolflow
generates an optimised stream-based hardware system, taking into account the
available resources and off-chip memory characteristics of the FPGA device. The
generated designs push further the current performance-accuracy pareto front,
and enable for the first time the targeting of such complex model architectures
for the Human Action Recognition task.Comment: 8 pages, 6 figures, 2 table
Optimising algorithm and hardware for deep neural networks on FPGAs
This thesis proposes novel algorithm and hardware optimisation approaches to accelerate Deep Neural Networks (DNNs), including both Convolutional Neural Networks (CNNs) and Bayesian Neural Networks (BayesNNs).
The first contribution of this thesis is to propose an adaptable and reconfigurable hardware design to accelerate CNNs. By analysing the computational patterns of different CNNs, a unified hardware architecture is proposed for both 2-Dimension and 3-Dimension CNNs. The accelerator is also designed with runtime adaptability, which adopts different parallelism strategies for different convolutional layers at runtime.
The second contribution of this thesis is to propose a novel neural network architecture and hardware design co-optimisation approach, which improves the performance of CNNs at both algorithm and hardware levels. Our proposed three-phase co-design framework decouples network training from design space exploration, which significantly reduces the time-cost of the co-optimisation process.
The third contribution of this thesis is to propose an algorithmic and hardware co-optimisation framework for accelerating BayesNNs. At the algorithmic level, three categories of structured sparsity are explored to reduce the computational complexity of BayesNNs. At the hardware level, we propose a novel hardware architecture with the aim of exploiting the structured sparsity for BayesNNs. Both algorithmic and hardware optimisations are jointly applied to push the performance limit.Open Acces
- …