358 research outputs found
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition
Neural Network designs are quite diverse, from VGG-style to ResNet-style, and
from Convolutional Neural Networks to Transformers. Towards the design of
efficient accelerators, many works have adopted a dataflow-based, inter-layer
pipelined architecture, with a customised hardware towards each layer,
achieving ultra high throughput and low latency. The deployment of neural
networks to such dataflow architecture accelerators is usually hindered by the
available on-chip memory as it is desirable to preload the weights of neural
networks on-chip to maximise the system performance. To address this, networks
are usually compressed before the deployment through methods such as pruning,
quantization and tensor decomposition. In this paper, a framework for mapping
CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is
proposed. The proposed method applies layer-specific Singular Value
Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed
manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs.
Our work is open-sourced: https://github.com/Yu-Zhewen/Mixed-TDComment: accepted by FPL202
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference
CNNs outperform traditional machine learning algorithms across a wide range
of applications. However, their computational complexity makes it necessary to
design efficient hardware accelerators. Most CNN accelerators focus on
exploring dataflow styles that exploit computational parallelism. However,
potential performance speedup from sparsity has not been adequately addressed.
The computation and memory footprint of CNNs can be significantly reduced if
sparsity is exploited in network evaluations. To take advantage of sparsity,
some accelerator designs explore sparsity encoding and evaluation on CNN
accelerators. However, sparsity encoding is just performed on activation or
weight and only in inference. It has been shown that activation and weight also
have high sparsity levels during training. Hence, sparsity-aware computation
should also be considered in training. To further improve performance and
energy efficiency, some accelerators evaluate CNNs with limited precision.
However, this is limited to the inference since reduced precision sacrifices
network accuracy if used in training. In addition, CNN evaluation is usually
memory-intensive, especially in training. In this paper, we propose SPRING, a
SParsity-aware Reduced-precision Monolithic 3D CNN accelerator for trainING and
inference. SPRING supports both CNN training and inference. It uses a binary
mask scheme to encode sparsities in activation and weight. It uses the
stochastic rounding algorithm to train CNNs with reduced precision without
accuracy loss. To alleviate the memory bottleneck in CNN evaluation, especially
in training, SPRING uses an efficient monolithic 3D NVM interface to increase
memory bandwidth. Compared to GTX 1080 Ti, SPRING achieves 15.6X, 4.2X and
66.0X improvements in performance, power reduction, and energy efficiency,
respectively, for CNN training, and 15.5X, 4.5X and 69.1X improvements for
inference
Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design
Sparse training is one of the promising techniques to reduce the
computational cost of DNNs while retaining high accuracy. In particular, N:M
fine-grained structured sparsity, where only N out of consecutive M elements
can be nonzero, has attracted attention due to its hardware-friendly pattern
and capability of achieving a high sparse ratio. However, the potential to
accelerate N:M sparse DNN training has not been fully exploited, and there is a
lack of efficient hardware supporting N:M sparse training. To tackle these
challenges, this paper presents a computation-efficient training scheme for N:M
sparse DNNs using algorithm, architecture, and dataflow co-design. At the
algorithm level, a bidirectional weight pruning method, dubbed BDWP, is
proposed to leverage the N:M sparsity of weights during both forward and
backward passes of DNN training, which can significantly reduce the
computational cost while maintaining model accuracy. At the architecture level,
a sparse accelerator for DNN training, namely SAT, is developed to neatly
support both the regular dense operations and the computation-efficient N:M
sparse operations. At the dataflow level, multiple optimization methods ranging
from interleave mapping, pre-generation of N:M sparse weights, and offline
scheduling, are proposed to boost the computational efficiency of SAT. Finally,
the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA
card using various DNN models and datasets. Experimental results show the SAT
accelerator with the BDWP sparse training method under 2:8 sparse ratio
achieves an average speedup of 1.75x over that with the dense training,
accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our
proposed training scheme significantly improves the training throughput by
2.97~25.22x and the energy efficiency by 1.36~3.58x over prior FPGA-based
accelerators.Comment: To appear in the IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (TCAD
- …