30 research outputs found
ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data
Residual block is a very common component in recent state-of-the art CNNs
such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of
feature-maps access in ResNet152 [8]. Most of the previous DNN compilers,
accelerators ignore the shortcut data optimization. This paper presents
ShortcutFusion, an optimization tool for FPGA-based accelerator with a
reuse-aware static memory allocation for shortcut data, to maximize on-chip
data reuse given resource constraints. From TensorFlow DNN models, the proposed
design generates instruction sets for a group of nodes which uses an optimized
data reuse for each residual block. The accelerator design implemented on the
Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan
Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti,
the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient.
Compared to the result from baseline, in which the weights, inputs, and outputs
are accessed from the off-chip memory exactly once per each layer,
ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3,
ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8],
which also mine the shortcut data in hardware, the proposed work reduces
off-chip access for feature-maps 5.27x while accessing weight from off-chip
memory exactly once.Comment: 12 page
Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition
Recently, the advance in deep learning has brought a considerable improvement
in the end-to-end speech recognition field, simplifying the traditional
pipeline while producing promising results. Among the end-to-end models, the
connectionist temporal classification (CTC)-based model has attracted research
interest due to its non-autoregressive nature. However, such CTC models require
a heavy computational cost to achieve outstanding performance. To mitigate the
computational burden, we propose a simple yet effective knowledge distillation
(KD) for the CTC framework, namely Inter-KD, that additionally transfers the
teacher's knowledge to the intermediate CTC layers of the student network. From
the experimental results on the LibriSpeech, we verify that the Inter-KD shows
better achievements compared to the conventional KD methods. Without using any
language model (LM) and data augmentation, Inter-KD improves the word error
rate (WER) performance from 8.85 % to 6.30 % on the test-clean.Comment: Accepted by 2022 SLT Worksho
EM-Network: Oracle Guided Self-distillation for Sequence Learning
We introduce EM-Network, a novel self-distillation approach that effectively
leverages target information for supervised sequence-to-sequence (seq2seq)
learning. In contrast to conventional methods, it is trained with oracle
guidance, which is derived from the target sequence. Since the oracle guidance
compactly represents the target-side context that can assist the sequence model
in solving the task, the EM-Network achieves a better prediction compared to
using only the source input. To allow the sequence model to inherit the
promising capability of the EM-Network, we propose a new self-distillation
strategy, where the original sequence model can benefit from the knowledge of
the EM-Network in a one-stage manner. We conduct comprehensive experiments on
two types of seq2seq models: connectionist temporal classification (CTC) for
speech recognition and attention-based encoder-decoder (AED) for machine
translation. Experimental results demonstrate that the EM-Network significantly
advances the current state-of-the-art approaches, improving over the best prior
work on speech recognition and establishing state-of-the-art performance on
WMT'14 and IWSLT'14.Comment: ICML 202