3 research outputs found
Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs
To process sensor data in the Internet of Things(IoTs), embedded deep
learning for 1-dimensional data is an important technique. In the past, CNNs
were frequently used because they are simple to optimise for special embedded
hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed
at energy-efficient inference on end devices. Using the traffic speed
prediction as a case study, a vanilla LSTM model with the optimised LSTM cell
achieves 17534 inferences per second while consuming only 3.8 J per
inference on the FPGA XC7S15 from Spartan-7 family. It achieves at least
5.4 faster throughput and 1.37 more energy efficient than
existing approaches.Comment: 12 pages, 7 figure
Accelerating LSTM-based High-Rate Dynamic System Models
In this paper, we evaluate the use of a trained Long Short-Term Memory (LSTM)
network as a surrogate for a Euler-Bernoulli beam model, and then we describe
and characterize an FPGA-based deployment of the model for use in real-time
structural health monitoring applications. The focus of our efforts is the
DROPBEAR (Dynamic Reproduction of Projectiles in Ballistic Environments for
Advanced Research) dataset, which was generated as a benchmark for the study of
real-time structural modeling applications. The purpose of DROPBEAR is to
evaluate models that take vibration data as input and give the initial
conditions of the cantilever beam on which the measurements were taken as
output. DROPBEAR is meant to serve an exemplar for emerging high-rate "active
structures" that can be actively controlled with feedback latencies of less
than one microsecond. Although the Euler-Bernoulli beam model is a well-known
solution to this modeling problem, its computational cost is prohibitive for
the time scales of interest. It has been previously shown that a properly
structured LSTM network can achieve comparable accuracy with less workload, but
achieving sub-microsecond model latency remains a challenge. Our approach is to
deploy the LSTM optimized specifically for latency on FPGA. We designed the
model using both high-level synthesis (HLS) and hardware description language
(HDL). The lowest latency of 1.42 S and the highest throughput of 7.87
Gops/s were achieved on Alveo U55C platform for HDL design.Comment: Accepted at 33rd International Conference on Field-Programmable Logic
and Applications (FPL
Reconfigurable acceleration of Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency.
This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs.
The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs.
The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs.
The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection.
To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces