Recurrent Neural Networks: An Embedded Computing Perspective by Rezk, Nesma M. et al.
Recurrent Neural Networks: An Embedded Computing Perspective
NESMA M. REZK, Halmstad University
MADHURA PURNAPRAJNA, Amrita Vishwa Vidyapeetham
TOMAS NORDSTRO¨M, Umea˚ University
ZAIN UL-ABDIN, Halmstad University
Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data.
Recently, a strong interest has emerged to execute RNNs on embedded devices. However, RNN requirements of high computational
capability and large memory space is dicult to be met. In this paper, we review the existing implementations of RNN models on
embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems.
We dene the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. en, we
explain the components of RNNs models from an implementation perspective. Furthermore, we discuss the optimizations applied
on RNNs to run eciently on embedded platforms. Additionally, we compare the dened objectives with the implementations and
highlight some open research questions and aspects currently not addressed for embedded RNNs.
e paper concludes that applying algorithmic optimizations on RNN models is vital while designing an embedded solution. In
addition, using the on-chip memory to store the weights or having an ecient compute-load overlap is essential to overcome the high
memory access overhead. Nevertheless, the survey concludes that high performance has been targeted by many implementations
while exibility was still less aempted.
Additional Key Words and Phrases: Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), Embedded computing,
Compression, antization
ACM Reference format:
Nesma M. Rezk, Madhura Purnaprajna, Tomas Nordstro¨m, and Zain Ul-Abdin. 2019. Recurrent Neural Networks: An Embedded
Computing Perspective. 1, 1, Article 1 (July 2019), 36 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Recurrent Neural Networks (RNNs) are a class of Neural Networks (NNs) that deal with applications that have sequential
data inputs or outputs. RNNs capture the temporal relationship between input/output sequences by introducing feedback
to FeedForward (FF) neural networks. us, many applications with sequential data such as speech recognition [26],
language translation [71], and human activity recognition [20] can benet from RNNs.
In contrast to cloud computing, edge computing can guarantee beer response time and enhance security for the run-
ning application. Augmenting edge devices with RNNs grant them the intelligence to process and respond to sequential
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 ACM. Manuscript submied to ACM
Manuscript submied to ACM 1
ar
X
iv
:1
90
8.
07
06
2v
1 
 [c
s.N
E]
  2
3 J
ul 
20
19
2 Rezk, et al.
problems. In this paper, we study RNN models and specically focus on RNN optimizations and implementations on
embedded platforms at edge devices.
Fig. 1. Structure of the survey article. RNN models should run on an embedded platform at an edge device. Section 2 discusses the
objectives of such implementation and the challenges facing it. Section 3 describes the RNN models and their details. Algorithmic
optimizations (Section 4.1) are applied to RNN models and platform-specific optimizations (Section 4.2) are applied to embedded
platforms. The resulted implementations are discussed in Section 5 and compared to the objectives in Section 6.
1.1 Survey scope
is survey article focuses on embedded solutions for RNN models. e article compares the recent implementations of
RNN models on embedded systems in the literature. For a research paper to be included in the comparison, it should
satisfy the following conditions:
• Discussing the implementation of an RNN model or the recurrent layer of an RNN model.
• e target platform is an embedded platform such as FPGA, ASIC, etc.
To provide a complete study, the survey studies the methods used for optimizing the RNN models and realizing them
on embedded systems as well.
ere are other surveys that focus on one or two aspects as compared to the ones covered in this article. Some
articles look at NN applications from the algorithmic point of view and RNN applications are treated as one of these
NN applications [19] or study RNNs only from an algorithmic point of view [44, 62]. While another group of survey
articles look at the hardware implementations. For instance, a survey on neural networks ecient processing [72]
studied CNNs, CNN optimizations, and CNN implementations and another CNN survey [75] studied CNN mappings on
FPGAs. For the purpose of hardware implementations, some articles were specialized in algorithmic optimizations such
as quantization [28] and compression [11]. All algorithmic Optimizations for both CNNs and RNNs were surveyed in
one article that discussed their implementations as well [76]. e article main scope was optimizations. us, RNN
models and their components were not studied. Furthermore, the RNN implementations understudy were limited to
speech recognition applications. Our survey distinguishes itself from other related works as none of these survey
articles grouped RNN models with their optimizations and implementations in one study.
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 3
1.2 Contributions
is survey article provides the following:
• A detailed comparison of RNN models components from a computer architecture perspective that looks into
the computations and memory requirements.
• A study of the optimizations applied to RNNs for the purpose of executing it on embedded platforms.
• An application-independent comparison of the recent implementations of RNNs on embedded platforms.
• Determining the possible opportunities for future research.
1.3 Survey Structure
is survey article is organized as shown in Fig. 1. Section 2 denes the objectives of realizing RNN models on embedded
platforms and the challenges making that dicult. Following that, we dene a general model for RNN applications
and discuss dierent variations for the recurrent layers in RNN models in Section 3. However, it is dicult to run
RNN models in its original form eciently on embedded platforms. erefore, researchers have applied optimizations
to both the RNN model and the target platform. e optimizations applied to the RNN model are called algorithmic
optimizations and discussed in Section 4.1 and the optimizations applied to the hardware platform are called platform-
specic optimizations and discussed in Section 4.2. en, in Section 5, we present the hardware implementations
of RNNs suggested in the literature. In Section 6, we compare the implementations analyzed in Section 5 with the
objectives dened in Section 2 to dene the gap between them. Finally, in Section 7, we summarize our survey.
2 OBJECTIVES AND CHALLENGES
Implementation eciency is the primary objective in implementing RNN applications on embedded systems. Implemen-
tation eciency requires the implementation to have high throughput, low energy consumption, and meet run-time
requirements. A secondary objective for the implementation would be exibility. Flexibility requires the implementation
to support variations in the RNN model, allow for online training, and meet dierent applications requirements. To
meet these objectives there exist some challenges in mapping these applications onto embedded systems, such as the
large number of computations to be performed within the limited available memory. ese objectives and challenges
are discussed in detail as follows.
2.1 Objectives of realizing RNNs on embedded platforms
To realize RNN models on embedded platforms, we dene some objectives that will inuence the solution. ese
objectives are divided into implementation eciency objectives and exibility objectives.
2.1.1 Implementation Eiciency. Since we target embedded platforms, we consider the online execution of the
application. To satisfy the implementation eciency objective, the implementation should have a high throughput, low
energy consumption, and meet real-time requirements of the application. e real-time requirements of the application
pose additional demands for the throughput, energy consumption and the accuracy of the implementation. Accuracy
indicates how correct is the model in doing the recognition, classication, translation, etc.
• High throughput roughput is a measure of performance. It measures the number of processed input/output
samples per second. Applications inputs and outputs are diverse. For some applications, the input can be frame
and the throughput can be the number of consumed frames per second, which depends on the frame size as
Manuscript submied to ACM
4 Rezk, et al.
well. For another application, it can be the number of predicted words per second. us, for dierent input and
outputs types and sizes, throughput can have dierent units and dierent interpretation for the throughput
value. To compare the throughput of dierent applications, we choose to use the number of operations per
second as a unit for throughput.
• Low energy consumption For an implementation to be considered ecient, the energy consumption of the
implementation should meet embedded platforms energy constraints. To compare the energy consumption of
dierent implementations, we use the number of operations per second per wa as a unit for energy eciency.
• Real-time requirements At real-time, a response cannot be delayed beyond a predened deadline and energy
consumption cannot exceed a predened limit. e deadline is dened by the application and is aected by
the frequency of sensor inputs and the system response time. Normally, the RNN execution should meet the
predened deadline.
2.1.2 Flexibility. e exibility of the solution in this context is the ability of the solution to run dierent models
under dierent constraints without being restricted to one model or one conguration. For an implementation to be
exible, we dene the following requirements that should be satised:
• Supporting variations in RNN layer e recurrent layers of RNN models can vary in the type of the layer
(dierent types of the recurrent layer are discussed in Section 3.1), the number of hidden cells, and the number
of recurrent layers.
• Supporting other NN layers RNN model has other types of NN layers as well. A solution that supports more
NN layers shall be considered a complete solution for RNN models not only a exible solution. Convolution
layers, fully connected layers, and pooling layers might be required in an RNN model.
• Supporting algorithmic optimization variations Dierent algorithmic optimizations are applied to RNN
models to implement it eciently on embedded systems (Section 4). Supporting at least one algorithmic
optimization in the hardware solution in many cases is mandatory for a feasible execution of the RNN models
on an embedded system. Supporting more optimizations would make the hardware solution both ecient and
exible as it gives the algorithmic designer more choices while optimizing the model for embedded execution.
• Online training Training is a process that targets seing the neural network with parameter values. In
embedded platforms, training is done oine and inference is what runs at run-time on the platform. Comparing
this to real-life problems, it is not enough to run only inference on the embedded platforms. Some level of
training is required at run-time as well. Online training allows the neural network to adapt to the new data that
was not met within the training data and adapt to the changes in the environment. For instance, online training
is required in autonomous cars object recognition to achieve lifelong learning by continuously receiving new
training data from eets of robots and update the model parameters [73]. One other example is in automated
visual monitoring systems that receive new labelled data continuously [35].
• Meeting dierent application domains requirements One aspect of exibility is to support dierent
application domains requirements. is is an aractive property of the implementation as the solution can
support a wider range of applications. However, dierent application domains can have dierent performance
criterion. Some application domains might require very high throughput with moderate power consumption
such as autonomous vehicles [78]. In contrast, other application domains might require extremely low power
consumption and be less strict on the throughput such as mobile applications [38, 88].
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 5
2.2 Challenges in mapping RNNs on embedded platforms
Let us now take a look at the challenges faced by hardware solutions to meet all the objectives discussed earlier in this
section.
2.2.1 Computation challenge . e main computation boleneck in RNNs is the matrix to vector multiplications.
LSTM layer (Explained in Section 3.1) has four computation blocks, each of them has one matrix to vector multiplication.
For instance, if the size of the vector is 1280 and the size of the matrices is 1280 × 1024. Each matrix to vector
multiplication requires 1280 × 1024 MAC (Multiply And Accumulate) operations. e total number of MAC operations
in the LSTM would be 4 × 1280 × 1024 = 5.24 Mega MAC, which is approximately equivalent to 10.5 MOP. e high
number of computations aects the throughput of implementation and energy consumption as well.
One other problem in RNNs is the recurrent structure of the RNN. In RNNs, the output is fed back as an input in
such a way that each time-step computations need to wait for the previous time-step computations completion. is
temporal dependency makes it dicult to parallelize the implementation over time-steps.
2.2.2 Memory challenge. e memory required for the matrix to vector multiplications can be very large. e
size and the access time of these matrices become a memory boleneck. Using the previous example of the LSTM layer,
it requires four matrices each of size 1280 × 1024. Consider 32-bit oating point operations, the size of the required
memory for weights would be 32 × 4 × 1280 × 1024 = 21MB. Nevertheless, the high number of memory accesses aects
the throughput and energy consumption of the implementation [10].
2.2.3 Accuracy challenge. To overcome the previous two challenges (computation and memory challenges), some
optimizations can be applied to RNN models as discussed in Section 4. ese optimizations may aect accuracy. e
accepted decrease in accuracy varies from one application domain to the other. For instance, in aircra anomaly
detection, the accepted range of data uctuation is only 5% [69].
3 RECURRENT NEURAL NETWORKS
e intelligence of human as well as most animals depends on having a memory of the past. Both in the short-term;
like combining sounds to words as well as long-term, where a word “she” can refer back to “Anne” mentioned hundreds
of words earlier. at is exactly what RNN does in neural networks. It adds feedback that enables using previous time
step outputs while processing the current time-step input. Nevertheless, it adds memory cells that should function as
human long-term and short-term memories.
e RNNs add the recurrent layers to the NN (Neural Network) model. Fig. 2 presents the generic model for RNN
models that consists of three sets of layers (input, recurrent, output). Input layers are to take the sensor output and
convert it into a vector that carries the features of the input. Input layers are followed by the recurrent layers. Recurrent
layers are the layers with feedback. In most of the recent recurrent layers, memory cells exist as well. Aerwards,
the model completes like most of the NNs models with Fully Connected (FC) layers and an output layer that can be a
somax layer. FC layers and output layer are grouped into the set of output layers in Fig. 2. In this section, we show the
dierent types of the recurrent layer. In appendix A, we discuss the input layers, output layers, modes of operation, and
RNN applications and their corresponding datasets.
Manuscript submied to ACM
6 Rezk, et al.
Fig. 2. RNNs generic model.
3.1 Recurrent layers
In this section, we cover the types of recurrent layers. For each layer, we discuss the structure of the layer and the gates
equations. e most popular recurrent layer is the Long Short Term Memory (LSTM) [33]. ere have been changes
proposed to the LSTM to enhance the algorithmic eciency or enhance the computational complexity. Enhancing
algorithmic eciency means improving the accuracy the RNN model can achieve such as LSTM with peepholes and
ConvLSTM discussed in Sections 3.1.2 and 3.1.3, respectively. While enhancing computational complexity means
decreasing the number of computations and size of memory required by an LSTM to run eciently on hardware
platforms such as LSTM with projection, GRU, and QRNN/SRU discussed in Sections 3.1.4, 3.1.5, and 3.1.6, respectively.
ese changes can be applied to the gate equations, interconnections, or even the number of gates. Finally, we compare
all dierent layers against the number of operations and the number of parameters in Table 1.
3.1.1 LSTM. First, we explain the LSTM (Long Short Term Memory) layer. Looking at LSTM as a black box, the
input to the LSTM is a vector combined from the input vector xt and the previous time-step output vector ht−1. e
output vector at time t is denoted as ht . Looking at the structure of an LSTM, it has a memory cell state Ct and three
gates. ese gates control what is to be forgoen and memorized by the memory state (forget and input gates). ey
also control the part of the memory state that will be used as an output (output gate). Our description of the LSTM unit
is based on its relationship with hardware implementations. us, in Fig. 3a, we show the LSTM as four blocks instead
of three gates. e reason for it is that LSTM is composed of four similar computation blocks.
e computation block is the matrix to vector multiplication of the combination of xt and ht−1 with one of the
weight matrices {Wf ,Wi ,Wc ,Wo}, which is considered the dominant computation in LSTMs. Each block is composed of
a matrix to vector multiplication followed by the addition of a bias vector {bf ,bi ,bc ,bo }, and then applying a nonlinear
function. Each block might have element-wise multiplication operations as well. e nonlinear functions used in the
LSTM are tanh and siдmoid functions. e four computation blocks are as follow:
• Forget gate e role of the forget gate is to decide the information to be forgoen. Forget gate output ft is
calculated as
ft = σ (Wf [ht−1,xt ] + bf ), (1)
where xt is the input vector, ht−1 is the hidden state output vector, Wf is the weight matrix, bf is the bias
vector, and σ is the siдmoid function.
• Input gate e role of the input gate is to decide which information to be memorized. Input gate output it is
computed similarly to the forget gate output as
it = σ (Wi [ht−1,xt ] + bi ), (2)
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 7
using the weight matrixWi and the bias vector bi .
• State computation e role of this computation is to compute the new memory state Ct of the LSTM cell.
First, it computes the possible values for the new state
C˜t = tanh(WC [ht−1,xt ] + bC ), (3)
where xt is the input vector, ht−1 is the hidden state output vector,Wc is the weight matrix, and bc is the bias
vector. en, the new state vectorCt is calculated by the addition of the previous state vectorCt−1 element-wise
multiplied with the forget gate output vector ft and the new state candidate vector C˜t element-wise multiplied
with the input gate output vector it as
Ct = ft  Ct−1 + it  C˜t , (4)
where  is used to denote the element-wise multiplication.
• Output gate e role of the output gate is to compute the LSTM output. First, the output gate vector ot is
computed as
ot = σ (Wo [ht−1,xt ] + bo ), (5)
where xt is the input vector, ht−1 is the hidden state output vector,Wo is the weight matrix, bo is the bias vector,
and σ is the siдmoid function. en, the hidden state output ht is computed by applying the element-wise
multiplication of the output gate vector ot (that holds the decision of which part of the state is the output) to
the tanh of the state vector Ct as
ht = ot  tanh(Ct ). (6)
e number of computations and parameters for LSTM are shown in Table 1. e matrix to vector multiplications
dominate the number of computations and parameters. For each matrix to vector multiplication, the input vector xt of
sizem and the hidden state output vector ht−1 of size n are multiplied with weight matrices of size (m + n) × n. at
requires n(m + n) MAC operations, which is equivalent to nm + n2 multiplications and nm + n2 additions. e number
of parameters in the weight matrices is nm + n2 as well. Since this computation is repeated four times within the LSTM
computation, these numbers are multiplied by four in the total number of operations and parameters for an LSTM. For
the models in the studied papers, n is larger thanm. us, n has a dominating eect on the computational complexity
of the LSTM.
3.1.2 LSTM with peepholes. Peepholes connections were added to LSTMs to make them able to count and
measure the time between events [23]. As seen in Fig. 3b, the output from the state computation is used as input for the
three gates. e LSTM gate equations will change to
ft = σ (Wf [ht−1,xt ,Ct−1] + bf ), (7)
it = σ (Wi [ht−1,xt ,Ct−1] + bi ), (8)
and
ot = σ (Wo [ht−1,xt ,Ct ] + bo ). (9)
where xt is the input vector, ht−1 is the hidden state output vector, Ct−1 is the state vector at time t − 1,Wf ,Wi , and
Wo are the weight matrices, and bf , bi , and bo are the bias vectors.
Manuscript submied to ACM
8 Rezk, et al.
(a) Long Short Term Memory (LSTM). (b) LSTM with peepholes.
(c) LSTM with projection layer. (d) Gated Recurrent Unit (GRU).
(e) asi-RNN (QRNN). (f) Simple Recurrent Unit (SRU).
Fig. 3. Dierent variations of an RNN layer.Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 9
e number of operations and computations for an LSTM with peepholes are shown in Table 1. ere exist two rows
for an LSTM with peepholes. e rst one considers the multiplication with the cell state in the three gates as a matrix
to vector multiplication. e number of multiplications, additions, and weights would increase by 3 × n2. However, the
weight matrices multiplied with the cell state can be diagonal matrices [25]. us, the matrix to vector multiplication
can be considered as element-wise vector multiplication, which was widely used for LSTM with peepholes later on. In
this case, the number of multiplications, additions, and weights will increase by 3n only.
3.1.3 ConvLSTM. ConvLSTM is an LSTM with all matrix to vector multiplications replaced with 2D convolutions
[66]. e idea is that if the input to the LSTM is data that holds spatial relations like visual frames, it is beer to apply
2D convolutions than matrix to vector multiplications. Convolution is capable of extracting the spatial information
from the data. e vectors xt , ht , and Ct are replaced with 3-D tensors. One can think of each element in the LSTM
vectors as a 2D frame in the ConvLSTM vectors. Convolution weights need less memory than matrix to vector matrices
weights. However, they need more computations.
e number of operations and parameters required for a convLSTM are shown in Table 1. e calculated numbers
are for a convLSTM without peepholes. If peepholes are added, the number of multiplications, additions, and weights
will increase by 3n. Since the main change from an LSTM is the replacement of the matrix to vector multiplications
with convolutions. e change in the number of operations and parameters would be to the nm +n2 factor that appears
in multiplications, additions, and the number of weight equations. e number of multiplications and additions (MACs)
in convolutions of input vector xt and hidden state output vector ht−1 is rcnmki 2 + rcn2 × ks 2, where r is the number
of rows and c is the number of columns in the frames, n is the number of frames in input xt ,m is the number of frames
in output ht (or the number of hidden cells), ki is the size of the lter used with xt , and ks is the size of the lter used
with ht−1. While the number of weights is the size of the lters used for convolutions.
3.1.4 LSTM with projection layer. e LSTM is changed by adding one extra step aer the last gate [61]. is
step is called a projection layer. e output of the projection layer is the output of the LSTM and the feedback input to
the LSTM in the next time-step as shown in Fig. 3c. Simply, a projection layer is like an FC layer. e purpose of this
layer is to allow an increase in the number of hidden cells while controlling the total number of parameters. is is
performed by using a projection layer that has a number of units p less than the number of hidden cells. e dominating
factor in the number of computation and the number of weights will be 4pn instead of 4n2, where n is the number of
hidden cells and p is the size of the projection layer. Since p < n, n can increase with a smaller eect on the size of the
model and the number of computations.
In Table 1, we show the number of operations and parameters required for an LSTM with a projection layer. In
the original paper proposing the projection layer, the authors considered the output layer of the RNN as a part of the
LSTM [61]. e output layer was an FC layer that changes the size of the output vector from n to o, where o is the
output size. us, there is an extra no term in the number of multiplications, additions, and weights. Aer adding the
projection layer, the extra term will be po as the LSTM output vector is of size p now. We put the extra terms between
curly brackets to show that they are optional terms. e projection layer can be applied to an LSTM with peepholes as
well. In Table 1, we show the number of operations and parameters for an LSTM with peepholes and a projection layer.
3.1.5 GRU. Gated Recurrent Unit (GRU) was proposed in 2014 [13]. e main purpose was to make the recurrent
layer able to capture the dependencies of dierent time scales in an adaptive manner [14]. However, the fact that
GRU has only two gates (three computational blocks) instead of three (four computational blocks) like LSTM makes
Manuscript submied to ACM
10 Rezk, et al.
it more computationally ecient and more promising for high-performance hardware implementations. e three
computational blocks are as follows:
• Reset gate e reset gate is used to decide whether to use the previously computed output or treat the input
as the rst symbol in a sequence. e reset gate output vector rt is computed as
rt = σ (Wr [ht−1,xt ]), (10)
where xt is the input vector, ht−1 is the hidden state output vector, Wr is the weight matrix, and σ is the
siдmoid function.
• Update gate e update gate is to decide how much of the output is updated. e output of the update gate
zt is computed as the reset gate output rt using the weight matrixWz as
zt = σ (Wz [ht−1,xt ]). (11)
• Output computation e role of this computation is to compute the hidden state vector ht . First, it computes
the possible values for the hidden state vector h˜t
h˜t = tanh(W [rt  ht−1,xt ]), (12)
where xt is the input vector, ht−1 is the hidden state output vector, andW is the weight matrix. e reset gate
output vector rt decides how much of ht−1 can contribute in the computation of h˜t . en, the hidden state
vector ht is computed from the old output ht−1 and the new possible output h˜t relying on the update gate
output vector zt (that decides how much of the output will be updated) as
ht = (1 − zt )  ht−1 + zt  h˜t . (13)
Similar to LSTM, we visualize a GRU in Fig. 3d as three blocks, not two gates, as it has three blocks of matrix to
vector multiplications. In Table 1, we show the number of operations and parameters required for a GRU. e number
of operations and parameters is approximately 0.75 the number of operations and parameters in the LSTM.
3.1.6 QRNN and SRU. e purpose of asi-RNN (QRNN) [5] and Simple Recurrent Unit (SRU) [41] is to make
the recurrent unit friendlier for computation and parallelization. e boleneck in LSTM/GRU is the matrix to vector
multiplications. It is dicult to parallelize this part because it depends on the previous time-step output ht−1 and
previous time-step state Ct−1. In QRNN/SRU, ht−1 and Ct−1 are removed from all matrix to vector multiplications and
appear only in element-wise operations. QRNN has two gates and a memory state. It has three heavy computational
blocks. In these blocks, only the input vector xt is used as input. It replaces the matrix to vector multiplications with
1D convolutions with inputs along the time-step dimension. For instance, if the lter dimension is two, convolution is
applied on xt and xt−1. e rst computation block is to compute the candidate for new state zt
zt = tanh(Wz ∗ xt ), (14)
whereWZ is the convolutional lters bank and “∗” is to denote the convolution operation. e second computation
block is to compute the forget gate vector ft that decides what to forgot from the old state using the equation
ft = σ (Wf ∗ xt ), (15)
Manuscript submied to ACM
RecurrentN
euralN
etw
orks:A
n
Em
bedded
Com
puting
Perspective
11
Table 1. Comparing LSTM and its variations.
RNN layer Number of Operations Number of Parameters
Multiplications Additions Nonlinear Weights Biases
LSTM 4n
2 + 4nm + 3n 4n2 + 4nm + 5n 5n 4n2 + 4nm 4n
= LSTMmul = LSTMadd = LSTMnonlinear = LSTMweiдhts = LSTMbiases
LSTM + peepholes 7n
2 + 4nm + 3n 7n2 + 4nm + 5n 5n 7n2 + 4nm 4n
= LSTMmul + 3n2 = LSTMadd + 3n2 = LSTMnonlinear = LSTMweiдhts + 3n2 = LSTMbiases
LSTM + peepholes (di-
agonalized)
4n2 + 4nm + 6n 4n2 + 4nm + 8n 5n 4n2 + 4nm + 7n 4n
= LSTMmul + 3n = LSTMadd + 3n = LSTMnonlinear = LSTMweiдhts + 3n = LSTMbiases
LSTM + projection 4np + 4nm + 3n +np + {po} 4np + 4nm + 5n 5n 4np + 4nm + np + {po} 4n
= LSTMProjmul = LSTMProjadd = LSTMnonlinear = LSTMProjweiдhts = LSTMbiases
LSTM + peepholes (di-
agonalized) + projec-
tion
4np + 4nm + 6n + np + [po] 4np + 4nm + 8n 5n 4np+4nm+3n+np+{po} 4n
= LSTMProjmul + 3n = LSTMProjadd + 3n = LSTMnonlinear = LSTMProjweiдhts +
3n
= LSTMbiases
ConvLSTM 4rcnmki 2 +4rcn2ks 2 + 3n 4rcnmki 2 +4rcn2ks 2+5n 5n 4nmki 2 + 4n2ks 2 4n
GRU 3n
2 + 3nm + 3n 3n2 + 3nm + 5n 3n 3n2 + 3nm -
= 0.75LSTMmul = 0.75LSTMadd = 0.6LSTMnonlinear = 0.75LSTMweiдhts -
QRNN 3knm + 3n 3knm + 2n 3n 3knm -
SRU 3nm + 6n 3nm + 8n 2n 3nm + 2n 2n
In the table, we are using the following symbols: m is the size of input vector xt , n is the number of hidden cells in ht , p is the size of the projection
layer, o is the size of the output layer, r is the number of rows in a frame, c is the number of columns in a frame, ki is size of the 2D lter applied to xt ,
ks is the size of the 2D lter applied to ht−1, and k is the size of 1D convolution lter. e term {po} is an optional term as discussed in Section 3.1.4.
M
anuscriptsubm
ied
to
ACM
12 Rezk, et al.
whereWf is the convolutional lters bank. e last computation block to compute is the output gate vector ot that
decides what information from the current state Ct will be used in the output using the equation
ot = σ (Wo ∗ xt ), (16)
whereWo is the convolutional lters bank. Aer these three blocks are computed, Ct and ht are computed by two
element-wise multiplication operations. Ct is computed from the old state Ct−1 and the candidate for new state zt
controlled by the forget gate vector ft (that decides what would be forgoen and what would be new) as
Ct = ft  Ct−1 + (1 − ft )  zt . (17)
e QRNN output ht is computed in by the element-wise multiplication of the current state Ct and the output gate
output ot (that decides what information from the state will be in the output) as
ht = ot  Ct . (18)
Fig. 3e is used to visualize the QRNN layer. e number of operations and parameters required for a QRNN is shown
in Table 1, where k is the size of the convolution lter.
e SRU has two gates and a memory state as well. e heavy computational blocks (three blocks) are matrix to
vector multiplications, not convolutions. e two gates (forget and update gates) are computed using the equations
ft = σ (Wf xt +vf  ct−1 + bf ) (19)
and
rt = σ (Wrxt +vr  ct−1 + br ) (20)
respectively. In both gates calculations, Ct−1 is used but consumed by element-wise multiplications. e parameter
vectors vf and vr are to be learned with weight matrices and biases during training.
e third computational block is the state computation Ct
Ct = ft  Ct−1 + (1 − ft )  (W .xt ), (21)
where Ct−1 is the old state vector and xt is the input vector. e computation is controlled by the forget gate output
vector ft (that decides what to be forgoen and what to be new). Finally, the SRU output ht is computed in from the
new state Ct and the input vector xt controlled by the update gate (that decides the parts of output that are taken from
state and the parts that are taken from input) using the equation
ht = rt  Ct + (1 − rt )  xt . (22)
Fig. 3f visualizes the SRU. e output computation is done in the same block with the update gate. It is worth
observing that in both QRNN and SRU, ht−1 is not used in the equations. Only the old state Ct−1 is used. e number
of operations and parameters for an SRU is shown in Table 1.
In Table 1, we compare the LSTM and all of its variations against the memory requirements for the weights and the
number of computation per one time-step. is comparison helps to understand the needed hardware platform for each
of them. To make it easier for the reader to understand the dierence between the LSTM and the other variants, we
show the equations for operations and parameters in terms of LSTM operations and parameters if they are comparable.
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 13
4 OPTIMIZATIONS FOR RNNS
RNN applications —as all neural network applications— rely on intensive operations between high precision values.
us, they require high computation power, large memory bandwidth, and high energy consumption. Due to the
resource constraints of embedded platforms, there is a need for decreasing the computation and memory requirements
of RNN applications. Researchers have been working on two types of optimizations. e rst type is related to the RNN
algorithms themselves, where RNN algorithms are modied to decrease the computation and memory requirements
without aecting the accuracy or with a limited eect on accuracy. e second type of optimizations is related to the
embedded platform, where hardware improvements are applied to increase the parallelization of the application and
decrease the overhead of memory accesses. Fig. 4 illustrates the two types of optimizations.
Fig. 4. Optimizations applied to RNN applications with sections numbers indicated and comparing the eect of dierent algorithmic
optimizations on memory and computation requirements.
4.1 Algorithmic optimizations
In this section, we discuss the dierent algorithmic optimizations performed on the recurrent layer of an RNN application
to decrease the computation and memory needs of the application. We discuss how these optimizations are carried
out and how accuracy is being aected. Applying optimizations directly to inference may aect the accuracy in an
unaccepted manner. us, training the network would be required to enhance the accuracy where optimizations may be
applied before training or aer the model is trained and then the model is retrained for some epochs (training cycles).
Dierent datasets have dierent accuracy measuring units. For some units, higher values are beer and for the
others, lower values are beer. To provide a unied measure of the change in accuracy, we calculate the percentage of
change in accuracy from the original value to the value aer applying the optimization method as
a∆ = (−1)α Va −Vb
Vb
× 100, (23)
Manuscript submied to ACM
14 Rezk, et al.
where a∆ is the eect of the optimization method on accuracy as a percentage of the original accuracy value, Vb is the
value of accuracy before optimization, Va is the value of accuracy aer optimization, and α is an indicator that has a
value of 0 if higher accuracy values are beer and 1 if lower accuracy values are beer. us, if the baseline accuracy
achieved by the original model without optimizations is 96% and the accuracy aer optimization is 94%, the eect of
optimization on accuracy is −2.1%. If the accuracy aer optimization is 98%, the eect of optimization on accuracy is
+2.1%. If the optimization has no eect on accuracy, then the eect on accuracy is 0%.
As shown in Figure 4, algorithmic optimizations are quantization, compression, deltaRNN, and nonlinear. e rst
three optimizations are applied to the matrix to vector multiplications operations and the last one is applied to the
non-linear functions computations. e table in Fig. 4 compares quantization, compression, and deltaRNN with their
eect on memory requirement, number of memory accesses, number of computations, and MAC operation cost. MAC
operation cost can decrease by decreasing operands precision.
4.1.1 antization. antization is decreasing the precision of the operands. antization can be applied to
the network parameters only or to the activations and inputs as well. While discussing quantization, there are three
important factors to consider. First, the number of bits used for weights, biases, activations, and inputs. Second, the
quantization method. e quantization method denes how to store the full precision values in less number of bits.
ird, discussing whether quantization was applied with training from the beginning of training or the model was
re-trained aer applying quantization. ese three factors aect accuracy. But, these are not the only factors aecting
accuracy. Accuracy is aected by model architecture, dataset, and other factors. However, these three factors are more
related to applying quantization to the RNN model.
Discussing quantization methods, we cover xed-point quantization, multiple binary codes quantizations, and
exponential quantization. We study whether the selection of the quantized value is deterministic or stochastic as well.
In deterministic methods, the selection is based on static thresholds. In contrast, selection in stochastic methods can
rely on probabilities and random numbers. Relying on random numbers is more dicult for hardware.
antized values representation. ere are dierent methods for representing quantized values. Next, we explain
three commonly used methods.
(1) Fixed-point quantization In this quantization method, the 32-bit oating-point values are quantized into
xed-point representation notated asQm,f , wherem is the number of integer bits, f is the number of fractional
bits. e total number of bits required is k . e sign bit may be included in the number of integer bits [52] or
added as an extra bit added tom and f [59]. For instance, in the rst case [52], Q1.1 is used to represent 2 bits
xed-point that has three values {−0.5,0,0.5}. is quantization method is called Pow2-ternarization as well
[68]. Usually, xed-point quantization is deterministic that for each oating-point value there is one quantized
xed-point value dened by an equation (i.e. rule-based). Fixed-point quantization is done by clipping the
oating-point value between the minimum and the maximum boundaries and rounding it.
(2) Exponential quantization Exponential quantization quantizes a value into an integer power of two. Expo-
nential quantization is very benecial for the hardware as multiplying with exponentially quantized value is
equivalent to shi operations if the second operand is a xed-point value and addition to exponent if the second
operand is a oating-point value [52, 80]. Exponential quantization can be both deterministic and stochastic.
(3) Binary and multi-bit codes quantization e lowest precision in RNNs is the binary precision [34]. Each
full precision value is quantized into one of two values. e most common two values are {−1, +1}. It can also
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 15
be {0, +1}, {−0.5, 0}, {−0.5, +0.5}, or any combination of two values [52]. Binarization can be deterministic
or stochastic. For deterministic binarization, sign function can be used for binarization. While for stochastic
binarization, selection thresholds depend on probabilities to compute the quantized value
xb =
{
+1 with probability p = σh (x),
−1 with probability 1 − p, (24)
where σh is the “hard sigmoid” function dened as
σh (x) = clip(
x + 1
2 , 0, 1) =max(0,min(1,
x + 1
2 )). (25)
Binarization has a great value for hardware computation as it turns multiplication into addition and
subtraction. e greatest value comes when having full binarization, where both of the weights and the
activations have binary precision. In this case, it is possible to concatenate weights and activations into 32-bit
operands and do multiple MAC operations by XNOR and bit-count operations. Full binarization can reduce
memory requirement by 32 and decrease the computation time considerably [57].
Adding one more value to binary precision is called ternarization. Weights in ternarized NN are restricted
to three values. ese three values can be {−1, 0, 1} [42]. Power two ternarization discussed earlier while
discussing xed-point quantization is an example of ternarization with dierent three values {−0.5, 0, 0.5}.
Both deterministic and stochastic ternanrization have been applied on RNNs [52]. While having four possible
values for quantization is called aternarization. In quaternarization, the possible values can be {−1, −0.5,
+0.5, +1} [2]. In order to benet from the high computation benet of having binary weights and activations
while using more number of bits, multiple binary codes {−1,+1} was used for quantization [82]. For instance,
two bit quantization has four possible values {{−1,−1}, {−1,1}, {1,−1}, {1,1}}.
e most common method for deterministic quantization is uniform quantization. Uniform quantization
may not be the best quantization method as it may change the distribution of the original data especially for
non-uniform data, which can aect the accuracy. One solution is balanced quantization [91]. In balanced
quantization, data is divided into groups of the same amount of data before quantization to ensure a balanced
distribution of data aer quantization. Other suggested solutions treat quantization as an optimization problem
such as greedy quantization, rened greedy quantization, and alternating multi-bit quantization [29, 82].
Training/Retraining. As mentioned earlier, there are three options to retain accuracy loss due to quantization. e
rst is to apply quantization with training [16]. Where quantized weights are used during the forward and backward
propagation only. Full precision weights are used for the parameters update step in the (Stochastic Gradient Descent)
SGD. Copies for both quantized and full precision weights are kept to decide at inference time which one to use [52]. In
the second approach, quantization is applied to pre-trained parameters and the RNN model is retrained to decrease
the accuracy loss. Authors in one of RNN implementations [59] adopted a mix of training and retraining approaches,
where only the activations were not quantized from the beginning. Activations were quantized aer training and then
the model was retrained for 40 epochs. e third approach is to use quantized parameters without training/retraining.
It is very common to be used with 16-bit xed-point quantization. Usually, training happens at training servers and
quantization is applied at the inference platform without having the opportunity to re-train the model. It is very
common as well to use 16-bit xed-point quantization with other optimization techniques such as circulant matrices
compression [77], pruning [8], and deltaRNN (discussed later in Section 4.1.3) [22].
Manuscript submied to ACM
16 Rezk, et al.
Table 2. Eect of quantization on accuracy.
Method W/A RNN type Dataset Training Accuracy Paper
Fixed Point 2/2 1*BiLSTM*128 OCR dataset With training +0.7% [59]P2T/real 4*BiLSTM*250 WSJ With training +6% [52]
Exponential EQ/real 1*GRU*200 TIDIGITS With training +1% [52]
Mixed EQ+ xed6/8 3*BiLSTM*512 AN4 Retraining +10.7%1 [80]
Binary
B/real 1*GRU*128 IMDB With training −5.3% [2]
B/real ConvLSTM Moving MNIST With training −100%2 [2]
B/1 1*BiLSTM*128 OCR dataset With training −3.7% [59]
B/4 1*BiLSTM*128 OCR dataset With training +1% [59]
B/real 1*GRU*200/400 TDIGITS With training −80.9% [52]
Ternary
T/real 1*GRU*128 IMDB With training −4% [2]
T/real ConvLSTM Moving MNIST With training −50%2 [2]
T/real 1*GRU*200 TDIGITS With training −1.6% [52]
aternary Q/real 1*GRU*128 IMDB With training −1.7% [2]Q/real ConvLSTM Moving MNIST With training −75%2 [2]
Multi-Binary
3/3 1*LSTM*512 WikiText2 Retraining +1.4% [82]
2/2 1*LSTM*512 WikiText2 Retraining −6% [82]
1/4 2*LSTM*256 PTB With training −7.8% [85]
1 Accuracy is also aected by the compression scheme and nonlinear functions approximation used in this work.
2 We calculate the error at the tenth frame (third predicted frame).
In the table we have used the symbols: W/A for number of bits for weights/number of bits for activations,
P2T for power two ternarization, EQ for exponential quantization, B for binary quantization, T for ternary
quantization, and Q for quaternary quantization.
Eect on accuracy. In Table 2, we gather the research work that had experiments on the quantization of RNN
models. Not all of the studied work have a hardware implementation as the purpose was to show that quantization can
be done while keeping accuracy high. In the table, we put the three factors aecting the accuracy and discussed earlier
(number of bits, quantization method, and training) with an addition to the type of recurrent layer (LSTM, GRU…) and
the dataset. en, we show the eect of quantization on accuracy computed with respect to the accuracy achieved by
full precision parameters and activation using Eq.( 23). For the number of bits, we use W/A where W is the number of
bits used for weights and A is the number of bits used for activations. For the RNN type, we put the recurrent layers
used in the experiments. All recurrent layers are explained in Section 3.1 except the BiLSTM (Bidirectional LSTM) is
explained in Appendix A. We use x*y*z, where x is the number of layers, y is type of the layers, and z is the number
of hidden cells in each layer. For training, if quantization was applied with training from the beginning we write
“With training”. If quantization was applied aer training and the model was later retrained, we write “Retraining”.
Positive values for accuracy means that quantization enhanced the accuracy and negative value for accuracy means
that quantization caused the model to be less accurate.
Each experiment in Table 2 is applied to a dierent model, dierent dataset, and might have used dierent training
methods. us, conclusions from Table 2 cannot be generalized. Still, we can discuss some observations. Fixed point
quantization, exponential quantization and mixed quantization has no negative eect on accuracy. Accuracy has
increased aer applying such quantization methods. Regarding binary quantization, the negative eect on accuracy
varied within small ranges in some experiments [2, 59]. Experiments showed that using more bits for activations can
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 17
enhance the accuracy [59]. Using binary weights with convLSTM is not solely responsible for the bad accuracy reached.
Ternary and aternary quantization reached bad accuracy numbers with convLSTM as well [2]. Nevertheless, e
quantization methods applied on convLSTM was successful when applied on LSTM and GRU in the same work [2].
4.1.2 Compression. Compression is decreasing the model size by decreasing the number of parameters/connections.
As the number of parameters decreases the memory requirement and the number of computation decrease. Table 3
compares dierent compression methods. Compression ratio shows the ratio between the number of parameters of
models before and aer applying compression methods. Accuracy degradation is computed using Equation 23.
(1) Pruning Pruning is the process of eliminating redundancy. Computations in RNNs are mainly dense matrix
operations. To improve computation time, dense matrices are transformed into sparse matrices, which aect
accuracy. However, choosing the method of transforming a dense matrix to a sparse matrix carefully may
result in a limited impact on accuracy, while making signicant gains in computation time. Especially, in the
memory domain, reduction in memory footprint along with computation optimization is essential to making
RNNs viable. However, pruning results in two undesirable eects. e rst is a loss in the regularity of memory
organization due to sparsication of the dense matrix and the second is a loss in accuracy on account of removal
of weights and nodes in the model under consideration. e transformation from a regular matrix computation
to an irregular application oen results in the use of additional hardware and computation time to manage
data. Whereas, to compensate for the loss in accuracy on account of pruning, methods such as retraining have
been applied. e following sections describe methods of pruning and the compensation techniques found in
the literature. Table 3 summarizes the methods of pruning and its impact on sparsity and accuracy. Sparsity in
this context refers to the number of empty entries in the matrices. In Table 3, sparsity indicates the impact on
the number of entries eliminated on account of the method of pruning used. Within RNNs, pruning can be
classied as magnitude pruning for weight matrix sparsication and structure-based pruning.
Magnitude pruning Magnitude pruning relies on eliminating all weight values, below a certain threshold.
In this method, the choice of the right threshold is crucial in minimizing the negative impact on accuracy.
Magnitude pruning is primarily based on identifying the right threshold for pruning weights.
• Weight Sub-groups For weight matrix sparsication, the RNN model is trained to eliminate redundant
weights and only retain weights that are necessary. ere are three categories to create weight subgroups
to select the pruning threshold [64]. ese three categories are class-blind, class-uniform, and class-
distribution. In class-blind, x% of weights with the lowest magnitude are pruned, irrespective (blind)
of the class. In class-uniform, lower pruning x% of weights is uniformly performed in all classes. In
class-distribution, weights within the standard deviation of that class are pruned.
• Hard thresholding [30, 53] identies the right threshold value that keeps accuracy unaected. ESE [30]
uses hard thresholding during training to learn which weights contribute to prediction accuracy.
• Gradual thresholding is method [48] uses a set of weight masks and a monotonically increasing
threshold. Each weight is multiplied with its corresponding mask. is process is iterative, where the
masks are updated by seing all parameters that are lower than the threshold to zero. As a result, this
technique gradually prune weights introduced within the training process in contrast to hard thresholding.
• Block Pruning In block pruning [49], magnitude thresholding is applied to blocks of a matrix instead of
individual weights during training. e weight with the maximum magnitude is used as a representative
for the entire block. If the representative weight is below the current threshold, all the elements in the
Manuscript submied to ACM
18 Rezk, et al.
blocks are set to zero. As a result, block sparsication mitigates the indexing over-head, irregular memory
accesses, and incompatibility with array-data-paths present in unstructured random pruning.
• Grow and prune Grow and prune [18] combines gradient-based growth [17] and magnitude-based
pruning [30] of connections. e training starts with randomly initialized seed architecture. Next, in the
growth phase new connections, neurons and feature maps are added based on the average gradient over
the entire training set. Once the required accuracy has been reached, redundant connections and neurons
are eliminated based on magnitude pruning.
Structure pruning Modifying the structure of the network by eliminating nodes or connections is termed
as structure pruning. Connections that may be important are learned in the training phase or pruned using
probability-based techniques.
• Network sparsication Pruning through network sparsication [58] introduces sparsity for the con-
nections at every neuron output, such that each output has the same number of inputs. Further, an
optimization strategy is formulated that replaces non-zero elements in each row with the highest absolute
value. is step avoids any retraining, which may be compute-intensive and dicult in privacy critical
applications. However, the impact on this method on pruning on accuracy is not directly measured. Design
space exploration over dierent levels of sparsity measures the quality of output and gives an indication
of the relationship between the level of approximation and the application-level accuracy .
• Drop-out DeepIoT [84] compresses neural network structures into smaller dense matrices by nding the
minimum number of non-redundant hidden elements without aecting the performance of the network.
For LSTM networks, Bernoulli random probabilities are used for dropping out hidden dimensions used
within the LSTM blocks.
Retaining accuracy levels Pruning alongside training and retraining have been employed to retain the
accuracy levels of the pruned models. Retraining works on the pruned weights and/or pruned model until
convergence to a specied level of accuracy is achieved.
Handling irregularity in pruned matrices. Pruning to maximize sparsity results in a loss in regularity
(or structure) of memory organization due to sparsication of the original dense matrix. Pruning techniques
that are architecture agnostic, mainly result in unstructured irregular sparse matrices. Methods such as load
balancing-aware pruning [30] and block pruning (explained earlier within magnitude pruning) [49] have
been applied to minimize these eects. Load balancing-aware pruning [30] works towards ensuring the same
sparsity ratio among all the pruned sub-matrices, thereby achieving an even distribution of non-zero weights.
ese techniques introduce regularity in the sparse matrix to improve performance and avoid index tracking.
(2) Structured matrices
Circulant matrices A circulant matrix is a matrix that has each column (row) a cyclic shi of its above
column (row) [80]. It is considered as a special case of Toeplitz-like matrices. e weight matrices are
reorganized into circular matrices. e redundancy of values in the matrices reduces the space complexity of
weights matrices. Circulant matrices can save nearly 4 × the memory space required for large matrices.
Block-circulantmatricesDespite transforming the weight matrix into a circulant matrix, it is transformed
into a set of circulant sub-matrices [43, 77]. Figure 5 shows a weight matrix that has 32 parameters. e block
size of the circular sub-matrices is 4. e weight matrix has transformed into two circulant sub-matrices with 8
parameters (4 parameters each). e compression ratio is 4 ×, where 4 is the block size. us, having larger
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 19
block sizes will result in a higher reduction in model size. However, a high compression ratio may degrade
the prediction accuracy. In addition, the matrix to vector multiplications can be replaced for DFT and IDFT
operations that reduce the computational complexity to O( klogk ).
Fig. 5. Regular weight matrix transformed into block-circulant sub-matrices of block size 4 [77].
(3) Tensor decomposition Tensors are multidimensional arrays. A vector is tensor of rank one, a 2-D matrix is a
tensor of rank two and so on. Tensors can be decomposed into lower ranks tensors and tensors operations can
be approximated using these decompositions in order to decrease the number of parameters in the NN model.
Canonical polyadic (CP) decomposition, tucker decomposition and tensor train decomposition are dierent
techniques used to apply tensor decomposition [74]. Tensor decomposition techniques can be applied to the
FC layers [51], convolution layers [39], and recurrent layers [74] . In Table 3, we show an example of applying
tensor decomposition on a GRU layer using CP technique. Tensor decomposition techniques can achieve a
high compression ratio compared to other compression methods.
(4) Weight sharing Weight sharing replaces each weight with an approximate obtained through k-means cluster-
ing. For instance, deep compression [32] uses Human coding with weight sharing to reduce the length of the
weight indices. Human coding relies on using the occurrence probability of used weights, more common
symbols are encoded with fewer bits. However, we did not nd any work applying weight-sharing on RNNs.
(5) Knowledge distillation Knowledge distillation is a method that replaces the large model with a smaller
model that should behave like a large model. Starting from a large model (teacher) with trained parameters
and a dataset, the small model (student) is trained to behave as the large model [36]. In addition to knowledge
distillation, pruning can be applied to the resulted model to increase the compression ratio as shown in Table 3.
4.1.3 DeltaRNN. Delta Recurrent Neural Networks (DeltaRNN) [50] invests the temporal relation between input
sequences. For two consecutive input vectors xt and xt−1, the dierence between corresponding values in the two
vectors may be zero or close to zero. e same holds for the hidden state output vector. e idea is to skip the
computations for input/hidden state values that when compared to input/hidden state values of the last time step, the
dierence is less than a pre-dened threshold called delta (ϕ). e gain would be decreasing the number of computations
and the number of memory accesses required by the recurrent unit. However, the memory requirement will not decrease
as we still need to store all the weights as we cannot predict which computations will be skipped.
Manuscript submied to ACM
20
Rezk,etal.
Table 3. Eect of compression techniques on accuracy.
Method Technique RNN Type Dataset Compression ratio
(Sparsity for pruning )
Training Accuracy Paper
Magnitude
pruning
Weight subgroups 4*LSTM*1024 +
4*LSTM*1024
WMT’14 5× (80%)-10×(90%) Retraining +2.1%-−1.7% [64]
Hard thresholding 2*LSTM*512 TIMIT 1.1× (10%) -1.3×(24%) None 0% [53]
Gradual pruning 2*LSTM*1500 PTB 20× ( 90%) With training −2.3% [92]
Block pruning 7*BiLSTM*2560 Speech Data2 12.5× (92%) With training −12% [48]
Grow&Prune 1*H-LSTM*512 1 COCO 8× (87.5%) -19× (95%) With training 0%-−2.2% [18]
Structured
pruning
Network sparsication 2*LSTM*512 COCO 2× (50%) None 0% [58]
Drop-out 5*BiLSTM*512 LibriSpeech
ASR corpus
10× (90%) None 0% [84]
Structured
matrices
Circulant 3*BiLSTM*512 AN4 nearly 4× With training +10.7%3 [80]
Block-circulant 2*LSTM*1024 TIMIT 15.9× With training -5.5% [77]
Tensor de-
comp.
CP 1*GRU*512 Noingham 101× - 481× With training −1% - −5% [74]
Knowledge
distillation
Plain 4*LSTM*1000 WMT’14 3× With training −1% [36]
+Pruning 4*LSTM*1000 WMT’14 26× With training +
Retraining
−5.1%
1 H-LSTM is hidden LSTM. Non-linear layers are added in gates computations (Explained in Appendix A).
2 Dataset name is not mentioned in the paper.
3 Accuracy is also aected by quantization (Table 2) and nonlinear functions approximation used in this work.
Table 4. DeltaRNN eect on accuracy
RNN model Dataset Training Accuracy Speedup paper
1*GRU*512 TIDIGITs With training −1.6% 5.7× [22]
CNN+ 1*GRU*512 Open-driving With training 0% 100× [50]
M
anuscriptsubm
ied
to
ACM
Recurrent Neural Networks: An Embedded Computing Perspective 21
e value of delta threshold aects both accuracy and speedup. In Table 4, we summarize the eect of DeltaRNN on
accuracy for two dierent datasets. In some occasions, it was required to train the RNN using delta algorithm before
inference to get beer accuracy at inference time. Furthermore, the speedup gained by delta algorithm at one delta
value is not static. It depends on the relation between the input sequences. e highest speedup could be reached when
having video frames (open driving dataset) as input data as seen in Table 4. However, the time-consuming CNN before
the recurrent layer covered the speedup gained by deltaRNN. us, the 100x speedup in GRU execution will drop down
to a non-signicant speedup for the whole model. On the other hand, CNN-Delta [7] applied a similar delta algorithm
on CNNs. Applying delta algorithms to both recurrent layers and CNN layers might be benecial.
4.1.4 Nonlinear function approximation. Nonlinear functions are the second most used operations in the RNN
aer matrix to vector multiplications as observed in Table 1. e nonlinear functions used in the recurrent layers are
tanh and sigmoid, respectively. Both functions require oating-point division and exponential operations, which are
expensive in terms of hardware resources. In order to have an ecient implementation for an RNN, nonlinear function
approximations are implemented in hardware. is approximation should satisfy a balance between high accuracy and
low hardware cost. Next, we present the used approximations found in the implementations understudy.
Look-up tables (LUTs): Replacement of non-linear functions computation with look-up tables is the fastest method
[55]. e input range is divided into segments with constant output values. However, for achieving high accuracy, large
LUTs will be required and that will consume a large area of silicon, which is not practical. In order to decrease the
LUTs size while preserving high accuracy, several methods have been proposed.
Piece-wise linear approximation: is approximation method is done by dividing the nonlinear function curve
into a number of line segments. Any line segment can be represented by only two values: the slope and the bias.
us, for each segment, only two values are stored in the LUTs. e choice of the number of segments aects both
accuracy and the size of LUTs. us, the choice of the number of segments is done wisely to keep the accuracy high
while having as small LUTs as possible. e computation complexity of the nonlinear function changes to be a single
comparison, multiplication and addition, which might be implemented using shis and additions. Comparing this
method to Look-up tables method, piece-wise linear approximation requires less LUTs and more computations.
Hard tanh / Hard sigmoid: Hard tanh and hard sigmoid are two examples of piece-wise linear approximation with
three segments. e rst segment is saturation to zero or −1 (zero in case of sigmoid and −1 in case of tanh), the last
segment is saturation to one, and the middle segment is a line segment that joins the two horizontal lines.
ere is a variation of piece-wise linear approximation called piece-wise non-linear approximation. e line segments
are replaced by nonlinear segments and the use of multipliers cannot be avoided as in the linear version. at made the
linear approximation more preferable in hardware design.
RALUT One other method to reduce the size of the LUTs is to use RALUT (Range Addressable Look Up Tables) [47].
In RALUTs, each group of inputs is mapped into a single output.
4.2 Platform specific optimizations
In this section, we discuss the optimizations performed on the hardware level to run an RNN model eciently. ese
optimizations may be related to computation or memory. For computation-related optimizations, techniques are applied
to speedup the computations and get higher throughput. While for memory-related optimizations, techniques are
applied to utilize memory usage and accesses for less memory overhead.
Manuscript submied to ACM
22 Rezk, et al.
4.2.1 Compute-specific. e boleneck in RNNs computations is the matrix to vector multiplications. Further-
more, it is dicult to fully parallelize matrix to vector multiplications over time-steps as the RNN model has a feedback
part. Each time-step computation is waiting for the preceding time-step computations to be completed to use the hidden
state output as an input for the new time step computation.
• Loop unrolling Loop unrolling is used to allow pipelining of loops computation. ere are two kinds of loop
unrolling used in RNN implementations. e rst is inner loop unrolling, where the inner loop of the matrix
to vector multiplication is unrolled [27, 89]. e second kind is unrolling over time-steps. RNN needs to run
for multiple time-steps for each task to be completed. e computation of the recurrent unit can be unrolled
over time-steps [59, 60]. However, this cannot be fully parallelized as discussed earlier. Only computations
that rely on inputs can be parallelized while computations relying on hidden state outputs are performed in
sequence. One solution to this problem can be using QRNN or SRU as discussed in Section 3.1. In QRNN and
SRU, the matrix to vector multiplications does not operate on the hidden state output and thus can be fully
parallelized over unrolled time steps [70].
• Tiling Tiling is dividing one matrix to vector multiplication into multiple matrix to vector multiplications.
Usually, tiling is used when a hardware solution has built-in support to the matrix to vector multiplication of a
specic size in one clock cycle. When the input vector or the weight matrix size is larger than the size of the
vector or the matrix supported by the hardware, tiling is used to divide the matrix to vector multiplication
to be done on the hardware in multiple cycles [27, 80]. For further understanding of the tiling concept, see a
visual illustration of tiling in Appendix B.
• Hardware sharing In the GRU recurrent layer, the execution of rt and h˜t has to be in sequence as h˜t
computation depends on rt as shown in Eq.( 12). us, the computation of rt and h˜t is the critical path in the
GRU computation. While zt can be computed in parallel as it is independent on h˜t and rt . e same hardware
can be shared for computing rt and zt to save hardware resources [9].
• Analog computing Analog signal processing is a good candidate for neural network accelerators [90]. Analog
neural networks [46] and analog CNNs [4] have been studied recently. Interestingly, RNNs implementations
using ASP started to get research focus [90]. RNNs and NNs generally can benet from ASP as a single signal in
ASP can replace multiple bits in DSP (Digital Signal Processing), matrix to vector multiplication is faster in ASP,
and non-linear functions implementation is easier. However, the interfacing using DACs (Digital to Analog
Converters) and ADCs (Analog to Digital Converters) can obstruct achieving energy and area eciency.
4.2.2 Memory specific. For the processing of an RNN algorithm, memory is needed to store weight matrices,
biases, inputs and activations, where the weight matrices have the highest memory requirement. e rst decision
related to memory is the location of weights storage. If all the weights are stored in the o-chip memory, accessing the
weights will be of the highest cost with respect to both latency and energy [27, 31].
On-chip memory Aer applying the algorithmic optimizations introduced in Section 4.1, the memory requirement
of the RNN layer decreases which increases the possibility of storing the weights on the on-chip memory. However,
this will result in a restriction on the model size that can run on the embedded platform. On-chip memory has been
used for storing the weights by many implementations [22, 40, 59, 77, 80].
Hybrid memory Storing all the weights on the on-chip memory restricts the size of the model executed on the
embedded solution. Storing parts of the weights on the on-chip memory and the rest of the weights are on the o-chip
memory might be the solution [8].
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 23
In addition to maximizing the use of on-chip memory and using algorithmic optimizations, some researchers use
techniques to decrease the number and the cost of memory accesses.
• Multi time-step parallelization
e fact that QRNN and SRU removed hidden state output from the matrix to vector multiplications can be
invested to allow multi time-step parallelization [70]. Multi time-step parallelization is done by converting
multiple matrix to vector multiplication into a fewer matrix to matrix multiplications. is method will decrease
the number of memory accesses by reusing the weights for multiple time-steps computations.
• Reordering weights Reordering weights in memory in the same order of computation helps in decreasing
the memory access time [27]. Reordering the parameters in memory is done in a way that ensures that the
memory accesses will be sequential.
• Compute/load overlap In order to compute matrix to vector multiplications, weights need to be accessed
and loaded from memory and then used for computations. e total time will be the sum of the access time and
computation time. To decrease this time, memory access and computations can be overlapped. is overlap can
be done by fetching the weights for the next time-step while doing the computation of the current time-step.
e overlap would require the existence of extra buers for storing the weights of the next time-step while
using the weights of the current time-step as well [30].
• Doubling memory fetching In this method, double the required weights for computation are fetched [24].
Half of the weights will be consumed at the current time step t computations and the rest will be buered for
the next time step t + 1. Doubling memory fetching can decrease the memory bandwidth to its half.
Domain-wall memory (DWM) DWM is a new technology for non-volatile memories proposed by Parkin et al.
from IBM in 2008 [54]. DWM technology is based on a magnetic spin [15, 63, 79, 86]. Information is stored by seing
the spin orientation of magnetic domains in a nanoscopic permalloy wire. Multiple magnetic domains can occupy one
wire which is called race-tracking. Race-tracking allows the representation of up to 64 bits. DWM density is hoped to
heighten SRAM by 30x and DRAM by 10x [3]. Using DWM in RNN accelerator can achieve beer performance and
lower energy consumption [63].
Processing In Memory (PIM) PIM gets rid of data fetching problem by making computation happens in memory.
us, no memory access overhead exists anymore. In such architecture, a memory bank is divided into three sub-
arrays segments: memory sub-arrays, buer sub-arrays, and processing sub-arrays that are used as conventional
memory, data buer and processing sub-arrays respectively. ReRAM based PIM has been approached to accelerate
CNNs [12, 67, 87] and RNNs [45]. ReRAM that support XNOR and bit counting operations only would be sucient for
RNN implementation if binary or multi-bit codes (Section 4.1.1) quantization have been applied [85].
5 RNN IMPLEMENTATIONS ON HARDWARE
In the previous section, we have discussed the optimizations applied to decrease the RNN models computation and
memory requirements. In this section, we study the recent implementations of RNN applications on embedded platforms.
e implementations are divided into FPGA, ASIC, and other implementations. In the study, the optimizations applied
in each implementation are presented. However, the eect of each optimization is not shown separately. Instead, the
outcomes of applying the mix of optimizations are discussed with respect to the objectives presented in Section 2.
Manuscript submied to ACM
24 Rezk, et al.
First, for the implementation eciency objective, the implementations are compared in terms of throughput, energy
consumption, and meeting the real-time requirements. en, for the exibility objective, implementations that supported
variations in the models, online training, or dierent application domains are discussed.
In Appendix C, we show information about the implementations understudy. Authors names are shown, the name of
the architecture; if named; the aliation, and the year of publication. Table 5 and Table 6 present the implementations
understudy. Table 5 shows implementations performed on FPGAs, while Table 6 shows implementations performed
on other platforms. Each implementation has an index. e index starts with “F” for FPGA implementations, “A” for
ASIC implementations, and “C” for other implementations. For each implementation, the tables show the platform,
the RNN model, the applied optimizations, and the runtime performance. For the RNN model, in most of the cases,
only the recurrent layers are shown as most of the implementation papers provided the implementation for these
layers only. e recurrent layers are wrien in the format of x*y*z, where x is the number of recurrent layers, y is
the type of recurrent layers (e.g LSTM, GRU, ..), and z is the number of hidden cells in each layer. If the model has
dierent modules (e.g two dierent LSTM models or LSTM + CNN), we mention the number of executed time-steps of
the RNN model. Both algorithmic and platform optimizations are shown in the tables. All the optimizations found in
the tables are previously explained in Section 4 entitled under the same keywords in the tables. For quantized models,
“antization X” is wrien in the optimizations column where X is the number of bits used to store the weights. e
eective throughput and the energy eciency given in the tables are discussed in details in the next sub-section.
5.1 Implementation eiciency
To study the eciency of the implementations understudy, we focus on three aspects. e rst is the throughput, the
second is energy consumption, and the third is meeting the real-time requirements.
5.1.1 Eective Throughput. To compare the throughput of dierent implementations, we use the number of
operations per second (OP/s) as a measure. Some of the papers surveyed did not directly state the throughput. For
these papers, we have tried to deduce the throughput from other information given. e method used to deduce the
throughput values is explained in Appendix D. One other aspect to consider is that compression optimization results
in decreasing the number of operations in the model before running it. Consequently, the number of operations per
second is not a fair indicator for the implementation eciency. For this case, the throughput is calculated using the
number of operations in the dense RNN model, not the compressed model. us, we call it Eective roughput.
For a fair comparison for the ASIC implementations, we have applied scaling to 65nm technology at 1.1 volt using
the general scaling equations in Rabaey book [56]. If the voltage value is not mentioned in the paper, we assume the
standard voltage for the implementation technology. For instance, since A6 was implemented on 65nm, we assume the
voltage value to be 1.1 volt.
To analyze Table 5 and Table 6 and understand the eect of dierent optimizations on throughput, the entries of the
tables are ordered in a descending order starting from the implementation with the highest throughput. ere exist two
optimizations groups that appear more frequently in the high throughput implementations. e rst optimization group
is related to decreasing memory access time. Memory access time is decreased either by using on-chip memory for all
weights or overlapping the computation time and the weights loading time. e second group is related to algorithmic
optimizations. Algorithmic optimizations present in all high throughput implementations are compression (pruning,
block-circulant matrices, etc.), deltaRNN, and low precision quantization. antization using 16-bit and non-linear
function approximations are not within the groups of high eect optimizations. antization with 16-bit is present in
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 25
many implementations that do not go for lower precision and it does not have a great eect on computation cost. us,
it is not a dierentiating factor. Non-linear function approximations are not contributing in the most used operations
(matrix to vector multiplications).
Finally, the throughput values are ploed against the implementations in Fig. 6. e scaled eective throughput values
for the ASIC implementations are used. Implementations that have memory access optimizations or/and algorithmic
optimizations are highlighted by puing them inside an extra square or/and circle. It can be observed from Fig. 6 that
all of the implementations with high throughput have some algorithmic optimization and applied memory access
optimization. For instance, F3 [59] applied low precision quantization and placed all the weights on the on-chip memory.
F1 [43], F2 [77], F5 [22], and A1 [80], all applied both on-chip memory optimization and algorithmic optimizations. In
F4 [30], the architecture had a scheduler that overlaps computation with memory accesses. All the weights required for
computation are fetched before the computation starts. us, they managed to eliminate the o-chip memory access
overhead by having an ecient compute/load overlap. One implementation that stands out is A5 [24], which has a
very low throughput for an ASIC implementation while applying on-chip and algorithmic optimizations. e reason
is that this particular implementation was meant to meet a latency deadline of 16ms while consuming low power in
micro-wa. us, high throughput was not the objective from the beginning. Another implementation that needs
inspection is F8. Despite applying the two mentioned optimizations, it could not reach as high performance as expected.
e conclusion here is that applying memory access optimization and algorithmic optimization is necessary but not
sucient for high performance.
Furthermore, Fig. 6 shows that the ASIC implementations were not exceeding FPGA implementations in terms of
throughput. We think the reason is that the ASIC implementations understudy did not use the latest ASIC technologies
as shown in Table 6. Both F1 and F2 (the implementations with the highest throughput) applied block-circulant
matrices optimization. In addition, A1 ( the ASIC implementation with the highest throughput) applied circulant
matrices optimization. is indicates that restructuring weight matrices into circulant matrices and sub-matrices is
one of the most fruitful optimizations. e reason can be that circulant matrices optimization does not cause the
irregularity of weight matrices such as pruning [77]. Nevertheless, circulant matrices can be accompanied by low
precision quantization without a harsh eect on accuracy as in A1 (6-bit) and F1 (12-bit). It is observed in Table 5 that F1
and F2 optimizations are almost identical but the performance is dierent. F1 and F2 have dierences in the hardware
architecture and F1 applied lower precision than F2 but the most important reason is that F1 used a beer approach in
training the compressed RNN model. F1 was able to reach the same accuracy level reached by F2 with block size 8
while using block size 16. us, the RNN model size in F1 is approximately 2x less than F1. For the low throughout
implementations, Fig. 6 shows that some implementations did not apply any of the two optimizations (memory access
and algorithmic), such as F9 [69] that had a strict accuracy constraint bounding the use of algorithmic optimizations and
C3 [6]. In addition to the implementations that applied only one of the two optimizations such as F11 [58] and F12 [8].
5.1.2 Energy eiciency. To compare the implementations understudy from the energy consumption perspective,
we use the number of operations per second per wa as a measure. e last columns in Table 5 and Table 6 show
the energy eciency. Energy eciency is calculated based on the dense model, not the sparse model as for eective
throughput. However, it was not possible to get values for energy eciency in all implementations. In some cases, the
power consumption was not mentioned in the paper. While, in other cases, the consumed power was not provided in a
precise manner. For instance, the power of the whole FPGA board may be provided, which does not indicate how much
power is used by the implementation with respect to the peripherals [27, 89].
Manuscript submied to ACM
26
Rezk,etal.
Table 5. Comparing papers techniques in FPGA implementations.
Index Platform Model Algorithmic
Optimizations
Platform
Optimizations
Qef f
1
GOP/s
Eef f
2
GOP/s/wa
F1 [43] Alpha Data ADM-7V3 T
@200MHz
1*LSTM*1024 Block-circulant 16
Piecewise approx.
antization 12
On-chip
pipelining
< q3 > 79560 < e1 >
3182
F2 [77] Alpha Data ADM-7V3 T
@200MHz
1*LSTM*1024 Block-circulant 8
Piecewise approx.
antization 16
On-chip
pipelining
< q3 > 37375 < e1 >
1699
F3 [59] Zync XCZU7EV @266
MHz
1*BiLSTM*128 antization 1-8 On-chip
Unrolling-timesteps
< q1 > 3435 < e4 > -
F4 [30] XCKU060 @200 MHz 1*LSTM*1024 Pruning, antization 12 Compute/load overlap < q1 > 2515 < e2 > 61.4
F5 [22] Zync 7100 @125 MHz 1*GRU*256 DeltaRNN, RALUT
antization 16
On-chip < q1 > 1198.3 < e2 >218
F6 [60] Zynq- 7000 XC7Z045
@142 MHz
1*BiLSTM*100 antization 5 On-chip, Loop-unrolling < q1 > 308 < e3 > 44
F7 [89] Virtex-7 VC709 @100 MHz AlexNet +
15steps:1*LSTM*256
antization 16 Loop-unrolling
Reordering weights
< q2 > 36.253 < e4 > -
F8 [40] XC7Z045 @ 100 MHz 100steps:3*LSTM*256
3840steps:2*LSTM*256
antization 6 On-chip < q3 > 304 < e2 > 5.4
F9 [69] VC707 @150 MHz 1*LSTM*10 + FC Hard Sigmoid Loop-unrolling < q1 > 13.5 < e4 > -
F10 [27] VC707 @150 MHz 3*LSTM*250 Piecewise approx. Tiling, Loop-unrolling
Compute/load overlap
Reordering weights
< q1 > 7.3 < e4 > -
F11 [58] Zynq ZC706 @100 MHz 2 * LSTM*512 Pruning Tiling < q3 > 1.55 < e4 > -
F12 [8] Zynq-7000 XC7Z045
@142MHz
2*LSTM*128 antization 16
Piecewise approx.
Hybrid memory
Compute/load overlap
< q4 > 0.2 < e1 > 0.11
1 e cases q1-q4 are explained in Appendix D.
2 e cases e1-e4 are explained in Appendix D.
3 e throughput is for running CNN and LSTM combined together.
4 e number of time steps the model should run per second to reach real-time behavior is given. We computed the number of operations in the model
and multiplied by the number of time steps in one second then multiplied by the speedup gained over real-time threshold to get the implementation
throughput.
M
anuscriptsubm
ied
to
ACM
RecurrentN
euralN
etw
orks:A
n
Em
bedded
Com
puting
Perspective
27
Table 6. Comparing papers techniques in ASIC and other implementations.
Category Index Platform Model Algorithmic
Optimizations
Platform
Optimizations
Qef f
1
GOP/s
(original/scaled)3
Eef f
2GOP/s/wa
(original/scaled)3
ASIC
A1 [80] TSMC 90nm
@600MHz &1v
1*LSTM*512 antization 6
Circulant matrices
Piecewise approx.
On-chip
Tiling
< q1 > 2460/3406 < e2 > 2436/2787
A2 [90] CMOS 180nm
&1.8v
1*LSTM*16 antization 4 ASP
On-chip
< q4 >473.3/1211 < e1 >950/7044
A3 [9] CMOS 65nm
@400 MHz &1.2v
GRU antization 16
Piecewise approx.
On-chip
Hardware sharing
< q1 > 311.6 < e1 > 2000/2380
A4 [53] CMOS 65nm
@200MHz
2*LSTM*512 Pruning Load balancing < q3 > 295 < e3 > 122.9
A5 [24] CMOS 65nm
@239 KHz &0.575v
2*LSTM*32 antization 4
Piecewise approx.
On-chip
Doubling memory fetching
< q2 > 0.002 5 < e2 > 469.3/128
A6 [85] CMOS 65nm 1*LSTM*256 antization 17 ReRAM PIM < q5 > - < e1 >27000
Others
C1 [70] ARMv8
@ 2GHz
1*SRU*1024 SRU Multi time-step paralleliza-
tion
< q3 > 22.3 < e4 > -
C2 [70] Intel Core i7
@ 3.2GHz
1*SRU*1024 SRU Multi time-step paralleliza-
tion
< q3 >19.2 < e4 > -
C3 [6] Adreno 330 GPU
@ 450 MHz
2*LSTM*32 - RenderScript6 < q3 > 0.0011 < e4 > -
1 e cases q1-q4 are explained in Appendix D.
2 e cases e1-e4 are explained in Appendix D.
3 Scaled to 65nm at 1.1 volt using general scaling [56].
4 e throughput is not high as the purpose was to reach very low power consumption while doing inference within 16ms.
4 e shown numbers are for running FC layers of a CNN as it reproduce throughput numbers for the LSTM layer experimented in the paper.
6 RenderScript is a mobile specic parallelization framework [1].
7 antization used 1 bit for weights and 4 bits for activations.Manuscriptsubm
ied
to
ACM
28 Rezk, et al.
Fig. 6. Eective throughput of dierent implementations along with the key aecting optimizations.
Fig. 7. Energy eiciency of dierent implementations along with the key aecting optimizations.
Fig. 7 is a plot of the energy eciency found or deduced (methods used for deduction are presented in Appendix D) for
the implementations understudy against the implementation index. Implementations are plot sorted according to energy
eciency and the scaled values for the ASIC implementations are used. Again, to show the eect of optimizations,
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 29
we chose the two most eective optimizations from Table 5 and Table 6 and put them in the gure. ey are the
same used in Fig. 6: memory access optimization (on-chip memory usage for weights) and algorithmic optimizations.
e observations from Fig. 7 agree with the observations from Fig. 6. Algorithmic optimizations are applied in most
of the ecient implementations and on-chip memory has been used for weights storage for most of the ecient
implementations. Comparing eective throughput and energy eciency of FPGA and ASIC implementations, it is
observed that FPGA and ASIC have close values for eective throughput while ASIC implementations are more energy
ecient. e credit can go for ASIC technology.
It can be observed that the highest energy eciency was achieved by A6. A6 managed to save the memory access
time by computing in memory. e quantization method used is multi-bit code quantization (1-bit for weights and
4-bit for activations). Multi-bit code quantization enables replacing the MAC operation with XNOR and bit-counting
operations as discussed in Section 4.1.1. It was sucient to use an XNOR-RRAM based architecture to implement the
RNN. erefore, in Figure 7, we consider PIM as a memory access optimization.
5.1.3 Meeting real-time requirements. In some of the implementations understudy, real-time requirements
for throughput and power have been determined. For instance, in F8 [40], the speech recognition system had two
RNN models. One model for acoustic modelling and the other for character-level language modelling. e real-time
requirement was to run the rst model 100 times per second and the second model 3,840 times per second. While in
A5 [24], an LSTM accelerator for an always-on Keyword Spoing System (KWS), the real-time response demanded that
a new input vector should be consumed every 16 ms and the power consumption should not exceed 10µwa.
5.2 Flexibility
Flexibility, as dened in Section 2 is the ability of the solution to support dierent models and congurations. e
exibility of the solution can be met by supporting variations in the model. Models can vary in the number of layers,
the number of hidden units per layer, optimizations applied on the model, and more. Nevertheless, exibility can be
met by supporting online training or meeting dierent application domain requirements.
To quantify how far exibility is met by the implementations understudy, Fig. 8 shows the percentage of imple-
mentations supporting each exibility aspect. While the details of exibility aspects covered by each implementation
are in Appendix E. Flexibility is visualized as levels. Level 0 is used to indicate no exibility. Level 0 requires the
implementation to support only one recurrent layer conguration. All papers meet level 0 requirement and then they
vary in meeting other exibility aspects. e exibility aspects and how they can be met are discussed in the following:
Supporting variations in RNN layers (level 1) Recurrent layers can vary in the type of layers, the number of
cells in each layer, and the number of layers (the depth of an RNN model). One optimization that might have a side
eect on the exibility of the solution is the choice of using the on-chip/o-chip memory to store the weights. Being
able to store all the weights in the on-chip memory is very benecial. It leads to beer performance and less energy
consumption by decreasing the cost of memory accesses. However, the solution may be unfeasible for larger problems.
For instance, in F8 [40], the number of weights in the model and their precision are restricted by the on-chip memory
size. It is not possible to run a model with an increased number of hidden cells or increased precision. A possible
solution is to use an adaptable approach, where the choice of the location of storing the weights is dependent on the
model size and thus can support a wide range of models. Another solution was adopted in F12 [8], where part of the
weights is stored in the internal memory and the rest is stored in the o-chip memory (Hybrid memory).
Manuscript submied to ACM
30 Rezk, et al.
Supporting other NN layers (level 2) Supporting other NN layers would allow the solution to run a broader range
of NN applications. Nevertheless, other NN layers may exist in the RNN model such as convolutions as a feature
extractor. us, supporting convolution in the implementation increases the exibility of the solution, as it can run
RNN models with visual inputs and run CNN independent applications.
Supporting algorithmic optimization variations (Level 3)Variations in the optimizations applied are considered
as variations in the model as well. For instance, variation due to applying/not applying pruning is the presence
of sparse/dense matrices in the matrix to vector multiplications computations. e design in A8 [37] employed a
congurable interconnection network topology to increase the exibility of the accelerator. e accelerator in A8 [37]
supported both LSTM and CNN layers. e accelerators supported both of sparse and dense matrices. One other
variation is the variation in weights and activations precision. e design in A9 [65] supported varying precision
models by allowing dynamic precision per layer for both CNN and RNN models. Similarly, Microso NPU brainwave
architecture [21] supported varying precision using a narrow precision block oating-point format [81].
Online training (Level 4) Incremented online training was supported in A3 [9] to support retraining pre-trained
networks to enhance accuracy. Changes in hardware design have been applied to support both training and inference
without aecting the quality of inference. For instance, three modes of data transfer were applied. e rst is to load
new weights. e second is to load input sequences and the third is to update certain weights. Nevertheless, extra
precision was used in case of training only.
Meeting dierent applications domains constraints (Level 5) We did not meet an RNN implementation that
targets variations in the application domains constraints. NetAdapt is a good example of an implementation that can
adapt to dierent metric budgets [83]. However, it only targets CNNs.
Fig. 8. Percentage of implementations meeting flexibility aspects for dierent flexibility levels and the definition of flexibility levels.
6 DISCUSSIONS AND OPPORTUNITIES
In the previous section, we studied the implementations of RNN on embedded platforms. Furthermore, in Section 2,
we have dened the objectives of realizing RNN models on embedded platforms. Let us now take a look at how the
objectives are being met by the implementations.
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 31
roughput It is clear that throughput was the main objective for most of the implementations. e TOP/s threshold
has been surpassed by about third of the implementations. Many algorithmic and platform-specic optimizations have
been studied and applied to achieve high throughput.
Energy eciency was not studied by all implementations. For many of the studied implementations, we could not
conclude a value for the energy eciency from the implementation papers. For the remaining of the implementations,
energy eciency was aected by the optimizations applied and could reach high values in terms of GOP/s/wa.
Meeting real-time requirements was not used as an objective for many implementations. In a few of the imple-
mentations understudy, real-time deadlines were mentioned and followed in the design of the solution.
Flexibilty In Section 2.1, exibility is dened as a secondary objective. us, we do not expect exibility to be fully
met by the implementations. Variations in RNN model was partially fullled by many implementations. However, the
number of variations covered by each implementation is quite low. Nevertheless, fewer implementations approached
other NN layers and variations in algorithmic optimizations. Online training was targeted by only one implementation.
Online-training is not popular in embedded implementations. While at the algorithmic side, researchers are doing
interesting work based on online/continuous training [35, 73]. Supporting dierent applications was never met by any
of the RNN implementations. It has been met by the CNN solution in [83]. Following a similar method in RNNs with
addition to supporting models variations can lead to an interesting solution.
e opportunities for future research can be summarized as:
QRNN and SRU: QRNN and SRU (Section 3.1.6) are two alternatives to LSTM where the matrix to vector compu-
tations at the current time-step are independent on the previous time-step computations. us, using them in RNN
models can make the parallelization more ecient and consequently lead to beer performance.
DeltaRNN [50] and DeltaCNN [7]: We believe that applying the delta algorithm on both recurrent and convolution
layers is a logical step due to the temporal relation between the input sequences. Adding delta step to other algorithmic
optimizations such as pruning and quantization would decrease the memory access and computation requirements.
Block-circulant matrices Using block-circulant matrices as an algorithmic optimization decreases the RNN size
without causing irregularity of computation as pruning [77]. Applying circulant matrices can be accompanied by low
precision parameters and activations with a small eect on accuracy [80]. With the addition to applying the delta
algorithm as mentioned earlier, RNN inference can achieve a promising throughput and energy eciency.
Hybrid optimizations: It has been shown that a mix of algorithmic optimizations can be applied to an RNN model
with an acceptable loss in accuracy [80]. Applying a mix of optimizations would enable the implementations to benet
from each optimization. For an RNN implementation, three classes of optimizations can be mixed together with tuning.
e rst optimization as mentioned earlier is the delta algorithm and the corresponding parameter is delta. e second is
quantization and the corresponding parameters are the number of bits and quantization method. e third optimization
is compression. If the applied compression technique is block-circulant matrices, the parameter will be the block size.
Tuning the three parameters delta, number of bits, quantization method, and block size, the designer can reach the
highest performance while keeping the accuracy within an acceptable range (the range is dependant on the application).
ASP and PIM: Analog processing [90] and processing in memory [85] have shown preliminary promising perfor-
mance, especially in energy eciency. Proposing solutions that can run larger RNN than the RNNs used in literature
will expand the usage of such hardware solutions for RNN applications.
Flexible solutions: Flexible hardware solutions that support all NN layers and as many optimizations as possible is
an in-demand area for research. A8 [37] and A9 [65] have shown good architectures in this context. However, more
Manuscript submied to ACM
32 Rezk, et al.
algorithmic optimizations need to be supported in both architectures to reach a higher level of exibility. Furthermore,
online training is an interesting area of research that has not been explored enough by embedded systems researchers.
7 SUMMARY
Today we see a trend towards more intelligent mobile devices that are processing applications with stream data
in the form of text, voice, and video. To process these applications, RNNs are important due to their eciency in
processing sequential data. In this paper, we perform a thorough review of the literature dealing with recurrent neural
network implementations from the embedded systems perspective. We study all the aspects required for the ecient
implementation of an RNN model on embedded platforms. To do so, we study the dierent components of RNN models
from an implementation point of view more than an algorithmic point of view. Nevertheless, we dene the objectives
that are required to be met by the hardware solutions for RNN applications and the challenges making them dicult.
For an RNN model to run eciently on an embedded platform, some optimizations need to be applied. us, we studied
both algorithmic and platform-specic optimizations. en, we analyze the implementations that applied the studied
optimizations to propose solutions for RNN models on embedded systems. Finally, we discussed how the objectives
dened earlier in the article have been met and highlight possible directions for research in this eld in the future.
We concluded from the analysis of the implementations that there exist two mandatory optimizations for ecient
implementations. e rst is the algorithmic optimizations. e second is to decrease the memory access time for
weights retrieval either by relying on the on-chip memory for storing the weights, applying an ecient overlap between
weights loading and computations, or computing in memory. e study of the implementations in the literature shows
enough performance for streaming applications and a lack of exibility.
8 ACKNOWLEDGMENT
is research is performed in the NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and
Recongurability) Indo-Swedish project, funded by VINNOVA Strategic Innovation grant and the Department of
Science and Technology (INT/SWD/VINN/p-10/2015), Government of India.
REFERENCES
[1] [n. d.]. Android RenderScript kernel description. hps://developer.android.com/guide/topics/renderscript/compute.html.. ([n. d.]). Accessed:
2019-01-05.
[2] Md. Zahangir Alom, Adam T. Moody, Naoya Maruyama, Brian C. Van Essen, and Tarek M. Taha. 2018. Eective quantization approaches for
recurrent neural networks. CoRR abs/1802.02615 (2018).
[3] A. J. Annunziata, M. C. Gaidis, L. omas, C. W. Chien, C. C. Hung, P. Chevalier, E. J. O’Sullivan, J. P. Hummel, E. A. Joseph, Y. Zhu, T. Topuria, E.
Delenia, P. M. Rice, S. S. P. Parkin, and W. J. Gallagher. 2011. Racetrack memory cell array with integrated magnetic tunnel junction readout. In
2011 International Electron Devices Meeting. 24.3.1–24.3.4. hps://doi.org/10.1109/IEDM.2011.6131604
[4] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H. Yoo. 2017. 14.6 A 0.62mW ultra-low-power convolutional-neural-network face-recognition
processor and a CIS integrated with always-on haar-like face detector. In 2017 IEEE International Solid-State Circuits Conference (ISSCC). 248–249.
hps://doi.org/10.1109/ISSCC.2017.7870354
[5] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. asi-recurrent neural networks. CoRR abs/1611.01576 (2016).
arXiv:1611.01576
[6] Qingqing Cao, Niranjan Balasubramanian, and Aruna Balasubramanian. 2017. MobiRNN: ecient recurrent neural network execution on mobile
GPU. CoRR abs/1706.00878 (2017).
[7] Lukas Cavigelli, Philippe Degen, and Luca Benini. 2017. CBinfer: change-based inference for convolutional neural networks on video data. CoRR
abs/1704.04313 (2017).
[8] A. X. M. Chang and E. Culurciello. 2017. Hardware accelerators for recurrent neural networks on FPGA. In 2017 IEEE International Symposium on
Circuits and Systems (ISCAS). 1–4. hps://doi.org/10.1109/ISCAS.2017.8050816
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 33
[9] C. Chen, H. Ding, H. Peng, H. Zhu, R. Ma, P. Zhang, X. Yan, Y. Wang, M. Wang, H. Min, and R. C. . Shi. 2017. OCEAN: an on-chip incremental-learning
enhanced processor with gated recurrent neural network accelerators. In ESSCIRC 2017 - 43rd IEEE European Solid State Circuits Conference. 259–262.
hps://doi.org/10.1109/ESSCIRC.2017.8094575
[10] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: a spatial architecture for energy-ecient dataow for convolutional neural networks. In
Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). 367–379.
[11] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. CoRR
abs/1710.09282 (2017).
[12] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network
Computation in ReRAM-Based Main Memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 27–39.
[13] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: encoder-
decoder approaches. CoRR abs/1409.1259 (2014).
[14] Junyoung Chung, C¸aglar Gu¨lc¸ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on
sequence Modeling. CoRR abs/1412.3555 (2014).
[15] Jinil Chung, Jongsun Park, and Swaroop Ghosh. 2016. Domain Wall Memory Based Convolutional Neural Networks for Bit-width Extendability and
Energy-Eciency. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED ’16). ACM, New York, NY, USA,
332–337. hps://doi.org/10.1145/2934583.2934602
[16] Mahieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: training deep neural networks with binary weights during
propagations. CoRR abs/1511.00363 (2015).
[17] Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. 2017. NeST: a neural network synthesis tool based on a grow-and-prune paradigm. CoRR abs/1711.02017
(2017).
[18] Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. 2018. Grow and prune compact, fast, and accurate LSTMs. CoRR abs/1805.11797 (2018).
[19] Li Deng. 2014. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information
Processing 3 (2014), e2. hps://doi.org/10.1017/atsip.2013.9
[20] Je Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014.
Long-term recurrent convolutional networks for visual recognition and Description. CoRR abs/1411.4389 (2014).
[21] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G.
Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Cauleld, E. S. Chung, and D. Burger. 2018. A Congurable Cloud-Scale DNN Processor for Real-Time
AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14. hps://doi.org/10.1109/ISCA.2018.00012
[22] Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbru¨ck. 2018. DeltaRNN: a power-ecient recurrent neural network accelerator. In
FPGA.
[23] F. A. Gers and J. Schmidhuber. 2000. Recurrent nets that time and count. In Proceedings of the IEEE-INNS-ENNS International Joint Conference
on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, Vol. 3. 189–194 vol.3. hps:
//doi.org/10.1109/IJCNN.2000.861302
[24] J. S. P. Giraldo and M. Verhelst. 2018. Laika: a 5uW programmable LSTM accelerator for always-on keyword spoing in 65nm CMOS. In ESSCIRC
2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC). 166–169. hps://doi.org/10.1109/ESSCIRC.2018.8494342
[25] Alex Graves. 2013. Generating sequences With recurrent neural networks. CoRR abs/1308.0850 (2013).
[26] Alex Graves, Abdel-rahman Mohamed, and Georey E. Hinton. 2013. Speech recognition with deep recurrent neural networks. CoRR abs/1303.5778
(2013).
[27] Y. Guan, Z. Yuan, G. Sun, and J. Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In 2017 22nd Asia and
South Pacic Design Automation Conference (ASP-DAC). 629–634. hps://doi.org/10.1109/ASPDAC.2017.7858394
[28] Yunhui Guo. 2018. A survey on methods and theories of quantized neural networks. (2018). arXiv:cs.LG/1808.04752
[29] Yiwen Guo, Anbang Yao, Hao Zhao, and Yurong Chen. 2017. Network sketching: exploiting binary structure in deep CNNs. CoRR abs/1706.02021
(2017).
[30] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J.
Dally. 2016. ESE: ecient speech recognition engine with compressed LSTM on FPGA. CoRR abs/1612.00694 (2016).
[31] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: ecient inference engine on
compressed deep neural network. CoRR abs/1602.01528 (2016).
[32] Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: compressing deep neural network with pruning, trained quantization and
human coding. CoRR abs/1510.00149 (2015).
[33] Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.
[34] Itay Hubara, Mahieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Proceedings of the 30th
International Conference on Neural Information Processing Systems (NIPS’16). 4114–4122.
[35] Christoph Ka¨ding, Erik Rodner, Alexander Freytag, and Joachim Denzler. 2016. Fine-tuning deep neural networks in continuous learning scenarios.
In ACCV Workshops.
Manuscript submied to ACM
34 Rezk, et al.
[36] Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. CoRR abs/1606.07947 (2016). arXiv:1606.07947 hp://arxiv.org/abs/
1606.07947
[37] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataow Mapping over DNN Accelerators via Recong-
urable Interconnects. SIGPLAN Not. 53, 2 (March 2018), 461–475. hps://doi.org/10.1145/3296957.3173176
[38] N. D. Lane, S. Bhaacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, and F. Kawsar. 2016. DeepX: a soware accelerator for low-power deep
learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). 1–12.
hps://doi.org/10.1109/IPSN.2016.7460664
[39] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. 2014. Speeding-up Convolutional Neural Networks Using
Fine-tuned CP-Decomposition. (12 2014).
[40] Minjae Lee, Kyuyeon Hwang, Jinhwan Park, Sungwook Choi, Sungho Shin, and Wonyong Sung. 2016. FPGA-based low-power speech recognition
with recurrent neural networks. In Signal Processing Systems (SiPS), 2016 IEEE International Workshop on. IEEE, 230–235.
[41] Tao Lei, Yu Zhang, and Yoav Artzi. 2017. Training RNNs as fast as CNNs. CoRR abs/1709.02755 (2017).
[42] Fengfu Li and Bin Liu. 2016. Ternary weight networks. CoRR abs/1605.04711 (2016).
[43] Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, Xue Lin, Xuehai Qian, and Yanzhi Wang.
2018. E-RNN: Design Optimization for Ecient Recurrent Neural Networks in FPGAs. CoRR abs/1812.07106 (2018). arXiv:1812.07106 hp:
//arxiv.org/abs/1812.07106
[44] Zachary Chase Lipton. 2015. A critical review of recurrent neural networks for sequence learning. CoRR abs/1506.00019 (2015).
[45] Y. Long, T. Na, and S. Mukhopadhyay. 2018. ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 26, 12 (Dec 2018), 2781–2794. hps://doi.org/10.1109/TVLSI.2018.2819190
[46] D. Maliuk and Y. Makris. 2015. An Experimentation Platform for On-Chip Integration of Analog Neural Networks: A Pathway to Trusted and Robust
Analog/RF ICs. IEEE Transactions on Neural Networks and Learning Systems 26, 8 (Aug 2015), 1721–1734. hps://doi.org/10.1109/TNNLS.2014.2354406
[47] R. Muscedere, V. Dimitrov, G. A. Jullien, and W. C. Miller. 2005. Ecient techniques for binary-to-multidigit multidimensional logarithmic number
system conversion using range-addressable look-up tables. IEEE Trans. Comput. 54, 3 (March 2005), 257–271. hps://doi.org/10.1109/TC.2005.48
[48] Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring sparsity in recurrent neural networks. CoRR abs/1704.05119
(2017).
[49] Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).
[50] Daniel Neil, Junhaeng Lee, Tobi Delbru¨ck, and Shih-Chii Liu. 2016. Delta networks for optimized recurrent network computation. CoRR
abs/1612.05571 (2016).
[51] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. 2015. Tensorizing Neural Networks. CoRR abs/1509.06569 (2015).
arXiv:1509.06569 hp://arxiv.org/abs/1509.06569
[52] Joachim O, Zhouhan Lin, Ying Zhang, Shih-Chii Liu, and Yoshua Bengio. 2016. Recurrent neural networks With limited numerical precision.
CoRR abs/1608.06902 (2016).
[53] J. Park, J. Kung, W. Yi, and J. J. Kim. 2018. Maximizing system performance by balancing computation loads in LSTM accelerators. In 2018 Design,
Automation Test in Europe Conference Exhibition (DATE). 7–12. hps://doi.org/10.23919/DATE.2018.8341971
[54] S. S. P. Parkin, M. Hayashi, and L. omas. [n. d.]. Magnetic domain-wall racetrack memory, journal = Science, volume = abs/1712.01507, number
=5873, pages= 190194, year = 2008,. ([n. d.]).
[55] F. Piazza, A. Uncini, and M. Zenobi. 1993. Neural networks with digital LUT activation functions. In Proceedings of 1993 International Conference on
Neural Networks (IJCNN-93-Nagoya, Japan), Vol. 2. 1401–1404 vol.2. hps://doi.org/10.1109/IJCNN.1993.716806
[56] Jan M. Rabaey. 1996. Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
[57] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: imageNet classication using binary convolutional
neural networks. CoRR abs/1603.05279 (2016).
[58] Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Approximate FPGA-based LSTMs under computation
time constraints. CoRR abs/1801.02190 (2018).
[59] Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaar, Giulio Gambardella, Norbert Wehn, and Michaela Blo. 2018. FINN-L:
library extensions and design trade-o analysis for variable precision LSTM networks on FPGAs. CoRR abs/1807.04093 (2018).
[60] Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Youse, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term
memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. European Design
and Automation Association, 1394–1399.
[61] Has¸im Sak, Andrew Senior, and Franc¸oise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic
modeling. In Fieenth annual conference of the international speech communication association.
[62] Hojjat Salehinejad, Julianne Baarbe, Sharan Sankar, Joseph Barfe, Errol Colak, and Shahrokh Valaee. 2018. Recent advances in recurrent neural
networks. CoRR abs/1801.01078 (2018).
[63] Mohammad Hossein Samavatian, Anys Bacha, Li Zhou, and Radu Teodorescu. 2018. RNNFast: An Accelerator for Recurrent Neural Networks
Using Domain Wall Memory. CoRR abs/1812.07609 (2018).
Manuscript submied to ACM
Recurrent Neural Networks: An Embedded Computing Perspective 35
[64] Abigail See, Minh-ang Luong, and Christopher D. Manning. 2016. Compression of neural machine translation models via pruning. CoRR
abs/1606.09274 (2016).
[65] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. 2017. Bit
Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks. CoRR abs/1712.01507 (2017). arXiv:1712.01507
hp://arxiv.org/abs/1712.01507
[66] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: a machine
learning approach for precipitation nowcasting. CoRR abs/1506.04214 (2015).
[67] L. Song, X. Qian, H. Li, and Y. Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE International
Symposium on High Performance Computer Architecture (HPCA). 541–552. hps://doi.org/10.1109/HPCA.2017.55
[68] Evangelos Stromatias, Daniel Neil, Michael Pfeier, Francesco Galluppi, Steve B. Furber, and Shih-Chii Liu. 2015. Robustness of spiking deep belief
networks to noise and reduced bit precision of neuro-inspired hardware platforms. Frontiers in Neuroscience 9 (2015), 222.
[69] Z. Sun, Y. Zhu, Y. Zheng, H. Wu, Z. Cao, P. Xiong, J. Hou, T. Huang, and Z. e. 2018. FPGA acceleration of LSTM based on data for test ight. In
2018 IEEE International Conference on Smart Cloud (SmartCloud). 1–6. hps://doi.org/10.1109/SmartCloud.2018.00009
[70] Wonyong Sung and Jinhwan Park. 2018. Single stream parallelization of recurrent neural networks for Low power and fast inference. CoRR
abs/1803.11389 (2018).
[71] Ilya Sutskever, Oriol Vinyals, and oc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR abs/1409.3215 (2014).
[72] V. Sze, Y. Chen, T. Yang, and J. S. Emer. 2017. Ecient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105, 12 (Dec 2017),
2295–2329. hps://doi.org/10.1109/JPROC.2017.2761740
[73] A. Teichman and S. run. 2011. Practical object recognition in autonomous driving and beyond. In Advanced Robotics and its Social Impacts. 35–38.
hps://doi.org/10.1109/ARSO.2011.6301978
[74] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2018. Tensor Decomposition for Compressing Recurrent Neural Network. CoRR
abs/1802.10410 (2018). arXiv:1802.10410 hp://arxiv.org/abs/1802.10410
[75] Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. toolows for mapping convolutional neural networks on fPGAs: a
survey and future directions. CoRR abs/1803.05900 (2018).
[76] Erwei Wang, James J. Davis, Ruizhe Zhao, Ho-Cheung Ng, Xinyu Niu, Wayne Luk, Peter Y. K. Cheung, and George A. Constantinides. 2019. Deep
Neural Network Approximation for Custom Hardware: Where We’ve Been, Where We’re Going. CoRR abs/1901.06955 (2019). arXiv:1901.06955
hp://arxiv.org/abs/1901.06955
[77] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Yanzhi Wang, Qinru Qiu, and Yun Liang. 2018. C-LSTM: enabling ecient LSTM using structured
compression techniques on fPGAs. CoRR abs/1803.06305 (2018).
[78] Yu Wang, Shuang Liang, Song Yao, Yi Shan, Song Han, J. Peng, and Hong Luo. 2017. Recongurable processor for deep learning in autonomous
vehicles.
[79] Y. Wang, H. Yu, L. Ni, G. Huang, M. Yan, C. Weng, W. Yang, and J. Zhao. 2015. An Energy-Ecient Nonvolatile In-Memory Computing
Architecture for Extreme Learning Machine by Domain-Wall Nanowire Devices. IEEE Transactions on Nanotechnology 14, 6 (Nov 2015), 998–1012.
hps://doi.org/10.1109/TNANO.2015.2447531
[80] Z. Wang, J. Lin, and Z. Wang. 2017. Accelerating recurrent neural networks: a memory-ecient approach. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems 25, 10 (Oct 2017), 2763–2775. hps://doi.org/10.1109/TVLSI.2017.2717950
[81] James H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications, Inc., New York, NY, USA.
[82] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. 2018. Alternating multi-bit quantization for
recurrent neural networks. CoRR abs/1802.00150 (2018).
[83] Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, Vivienne Sze, and Hartwig Adam. 2018. NetAdapt: platform-aware neural
network adaptation for mobile applications. CoRR abs/1804.03230 (2018).
[84] Shuochao Yao, Yiran Zhao, Aston Zhang, Lu Su, and Tarek Abdelzaher. 2017. DeepIoT: compressing deep neural network structures for sensing
systems with a compressor-critic framework. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (SenSys ’17). Article 4,
14 pages.
[85] S. Yin, X. Sun, S. Yu, J. Seo, and C. Chakrabarti. 2018. A Parallel RRAM Synaptic Array Architecture for Energy-Ecient Recurrent Neural Networks.
In 2018 IEEE International Workshop on Signal Processing Systems (SiPS). 13–18. hps://doi.org/10.1109/SiPS.2018.8598445
[86] H. Yu, Y. Wang, S. Chen, W. Fei, C. Weng, J. Zhao, and Z. Wei. 2014. Energy ecient in-memory machine learning for data intensive image-
processing by non-volatile domain-wall memory. In 2014 19th Asia and South Pacic Design Automation Conference (ASP-DAC). 191–196. hps:
//doi.org/10.1109/ASPDAC.2014.6742888
[87] S. Yu, Z. Li, P. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian. 2016. Binary neural network with 16 Mb RRAM macro chip for classication and
online training. In 2016 IEEE International Electron Devices Meeting (IEDM). 16.2.1–16.2.4. hps://doi.org/10.1109/IEDM.2016.7838429
[88] Chaoyun Zhang, Paul Patras, and Hamed Haddadi. 2018. Deep learning in mobile and wireless networking: a survey. CoRR abs/1803.04311 (2018).
[89] X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang, Z. Cheng, K. Rupnow, and D. Chen. 2017. High-performance video content
recognition with long-term recurrent convolutional network for FPGA. In 2017 27th International Conference on Field Programmable Logic and
Applications (FPL). 1–4. hps://doi.org/10.23919/FPL.2017.8056833
Manuscript submied to ACM
36 Rezk, et al.
[90] Zhou Zhao, Ashok Srivastava, Lu Peng, and Qing Chen. 2019. Long Short-Term Memory Network Design for Analog Computing. J. Emerg. Technol.
Comput. Syst. 15, 1, Article 13 (Jan. 2019), 27 pages. hps://doi.org/10.1145/3289393
[91] Shuchang Zhou, Yuzhi Wang, He Wen, Qinyao He, and Yuheng Zou. 2017. Balanced quantization: an eective and ecient approach to quantized
neural networks. CoRR abs/1706.07145 (2017).
[92] M. Zhu and S. Gupta. 2017. To prune, or not to prune: exploring the ecacy of pruning for model compression. ArXiv e-prints (Oct. 2017).
arXiv:stat.ML/1710.01878
Manuscript submied to ACM
Online Appendix to: ”Recurrent Neural Networks: An Embedded Computing
Perspective”
NESMA M. REZK, Halmstad University
MADHURA PURNAPRAJNA, Amrita Vishwa Vidyapeetham
TOMAS NORDSTRO¨M, Umea˚ University
ZAIN UL-ABDIN, Halmstad University
ACM Reference format:
Nesma M. Rezk, Madhura Purnaprajna, Tomas Nordstro¨m, and Zain Ul-Abdin. 2019. Online Appendix to: ”Recurrent Neural Networks:
An Embedded Computing Perspective”. 1, 1, Article 1 (July 2019), 10 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
A RNN MODELS
In this section, we show the details of RNN models except the recurrent layer. Recurrent layers are explained in the
main article.
A.1 Input layers (features extractor)
As discussed earlier, input layers are needed by many implementations to prepare the sensor output for processing
(also called feature extraction layers). ese layers extract the features in the input and put it in a vector and forward
this vector to the recurrent layers. Dierent kinds of feature extraction layers are used for dierent input types. In
this section, we only mention some examples of feature extractors. ese examples are for the three most frequent
applications found in literature: speech, text, and visual applications.
A.1.1 Sound features extractors. Sound features extractors translate sound signals into features vectors. ere
are dierent kinds of features that can be extracted from a sound signal such as Mel Frequency Cepstral Coecients
(MFCC) and Linear Predictive Coding (LPC) Coecients.
A.1.2 Convolutional Neural Networks. Convolutional Neural Networks (CNNs) are used when the RNN model
has spatial input data. Camera image output is the most common example. Whereas, the RNN model is designed to deal
with images for activity recognition and image description [12, 20] or video description [52]. CNN is used as a feature
extractor that takes images as inputs and generates feature vectors as outputs.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 ACM. Manuscript submied to ACM
Manuscript submied to ACM 1
ar
X
iv
:1
90
8.
07
06
2v
1 
 [c
s.N
E]
  2
3 J
ul 
20
19
2 Rezk, et al.
Another example is speech recognition applications. Combining features vectors for speech inputs generated from
sound features extractors described earlier composes a spectrogram. Spectrograms hold information about frequency
and amplitude against the time in visual representations. us, CNNs can be used to extract features from spectrograms
[3, 49].
A.1.3 Word embedding. Word embedding is used when the input is in the form of text [51]. e word embedding
layer extracts the features in each word with relation to the rest of the vocabulary. e output of the word embedding
is a vector. e distance between the two vectors of two words that have a similar context is short and between two
words that have dierent context is large.
Sentiment analysis (or emotional AI) is some sort of natural language processing that goes deeper in the meaning of
the language. us, it is not only understanding the words meaning but also capturing the feelings behind these words.
But since it is dealing with text words as input, it relies on word embedding for features extraction as well [25].
A.2 Output layers
e output layers in the RNN model are the FC layers and the output function.
A.2.1 FC (Fully Connected) Layers. RNN model might have one or more FC layers aer the recurrent layers.
Non-linear functions may be applied between FC layers as well. It is called fully connected because each neuron in the
input is connected to each neuron of the output. Computationally, it is done by matrix to vector multiplication using a
weight matrix of size Inputsize × outputsize , where Inputsize is the size of the input vector and Outputsize is the size
of the output vector. One purpose of the FC layer in RNN models can be the change of the dimension of the hidden
state output vector ht to the dimension of the RNN model output to prepare it for the output function. In this case, the
FC layer might be replaced by adding a projection layer in the recurrent layer.
A.2.2 Output function. e output function is the nal step in the neural networks inference. It generates the
output of the neural network model. is output can be a prediction, classication, recognition, etc. For instance, in a
text prediction problem, somax function will be used as an output function. e output will be a vector of probabilities
that sum to one. Each probability is corresponding to one word. e word with the highest probability is the prediction
of the neural network [5].
A.3 Processing of data in RNN models
Processing of data in RNN models can vary in dierent ways. e rst is to vary through time steps. is is aected
by the nature of the application, as the application may have inputs with temporal relations, outputs with temporal
relations, or both. e second is related to bidirectional RNNs. We discuss how RNN can process inputs forward and
backwards in time in the bidirectional RNN. Furthermore, we discuss how an RNN model can be a deep RNN model.
A.3.1 RNN unfolding variations through time-steps. RNN unfolding/Unrolling is done to show the repetition
in the recurrent layer and show the number of time steps required to complete a task. Unfolding the RNN shows the
dierent types of RNN models one can meet.
• One to many One to many model is the model that generates a sequence of outputs for every single input as
shown in Fig. 1a. Image captioning is one example [12]. e model takes one image as input and generates a
Manuscript submied to ACM
Online Appendix to: ”Recurrent Neural Networks: An Embedded Computing Perspective” 3
sentence as an output. e words of the sentence compose a sequence of temporally related data. us, the
sequence, in this case, is only in the output.
(a) One to Many RNN. (b) Many to One RNN.
(c) Many to Many RNN.
Fig. 1. Unfolding RNN model through time steps.
• Many to one Many to one model is the model that for a sequence of inputs generate one output, as shown
in Fig. 1b. Activity recognition [12] and sentiment analysis [46] are two examples. In activity recognition
applications, the model takes a sequence of images to decide on the activity happening in the images. In
sentiment analysis, the model takes a sequence of words (sentence) as input and generates one feeling at the
end. us, the sequence, in this case, is only in the input.
• Many to many Many to many model is the model that has a sequence in the input and a sequence in the
output as shown in Fig. 1c. Language translation [45] and video description [12] are two examples. In language
translation, the model has a sequence of words (sentence) as an input and a sequence of words (sentence) as an
output. In video description applications, the model has a sequence of image frames as input and a sequence of
words (sentence) as output.
• One to one ere is no RNN model with one to one unrolling. One to one simply means that there is no
temporal relation within inputs or outputs (Feedforward neural network).
A.3.2 Bi-directional RNN. In Bidirectional, RNN input can be fed into the recurrent layer from two directions:
past to future and future to past. at requires the duplication of the recurrent layer to have two recurrent layers
working simultaneously each processing input in a dierent direction. at would help the network to understand the
Manuscript submied to ACM
4 Rezk, et al.
context beer by geing data from past and future at the same time. is concept can be applied to dierent variations
of recurrent layers such as BiLSTM [24] and BiGRU [47].
A.4 Deep Recurrent Neural Networks (DRNN)
Having a neural network as a deep neural network is done by adding non-linear layers between the input layer and the
output layer [4]. is is straightforward in feedforward NNs. However, in RNNs, there are dierent approaches that can
be tackled. Similar to feedforward NNs, we can have a stack of recurrent layers (stacked RNN) [17] as shown in Fig. 2,
where we have a stack of two recurrent layers. e output of the rst layer is considered as the input for the second
layer. Alternatively, the extra non-linear layers can be within the recurrent layer computations [36]. Extra non-linear
layers can be embedded within the hidden layer vector ht calculation, where xt and ht−1 vectors used to calculate ht ,
pass through extra non-linear layers. is model is called deep transition RNN model. Nevertheless, the extra non-linear
layers can be added in computing the output from the hidden state vector; this model is called deep output RNN model.
It is possible to have an RNN model that is both a deep transition and a deep output RNN model [10]. One other way to
have extra non-linear functions within the recurrent layer is to have them within the gate calculations. e later layer
is called H-LSTM (Hidden LSTM).
Fig. 2. Stacked RNN. The first layer output is h1t and the second layer output is h
2
t .
A.5 Applications and their corresponding datasets
In the main article, we study dierent optimizations applied to dierent models with a dierent eect on accuracy. To
fully understand these optimizations, it is important to understand to which application the RNN model was applied and
on which dataset. Datasets are used by researchers to apply their methods and modications on it to show their success.
Each application has its own corresponding datasets. ese datasets dier in the size of the data samples, values of data
samples, and the total size of the dataset. e Success of NN models is measured by accuracy. Accuracy indicates how
correct is the model in doing the recognition, classication, translation, etc. Dierent datasets use dierent units to
measure the accuracy of the model. In Table 1, we summarize the application domain and their corresponding datasets.
For dierent datasets, dierent accuracy measure metrics are used. e application domains are as follows.
• Speech recognition
Manuscript submied to ACM
Online Appendix to: ”Recurrent Neural Networks: An Embedded Computing Perspective” 5
Table 1. Application domains and their corresponding datasets.
Application domain Dataset Accuracy measure metric
Speech recognition
TIDIGITS [23]
Word Error Rate (WER) (Lower is bet-
ter) & Phone Error Rate (PER) (Lower is
beer)
AN4 [1]
TIMIT [15]
Wall Street Journal(WSJ) [14]
LibriSpeech ASR corpus [33]
Text generation
Penn Treebank (PTB) [31] Perplexity per word (PPW)
wikitext [32] (Lower is beer)
Text8 [30] & Bilingual Evaluation Understudy
(BLEU)
WMT’14 [2] (Higher is beer)
Sentiment analysis IMDB [29] Testing accuracy (Higher is beer)
Image/video applica-
tions
COCO [27] BLEU (Higher is beer)
Moving MNIST [42] Cross entropy loss (Lower is beer)
comma.ai driving dataset [40] RMS prediction error (Lower is beer)
Music generation Noingham [6] Testing accuracy (Higher is beer)
Speech recognition applications receive audio as input, understand it, and translate it into words. Speech
recognition can be used for phonetic recognition, voice search, conversational speech recognition, and speech
to text processing [11].
• Text generation RNN models can be used for language-related applications like text generation. RNN model
can predict the next words aer taking the previous words as inputs.
• Sentiment analysis Sentiment analysis is the task of understanding the opinion behind words [34]. Since the
input words are composing a sequence, then sentiment analysis is a problem for RNNs to solve.
• Image/Video applications Image/video applications cover any application that takes images as input. For
instance, image captioning, activity recognition, and video description applications.
B ILLUSTRATIONS FOR SOME OPTIMIZATIONS
Fig. 3 shows a vector that is broken into three vectors and a matrix that is broken into nine matrices. us, one matrix
to vector multiplication is broken into nine matrix to vector multiplications. Each vector is multiplied with the matrices
having a similar colour. e output vector is built from three vectors, where each three output vectors are accumulated
together to form one vector in the output. is computation requires nine cycles to be completed assuming that new
weights can be loaded into the hardware multiplication unit within the cycle time.
C IMPLEMENTATIONS UNDER STUDY
Table 2 shows the details of the implementations under study.
D CASES FOR THROUGHPUT AND ENERGY EFFICIENCY COMPUTATIONS
• Case q1: Eective throughput is given in the paper.
• Case q2: Number of operations in the dense model and computation time are given. By dividing number of
operations nop by time, we get the eective throughput Qef f as shown in Eq.( 1). In some papers, the number
Manuscript submied to ACM
6 Rezk, et al.
Fig. 3. Tiling of matrix to vector multiplication.
of operations and the computation time timecomp were given for multiple time steps (multiple inputs), which
would require running the LSTM nsteps times.
Qef f =
nop × nsteps
tcomp
(1)
• Case q3: e implemented RNN model information is provided in the paper. us, we calculate the number
of operations from the model information and then divide it by computation time to get the throughput as
shown in Eq.( 1). To compute the number of operations, the number of operations in the matrix to vector
multiplications is counted as they have the dominant eect on the performance. For instance, if the model is
built using LSTM layers, we use the equation
nop = 2 × 4 × (n ×m + n2), (2)
where nop is the number of operations in an LSTM layer, the term between the brackets is the number of
the matrix to vector multiplications in one gate (m is the input vector size and n is the number of hidden cells).
is term is multiplied by four as the LSTM has four matrix to vector multiplications and multiplied by two to
convert the matrix to vector multiplications into operations as each MAC operation in the matrix to vector
multiplication is multiply then add (two operations).
If the LSTM has a projection layer, the number of operations is calculated as
nop = 2 × 4 × (n ×m + n × p), (3)
where the term n2 is replaced by the term n ×p (p is the size of the projection layer). In the worst case, if the
paper is not giving enough information to calculate the number of operations, the number of operations can be
approximately calculated by multiplying the number of parameters by two. Furthermore, if the recurrent layer
is bidirectional, the number of operations is multiplied by two.
• Case q4: e energy eciency is given in terms of OP/s/wa and the power consumption is given in wa. By
multiplying the two values throughput is calculated.
• Case q5: Eective throughput could not be computed.
• Case e1: e Eef f energy eciency is given in the paper.
• Case e2: e power consumption is given in the paper. To compute the energy eciency Eef f , the eective
throughput Qef f (OP/s) is divided by the power P (wa) as
Manuscript submied to ACM
Online Appendix to: ”Recurrent Neural Networks: An Embedded Computing Perspective” 7
Table 2. Detailed information about papers under study
Index Authors Name Aliation Year
F1 [26] Li et al. E-RNN Syracuse University, Northeastern University,
Florida International University,
Mellon University,
Carnegie University of Southern California,
SUNY University
2019
F2 [48] Wang et al. C-LSTM Peking University, Syracuse University,
City University of New York
2018
F3 [38] Rybalkin et al. FINN-L University of Kaiserslautern,
Xilinix Research Lab
2018
F4 [19] Han et al. ESE Stanford University, DeePhi Tech,
Tsinghua University, NVIDIA
2017
F5 [13] Gao et al. DeltaRNN University of Zurich & ETH Zurich 2018
F6 [39] Rybalkin et al. - University of Kaiserslautern,
German Research Center for Articial Intelligence
2017
F7 [52] Zhang et al. University of Illinois, Inspirit IoT Inc,
Tsinghua University, Beihang University
2017
F8 [22] Lee et al. - Seoul National University 2016
F9 [43] Sun et al. - Shanghai Jiao Tong University,
Chinese Academy of Sciences,
University of Cambridge, Imperial College
2018
F10 [18] Guan et al. - Peking University, University of California
PKU/UCLA Joint Research Institute in Science and
Engineering
2017
F11 [37] Rizakis et al. - Imperial College London 2018
F12 [8] Chang et al. DeepRnn Purdue University 2017
A1 [49] Wang et al. - Nanjing University 2017
A2 [53] Zhao et al. - Louisiana State University 2019
A7 [28] Long et al. - Georgia Institute of Technology, Atlanta 2018
A3 [9] Chen et al. Ocean Fudan University, Zhejiang University,
University of Washington
2017
A4 [35] Park et al. - Pohang University of Science and Technology 2018
A5 [16] Giraldo et al. Laika KU Leuven 2018
A6 [50] Yin et al. - Arizona State University 2018
A8 [21] Kwon et al. MAERI Goergia Institute of Technology 2018
A9 [41] Sharma et al. Bit Fusion Goergia Institute of Technology, Arm Inc.
University of California (San Diego)
2018
C1 [44]
C2 [44]
Sung et al. - Seoul National University 2018
C3 [7] Cao et al. MobiRNN Stony Brook University 2017
Eef f =
Qef f
P
.
• Case e3: Energy and computation time are provided. First, we divide energy by time to get power. Next, we
divide eective throughput Qef f by the power to get energy eciency, as we did in case e2.
• Case e4: energy eciency could not be computed.
Manuscript submied to ACM
8 Rezk, et al.
E DETAILED FLEXIBILITY ASPECTS
Flexibility is not quantitative like throughput. us, we use a subjective measure for exibility to reach a exibility
score for each implementation. Table 3 shows the exibility aspects supported by each implementation as discussed
in the papers and the exibility score for each implementation. Papers that do not discuss any exibility aspects are
omied from Table 3. In A3 [9], the architecture should support various models. e number of cells and layers the
architecture can support are not mentioned in the paper. Hence, we cannot deduce how the implementation can support
variations in the RNN model. Nevertheless, the variations should be supported on the hardware platform and not only
by the method before fabrication. In A1 [49], the design method can support two dierent RNN layers. However, the
fabricated chip will support only one of them. us, we do not consider A1 [49] meeting the exibility objective.
Table 3. Flexibility score of implementations under study.
Index Flexibility aspects in papers Score
F1 [26] Varying layer (LSTM/GRU), Varying number of cells
Varying block size (block circulant matrices)
XXX
F2 [48] Varying layer (LSTM/BiLSTM), Varying number of layers
Varying number of cells
XXX
F3 [38] Varying layer (LSTM/BiLSTM), Varying precision, FC supported XXX
F6 [39] Varying layer (LSTM/BiLSTM), FC supported XX
F7 [52] Convolution supported, FC supported XX
F8 [22] Varying number of layers, Varying number of cells XX
F10 [18] Varying number of layers, Varying number of cells XX
A3 [9] Online training X
A4 [35] Varying number of cells, FC supported XX
A5 [16] Varying number of layers, Varying number of cells
Linear/nonlinear quantization, FC supported
XXXX
A7 [28] Varying type of layer(LSTM/GRU), Convolution supported, FC supported XXX
A8 [21] Varying number of cells, Varying number of layers
Dense/Sparse, Convolution supported
XXXX
A9 [41] Varying number of cells, Varying number of layers
Convolution supported, Varying precision
XXXX
C2 [44] Varying layer (LSTM/SRU/QRNN), Varying number of cells XX
C3 [7] Varying number of layers, Varying number of cells XX
REFERENCES
[1] 1991 (accessed October 30, 2018). AN4 dataset. hp://www.speech.cs.cmu.edu/databases/an4/.
[2] (accessed January 7, 2019). WMT’14 dataset. hps://nlp.stanford.edu/projects/nmt/.
[3] Dario Amodei, Rishita Anubhai, Eric Baenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg
Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan
Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian
Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep speech 2: end-to-end speech recognition in english and
mandarin. CoRR abs/1512.02595 (2015).
[4] Yoshua Bengio. 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1 (Jan. 2009), 1–127.
[5] Yoshua Bengio, Re´jean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural probabilistic language model. J. Mach. Learn. Res. 3 (March
2003), 1137–1155.
Manuscript submied to ACM
Online Appendix to: ”Recurrent Neural Networks: An Embedded Computing Perspective” 9
[6] Nicolas Boulanger-Lewandowski, Y Bengio, and Pascal Vincent. 2012. Modeling Temporal Dependencies in High-Dimensional Sequences:
Application to Polyphonic Music Generation and Transcription. Proceedings of the 29th International Conference on Machine Learning, ICML 2012 2
(06 2012).
[7] Qingqing Cao, Niranjan Balasubramanian, and Aruna Balasubramanian. 2017. MobiRNN: ecient recurrent neural network execution on mobile
GPU. CoRR abs/1706.00878 (2017).
[8] A. X. M. Chang and E. Culurciello. 2017. Hardware accelerators for recurrent neural networks on FPGA. In 2017 IEEE International Symposium on
Circuits and Systems (ISCAS). 1–4. hps://doi.org/10.1109/ISCAS.2017.8050816
[9] C. Chen, H. Ding, H. Peng, H. Zhu, R. Ma, P. Zhang, X. Yan, Y. Wang, M. Wang, H. Min, and R. C. . Shi. 2017. OCEAN: an on-chip incremental-learning
enhanced processor with gated recurrent neural network accelerators. In ESSCIRC 2017 - 43rd IEEE European Solid State Circuits Conference. 259–262.
hps://doi.org/10.1109/ESSCIRC.2017.8094575
[10] Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. 2018. Grow and prune compact, fast, and accurate LSTMs. CoRR abs/1805.11797 (2018).
[11] Li Deng. 2014. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information
Processing 3 (2014), e2. hps://doi.org/10.1017/atsip.2013.9
[12] Je Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014.
Long-term recurrent convolutional networks for visual recognition and Description. CoRR abs/1411.4389 (2014).
[13] Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbru¨ck. 2018. DeltaRNN: a power-ecient recurrent neural network accelerator. In
FPGA.
[14] John Garofalo, D Gra, D Paul, and D Pallet. [n. d.]. Csr-i (wsjo) sennheiser. ([n. d.]). Linguistic Data Consortium, Philadelphia.
[15] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Palle, and N. L. Dahlgren. 1993. DARPA TIMIT acoustic phonetic continuous speech
corpus CDROM. (1993).
[16] J. S. P. Giraldo and M. Verhelst. 2018. Laika: a 5uW programmable LSTM accelerator for always-on keyword spoing in 65nm CMOS. In ESSCIRC
2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC). 166–169. hps://doi.org/10.1109/ESSCIRC.2018.8494342
[17] Alex Graves. 2013. Generating Sequences with recurrent neural networks. CoRR abs/1308.0850 (2013).
[18] Y. Guan, Z. Yuan, G. Sun, and J. Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In 2017 22nd Asia and
South Pacic Design Automation Conference (ASP-DAC). 629–634. hps://doi.org/10.1109/ASPDAC.2017.7858394
[19] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J.
Dally. 2016. ESE: ecient speech recognition engine with compressed LSTM on FPGA. CoRR abs/1612.00694 (2016).
[20] A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In 2015 IEEE Conference on Computer Vision
and Paern Recognition (CVPR). 3128–3137. hps://doi.org/10.1109/CVPR.2015.7298932
[21] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataow Mapping over DNN Accelerators via Recong-
urable Interconnects. SIGPLAN Not. 53, 2 (March 2018), 461–475. hps://doi.org/10.1145/3296957.3173176
[22] Minjae Lee, Kyuyeon Hwang, Jinhwan Park, Sungwook Choi, Sungho Shin, and Wonyong Sung. 2016. FPGA-based low-power speech recognition
with recurrent neural networks. In Signal Processing Systems (SiPS), 2016 IEEE International Workshop on. IEEE, 230–235.
[23] R. Leonard. 1984. A database for speaker-independent digit recognition. In ICASSP ’84. IEEE International Conference on Acoustics, Speech, and Signal
Processing, Vol. 9. 328–331.
[24] J. Li and Y. Shen. 2017. Image describing based on bidirectional LSTM and improved sequence sampling. In 2017 IEEE 2nd International Conference
on Big Data Analysis (ICBDA)(. 735–739. hps://doi.org/10.1109/ICBDA.2017.8078733
[25] Yang Li, an Pan, Tao Yang, Suhang Wang, Jiliang Tang, and Erik Cambria. 2017. Learning word representations for sentiment analysis. Cognitive
Computation 9, 6 (01 Dec 2017), 843–851. hps://doi.org/10.1007/s12559-017-9492-2
[26] Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, Xue Lin, Xuehai Qian, and Yanzhi Wang.
2018. E-RNN: Design Optimization for Ecient Recurrent Neural Networks in FPGAs. CoRR abs/1812.07106 (2018). arXiv:1812.07106 hp:
//arxiv.org/abs/1812.07106
[27] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and
C. Lawrence Zitnick. 2014. Microso COCO: common objects in context. CoRR abs/1405.0312 (2014).
[28] Y. Long, T. Na, and S. Mukhopadhyay. 2018. ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 26, 12 (Dec 2018), 2781–2794. hps://doi.org/10.1109/TVLSI.2018.2819190
[29] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Pos. 2011. Learning word vectors for sentiment
analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for
Computational Linguistics, Portland, Oregon, USA, 142–150.
[30] Ma Mahoney. 2006 (accessed October 30, 2018). About the test data. hp://mamahoney.net/dc/textdata.
[31] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn treebank.
Comput. Linguist. 19, 2 (June 1993), 313–330.
[32] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. CoRR abs/1609.07843 (2016).
[33] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206–5210. hps://doi.org/10.1109/ICASSP.2015.7178964
Manuscript submied to ACM
10 Rezk, et al.
[34] Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2, 1-2 (Jan. 2008), 1–135.
[35] J. Park, J. Kung, W. Yi, and J. J. Kim. 2018. Maximizing system performance by balancing computation loads in LSTM accelerators. In 2018 Design,
Automation Test in Europe Conference Exhibition (DATE). 7–12. hps://doi.org/10.23919/DATE.2018.8341971
[36] Razvan Pascanu, C¸aglar Gu¨lc¸ehre, Kyunghyun Cho, and Yoshua Bengio. 2013. How to construct deep recurrent neural networks. CoRR abs/1312.6026
(2013).
[37] Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Approximate FPGA-based LSTMs under computation
time constraints. CoRR abs/1801.02190 (2018).
[38] Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaar, Giulio Gambardella, Norbert Wehn, and Michaela Blo. 2018. FINN-L:
library extensions and design trade-o analysis for variable precision LSTM networks on FPGAs. CoRR abs/1807.04093 (2018).
[39] Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Youse, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term
memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. European Design
and Automation Association, 1394–1399.
[40] Eder Santana and George Hotz. 2016. Learning a driving simulator. CoRR abs/1608.01230 (2016).
[41] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. 2017. Bit
Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks. CoRR abs/1712.01507 (2017). arXiv:1712.01507
hp://arxiv.org/abs/1712.01507
[42] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs. CoRR
abs/1502.04681 (2015).
[43] Z. Sun, Y. Zhu, Y. Zheng, H. Wu, Z. Cao, P. Xiong, J. Hou, T. Huang, and Z. e. 2018. FPGA acceleration of LSTM based on data for test ight. In
2018 IEEE International Conference on Smart Cloud (SmartCloud). 1–6. hps://doi.org/10.1109/SmartCloud.2018.00009
[44] Wonyong Sung and Jinhwan Park. 2018. Single stream parallelization of recurrent neural networks for Low power and fast inference. CoRR
abs/1803.11389 (2018).
[45] Ilya Sutskever, Oriol Vinyals, and oc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR abs/1409.3215 (2014).
[46] Aditya Srinivas Timmaraju. 2015. Sentiment analysis on movie reviews using recursive and recurrent neural network architectures.
[47] Vedran Vukotic, Christian Raymond, and Guillaume Gravier. [n. d.]. A step beyond local observations with a dialog aware bidirectional GRU
network for Spoken Language Understanding.
[48] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Yanzhi Wang, Qinru Qiu, and Yun Liang. 2018. C-LSTM: enabling ecient LSTM using structured
compression techniques on fPGAs. CoRR abs/1803.06305 (2018).
[49] Z. Wang, J. Lin, and Z. Wang. 2017. Accelerating recurrent neural networks: a memory-ecient approach. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems 25, 10 (Oct 2017), 2763–2775. hps://doi.org/10.1109/TVLSI.2017.2717950
[50] S. Yin, X. Sun, S. Yu, J. Seo, and C. Chakrabarti. 2018. A Parallel RRAM Synaptic Array Architecture for Energy-Ecient Recurrent Neural Networks.
In 2018 IEEE International Workshop on Signal Processing Systems (SiPS). 13–18. hps://doi.org/10.1109/SiPS.2018.8598445
[51] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2017. Recent trends in deep learning based natural language processing.
CoRR abs/1708.02709 (2017).
[52] X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang, Z. Cheng, K. Rupnow, and D. Chen. 2017. High-performance video content
recognition with long-term recurrent convolutional network for FPGA. In 2017 27th International Conference on Field Programmable Logic and
Applications (FPL). 1–4. hps://doi.org/10.23919/FPL.2017.8056833
[53] Zhou Zhao, Ashok Srivastava, Lu Peng, and Qing Chen. 2019. Long Short-Term Memory Network Design for Analog Computing. J. Emerg. Technol.
Comput. Syst. 15, 1, Article 13 (Jan. 2019), 27 pages. hps://doi.org/10.1145/3289393
Manuscript submied to ACM
