ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA by Han, Song et al.
ESE: Efficient Speech Recognition Engine
with Sparse LSTM on FPGA
Song Han1,2, Junlong Kang2, Huizi Mao1,2, Yiming Hu2,3, Xin Li2, Yubin Li2, Dongliang Xie2
Hong Luo2, Song Yao2, Yu Wang2,3, Huazhong Yang3 and William J. Dally1,4
1 Stanford University, 2 DeePhi Tech, 3 Tsinghua University, 4 NVIDIA
1 {songhan,dally}@stanford.edu, 2 song.yao@deephi.tech, 3 yu-wang@mail.tsinghua.edu.cn
ABSTRACT
Long Short-Term Memory (LSTM) is widely used in speech
recognition. In order to achieve higher prediction accuracy,
machine learning scientists have built increasingly larger mod-
els. Such large models are both computation and mem-
ory intensive. Deploying such bulky models results in high
power consumption and leads to a high total cost of owner-
ship (TCO) for a data center.
To speedup prediction and make it energy efficient, we
first propose a load-balance-aware pruning method that can
compress the LSTM model size by 20× (10× from pruning
and 2× from quantization) with negligible loss of prediction
accuracy. Also we proposed load-balance-aware pruning to
ensure high hardware utilization. Next, we propose a sched-
uler that encodes and partitions the compressed model to
multiple PEs for parallelism and schedules the complicated
LSTM data flow. Finally, we design a hardware architecture
named ESE that works directly on the sparse LSTM model.
Implemented on Xilinx XCKU060 FPGA running at 200MHz,
ESE has a performance of 282 GOPS working directly on the
sparse LSTM network, corresponding to 2.52 TOPS on the
dense one, and processes a full LSTM for speech recogni-
tion with a power dissipation of 41 Watts. Evaluated on the
LSTM for speech recognition benchmark, ESE is 43× and
3× faster than Core i7 5930k CPU and Pascal Titan X GPU
implementations. It achieves 40× and 11.5× higher energy
efficiency compared with the CPU and GPU respectively.
Keywords
Deep Learning; Speech Recognition; Model Compression;
Hardware Acceleration; Software-Hardware Co-Design; FPGA
1. INTRODUCTION
Deep neural network is widely used for speech recogni-
tion [6, 13]. Long Short-Term Memory (LSTM) and Gated
Recurrent Unit (GRU) are two popular types of recurrent
neural networks (RNNs) used for speech recognition. In this
work, we evaluated the most complex one: LSTM [14]. A
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
FPGA ’17, February 22 - 24, 2017, Monterey, CA, USA
c© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4354-1/17/02. . . $15.00
DOI: http://dx.doi.org/10.1145/3020078.3021745
Training
Accelerated 
InferenceCompression
Pruning 
Quantization
Conventional
Proposed
Training Inference
This Work
Figure 1: Proposed efficient DNN deployment flow:
model compression+accelerated inference.
LSTM Model
Compression
20x smaller
similar accuracy
Scheduling 
Compiling
relative-indexed 
blocked  CSC
FPGA
Acceleration
3x speedup
11.5x lower energy
Deep Model
Compression
35x-49x smaller
same accuracy
Blocking
Encoding
relative-indexed CSC
format with codebook
Customized
Accelerator
13x speedup, 3400x 
lower energy than GPU 
Algorithm Software Hardware
Algorithm Software Hardware
Acceleration
Load Balancing
Compression Hardware
Compression
Pruning /
Weight Sharing
Load Balance-Aware Pruning
Acceleration
Sparsity, Load 
Balancing
Compression Hardware
Compression
Pruning /
Weight Sharing
Figure 2: ESE optimizes LSTM computation across
algorithm, software and hardware stack.
similar methodology could be easily applied to other types
of recurrent neural networks.
Despite its high prediction accuracy, LSTM is hard to de-
ploy because of its high computation complexity and mem-
ory footprint, leading to high power consumption. Memory
reference consumes more than two orders of magnitude more
energy than ALU operations, thus w focus on optimizing
the memory footprint.
To reduce the memory footprint, we design a novel method
to optimize across the algorithm, software and hardware
stack: we first optimize the algorithm by compressing the
LSTM model to 5% of it’s original size (10% density and
2× narrower weights) while retaining similar accuracy; then
we develop software mapping strategy to represent the
compressed model in a hardware-friendly way; finally we de-
sign specialized hardware to work directly on the compressed
LSTM model.
The proposed flow for efficient deep learning inference
is illustrated in Fig. 1. It shows a new paradigm for ef-
ficient deep learning inference, from Training=>Inference,
to Training=>Compression=>Accelerated Inference, which
has advantage of faster inference speed and energy efficiency
ar
X
iv
:1
61
2.
00
69
4v
2 
 [c
s.C
L]
  2
0 F
eb
 20
17
compared with the conventional method. Using LSTM as a
case study for the proposed paradigm, the design flow is
illustrated in Fig. 2.
The main contributions of this work are
1. We present an effective model compression algorithm
for LSTM, which is composed of pruning and quanti-
zation. We highlight our load-balance-aware pruning
and automatic flow for dynamic-precision data quan-
tization.
2. The recurrent nature of RNN and LSTM produces
complicated data dependency, which is more challeng-
ing than feedforward neural nets. We design a sched-
uler that can efficiently schedule the complex LSTM
operations with memory reference overlapped with com-
putation.
3. The irregular computation pattern after compression
posed a challenge to hardware. We design a hard-
ware architecture that can work directly on the sparse
model. ESE achieves high efficiency by load balancing
and partitioning both the computation and storage.
ESE also supports processing multiple user’s speech
data concurrently.
4. We present an in-depth study of the LSTM and speech
recognition system and optimize across the algorithm,
software, hardware boundary. We jointly analyze the
trade-off between prediction accuracy and prediction
latency.
2. BACKGROUND
Speech recognition is the process of converting speech
signals to a sequence of words. As shown in Fig. 3, the
speech recognition system contains the front-end and back-
end units, where the front-end unit is used for extracting fea-
tures from speech signals, and the back-end unit processes
the features and converts speech to text. The back-end in-
cludes an acoustic model (AM), language model (LM), and
decoder. Here, the Long Short-Term Memory (LSTM) re-
current neural network is used in the acoustic model.
The feature vectors extracted from the front-end unit are
processed by the acoustic model; then the decoder uses both
acoustic and language models to generate the sequence of
words by maximum a posteriori probability (MAP) estima-
tion, which can be described as
Wˆ = arg max
W
P (W|X) = arg max
W
P (X|W)P (W)
P (X)
where for the given feature vector X = X1X2 . . . Xn, the
goal of speech recognition is to find the word sequence Wˆ =
W1W2 . . .Wm with maximum posterior probability P (W|X).
Because X is fixed, the above equation can be rewritten as
Wˆ = arg max
W
P (X|W)P (W)
where P (X|W) and P (W) are the probabilities computed
by acoustic and language models shown respectively in Fig. 3
[20].
In modern speech recognition system, LSTM architecture
is often used in large-scale acoustic modeling and for com-
puting acoustic output probabilities. LSTM is the most
computation and memory intensive part of the speech recog-
nition pipeline. Thus we focus on accelerating the LSTM.
Feature 
Extraction
Acoustic 
Model
Language 
Model
Decoder
Front-End
Speech Text
LSTM
Back-End
Figure 3: The speech recognition pipeline. LSTM
takes more than 90% of the total execution time in
the whole computation pipeline.
Input
Output
153
512
512
6294
6294
LSTM
LSTM
FC
Softmax
Figure 4: Data flow of the LSTM model.
The LSTM architecture is shown in Fig. 4, which is the
same as the standard LSTM implementation [19]. LSTM is
one type of RNN, where the input at time T depends on
the output at T − 1. Compared to the traditional RNN,
LSTM contains special memory blocks in the recurrent hid-
den layer. The memory cells with self-connections in mem-
ory blocks can store the temporal state of the network.
The memory blocks also contain special multiplicative units
called gates: input gate, output gate and forget gate. As in
Fig. 4, the input gate i controls the flow of input activations
into the memory cell. The output gate o controls the output
flow into the rest of the network. The forget gate f scales
the internal state of the cell before adding it as input to the
cell, which can adaptively forget the cell’s memory.
An LSTM network accepts an input sequence x = (x1; . . . ;xT ),
and computes an output sequence y = (y1; . . . ; yT ) by using
the following equations iteratively from t = 1 to T :
it = σ(Wixxt +Wiryt−1 +Wicct−1 + bi) (1)
ft = σ(Wfxxt +Wfryt−1 +Wfcct−1 + bf ) (2)
gt = σ(Wcxxt +Wcryt−1 + bc) (3)
ct = ft
⊙
ct−1 + gt
⊙
it (4)
ot = σ(Woxxt +Woryt−1 +Wocct + bo) (5)
mt = ot
⊙
h(ct) (6)
yt = Wymmt (7)
Here the big O dot operator means element-wise multiplica-
tion, the W terms denote weight matrices (e.g. Wix is the
matrix of weights from the input to the input gate), and
Wic, Wfc, Woc are diagonal weight matrices for peephole
connections. The b terms denote bias vectors, while σ is the
logistic sigmoid function. The symbols i, f , o, c and m are
respectively the input gate, forget gate, output gate, cell ac-
tivation vectors and cell output activation vectors, and all
of which are the same size. The symbols g and h are the cell
input and cell output activation functions.
3. MODEL COMPRESSION
It has been widely observed that deep neural networks
usually have a lot of redundancy [11, 12]. Getting rid of
the redundancy doesn’t hurt prediction accuracy. From the
hardware perspective, model compression is critical for sav-
ing the computation as well as the memory footprint, which
means lower latency and better energy efficiency. We’ll dis-
cuss two steps of model compression that consist of pruning
and quantization in the next three subsections.
3.1 Pruning
In the pruning phase we first train the model to learn
which weights are necessary, then prune away weights that
are not contributing to the prediction accuracy; finally, we
retrain the model given the sparsity constraint. The process
is the same as [12]. In step two, the saliency of the weight
is determined by the weight’s absolute value: if the weight’s
absolute value is smaller than a threshold, then we prune
it away. The pruning threshold is empirical: pruning too
much will hurt the accuracy while pruning at the right level
won’t.
Our pruning experiments are performed on the Kaldi speech
recognition toolkit [17]. The trade-off curve of the percent-
age of parameters pruned away and phone error rate (PER)
is shown in Fig.6. The LSTM is evaluated on the TIMIT
dataset [8]. Not until we prune away more than 93% of pa-
rameters did the PER begin to increase dramatically. We
further experimented on a proprietary dataset that is much
larger: it has 1000 hours of training speech data, 100 hours
of validation speech data, and 10 hours of test speech data.
We find that we can prune away 90% of the parameters with-
out hurting word error rate (WER), which aligns with our
result on the TIMIT dataset. In our later discussions, we
use 10% density (90% sparsity).
3.2 Load Balance-Aware Pruning
On top of the basic deep compression method, we high-
light our practical design considerations for hardware effi-
ciency. To execute sparse matrix multiplication in parallel,
we propose the load-balance-aware pruning method, which
is very critical for better load balancing and higher utiliza-
tion on the hardware.
Pruning could lead to a potential problem of unbalanced
non-zero weights distribution. The workload imbalance over
PEs may cause a gap between the real performance and peak
performance. This problem is further addressed in Section 4.
Load-balance-aware pruning is designed to solve this prob-
lem and obtain hardware-friendly sparse network, which pro-
duces the same sparsity ratio among all the submatrices.
During pruning, we make efforts to avoid the scenario when
the density of one submatrix is 5% while the other is 15%.
Although the overall density is about 10%, the submatrix
with a density of 5% has to wait for the other one with more
computation, which leads to idle cycles. Load-balance-aware
pruning assigns the same sparsity quota to submatrices, thus
ensures an even distribution of non-zero weights.
~a
 
0 a1 0 a3
 
⇥ ~b
PE0
PE1
PE2
PE3
0BBBBBBBBBBBBB@
w0,0w0,1 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 0 0
0 0 w4,2w4,3
w5,0 0 0 0
0 0 0 w6,3
0 w7,1 0 0
1CCCCCCCCCCCCCA
=
0BBBBBBBBBBBBB@
b0
b1
 b2
b3
 b4
b5
b6
 b7
1CCCCCCCCCCCCCA
ReLU)
0BBBBBBBBBBBBB@
b0
b1
0
b3
0
b5
b6
0
1CCCCCCCCCCCCCA
1
Unbalanced
w0,0 w0,1 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 0 0
0 0 w4,2 w4,3
w5,0 0 0 0
w6,0 0 0 w6,3
0 w7,1 0 0
~a
 
0 a1 0 a3
 
⇥ ~b
PE0
PE1
PE2
PE3
0BBBBBBBBBBBBB@
w0,0w0,1 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 0 0
0 0 w4,2w4,3
w5,0 0 0 0
0 0 0 w6,3
0 w7,1 0 0
1CCCCCCCCCCCCCA
=
0BBBBBBBBBBBBB@
b0
b1
 b2
b3
 b4
b
b
 b7
1CCCCCCCCCCCCCA
ReLU)
0BBBBBBBBBBBBB@
b0
b1
0
b3
0
b5
b6
0
1CCCCCCCCCCCCCA
1
5 cycles
2 cycles
4 cycles
1 cycle
Overall: 5 cycles
Balanced
Overall: 3 cycles
3 cycles
3 cycles
3 cycles
3 cycles
~a
 
0 a1 0 a3
 
⇥ ~b
PE0
PE1
PE2
PE3
0BBBBBBBBBBBBB@
w0,0w0,1 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 0 0
0 0 w4,2w4,3
w5,0 0 0 0
0 0 0 w6,3
0 w7,1 0 0
1CCCCCCCCCCCCCA
=
0BBBBBBBBBBBBB@
b0
b1
 b2
b3
 b4
b5
b6
 b7
1CCCCCCCCCCCCCA
ReLU)
0BBBBBBBBBBBBB@
b0
b1
0
b3
0
b5
b6
0
1CCCCCCCCCCCCCA
1
~a
 
0 a1 0 a3
 
⇥ ~b
PE0
PE1
PE2
PE3
0BBBBBBBBBBBBB@
w0,0w0,1 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 0 0
0 0 w4,2w4,3
w5,0 0 0 0
0 0 0 w6,3
0 w7,1 0 0
1CCCCCCCCCCCCCA
=
0BBBBBBBBBBBBB@
b0
b1
 b2
b3
 b4
b5
b6
 b7
1CCCCCCCCCCCCCA
ReLU)
0BBBBBBBBBBBBB@
b0
b1
0
b3
0
b5
b6
0
1CCCCCCCCCCCCCA
1
w0,0 0 0 w0,3
0 0 w1,2 0
0 w2,1 0 w2,3
0 0 w3,2 0
0 0 w4,2
w ,0 0 0 w5,3
w ,0 0 0 0
0 w7,1 0 w7,3
Figure 5: Load Balance Aware Pruning and its Ben-
efit for Parallel Processing
19.0% 
20.5% 
22.0% 
23.5% 
25.0% 
26.5% 
28.0% 
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
Ph
on
e 
Er
ro
r R
at
e
Parameters Pruned Away
with load balance without load balance
sweet
spot
Figure 6: Accuracy curve of load-balance-aware
pruning and original pruning.
In Fig. 5, the matrix is divided into four colors, and each
color belongs to a PE for parallel processing. With con-
ventional pruning, PE0 might have five non-zero weights
while PE3 may have only one. The total processing time is
restricted to the longest one, which is five cycles. With load-
balance-aware pruning, all PEs have three non-zero weights;
thus only three cycles are necessary to carry out the oper-
ation. Both cases have the same non-zero weights in total,
but load-balance-aware pruning needs fewer cycles. The dif-
ference of prediction accuracy with/without load-balance-
aware pruning is very small, as shown in Fig. 6. There is
some noise around 70% sparsity, so we focused our exper-
iments around 90% sparsity, which is the sweet spot. We
find the performance is very similar.
To show that load-balance-aware pruning still obtains com-
parable prediction accuracy, we compare it with original
pruning on the TIMIT dataset. As demonstrated in Fig.6,
the accuracy margin between two methods is within the vari-
ance of pruning process itself.
3.3 Weight and Activation Quantization
We further compressed the model by quantizing 32bit
floating point weights into 12bit integer. We used linear
quantization strategy on both the weights and activations.
In the weight quantization phase, the dynamic ranges of
weights for all matrices in each LSTM layer are analyzed
first, then the length of the fractional part is initialized to
avoid data overflow.
Table 1: Weight Quantization under different Bits.
Weight Matrices1 Min Max Integer
Decimals
16bit 12bit 8bit
LSTM1
W gifo x2 -4.9285 5.7196 4 8 4 0
W gifo r2 -0.6909 0.7140 1 11 7 3
bias -3.0143 2.1120 3 13 9 5
W ic -0.6884 0.9584 1 15 11 7
W fc -0.6597 0.7204 1 15 11 7
W oc -1.5550 1.3325 2 14 10 6
W ym -0.9373 0.8676 1 11 7 3
LSTM2
W gifo x -1.0541 1.0413 2 10 6 2
W gifo r -0.6313 0.6400 1 11 7 3
bias -1.5833 1.8009 2 14 10 6
W ic -0.9428 0.5158 1 15 11 7
W fc -0.5762 0.6202 1 15 11 7
W oc -1.0619 1.4650 2 14 10 6
W ym -1.0947 1.0170 2 10 6 2
1 Only weights in LSTM layers are qunantized.
2 In Kaldi, Wcx, Wix, Wfx, Wox are saved together as W gifo x,
and so does W gifo r mean.
Table 2: Activation Function Lookup Table.
Activation Min Max sampling range sampling points
Sigmoid Input -51.32 59.16 -64-64 2048
Tanh Input -104.7 107.4 -128-128 2048
The activation quantization phase aims to figure out the
optimal solution to the activation functions and the inter-
mediate results. We built lookup tables and use linear in-
terpolation for the activation functions, such as sigmoid and
tanh, and analyze the dynamic range of their inputs to de-
cide the sampling strategies. We also investigated the mini-
mum amount of bits to maintain the accuracy.
We explored different data quantization strategies with
LSTM trained under TIMIT corpus. Performing the weight
and activation quantization, we can achieve 12bit quanti-
zation without any accuracy loss. The data quantization
strategies are shown in Table .1, 2, 3. For the lookup ta-
bles of activation functions sigmoid and tanh, the sampling
ranges are [-64, 64] and [-128, 128] respectively. The sam-
pling points are both 2048, and the outputs are 16bit with
15bit decimals. All the results are obtained using the Kaldi
framework.
For TIMIT, as shown in Table .4, the PER is 20.4% for
the original network and changes to 20.7% after the pruning
and fine-tune procedure when 32-bit floating-point numbers
are used. The PER remains as 20.7% without any accuracy
loss under 16/12-bit quantization, and deteriorates to 84.5%
while 8-bit quantization is employed.
4. ENCODING AND COMPILING
The LSTM computation includes sparse matrices multipli-
cation, element-wise multiplication, and memory reference.
We designed a data flow scheduler to make full use of the
hardware accelerator.
w0,2 w0,3 w0,5 …
0 0 w0,2 w0,3 0 w0,5
00
0
128-bit
16-bit
…
2 3 5 …
align with DDR align with PCIE
weight column relative index
zero padding
(weight column)
512-bit
encoded weight
w0,2 w0,3 w0,5 …
0 0 w0,2 w0,3 0 w0,5
00
0
128-bit
16-bit
…
2 0 1
align with DDR align with PCIE
weight column relative index
zero padding
(weight column)
512-bit
encoded weight
(encoded weight column)
Figure 7: Encoding in CSC format and data align
using zero-padding.
Table 3: Other Activation Quantization.
Activation Min Max Width Decimals
LSTM Input -7.611 8.166 16 11
Intermediate Results -107.8 109.4 16 8
Table 4: PER Before and After Compression.
Quantization Scheme Phone Error Rate %
32bit floating original network 20.4%
32bit floating pruned network 20.7%
16bit fixed pruned network 20.7%
12bit fixed pruned network 20.7%
8bit fixed pruned network 84.5%
Data is divided into n blocks by row where n is the number
of PEs in one channels of our hardware accelerator. The first
n rows are put in n different PEs. The n + 1 row is put in
the first PE again. This ensures that the first part of the
matrix will be read in the first reading cycle and can be used
in the next step computation immediately.
Because of the sparsity of pruned matrices. We only store
the nonzero number in weight matrices to save redundant
memory. We use relative row index and column pointer
to help store the sparse matrix. The relative row index for
each weight shows the relative position from the last nonzero
weight. The column pointer indicates where the new column
begins in the matrix. The accelerator will read the weight
according to the column pointer.
Considering the byte-aligned bit width limitation of DDR,
we use 16bit data to store the weight. The quantized weight
and relative row index are put together (i.e. 12bit for quan-
tized weight and 4bit for relative row index).
Fig.7 shows an example for the compressed sparse column
(CSC) storage format and zero-padding method. We locate
one column in the weight matrix through a pointer and cal-
culate the absolute address of weights by accumulating rel-
ative indexes. In Fig. 8, we demonstrate the computation
pattern using a simple example where the input vector has
6 elements {a0,a1,a2,a3,a4,a5}, and the weight matrix con-
tains 8×6 elements. There are 2 PEs calculating a3×w[3],
where a3 is the fourth element in the input vector and w[3]
represents the fourth column in the weight matrix.
5. HARDWARE IMPLEMENTATION
In this section, we first present challenges in hardware
design and then propose the Efficient Speech Recognition
Engine (ESE) accelerator system and detail how ESE accel-
erates the sparse LSTM.
5.1 Motivation
vector
weight matrix
0 0 w0,2 w0,3 0 w0,5
w1,0 0 0 0 0 w1,5
…
w2,0 W2,1 w2,2 w2,3 0 0
…
0 0 0 0 w3,4 0
…
w4,0 w4,1 0 w4,3 0 w4,5
…
0 0 w0,2 0 0 w0,5
0 0 w0,2 0 0 w0,5
w7,0 0 0 w7,3 0 w7,5
a3 x w[3]
PE 0
PE 1
PE 0
PE 1
PE 0
PE 1
PE 0
PE 1
a0 a1 a2 a3 a4 a5vector
weight
w0,3 w2,3 w4,3
PE 0
a3
w7,3
PE 1
1 2 3
1
buf
buf
Figure 8: The computation pattern: non-zero
weights in a column are assigned to 2 PEs, and ev-
ery PE multiply-add their weights with the same
element from the shared vector.
Task 1
…
Task 2
PE 1
PE 2
PE N
PE 0
computation time wait time
w0,0 0 w0,2 0 w0,4 0
w1,0 0 w0,2 0 0 w0,5
w0,2 w0,3 w0,5 … w2,4
0
0 0 w0,2 w0,3 0 w0,5
…
00
0
128-bit
12-bit
…
… Encoded vector
2 3 5 …
align with DDR align with PCIE
encoded weight relative index
zero padding
Figure 9: Imbalanced workload results in more wait-
ing time.
Although pruning and quantization can reduce the mem-
ory footprint, three new challenges are introduced. General
purpose processors cannot implement these challenges effi-
ciently.
First, irregular computation is introduced by compression.
After pruning, dense computation becomes sparse computa-
tion; After quantization, the weight and index are not byte-
aligned and must be grouped. We group the 4-bit pointer,
and 12-bit weight into 2 bytes.
Second, load imbalance introduced by sparsity will reduce
the hardware efficiency. In the sparse LSTM, a single ele-
ment in the voice vector will be consumed by multiple PEs.
As a result, operations of all PEs have to be synchronized.
It will create a long waiting period if some PEs have fewer
non-zero weights, as shown in Fig.9.
Moreover, general-purpose processors cannot fully exploit
the parallelism in the compressed LSTM network. In the
custom design, however, we have the freedom to take ad-
vantage of the parallelism of both the inter sparse SpMV
operation and the intra SpMV operation.
Many challenges exist in the specialized hardware accel-
erator design on FPGA. First, customized decoding circuits
are needed to recover the original weight matrix. The index
is relative, so accumulation is needed to recover the absolute
index. We use only 4-bits to represent relative offset. If a
real offset is more than 16, the largest offset that 4 bits can
represent, a padding zero is introduced.
Second, data representation should be carefully designed.
The data width of the PCIE interface, external DDR3 mem-
ory interface, and data itself are not aligned. Moreover, the
dynamic-precision quantization makes hardware computa-
tion on different data more complex and irregular. Bit shifts
are necessary for different layers.
Third, a carefully designed scheduler/controller is needed.
The LSTM network involves a complicated data flow and
many different types of weights. Computations in the LSTM
network have dependency on each other. Some computation
can be executed concurrently, while other computation has
to be executed sequentially. Moreover, the hardware design
should support input vector sharing in the multi-channel sys-
tem, which aims to perform multiple LSTM networks with
different voice vectors concurrently. Therefore, a carefully
designed scheduler is necessary for a highly pipelined design,
which can overlap the data communication and computa-
tion.
5.2 System Overview
Fig.10 (a) shows the overview architecture of the ESE
system. It is a CPU+FPGA heterogeneous architecture to
accelerate LSTM. The whole system can be divided into
three parts: the hardware accelerator on a FPGA chip, the
software program on CPU, and the external memory on the
FPGA board.
The software part consists of a CPU and host memory. It
Table 5: Two types of LSTM operations: matrix-
vector multiplication and element-wise multiplica-
tion.
Target SpMV Group ElemMul Group
it Wixxt, Wiryt−1 Wicct−1
ft Wfxxt, Wfryt−1 Wfcct−1
ct Wcxxt, Wcryt−1 ftct−1, itgt
ot Woxxt, Woryt−1 Wocct
mt N/A otht
yt Wymmt N/A
communicates with FPGA via the PCI-Express bus. In the
initialization procedure, it sends parameters of the LSTM
model to FPGA. It can transmit voice vectors and receive
corresponding results if the hardware accelerator on FPGA
is ready.
The external memory together with the FPGA chip on
one development board stores all the parameters and voice
vectors. The on-chip BRAM is limited while the amount of
data in the LSTM model is larger than it can hold. The
accelerator accesses the DRAM through memory controller
(MEM Controller), which is built using the memory interface
generator (MIG) IP.
On the FPGA chip, we put the ESE Accelerator, ESE
Controller, PCIE Controller, MEM Controller, and On-chip
Buffers. The ESE Accelerator consists of Processing Ele-
ments (PEs) which take charge of the majority of compu-
tation tasks in the LSTM model. PE is the basic computa-
tion unit for a slice of voice vectors with partial weight ma-
trix. Each ESE channel implements the LSTM network for
one voice vector sequence independently. On-chip buffers,
including input buffer and output buffer, prepare data to
be consumed by PEs and store the generated results. The
ESE Controller determines the behavior of other circuits on
the FPGA chip. It schedules the PCIE/MEM Controller
for data-fetch and the LSTM computation pipeline flow of
the ESE Accelerator. The accelerator reads parameters and
voice vectors from, and writes computation results to, the
DRAM memory. When the MEM Controller is in the idle
state, the accelerator can read results currently stored in the
memory and feed them to the software part.
5.3 ESE Controller (Scheduler)
The most expensive operations are sparse matrix vector
multiplication (SpMV) and element-wise multiplication (El-
emMul). We partition the operations involved in the LSTM
network described by equations (1) to (6), into the such two
operations, as shown in Table 5.
LSTM is a complicated dataflow. We want to meet the
data dependency and ensure more parallelism at the same
time. Fig.11 shows the state machine in the ESE scheduler.
It overlaps computation and memory reference. From state
INITIAL to STATE 6, the ESE accelerator completes the
computation of a LSTM. The first three lines operations are
fetching weights, pointers, and vectors/diagonal matrix/bias
respectively to prepare for the next computation. Opera-
tions in the fourth line are matrix-vector multiplications,
and in the fifth line are element-wise multiplications (indigo
blocks) or accumulations (orange blocks). Operations in the
horizontal direction have to be executed sequentially, while
those in the vertical direction can be executed concurrently.
For example, we can calculate Wfryt−1 and it concurrently,
CPU External Memory
DATA BUS
PCIE Controller MEM Controller
E
SE
 C
o
n
tr
o
ll
er
Input Buffer Output Buffer
PE
Channel 1
PE
PE
PE
Channel 0
PE
PE
PE
Channel 
N
PE
PE
ESE Accelerator
FPGA
SpMVExternal Memory
Software Program
ActQueue
Sigmoid
/Tanh
Act Buffer
Buf Buf
Weight Buffer
Buf Buf
SpmatRead
Pointer Buffer
Buf Buf
PtrRead
Adder Tree
ElemMul ElemMul
Ct
Ht Buffer
Yt
Processing Element (PE)
FIFO
FIFO
Accu
                                    (a)                                                                                                                          (b)
Mt
MEM
WXt/Yt-1
CPU External Memory
DATA BUS
PCIE Controller MEM Controller
E
SE
 C
o
n
tr
o
ll
er
Input Buffer Output Buffer
PE
Channel 1
PE
PE
PE
Channel 0
PE
PE
PE
Channel 
N
PE
PE
ESE Accelerator
FPGA
External Memory
Software Program
MEM
Sigmoid
/Tanh
Adder Tree
ElemMul ElemMul
ct
Ht 
Buffer
Channel with 
multiple PEs
yt
FIFO
FIFO
FIFO
FIFOA
ct
Q
u
eu
e
SpMV
Act Buffer
Buf Buf
Weight Buffer
Buf Buf
SpmatRead
Pointer Buffer
Buf Buf
PtrRead
Accu
PE N
SpMV
Act Buffer
Buf Buf
Weight Buffer
Buf Buf
SpmatRead
Pointer Buffer
Buf Buf
PtrRead
Accu
PE k
SpMV
Act ffer
Buf Buf
Weight Buffer
Buf Buf
SpmatRead
Pointer Buffer
Buf Buf
PtrRead
Accu
SpMV
Act fer
Buf Buf
Weight Buffer
Buf Buf
SpmatRead
Pointer Buffer
Buf Buf
PtrRead
Accu
PE 0
PE 1
Wxt/Wyt-1
mt
A
ss
em
b
le
x/yt-1
Wc/ct-1
Figure 10: The Efficient Speech Recognition Engine (ESE) accelerator system: (a) the overall ESE system
architecture; (b) one channel with multiple processing elements (PEs).
STATE STATE_1
Output
Input Xt
WixXt
W
WfxXt WcxXt WiCCt-1 
Ct-1
STATE_2
WirYt-1 WfrYt-1 WcrYt-1 WCfCt-1 It Ft 
WYt-1Wc Ct-1 Wc B
STATE_3
WoxXt Gt 
WXt B
STATE_4
WorYt-1 Ct WocCt Ht 
Yt-1 W Wc
STATE_5
Ot Mt 
B
STATE_6
W
Yt 
STATE STATE_1
Computation
Data Fetch
bi
Wixxt Wfxxt Wcxxt
Wicct-1 
STATE_2
Wiryt-1 Wfryt-1 Wcryt-1 
Wcfct-1 it ft 
STATE_3
Woxxt
gt 
STATE_4
Woryt-1 
ct Wocct ht 
STATE_5
ot mt 
STATE_6
N/A
P
Wfx
Wic
P
Wcx
N/A yt N/A
N/A
Wfc
P
Wir
bf
P
Wfr
bc
P
Wcr
bo
P
Wox
Woc
P
Wor
N/A
N/A
N/A
N/A
P
Wym
X
P
Wix
INITIAL
N/A
N/A
Sigmoid
/Tanh
X
P
Wix
Sparse matrix-vector multiplication by SpMV Element-wise multiplication  by ElemMul
Accumulate operations by Adder Tree
N/A Idle state
Fetch data for the next operation 
Figure 11: The state flow of the ESE accelerator system: operations in the horizontal direction and vertical
direction are executed sequentially and concurrently respectively.
because the two operations are not dependent on each other
in the LSTM network, and they can be executed by two in-
dependent computation units. Wiryt−1/Wicct−1 and it have
to be e ecuted sequentially, because it is dependent on the
former operations in LSTM network.
Wixxt and Wfxxt are not dependent on each other in the
LSTM network, but they cannot be calculated concurrently
because they have resource conflict. Weights are stored in
one piece of DDR3 memory because even after compression
the real world network cannot fit in the limited block RAM
(4.25MB). Other parameters and input vector are stored in
the other piece of DDR3 memory. Pointers are required for
the same computations as weights, because we use point-
ers to look up weights in the compressed LSTM network.
But the memory overhead necessary to store the pointers is
small. Note that x, bias b and diagonal matrix Wc are not
accessed at the same time, and all these parameters have
a relatively small quantity. Therefore, pointers, vectors, di-
agonal matrix and bias can be stored in the same external
memory and prepared accordingly during weight fetching
period.
The latency of the element-wise operations and non-linear
functions is not on the critical path. These operations are
executed in parallel with the matrix-vector multiplication
and weights-fetching.
5.4 ESE Channel Architecture
Fig.10 (b) shows the architecture of one ESE channel with
multiple PEs. It is composed of Activation Queue (Ac-
tQueue), Sparse Matix-vector Multiplier (SpMV), Accumu-
lator, Element-wise Multiplier (ElemMul), Adder Tree, Sig-
moid/Tanh Units and local buffers.
Activation Vector Queue (ActQueue). ActQueue
consists of several FIFOs. Each FIFO stores some elements
of the input voice vector aj for each PE. ActQueue is shared
by all the PEs in one channel, while each FIFO is owned by
each PE independently.
ActQueue is used for decoupling the imbalanced work-
load among different PEs. Load imbalance arises when the
number of multiply accumulation operations performed by
every PE is different, due to the imbalanced sparsity. Those
PEs with fewer computation tasks have to wait until the
PE with the most computation tasks finishes. Thus if we
have a FIFO, the fast PE can fetch a new element from the
FIFO and won’t need to be blocked by slow PEs. The data
width of FIFO is 16-bit, the depth is adjusted from 1 to 16
to investigate its effects on the latency, and the results are
discussed in the experiment section. These FIFOs are built
on the distributed RAM on chip.
Sparse Matrix Read (SpmatRead). Pointer Read
Unit (PtrRead) and Sparse Matrix Read (SpmatRead) man-
age the encoded weight matrix storage and output. The
start and end pointers pj and pj+1 for column j determine
the start location and length of elements in one encoded
weight column that should be fetched for each element of
a voice vector. SpmatRead uses pointers pj and pj+1 to
look up the non-zero elements in weight column j. Both
PtrRead and SpmatRead consist of ping-pong buffers. Each
buffer can store 512 16-bit values and is implemented with
block rams. Each 16-bit data in SpmatRead buffers consists
of a 4-bit index and a 12-bit weight. Here are the four basic
components.
Sparse Matrix-vector Multiplication (SpMV). Each
element in the voice vector is multiplied by its correspond-
ing weight column. Multiplication results in the same row
of all new vectors are summed to generate an element in the
result vector, which is a local reduction. In ESE, SpMV
multiplies an element from the input activation by a column
of weight, and the current partial result is written into the
partial result buffer ActBuffer. Accumulator Accu sums
the new output of SpMV and previous data stored in Act
Buffer. The multiplier instantiated in the design can per-
form 16bitx12bit functions.
Element-wise Multiplication (ElemMul). ElemMul
in Fig.10 (b) generates one vector by consuming two vectors.
Each element in the output vector is the element-wise mul-
tiplication of two input vectors. There are 16 multipliers
instantiated for element-wise multiplications per channel.
Adder Tree. AdderTree performs summation by con-
suming the intermediate data produced by other units or
bias data from input buffer.
Sigmoid/Tanh. SigmoidandTanh units are the non-
linear modules applied as activation functions to some in-
termediate summation results.
Here we explain how ESE computes it. In the initial
state, PE receives weight Wix, pointers P and voice vec-
tor x. Then SpMV calculates WixXt in the first phase of
STATE 1. Wiryt−1 and Wicct−1 are generated by SpMV
and ElemMul respectively in the first phase of STATE 2. In
the second phase of STATE 2, Adder Tree accumulates these
output and bias data from the input buffer and then the
following non-linear activation function unit Sigmoid/Tanh
produces intermediate data it. PE will fetch required pa-
rameters in the previous phase to overlap with the compu-
tation. The other LSTM network operations are similar. In
Fig.11, either SpMV or ElemMul is in the idle state at some
phases. This is because both matrix-vector multiplication
and element-wise multiplication consume weight data, while
PE cannot pre-fetch enough weight data for both computa-
tions in the period of one phase.
5.5 Memory System
In the hardware design, on-chip buffers are built upon a
basic idea of double-buffering, in which double buffers are
operated in a ping-pong manner to overlap data transfer
with computation. We use two pieces of 4GB DDR3 DRAMs
as the off-chip memory, named DDR 1 and DDR 2 in Fig.12,
and design a memory controller (MEM Controller). Fig.12
shows the MEM Controller architecture. On the one hand,
it receives instructions from the ESE Controller and sched-
ules the data flow among the ESE accelerator, PCIE inter-
face, and DDR3 interface. On the other hand, it rearranges
received data into structures required by the destination in-
terface. We take the data flow of result y as an example.
Data y at the output port of PE is 16-bit wide, while the
FIFO_WR_W128_
R512_D256
DDR_2
FIFO_RD_W512_
R512_D256
Y_ASSEMBLE
(16BIT–128BIT)
FIFO_WR_W128_
R512_D256
DDR_1
FIFO_RD_W512_
R512_D256
FIFO_WR_W128_
R128_D256
DDR_1_Controller
DDR_2_Controller FIFO_W512_
R64_D512
FIFO_W64_
R16_D512
P/B/Wc
X
Y
W
P
C
IE
 B
U
S
P
E
 D
A
T
A
 B
U
S
Figure 12: Memory management unit.
PCIE interface is 128-bit wide. In order to increase the data
transmission speed, we assemble eight 16-bit data into one
128-bit value by Y ASSEMBLE unit. Then the value will
be stored in DDR 1 temporarily and fed back to the soft-
ware via PCIE interface when both PCIE and DDR 1 are
in idle state. The behavior described above is shown as the
green arrow line in Fig.12. Similarly, vector x is split into
32 16-bit values from a 512-bit value through asynchronous
FIFOs. Moreover, asynchronous FIFOs, FIFO WR XX and
FIFO RD XX also play an important role of asynchronous
clock domains isolation.
6. EXPERIMENTAL RESULTS
In this section, the performance of the hardware system
is evaluated. First, we introduce the environment setup of
our experiments. Then, hardware resource utilization and
comprehensive experimental results are provided.
6.1 Experimental Setup
The proposed ESE hardware system is built on XCKU060
FPGA running at 200 MHz. Two external 4GB DDR3
DRAMs are used. Our host program is responsible for send-
ing parameters and vectors into the programmable logic
part, and collecting corresponding results.
We use the TIMIT dataset to evaluate the performance
of model compression. TIMIT is an acoustic-phonetic con-
tinuous speech corpus. It contains broadband recordings of
630 speakers of eight major dialects of American English,
each reading ten phonetically rich sentences. We also use
a proprietary, much larger speech recognition dataset that
contains 1000 hours of training data, 100 hours of validation
data and 10 hours of test data.
Our baseline software program runs on i7-5930k CPU and
Pascal Titan X GPU. We use MKL BLAS / cuBLAS on
CPU / GPU for dense matrix operation implementations,
and MKL SPARSE / cuSPARSE on CPU / GPU for sparse
matrix implementations.
6.2 Resource Utilization
Table 6 shows the resource utilization for our ESE design
configured with 32 channels, and each channel has 32 PEs on
XCKU060 FPGA. The ESE accelerator design almost fully
Table 6: ESE Resource Utilization.
LUT LUTRAM1 FF BRAM1 DSP
Avail. 331,680 146,880 663,360 1,080 2,760
Used 293,920 69,939 453,068 947 1,504
Utili. 88.6% 47.6% 68.3% 87.7% 54.5%
1 LUTRAM is 64b each, BRAM is 36Kb each.
40%
50%
60%
70%
80%
90%
100%
Wix Wfx Wcx Wir Wfr Wcr Wox Wor Wym
Ut
iliz
at
io
n 
 (B
us
y 
Cy
cle
s 
/
 To
ta
l C
yc
les
)
No FIFO FIFO = 4 FIFO = 8 FIFO = 16
Figure 13: FIFO improves load balancing and de-
creases latency. The ALU utilization is more than
90% when FIFO depth is 8 for load balancing.
utilizes the FPGA’s hardware resource.
We configured each channel with 32 PEs, which is deter-
mined by balancing computation and data transfer. It is
required that the speed of data transfer is no less than that
of computation in order not to starve the DSP. As a result,
we get Equation ??. The expression to the left of the equal
sign means that the amount of computations is divided by
the computation speed. Multiplied by 2 in the numerator
part means each piece of data needs multiplication and accu-
mulation operations, and that in the denominator part indi-
cates twice multiply-accumulate operations for 2 bytes (16-
bit). ESE implements the multiply-accumulate operation in
a pipeline manner. The expression to the right represents
the cycles that ESE fetch the required amount of data from
external memory. In our hardware implementation, both
the frequencies of PE and memory interface controller are
200MHz. The width of external DRAM is 512-bit. There-
fore, the proper number of PEs per channel is 32.
data size× compress ratio× 2
PE num× 2× freq PE
≥ data size× compress ratio× 16bit
ddr width× freq ddr
(8)
FIFO Depth. ESE uses FIFO to decouple the PEs and
solves the load imbalance problem. Load imbalance here
means the number of non-zero weight assigned to every PE
is different. The FIFO for each PE reduces the waiting time
for PEs with fewer computations. We adjust the cache depth
to investigate its effect. The FIFO width is 16-bit, and its
depth is set at 1, 4, 8, 16. In Fig.13, when the FIFO depth is
one (no FIFO), the utilization, which is defined as busy cycle
divided by total cycles, is low (80%) due to load imbalance.
When the FIFO depth is 4, the utilization is above 90%.
When the FIFO depth is increased to 8 and 16, the utiliza-
tion increased but has a marginal gain. Thus we choose the
FIFO depth to be 8. Note that even when the FIFO depth
is 8, the last matrix (Wym) still has low utilization. This is
because that matrix has very few rows and each PE has few
elements, and thus the FIFO cannot fully solve this problem
for this matrix.
6.3 Accuracy, Speed and Energy Efficiency
We evaluate the trade-off between accuracy and speedup
of ESE in Fig.15. The speedup increases as more parameters
get pruned away. The sparse model which is pruned to 10%
Table 7: Power consumption of different platforms.
Platform CPU CPU GPU GPU ESE
Dense Sparse Dense Sparse
Power 111W 38W 202W 136W 41W
Figure 14: Measured at the socket, the total power
consumption of the machine with FPGA fully loaded
is 132W. Without FPGA the idle machine consumes
91W. Subtracting the two, ESE consumes 41W.
achieved 6.2× speedup over the dense baseline model. Com-
paring the red and green lines, we find that load-balance-
aware pruning improves the speedup from 5.5× to 6.2×.
We measured power consumption of CPU, GPU and ESE.
CPU power is measured by the pcm-power utility. GPU
power is measured with nvidia-smi utility. We measure
the power consumption of ESE by taking difference with
/ without the FPGA board installed. ESE takes 41 watts;
CPU takes 111 watts (38 watts when using MKLSparse) and
GPU takes 202 watts (136 watts when using cuSparse).
The performance comparison of LSTM on ESE, CPU,
and GPU is shown in Table 8. The CPU implementation
used MKL BLAS and MKL SPBLAS for dense/sparse im-
plementation, and the GPU implementation used cuBlas
and cuSparse. We optimized the CPU/GPU speed by com-
bining the four matrices of the i, f, o, c gates that have
no dependency into one large matrix. Both mklSparse and
cuSparse implementation results in significant lower utiliza-
tion of peak CPU/GPU performance for the interested ma-
trix size (relatively small) and sparsity (around 10% non-
zeros). We implemented the whole LSTM on ESE. The
model was pruned to 10% non-zeros. There are 11.2% non-
zeros taking padding zeros into account. On ESE, the to-
tal throughput is 282 GOPS with the sparse LSTM, which
corresponds to 2.52 TOPS on the dense LSTM. Processing
the LSTM with 1024 hidden elements, ESE takes 82.7 us,
CPU takes 6017.3/3569.9 us (dense/sparse), and GPU takes
240.2/287.4 us (dense/sparse). With batch=32, CPU sparse
is faster than dense because CPU is good at serial process-
ing, while GPU sparse is slower than dense because GPU is
throughput oriented. With no batching, we observed both
CPU and GPU are faster for the sparse LSTM because the
saving of memory bandwidth is more salient.
Performance wise, ESE is 43× faster than CPU 3× faster
than GPU. Considering both performance and power con-
sumption, ESE is 197.0×/40.0× (dense/sparse) more energy
efficient than CPU, and 14.3×/11.5× (dense/sparse) more
energy efficient than GPU. Sparse LSTM makes both CPU
and GPU more energy efficient as well, which shows the ad-
vantage of our pruning technique.
Table 8: Performance comparison of running LSTM on ESE, CPU and GPU
Plat. ESE on FPGA (ours) CPU GPU
Matrix
Matrix Sparsity
Compres. Theoreti. Real Total Real Equ. Equ. Real Comput. Real Comput.
Size (%)1
Matrix Comput. Comput. Operat. Perform. Operat. Perform. Time (µs) Time (µs)
(Bytes)2 Time (µs) Time (µs) (GOP) (GOP/s) (GOP) (GOP/s) Dense Sparse Dense Sparse
Wix 1024×153 11.7 36608 2.9 5.36 0.0012 218.6 0.010 1870.7
1518.43 670.4 34.2 58.0
Wfx 1024×153 11.7 36544 2.9 5.36 0.0012 218.2 0.010 1870.7
Wcx 1024×153 11.8 37120 2.9 5.36 0.0012 221.6 0.010 1870.7
Wox 1024×153 11.5 35968 2.8 5.36 0.0012 214.7 0.010 1870.7
Wir 1024×512 11.3 118720 9.3 10.31 0.0038 368.5 0.034 3254.6
3225.04 2288.0 81.3 166.0
Wfr 1024×512 11.5 120832 9.4 10.01 0.0039 386.3 0.034 3352.1
Wcr 1024×512 11.2 117760 9.2 9.89 0.0038 381.2 0.034 3394.5
Wor 1024×512 11.5 120256 9.4 10.04 0.0038 383.5 0.034 3343.7
Wym 512×1024 10.0 104832 8.2 15.66 0.0034 214.2 0.034 2142.7 1273.9 611.5 124.8 63.4
Total 3248128 11.2 728640 57.0 82.7 0.0233 282.2 0.208 2515.7 6017.3 3569.9 240.3 287.4
1 Pruned with 10% sparsity, but padding zeros incurred about 1% more non-zero weights.
2 Sparse matrix index is included, and weight takes 12 bits, index takes 4 bits => 2 Bytes per weight in total.
3 Concatenating Wix, Wfx, Wcx and Wox into one large matrix Wifoc x, whose size is 4096×153.
4 Concatenating Wir, Wfr, Wcr and Wor as one large matrix Wifoc r, whose size is 4096×512. These matrices don’t have dependency
and combining matrices can achieve 2× speedup on GPU due to better utilization.
7. RELATED WORK
Deep Compression Deep Compression [11] is a method
that can compress convolutional neural network models by
35x-59x without hurting the accuracy. It is comprised of
pruning, weight sharing and Huffman coding. However, the
compression rate targets CNN and image recognition. In
this work we target LSTM and speech recognition. The
method also differs from the previously proposed ‘Deep Com-
pression’ in that we proposed load-balance-aware pruning.
During pruning, we enforce each row has the same amount
of weight to enforce hardware load balancing. During quan-
tization, we use linear quantization instead of non-linear
quantization, which is simpler but has smaller compression
ratio. We also eliminate the Huffman Coding step which
introduces extra decoding overhead but marginal gain.
CNN Accelerators Many custom accelerators have been
proposed for CNNs. DianNao [2] implements an array of
multiply-add units to map large DNN onto its core architec-
ture. Due to limited SRAM resource, the off-chip DRAM
traffic dominates the energy consumption. DaDianNao [3]
and ShiDianNao [5] eliminate the DRAM access by hav-
ing all weights on-chip (eDRAM or SRAM). However, these
DianNao-series architectures are proposed to accelerate CNNs,
and the weights are uncompressed and stored in the dense
format. In this work, we target LSTM neural network and
speech recognition, and data compression is also supported
in our ESE architecture. Our work in this paper also dis-
tinguishes itself from Angel-Eye architecture, which also has
the compression, compilation and acceleration, but it is ac-
celerating CNNs, not LSTMs [9,18].
EIE Accelerator The EIE architecture proposed by Han
et al. [10] can performs inference on compressed network
model and accelerates the resulting sparse matrix-vector mul-
tiplication with weight sharing. With only 600mW power
consumption, EIE can achieve 102 GOPS processing power
on a compressed network corresponding to 3 TOPS/s on an
uncompressed network, which is 24000× and 3400× more
energy efficient than a CPU and GPU respectively. EIE is a
general building block for deep neural network, not specially
designed for LSTM and speech recognition; ESE in this pa-
per targets LSTM. ESE has different design constrains on
FPGA while EIE is for ASIC, which leads different design
considerations. Besides, EIE uses codebook-based quanti-
zation, which has better compression ratio; ESE uses linear
0x
1x
2x
3x
4x
5x
6x
7x
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Sp
ee
du
p
Parameters Pruned Away
with load balance without load balance
6.2x	speedup
over	dense
5.5x	speedup
over	dense
Figure 15: Computation latency decreases as the
sparsity increases. Running the sparse model is 4.2×
faster over the dense model, both run on ESE. Load
balance aware pruning helps speedup.
quantization, which is easier to implement.
Sparse Matrix-Vector Multiplication Accelerators
To pursue a better computational efficiency on machine learn-
ing and deep learning, several recent works focus on using
FPGA as an accelerator for Sparse Matrix-Vector Multipli-
cation (SpMV). Zhuo et al. [21] proposed an FPGA-based
design on Virtex-II Pro for SpMV. Their design outperforms
general-purpose processors. Fowers et al. [7] proposed a
novel sparse matrix encoding and an FPGA-optimized archi-
tecture for SPMV. With lower bandwidth, it achieves 2.6×
and 2.3× higher power efficiency over CPU and GPU respec-
tively while having lower performance due to lower memory
bandwidth. Dorrance et al. [4] proposed a scalable SMVM
kernel on Virtex-5 FPGA. It outperforms CPU and GPU
counterparts with >300× computational efficiency and has
38-50× improvement in energy efficiency. For compressed
deep networks, previously proposed SpMV accelerators can
only exploit the static weight sparsity. In this paper, we use
the relative indexed compressed sparse column (CSC) for-
mat for data storing, and we develop a scheduler which can
map a complicate LSTM network on ESE accelerator.
GRU on FPGA Nurvitadhi et al presented a hard-
ware accelerator for Gated Recurrent Network (GRU) on
Stratix V and Arria 10 FPGAs [16]. This work shows that
FPGA can provide superior performance/Watt over CPU
and GPU. In our work, we present a FPGA accelerator
for LSTM network. It also demonstrates a higher efficiency
FPGA comparing with CPU and GPU. Different from theirs,
ESE is especially designed for accelerating sparse LSTM
model.
LSTM on FPGA In order to explore the parallelism for
RNN/LSTM, Chang presented a hardware implementation
of LSTM network on Zynq 7020 FPGA from Xilinx with
2 layers and 128 hidden units in hardware [1]. The imple-
mentation is 21 times faster than the ARM Cortex-A9 CPU
embedded on the Zynq 7020 FPGA. Lee accelerated RNNs
using massively parallel processing elements (PEs) for low
latency and high throughput on FPGA [15]. These imple-
mentations did not support sparse LSTM network, while our
ESE can achieve more speed up by supporting sparse LSTM.
8. CONCLUSION
In this paper, we present Efficient Speech Recognition En-
gine (ESE) that works directly on compressed sparse LSTM
model. ESE is optimized across the algorithm-hardware
boundary: we first propose a method to compress the LSTM
model by 20× without sacrificing the prediction accuracy,
which greatly saves the memory bandwidth of FPGA im-
plementation. Then we design a scheduler that can map
the complex LSTM operations on FPGA and achieve par-
allelism. Finally we propose a hardware architecture that
efficiently deals with the irregularity caused by compres-
sion. Working directly on the compressed model enables
ESE to achieve 282 GOPS (equivalent to 2.52 TOPS for
dense LSTM) on Xilinx XCKU060 FPGA board. ESE out-
performs Core i7 CPU and Pascal Titan X GPU by factors
of 43× and 3× on speed, and it is 40× and 11.5× more
energy efficient than the CPU and GPU respectively.
9. ACKNOWLEDGMENT
This work was supported by National Natural Science
Foundation of China (No.61373026, 61622403, 61261160501).
We would like to thank Wei Chen, Zhongliang Liu, Guanzhe
Huang, Yong Liu, Yanfeng Wang, Xiaochuan Wang and
other researchers from Sogou for their suggestions and pro-
viding real-world speech data for model compression perfor-
mance test.
10. REFERENCES
[1] A. X. M. Chang, B. Martini, and E. Culurciello.
Recurrent neural networks hardware implementation
on FPGA. CoRR, abs/1511.05552, 2015.
[2] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen,
and O. Temam. Diannao: a small-footprint
high-throughput accelerator for ubiquitous
machine-learning. In ASPLOS, 2014.
[3] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang,
L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam.
Dadiannao: A machine-learning supercomputer. In
MICRO, December 2014.
[4] R. Dorrance, F. Ren, et al. A scalable sparse
matrix-vector multiplication kernel for energy-efficient
sparse-blas on FPGAs. In FPGA, 2014.
[5] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
X. Feng, Y. Chen, and O. Temam. Shidiannao:
shifting vision processing closer to the sensor. In
ISCA, pages 92–104. ACM, 2015.
[6] D. A. et al. Deep speech 2: End-to-end speech
recognition in english and mandarin. arXiv, preprint
arXiv:1512.02595, 2015.
[7] J. Fowers, K. Ovtcharov, K. Strauss, et al. A high
memory bandwidth fpga accelerator for sparse
matrixvector multiplication. In FCCM, 2014.
[8] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G.
Fiscus, and D. S. Pallett. Darpa timit
acoustic-phonetic continous speech corpus cd-rom. nist
speech disc 1-1.1. NASA STI/Recon technical report n,
93, 1993.
[9] K. Guo, L. Sui, et al. Angel-eye: A complete design
flow for mapping cnn onto customized hardware. In
ISVLSI, 2016.
[10] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A.
Horowitz, and W. J. Dally. Eie: efficient inference
engine on compressed deep neural network. arXiv
preprint arXiv:1602.01528, 2016.
[11] S. Han, H. Mao, and W. J. Dally. Deep Compression:
Compressing deep neural networks with pruning,
trained quantization and huffman coding. ICLR, 2016.
[12] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning
both weights and connections for efficient neural
networks. In Proceedings of Advances in Neural
Information Processing Systems, 2015.
[13] A. Hannun, C. Case, J. Casper, B. Catanzaro,
G. Diamos, E. Elsen, R. Prenger, S. Satheesh,
S. Sengupta, A. Coates, and A. Ng. Deep speech:
Scaling up end-to-end speech recognition. arXiv,
preprint arXiv:1412.5567, 2014.
[14] S. Hochreiter and J. Schmidhuber. Long short-term
memory. Neural computation, 1997.
[15] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and
W. Sung. Fpga-based low-power speech recognition
with recurrent neural networks. arXiv preprint
arXiv:1610.00552, 2016.
[16] E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra,
S. Krishnan, and D. Marr. Accelerating recurrent
neural networks in analytics servers: Comparison of
fpga, cpu, gpu, and asic. In Field Programmable Logic
and Applications (FPL), 2016 26th International
Conference on, pages 1–4. EPFL, 2016.
[17] D. Povey, A. Ghoshal, G. Boulianne, L. Burget,
O. Glembek, N. Goel, M. Hannemann, P. Motlicek,
Y. Qian, P. Schwarz, et al. The Kaldi speech
recognition toolkit. In IEEE 2011 workshop on
automatic speech recognition and understanding, 2011.
[18] J. Qiu, J. Wang, et al. Going deeper with embedded
FPGA platform for convolutional neural network. In
FPGA, 2016.
[19] H. Sak et al. Long short-term memory recurrent
neural network architectures for large scale acoustic
modeling. In INTERSPEECH, pages 338–342, 2014.
[20] L. D. Xuedong Huang. An Overview of Modern Speech
Recognition, pages 339–366. Chapman & Hall/CRC,
January 2010.
[21] L. Zhuo and V. K. Prasanna. Sparse matrix-vector
multiplication on fpgas. In FPGA, 2005.
