In this paper, a neural network based real-time speech recognition (SR) system is developed using an FPGA for very low-power operation. The implemented system employs two recurrent neural networks (RNNs); one is a speech-tocharacter RNN for acoustic modeling (AM) and the other is for character-level language modeling (LM). The system also employs a statistical word-level LM to improve the recognition accuracy. The results of the AM, the character-level LM, and the word-level LM are combined using a fairly simple N -best search algorithm instead of the hidden Markov model (HMM) based network. The RNNs are implemented using massively parallel processing elements (PEs) for low latency and high throughput. The weights are quantized to 6 bits to store all of them in the on-chip memory of an FPGA. The proposed algorithm is implemented on a Xilinx XC7Z045, and the system can operate much faster than real-time.
I. INTRODUCTION
Speech recognition has long been studied, and most of the algorithms employ hidden Markov models (HMMs) or its variants as inference and information combining tools [1] , [2] . Recently, deep neural networks are employed for acoustic modeling (AM) of state of the art speech recognition systems which, however, are not free from the HMM [3] . HMM modeling for speech recognition demands a vast amount of memory access operations on a large size network, whose memory capacity usually exceeds a few hundred megabytes [4] . Thus, speech recognition algorithms are usually implemented on GPUs or multi-core systems that equip large DRAM-based memory, which are hardly power efficient.
Recently, fully neural recurrent network based speech recognition algorithms are actively investigated [5] , [6] . The RNN is end-to-end trained with connectionist temporal classification (CTC) [7] to directly transcribe the input utterance to characters. The RNN has also been used for language modeling (LM), which shows much better capability than tri-gram based statistical algorithms [8] . Recently, complete speech recognition algorithms have been developed by combining the CTC RNN and the RNN LM [5] , [6] . These RNN based algorithms do not employ a conventional HMM that needs a large search space. However, neural network algorithms, including RNNs, demand a very large number of arithmetic operations, thus they are mostly implemented using GPUs [9] , [10] .
In this work, a low-power real-time speech recognition (SR) system is developed using an FPGA. The developed system employs two long-short term memory (LSTM) RNNs [11] ; one for acoustic modeling and the other for character-level language modeling. A statistical word-level LM is also used to further improve the recognition performance. The overall algorithm is shown in Fig. 1 . The information generated from the RNNs and the word-level LM is combined using a tree structured N -best beam search algorithm. The beam search employing the beam width of 128 only requires about 197 KB of data structure, while the conventional HMM based network demands a few hundred megabytes of memory. The SR system employs a unidirectional RNN based acoustic model, causing a slight disadvantage in the recognition performance when compared to a bidirectional one, but is more appropriate for online real-time applications where immediate reaction to utterance is desired.
The RNNs for acoustic modeling and character-level LM are implemented on a mid-sized FPGA, Xilinx XC7Z045, which contains 2.18 MB on-chip memory. To store all the weights of the RNNs in the on-chip memory, the weights are quantized to 6 bits using the retraining based fixed-point optimization algorithm [12] . The RNN for the character-level LM stores 128 contexts in the on-chip memory, where each context is assigned to each beam in the N -best search. All of the weights and the contexts are stored in the on-chip memory of the FPGA, and thus the RNNs do not need DRAM accesses which require a large amount of energy [13] , [14] . As a result, this speech recognition system only uses DRAM accesses for tri-gram based language modeling, and consumes very small power compared to GPU based systems or other off-chip memory based architectures. The RNNs in the FPGA are implemented using highly parallel arithmetic arrays.
The paper is organized as follows. In Section II, recent related works are revisited. Section III describes the implemented SR algorithm. The FPGA based implementation of the algorithm is shown in Section IV. The system is evaluated in Section V. Concluding remarks are in Section VI.
II. RELATED WORKS A. Large Vocabulary Continuous Speech Recognition
Most state-of-the-art large vocabulary continuous speech recognition (LVCSR) systems employ a DNN-HMM hybrid acoustic model [3] or a weighted finite state transducer (WFST) decoder [2] . The WFST network is composed by integrating the HMM acoustic model, a pronunciation lexicon model, and a word-level n-gram back-off language model. Therefore, the resulting decoding network becomes huge, which is usually over a few hundred megabytes [4] , and hinders small-footprint low-power implementations.
A traditional LVCSR performs Viterbi decoding [15] on the WFST network using senone-level likelihoods computed by the acoustic model. Efficient hardware based implementation of the LVCSR [16] is difficult because of the large amount of search operations needed for Viterbi decoding. Specifically, the network cannot be embedded in the on-chip memory due to its size and is usually stored on an off-chip DRAM module. The energy cost of a DRAM access is large since static power is required to keep the I/O active and data must travel a long distance [13] . Therefore, the decoding procedure on WFST using DRAM consumes a large amount of power.
Recently, several RNN based end-to-end speech recognizers have been developed [17] , [9] , [10] . A phoneme-level CTCtrained RNN for acoustic modeling can reduce the size of a WFST network to about a half of that needed for DNN-HMM hybrid models [10] . Also, character-level RNN language models and prefix beam search decoding greatly reduce the complexity of the decoding stage [5] , [6] . Especially, a tree-based online decoding algorithm is proposed for lowlatency speech recognition [6] .
B. FPGA-Based Neural Network Implementation
Neural networks demand many multiply and add operations, but they are hardware-friendly in nature due to their massive parallelism. However, many previous implementations store the network parameters on an external DRAM, since the networks usually demand more than millions of parameters. Note that the weights for fully connected layers or recurrent neural networks are used only once when fetched, thus their accesses show very low temporal locality. There have been efforts to reduce the size of parameters by quantization. The bit-width of DNNs can be reduced to only two bits by retraining the quantized parameters with a modified backpropagation algorithm [12] . This approach was successfully applied to CNNs and RNNs [18] , [19] . RNNs also demand a large number of parameters. Thus, it is helpful to quantize the parameters in low bits. A study on weight quantization of RNNs was presented in [19] . The retrain-based quantization method led to an efficient VLSI implementation of DNNs that store all the quantized parameters on the on-chip SRAM [20] . Also, a similar architecture was employed for a DNN implementation on an FPGA [21] .
III. SPEECH RECOGNITION WITHOUT HMM

A. Algorithm Overview
The speech recognition algorithm implemented in this paper consists of an RNN for acoustic modeling (AM), an RNN for character-level LM and a statistical word-level LM as illustrated in Fig. 1 . The RNN AM employs the online CTC algorithm [22] and generates the probabilities of characters by analyzing each frame of input utterance. The character-level RNN LM outputs the probabilities of the following characters, while the statistical word-level tri-gram back-off LM shows that of the following words. The information generated from these three modules are integrated to find the best hypothesis using an N -best search algorithm.
The acoustic model has a deep LSTM network structure and is end-to-end trained with online CTC algorithm [22] . Although some recent RNN-based end-to-end speech recognition algorithms [17] , [9] , [10] employ the bidirectional structure for recognition performance improvement, we use a unidirectional structure for real-time operation, where it is not allowed to access the future contexts.
The proposed SR system also employs a deep unidirectional LSTM RNN for character-level LM [23] . Since the character-level LM does not utilize any lexicon information, it can dictate out of vocabulary (OOV) words but is slightly disadvantaged in recognizing vocabularies in the dictionary. When compared to widely used HMM or RNN based speech recognition algorithms, the implemented one has the capability of low-latency decoding and OOV dictation, but these characteristics also mean slight weakness in the recognition accuracy. The structures of the RNNs for the AM and character-level LM are described in [6] .
In our work, conventional statistical tri-gram back-off model is also employed for the word-level LM to complement the RNN based character-level LM. For better backing-off, we use improved Kneser-Ney smoothing [24] . The word-level LM is integrated for the N -best beam search in a similar manner as the character-level LM [6] , except that the rescoring is performed on the fly, only when the active node represents a blank or the end of sentence (EOS) symbol. Also, the word insertion bonus is considered when the word-level LM is applied. Note that the number of DRAM accesses for the word-level LM is not very large.
B. Beam Search Algorithm
In this work, the beam search decoding is conducted with a simple prefix tree structure. The N -best hypotheses are generated using the RNN AM and the RNN for characterlevel LM, and rescored by the statistical word-level LM on the fly.
Let L be the set of all output labels in the RNN AM except for the CTC blank. The input feature vector from time 1 to t is denoted as x 1:t . Given x 1:t , the goal of the beam search decoding is to find the label sequence with the maximum posterior probability generated by the RNN AM.
The hypotheses are represented by a simple tree, where each node in the tree represents labels in L. To deal with CTC state transitions, state-based networks that are represented with CTC states, L = L ∪ {CTC blank}, are employed in low level by decomposing a tree node into two CTC states; a state that corresponds to a label in L and a following state that represents the CTC blank label. Since the tree grows indefinitely as the beam search proceeds, it is necessary to prune the search tree periodically. The tree is pruned both in depth and width as explained in [6] .
C. Retraining Based Fixed-Point Optimization
Since the LSTM RNN contains millions of weights, an FPGA based implementation demands large on-chip memory space to store the parameters. It is not efficient to store the weights on the external DRAM because the fetched weights are used only once for each output computation. In our implementation, the retraining based method [12] , [19] is applied to reduce the word-length of weights. The algorithm groups the weights and signals by layer, applies direct quantization to each group, and retrains the whole network in the quantized domain. In our work, the weights and the internal signals are quantized to 6 and 8 bits, respectively. We find that the internal LSTM cells demand high precision, and thus, they are represented in 16 bits.
IV. FPGA-BASED IMPLEMENTATION A. Overview of the FPGA System
The proposed algorithm is implemented on a Xilinx ZC706 evaluation board that equips an XC7Z045 FPGA. The FPGA embeds an ARM CPU in addition to configurable logic circuits. Fig. 2 shows the hardware architecture for implementing RNNs. Although the SR algorithm employs two RNN algorithms, our FPGA design implements only one LSTM tile and one output tile, which operate intermittently when the control signal is given. Note that the RNN operation for the acoustic model is needed only once for each input speech frame whose length is normally 10-ms, but the characterlevel LM operates much more frequently to generate N -best hypotheses for different search paths.
B. Architecture and Algorithm
The standard LSTM with peephole connections is described in Algorithm 1. The equations show that one LSTM RNN layer requires eight matrix-vector multiplications in each time step.
The LSTM tile in Fig. 3 consists of two main processing modules; the processing element (PE) array calculates matrixvector multiplications and the LSTM extra processing unit (LSTM EPU) conducts the rest of the calculations, such as Algorithm 1 LSTM equations with peephole connections: x is the input vector of the input layer, h is the output vector of the layer. The vector i, f and o are activations of the input gate, forget gate and the output gate processed by the logistic sigmoid function σ, respectively. c represents the activation of the cell and c t is the candidate memory cell. The vector b stands for the bias. The subscript t is the current data where t − 1 denotes the data from the previous time step. W is the model parameter matrix and W is the diagonal model parameter matrix. The operator is an element-wise multiplication, and tanh is a hyperbolic tangent. applying element-wise products for peephole connections and evaluating activation functions. As shown in Fig. 4 , the PE array consists of 512 PEs. The PE in Fig. 5 multiplies the input D in with the weight W and adds the result with the partial sum stored in the accumulator where the bias values are preloaded [21] . The results of eight matrix-vector multiplications are stacked in the PE output buffer. We use four PE buffers, P E i , P E f , P E o and P E c .
The LSTM EPU shown in Fig. 6 is implemented to manage the rest of the LSTM operations. The input c t−1 represents the cell activation of the previous time step. To implement the peephole connections in the LSTM, c t−1 is multiplied with the peephole weights and added to P E i and P E f while c t is multiplied with the weights and added to P E o . Since the matrix-vector multiplication results are already stored in the PE buffers, the LSTM EPU and the PE array can operate independently. The activation functions in the LSTM EPU are implemented using lookup tables. In the proposed system, only one LSTM EPU is used because one output data is transmitted in each clock and all the operations in the LSTM EPU are element-wise ones.
The output vector of the LSTM EPU is stored in the context memory. The stored contexts are used in the following operations and the beam search decoding. The number of 3 Fig. 4 . Structure of the processing element array. stored contexts is the same as that of hypotheses in the beam search. The output tile is a fully connected layer that employs the same structure in [21] . The input of the output tile is the data stored in the context memory.
C. Throughput of the LSTM tile
As shown in Fig. 4 , there are two PE arrays in the PE array block. Since there are eight matrix-vector multiplications, one RNN layer demands four matrix-vector multiplication cycles. Each PE array has 256 PEs and conducts a matrix-vector multiplication using the outer product method. The processing time of the LSTM depends on the dimension of the input vector because the outer product method supplies one input element at each clock. The input size of the first level RNN AM is 123 and that of the next layers is 256. Thus, the first layer processing of the RNN AM requires 246 (= 123×4÷2) and 512 (= 256×4÷2) clock cycles to conduct matrix-vector multiplications related with x t and h t−1 . The number of clock cycles for the next layer is 1,024. Note that there exists a small overhead to synchronize the system. The number of required clock cycles to process the RNN AM with three LSTM layers is 2,806 (= 758 + 1, 024 + 1, 024) and that of the RNN LM containing two LSTM layers is 1,596 (= 572 + 1, 024), respectively.
V. EXPERIMENTAL RESULTS
A. Recognition Performance
To train the RNN AM, we use the standard WSJ SI-284 training set. The utterances with verbalized punctuations are removed and odd transcriptions are filtered out. The final size of the training set is roughly 71 hours. For evaluation, the WSJ eval92 (Nov'92 20k evaluation set) is used. The utterances in the evaluation set are sequentially concatenated to generate a single 42-minute input speech stream. The RNN AM is trained using the stochastic gradient descent (SGD) with 8 parallel input streams on a GPU [25] . The RNN AM uses a 40-dimensional log mel frequency filterbank feature with energy and their delta and double-delta, resulting in a 123-dimensional vector. The feature vector is computed every 10 ms over a 25 ms Hamming window and element-wisely normalized based on the statistics obtained from the training set. A centered sliding-window with 300-frame size is used to reduce the amplitude distortion effect from silence intervals. The RNN AM outputs a 31-dimensional vector representing the probabilities of 26 upper case alphabet characters, 3 special characters for punctuation marks, the end of sentence symbol, and the CTC blank label. The RNN LM is trained with a text stream generated by concatenating randomly selected sentences in the WSJ nonverbalized punctuation text corpus where the EOS label is inserted between the sentences. The RNN LM is trained with AdaDelta [26] based SGD. The RNN LM uses a 30dimensional vector where the current character-label is one-hot encoded and outputs a 30-dimensional vector which represents the probabilities of the following character-labels.
The statistical tri-gram LM is generated with the IRSTLM [27] toolkit included in the KALDI speech recognition tool [28] . build-lm.sh and compile-lm in IRSTLM toolkit is used to generate a standard advanced research project agency (ARPA) file while applying the improved Kneser-Ney method [24] for higher performance. We use the WSJ nonverbalized punctuation text corpus that contains 165 K words to build the LM. The generated 578-MB ARPA file is stored in the off-chip DRAM.
The word error rate (WER) and character error rate (CER) performances of the proposed system with respect to the size of the RNNs and the beam width are shown in TABLE I.  TABLE II  THE WER AND THE CER PERFORMANCE (%) WITH RESPECT TO THE  WEIGHT PRECISION OF THE SMALL-MODEL WHEN THE BEAM IS 128. WORD LM WEIGHT PRECISION [10] , but ours supports delay free real-time SR. Of course, the best advantage we expect is the energy efficiency since we do not employ a WFST network which demands a large amount of computation and memory accesses. Note that the algorithm in [10] is not for real-time speech recognition task, and employs a bidirectional structure that shows better performance over the unidirectional structure. The algorithm also uses the WFST decoding network to combine the results of acoustic modeling, lexicon, and the word-level LM. Note that the compared system does not use the character-level RNN because the WFST network embeds the lexicon. However, the WFST-based decoding demands a large memory space to search, and thus the algorithm is hard to be power efficient. On the other hand, our algorithm employs the character-level LM in addition to the word-level LM, and uses simple beam-search in decoding that requires far less memory. The RNNs of the proposed algorithm are implemented using only on-chip memory for energy efficiency. Note that the recognition performance of our system can be further improved by employing larger RNNs or increasing the beam width.
The SR algorithm is implemented on an XC7Z045 FPGA that has 2.18 MB on-chip memory. In our experiment, the number of parameters for the small-model is 2.3 M while that of the large-model is 15.1 M. The retraining based fixed-point optimization is applied to reduce the precision of weights. TABLE II shows the performance of the systems that employ fixed-point weights, where the precision of the signal and the LSTM cells are fixed to 8 and 16 bits, respectively. The table shows that rescoring with the word-level LM is also effective for the systems that employ fixed-point weights. The FPGA can only accommodate up to 6-bit weights, which demands only 1/5 of the memory space required for floating-point implementations with about 1.5% WER increase. The size of the parameters with 6-bit precision is about 1.1 MB, which can be stored in the on-chip memory of Xilinx XC7Z045.
B. FPGA Implementation Performance
The FPGA implements the small-model with the beam width of 128. Note that the large-model based system can be implemented using an ultra-scale FPGA [29] . In our implementation, the programmable hardware operates at 100-MHz and the CPU runs with a 800-MHz clock to conduct the N -best search. The FPGA resource utilization result is shown in TABLE III.
The implemented system requires one RNN AM operation for each 10 ms speech frame (100 times per second). However, the RNN for character-level LM is needed only when character transition occurs, whose frequency is usually no more than 30 times per second in our experiments. Assuming 128 beams, this translates about 3,840 RNN LM operations per second. Thus, the number of clock cycles for achieving a real time with conservative estimation is about 6.4 M (= 100 × 2, 806 + 3, 840 × 1, 596) per second. Note that silence period does not generate any transition, thus no RNN LM is demanded.
TABLE IV shows the power consumption measured by the Xilinx simulation tool. The actual power consumption of the small-model based SR measured on the evaluation board is 9.24 W including that in the DRAM and peripherals, while achieving × 4.12 real-time speed. Our implementation consumes some extra cycles for communication.
We compare our FPGA implementation with that of a highend GPU, NVIDIA GeForce Titan X. In the GPU based implementation, the time to evaluate the 42-minute WSJ eval92 evaluation set is 12.5 minute, which means ×3.36 real-time speed, while utilizing about 30 % of GPU resource. Note that the throughput of the GPU can be increased by processing multiple input speech utterances. However, our FPGA based system shows better recognition speed by efficiently utilizing hardware resources even when processing a single speech stream. The power consumption of the GPU based system is about 80 W which is much higher than ours.
VI. CONCLUDING REMARKS
In this paper, an RNN based real-time speech recognition system is implemented on an FPGA. The algorithm employs the RNNs for acoustic modeling and character-level language modeling, and is optimized for real-time operations using unidirectional RNNs. The vocabulary size of the speech recognition is unlimited since the character-level RNN can dictate out of vocabulary words. A statistical word-level language model is also employed to improve the recognition performance. The models are integrated using a simple tree-based search algorithm without employing a hidden Markov model or weighted finite state transducers. The weights of the RNNs are quantized to 6 bits. The RNNs are implemented using an array of processing elements for high throughput matrix-vector multiplications. The RNNs implemented on the FPGA only use on-chip memory. The implemented speech recognition system on Xilinx XC7Z045 can achieve approximately 4.12 times of the real-time speed when 100 MHz clock is used while consuming only 9.24 W of power. When compared to a high-end GPU based system, the power efficiency is considered about 10 times higher.
