ABSTRACT Long Short-Term Memory network (LSTM) is the most widely used recurrent neural network architecture. It plays an important role in a number of research areas, such as language modeling, machine translation and image captioning. However, owing to its recurrent nature, general-purpose processors like CPUs and GPGPUs can achieve limited parallelism while consuming high power energy. FPGA accelerators can outperform general-purpose processors with flexibility, energy-efficiency and more delicate optimization capabilities for the recurrent based algorithms. In this paper, we present the design and implementation of a cloud-oriented FPGA accelerator for LSTM. Different from most of previous works designed for embedded systems, our FPGA accelerator transfers data sequences from and to the host server through PCIe and performs multiple time series predictions in parallel. We optimize both the on-chip computation and the communication between the host server and the FPGA board. We perform experiments to evaluate the overall performance as well as the computation and the PCIe communication efforts. The results show that the performance of our implementation is better than the CPU-based and other hardware-based implementations.
I. INTRODUCTION
In recent years, deep learning is becoming a hot topic in computer science because of its great progress of inference accuracy. Deep Neural Networks (DNN) provides a practical method for computers to learn knowledges from massive amount of labeled data. Recurrent Neural Network (RNN) is a class of Deep Neural Networks that processes sequential data and performs time series predictions. It has been proved to be useful in various time series prediction related applications like language modeling [1] , [2] , machine translation [3] and image captioning [4] - [6] . A number of novel RNN models are created to promote the performance of the traditional RNN. Among these models, Long Short-Term Memory network (LSTM), which implements a learned memory controller to achieve excellent prediction accuracy, is the most popular recurrent neural network model.
From the model aspect, LSTM focuses on how to improve the prediction accuracy while paying little attention to the throughput, response time or the power consumption. As the The associate editor coordinating the review of this article and approving it for publication was Junxiu Liu.
LSTM applications are becoming more and more popular, they are widely deployed in cloud environments. In this kind of situation, high throughput, short response time and low power consumption are very important. However, owing to the recurrent nature, LSTM implementation on general-purpose processors like CPUs and GPGPUs can only achieve limited parallelism and requires high power consumption.
To solve this problem, a number of works have been made to utilize FPGA to accelerate LSTM [7] - [13] . FPGA accelerator is an efficient way to optimize applications delicately by pipelining computation process and customized hardware structure. In addition, FPGA accelerators are widely used because of lower power consumption and higher computing capability compared with general purpose processors. However, most of previous FPGA accelerators for DNNs are only designed for embedded systems. In recent years, efficiency of running DNN in data centers calls for utilizing FPGA in public and private cloud environments [14] , [15] . As for cloud environment, the data transmission between the host server and the FPGA board are very critical. At the same time, the host server usually receives multiple input streams in parallel. However, traditional LSTM algorithms are largely composed of vector-based manipulations. It requires to read vectors sequentially, which decreases the overall throughput of the entire system in the cloud environment.
In this paper, we design and implement a cloud server oriented FPGA accelerator for LSTM. The design and implementation are highly optimized for the LSTM applications in cloud environments. We aim to improve the performance of the whole system that is composed by the host server and the FPGA board. Towards this end, we make a number of optimization on the host side, the FPGA side and in the transmission phase between them. We implement the accelerator by OpenCL (Open Computing Language), which is an open, cross-platform parallel programming language that can be used for CPU, GPU and FPGA developments [16] . This paper is an elaborated and extended version of a conference paper accepted by 2019 3rd International Conference on Simulation, Modeling and Optimization [17] . The key additions of this version includes: (1) we take the host side optimization into consideration and illustrate the complete data flow of our accelerator in Section III-B, (2) we describe how to use overlapping to optimize the transmission between the host server and the FPGA in Section III-G, (3) we perform an additional experiment to verify our host side optimization in Section IV-B. In summary, the main contributions of this paper are as follows:
• To improve the matrix computation performance on FPGA, we transform the vector manipulations into matrix operations. In addition, our design can process multiple input streams in parallel by this way. It ensures that we can achieve high throughput of the whole system. Given that we can predict multiple data sequences in a deep pipeline way instead of creating copies of processing hardware. Increasing the number of input streams makes no impact on the resource usage.
• For cloud environments, our design focuses on not only the optimization of FPGA side computation but also the improvement of communication efficiency between host server and FPGA board. Based on the full-duplex PCIe bus, we utilize the ping-pong buffers in the host server memory and the on-board DDR to make the computation and the data transmission to and from the FPGA simultaneously. It can achieve close to zero transmission time when the the number of input streams is large enough. By this way, we can keep the FPGA always working to take full advantage of its parallelizing capability.
• Based on above design, our implementation achieves high performance in cloud environment with optimization for specific FPGA features like LUTRAM and approximate linear function. We measure the throughput by the number of predictions per second. The peak throughput of our implementation is 7,573,462.59 predictions per second. It means that the average execution time of one prediction per input stream is 132.04ns. The rest of this paper is structured as follows. In Section II, we introduce related works of FPGA accelerator for LSTM. Section III discusses the details of our design and implementation. Section IV presents and analyzes the experiment results of our work. In Section V, we conclude and point out future works.
II. RELATED WORK
LSTM was firstly designed in [18] . After that, it has a number of updated variations [2] , [19] , [20] . The LSTM with peephole introduced by [21] is a famous variation in which the cell memory influences the input, forget and output gates. Greff et al. studied a number of commonly used variations of LSTM in [22] . It shows that the changes to the standard architecture are very small. These variations have similar performance, and their effects on overall prediction accuracy are small enough to be ignored. In this paper, we implement the LSTM version of [23] .
Owing to its good performance on classifying and prediction of sequence data, LSTM network has been widely used in various areas, such as language modeling, machine translation and image captioning. Most software-implemented LSTM applications are paying too much attention on prediction accuracy. However, in a certain measure, they neglect the throughput and power consumption problem, which are vital in real-world production environments. Therefore, there are some efforts that have been made to implement hardware accelerators. Because training of LSTM model is a compute-intensive task, there is little work on implementing LSTM training on FPGA [7] . Most of related works focus on how to design and implement LSTM inference accelerators. In [8] , Chang et al. implemented the regarded earliest actual hardware implementation for LSTM, which is a two layer LSTM network with 128 neurons each. However, it does not achieve a full level of parallelism. Meanwhile, its performance is mediocre compared to some latest implementations. Work in [9] focuses on a delicated optimization for LSTM. They compared their implementation with [8] directly. The results proved that their work in [9] has a significant speed-up over [8] and other software implementations. Work in [10] optimizes the LSTM network both in computation and communication phase. The authors use pipelining and unrolling to improve the computation performance. Besides, they make more effort on data accessing method, such as reshaping the weight parameters and implementing ping-pong buffers. In [11] , Han et al. proposed a method for compressing the LSTM model and allocating computation tasks to processing elements (PE) to achieve balanced workload and improve the performance. Similarly, researchers in [12] focused on compressing LSTM model and proposed a method named efficient FFT based fast circulant convolution to accelerate the computation. In [13] , Chen et al. implemented a compact LSTM inference kernel named CLINK to decrease the required resource and power of computation for improving the energy efficiency of embedded devices. However, most of previous works [8] - [13] aim to improve performance of LSTM applications deployed in embedded systems. There is rare LSTM hardware implementation that is optimized for cloud environments. In cloud environments, host servers VOLUME 7, 2019 usually receive multiple input streams in parallel and transfer data between host servers and on-board DDRs. We make some modifications on traditional LSTM algorithms and implement it based on FPGA hardware for cloud environment. The design and implementation optimize both computation and communication phase, especially the transmission between host server and on-board DDR. Since our work is similar to [9] , we also compare our results to it. The experimental results prove that the on-chip computation performance of our work outperforms the software implementation and hardware implementation in [9] .
In the process of LSTM inference, matrix computation occupies most of resources and time. There are a number of works of optimizing matrix computation on FPGA. Dou et al. proposed a block matrix multiplication algorithm and implemented a linear array of processing elements (PE) to improve the performance of matrix multiplication on FPGA [24] . However, it is also designed for embedded system. If there are too many data streams to be processed, which is common in the cloud environment, the performance will be downgraded because of a large number of inputed rows. In [25] , Suda et al. transformed the convolution operation to matrix multiplex operation to accelerate computing CNN on FPGA. They divided a matrix to a set of blocks. Each block is allocated to a PE to be computed. However, they did not implement a pipeline among computation tasks of blocks. If the workloads of PEs are unbalanced, the performance will be downgraded significantly because of unwanted idle of some PEs. Google Tensor Processing Unit (TPU) [14] unfolded matrix multiplication in both row and column dimensions and achieved extra high performance with fully pipelined workflow on PEs. However, it requires excessive resources that can only be satisfied by custom ASICs. Compared with above works, the proposed design and implementation in this paper achieves a balance between improved performance and required resources, which makes it more suitable to be used for the cloud server oriented FPGA accelerator.
III. DESIGN AND IMPLEMENTATION
In this section, we will firstly introduce the overall system architecture of the proposed FPGA accelerator. Secondly, we will illustrate the data-flow between each component of our system and describe how they interact with each other. Then, we will describe the core principles of our optimization and discuss our design in detail with a general matrix optimization model. Finally, we will present the implementation of the proposed FPGA accelerator.
A. ARCHITECTURE Figure 1 shows the architecture design of the proposed FGPA accelerator. Since the accelerator is designed for running in cloud environments, the system architecture is quite different from that for embedded systems. The target system includes not only the FPGA board but also the host server, which is a universal server equipped with CPUs to receive requests from remote clients and send response to them. The host server controls the data transmission and schedules the computation-intensive tasks for the FPGA board. It transfers data with the FPGA board through PCIe bus. The DDR4 DRAMs persisted on the FPGA board receive input vectors from PCIe bus and save output vectors generated by the FPGA chip. The DRAMs function as a connector between FPGA chip and host server. The host server manages two input buffers and two output buffers in the on-board DDR4 DRAMs, which work in a ping-pong way. The data transmissions between the DDR4 DRAMs and the FPGA chip are done on the AXI4 bus or the AXI4Lite bus. There are three core computation modules in FPGA chip, matrix-matrix multiplication module, gate module and element-wise computation module. These three modules are responsible for three different phases of the LSTM inference. They are also connected to AXI4 bus or AXI4Lite bus. The following sections will introduce the design principles and implementation details of these three on-chip modules. Figure 2 illustrates the data-flow of our design. The whole system consists of five components, which are host server, on-board DDR, matrix-matrix multiplication module, gate module and element-wise computation module. The input FIGURE 2. Data-flow of the proposed system. vectors are streamed from the host server through PCIe bus. Multiple input vectors are grouped in the input buffers and persisted on the on-board DDR. Since LSTM network is a recurrent neural network that needs multiple time steps to generate final output vectors, the computation of three on-chip modules and the data transmission among them need to loop through several time steps. In each time step, the on-board DDR transfers data to the matrix-matrix multiplication module. The output vectors generated by the element-wise computation module from the previous time step are regarded as recurrent state. These vectors are also fed to the matrix-matrix multiplication module as input data. After performing the matrix-matrix computation, intermediate output matrices are generated in the on-chip memory. When the gate module is ready, the intermediate matrices are fed to the gate module as input matrices. Similarly, after the gate module finished its computation, another intermediate matrices are generated as output matrices and transferred to the element-wise computation module. Then, the elementwise computation module performs its calculations. When a loop is done, the final output vectors from the element-wise computation module are transferred back to the on-board DDR. Finally, the on-board DDR transfers the output data back to the host server.
B. DATA-FLOW
We can see that the input buffers persisted on the on-board DDR save input vectors of all time steps. These vectors is considered too large to be totally stored in the FPGA chip. That means we must transfer data from the on-board DDR to FPGA chip in each time step. As for the recurrent state, we can perform in-place modification recurrently in each time step. Therefore, we only need to transfer data to the on-board DDR once in the final time step. Here we put the weight matrix aside in the data-flow because it is small enough to be stored in on-chip memory and does not need to transmit among different components.
C. MATRIX OPTIMIZATION MODEL
The LSTM algorithm can be described by the following equations:
where σ , , x, W , b, c and h are sigmoid function, element-wise multiplication, input vector, corresponding weight matrix, corresponding bias vector, memory cell activation and layer output vector, respectively. i, f ,c, o represent the input gate, forget gate, candidate memory cell gate and output gate, respectively. The subscript t denotes the corresponding time step.
Equations (1)- (6) indicate that the traditional LSTM algorithm is based on vector manipulation. Previous LSTM hardware accelerators like [8] - [10] focus on optimizations for these vector-based computations, such as the matrix-vector multiplication and some kinds of element-wise vector computations. However, the defective vector-based model mainly lies in its low throughput because the LSTM network can only process one input stream at a time. In cloud environments, we must handle multiple input streams. Therefore, the traditional vector-based model is no longer suitable for our accelerator. To solve this problem, we change the vector-based model to a matrix-based model that groups multiple input vectors from various input streams into an input matrix. The LSTM computation of our design can be characterized by the following equations:
From Equations (7)- (12), we can see that our design is largely based on matrix computations, such as matrix-matrix multiplication, matrix addition, element-wise matrix multiplication, element-wise matrix sigmoid function and element-wise matrix tanh function. In order to optimize the computation phase of our system, we put forward a general matrix optimization model that optimizes all these matrix computations by using unrolling and pipelining manners. All of three on-chip modules presented in Figure 1 use this model to implement the LSTM network.
We notice that structures of these different types of matrix computations are similar, while the major differences are matrix operators. It means that each element in two input matrices would perform a certain type of basic matrix operation in matrix computations process. Figure 3 shows how five matrix operators work. They are MAC, +, * , sigmoid and tanh, referring to the matrix-matrix multiplication, the matrix addition, the element-wise multiplication, the element-wise matrix sigmoid function and the element-wise matrix tanh function, respectively. Figure 4 is the expanded view of Figure 3 . It presents the details of our matrix optimization model. Input Stream Count denotes the number of input streams that need to be processed. LSTM Size denotes the scale of the LSTM model, which is determined by the size of the weight matrix. We assume that the size of the input vector is less than the LSTM Size. Matrix A represents the input matrix in on-chip computation modules. Matrix B represents the weight matrix in the matrix-matrix multiplication module or the input matrix in other two modules. When performing matrix computation, we always unroll the loop of iterating the columns of input matrix or the rows of the weight matrix. The Processing Element (PE) in Figure 4 shows the unrolling logic. The number of PEs is LSTM Size. Every PE fetches one value from the column of the input matrix or the corresponding row of the weight matrix to its inside BRAM. By this way, the matrix computation with specific operator are processed in parallel. If the matrix B is the weight matrix in the matrix-matrix multiplication module, the previous PE also has to write the output value to the next PE for MAC operator.
All PEs are running in a deep pipeline way. Different PEs are processing different input streams simultaneously. A single PE is also processing multiple input streams. For example, it can read one value of the third input stream from Matrix A as well as read one value of the second input stream from Matrix B. At the same time, it may perform the computation of the matrix operator of the first input stream in parallel. In conclusion, all PEs are processing different phases of the computation for different input streams. This pipeline mechanism keeps the hardware working at all time and improves performance significantly.
In order to maximize the pipeline performance, we must ensure the PEs carry no dependencies. Considering that we have LSTM Size PEs, they could access at most LSTM Size columns of the input matrix and at most LSTM Size rows of the weight matrix in parallel. Therefore, we store the matrices in LUTRAM. The input matrix is partitioned by columns. The weight matrix is partitioned by rows. It constructs N -ports memory that can provide N parallel accesses. After the partition, all PEs can work with no dependencies and the initiation interval of the pipeline is 1.
After the introduction of the matrix optimization model, we will present the implementation details of the three on-chip computation modules based on this model in the following subsections.
D. MATRIX-MATRIX MULTIPLICATION
From Equations (7)- (10) in Section III-C, we can see that matrix-matrix computation is the first step of LSTM network inference. The pseudocode of the matrix-matrix for i = 1 to stream_count do 4: Initialize Temp to zero 5: #pragma unroll by factor of MAX_LSTM_SIZE 6: for k = 1 to MAX _LSTM _SIZE do 7: Temp = a i,k * w k,j + Temp . We define MAX _STREAMS as an upper bound for the number of rows of the input matrix and the output matrix. It decides the maximum input streams that we can process together. We also define MAX _LSTM _SIZE as an upper bound for the number of rows of the weight matrix and columns of the input matrix, the weight matrix and the output matrix. LSTM applications can use this model if weight scale is no more than MAX _LSTM _SIZE. We assume that the size of input vector is less than the LSTM weight size. The matrix-matrix multiplication is performed by three nested loops. real_lstm_size denotes the actual weight size of the LSTM layer. stream_count denotes the number of input streams. The innermost loop generates one element of the output matrix. The rest two outer loop traverse the output matrix in row-major order. We completely unroll the innermost loop and pipeline the second loop to achieve high parallelism. We can see that Algorithm 1 is based on the matrix optimization model. The matrix operator is MAC. The two inner loops implement the matrix optimization model, while the outermost loop performs the calculation sequentially. In this way, we can predict multiple data sequences in parallel to achieve high throughput.
E. GATE MODULE
After the matrix-matrix multiplication, the rest of LSTM inference in Equations (7)- (10) are matrix addition, element-wise sigmoid function and element-wise tanh function. Given that we can construct the data in the form of matrices rather than vectors. Therefore, we can keep benefiting from the deep pipeline way. The sigmoid function and the end for 8: end for tanh function are non-linear that are hard to be implemented in FPGA. Therefore, we use the Polynomial Approximations presented in [26] to implement approximate linear functions. We adopt the same polynomial coefficients that are used in [9] . The maximum approximation are 1.408 × 10 −3 for the sigmoid function and 1.21 × 10 −2 for the tanh function. 
F. ELEMENT-WISE COMPUTATION
The rest work of our accelerator is implementing the last two Equations (11) and (12), which mainly consist of element-wise multiplication, matrix addition and element-wise tanh function. The optimization methods are similar to Algorithm 2. The only difference is the matrix operators. This module is also based on the matrix optimization model that uses the * , + and tanh as matrix operators.
G. HOST SIDE OPTIMIZATION
Most of previous works only focus on FPGA-side optimization for deploying LSTM model in embedded systems. As for cloud environments, host servers often receive multiple input streams and transfer the data to FPGA boards. In this kind of situation, the transmission time could be larger than that of embedded systems. Therefore, we must optimize both the control and transmission phases of the host side to maximize the overall throughput. Towards this end, we group multiple input vectors, which are matched with our memory models in the FPGA chip, into a matrix. These vectors are transferred 10: host_input_buffer += sizeof(input_buffer[flag]) 11: host_output_buffer += sizeof(output_buffer[flag]) 12 : end for together between the host server and the FPGA board. It obviously can utilize more bandwidth compared to the single vector transferring way. Figure 1 shows the data transmission between the host server and the FPGA board. Because PCIe is a full-duplex bus, we can perform two-way data transmission with FPGA simultaneously. FPGAs can also perform computations during data transmission process. Performing all these three operations concurrently allows us to keep the FPGA always working and take full advantage of all hardware resource in the proposed system. In Figure 1 , we have two pairs of buffers persisted on the on-board DDR. They function as two ping-pong buffers. For example, the host server can write to the first input buffer and read from the second output buffer. At the same time, the FPGA chip can write to the first output buffer and read from the second input buffer.
Algorithm 3 presents the pseudo code of the ping-pong buffer process. There are three types of events, writing data into FPGA, computing in FPGA and reading data from FPGA. We implement three functions to manage events by utilizing build-in functions of OpenCL like clEnqueueMigrateMemObjects() and clEnqueueNDRangeKernel(): (1) schedule() is to put an event into execution queue of OpenCL. An event could be scheduled depending on another event by schedule() function. (2) bind() is to bind an event with input and output buffers. (3) wait() means waiting until an event finish. Figure 5 shows how to use two pairs of ping-pong buffers for making write, read and compute operations to overlap with other. There are two tasks running in parallel, task 1 and 2. The number of task is denoted as the first number after the word of operation. For example, write1 : input_buffer1 means that task 1 is writing data into input buffer 1. At the beginning, task 1 transfers data from the host server to input buffer 1. When task 1 has been computed, it may read from input buffer 1 and write to output buffer 1. At the same time, task 2 can transfer data from the host server to input buffer 2. After the computation of task 1 has been finished, the host server can read from output buffer 1 and write to input buffer 2. The computation of task 2 would not be blocked. Since the computation time is usually longer than the transmission time, the read and write processes are completely running as background work. After optimizing the transmission process, the transmission time between the host server and on-board DDR can be nearly ignored.
IV. EXPERIMENTAL EVALUATION
In this section, we first introduce our experiment environment and method. Then, we analyze the results of different-sized synthesized LSTM networks with various numbers of input streams.
A. EXPERIMENTAL ENVIRONMENT
We implement an OpenCL-based FPGA accelerator. Our hardware system consists of a Xilinx Kintex UltraScale KCU1500 Acceleration Development Board and a general host server with Intel Pentium CPU G2030. The FPGA board is plugged in the host server on a PCIe Gen2(5GT/s) bus with 8 lanes. OpenCL is a cross-platform framework for parallel programming on CPU, GPU and FPGA. We use Xilinx SDAccel development environment to compile OpenCL programs to execute on the Xilinx FPGA board. Xilinx SDAccel is a high-level synthesis tool that abstracts low level details of the FPGA platform. It allows us to utilize the power of parallel capabilities of FPGA without extensive knowledge of the underlying hardware detail.
We test our system with different-sized synthesized networks and various numbers of input streams to examine the performance. We define the size of synthesized network as N , which represents for the scale of the weight matrix ranged from 4, 8, 16 to 32. The number of input stream is defined as I ∈ {4, 8, 32, 64, 128, 512, 1024} . The different results of various N and I are presented and analyzed in the next subsection. We run our application 1000 times for each combination of N and I .
We use Keras with TensorFlow backend to train and test the software-based LSTM network. We take the same weight parameters and input vectors as the input of our hardware implementation. The output vectors of the hardware implementation are compared with the software-based LSTM network. It ensures that two implementations have the same results within a certain error range because of the non-linear activation functions explained in Section III-E. Errors in the setting of N = 4 and I = 4, 8, 16, 32, 64, 128, 256, 512, 1024 are 2.38%, 1.85%, 1.56%, 2.01%, 0.78%, 1.48%, 1.14%, 1.59% and 0.63%, respectively. The average value is 1.49%. Figure 6 shows that the throughput improves when the I grows. As Section III-C explains, we process multiple input streams in a highly pipeline way to predict multiple data sequences in parallel. When I is small, the pipeline queue is always empty. It makes the hardware always be idle. For example, if one PE can process four input streams in parallel and there are totally 16 PEs (N is 16), the hardware would be fully worked if the number of input streams is bigger than 64. In Figure 6 , we can see that we achieve the highest throughput when I = 1024. In our setting, when the I keeps growing after 1024, its impact on throughput becomes small. In the setting of N = 4 and I = 1024, we get the peak throughput of 7.57 millions inferences per second (IPS). For each inference, there are 1,137 floating-point arithmetic operations. That means the peak performance of our accelerator is 8.6GFLOPS.
B. PERFORMANCE EVALUATION
To demonstrate the performance improvement of our design, we compare the results to a similar FPGA accelerator presented in [9] . The implementation in [9] mainly focuses on reducing the time from the moment of applying a new input vector to the moment of producing an output vector by the LSTM network. This time reflects the on-chip computation performance. The hardware version of [9] is implemented on a Xilinx XC7Z020 SoC board. It also implements a software LSTM model in python that runs on a Linux server with i7-3770k 4.2GHz Intel Processor and 8GB of RAM. In addition, we implement a GPU-version that performs the inference of the same LSTM model on a Linux server with a GeForce RTX 2080 Ti GPU, 10,989MB video memory and 64GB DDR memory. We compare the average computation time per input stream (I = 1024) of our accelerator with that of hardware-based and software-based implementation in [9] and GPU-version. The result is illustrated in Table 1 . We can see that the performance of our implementation is better than that of implementations in [9] and the GPU-version for all of the different-sized synthesized networks. In addition, to show the performance improvement of our proposed method, we compare two existing FPGA-based LSTM-RNN accelerators in [8] , [10] with ours. The results are shown in Table 2 . We listed the type and frequency of FPGA chip, total number of operations (MOP) of each model, the data precision mode and the overall performance (GOP/s and GFLOP/s) in Table 2 . From the results, we can see that the performance of our implementation is the best one. As we mentioned before, most of previous FPGA accelerators for LSTM are designed for embedded systems. They would definitely face a significant performance degradation in the cloud environments because of the transmission time between the host server and the FPGA board. In our design, we aim to take the most execution time spending on computation and make the transmission time as short as possible. In Section III-G, we describe the hardware overlapping method based on ping-pong buffers. In this way, we can reduce the transmission time to nearly zero. To present the benefits of this design, we perform a set of experiments to evaluate the effect of performance optimization with the ping-pong buffer mechanism. Table 3 presents the transmission time, which is calculated by subtracting the on-board computing time of FPGA from the total execution time, before and after we use the ping-pong buffer. To make the comparison more clear, we separately present the transmission time of N = 32 in Figure 7 . The results show that the host side optimization can significantly reduce the transmission time. When I increases from 512 to 1024 for N = 4, the transmission time decreases from 1.85ns to 0.003ns because the FPGA can perform computations during the transmission process without waiting for data. In addition, Figure 8 shows the ratio of transmission time between the host server and the FPGA board to the total execution time. It is obvious that the proportion decreases when I grows. If I is large enough, such as 512 and 1024, the proportion of the transmission time is close to zero. It means that the transmission between the host server and the FPGA board is almost totally overlapped. The performance loss on transmission time is only caused by the first and the last transmission step that cannot be overlapped. When I is small, the transferring is similar to the vector-based model that transfers vectors sequentially. That means small I cannot fully utilize the bandwidth and would lead to low throughput. When I becomes larger, the utilization of bandwidth is increased because of larger bulk data transmission. Table 4 presents different clock frequencies when N changes. When N grows, the clock frequency decreases to avoid timing violations and ensures that the accelerator functions properly. We can see that the clock frequencies of our implementation outperform the previous implementation in [9] . We do not present the results of different I because the size of I has little impact on the clock frequencies compared to N . For each size of N , we synthesize the LSTM network that can process at most 1024 input streams. Then, we change the size of I from 4 to 1024 to evaluate the performance impact of various I . The results with little difference tells us that we can ignore the impact of I on clock frequencies.
C. RESOURCE UTILIZATION AND POWER CONSUMPTION
In Table 5 , we can see that BRAM, DSP, FF and LUT all scale up to a 2× factor when N changes from 4 to 32. Our accelerator needs LUT most because of the memory partition as Section III-C explains. To achieve high performance, we fully utilize the FPGA resource. It limits the size of synthesized LSTM network to be N = 64. Given that we can process multiple input streams in a deep pipeline way instead of unrolling. The increment of I would not consume hardware resource. Therefore, our accelerator can process plenty of input streams together. Table 5 also presents the power consumption of each N . It clearly shows that the smaller the network is, the less power consumed.
V. CONCLUSION AND FUTURE WORK
We have presented an OpenCL-based FPGA accelerator for long short-term memory recurrent neural networks.
We optimize the design in cloud environments by processing multiple input streams in parallel. In computation phase, we provide a general matrix optimization model that transforms the vector-based manipulations to matrix-based operations to improve throughput. When the number of input streams is large enough, we obtain high throughput while still retaining the same resource consumption. We implemented two input buffers and two output buffers working in a ping-pong manner to utilize the parallel nature of FPGA hardware. It performs reading, writing and computing at the same time to keep the FPGA busy and takes full advantage of all of the hardware resource on our system. When the number of input streams is large enough, the transmission time between the host server and the FPGA board is close to zero. The experimental evaluation results showed the effectiveness and performance improvement of the design and implementation compared with existing solutions.
Additionally, it is also worthwhile to mention that our work still have potential for further research and improvements. Firstly, we used 32 bit floating point operations instead of fixed point operations. It would bring negative impact on performance and resource consumption. We plan to implement a fixed point mechanism to further optimize our work. Secondly, compared to the transmission phase between the host server and the on-board DDR, the one between the on-board DDR and the on-chip memory need further optimization. At last, although we utilize the bandwidth by transferring large bulk data, the data accessing manners are not efficient enough. We plan to add matrix tiling to make our work more scalable. 
