The effectiveness of LSTM neural networks for popular tasks such as Automatic Speech Recognition has fostered an increasing interest in LSTM inference acceleration. Due to the recurrent nature and data dependencies of LSTM computations, designing a customized architecture specifically tailored to its computation pattern is crucial for efficiency. Since LSTMs are used for a variety of tasks, generalizing this efficiency to diverse configurations, i.e., adaptiveness, is another key feature of these accelerators.
I. INTRODUCTION
Recurrent Neural Networks (RNN) represent a well-known Deep Learning (DL) model [1] - [3] , with increasing popularity for applications that are based on sequence-to-sequence processing [4] - [7] , such as speech recognition [8] and machine translation [9] . A key attribute of this class of neural networks is that they use past information to improve model accuracy. Long-Short-Term-Memory (LSTM) [1] is the most commonly used RNN. It can potentially remember useful information over a long period of time, providing high accuracy. LSTM networks have shown great effectiveness in many sequence processing problems and have fostered state-of-the-art research innovations, such as in natural language processing tasks, e.g., machine reading comprehension [10] - [12] and language modeling [13] - [15] , and speech recognition [16] , [17] .
Keeping track of historical information introduces intrinsic challenges to LSTM computation, such as relying on many recurrent steps in order to obtain an accurate inference mechanism. This results in lots of dependencies and serialization, limiting the amount of parallelism exploited by multicore CPUs or GPUs [21] , [22] . Figure 1 shows the FLOP efficiency, i.e. the relative FLOPs performance to the theoretical Batch-Size -64 Batch-Size -1 FLOP Efficiency (%) Fig. 1 . Flop-efficiency of Titan V GPU, performing different real-world sequence processing applications, i.e. Machine Comprehension (MC) [10] , Speech Recognition (SR) [8] , Language Modeling (LM) [18] , and Machine Translation (MT) [19] using cuDNN library [20] and enabling TCUs with mixed-precision.
peak, for a high-end GPU (Titan V), when running different applications using the most recent cuDNN library [20] . Note that the evaluation is measured by enabling the tensor-coreunits (TCUs) in the cuDNN library, as explained in [23] . As seen, GPU is extremely under-utilized when performing services of batch size 1. Furthermore, even when using larger batch size of 64, the GPU achieves moderate utilization, between 4% to 28% of peak performance. The reason is that GPU only operates efficiently when there is high level of parallelism available, such as for training. However, for LSTM inference, even though the amount of computation (matrix-vector multiplications) increases for long sequences, as in speech recognition or machine translation tasks, the parallelism is limited due to many data-dependencies. Furthermore, the inference engines are commonly required to operate on small batch sizes of a few requests, at a time, in order to meet the Service-Level Agreements (SLAs), or the safety requirements [24] , [25] . This requirement further reduces available parallelism in LSTM inference. Recently, there have been several efforts on either CPU [26] , or GPU [22] , [24] , to improve the efficiency of LSTM inference. However, they show poor scalibility for either small or large models with different sequence length. For instance, our evaluation of GRNN [24] for the hidden time-steps of applications, such as speech recognition, shows poor performance for long sequences.
In order to answer the needs for high-performance and efficient LSTM computation, accelerating LSTM networks through either customized architectures [21] , [27] or neural processing units (NPU) [25] , [28] has been recently explored. arXiv:1911.01258v1 [cs. LG] 4 Nov 2019 These systems are implemented on either ASICs or FPGAs. FPGAs are attractive for their cost and reconfigurability, whereas ASICs are more energy-efficient. In this work, we focus on ASIC implementations. We address most of the issues and challenges observed due to the variety of LSTM models, improving scalability through an efficient scheduling scheme together with a reconfigurable architecture.
Although previous accelerators have achieved good performance improvement over CPUs and GPUs, they suffer from an important issue, which is low resource utilization. Two stateof-the-art implementations, Microsoft's BrainWave [25] and Google's TPU [28] , achieve average utilization of 18% and 3.5% for LSTMs, respectively. We elaborate on the scaling issues of two state-of-the-art ASIC-and FPGA-based LSTM accelerators in Section III.
In this paper, we propose LSTM-Sharp, an adaptable architecture for LSTM acceleration, which addresses the resource utilization challenges of the previous designs by efficiently handling the data dependencies. By analyzing the unique challenges and special characteristics of LSTM, we introduce the unfolded scheduling that targets its main requirements and maximizes resource utilization by strictly overlapping the computations and significantly reducing the length of LSTM critical-path. We furthermore introduce dynamic reconfigurability at LSTM-Sharp's compute-engine, in order to improve the accelerator's adaptability. Through dynamic reconfiguration, LSTM-Sharp adapts to the particular characteristics of different LSTM models, including padding caused by matrixvector multiplication. We show that by combining the systemlevel scheduling scheme and the reconfigurability, our design yields the most efficient workload-dispatching configuration for each model and therefore boosts performance.
Overall, we sum up the paper's contributions as follows:
• We propose LSTM-Sharp, a Scalable, High-performance LSTM Accelerator, which removes pipeline stalls and resources' idleness through Reconfiguration and better handles Padding in matrix-vector multiplication. • We analyze LSTM's critical-path delay, considering its data dependencies, and identify opportunities to hide the latency of sequential parts. We then develop a new scheduling scheme to efficiently resolve all dependencies. • To increase the adaptability of our system running different models, we implement a reconfigurable computeengine that delivers the most efficient resource mapping.
Intra-sequence Dependency
Across-sequence Dependency Fig. 3 . LSTM computation overview.
• We conduct thorough evaluation and demonstrate average speedups of 1.5x, 2.86x, and 82x with respect to the stateof-the-art ASIC, FPGA, and GPU implementations. The remainder of this paper is organized as follows. Section II provides some background on LSTM networks. Section III furthermore introduces some challenges and opportunities regarding LSTM acceleration design. Next, we introduce our proposed scheduling technique in Section V. Afterwards, we describe LSTM-Sharp's design in Section IV. Then, Section VI discusses the architecture's reconfigurability. We describe the evaluation methodology in Section VII and show the experimental results in Section VIII. Finally, Section IX reviews some related work, and Section X sums up the main conclusions of this work.
II. LSTM BACKGROUND
LSTM networks [1] can capture both short and long term dependencies of an input sequence through their recurrent links. This results in a lot of serial processing, and such inherent dependencies of LSTM computation make it the most challenging of all types of neural networks to parallelize. In this section, we first go through some overview of LSTMs, and then elaborate more on its computation style. Finally, we discuss the most common scheduling schemes used in the previous LSTM accelerators.
An LSTM network is composed of a chain of LSTM cells, and each cell processes two vectors, x t and h t at each time step, corresponding to the input and hidden vectors of the forward and recurrent connections, respectively. As Figure 2 illustrates, an LSTM cell [1] employs four gates in order to recurrently update the cell state and also compute the output. At each recurrent phase, the gates carry out the following actions: the input gate (i t ) decides how the current input affects the cell-state, while the forget gate (f t ) removes the amount of useless information from the current cell-state; the cell-update gate (g t ) modulates the amount of input information that is considered as candidate to update the cell-state; finally, the output gate (o t ) decides what information to emit from the cell. Figure 3 formulates the detailed computation of LSTM at each time step. As depicted, each LSTM gate performs two matrix-vector multiplications (MVMs), which finally decide how to update the cell-state (c t ), and how to generate the hidden output vector (h t ) that is recurrently sent to the following time step. Two kinds of dependencies exist in these computations: intra-sequence, since all the gate's activations must be ready before updating the cell or generating the output; across-sequence, meaning that each step has to wait until its recurrent input is received from the previous time step. Later, we will discuss the parallelism constrains due to the sequential behavior imposed by these dependencies.
III. CHALLENGES AND OPPORTUNITIES
There are two main challenges with the existing approaches for LSTM acceleration regarding their adaptiveness: (1) they are efficient only for certain configurations, but they become inefficient when the model configurations start to change;
(2) they do not achieve a well-balanced execution pipeline as the available hardware resources increase. Most of the previous accelerators are designed for a variety of neural networks rather than being tightly optimized for LSTMs. For instance, NPUs [25] , [28] have the parallel multiplyaccumulation (MAC) stage at the heart of their pipeline and are not optimized in case the serial part becomes the performance bottleneck for some models. On the other hand, customized accelerators [21] , [29] use a relatively small resource budget which therefore causes large delay for MVM, hence overlap the remaining LSTM computation that needs to run sequentially. However, when using more MACs, the issue of efficiently handling LSTM's dependencies still remains. Figure 4 shows the latency and utilization of BrainWave for different LSTM sizes. As the size of the hidden layers decreases, utilization drops drastically, whereas the latency remains the same. However, an efficient design should operate faster as the LSTM workload reduces. The reason for such performance inefficiency is that BrainWave's pipeline is mainly optimized for some particular large models rather than being adaptable to various models. As stated in [25] , for small LSTM, BrainWave's utilization drops due to two main reasons: (i) the design of large tile dimension for the multiplication units, resulting in wasteful work and resource underutilization; (ii) the deep pipeline which delays the writing of the dependent data back. In other words, the pipeline is not well-balanced in assigning resources to different stages based on the models' requirements, which causes many stalls in case one stage is slower.
On the other hand, some designs obtain good performance efficiency for a specific model only when resources are limited, but are inefficient with larger resource budgets. For instance, we thoroughly evaluate E-PUR [21] , the state-of-the-art dense LSTM accelerator, by experimenting across different number of multiply-adder units. Figure 5 shows the performance improvement obtained by increasing resources when accelerating EESEN [8] , one of the benchmarks used in [21] . As seen, by raising the multiply-add units above 4K, we are not able to achieve an efficient speedup compared to the increase in the number of resources.
An LSTM cell computes eight different MVMs which can run in parallel at each time step. Previous proposals mainly use vector-vector (VV) [21] , vector-matrix (VM) [25] and matrixmatrix (MM) [28] primitives, which are the most simple, straightforward hardware approaches. However, they are less flexible since their vector and matrix dimensions are set to a fixed size, which results in resources' under-utilization in many cases. In this work, we use vector-scalar (VS) as the basic primitive and implement VV and VM by merging VSs in different dimensions. This way, we can implement resizable primitives by using VSs of different sizes.
To address the aforementioned challenges, we propose LSTM-Sharp in the next section, as a reconfigurable accelerator design with an efficient pipelining mechanism combined with a new scheduling.
IV. LSTM-SHARP ARCHITECTURE
In this section, we present the architecture of LSTM-Sharp. First, we describe the accelerator's pipeline considering LSTM processing flow. Next, we elaborate on each pipeline stages associated to each part of LSTM computation. Finally, we discuss the balance between different component's latency in order to keep the pipeline fully utilized. Figure IV illustrates the architecture of LSTM-Sharp, which consists of a pipeline with four stages. The pipeline performs the following tasks at each time step of LSTM evaluation. First, the two pipeline components, Compute Unit and Add Reduce, multiply the weight matrix and input/hidden vector in a tiled-based fashion, by fetching them from weight and I/H buffers, respectively. Then, after finishing each gate's matrix-vector multiplication (MVM) for both input and hidden vectors, Activation MFU runs the activation function, i.e. sigmoid or hyperbolic tangent, on the MVM's result. Finally, Cell Updater uses all the four gates' activated results to update the cell-state and produce the hidden outputs for the next step.
A. Overview
In addition to the functional units, LSTM-Sharp includes two memory components in order to store the synaptic weight matrices and input and hidden vectors necessary for one LSTM layer's evaluation. This way we can avoid most of the expensive off-chip memory accesses, which is identified as the main feature in most of the state-of-the-art hardware implementations [21] , [25] , [28] . Regarding the I/H buffer, we use SRAM in order to reduce the access latency and since it often gets modified between sequence processing. On the other hand, weight buffer is designed as a multi-banked embedded-DRAM (eDRAM) memory which provides similar read latency as SRAM. As this buffer is written once for the execution of each layer of LSTM network, we can simply overlap the write latency by the computation. Furthermore, since performing the layer's computation is normally in the order of a hundred microseconds, eDRAM refreshes are skipped as they have an interval in the order of milliseconds [30] . Due to the predictable pattern of LSTM computation, we can easily interleave the weight matrices across different memory banks, fully utilizing the multiply units without having collisions in accessing similar memory line. Additionally, we use two onchip SRAM buffers for storing the cell-state and intermediate results produced between the recurrent time steps.
B. Resizable MVM Tile-Engine
Figure 7 (left) shows the Compute Unit structure plus the weight and input and hidden buffers. Compute Unit is equipped by N K-width vector-scalar (VS) multipliers, each multiplying an input/hidden by k-row elements of the weight matrix and producing N × K partial results. Most of the previous proposals use the Dot Product Unit (DPU) that operates on two vectors, by dispatching the weight matrix column-wise. However, we consider row-wise selection for the basic vector operation. Former scheme has to reduce the result-vectors into several outputs which may require 2 reduction levels (such as in [25] ), whereas ours generates one or multiple vectors of partial sums by accumulating the resultvectors, requiring 1-level reduction. As depicted by Figure 8 , we can allocate the N VS units both row-wise or columnwise, generating resizable MVM tile-engines to go over gate's computation in a tiled fashion. In the next section, we will show that different choices of K and VS mapping impact performance and utilization of LSTM-Sharp running different LSTM models.
We design Add Reduce using a tree-adder that sums all the K-vector results up to K partial sum ( Figure 7 ). This way, we can reduce the results in the case that all the VS multipliers are dispatched column-wise as shown in Config4 of Figure 8 . Therefore, Add Reduce can have a latency of log(N ) going through all the levels of tree-adder. In order to hide this delay, we pipeline all the levels of tree, resulting in a 1-cycle add-reduction if the pipeline is full. By choosing the different configurations shown by Figure 8 , we can update 1 to 8 K-accumulators as we reach the 4 last levels of tree. Upon completion of each MVM, the accumulators are released for the next phase of LSTM computation. As Figure 3 shows, the input and hidden vectors must be processed, before the Add Reduce sends the result out to Activation MFU.
C. Gate Activation and Cell-Update
Activation MFU is as a configurable multi-functional component, composed of several arithmetic units for doing some floating-point operations such as addition, division and exponentiation. By combining these units, we implement the two activation functions (sigmoid and hyperbolic tangent) applied to each gate's outputs. We use the same approach proposed in [21] in order to configure MFU data transfer based on each activation function. For instance, MFU carries out the following actions to get the sigmoid of X:
(1) Based on our synthesis evaluation using Synopsys Design Compiler [31] combined with 32nm technology library, we calculate MFU's critical-path-delay as 29.14 ns for hyperbolic tangent function. We then partition the operations in a way to efficiently pipeline them, achieving a 1-cycle latency for performing the activation function on each gate's output.
As soon as all the gates' activation results are ready, Cell Updater starts the following two sequential tasks: updating the cell-state, and generation of the hidden outputs. Regarding the calculation of cell-state (c t ), Cell Updater uses the outputs of input, forget and cell-update gates, plus the previous cell-state (see Figure 3 ). Then, to compute the hidden outputs (h t ), a hyperbolic tangent is applied to the new cell state, and the result is multiplied by the mask generated from the output gate. Therefore, Cell-Updater is also a multi-functional unit that includes an additional point-wise vector-multiplication unit compared to the Activation MFU. We pipeline all the operations in order to assure that the calculation of every K 4 elements of hidden outputs (combining the 4 gates outputs) finish at each cycle (providing that pipeline is always full).
D. Pipeline Efficiency
LSTM computation consists of several data dependencies, which makes it a lot challenging to design an efficient pipeline to overlap the independent and the sequential parts. We employ Unfolded scheduler (Section V) on top of our pipeline design in order to improve performance under different resource configurations. As the MVM operations include an important share of LSTM processing, the main focus of previous proposals is to provide more parallelism by increasing MAC resources and reduce LSTM latency. However, we observe that for several applications LSTM computation can cause a lot of stalls due to its serialization, which limits parallelization. LSTM models with small dimensions that process long sequences are the most tangible examples that require dealing with lots of dependencies besides the parallel task of MVMs. Thus, we cannot achieve a reasonable performance improvement by only providing high amount of parallelism.
In order to achieve the best throughput, all the pipeline stages should be kept fully utilized. This means that the different components must have similar latency in order for the pipeline to flow without any stalling. Otherwise, it happens that one stage operates faster or slower than others, causing stalls or idleness. This results in under-utilization because of uneven distribution of the amount of work and the number of resources between different stages. In our design, we divide the workload of LSTM based on their types of operation, whether they are dependent or independent, and then explore various number of resources for each part. Our experiments show that there is a high correlation between the scheduling scheme and the way pipeline resources are allocated to each part of LSTM computation. Furthermore, there is not just one best resource mapping (tiling dimension) to evaluate all the LSTM models, since each model has different requirements based on the ratio between parallel and serial tasks. Therefore, by adding some level of reconfigurability at the Compute Unit and Add Reduce components, we increase our design's adaptability by tailoring the best configuration for each model. We elaborate on the reconfigurability technique in Section VI.
V. LSTM-SHARP SCHEDULE There have been several scheduling approaches proposed in the previous LSTM implementations. These schemes mainly focus on the different processing order of the gates [21] , [25] and the input and hidden vectors [26] , [29] . However, they result in sub-optimal resource utilization due to inefficiently handling of data dependencies because of not pipelining the whole LSTM. Existing works mainly focus on speeding up MVMs with a high level of parallelism. But according to Amdahl's Law [32] , overall performance is bounded by the serial execution of the cell-state and hidden units. To overcome this challenge, we propose Unfolded schedule, which removes all the parallelism inefficiencies by strictly overlapping the dependent and independent parts of computation.
There have been two basic schedulers employed in the previous proposals: Sequential [25] , [28] , [29] , [33] and Intergate [21] , [27] . Figures 9.a and 9.c show the two schemes graphically. Sequential scheduling computes the gates in a sequential manner, one gate after another, whereas Intergate scheduling runs all gates' multiplication together by sharing MAC resources. Although the two techniques have equal latency in processing the gates which includes the MVM and gate activation, there is a slight difference between them. Intergate scheduling can better hide the latency of updating the cell-state and computing the hidden units, by pipelining them with the gates' computation (output-based tiling). On the other hand, Sequential scheduling has to wait until reaching the last gate (Output) for continuing with the cell-state update and producing hidden units. We will show that the latter schedule outperforms the former in cases that MVM is highly parallelized and operates too fast. However, the challenge is that it makes the serial portion of LSTM computation the main bottleneck. Figure 9 .d depicts the Unfolded technique graphically. As illustrated, we first process the input MVM of each time step and save its result in an intermediate buffer. In other words, we unfold the MVM of the input and hidden vectors in order to hide the serialization delay of the recurrent step t with the input MVM of step t + 1 (across-sequence dependency), as there is no dependency between the input sequence vectors. Then, by accumulating the hidden MVM's output with the buffered input results, we apply the activation function, update the cell-state and generate the hidden outputs. By processing all gates simultaneously, we can overlap the computation of cell-state and hidden units (intra-sequence dependency), by pipelining them in the output-based tiling manner. By using such computation order, we completely overlap the criticalpath delay for evaluating LSTM, significantly increasing the utilization rate of parallel MAC resources by hiding the two types of LSTM dependencies shown in Figure 3 . Figure 9 shows the LSTM critical-path time-line, including the gates' MVMs plus the recurrent serial computation, for the different scheduling methods. In order to go over the gate's MVM, we divide the weight matrix into several blocks (shown as red boxes) to dispatch to the MVM tile-engine. Critical-path is considered as the longest part of the LSTM computation between recurrent steps. Due to the dependency of the hidden vector, we cannot completely dispatch the next step's MVM while we are processing last sequential portion of the current step. Sequential scheme pipelines each gate's computation, including MVM and activation function. Since all the four gates' outputs must be ready before issuing the cell and hidden update, they run serial to the last gate's MVM. Batch scheduling is a variant of the previous one, with the difference that it only processes a batch of each gate at a time, allowing to pipeline the whole LSTM computation. By issuing all the gates' MVMs at the same time for the Intergate approach, we can better overlap the computation and decrease the latency for the cell and hidden update by four times. Our proposal, Unfolded scheduling, not only leverages the advantage of Intergate technique for hiding the intrasequence dependency, but it also handles the across-sequence dependency. As shown in Figure 9 .d, while processing the last sequential computation of the cell-state and hidden outputs for the current time step, the MAC units can still be busy with calculating the next step's input MVM.
We evaluate Unfolded schedule on GPU (NVIDIA GTX1080) using cuBLAS [34] library, and achieve around 20% performance improvement with respect to the cuDNN's LSTM implementation [22] . However, the benefit of our scheme is limited due to the synchronization overhead between thread-blocks and streams. On the other hand, this approach is more straightforward to the specific LSTM-Sharp's acceleration design. In order to obtain the performance benefits for the Unfolded scheme, we define a new memory layout, which divides the weight matrix into two partitions of input and hidden. Moreover, as we process all the gates together, we put their weights into consecutive parts of the memory based on the tiling dimension selected for MVM processing. We go over the different configurations of the MVM tile-engine in the following section.
A similar approach, introduced by [21] , [26] , also tries to partition the input and hidden evaluations of LSTM. Their scheme separates the whole input MVMs across all the sequence time-steps, focusing on either improving data locality for accessing weight matrices [21] or optimizing LSTM execution through faster scheduling of GEMMs [26] . In contrast, Unfolded scheduling unfolds the work of each time step individually, and by doing so, we overlap the data-dependency between recurrent serial processing. Therefore, our mechanism introduces a more efficient pipelining for LSTM in order to maximize resource utilization.
VI. IMPROVING ADAPTABILITY VIA RECONFIGURABILITY
An efficient LSTM acceleration design is able to adapt and scale performance across the space of applications (with different model characteristics) and resource budgets. This = (128, 256, 512) Fig. 10 . Exploration on the K-width for the VS units of Compute Unit. The speedups are normalized to 1K-MAC design. In most cases, for different resource budgets, there is not just one configuration providing the best performance for the various LSTM dimensions. means two things: (i) for a fixed resource budget, achieve high-performance execution for different applications (with different model characteristics), and (ii) for a fixed application, scale performance proportionally with resource budget.
In this section, we evaluate LSTM-Sharp under several configurations for different model characteristics. Then, we will show how the design's parameters impact the performance of the system regarding various LSTM topology. Finally, to improve the adaptability and scalability of our system, we define some level of reconfigurability in order to tailor the tile-engine of LSTM-Sharp's architecture to each model.
A. Adaptability Issue
As explained in Section IV-D, MVM operations are the main part of LSTM pipeline. Thus, the way they are assigned to the MAC units defines the pipeline efficiency. However, as we observed, there is not just one fixed configuration to dispatch weight matrices to the MVM tile-engine. This is due to two reasons: first, there is always some padding when tiling the matrix MVM; second, there is high performance difference when choosing various tile dimensions for an LSTM model. Here, we elaborate more on these adaptability issues and then propose reconfiguration at the MVM tile's architecture, in order to flexibly adapt to each model's requirements and achieve the highest performance and utilization.
1) MVM Padding: By using one multiplication tile-engine to go through the whole MVM of the weight matrix, we incur several padding due to not fitting the last portion of rows and columns of the matrix into a tile. Therefore, this results in some resource under-utilization because of not occupying the multipliers that fall out of the matrix dimension. Furthermore, this padding will continue to exist until multiplying the last column of the matrix. Note that the only case that padding does not exist is when the size of matrix is a multiple of the tile dimension. However, practically speaking, we cannot have as many tile-engines as the different LSTM models. In order to handle such inefficiency, we apply reconfiguration in a way that flexibly changes the tile dimension when reaching the last row segment.
2) Model Diversity: As specified in Section IV-B, Compute Unit is constructed based on a tile with K-row and N -column multipliers. Considering the same resources, by choosing different K widths, the tile dimension (rows and columns) varies, therefore results in various latency to complete the K partial results of MVM. For instance, if K is too small (Config4 at Figure 8) , we place the multipliers more columnwise, producing partial results faster than the case that k is too large(Config1 at Figure 8 ). Figure 10 shows the kwidth exploration results in four charts corresponding to 1K, 4K, 16K and 64K multiply-adders, respectively. Each chart illustrates the performance evaluation of choosing several K widths from 32 to 512, regarding the different LSTM hidden dimensions. Note that we assume equal size for both the hidden and input vectors in our experiments. Moreover, we run all the models for the same sequence length of 25 time steps. As seen, there is not just one best configuration for tiling the MVM operations. For instance, in the case of 4K MAC units, there are different optimal K-widths (128, 256, and 512) for each LSTM dimension that result in the highest performance speedup.
B. Reconfigurability
In order to increase the adaptability of our design, we modify the Compute Unit and Add Reduce components of the pipeline in a way to configure the MVM tile-engine based on each LSTM model dimension. Initially, we consider 32 as the K-width of each VS unit. Then, by mapping the VS units at either rows or columns of the weight matrix, we can generate different MVM tiles. Figure 8 depicts the four possible tile configurations of LSTM-Sharp. Even though the number of multiplications does not differ at each configuration, the data dispatching pattern should match to the row and column selection scheme. This also affects the number of partial results generated by the Add Reduce stage. Therefore, we rearrange the memory organization of the weight matrix by interleaving them based on the configured tile dimension. We also interleave the weights in a way to keep all the VS units uniformly busy.
After delivering the multiplication results to the Add Reduce stage, the tree-adder uses a reconfigurable routing mechanism in order to emit the correct partial results corresponding to the MVM tile configuration. By employing four multiplexers, we support the four configurations shown in Figure 8 , by selecting between the outputs of the last four levels of the tree-adder. Figure 7 illustrates the configurations in different colors and also the 4 multiplexers we use to reroute adders' outputs to match the tiling topology. For example, in the case of Config1 of Figure 8 , we select eight partial sums from the fourth last level (LogN − 4) of the tree-adder to send to the accumulators. Then, by reaching to the end of input and hidden vectors, we will have 8 × K MVM results. LSTM-Sharp's controller multiplexes based on each model's specification, by configuring the bit-selects form a table that stores them for the different LSTM dimensions. 1) Impact on Padding: In order to measure the effectiveness of our scheme, we have evaluated padding reconfiguration for the different LSTM models and similar range of resources as for Figure 10 . Regarding the MVM tile-engine, we configure K opt for each combination of LSTM dimension and MAC resources. Then, we compare the accelerator's performance for the two cases, applying fixed or reconfigurable configurations. Note that the controller reconfigures the tile-engine dynamically, in a way that K gets as close as to the remaining number of rows. Figure 11 shows the speedups achieved for LSTM-Sharp by using reconfigurability when running various LSTM models, considering different MAC units. As seen, we improve performance in almost all the cases, except for 512 hidden dimension. The reason is that as 512 is a multiple of K opt , it causes no padding during MVM tiling, and hence, there is no benefit of reconfiguration. In total, we can get up to 1.22x speedup by applying our approach to alleviate padding.
2) Impact on Adaptability: By using the reconfigurability, we can generate almost all the K opt widths for the MVM tile-engine regarding the LSTM models shown in Figure 10 , by combining the basic 32-width VS units. We can select between the four options from 32 to 256 for the K, achieving most of the performance benefits considering a variety of model dimensions. Note that we explore the configurations offline in order to determine the parameters that reach the best performance for each application. This generates a table with several entries, each storing the optimal configuration for each LSTM's hidden dimension, including the multiplexing and control-logic applied for the tree-adder. The table is preloaded in an on-chip memory in LSTM-Sharp, minimizing the cost of reconfiguration both performance-and energy-wise.
Reconfiguration in LSTM-Sharp has negligible runtime cost and it works as follows. Prior to the execution of each LSTM layer, its optimized architecture's configuration is fetched from the aforementioned table. Then, LSTM-Sharp sets the control signals of the multiplexers for the tree-adder accordingly, which has negligible performance overhead. Note that the expensive operations such as experimentally finding the optimal configuration or changing the memory layout are performed offline. Runtime operations only include an access to a small table and setting the control signals of several multiplexers.
For every LSTM network, its weights are rearranged and interleaved offline according to the access-pattern of the optimal configuration. Then, we fetch the memory blocks likewise in order to fill in the on-chip buffers, the eDRAM's banks corresponding to the VS units. Except for the initial delay to fetch the memory requests (this delay is proportional to the model's size and the LSTM dimension), we can overlap the rest with the computation of MVM tile-engine. Furthermore, after having the weight matrices reside on-chip, there will be no off-chip memory bottleneck restricting LSTM-Sharp's performance. Regarding the input sequences, the I/H buffer (see Figure IV) works in a ping-pong manner. While the MVM tile-engines are processing the current batch of data, LSTM-Sharp prefetches the next part of input data.
In order to measure the effectiveness of reconfigurability, we compare the configured K against an ideal case of using a hardened K for each model. Our numbers show very similar performance evaluation as all the multiplexing latency is covered by the Add Reduce slack time, imposing no extra cycle to send out the partial sums. The only overhead is for the reconfiguration logic at the controller to select the best configuration, which only happens at the beginning of each LSTM layer's computation.
By combining the reconfigurability with our scheduling technique, we improve the scalability of our architecture by increasing the performance efficiency with high utilization of the available resources. More specifically, by carefully pipelining the LSTM computation and balancing the latency of pipeline stages, we can overlap most of the data dependencies that limit the parallelism of the MVM operations. Furthermore, comparing with the previous methods that show a poor scaling factor by increasing the number of resources (see Figure 5 ), we significantly improve the utilization by handling the serialization of LSTM computation more efficiently. For instance, as seen in Figure 10 , the speedup numbers are relatively higher in the cases of using more resource budgets such as 16K/64K.
VII. EVALUATION METHODOLOGY
In order to evaluate LSTM-Sharp, we firstly developed a C++ cycle-accurate simulator which accurately models all the pipeline stages described in Section IV. Secondly, we implemented all the logic components in Verilog using the Design-Ware library and synthesized them using the Synopsys Design Compiler considering 32 nm technology library from Synopsys [31] . Regarding the memory components, we used CACTI-P [35] with the same 32 nm technology parameter.
In all our experiments, the accelerator's architecture is configured using the parameters shown in Table I . Because we consider different resource budgets from 1K to 64K MACs, we can obtain a range of peak throughput between 0.46 and 29.8 TFLOPS/s. However, by increasing the resources, we require higher peak bandwidth from the on-chip memory components, up to 561 GB/s for the 64K-MAC configuration. To achieve this bandwidth, we increase the banks of eDRAM memory proportional to the of VS units of LSTM-Sharp's architecture.
Regarding the power and energy evaluation, we use the same synthesis results for each pipeline stage that estimates the design's static power and the dynamic power for an average activity factor on the internal signals of each module. Furthermore, we characterize the memory components of the accelerator by obtaining the energy per access using CACTI.
By combining the results of the synthesis and simulation (activity factors and cycles), we estimate the execution time plus the dynamic and static energy-consumption. To set the frequency of the system, we consider the critical path delay and access times reported by Design Compiler and CACTI, respectively. We take the maximum delay among the different components, which is 1.94 ns for the half-precision (16-bit) multiplication, resulting in nearly 500 MHz frequency.
Regarding the previous implementations, we implemented E-PUR scheduling [21] by modifying LSTM-Sharp's architecture in order to enable a thorough comparison of our design with the state-of-the-art ASIC-based LSTM acceleration. Moreover, since BrainWave is not open sourced, we developed a cycle-accurate performance model for the BrainWave FPGA implementation [25] . We validated the correctness of our model, by comparing against the number of cycles reported in [25] , using the Structurally-Constrained Model Critical-Path analysis. In order to have a fair comparison, our BrainWave implementation does not account for the network latency.
The LSTM hidden dimensions are selected from the LSTM networks of popular applications such as machine comprehension [10] , language modeling [18] , speech recognition [8] , or machine translation [19] .
VIII. EXPERIMENTAL RESULTS
This section presents an experimental evaluation of LSTM-Sharp, by measuring the impact of our proposed techniques on improving performance in terms of: (1) reducing execution time and increasing resource utilization, and (2) reducing energy consumption. First, we compare the execution time of the LSTM computation considering the different scheduling schemes with and without reconfigurability. Next, we show the latency and utilization of the different LSTM's configuration for the several hidden dimensions. Then, we compare our numbers with different state-of-the-art systems. Then, we show the energy consumption of LSTM-Sharp for various scheduling approaches. Finally, we report some results on the area and power analysis. Figure 12 shows the performance comparison of the schedulers discussed in Section V. Each set of 4 bars shows the speedups normalized to the first bar (Sequential). We evaluate the numbers considering all resource budgets and the LSTM models used for the design experimentation. The MVM tileengine (N × k-width VS units) is configured with 32 k-width (number of rows) and and mapping all the VS units to the columns of the weight matrix.
As Figure 12 depicts, Unfolded scheme always obtains the best performance of all, since it removes most of the data-dependencies and highly utilizes the parallel MACs. However, the benefit diminishes by increasing the LSTM dimension or reducing the number of MACs. The reason is that MVMs become the main performance bottleneck under those conditions, and hence, the way we order the LSTM computation cannot have the expected impact. On the other hand, Intergate scheduling outperforms Sequential and Batch scheduling. Compared to our proposal, it provides less benefit as it only removes the intra-sequence dependency whereas across-sequence dependency still remains. Batch and Sequential schedules show almost similar execution, due to not efficiently handling all the data-dependencies.
After comparing the different scheduling method, we analyze the reconfigurability impact by considering the best k-width for the MVM tile-engine for every combinations of LSTM dimension and resource budget. Figure 13 shows the speedup of all the schedules compared to the Sequential technique. Note that we do not use reconfiguration in the case of Sequential scheduling, since as the gates are processed serially, it has very low impact on balancing the pipeline stages (except for handling the padding). As illustrated, by applying reconfigurability, we achieve significant performance improvement with respect to the numbers shown in Figure12. The main reason is that by tailoring the dispatching pattern to the MVM tile-engine based on each LSTM model, we efficiently distribute the workload among different pipeline Fig. 13 . Reconfigurability impact on the performance improvements. We configure Kopt for the MVM tile-engine at each combination of hidden dimension and resource budget. Regarding Sequential scheduling, we use 32 as the k-width. We use the same setup as in Figure 14 .
stages. Thus, we can better overlap the computation and remove all the stalls caused due to the data-dependencies. Moreover, by increasing the number of MAC units, we obtain higher speedup for the large LSTM dimensions, such as 512 or 1500. This is due to the fact that by having the MVM operations highly parallelized, the serial portion becomes the main bottleneck. Consequently, as we manage the resources more effectively, we can hide most of the sequential parts and maximize the parallel tasks. After comparing different scheduling schemes considering the impact of reconfigurability, we choose the Unfolded scheduler as the best one. We then analyze LSTM-Sharp's scalability regarding different resource-budgets, as well as its adaptability toward various models. Figure 14 shows the execution time and resource-utilization considering the different resource budgets for LSTM-Sharp and the LSTM models used for the design experimentation. The MVM tileengine (N × k-width VS units) is configured based on the exploration result shown in Figure 10 . Moreover, the tile will be dynamically reconfigured to reduce the padding impact. As depicted, LSTM-Sharp scales well as it linearly reduces the execution time (AVG case) by increasing the number of MACs from 1K to 64K. Furthermore, we achieve relatively high utilization for all the cases, ranging from 50% to 98% for 64K-to 1K-MAC resource-budgets, respectively. These performance benefits come form both the efficient scheduling as well as the reconfigurability of LSTM-Sharps' architecture. With scheduling, we efficiently distribute the workload among different pipeline stages, whereas by reconfigurable MVMtiling, we decide on the amount of work assigned for each stage. Consequently, by relaxing the two main dependencies of LSTM-computation, we are able to manage the resources more effectively.
We compare LSTM-Sharp against the state-of-the-art GPU, FPGA and ASIC implementations, i.e. cuDNN [20] , GRNN [24] , BrainWave [25] and E-PUR [21] . Figure 15 shows the speedup achieved for the LSTM-Sharp compared to the most recent scalable, efficient implementations on GPU. We use NVIDIA Titan V for the GPU evaluation that has a theoretical peak throughput of 29.8 TFLOPs (FP16). Speedup numbers are reported for all the different configuration, executing the various LSTM models that we have tested. As seen, our ASIC implementation outperforms the GPU's in almost all the cases by up to 1 to 2 orders of magnitude. Considering 64K-MAC configuration, which has equal peak throughput as Titan V, LSTM-Sharp obtains 172-625x and 72-93x faster LSTM inference than the cuDNN and GRNN GPU implementations. Table II shows the performance speedup achieved by our accelerator compared to the Stratix-10 version of BrainWave. Note that we reduce LSTM-Sharp's frequency from 500 MHz to 250 MHZ, similar to the BrainWave's design, so as to have a fair comparison. In addition, we increase our MACs to 96K to have equal resource budget. As seen, we achieve more than 1.65x speedup for all the LSTM models and the speedups are significantly larger for the smaller dimensions. This shows that we alleviate the adaptability issue of BrainWave (see Figure 4 ). Table III shows the speedup of LSTM-Sharp running EESEN [8] benchmark compared to E-PUR for different number of MACs. The results show we obtain relatively higher speedups as we increase the number of resources. As a result, we improve the scalability of LSTM acceleration for a range of resource budgets and different models.
Furthermore, we measure the energy consumption for both LSTM-Sharp and E-PUR considering different number of resources. Figure 16 shows the energy consumption, normalized to E-PUR using 1K MACs, for the different LSTM dimensions. As it illustrates, LSTM-Sharp obtains better energyefficiency for the smaller models when there is lower resources available (1K-and 4K-MAC). This is due to the effectiveness of both Unfolded scheduling as well as flexible tiling. Moreover, reconfiguration handles the padding of LSTM's MVM which is critical for the performance of smaller models and when there is less number of MACs (Figure 11 ). Moreover, we achieve higher energy-reduction for larger LSTM-Sharp's designs, as we achieve better scalability than E-PUR (Table III) . In total, we reduce LSTM-Sharp's energy consumption on average by 8%, 24%, 71%, and 87% when using 1K to 64K MACs, respectively.
Regarding the power and area analysis, Table IV shows the numbers related to each configuration of LSM-Sharp. As we increase the number of resources from 1K to 64K MACs, the area usage gets significantly larger. Note that multiply-add 200  340  512  1500  200  340  512  1500  200  340  512  1500  200  340  512 units take the main portion of area breakdown. Even though memory components allocate more than 74% and 59% of the area at 1K-and 4K-MAC configurations, the share decreases to around 34% and 14.5% for 16K and 64K MACs. Compared to Google's TPU [28] , LSTM-Sharp consumes 1.7x more area and almost the same power considering 64K MACs. However, we use half of TPU's MAC resources and the multiply-add operations are performed in 16-bit floating-point rather than 8-bit integer. Therefore, we can significantly decrease both power dissipation and area of LSTM-Sharp by modifying the operations from floating-point to fixed-point or using lowerprecision numbers by applying quantization techniques such as K-Means [36] .
IX. RELATED WORK
LSTM network is a powerful RNN tool, which has attracted a lot of attention toward different AI applications [4] , [7] , [9] . Therefore, a plethora of customized architecture [21] , [27] , [29] , [33] , [37] , [38] and neural-processing [25] , [28] , [39] designs have been recently proposed in order to obtain high performance and energy efficient execution for such networks. Most of the previous accelerators' implementation are FPGAbased [25] , [40] - [42] , whereas only a few works have been explored for ASIC design [21] , [28] . Furthermore, these designs are either targeted for mobiles and wearables [21] , or data centers and cloud networks [25] , [28] . The latter cases normally restrict themselves to a relatively small resource budget, whereas the former employ a large amount of parallelism. LSTM-Sharp aims to optimize LSTM for a variety of design points and considering different model characteristics.
We have gone through some of the previous works [21] , [25] , [28] throughout the paper and compare them against LSTM-Sharp's evaluation results in Section VIII. The rest implementations focus on two main problems for LSTM acceleration: optimizing the inference computation [37] - [39] , and reducing memory requirements by compression and pruning techniques [27] , [29] . While most of these architectures concentrate on accelerating dense networks, EIE [33] and ESE [8] are the two closely related works that optimize LSTM through execution engines for sparse networks. C-LSTM [27] is another alternative working on compressed LSTM models using FFT-based block-circulant matrices.
In this paper, we mainly target dense and uncompressed LSTM models. Furthermore, we consider that all the synaptic weights fit on-chip for one layer execution, similar to E-PUR [21] and BrainWave [25] . Other designs, such as ESE [8] and C-LSTM [27] , take another approach that tries to pipeline the memory fetches with the LSTM computations in order to hide the latency of accessing off-chip data. However, we observe that such design schemes cannot provide good scalability and performance efficiency with high amount of parallelism, resulting in low utilization. We build LSTM-Sharp in a way to optimize for various purposes from wearables to mobile systems and cloud networks. Also, we ensure that it achieves a scalable performance for a range of models. On the other hand, NPUs are mainly designed for performing a range of neural networks, and therefore are not customized to the needs of each one.
Apart from FGPA and ASIC, researchers and practitioners are also looking into using GPUs for LSTM [22] , [43] , [44] . They are mainly designed for maximizing training throughput, whereas the GPU utilization is limited by insufficient parallelism when the model size and batch size are small, which is difficult to make full usage of massive GPU cores.
X. CONCLUSIONS
In this paper, we propose LSTM-Sharp, a scalable, highperformance LSTM accelerator, which consists of a novel scheduling scheme -Unfolded that strictly overlaps the LSTM computation by hiding almost all the dependencies, and a dynamically reconfigurable architecture to improve the adaptability. LSTM-Sharp tailors the resource allocation based on the requirements of a particular model. It achieves significant performance speedup, energy-saving and utilization improvement, for all the design points and various LSTM models. Compared to the state-of-the-art ASIC, FPGA, and GPU implementations, we improve the performance by 1.5x, 2.86x, and 82x, respectively, considering 64K-MAC. Our scheme can also be applied for FPGAs, e.g. BrainWave, besides being targeted for ASIC design. Finally, we obtain an average utilization of 50%, for a peak throughput of 30 TFLOPs/s, resulting in 0.38 TFLOPS/Watt.
