Recurrent Neural Networks (RNNs) are a key technology for applications such as automatic speech recognition or machine translation. Unlike conventional feed-forward DNNs, RNNs remember past information to improve the accuracy of future predictions and, therefore, they are very effective for sequence processing problems.
INTRODUCTION
Recurrent Neuronal Networks (RNNs) represent the state-of-the-art solution for many sequence processing problems such as speech recognition [15] , machine translation [34] or automatic caption generation [32] . Not surprisingly, data recently published in [20] show that around 30% of machine learning workloads in Google's datacenters are RNNs, whereas Convolutional Neuronal Networks (CNNs) only represent 5% of the applications. Unlike CNNs, RNNs use information of previously processed inputs to improve the accuracy of the output, and they can process variable length input/output sequences.
Although RNN training can be performed efficiently on GPUs [7] , RNN inference is more challenging. The small batch size (just one input sequence per batch) and the data dependencies in recurrent layers severely constrain the amount of parallelism. Hardware acceleration is key for achieving high-performance and energy-efficient RNN inference and, to this end, several RNN accelerators have been recently proposed [17, 18, 22, 23] .
Neurons in an RNN are recurrently executed for processing the elements in an input sequence. An analysis of the output results reveals that many neurons produce very similar outputs for consecutive elements in the input sequence. On average, the relative difference between the current and previous output of a neuron is smaller than 23% in our set of RNNs, whereas previous work in [28] has reported similar results. Since RNNs are inherently error tolerant [36] , we propose to exploit the aforementioned property to save computations by using a neuron-level fuzzy memoization scheme. With this approach, the outputs of a neuron are dynamically cached in a local memoization buffer. When the next output is predicted to be extremely similar to the previously computed result, the neuron's output is read from the memoization buffer rather than recalculating it, avoiding all the corresponding computations and memory accesses. Figure 1 : Accuracy loss of different RNNs versus the relative output error threshold using an oracle predictor. If the difference between the previous and current output predicted is smaller than the threshold, the memoized output is employed instead of calculating the new one. Figure 1 shows the potential benefits of this memoization scheme by using an oracle that accurately predicts the relative difference between the next output of the neuron and the previous output stored in the memoization buffer. The memoized value is used when this difference is smaller than a given threshold, shown in the x-axis of Figure 1 . As it can be seen, the RNNs can tolerate relative errors in the outputs of a neuron in the range of 30-50% with a negligible impact on accuracy. With these thresholds, a memoization scheme with an oracle predictor can save more than 30% of the computations.
A key challenge for our memoization scheme is how to predict the difference between the current output and the previous output stored in the memoization buffer, without performing all the corresponding neuron computations. To this end, we propose to extend each recurrent layer with a Bitwise Neural Network (BNN) [21] . We do this by reducing each input and weight to one bit that represents the sign as described in [11] . We found that BNN outputs are highly correlated with the outputs of the original recurrent layer, i.e. a similar BNN outputs indicates a high likelihood of having similar RNN output (although BNN outputs are very different to RNN outputs). The BNN is extremely small, hardware-friendly and very effective at predicting when memoization can be safely applied.
Note that by simply looking at the inputs, i.e. predicting that similar inputs will produce similar outputs, might not be accurate. Small changes in an input that is multiplied by a large weight will introduce a significant change in the output of the neuron. Our BNN approach takes into account both the inputs and the weights.
In short, we propose a neuron-level hardware-based fuzzy memoization scheme that works as follows. The output of a neuron in the last execution is dynamically cached in a memoization table, together with the output of the corresponding BNN. For every new input in the sequence, the BNN is first computed and the result is compared with the BNN output stored in the memoization table. If the difference between the new BNN output and the cached output is smaller than a threshold, the neuron's cached output is used as the current output, avoiding all the associated computations and memory accesses in the RNN. Otherwise, the neuron is evaluated and the memoization table is updated.
Note that only using the BNN would result in a large accuracy loss as reported elsewhere [27] . In this paper, we take a completely different approach and use the BNN to predict when memoization can be safely applied with negligible impact on accuracy. The inexpensive BNN is computed for every element of the sequence and every neuron, whereas the large RNN is evaluated on demand as indicated by the BNN. By doing so, we maintain high accuracy while saving more than 24.2% of RNN computations.
In this paper we make the following contributions:
• We provide an evaluation of the outputs of neurons in recurrent layers, and show that they exhibit small changes in consecutive executions. • We propose a fuzzy memoization scheme that avoids more than 24.2% of neuron evaluations by reusing previosly computed results stored in a memoization buffer. • We propose the use of a BNN to determine when memoization can be applied with small impact on accuracy. We show that BNN and RNN outputs are highly correlated. • We implement our neuron-level memoization scheme on top of a state-of-the-art RNN accelerator. The required hardware introduces a negligible area overhead, while it provides 1.35x speedup and 18.5% energy savings on average for several RNNs.
BACKGROUND 2.1 Recurrent Neural Networks
A Recurrent Neural Network (RNN) is a state-of-the-art machine learning approach that has achieved tremendous success in applications such as machine translation or video description. The key characteristic of RNNs is that they include loops, a.k.a. recurrent connections, that allow the information to persist from one time-step of execution to the next ones and, hence, they have the potential to use unbounded context information (i.e. past or future) to make predictions. Another important feature is that RNNs are recurrently executed for every element of the input sequence and, thus, they are able to handle input and output with variable length. Because of these characteristics, RNNs provide an effective framework for sequence-to-sequence applications (e.g. machine translation), where they outperform feed forward Deep Neural Networks (DNNs) [16, 29] .
Basic RNN architectures can capture and exploit short term dependencies in the input sequence. However, capturing long term dependencies is challenging since useful information tend to dilute over time. In order to exploit long term dependencies, Long Short Term Memory (LSTM) [19] and Gated Recurrent Units (GRU) [10] networks were proposed. These types of RNNs represent the most successful and widely used RNN architectures. They have achieved tremendous succcess for a variety of applications such as speech recognition [5, 24] , machine translation [9] and video description [32] . The next subsections provide further details on the structure and behavior of these networks.
Deep RNNs.
RNNs are composed of multiple layers that are stacked together to create deep RNNs. Each of these layers consists of an LSTM or a GRU cell. In addition, these layers can be unidirectional or bidirectional. Unidirectional layers only use past information to make predictions, whereas bidirectional LSTM or GRU networks use both past and future context.
The input sequence (X ) is composed of N elements, i.e. X = [x 1 , x 2 , ..., x N ], which are processed by an LSTM or GRU cell in the forward direction, i.e. from x 1 to x N . For backward layers in bidirectional RNNs, the input sequence is evaluated in the backward direction, i.e from x N to x 1 . Figure 2 shows the structure of an LSTM cell. The key component is the cell state (c t ), which is stored in the cell memory. The cell state is updated by using three fully connected single-layer neural networks, a.k.a. gates. The input gate, (i t , whose computations are shown in Equation 1) decides how much of the input information, x t , will be added to the cell state. The forget gate (f t , shown in Equation 2) determines how much information will be erased from the cell state (c t −1 ). The updater gate (д t , Equation 3) controls the amount of input information that is being considered a candidate to update the cell state (c t ). Once these three gates are executed, the cell state is updated according to Equation 4 . Finally, the output gate (o t , Equation 5) decides the amount of information that will be emitted from the cell to create the output (h t ). Figure 4 shows the computations carried out by an LSTM cell. As it can be seen, a neuron in each gate has two types of connections: forward connections that operate on x t and recurrent connections that take as input h t −1 . The evaluation of a neuron in one of these gates requires a dot product between weights in forward connections and x t , and another dot product between weights in recurrent connections and h t −1 . Next, a peephole connection [13] and a bias are also applied, followed by the computation of an activation function, typically a sigmoid or hyperbolic tangent.
LSTM Cell.

GRU Cell.
Analogous to an LSTM cell, a GRU cell includes gates to control the flow of information inside the cell. However, GRU cells do not have an independent memory cell (i.e. cell state). As it can be seen in Figure 3 , in a GRU cell the update gate (z t ) controls how much of the candidate information (д t ) is used to update the cell activation. On the other hand, the reset gate (r t ) modulates the amount of information that is removed from the previous computed state. Note that GRUs do not include an output gate and, hence, the whole state of the cell is exposed at each timestep. The computations carried out by each gate in a GRU cell are very similar to those in Equations 1, 2 and 3. We omit them for the sake of brevity, the exact details are provided in [10] . For the rest of the paper, we used the term RNN cell to refer to both LSTM and GRU cells.
Binarized Neural Networks
State-of-the-art DNNs typically consist of millions of parameters (a.k.a. weights) represented as floating point numbers using 32 or 16 bits and, hence, their storage requirements are quite large. Linear quantization may be used to reduce memory footprint and improve performance [20, 34] . In addition, real-time evaluation of DNNs requires a high energy cost. As an attempt to improve the energy-efficiency of DNNs, Binarized Neural Networks (BNNs) [11] or Bitwise Neural Networks [21] are a promising alternative to conventional DNNs. BNNs use one-bit weights and inputs that are constrained to +1 or -1. Typically, the binarization is done using the following function:
where x is either a weight or an input and x b is the binarized value which is stored as 0 or 1. Regarding the output of a given neuron, its computation is analogous to conventional DNNs, but employing the binarized version of weights and inputs, as shown in Equation 8 :
where w b and x b t are the binarized weight and input vectors respectively. Note that evaluating the neuron output (y b t ) only involves multiplications and additions that, with binarized operands, can be computed with XNORs and integer adders. BNN evaluation is orders of magnitude more efficient, in terms of both performance and energy, than conventional DNNs [11] . Nonetheless, DNNs and RNNs still deliver significantly higher accuracy than BNNs [27] .
Fuzzy Memoization
Memoization is a well-known optimization technique used to improve performance and energy consumption that has been used both in software [2] and hardware [14] . In some applications, a given function is executed many times, but the inputs of different executions are not always different. Memoization exploits this fact to avoid these redundant computations by reusing the result of a previous evaluation. In general, the first time an input is evaluated, the result is cached in a memoization table. Subsequent evaluations probe the memoization table and reuse previously cached results if the current input matches a previous execution. In a classical memoization scheme, a memoized value is only reused when it is known to be equal to the real output of the computation. However, for some applications such as multimedia [4] , graphics [8] , and neural networks [36] , this scheme can be extended to tolerate a small loss in accuracy with negligible impact in the quality of the results, and is normally referred to as fuzzy memoization.
NEURON LEVEL MEMOIZATION
In this section, we propose a novel memoization scheme to reduce computations and memory accesses in RNNs. First, we discuss the main performance and energy bottlenecks on state-of-the-art hardware accelerators for RNN inference. Next, we introduce the key idea for our neuron-level fuzzy memoization technique. Finally, we describe the hardware implementation of our technique.
Motivation
As shown in Figure 4 , RNN inference involves the evaluation of multiple single-layer feed-forward neural networks or gates that, from a computational point of view, consist of multiplying a weight matrix by an input vector (x t for forward connections and h t −1 for recurrent connections). Typically, the number of elements in the weight matrices ranges from a few thousands to millions of elements and, thus, fetching them from on-chip buffers or main memory is one of the major sources of energy consumption. Not surprisingly, it accounts for up to 80% of the total energy consumption in state-of-the-art accelerators [30] . For this reason, a very effective way of saving energy in RNNs is to avoid fetching the synaptic weights. In addition, avoiding the corresponding computations also increases the energy savings. In this work, we leverage fuzzy memoization to selectively avoid neurons evaluations and, hence, to avoid their corresponding memory accesses and computations. For fuzzy memoization to be effective, applications must be tolerant to small errors and its hardware implementation must be simple. In the next sections, we show that RNNs are resilient to small errors in the outputs of the neurons, and we provide an efficient implementation of the memoization scheme that requires simple hardware support.
RNNs Redundancy.
Memoization schemes rely on a high degree of redundancy in the computations. For RNNs, a key observation is that the output of a given neuron tends to change lightly between consecutive input elements. Note that RNNs are used in sequence processing problems such as speech recognition or video processing, where RNN inputs in consecutive time steps tend to be extremely similar. Prior work in [28] reports high similarity across consecutive frames of audio or video. Not surprisingly, our own numbers for our set of RNNs also support this claim. Figure 5 shows the relative difference between consecutive outputs of a neuron in our set of RNNs. As it can be seen, a neuron's output exhibits small changes (less than 10%) for 25% of consecutive input elements. On average, consecutive outputs change by 23%. Furthermore, RNNs can tolerate small errors in the neuron output [36] . This observation is supported by data shown in Figure 1 , where the accuracy curve shows the accuracy loss when the output of a neuron is reused using fuzzy memoization, for different thresholds (x-axis) that control the aggressiveness of the memoization scheme. For this study, the relative error (δ ) between a predicted neuron output (y p t ) and a previously cached neuron output (y m ) is used as the discriminating factor to decide whether the previous output is reused, as shown in Figure 6 . To evaluate the potential benefits of a memoization scheme, the predicted value is provided by an Oracle predictor which is 100% accurate, i.e its prediction is always equal to the neuron output (y p t = y t ). As shown in Figure 1 , neurons can tolerate a relative output error between 0.3 and 0.5 without significantly affecting the overall network accuracy (i.e, accuracy loss smaller than 1%). On the other hand, the reuse curve shows the percentage of neuron computations that could be avoided through this memoization with an Oracle predictor. Note that by allowing neurons to have an output error between 0.3 to 0.5, at least 30% of the total network computations could be avoided.
Full-precision Output To achieve significant savings, the memoization scheme must add a small overhead to the system. The key challenge is how to approximate the behavior of the Oracle predictor with simple hardware, to decide when memoization can be safely applied with small impact on overall RNN accuracy. We describe an effective solution in the next section.
Binary Network Correlation.
A key challenge for an effective fuzzy memoization scheme is to identify when the next neuron output will be similar to a previously computed (and cached) output. Note that having similar inputs does not necessarily result in similar outputs, as inputs with small changes might be multiplied by large weights. Our proposed approach is based on a Bitwise Neural Network (BNN). In particular, each fully-connected neural network (NN) is extended to an equivalent BNN, as described in Section 3.2. We use BNNs for two reasons. First, the outputs of a BNN and its corresponding original NN are highly correlated [6] , i.e. a small change in a BNN output indicates that the neuron's output in the original NN is likely to be similar. Second, BNNs can be implemented with extremely low hardware cost.
Regarding the correlation between BNN and RNN, Anderson et al. [6] show that the binarization approximately preserves the dot-products that a neural network performs for computations. Therefore, there should be a high correlation between the outputs of the full-precision neuron and the outputs of the corresponding binarized neuron. We have empirically validated the dot product preservation property for our set of RNNs. Figure 7 shows the linear correlation between RNN outputs and the corresponding BNN outputs for EESEN network. Although the range of the outputs of the full-precision (RNN) and binarized (BNN) dot products are significantly different, their values exhibit a strong linear correlation (correlation coefficient of 0.96). On the other hand, Figure 8 shows the histogram of the correlation coefficients for the neurons in four different RNNs. As it can be seen, correlation between binarized and full-precision neurons tend to be high for all the RNNs. More specifically, for the networks EESEN, IMDB SENTIMENT, and DEEPSPEECH, 85% of the neurons have a linear correlation factor greater than 0.8 and for the Machine Translation network most of them have a correlation factor greater than 0.5 . These results indicate that if the output of a binarized neuron shows very small changes with respect to a previously computed output, it is very likely that the full-precision neuron will also show small changes and, hence, memoization can be safely applied. As shown in Equation 8 , the output of a given neuron in a BNN can be computed with an N-bit XOR operation for bit multiplication and an integer adder to sum the resulting bits. These two operations are orders of magnitude cheaper than those required by the traditional data representation (i.e., FP16). Therefore, a BNN represents a low overhead and accurate manner to infer when the output of a neuron is likely to exhibit significant changes with respect to its recently computed outputs.
Overview
The target of our memoization scheme is to reuse a recently computed neuron output, y m , as the output for the current timestep, y t , provided that they are very similar. Reusing the cached neuron output avoids performing all the corresponding computations and memory accesses. To determine whether y t will be similar to y m , we use a BNN as a predictor.
In our memoization scheme, we extend the RNN with a much simpler BNN. The BNN model is created by mirroring the full precision trained model of an LSTM or GRU gate, as illustrated in Figure 9 . More specifically, each neuron is binarized by applying the binarization function shown in Equation 7 to its corresponding set of weights. Therefore, in an gate, every neuron n with weights vector ì w is mirrored to a neuron n b with weights vector ì w b corresponding to the element-wise binarization of ì w. Our scheme stores recently computed outputs for the binary neuron n b and its associated full precision neuron n. We refer to these memoized values as y b m and y m respectively. On every timestep t, the binarized version of the neuron, n b , is evaluated first obtaining y b t . Next, we compute the relative difference, ϵ b t , between y b t and y b m , i.e. the current and memoized outputs of the BNN, as shown in Equation 12. If ϵ b t is small, i.e. if the BNN outputs are similar, it means that the outputs of the full precision neuron are likely to be similar. As we discuss in Section 3.1.2, there is a high correlation between BNN and RNN outputs. In this case, we can reuse the memoized output y m as the output of neuron n for the current timestep, avoiding all the correspoding computations. If the relative difference ϵ b t is significant, we compute the full precision neuron output, y t , and update our memoization buffer, as shown in Equations 15, 16 and 17 so that these values can be reused in subsequent timesteps.
We have observed that applying memoization to the same neuron in a large number of successive timesteps may negatively impact accuracy, even though the relative difference ϵ b t in each individual timestep is small. We found that using a simple throttling mechanism can avoid this problem. More specifically, we accumulate the relative differences over successive timesteps where memoizaiton is applied, as shown in Equation 13 . We use the summation of relative differences, δ b t , to decide whether the memoized value is reused. As illustrated in Equation 14 , the memoized value is only reused when δ b t is smaller or equal than a threshold θ . Otherwise, the full precision neuron is computed. This throttling mechanism avoids long sequences of timesteps where memoization is applied to the same neuron, since δ b t includes the differences accumulated in the entire sequence of reuses. Figure 11 shows that the throttling mechanism provides higher computation reuse for the same accuracy loss. Figure 12 summarizes the overall memoization scheme, that is applied to the gates in an RNN cell as follows. For the first input element (x 0 ), i.e. the first timestep, the output values y b 0 (binarized version) and y 0 (in full precision) are computed for each neuron and stored in a memoization buffer. δ b 0 is set to zero. In the next timestep, with input x 1 , the value y b 1 is computed first by the BNN. Then, the relative error (ϵ b 1 ) between y b 1 and the previously cached value, y b 0 , is computed and added to δ b 0 to obtain δ b 1 . Then, δ b 1 is compared with a threshold θ . If δ b 1 is smaller than θ , the cached value y 0 is reused, i.e. y 1 is assumed to be equal to y 0 , and δ b 1 is stored in the memoization buffer. On the contrary, if δ b 1 is larger than θ , the full precision neuron output y 1 is computed and its value is cached in a memoization buffer. In addition, y b 1 is also cached and δ b 1 is set to zero. This process is repeated for the remaining timesteps and for all the neurons in each gate.
Finding the threshold value.
One of the key parameters in our scheme is the threshold θ . We perform an exploration of different values of θ for each RNN model by using the training set, obtaining accuracy and degree of computation reuse for each threshold value and RNN network. We then select the value that achieves highest computation reuse with the target accuracy loss (i.e. less than 1%) for each RNN model. This process is done just once for each model and once θ is determined, it can be used for inference on the test dataset.
Hardware Implementation
We implement the proposed memoization scheme on top of EPUR, a state-of-the-art RNN accelerator for low power mobile applications [30] . A high-level block diagram of this accelerator is shown in Figure 13 . E-PUR is composed of four computational units that are tailored to the evaluation of each gate in an RNN cell, and a dedicated on-chip memory used to store intermediate results. In the next subsections, we outline the main components of the E-PUR architecture and detail the necessary hardware modifications required in order to support our fuzzy memoization scheme. Figure 14 , are composed of a dot product unit (DPU), a Multi-functional Unit (MU) and buffers to store the weights and inputs. The DPU is used to evaluate the matrix vector multiplications between the weights and inputs (i.e. x t and h t −1 ) whereas the MU is used to compute activation functions and scalar operations. Note that in E-PUR computations can be performed using 32 or 16 bits floating points operations.
Hardware Baseline. In E-PUR each of the Computation Units (CUs), shown in
Figure 13: Overview of E-PUR architecture which consist of 4 Computation Units (CU) and an on-chip memory (OM).
In E-PUR, while evaluating an RNN cell, all the gates are computed in parallel for each input element. On the contrary, the neurons in each gate are evaluated in a sequential manner for the forward and recurrent connections. The following steps are executed in order to compute the output value (y t ) for a given neuron (i.e. n i ). First, the input and weight vectors formed by the recurrent and forward connections (i.e, x t and h t −1 ) are split into K sub-vectors of size N. Then, two sub-vectors of size N are loaded from the input and weight buffer respectively and the dot product between them is computed by the DPU, which also accumulates the result. Next, the steps are repeated for the next k th sub-vector and its result is added to the previously accumulated dot products. This process is repeated until all K sub-vectors are evaluated and added together. Once the output value y t is computed, the DPU sends it to the MU where bias and peephole calculations are performed. Finally, the MU computes the activation function and stores the result in the on-chip memory for intermediate results. Note that once the DPU sends a value to the MU, it will continue with the evaluation of the next neuron output, hence, overlapping the computations executed by the MU and DPU since they are independent. Finally, these steps are repeated until all the neurons in the gate (for all cells) are evaluated for the current input element.
Support for Fuzzy Memoization .
In order to perform fuzzy memoization through a BNN, two modifications are done to each CU in E-PUR. First, the weight buffer is split into two buffers: one buffer is used to store the weight signs (sign buffer) and the other is used to store the remaining bits of the weights. Note that the sign buffer is always accessed to compute the output of the binary network (y b t ) whereas the remaining bits are only accessed if the memoized value (y m ) is not reused. The binarized weights are stored in a small memory which has low energy cost but, as a consequence of splitting the weight buffer, its area increases a bit (less than one percent).
The second modification to the CUs is the addition of the fuzzy memoization unit (FMU) which is used to evaluate the binary network and to perform fuzzy memoization. This unit takes as input two size-T vectors (i.e., number of neurons in an RNN cell). The first vector is a weight vector loaded from the sign buffer whereas the other is created as the concatenation of the forward (x t ) and the recurrent connections (h t −1 ).
As shown in Figure 15 , the main components of the FMU are the BDPU that computes the binary dot product and the comparison unit (CMP) which decides when to reuse a memoized value. In addition, the FMU includes a buffer (memoization buffer) which stores the δ b t for every neuron and the latest evaluation of the neurons by the full precision and binary networks. BNN neurons (i.e, binary dot product) are evaluated using a bitwise XNOR operation and an adder reduction tree to gather the resulting bit vector. In the CMP unit, the relative error (δ b t ) is computed using integer and fixed-point arithmetic.
The steps to evaluate the RNN cell, described in Section 3.3.1, are executed in a slightly different manner to include the fuzzy memoization scheme. First, the binarized input and weight vectors for a given neuron in a gate are loaded into an FMU from the input and sign buffers respectively. Next, the BDPU computes the dot product and sends the result (y b t ) to the comparison unit (CMP). Then, the CMP loads the previously cached values y b m and δ b t −1 from the memoization buffer and it uses them to compute the relative error (ϵ b t ) and the δ b t . Once δ b t is computed, it is compared with a threshold (θ ) to determine whether the full precision neuron needs to be evaluated or the previously cached value is reused instead. In the case that δ b t is greater than θ , an evaluation in full precision is triggered. In that regard, the DPU is signaled to start the full precision evaluation which is done following the steps described in Section 3.3.1. After the full precision evaluation, the values y t , y b t , and 0.0 are cached in the memoization table corresponding to y m , y b m , and δ b t respectively. On the other hand, if memoization can be applied (i.e. δ b t is smaller than the maximum allowed error), δ b t is updated in the memoization table and the memoized value (y m ) is sent directly to the MU (bypassing the DPU), so the full precision evaluation of the neuron is avoided. Finally, these steps are repeated until all the neurons in a gate are evaluated for the current input element. Since LSTM or GRU gates are processed by independent CUs, the above process is executed concurrently by all gates.
EVALUATION METHODOLOGY
We use a cycle-level simulator of E-PUR customized to model our scheme as described in Section 3.3.2. This simulator estimates the total energy consumption (static and dynamic) and execution time of the LSTM networks. The different pipeline components were implemented in Verilog and we synthesized them using the Synopsys Design Compiler to obtain their delay and energy consumption. Furthermore, we used a typical process corner with voltage of 0.78V. We employed CACTI [26] to estimate the delay and energy consumption (static and dynamic) of on-chip memories. Finally, to estimate timing and energy consumption of main memory we used MICRON's memory model [25] . We model 4 GB of LPDDR4 DRAM.
In order to set the clock frequency, the delays reported by Synopsys Design Compiler and CACTI are used. We set a clock frequency that allows most hardware structures to operate at one clock cycle. In addition, we evaluated alternative frequency values in order to minimize energy consumption.
Regarding the memoization unit, the configuration parameters are shown in Table 2 . Since E-PUR supports large LSTM networks, the memoization unit is designed to match the largest models supported by E-PUR. This unit has a latency of 5 clock cycles for the largest supported LSTM networks. In this unit, integer and fixed point operations are used to perform most computations. The memoization buffer is modeled as 8KiB scratch-pad eDRAM.
The remaining configuration parameters of the accelerator used in our experiments are shown in Table 2 . We strive to select an energy-efficient configuration for all the neural networks in Table 1 . Because the baseline accelerator is designed to accommodate large LSTM networks, some of its on-chip storage and functional units might be oversized for some of our RNNs. In this case, unused on-chip memories and functional units are power gated when not needed.
As for benchmarks, we use four modern LSTM networks which are described in Table 1 . Our selection includes RNNs for popular application such as speech recognition, machine translation and image description. These networks have different number of internal layers and neurons. We include both bidirectional (EESEN) and unidirectional networks (the other three). On the other hand, the length of the input sequence is also different for each RNN and it ranges from 20 to a few thousand input elements.
The software implementation of the networks was done in Tensorflow [1] . We used the network models and the test set provided in [9, 12, 24, 33] for each RNN. The original accuracy for each RNN is listed in Table 1 , and the accuracy loss is later reported as the absolute loss with respect to this baseline accuracy.
EXPERIMENTAL RESULTS
This section presents the evaluation of the proposed fuzzy memoization technique for RNNs, implemented on top of E-PUR [30] . We refer to it as E-PUR+BM. First, we present the percentage of computation reuse and the accuracy achieved. Second, we show the performance and energy improvements, followed by an analysis of the area overheads of our technique. Figure 16 shows the percentage of computation reuse achieved by the BNN and the Oracle predictors. The percentage of computation reuse indicates the percentage of neuron evaluations avoided due to fuzzy memoization. For accuracy losses smaller than 2%, the BNN obtains a percentage of computation reuse extremely similar to the Oracle. The networks EESEN and IMDB are highly tolerant to errors in neuron's outputs, thus, for these networks, our memoization scheme achieves reuse percentages of up to 40% while having an accuracy loss smaller than 3%. Note that, for classification problems, BNNs achieve an accuracy close to the state-of-the-art [27] and, hence, it is not surprising that the BNN predictor is highly accurate for approximating the neuron output. For DeepSpeech (speech recognition) the reuse percentage is up to 20% for accuracy losses smaller than 2%. In this network, the input sequence tends to be large (i.e, 900 elements on average). As the reuse is increased, the error introduced to the output sequence of a neuron persists for a larger number of elements. Therefore, the introduced error will have a bigger impact both in the evaluation of the current layer, due to the recurrent connections, and the following layers. As a result, the overall accuracy of the network decreases faster. For MNMT (machine translation) the BNN predictor and the oracle achieve similar reuse versus accuracy trade-off for up to 23% of computation reuse. Note that, for this network, the linear correlation between the BNN and the full precision neuron output is typically lower than for the other networks in the benchmark set. Figure 17 shows the energy savings and computation reuse achieved by our scheme, for different thresholds of accuracy loss. For a conservative loss of 2%, the average energy saving is 25.5%, whereas the reuse percentage is 31%. In this case, the networks DeepSpeech and MNMT and have similar energy savings, whereas the networks IMDB and EESEN exhibit the largest savings since they are more tolerant to errors in the neuron output. For an extremely conservative 1% of accuracy loss, the computation reuse and energy saving are 24.2% and 18.5% on average respectively. EESEN and DeepSpeech achieve 25.32% and 12.23% energy savings respectively for a 1% accuracy loss. Regarding the machine translation network (MNMT), the energy savings for 1% and 2% accuracy loss are 15.17% and 23.46% respectively.
Figure 17: Energy savings and computation reuse of E-PUR+BM with respect to the baseline.
Regarding the sources of energy savings, Figure 18 reports the energy breakdown, including static and dynamic energy, for the baseline accelerator and E-PUR+BM,for an accuracy loss of 1%. The sources of energy consumption are grouped into on-chip memories ("scratch-pad" memories), pipeline components ("operations", i.e. multipliers), main memory (LPDDR4) and the energy consumed by our FMU component. Note that most of the energy consumption is due to the scratch-pad memories and the pipeline components and, as it can been seen, both are reduced when using our memoization scheme. In E-PUR+BM, each time a value from the memoization buffer is reused, we avoid accessing all the neuron's weights and the input buffers, achieving significant energy savings. In addition, since the extra buffers used by E-PUR+BM are fairly small (i.e. 8 KB), the energy overhead due to the memoization scheme is negligible. The energy consumption due to the operations is also reduced, as the memoization scheme avoids neuron's computations. Furthermore, the leakage of scratch-pad and operations are also reduced due to the speedups achieved by the memoization scheme. Finally, the energy consumption due to accessing main memory is not affected by our technique since both, E-PUR and E-PUR+BM, must access main memory to load all the weights once for each input sequence. Figure 19 shows the performance improvements for the different RNNs. On average, an speedup of 1.35x is obtained for a 1% accuracy loss, whereas accuracy losses of 2% and 3% achieve improvements of 1.5x and 1.67x respectively. The performance improvement comes from avoiding the dot product computations for the memoized neurons. Therefore, the larger the degree of computation reuse the bigger the performance improvement. Note that the memoization scheme introduces an overhead of 5 cycles per neuron (see Table 2 ), mainly due to the evaluation of the binarized neuron. In case the full precision neuron evaluation can be avoided, our scheme saves between 16 and 80 cycles depending on the RNN. Therefore, configurations with low degree of computation reuse, like Deepspeech at 1% accuracy loss, exhibit smaller speedups due to the overhead of the memoization scheme. On the other hand, RNNs that exhibit higher computation reuse, such as EESEN at 2% accuracy loss, achieve an speepup of 1.55x.
E-PUR has an area of 64.6 mm 2 , whereas E-PUR+BM requires 66.8 mm 2 (4% area overhead). The largest overhead contribution (3%) is in the extra scratch-pad memory required by the memoization unit.
RELATED WORK
Increasing energy-efficiency and performance of LSTM networks has attracted the attention of the architectural community in recent years [17, 18, 22, 23] . Most of these works employ pruning and compression techniques to improve performance and reduce energy consumption. Furthermore, linear quantization is employed to decrease the memory footprint. On the contrary, our technique improves energy-efficiency by relying solely on computation reuse at the neuron level. To the best of our knowledge, this is the first work using a BNN as a predictor for a fuzzy memoization scheme. BNNs have been used previously [11, 21, 27] as standalone networks, whereas we employs BNNs in conjunction with the LSTM network to evaluate neurons on demand.
Fuzzy memoization has been extensively researched in the past and has been implemented both in hardware and software. Hardware schemes to reuse instructions have been proposed in [3, 8, 14, 31] . Alvarez et al. [4] presented a fuzzy memoization scheme to improve performance of floating point operations in multimedia applications. In their scheme floating point operations are memoized using a hash of the source operands, whereas in our technique, a whole function (neuron inference) is memoized based on the values predicted by a BNN.
Finally, software schemes to memoize entire functions have been presented in the past [2, 35] . These schemes are tailored to general purpose programs whereas our scheme is solely focused in LSTM networks, since it exploits the intrinsic error tolerance of LSTM networks.
CONCLUSIONS
In this paper, we have shown that 25% of neurons in an LSTM network change their output value by less than 10%. This motivated us to propose a fuzzy memoization scheme to save energy and time. A major challenge to perform neuron level fuzzy memoization is to predict, in a simple and accurate manner, whether the output of a given neuron will be similar to a previously computed and cached value. To this end, we propose to use a Binarized Neural Network (BNN) as a predictor, based on the observation that the fully precision output of a neuron is highly correlated with the output of the corresponding BNN. We show that a BNN predictor achieves 24.2% computation reuse on average, which is very similar to the results obtained with an Oracle predictor. We have implemented our technique on top of E-PUR, a state-of-the-art accelerator for LSTM networks. Results show that our memoization scheme achieves significant time and energy savings with minimal impact in the accuracy of the RNNs. When compared with the E-PUR accelerator, our scheme achieves 18.5% energy savings on average, while providing 1.35x speedup at the expense of a minor accuracy loss.
