Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension that allows MD-LSTM to show state-ofthe-art results in various applications including handwritten text recognition, medical imaging, and many more. However, efficient implementation suffers from very sequential execution that tremendously slows down both training and inference compared to other neural networks. This is the primary reason that prevents intensive research involving MD-LSTM in the recent years, despite large progress in microelectronics and architectures.
INTRODUCTION
One-dimensional Long Short-Term Memory (1D-LSTM) neural networks have shown superior performance in many applications including speech recognition [6] and image caption generation [31] . However, they are less suitable for handling multidimensional data because 1D-LSTM has recurrence only along a single dimension, time-axes in speech recognition or writing direction in Optical Character Recognition (OCR).
Multidimensional Long Short-Term Memory (MD-LSTM) neural network was firstly proposed in [10] to extend the applicability of unidimensional counterpart for data with more than one spatiotemporal dimension like in image and video processing or volumetric medical imaging. Having recurrent connections along multiple axes, typically the x-axis and y-axis in image processing, allows MD-LSTM to capture long-term dependencies along all dimensions of the data that makes MD-LSTM very robust to local distortions along any combination of input dimensions.
MD-LSTM provides state-of-the-art performance on most of benchmarks for offline Handwritten Text Recognition [32] , [19] . They show advanced results in semantic segmentation [5] and medical imaging [29] , [7] . However, MD-LSTM networks are very computationally expensive in comparison to CNNs, for example. Interestingly, the expense does not come from the complexity of a single step, but from the long sequential execution. It has been shown that a size of a model based on MD-LSTM can be 3-11 times smaller than a CNN model with similar accuracy [24] , [19] . However, the best parallel implementation of a single layer of MD-LSTM has a computational complexity that is typically by one or two orders of magnitude higher than of a naive implementation of a convolutional layer [24] .
Multidimensional recurrences significantly reduce potential for parallelization that makes execution of MD-LSTM very sequential that significantly slows down inference and training. Additionally, training is complicated due to high memory requirements to perform back-propagation through recurrent connections. No substantial progress in efficient architectures and implementations compared to other networks e.g. like CNNs prevents intensive research involving MD-LSTM in the recent years. Probably, MD-LSTM is one of the most controversial neural networks. While having high theoretical potential for learning, [10] , [24] , [19] , [33] acknowledged that MD-LSTMs are in principle more powerful and more robust to distortions than CNNs, they suffer from lack of investigation due to slow training. Recently there were a lot of discussions if MD-LSTM layers are needed at all. It was shown that in some scenarios, MD-LSTM layers can be replaced with CNN layers that can provide similar or higher accuracy at lower computational cost [24] . However, the most recent research on handwritten text recognition [19] proofs that MD-LSTM interleaved with CNN layers and 1D-LSTM on the top provides a superior performance than pure CNN based topology or CNNs combined with 1D-LSTM in the case of more complex and challenging real-world datasets. That leaves no doubts in justification of MD-LSTM related research.
There is a number of works that proposed different techniques how to accelerate inference or training of MD-LSTM by using new parallelization approaches or modifying memory cells (see Section 2.2). Nevertheless, the sequential execution is still a bottleneck that significantly constrains the efficiency of using a massive parallelism of modern GPUs. We claim and show that high flexibility of FPGAs that allows for custom pipelined datapath, and natural support of low bit-width operations provides an efficient solution to complex recurrent dataflow of MD-LSTM.
In this paper, we elaborate on two vanilla MD-LSTM [10] based typologies, where one was trained to perform binarization of historical documents from DIBCO 2017 competition [23] , and second to perform handwritten digits recognition task on MNIST dataset [16] . The former task was selected because it is a challenging problem that shows that the selected topology has a representative complexity and deals with a real world problem. Additionally, MNIST was chosen as a benchmark to have a fair comparison with existing FPGA-based implementations with respect to throughput and energy efficiency.
In this paper, by presenting a novel hardware architecture, we propose an orthogonal approach that potentially can be combined with earlier presented optimizations (see Section 2.2). We present an efficient implementation for inference, and open a door for efficient training that can boost application of MD-LSTM. The novel contributions of this paper are:
• A novel hardware architecture of MD-LSTM neural network is presented. • A systematic exploration of precision vs. accuracy trade-off for MD-LSTM using challenging DIBCO 2017 and MNIST datasets is conducted. We investigate the effects of quantization of weights, input, output, recurrent, and in-memory cell activations on accuracy during training. In particular, we explore binarized weights and activation functions. The accelerators outperform NVIDIA Tesla K80 GPU implementation in terms of runtime and energy efficiency. At the same time, our design demonstrates higher classification accuracy and a promising throughput with respect to stateof-the-art FPGA-based implementations for MNIST dataset. The paper is structured as follows: in Section 2 we review existing research on MD-LSTM. The MD-LSTM algorithm is overviewed in Section 3. We describe the training setup and quantization techniques in Section 4 and Section 5, respectively. Section 6 provides the details of the novel hardware architecture. Finally, Section 7 presents the results and Section 8 concludes the paper.
RELATED WORK 2.1 Application
Afzal et al. [3] has proposed treating the image binarization as a sequence learning problem and modeled the image as a twodimensional sequence. The MD-LSTM is used to learn spatial dependencies in the two-dimensional image, the same as 1D-LSTM is used in temporal domain to learn temporal dependencies. In general the task was formulated as a semantic segmentation problem, where each pixel was classified either as text or non-text. The approach was evaluated on DIBCO 2013 dataset achieving comparable results with the state-of-the-art. Similar idea was applied by one of the participants at DIBCO 2017 [23] competition using Grid LSTM cells [14] . We use similar topology to Afzal et al. but with more memory cells to cope with increased complexity of DIBCO 2017 dataset. Graves et al. [10] could achieve accuracy comparable to state-of-the-art applying MD-LSTM on MNIST dataset without using any pre-processing or augmentation. They formulated the task as a semantic segmentation problem, where each pixel was classified according to the digit it belonged to. Our approach is different as we do not make prediction for each pixel but for the complete image based on all outputs from hidden layers taken together. We selected an alternative topology in order to explore more on architectural level, otherwise the architecture would be identical to the semantic segmentation one.
Optimization
There is a number of papers proposing to accelerate inference and training of MD-LSTM by new parallelization approaches or improving stability of the cells for faster convergence. Leifert et al. [17] has shown that MD-LSTM has a severe conceptual problem related to computational instability of a value of the memory state that grows drastically over time and can even become infinite. They proposed an improved version of a cell called Leaky − LP that overcomes the instability, while preserves the state over long period of time, however at the expense of more complex computational graph. In our research we use vanilla MD-LSTM cells, so to be able to evaluate all optimizations with respect to the original model.
MD-LSTM has constraints on computing order of the steps (pixels) due to recurrent dependencies between the pixels, see Fig. 1 (a). Each pixel is considered as one time step, position here. Each position requires an input from two neighboring ancestor positions, one located on the previous column, another on the previous row. The straightforward implementation follows a pixel-by-pixel processing order, one pixel is processed after another as it is shown in Fig. 1 (b) . Voigtlaender et al. [32] and Oord et al. [20] proposed conceptually similar parallelization approaches. They have noticed that all pixels on the same diagonal can be computed in parallel because they are computationally independent from each other, depicted in Fig. 1 (c) . The diagonal-wise approach reduces the time complexity to process an image from O(W · H ) to O(W + H ) because there are (W + H − 1) diagonals in an image with H rows and W columns. Our architecture is based on the pixel-by-pixel order of processing. In the case of GPU, we have implemented both approaches. Additionally, Wenniger et al. [33] proposed an efficient approach of packing samples in batches with low computational and memory overhead. The approach could achieve 6.6 times speedup on word-based handwriting recognition by avoiding wasteful computation on padding. In the case of MNIST dataset, the padding is not necessary because all images are of the same size. For DIBCO dataset, the padding is negligible because we process big size images (up to 4M pixels) in small chunks (64 by 64 pixels). Our approach is orthogonal to all previously proposed, as we present a hardware acceleration on alternative computing platform, namely FPGA. Our approach potentially can be combined with previously presented ones.
Architecture
To the best of our knowledge, we are proposing the first hardware architecture of MD-LSTM. However, it is not the first architecture of Recurrent Neural Networks (RNNs). In [27] Rybalkin et al. gives a good overview of FPGA-based implementations. The most remarkable architectures providing the highest throughput are [27] and [11] . Where the later one operates on partially stored on-chip sparse model. However, in this paper we investigate influence of low precision on accuracy rather than sparsity, and keep our model dense. Furthermore, as it was mentioned in Section 1, the MD-LSTM models are much more compact in comparison to their counterparts that makes them very appealing for on-chip implementation. In the following, we primary compare new architecture with [27] as it operates on a dense model stored completely on-chip. The new architecture has additional datapaths for recurrent dependencies along extra dimensions. We exploit a concept of interleaving data corresponding to different processing directions proposed in [28] , and extend it to four directions (see Section 3). Interleaving in the case of RNNs enables to keep pipelined datapath always busy (see Section 6) . We show that in the case of MNIST datatset, it is not necessary to align corresponding outputs from four directions that significantly reduces the required on-chip memory for intermediate computations. However, in the case of semantic segmentation problem, like in binarization, the alignment is necessary. We propose an architecture of an output layer operating on multidimensional buffer.
LSTM THEORY
MD-LSTM is a generalization of a recurrent neural network, which can deal with higher-dimensional data. Here we restrict ourselves to the two-dimensional case, which is commonly used for handwriting and similar tasks. A 2D Long Short-Term Memory (2D-LSTM) scans the input image along horizontal and vertical axes and produces an output image of the same size following equations (1) for each coordinate (i, j) in the input image x.
It is common to use parallel 2D-LSTM layers processing an input image in four directions (top-left origin, top-right origin, bottomleft origin and bottom-right origin) as it is depicted in Fig. 1 for top-left origin. The outputs from four directions are later combined (concatenated) so that a spatio-temporal context from all four directions is available at each position (pixel).
For the sake of clarity, we review the basic 2D-LSTM algorithm. Eq. (1) summarizes formulas for the network forward pass.
The 2D-LSTM architecture is composed of a number of recurrently connected memory cells. Each memory cell is composed of four multiplicative gating connections, namely input, forget for the y-coordinate and x-coordinate, and output gates (k, f , д, and o); the function of each gate can be interpreted as write, reset, and read operations, with respect to the cell internal state (c). The gate units in a memory cell facilitate the preservation and access of the cell internal state over long periods of time. There are recurrent connections from the cell output (y) to the cell input (a) and the four gates. Rectangular input weight matrices are shown by W , and square weight matrices for the y-coordinate by U and for xcoordinate by V . And b refers to bias vectors for each of the gates. Activation functions are point-wise non-linear functions, that is logistic sigmoid ( 1 1+e −x ) for the gates (σ ) and hyperbolic tangent for input to and output from the cell. Point-wise vector multiplication is shown by ⊙.
TRAINING
We conducted training on several benchmarking datasets, namely MNIST and DIBCO, experimenting with a different number of MD-LSTM cells, precision and different hyper parameters. The quantized versions of the networks were trained de novo without full-precision pre-training. We trained the networks with quantized weights, biases, input, output, recurrent activatuions, and full-precision in-cell activations. A quantized LUT-based version Session: Deep Learning I FPGA '20, February 23-25, 2020, Seaside, CA, USA of the in-cell activations was used during the test. We found that gradient clipping can be necessary to achieve stable learning that is reasonable in the case of vanilla cells in particular [33] . We applied dropout between the 2D-LSTM layers and output layer for better generalization.
MNIST
We used a well-known MNIST dataset [16] of isolated handwritten digits to have a common benchmark to compare our hardware implementation to existing FPGA-based designs with respect to throughput and energy efficiency. The database consists of grayscale images, each of which is 28 pixels high and 28 pixels wide depicting a single handwritten digit like in Fig. 2 . MNIST has a training set with 60k images and a test set with 10k. We used 5% of the training set for validation. We report the best test accuracy of a model with the highest validation accuracy after at most 1000 epochs. We did not perform any pre-processing or augmentation on MNIST dataset. We used a threshold-based binarization on the input pixels in the case of quantized network and real-value pixels in the case of full-precision configuration. The architecture of 2D-LSTM network depicted in Fig. 3 has a single hidden layer that is separated into four independent Long Short-Term Memorys (LSTMs) processing the input in four directions with N H memory cells each. The red arrows in Fig. 3 indicate the direction of propagation during the forward pass. The input layer is of size (C) 1. The outputs from all hidden layers and from all pixels are concatenated and fed into the common output layer of size (N O ) 10 that is a number of classes (digits from 0 to 9). The LogSoftmax activation function was used at the output layer. We used a negative log likelihood loss criterion with Adam optimizer and a mini-batch size of 512 or 256 depending on a size of a model. We used larger batch size for training smaller models to fully utilize the GPU resources. The learning rate starts from 0.01 and reduced by a factor of 0.8 every 20 epochs. We used gradient clipping with a value of 100.
DIBCO 2017
Document image binarization is a process of separating text from images, where text consists of pixels of interest, thus we train a model to label every pixel as belonging to text of background. Binarization of document images is a common pre-processing step for the document image processing that enhances quality of the images. Document images, especially of historical documents, usually contain different types of degradation and noise, i.e. different artifacts, uneven background, and bleedthrough text like it is shown in Fig. 4 . We considered a dataset that is used in DIBCO 2017 document image binarization contest to proof representability and correctness of our implementation. For training we used a combination of datatsets from DIBCO 2009 -2016 competitions that gave us 86 images, where 10 of them were randomly chosen for validation.
We report the highest test accuracy associated with the highest validation accuracy after at most 300 epochs. For testing we used a dataset from DIBCO 2017 [23] as the most challenging among all of them. The test datasets contains 10 printed and 10 handwritten documents. All datasets contain both color and gray-scale images with up to 4M pixels per image. We treated all images as 3-channel color images (C = 3). We did not perform any pre-processing or augmentation. The images are divided into patches of size n by n and are fed into the 2D-LSTM. We could achieve the best results with n = 64. The architecture of 2D-LSTM network depicted in Fig. 5 has a single hidden layer that is separated into four independent LSTMs processing the patches from all the four possible directions. Each LSTM consists of N H memory cells. The values from each LSTM are concatenated with respect to corresponding pixels, and fed into a common output layer of size (N O ) 2. Now, the output has a context from four different directions of the image at each pixel. Each output value indicates the probability of each pixel to be either text or background. We used a negative log likelihood loss criterion with Adam optimizer with a mini-batch size of 128 or 96 patches depending on a size of a model. We used larger batch size for training smaller models to fully utilize the GPU resources. The learning rate starts from 0.01 and reduced by a factor of 0.7 every 10 epochs. We used gradient clip with a value of 10.
QUANTIZATION
Our quantization approaches are based on those presented in [25] , [13] , [35] . For multi-bit weights and activations, we adopt Eq. 2a as a quantization function, where a is a full-precision activation or weight, a q is the quantized version, k is a chosen bit-width, and f is a number of fraction bits. For multi-bit weights, the choice of parameters is shown in Eq. 2b. For multi-bit activations, we assume that they have passed through a bounded activation function before the quantization step, which ensures that the input to the quantization function is either x ∈ [0, 1) after passing through a logistic sigmoid function, or x ∈ [−1, 1) after passing through a tanh function. As for the quantization parameters, we use Eq. 2b after the tanh and Eq. 2c after the logistic sigmoid.
In the case of binarized activations a b , we apply a siдn function as the quantization function [25] , as shown in Eq. 3. In the case of binarized weights w b , on top of the sign function we scale the result by a constant factor that is a standard deviation calculated according to Glorot et al. [9] . Our motivation to use this scaling coefficient based on that we use Glorot initialization for the weights in the case of full-precision model. In the case of binarization, the weights are constrained to -1 and 1, thus it makes sense to scale them to the boundary limits of the full-precision weights.
w b = siдn(a) · scalinд_f actor (4)
ARCHITECTURE
The main challenge to design an efficient architecture of an algorithm with feed-back loop comes from recurrency that becomes a throughput bottleneck. A next time step (pixel) can be calculated only if the previous step has been finished due to precedence constraint that makes pipeline part-time idle. An efficient solution is to make pipeline busy with calculations that do not have recurrent dependencies between each other. In the case of 2D-LSTM, the processing of the different directions (top-left origin, top-right origin, bottom-left origin and bottom-right origin) are independent.
Being inspired with the idea presented in [28] , we propose to interleave calculations corresponding to different directions. First, the architecture processes first pixel from each direction in a sequence, then second pixel, and so on. This approach keeps pipeline always busy without throughput penalties, if the number of sequentially processed LSTM cells is enough. The architecture is parametrizable with respect to performance scaling. The parametrization allows to apply coarse-grained parallelization on a level of 2D-LSTM cells, and fine-grained parallelization on a level of dot-products. The former, indicated as PE_LSTM unrolling, while the latter, indicated as SIMD_INPUT and SIMD_RECURRENT unrolling factors for input data and recurrent paths, respectively. This flexibility allows for tailoring of parallelism according to the available resources on the target device. When PE unrolling is performed, the LSTM cell will be duplicated PE_LSTM times, generating for each clock cycle PE_LSTM output activations. When SIMD unrolling is applied, each gate will receive a portion of the input data, namely SIMD_INPUT and a portion of the recurrent state SIMD_RECURRENT in each clock cycle. The same is applied to the fully-connected layer with PE_FC and SIMD_FC unrolling factors.
In the following, we explain the functionality of each component of the 2D-LSTM accelerator depicted in Fig. 6 .
Session: Deep Learning I FPGA '20, February 23-25, 2020, Seaside, CA, USA • Mem2Stream core is responsible for DMA reads of input data from DRAM using AXI memory-mapped non-coherent interface, and converting each transaction into AXI stream. The input data (pixels) are concatenated into 64-or 128-bit words that correspond to a single bus transaction depending on a bus width. For simplicity, the pixels are quantized so that the bus width is a multiple of pixel width, so the pixels do not spread over several transactions, and no padding is required. • The Data Width Converter splits each transaction into separate pixels and concatenates them according to SIMD_INPUT. • The 2D-LSTM Hidden Layer core depicted in Fig. 7 implements all functionality according to Eq. (1) except the recurrent connections. All memory cells and corresponding dot-products are instantiated with respect to PE_LSTM, and SIMD_INPUT, SIMD_RECURRENT respectively. Computations corresponding to different memory cells and dot-products can be calculated with various degree of parallelization, but always in a sequence with respect to different directions. Sequential execution of computations corresponding to different directions instead of parallel execution allows to save hardware resources as we do not quadruple datapath and memory ports. Although it increases latency per image, it keeps pipeline always busy that results in a higher computational efficiency. • The Recurrent Path X/Y-axis Data With Converter converts the output from the hidden layer that has a width that is a multiple of PE_LSTM into input with a width that is a multiple of SIMD_RECURRENT. The buffers are required to match the inputs and outputs from the corresponding directions. The X-axis buffer contains 4 × N H recurrent values (y i−1, j ) along x-axis, while the Y-axis buffer contains W × 4 × N H recurrent values (y i, j−1 ) along y-axis. • The Output Layer implements the output fully-connected layer with PE_FC = FULL and SIMD_FC equal to PE_LSTM of the previous hidden layer, so to match the throughput. The architecture of the core is depicted in Fig. 8 . In the case of MNIST, the outputs from different directions do not need to be matched for the same pixels, thus there is no need for a large matching buffer. The results are accumulated in a single-value buffer instantiated for each output unit. Opposite for DIBCO, outputs from hidden layer corresponding to the same pixel processed from different directions have to be matched, thus there is a matching buffer of size (H × W ) for each output unit shown in Fig. 8 . The buffers are implemented with dual-port read-write memories that allow to read and write partial sums in a pipelined manner at each clock cycle. Due to pixel-by-pixel processing order and sequential computations corresponding to different directions, the read and write patterns never risk to address the same address (partial sum) in the buffer in the same clock cycle. The processing directions follow distinct access sequence that is encoded in Row offset and Column index buffers. • The SoftMax core implements a simplified version of Softmax function. It finds a label with the maximum value among all outputs from the output layer in the case of MNIST, or among all outputs corresponding to the same pixel in the case of DIBCO. • The Data Width Converter concatenates the labels into 64or 128-bit words depending on bus width. • The Stream2Mem writes the labels to DRAM using AXI memory-mapped non-coherent interface. Table 1 presents classification accuracy achieved with the 2D-LSTM on MNIST dataset for full-precision versions and quantized counterparts. Our full-precision version is approaching accuracy of the state-of-the-art network [15] without using any pre-processing or augmentation, and confidently outperforms the results of a similar approach presented in the original paper on MD-LSTM [10] (see Section 2.1). The network presented in [15] is based on three deep learning architectures working in parallel, namely fully-connected deep network, deep CNN and deep Recurrent Neural Network (RNN). The top result was achieved using a combination of 30 deep learning models that by far exceeds the size and complexity of our network. The full-precision versions have achieved higher test accuracy with a higher value of dropout than quantized versions (0.8 vs. 0.7, respectively) that confirms observations that quantization has a regularization effect on neural network similar to dropout that prevents model overfitting [12] , [34] . Furthermore, high dropout values mean that MD-LSTM has a strong tendency to overfitting that is also in accordance with earlier observations presented in [22] and [19] . The configuration №8 with 2-bit weights has shown bad convergence that is probably a result of imbalanced representation of negative and positive values in the case of 2-bit quantization, namely (-1.0, -0.5, 0, 0.5). We expect that ternarisation of the model will result in higher accuracy. Configuration №11 demonstrates that binarization of activation causes stronger degradation of accuracy than binarization of the weights that can be explained by that we did not use scaling factor for binarized activation but weights and biases only. Some of the configurations with lower precision have higher test classification accuracy than some with higher precision that can mean that regularization that is brought by the quantization outperforms the negative effects of the lower precision. This statement is also confirmed by observing the train accuracy that mostly monotonously degrades with lower precision, meaning quantization as regularization improves generalization, i.e. test accuracy. We consider configuration №10 for hardware implementation because it provides higher accuracy than the reference designs see Table 3 , and has the lowest implementation complexity among other quantized versions. Moderate degradation in accuracy even for the configuration with binary weights mean that the 2D-LSTM is an overkill for the MNIST problem. Table 1 presents accuracy (F-measure) achieved with the 2D-LSTM on DIBCO 2017 dataset for full-precision versions and quantized counterparts. Our implementation with the highest accuracy outperforms the similar approach (see Section 2.1) and loses only 2.8% in accuracy to the winner of the competition presented in [23] without using any pre-processing or augmentation of the dataset. We consider the achieved accuracy as comparable with the state-ofthe-art result. It is difficult to qualitatively categorize the achieved accuracy as low or high with respect to the state-of-the-art as it is difficult to interpret pixel-level accuracy of the document binarization. The binarization step is just a one step out of many in a conventional document recognition chain that ends with OCR step that can be characterized with a Character or Word Error Rate (CER/WER). We can expect that degradation in accuracy of 2.8% in Table 2 : Resource utilization for a baseline (PE_LSTM = 1), unrolled (PE_LSTM > 1), multiple-baseline M × (PE_LSTM = 1) and multiple-unrolled M × (PE_LSTM > 1) instances of the accelerator implemented on XCZU9EG for MNIST and DIBCO 2017 datasets. the binarization step can result in negligible or even no accuracy degradation in the last OCR step. The values of CER/WER are easier for interpretation, however, it is out of the scope of the current paper. The approach presented in [23] is based on U-Net [26] type of a fully convolutional neural network and data augmentation during training. However, publication [23] lacks details of the topology besides mentioning U-Net as classification architecture. Assuming that the authors used the same architecture as in [26] , a size of the model is estimated to be of more than 31M parameters that is 48× larger than the 2D-LSTM model. In this regard, 2D-LSTM brings a very reasonable trade-off between complexity and accuracy.
RESULTS

Configuration
The results show a strong correlation between the number of cells, precision of the operands and accuracy. Comparing configurations №4 and №5 or №5 and №9, we see that precision of a bias has a noticeable influence on accuracy. Keeping higher precision for bias, while reducing precision for weights or activation can be an efficient approach for maintaining accuracy at low hardware cost. The noticeable influence of biases on accuracy can be explained by higher contribution of biases to the final result in the case of smaller models. The full-precision versions could achieve higher accuracy with higher value of dropout than quantized versions (0.6 vs. 0.5, respectively). There is still a noticeable gap between train and test accuracy that can mean an overfitting of the model, however higher dropout values did not improve the accuracy. The gap in accuracy between training and test can be also explained by a different complexity between training and test datasets. The DIBCO 2017 dataset is different from the datasets from the previous years because documents with more complex artifacts are added every year to challenge the performance of the current state-of-theart. We chose configuration №6 for hardware implementation that shows a degradation in accuracy of less then 1% with respect to full-precision version. 1% degradation is an empirical value for fair trade-off between complexity and accuracy. Configuration №9 is rejected, although it provides high accuracy (within 1% from the full-precision version) with lower precision of the weights than configuration №6 that results in lower BRAM usage, it needs higher precision for activation that brings to higher LUT utilization that is the implementation bottleneck, see Table 2 .
All hardware prototypes have been implemented on Xilinx Zynq UltraScale+ MPSoC ZCU102 board featuring a XCZU9EG device that contains (1) Quad-core ARM Cortex-A53 and (2) FPGA fabric. The HLS synthesis and place and route have been performed with Vivado Design Suite 2017.4. The resource utilization for different configurations of the accelerator for MNIST and DIBCO 2017 datasets are presented in Table 2 . The resource utilization of a baseline configuration (PE_LSTM = 1, SIMD_LSTM = FULL, PE_FC = FULL, SIMD_FC = PE_LSTM) allows to place more instances. The increase in a number of PE_LSTM will reduce the number of sequentially executed cells that will make the pipeline partially idle. According to Vivado HLS report, the depth (D) of the datapath of the hidden layer is 35 and 37 clock cycles at the corresponding frequency for MNIST and DIBCO respectively. The maximum PE_LSTM that still keeps pipeline busy is 4 × N H /D that is 2 and 4 for MNIST and DIBCO respectively. To further increase throughput and to avoid the negative side effect, we instead of further instantiating multiple cells, instantiate multiple (M) instances of the complete accelerator. The configurations with the highest throughput are used for comparison with GPU and other FPGA implementations in Table 3 . Table 2 clearly shows that instantiating multiple cells rather than multiple instances of the complete accelerator is more efficient because we avoid unnecessary replication of conversion cores and weights' memory. However, in the case of instantiating multiple cells, although the design operates on the same model, the is still an increase of memory (BRAMs) utilization that comes from more parallel ports required.
The host code runs on Cortex-A53 running Ubuntu 16.04.6 LTS. It reads test images from the SD card, stores them in the shared DRAM, starts the accelerator execution, measures runtime, and computes accuracy against the ground truth classifications. During processing, we perform two types of power measurements: P boar d that is a power consumption of the complete board, in the case of FPGA measured using digital multimeter Voltcraft VC-870, and P chip that has been acquired using Zynq UltraScale+ MPSoC Power Advantage Tool [2] for FPGA, and nvidia-msi for GPU.
According to Eq. (1), a number of learnable parameters per single 2D-LSTM cell is 5 × (C + 2 × N H + 1), where five comes from a number of gates including input to a memory cell. In the case of semantic segmentation problem, a single neuron of the output layer has (4 × N H + 1) parameters, where 4 × N H gives a number of concatenated outputs from the hidden layers. In the case of MNIST, a single unit of the output layer has (4 × N H × H × W + 1) parameters. The number of operations of the 2D-LSTM layer per image is (2×5×(C +2×N H )+11)×N H ×H ×W ×4, where two comes from multiplication and addition counted as separate operations, eleven is a number of point-wise operations, and four is a number of possible processing directions. The same applies to the output layer: (2 × (4 × N H ) + 1) × N O × H × W . We neglect a complexity of the rest of the network. As it is shown in Table 3 , 2D-LSTM model provides the highest accuracy with respect to other FPGAbased implementations, while having one of the smallest models that confirms finding in the earlier works [24] , [19] that MD-LSTM layers require less trainable parameters to achieve similar accuracy in comparison to other types of neural networks. Although, it has one of the smallest models, the number of operations is the highest that is also in alignment with the previous works, and this is the primary reason why other types of networks are preferred over MD-LSTM in the recent years. Table 3 presents comparison with earlier FPGA implementations of Multilayer Perceptron (MLP) processing MNIST datsaset, and our 2D-LSTM implementations for pixel-by-pixel (P) and diagonal-wise (D) approaches running on GPU for MNIST and DIBCO datasets. The implementations are compared from different perspectives including throughput and energy efficiency. We considered throughput (kF PS) and energy efficiency (kF PS/W att) expressed on the application level as more informative than expressed on the architectural level (TOPS and TOPS/W att, respectively). The networks have different dataflow and need a very different number of operations in order to process the same amount of information. TOPS and and TOPS/W att can be misleading. Our FPGA-based implementation processing MNIST dataset outperforms all reference FPGA implementations in accuracy, however it loses to most of them in terms of throughput and energy efficiency due to several reasons. First of all, all other implementations use MLP models that have much lower implementation complexity due to only feed-forward dataflow and low-complexity activation functions like ReLu or just a step function in the case of binarized models [30] . Furthermore, 2D-LSTM has two recurrent paths and multiple non-linear activation functions, and has to process image in four directions. Additionally, the implemented architecture follows the pixel-by-pixel order of execution that has the highest time complexity. Nevertheless, it shows very promising results taking into account that we did not apply any of the orthogonal approaches for accelerating 2D-LSTM (see Section 2.2).
In the case of MNIST dataset, 2D-LSTM shows high accuracy that is not achievable by simple classifiers, however, at a cost of higher complexity that is significantly lower than of state-of-theart approaches [15] . In this regard, 2D-LSTM brings a trade-off between complexity and accuracy.It also worth mentioning that 2D-LSTM appeared to be more accurate than other referenced hardware implementations with higher precision. It seems that multi-dimensional recurrences and multi-directional processing play more significant role than just precision.
Our FPGA-based implementation outperforms pixel-by-pixel and diagonal-wise approaches running on GPU in terms of throughput by 30-50× and 9-21×, and in terms of energy efficiency by 365-746× and 153-455×, respectively. However, a difference in throughput between GPU implementations is less than expected (H ×W )/(H +W ), due to overheard of picking up pixels from diagonals for diagonalwise processing. Additionally, in the case of MNIST dataset, the batch size for pixel-by-pixel approach was larger than for diagonalwise approach (2560 vs. 2048, respectively) due to lower memory overhead that also reduces the speedup of the later approach. In the case of DIBCO dataset, the batch size was the same for both implementations. Clearly there is a big room for optimization. However, even if we consider perfect speedup with respect to pixel-by-pixel approach, our FPGA-based implementation still outperforms the perfect diagonal-wise GPU implementation in terms of throughput and energy efficiency.
CONCLUSION
We presented a novel hardware architecture of MD-LSTM neural network. A systematic exploration of accuracy vs. precision trade-off for MD-LSTM using challenging DIBCO 2017 and MNIST datasets has been conducted. We have shown that MD-LSTM can achieve comparable with state-of-the-art results for MNIST dataset even with binary weights and 2-bit activation. It also approaches best results for DIBCO 2017 dataset with low bit-width weights and activation. Based on the new architecture we implemented an FPGA-based accelerator that outperforms NVIDIA K80 GPU implementation in terms of runtime and energy efficiency. At the same time, our accelerator demonstrates higher accuracy and comparable throughput in comparison with state-of-the-art FPGA-based implementations of multilayer perceptron for MNIST dataset.
By this research we have presented an efficient implementation for inference, and open a door for efficient training that can boost application of MD-LSTM. We have shown that FPGA is an alternative platform for deep learning that can offer a solution in cases when a massive parallelism of GPUs does not provide the necessary performance required by the application. Recent research has demonstrated a promising results on in-datacenter FPGA-based distributed training of deep neural network [8] . In combination with such systems, we can boost research in the field of demanding deep learning algorithms.
