We consider the use of look-up tables (LUT) to speed up and simplify the hardware implementation of a deep learning network for inferencing after weights have been successfully trained. The use of LUT replaces the matrix multiply and add operations with a small number of LUTs and addition operations resulting in a multiplier-less implementation. We compare the different tradeoffs of this approach in terms of accuracy versus LUT size and the number of operations.
Introduction
In recent years, neural networks (NN) [6] have re-emerged as a powerful tool in many nonlinear regression and classification applications, especially in the visual [10] and audio [1] processing domains. The general form of a neural network can be expressed as a sequence of linear and nonlinear mappings. In particular, we consider the following N -layer feed-forward network formulation:
where f i is a nonlinear function mapping vectors of length m i to vectors of lengths n i+1 . They can be vectorized activation functions (ReLu, sigmoid, tanh, etc.), but can also be other functions such as pooling, softmax and even linear functions such as the identity function. Each W i is a matrix of length m i × n i and y i is a vector of length n i , although for easier visualization and interpretation, they can be represented as (or reshaped into) a multidimensional tensor with the same number of elements. The vectors b i are of length m i . The vectors y 1 and y N are the input and the output vectors of the entire network respectively and thus the network in Eq. (1) describes a nonlinear mapping between input and output. The functions f i are assumed to be fixed and given. The goal in training a network for classification and regression is that given K pairs of vectors (x k , z k ), find the set of weights W i and biases b i such that k d(z k , z k ) is minimized, wherez k is the output from the neural network when x k is fed as input to the neural network and d(·, ·) is an error or loss metric.
After the network is trained, Eq. (1) is used in the inference phase to compute y N +1 given an new input data element y 0 . Since this may be done in a low-power environment or at high speed, there is a need to speed up the computation of Eq. (1) using low-power high speed hardware. The evaluation of Eq.
(1) consists of many multiply-and-add operations and nonlinear operations such as sigmods and logistic equations which also require many arithmetic operations. Modern GPU architectures have been developed to perform these computations with a large number of arithmetic units and digital multipliers. The purpose of this paper is to use look-up tables (LUT) to construct a truly multiplierless implementation that is easily parallelizable.
LUT framework and notation
A LUT is a memory array to store precomputed values of a typically complicated function and has been found useful in high performance image processing to map between different color spaces where this mapping is generally nonlinear. In particular, current GPUs support the use of 2.5D and 3D LUT for color conversion at high speed [8] .
We can consider a LUT as a function f : I → O that maps elements from an input set I to elements of an output set O. The number of bits required to describe a element x ∈ I is denoted β(I) (we will also denote this as the resolution of I) and is given by β(I) = log 2 (|I|) where |I| is the cardinality of I. We assume that the bits used to describe elements of I are in a number format so that arithmetic operations can be performat directly on the bits. For instance if I are integers from 0 to 255, then β(I) = 8. If I are IEEE 754 single precision floats (as used in the C programming language), then β(I) = 32. If I are 8-bit minifloats [11] , then β(I) = 8. Thus the LUT is indexed by β(I) bits and output β(O) bits and the size of a LUT is 2 β(I) β(O) bits.
Partitioning the input bits
Note that there is an asymmetry between I and O in determining the size of the LUT. This suggests that β(I) should be as small as possible and typically β(O) is larger than β(I). If β(I) is too large, one way to reduce the resulting LUT is to partition the bits in β(I). We assume that the bits in an element of I are additive in the sense that if b i are the bits in x ∈ I, then x = i b i α i for some fixed objects α i . This is certainly the case when x is a vector, matrix or tensor of numbers in fixed point or floating point formats, in which case α i are also vectors, matrices or tensors. If we partition the bits into k chunks of m i bits such that
, then we can construct k LUTs of size 2 mi β(O) bits each. The bits in x ∈ I are partitioned into k chunks, applied to these LUTs and their results added (according to x = i b i α i ). Note that generally m i ≥ 2; if we split a 2-bit chunk into 2 1-bit chunks, we did not reduce the total LUT size as
From the previous section, a NN can be decomposed into 2 types of modules, the affine operation W x + b and the nonlinear function f . Each of these modules can be replaced with LUTs. We describe each of these in turn.
Computing a nonlinear function f with LUT
Replacing a general nonlinear function f : I → O with a LUT is generally feasible only if β(I) is small. This means that using a single LUT is more suitable for scalar functions f : R → R such as sigmoids and activation functions or pooling layers in later stages of a NN where the information is more compressed into features. For instance, a scalar function that maps 32-bit floats to 32-bit floats can be implemented with a LUT table of size 2 37 bits or 16 Gibibytes 1 which is quite unwieldy. However, reducing the input and output to a 16-bit half-precision float reduces the LUT table size to 128 Kibibytes. It is possible to search for a nonlinear way to combine the output of multiple LUTs to approximate f in order to reduce further the total size of the LUT, but that is not the focus of this paper as it is a much more difficult problem. In many recent NN architectures, the activation function is a rectified linear function (ReLu) which can simply be implemented with a compare and branch (either in software or hardware) instruction and does not need the use of a LUT. 4 Computing the affine operation W x + b and exploiting linearity
The most computation-intensive part of a NN is the affine operation W x + b. We can exploit linearity to compute W x + b efficiently using LUTs. Let W be a p by q matrix, b a p-vector. Let x ∈ I be a q-vector where each element is represented with r I bits. Thus β(I) = qr I . The output W x + b is a p-vector whose elements are represented with r O bits each, i.e. β(O) = pr O . We first partition the vector x into k segments x i of size m i such that
Then each of these segments x i is used to build a LUT which outputs W x i + 1 k b. The output of these table lookup operations is then added to obtain the final result
mir I r O bits and a total of k − 1 additions of p-vectors is needed. This is in contrast with pq multiply and add operations for a standard implementation of W x + b. In particular, if we choose k = q, m i = 1, we will have q LUT tables with a total size of 2 r I qr O bits and q − 1 additions of p-vectors. Thus the number of additions is the same as the standard implementation, but all the pq r I -bit multiplications are replaced with q LUT operations.
Fixed point formats
If x is stored in a fixed point format, additional simplification and efficiency can be obtained by exploiting linearity in the fixed point representation. Consider a q-vector x where each element is denoted x i . Let r I = n, i.e. each number x i is represented as a n-bit number
where a ij are the bits representing x ij . The linear combination y = i w i x i can then be written as
Swapping the order of summation we get
For a fixed j, the bits a ij correspond the j-th bitplane of the numbers x j . This implies that we can use the same LUT for each bitplane and do n shift-and-add operations to compute y.
Again q is partitioned into k segments of size m i such that Instead of using a single bitplane in the input of a LUT, a subset of bitplanes can be used to index the LUT. In this case, the set of bitplanes is partitioned into blocks that can be shifted into each other (e.g. pairs of adjacent bits).
Floating point formats
When numbers are represented in floating point format, similar to the fixed point format, we can also split the mantissa into bitplanes (or groups of bitplanes). However, the entire exponent need to be part of the input to each LUT. For instance, consider a floating point representation with r I bits where r I = n + t, with n bits reserved for the mantissa and t bits reserved for the exponent. The same LUT is used to index a single bitplane of the mantissa and the entire t bits of the exponent. This is illustrated in Fig. 1 . With k and m i defined as above, the total number of bits for the LUTs is i 2 mi(1+t) b(O) and nk LUT evaluations and bit-shift-and-add are needed to compute W x + b. This suggests that in order to obtain a small total LUT size, the number of bits allocated to the exponent should be small.
Convolutional layers using LUT
In convolutional layers the W matrix has a specific structured form and many of the matrix coefficients are duplicated. For instance, a 1-D convolutional layer has a circulant matrix W whereas in 2-D convolutional layers, the matrix W is a block-circulant matrix of circulant blocks. To implement this in a LUT, first the input q-vector is partitioned into k chunks whose support are shifted version of each other. This could be either contiguous chunks or alternating elements. The output will be a vector that is larger related to the filter size (Fig. 2a) and is equal to the dilation of the input support with the convolutional filter structural element [2] . To minimize the size of the output support it is better to have the partition be in square contiguous blocks. The shift-invariance in the linear convolution operation is analogous to the invariance of the linear operations on the bitplanes and we can similarly reuse the same LUT. In particular, the same LUT can be used for each of the chunks and the output shifted (in space here rather than in binary base (Sect. 4.1)) and added. Suppose that due to the convolution an a-element vector is mapped to an output vector of size c. Then the LUT size is 
Dealing with signed numbers
So far the discussion deals only with using unsigned numbers in the input set I to index the LUT 2 . Dealing with signed numbers requires a slight modification of the architecture. We discuss here the case of fixed point formats as the floating point format is similarly handled. Consider a n-bit bitstring x encoding a number in 2's complement format. The most significant bit (MSB) is a sign bit. If this bit is 1, then the value represented by x is x − 2 n = x b − 2 n−1 where x b is the bitstring x minus the MSB. Depending on whether the MSB is 1 or not, there is an additional offset of −2 n−1 . Thus we can partition the set of x b 's as before, apply the LUTs (this can be applied to the entire bitstring x b or by bitplanes as described above), and add the results. The MSB of all the elements in the vectors are similarly partitioned and applied to the same LUTs, and the result shifted to the left by n − 1 bits and subtracted from the previous result. This is shown schematically in Fig. 2b .
In many NN architectures, there is no need to deal with signed numbers, since the input to a linear layer is generally the output of an activation layer or a pooling layer. Many activation functions are either nonnegative (e.g. ReLu) or can be made nonnegative by the addition of a fixed positive constant. This means that the pooling layer after an activation layer will also be nonegative.
Stochastic rounding
A LUT can also be used to implement stochastic rounding that has been found useful in ML algorithms using limited precision [4] . The rounding function is augmented with an additional input as a counter. Let r(i) be a sequence of R (pseudo)random numbers between 0 and 1 3 , then the function that is implemented by the LUT is:
The index i is incremented (modulo R) each time the LUT table is accessed. The size of the LUT is R2 β(I) β(O) bits.
Example implementations
We consider multiple neural network architectures for the tasks of classifying the MNIST dataset [7] and the Fashion MNIST dataset [13] . We insert quantization operations before the input to a CNN or dense linear layer to mimic the quantization required to map it to the desired input set I for the input of a LUT. We trained this modified network using Tensorflow and SGD with dropout. The ReLu activation layers, the pooling layers, and the argmax layer to determine the label from the one-hot encoding do not involve any multiplication and only use comparison operations only. We will omit these layers from our comparison since they are the same for the LUT approach and the traditional approach. Because of the ReLu activation layers, the sign bit in the input x ∈ I to the LUT will always be 0 so can reduce the LUT size by half when using a floating point format as the input.
Linear classifier
Consider a linear classifier with a single dense layer (W, b) of sizes 784 × 10 and 10 × 1 respectively. The total storage for the weight matrices in single precision floating point format is 30.7Kibibytes. We run the training for 50000 episodes with a minibatch size of 100 and averaged the results over 20 trials.
MNIST
The reference model achieves an average accuracy of 92.4% on the test data set. For the LUT-based implementation, the LUT implements the operations W x + b and accepts x as an input in fixed point format. The accuracy versus the number of bits in the input is shown in Fig. 3 . With the input quantized to about 3 bits, we were able to achieve similar accuracy and that increasing the precision on the input does not increase the accuracy noticeably. This is not surprising since the original NIST digits images [3] are bilevel and the few grey levels were introduced into MNIST due to anti-aliasing. Quantizing the input to 3 bits implies that the totality of bits in the input of the LUT is 3 × 28 × 28 = 2352. The output is a one-hot encoding of the classified label ranging from 1 to 10, i.e. it can be represented with 10 16-bit half-precision float numbers. Partitioning these input bits into various partition result in the tradeoff of LUT size versus number of shift-and-add operations (Fig. 4) . For instance, we can perform each inference on MNIST using 56 LUTs with a total combined size of 17.5 Mebibytes, 168 LUT evaluations and 1650 shift-and-add operations compared to 7840 multiply and add operations in the referencel model. In fact, using 784 LUTs totaling about 30.6Kibibytes, the number of shift-and-add operations is 23520 and has the same memory footprint as the reference model but without any multiplications involved. 
Fashion MNIST
After training, the reference model using 32-bit single precision floating point arithmetic achieves an average accuracy of 81.4% on the test dataset. The tradeoff in accuracy versus the number of bits in the input is shown in Figure 5 . Similarly, we see that quantizing the input to 3 bits per pixel suffices to reach similar accuracy. Interestingly, we see that the accuracy can decrease slightly as the number of bits increase. This we believe is due to the fact that the loss of information in quantization counteracts the decrease in accuracy on the testing tasks due to overfitting on the training tasks. Similar to the MNIST case we can perform each inference on Fashion MNIST using 56 LUTs with a total combined size of 17.5 Mebibytes, 168 LUT evaluations and 1650 shift-and-add operations. 
Multilayer Perceptron (MLP)
We consider a 3-layer neural networks with 3 dense layers of sizes (784×1024, 1024×1), (1024×512, 512 × 1) and (512 × 10, 10 × 1) respectively. These weights requires about 5.1 Mebibytes in storage. For the MNIST dataset, the reference model achieves 98.2% accuracy. We will use a 8-bit fixed point format to encode the input image pixels for the first dense layer. Using a fixed point format for the input to the second and third dense layers result in a reduced accuracy.
On the other hand, using IEEE 754 binary16 16-bit floating point format for the output of the first layer and the second layer and the input to the second and third layer, we obtain an accuracy of 98.4% which is comparable to the reference model.
There are two ways to implement the float point format architecture. If all 16 bits are used to index the LUT, we can achieve similar performance as the reference model, with 2320 LUTs with a combined size of 32.7 Gibibytes and 1330678 addition operations compared with 1332224 multiply-and-add operations. This LUT size is not practical in current implementations.
The second way is to split the mantissa into bitplanes and apply the same LUT to each bitplane and apply a shift-and-add as described in Sect. 4.2. The precision in the mantissa of the IEEE 754 binary16 format is 11 bits. The sign bit is always 0 since we are only dealing with nonnegative numbers due to the use of ReLu activation. If the mantissa are separated into these 11 bit-planes but still use the entire 5-bit exponent to index the LUT, the tradeoffs are shown in Fig. 6 . Thus we can achieve similar performance as the reference model, with 2320 LUTs with a combined size of 162.6 Mebibytes and 14652918 shift-and-add operations compared to 1332224 multiply-and-add operations in the reference model. For the Fashion MNIST dataset, the reference model achieves 89.7% accuracy and similar to MNIST, using binary16 in the input we obtained a similar accuracy.
Deep CNN
Consider next a LeNet implementation for classifying MNIST that was described in a Tensorflow tutorial. The weight matrices for the dense and convolutional linear layers (with pooling and dropout layers in between) are: Again, we use 8-bits in fixed point format to encode the input images to the network. We find that with a fixed point format indexing the input to layers 2 through 4, we can only get an accuracy of about 95.6%. On the other hand using the IEEE 354 binary16 floating point format to encode the input to layers 2 through 4, we achieved similar accuracy as the reference model. The smallest total LUT size is achieved for the configuration where the mantissa is partitioned into 11 bitplanes and the spatial partition is into single elements. In this case, the total LUT size is 400Mebibytes. The number of shift 4 and add operations are 37.4M. A more detailed tradeoff of total LUT size versus number of operations is shown in Fig. 7 . For instance, another partitioning configuration results in a total LUT size of 12.26Gibibytes and 12.9M shift-and-add operations (comparable to the number of multiply-and-add operations in the reference model). 
Comparison with low precision NN
While the philosophy of trading off accuracy versus performance by leveraging low precision data is similar to approaches using low-precision data format and arithmetic [4] , there are some significant differences between the two approaches. First of all, because of the asymmetry between the input and output in a LUT, the main reduction in the precision is the input I in a LUT. The weights, the nonlinearity, the output O and the computations needed to produce the elements in O that are stored in the LUT can all be done in full precision. This is in contrast to limited precision approaches where the entire computation pipeline is done in low precision. Secondly, because the LUT is indexed by the totality of bits in I, where I can be an ordered set of multiple elements, the precision of different elements in I can be different. This is in contrast to using low precision arithmetic hardware where all the computation is done at the same precision. For instance, Ref. [12] shows an application in image halftoning using LUT where the precision of the input depends on the location of the input pixel. In terms of parallism, the LUT approach can be parallelized as the LUT operations can be done in parallel with multiple LUTs, whereas the paralellism in multiplier-based NN replies on using a parallel multiplier (rather than an iterative shift and add multiplier) and having several multiplier hardware running in parallel.
Concluding remarks
We explore the feasibility of using LUT to speed up the inference of large scale neural networks by eliminating multiply and add operations, especially in CNNs where the majority of computations during inference is computing the convolutions. This approach can be implemented in software using standard computing hardware. However, custom hardware for partitioning the input data (whether in floating point or fixed point format) into chunks that can be used directly as indices to the LUT and for bit-shuffling operations applied to both the input and the output of the LUT can result in an even more efficient architecture. Such custom hardware would simply reroute the bits appropriately to access memory locations of the LUT and rerouting the output from the LUT appropriately to the adder. Our experimental results indicate that a practical tradeoff between total size of LUT, number of arithmetic operations and accuracy exists. Furthermore, floating point formats in intermediate layers provide better results for the same amount of bits than fixed point formats. Future research include determining what the optimal architecture should be to balance the LUT size and the number of operations for each inference.
