Introduction
Use of neural networks for computer vision, speech recognition, and other applications has exploded in recent years, in part due to their unprecedented performance on a variety of benchmarks. Nonetheless, highthroughput and energy-efficient evaluation of such neural networks, and in particular, convolutional neural networks (CNNs), remains an active field of research. Evaluation of networks is memory and compute intensive, with the bottleneck depending on the network topology and layer types (convolutional or fully-connected [FCN] ).
Significant effort has been made to reduce the memory footprint of neural networks, motivated by the fact that many off-chip memory accesses can dominate energy consumption during evaluation (Zhou et al., 2016; Han et al., 2015; Courbariaux & Bengio) . This work explores the lesser studied objective of optimizing the multiply-and-accumulates executed during evaluation of the network. In particular, we propose using the Residue Number System (RNS) as the internal number representation across all layer evaluations, allowing us to explore usage of the more power-efficient RNS multipliers and adders. We motivate our optimization with Table 1 , which summarizes the large number of multiply-and-accumulates (MACs) required during evaluation of popular networks. Small improvements to the underlying efficiency of the core multiply and accumulate block can have large improvements to the overall network evaluation. Prior work applying RNS for efficient computation has largely focused on cryptographic applications and general purpose ALUs. In the machine learning domain, various optimizations such as Winograd and FFT transforms have been proposed to speed up network inference. As far as we are aware, this is the first attempt at applying RNS to neural network inference.
In section 2, we describe RNS and the requisite operations and methodology for RNS-based inference. Sections 3, 4, 5, and 6.1 details our implementation of core RNS hardware modules in Verilog, and simulation results with estimated power, area, and latency. In section 6.2, we describe accuracy estimates with our chosen RNS precision. In section 6.3, we tie together our results with an analysis of the break-even points at which it becomes beneficial to use RNS-based inference for low-power network evaluation. We conclude with a critique of our contributions.
All our Tensorflow, Bluespec/Verilog, and software model of digital hardware source code is available for re-use at https://github.mit.edu/mrhamid/6888_ Project.
Preliminaries

Residue Number System
At its core, the Residue Number System relies on the Chinese Remainder Theorem (CRT) to represent a large integer as a tuple of smaller integers. This allows for more smaller and more parallelized arithmetic blocks. In RNS, an integer X mod M with moduli set {m 1 , m 2 , . . . , m k , . . . m n } is represented as (x 1 , . . . , x n ) where
The set of moduli is carefully chosen. In particular, the moduli are usually co-prime to reduce the number of distinct numbers that have the same RNS representation, and in our case, also chosen for hardware efficiency of operations. In this work, we fix our moduli to the structured 4-tuple {2 n ± 1, 2 n+1 ± 1}. This set can , 2007) . We choose the n = 7 moduli set in this work. Every RNS number is stored using 7 + 8 + 8 + 9 = 32 bits, and can fall in the range [0, (Hiasat, 2003) .
Piecing Together RNS Inference
At its core, to construct complete end-to-end inference in RNS we need to support two operations: multiplyand-accumulate and ReLU. Our choice of moduli allow us to build on prior work for both: (Zimmermann, 1999) proposes digital architecture for multiplication and addition mod 2 n ± 1, and (Sousa, 2007) demonstrates comparison of RNS numbers mod 2 n ± 1 in VHDL. We use a comparison module to implement the ReLU nonlinearity.
We assume a discrete output for our network -e.g. 10-output image-recognition with CIFAR-10. This allows us to avoid the overhead of conversion out of RNS at the end of the network; instead, with our comparator, we compute a max over final layer softmax values, returning the index with the maximum value. All RNS operations occur in the realm of positive integers with fixed modulus. We explore this constraint in Section 6.2.
Comparison in RNS
Though RNS multiplication and addition are operationally intuitive, comparing two RNS numbers is non-trivial. We follow the procedure given by (Sousa, 2007) . The crux of the algorithm is reducing comparison to parity (even/odd) checking. To compare two unsigned integers A, B mod M , we compute the difference C = A − B which becomes one of two values:
Because with our chosen moduli M is odd, these two values have different parities. As such, we can compute a comparison given a formula for the mod 2 parity X P of an RNS number X = (x 1 , x * 1 , x 2 , x * 2 ). Parity is calculated with the following set of equations:
Proof is provided by (Sousa, 2007) . This is the fullcomparator we implement in combinational logic to execute the final layer maximum. In the ReLU, we are able to trim this combinational logic, because we compare with a fixed threshold M 2 (0 in our modulus world). We call this trimmed comparator a halfcomparator. In particular, the parity of B = M 2 is fixed and pre-computed, as well as the value of the additive inverse −B = − M 2 we feed into the modulo adder. The parity-checking combinational logic implemented in Verilog is given by Figure 1 . This is used in both the full and half comparator design. We modify the circuit given in (Sousa, 2007) , which we suspect, from our testing, does not evaluate correctly in an exhaustive sweep of all ∼3 billion inputs.
We include various optimizations. For example, our choice of modulus allows for use of an inverter to find the additive inverse. It also allows for calculating the remainder with a 16 bit number as a single addition (bottom-right). Additionally, we implement multiplication with 2 n mod 2 n+1 − 1 as a right rotate. 
Converting Input into RNS
For the chosen moduli set, the residue generation relies on the periodicity of the binary weights (2 i ) in the modulus domain. As pointed out by (Piestrak, 1994) , the residue of a binary number is given by:
taking the modulus causes the binary weights to repeat themselves when i > n 1 . Therefore, the residue is calculated by periodically folding back the higher weights and adding them to the lower weights using a tree of modulo adders.
RNS Arithmetic
CNNs perform a large number of computations such as multiplication and accumulation throughout the convolutional as well as the fully connected layers. To fully exploit the advantages of transforming the network to the RNS system, efficient modulo arithmetic circuits have to be designed. This section presents efficient modulo multiplication and addition implementation using the same moduli set as the comparison, namely {2 n1 ± 1, 2 n2 ± 1}.
Modulo Addition
The main advantage of RNS arithmetic is that each residue operates separately in parallel without any carry propagation from one residue to the other. Our conjugate moduli set requires modulo (2 n − 1) addition/multiplication as well as (2 n + 1).
First, the modulo (2 n − 1) addition can be expressed as conventional n-bit addition if the sum is less than the modulus, while a correction is added if the sum overflows the modulus as follows:
Since, the output carry (c out ) of an n-bit adder can be used to detect the overflow condition which determines whether to increment the sum or not, then such carry can be fed back into the adder as proposed by (Zimmermann, 1999) for (2 n − 1) addition.
(A + B) mod (2 n − 1) = (A + B + c out ) mod 2 n Second, a similar analysis can be done for the modulo (2 n + 1) addition to show that (A + B + 1) mod (2 n + 1) = (A + B + c out ) mod 2 n where diminished-1 numbers can be used for the inputs, or a correction circuit is added to the output to account for the extra '1'.
Since the addition in both moduli depends on the output carry, then fast parallel prefix adders are the most suitable implementation for the modulo adders. Figure 2 shows the a modulo parallel prefix adder where the inputs are preprocessed into carry generate and propagate signals then a tree of fixed operation propagates the carry in only 4 levels. Each solid circle represents a dot operation which combines the group carry generate and propagate bits. Finally, a modulo endaround carry correction is required to feedback back the output carry (c out ) or its complement (c out ) according to the designated modulus. a6  b6  a5  b5  a4  b4  a3  b3  a2  b2  a1  b1  a0  b0 (g, p) = (ab, a⨁b) Figure 2 . Parallel prefix adder for mod 2 7 − 1 addition
Modulo multiplication
N-bit binary multiplication relies on generating N partial products and accumulating them all to produce the final product. Modulo multiplication relies on the same concept as well as the periodicity of the binary weights which causes the higher order partial products to rotate folding back into n-bit weights. Therefore, the partial products can be generated, similar to in (Zimmermann, 1999) , as
where the << operator represents a circular shift. Similarly, an expression for the partial products of the (2 n + 1) modulus can be derived to be
Such multipliers can be designed in a modular way where a block generates the required partial products according to the selected modulus. Then, a modulo carry save adder tree as shown in Figure 3 generates a redundant sum output (P C , P S ) which is then added using a modulo adder to produce the final product. 
RNS Power Consumption
We built several building blocks for RNS-based network inference using Bluespec SystemVerilog and synthesized them using a commercial LP65nm CMOS process. Table 2 shows the power and frequency of operation of the RNS blocks as well as their 32-bit counterparts. It is worth noting that the multiplier consumes almost half the power of the 32-bit block with a positive slack allowing for higher frequency of operation. 
Maintaining a Modulus Integer Network
A limitation using RNS is the necessity to maintain positive integer weights and activations within a given modulus M . To demonstrate the feasibility of this, and obtain a rough estimate of accuracy degradation when imposing such constraints, we train different flavors of a 8-layer (7 CNN/1 FC) network on the StreetView House Numbers (SVHN) dataset (Netzer et al., 2011) . We denote a (W, A)-FP/INT network as a network with W-bit weights and A-bit activations in either floating point or integer, respectively. Note that negative integers are interpreted as their respective positive value in a wrap-around modulus.
We first trained (32, 32)-FP. We used a set of shadow floating point weights, initialized to (32, 32)-FP, , and truncate these shadow weights in the forward pass to generate our (6, 6)-FP network (gradients get passed to the shadow weights). In our (32, 32)-INT and (6,6)-INT networks, we modify this truncation operation to be a suitable affine transformation to fit our bit width and desired range. Note that our activation function in the integer network changes to compare with M 2 . Networks were trained for 15 epochs, with data augmentation, 50% last layer dropout, and selecting the best model-checkpoint with highest validation accuracy. Note that a 6-bit integer is able to fit within each modulus of our RNS representation. As expected, reducing bitwidth increases error. Moving to integer networks appears to slightly decrease accuracy. The precise reason for this is unclear; perhaps, something wonky is occuring with the gradient 
E X is the energy per operation for hardware block X. Given our simulation results, this simplifies to:
This hints that it could be possible to achieve energy savings through RNS on FC layers of any size, because of our ReLU overhead/MAC savings ratio. It is demonstratable that the same result applies for a CNN layer, in which we would replace X with C in K X K Y , the size of channel-output filter. Note that this estimation is ignoring costs of memory accesses. Though, because of our choice of moduli, the amount of data being shuffled is similar in both systems.
Conclusion
In this work, we outlined use of the Residue Number System to perform inference on neural networks. Using our single-block implementation power estimates, we showed theoretical analysis of the advantages of RNS for an end-to-end system.
Critique and Future Improvements
In our next steps, we aim to demonstrate end-to-end inference stringing together our RNS blocks, comparing this system with a non-RNS system. It would be useful to explore other methods of translating networks to the integer domain, and testing accuracy drops in more networks and datasets. By fiddling with choice of adder design, we suspect that we can improve our RNS multipliers and comparator power.
