In this brief, we present a novel inner product (IP) design for stochastic computing (SC). SC is an emerging computing technique, that encodes a number in the probability of observing a one in a random bit stream. This leads to reduced hardware costs and high error tolerance. The proposed IP design is based on a two-line bipolar encoding format and applies sequential processing of the input in a central accumulation unit. Sequential processing significantly increases the computation accuracy, since it allows for preliminary cancelation of carry bits. Moreover, the central accumulation unit gives a much better scalability compared to conventional adder tree approaches. We show that the proposed IP design outperforms a state-of-the-art design in terms of hardware costs for high accuracy requirements and fault tolerance.
I. INTRODUCTION
S TOCHASTIC computing (SC) is an emerging computing technique that encodes a real-valued number into a random bit stream [1] , representing the number as the probability of observing the bit one. This representation allows for a low-complexity implementation of basic arithmetic operations, using only a few logic gates. For instance, the complex multiplier used in conventional binary computing can be replaced by an AND gate in SC. Moreover, compared to the binary radix representation, the stochastic representation has a high degree of error tolerance [2] . SC has been successfully applied in many areas, including decoding of error detection codes, control systems, image processing, filter design, and neural networks (e.g., [2] , [3] and references therein). In many of these applications the inner product (IP) is a main building block and, thus, an implementation with low hardware effort and high accuracy is desired. In particular, in neural networks IPs are used to model the operation of the neurons [4] . Moreover, the FIR filter operation [5] and the DFT/FFT computation [6] are based on the IP of two vectors. Manuscript A straightforward implementation of the IP using an adder tree with multiplexer-based scaled adders, scales down the result, causing severe accuracy loss especially for large vectors. To overcome the scaling problem, stochastic to binary conversion was applied in [7] and an integer form of SC was proposed in [8] . The former approach suffers from additional hardware costs and the result cannot be used by subsequent blocks without binary to stochastic conversion. The latter approach uses integer stochastic streams which reduces the stream lengths, but puts less emphasis on fault tolerance and hardware costs (integer stream generation, binary radix adder/multiplier). Recently, two approaches have been proposed, addressing the scaling issue within the binary stochastic domain [5] , [6] . In [5] , an adder tree implementation with multiplexer-based adders using uneven-weights is presented. This reduces the downscaling factor or even scales up the result for certain input values. Unfortunately, the computation of the weights is very complex, and, thus they are often pre-calculated, requiring at least one constant input vector. Moreover, for large vectors the accuracy of the result still degrades due to the growing scaling factor. In [6] , the scaledadders are replaced by counter-based non-scaled adders using the two-line signed magnitude format [9] . It was shown that when applied to a DFT/FFT implementation it achieves a significantly higher accuracy than the approach in [5] . Although, the approach in [6] seems very promising there are still some shortcomings. The hardware effort is significantly higher compared to [5] , since the non-scaled adder requires more logic gates than a simple scaled adder. Moreover, the accuracy of non-scaled adders is based on the preservation of carry bits, which can be improved by increasing the counter length. Hence, to prevent overflow errors in an adder tree all counter lengths must be increased, leading to a poor scalability in terms of hardware effort.
In this brief, we present a novel stochastic IP design. We employ the two-line bipolar format [1] , enabling a simpler design and achieving higher accuracy compared to [6] . However, we propose simple conversion circuits between the two-line bipolar and the signed magnitude format used in [6] , making the proposed implementation also applicable for the signed magnitude format. Instead of an adder tree we use sequential processing of the input in a central accumulation unit, which is realized by a shift-register-based non-scaled adder. The use of a central accumulation unit significantly increases the scalability compared to [6] . Moreover, sequential processing together with the two-line bipolar representation allows for preliminary cancelation of carry bits. This approach reduces the probability of an overflow in the carry register and gives high-accuracy results.
II. STOCHASTIC COMPUTING FORMATS
In this section, we provide an overview on single-and twoline encoding formats used in SC.
A. Single-Line Encoding Formats
Unipolar Format: In the unipolar format, the value of a deterministic number x ∈ [0, 1] is encoded in a stochastic bit stream X of length L by x = 1/L L l=1 X[l] [1] , where X[l] ∈ {0, 1} denotes the lth bit of the bit stream X. Based on this format, basic arithmetic operations can be implemented using simple logic gates (e.g., AND gate for multiplication) [1] .
Bipolar Format: In contrast to the unipolar format, the bipolar format can also represent negative numbers. This is accomplished through a different interpretation of the stochastic stream. In this case a number x ∈ [−1, 1] can be represented by a bit stream X of length
. Similar to the unipolar format, the circuits for basic arithmetic operations are very simple [1] .
B. Two-Line Encoding Formats
Signed Magnitude Format: In the signed magnitude (SM) format, the sign and magnitude information of a number x ∈ [−1, 1] is carried by the bit streams X s and X m , respectively. Hence, x can be represented as [9] . The lth bit of the bit stream X m and X s is denoted by X m [l] and X s [l], respectively. Although, the hardware effort for basic arithmetic operations is higher compared to the unipolar and bipolar format [9] , it enables an efficient implementation of a non-scaled adder [6] . Non-scaled adders 1 are very important if multiple successive additions are required (e.g., in an adder tree) since it avoids downscaling. Thus, non-scaled adders are crucial building blocks for an IP design.
Two-Line Bipolar Format: The two-line bipolar (TLB) format uses a different interpretation of the bit streams compared to the SM format. In particular, a number x ∈ [−1, 1] is interpreted as the difference between the numbers x p ∈ [0, 1] and x n ∈ [0, 1], which are encoded as unipolar bit streams X p and X n . Hence, x can be represented as [1] . The lth bit of the bit stream X p and X n is denoted by X p [l] and X n [l], respectively. Similar to the SM format, the circuits for the basic arithmetic operations are slightly more complex than for the unipolar and bipolar format [1] , but the TLB format also enables an efficient non-scaled adder implementation (see Fig. 3 ).
It is important to note that the TLB and SM format can be easily converted into each other. The corresponding conversion circuits are shown in Fig. 1 .
III. TLB BUILDING BLOCKS
In this section, we present the realization of a multiplier and a non-scaled adder for the TLB format. 2 These are the crucial components for the stochastic IP implementation presented in the next section.
A. Multiplier
A first circuit of a multiplier for the TLB format has been proposed in [1] . In Fig. 2 we present an alternative multiplier circuit. The core circuit corresponds to the multiplier for the SM format (XOR and AND gate) [6] and the interface corresponds to the conversion circuit between TLB and SM format shown in Fig. 1 . It is important to note that the presented circuit is only used for illustration purpose and a more simple design can be obtained through logic optimization.
B. Non-Scaled Adder
To the best of our knowledge, only scaled adders have been proposed for the TLB format (see [1] ). Thus, we present a novel shift-register-based non-scaled adder 3 as shown in Fig. 3 . The circuit consists of an update logic and carry shift registers p c and n c , each of size M. The update logic must consider many different cases, including the preservation and cancellation of carry bits in the carry shift registers. For example, Fig. 4 . Architecture of the novel stochastic inner product design.
Algorithm 1 Update Logic for Non-Scaled Adder
Input: X, Y Initialization: n c = 0, p c = 0 1: for i = 1 to L do 2: if [1] ; p c and n c shift out 4 : [1] ; n c shift out 6 : and Y[l] are either 1 or −1, Z[l] is either 1 or −1 and a carry 1 (p c shift in) or −1 (n c shift in) should be stored in the carry shift registers for the next calculation. However, it is also possible that the current carry bit cancels a stored carry bit from a previous calculation, e.g., a generated carry 1 cancels a stored carry −1 (n c [1] = 1). The update logic algorithm given in Algorithm 1 takes into account all this different scenarios. Please note that the shift in operation denotes that a one (carry bit) is shifted into the register on one side, while the shift out operation denotes that a zero is shifted into the register on the other side, i.e., a carry bit is shifted out of the register.
IV. STOCHASTIC INNER PRODUCT DESIGN
In this section, we present the stochastic IP implementation. The architecture is shown in Fig. 4 , including a multiplier stage, input shift registers with carry canceling and an accumulation stage. For the following description we consider the computation of the IP between the vectors x = [x 1 , . . . , x K ] T and y = [y 1 , . . . , y K ] T given by
with x k , y k ∈ [−1, 1]. The numbers x k and y k are encoded in the stochastic bit streams (X p,k , X n,k ) and (Y p,k , Y n,k ) using the TLB format.
A. Multiplier Stage
This stage performs the multiplication of the individual entries of the input vectors, i.e., v k = x k y k , using K stochastic multipliers as shown in Fig. 2 . Each multiplier has the streams (X p,k , X n,k ) and (Y p,k , Y n,k ) at its input and generates the streams (V p,k , V n,k ). The individual bits of the output streams are stored for one clock cycle of the main clock in the input hold registers p h and n h , respectively. These registers prevent intermediate results from propagating from the main clock domain (multiplier stage) into the higher clock domain (input shift registers, accumulation stage).
B. Input Shift Registers With Carry Canceling
Upon a rising edge of the main clock, the elements of the input hold registers p h and n h are copied into the input shift registers p s and n s , following the mapping: p h [1] → p s [1] , p h [2] → p s [2] , etc., and n h [1] → n s [K], n h [2] → n s [K − 1], etc. This type of mapping increases the probability that ones are canceled by the so-called carry canceler (CC). The aim of the CC is to reduce the number of ones that are shifted towards the accumulation stage, which reduces the probability of an overflow of the carry shift registers. Hence, this improves the accuracy of the IP calculation. The CC circuit is shown in Fig. 4 , where the outputs are zero if both inputs are one Fig. 5 . Average probability that ones are shifted towards the accumulation stage P p , P n versus the input shift register length K, assuming that the probability that ones are copied from the input hold registers to the input shift registers is 0.5. and otherwise the outputs follows the inputs. The diagonal elements of the input shift registers are connected by the CC (see Fig. 4 ) and, thus, the value of the kth register element after the shifting operation is given by
where (·) denotes the negation operator. Please note that (2) corresponds to the Boolean function of the CC. In particular, the canceling procedure is as follows: Upon a rising edge of the higher clock, the CC output is written into the next register element. This corresponds either to shifting the value of the previous element to the next element (normal shift operation) or writing zeros (carry canceling).
The elements p s [1] and n s [1] are sequentially shifted to the accumulation stage using a higher clock compared to the main clock. Please note that the input shift registers p s and n s are shifted in opposite directions (see Fig. 4 ), which reduces the probability that ones are shifted to the accumulation stage compared to shifting in the same direction. This is because in case of shifting in the same direction the CC has only an effect after the first shifting operation. We validated the impact of the shifting direction through bit-true simulations. Therefore we evaluated the average probability that ones are shifted towards the accumulation stage during sequential processing of the entire input shift registers p s and n s . This average probability is given by P
, with x ∈ {p, n} and x s ∈ {p s , n s }, and Pr(x (j) s [1] = 1) denotes the probability that a one is shifted towards the accumulation stage after the jth shift operation. The results are shown in Fig. 5 , confirming that for K ≥ 2 shifting in the opposite direction should be preferred to shifting in the same direction.
C. Accumulation Stage
The accumulation stage corresponds to a shift-register-based non-scaled adder (see Section III-B), which accumulates the output of the input shift registers in the carry shift registers. Similar to the non-scaled adder, the accumulation stage considers many different scenarios, including the preservation and cancellation of carry bits. The corresponding algorithm is given in Algorithm 2. 
It is important to note that the sequential processing of the input shift registers must be finished upon the next rising edge of the main clock. Then, the input shift registers are loaded with the next inputs from the input hold registers. Moreover, the entries p c [1] and n c [1] of the carry shift registers are shifted to the output flip-flops corresponding to the lth bit in the output stochastic streams, i.e., (Z p [l], Z n [l]).
V. PERFORMANCE ANALYSIS
In this section, we compare the proposed IP design with the state-of-the-art design presented in [6] in terms resource utilization and fault tolerance for different accuracy requirements. For the comparison, we only consider the IP calculation and omit the costs for the stochastic stream generation and the back conversion, since they are similar for both approaches.
We define the computation accuracy by the root mean square error (RMSE) given by RMSE = mean(|ẑ − z|), where z denotes the true IP result (double-precision floating point) andẑ corresponds to the results of the particular stochastic implementation. The accuracy is controlled by the carry shift register length and the counter length for the novel and the state-of-the-art design, respectively. For all investigations we fixed the length of the stochastic stream to L = 10 4 .
We determined the resource utilization for both implementations through synthesis for an Altera Cyclone IV EP4CE115 FPGA. Figs. 6 and 7 show the minimum number of logic elements (combinational logic) and registers that are required to achieve a certain computation accuracy. We observe from Fig. 6 that the logic element utilization of the proposed design is much better compared to the state-of-the-art design, especially for large input vectors and if high computation accuracy is required. Moreover, we observe from Fig. 7 that if low accuracy is sufficient, the state-of-the-art approach outperforms the novel design in terms of register utilization. This is because the approach in [6] requires no hold circuit at the input (input hold registers) or sequential processing storage (input shift registers). Interestingly, for the novel design the logic element and register utilization is almost independent of the accuracy requirements, while it increases for the state-of-theart implementation. This means that for the proposed design the additional hardware effort (larger carry shift registers) to achieve a better accuracy is insignificant. Fig. 8 compares the fault tolerance of the novel and stateof-the-art IP design. Therefore, we randomly flipped a bit in the carry shift registers or the counters in the adder tree with probability P flip . This approach gives a good approximation of the fault tolerance for the entire design, since failures in the storage can also be interpreted as bit flips coming from the combinational logic. We used input vectors of length K = 16 and started with the computation accuracy RMSE = 0.02. This requires a carry shift register length of 6 and a counter length of 4. We observe that the proposed design is much more robust against bit flips than the state-of-the-art implementation. This is mainly because we use a shift-register-based approach, rather than a counter-based approach.
The high clock domain of the novel design can operate at 315.36 MHz, which is near the maximum platform speed, but the main clock is reduced by the input vector length K, i.e., 19.7 MHz for K = 16. The design in [6] has an operating frequency of 143.62 MHz, when synthesized for the FPGA mentioned above. This results in a latency for both designs of 0.51 ms and 0.07 ms, respectively. Based on these results it is important to note that although the proposed IP design is much slower than the design in [6] , it has a significantly higher fault tolerance, provides high-accuracy results, has less hardware costs and, thus, is suitable for applications where these features are more important than a high throughput. However, please note that the throughput can be increased through parallelization of the sequential processing step, using multiple IP cores.
VI. CONCLUSION
In this brief, we proposed a novel stochastic inner product design. In contrast to state-of-the art adder tree implementations, we performed the addition in a central accumulation unit by applying sequential processing of the input. The central accumulation unit increases the scalability and sequential processing enables preliminary carry canceling which improves the computation accuracy. Performance analysis revealed that the proposed design significantly reduces the hardware costs for high accuracy requirements and provides a high fault tolerance compared to a state-of-the-art design. Please note that the presented inner product design has been proposed as a crucial building block for a novel stochastic computing estimation architecture [11] .
