In stochastic computing (SC), a real-valued number is represented by a stochastic bit stream, encoding its value in the probability of obtaining a one. This leads to a significantly lower hardware effort for various functions and provides a higher tolerance to errors (e.g., bit flips) compared to binary radix representation. The implementation of a stochastic max/min function is important for many areas where SC has been successfully applied, such as image processing or machine learning (e.g., max pooling in neural networks). In this work, we propose a novel shift-register-based architecture for a stochastic max/min function. We show that the proposed circuit has significantly higher accuracy than state-of-the-art architectures for uncorrelated bit streams at comparable hardware costs. Moreover, we analytically proof the correctness of the proposed circuit and provide a new error analysis, based on the individual bits of the stochastic streams. Interestingly, the analysis reveals that for a certain practical bit stream length a finite optimal shift register length exists and it allows to determine the optimal length. Index Terms-Stochastic computing, sequential logic, finite state machine, stochastic max/min function Ç M. Lunglmayr and D. Wiesinger are with the Institute of Signal Processing, Johannes
INTRODUCTION
SOCHASTIC computing (SC) is a promising computing paradigm, which represents a real-valued number by a stochastic stream [1] , [2] , [3] . The value is encoded by the probability of obtaining a one in the stream. Compared to binary radix representation, the stochastic representation leads to low hardware cost and high fault tolerance to circuit noise and bit flips [4] , [5] .
In SC, basic arithmetic operations can be realized with simple combinational logic [1] . Moreover, combinational logic can be synthesized to implement arbitrary polynomial functions, by manipulating them into a Bernstein polynomial with coefficients in the unit interval [6] , [7] , [8] . However, in order to implement more complex (non-linear) functions, sequential logic is required [4] , [9] . In particular, linear finite-state machines (FSM) have been proposed to implement complex functions, which can be realized by either employing saturating up/down counters or shift registers [10] . The stochastic exponentiation and the tanh function were presented in [9] and the absolute value as well as exponentiation based on an absolute value were proposed in [4] . Recently, a new synthesis method which allows to implement arbitrary functions using FSM-based elements was introduced in [11] .
In recent years, SC has been successfully applied to a variety of applications such as decoding of modern error correcting codes [12] , [13] , [14] , control systems [15] , [16] , image processing [17] , [18] , [19] , filter design [20] , [21] , and neural networks [9] , [22] , [23] , [24] . Most of these applications exploit the low complexity circuitry of SC in algorithms that do not require a high numerical precision of the final result.
In many of the aforementioned application domains, the efficient implementation of a stochastic max/min (SMax/SMin) function is very important. Especially, for neural networks, where such functions are the key element in the max pooling layer [22] , [23] . Two architectures for SMax functions 1 have been proposed in literature [17] , [23] . The implementation in [17] is based on a stochastic comparator, which requires a stochastic number generator (SNG) that is usually realized by a linear feedback shift register (LFSR). In order to reduce the overhead of the SNG, an optimized SMax function was proposed in [23] . However, both approaches have only been validated empirically.
Recently, a different methodology to calculate max/min functions for bit streams relying on the correlation between the bits has been proposed [25] , [26] . Although correlated bit streams allow using single logic gates for min/max calculation, other operations such as multiplications rely on uncorrelated streams for low complexity implementation. A promising approach to change the correlation of bit streams has been proposed in [27] . This approach first introduces correlation in order to use single gates for the max/min operations and finally decorrelate the bit streams for further usage. The last step is essential since the majority of SC arithmetic units have been proposed for uncorrelated bit streams. The correlation and decorrelation of the bit stream significantly increase the complexity and reduces the complexity benefits of this approach. Thus, we propose a novel shift-register-based max/min circuit for uncorrelated bit streams, which is also applicable to correlated bit streams. In the case of fully correlated streams, the proposed circuit reduces to a single OR gate. The proposed max/min circuit was successfully used in [28] , where high precision and error tolerance was required.
The first goal of this paper is to analytically prove the correctness of the SMax functions in [17] , [23] . Then, we propose a novel shift-register-based architecture for a stochastic SMax/SMin function and analytically prove its correctness. We show that the novel architecture provides a higher accuracy than [17] , [23] at comparable hardware cost. We provide a new error analysis of the proposed circuit, considering the individual bits of the stochastic streams. To the best of our knowledge, no such analysis has been done before for an FSM-based stochastic computing element. Based on the error analysis we show that for practical bit stream lengths a finite optimal shift register length exists. Moreover, we determine the optimal shift register size for certain bit stream lengths.
STOCHASTIC COMPUTING BASICS
In this section, we briefly review the main principles of SC and introduce the basic computing elements used in this work. A comprehensive overview on SC can be found in [2] and recent challenges and potential solutions are discussed in [3] .
Unipolar Coding Format
In the unipolar coding format, 2 the value of a deterministic number x 2 ½0; 1 is encoded in a stochastic bit stream X of length N. The individual bits in the stochastic stream are indicated by X½i 2 f0; 1g. The probability for each bit in the stream to be one is given by x ¼ P X ¼ P ðX½i ¼ 1Þ. In practical realizations, the rate of ones in the stochastic bit stream is used to represent the number x rðXÞ ¼
1. Both circuits can be easily converted to realize a SMin function. 2. It is important to note that the circuits proposed in this work are also valid for the bipolar format, which enables the representation of negative values [1] .
where oðXÞ ¼ P N i¼1 X½i denotes the number of ones in the stream. The precision (representation resolution) of the unipolar format is given by 1=N. Thus, rðXÞ ¼ x only if N ! 1, otherwise rðXÞ is only an approximation of x.
Combinational Logic-Based SC Elements
Certain arithmetic operations in SC can be implemented by single combinational elements, for example scaled addition and multiplication can be realized using a multiplexer and an AND gate, respectively. Moreover, an XOR gate with the input streams A and B implements the function A þ B À 2AB, which involves addition and subtraction. In the following, we briefly explain the principles of the aforementioned operations, as they are the main building blocks of the SMax/SMin functions presented in Sections 3 and 4.
Multiplication
The stochastic multiplication can be implemented using a simple AND gate. If we assume that the input stochastic streams A and B are uncorrelated and have the probabilities P A and P B , then we have at the output
For the unipolar format the values encoded by the stochastic streams A, B and C are a ¼ P A , b ¼ P B and c ¼ P C , and, thus we obtain c ¼ ab.
Scaled Addition
The stochastic circuit for scaled addition is a multiplexer. If we assume uncorrelated stochastic input streams A and B that are also uncorrelated with the stochastic selection stream S, and P A , P B and P S are their corresponding probabilities, the output can be expressed as
According to the unipolar format, we substitute P A , P B , P C and P S by a, b, c and s, and obtain c ¼ sa þ ð1 À sÞb. In order to perform unbiased addition, s is set to 1=2.
Non-Scaled Addition and Subtraction
If we assume uncorrelated input stochastic streams A and B, with the corresponding probabilities P A , P B , then according to the Boolean function of the XOR gate we have at the output
For the unipolar coding format (i.e., a ¼ P A , b ¼ P B and c ¼ P C ) we can rewrite Equation (4) by c ¼ a þ b À 2ab.
FSM-Based SC Elements
Combinational logic can be used to realize polynomial functions of a specific form [6] and to approximate non-polynomial functions, for example using the MacLaurin expansion [8] . However, highly non-linear functions such as the exponential or the tanh function cannot be realized. Hence, FSM-based SC elements have been introduced [4] , [9] . Here, we briefly review the stochastic tanh function 3 (STanh) which builds the basis for the state-of-the art SMax/SMin functions presented in Section 3.
Stochastic Tanh Function
Let's consider a linear FSM, with M states S 0 ; . . .; S MÀ1 that are arranged in a linear form. The state transition process of the FSM can be modeled as a time-homogeneous irreducible and aperiodic Markov chain, which has a single steady state. The steady state probability is given by [4] 
where P X denotes the transition probability from state S i to S iþ1 (state is incremented) and ð1 À P X Þ indicates the transition from state S i to S iÀ1 (state is decremented). In order to realize the STanh function, the FSM output can be expressed as [9] 
Substituting P i given in Equation (5) into Equation (6) results in [4]
If we substitute P X and P Z by x and z (unipolar coding format) we obtain [4] z ¼
which corresponds to a scaled and shifted tanh function 4 . For a large number of states M, Equation (8) 
STATE-OF-THE-ART STOCHASTIC MAX/MIN FUNCTIONS
In this section, we discuss two recently proposed architectures of SMax/SMin functions [17] , [23] for uncorrelated bit streams. Moreover, we provide analytical proofs of their correctness, since [17] , [23] only provide empirical validations. For the sake of clearness, we focus our analysis on the SMax function, since the presented propositions and proofs can be easily applied to the SMin function.
Stochastic Max/Min Function in [17]
Fig . 1 shows the architecture of the SMax function proposed in [17] , which is based on the stochastic comparator (input multiplexer and STanh function). The SMin function is obtained by swapping the input streams at the final multiplexer. The following proposition validates the correctness of the circuit shown in Fig. 1 .
For uncorrelated input bit streams A and B, encoding the values a ¼ P A and b ¼ P B (unipolar coding format), the output of the circuit shown in Fig. 1 can be expressed as 3. In [4] , [9] various other FSM-based SC elements are presented as well. 4. Substituting P X and P Z in Equation (7) with their bipolar coding format results in z ¼ tanhðM=2xÞ [4] , representing the signum function for M ! 1.
where c ¼ P C denotes the value encoded in the output stream C. For M ! 1 the expression in Equation (10) can be written as
which validates the functionality of the SMax function.
Proof. The output of the first multiplexer is given by (cf. Equation (3))
with P S 1 ¼ 1=2. According to Equation (7), the output of the STanh function can be expressed as
Finally, the output of the second multiplexer is given by (cf. Equation (3))
When substituting P A , P B and P C with their corresponding unipolar coding format values a, b and c we obtain
If M ! 1 the denominator in Equation (10) (10), the bit-wise simulation results of the circuit shown in Fig. 1 and the exact max function. We observe a good match between the theoretical and simulation results (for the simulated stream lengths). Moreover, we observe that already a moderate number of states M provide a good approximation of the max function.
Stochastic Max/Min Function in [23]
Fig . 3 shows the architecture of the SMax function proposed in [23] . Similar to Section 3.1, the SMin function is obtained by swapping the input streams at the final multiplexer. The following proposition validates the correctness of the circuit shown in Fig. 3 .
Proposition 2. For uncorrelated input bit streams A and B, encoding the values a ¼ P A and b ¼ P B (unipolar coding format), the output of the circuit shown in Fig. 3 can be expressed as
where c ¼ P C denotes the value encoded in the output stream C. For M ! 1 the expression in Equation (17) can be written as
Proof. The output of the XOR gate can be expressed as (cf.
Equation (4))
In contrast to the STanh function presented in Section 2.3.1 the STanh function shown in Fig. 3 has two inputs A and D. The stochastic stream D is used to enable the FSM state update and the stream A updates the state according to its value. In particular, if D½i ¼ 1 then the state increases if A½i ¼ 1 and decreases if
Thus, the probability that the state increases or decreases is given by P A P D and ð1 À P A ÞP D , respectively. According to Equation (5), the steady state probability is given by
According to Equation (7), the output of the STanh function can be expressed as
Finally, the output at the multiplexer is given by (cf. Equation (3))
For the unipolar encoding format (i.e., a ¼ P [17] . Solid lines: theoretical results equation (10); markers (+): bit-wise simulation for N ¼ 10 6 ; dotted line: exact max function. Fig. 3 . Implementation of the SMax function proposed in [23] .
If M ! 1 the denominator in Equation (17) (17), the bit-wise simulation results of the circuit shown in Fig. 3 , and the exact max function. We observe a good match between the theoretical and simulation results (for the simulated stream lengths). Similar to Fig. 2 , we observe that already a moderate number of states M provide a good approximation of the max function. However, in Fig. 4 one can already see a closer match of this approach compared to the SMax function of [17] .
NOVEL STOCHASTIC MAX/MIN FUNCTION: ARCHITECTURE
In this section, we propose a novel architecture for the SMax function as shown in Fig. 5 . This circuit can be easily converted to realize the SMin function by inverting its inputs A and B as well as its output C. Hence, we only consider the SMax function in the following description.
In contrast to the state-of-the-art SMax functions [17] , [23] , the FSM-based SC element used in the proposed architecture does not implement the STanh function. However, similar to the architecture in [23] it has two inputs A and D. Input D enables the FSM state update and input A updates the state according to its value (cf. Section 3.2). FSM-based elements can either be implemented using up/down counters or shift registers. When using a shift register, its length L is equal to the last state of the FSM, i.e., L ¼ M À 1. For the novel SMax function we use a shift register, since it has some distinct advantages compared to a counter-based implementation [10] . One advantage is that the values in a shift register are of equal significance, in contrast to a binary counter, where the bits are weighted by different powers of two. This allows to design more fault-tolerant SC computing circuits when using shift registers. Furthermore, as the following description demonstrates, shift registers are naturally suited to implement the described functionality.
Depending on the actual value of the input streams A and B the functionality of the SMax function (cf. Fig. 5 ) can be described as follows:
A½i ¼ B½i: Since D½i ¼ 0 the content of the shift register remains unchanged (state is not updated) and the output of the circuit is given by C½i ¼ B½i. A½i ¼ 0; B½i ¼ 1: Since D½i ¼ 1 and A½i ¼ 0 a zero is shifted from the right into the shift register (state is decremented) and the the output of the circuit is given by
shifted from the left into the shift register (state is incremented). In this case the rightmost value of the shift register is output by the circuit, i.e., C½i ¼ U½i.
According to the description above it is important to note that the ones in the stream B also appear in the output stream C.
In the following, we provide two approaches for analyzing the functionality of the proposed SMax function. First, we describe the functionality by considering the individual bits in the stochastic bit stream. Then, similar to Section 3 we proof the correctness of the circuit assuming very long stochastic bit streams. We denote these two methods as deterministic and probabilistic analysis, respectively.
Deterministic Analysis
For the deterministic analysis of the novel SMax function we distinguish the two cases: rðAÞ rðBÞ and rðAÞ > rðBÞ.
SMax Circuit Behavior for rðAÞ rðBÞ
If rðAÞ rðBÞ, the stream B has more (or equal) ones than bit stream A. For a correct functionality it is desired that the number of ones oðBÞ in the input stream B, and the number of ones oðCÞ in the output stream C, are equal. As discussed above, all ones of stream B are included in the output stream C. However, if a subsequence of A has more ones than the corresponding subsequence of B, also ones of stream A might be additionally injected into the output stream C. This occurs if the excess of ones in this subsequence is larger than the shift register length L. We refer to such an event as right overflow of the shift register. Thus, the number of ones in the output stream C can be expressed as
where o R denotes the additional number of ones due to the right overflows. If the shift register is sufficiently long, no right overflows occur, i.e., oðCÞ ¼ oðBÞ.
SMax Circuit Behavior for rðAÞ > rðBÞ
If rðAÞ > rðBÞ , the bit stream A has more ones than bit stream B. For a correct functionality, it is desired that oðAÞ, the number of ones in the input stream A, and oðCÞ, the number ones in the output stream C, are identical. Similar as above, all ones of B are included in the output stream C. In addition, the excess of ones in stream A is shifted into the shift register and once the shift register is filled, the ones are injected into the output stream C when B½i ¼ 0 and A½i ¼ 1 occurs. However, at the end there might be ones left in the shift register, which are missing in the output stream 5 C. We denote the number of missing ones by o S . Moreover, if a subsequence of B has more ones than the corresponding subsequence of A, also ones of stream B might 5. We assume the same length for all stochastic streams, which is a typically assumption in SC.
additionally be injected into the output stream C. This happens if the excess of ones leads to an empty (all-zero) shift register, and, thus, an input pattern B½i ¼ 1 and A½i ¼ 0 (and assuming a later following B½i ¼ 0 and A½i ¼ 1) injects an additional one in the output stream C. We refer to this effect as left overflow of the shift register and denote the number of additional ones due to left overflows by o L . This allows expressing the number of ones in the output stream C as
where ðoðAÞ À oðBÞÞ denotes the excess of ones in stream A compared to stream B. Expression Equation (26) shows the two opposite error effects. One the one hand, the number of left overflows o L becomes smaller for long shift registers. On the other hand, the error due to the remaining ones in the shift register o S becomes smaller, for short shift registers. In Section 5 we determine the optimal shift register length for a given bit stream length. This deterministic description also gives insights on how the proposed architecture would behave if correlated bit streams would be used. Assuming, for example, perfectly correlated bit streams (i.e., a correlation of 1, using the definition from [26] ), and rðBÞ ! rðAÞ, stream B would be output by the architecture without any errors. This is because all ones of B are output by the circuit. Due to perfect correlation, no zeros of B will be present at onepositions of A. If the bit streams are perfectly correlated and rðBÞ < rðAÞ, all bits of stream A in excess of stream B will propagate through the shift register, leaving only the remaining bits in the shift register at the end of the bit streams as errors. Hence, to obtain no errors, the optimal memory length is zero for perfectly correlated streams. In particular, by replacing the FSM element in Fig. 5 by a single AND gate (representing the enable functionality without memory), it can be easily verified that the resulting circuit performs an OR gate functionality. This is the maximum circuit for perfectly correlated bit streams, as described in [26] . That means if one processes perfectly correlated bit streams our design (without memory) would be implemented directly as an OR gate.
Probabilistic Analysis
The following proposition validates the correctness of the circuit shown in Fig. 5 . Fig. 5 can be expressed as
where c ¼ P C denotes value encoded in the output stream C. For M ! 1 the expression in Equation (27) can be written as
8 < : (28) which validates the functionality of the SMax function.
Similar to Section 3.2, the FSM has two inputs A and D and, thus, the steady state probability can be written as (cf. Equation (20))
According to Fig. 5 , the FSM outputs can be expressed as
The output of the AND gate can be calculated as (cf. Equation (3))
Finally, the output of the multiplexer is given by
For the unipolar encoding format (i.e., a ¼ P A , b ¼ P B and c ¼ P C ) we have
If M ! 1 the denominator in Equation (17) becomes one or infinity depending on whether a > b or a b, respectively. Thus, c ¼ a or c ¼ b if a > b or a b, which proves the correctness of the SMax circuit shown in Fig. 5 . t u Fig. 6 illustrates the theoretical expression in Equation (27), the bit-wise simulation results of the circuit shown in Fig. 5 and the exact max function. We observe a good match between the theoretical and simulation results (for the simulated stream lengths). Moreover, we observe that already a low number of states M provide a good approximation of the max function. In contrast to the state-of-the-art SMax functions [17] , [23] the proposed function does not approach the exact max function at a ¼ b, but provides a better approximation for a 6 ¼ b.
Next, we compare the approximation error of the state-of-theart SMax functions and the novel SMax function. For this we calculate the expected value of the absolute error, assuming a uniform distribution of a and b over the interval [0,1], respectively. We define the absolute error e by e ¼ jc exact À cj, with the exact max function c exact ¼ maxða; bÞ and the FSM-based approximations c given in Equations (10), (17) and (27), respectively. Then, the expected absolute error can be calculated as EðeÞ ¼
The absolute value of the error in Equation (35) allows to consider both, erroneously added ones (i.e., c > c exact ) as well as erroneously removed ones (i.e., c < c exact ) in the bit stream representing c. When considering a stochastic bit stream of c exact , then the absolute difference e can be interpreted as a bit error probability of c compared to such a bit stream of c exact . This is crucial in order to enable a comparison with the analysis results presented in Section 5. We observe from Fig. 7 that the novel SMax function has a significantly lower approximation error than the state-of-the-art FSM-based SMax functions. This is because the power of M in the denominator of Equation (27) converges faster to zero or infinity as the number of states M increases compared to the denominators in Equation (10) or (17) (power of M=2). Moreover, we observe that the SMax function proposed in [17] has the highest approximation error among the three approaches.
NOVEL STOCHASTIC MAX/MIN FUNCTION: ERROR ANALYSIS
We observed from the deterministic analysis in Section 4.1 that long shift registers reduce the errors due to right and left overflows. However, a large shift register length increases the error caused by the remaining bits in the shift register. It is important to note that especially the last observation cannot be inferred from the probabilistic analysis presented in the previous chapter. The probabilistic analysis becomes exact if and only if the bit stream length goes to infinity. In such a case, a finite number of remaining bits in the shift register does not matter. However, when using a finite bit stream length, these remaining bits matter. In the following analysis, we assume a finite bit stream length that is significantly larger than the shift register length (a typical scenario in SC). This allows using the probabilistic FSM description in Section 4.2 for modelling the behavior of the shift register for finite bit stream lengths. Moreover, having sufficiently long bit streams justifies using the probabilities of ones instead of the rate of ones in the stream (cf. Section 2.1) for the following error analysis. In the following, we derive the expected error probability, based on the deterministic analysis in Section 4.1. With this expression we determine the optimal shift register length L opt , with respect to the bit stream length N. Similar to Section 4.1 we distinguish two cases: a b and a > b.
Error Probability for a b
In this case, an error occurs due to right overflows. In particular, the shift register is filled with ones, i.e., the FSM is in the last state M À 1, and the input A½i ¼ 1 and B½i ¼ 0 is applied. The probability for this error event can be described as P e;a b ¼ P MÀ1 P A ð1 À P B Þ;
where P MÀ1 describes the probability of the FSM to be in the last state (cf. Equation (31)).
Error Probability for a > b
In this case, errors can originate from two sources: Left overflow and remaining ones in the shift register. At the left overflow, the shift register is empty (all-zeros), i.e., the FSM is in state S 0 , and the input pattern A½i ¼ 0 and B½i ¼ 1 occurs. The probability for this error event can be expressed as P e;0 ¼ P 0 P B ð1 À P A Þ:
where P 0 denotes the probability of the zero state of the FSM, i.e., an all-zero shift register (cf. Equation (30)). The corresponding expected number of erroneously added ones in the output stream can be calculated by
For the error caused by the remaining ones in the shift register we compute the expected value of the shift register state, i.e., the expected number of ones in the shift register, as
where P i denotes the probability of the FSM to be in the ith state. Note that the expected number of ones in the shift register corresponds to the number of ones that are missing on average in the output stream. Combining Equations (38) and (39) and considering that left overflows add ones to the output stream, while the remaining bits in the shift register are the missing ones in the output stream, the expected number of erroneous ones is given by
Finally, the error probability can be expressed as
Interestingly, the second term in Equation (41) depends on the stream length N, which goes to zero for N ! 1. This is because a finite number of missing bits has a higher impact on the error for shorter streams than for longer streams.
Expected Error Probability
Assuming a uniform distribution for a and b over [0,1] and using Equations (36) and (41), the expected error probability can be derived as follows EðP e;N Þ ¼
The two integrals in Equation (42), cover exactly half of the two dimensional space ½0; 1 Â ½0; 1. Thus, they form the expected value over the whole space. Unfortunately, to the best of our knowledge, for this integral no closed-form solution exists. Thus, we calculated it through numerical integration. Fig. 8 shows the error probabilities obtained through numerical integration and the simulation results. For each simulated point, the empirical error probability was averaged over 10000 test cases. For each simulation, the shift register has been initialized with the all-zero vector. We used the rand() function of Matlab to generate the random bit streams. Simulations using the LFSR-based weighted binary generator of [29] produced comparable results. We observe a good match between the theoretical and the simulation results. However, the analytical results are obtained much faster than the simulation results. Moreover, it can be seen that for a certain stream length N there exists an optimal shift register length L opt . For example, for N ¼ 10 4 the error probability decreases until length 15 and then increases due to the remaining bits in the shift register. Thus, for N ¼ 10 4 the optimal shift register length is given by L opt ¼ 15. Although, a reasonable performance can be observed also for short bit stream lengths, the performance increases for long bit streams. However, we want to point out that the errors at optimal shift register lengths of the proposed SC maximum are only a small factor larger than the precision (1=N) of the corresponding bit stream lengths. In Fig. 8 , we marked the optimal shift register lengths with triangles and added extra ticks at the xaxis. The lower bound curve in Fig. 8 corresponds to the performance limit if the bit stream length N goes to infinity. It can either be obtained by numerically integrating Equation (35) or (42), the latter with N ! 1. For the latter approach, the second term on the right hand side of Equation (41) goes to zeros, resulting in P e;a > b;N!1 ¼ P e;0 . This is because when the stream length goes to infinity, a finite number of remaining bits in the shift register does not matter. Fig. 9 shows simulation-based comparisons of the empirical bit error probability for the three discussed SC maximum architectures for moderate bit stream lengths. We observe that not only the error performance of the proposed approach is the best among the comparison candidates, but also that it achieves this performance with the smallest number of states in the FSM (i.e., it has the least memory requirement).
It is important to note that the proposed architecture can be directly used in larger architectures using SC computational blocks for uncorrelated streams (e.g., [28] ). Recently, a hybrid approach has been proposed where correlation manipulating circuits are introduced, enabling the use of low-complexity SC blocks for various operations [27] , such as AND gates for multiplication (uncorrelated streams) or OR gates for max operation (correlated streams). Nevertheless, for application scenarios relying on uncorrelated streams, our proposed approach is still competitive in terms of complexity compared to the hybrid approach. This can be seen when considering examples such as the following: the results of two maximum operations, realized by two OR gates and correlated bit streams, are further processed by a computational unit that requires uncorrelated streams as input (e.g., an AND gate for multiplication). This requires a decorrelator that includes the generation of random bit streams typically generated using an LFSR [27] . For bit stream lengths below N ¼ 10000, the memory requirements of the LFSR (i.e., approximately log 2 ðNÞ) are already significantly higher than the optimal shift register length for our proposed design (see Fig. 8 ). For example, N ¼ 256 requires an 8-bit shift register for decorrelation, but our design requires only a 4-bit memory.
CONCLUSION
In this work, we investigated the stochastic SMax/SMin function, which is an important building block in many applications (e.g., max pooling in neural networks). Prior works have proposed circuits for the SMax/SMin function and provided empirical validation. In this paper, we analytically proved the correctness of these architectures. Moreover, we proposed a novel shift-register based SMax/SMin function, which outperforms the state-of-the-art FSMbased architectures in terms of accuracy, while having comparable hardware cost. We provided a new error analysis of the proposed circuit, considering the value of the individual bits in the stochastic stream. This analysis revealed that for practical bit stream lengths a finite optimal shift register length exists. Moreover, we showed that increasing the shift register length beyond the optimal value deteriorates the accuracy. Although this work can be applied for various use cases of SC, it is most useful for cases relying on high precision (long bit streams) as well as on SC's high error tolerance (e.g., [28] ). This is due to the error caused by the remaining bits in the shift register. Hence, finding strategies to empty the shift register might be an interesting future extension of this work. 
