Stochastic unary computing provides low-area circuits. However, the required area consuming stochastic number generators (SNGs) in these circuits can diminish their overall gain in area, particularly if several SNGs are required. We propose area-efficient SNGs by sharing the permuted output of one linear feedback shift register (LFSR) among several SNGs. With no hardware overhead, the proposed architecture generates stochastic bit streams with minimum stochastic computing correlation (SCC). Compared to the circular shifting approach presented in prior work, our approach produces stochastic bit streams with 67% less average SCC when a 10-bit LFSR is shared between two SNGs. To generalize our approach, we propose an algorithm to find a set of m permutations (n > m > 2) with a minimum pairwise SCC, for an n-bit LFSR. The search space for finding permutations with an exact minimum SCC grows rapidly when n increases and it is intractable to perform a search algorithm using accurately calculated pairwise SCC values, for n > 9. We propose a similarity function that can be used in the proposed search algorithm to quickly find a set of permutations with SCC values close to the minimum one. We evaluate our approach for several applications. The results show that, compared to prior work, it achieves lower mean-squared error (MSE) with the same (or even lower) area. Additionally, based on simulation results, we show that replacing the comparator component of an SNG circuit with a weighted binary generator can reduce SCC.
Low-Cost Stochastic Number Generators for Stochastic Computing
Sayed Ahmad Salehi , Member, IEEE Abstract-Stochastic unary computing provides low-area circuits. However, the required area consuming stochastic number generators (SNGs) in these circuits can diminish their overall gain in area, particularly if several SNGs are required. We propose area-efficient SNGs by sharing the permuted output of one linear feedback shift register (LFSR) among several SNGs. With no hardware overhead, the proposed architecture generates stochastic bit streams with minimum stochastic computing correlation (SCC). Compared to the circular shifting approach presented in prior work, our approach produces stochastic bit streams with 67% less average SCC when a 10-bit LFSR is shared between two SNGs. To generalize our approach, we propose an algorithm to find a set of m permutations (n > m > 2) with a minimum pairwise SCC, for an n-bit LFSR. The search space for finding permutations with an exact minimum SCC grows rapidly when n increases and it is intractable to perform a search algorithm using accurately calculated pairwise SCC values, for n > 9. We propose a similarity function that can be used in the proposed search algorithm to quickly find a set of permutations with SCC values close to the minimum one. We evaluate our approach for several applications. The results show that, compared to prior work, it achieves lower mean-squared error (MSE) with the same (or even lower) area. Additionally, based on simulation results, we show that replacing the comparator component of an SNG circuit with a weighted binary generator can reduce SCC.
Index Terms-Linear feedback shift register (LFSR), permutation, stochastic computing (SC), stochastic number generator (SNG).
I. INTRODUCTION
S TOCHASTIC computing (SC) has emerged as an unconventional technique for performing computations by logic circuits [1] . Rather than performing computation on deterministic binary numbers, SC circuits are designed to process random bit streams. The input and output are represented by bit streams and their values are encoded as the probabilities of seeing 1's in the bit streams. Evidently, the values are confined in the unit interval [0, 1], since probabilities cannot be beyond the unit interval. Compared to deterministic binary computing, SC provides several advantages, including reduced hardware complexity and fault-tolerant computing. Because of these advantages, SC has been considered as an appropriate alternative to binary computing in different applications such as Manuscript received July 20, 2019; revised October 18, 2019 and December 7, 2019; accepted December 28, 2019. Date of publication January 22, 2020; date of current version March 20, 2020. The author is with the Electrical and Computer Engineering Department, University of Kentucky, Lexington, KY 40508 USA (e-mail: sayedsalehi@uky.edu).
Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2019.2963678 low-density parity check (LDPC) decoding [2] , image processing [3] , neural networks [4] , [5] , and digital filters [6] , [7] . One main advantage of SC is its very low hardware complexity that could result in cost-efficient computing circuits. The most common way to demonstrate the low hardware cost of SC is its implementation of basic operations, that is, multiplication and addition. Fig. 1(a) shows a simple AND gate implementing multiplication in SC. For the AND gate, the output is 1 only when both inputs A and B are 1. Therefore, the probability of having 1 in the output bit stream is the multiplication of the probabilities of having 1 in each of the input bit streams, that is, Fig. 1(b) shows a 2-input multiplexer computing scaled addition. For the multiplexer, output C is 1 when S is 0 and A is 1 or when S is 1 and B is 1. Therefore,
A stochastic number generator (SNG) is an essential part of any SC circuit. An SC circuit uses SNGs to convert binary numbers into their corresponding random bit streams. They generate random bit streams with probabilities of producing 1's equal to their corresponding binary numbers. SNGs play a central role in the efficiency of an SC circuit for two reasons. First, for SC circuits, the size of an SNG part is remarkable with respect to the computing part. This problem becomes more critical for applications with SC circuits that require many SNGs, such as high-degree digital filtering and image-processing algorithms. In fact, for several SC designs, SNG circuits consume around 80% or even 90% of the total area [8] , [9] . Second, the quality of the random numbers generated by SNGs can significantly affect the computational accuracy of the SC designs, and correlation among random bit streams is a source of inaccuracy in SC circuits. Therefore, obtaining area-efficient and low-correlated SNGs is a major design challenge for SC circuits.
In response to this challenge, the contributions of this article are the following.
1) Introducing a new permutation-based design space for sharing a random number source (RNS) among several SNGs. The design space yields low-cost and low-correlated SNGs. Compared to SNGs with the same hardware complexity, the proposed SNGs generate random bit streams with lower cross correlation. 2) Modeling the variation of SC correlation for the proposed design space and presenting a searching algorithm for finding the permutations with minimum correlation. In addition, we present a similarity function that can be used to speed up the searching algorithm by degrading its accuracy in obtaining the permutations with exact minimum SC correlation. Even the fast version of the proposed algorithm achieves permutations with lower SC correlation, compared to prior work with the same hardware complexity. 3) Using simulation results to demonstrate a reduction in SC correlation achievable by replacing the comparator (CMP) component of SNG circuits with weighted binary generator (WBG). In Section II, we explain the general structure of SNGs, a measure to evaluate their performance in SC circuits, and related prior work. Section III describes the proposed design technique for two SNGs sharing an RNS. Section IV presents a low computational complexity model for the correlation variation of the proposed design, and Section V generalizes the design approach for more than two SNGs. In Section VI, we evaluate our technique for some applications and Section VII concludes this article.
II. PRELIMINARIES AND PRIOR WORK

A. Stochastic Number Generators
Generally speaking, an SNG is composed of two parts: an RNS and a probability conversion circuit (PCC). An RNS is used to generate a sequence of uniformly distributed random numbers, whereas a PCC is designed to convert the generated random numbers into a random bit stream with the desired probability of generating 1's. Fig. 2(a) shows an SNG circuit.
A linear feedback shift register (LFSR) and a cellular automata (CA) can be used as a digital RNS. A CA is made up of cascaded modules called a cell or site [10] . Each cell is composed of a flip-flop and a combinational circuit. In its simplest form, each cell is connected to only two neighbor cells on its left and right. The next value of one cell is defined by its current value and that of the connected neighbor cells. Although a CA provides modularity and can generate good-quality random numbers, it is not commonly used in SC circuits due to its hardware complexity compared to an LFSR. Due to the low hardware complexity and high speed of an LFSR, it is employed as the RNS part in most SC circuits, including the circuits proposed in this article. The advantage of using an LFSR is more crucial for computationally intensive applications such as deep neural networks [5] and energylimited applications such as embedded systems and mobile Internet of Things (IoT) devices. Note that CA and LFSR circuits cannot be designed to generate true random numbers, however their output sequences pass some of the random number tests and if the period of the sequences is large enough, they resemble an ideal RNS for SC computing [11] .
An n-bit LFSR is composed of an n-bit shift register and one or more XOR gates. Fig. 2 (b) shows a 4-bit LFSR initialized by 0001. Normally, an LFSR is designed to have the maximum sequence length of 2 n − 1. That is, the output sequence of a maximal-length LFSR repeats after a period of 2 n − 1 binary numbers, and each number in the range of [1, 2 n − 1] is generated once in the period.
Considering the sequence of bits produced in each single flip-flop of the LFSR in Fig. 2 (b), four random bit streams are generated as shown in Table I by L 4 , L 3 , L 2 , and L 1 . Each bit stream has a period of 15 bits including eight 1's and seven 0's.
In general, for an n-bit maximal-length LFSR, the bit pattern in each bit stream repeats every 2 n − 1 bits. Since 2 n−1 of bits in the pattern are 1's, the probability of nearly 0.5 is generated by each bit stream.
Because all the bit streams produced by an LFSR have the probability of 0.5, a PCC is required in order to generate a bit stream with a desired probability other than 0.5. PCC is a combinational circuit with two n-bit inputs. One is connected to a deterministic binary number x and the other one to a sequence of numbers generated by an LFSR. It produces a bit stream with the probability of P x = x × 2 −n , or more accurately P x = (x/2 n − 1). In each clock cycle, one n-bit number from the output sequence of an LFSR is converted to one bit. The output bit stream is generated such that the total of x bits in each period is 1 and that of the other bits is 0. In the literature of SC circuits, two types of PCC circuits have been proposed: digital CMP and WBG. A CMP is an n-bit digital CMP circuit that produces a 1 if the random number from the LFSR is less than the binary number x, and a 0 otherwise. A WBG circuit works differently. First it converts the output sequence of an LFSR into a sequence of weighted binary numbers and then generates the output bit stream using the weighted sequence and input binary number x [12] . For a 4-bit CMP and WBG, the internal circuits are illustrated in Fig. 3 (a) and (b), respectively. Although both circuits generate bit streams with the desired probability for every input x, their internal logic circuits and generated bit streams are different. For x = 1011, Table I tabulates the output bits generated by a CMP, S CMP , and a WBG, S WBG , for an LFSR's output, L 4 L 3 L 2 L 1 . In this article, we examine both CMP and WBG circuits as the PCC part of the proposed SNG circuits.
B. SC Correlation (SCC)
When two (or more) random bit streams are used as inputs for an SC circuit, the cross correlation between them can affect the computational accuracy of the circuit. Assume s x is a random bit stream generated for binary number x and s y is generated for binary number y. In order to quantitatively evaluate the correlation between s x and s y , one commonly used measure is the SCC computed by [13] 
where P(s x ) and P(s y ) are, respectively, the probabilities for bit streams s x and s y to have 1's and δ(S x , S y ) = P(S x ∧ S y ) − P(s x )P(s y ), with ∧ denoting the bitwise AND of s x and s y . The SCC can have values between −1 and +1, where ±1 indicate maximum correlation and 0 means no correlation. When comparing the corresponding bits of the two bit streams, the SCC is positive if most 1's and 0's are aligned. However, if most corresponding bits are complemented to each other, the SCC is negative. Since lower absolute SCC values elicit more accurate results in SC circuits, researchers seek designs that generate bit streams with low SCCs. In general, the absolute values for SCC among bit streams generated in each flip-flop of an LFSR (before connecting them to a PCC) are low. For example, the cross correlation between each pair of bit streams generated by a maximal-length 4-bit LFSR, for example, (L 2 , L 1 ) in Table I , is −0.0816 and becomes smaller as n increases. As suggested in [6] , we evaluate the correlation between two SNGs by finding the average SCC among their generated bit streams for all possible input values and represent it as SCC avg . We can calculate the SCC avg for SNG1 and SNG2 by
where s i and s j are bit streams generated by SNG1 and SNG2, respectively. To calculate the SCC avg , first, for both SNGs, the bit streams of all possible inputs, that is, s k and s k for k = 1, 2, . . . , 2 n − 1 are generated. Then, SCC values between each bit stream of SNG1 and bit streams of SNG2 are calculated by (1) . Finally, the SCC avg is calculated by computing and normalizing the total sum of the SCCs. Obviously, SCC avg is a positive number between 0 and 1, and the lower value means less correlation between the two SNGs.
C. Prior Work
When several bit streams are required in an SC circuit, it is implemented straightforwardly using a separate SNG to generate each bit stream. Anderson et al. [14] have shown that careful seeding, scrambling, and feedback polynomials for the LFSR parts of these SNGs can improve computational accuracy. However, rather than using a separate LFSR for each SNG, a common approach to design compact SNGs is based on sharing an LFSR among them. Although sharing an LFSR reduces the hardware cost, it significantly raises the cross correlation between each pair of the generated random bit streams, thus leading to computational inaccuracy. It is worth mentioning that there are a limited number of applications for SC circuits where computational accuracy is not affected by the correlation between bit streams. Therefore, an LFSR can be directly shared among different SNGs [15] . However, it is required for many applications to reduce the mutual correlation among random number sequences before sharing them [5] . Neugebauer et al. [16] have suggested using an extra S-Box circuit to generate low-correlated copies of an LFSR's output to be shared with different SNGs. Although this method generates low auto-and cross-correlated bit streams, the S-Box is a combinational circuit that can increase the hardware complexity significantly for large values of n. Recent work [8] has suggested a circular shifting approach in order to obtain bit streams with low cross correlation from a shared LFSR without hardware overhead. However, the approach does not provide bit streams with the minimum cross correlation. In fact, circular shifts are a small portion of an unexplored design space that can produce low-correlated bit streams from a shared LFSR with no hardware overhead. This research investigates the whole design space to find designs with minimum SCC avg .
Note that some SNGs [17] generate multi-bit-width (parallel) bit streams for an input binary number, but this article focuses on SNGs generating single-bit-width (serial) bit stream.
III. SNG COST REDUCTION WITH PERMUTATION-BASED SHARED RNS
This section presents the proposed approach for the design of efficient SNGs. We share one LFSR between two SNGs to reduce the area cost. However, unlike prior work, we reduce the correlation among the generated sequences without adding any extra hardware. In this section, we explain the method for two SNGs and in Section V, we generalize the idea for designing more than two SNGs.
In order to generate low-correlated random bit streams, the cross correlation among the sequences of random numbers fed to the PCCs of different SNGs should be low. While using one LFSR generates one sequence of random numbers, we can feed different sequences of random numbers to different PCCs by permuting the connection between the LFSR's output and the inputs of the PCCs. Consider a simple example of generating two random bit streams from a 4-bit LFSR. Fig. 4(a) shows two 4-bit SNGs sharing one LFSR based on the proposed approach. We connect the L 4 , L 3 , L 2 , and L 1 outputs of the LFSR to the r 4 , r 3 , r 2 , and r 1 inputs of the first PCC, respectively. For the second SNG, however, we permute the output bits of the LFSR before connecting them to the inputs of the SNG's PCC.
The permutations of an LFSR's output can provide low-correlated random number sequences with the required feature due to two reasons. First, in general, there is a low correlation among bits in the flip-flops of an LFSR at a given time.
Thus, permuted versions of LFSR's output sequences have low cross correlation and they feed low-correlated sequences of random numbers to different PCCs. Second, all permutations of a maximal-length LFSR generate uniformly distributed random numbers such that in its repeating cycle (period), every integer binary number between 1 and 2 n −1 is repeated exactly once. Hence the permutation of an LFSR's output bits does not affect the functionality of SNGs connected to them. That is, in each period, connected PCCs generate the desired number of 1's and 0's but in a permuted order.
The approach can be extended to any n-bit maximal-length LFSR: the first SNG is built by a direct connection of the LFSR's output to a PCC's input, whereas the second SNG is built by connecting the permuted output of the LFSR to another PCC's input. The permuted output should be chosen such that the SCC between the generated bit streams of the first and second SNG is minimum. However, other than the direct connection, there are (n! − 1) different permutations for an n-bit LFSR output; which one achieves the minimum SCC avg ? To answer this question for different values of n, we examine all possible permuted connections of the LFSR and search for those with minimum SCC avg . We start for the case of n = 4. Assume that vector L = [L 1 , L 2 , L 3 , L 4 ] is the output of a 4-bit LFSR. There are 24 possible permutations for L. Also, assume that the first SNG is formed by connecting L 4 , L 3 , L 2 , and L 1 , respectively, to r 4 , r 3 , r 2 , and r 1 of a PCC. Among the other 23 possible permutations for L, the SNG that results in the minimum SCC avg with the first SNG is formed by connecting L 1 , L 2 , L 3 , and L 4 , respectively, to r 4 , r 3 , r 2 , and r 1 , of another PCC. Similar results are observed by investigating the proposed approach for the permutation of other values of n. The following conclusion generalizes the approach: for i = 1, 2, . . . , n, if the first SNG is formed by connecting L i , that is, the i th flip-flop of an LFSR, to r i , that is, the i th input of a PCC, then the second SNG, resulting in the minimum SCC avg with the first SNG, is formed by connecting L n−i output of the LFSR to r i input of another PCC. Ronald [18] has proved that a permutation with reversed ordering provides the maximum deviation distance that agrees with our results. For example, to share an 8-bit LFSR, L 1 -L 8 are, respectively, connected to r 1 -r 8 of a PCC to build the first SNG and L 8 -L 1 are, respectively, connected to the r 1 -r 8 of another PCC to build the second SNG. Fig. 4 shows the proposed LFSR-sharing approach based on permutation for n = 4 and 8.
To illustrate how the SCC avg varies with permutation, Fig. 5 (a)-(d) shows the SCC avg values between the first SNG and the permuted ones for n = 4-7. For the purpose of better readability, we do not include the graph of SCC avg for higher values of n; however, they have a similar pattern. The horizontal axis ranges from 1 to n! and represents the index of permutation in reverse lexicographic order [19] (the same order produced by MATLAB's function perms) [20] . For reverse lexicographic order, the permutation of a vector is performed based on the positional index of its elements. That is, the permuted versions of a vector are formed by rearranging its elements from left to right and starting from greater positional indices. For example, for n = 4 and original vector [1, 2, 3, 4] , the permutations in reverse lexicographic order are listed as [ , and so on. Note that in the reverse lexicographic order, the last permuted vector is the same as the original vector.
For all values of n in Fig. 5 , the SCC avg corresponding to the first permutation is the minimum SCC avg . This permutation is representing the connection of L n , L n−1 , . . . , L 1 to r 1 , r 2 , . . . , r n . On the other hand, since the last permutation, that is, vector [1, 2, . . . , n] , is the same as the original SNG connection, it has the maximum SCC avg . Let k, where 1 ≤ k ≤ n, denote the number of shifts in the circular shifting approach [8] . The red dots in Fig. 5 mark the values of SCC avg related to the circular shifts with k bits shift. As it is explained in [8] , compared to the other values of k, the circular shift with maximum gap, that is, k = n/2, yields the lowest SCC avg values achievable by the circular shifting approach. Yet, our proposed permutation-based approach can find SCC avg values lower than those produced by the circular shifting approach. The green stars in Fig. 5 (a) mark these points for n = 4. The minimum SCC avg calculated for n = 4-10 is listed in Table II . The first and second columns compare the minimum SCC avg achievable by the circular shift and our permutation approaches. For both methods, an n-bit CMP is used as the PCC part. The third column is for using an n-bit WBG as the PCC part in our method. Obviously, using WBG as the PCC part and increasing the value of n further reduce the SCC avg .
IV. MODELING AND ANALYSIS
To find the permutation with minimum SCC avg , we need to calculate the SCC avg between the original SNG and (n! − 1) other SNGs formed from the permutation of an n-bit LFSR's output. As the length of the LFSR increases, the search space and the required resources to find the solution rapidly become much larger. For example, for n = 11, each copy of all permutations of L requires more than 3 GB of RAM [20] and it grows quickly. In fact, for n > 9, an exhaustive search for finding the minimum SCC avg is intractable. In order to reduce the computational complexity of this problem, we model the behavior of the SCC avg for different permutations of an LFSR by introducing a new function, one that we call the similarity function. Assuming the original (nonpermuted) output vector of an n-bit LFSR is L = [L 1 , L 2 , . . . , L n ], the positional index for L 1 is 1, for L 2 is 2, and so on. Also, let P L k , for 1 ≤ k ≤ n!, denote the kth permuted vector of L in the reverse lexicographic order. The similarity function calculates an approximation of the SCC avg between L and its permutations, P L k , and is defined as
where ind(P L k (i )) is the index of the i th element of vector P L k in the original vector L. For example, if k = 1, then S(1) calculates the similarity function for P L 1 , the first permutation of L. Since P L 1 = [L n , L n−1 , . . . , L 1 ], the i th element of P L 1 is L (n−i+1) . That is to say, the index for the i th element of P L 1 is (n − i + 1) in the original vector L. Therefore, S(1) is calculated as
For the last permutation of L, that is,P L (n!) , since it is the same as the original vector
The value of S(k) is smaller if more elements of the corresponding permuted vector, P L k , change their positional index with respect to the original vector L. In other words, the similarity function is smaller if, in the connection of LFSR's output to PCC unit, more bits are permuted with respect to the direct connection. Therefore, the similarity function provides an estimate of the correlation among bit streams generated by the permutations of an LFSR. Fig. 6 illustrates the graph of normalized S(k) for n = 4-7. Fig. 6 includes the graph of the SCC avg values to make comparison easier. Although the similarity function does not calculate the exact values for the SCC avg , comparison of the two graphs shows that this function approximately models the behavior of the SCC avg and provides an approximation of indices of permutations with the minimum SCC avg . We can find other measures for the closeness between permutations of a sequence [21] . Among them, squared deviation distance [22] achieves the same pattern as S(k), but with different calculations, and can be used for our model. Compared to the other measures, S(k) appropriately approximates the SCC avg with lower computational complexity.
In fact, the similarity function forms a low-cost heuristic computation approach for the estimation of the SCC avg . For any reason, such as limitations in circuit-level implementation, if a designer decides to use permutations other than the permutation with the minimum correlation, then the similarity function can provide a guiding estimate for choosing other permutations with low correlation.
Notice that for any specific n, as long as each number between 1 and 2 n − 1 repeats exactly once in each period of the random number sequence, the variation in the SCC avg with respect to permutations remains the same. That is, for all possible structures of an n-bit LFSR, if it is a maximallength LFSR, the values of the SCC avg for permutations are the V. GENERALIZATION So far, we have discussed the permutation-based sharing of an LFSR between two SNGs. However, the idea can be extended to sharing an n-bit LFSR for more than two SNGs. Let us assume that we want to share an LFSR among m SNGs, where n > m > 2. The goal is to find a set of m different permutations of the LFSR's output such that the maximum of all pairwise SCC avg values for this set is minimum among other possible sets. For example, assume m = 3 and P1, P2, and P3 are indices in reverse lexicographic order for the permutations of an n-bit LFSR that build three SNGs with the minimum mutual values of SCC avg . Then, permutations P L P1 , P L P2 , and P L P3 are the ones that minimize M P 3 (P L Pi , P L P j , P L Pk ), where M P 3 (P L Pi , P L P j , P L Pk ) = max{SCC avg (P L Pi , P L P j ), SCC avg (P L Pi , P L Pk ), SCC avg (P L P j , P L Pk )} for 1 ≤ i, j, k ≤ n!.
The pseudo-code in Algorithm 1 represents the proposed algorithm for finding a set of m permutations that can provide an RNS for m SNGs with minimum SCC avg values. For each permutation, the algorithm examines whether it can be part of a set of m permutations with minimum SCC avg values. It starts with SM = 1, the greatest possible value for SCC avg . If the maximum value of SCC avg among a set of m permutations is less than SM, then the algorithm updates SM with this maximum value and saves the indices of the permutations in P1, P2, . . . , Pm as the current set with the minimum SCC avg value. This process repeats for all possible sets of permutations and after examining the last set, indices for the best set are saved in P1, P2, . . . , Pm. As an example, Algorithm 2 represents the pseudo-code for m = 3. The pairwise SCC avg values and indices of three permutations with minimum mutual SCC avg values for n = 4-7 are listed in Table III . We extend the circular shift approach for obtaining a set of three shifts with minimum SCC avg values and list the results in Table IV. Comparison of the results  in Tables III and IV shows that the proposed method can achieve triple sets with lower SCC avg values. Notice that for both methods there are more than one set with minimum SCC avg values. Running the algorithm and replacing CMP by WBG in the permutation-based SNGs reduces the obtained SCC avg values even more as listed in Table V . Here, we assume that the cross correlations between all elements in a set of m permutations are equally important. However, if it is required for particular applications, we can change the algorithm to give priority to SCC avg values for some pairs over others.
Due to the inequalities in the if statements of Algorithm 1, part of its pseudocode is executed conditionally. That is, different permutations may require different amounts of time to complete their pass in the algorithm. To estimate the worst case time complexity of finding the best set of m permutations, we break down the process into four steps: 1) calculating all possible permutations for an n-bit LFSR; the computational complexity is O(n! × n); 2) calculating bit streams for each permutation; the computational complexity is O(2 2n ); Finally, finding the minimum of the maximum values requires n! m − 1 comparisons. Thus, the total number of required comparisons by the algorithm is
The total computational complexity for the worst case runtime is the sum of the above four steps. Note that when n is increased, in addition to the required computational complexity, the required memory grows exponentially and becomes a challenge. To show the actual runtime of Algorithm 1, Table VI lists the runtime of the algorithm for different values of n and m implemented by MATLAB on a computer with a Core i7 2.11GHz intel processor and 16 GB of RAM.
Rather than using SCC avg , we can use the similarity function in Algorithm 1 to reduce its runtime. We replace steps 1-3 by calculating S(k) using (3). First, we use S(k) to find the indices of the best m permutations and then compute the exact value of SCC avg for these indices. Table VI shows the average runtime using S(k)for different values of n and m. As Table VI shows, using S(k) significantly reduces the computational time of the algorithm. The reduction becomes more significant when n increases. Table VII shows the indices and values of the best three permutations obtained using S(k) in Algorithm 1. These results show that by using the similarity function, we can find permutations with SCC avg values very close to those listed in Table III . In fact, the SCC avg values of the achieved permutations for n = 5 is the same in both Tables III and VII. Although using the similarity function in Algorithm 1 does not necessarily achieve the minimum correlations, it achieves correlation values lower than circular shifting results listed in Table VI .
VI. EVALUATION WITH APPLICATIONS
In this section, we evaluate the proposed design approach using applications with different levels of complexity ranging from simple multiplication to image segmentation. For all experiments, we used 8-bit maximal-length LFSRs and represent variables by 255-bit random bit streams. To make a fair comparison with prior work regarding hardware implementation, we use the synthesis results obtained by Synopsys Design Compiler in 45-nm NanGate library [23] . We compare the results for six different methods: deterministic (conventional binary), no-share LFSR (separate LFSR for each SNG), simple-share (one LFSR with the same output connection for all SNGs), SBoNG [16] , circular shift [8] , and our proposed method. Fig. 7 shows the circuit area (in μm 2 ) and Table VIII lists the mean-squared error (MSE) for each application.
As the first application, we implement a simple 2-input multiplier. For the binary multiplication, we use a conventional 8 × 8 Wallace tree multiplier [24] . Because the MSE varies for some SC-based circuits due to the use of different LFSRs, we calculate the average MSE for 1000 trials with different LFSRs. While the proposed circuit has the same size as the simple-share and circular shift circuits, it achieves higher accuracy. Interestingly, the multiplier implemented using the proposed method has a lower MSE than the no-share and SBoNG methods. We can justify this by considering the fact that in our method we design SNGs based on minimum SCC values and the definition of SCC in (1) is completely in favor of multiplication of two bit streams. Thus, we find a pair of an LFSR's permutations that can yield better results than two separate LFSRs or randomized output of an LFSR for multiplication. For more complex examples, we compare different implementations of 31-and 267-tap FIR filters in the form of a MUX tree explained in [6] and [7] . For both filters, we use MATLAB to generate low-pass filters' coefficients. For the SBoNG method, we use the SNG circuit described in [16] with 8-bit LFSRs, and for the circular shifting method, we use circuits similar to the architectures described in [6] . When a separate LFSR is used for each data and selection input, the number of required LFSRs for each application is listed in Table VIII. As Table VIII shows, for the 267-tap filter, the proposed approach provides better accuracy compared to the circular shift method. For this filter, the no-share LFSR and SBoNG methods can achieve lower MSE values but their hardware complexity is more.
Finally, we apply our technique to the implementation of two image-processing applications, that is, edge detection and image segmentation, and compare their stochastic computation using different circuits. We evaluate our method by the Roberts cross edge detection algorithm implemented in SC [25] and by the kernel density estimation (KDE)-based image segmentation [3] . In the circuits related to the no-share LFSR method, each data and selection input of the multiplexers uses a separate LFSR. We obtained MSE values by exploiting five normalized grayscale still images with 256 levels from black to white for the edge detection algorithm, and four grayscale movies with 33 frames for the image segmentation algorithm. Table VIII lists MSE values for each algorithm calculated by taking the average of the MSE values of all trials for a design. For the edge detection, our method results in an MSE value close to that of the no-share LFSR and SBoNG methods, however, with lower hardware complexity. For the KDE-based image segmentation, our proposed circuit leads to an MSE value nearly half of the MSE value for the circular shift circuit with the same hardware complexity.
VII. CONCLUSION
In this article, we investigated the design of low-cost and low-correlated SNG circuits using LFSR sharing. To reduce the correlation among the generated bit streams, we permuted the output of a shared LFSR before using it as input for different SNGs. We modeled the behavior of SCC avg for all permutations, and our results show that for an LFSR's output, its first permutation in the reverse lexicographic order provides the minimum cross correlation. Compared to prior work with the same hardware complexity, that is, the circular shift [6] , [8] , our method results in SNGs with lower cross correlation values. We also proposed an algorithm for finding a set of m permutations that can be shared among m SNGs with minimum cross correlation. We used the proposed SNGs in the SC-based implementation of several applications and the results show that, with low hardware complexity, we obtain better computational accuracy compared to prior methods.
