Stochastic computing (SC) is a promising candidate for fault-tolerant computing in digital circuits. We present a novel stochastic computing estimation architecture allowing to solve a large group of estimation problems including least squares estimation as well as sparse estimation. This allows utilizing the high fault tolerance of stochastic computing for implementing estimation algorithms. The presented architecture is based on the recently proposed linearized-Bregman-based sparse Kaczmarz algorithm. To realize this architecture, we develop a shrink function in stochastic computing and analytically describe its error probability. We compare the stochastic computing architecture to a fixed-point binary implementation and present bit-true simulation results as well as synthesis results demonstrating the feasibility of the proposed architecture for practical implementation.
I. INTRODUCTION
S TOCHASTIC computing is a promising candidate to increase the fault tolerance of digital circuits. 1 Distributing the weights of a digital number evenly over a long bitstream allows for a high robustness against errors. Stochastic computing circuits have been successfully designed for various use cases, such as decoding of LDPC codes [2] , [3] , neural networks [4] , digital filter implementations [5] , and DFT/FFT computation [6] .
Estimation algorithms represent an important class of signal processing methods. Their applications range from Radar and communications engineering, over speech and image processing to bio-medical applications [7] . Because estimation algorithms are typically used in applications with noisy measurements, estimation results are naturally afflicted with errors. Due to these inevitable errors, approximate algorithms are often used in this field, allowing for a trade-off between precision and computational complexity. For this reason, approximate estimation algorithms seem to be perfectly fitted to the stochastic computing world, as both are based on similar maxims, putting algorithmic robustness before exact solutions. However, the number of estimation algorithms realized using stochastic computing has been limited so far. According to our opinion, this is due to several reasons: Manuscript Lunglmayr.) The authors are with the Institutes of Signal Processing and Communications Engineering, Johannes Kepler University Linz, 4040 Linz, Austria (e-mail: michael.lunglmayr@jku.at; werner.haselmayr@jku.at).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSII.2019.2916305 1 Please refer to [1] for a comprehensive survey on stochastic computing.
• Many estimation algorithms are not stream-based and thus not very suited for SC implementation. Examples are algorithms that require access to full matrices such as the recursive least squares (RLS) [7] or that use branching operations such as active-set based LASSO algorithms for sparse estimation [8] . • Unfavorable properties of some stochastic representation formats for the requirements of estimation algorithms. When considering, e.g., sparse estimation, most elements of an estimation result are zero. Using, e.g., the singleline bipolar SC format, zeros are translated into bitstreams with a one-probability of 0.5. For finite bitstreams this potentially introduces errors (randomly generated bitstreams might not have exactly half of its bits equal to one) and leads to frequent switching of the computation elements resulting in high energy consumption. • Estimation algorithms often require arithmetic operations that have non-beneficial properties for SC. For example, estimation algorithms often require the calculation of scalar products with a large number of inputs. When implementing such scalar products by, e.g., using a tree of scaled adders this would result in very large computation errors for practical bitstream lengths, often rendering the result unusable. In this brief, we consider all of these three aspects, and present an architecture realizing a linearized Bregman based Sparse Kaczmarz algorithm allowing performing a large class of estimation algorithms with stochastic computing: combined l 1 /l 2 -norm estimation. This class of algorithms allows solving least squares estimation problems as well as sparse estimation problems for estimation scenarios based on blocks of measurement data, or when used as adaptive filters, it allows performing least mean squares (LMS) filters or Sparse LMS filters [9] .
A realization of an LMS filter using stochastic computing has already been presented in [10] . There, the authors used a parallel counter in the scalar product followed by a converter for the binary number output by the parallel counter to a stochastic bitstream. Although this represents an interesting design, it uses binary counters in the core algorithm. Our aim, however, was to develop a fully stochastic design (except for the necessary storage between the iterations), avoiding any conversions to binary numbers in the core algorithm. This, e.g., allows ensuring the high fault tolerance of stochastic computing throughout the whole algorithmic core.
II. KACZMARZ TYPE ALGORITHMS FOR ESTIMATION
The algorithms discussed in this brief are based on a linear model of the form y = Ax + w.
(1) Here, y is a measurement vector, A a known matrix, often called system matrix, w an unknown noise vector and x the 1549-7747 c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. for j = 1..q do 5: x
Tx(k) 9: end for unknown vector that is to be estimated. The dimensions of all vectors in (1) are defined by the dimension of the matrix A : m×q. Algorithms for estimating x from the measurements y often require matrix-vector operations resulting in a large number of memory access operations (e.g., as in the SLS algorithm). For a stochastic computing implementation, memory access operations should be avoided as much as possible, due to the required conversion effort for bitstream generation and for conversion into the storage format. One would prefer algorithms that work in a stream-based fashion, avoiding excessive use of memory access operations.
A promising type of algorithms allowing performing estimation of x are Kaczmarz-type algorithms [11] , [12] . These algorithms do not require branching operations and use only vector operations avoiding the more costly matrix-vector operations. Traditionally, these algorithms have been used for l 2 -norm estimation problems using cost functions of the form arg min
The Kaczmarz algorithm is used to approximately solve this problem [11] , [12] . Recently, an extension based on linearized Bregman iterations has been proposed [13] allowing performing sparse estimation as well. This is done using the combined l 1 /l 2 cost function arg min
x:Ax=y
The value λ is used to control the sparsity of the estimation result. The higher this value, the lower the number of non-zero values in the estimation result will be. This allows to formulate the following Sparse Kaczmarz algorithm [13] , shown in Alg. 1 for solving (3) .
Here,
As it is shown in [14] , by appropriate scaling of λ as well as the measurements y, one can perform the whole algorithm in fractional precision fixed-point using only values out of the interval [−1, 1) (for arbitrary lambda values that are then scaled down as well). Assuming one wants to perform an estimation with a value λ , one typically sets the algorithm's λ value to the fixed value of 0.5 and scales the measurement values by 0.5/λ . The algorithm will then automatically output the scaled estimation resultx (N) · 0.5/λ [14] . This scaling method using λ = 0.5 in the algorithm was also utilized for the sparse estimations of this brief.
If one sets the parameter λ to zero, the output of shrink is the same as its input shrink(v, 0) = v. In this case, Alg. 1 becomes the ordinary Kaczmarz algorithm (for the estimation problem of (2)) using x (k) = v (k) . As one can notice from Alg. 1, Kaczmarz algorithms re-use measurement values as well as rows from the matrix A, in a cyclic way. This is ensured by the modulo operation in line 7 of the algorithm. However, if one assumes A to be a convolution matrix and uses N = m, Alg. 1 performs the operations of a normalized LMS (NLMS) [7] . If one furthermore uses a λ value larger than zero, the algorithm becomes the Sparse LMS from [9] . Table I shows the manifoldness of the basic algorithm.
In addition to the versatility of the algorithm, its simple algorithmic structure makes the algorithm perfectly suited to be implemented in stochastic computing. Using no branching operations and only a small amount of storage operations fosters the stochastic computing implementation described below.
III. STOCHASTIC IMPLEMENTATION OF KACZMARZ TYPE ALGORITHMS A. Stochastic Computing Formats
For stochastic computing, often single-line formats such as the unipolar or the bipolar format [15] are used. For the unipolar format, many arithmetic operations have been designed (see, e.g., [16] and the references therein). Using a unipolar format only allows for non-negative numbers. This is often resolved by using the single-line bipolar format. However, compared to the unipolar format this format changes and sometimes even complicates the arithmetic operations. As an example, to the best of the authors' knowledge, there does not exist an architecture for a non-scaled adder for this format. This lack of a practical architecture for a non-scaled adder prevents using this format for many signal processing applications, especially for estimation purposes.
The drawbacks of single-line formats can be relieved when introducing a second line. One representation is the so-called signed magnitude format [17] . It represents a deterministic number x ∈ [−1, 1] by two stochastic bitstreams, a sign stream X s and a magnitude stream X m . While the magnitude stream can be interpreted as a number in the unipolar format, the sign stream represents the corresponding sign of a bit in the magnitude stream. The value of a number is defined as
, where L is the length of the bitstreams, and X s [l] and X m [l] are the l th bits of the sign and magnitude streams, respectively.
The two-line bipolar format (TLB) [15] represents a deterministic number x ∈ [−1, 1] by two unipolar stochastic bitstreams, the positive stream X p (its ones are counted positive) and the negative stream X n (its ones are counted negative). A value in the TLB format is defined as
, where X p [l] and X n [l] are the l th bits of the positive and negative streams, respectively. This two-line representation efficiently allows using shift register based arithmetic units enabling a high precision implementation of, e.g., an SC scalar product [18] . Furthermore, the signed magnitude and the twoline bipolar format can be easily converted into each other using only two logic gates [18] . Fig. 1 shows a cancelation circuit, described in [15] , [18] allowing a minimum variance Fig. 1 . cancelation circuit eliminating a 1 in X p and X n at the same time. representation [15] of a TLB represented number at its output. It simply works by canceling two ones appearing in the same position in the p and the n stream. The TLB representation enables the efficient implementation of the shrink function for the Sparse Kaczmarz algorithm, as we describe below.
B. Stochastic Computing Shrink Function
The implementation of the Sparse Kaczmarz algorithm requires the following building blocks: non-scaled adders, multipliers, a scalar product, and shrink function blocks. Except for the shrink function, all of these building blocks have been already described in the literature. We used the non-scaled adders, multipliers as well as the sequential-shift scalar product from [18] in our implementation of the Sparse Kaczmarz algorithm.
Shrink is a non-smooth function. It sets its output to zero if the absolute value of its input is smaller than λ and else reduces the magnitude of the input by λ. Although, at first glance, it might seem that this would require a branching operation, it can be performed in a stream-based manner as well.
The shrink operation will be done right after the iteration update. For this, the stochastic bitstreams are first converted to a storage representation, e.g., as described in [19] . Then new bitstreams are then re-generated for the next iteration. We assume that the new bitstreams are generated such that one of the two lines is all-zero. This can be, e.g., performed by using Memristors. As the authors of [19] describe, with Memristors one can convert an analog voltage to a stochastic bitstream and vice versa. For a two-line format, one could subtract the voltages from the p and n streams in the analog domain and then generate the corresponding bitstream out of the difference voltage. Fig. 2 shows the stochastic computing implementation of the shrink function using two stochastic maximum circuits.
In the literature, several min/max circuits for stochastic computing have been proposed [20] - [22] . The approach of [22] , as depicted on the right in Fig. 2 , not only provides a higher precision compared to the other proposed circuits, but also shows an important property for implementing the shrink function: the ones of input stream B are always output by the circuit. This can be easily confirmed by analyzing the circuit of Fig. 2 . The circuit has a bi-directional shift register to prevent ones of the input stream A to be output if the stream of A represents a smaller number than B. The shift register acts as a buffer for the excess of one-bits in A. A one at the enable (EN) input shifts in a one in from the left into the shift register, a zero at EN shifts a zero in from the right. Only if A contains a large enough number of ones (i.e., larger than B), ones can reach the right end of the shift register and is output. This way, if A represents a larger number than B, ones of A in excess to B are output as well, together representing the number of bitstream A (then the maximum) in the output stream.
The property that the ones of B are always output by the circuit enables efficient implementation of the shrink function. Assuming large enough shift registers, by connecting the bitstream X λ (e.g., for the typical value λ = 0.5) to both B inputs of the two stochastic maximum blocks, both blocks should output the X λ stream if the magnitude of the input value is smaller than lambda (one of the numbers represented by X p or X n then being smaller than λ and the other one being zero). If one of the values represented by the inputs X p or X n is larger than λ, meaning that the magnitude of the TLB-represented value is larger than λ, one output of the stochastic maximum blocks will represent this larger value, the other one (due to having a zero bitstream at its input A) will output the stream X λ . Then, the TLB representation will naturally provide the subtraction of the values represented by X p and X n . For the case of both maximum circuits outputting the stream X λ , the following cancelation block from Fig. 1 will produce two all-zero streams.
When using this stochastic shrink function, especially the error for cases when the shrink function should output a zero value is most relevant. This is because in sparse estimation it is most relevant which elements of the estimated vector are non-zero, thus an error at zero positions (leading to a new non-zero element) typically has a more severe effect than an error at non-zero positions. For the maximum block with the all-zero stream at input A, the output will always be exactly X λ . For the other block, the output might deviate from X λ . The probability of an error of a bit in this output stream can be analytically described. 2 Due to the implicit subtraction of the TLB format, the bit error probability of this second stream deviating from X λ is equal to the bit error probability of shrink's output compared to the optimal zero streams. We will call this error probability P !0 .
For the stochastic maximum, an analytic description of the error probability of the output stream for the case that the value of bitstream of input A is smaller than of B is given as [22] :
with M as the length of the shift register. P A and P B are the probabilities of having bit one in the bitstreams of A and B, respectively, and thus specify the (unipolar) values of these bitstreams.
For use in the shrink function, this equation can be simplified by setting P B = 0.5 (the typically used λ value as described in Section II), resulting in P !0 = P e,a≤0.5 . Fig. 3 shows the evaluation of (5) for values P A up to 0.5 using different values M. As one can see, the error probabilities always have their maximum for P A values of 0.5. This maximum error probability can be used as an upper bound of the actual error probability for all values P A . Via the limit P A → 0.5 for P B = 0.5 in (5), one can analytically find this maximum as P !0,max = P e,a→0.5≤0.5 = 1 M 0.5 2 . The expected value of the error probability can be found by integrating P !0 = P e,a≤0.5 over all values of a (assuming a uniform distribution of the values of a). To the best of our knowledge, this integration can only be performed numerically. Fig. 4 shows the expected error probabilities as well as the maximum value (6) for different shift register lengths M. We also plotted empirical results validating the theoretical results. The simulation times to obtain the empirical results was approximately a factor 4·10 4 larger than the time required for numerical integration. One can see from these figures, that a good error performance can be already achieved for moderate shift register lengths M.
C. Stochastic Computing Sparse Kaczmarz Architecture
Combining the building blocks described in the previous sections, we developed the SC architecture depicted in Fig. 5 . This architecture can also be used to realize the ordinary Kaczmarz algorithm using λ = 0, i.e., by bypassing the shrink function (see Table I ). Using this structure in an adaptive filter setting, e.g., assuming that the vector a i has the structure of the row of a convolution matrix, one can realize an LMS or a Sparse LMS with this structure.
The operation of the architecture is as follows. The conversion between the deterministic memory and the stochastic domain is accomplished via the blocks D/S . Before the first iteration, the variablesx (0) andˆv (0) are set to zero (see Line 1, 2 in Alg. 1), and, thus, the corresponding bitstreams will be all-zero streams.
An iteration starts by passing the q two-line streams ofv (k) through the shrink and cancelation block. This block consists of q shrink blocks in parallel, each followed by a cancelation circuit shown in Fig. 1 . The output of the shrink and cancelation blocks,x (k) , are used in the scalar product with the bitstreams of the matrix row a i (see Line 8 in Alg. 1). After the scalar product, only scalar (single two-line) operations are performed: the addition with y i and multiplication by 1/||a i || 2 (see Line 8 in Alg. 1). The values of 1/||a i || 2 2 have been precalculated offline and are stored in a separate memory. The result (y i − a T i x (k) ) is finally multiplied with the bitstreams of a i (see Line 8 in Alg. 1). To prevent correlations between the bitstreams (the streams of a i are used twice) we included a delay consisting of a 10 flip-flops shift register. After q nonscaled adders, the storage elements ofv (k) are updated for the next iteration. After all N iterations have been performed, the output streams after the shrink and cancelation blocks are converted to the memory ofx (N) .
To prevent correlations, the bit streams of all variables used within an iteration (i.e.,v (k) , 1/||a i || 2 2 , . . . ) use different (pseudo-random) start states in the generating LFSRs as well as different feedback polynomials. We observed that, when using the same seeds over different iterations, the calculation error increased significantly. This can be explained as follows. When using different LFSR seeds for each iteration, the representation errors over the iterations (on average) cancel each other. Using the same seeds over the iterations, however, leads to similar representation errors that accumulate over the iterations, significantly degrading the performance. Using pseudo-random generators for generating different start seeds over the iterations would increase the complexity of the design. To avoid this complexity increase, we use stream lengths less than a full cycle of the LFSR (for the simulation results shown in the next section we used one less than a full period). At the end of the bitstream, we flip the least significant bit (LSB) of each LFSR's memory vector, respectively (as long as this would not result in an all-zero memory vector). This strategy selects new and different start states for all LFSRs over different iterations without using additional random number generators and, thus, enables performing iterative estimation in stochastic computing. Fig. 6 shows hardware validated bit-true simulations of the stochastic computing implementation described above as well as of the fixed-point design from [14] for different bit-widths. The stochastic computing design was implemented for a compressive sampling example estimating x vectors with q = 16 with z non-zero elements at random positions and randomly selected out of [−1, 1]. The algorithms used m = 10 measurements per estimation test case. The root mean squared errors (RMSE): mean x −x 2 2 have been averaged over 1000 simulated test cases. The measurements are modeled using Ax+w, with the entries of A selected uniformly at random from [−1, 1] and w as white Gaussian noise, scaled to obtain a signal to noise power ratio (SNR) of 30dB. This SNR value has been chosen to allow a clear distinction of the different accuracies due to the chosen precisions of the estimation algorithms. For lower SNR values, the results are even closer together as the errors due to noise dominate the errors due to calculation precision. The bitstreams for the test cases have been generated via maximum length linear feedback shift registers (LFSR) with L = 2 16 − 2. The scalar product as well as the non-scaled adder (both taken from [18] ) of the design in Fig. 5 use two times shift registers of length 20 (one for positive and one for negative carries), respectively. Based on the evaluations above, we used a shift register length of 10 as a trade-off between complexity and precision for the SC maximum blocks in the shrink functions. As one can see from Fig. 6 , the SC error performance in terms of root mean square errors is in between the performance of the fixed-point implementations of 9 and 10 bit respectively. When comparing the synthesis results of the stochastic implementation shown in Table II , one can see that the stochastic computing implementation requires about 20% more combinatorial functions and flip-flops, than the corresponding fixed-point binary implementation of [14] (here synthesized for a bit width of 10), but require no binary multipliers. However, due to the bitstream representation, the stochastic computing implementation requires significantly more clock cycles (L clocks per iteration) than the fixed-point binary implementation (2q + 1 clocks per iteration). On the other hand, as it is exemplarily described for the scalar product in [18] , the robustness in terms of calculation errors of the stochastic computing implementation is significantly higher than the error tolerance of the binary implementation.
IV. PERFORMANCE RESULTS

V. CONCLUSION
In this brief, we presented a fully stochastic computing architecture for performing iterative estimation based on linearized-Bregman-based Sparse Kaczmarz. In order to realize this estimation algorithm, we proposed a novel stochastic implementation of the non-linear shrink function and analytically characterized its error performance. We presented bit true simulation results as well as synthesis results comparing the stochastic computing implementation to a fixedpoint binary implementation demonstrating the feasibility of the stochastic computing estimation architecture for practical implementation.
