Abstract-Designing reliable systems, while eschewing the high overheads of conventional fault tolerance techniques, is a critical challenge in the deeply scaled CMOS and post-CMOS era. To address this challenge, we leverage the intrinsic resilience of application domains such as multimedia, recognition, mining, search, and analytics where acceptable outputs are produced despite occasional approximate computations. We propose stochastic checkers (checkers designed using stochastic logic) as a new approach to performing error checking in an approximate manner at greatly reduced overheads. Stochastic checkers are inherently inaccurate and require long latencies for computation. To limit the loss in error coverage, as well as false positives (correct outputs flagged as erroneous), caused due to the approximate nature of stochastic checkers, we propose input permuted partial replicas of stochastic logic, which improves their accuracy with minimal increase in overheads. To address the challenge of long error detection latency, we propose progressive checking policies that provide an early decision based on a prefix of the checker's output bitstream. This technique is further enhanced by employing progressively accurate binary-to-stochastic converters. Across a suite of error-resilient applications, we observe that stochastic checkers lead to greatly reduced overheads (29.5% area and 21.5% power, on average) compared with traditional fault tolerance techniques while maintaining high coverage and very low false positives.
We address this challenge by leveraging a key property of many prevalent and emerging application domains such as multimedia (audio, video, and image) processing, machine learning, data mining, search, and analytics-their computations may be executed approximately without significantly impacting the quality of results [6] - [8] . We suggest that these applications may be designed with simplified error checkers that compute an approximation of the correct output, potentially resulting in a small probability of undetected faults, while still maintaining acceptable output quality.
We propose StoCK, an approach to design approximate error checkers using stochastic logic. In stochastic computing (SC) [9] , [10] , numbers are represented as signal probabilities of pseudorandom bitstreams. The key advantage of SC is that various arithmetic operations can be implemented in a highly power and area efficient manner (e.g., a multiplier is implemented using just a single AND gate since the signal probability at its output is the product of the input probabilities). Another interesting property of SC is that the precision of the computation progressively increases as the computation proceeds. Thus, an approximation of the final output can be found from the initial bits of the output bitstream. In addition, stochastic circuits are themselves quite fault tolerant; a few bit flips in a bitstream do not impact the output significantly. These features make SC promising for the design of lowoverhead error checkers.
The use of stochastic error checkers leads to two key challenges. First, the inherent approximate nature of SC can lead to either some errors being undetected (missed coverage) or some correct outputs being wrongly identified as erroneous (false positives). Second, since stochastic circuits operate on bitstreams, they may require longer time to complete, leading to an error detection latency. We propose design techniques to optimize stochastic checkers for the key metrics of fault coverage, false positives, and detection latency.
While the field of SC has seen considerable interest in recent years, we are unaware of any other effort to explore them as error checkers. It also bears mentioning that, while the philosophy of approximate error checking is shared with algorithmic noise tolerance [11] , the key distinction is the use of stochastic logic for the design of error checkers, which brings unique benefits and challenges.
Our contributions can be summarized as follows. 1) We propose StoCK, an approach to design low-overhead checkers using stochastic logic for applications that can tolerate approximate computations. 2) We propose input permuted partial replicas (IPPRs) of stochastic circuits to improve the accuracy of stochastic checkers, thereby improving both their coverage and false positive rate. 3) We propose progressive checking policies, wherein the stochastic checker performs checks using a prefix of the output bitstream, improving error detection latency. 4) We propose progressively accurate binary-to-stochastic (PA-BTS) converters that reduce conversion errors at lower latencies, thus further improving the effectiveness of progressive checking. 5) We implement and evaluate stochastic error checkers for a suite of error-resilient applications and demonstrate that they can lead to very high fault coverage and very low false positives, at considerably reduced overheads (∼29.5% area and ∼21.5% power) compared with conventional techniques. The rest of this paper is organized as follows. Section II provides preliminaries on SC and motivates its use for the design of error checkers. An overview of stochastic checkers and a detailed description of their design are provided in Section III. The experimental methodology and results are presented in Sections IV and V, respectively. Section VI discusses related previous work, and Section VII concludes this paper.
II. PRELIMINARIES AND MOTIVATION
In SC [9] , data are represented and processed in the form of pseudorandom bitstreams, such that the probability of 1 in the bitstream corresponds to the magnitude of the number being represented. Fig. 1 shows the key components of a stochastic circuit. The designs of BTS and stochastic-to-binary (STB) converters are shown in Fig. 1(a) and (d), respectively. More details about the components of Fig. 1 and implementations of stochastic logic can be found in [10] . To convert a binary number to a stochastic bitstream, a linear feedback shift register (LFSR) is first used to generate a pseudorandom sequence of numbers. An M-bit maximal polynomial LFSR will traverse through all the 2 M states except zero in a random sequence. These M-bit random numbers and the M-bit binary number are then compared to generate a "1" if the random number is less than or equal to the binary number, else a "0."
This working ensures that the number of ones in the final bitstream (i.e., at the end of 2 M cycles) is equal to the value of the binary number. An STB converter is realized using a counter, which accumulates the 1s in the stochastic bitstream to produce its binary equivalent.
One of the key advantages of SC is that arithmetic operations, such as multiplication, addition, and others, can be realized in a highly compact manner. For example, as shown in Fig. 1b , a single AND gate can be used as a multiplier in the stochastic domain, as it computes the product of the probabilities of the input stochastic bitstreams. Similarly, a MUX performs scaled addition [ Fig. 1(c) ] of input bitstreams. Thus, utilizing stochastic circuits as error checkers has a potential to dramatically reduce the overheads of error detection.
While stochastic circuits are highly area and power efficient, the key disadvantage of SC is the high latency of circuit operation. To represent an n-bit binary number precisely in the stochastic domain, we require a bitstream of length 2 n , which dictates the number of cycles to compute the circuit output. While techniques such as vector processing, segmented stochastic representation [12] , [13] have been proposed to improve performance, matching the throughput of binary circuits while retaining the compactness of SC remains a significant challenge. Acknowledging this drawback, we instead explore a different use for stochastic circuits-as error checkers, wherein their speed will impact only error detection latency and not the performance of the circuit itself. The latency of stochastic checkers fundamentally limits the types of faults that can be detected using them. For example, stochastic circuits are not suited for detecting highly transient faults caused due to soft errors, etc. On the other hand, faults that are persistent or slow changing in nature due to aging, process and temperature variations or deliberately caused by voltage and clock overscaling can be identified with a significantly lower overhead. In this paper, we restrict our fault model to such faults, which once detected, can be corrected by regulating system-level parameters such as voltage or clock frequency.
Another key characteristic of SC is that, since the input bitstreams are probabilistic, the output of the stochastic circuit is intrinsically approximate. Thus, when a stochastic circuit is used as a checker, either a small number of errors may be missed or some outputs may be falsely deemed erroneous. In this paper, we target application domains that are intrinsically error resilient, i.e., produce outputs of acceptable quality even if some of their computations are performed in an approximate manner. We propose design techniques to limit the loss in fault coverage and the false positives introduced due to the approximate nature of stochastic checkers.
It is also worth noting that SC possesses high fault tolerance compared with binary representations. Since all the bits in the stochastic bitstream carry the same significance, a single bitflip may result in the output differing only by 1. On the other hand, in the case of binary circuits, a bit-flip in the i th bit will result in an error magnitude of 2 i at the output.
In summary, SC enables compact implementations of various arithmetic operations, making it an attractive candidate for designing low-overhead error checkers. In the next section, we explore this approach and present techniques to overcome the key shortcomings of stochastic logic, namely, their inaccuracy and latency, in the context of error checking.
III. STOCHASTIC CHECKERS
In this section, we first provide an overview of stochastic checkers and discuss the key issues involved in their design. We then propose strategies to improve the accuracy of stochastic checkers in order to achieve high fault coverage and low false positives. We also outline progressive checking policies that significantly lower the fault detection latency of stochastic checkers. Fig. 2 shows the block diagram of a circuit with a stochastic checker. A stochastic circuit implementing the same function as the original (binary) circuit, along with the necessary BTS and STB converter logic, is placed alongside the original circuit. Since the latency of the original circuit and the stochastic checker may differ, the checker samples the original circuit's inputs and outputs. Once the stochastic checker completes evaluation, the binary and stochastic outputs are compared to determine whether there is an error.
A. Overview
We next discuss how the comparison of the binary and stochastic outputs is performed. Since stochastic circuits are intrinsically approximate, there is a small error in the output of the stochastic circuit. Therefore, an exact (equal-to) comparison of the checker's output with the original circuit output will frequently declare an error even when the original circuit output is correct (we refer to this phenomenon as a false positive). To avoid false positives, we use error bands (EBs) around the stochastic output. The stochastic checker's output is converted into an interval [O stock −EB, O stock +EB] and the error flag is raised only if the binary output falls outside this interval. The approximate nature of the stochastic output and the use of EBs have another important implication-errors in the original circuit's outputs that are small in magnitude cannot be discerned, leading to a potential loss in coverage.
To illustrate the coverage of stochastic checkers, we designed a stochastic checker for a four-tap finite-impulse response (FIR) filter. The FIR filter was subject to aggressive timing optimization and then subject to clock overscaling such that timing errors are introduced in its output. We plot the normalized histogram of the difference between the accurate and stochastic outputs (blue lines), and accurate and faulty outputs from the overscaled circuit (red and green lines, corresponding to different levels of overscaling) in Fig. 3 . The x-axis of Fig. 3 represents the amount of error in the output, i.e., golden output-faulty output. For the chosen FIR filter (8-bit inputs and 18-bit output), the error can be an integer in the range [−2.55 × 10 5 , 2.55 × 10 5 ]. The y-axis on the other hand shows the normalized number of faulty outputs, i.e., number of faults/10 6 (10 6 random inputs were used to simulate the designs).
We observe that, when the clock frequency is moderately scaled, there is no overlap with the stochastic error histogram (green and blue lines). This is because timing errors predominately affect the MSBs of the output, and are hence very large in magnitude compared with errors in the stochastic output. Therefore, the stochastic checker yields 100% fault coverage. When the clock frequency is more aggressively scaled, timing errors start to appear at the LSBs, and we observe a very small overlap (< 0.1%) between the corresponding error histograms (red and blue lines). In this case, the fault coverage slightly decreases to 99.9%. Note that the wide spread in errors is characteristic of aggressively timing optimized designs where a large number of paths (of varying bit significance) are nearcritical. For relaxed timing constraints, the spread in the error distribution will be smaller, leading to even higher coverage for the stochastic checker.
An interesting property holds for errors that are not detected by stochastic checkers-they must lie within a certain range of the accurate output. In other words, the undetected errors lead to approximations of the output rather than arbitrary values. The following equation formalizes this observation:
where O accurate is the accurate binary output, O stock is the stochastic checker's output, and O binary is the binary output (which may be potentially faulty). We infer from the above equation that if the stochastic circuit can be designed to produce an output that differs by no more than from the golden output and the EB used for comparison is EB, the approximation in the output due to undetected errors is bounded by EB + .
In effect, the performance of the stochastic checker can be quantified using two metrics: 1) coverage, which denotes the fraction of errors that are detected by the checker and 2) false positive rate, which quantifies the fraction of correct outputs that are incorrectly detected as errors. The value of EB plays a key role in determining the tradeoff between these two metrics. When EB is increased coverage decreases, as errors that are smaller in magnitude than EB will be missed by the checker. However, the probability of incorrectly deeming an output as faulty will also be lower, as the error introduced due to the approximate nature of the stochastic checker is likely to lie within the EB. On the other hand, a smaller value of EB yields better coverage, while increasing the false positive rate. Therefore, it is key to select EB such that we achieve a good tradeoff between fault coverage and false positives.
We demonstrate the tradeoff between coverage and false positive rate for the four-tap FIR filter. As before, faults were introduced in the circuit by overscaling its clock frequency. Fig. 4 shows the coverage and false positive rate as the EB is varied. We observe that when EB = 0 (i.e., the stochastic output is compared exactly with the binary output), we achieve a fault coverage of nearly 100%, but the false positive rate is very high at 90%. As the value of EB is increased, the fault coverage decreases slightly (100%-98%), whereas the false positive rate reduces drastically (90%-0%). This is a favorable tradeoff since we significantly lower the number of false positives while only slightly degrading fault coverage.
In scenarios where the binary circuits operate at a larger bit-width, it is not necessary to design the stochastic circuit to the same bit-width. In fact, the stochastic circuit can be of a lower bit-width and its outputs can be scaled appropriately before comparison with the binary output. While this approach does reduce the high latency overheads of employing larger bit-width stochastic circuits, it also adversely affects the fault coverage and false positives of the checker. For example, consider the above case of an FIR Filter where the stochastic circuit is designed to 8-bit precision while the binary circuit has a higher bit-width. We take three variants of this design-FIR-9 (binary with 9-bit inputs), FIR-10 (binary with 10-bit inputs), and FIR-12 (binary with 12-bit inputs). For each of these designs, the stochastic circuit receives the 8 MSBs of the inputs (i.e., the lower input bits are truncated). Similarly, the output of the stochastic circuit is scaled and left shifted by 2, 4, and 8 bits for FIR-9, FIR-10, and FIR-12, respectively, so that MSBs are close to the correct binary output. In order to quantify the impact on fault coverage and false positives of such a design methodology, errors are injected into these designs through clock overscaling (such that the error rate remains the same in all the designs). As previously mentioned, clock overscaling affects the MSBs of the binary circuit first causing high magnitude errors. Fig. 5 captures the impact on fault coverage and false positives for FIR-9, FIR-10, and FIR-12. One can observe that as the bit-width increases from 9 to 12, the maximum fault that coverages at a 0% false positive point decreases. This occurs because in order to attain a 0% false positive rate, the EB is increased significantly, leading to several high magnitude errors going undetected.
The above discussions indicate that the accuracy of the stochastic checker's output plays a critical role in determining the coverage and false positive rate. Therefore, designing highly accurate checkers while retaining the area and power benefits is a key challenge. Another important factor that limits the applicability of stochastic checkers is the latency incurred in computing the stochastic output. As the latency determines the rate at which the stochastic checker can sample inputs, it directly influences the duration for which a fault should persist in the circuit in order to be detected. Therefore, we propose techniques to improve the accuracy and latency of stochastic checkers, which are described in the following sections.
B. Improving the Accuracy of Stochastic Checkers
A simple way to improve the accuracy of stochastic circuits is to increase the length of the bitstream. However, this incurs an exponential increase in the latency of the circuit, and is hence not desirable. Another approach to enhance accuracy is to improve the randomness of the bitstreams by utilizing more sophisticated random number generators [12] , [14] . However, these techniques incur substantial area and power overheads with only modest accuracy improvements.
We propose a new low-overhead approach, called IPPR, in which the stochastic logic in the checker is partially replicated and the results from the replicas are averaged to produce the final checker output. Since errors in stochastic circuits are largely random in nature, such averaging results in considerable improvements in accuracy. The key idea is to generate stochastic bitstreams for the different replicas without replicating the pseudorandom number generators (PRNGs) in the BTS converters, which account for a major portion of the stochastic checker's area and power.
To elaborate, let us consider the checker shown in Fig. 6 , which takes two inputs. A stochastic checker without IPPRs requires two PRNGs and two comparators, one for each input. When we introduce the second replica, we double the number of comparators, but reuse the PRNGs, connecting them to different inputs in each replica. This significantly reduces the overhead of replication, as PRNGs are the single largest component in stochastic circuits (consuming ∼50% area for the four-tap FIR filter example). We note that sharing PRNGs across the partial replicas does not introduce additional correlation errors since the replicas operate independently, and their outputs are converted to the binary domain and then averaged to compute the checker's output. The major Fig. 5 . Impact of using a low bit-width stochastic checker for a binary circuit operating at a higher bit-width. The experiments were performed on four circuits where the stochastic checker is always operating on 8-bit inputs, while the binary circuit operates on (a) 8, (b) 9, (c) 10, and (d) 12 bits. contribution of the IPPR scheme is toward reducing errors induced due to the randomness properties of the PRNGs (such as LFSRs).
The STB converter is enhanced to count multiple bitstreams in a cycle, i.e., we use a k-up counter (instead of a 1-up counter) in the case of IPPRs with k partial replicas. We also restrict the number of replicas to powers of two, eliminating the need for a divider to compute the average (in our experiments, two or four partial replicas sufficed). Thus, we significantly improve the accuracy, and consequently the fault coverage and false positive rate of stochastic checkers, with relatively low overheads, using IPPRs. We quantify the benefits of IPPRs in Section V-B.
C. Chaining Stochastic Computations
The granularity at which error checking is to be performed is a key choice in the design of stochastic checkers. Since stochastic logic intrinsically has a low logic complexity and logic depth, it is possible to pack the equivalent of several cycles of binary computation into a single checker unit. Such chaining of stochastic computations can partially offset the high latency of stochastic checkers.
For example, consider a function F that is computed by evaluating a combinational kernel f iteratively over k cycles with varying inputs. For example, if F is a dot product of vectors, f could be a scalar multiply-accumulate (MAC) operation, in which case k would be the vector length. In this case, the output of the binary circuit, O binary , will be available at the end of k cycles. Suppose we design the error checker to perform the equivalent of a single cycle of binary computation, i.e., f . Computing f in the stochastic domain will take 2 N cycles, where N is bit-width of the inputs. Thus, the output of the stochastic checker, O stock , will be available at the end of k × 2 N cycles. Therefore, only every 2 Nth binary output will be sampled for error detection, which may be unacceptable for higher values of N. This scenario is depicted in Fig. 7(a) .
In order to reduce error detection latency, we propose to chain multiple kernel computations within a single stochastic computation. This is because, stochastic circuits, being highly compact, can easily accommodate such chaining without violating the imposed constraints of area, speed, and power. Chaining stochastic computations reduces the time for the stochastic checker to output the result and thus improves error detection latency. For example, suppose we chain p computations of the kernel f into the stochastic circuit. The output from the stochastic checker will be available at the end of (k/ p)×2 N cycles. In other words, one in every 2 N / p binary outputs will be sampled for error detection. This scenario is presented in Fig. 7(b) .
While the benefits of chaining functions in the stochastic domain are clear from the above discussion, they, however, come at a price of extra hardware. The stochastic circuit now computes more kernels per cycle than the binary circuit. It also requires the inputs for these kernels to be available, which further leads to an increase in the BTS converters, leading to higher hardware overheads. We demonstrate this tradeoff between error detection latency and hardware overhead through an example. We consider the function F to be a 128-tap FIR Filter. A 128-tap FIR Filter will require 128 MAC operations to be performed. This function is broken down to a kernel f computing eight MAC operations in one cycle. This leads to k = 16. For this binary circuit, the benefits and overheads of computing multiple kernels p in the stochastic domain are shown in Fig. 8 . The area overheads here refer to the extra cost of adding StoCK with the necessary configuration over and above the conventional binary circuit.
It is evident that, as p increases, the error detection latency reduces with a corresponding increase in area of the stochastic checker. A suitable point on this tradeoff curve needs to be chosen such that significant benefits in error detection latency are obtained while the costs of extra hardware are within tolerable limits.
While chaining computations considerably reduces the checker's latency, the resulting latency is still exponential in N. Thus, it may be desirable to further reduce the latency; techniques to achieve this are proposed in the next section.
D. Progressive Stochastic Checking
The latency of stochastic checkers is primarily dictated by the length of the bitstream used, and fundamentally determines the rate at which inputs can be sampled by the checker. We reiterate that stochastic checkers are targeted at detecting persistent errors due to voltage or clock overscaling, aging, and temperature or voltage variations, which last for significant durations (often, until they are corrected by system-level mechanisms such as voltage/frequency regulation or thermal management that require significant response time). Nevertheless, it is desirable to reduce the checker's latency to ensure timely detection. Toward that end, we propose progressive checking policies, which are described below.
The progressive checking policy concept is motivated by two key observations about stochastic checkers. First, there exists a direct correlation between the length of the bitstream and the accuracy of the stochastic output, i.e., with larger bitstream length, the stochastic output becomes probabilistically more accurate. The second observation relates to the tradeoff between the accuracy of the stochastic output and the magnitude of the EB required to achieve a desired fault coverage. When the stochastic circuit can have large errors, then we need to conservatively set the EB to be smaller to achieve a desired coverage. This is because smaller EBs lead to errors being flagged more aggressively, and it is unlikely that the erroneous output lies very close to the stochastic output, notwithstanding the inaccuracy of the stochastic output itself. However, as shown in Fig. 4 , smaller EBs are undesirable from a false positive rate perspective. In summary, to achieve a given coverage, less accurate stochastic checkers require us to use smaller EBs, which consequently increases the false positive rate of the checker.
Keeping the above observations in mind, the progressive checker policy is designed as follows. First, given a binary circuit, a target fault coverage, and a false positive rate, we determine the maximum bitstream length needed by the stochastic checker. Next, we split the bitstream into multiple intervals, and at each interval determine the maximum value of EB that yields the target fault coverage. Note that the intermediate intervals would have a higher false positive rate, as the EB is smaller at those points. Fig. 9 depicts this trend in the context of a four-tap FIR filter. In this case, we set the desired fault coverage to 99%. For this example, we find that a bitstream of length 256 satisfies these requirements. We split the bitstream into five intermediate intervals of 8, 16 , 32, 64, and 128. Fig. 9 shows the normalized magnitude of the EB and the false positive rate at each of these points. We observe that, as the bitstream length is increased, thereby improving the accuracy of the stochastic checker, the EB magnitude increases, while decreasing the false positive rate as a consequence.
During execution, we progressively compare the outputs of the stochastic checker and the binary circuit after each interval with the appropriate EB, and take the following actions.
1) If the checker indicates an error, we make no decision and simply continue to the next interval. This is because, since the false positive rate is high at the intermediate points, an error may be flagged due to an actual fault or due to an incorrect checker output. 2) If the checker does not indicate an error, we terminate and deem the binary output as correct. This is because all intermediate checker intervals and EBs are selected to achieve the same fault coverage. The above policy allows for early termination of the checker when the binary output is found to be correct at any interval; on the other hand, the binary output can be flagged as erroneous only after the entire bitstream is processed. The effectiveness of the progressive checking policy depends on the actual error rate in the circuit. If the error rate is low, then a decision is made on most inputs quite early, significantly lowering the average checker latency. As the error rate increases, a proportionately larger fraction of the inputs are resolved only after the final interval, lowering the benefits of progressive checking. This design choice is deliberate-the error rate in practice is typically low, and hence, the progressive checking policy has the potential to significantly lower the latency of stochastic checkers.
To incorporate the concept of progressive checking in StoCK, we replace the comparator of Fig. 2 with a slightly more complex logic that operates based on the flowchart shown in Fig. 10 . For each progressive checking interval, the stochastic output is scaled appropriately and the respective EB (identified during precharacterization) is chosen for comparison with the binary output. The final error flag is raised based on the outcome of the above-mentioned policies for a given latched binary output.
E. Progressively Accurate Binary-to-Stochastic Converters
The progressive checking technique described in the previous section is beneficial if the checker terminates more often at lower latency intervals. This is possible only if the stochastic checker's outputs at lower latency intervals provide a good estimate of the final output. In other words, if the probability of 1s in each output bitstream is relatively invariant across intervals, then the output at each interval can be appropriately scaled to estimate the final output. Since the stochastic logic itself is combinational, this largely translates into a requirement on the input bitstreams. In particular, it is desirable that each BTS converter generates a bitstream that reflects the input binary number as closely as possible not only when considered in its entirety, but also when prefixes of the bitstream are considered.
Consider a BTS circuit employing an 8-bit maximal polynomial LFSR for converting an 8-bit binary number to a stochastic bitstream. As outlined above, the probability of prefixes of the output bitstream (substreams of length 8, 16, 32, etc., which correspond to the lower latency intervals of stochastic checking) must be close to the probability of the entire bitstream (256). Fig. 11(a) shows the absolute error in bitstream probability at five lower latency intervals, across all possible 8-bit binary input numbers.
It can be seen that for a latency interval of eight, the error in probability is very high for majority of the input numbers. This error in probability arises due to the fact that the LFSR within the BTS is designed for the highest bitstream length (256 in this case), and thus the random numbers generated at lower latencies may not cover the required states to produce an accurate bitstream.
To address this issue, we propose to design BTS converters using multiple maximal polynomials that are designed to produce bitstreams of different lengths. To elaborate, let us consider a scenario where we use a separate maximal polynomial LFSR for each latency interval, i.e., a 3-bit, 4-bit, 5-bit, 6-bit, 7-bit, and an 8-bit LFSR all working independently. For a given binary input, for the first eight cycles, we compare the upper 3 bits of the binary number with the output from the 3-bit LFSR and generate the required bitstream. From the 9th cycle until the 16th cycle, we compare the upper 4 bits of the binary number with the 4-bit LFSR. Similarly the 5-bit, 6-bit, 7-bit, and the 8-bit LFSRs will be used from the 17th, 33rd, 65th, and 129th cycles, respectively. Fig. 11(b) shows the error in probability of the above configuration for all 8-bit binary numbers. It can be seen that there is a large reduction in error at lower latencies using the multiple LFSRs compared with using just a single maximal polynomial LFSR of the highest bit length [ Fig. 11(a) ]. However, employing different LFSRs for each substream length leads to a large area overhead for the BTS circuit owing to the distinct LFSR and comparator circuits.
In view of the above issues, we present a PA-BTS converter that reduces this error in probabilities of the BTS circuit at lower latency intervals with minimal area overheads. In a PA-BTS, we use a reconfigurable LFSR that contains just one register of the highest bitwidth, and additional XOR gates and MUXes to provide the necessary switching at different latency intervals. Fig. 12 shows such an 8-bit LFSR configured to use the upper 4 bits as a maximal 4-bit polynomial LFSR for the first 16 cycles and switch to the 8-bit LFSR polynomial after the 16th cycle. This technique can be easily extrapolated to include additional latency intervals. Since the lower latency polynomials are incorporated within the same register, only one comparator is required at the output for comparing with the binary number. This drastically reduces the overhead compared with using multiple LFSRs. Fig. 11(c) shows the error in probability of an 8-bit PA-BTS with five switching modes (one each for latency intervals of 8, 16 32, 64, and 128). It should be noted here that the error at each latency interval (except for interval 8) is slightly higher than that observed in Fig. 11(b) . This is because the bitstream generated by the lower latency intervals is reused by the higher latency intervals. This may lead to overlap of certain LFSR states/sequences, thus leading to errors at the BTS circuit output. This error may also affect the final bitstream (i.e., at the end of 256 cycles). However, in our context, this is a desirable tradeoff. In summary, the proposed PA-BTS greatly improves the performance of StoCK with progressive checking compared with a conventional LFSR, at the cost of modest area overheads. Section V-D discusses the area overheads and reductions in fault checking latency across various benchmark circuits.
IV. EXPERIMENTAL METHODOLOGY
To evaluate stochastic checkers, we utilized a suite of applications, listed in Table I , from the signal processing, image processing, and recognition application domains. Table I also shows the dominant computational kernel for each application and the number of gates in its baseline binary hardware implementation.
The benchmarks (both stochastic and binary versions) were implemented at the register-transfer level (RTL) using Bluespec System Verilog and synthesized to a 65-nm technology library using Cadence RTL Compiler. Cadence encounter was then used to estimate power consumption. We utilized timing errors introduced due to clock overscaling as the source of faults in the binary circuit. To model the errors due to clock overscaling, we performed SDF based timing simulations of the synthesized netlists at the overscaled frequency using Cadence Incisive for 1 million random input vectors. The actual fault rate injected in the binary circuit was around 1%, i.e., the clock frequency was scaled to an extent where roughly 1% of the inputs resulted in erroneous outputs.
V. RESULTS

A. StoCK: Area and Power Overheads
First, we quantify in Table II the area and power overheads of using stochastic checkers for the different benchmarks. The overheads are normalized to the area and power of the respective binary circuits. The stochastic checkers were implemented with four IPPRs (4-IPPR), and incur a maximum latency of 256 cycles. We observe from Table II that, across all benchmarks, stochastic checkers result in average area overheads of 29.5% and average power overheads of 21.5%. This translates to 3.4× and 4.5× reduction in area and power overheads, respectively, compared with employing dualmodular redundancy (DMR). Table II also lists the coverage and false positive rate achieved using StoCK. We find that the coverage is 99.5% on average (100% for three of the five benchmarks, and >98% for all of them), with an average false positive rate of ∼0.1%. While we consider the area and power overheads to be reasonable, we note that they can be lowered using fewer replicas (2-IPPR), at the cost of slight degradation in coverage and false positives.
B. Effect of IPPR on Accuracy
In this section, we demonstrate the improvement in accuracy achieved using IPPRs to design the stochastic checker. We quantify accuracy as the absolute relative error between the stochastic output and correct binary output, as shown in
(1) Fig. 13 shows the improvement in accuracy achieved using two replicas (2-IPPR) and four replicas (4-IPPR) for various bitstream lengths in the case of a four-tap FIR filter. We find that, across all bitstream lengths, increasing the number of replicas consistently improves the accuracy of the stochastic checker. For example, when the bitstream length is 8, we observe a ∼25% improvement in absolute relative error between the conventional and 4-IPPR configurations.
The improvement in accuracy of the stochastic checker directly translates to higher coverage and lower false positive rate, as illustrated in Fig. 14 for a four-tap FIR filter. We find that the coverage is best for the 4-IPPR configuration, remaining close to 100% for a wider range of EB magnitudes, followed by the 2-IPPR and conventional cases. Similarly, the false positive rate is the lowest for the 4-IPPR configuration, followed by the 2-IPPR and conventional cases, underscoring the effectiveness of the IPPR technique. Finally, we quantify the area overheads of IPPR for a four-tap FIR filter in Table III . Since only a relatively small fraction of the stochastic checker is replicated, we find that the overheads increase only modestly when the number of replicas are increased-∼4% and ∼13% increase for 2-IPPR and 4-IPPR, respectively. Fig. 16 demonstrates the reductions in latency achieved using the progressive checking policy described in Section III-D. In all cases, the maximum length of the bitstream used was 256. We divided the bitstream into six intervals of lengths 8, 16, 32, 64, 128 , and 256. Fig. 16 shows the fraction of outputs on which a decision was made at each interval. We find that in all circuits, except sum of absolute difference (SAD), a significant fraction of the outputs are resolved quite early. The SAD requires computing the difference between a pair of inputs and adding them. In the stochastic domain, subtraction is implemented using XOR gates and addition is implemented using MUXes (since the inputs to the stochastic circuits are in unipolar format) as shown in Fig. 15 . XOR gates, however, are not very accurate at subtraction (e.g., compared with MUXes for addition). Thus, the error at the output of the XOR is very high, even for uncorrelated input bitstreams. An array of such XORs feeding into another array of MUXes further deteriorates the quality of the output of SAD. Due to this reason, the scaling at intermediate latency intervals is also significantly affected. Scaling these intermediate values leads to large quantization errors and lowers the fault coverage. Hence, the checker has to frequently resort to the highest latency output (i.e., 256 cycles), which results in higher overall checking latency. Fig. 16 also shows the average detection latency for the benchmarks. The reduction in average latency ranges from 1.25× for SAD to 8.5× for FIR.
C. Effect of Progressive Checking
The exact hardware overheads to implement just the control circuit of Fig. 10 vary between 1.1% and 4.5% of the total circuit (area and power) across the benchmarks.
D. Effect of PA-BTS
We implemented a PA-BTS with a reconfigurable LFSR that supports two modes-one for a 4-bit polynomial and the other for an 8-bit polynomial. An analysis of area, power, fault coverage, false positives, and checking latency using PA-BTS is presented in Table IV . While the area and power overheads are minimal for almost all circuits, there is a significant reduction in the average fault checking latency. There is also a slight reduction in fault coverage and false positives while employing PA-BTS. This happens because of the error introduced by the repetition of LFSR states across the different modes. This effect is more prominent in the case of the SAD benchmark, and leads to an increase in the fault checking latency.
E. Application-Level Case Study
We previously evaluated StoCK in terms of area, power, and latency for algorithmic kernels commonly used in error-resilient applications. In this section, we examine the benefits of employing StoCK for quality-aware adaptive voltage scaling at the application level. For this purpose, we consider handwritten-digit recognition using the K-nearest neighbor (K-NN) algorithm. An optical digit recognition data set [15] is used to evaluate accuracy. For our experiments, we train the circuit using the training set of 3823 samples, and use 500 test samples from the validation set (different from the training set) to evaluate energy and accuracy. Fig. 17 shows a hardware implementation of the K-NN algorithm that uses StoCK to enable adaptive voltage scaling. The K-NN algorithm uses SAD as the metric to calculate the distance between a test sample and a training sample. The SAD unit is capable of summing differences between up to eight pairs of scalar elements at a time (each scalar element being 5 bits wide). Since each input sample consists of 64 elements, a total of eight cycles would be required to calculate the distance between a test sample and a training sample. Thus, to classify a test sample to a digit class, it is compared with 3823 training samples (which requires a total of 3823 × 8 = 30 584 SAD computations) and the K closest training samples are stored. The majority class among the K closest samples is assigned as the class of the test sample. In this paper, we have chosen K = 7. The memory control block generates the addresses to access the training and testing samples from memories. It also contains a buffer to hold the seven closest training samples and chooses the majority class as the final output. The conventional design achieves a classification accuracy of 97.4% for the 500 test samples used. Fig. 17 also shows the stochastic SAD (St-SAD) unit. The accuracy of the St-SAD unit (with respect to the binary SAD unit) was improved from 9.64% to 5.53% using the 4-IPPR configuration. The latency of fault checking reduced from 32 cycles to an average of 11 cycles using the progressive fault checking policy described in Section III-D.
The voltage controller shown in Fig. 17 is responsible for controlling the supply voltage to the binary SAD unit. A reduction in the supply voltage from the nominal voltage leads to errors in the SAD unit and thus a reduction in the classification Table V refer to the power consumed in the SAD unit relative to the power of the unit at the nominal voltage. Fig. 18 shows the variation in the supply voltage versus time for each configuration in Table V . It should be noted that in Fig. 18 , V 8 is the nominal voltage and
Each of the simulations was run for the entire 500 test samples (≈ 15.2 million cycles). We can observe from Fig. 18 that for first few inputs, the accuracy is low, since the voltage controller begins its operation at the lowest voltage. However, the voltage controller reacts and increases the supply voltage, eventually stabilizing the accuracy near the target level.
F. Comparison With Other Error Detection Schemes
One of the major factors that sets StoCK apart from conventional error detection methods such as N-modular redundancy (NMR) and temporal redundancy is the nature of the faults that can be detected. StoCK is primarily focused on detecting persistent or slow changing faults caused due to aging and process and temperature variations or deliberately due to voltage and clock overscaling. The fact that StoCK samples only a subset of the outputs (due to its high checking latency) limits its ability to detect single error upsets or faults existing for a very short period of time. DMR technique on the other hand is capable of sampling every output and thus detects long and short duration faults, however, with a significant area and power overhead compared with StoCK. For benchmarks presented in Table I , StoCK presents an average of 24.3% area overhead while DMR leads to area overheads in the range of 101%-103% (accounting for the comparators and other necessary control logic). In addition, the applications targeted in this paper are approximate in nature (i.e., they can tolerate errors to a certain degree). In this context, StoCK is a much more suitable candidate than the conventional DMR. It should also be noted that the stochastic checker operates in parallel and does not affect the throughput of the binary circuit output.
Temporal redundancy on the other hand operates by simply recomputing the same operations multiple times (typically two or three) on the same hardware. This technique lacks the ability to detect long duration faults or permanent faults. Though temporal redundancy provides a better solution in terms of area and power overheads, it provides a very poor solution in terms of energy efficiency (due to repeated computations).
VI. RELATED WORK
Ensuring reliability at low overheads is an important problem that has been addressed in various fields of research. However, our focus is on dedicated hardware implementations rather than general-purpose processors. Design techniques such as BISER [16] can be used to detect and correct transient soft errors, which is not the target of our work. Timing speculation [17] , [18] is a well-known technique for detecting and correcting timing errors based on the use of double-sampling latches or FFs. Unlike StoCK, it cannot detect permanent failures, and requires strict timing constraints to be satisfied on the network that aggregates error signals from the FFs. The effectiveness in [17] is known to be limited in designs with a large number of critical or near-critical paths (due to the rapid increase in error rates leading to high recovery time/energy), or a large number of short paths (due to the need to insert buffers to eliminate paths shorter than the double-sampling window), both of which are common in practice [19] - [21] . In contrast, the proposed approach is effective on circuits with such timing characteristics. Finally, unlike Razor, our approach does not require any modifications to the cell library or design flow.
Although NMR-based designs are able to detect both persistent and transient errors, the area overheads incurred are very large. A more area and energy efficient solution is temporal redundancy [22] , which leads to degraded performance due to rollbacks and recoveries. An adaptive and a confidence driven approach for error resilient designs is proposed in [23] .
Several works have exploited the approximate nature of applications at various levels for fault/error tolerance. ERSA [24] exploits resilience at the system level by utilizing low reliability cores to execute error-resilient computations. Low-power designs based on voltage overscaling that leverage error correction at the algorithmic level have been proposed in [11] . These efforts are largely orthogonal to our work, which focuses on the design of stochastic checkers.
Previous efforts in SC such as [13] , [25] , and [26] have explored implementing various computations in a compact and efficient manner in the stochastic domain. Despite significant benefits in area and power, these implementations are usually inferior in performance and energy, limiting their potential as replacements for binary implementations. Complementary to these efforts and recognizing the above limitation, we explore the use of stochastic circuits as approximate error checkers.
VII. CONCLUSION
To overcome the reliability challenges posed by deeply scaled CMOS and post-CMOS devices in the context of approximate applications, we proposed and explored stochastic error checkers. We addressed various challenges involved in the design of stochastic checkers. We proposed the use of IPPRs to improve the accuracy of the stochastic output, thereby improving coverage and reducing false positives. The latency of stochastic checkers was reduced through progressive checking policies. Our experiments demonstrated that stochastic checkers require ∼29.5% area and ∼21.5% power overheads, while achieving a coverage of ∼99.5% and false positive rate of ∼0.1%.
