We consider the problem of computing a binary linear transformation when all circuit components are unreliable. Two models of unreliable components are considered: probabilistic errors and permanent errors. We introduce the "ENCODED" technique that ensures that the error probability of the computation of the linear transformation is kept bounded below a small constant independent of the size of the linear transformation even when all logic gates in the computation are noisy. By deriving a lower bound, we show that in some cases, the computational complexity of the ENCODED technique achieves the optimal scaling in error probability. Further, we examine the gain in energyefficiency from the use of a "voltage-scaling" scheme, where gate-energy is reduced by lowering the supply voltage. We use a gate energy-reliability model to show that tuning gate-energy appropriately at different stages of the computation ("dynamic" voltage scaling), in conjunction with ENCODED, can lead to orders of magnitude energy-savings over the classical "uncoded" approach. Finally, we also examine the problem of computing a linear transformation when noiseless decoders can be used, providing upper and lower bounds to the problem.
principles, such as "voltage-scaling" (which is commonly used in modern circuits), reduce energy consumption [7] , but often at a reliability cost: when the supply voltage is reduced below the transistor's threshold voltage, component variability results in reduced control of component reliability. From an area viewpoint, as transistors become smaller and clock frequencies become higher, noise margin of semiconductor devices is reduced [8] . In fact, voltage variation, crosstalk, timing jitter, thermal noise caused from increased power density and quantum effects can all jeopardize reliability. 1 Beyond CMOS, circuits for many emerging technologies, such as those built out of carbon-nanotubes [10] , suffer from reliability problems, such as wire misalignment and metallic carbon-nanotubes [11] . Thus, for a host of factors, circuit reliability is becoming an increasingly important issue.
While most modern implementations use overwhelmingly reliable transistors, an appealing idea is to deliberately allow errors in computation, and design circuits and systems that "embrace randomness and statistics, treating them as opportunities rather than problems" [3] . Inspired by the triumph of Shannon theory in dealing with noise in communication channels [12] , von Neumann initialized the study of noise in circuits [13] . He showed that even when circuit components are noisy, it is possible to bias the output towards the correct output using repetition-based schemes. Repeated computations, followed by majority voting, have been used in some applications to make circuits error-tolerant [14] [15] [16] . In fact, many functional-block-level or algorithmic error-tolerant designs have been tested on real systems [14] , [17] , [18] . However, in absence of a comprehensive understanding of the fundamental tradeoffs between redundancy and reliability, these designs have no guarantees on the gap from optimality.
Can use of sophisticated codes help? For storage, which can be viewed as computing the identity function, Low-Density Parity-Check (LDPC) codes [19] and Expander codes [20] have been used to correct errors [21] [22] [23] [24] . Closer in spirit of computation with noisy elements, in [21] [22] [23] , the decoders (though not the encoders) for storage are assumed to be noisy as well. In [25] , adaptive coding is used to correct memory faults for fault-tolerant approximate computing. Decoding with noisy elements has become an area of active research [26] [27] [28] [29] [30] . In [26] [27] [28] [29] [30] , noisy decoders performing message-passing algorithms are analyzed using the density evolution technique [31] , [32] . The idea of using noisy decoders, and yet achieving reliable performance, is further extended to noisy discrete-time error-resilient linear systems in [33] , where LDPC decoding is utilized to correct state errors after each state transition. Error control coding is also used in fault-tolerant parallel computing [34] , AND-type one-step computing with unreliable components [35] and applied to error-resilient systems on chips (SoCs) [36] . Unfortunately, all of the above works on using sophisticated codes in noisy computing have one major intellectual and practical shortcoming: while they use noisy gates to perform some computations, they all assume absolute reliability in either the encoding part, or the decoding part, or both.
The perspective of allowing some noiseless gates in noisy computing problems has permeated in the investigation of fundamental limits as well (e.g. [37] [38] [39] ), where, assuming that encoding and/or decoding are free, the authors derive fundamental limits on required resources for computation with noisy elements with no assumptions on the computation strategy. Can one choose to ignore costs associated with encoding or decoding? While ignoring these costs is reasonable in long-range noisy communication problems [40] , where the required transmit energy tends to dominate encoding/ decoding computation energy, recent work shows this can yield unrealistically optimistic results in short-range communication [40] [41] [42] [43] [44] and noisy computing [45] , especially in the context of energy. These works derive fundamental limits for simplistic implementation models that account for total energy consumption, including that of encoding and decoding, in communication [41] [42] [43] [44] and computing [45] .
In this paper, we investigate the problem of reliable 2 computation of binary linear transformations using circuits built entirely out of unreliable components, including the circuitry for introducing redundancy and correcting errors.
In Section III, we study the problem of computing linear transformations using homogeneous noisy gates, all of which are drawn from the same faulty gate model. We consider both probabilistic error models (transient gate errors) [46] and permanent-errors models (defective gates) [28] . The problem formulation and reliability models are detailed in Section II.
The key to our construction is the "ENCODED" technique (Encoded Computation with Decoders EmbeddeD), in which noisy decoders are embedded inside the noisy encoder to repeatedly suppress errors (Section IV). The entire computation process is partitioned into multiple stages by utilizing the properties of an encoded form of the linear transformation matrix (see Section III-A for details). In each stage, errors are introduced due to gate failures, and then suppressed by embedded noisy decoders [27] , preventing them from accumulating. Intuition on why embedded decoders are useful is provided in Section III-B.
In Section III and IV, we show that using ENCODED with LDPC decoders, an L × K binary linear transformation can be computed with O(L) operations per output bit, while the output bit error probability is maintained below a small constant that is independent of L and K . In Section IV-C, we use expander LDPC codes to achieve worst-case error tolerance using these codes, while still using error-prone decoding circuitry. We show that ENCODED can tolerate defective gates errors as long as the fraction of defective gates is below a small constant. We also obtain a stronger result on the computational complexity when the block error probability, instead of bit error probability, is specified: by deriving a fundamental lower bound (when the linear transform has full row rank), we show that the computational complexity per bit matches the lower bound in the scaling of the target error probability, as the required block error probability approaches zero. Interestingly, in the derivation of this lower bound, we allow the circuit to use noiseless gates to perform decoding operations. In Section IV-D, we use simulations to show that using exactly the same types of noisy gates (even with the same fan-in), the achieved bit error ratio and the number of iterations of ENCODED are both smaller than those of repetition-based schemes. Since computing energy is closely related to the number of operations, this shows an energy advantage of our ENCODED technique as well.
In Section V, we go a step further and systematically study the effect of tunable supply voltage ("dynamic" voltage scaling) on the total energy consumption by modeling energy-reliability tradeoffs at gate-level. For dynamic scaling, the gates are no longer homogeneous. We introduce a two phase algorithm in which the first phase is similar to ENCODED with homogeneous gates, but in the second phase, the voltage (and hence gate-energy) is tuned appropriately, which leads to orders of magnitude energy savings when compared with "static" voltage scaling (where the supply voltage is kept constant through the entire computation process). For example, when the required output bit error probability is p tar , for polynomial decay of gate error probability with gate energy E (i.e., = 1 E c ), the energy consumption per output bit is O N K max L, 1 p tar 1 c with dynamic voltage scaling, while it is ( N L K ( 1 p tar ) 1 c ) for the static case (we note that energy for ENCODED with static voltage scaling is still smaller than "uncoded" with static voltage scaling, which is (L( L p tar ) 1 c )). Finally, in Section VI, for deriving a lower bound as well as to connect with much of the existing literature, we allow the circuit to use noiseless gates for decoding. We derive (asymptotically) matching upper and lower bounds on required number of gates to attain a target error-probability.
A. Related Work
In spirit, our scheme is similar to von Neumann's repetitionbased construction [13] where an error-correction stage follows each computation stage to keep errors suppressed. Subsequent works [47] [48] [49] focus on minimizing the number of redundant gates while making error probability below a small constant. The difference from our work here is that these works do not allow (noiseless) precomputation based on the the knowledge of the required function, which our scheme (ENCODED) explicitly relies on. Therefore, our results are applicable when the same function needs to be computed multiple times for (possibly) different inputs, and thus the one-time cost of a precomputation is worth paying for. Thus, we do not include the preprocessing costs of the linear transformation matrix in the computational complexity calculation, and we assume all preprocessing can be done offline in a noise-free fashion. 3 We note that the algorithm introduced by Hadjicostis in [50] , which is applied to finite-state linear systems, is similar to ours in that he also uses a matrix encoding scheme. However, [50] assumes that encoding and decoding procedures are noiseless, which we do not assume. Laurenciu et al. [51] designed a fault-tolerant computing scheme that embeds the encoding of an error control code into the logical functionality of the circuit. Decoding units are allowed to be noisy as well. However, the computation size is small, and theoretical guarantees are not provided. Pippenger [48, Th. 4.4] designed an algorithm to compute a binary linear transformation with noisy gates. The algorithm requires gates with fan-in 2 23 and a gate-error probability of 35 · 2 −50 . While the fan-in values are unrealistically high, the gate-error probability is also low enough that most practical computations can be executed correctly using "uncoded" strategies, possibly the reason why it has not received significant attention within circuits community. At a technical level, unlike the multi-stage computing scheme used in our work, Pippenger uses exhaustive enumeration of all linear combinations with length 1 3 log L for computing, where L is the number of rows in the binary linear transformation. Lastly, we note here that Pippenger's scheme only works for the case when the number of columns K in the binary linear transformation matrix and the code length N of the utilized LDPC code satisfies K = (N 3 ), while our algorithm works in a more practical scenario where K = (N).
This work builds on our earlier work [52] , in which the problem of reliable communication with a noisy encoder is studied. In [52] , noisy decoders are embedded in the noisy encoder to repeatedly suppress errors. The noisy encoding problem is a special case of computing noisy linear transformation when the linear transformation matrix is the generator matrix of an error-correcting code. In [53] , an augmented encoding approach was introduced to protect the encoder from hardware faults using extra parity bits. In [54] , which considers a similar problem, errors are modelled as erasures on the encoding Tanner graph.
Outside information theory, fault-tolerant linear transformations and related matrix operations have been studied extensively in algorithm-based fault tolerance [55] [56] [57] [58] [59] . The main difference in our model is that faults happen at the circuitlevel, e.g., in AND gates and XOR gates. Instead, in [55] [56] [57] [58] [59] , each functional block, e.g. a vector inner product, fails with a constant probability. If errors are considered at gate level, the error probability of a vector inner product will approach 1/2 [33] as vector size grows, and one may not be able to use these schemes. Fault-detection algorithms on circuits and systems with unreliable computation units have also been studied extensively [14] , [60] [61] [62] [63] . However, these algorithms assume that the detection units are reliable, which we do not assume. Moreover, using error control coding, we can combine the error detection and correction in the same processing unit.
II. SYSTEM MODEL AND PROBLEM FORMULATION

A. Circuit Model
We first introduce unreliable gate models and circuit models that we will use in this paper. We consider two types of unreliable gates: probabilistic gates and defective gates.
Definition 1 (Gate Model I (D, )): The gates in this model are probabilistically unreliable in that they compute a deterministic boolean function g with additional noise z g
where d g denotes the number of inputs and is bounded above by a constant D > 3, ⊕ denotes the XOR-operation and z g is a boolean random variable which takes the value 1 with probability which is assumed to be smaller than 1 2 . The event z g = 1 means the gate g fails and flips the correct output. Furthermore, in this model, all gates fail independently of each other and the failure events during multiple uses of a single gate are also independent of each other. We allow different kinds of gates (e.g. XOR, majority, etc.) to fail with different probabilities. However, different gates of the same kind are assumed to fail with the same error probability. 4 This model is similar to the one studied in [49] and the failure event is often referred to as a transient fault. Our next model abstracts defective gates that suffer from permanent failures.
Definition 2 (Gate Model II (D, n, α)): In a set of n gates, each gate is either perfect or defective. A perfect gate always yields a correct output function
where d g denotes the number of inputs and is bounded above by a constant D > 3. A defective gate outputs a deterministic boolean function of the correct outputỹ = f (g(·)). This function can be either f (x) =x (NOT function), f (x) = 0 or f (x) = 1. The fraction of defective gates in the set of n gates is denoted by α. We assume that measurement techniques cannot be used to distinguish between defective gates and perfect gates. 5 From the definition, a defective gate may repeatedly output the value 1 no matter what the input is, which is often referred to as a "stuck-at error". This might happen, for example, when a circuit wire gets shorted. 4 A weaker assumption is that different gates fail independently, but with different probabilities all smaller than , which is called -approximate [48] . The ENCODED technique also works for this model. Also note that our model is limited in the sense that the error probability does not depend on the gate input. This may not be realistic because the gate error probability can also depend on the input and even the previous gate outputs, which is also noted in [64] . However, the assumption that does not depend on the gate input can be relaxed by assuming that is the maximum error probability over all different input instances. 5 Defective gates may result from component aging after being sold, and examining each gate in circuitry is in practice extremely hard. The storage units are easier to examine [65] , but replacing faulty memory cells requires replacing an entire row or column in the memory cell array [66] . Remark 1: In fact, we can generalize all the results in this paper on the permanent error model (Gate Model II) to arbitrary but bounded error model, in which errors occur in a worst-case fashion but no more than a fixed fraction. The latter error model has been used in worst-case analyses in coding theory and arbitrarily varying channels [67] . However, for consistency with the existing literature on error-prone decoding with permanent errors [23] , [28] , we limit our exposition to these errors.
The computation in a noisy circuit is assumed to proceed in discrete steps for which it is helpful to have circuits that have storage components.
Definition 3 (Register):
A register is an error-free storage unit that outputs the stored binary value. A register has one input. At the end of a time slot, the stored value in a register is changed to its input value if this register is chosen to be updated.
Remark 2: We assume that registers are noise-free only for clarity of exposition. It is relatively straightforward to incorporate in our analysis the case when registers fail probabilistically. A small increase in error probability of gates can absorb the error probability of registers. A similar change allows us to incorporate permanent errors in the registers as well.
Definition 4 (Noisy Circuit Model (G, R)): A noisy circuit is a network of binary inputs s = (s 1 , s 2 , . . . s L ), unreliable gates G = {g 1 , g 2 , . . . , g S } and registers R = {r 1 , r 2 , . . . , r T }. Each unreliable gate g ∈ G can have inputs that are elements of s, or outputs of other gates, or from outputs of registers. That is, the inputs to an unreliable gate g are s i 1 , . . . , s i a , y j 1 , . . . , y j b , r k 1 , . . . , r k c , where a+b+c = d g , the total number of inputs to this gate. Each register r ∈ R can have its single input from the circuit inputs s, outputs of unreliable gates or outputs of other registers. For simplicity, wires in a noisy circuit are assumed to be noiseless.
Definition 5 (Noisy Computation Model (L, K , N comp )): A computing scheme F employs a noisy circuit to compute a set of binary outputs r = (r 1 , r 2 , . . . r K ) according to a set of binary inputs s = (s 1 , s 2 , . . . s L ) in multiple stages. At each stage, a subset of all unreliable gates G are activated to perform a computation and a subset of all registers R are updated. At the completion of the final stage, the computation outputs are stored in a subset of R. The number of activated unreliable gates in the t-th stage is denoted by N t comp . Denote by N comp the total number of unreliable operations (one unreliable operation means one activation of a single unreliable gate) executed in the noisy computation scheme, which is obtained by
where T is the total number of stages, which is predetermined. The noisy computation model is the same as a sequential circuit with a clock. The number of stages T is the number of time slots that we use to compute the linear transform.
In each time slot t, the circuit computes an intermediate function f t (x) using the computation units on the circuit, and the result f t (x) is stored in the registers for the computation in the next time slot t + 1. The overall number of stages T is predetermined (fixed before the computation starts).
Remark 3:
A computing scheme should be feasible, that is, in each time slot, all the gates that provide inputs to an activated gate, or a register to be updated, should be activated.
In this paper, we will only consider noisy circuits that are either composed entirely of probabilistic gates defined in Gate Model I or entirely of unreliable gates in Gate Model II. Note that if we consider probabilistic gates, the noisy circuit can be transformed into an equivalent circuit that does not have registers. This is because, since the probabilistic gate failures are (assumed to be) independent over operations, we can replicate each gate in the original circuit multiple times such that each gate in the equivalent circuit is only activated once. This circuit transformation is used in the proof of Theorem 4.
B. Problem Statement
The problem considered in this paper is that of computing a binary linear transformation r = s · A using a noisy circuit, where the input vector s = (s 1 , s 2 , . . . s L ), the output vector r = (r 1 , r 2 , . . . r K ) and the L-by-K (linear transformation) matrix A are all composed of binary entries. We consider the problem of designing a feasible (see Remark 3) computing scheme F for computing r = s·A with respect to Definition 5. Suppose the correct output is r. Denote byr = (r 1 ,r 2 , . . .r K ) the (random) output vector of the designed computing scheme F . Note that the number of operations N comp has been defined in Definition 5. The computational complexity per bit N per-bit is defined as the total number of operations per output bit in the computing scheme. That is
For gates from Gate Model I (Definition 1), we are interested in the usual metrics of bit-error probability P bit
Pr (r k = r k ) and block-error probability P blk e = Pr (r = r), averaged over uniformly distributed inputs s and noise realizations. In addition, in the spirit of "excess distortion" formulation in information theory [68] , we are also interested in keeping the fraction of (output) errors bounded with high probability. This could be of interest, e.g., in approximate computing problems. To that end, we define another metric, δ frac e , the "bit-error fraction," which is simply the Hamming distortion between the computed output and the correct output (per output bit). That is, δ frac e = max s
where ½ {·} is the indicator function. The bit-error fraction depends on the noise, which is random in Gate Model I. Thus, we will constrain it probabilistically (see Problem 2). 6 The resulting problems are stated as follows:
Problem 1:
where p tar > 0 is the target bit error probability, and P e could be P bit e or P blk e . Problem 2:
where p tar > 0 is the target block error fraction and δ is a small constant. When we consider the Gate Model II (Definition 2), since all gates are deterministic functions, we are interested in the worst-case fraction of errors δ frac e . Thus, the optimization problem can be stated as follows:
Problem 3:
where s is the input vector, S i def is the set of defective gates of type i , W is the set of indices of different types of noisy gates (such as AND gates, XOR gates and majority gates), α i is the error fraction of the gates of type i , n F ,i is the total number of gates of type i in the implementation of F , and p tar > 0 is the target fraction of errors. Note that n F ,i is chosen by the designer as a part of choosing F , while the error-fraction α i is assumed to be known to the designer in advance.
Throughout this paper, we rely on the family of Bachmann-Landau notation [69] (i.e. "big-O" notation). For any two functions f (x) and g(x) defined on some subset of R, asymp-
C. Technical Preliminaries
First we state a lemma that we will use frequently. Lemma 1 [20, p. 41, Lemma 4.1] : Suppose X i , i = 1, . . . , L, are independent Bernoulli random variables and Pr(X i = 1) = p i , ∀i . Then
where the summation is over F 2 , i.e., 1 + 1 = 0. We will use error control coding to facilitate the computation of the binary linear transformation. Here, we introduce 6 We will show that the bit-error fraction is constrained probabilistically (see Problem 2) for all input vector s. some notations related to the codes that we will use. We will use a regular LDPC code [19] , [31] with code length N, dimension K and a K × N generator matrix G written as
where each row g k is a length-N codeword. In the LDPC Tanner graph, denote the degree of a variable node v by d v and the degree of a parity check node c by d c . The embedded decoders use either the Gallager-B decoding algorithm which is a 1-bit hard-decision based decoding algorithm proposed in [19] and is included for completeness in Appendix A, or the parallel bit flipping (PBF) algorithm, which is also a harddecision algorithm proposed in [20] . In particular, we use the modified parallel bit flipping algorithm defined in [70] .
The PBF algorithm is defined as follows • Flip each variable node that is connected to more than d v 2 unsatisfied parity check nodes; • Set the value of each variable node connected to exactly d v /2 unsatisfied parity-check nodes to 0(or 1) with probability 1/2; • Update all parity check nodes; • Repeat the first three steps for c e log N times, where c e is a constant. The PBF algorithm can be used to correct a constant fraction of errors after (log N) decoding iterations when the computing components in the decoder are noiseless and the error fraction is small enough. However, since we will consider noisy decoders, we will build on a more refined result, which concerns a single decoding iteration of the algorithm (see the following requirement (A.3) and Lemma 8).
In our main results, we may require the utilized LDPC code to satisfy some of (not all) the following conditions.
• (A.1) Degree Bound: The variable node degree d v and the parity check node degree d c are both less than or equal to D, so that each majority or XOR-operation (in the Gallager-B decoding algorithm) can be carried out by a single unreliable gate. Moreover, we assume that the variable node degree d v ≥ 4, ∀v. • (A.2) Large Girth: The girth l g = (log N). An LDPC code with the following girth lower bound is obtained in [20] and [33] :
where
) is a constant that does not depend on N. • (A.3) Worst-case Error Correcting:One iteration of the PBF algorithm using a noiseless decoder can bring down the number of errors in the codeword from α 0 N to (1 − θ)α 0 N for two constants α 0 , θ ∈ (0, 1), for any possible patterns of α 0 N errors. The requirement in (A.2) can be met by using codes introduced in [32] or using the PEG construction proposed in [71] . The requirement in (A.3) can be met either by using (d v , d c )-regular random code ensembles and using the analysis in [70] , or by using regular Expanders [20] . In particular, in Appendix D, we show that almost all codes in the (9, 18)regular code ensemble of sufficiently large length N can reduce the number of errors by θ = 15% after one iteration of the PBF algorithm, if the fraction of errors is upper-bounded by α 0 ≤ 5.1 · 10 −4 . We also show that at least 4.86% of the (9, 18)-regular codes of length N = 50, 000 can reduce the number of errors by θ = 15% after one iteration of the PBF algorithm, if the number of errors satisfies α 0 N ≤ 20, which is equivalent to α 0 ≤ 0.0004.
III. ENCODED: ENCODED COMPUTATION
WITH DECODERS EMBEDDED
In this section, we present the main scheme that we use for noisy computation of linear transformations. We call this scheme "ENCODED" (Encoded Computation with Decoders EmbeddeD). We aim to provide an overview of ENCODED in this section. Then, this scheme will be modified to a treestructured scheme, ENCODED-T, in Section IV-A and further modified to a bit-flipping-based technique ENCODED-F in Section IV-C.
A. ENCODED: A Multi-Stage Error-Resilient Computation Scheme
Instead of computing a binary linear transformation r = s·A without using any redundancy, we will compute
where G = [I, P] = [g 1 ; g 2 ; . . . ; g K ] is the K × N generator matrix of the chosen systematic LDPC code. The matrix product AG is assumed to be computed offline in a noisefree fashion. An important observation is that since all rows in the matrix product AG are linear combinations of the rows in the generator matrix G, the rows of AG are codewords as well. That is,G
where each rowg l , l = 1, . . . , L is a codeword. Then, if the computation were noiseless, the correct computation result r = s · A could be obtained from the combined result
Since r · G = s · AG = s ·G,
In the following sections, we will explain how error control coding can be used to reliably compute x = Store an all-zero vector x (0) in an N-bit register.
FOR l from 1 to L • Use N unreliable AND gates to multiply s l withg l , the lth row inG, add this result to x (l−1) using N unreliable XOR gates, and store the result in the N-bit register. 7 Use an unreliable decoder to correct errors and get x (l) . END Output x (L) as the output x.
is a codeword. When gates are noisy, during the l-th stage, we first compute x (l−1) + s lgl using noisy AND gates (binary multiplication) and noisy XOR gates (binary addition) and then correct errors (with high probability) using an LDPC decoder or an expander decoder to get x (l) . During the entire computing process, AND gates and XOR gates introduce errors, while the noisy decoders suppress errors. Finally, it will be proved in Theorem 1 and Theorem 3 that error probability is maintained below a small constant. We summarize the ENCODED technique in Algorithm 1. Compared to many classical results [34] , [35] , [37] , [55] on applying error control coding to noisy computing, instead of computing after encoding, the proposed scheme combines encoding and computing into a joint module (see Fig. 2 ). Because there is no separation between computing and encoding, in some sense, we encode the computation, rather than encoding the message. We briefly discuss the intuition underlying the ENCODED technique in Section III-B. We note that, we change the FOR-loop in Alg. 1 to a tree-structure (ENCODED-T) in Section IV-A in order to reduce error accumulation as explained in Remark 4 in Section IV-A.
B. Intuition Underlying the Embedded Decoders
The basic idea of our proposed computing scheme is to split the computation into a multistage computation of x = L l=1 s lgl , and use embedded decoders inside the noisy circuit to repeatedly suppress errors as the computation proceeds. Since the noisy circuit can only be constructed using unreliable gates, the embedded decoders are also constituted by unreliable gates.
Why is such a multistage computation helpful? For instance, if "uncoded" matrix multiplication r = sA is carried out, each output bit is computed using an inner product, and O(L) unreliable AND and XOR-operations are required. Without repeated suppression, each output bit is erroneous with probability 1 2 as L → ∞. Intermediate and repeated error suppression alleviates this error accumulation problem. Can one use a feedback structure, as is used in Turbo codes [72] and ARA codes [73] (these codes often have a feedback structure [74] for encoding), to keep errors suppressed, instead of the LDPC codes used here? A feedback structure can be detrimental since errors persist in the feedback loop and propagate to the future, which can make the final bit error probability large. This observation motivated us to use LDPC codes.
Also note that due to the 'last-gate' effect in noisy circuits, error probability cannot approach zero. Thus, our goal is not to eliminate errors, but to suppress them so that the error probability (or the error fraction) is kept bounded below a target value that depends on the error probability of the last gate.
IV. MAIN RESULTS ON COMPUTING A BINARY LINEAR
TRANSFORMATION WITH EMBEDDED DECODERS In this section, we show that a linear transformation can be computed 'reliably' (in accordance with the goals of Problems 1-3 in Section II-B) even in presence of noise, using error control coding. We provide three results, one each for formulations in Problem 1 to Problem 3. These results are obtained using two variants of ENCODED, which we call ENCODED-T and ENCODED-F. ENCODED-T uses Gallager-B decoding algorithm for error suppression, while ENCODED-F uses the PBF algorithm. The implementation details of ENCODED-T and ENCODED-F are respectively provided in Section IV-A and Section IV-C. We also compare resource requirements of this coding-based computation with repetition-based computation using simulations (in Section IV-D.)
A. ENCODED-T: A Scheme for Reliable Computation of Linear Transformations Under Gate Model I
The utilized unreliable gates in the computing scheme are AND gates, XOR gates and majority gates that are defined in Gate Model I, with error probabilities p and , p xor and p maj respectively. We change the FOR-loop of ENCODED in Section III-A slightly and use a D-branch tree with depth M instead. We call this computing scheme "ENCODED-T" (ENCODED scheme with a Tree structure) and is conceptually illustrated in Fig. 3 (a). We use this tree structure because it reduces the number of stages of the computing scheme from L in the FOR-loop in Alg. 1 to (log L). This reduces the extent of information mixing caused by message-passing decoding (in comparison with ENCODED's sequential structure), which introduces correlation among messages and makes the density evolution analysis difficult. This issue will be detailed in Remark 4.
The message s = (s 1 , . . . , s L ) is input from the leaf nodes. The output x = s ·G = (x 1 , . . . , x N ) is computed from bottom to top and finally obtained at the root. Note that the tree structure is not necessarily a complete tree. Specifically, the tree is complete from the first level to the (M −1)-th level, i.e., the level just above the bottom level, and the number of nodes in the bottom level is chosen such that the number of leaf nodes is L. An illustration of a non-complete 3-branch tree with L = 22 leaf nodes (which are colored red) is shown in Fig. 3 (a). From Fig. 3 (a), we see that by removing the leaf children-nodes of each complete non-leaf node (a non-leaf node with d T leaf children-nodes), we effectively reduce the number of leaf-nodes by (d T − 1), because the d T removed leaf children-nodes (red nodes) is replaced by one non-leaf node that will turn into a leaf node. Therefore, it is easy to see that the total number of non-leaf nodes is L−1 d T −1 . Each of the L leaf nodes has one of the L rows of the matrixG, i.e.,g 1 tog L stored in it. At the start of the computing process, the l-th node of the first L nodes calculates s l ·g l using N unreliable AND gates and stores it as an intermediate result in a register. In the upper levels, each non-leaf node performs a component-wise XOR-operation of the d T intermediate results from d T children. Observe that if no gate errors occur, the root gets the the binary sum of all s l · g l , i = 1, . . . , L, which is the correct codeword x = s ·G.
In order to deal with errors caused by unreliable gates, each non-leaf tree node is a compute-and-correct unit shown in Fig. 3 (c). Unreliable XOR gates are used to perform the component-wise XOR-operation of the intermediate results.
A noisy Gallager-B decoder (see Appendix A) is used to correct errors in the associated register after the XOR-operation. Note that the number of bits transmitted from variable nodes to parity check nodes during each Gallager-B decoding iteration is E = d v N, where d v is the variable node degree of the Tanner graph and E is the total number of edges. Therefore, at a tree node, the register stores E = d v N bits as intermediate results instead of N bits as in Algorithm 1. 8 These E bits of messages can be viewed as d v copies of the corresponding N-bit codeword, with indices from 1 to was first used in [22] and [23] for the correction circuit in fault-tolerant storage, which is known as the Taylor-Kuznetsov memory. A related technique was used in [33, p . 216] to construct fault-tolerant linear systems. See [75] for details.) The XOR-operations at the beginning of the next stage are also done on codeword copies with the same index.
Before sending the output to the parent-node, each node performs C iterations of the message-passing decoding on the E-bit intermediate result obtained by the XOR-operations with the embedded decoder. Note that there are C iterations of decoding at each non-leaf node of the tree structure shown in Fig. 3 . We will use the index i = 1, . . . , C, to number the decoding iterations done at a single tree node, which is different from the index m used to number the level in the tree structure. However, we will show that it suffices to use C = 1 iteration (see Lemma 2) to bring the bit error probability of the E-bit intermediate result back to p maj + 1 d T p thr . In the noisy decoder, the bit error probability of the intermediate result, assuming the decoding neighborhood is cycle-free for i + 1 iterations (in fact, the number of levels of the decoding neighborhood grows by 2C at each tree node, see Remark 4 for details), follows the density evolution P (i+1) e = f (P (i) e ) and the explicit expression of function f (·) is given in Lemma 4 in Appendix B. This evolution is illustrated in Fig. 3(d) . The bit error probability follows the directed path shown in Fig. 3(d) , and asymptotically approaches the fixed point of the density evolution function as number of iterations increases if decoding neighborhoods continue to satisfy the cycle-free assumption (which no fixed good code does). However, the expression of f (·) is complicated, so we provide an upper bound f (P (i) e ) < f 0 (P (i) e ) in Lemma 5 for further analysis.
During the computing process, the XOR-operations introduce errors, while the Gallager-B decoding process suppresses them. The bit error probability of the intermediate result is reduced repeatedly at different stages and is ensured to be bounded within the interval ( p maj , p reg ) (as illustrated in Fig. 3(b) ), where the parameter p reg is defined as
Remark 4: An astute reader might notice that as the computation process proceeds, the message-passing algorithm introduces increasing correlation in messages as they are passed up the tree. This introduces a difficulty in analyzing the error probability using density-evolution techniques, because density evolution relies on the cycle-free assumption on the Fig. 4 . This figure shows the decoding neighborhoods in two adjacent levels of the tree structure in ENCODED-T. The black nodes in the decoding neighborhood denote variable nodes, while the white ones denote parity check nodes. The decoding neighborhood in the (M − 2)-th level is not cycle-free due to a short cycle of length 8. Note that the tree structure in ENCODED-T and the decoding neighborhoods are two completely different things and should not be confused. decoding neighborhoods. 9 As shown in Fig. 4 , the decoding neighborhoods of the same level in the ENCODED-T tree have the same neighborhood structure (because all nodes use the same LDPC code and the number of decoding iterations is a constant across the same level), while the decoding neighborhood on the upper level, the (M − 2)-th level (see Fig. 4 ), grows by a depth of 2C compared to the adjacent lower level, the (M − 1)-th level. Note that the XOR-operations before the LDPC decoding iterations do not change the nodes in the decoding neighborhood (again, because the same code is used at all nodes). Due to small girth of the LDPC code chosen in the example in Fig. 4 (girth = 8) , the decoding neighborhood is not cycle free, and hence the messages sent to a variable node can be correlated. In Fig. 4 , the correlation between messages at the root of the decoding neighborhood is caused by the correlation between messages sent by the red-colored node ν 1 . We circumvent this issue 10 by choosing a code of girth large enough so that no correlations are introduced even over multiple levels of the tree. The tree structure of ENCODED-T reduces the number of levels exponentially (in comparison with ENCODED), thereby reducing the girth requirement. 9 In fact, the analysis of message-passing type algorithms implemented on general graphs with cycles are the key question to many real-world problems, and convergence results are only known in some limited cases [33] , [76] . 10 Another possible way could be to prove convergence to cycle-free neighborhoods through randomized constructions, as is done in [27] and [32] .
To understand this, as in [32] and [33] , we define the number of independent iterations of an LDPC code as the maximum number of iterations of message-passing decoding such that the messages sent to any variable node or parity check node are independent (i.e., number of iterations after which the code reaches its girth). We also use the phrase 'overall number of decoding iterations' in ENCODED-T to denote the sum (over levels) of maximum number of iterations over all nodes at each level of the ENCODED-T tree. For instance, if C decoding iterations are executed at each tree node, then the "overall" number of iterations is C M, where M is the number of levels in the tree. Intuitively, this quantifies the extent of information mixing in the Tanner graph by the message-passing algorithm. We need to ensure that the overall number of decoding iterations is smaller than the number of independent iterations of the LDPC code used. For a fixed C, the tree structure of ENCODED-T requires the number of independent iterations, and hence also the code-girth, to be (log N/ log d T ).
The importance of the ENCODED-tree structure also becomes clear: the FOR-loop in ENCODED (Alg. 1) has exponentially larger number of levels, and requires a large girth of (L) for L stages of the algorithm to maintain independence of messages in a decoding neighborhood at the root node.
The number of independent iterations of message-passing decoding of an LDPC code can be made as large as 
• Each node v k m performs one iteration of the messagepassing decoding. END Change the E-bit vectorỹ 1 1 back to the N-bit codeword y 1 1 by randomly selecting one copy. Output y 1 1 as the output y.
). Thus, for d T and N large enough, the overall decoding neighborhoods are cycle free, and density evolution analysis is valid. More details are provided in Appendix B.
In Section IV-C, we provide another way to handle this correlation issue that uses codes designed for correcting worstcase errors. For these codes, correlations do not matter as long as the number of errors does not exceed a certain threshold.
B. Analysis of ENCODED-T
In what follows, we prove Theorem 1, which characterizes the performance of Algorithm 2.
Theorem 1 (Error Suppression Using Gallager-B Decoding for Problem 1): Using unreliable AND gates, majority gates and XOR gates from Gate Model I with respective gate error probabilities p and , p xor and p maj that satisfy 11 11 In this result, we obtain different conditions on the error probabilities of different types of gates as sufficiency conditions, instead of a uniform bound on all gates. We do so to offer maximum flexibility to the circuit designer.
where d = d v −1 2 , and the tree width d T satisfies
the binary linear transformation r = s · A can be computed with output bit error probability P bit e ≤ p maj + p thr · 1 d T using the ENCODED technique provided in Alg. 2 (see Section IV-A) that uses an LDPC code which satisfies assumptions (A.1) and (A.2). Further, the number of operations per bit satisfies
where E is the number of edges in the Tanner graph, K is the number of outputs and N is the code length of the utilized LDPC code.
Remark 5: Note that P bit e remains bounded even as L is increased. Thus, we can compute linear transforms with large size and still obtain good error performance.
Proof of Theorem 1: We construct the computation tree of the ENCODED-T technique (see Section IV-A and Section IV-B) for computing x = s·AG = s·G as described in Section IV-A. In all, there are M levels, and in the m-th level the number of nodes is chosen such that the overall number of leaf-nodes is L. Each leaf node has the codewordg l stored in it at the beginning of computation, whereg l is the l-th row ofG, an L-by-N matrix (see (12) ). Note that M satisfies
To ensure that the number of leaf nodes is L, the number of non-leaf nodes S nl is
1) Error Probability Analysis: Define p 0 to be the bit error probability at the start of the first round of the Gallager-B decoding carried out in the (M − 1)-th level of the computing tree in Fig 3(b) . Since at this time, each bit is computed by XORing d T bits (see Fig 3(a) ) where each bit is calculated from an AND-operation (see Fig 3(b) ), we know from Lemma 1 that
where step (a) is from the inequality
x ∈ (0, 1). Using definition of p reg in (16) and condition (17)
In the following, we will prove that as long as the error probabilities of noisy gates satisfy (17), the noisy Gallager-B decoder can make the bit error probability fall below p maj + 1 d T p thr after one iteration of decoding.
Lemma 2: Suppose all noise parameters satisfy (17) . Then, as long as the bit error probability before decoding is smaller than p reg defined in (16) , after one iteration of decoding, the bit error probability p dec e satisfies
Proof: See Appendix B. Remark 6: Lemma 2 describes the behavior shown in Fig. 3  (d) . For noiseless LDPC decoders [32] , [77] , [31] , the fixed point of the density evolution function is zero when the code is operated for a channel that has small enough noise. However, for noisy density evolution, it can be shown that the fixed point [27, Th. 5] . Therefore, after one iteration of Gallager-B decoding, the bit error probability reduces from p 0 to something less than p maj + 1 d T p thr . Then, the corrected codeword is sent to the parent-node in the next level of the tree structure. After that, the XOR-operation of d T codewords from children-nodes are carried out. From Lemma 1, we know that the error probability after this XOR-operation is upper bounded by
where (a) follows from (17) and (b) follows from (16) . Therefore, we can reuse the result in Lemma 2 and carry out a new Gallager-B decoding iteration. To summarize, the bit error probability starts at p 0 and then oscillates between p reg and p maj during the entire computation process. This behavior is qualitatively illustrated in Fig. 3 (b) and numerically illustrated through simulation results in Fig. 6 .
2) Computational Complexity Analysis: Each computeand-correct unit is constituted by E D-fan-in noisy XOR gates, an E-bit register and a Gallager-B decoder, where E is the number of edges in the LDPC bipartite graph.
The operations required in one iteration of decoding are E XOR-operations and E majority-operations, because on each edge that connects a variable node v and a paritycheck node c, two messages m (i) c→v and m (i) v→c are calculated respectively using an XOR-operation and a majority-operation. Thus, the number of operations to carry out during one iteration of decoding is 2E. Since the number of non-leaf tree nodes is S nl = L−1 d T −1 (see (21) ), the number of operations required in all non-leaf nodes is at most (2E + E) L−1 d T −1 = O(E L). The operations required in the first layer (input-layer) of the computing tree are L E AND-operations. The computing process outputs K bits, so the number of operations required per bit is 3E
Since the number of edges E = d v N = (N), we know that the number of operations per bit is in the order of O(L N/K ).
C. ENCODED-F: A Scheme for Reliable Computation of Linear Transformations Under Gate Model II
In this Section, we consider unreliable gates defined in Definition 2, which are either perfect or defective. We construct a computing scheme using a decoding scheme different from that of ENCODED-T that is still able to attain a small error fraction in the final output. This computing scheme operates in exactly the same manner as ENCODED (see Alg. 1). However, the embedded noisy decoder is a PBF decoder. This computation scheme is named ENCODED-F (ENCODED using the flipping algorithm). We modify ENCODED as follows: we partition the entire computing process into L d s −1 stages, where d s is called the group size. First, we store an all-zero codeword in the N-bit register. In the l-th stage, we first use (d s − 1)N AND gates to obtain the d s − 1 scalar-vector multiplications
. . , (d s −1)l}, whereg i is the i -th row of the combined matrixG = AG. Then, we use N XOR gates to add the d s −1 results to the N-bit register. The parameter d s is chosen so that d s ≤ D, the maximum input to each noisy gate. After that, we use one iteration of the PBF algorithm (see Section III) to correct errors. We use P XOR gates and N majority gates in one iteration of the PBF algorithm.
We note here that the tree structure in ENCODED-T (see Alg. 2) could be used in the ENCODED-F. However, we still use the FOR-loop structure in Alg. 1, because the resulting maximum tolerable gate error probability is smaller using the FOR-loop structure. This is for two reasons (i) The tree structure in ENCODED-T was motivated by induced correlations in messages as they are passed up. However, correlations do not matter when decoding using the PBF algorithm; (ii) Interestingly, there is a benefit to using the FOR-loop structure: in ENCODED-T, the error probability increases by a factor of d T at every level due to XOR-ing of d T inputs. Since these XOR operations are not needed in ENCODED-F, to keep error probability suppressed, the constraints on the gate error probability will thus be more severe in ENCODED-T than in ENCODED-F.
In what follows, we prove Theorem 2, which quantifies the error-tolerance of ENCODED-F. The basic tool used to prove Theorem 2 is a modified version of the worst-case error correcting result in the requirement (A.3) , which provides the worst-case error correcting capability of regular LDPC codes using one iteration of noisy PBF decoding. 
which can be obtained by combining (26) . Suppose in the (l − 1)-th stage, after adding a set of (d s − 1) codewords
. . , (d s −1)(l −1)} to the N-bit register, the number of errors is strictly less than Nα 0 . Then, according to condition (A.3), if no computation errors occur during execution of one iteration of PBF algorithm, the fraction of errors can be reduced to Nα 0 (1−θ). Whenever an XOR gate flips the corresponding parity check value during the PBF algorithm, it affects at most d c majority gates. In total, there are P XOR gates used in one iteration of the PBF algorithm, so there are at most α xor Pd c errors due to XOR errors in the PBF algorithm. There are at most α maj N errors due to majority gate failures. After this iteration of bit flipping, another set of (d s − 1) codewords s i g i is added to the N-bit register with N(d s − 1) AND gates and N XOR gates. These two operations introduce (α xor + α and (d s − 1))N errors. Therefore, the total error fraction before the next PBF algorithm is upper bounded (using the union bound) by
where R is the code rate (R = N−P N ). As long as (26) holds and d c ≤ D, before the next bit flipping, it holds that
Therefore, the induction can proceed.
In each stage, we need N + P XOR-operations and N majority-operations. During the entire computation, we need N L AND-operations. Therefore, the computational complexity per output bit, which is the total number of operations in L/(d s − 1) stages divided by K bits, is (2N + P) L/(d s − 1) /K + N L K . ENCODED-F can be applied to Gate Model I as well, which is characterized in the following theorem. 
the binary linear transformation r = s · A can be computed using 2N+P 
where the probability P blk
(33) However, in probabilistic settings, the number of errors at any stage could exceed Nα 0 . In what follows, we use large deviation analysis to show that the probability of exceeding Nα 0 is small. First, we review the large deviation result for binomial distribution [78, p. 502, Example 3] .
Proof: The inequality (34) is the large deviation bound for binomial distribution and is presented in [78, p. 502, Example 3] . Note that D( p+λ p) is monotone non-increasing for p ∈ (0, λ). When p < λ, we have D( p+λ p) > D(2λ λ). Therefore, (34) holds for p < λ. Then, Theorem 3 follows from Theorem 2 and Lemma 3.
Proof of Theorem 3: Using Theorem 2, we know that if the error fraction in all stages is bounded by the inequality (26) 
as in the condition (30), we have
Therefore,
= Pr (α and > p and + λ) + Pr (α xor > p xor + λ)
Since
Therefore, using the union bound for the L stages, the total error probability is upper bounded by P blk e < 3L exp (−D(2λ λ)N).
Remark 7: The analysis of the PBF algorithm (see Appendix D) still requires randomized code constructions. Another method to analyze the bit flipping algorithm is to use Expander codes (also see Appendix D). However, hardwarefriendly expander codes tend to be hard to construct and use in practice, while many hardware-friendly LDPC codes have been designed [79] [80] [81] [82] . In fact, the ENCODED-T and ENCODED-F both have some advantages over the other, so none of them universally outperforms the other. On the one hand, ENCODED-T works for all regular LDPC codes, but it requires a tree-structure for its theoretical analysis to work. On the other hand, ENCODED-F does not require a tree-structure, and requires less redundancy than ENCODED-T (it does not need to maintain d v copies of the computation), but it requires the LDPC code to satisfy certain properties. More concretely, it requires that each single iteration of the simple Bit Flipping algorithm corrects a constant fraction of errors (say, 75%) for any combination of less than some constant number of errors (say, 20 errors). This property is hard to verify in practice and only existence result can be obtained. This further makes it hard to say which one is better than the other. This is the reason that we keep both the density evolution analysis for general LDPC codes in Section IV-B under the assumption of large girth, and the PBF analysis for ENCODED-F.
The following converse result holds for all computation schemes. Although this converse result does not match with any of the achievable results listed above, it matches with an achievable result when a "noiseless decoder" is available (details will be provided in Section VI) in the scaling of the target error probability p tar . Thus, we believe the converse result captures the required computational complexity for the beginning stages of the linear transform computation.
Theorem 4 (Converse Result): For Gate Model I with error probability , maximum fan-in D, and linear transformation r = s · A with A having full row rank, in order to achieve P blk e smaller than p tar , the number of operations required per bit is lower bounded as N per-bit ≥ L log 1/ p tar K D log D/ = ( L log 1/ p tar K log 1/ ). Proof: See Appendix C. Fig. 5 . This is the illustration of the 3-time distributed voting scheme for computing an inner product s · a j = s 1 a 1 j + s 2 a 2 j + . . . s L a L j , where s is the input to the linear transform sA, and a j is the j-th column of A. The computation is divided into L stages. In the i-th stage, the distributed voting scheme computes x (i) j = x (i−1) j + s i g i j for three times using three sets of AND-gates and XOR-gates, uses three noisy majority-gates (which are called voters in [33] ) to compute three copies of the majority votes. Then, the output of each majority value is sent to the corresponding copy for the computation in the next stage. Note that the bounds on the bit error ratio are for the error probability after the decoding stages, so they only apply for the even stages.
D. Simulation Results for ENCODED Techniques
We present simulation results in Fig. 6 , which shows the variation of bit error probability during the process of implementing Algorithm 2. In the simulation, we generate random binary A matrices where each entry takes value one with probability 1/2. The x-axis is the computing steps from the bottom to the top of the noisy computing tree structure. The y-axis is the bit error probability. As we have mentioned, during the entire computing process, computation introduces errors and decoding suppresses errors. Thus, the bit error probability oscillates between two limits. This is exactly the expected behavior as shown in Fig. 3(b) .
This simulation uses a randomly generated (6, 12)-regular LDPC code of length 1200. The systematic generator matrix G is computed by solving the equation GH = 0 in the binary field using Gaussian elimination, where H is the parity check matrix. The tree in Algorithm 2 is set to be a two-branch tree, i.e., d T = 2. The failure probability values of different unreliable gates are set to be the same as the threshold value computed using the condition (17) in Theorem 1: p maj = 0.001, p xor = 0.00026 and p and = 0.002. We still assume that each operation in the decoding process is done by a single unreliable gate, and all gates fail independently of each other. Notice that the error probability lower limit is just above p maj = 0.001, which is consistent with our analysis in Section IV-B. The bit error probability after each decoding iteration should be confined between p maj and p maj + 1 d T p thr in theory (see Lemma 2) .
It is interesting to note that the computation scheme works at fairly practical values of node degrees (d v = 6) and blocklengths (N = 1200). The target error probabilities are typically much smaller, but so are gate-error probabilities. Moreover, the scheme works even though the choice of the tree-width d T do not satisfy the constraints (18) . These suggest that the bounds in Theorem 1 are conservative. The moderate blocklength of the code suggests that the scheme could be applied in practice, but a deeper investigation is needed which is beyond the scope of this work.
We also use simulations to compare ENCODED and repetition-based schemes. In particular, we provide a comparison between ENCODED-F and a particular repetitionbased scheme called "distributed voting scheme" [33] , that is designed for p maj > 0. This method repeats not only the computation part, but also the majority voting part of the repetition-based circuit. The illustration of the distributed voting scheme is shown in Fig. 5 . In this way, we can compare the (repetition-coding based) distributed voting scheme with ENCODED that both use noisy gates.
The performance comparison is shown in Fig. 7 . In the distributed majority scheme, we use three-time repetition or fourtime repetition. For ENCODED-F, we set d v = 4, d c = 8, d s = 8, K = 2000, L = 2100, N = 4000. We set p and = 0.000125, p maj = 0.0005 and p xor = 0.001. We set these error parameters because we assume that the error probability of each gate is proportional to its fan-in number (we use 2-input AND-gates, 4-input MAJ-gates and 8-input XOR gates). Note that the number of compute-and-correct stages in ENCODED-F should be L d s −1 = 300. In one compute-andcorrect stage, we need N XOR-operations of fan-in d s = 8 for binary addition, P XOR-operations of fan-in d c = 8 for parity computation and N MAJ-operations of fan-in d v = 4 for majority computation. In all 300 stages, we also need N L AND-operations of fan-in 2. Therefore the number of operations per output bit for ENCODED-F is
In the distributed majority voting scheme with repetition time t m (t m can be 3 or 4 when the majority gate with fan-in 4 is used), the number of operations per output bit is
N Rep AND−2,per-bit = t m L.
Therefore, when the repetition time t m is 3 or 4, the number of operations per output bit for ENCODED-F is always smaller than the number of operations per output bit for the distributed majority voting scheme.
E. Theoretical Comparison With Repetition Coding
In this section, we provably show the advantage of ENCODED through theoretical analysis. We also provide a result in an online version [83] of this paper which shows ENCODED beats repetition-based techniques in scaling sense.
In this paper, although we obtain results on the number of operations for ENCODED in Theorem 1-3, the results are biased for the comparison between ENCODED and repetitionbased schemes, because the number of operations do not take into account the gate fan-in. Therefore, to compare the complexity of operations with different fan-in, we define a new concept called "effective number of operations". We assume that the "effective number of operations" for an operation with fan-in c is N c-fan-in = c (the analysis for a different N c-fan-in can be done similarly). We show that if we consider the problem of find a binary linear transform scheme that achieves target error probability p tar = 5.1 · 10 −4 using only noisy gates with max( p xor , p maj , p and ) < 1.3 · 10 −6 , the effective number of operations of ENCODED-F is smaller than that of distributed majority voting, provided that the size of the linear transform satisfies N = 2K > 9.85 · 10 7 , and L > p tar p and . We choose these parameters only to show that ENCODED can provably beat repetition-based schemes in situations when the parameters are not absurdly large, and hence the theoretical analysis here has potential to provide practical insight. Here max( p xor , p maj , p and ) is interpreted as the maximum error probability over all types of different gates, which allows the same type of gates (i.e., MAJ-gates) with different fan-in to have different error probabilities.
1) Counting the Effective Number of Operations: First, we compare the effective number of operations in both schemes. For ENCODED-F, we use a (9, 18) code. To make the comparison fair, we allow the distributed majority voting scheme to group several stages into one stage as well. Recall that we use d s to denote the number of stages that ENCODED-F groups into one stage. Therefore, we use d s to denote the number of stages that distributed majority voting groups into one stage. In general, d s = d s .
We compare ENCODED-F using (9, 18) LDPC codes with distributed majority voting with three-time repetition. We show when
the "effective" number of operations in ENCODED-F is less than that of distributed majority voting. Note that d s can be arbitrary.
The number of compute-and-correct stages in ENCODED-F is In all compute-and-correct stages, the overall number of AND-operations of fan-in 2 is N L. Then,
In the distributed majority voting scheme with repetition time 3, the number of operations per output bit is 
Therefore, from (47)-(50), for ENCODED-F with d c = 18 and d v = 9, the effective number of operations is
From (51) to (53) , for distributed majority voting, the effective number of operations is (55) Therefore, when d s > 14,
2) Analyzing the Probability of Error: Now, we analyze the error probability of ENCODED-F for d s = 14, d c = 18 and d v = 9. From Lemma 7 in Appendix D, using almost all codes in a (d v , d c )-regular LDPC random code ensemble with d v > 4 and N large enough, after one iteration of the PBF algorithm, one can reduce the number of errors by at least θα 0 N for any α 0 N worst-case errors if α 0 and θ are small enough. That is, using a (d v , d c )-regular LDPC code, the number of errors after one iteration of noiseless PBF algorithm will be smaller than α 0 · (1 − θ). Recall that this is the condition (A.3) that we require on the utilized LDPC code. In Example 1 in Appendix D, for the (9, 18)-regular LDPC code, we computed numerically the threshold value of α 0 for θ = 0.15 and obtained α 0 = 5.1·10 −4 . We also obtained finite-length bounds which state that there exist (9, 18)-regular LDPC codes with length N = 50, 000 that can reduce the number of errors by 15% for an arbitrary pattern of at most 20 errors, which corresponds to the case when α 0 = 4 · 10 −4 and θ = 0.15.
From Theorem 3, using the (9,18) code, when the maximum gate error probability = max( p xor , p maj , p and ) satisfies the condition
ENCODED-F has bounded final error fraction with high probability, which is
where λ = θα 0 54 and D(2λ λ) = (2 log 2 − 1) λ + O(λ 2 ). In particular, if we choose = 1 60 θα 0 < 1 54 θα 0 = λ, the final error fraction satisfies δ frac e < α 0 = 60 θ · with probability 1 − 3L exp (−D(2λ λ)N). As we have mentioned, for θ = 0.15, we obtain α 0 = 5.1 · 10 −4 . Therefore, when the gate error probabilities satisfy max( p xor , p maj , p and ) = = 1 60 θα 0 = 0.0043α 0 = 1.3 · 10 −6 , the obtained error probability is smaller than α 0 = 60/θ 0 · = 400 = 5.1 · 10 −4 with probability 1 − 3L exp (−D(2λ λ)N), which is approximately 1 with reasonably large N, which can be guaranteed 12 12 We believe that further optimization in code design can provide techniques for error suppression for even smaller value of N . Therefore, if we consider the problem "find a binary linear transform scheme that achieves target error probability p tar = α 0 = 5.1 · 10 −4 using only noisy gates with max( p xor , p maj , p and ) < 1.3·10 −6 ", ENCODED-F has smaller "effective number of operations" than that of distributed majority voting. Additionally, one-time repetition or two-time repetition cannot obtain p tar = α 0 = 5.1 · 10 −4 when L is reasonably large so that 1 2 [1 − (1 − 2 p and ) L ] ≈ Lp and > α 0 . Thus, we conclude that ENCODED-F beats repetition-based schemes under this circumstance. Here, we acknowledge that the problem parameters (such as max( p xor , p maj , p and ) < 1.3 · 10 −6 and N > 9.85 · 10 7 ) are chosen to show that the theoretical analysis works even when the parameter sizes are not extremely large, and thus the theoretical analysis technique has the potential to provide practical insight.
V. COMPUTING A LINEAR TRANSFORMATION RELIABLY AND ENERGY-EFFICIENTLY WITH VOLTAGE SCALING
In this section, we consider unreliable gates with tunable failure probability [84] when supply voltage, and hence energy consumed by gates, can be adjusted to attain a desired gatereliability. To model this property within Gate Model I in (1), we assume that the added noise z g ∼ Bernoulli( g (E g )), in which g (E v ) is a function that depends on the supply energy E v . We assume that E v is identical for all gates at any stage of the computation, while it can vary across stages. Intuitively, g (·) should be a monotonically decreasing function, since the error probability should be smaller if more energy is used. Suppose the energy-reliability tradeoff functions of AND-gates, XOR-gates and majority-gates are and (·), xor (·) and maj (·) respectively. Then, the failure probability of these three types of gates are p and = and (E v ), p xor = xor (E v ) and p maj = maj (E v ).
A. Uncoded Matrix Multiplication vs ENCODED-T
In this section, we compare the required energy for ENCODED-T with that for 'uncoded' matrix multiplication r = sA, where the circuit voltage is maintained high to ensure overall error probability is smaller than target error probability. The uncoded matrix multiplication is how almost all circuits today operate. Here, we only provide a scaling sense comparison to show the advantage of ENCODED techniques. maj ( p tar ) . Proof: "Uncoded" scheme: To compute each output bit using straightforward dot-product-based multiplication, one needs to compute a dot product of the message s with one column in the matrix A, which needs 2L-1 unreliable operations (L AND-operations and L − 1 XOR-operations).
From Lemma 1, we know that the bit error probability is
Since ( 
where step (a) follows from p and < 1 2L−2 , and 
Therefore, the total energy required for each output bit is (L · max{ −1 and ( 2 p tar L ), −1 xor ( 2 p tar L )}). From Theorem 1, we know that in the ENCODED-T technique, N per-bit = (L) is sufficient to achieve bit error probability smaller or equal to p maj + 1 d T p thr . From (17), it is reasonable to make p xor = p and = p maj = p thr = 1 2 p tar , in which case p maj + 1 d T p thr < 2 p maj = p tar . Since there are O L N K max{ −1 maj ( 1 2 p tar ), −1 xor ( 1 2 p tar ), −1 and ( 1 2 p tar )} . Furthermore, p maj < p tar due to the 'last-gate effect'. Therefore, the total energy required for each bit is at least
Remark 8:
We show an illustrative example when and (·) = xor (·) = maj (·) = g (·). Because −1 g (u) typically decreases monotonically in u, we consider three specific cases: exponential decay, polynomial decay and sub-exponential decay. For exponential decay, we assume g (u) = exp(−cu), c > 0. Therefore, the total energy required for each output bit for the 'uncoded' matrix multiplication is (L log L p tar ), while that for ENCODED-T is ( L N K log 1 p tar ). For polynomial decay, g (u) = ( 1 u ) c , c > 0, the total energy required for each output bit for the 'uncoded' matrix multiplication is (L( L p tar ) 1 c ), while that for ENCODED-T is
For subexponential decay, we assume g (u) = exp(−c √ u), c > 0. By sub-exponential we mean the delay is slower than exponential but faster than polynomial. The sub-exponential decay model is inspired and obtained from [85] on spintronic devices [86] [87] [88] . Therefore, the total energy required for each output bit for the 'uncoded' matrix multiplication is L(log L p tar ) 2 , while that for ENCODED-T is L N K (log 1 p tar ) 2 . In all cases, if K = R N for some constant 'rate' R, the scaling of the required energy consumption of ENCODED-T is smaller than uncoded.
In the next subsection, we will show that using 'dynamic' voltage scaling, we can achieve even lower energy by using a two-phase computation scheme called ENCODED-V. For example, when g (u) = ( 1 u ) c , c > 0, the energy consumption per output bit is
B. ENCODED-V: Low-energy Linear Transformations Using Dynamic Voltage Scaling
In this part, we modify the ENCODED-F technique in Section IV-C with 'dynamic' voltage scaling to obtain arbitrarily small output error fraction. The gate model here is Model I. The original ENCODED-F technique has L/(d s − 1) stages, where in each stage, a noisy decoder of the utilized LDPC code is used to carry out one (noisy) iteration of PBF decoding. In the original ENCODED-F technique, we assumed that gate failure probability is constant (and equal for all gates) throughout the duration of the computation process. Here, we partition the entire ENCODED-F technique into two phases. In the first phase, we use constant supply energy, while in the second phase, we increase the supply energy as the computation proceeds, so that the gate failure probability decreases during the computation process, in order to achieve the required output error fraction with high probability.
For ease of presentation, we consider the case when d s = 2, i.e., we only add d s − 1 = 1 codeword to the N-bit storage at each stage. The extension to general d s is straightforward. We partition the entire ENCODED-F so that there are L − L vs stages in the first phase and L vs stages in the second phase, where L vs is defined as
where p tar is the required final output error fraction. In the i -th stage of the last L vs stages, we assume that the supply energy is increased to some value to ensure that
We call this (dynamic) voltage-scaling scheme the ENCODED-V technique. Theorem 5: (Using dynamic voltage scaling for Problem 2) Using unreliable AND gates, majority gates and XOR gates defined from Gate Model I (D, ) with maximum fan-in D and error probability p and , p xor and p maj , and using a regular LDPC code that satisfies assumption (A.3), the binary linear transformation r = s · A can be computed using the ENCODED-F technique with dynamic voltage scaling, with per-bit energy consumption E per-bit = L − L vs K N −1 and ( p and ) + N −1 maj p maj
where L vs , which is a function of p tar , is defined in (63) . Further, the output error fraction is below p tar with probability at least 1 − P blk e , where the probability P blk
Proof: See Appendix E. As the analysis in Section V-A, we consider three specific cases of energy-reliability tradeoff: exponential decay model and (u) = xor (u) = maj (u) = exp(−cu), c > 0, polynomial decay model and (u) = xor (u) = maj (u) = ( 1 u ) c , c > 0 or sub-exponential decay model and (u) = xor (u) = maj (u) = exp(−c √ u), c > 0. We evaluate the total energy consumption per output bit in these two cases under a specific choice of supply energy that ensures the condition (64) .
Corollary 1: (Using dynamic voltage scaling for Problem 1) Using a (d v , d c ) regular LDPC code that satisfies assumption (A.3) (with parameters α 0 and θ ) and
and λ * is defined in (67), the ENCODED-V technique can achieve output bit error probability p tar with total energy consumption pet bit I   THIS TABLE SHOWS THE ENERGY-RELIABILITY TRADEOFFS OF DIFFERENT COMPUTING  SCHEMES UNDER DIFFERENT GATE ERROR PROBABILITY MODELS when the energy-reliability tradeoff function and (u) =
We use Table I to show the energy-reliability tradeoff of "uncoded" matrix multiplication, ENCODED-T and ENCODED-V.
VI. WHEN A NOISELESS DECODER IS AVAILABLE
The conclusion in Theorem 3 can be further tightened if we use a noiseless PBF decoder after the noisy computation. Although the assumption that the last step of the entire computation process is fault-free is not valid under our Gate Model I or Gate Model II, it is often adopted in existing literature on computing with noisy components [33] , [34] , [89] .
Theorem 6 (What If we Have a Noiseless Decoder): Suppose the unreliable gates are drawn from Gate Model I (D, ). Further assume that a noiseless PBF decoder is available. Then, the linear transformation r = s · A that outputs K bits can be computed with P blk e < p tar using 1 λ * log 3L p tar = (log 1 p tar ) unreliable operations per output bit and extra (log log 1 p tar ) noiseless operations per output bit, where the parameter λ * is defined in (33) in Theorem 3.
Proof: We use the ENCODED-F technique to do noisy linear transformations. That is, instead of using Gallager-B decoding algorithm to correct errors, we use the PBF algorithm. Theorem 3 shows that the final error fraction can be upper bounded by a small constant α 0 with high probability as long as (30) holds. The total number of operations per bit
. If we require the error probability p tar to be arbitrarily small, we have to use a noiseless decoder to correct residual errors in the final output. We can use the noiseless decoder to carry out (log N) iterations of noiseless PBF algorithms to correct all errors, which introduces an additional (log N) operations per bit.
The overall error probability is the same as (32) . To ensure that P blk e is smaller than p tar , it suffices (see (32) ) to let 3L exp −λ * N < p tar . This is satisfied when
Thus, we need 4L K λ * log 3L p tar = ( L K log L p tar ) unreliable operations per bit and extra (log N) = (log log L p tar ) noiseless operations per bit.
Remark 9:
As discussed in Remark 6, the output error probability is at least p maj , the error probability of a majority gate, due to the 'last-gate effect'. Therefore, in order to achieve arbitrarily small error probability, the noiseless operations in Theorem 6 are necessary.
In fact, the bound in Theorem 4 is a lower bound on the number of operations that are used at the entrance stage, i.e., operations that have one of the L inputs (s 1 , s 2 , . . . s L ) as an argument, of the computation scheme. Therefore, Theorem 6 and Theorem 4 together assert that the number of noisy operations scales as (log 1 p tar ) under the setting of Problem 1, if the 'last-gate effect' can be addressed using a few noiseless operations which scales as (log log 1 p tar ).
VII. CONCLUSIONS AND FUTURE WORK
Can reliable computation be performed using gates that are all equally unreliable? As we discussed, the error probability is lower bounded by the last gate's error probability . We provide LDPC codes-based strategies (called ENCODED) that attain error probability close to (which we bound by 2 ). Further, we show that these strategies outperform repetitionbased strategies that are commonly used today.
The key idea that ENCODED relies on is to repeatedly suppress errors in computation process by, in a sense, encoding the computation matrix of the linear transformation, instead of encoding inputs (as is done in traditional communication). Using ENCODED, both probabilistic errors and worst-case errors can be kept suppressed.
Inspired by voltage-scaling techniques commonly used to reduce power in circuit design, we also analyzed possible gains attainable using 'static' and 'dynamic' voltage scaling in conjunction with our ENCODED technique. It would be meaningful to experimentally model the power-reliability tradeoffs of voltage scaling to give more insights to the designer. On the modeling side, it would also be important to include energy consumed in wiring [41] , [43] , [44] (which can be a significant chunk of the total energy in decoding circuits [90] ) in these models, and observe if predicted gains due to coding are significantly reduced. Perhaps wiring energy will also motivate design of novel coding techniques that attempt to correct errors with local information as much as possible.
There are many coding-theoretic problems that fall out naturally. What are practical codes that can be used to reduce computational errors? Are there benefits to applying more recent discoveries, such as spatially coupled LDPC codes [91] , instead of expander codes?
More broadly, the problem of computation with noisy gates is of considerable practical and intellectual interest. It is widely accepted that biological systems operate with noisy computational elements, and yet provide good performance at low energy. In engineered systems, with saturation of Dennard's scaling and Moore's law, new device technologies are being used to design circuits that are invariably error-prone. A comprehensive understanding of reliability-resource tradeoffs in error-correction coding in computing could give these novel technologies (e.g. carbon nanotubes and mechanical switches) a better chance to compete with established ones (i.e., CMOS). To that end, it will be key to identify what causes faults in these novel technologies so that they can be modeled and analyzed, and appropriate codes be designed for them. Intellectually, it is interesting (and widely acknowledged) that the remarkable gains that coding brings to communications, especially at longrange, are not easy to obtain in computational settings. The theoretical reasoning for this thus far rests on simplistic models and has rather loose bounds [45] . Improved strategies and improved outer bounds will go a long way in characterizing how large these gains can be.
A. Connections With Coded Computing With Stragglers and "Exascale Computing"
We note an important connection between ENCODED and the recent works on coded computation in presence of "stragglers" [92] [93] [94] [95] [96] . These works focus on "processor-level" (rather than gate-level) noise, e.g. it is assumed in [92] that the product of input s with each column of A is "erasureprone" with some probability (which depends on models of time required for computation). The formulation there is not applicable in two ways that are crucial for modern "exascale" computing systems: (i) there is an increasing trend in distributed systems community to consider "soft-errors" that are undetectable [97] , whereas [92] [93] [94] [95] [96] largely focus on erasures; and more importantly, (ii) there is an increasing need for understanding scalability when the number of (fixed memory) processors increases for a fixed total problem size (to understand the limits of gains with parallelization of a problem). This is called "strong scaling" [98, Ch. 9], whereas "weak scaling" allows for increasing problem size and number of processors, while keeping memory of each processor fixed. The works [92] [93] [94] [95] [96] only examine a fixed number of processors with increasing memory of each processor with problem size, which is, strictly speaking, "weaker than weak" scaling.
For the specific problem of matrix-vector multiplication, strong scaling allows adding more processors than the number of rows and/or columns of the matrix to increase parallelization. For example, when computing s × A, the matrix A is often split into both horizontal and vertical pieces (as used in the algorithm "SUMMA" [99] ), ENCODED can be adapted to this split to suppress error propagation (horizontal decomposition is similar to (12) , and vertical decomposition is imposed by using limited memory gates). In ENCODED-T, the tree-structure helps introduce increased parallelism, speeding up the computation, while keeping errors in check through repeated error suppression. However, the algorithms in [92] [93] [94] [95] [96] do not easily adapt to horizontal division. If one naively uses strategies in [92] [93] [94] [95] [96] for strong scaling with soft errors, errors will accumulate and cause the resulting output to be far from the correct output.
One limitation of the technique proposed here is that it is limited to finite fields instead of real number coding. It is important to extend ENCODED to real number codes, and a preliminary attempt is made in [100] on iterative algorithms for logistic regression, where LDPC-type real-number coding techniques (inspired from [101] ) are used for error-correction over reals.
APPENDIX A DETAILS OF GALLAGER-B DECODING ALGORITHM
Assume a variable node v is connected to d v parity check nodes in N v and a parity check node c is connected to d c variable nodes in N c . Suppose the received bits are r = (r 1 , . . . r N ). The decoding algorithm we use is the Gallager-B algorithm:
• From variable node to check node:
and z is a randomly generated bit.
• From check node to variable node:
Remark 10: Note that the updating rule m (i) v→c = z in (69) , which is used to break ties, is different from the original rule m (i) v→c = y v in [19] , in which y v is the channel output associated with the variable node v. This is because the problem that we consider is a computing problem, instead of a communication problem, and hence there are no channel outputs. Note that the analysis is also done for the modified updating rule. Although the modified updating rule is theoretically sound, we acknowledge that the cost of generating a random bit may be higher than that of the majority rule.
APPENDIX B PROOF OF LEMMA 2
In this section, we prove that bit error probability can be made below a small constant p maj + 1 d T p thr after one iteration of Gallager-B decoding. We use density evolution to analyze the change of error probability before and after decoding.
B. Density Evolution Analysis
We examine the m-th level in the tree structure of the ENCODED-T. After the outputs from the (m + 1)-th level are obtained, they are forwarded to the m-th level of the tree structure to perform a component-wise XOR-operation. The results of this XOR-operation are stored in the E-bit registers at a node v l m (a compute-and-correct unit) in the m-th level and is decoded using C iterations of the Gallager-B algorithm. Now, we focus on the C iterations of Gallager-B decoding done at the node v l m . For simplicity, we write the messagepassing result after the i -th iteration asx (i) = (x (i) v→c ), which is the vector constituted by the messages sent from variable nodes to parity check nodes. The initial inputx (0) is the output of the unreliable XOR-gates in the node v l m . Denote the correct message-passing bits byw (i) = (w (i) v→c ), i.e., if no computing errors are introduced in the entire computation process, in contrast to just iteration i . We write p (i) v→c as the bit error probability of x (i) v→c , that is,
We want to calculate the evolution of p (i) v→c with i . From [27] and [78] we know that in density-evolution analysis, the bit error probability does not depend on the transmitted codeword, based on the check node and variable node symmetry of the message-passing algorithm, and the channel symmetry and the message noise symmetry [26, Definition 5] . In our problem, the channel symmetry comes from the fact that the AND gates flip different outputs with the same probability. The message wire symmetry comes from the fact that the majority gates and XOR gates flip outputs with the same probability. Note that we do not need the source symmetry, and hence we can use the proof of [27, Th. 1] to show that the bit error probability P bit e does not depend on the correct codeword at the node v l m . Therefore, we can assume without loss of generality that the correct input (and hence output) in the linear computation s ·G is an all-zero codeword and hence
From assumption (A.3) we know that when the number of levels in the tree structure in the ENCODED-T technique is smaller or equal to log N 2 log(d v −1)(d c −1) , we can assume that the decoding neighborhood for each variable node is cycle-free and all bits entering a majority-gate or an XOR-gate are independent of each other. In our case, choosing C = 1 iterations at each level, we can indeed ensure that the constraint on number of decoding iterations holds, since the tree structure in the ENCODED-T technique makes the total number of decoding iterations equal to C · (M − 1) = log L log d T , (see (20) , we use M − 1 because only non-leaf nodes have embedded decoders), and the tree-width d T can be set large enough so that (18) is satisfied.
Therefore, based on symmetry and independence, we can use the analysis for the noisy Gallager-B decoder in [27] and attain the performance predicted by density evolution for
, define four functions:
Intuitively,γ (u) denotes the error probability after the XOR-operation at a check node andη(u) denotes the error probability after the majority-operation at a variable node. These functions are borrowed from [27] and are crucial for analyzing noisy Gallager-B density evolution. Note that we change the form of functions α(·), γ (·), η(·) in [27] into 1 −ᾱ(·), 1 −γ (·), 1 −η(·), in correspondence with the usual goal of analyzing error probability, instead of correctness probability.
We first state a result from [27] , and then simplify the result using an upper bound. Note that the LDPC decoding rule used in [27] is slightly different from ours, as stated in Remark 10. We will address this issue in the proof of the upper bound.
Lemma 4 (28, 
whereη(·) is defined in (75) andγ (·) is defined in (73) . The lower bound p dec e > p maj in (24) follows from the fact that p maj < 1 2 (see Gate Model I) and p dec e = p (C) = f ( p (C−1) ), where C is the total number of iterations of decoding at each level. In what follows, we upper-bound the RHS of (76) by upper-bounding the functionsγ (·) andη(·) defined in (73) and (75) . The result is shown in Lemma 5.
Lemma 5: For regular LDPC codes with check node degree d c and variable node degree d v
whereη
Proof: First, note that the decoding algorithm used in [27] is slightly different from ours in that the tie-breaking rule in [27] is m (i) v→c = y v , which is different from our rule m (i) v→c = z, where z is a randomly generated bit (see Remark 10) . It can be shown that, if our updating rule is used, theη function in the density evolution function (76) should be changed from (75) tō
When
, which can be
For the functionγ (u) in (73), we upper bound it with the following two inequalities:
Combining (83) and (82) 
where step (a) is due to the fact thatη 0 (γ ) is monotonically increasing.
C. Completing the Proof of Lemma 2
We need to prove that if the bit error probability before decoding is smaller than p reg = (D + 1) p thr + p xor , the bit error probability after decoding is smaller than p maj + 1 d T p thr . Using Lemma 5, we only need to prove
which is equivalent to
We know that
where step (a) is from the definition p reg = (d T + 1) p thr + p xor and step (b) is from the second condition in (17) . Thus, to prove (85) , it suffices to prove
in Theorem 1. Thus, the above inequality is ensured by
which is the first condition in (17) .
APPENDIX C PROOF OF THEOREM 4
Theorem 4 provides a lower bound on the number of operations by lower-bounding the operations done at the entrance stage of the noisy circuit, i.e., operations that have one of the L inputs (s 1 , s 2 , . . . s L ) as an argument. In order to prove Theorem 4, we need the following lemma (stated implicitly in [49, Proposition 1]) which characterizes the equivalence of a noisy-gate model and a noisy-wire model.
Lemma 6: For each unreliable gate from Gate Model I (D, ) with error probability and fan-in number ≤ D, its output variable can be stochastically simulated by (equivalent in distribution to) another unreliable gateg that computes the same function but with the following property: each input wire flips the input independently with probability /D and the gate has additional output noise independent of input wire noise.
Proof: For an arbitrary unreliable gate
consider another unreliable gate together with noisy wires
where w j is the noise on the j -th input wire and takes value 1 with probability /D. The probability that all d wires convey the correct inputs is (1 − /D) d > 1 − d D > 1 − . Therefore, ifz g is 0 w.p.1, the error ofg will be smaller than . Thus, using standard continuity arguments, we can find a random variablez g which equals to 1 w.p. < , while makingỹ and y equivalent in distribution.
Based on this lemma, we know that a noisy network defined in Section II-A can always be replaced by another network, where each wire has an error probability D . Before a specific input s k enters the noisy circuit, it is always transmitted along the wires connected to the entrance stage of the gates in the circuit. Because of the assumption that gates after the inputs are noisy, the bit will be 'sampled' by the noisy wires. For convenience of analysis, we assume each gate can only be used once so that the number of operations is equal to the number of unreliable gates. Now that each gate only computes once, each noisy wire can only carry information once as well. We assume each s k is transmitted on T k distinct wires. Then, the probability that the message on all T k wires flips is
Therefore, the error probability of the input bit s k satisfies P k in > p k . Since matrix A is assumed to have full row rank, if the linear transformation computation is noiseless, even a single input bit error leads to an output block error. Therefore, even when the linear transformation computation is noiseless, the output block error probability P blk e is greater than the input error probability P k in . Since the computation is noisy, P blk e is still greater than P k in , and hence is greater than p k . Therefore, if ( /D) T k = p k > p tar , the block error probability P blk e > P k in > p k > p tar , which contradicts with the aim to make the block error probability smaller than p tar . Thus,
which means that for any bit s k
Therefore, the number of wires connected to each input bit must be at least log 1/ p tar log D/ . Since the number of input bits is L, the total number of wires connected to all input bits is at least L log 1/ p tar log D/ . Since we are using gates with bounded fanin smaller than D, the number of gates is at least L log 1/ p tar D log D/ , so does the number of operations. Since there are K output bits, the number of operations per output bit N per-bit > L log 1/ p tar K D log D/ .
APPENDIX D CODES THAT SATISFY THE REQUIREMENT (A.3)
The existence of codes that satisfy the requirement (A.3) follows from a result in [70] . We first present the result from [70] .
Define β 0 , β 1 , β 2 , β 3 respectively as the largest integer less than d v /2, the largest integer less than or equal to d v /2, the smallest integer greater than or equal to d v /2, and the smallest integer greater than d v /2. Create four real parameters γ 12 , δ 12 , π 0 and ω 0 that satisfy the following inequalities
where d is the largest odd number which is less than or equal to d c , and
Define the following polynomials
where d is the largest even number less than or equal to d c . Then we define
where h(·) is the entropy function defined as
and
The base of all the logarithms is e. Then, [71, Th. 1] and the last paragraph on [71, p. 521] implies the following result: 
where the maximization is over all values of γ 12 , δ 12 , π 0 , ω 0 that satisfy (88)- (92) . Then, for anyᾱ 0 < α 0 , if N is sufficiently large, then except for almost all codes in this ensemble can correct at least θᾱ 0 N errors out of any arbitrarȳ α 0 N errors using one iteration of the PBF algorithm. Proof: Here we briefly summarize the proof in [70] . Denote byp e (ᾱ 0 N) the fraction of (bad) codes in the (d v , d c )regular ensemble that cannot correct a linear fraction θᾱ 0 N of all combinations ofᾱ 0 N errors or less using one iteration of the PBF algorithm. Then, according to [71, eq. (38) . Therefore, whenᾱ 0 is sufficiently small so that f (α) < 0 for all α <ᾱ 0 , p e (ᾱ 0 N) → 0 as N → ∞, which means that almost all codes in the (d v , d c )-regular ensemble can correct θ fraction of all possible combinations ofᾱ 0 N errors using one iteration of the PBF algorithm. Theorem 1 in [71] was stated for θ = 0 and the original constraint corresponding to the constraint (88) ((1 − θ)αN ≤ γ 12 N + δ 12 N) was α N ≤ γ 12 N + δ 12 N. In this paper,we use the result for θ = constant > 0. This result can be obtained by directly changing the original constraint α N ≤ γ 12 N + δ 12 N in [70] 
where the outer summation is over all integer values of α N ≤ᾱ 0 N , and the inner summation is over all integer values of γ 12 N, δ 12 N, π 0 (1 − R)N, ω 0 d v N that satisfy (88) to (92) . We will use this refined bound to obtain finite-length result in the following example. Example 1: One example of the parameter choice is d v = 9, d c = 18 and θ = 0.15. In this case, we computed the first positive root of f (α) = 0 using MATLAB and obtained α = 5.1 · 10 −4 . This means that using one iteration of the PBF algorithm, we can correct a fraction θ = 0.15 of 5.1 · 10 −4 · N worst-case errors using a (9, 18) regular LDPC code when N is sufficiently large. We can also use this result to obtain finite-length bounds (computing an upper bound on the fraction of bad codes using (107)). We obtained that at least 4.86% of (9, 18) regular LDPC codes of length N = 50, 000 in the random LDPC ensemble can reduce the number of errors by 15% using one iteration of the PBF algorithm, when the number of errors is smaller than or equal to 20, which corresponds to the case when α 0 = 0.0004. The existence of codes that satisfy requirement (A.3) can also be established using Expander LDPC codes. Here, we review some results on expander LDPCs [20] .
Definition 7: (Expander Graph) An (N, P, d v , γ, α) bipartite expander is a d v -left-regular bipartite graph G(V L ∪V R , E) where |V L | = N and |V R | = P. In this bipartite graph, it holds that ∀S ⊂ V L with |S| ≤ γ N, N (S) ≥ αd v |S|, where N (S) denotes the neighborhood of the set S, i.e., the set of nodes in V R connected to S. An (N, P, d v , γ, α) expander LDPC code is a length-N LDPC code, where the Tanner graph of the code is the corresponding expander graph with V L corresponding to the set of variable nodes and V R the parity check nodes. We use d c = d v N/P to denote the right-degree of the expander code.
Lemma 8 [21, Th. 11] : Using an (N, P, d v , γ, 3 4 + e ) regular expander LDPC code with parity check node degree d c = d v N/P, one can use one iteration of noiseless PBF algorithm to bring the fraction of errors down from α to (1 − 4 e )α provided that the original corrupted codeword has at most γ (1 + 4 e )/2 fraction of errors.
Example 2: The construction of a good Expander code has been investigated for a long time. Constructive approaches for Expander codes can be found in [103] and [104] . In [104] [105] [106] , it is shown that random regular LDPC codes are expanders with high probability when the code length N → ∞. In [106, Th. 8.7] it is shown that, suppose γ max is the positive solution of the equation d v (which means that all sets of left nodes with cardinality smaller than γ N have an expansion factor at least 3 4 + e ). For d v = 16, d c = 32, and e = 0.0375 (which is equivalent to 4 e = 0.15, the same as θ = 0.15 in Example 1), we use MATLAB to numerically solve the above equation and obtained γ max ≈ 4.1 * 10 −5 , which means the fraction of errors α can be as large as γ max (1 + 4 e )/2 = 2.3575 · 10 −5 .
APPENDIX E PROOF OF THEOREM 5
We tune the energy supply such that max( p and , p xor , p maj )
is satisfied for the first L − L vs stages (first phase), which ensures that
is satisfied. We tune the energy supply such that
is satisfied for the last L vs stages (second phase), which ensures that
is satisfied (we have mentioned this in (64)). Since this version of ENCODED-V technique with dynamic voltage scaling has the same procedure and constant supply energy during the first L − L vs stages (first phase) as the ENCODED-F technique, from Theorem 3, we know that after the first (L − L vs ) stages, the output error fraction is smaller than α 0 with probability at least 1 − P blk e , where P blk e < 3(L − L vs ) exp (−λ * N) and λ * is defined in (33) .
We will prove that, after the i -th stage of the remaining L vs stages, the error fraction is upper bounded by α (i)
with high probability. Thus, after L vs iterations, we obtain α (L vs )
where the last step can be verified by plugging in (63) . The case for i = 0 is already true as argued above. Suppose (113) holds for some i ≥ 0, then, we prove (113) also holds for the (i + 1)-th stage of the second phase. Note that from (112), the probability that the number of new errors introduced during the PBF decoding at the (i + 1)-th stage, which is [d c (1 − R) + 1]α 
where step (a) follows from (112), step (c) follows from the large deviation bound in Lemma 3 and λ (i+1) is defined in (67) . Therefore, with probability at least 1 − 3 exp − λ (i+1) N ,
where step (a) can be obtained by combining (116) and (113). Now that we have proved (113) for the (i + 1)-th stage, we can carry out the math induction for all i that satisfies 1 ≤ i ≤ L vs . If (113) holds for all i , the final error fraction is smaller than p tar . Thus, the overall probability that the final error fraction is greater than p tar is upper bounded by the summation of 3(L − L vs ) exp (−λ * N) in the first L − L vs stages and the RHS of (115) for the last L vs stages, which is
Thus, (66) is proved. Finally, we compute the overall energy consumption. The energy consumed in the i -th stage can be written as 
By summing over all stages both in the first phase and the second phase and normalizing by the number of outputs K , the total energy consumption per output bit can be written as in (65) . 
APPENDIX F PROOF
in the i -th stage of the last L vs stages (defined in (63)).
By plugging in (119), (120) and L vs = into the error probability expression (66), we know that the ENCODED-V technique has output error fraction smaller than α 0 (1 − θ/2) L vs ≤ 1 2 p tar with probability at least 1 − P blk e , where P blk e satisfies P blk e < 3(L − L vs ) exp −λ * N + 3L vs exp(− λ (L vs +1) N),
where λ (i+1) = D(2λ (i+1) λ (i+1) ) = (2 log 2 − 1) λ (i+1) + O((λ (i+1) )
