A majority logic decoder made of unreliable logic gates, whose failures are transient and data-dependent, is analyzed. Based on a combinatorial representation of fault configurations a closed-form expression for the average bit error rate for a one-step majority logic decoder is derived, for a regular low-density parity-check (LDPC) code ensemble and the proposed failure model. The presented analysis framework is then used to establish bounds on the one-step majority logic decoder performance under the simplified probabilistic gateoutput switching model. Based on the expander property of Tanner graphs of LDPC codes, it is proven that a version of the faulty parallel bit-flipping decoder can correct a fixed fraction of channel errors in the presence of data-dependent gate failures. The results are illustrated with numerical examples of finite geometry codes.
I. INTRODUCTION
I NCREASED integration factor of integrated circuits together with stringent energy-efficiency constraints result in increased unreliability of today's semiconductor devices. As a result of supply voltage reduction and the process variations effects, a fully reliable operation of hardware components cannot be guaranteed [1] .
Von Neumann first considered a problem of reliability of systems constructed from unreliable components [2] . His approach includes multiplexing of component logic gates and relies on high redundancy to achieve the desired system reliability. Dobrushin and Ortyukov [3] refined von Neumann's method and provided upper bounds on the required redundancy for reliable computation of a Boolean function implemented using faulty gates. On the other hand, Elias [4] applied more general coding techniques to the problem of reliable computing. He showed that except for some particular cases, such as exclusive-OR function, there is no code that outperforms von Neumann's multiplexing method. Overviews of problems in fault tolerant computation is given by Winograd and Cowan [5] and Pippenger [6] .
Error control coding, as a method for adding redundancy to ensure fault-tolerance of memory systems built from unreliable hardware, was introduced in the late sixties and early seventies by Taylor [7] and Kuznetsov [8] . In their memory system an information sequence, encoded by a low-density paritycheck (LDPC) code, is stored in unreliable memory cells, which are periodically updated using a "noisy" correcting circuit. They proved that, under the so-called von Neumann failure model, such a memory -even with a number of redundant gates linear in memory size -is capable of achieving arbitrary small error probability [7] . The equivalence between Taylor-Kuznetsov (TK) fault-tolerant memory architectures and a Gallager-B decoder, built from unreliable logic gates, was first observed by Vasić et al. [9] , [10] , and developed by Vasić and Chilappagari [11] into a theoretical framework for analysis and design of faulty decoders of LDPC codes.
Performance of ensembles of LDPC codes under faulty iterative decoding was studied by Varshney in [12] , who showed that, if certain symmetry conditions are satisfied, the density evolution technique is applicable to faulty decoders which he used to examine the performance of faulty Gallager-A and belief-propagation algorithms. Density evolution analysis of noisy Gallager-B decoders was presented in the series of complementing papers by Yazdi et al. [13] , [14] and by Huang et al. [15] . In [13] the authors studied the performance of the binary Gallager-B decoder used to decode irregular LDPC codes and proposed optimal resource allocation of noisy computational units, i.e., variable and check nodes of varying degrees, in order to achieve minimal error rate. The faulty decoder of non-binary regular LDPC codes was analyzed in [14] in the presence of von Neumann errors. In [15] a more complicated failure model was considered, which includes transient errors and permanent memory errors. Similar analysis was done by Leduc-Primeau and Gross in [16] , where the faulty Gallager-B decoder, improved by a message repetition scheme, was studied. More general finite-alphabet decoders were investigated by Huang and Dolecek in [17] , while a noisy min-sum decoder realization was considered by Ngassa et al. [18] and by Balatsoukas-Stimming and Burg in [19] . Dupraz et al. [20] have improved the notion of a noisy threshold by introducing the so-called functional threshold, which accurately characterizes the convergence behavior of LDPC code ensembles under noisy finite-alphabet message passing decoding.
Although complex soft-decision iterative decoders, built from reliable components, typically outperform lowcomplexity majority logic decoders, this is not necessarily true for faulty decoders. In addition, a simple probabilistic gradient decent bit-flipping decoder, recently proposed by Al Rasheed et al. [21] , achieves high level of fault-tolerance. Recently, Vasić et al. [22] showed that probabilistic behavior of the Gallager-B decoder due to unreliable components can lead to the improved performance. This resulted in an increased interest in hard-decision decoders. In our previous work [23] we investigated the performance of Gallager-B decoder under timing errors and showed that the density evolution technique is not applicable to that case.
In all the above references a special type of so-called transient failures is assumed. Transient failures manifest themselves at particular time instants, but do not necessarily persist for later times. These failures have probabilistic behavior and we assume the knowledge of their statistics. The simplest such statistics is the von Neumann failure model [2] , which assumes that each component of a (clocked) Boolean network fails at every clock cycle with some known probability. Additionally, failures are not temporally nor spatially correlated. In other words, failures of a given component are independent of those in previous clock cycles and independent of failures of other components.
However, the von Neumann failure model is only a rough approximation of physical processes leading to logic gate failures. The actual probability of failure of a logic gate is highly dependent on a digital circuit manufacturing technology, and for high integration factors the failures are datadependent and/or temporally correlated, as it was shown by Zaynoun et al. [24] . For example, timing errors caused by incorrect switching of a gate output are heavily dependent on data values processed by the gate in previous computing cycles and cannot be represented accurately by the von Neumann model. Perez-Andrade et al. [25] conducted gate level simulations, that have provided the bit error rate performance of noisy LDPC decoders under timing, but not data-dependent, errors. In the work by Leduc-Primeau et al. [26] , [27] , complexity of the soft-decision min-sum decoder has prevented modeling the hardware unreliability on the bit level. Instead, failures are inserted based on the conditional distributions, associated to each combination of an update message and a transmitted bit. These distributions were precalculated for a given decoder topology.
One-step majority logic (OS-MAJ) decoding, introduced in the sixties by Rudolph [28] , is an important class of algorithms in the context of faulty decoding. A OS-MAJ decoder can be seen as a Gallager-B/bit-flipping decoder [29] in which the decoding process is terminated after only one iteration, and bits are decoded by a majority vote on multiple parity-check decisions. In contrast to iterative decoders, the bit error rate performance of these decoders can be evaluated analytically for finite-length codes, as shown by Radhakrishnan et al. [30] .
Guaranteed error correction of LDPC codes has been only studied for the iterative decoders built from reliable components. Sipser and Spielman [31] showed that expander LDPC codes can be conveniently used to guarantee the correction of a fraction of errors, i.e. there exist some α, 0 < α < 1, for which the decoder can correct α N worst case errors, where N is the code length. They proved that both serial and parallel bit-flipping algorithms can correct a fixed fraction of errors if the underlying Tanner graph is a good expander. In the later work Burshtein [32] generalized their results and proved that a linear number of errors can be corrected by the parallel bit-flipping algorithm with almost all column-weightfour codes. The expander graph arguments can be also used to provide guarantees for message passing algorithms [33] and linear programming [34] . Recently, Chilappagari et al. [35] provided another look on the guaranteed error correction of the bit-flipping algorithms. They found the relation between the girth of the Tanner graph and the guaranteed error correction capability of an LDPC code.
In this paper we examine the effects of data-dependent gate failures to performance of the bit-flipping decoding. We propose a gate state model that captures the effects of datadependent and correlated nature of gate failures. We derive a closed form expression of the bit error rate (BER) at the output of the OS-MAJ decoder for an ensemble of regular LDPC codes free of four-cycles. Then, we derive bounds on BER performance under a simplified data-dependent model, called the probabilistic gate-output switching model. Additionally, we investigate the error correction capabilities of the noisy bit-flipping decoders and show that expander graph arguments can be used to establish lower bounds on the guaranteed error correction capability in the presence of data-dependent gate failures.
The rest of the paper is organized as follows. In Section II the preliminaries on codes on graphs are discussed. In Section III we give a description of novel approach to gate failure modeling. Section IV is dedicated to the theoretical analysis of the OS-MAJ decoder under data-dependent failure model. The error correction capability of the noisy bit-flipping decoder is investigated in Section V. The numerical results are presented in Section VI. Finally, some concluding remarks and future research directions are given in Section VII.
II. PRELIMINARIES
Let G = (U, E) be a graph with a set of nodes U and a set of edges E. An edge e is an unordered pair (v, c), which connects two neighboring nodes v and c. The cardinality of U , denoted as |U |, represents the order of the graph, while |E| defines the size of the graph. A set of neighbors of a particular node u is denoted as N (u), while a set of corresponding edges is denoted by E(u). The number of neighbors of a node u, denoted as d(u), is called the degree of u. The average degree of a graph G isd = 2|E|/|U |.
The girth g of a graph G is the length of smallest cycle in G. A bipartite graph G = (V ∪ C, E) is a graph constructed from two disjoint sets of nodes V and C, such that all neighbors of nodes in V belong to C and vice versa. The nodes in V are called variable nodes and nodes from C are check nodes. A bipartite graph is said to be γ -left-regular if all variable nodes have degree γ , and similarly, a graph is ρ-right-regular if all check nodes have degree ρ.
Consider a (γ , ρ)-regular binary LDPC code of length N and its graphical representation given by γ -left-regular and ρ-right-regular Tanner bipartite graph G, with Nγ /ρ check nodes and N variable nodes. Let x = (x 1 , x 2 , . . . , x N ) be a codeword of a binary LDPC code, which appears at the input of a binary symmetric channel (BSC). The output of the channel r = (r 1 , r 2 , . . . , r N ), where Pr{r k = x k } = p, is being decoded by our majority logic decoder. The number of flipped bits represents the Hamming distance between the transmitted codeword x and the received word r, and is denoted as d H (x,r). The decoder is divided into processing units that correspond to nodes in Tanner graph representation of the decoder. Let − → m i (e) and ← − m i (e) be messages passed on an edge e from variable node to check node and check node to variable node, during the i -th decoding iteration, respectively. Similarly − → m i (R) and ← − m i (R) denote sets of all messages from/to a variable node over a set of edges R ⊆ E. We next summarize our majority logic decoder.
• At iteration i = 0 the variable-to-check messages are initialized by using values received from the channel,
a variable node processing unit v performs the majority voting on binary messages received from its neighboring check nodes as follows
where s ∈ {0, 1} and x denotes the highest integer smaller than or equal to x. The output of the majority logic (MAJ) gate, described by the function (·) is then passed to all neighboring check nodes, i.e − → m i (e) = ( ← − m i−1 (E(v))), ∀e ∈ E(v). • During each iteration i , i ≥ 0, a check node processing unit c performs ρ eXclusive-OR (XOR) operations defined as follows
The results of the XOR operations are passed to the neighboring variable nodes by mapping ← −
If the decoding is terminated after the i -th iteration, the result of ( ← − m i (E(v))) represents the decoded bit x v . Note that, when built from perfectly reliable logic gates, our decoder is functionally equivalent to the parallel bit-flipping decoder [31] . Hardware unreliability in the decoder comes from unreliable computation of the update functions (·) and (·). It is caused by the fact that the logic gates performing these functions are prone to data-dependent failures, as described in the following section.
Instead of calculating a parity check equation, each check node c sends ← − m i (e), referred to as an estimate of a neighboring variable node v connected to c through an edge e. Hence, all messages that leave a check node are calculated by different (ρ − 1)-input XOR gates. This means that ρ XOR gates need to be implemented in each check node, and that the failure of one XOR gate affects the reliability of one variable node, only. The above fact is used to determine error correction capability of the faulty bit-flipping decoder in Section V. On the other hand, a variable node performs a majority vote on γ received estimates, producing a bit decision, and contains a single γ -input MAJ gate.
When the decoding is terminated after only one iteration, and a bit x v is decoded by ( ← − m 0 (E(v))), our decoder is reduced to the known OS-MAJ decoder, recently analyzed in our previous works [36] , [37] . In the first part of this paper we specially consider the OS-MAJ decoder, due to its simplicity.
III. DATA-DEPENDENT GATE ERROR MODEL

A. General Modeling Approach
m ) a gate input vector, i.e., a vector of arguments. Denote by {y (k) } k≥0 a timesequence of input vectors, and by {ξ (k) } k≥0 a corresponding error sequence. In this manuscript we will interchangeably use the terms "failure" and "error" meaning that failures are "additive" errors. In the classical von Neumann transient failure model the error values {ξ (k) } k≥0 are independent of the input sequence {y (k) } k≥0 .
In order to capture data and time dependence of gate failures more accurately, we propose the following gate-state model. Namely, we assume that ξ (k) is affected by the current and M − 1 prior consecutive gate input vectors, i.e., its probability depends on the input vector sequence in the time interval
The number of states grows exponentially with M and ρ.
The inputs of a MAJ gate are the outputs of γ XOR gates in the neighboring check nodes. Thus, at time k these gates can be associated with a state array
, whose elements represent the states of particular XOR gates. Let S σ be the set of all possible state arrays. Then, an gateerror probability vector can be associated to each state array and formed as ε (k) 
The set of all gate-error probability vectors by S ε can be obtained by measurements or by simulation of the selected semiconductor technology. Thus, in our analysis we assume that every ε (k) ∈ S ε is known.
B. Probabilistic Gate-Output Switching Model
Due to supply voltage reduction, resulting for example from aggressive voltage scaling [38] , switching of a gate output is prolonged and the signal is sampled or used in the next stage before it reaches a steady value. This causes timing-related errors that greatly influence reliability of combinatorial circuits, as documented in a number of prominent articles in the area of circuit design [24] , [39] - [43] . Recently, Amaricai et al. [41] investigated the probabilistic nature of gate switching for subpowered CMOS circuits. They proposed several fault injection models in CMOS circuits in which errors are added only when the gate output changes. Translated to our model, this means that it is sufficient to consider the case M = 2.
In this subsection we define the probabilistic gate-output switching model (GOS), in which the logic gate switches incorrectly with a probability that depends on a supply voltage, temperature and considered gate delay. This model was shown to have reduced complexity with minor degradation of accuracy when compared to more complex models that take into account the fact that different input patterns cause failures with different probabilities [41] .
In the GOS error model the probability that a gate fails to switch at time k is
where the bar indicates a scalar value. In the subsequent analysis, the probability of failure of a XOR gate is denoted bȳ ε X O R , while failure probability of MAJ gates isε M AJ . On the other hand, when the gate output is unchanged during two consecutive time instants, the function f is always correctly computed as assumed in [40] and [41] , i.e. Pr{ξ (k) = 1|z (k) = z (k−1) } = 0. Note that in the presented model switching is defined as a transition from the ideal output in previous time instant z (k−1) , not from the actual outputẑ (k−1) . A possible failure in the time instant k − 1 will not affect the failure in the k-th instant, due to the fact that a signal from (k − 1)-th instant reaches a steady value during the k-th time interval.
Clearly, since in the GOS model each XOR gates has two states, there are 2 γ possible realizations of the state array σ (k) . However, changing an order of zeros andε X O R values in a gate-error probability vector does not change the overall error probability at the output of the decoder. Thus, it is sufficient to consider only those gate-error probability vectors with first t elements equal toε X O R , and last γ − t elements equal to zero, i.e.,ε
The GOS model does not capture all effects which may lead to timing-related errors, since changes of the multiple inputs can cause a gate failure, even if the ideal output remains unchanged [24] . However, in the most recent literature dedicated to CMOS circuits operating with a voltage supply near or below the threshold voltages [40] , [41] , the above effects were neglected. The general framework presented in the previous subsection is applicable to other more complicated scenarios.
IV. ANALYSIS OF THE OS-MAJ DECODER
In this section we present an analytical method for performance evaluation of an ensemble of regular LDPC codes with girth at least six decoded by the faulty OS-MAJ decoder, described in the previous sections. In a Tanner graph with girth at least six, the variable nodes connected to the neighboring γ checks of a variable node v are all distinct. First, we consider a particular code bit x v and calculate the probability that it is miscorrected for a given XOR gate-error probability vector ε ∈ S ε . The first part of our analysis assumes that unreliability of XOR gates follows the general state model described in Section II-A, while MAJ gates operate reliably, which is given in the following lemma.
Let q l be a vector corresponding to one lexicographically ordered u-subset of a set [l] = {1, 2, . . . , l} and let a vector q r contain the remaining elements of [l], arbitrary ordered. We create a vector q by juxtapositioning q l and q r . We can arrange all possible vectors q into rows of an l u by l array Q u,l . For example, if l = 4 and u = 2, the rows of Q 2, 4 
and (3, 4, 1, 2) . The array Q u,l is instrumental in book-keeping of data-dependent error probabilities as described as follows.
Lemma 1: The probability that a code bit x v of a (γ , ρ)-regular LDPC code is incorrectly decoded by the OS-MAJ decoder with unreliable XOR gates, which fail according to a gate-error probability vector ε, is given by
and q t,m denote the element in the t-th row and the m-th column of the matrix Q i,γ . Proof: See Appendix A. In the transient gate failure model introduced by von Neumann [2] , the code bit error probability is independent of the state arrays, i.e., ε (k) =ε(γ ), k > 0, 1 ≤ v ≤ N. Thus, for a special case of von Neumann errors, which we previously investigated in [37] , all XOR gates have the same probability of failureε X O R . Moreover, any configuration of i incorrect estimates is equally likely, and Eq. (1) simplifies to
Further derivations under the general failure model are given in Appendix B. Here, we continue analysis assuming simplified GOS model, which describes timing-related nature of gate failures. Let {x (k) } k≥0 be a codeword sequence transmitted through the channel. In the GOS error model a failure is not possible when an XOR gate output remains unchanged, i.e., when the gate input vectors in the consecutive time instants k − 1 and k, k > 0, are the same or differ in an even number of positions. When x
and no channel error occur, then all XOR gates, used for decoding a bit x (k) v operate reliably. However, the parity of the gate input vectors can change due to channel induced errors, that is when an odd number of gate inputs from the consecutive time instants are flipped. The probability of such an event is equal to
Thus, the output of each XOR gate used for the decoding of a bit x v will be flipped with the probability Bε X O R . On the contrary, when 
v and x (k) v are received with no errors, then all γ XOR gates used for producing the bit decision of x (k) v change their outputs. Note also that, under the GOS model, the bit miscorrection probability, depends only on the number of non-zero elements of ε. Thus, without loss of generality, we can express the decoder performance in terms ofε(t), defined in Section III-B. This allows us to formulate the following theorem that bounds the error probabilities of the OS-MAJ decoder built entirely from unreliable gates. 
The lower bound corresponds to the case when x (k−1) = x (k) , while the worst case performance are obtained when d H (x (k−1) , x (k) ) = N. Proof: See Appendix C. The OS-MAJ decoder built entirely from reliable components satisfy the symmetry theorem which states that the performance of the decoder is independent of a codeword being decoded. We see that the symmetry condition does not hold for the OS-MAJ decoding in the presence of errors caused by an incorrect switching of a gate output. The previous theorem reveals a fundamental property of the OS-MAJ decoding under data dependent hardware failures -the dependence of the BER on the order in which the codewords are decoded. It can be seen that, for example, consecutive decoding of three identical codewords results in the lowest error rate, while the decoder operates worst if two complementary codewords are consecutively decoded. Note also that in the case of the best codeword order, the average BER does not depend on the reliability of the MAJ gates, while in the worst case codeword order, it is lower bounded byε M AJ .
The reliability of the decoder can be increased by allowing the MAJ gates to operate reliably, or by optimizing the decoding order schedule which effectively decreases switching activity of the gates comprising the decoder. The later method is infeasible as it requires rearranging the codewords at the transmitter. Finding an optimal rearrangement involves minimizing the sum of Hamming distances between pairs of consecutive codewords, which reduces to the traveler salesman problem. However, the notion of an "optimal" decoding order provides insights into performance bounds.
Assuming an optimal order of decoding, we now find an upper bound on the number of switchings that guarantee that the error rate does not exceed the error rate of the decoder made of reliable MAJ gates.
Let α s be a fraction of bits in a transmitted codeword
. These bits are decoded incorrectly with the probability bounded byP
On the other hand, the probability of misscorrecting a bit from the remaining
then the error rate of the decoder built from only unreliable components is less thanP v (γ ). In other words, the BER can be made even lower than the BER of a decoder made of reliable MAJ gates. For a higher fraction of switchings, there is no codeword rearrangement that can compensate failures of MAJ gates and lowering the BER can be achieved only by making the gates more reliable. In Section VI we numerically express the error rates and bounds for the decoding schedule for codes based on the finite geometries.
V. GUARANTEED ERROR CORRECTION
In this section we prove that the correcting capability of the iterative majority logic decoder, built partially from unreliable gates, increases linearly with code length, when Tanner graph of a code satisfies the expansion property, defined as follows.
Definition 1 [31] : A Tanner graph G of a (γ , ρ)-regular LDPC code is a (γ, ρ, α, δ) expander if for every subset S of at most α N variable nodes, at least δ|S| check nodes are incident to S.
We assume that the following two conditions are satisfied: (i) the MAJ gates used in the decoder are reliable, and XOR failures follow the error mechanism introduced in Section III-B, and (ii) no more than C X O R gates are erroneous in the first iteration. The need for previously described assumptions will be discussed later. Now we formulate the theorem that gives the error correction capability of the noisy majority logic decoder. be a set of corrupt variables at the beginning of the i -th decoding iteration. A set of corrupt variables at the beginning of the (i + 1)-th iteration (i.e., end of the i -th iteration), V i+1 , can be divided into two disjunct subsets: (i) (V i+1 ∩ V i ), a subset of corrupt variables that remained corrupt at the end of the i -th iteration, and (ii) (V i+1 \ V i ), a subset of newly corrupted variables, i.e., variables that were correct in the (i −1)-th iteration, but became corrupt during the i -th iteration. Let S i be a set of variables that were corrected during the (i − 1)-th iteration and also stayed correct at the end of the i -th iteration. Since variables in S i are flipped in the (i −1)-th iteration, from the definition of the GOS error model, it follows that any variable in S i may cause a failure of the neighboring XOR gates in the i -th iteration and consequently the incorrect estimates of variables with whom it shares the neighbors. On the other hand, no failure of the XOR gate output occurs in the check nodes connected to only un-flipped variables in the (i − 1)-th iteration.
Each incorrect estimate of a particular variable in V i+1 \ V i is due to the variable's connection (through shared neighbors) to variables from the set V i ∪ S i . This comes from the fact that the check node, which sends an incorrect estimate to a node in V i+1 \ V i , must be also connected to at least one other node which causes that incorrect estimate. Consequently, there is no variable node outside of V i ∪ S i that can cause an incorrect estimate of a node in the set of newly corrupt variables V i+1 \ V i . Thus, each incorrect estimate of a variable in V i+1 \V i indicates that a check is shared by two variables in V i ∪ S i ∪ V i+1 . On the other hand, there are no restrictions on possible neighbors of a check producing all correct estimates -they can be variables in V i+1 \ V i or variables outside of the set V i ∪ S i ∪ V i+1 . The number of correct estimates of each newly corrupt variable in V i+1 \ V i cannot be greater than γ /2, which means that the correct estimates are produced by at most γ /2 different neighboring check nodes. Then for some δ, 0 < δ ≤ 1, we have
Variables corrected during the i -th iteration (a set V i \ V i+1 ), as well as variables from S i can be connected to all different check nodes. Since a variable from V i ∩ V i+1 shares at least half of its neighbours with other variables from V i ∪ S i , it contributes with at most 3γ /4 additional check nodes in δγ |V i ∪ S i | and we have
On the other hand, assuming that a subgraph induced by V i ∪ V i+1 ∪ S i satisfies the expansion property defined by Definition 1, i.e.,
for all i > 0, then, since we consider (γ, ρ, α, (7/8 + )γ ) expanders, it follows,
Combining previous expression with Eq. (4) and Eq. (5) we obtain
Because all elements of S i were corrupted before the (i − 1)-th iteration, we know that |S i | ≤ |V i−1 |, which, based on the previous inequality, implies
Let |V 2 | ≤ β|V 1 |, β > 0. Then, |V i | can be bound as presented in the following lemma. Lemma 2: The number of corrupt variables at the beginning of the i -th decoding iteration, i > 1, |V i | is bounded by
Proof: See Appendix D. In order to complete this part of the proof of Theorem 2, we have to analyze the first decoding iteration and bound |V 2 |. In the following lemma we show that the upper bound of the value |V 2 | can be expressed in terms of |V 1 | and C X O R , the number of XOR gate failures in the first iteration.
Lemma 3: The number of corrupt variables after the first decoding iteration, |V 2 |, under the condition |V 1 | < (3 + 8 )αN/4, is bounded by
Proof: From the analysis presented in [31] , we know that the decoder built from reliable components reduces the number of corrupt variables to at most (1 − 4δ)|V 1 |, for all 1/4 ≥ δ > 0. The first summand in Eq. (9) is obtained noting that δ = 1/8 + . The second summand in Eq. (9) follows from the fact that each XOR gate failure can corrupt at most one additional variable.
From Eq. (9), β = (1 − 8 )/2 + |C X O R |/|V 1 |, and Eq. (8) can be rewritten as
The previous equation shows that, for all ∈ (0, 1/8], the number of corrupt variables reduces over time, which after a sufficient number of iterations leads to the correction of all initially corrupt variables.
Note that in our derivation we also assumed that |V i ∪ V i+1 ∪ S i | < αN, for all i > 0 (Eq. (6)). We next prove that the previous statement holds for all V 1 satisfying the condition given by the theorem. We use mathematical induction.
Let us assume that
This means that Eq. (10) is satisfied for the first i − 1 iterations and that we can use it to bound |V i−1 | and |V i |. Assume, by the way of contradiction, that
On the other hand, for some δ, 7/8+ ≤ δ ≤ 1, the number of checks connected to D∪S i ∪V i is bounded by
Combining the previous relation with the lower bound given by the expansion, we obtain
On the other hand, since
based on Eq. (10) we finally obtain
and
The function g 1 ( ) is monotonically increasing on the interval (0, 1/8], and its minimal value on this interval satisfies min 0< ≤1/8 (g 1 ( )) > 3/8. Similarly, the maximal value of the function g 2 (x) on the same interval is max 0< ≤1/8 (g 2 ( )) = √ 2.
Since 1/ √ 1 − 8 > 1 we can conclude that inequality (12) contradicts our initial assumption about |V 1 | given in the theorem formulation, and hence |S i ∪ V i ∪ V i+1 | < αN for all i > 2. When i = 2, Eq. (11) reduces to
which also contradicts our initial assumption. Finally, the condition |V 1 ∪ V 2 | < αN follows from the Eq. (9) and initial condition for |V 1 |. This proves the theorem.
In the previous analysis we assumed that XOR gates are unreliable, but not the MAJ gates. If we allow MAJ gates to be prone to data-dependent gate failures, the error correction cannot be guaranteed. This follows from the fact that in the worst case scenario correction of every variable can be annulled by the MAJ logic gate failure.
The necessity for a portion of the decoder's circuit to be reliable is well established in the state-of-the-art literature (see for example [6] , [7] , [12] , [36] , [44] ). Our architecture with reliable MAJ gates is consistent with the "golden gate assumption." This assumption requires using perfect gates for the final step of information extraction when multiple copies of a transmitted bit exist in a decoder, which is the case for the message-passing decoders. Thus, the circuit used for this step is assumed to operate reliably.
Note that the decoder's correcting capability depends not only on the expansion property of its Tanner graph, but also on the number of XOR failures in the first iteration (C X O R ). For too many XOR gate failures during the first iteration, the decoding process will not converge to a correct codeword. Recall from the GOS error model that C X O R depends on the XOR gates failures at time instant prior to the first decoding iteration. We do not have any control over the number of XOR gate failures before decoding has started, but there is a practical way to overcome this, and force C X O R to be zero. Before we start decoding a new codeword we can force all transistor-level transient processes in the decoding circuitry to reach a stationary state, so that there are no transitions at gate outputs nor accumulated errors, prior to the start of decoding. Practically, this can be done by slightly slowing down the clock in the first iteration and letting the signal level stabilize. Since the clock is slower, there are no-timing errors and the XOR computations are reliable, which yields C X O R =0.
We next compare our results with the results from [31] where a reliable decoder was considered. It can be observed that the presence of the XOR gate failures reduces the number of errors that can be tolerated by the bit-flipping decoder. For example, when the Tanner graph has the expansion of (7/8 + ), the perfect decoder can correct 9/16α N errors, which is two times higher than the error correction capability of the faulty decoder. In the limiting case = 1/8 the number of correctable errors is upper bounded by 3α N/8, which is only the 3/8 of the number of errors correctable by the decoder built from reliable components.
The problem of explicit construction of expander graph, with the expansion arbitrary close to γ (called lossless expanders), was investigated by Capalbo et al. [45] , where it was shown that the required expansion 7/8+ can be achieved with graph left-degree γ = poly(log(γ /ρ), 8/ (1 − 8 ) ). This proves the existence of a expander code that can tolerate a fixed fraction of errors under data-dependent gate failures.
Another proof of the guaranteed error correction of LDPC codes was provided by Chilappagari et al. [35] , where the correction capability of an LDPC code was expressed in terms of girth of Tanner graph. In the following theorem we extend the results presented in [35] to the case of the noisy decoder.
Theorem 3: Consider an LDPC code with γ -left-regular Tanner graph with γ ≥ 8 and girth g = 2g 0 . Then, the majority logic decoder built from unreliable check nodes can correct any error pattern |V 1 | such that |V 1 | < 9 n 0 (γ /4, g 0 )/32 − √ 2C X O R , where n 0 (γ /4, g 0 ) = n 0 (γ /4, 2 j Proof: In order to prove the theorem we use the following lemma.
Lemma 4: The number of checks connected to a set of variable nodes V in γ -left-regular Tanner graph with girth g = 2g 0 satisfies
where f (|V |, g 0 ) represents the maximal number of edges in an arbitrary graph with |V | nodes and girth g 0 . Proof: See [35] . Based on the Moore bound, we know that the number of nodes n(d, g 0 ) in a graph with the average degreed ≥ 2 and girth g 0 satisfies [46] n(d, g 0 ) ≥ n 0 (d, g 0 ),
where n 0 (d, g 0 ) is defined in Eq. (13) . On the other hand, since γ /4 ≥ 2 the graph with |V | < n 0 (γ /4, g 0 ) nodes must have average degree smaller than γ /4. Then, based on the definition of the average degree follows f (|V |, g 0 ) < γ |V |/8.
Combining the previous expression with Eq. (14) we obtain
Note that it was shown in [35] that γ ≥ 4 represents a sufficient condition for the guaranteed error correction on a Tanner graph with girth g. However, due to logic gate failures higher expansions are required compared to the perfect decoder, but the other conclusions remain the same as for the perfect decoder.
VI. NUMERICAL RESULTS
A. Error Probability Analysis
The codes designed from finite geometries are an important class of one-step majority logic decodable codes [47] . It was proven that for an LDPC code derived from finite geometries, the OS-MAJ decoder can correct up to γ /2 errors. In this section we investigate 2-dimensional projective geometry LDPC codes over the Galois field GF(2 s ), denoted as PG(2, 2 s ) codes, s > 0. The PG(2, 2 s ) codes, have rightdegree ρ = 2 s + 1, left-degree γ = 2 s + 1 and minimum distance d min = 2 s + 2.
The average bit error probabilities for several PG codes, are shown in Fig. 1(a) , for the case when the MAJ gates operate reliably. The performance upper bounds are calculated using Eq. (2) for two XOR gate error ratesε X O R = 10 −3 , 10 −2 , and compared with the case ofε X O R = 0, i.e., with the perfect decoder. It should be noted that the lower bounds correspond to rare hardware failures, and can be well estimated usinḡ P v (γ ) ≈ P v ( p, (0, . . . , 0) ). This is the reason why they are omitted from Fig. 1(a) .
It can be seen that frequent hardware failures can lead to significant performance degradation. This degradation is especially pronounced in the region of low BSC crossover probabilities. For example if p = 10 −3 , extremely unreliable When MAJ gates are unreliable, the BER performance are in the worst case determined byε M AJ , as can be seen in Fig. 1(b) . This corresponds to highly unrealistic case when only a sequence of pairs of complementary codewords is transmitted. In a more realistic scenario, only a fraction of bits will be miscorrected by the probability given as the upper bound in Eq. (2). Let us denote this fraction by α s . Then, as can be seen in Fig. 1(b) , the degradation can be significantly lower, and it is possible to achieve the bit error rate less thanP v (γ ) (dotted curve in Fig. 1(b) ). This fact reveals that instead of designing a decoder in which all MAJ gates operate reliably, it makes sense to optimize the decoding schedule in a such way that switching activity of gates used in the decoder is reduced. In Fig. 2 we illustrate the maximal fraction of switches α s,max , for which the average BER is lower thanP v (γ ). It can be observed that for codes with higher error correction capabilities, it is more difficult to satisfy the above condition. In addition, if more reliable MAJ gates are used in the decoder, more switchings can be tolerated.
B. Guaranteed Error Correction
From Theorem 2 follows that the number of errors that can be corrected depends on the expansion property, represented by α and , and the hardware failures inherited from the time instant prior to the decoding, C X O R . Here we provide an upper bound on a fraction of channel errors, α total = 3(3+8 )α/32− √ 2C X O R /N, that can be corrected by the decoder. We use the following lemma to numerically obtain the upper bound.
Lemma 5: If there exists a family of (γ, ρ, α, (7/8 + )γ ) expander codes with code length going to infinity, then it holds α total (α, ) ≤ α total (α * , * ), where α * and * satisfy the following relation * = (1 − (1 − α * ) ρ )/(α * ρ) − 7/8.
Proof:
The previous relation follows from the [31, Th. 25] , where it was shown that a set of α N variables can have at most Nγ (1 − (1 − α) ρ )/ρ + O(1) neighbors and the fact that we look for graphs which expand by at least a factor of (7/8 + ).
In Fig. 3 (a) we express α total (α * , * ) in terms of C X O R /N, for different ρ-right-regular Tanner graphs. We consider only cases where ρ ≥ 8. We can observe that, for example for ρ = 8, when the influence of inherited failures can be neglected, we can potentially correct more than 1% of erroneous bits. In addition, a code correction capability reduces with the increase of ρ. When XOR gate failures prior to the decoding become comparable with the correction capability of a code, a threshold is reached and the bound rapidly decreases. The threshold is independent of ρ. For sufficiently large C X O R /N the decoder performance is degraded up to the point where no error correction can be guaranteed. This happens, for example for ρ = 8, when C X O R /N ≥ 1%.
Another perspective on the error correction of the noisy decoders is provided in Fig. 3(b) . Here we examine how the girth of γ -left-regular Tanner graphs affects the decoder performance. In addition, we compare the results given by Theorem 3 with the correction capability of the noisy OS-MAJ decoder, expressed by γ /2 − C X O R . It can be observed that the error correction bound, guaranteed by Theorem 3, for small girth (g ≤ 8), is not tight. It is actually lower compared to the known OS-MAJ decoder correction capability. However, for higher girths of Tanner graphs, the results given in Theorem 3 are significant. For example, when g = 12, C X O R = 0 and γ = 12, we can guarantee correction of error patterns with weight 7, which is not possible using the OS-MAJ decoder.
VII. CONCLUSION
While the von Neumann error model is suitable for theoretical evaluation of fault-tolerant systems, applicability of the results obtained under this error model to real-world systems is limited. In practice, unreliability of logic gates is usually data-dependent and correlated in time. Hence, in order to describe hardware unreliability phenomenon more accurately, a change of modeling paradigm is required. We advocate the use of the state models, which provide a more general modeling approach. Then, based on the data-dependent gate failure model, we developed an analytical method for the performance evaluation of the OS-MAJ decoders. Our method enables calculating the BER of any regular LDPC code of girth at least six. These BER values are highly dependent on the decoded codewords and we have succeeded to bound them for the case of errors caused by the probabilistic nature of gate switching.
In addition, based on expander properties of Tanner graphs, we proved that correction capability of the majority logic decoder increases linearly with the code length. Although the required expansion is usually associated with codes with high left-degrees (γ ≥ 8), our results are significant as present the first known results regarding the guaranteed error correction of LDPC decoders made of unreliable components.
The future research includes the investigating fault-tolerant schemes which use other types of LDPC decoders, under data-dependent hardware failures. We are working on generalization of our results to more complex iterative decoders, such as, for example, finite-alphabet iterative LDPC decoders. Based on the structural property of Tanner graphs of LDPC codes, we are also investigating possibility of designing novel decoders that can work well under data-dependent hardware failures.
APPENDIX A (PROOF OF LEMMA 1)
Given the fact that each received bit is erroneous with the probability p, the probability that the output of a fully reliable XOR gate is also erroneous is equal to A. As j -th XOR gate fails with the probability ε j , the error at the output of j -th XOR is given by P j = ε j (1 − A) + (1 − ε j )A. Each row of the error configuration matrix Q i,γ represents one possible error configuration which results in appearance of exactly i erroneous bit estimates at inputs of the MAJ gate. The total number of such error configurations is γ i . A bit x v will be incorrectly decoded if the majority of its estimates are incorrect. Thus, for odd values of γ , only probabilities of i being greater than or equal to (γ + 1)/2 leads to a miscorrection. If γ is even, then there is a possibility of a tie (equal number of correct and incorrect estimates). For such cases γ /2 incorrect estimates can result in miscorrection, which is depicted by the second part of Eq. (1).
APPENDIX B (DERIVATION OF BIT ERROR RATE UNDER GENERAL FAILURE MODEL)
Let {x (k) } k≥0 be a codeword sequence transmitted through the channel. Clearly, decoding error of x (k) depends on M −1 codewords, previously transmitted through channel. Let 
Proof: The expression given by Eq. (1) represents the miscorrection probability for an arbitrary chosen bit under one hardware failure scenario, i.e., one state array σ (k) .
A particular XOR state s m, , 1 ≤ m ≤ γ , will appear if channel errors change only certain bits of the code sequence x m,v . The number of such bits is equal to the Hamming distance between the error-free code sequence and the XOR state s m, . As the inputs of XOR gates are not mutually dependent, the probability of the state array σ (k) occurrence can be derived by multiplying individual XOR state probabilities and we have
The error probability of a bit x (k) v , under assumption that a fixed sequence of M codewords was transmitted through the channel, can be derived by summing the products P σ (k) P v ( p, ε (k) ) obtained for all possible error vectors ε (k) ∈ S ε and the BER can be derived by performing one additional averaging over all code bits.
Gate-error probability vector depends on the transmitted codeword, and P v p, ε (k) is actually conditional error probability. The conditional BER can be obtained by averaging P v p, ε (k) over all gate-error probability vectors, but complexity of computing it grows exponentially in the product γ (ρ − 1)M. However, different gate-error probability vectors, ε (k) , may lead to the same bit error probability. In our analysis we take advantage of the fact that for the GOS failure model, the number of terms that need to be calculated is significantly lower.
APPENDIX C (PROOF OF THEOREM 1)
The valueP v (I (k) ) given in Eq. (3) represents the probability that incorrect bit estimate is passed to the neighbouring MAJ gate, under the GOS failure model, for a given parity indicator I (k) . This value can be easily obtained by averaging Eq. (1), from Lemma 1, over all error probabilities vectorsε(t). Namely, the probability that at time k, t logic gates used for decoding a bit v change their output values, under the parity indicator I (k) , is equal to
t − j B γ +2 j −I (k) −t (1 − B) I (k) +t −2 j , where j = t if I (k) = 0, and j = 0 if I (k) = γ . The sum over all possible configurations of t non-zero failure probabilities is the contribution of P v ( p,ε(t) ) to the overall miscorrection probability.
However, a final bit decision at time k can be incorrect due to the failures of MAJ gates, which also depends on the bit estimates from time instant k − 1. From the definition of the GOS model it follows that a failure of a MAJ gate at time k −1 will not effect the reliability of the same gate during the k-th computing cycle. Thus, the probability of bit miscorrection at time k,P e,G O S (I (k) ) depends onP v (I (k−1) ), the probability of bit misscorection given that there is no failure of the MAJ gate at time k − 1, rather than depending onP e,G O S (I (k−1) ). The probability of the bit error, at time instant k can be now expressed as 
Bane Vasić is a Professor of Electrical and Computer Engineering and
Mathematics at the University of Arizona and a Director of the Error Correction Laboratory. He is an IEEE Fellow and da Vinci Fellow, and is a past Chair of the IEEE Data Storage Technical Committee. He is an inventor of the soft error-event decoding algorithm, and the key architect of a detector/decoder for Bell Labs data storage read channel chips which were regarded as the best in industry. His pioneering work on structured lowdensity parity check (LDPC) error correcting codes and invention of codes has enabled low-complexity iterative decoder implementations. Structured LDPC codes are today adopted in a number of communications standards. His work on trapping sets has led to iterative decoding algorithms with fast convergence, low error floors and low power consumption. He is a co-founder of Codelucida, a startup company developing such advanced error correction solutions for communications and data storage.
