Motivated by recently derived fundamental limits on total (transmit + decoding) power for coded communication, this paper investigates how close regular LDPC codes can get to these fundamental limits. For two decoding algorithms (Gallager-A and Gallager-B), we provide upper and lower bounds on the required decoding power based on models of parallelized decoding implementations. As the target error-probability is lowered to zero, we show that the transmit power must increase unboundedly in order to minimize total (transmit + decoding) power. Complementing our theoretical results, we develop detailed physical models of decoding implementations using rigorous (post-layout) circuit simulations, and use them to provide a framework to search for codes that may minimize total power. Our results show that approaching the total-power channel capacity requires increasing the complexity of both the code design and the corresponding decoding algorithm as the communication distance is increased, or as the target error-probability is lowered.
the specific modulation choice, coding strategy, equalization strategy, etc. [29] , [20] . Even for a fixed communication strategy, processing power depends strongly on the implementation technology (e.g. 45 nm CMOS) and the choice of circuit architecture.
With simplistic encoding/decoding models, recent literature has explored fundamental limits [20] , [21] , [35] , [33] , [34] , [5] , [6] on transmit + encoding + decoding power. These limits hold for any code and any encoding/decoding algorithm, under specific implementation models. The models abstract power consumed in computational nodes [20] , [33] , [34] and wiring [35] , [21] , [5] , [6] in the encoder/decoder implementations, and show that there is a fundamental tradeoff between transmit and encoding/decoding power. The simplicity of these models is needed 1 in order to obtain universal limits.
In this work, we therefore pose the question (see Fig. 1 ): how small is the gap between known families of codes and decoding algorithms and fundamental limits on total 2 power? To address this question, we first provide asymptotic upper and lower bounds (Sections III-IV) on required decoding power. Our novelty, in comparison with the fundamental limits [21] , lies in performing these analyses by restricting our attention to regular LDPC codes and Gallager decoding algorithms. This choice is motivated by both the order-optimality of regular LDPC codes in some theoretical models of circuit power [20] , and their practical utility in both short [1] and long [15] distance settings. Recent work of Blake and Kschischang [6] also studied the energy of LDPC decoding circuits, and an important connection to this work is highlighted in Section IV-D.
Within these restrictions we provide the following insights:
1) Wiring power, which explicitly brings out physical constraints in a digital system [47] , costs more in the low errorprobability limit than the power consumed in computational nodes. Thus, it is not sufficient to simply count the number of algorithmic operations and use this as a metric of decoding complexity if power is important.
2) Approaching Shannon capacity requires keeping transmit power near the Shannon limit even as the error probability approaches zero. However, when total-power-minimization is the goal, keeping transmit power bounded (e.g., by approaching Shannon capacity) can lead to suboptimal decoding power. For instance, we observe that (Theorems 3 and 4)
at sufficiently low error-probability, it is more total-power-efficient to use uncoded transmission than regular LDPC codes with Gallager decoding, if using bounded transmit power. However, if transmit power is allowed to grow unboundedly, LDPC codes can outperform uncoded transmission in this total-power sense.
3) Under the assumption that processing power is dominated by nodes as opposed to wires, fundamental limits on total power can be achieved in the low error-probability limit via regular LDPC codes with Gallager-B decoding. However, when wires dominate, a large gap exists between fundamental limits and lower bounds on Gallager decoding.
To obtain insights on optimizing code choices for system implementations at practical error-probabilities, we then develop empirical models of decoding power consumption of 1 and 2 bit message-passing algorithms for regular LDPC codes (Section V-C). These models are constructed by simulating (post-layout) power consumption for simple codes and decoders, breaking down the circuit power into its constituents, and generalizing these constituents of power to structurally similar codes.
Shannon-theoretic analysis yields transmit-power-centric results, which are plotted as "waterfall" curves (with corresponding "error-floors") demonstrating how close the code performs to the Shannon limit. There, the channel path-loss can usually be ignored because it is merely a scaling factor for the term to be optimized (namely the transmit power), thereby not affecting the optimizing code. Since we are interested in total power, the path-loss impacts the code choice. For simplicity of understanding, 1 In a nutshell, the models assume that any synchronous VLSI circuit is a set of computational nodes connected to each other using wires. 2 In this paper, since we focus on transmit and decoding power, we use the term total power to denote just the sum of transmit and decoding power.
path-loss is translated into a more relatable metric -communication distance -using a simple model for path-loss. The resulting question is illustrated in Fig. 1(b) : At a given data-rate, what code and corresponding decoding algorithm minimize the transmit + decoding power for a given transmit distance and error-probability? In Section V-C, we present optimization results for this question in a 60 GHz communication setting using our models. This particular setting is chosen not just because of the short distance, but also because the results highlight another conceptual point we stress in this paper:
4) Approaching total-power capacity requires an increase in the complexity of both the code design and the corresponding decoding algorithm as the communication distance is increased, or as the target error-probability is lowered.
The results presented in this paper provide a framework for optimizing codes and decoders, which can be used to obtain optimal code-decoder choices for some (but not all) system designs. In particular, we only consider a limited set of coding strategies, and while the results and models presented here extend easily to irregular LDPC constructions, they are not necessarily applicable to all decoders. Second, modern transceivers [39] contain many other processing power sinks, including analog-to-digital converters (ADCs), digital-to-analog converters (DACs), power amplifiers, modulation, and equalizers, and the power requirements of each of these components can vary 3 based on the coding strategy. While recent works have started to address fundamental limits [26] and modeling [27] of power consumption of system blocks from a mixed-signal circuit design perspective, tradeoffs with code choice of these components remain relatively unexplored. Hence, while analyzing decoding power is a start, and is especially relevant when decoding power is the dominant sink of energy, other system-level tradeoffs should be addressed in future work. It is also of great interest to understand tradeoffs at a network level (e.g., see [20] ), where multiple transmitting-receiving pairs are communicating in a shared wireless medium. In such situations, one cannot simply increase transmit power to reduce decoding power: the resulting interference to other users needs to be accounted for as well.
The remainder of the paper is organized as follows. Section II states the assumptions and notation used in the paper.
Sections II-C to II-F introduce theoretical models of VLSI circuits and decoding energy. These models are analyzed in Sections III and IV respectively, in the context of the question illustrated in Fig. 1(a) . Section V discusses detailed physical models of decoding implementations, in the context of the question illustrated in Fig. 1(b) . Section VI concludes the paper.
II. SYSTEM AND VLSI MODELS FOR ASYMPTOTIC ANALYSIS
Throughout this paper, we rely on the family of Bachmann-Landau notation [38] (i.e. "big-O" notation). For any two functions f (x) and g(x) defined on some subset of R, asymptotically (as
| for some positive real-valued constants c 1 , c 2 , c 3 , c 4 . All logarithm functions log(·) are natural logarithms unless explicitly stated otherwise.
A. Communication channel model
We assume the communication between transmitter and receiver takes place over an AWGN channel with flat-fading. The transmission strategy uses BPSK modulation, and a (d v , d c )-regular LDPC code of design rate R = 1 − dv dc [48] (which is assumed to equal the code rate). The blocklength of the code is denoted by n, and the number of source bits is denoted by k = nR. We further assume that d v ≥ 4 for reasons that have to do with how fast the error-probability decreases with the number of decoding iterations, which will become clear in Lemma 2. The decoder performs a hard-decision on the observed channel outputs before starting the decoding process, thereby first recovering noisy codeword bits transmitted through a Binary Symmetric Channel (BSC) of flip probability p 0 = Q(
Here, SN R is the received signal-to-noise ratio and Q(·) is the right-tail cumulative density function of the standard normal distribution,
du. The transmit power P T is assumed to be proportional to SN R, modeling fixed distance and fixed fade-coefficient wireless communication. It follows via the Mill's ratio inequalities [32] , that
for some constant η > 0, where SN R = ηP T .
B. Decoding algorithm assumptions
We consider two decoding algorithms which were originally proposed in Gallager's thesis [31] , and are now called [48] "Gallager A" and "Gallager B". Both algorithms are one-bit message-passing algorithms which were analyzed in [31] under the assumption that any two messages being passed in a given iteration are independent. For this assumption to hold, decoding can run only for the number of algorithmic iterations until which the decoding neighborhood around each variable-node is locally tree-like. In this work, we assume the decoder operates under this constraint. Any larger number of iterations introduces correlations in messages, rendering density-evolution analysis [49] of error-probability invalid 4 .
It is known that the girth [44] of the code determines how many such independent algorithmic iterations can be accommodated.
This maximum number of independent iterations, which we denote as N iter , is explicitly N iter = g−2 4
[44]. Where g = 2 is the girth and ≥ 2 for any LDPC code. Hence, we assume that g ≥ 6 so that at least one decoding iteration is used. The target average bit-error probability is denoted by P e , decoding power by P Dec , and "total" power by P total := P T + P Dec . The minimum total power for a strategy is denoted by P total,min and the optimizing transmit power by P * T .
C. VLSI layout model for decoding implementation
Different models for analyzing the wiring and area complexity of VLSI circuits were introduced several decades ago in computer science, but the most commonly used one is attributed to Thompson [54] . Our model for the LDPC decoding circuit in this paper is an adaptation of Thompson's model, and it entails the following assumptions:
1) The VLSI circuit includes processing elements which perform computations and store data, and wires which connect them. The circuit is placed on a square grid of horizontal and vertical wiring tracks of finite width λ > 0, and contact squares of area λ 2 at the overlaps of perpendicular tracks.
2) Neighboring parallel tracks are spaced apart by width λ.
3) Wires carry information bi-directionally and can only cross orthogonally at the contact squares.
4) The layout is drawn in the plane. In other words, the model does not allow for more than two metal layers for routing wires in the manner that modern IC manufacturing processes do.
5) The processing elements in the circuit hold finite memory and are situated at the contact squares of the grid. They connect to wires routed along the grid.
6) Since wires are routed only horizontally and vertically, any single contact has access to a maximum of 4 distinct wires.
To accommodate higher-degree nodes, a processing element requiring x external connections (for x > 4) can occupy a square of side-length xλ on the grid, with wires connecting to any side. No wires pass over the large square.
Hence, λ can be thought of as a "toy-model" of the minimum feature-size metric which is often used to describe IC fabrication processes. Consequently, in the body of the paper we refer to this model as Implementation Model (λ). The decoder circuit is assumed to be implemented in a "fully-parallel" manner [8] , i.e. a processing element never acts as more than one vertex in the Tanner 
D. Time required for processing
In Sections II-E, II-F we will describe two models of energy consumption for the VLSI circuit. In order to later translate these energy models to power models, we need the time required for computation (the computation time is measured in seconds and is different from N iter ). In this section, we characterize the computation time.
The computations are assumed to happen in clocked iterations, with each iteration consisting of two steps. In the first step of each iteration, one-bit information messages (based on the Gallager A or B decoding algorithm) are passed from all 6 variable-nodes to neighboring check-nodes along connecting wires. In the second step, each check-node computes a function of its inputs (according to the decoding algorithm) and passes a one-bit message back to each neighboring variable-node.
We denote the decoding throughput (number of source bits decoded per second) by R data . Because a batch of k source bits are processed in parallel, the time available for processing is T proc = k R data seconds. The required power for decoding is therefore P Dec = EDec Tproc = EDec k R data , which is simply the energy per source bit (E Dec /k) multiplied by the (fixed) data rate.
E. Computational node model of decoding power Definition 1 (Node Model (ξ node )). The energy consumed in each variable or check node during one decoding iteration is E node .
This constant can depend on λ, d v , d c , and R data . The total number of nodes at the decoder is n nodes = n + (n − k) = 2n − k.
The total energy is E Dec = E node n nodes N iter . The decoding power is
Here, ξ node is a constant that depends on λ, d v , and d c . This model assumes that the entirety of the decoding energy is consumed in computational processing nodes, and wires require no energy. In spirit, the model is simply counting the number of operations for the given message-passing decoding algorithm. We note that the node model accounts for the number of iterations because the leakage power in the nodes (which is the power expended even when there is no switching) is not negligible [12] . The next energy model complements the node model by accounting for energy consumed in wiring.
F. Message-passing wire model of decoding power Definition 2 (Wire Model (ξ wire )). The decoding energy is E Dec = E unit−area A wires , where E unit−area is the energy consumed in each unit-area of a wire and A wires is the total area occupied by the wires in the circuit. The decoding power is P Dec = Eunit−areaAwires k R data = ξ wire A wires , where ξ wire is a constant depending on λ, d v , d c , and R data .
The wires in the decoder consume power whenever they are "switched," i.e. when the message along the wire changes its value 5 . The fraction of time the wires need to be switched is called the activity-factor of the wires. If the messages passed along the wires were completely random (i.e. Bernoulli( 1 2 )) and independent over time, then each wire would need to be switched half the time (on average). In reality however, as decoding proceeds, the messages tend to stabilize, reducing switching and hence the power consumed in the wires. Thus the model here accounts for the fact that the activity factor is low after the first few iterations by ignoring the energy consumed in the wires after the first iteration.
In practice, either of the two power models can be a better approximation based on the simplicity of the computations and the complexity of the wiring required for the decoding algorithm.
III. ANALYSIS OF NODE MODEL

A. Approximation analysis of Gallager decoding algorithms
We first obtain bounds on the number of algorithmic decoding iterations required to attain a specific error-probability. These results are used in Section III-B to bound the total power under the Node Model.
Lemma 1. The required number of independent iterations N iter to attain error-probability P e with a Gallager-A decoder is
Proof: See Appendix A.
Lemma 2. The required number of independent iterations N iter to attain error-probability P e with a Gallager-B decoder with variable node degree d v ≥ 4 is given by
Proof: See Appendix B. Importantly, this does not hold for d v < 4 because Gallager-A and Gallager-B are equivalent then.
B. Minimum total power of node model
In this section, we investigate the question: as P e → 0, which decoding algorithms, with associated optimal transmit power, minimize the total power under the Node Model of Section II-E?
1) Gallager-A decoding:
Under the Node Model, the optimal total power using Gallager-A Decoding is
which is achieved by transmit power
Proof: Applying Lemma 1 to the Node Model, the power consumed by a Gallager-A decoder is given by
and the total power is given by
Thus, if P T is bounded even as P e → 0, the total power P total,bdd PT = Θ(log 1 Pe ). If instead P T is allowed to increase unboundedly, optimizing over P T , the minimum total power is P total,min = min
for which the optimal transmit power, P *
Corollary 2. Under the Node Model, the optimal total power using Gallager-B Decoding is
Proof: Using Lemma 2 in the Node Model, the power consumed by a Gallager-B decoder is given by
Thus the total power is given by
The optimal total power as P e → 0 is given by
In this case the optimal transmit power is constant, even as P e → 0.
C. Comparison with fundamental limits
Can we reduce the power under the Node Model via a better code, or a more sophisticated decoding algorithm? After all, Gallager-B is merely a one-bit message-passing algorithm, and belief-propagation requires the transmission of infinite-length log-likelihoods. It was shown in [20] that under the Node Model and a fully-parallelized decoding implementation such as Implementation Model (λ), the optimal total power is lower bounded by Ω log log 1 Pe , matching Corollary 2. In fact, using a code which performs closer to Shannon capacity can even reduce efficiency: if a capacity-approaching LDPC code is used instead of a regular LDPC code, the asymptotic performance under the Gallager-B decoding algorithm matches that of regular LDPCs under Gallager-A. This is because the error-probability decays only exponentially (and not doubly-exponentially) with the number of iterations under Gallager-B decoding if degree-2 variable nodes are present [18] , and [51] shows that degree-2 variable nodes are required in order to achieve capacity (the fraction of degree-2 variable nodes required to attain capacity under message-passing decoding is characterized in [51] ).
IV. ANALYSIS OF WIRE MODEL
We now shift our focus toward analyzing the Wire Model described in Section II-F. We first state some bounds on the blocklength of LDPC codes which we will refer back to at several points in the remainder of the paper.
Lemma 3. For a given girth g of a (d v , d c )-LDPC code, a lower bound on the blocklength n is given by
, and an upper bound on the blocklength is given by
Proof: For the lower bound, see [35, Appendix I] , and for the upper bound, see [35, Claim 2] .
A. Bounds on wiring area of decoders
To make use of the energy model of Section II-F, we must characterize the total wiring area of the decoder. Techniques for obtaining upper and lower bounds on the total wire area for different computations were explored in often-forgotten computer science works [54] , [41] , [43] , [40] . In the following subsections, we introduce some graph theory concepts and we directly apply them to obtain bounds on the wiring of an LDPC decoder.
1) Lower bound on wiring area:
We first give the trivial lower bound on the wiring area of the decoder for any regular LDPC code under Implementation Model (λ).
Lemma 4. For a (d v , d c )-regular LDPC code of blocklength n, the wiring area A wires under Implementation Model (λ) is
Proof: There are nd v wires. Each wire has width λ > 0 and positive length (no two wires overlap completely).
In his thesis [40] , Leighton utilizes the crossing number (a property first defined by Turán [56] ) of a graph as a tool for obtaining lower bounds on the wiring area of circuits. We use the following two definitions to introduce this property.
Definition 3 (Graph Drawing).
A drawing of a graph G is a representation of G in the plane such that each vertex of G is represented by a distinct point and each edge is represented by a distinct continuous arc connecting the endpoints, which does not cross itself. We assume that in any drawing, no edge passes through vertices other than its endpoints and no edges overlap others for any nonzero length (i.e. anything other than at discrete points where they might cross).
Definition 4 (Crossing Number).
The crossing number of a graph G, cr(G), is the minimum number of edge-crossings over all possible drawings of G. An edge-crossing is any point in the plane other than a vertex of G where a pair of edges intersects.
Crossing numbers continue to be of interest to combinatorialists and graph-theorists, and many difficult problems on finding exact crossing numbers or bounds for various families of graphs remain open [46] .
It follows that for any graph G, the wiring area (A wires ) of the corresponding circuit under Implementation Model (λ) is lower bounded as A wires ≥ λ 2 cr(G). This is due to the fact that any VLSI layout of the type described in Section II-C is isomorphic to a drawing of G in the sense of Definition 3. Therefore the minimum number of wire crossings of any layout of G is cr(G). Since every crossing has area λ 2 , the inequality follows. From this, any lower bound on the crossing number of a computation graph also yields a lower bound on its circuit wiring area. The first lower bound on the crossing number of a general graph (often called the "crossing number lemma" [3] ) was proved independently by Ajtai et al. [17] and Leighton [40] .
In this paper, we make use of the following improvement from [11] .
Theorem 1 (Pach, Spencer, Tóth [11] ). Let G = {V, E} be a graph with girth g > 2 and |E| ≥ 4|V |. Then cr(G) satisfies
where k > 0 is a constant dependent on .
We now obtain lower bounds on wiring area based on the minimum number of independent iterations the code allows for.
Lemma 5 (Crossing Number Lower Bound on A wires ). For any (d v , d c )-regular LDPC code C that allows for at least N iter independent decoding iterations with a decoder D implemented in Implementation Model (λ), the decoder wiring area A wires is lower bounded in the order of N iter as
we can tighten this to
A wires = Ω e Niter(γ+2 log dv+2 log dc−2 log (dv+dc)) .
Proof: Let C be a (d v , d c )-regular LDPC code that allows for at least N iter independent decoding iterations. We know that the minimum girth, g min of C must satisfy
From Lemma 3, the blocklength n of the code C can be expressed in the order of N iter as
. And from Lemma 4 we then have
Now, assume
. Let V C , E C denote the sets of vertices and edges in the tanner graph of C. The sizes are
We can then carry out the following manipulations
Hence, |E C | ≥ 4|V C |. Also, using the fact that g min > 4N iter − 4 we can apply Theorem 1 and write
Niter(γ+2 log dv+2 log dc−2 log(dv+dc)) .
How loose can (13) 
, the loosest the bound can become is A wires = Ω e Niter(γ+2 log 4) .
2) Upper bound on wiring area: Since the total circuit area is always an upper bound on the area occupied by wires, we use an upper bound on the circuit area to obtain the following upper bound on the wiring area based on the maximum number of independent iterations that the code allows for.
Lemma 6 (Upper bound on A wires ). For any (d v , d c )-regular LDPC code C, that allows for at most N iter independent decoding iterations, the decoder wiring area A wires is upper bounded in the order of N iter as
Proof: Let C be a (d v , d c )-regular LDPC code that allows for at most N iter independent decoding iterations. We know that the maximum girth, g max of C must satisfy
From Lemma 3, the blocklength of any such code can be upper bounded in the order of N iter as
where
. Then, consider a "collinear" VLSI layout of the Tanner graph of C which satisfies all the assumptions described in Section II-C. Arrange all variable-nodes and check-nodes in the graph along a horizontal line, leaving λ spacing between consecutive nodes. The total length of this arrangement is then O(n). Allocate a unique horizontal wiring track for each of the nd v edges in the Tanner graph. Then, every connection in the graph can be made with two vertical wires (one from each endpoint) which connect to the opposite ends of the dedicated horizontal track.
The total height of this layout is then O(n), and the total area is O(n 2 ). An example collinear layout is given in Fig. 3 .
Substituting the result from (14) for n, we obtain the desired bound. We mention that this upper bound is crude, since the O(|V | 2 ) layout construction applies for any graph G = {V, E} which
if G has crossing number exactly cr(G). Via this, any upper bound on the crossing number yields an upper bound on total area. Hence, a constructive proof of a drawing algorithm for semi-regular graphs which yields sub-quadratic (in n) crossing numbers would be immediately applicable to the design of codes with short wires and less power consumed at the decoder.
B. Minimum total power of wire model
We now present analogues of results in Section III-B, where we instead consider decoding power described by the Wire
Model of Section II-F. We translate the wiring area bounds of Section IV-A to power bounds.
Theorem 2 (Asymptotic bounds on P wires ). Under implementation Model (λ) and Wire Model (ξ tech ), the decoding power P wires for any regular LDPC code that is decoded for exactly N iter iterations is bounded as
Proof: The result is a straightforward conclusion from Lemma 3, Lemma 5, and Lemma 6.
1) Gallager-A decoding:
Theorem 3. The optimal total power under Gallager-A decoding in the Wire Model (ξ tech ) for any regular LDPC code to achieve error-probability P e is bounded as
Further, if P T is bounded even as P e → 0, then the required power diverges as a polynomial in 1 Pe , which is exponentially worse than using uncoded transmission.
Proof: See Appendix C.
2) Gallager-B decoding:
Theorem 4. The optimal total power under Gallager-B decoding in the Wire Model (ξ tech ) for any regular LDPC code to achieve error-probability P e is lower bounded as P total,min = Ω log 2 3
1 P e and upper bounded as
Further, if P T is bounded even as P e → 0, then P total,bdd PT = Ω log 2 1
Pe .
Proof: See Appendix D.
C. Comparison with fundamental limits
In [21] , using another Wire Model, it is show that the total power required for any message passing decoding algorithm is fundamentally lower bounded by Ω log
Pe . In comparison, Theorem 4 shows that the total power for regular LDPCs for Gallager-B decoding diverges to infinity at least as fast as log Pe . The Wire Model of [21] and the one here have a small difference: while the power is assumed to be proportional to A wires N iter in [21] , here it is assumed to be simply proportional to A wires . This difference in modeling is not significant for Gallager-B decoding: the number of iterations, N iter is Θ log
Thus even if we adopted the Wire Model of [21] in this paper, the introduced discrepancy would be bounded by a multiplicative factor of log log 1 Pe , which is small relative to the fractional powers of log 1 Pe in play here. Theorem 3 shows that coding can be useful: the Gallager-A algorithm can outperform uncoded transmission in total-power in the order sense. However, the gap in total power between the two is merely a multiplicative factor of log log these degrees is ≈ 0.98, which suggests little order sense improvement over uncoded transmission. Hence, the wiring area at the decoder (particularly, how much better it can be than the bound of Lemma 6) is crucial in determining how much can be gained by using Gallager-B decoding instead of uncoded transmission.
D. The need for new code constructions
In fact, wiring complexity of the code is so crucial that inefficient constructions can cause the lower bound and upper bound of Theorem 4 to match. Recent work of Blake and Kschischang [6] has shown the following theorem, which holds for randomly generated regular-LDPCs and for any sequence of randomly generated LDPC codes which approach capacity.
be a sequence of (regular or capacity-approaching) LDPC codes that are generated randomly with blocklengths
By Theorem 4 then, randomly generated LDPC codes with Gallager-B decoding will have minimum total power that is
Pe , where 0.97 < k < 1, providing little order-sense improvement over uncoded transmission. The authors of [6] highlight the fact that Theorem 5 does not rule out the possibility that there may exist a subset of codes with measure 0 asymptotically that has sub-quadratic wiring area. One open problem that was highlighted at the end of Section IV-A2, specifically the problem of proving tight upper bounds on the crossing number for (even some classes of) semi-regular graphs, could potentially provide an answer to this issue. Thus, while randomly generated codes provide a means to approach Shannoncapacity, new code constructions are needed in order to approach total-power-capacity.
V. CIRCUIT SIMULATION EXPLORATION FOR FINITE-LENGTH CODES
At reasonable error probabilities (e.g. 10 −8 ) and short distances (e.g. less than five meters), asymptotic bounds cannot provide precise answers on which codes to use. For example, consider the following problem, shown graphically in Fig. 1(b) .
Problem 1. Suppose we want to design a point-to-point communication system that operates over a given channel. We are given a target error-probability P e , communication distance r, and system data-rate R data that the link must operate at. Which code and corresponding decoding algorithm minimize the total (i.e. transmit + decoding) power?
Since the bounds of Sections II-IV are derived as P e → 0, they may not be applicable to most instances of Problem 1. In this section we therefore develop an optimization paradigm for jointly choosing codes and decoding algorithms to answer specific instances of Problem 1. We focus on one-bit Gallager-A [31] and two-bit [25] decoding algorithms, still restricting the number of algorithmic iterations to
. Because of the effort required in implementing or even simulating a single decoder in hardware, we construct models for power consumed in decoding implementations of different algorithms based on post-layout simulations for simple decoders. The models developed attempt to capture detailed physical aspects (e.g. interconnect lengths
and impedance parameters, propagation delays, silicon area, and power-performance tradeoffs) of implementations, in stark contrast with their theoretical counterparts of Sections II-IV. In Section V-C, we use these models to investigate solutions to some instances of Problem 1. We bridge the gap between the asymptotic bounds and the circuit simulations by comparing the bounds with the behavior of the optimal total power as r is fixed and P e → 0.
A. Note on channels and constellation size
To answer Problem 1 using precise numbers, additional physical assumptions about the channel (e.g. bandwidth, fading, path-loss, temperature, constellation size) are required in comparison to the model of Section II-A. The channel is still assumed to be binary-input AWGN with flat-fading. However, while Section II-A assumes BPSK modulation for all transmissions, due to the introduction of a data-rate constraint and fixed passband bandwidth W (for fair comparison), the constellation size is required to vary based on the code rate. Explicitly, the transmission strategy is assumed to use either BPSK or square-QAM modulation, mapping codeword bits to constellation symbols. We assume the transmitter signals at a rate of W symbols/s and that the minimum square constellation size (M ) satisfying the system data-rate requirement is chosen: M is always the smallest square of an even integer for which:
For calculating transmit power numbers, the thermal noise variance used is σ N 0 is obtained as a function of the system and channel parameters:
where λ is the wavelength of transmission at center frequency f c in Hz (λ = 3 × 10 8 /f c ). The channel error-probability for BPSK transmissions under this model is:
and the channel error-probability for M -ary square QAM is [7] :
We specifically mention here that the asymptotic bounds from Sections II-IV remain unchanged, even if we substitute M -ary QAM for BPSK as the signaling constellation. This follows from the fact that the RHS of equation (17) For the results presented in Section V-C, we assume the decoding throughput is required to be equal to R data = 7 Gb/s.
We assume a channel center frequency of f c = 60 GHz and bandwidth of W = 7 GHz. The temperature T is 300 K. The distances considered are much larger than the wavelength of transmission (≈ 0.5 cm) so the "far-field approximation" applies.
B. Simulation-based models of LDPC decoders
Given a code, decoding algorithm, and desired data-rate, calculating the required decoding power P Dec is a difficult task.
Even within the family of regular LDPC codes and specified decoding algorithms, the decoder can be implemented in myriad ways. The choice of circuit architecture, implementation technology, and even process-specific transistor options can have a significant impact on the decoding power [8] , [29] . The models we present here are based on simulations of synchronous, fully-parallel decoding architectures in a 90 nm CMOS process with a standard threshold voltage. While this provides insight (see Section V-C), a solution to any instance of Problem 1 requires optimization of P T + P Dec over not just super-exponentially many codes and decoding algorithms, but also all decoder architectures, implementation technologies, and process options.
1) Initial post-layout simulations: Our models for large-degree LDPC codes are constructed based on circuit simulations using STMicroelectronics 90nm standard CMOS process with 7 metal-layers. First, post-layout simulations of one-bit and two-bit decoders for two simple codes were performed. The codes were both (3, 4)-regular LDPC codes (of girth 6 and 8) that were generated randomly using the guess-and-test algorithm in [28] and [30] . The CAD flow used is detailed in Appendix E, and a diagram of the flow is given in Fig. 5(b) . The next section details how these results are generalized to larger codes.
2) Physical model of LDPC decoding: Even within our imposed restrictions on the LDPC code degrees, girth, and number of message-passing bits for decoding, constructing a decoding power model that applies to all combinations of these code parameters requires some simplifying assumptions:
1. Decoders are assumed to operate at a fixed supply voltage (chosen to be 0.6V: the minimum voltage used in our initial simulations for which all decoders meet timing constraints). We then model the minimum-achievable clock period T CLK , maximum-achievable clock frequency f CLK , and decoding throughput R Dec for each decoder as functions of b, g, d v , d c :
In ( We model the components of decoding power as
In (21), P VN (·, ·) and P CN (·, ·) are the power consumed in individual variable and check nodes respectively, and P wire (·, ·, ·, ·)
is the power consumed in a single message-passing interconnect. Note that (21) is a sum of all power consumed in computations and wires of the decoder (the coefficients in (21) count the number of occurrences of each power sink in the decoder). The details of the node power models are given in Appendix G and the details of the wire power models are given in Appendix H.
3) Satisfying the communication data-rate: Fixing the supply voltage for a decoder and using the fastest possible clock speed only allows for a single decoding throughput. Hence, parallelism in order to meet the system data-rate requirement R data in Problem 1 is also modeled. For example, Fig. 4(a) shows the decoding power vs. decoding throughput for two hypothetical decoders ('A' and 'B') at a fixed supply voltage. If an application demands throughput that is twice A's throughput, two copies of decoder 'A' can be used in parallel. Together, they provide twice the throughput, and require twice the power of a single decoder 'A'. Fig. 4(b) shows the communication system architecture which accommodates this choice. Two separate codewords are transmitted at twice the throughput of a single decoder 'A', and a multiplexer at the receiver passes a separate codeword to each of the parallel decoders, which decode the two codewords independently. We therefore allow any integer multiple of points on the curves of Fig. 4 (a) to be achieved using this strategy. Though making such a design choice in practice would introduce additional hardware and a slight power consumption overhead, we ignore this cost in our analysis.
What if we want a decoding throughput that is, say, 1.5 times the throughput for a single decoder? In other words, can we interpolate between the points 'A' and '2A' in Fig. 4(a) ? In cases where integer multiples of a single decoder's throughput do not exactly reach R data , we first find the minimum number of parallel decoders that when combined exceed the required throughput. Calling this minimum number of decoders Q, we then assume that the clock frequency of each of the parallel decoders is slowed down until the overall throughput of the parallel combination is exactly R data . Explicitly, the formula to determine this "underclocked" frequency f u is:
Because the decoding power is modeled as being linearly proportional to the decoder clock frequency (see Section V-B2 and Appendices G-H), we multiply each individual decoder's power by the appropriate frequency scaling factor κ = fu fCLK(b,g,dv,dc) , and then multiply the result by the number of parallel decoders to get the total power of the parallel combination:
Substituting (18) and using the explicit formula above for f u , we replace κ and carry out some algebra to obtain:
Hence, we assume that any (throughput, power) point on the line through 'A' and '2A' in Fig. 4(a) can be achieved in this manner (with the obvious exception of points that have negative throughput and power). Therefore, in our analysis in Section V-C, we assume the decoding throughput is exactly R data and we use the decoding power numbers obtained through this linear interpolation between the modeled points.
4) Comparing different coding strategies:
Now, given a subset of codes and decoders, how should a system designer jointly choose a code and decoding algorithm to minimize the total system power? Within the channel model of Section V-A, consider specific instances of Problem 1: let path-loss coefficient α and R data be fixed. Then, for each choice of (r, P e ), we can compare the required P T + P Dec for each combination of code and decoding algorithm modeled in Section V-B2, and find the minimizing combination. A flowchart detailing the steps in this analysis is given in Fig. 5(a) . 
C. Example: 60 GHz point-to-point communication
An example plot which shows the minimum achievable P T + P Dec for different P e values at a fixed distance r = 2.8m and α = 3 is given in Fig. 6 . The plot also shows the curve of the optimizing P T which achieves the minimum P T + P Dec , and the Shannon-limit [50] for the AWGN channel. The horizontal distance between the optimizing P T curve and the P T + P Dec curve in Fig. 6 corresponds to the optimizing P Dec . As P e decreases, this decoding power increases, indicating an increase in the total-power-minimizing decoder's complexity.
1) Connection with asymptotic bounds:
To investigate at what P e values the asymptotic bounds on P wires from Section IV-B become relevant, an example plot including the upper and lower bounds on the wiring power of the optimizing decoder at a fixed distance r = 4.4m and α = 3 is given in Fig. 6(b) . This particular distance is chosen for the plot, since the optimizing decoders happen to be 1-bit Gallager-A decoders, allowing for a fair comparison with the results of Theorem 3. Because multiplicative constants are not modeled in the asymptotic analysis, the bounds are scaled by the appropriate constant (the power consumed in a single interconnect, found using the models of V-B2) to facilitate comparison of the behavior of total power curves in Fig. 6(b) . The optimal total power curve sits within the upper and lower bounds on the wire model. The looseness between the upper and lower bounds increases significantly as P e → 0, which is expected given the multiplicative gap (linear in blocklength) between the two.
2) Joint optimization over code-decoder pairs: Based on the previous sections, it is clear that the form of the total power curve changes with communication distance. For improved understanding, we use two-dimensional contour plots in the (r, P e ) space to evaluate choices of codes and decoders, as suggested by Fig. 1(b) . An example is shown in Fig. 7(a) , which compares code and decoding algorithm choices for path-loss coefficient α = 3. In the top plot, the contours represent regions in the (r, P e ) space where specific combinations minimize P T + P Dec , and in the bottom plot, regions in the (r, P e ) space are divided based on the value of the minimum P T + P Dec . The best choices for these instances of Problem 1 turn out to be rate 1 2 codes. Lower rate codes require large constellations for a 7 Gb/s data-rate, thus requiring large transmit power for the same p 0 , and higher rate codes require larger decoding power due to increased complexity and size of higher degree nodes.
Some tradeoffs between P T + P Dec and code and decoder complexity can also be observed in Fig. 7(a) : to achieve minimum power, message-passing bits b must increase with r and code girth g must increase with decreasing P e .
How does the inclusion of uncoded transmission as a possible strategy change the picture? Contour plots with uncoded transmission included are given in Fig. 7(b) . Comparing Fig. 7(a) with Fig. 7(b) , we see that when uncoded transmission is included, it overtakes areas in the (r, P e ) space where P e is high and r is very small. However, Fig. 7(b) suggests that simple codes and decoders can still outperform uncoded transmission at reasonably low P e and distances of several meters or more.
VI. CONCLUSIONS AND DISCUSSIONS
We have developed a framework to determine total (transmit + decoding) power for regular LDPC codes. We then compared the derived results with known fundamental limits. If only node power is considered, we show that it is order-optimal to transmit with bounded power with the Gallager-B decoding algorithm. Further, the minimum total power increases as log log 1 Pe which matches the fundamental limits on this component of power from [21] . However, if wiring power is considered, it turns out that transmitting at bounded transmit power is suboptimal, and in fact even worse than just using uncoded transmission. This suggests that measuring complexity of decoding by merely counting the number of operations (e.g. [55] , [37] , [4] ) is insufficient for understanding system-level power consumption.
We then asked the question as to when coding can indeed be useful in achieving an order-sense improvement in total power over uncoded transmission. It turns out that in the wire model, achieving any order-sense advantage over uncoded transmission requires that both transmit and decoding power diverge to ∞ as P e → 0, which calls into question the assumption that one should approach Shannon capacity in order to minimize power.
Our work highlights an important question (see Section IV-D) that has received little attention in the coding-theoretic literature: design of codes that have good performance while maintaining small wiring area (see [45] , [19] , [5] , [6] , [35] ). For wire power consumption, there is a significant gap between the bounds on power consumed by LDPC decoders derived here, and the fundamental limits derived in [21] . Nevertheless, it is entirely possible that regular LDPC codes do not achieve these fundamental limits, even in the order sense. It is therefore important to investigate wiring complexity and power consumption of other coding families, such as polar codes [4] , and convolutional codes [42] .
The simulation-based estimates of decoding power presented in Section V suggest that coding can be useful for minimizing total power, even in short-distance settings. For instance, they predict that simple regular-LDPC codes can achieve lower error probabilities than uncoded transmission in short-distance settings while still consuming the same total power, even at distances as low as 2 meters. However, in these regimes, it is possible that "classical" coding families (e.g. Hamming or Reed-Solomon codes [42] ) might be even more efficient, and hence they need to be examined as well. (6)], the error-probability in the ith decoding iteration, denoted by p i , follows
Since the RHS of (27) is differentiable wrt p i−1 , by Taylor's Theorem ∃ a real function R 1 (x) with lim x→0 R 1 (x) = 0 s.t.:
The RHS of (28) is the first-order MacLaurin expansion of p i . Further, the remainder term R 1 (p i−1 ) has Lagrange form:
where x * ∈ (0, p i−1 ). Calculating the second derivative of p i wrt p i−1 , we find that it is:
It is easily verified that the derivative of the RHS of (30) wrt p i−1 is nonnegative for p i−1 ∈ 0, 1 2 . Therefore, the RHS of (30) is maximized at p i−1 = 1 2 , which gives the bound d 2 pi dp 2 i−1 ≤ 0. Hence, we always have
Plugging (31) into (28) and applying the resulting relation recursively, we obtain an upper bound on p i :
Similarly, the RHS of (30) is minimized at p i−1 = 0. Plugging back into (29), we obtain the following:
Using (33) in (28), applying the relation recursively, and combining it with the bound of (32):
Note the leftmost term of (34) is negative when p 0 > 1 (dv−1)(dc−1) , the stability threshold for Gallager A decoding. However, there is some constant K A > 0 s.t. transmit power P T ≥ K A implies p 0 < 1 (dv−1)(dc−1) . Notice that if P T = max{P T , K A }, P T = Θ P T . Hence, we can assume all sides of (34) are non-negative. Applying the Mill's ratio bounds [32] on p 0 terms:
Inverting all sides of (35), taking log(·) on all sides, and replacing p i by P e and i by N iter :
For some constants c 
Gallager [31, Eqn. 4.17] shows that the j−1 2 th order Maclaurin expansion is
Here, R B (p i−1 ) takes the form
where x * ∈ (0, p i−1 ). Clearly, the 
22 for some constants c 1 , c 2 . Plugging (40) into (38),
Hence we immediately have
Applying (42) recursively
Therefore, using f g and f g as shorthand for f = O(g) and f = Ω(g) respectively,
The constant in (43) is
, and the constant in (44) is
Inverting both sides of (43) and (44), taking log(·) on both sides, and replacing i by N iter and p i by P e
Because N iter ≥ 0, the sum of the log c l B terms of opposite sign on the RHS of (45) will be non-positive. We can therefore loosen the upper bound of (45) by canceling them
We can further loosen the upper bound of (46) by noticing
Plugging in this upper bound back into (46)
1 p0 with log √ 4πηP T + ηP T and dividing all sides of (47) by
Taking log(·) on all sides of (48) we obtain
And dividing all sides of (49) by log dv−1 2
, we obtain
The result now follows for small P e and large P iter denote the number of independent Gallager-A decoding iterations required for a given regular LDPC code to achieve error-probability P e . Via Theorem 2 and Lemma 1, the total power is lower bounded by
Similarly, there is an upper bound for the total power
It follows that if P T is not increased unboundedly as P e → 0, then the required decoding power diverges as a power of 1 Pe which is exponentially larger than the power required for uncoded transmission. In order to find the optimizing transmit power, let L Pe (P T ) denote the function in the Ω expression of (51) and let U Pe (P T ) denote the function in the O expression of (52):
We start by analyzing the lower bound. To find the P T which minimizes L Pe , we differentiate L Pe and set it to 0
Now, let P = P T γ log 1 Pe η . Substituting into (55), we get
log Pe
The positive, real valued solution to (58) is given by the principal branch W 0 (·) of the Lambert W function [23] . Explicitly, when x, z ∈ R + satisfy the relation x = ze z , we say z = W 0 (x). Hence we can write 
Rewriting P in terms of P T we find the optimizing transmit power
The first two terms in the asymptotic expansion of the W 0 (x) as x → ∞ are log(x) − log log(x) [23] . In fact, ∀x ≥ e [36] :
Using (63) 
Ignoring constants and non-dominating terms in the denominators of (64), we get the lower bound of Theorem 3:
An identical minimization of U Pe in (53) , reveals that the optimizing transmit power is upper bounded as 
Ignoring constants and non-dominating terms in the denominators of both the transmit and decoding power in (65), we get iter denote the number of independent Gallager-B decoding iterations required for a given regular LDPC code to achieve error-probability P e . The total power is
where (a) holds from Theorem 2, and (b) holds from Lemma 2. Using the upper bound from Theorem 2, we also know,
Then, considering the bounds on γ in Theorem 2, we examine the exponent of
Because d c > d v for any regular LDPC code,
Also, because d v > 3 for Gallager-B decoding, the denominator of (68) can be bounded by:
Thus, using (69) and (70) in (68),
Substituting this lower bound on the exponent back into (66), we obtain:
If the transmit power is bounded even as P e → 0, then the total power (and the decoding power) diverges
Differentiating the expressions on the RHS of (71) wrt P T and setting to zero, the minimum total power is lower bounded as:
Moving to the upper bound, via Theorem 2, we find that the exponent of
Then substituting (73) into (67), we get the bound
Differentiating the expressions inside the O (·) of (74) wrt P T and setting to zero, we obtain the upper bound of Theorem 4.
APPENDIX E CAD FLOW DETAILS FOR SIMULATION AND POWER ESTIMATION
Decoders are constructed in a hierarchical manner, where behavioral verilog descriptions of variable and check nodes are mapped to standard cells using logic synthesis and then placed and routed. Then, these nodes are connected according to the parity-check matrix of the codes using place and route, resulting in fully-parallel layouts for the decoders. Post-layout simulation is then performed, using extracted RC parasitics and typical corners for the ST 90nm CMOS process. Three components of the total power consumption (computational nodes, message-passing interconnects, and global clock-tree) are isolated by means of post-layout netlist modification. Starting from a nominal supply voltage of 1.2V down to a minimum of 0.6V, power reduction steps are taken by means of supply voltage and frequency scaling. During this process, no changes to the circuit architecture or transistor sizing in the decoder cells are made. Initial simulations are performed assuming that the received sequence at the output of the channel is all-zero. An assumption of an all-zero transmitted sequence (codeword) is easy to justify in theory [48] due to the linearity of the code, the symmetry of the channel, and the symmetry of the decoders with respect to ones and zeros. In practice however, the received sequence of bits has errors, and could therefore require more switching activity at the decoder than these all-zero simulations indicate. However, since the target error probabilities we consider (see Section V-C)
correspond to channel flip probabilities of p 0 ≈ 10 −2 or less, the expected number of errors in the received sequence is small and the extra switching activity caused by bit flips is ignored. Assuming a random initial state for gates and wires, the power consumption for all valid codewords (that satisfy the parity-check constraints) should be the same. Thus, assuming small p 0 , power consumption for any likely sequence received from the channel is estimated (with slight underestimation) by simply measuring the required power for the all-zero codeword.
It is assumed that all decoders operate at the minimum clock period T 
APPENDIX G
CIRCUIT MODEL FOR COMPUTATION POWER
The power consumption of a logic gate is approximately proportional to the total capacitance of the gate, but it consists of both dynamic power (which is also proportional to the activity-factor at the input of the gate), and static power (which has no dependence on the activity-factor) [47] . In simulation, the static power consumption of decoders at 0.6V supply is observed to be on average less than 5% of the total decoding power, and it is therefore ignored for these models. for Gallager-A nodes (P VN (1, 6, 3, 4) ) and (P CN (1, 6, 3, 4) ), scaled by the appropriate ratio of logic gate area to account for different capacitance. Given the small channel error probabilities considered in this work, it is assumed that most errors are corrected within the initial iterations of decoding. Under Gallager-A decoding, it is assumed that messages stabilize within 1 iteration and under two-bit decoding, it is assumed to take 2 iterations since the "strength" bits take an extra iteration to stabilize when an error is corrected. For these first few iterations, worst-case switching activity is assumed, but after these iterations, flip-flop access power (which occurs at every clock cycle) in variable nodes is assumed to be the only dynamic power consumed in the decoder. This gives an effective activity-factor for the rest of the decoder of a(b, g) = in the decoder is modeled using the formula for the dynamic power consumed in interconnects [47] :
