Using the Thompson circuit complexity model, it is shown that fully parallel encoding and decoding schemes with asymptotic block error probability that scales as O( f (n)) have energy that scales as (n− ln f (n) 1/2 ). In addition, it is shown that the number of clock cycles [T (n)] required for any encoding or decoding scheme that reaches this bound must scale as T (n) ≥ − ln f (n) 1/2 . Similar scaling results are extended to serialized computation. A similar approach is extended to three dimensions by generalizing the Grover information-friction energy model. Within this model, it is shown that encoding and decoding schemes with probability of block error P e (n) consume at least (n(− ln P e (n)) (1/3) ) energy.
I. INTRODUCTION
E XPANDING on work started in [1] and more recently advanced in [2] - [4] , we use a computational complexity model introduced in [5] that allows us to consider fundamental tradeoffs between the asymptotic energy, number of clock cycles, and block error probability for sequences of encoders and decoders.
As we define more formally later, an f (n)-coding scheme is a sequence of codes of increasing block length n in which block error probability scales as O( f (n)).
We show that all fully parallel f (n)-coding schemes with T (n) clock cycles have encoding and decoding energy (E(n)) that scales as E(n) ≥ −n ln f (n) T (n) . We show that the energy optimal number of clock cycles (T (n)) for encoders and decoders scales as O √ − ln f (n) , giving a universal energy lower bound of n √ − ln f (n) . A consequence of our result is that e −cn -coding schemes have encoding and decoding energy that scales at least as n 3 2 with energyoptimal number of clock cycles that scales as n 1 2 . This approach is generalized to serial implementations.
Recent work on the energy complexity of error control coding circuits has focused largely on planar circuits. However, circuits implemented in all three spatial dimensions Manuscript exist [6] , and so we generalize the recent information friction (or bit-meters) model introduced by Grover [3] to circuits implemented in three-dimensions to show that, in terms of block length n, a bit-meters coding scheme in which block error probability is given by P e (n) has encoding/decoding energy that scales as n (− ln P e (n)) 1 3 . We show how this approach can be generalized to an arbitrary number of dimensions.
In Section II we discuss prior work, and in particular we discuss existing results on complexity lower bounds for different models of computation for different notions of "good" encoders and decoders. The main technical results of this work are in Section III, where we study fully parallel lower bounds within the Thompson model, Section IV, where we study serial lower bounds, and in Section V, where we study a multidimensional generalization of the Grover bit-meters model. In these sections we present lower bounds for decoders, as the derivation for encoding lower bounds is almost exactly the same. We provide an outline of the technique for encoder lower bounds in Section VI. In Section VII we discuss limitations and weaknesses in the model used. In Section VIII, we discuss other energy models of computation. In Section IX we discuss possible future work, and conjecture that similar tradeoffs may extend to circuits that perform probabilistic inference.
Notation: We use standard Bachmann-Landau notation in this paper. For any non-negative real-valued functions f (x) and g (x) , the statement f (x) = O(g(x)) (or equivalently f (x) ≤ O(g(x))), means that for sufficiently large x, f (x) ≤ cg(x) for some positive constant c. The statement f (x) = (g(x)) (or equivalently f (x) ≥ (g(x))), means that for sufficiently large x, f (x) ≥ cg(x), again for some constant c. The statement f (x) = (g(x)) means that there are two positive constants b and c such that b ≤ c and for sufficiently large x, bg(x) ≤ f (x) ≤ cg(x).
II. PRIOR RELATED WORK: COMPUTATIONAL COMPLEXITY LOWER BOUNDS FOR GOOD DECODERS AND ENCODERS
The earliest work on computational complexity lower bounds for error control coding circuits comes from Savage in [7] and [8] , who considered bounds on the memory requirements and number of logical operations needed to compute decoding functions. However, wiring area is a fundamental cost of coding circuits and the authors do not consider this. 0018 -9448 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
More recently, in [1] , the authors use a model similar to our model, except the notion of "area" the authors use is the size of the smallest rectangle that completely encloses the circuit under consideration. In [2] , Grover et al. consider the same model that we do, and find Thompson energy lower bounds as a function of of block error probability for encoders and decoders. Our analysis of the Thompson model differs from the approach of Grover et al. in a number of ways. Firstly, central to the work of Grover et al. is a bound on block error probability if intersubcircuit bits communicated is low (presented in Lemma 2 in the Grover et al. paper), which is analogous to our result in (4) of the proof of Theorem 1. Our result simplifies this relationship using probability arguments. Secondly, the Grover et al. paper does not present what energy-optimal number of clock cycles are in terms of asymptotic probability of block error, nor do they present the fundamental tradeoff between number of clock cycles, energy, and reliability within the Thompson model that we present in this paper. Moreover, the technique of [2] does not extend to serial implementations.
In [4] we considered the corner case of decoding schemes in which block error probability asymptotically was less than 1 2 for serial and parallel decoding schemes. We did not, however, analyze schemes in terms of the rate at which block error probability approaches 0, nor did we compute energy-optimal number of clock cycles as we do herein.
There has also been some work on complexity scaling rules for encoding and decoding of specific types of codes. Low-density parity-check coding VLSI scaling rules have been studied in [9] and [10] and polar coding scaling rules have been studied in [11] . The scaling rules presented in this paper are general and apply to any code.
Another computational model that has proven more tractable than the Turing Time complexity model is the constant depth circuit model (see [12] for a detailed description of this model). Super-polynomial lower bounds on the size of constant depth circuits that compute certain notions of "good encoding functions" (though not decoding) were derived in [13] . In this case, the notion of "good" considered was the ability to correct at least (n) errors at rates asymptotically above 0. Similar related work exists in [14] which discovered lower bounds on the formula-size of functions that perform good error control coding; similar bounds were later discovered in [15] .
III. THOMPSON MODEL

A. Circuit Model
The model we will consider derives from Thompson [5] . A description of this model can also be found in [16, Ch. 2] . The specific model we consider has been studied in [2] , [4] , [9] , and [10] . The reader should refer to [4] for details of the model. The important parameters to be extracted from the model are A, the circuit area, and T , the number of clock cycles in a computation. In this paper we also introduce the symbol q, which is the switching activity factor: this is the fraction of the circuit that switches, on average, in each clock cycle. Since in this paper we are only concerned with scaling rules, we assume that both the technology constant and the wire width considered in [2] and [4] are equal to 1. The energy of a computation is thus defined as E = q AT .
Note that a circuit can be associated with a graph in the natural way, in which a wire corresponds to an edge of the graph and a node corresponds to a vertex. An edge connects two vertices if their associated nodes are connected by wires. A diagram of a small circuit next to its associated graph is given in Fig. 1 .
Lemma 2 presented below is derived in [5] and it relates the area of a circuit to its graph's minumum bisection width, and is a key component of our Thompson model circuit lower bounds.
B. Definitions and Lemmas
A key concept of this paper is that of an f (n)-coding scheme.
Definition 1: An f (n)-coding scheme is a sequence of codes of increasing block length n, together with a sequence of encoders and decoders, in which the block error probability associated with the code of block length n is less than O( f (n)) for sufficiently large n, with rates bounded below by a constant greater than 0.
We let A(n), T (n), q(n), and E(n) = q(n)A(n)T (n) be the area, number of clock cycles, switching activity factor, and energy of the encoder/decoder for the code of block length n. Whether these quantities refer to the encoder or decoder properties will be clear from context. Moreover, we will usually suppress these quantities' dependence on n.
We now present a sequence of definitions and lemmas similar to those found in [2] and [4] .
Lemma 1 [4] : Suppose that X, Y , andX are random variables that form a Markov chain X → Y →X and X takes on values from a finite alphabet X with a uniform distribution, (i.e., P (X = x) = 1 |X | for all x ∈ X ), Y takes on values from a finite set Y, andX from a setX . Suppose as well that X ⊆X . Then:
Remark 1: We will interpret X as the set of symbols a particular subcircuit will need to estimate,X as that subcircuit's estimate of those symbols, and Y as the bits injected into the subcircuit during the computation. Note that this result mirrors the result of [3, Lemma 4] . In this lemma, the author proves that if a circuit has r 3 bits to make an estimateX of a random variable X that is uniformly distributed over all binary strings of length r , then that circuit makes an error with probability at least 1 9 . Our lemma presented here includes this lemma as a special case by setting |Y| = 2 r 3 and |X | = 2 r . In this case we can infer:
where the last inequality is implied by r ≥ 1.
Proof: (of Lemma 1) See [4, Lemma 2] . This flows from a simple application of the law of total probability and the definition of a Markov chain.
Definition 2: A bisection of a graph G = (V G , E G ) of a set of vertices V ∈ V G is a set of edges E ∈ E G that, once removed from the graph, results in two disconnected subgraphs with vertices V 1 and
That is, it is the set of edges that, once removed, divides the vertices of V roughly in half. The minimum bisection width of a set of vertices V is the size of a smallest bisection.
Note that since a circuit is associated with a graph, we can discuss such a circuit's minimum bisection width, that is the minimum bisection width of the graph with which it is associated. Herein we will consider bisecting the output nodes of a circuit.
Lemma 2: All circuits whose associated graphs have minimum bisection width ω have circuit area A ≥ ω 2 4 . Proof: See Thompson [5] , or the generalization derived in [9, Lemma 10] .
We now discuss the notion of nested minimum bisection, a concept introduced by Grover et al. [2] and also used in [4] which we again present here so the paper is self contained.
Suppose that a circuit has k output nodes. If the output nodes of such a circuit are minimum bisected, this results in two disconnected subcircuits each with, roughly, k 2 output nodes. These two subcircuits can each have their output nodes minimum bisected again, resulting in four disconnected subcircuits, now each with roughly k 4 output nodes. Definition 3: This process of nested minimum bisections on a circuit, when repeated r times, is called performing r -stages of nested minimum bisections. In the case of this paper, the set of nodes to be minimum bisected will be the output nodes. We may also refer to this process as performing nested bisections, and a circuit under consideration in which nested bisections have been performed as a nested bisected circuit. Note that we will omit the term "minimum" in discussions of such objects, as this is implicit.
Note that associated with an r -stage nested bisected circuit are 2 r subcircuits. Note as well that once a subcircuit has only one node, it does not make sense to bisect that subcircuit again. Suppose we are nested-bisecting the k output nodes of a circuit. In this case, one cannot meaningfully nested-bisect the output nodes of a circuit r times if 2 r > k.
Note that each of the 2 r subcircuits induced by the r -stage nested bisection may have some internal wires, and also wires that were deleted and connect to nodes in other subcircuits. We can index the 2 r subcircuits with the symbol i .
Definition 4: Let the number of wires attached to nodes in subcircuit i that were deleted in the nested bisections be f i . This quantity is the fan-out of subcircuit i .
We shall also consider the bits communicated to a given subcircuit.
Definition 5: Let b i = f i T , where we recall that T is the number of clock cycles used in the running of the circuit under consideration. This quantity is called the bits communicated to the i th subcircuit.
We can now define an important quantity.
Note that each subcircuit induced by the nested bisections will each have close to k 2 r output nodes within them (a consequence of choosing to bisect the output nodes at each stage), however, each may have a different number of input nodes.
Definition 7: This quantity is called the number of input nodes in the i th subcircuit and we denote it n in,i .
Note that 2 r i=1 n in,i = n for all valid choices of r . That is, the sum over the number of input nodes in each subcircuit is the total number of input nodes in the original circuit.
This now allows us to present an important lemma. Lemma 3: All fully-parallel circuits with inter-subcircuit bits communicated B r have product AT 2 bounded by:
. Proof: This result, from Grover et al. [2] , flows from applying Lemma 2 recursively on the nested-bisected structure and optimizing.
Lemma 4: All fully-parallel circuits with inter-subcircuit bits communicated B r and number of input nodes n have product AT bounded by:
Proof: See [2] . This result flows from the observation that A ≥ n for a fully parallel circuit and then combining this inequality with (1).
Definition 8: An (n, k)-decoder is a circuit that computes a decoding function f : {0, 1} n → {0, 1} k . It is associated with a codebook, (and therefore, naturally, an encoding function, which computes a function g : {0, 1} k → {0, 1} n ), a channel statistic, P (y n |x n ) (which we will assume herein to be the statistic induced by n channel uses of a binary erasure channel), and a distribution from which the source is drawn p x k (which we will assume to be the statistic generated by k independent fair binary coin flips). The quantity n is the block length of the code, and the quantity k is the the number of bits decoded.
Definition 9: The block error probability of a decoder, denoted P e , is the probability that the decoder's estimate of the original source is incorrect. Note that this probability depends on the source distribution, the channel, and the function that the decoder computes.
Definition 10: A decoding scheme is an infinite sequence of circuits D 1 , D 2, . . . each of which computes a decoding function, with block lengths n 1 < n 2 < . . . and bits decoded k (n 1 ) , k (n 2 ) , . . .. They are associated with a sequence of codebooks C 1 , C 2 , . . . and a channel statistic.
Though we assume the channel statistic associated with each decoder is the statistic induced by n uses of a binary erasure channel, our lower bound results also apply to any channel that is a degraded erasure channel, including the binary symmetric channel. Our results in terms of binary erasure probability can be applied to decoding schemes for the binary symmetric channel with crossover probability p by substituting p = 2.
Definition 11: We let P e (n) denote the block error probability for the decoder with input size n. We let R (n) = k(n) n be the rate of the decoder with input size n.
We also classify decoding schemes in terms of how their probability of error scales in the definition below.
Definition 12: An f (n)-decoding scheme is a decoding scheme in which for sufficiently large n the block error probability P e (n) < f (n).
Definition 13: The asymptotic-rate, or more compactly, the rate of a decoding scheme is lim n→∞ R (n), if this limit exists, which we denote R.
Note that the rate of a decoding scheme may not be the rate of any particular codebook in the decoding scheme.
Definition 14: An exponentially-low-error decoding scheme is an e −cn -decoding scheme for some c > 0 with asymptotic rate R greater than 0.
We will also consider another class of decoding schemes, one which can be considered less reliable.
Definition 15: A polynomially-low-error decoding scheme is a 1 n t -decoding scheme for some t > 0 with asymptotic rate R > 0.
We will also need to define a sublinear function, which will be used to deal with a technicality in Theorem 1.
Definition 16:
A sublinear function f (n) is a function in which lim n→∞ f (n) n = 0.
C. Main Lower Bound Results
We can now state the main theorem of this paper. Theorem 1: All f (n)-decoding schemes for a binary erasure channel with erasure probability in which f (n) monotonically decreases to 0 and in which − ln ( f (n)) is a sublinear function have energy that scales as
Rn
and AT 2 complexity that scales as:
Remark 2: To show this, we use a proof by contradiction. We suppose that the scheme is an f (n)-coding scheme. We suppose that inter-subcircuit bits communicated is low. Then, an averaging argument can show that there must be at least one subcircuit with fewer bits communicated to it from outside the subcircuit than it is responsible for decoding. The probability of block error for the circuit in this case must be at least the probability that all the input bits in this subcircuit are erased by the channel. Using this argument and a judicious choice of the number of nested bisections can show then that the scheme is not an f (n)-coding scheme. Thus, for an f (n)-coding scheme the inter-subcircuit bits communication must be high, which implies the energy and area lower bound results. The details are given below.
Proof: (Of Theorem 1) Associated with each decoder is its B r , the inter-subcircuit bits communicated. We can choose r to be any function of n so long as 2 r < n R (n) = k (n). From here on, we will suppress the dependence of r (n), k (n), and R (n) on n. For ease of notation, let N = 2 r be the number of subcircuits induced by the r -stages of nested bisections. Consider any specific sufficiently large circuit in our decoding scheme, and suppose that B r < k 2 . Then there exists at least N 2 subcircuits in which b i < k N (where we recall b i is the bits communicated to the i th subcircuit from Definition 5). Suppose not, i.e., that there are at least N 2 subcircuits with
Let Q be the set of (at least N 2 ) subcircuits with bits communicated to them less than k N . Using a similar averaging argument, we claim that within Q there must be one subcircuit in which n in,i ≤ 2n N . If not, if all N 2 subcircuits in Q have greater than 2n N input bits injected into them, then the total number of inputs nodes in the entire circuit is greater than 2n N N 2 = n, but there are only n input nodes in the entire circuit. Thus, there is at least one subcircuit in Q in which b i < k N and n in,i ≤ 2n N . Suppose that all the input bits injected into this special subcircuit are erased. Then, that subcircuit makes an error with probability at least 1 2 by Lemma 1, since it will have to form an estimate of k N bits by only having injected into it fewer than k N bits. Thus, if B r < k 2 then: P e ≥ P error|all n in,i bits erased P all n in,i bits erased
where this first inequality flows from summing one term in a law of total probability expansion of the probability of block error, and the second from lower bounds on these probabilities. Combining this observation with the fact the n in,i ≤ 2n N gives us the following observation:
This is true for any valid choice of r . Now suppose that our decoding scheme is an f (n)-decoding scheme. We choose r to be
. (5) Note that N cannot grow faster than O (n) since we assumed f (n) was monotonically decreasing. Thus, this is a valid choice for r . Note as well that N increases with n because of the sub-linearity assumption of − ln ( f (n)). Then, if B r < k 2 , by directly substituting into (4),
In other words, if B r < k 2 then our decoding scheme is not an f (n)-decoding scheme. Thus, for this choice of r , B r ≥ k 2 . Thus, by Lemma 4,
where we substituted the value for N in the first line, used the fact that x ≤ x + 1 in the second, proving inequality (2) of the theorem. As well, by Lemma 1, using B r ≥ k 2 for this choice of r , following a similar substitution as in the previous paragraph: for all functions p (n) that increase without bound. Moreover, any such scheme that has energy that grows optimally, i.e. as AT = O n 3 2 , must have T (n) ≥ n 0.5 . Proof: Note that an exponentially-low-error decoding scheme has P e ≤ e −cn for some c > 0. Thus, such a scheme is also an e −c n p(n) -decoding scheme, for any increasing p (n).
The result then directly flows by substituting f (n) = e −c n p(n) into (2) of Theorem 1.
For the second part of the corollary, suppose that for some constant c, a decoding scheme has
We have as well from (3) and substituting f (n) = e −c n p(n)
Suppose that
for a g (n) that grows with n, i.e., that T asymptotically grows slower than O n 1 2 . Then, to satisfy (7) we need
for all increasing p (n), implying
To see this precisely, suppose otherwise and then it is easy to see that, combined with (8) the inequality in (9) will be unsatisfied. If (8) is true, however, then,
Since this is true for all increasing p (n), it is true for, say, p (n) = ln g (n), implying that the product AT grows strictly faster than n 3 2 , contradicting the assumption of (6). We generalize Corollary 1 to decoding schemes with different asymptotic block error probabilities below:
Theorem 2: All f (n)-decoding schemes with asymptotic rate greater than 0 in which f (n) is sub-exponential with energy that scales as E = n √ − ln f (n) (that is, their energy matches the lower bound of (2) of Theorem 1) must have T (n) = √ ln f (n) . Moreover, for all decoding schemes in which T (n) is faster than this optimal, E ≥ n ln f (n)
Note that from (3),
As well, suppose
for some increasing g (n). Then, from (10), A ≥ (ng(n)). Combining this bound on A with (11) and taking the square root implies that AT ≥ n √ − ln ( f (n)) √ g(n), contradicting (10) . Moreover, for all T (n) growing slower than that required for optimal energy, this implies that A ≥ −n ln( f (n))
. Corollary 2: All polynomially-low error decoding schemes have energy that scales at least as
If this optimal is reached, then T (n) ≥ ( √ ln n). Proof: This energy lower bound flows from letting f (n) = 1 n k and then substituting this value into (2) . The time lower bound flows from directly applying Theorem 2.
IV. SERIAL DECODING SCHEME SCALING RULES Let the number of output nodes in a particular decoder be denoted j .
Definition 17: A serial decoding scheme is a decoding scheme in which the number of output nodes j is constant with increasing block length n.
In [4] we considered the case of allowing the number of output nodes j to increase with increasing block length. We required an assumption that such a scheme be outputregular, which we define below.
Definition 18 [4] : An output-regular circuit is a circuit in which there are a set of specified clock cycles in which outputs are to appear in output nodes. At each of these special clock cycles, exactly one output bit must appear in each output node of the circuit. This definition excludes circuits where some output nodes output a bit during some clock cycle and other output nodes do not during this clock cycle. An output-regular decoding scheme is one in which each decoder in the scheme is an output-regular circuit.
Definition 19: An increasing-output-node decoding scheme is a scheme in which the number of output nodes increases with increasing block length n.
To prove the theorem below, we will consider dividing the circuit in time, into epochs.
Definition 20: For a circuit undergoing T clock cycles, an epoch is an interval of integers between 1 and T .
For example, if T = 20 an epoch may be the set of integers 10, 11, 12. We then may consider things like the "bits output during this epoch" which are the circuit outputs that appear in the circuit during clock cycles 10, 11, or 12.
Given a particular epoch, we can then consider the bits output during that epoch. These are simply the output bits of the computation that appear in output nodes in clock cycles within the epoch.
We now have the tools to extend our results to serial computation schemes.
Theorem 3: All constant-output-node serial f (n)-decoding schemes with switching activity factor greater than q > 0 for each decoder in the scheme has energy that scales as (−n ln f (n)). Proof: Let there be a total of j circuit output nodes in each circuit in the scheme, regardless of n. Since at each clock cycle there are at most j outputs that appear in output nodes of the circuit, we divide the computation into M epochs, each of which has between A + 1 to A + j + 1 circuit outputs. Since there are Rn outputs of the computation, we can conclude that the number of epochs (which we will denote M) is bounded by:
Since there are M epochs and in total n input bits injected into the circuit during the entire computation, the average number of input bits injected into the circuit during each of these epochs is n M and by combining with the inequality above, this is bounded by:
Thus, there must be at least one epoch (labelled i ) in which the number of bits injected into the circuit (n in,i ) is at most A+ j +1 R . At the beginning of this epoch, we assume optimistically that the entire circuit area is used to store A bits of information and this information is carried over to the next epoch. 
Substituting this into (13) gives us:
Thus, such a scheme is not an f (n)-coding scheme. We must conclude therefore that
We note that T ≥ Rn j so that there are enough clock cycles to output all the Rn output bits. Thus,
Theorem 4: All output-regular increasing-output-node f (n)-decoding schemes have energy that scales as n(ln f (n)) 1 5 . Proof: We divide the circuit into M = Rn 8 A epochs and divide the subcircuits into N s = b A −ln (2 f (n)) subcircuits through nested bisections, for some constant b which we will choose later.
Consider an epoch i . We let B r,i be the inter-subcircuit bits communicated during epoch i . We shall prove by an averaging argument that if B r,i < j 2 then, in at least M/2 of the epochs, there is a subcircuit with (1) area less than 4 A N s (2) fewer than 2 j N s bits injected into it from outside the circuit, (3) fewer than 4n N s M bits injected into its input nodes during this epoch. First, note that there are greater than or equal to M 2 epochs with no more than 2n M circuit inputs injected into them during this epoch. Suppose not, then the total number of bits injected into the circuit during the computation is greater than M 2 2n M = n, a contradiction, since n is the number of input bits injected into the circuit.
Secondly, during each of these (at least M 2 ) epochs, there are at least N s 2 subcircuits with fewer than 4n N s M inputs injected into them. Otherwise, the total number of inputs injected into the circuit during this epoch is more than N s 2 4n N s M = 2n M , but by the above paragraph these epochs have fewer than this number of bits injected.
Thirdly, among these N s 2 low-input bits injected subcircuits, there are more than N s 4 subcircits with area less than 4 A N s . Otherwise the total area of the circuit is greater than
Fourthly, in these (at least N s 4 ) low-area, low input bits injected subcircuits, there is at least one subcircuit with fewer than 2 j N s bits injected into it from outside the subcircuit during the epoch. Otherwise, if all of these more than N s 4 subcircuits have at least 2 j N s bits injected into them, then B r,i > 2 j N s N s 4 = j 2 but we assumed B r,i < j 2 . Since we divided the circuit into Rn 8 A epochs, by the output regularity assumption, during an epoch the circuit is responsible for decoding 8A bits, and (since we nested-bisected the output nodes) each subcircuit during an epoch is responsible for decoding 8A/N s bits. Consider the low-area low-bitscommunicated subcircuit that must exist if B r,i < j 2 . The total number of bits communicated to this subcircuit is at most the area of this subcircuit, plus the bits communicated to it from the other subcircuits during the epoch. But this is less than 4 A/N s + 2 j/N s < 8 A/N s bits communicated to it (where we use A > j since j is the number of output nodes, so the area is at least this). That is, it has fewer bits injected into it than it is responsible for decoding. If all its at most 4n N s M input bits are erased, then it makes an error with probability at least 1/2 by Lemma 1. In this case the block error probability is bounded by: We substitute our value for N s to give us:
We also have T ≥ Rn j since the number of clock cycles has to at least be enough to output every one of its Rn output bits in its j output nodes. Thus:
and so:
ln(2 f (n)) 32 ln() This allows us to conclude that:
AT ≥ n (− ln ( f (n))) 1 5 .
V. INFORMATION FRICTION IN THREE-DIMENSIONAL CIRCUITS
In the previous section, we discussed the two-dimensional Thompson model. In this model, wire length can dominate energy consumption. However, the model involves planar circuits, while modern circuit design techniques can exploit all three spatial dimensions [6] . Moreover, computations performed in the human brain are also obviously done in three dimensions [17] . In this section we adapt Grover's [3] information friction, or "bit-meters" model to computations done in three dimensions. Since the model was introduced, it has informed the work of [18] - [24] . We use our three-dimensional generalization to derive energy complexity lower bounds on decoder and encoder circuits. We discuss how this approach can be generalized to an arbitrary numbers of dimensions. We present the model below and then prove our main complexity result.
• A circuit is a grid of computational nodes at locations in the set Z 3 , where Z is the set of integers. Some nodes are input nodes, some are output nodes, and some are helper nodes. Note that Grover [3] considers this model in terms of a parameter characterizing the distance between the nodes, but since we are concerned with scaling rules, we will assume that they are placed at integer locations, allowing us to avoid unnecessary notation. The Grover paper considered scaling rules in which nodes are placed on a plane, in which the number of dimensions d = 2.
In our results we will discuss the case of d = 3 and afterwards discuss how the approach can be generalized to an arbitrary number of spatial dimensions. • A circuit is to compute a function of n binary inputs and k binary outputs. • At the beginning of a computation, the n inputs to the computation are injected into the input nodes. At the end of the computation the k outputs should appear at an output node. A node can be both input and output. • A node can communicate messages along its links to any other node, and can receive bits communicated to them from any other node. • Each node has constant memory, and can compute any computable function of all the inputs it has received throughout the computation that is stored in their memory, to produce a message that it can send to any other node. • We associate a computation with a directed multi-graph, that is, a set of edges linking the nodes. For every computation, there is one edge per bit communicated along a link in the computation's associated multi-graph. The "cost" of an edge in such a multi-graph is the Euclidean distance between the two nodes that it connects. Note that if a node communicates m bits to another node in a computation, then that computation's associated multigraph must have m edges connecting the two nodes. This multi-graph is called a computation's communication multi-graph. • The energy, or the bit-meters, denoted β, of a computation is the sum of the costs of all the edges in the computation's associated multi-graph (that is, the sum of the Euclidean distances of all the edges). We consider a grid of three-dimensional cubes, with "inner cubes" nested within them. This object is a generalization of the "stencil" object defined by [3] .
Definition 21: An (L, λ) −nested cube grid is an infinite grid of cubes, with side length L and inner cube side length L (1 − 2λ). Note that the inner cubes are centered within the outer cubes. Fig. 2 shows a diagram of one cube in a (L, λ) −nested cube grid, to which the reader can refer to visualize this nested cube structure. A set of nested cube grid parameters is valid if L > 0 and 0 < λ < 1 2 . Note that a nested cube grid can be placed conceptually on top of a bit meters circuit. We will consider placing a nested cube grid aligned with the Cartesian 3-space that defines our circuit. We can specify the position of a nested cube grid that is parallel to a set of Cartesian coordinates by calling one of the corners of an outer cube the origin, and then specify the location of its origin. A particular set of parameters for a nested cube grid and a location for its origin (called its orientation) induces a set of subcircuits, defined below.
Definition 22: A subcircuit, associated with a particular orientation of a nested cube grid, is the part of a bit-meters circuit within a particular outer cube.
Nodes in any subcircuit can thus be considered to be either inside an inner cube or outside an inner cube. For any circuit with finite number of nodes there will thus be some cubes that contain computational nodes, and some that do not. We can label the subcircuits that contain nodes with the index i . The number of input nodes in cube i we denote n in,i . The number of output nodes in subcircuit i we denote k i . Furthermore, we denote the number of input nodes within the inner cube of subcircuit i as k in,i .
Definition 23: We define k in = k in,i , which is the the number of output nodes within inner cubes, which we will often simply refer to with the symbol k in .
We will show in Lemma 6 that there exists a nested cube grid orientation in which k in is high.
Definition 24: The internal bit meters of a subcircuit i is the length of all the communication multigraph edges completely within subcircuit i , plus the length of the parts of the edges within subcircuit i . This quantity is denoted with the symbol β i . Note that β = all subcircuits j β j (where we may have to sum over some subcircuits that do not contain any nodes).
Since a computation has associated with it its communication multi-graph, for a given subcircuit we can consider the subgraph formed by all the paths that start outside of the cube and end inside the inner cube. We can group all the vertices of this graph that start outside the outer cube and call this the source, and group all vertices inside an inner cube and call it the sink. For this graph we can consider its min-cut, the minimum set of edges that, once removed, disconnects the source from the sink.
Definition 25: The number of bits communicated from outside a cube to within an inner cube, or, bits communicated, is the size of this minimum cut. For a particular subcircuit i we refer to this quantity with the symbol b i .
Remark 3: This quantity is analogous (but not the same) as the quantity b i for the Thompson circuit model from Definition 5, and thus we use the same symbol. The reader should not confuse these symbols; the Thompson model definition applies to discussions in Section III, and the bitmeters model definition applies in this section, Section V. If the n in,i internal bits of a subcircuit are fixed, then the subcircuit inside an inner cube will compute a function of the messages passed from outside the outer cube. Clearly, the size of the set of possible messages injected into this internal cube is 2 b i (since b i is the min cut of the paths leading from outside to inside.)
Lemma 5: All subcircuits with bits communicated b i have internal bit meters at least b i λL.
Proof: This result flows from Menger's Theorem [25] , [26] , which states that any network with min-cut b i has at least b i disjoint paths from source to sink. Each of these paths must have length at least λL from the triangle inequality.
Remark 4: This lemma makes rigorous the idea that to communicate b i bits from outside a subcircuit to within its inner square, the bit-meters this takes is proportional to the distance from outside an outer square to within an inner square (λL) and the number of bits communicated.
In the lemma below we show that there exists an orientation of any nested cube grid such that k in is high.
Lemma 6: For all three dimensional bit-meters circuits with k output nodes and all valid nested cube grid parameters L and λ, there exists an orientation of an (L, λ)-nested cube grid in which the number output nodes within inner cubes (k in ) is bounded by:
Remark 5: Note that the relative volume of the inner cubes is (1 − 2λ) 3 . This lemma says there exists an orientation of any nested cube grid in which the fraction of output nodes within inner cubes is at least this fraction, so this result is not surprising.
Proof: This is a natural generalization of the Grover result (See [3, Lemma 2]), which uses the probabilistic method. We consider placing the origin of an (L, λ)-nested cube grid uniformly randomly within a cube of side length L centered at the origin in the Cartesian 3-space. We index the k output nodes by i . Let 1 in,i be the indicator random variable that is equal to 1 if output node i is within an inner cube. Then, given the uniform measure on the position of the cube, the quantity k in is a random variable. We observe:
Taking the expectation of both sides gives us: (15) where in (15) we use the observation that, for each output node, the probability that it is in an inner square is proportional to the relative area of the inner square. Thus, the expected value of k in is k (1 − 2λ) 3 and so there must be at least one nested cube grid orientation in which k in is greater than or equal to that value. Lemma 7: For all valid nested cube parameters L and λ, n in,i ≤ (L + 1) 3 and thus, for sufficiently large L, n in,i ≤ 2L 3 .
Proof: Intuitively, there cannot be more than on the order of L 3 inner nodes in a cube of volume L 3 . The (L + 1) 3 bound comes from considering the corner case of a cube whose sides exactly touch output nodes.
We can now state the main results of this section. Theorem 5: All 3D-bit-meters decoders for a binary erasure channel with erasure probability of sufficiently large block length with block error probability P e have bit-meters β bounded by:
ln (4P e ) 2 ln() 1 3 k.
Proof: We consider the number of bits communicated from outside a subcircuit i to within the inner cube of subcircuit i (b i ). It must at least be k in,i to overcome the case that all the input bits in the entire cube are erased. If this does not happen, then one of the output nodes must guess at least one bit, making an error with probability at least 1 2 , formally justified by Lemma 1. This allows us to argue that: P e ≥ P error|all n in,i input bits are erased P all n in,i input bits are erased
If β < λLk in then there exists a subcircuit indexed by i in
where we apply Lemma 5 after the first inequality, and for convenience suppress the subscript on the summation sign after the first instance. This contradicts our assumption that β < λLk in . We choose the parameter L in terms of probability of error in order to derive a contradiction if a circuit does not have high enough bit-meters. Specifically, we choose L = ln (4P e ) 2 ln() 1 3 .
Consider the nested cube structure that has k in ≥ (1 − 2λ) 3 k that must exist by Lemma 6. If β ≤ λLk in then there must exist a subcircuit i that has less than k in,i bits injected into it from outside the subcircuit to within its inner cube. Thus:
where (a) flows from (16) , (b) from Lemma 7, and (c) from the evaluation of this expression by substituting (17) . This is a contradiction. Thus, all bit meters decoders must have
The second inequality flows from the fact that we are considering the nested cube structure in which k in ≥ (1 − 2λ) 3 k that must exist by Lemma 6. We may choose any valid λ to maximize this bound, and letting λ = 1 8 gives us:
Remark 6: Note that this argument naturally generalizes to d-dimensional space, in which all d-dimensional bit-meters decoders have energy that scales as β ≥ (ln (P e )) 
VI. ENCODER LOWER BOUNDS
In terms of scaling rules, all the decoder lower bounds presented herein can be extended to encoder lower bounds. The main structure of the encoder lower bounds (inspired by [2] and [3] ) follows the same structure as the decoder lower bounds.
The key point will be in defining what is meant by error probability of an encoder. For the sake of this section, we shall consider a communication system with a message, an encoder, a binary erasure channel, and a decoder. Moreover, we assume that the decoder uses any possible type of decoding (and so we may assume it does maximum likelihood decoding). The probability of error is the probability that the maximum likelihood decoder does not recover the original message. Note that this error probability is actually a property of the underlying code, and not the encoder itself. But as we shall see, any encoder for a code permitting an f (n)-coding scheme must have minimum energy scaling requirements, which we derive in this section.
Theorem 6: All fully-parallel f (n)-encoding schemes with number of clock cycles T (n) have energy
with optimal lower bound of E ≥ n log f (n) when log( f (n) ).
All serial, f (n)-encoding schemes have energy that scales as E(n) ≥ (n log f (n)) .
All increasing output node, output-regular f (n)-encoding schemes have energy that scales as E(n) ≥ n log 1 /5 ( f (n) ) .
Finally, all three-dimensional, bit-meters encoding schemes associated with block error probability P e have energy that scales E(n) ≥ (n ln P e ).
The key point of these proofs is to recognize that an encoder is a function that maps k input bits to n output bits.
Proof: For the first inequality, follow the steps of the proof of Theorem 1 but instead nested-bisect the k input nodes of the encoder. Following analogous steps of this proof, we can show that if B r < k 2 then there must be one subcircuit (with number of output bits n out,i at most 2n N ) in which the number of bits communicated out of this subcircuit is less than the number of message bits injected. Suppose then that the channel erases all n out,i output bits of this subcircuit. Then, the decoder will only have the rest of the code bits to decode the message bits associated with this subcircuit. However, the rest of the circuit has less then k 2 r bits communicated from the subcircuit. We can then apply Lemma 1 to show that in this case the decoder must make an error with probability at least 1 2 . Use this to derive the observation similar to (4) that:
The rest of the proof follows exactly as in the proof of Theorem 1. For the next two inequalities, follow this outline, but instead of subdividing the k outputs of the circuit as in the decoder case, subdivide the k inputs of the encoder circuit into subcircuits and epochs. The proofs then follow exactly as in the decoder results of Theorems 3 and 4. The last inequality flow from following Theorem 5, and using an analogous approach as in Grover [3] .
VII. LIMITATIONS OF THESE RESULTS
There are a number of weaknesses in the models we have used. Firstly, our results are asymptotic. For some set block error probability and rate, there may be a specific circuit that reaches this block error probability using a circuit design methodology that does not generalize to scale in a way as predicted by our theorems.
Note that our quantity T refers to number of clock cycles, which reflects one of the main "time costs" in a circuit computation. In real circuits, the "time cost" of a computation involves two parameters: the number of clock cycles required, and the time it takes to do each clock cycle. In our model, we do not consider the time per clock cycle. In real circuits, this quantity often varies with wire lengths. We do not consider this in our model.
A particular weakness of the lower bounds that use the Thompson model is that they assume a switching activity factor q(n) bounded below by a positive constant. On the other hand, the information-friction model accounts for the possibility of schemes in which switching activity factor changes with increasing block length, so, combined with the results of Grover, [3] , the asymptotic energy lower bounds we derive apply.
VIII. OTHER ENERGY MODELS OF COMPUTATION
There has been some work on energy models of computation different from the Thompson energy models and Grover information friction models, and herein we provide a short review.
Bingham and Greenstreet [27] classify the tradeoffs between the "energy" complexity of parallel algorithms and "time" complexity for the problem of sorting, addition, and multiplication using a model similar to, but not the same as the model we use. In the grid model used by these authors, a circuit is composed of processing elements laid out on a grid, in which each element can perform an operation. In this model the circuit designer has choice over the speed of each operation, but this comes at an energy cost. Real circuits run at higher voltages can result in lower delay for each processing element but higher energy [28] . The model used by the authors in [27] captures some of this fundamental tradeoff. Note that our model assumes constant voltage. Non-trivial results that show how real energy gains can occur by lowering voltages in decoder circuits have been studied in [29] , but we do not study this here.
Another energy model of computation was presented by Jain et al. [30] . This model introduced an augmented Turing machine, a generalization of the traditional Turing machine [31] . The authors introduce a transition function, mapping the current instruction being read, the current state, the next state and the next instruction to the "energy" required to make this transition. This model (once the transition function is clearly defined for a specific processor architecture) would be good for the algorithm designer at the software level. However, we do not believe this model informs the specialized circuit designer. The Thompson model which we analyze, on the other hand, can include, as a special case, the energy complexity of algorithms implemented on a processor, as our model allows for a composition of logic gates to form a processor.
Landauer [32] derives that the energy required to erase one bit of information is at least kT ln 2, where k is Boltzmann's constant, and T is the temperature. Thus, a fundamental limit of computation comes from having to erase information. Of course, it may be possible to do reversible computation in which no information is erased that can use arbitrarily small amounts of energy, but such circuits must be run arbitrarily slowly. This suggests a fundamental time-energy tradeoff different from the tradeoff discussed herein. Landauer [33] , Bennett [34] and Lloyd [35] provide detailed discussions and bibliographies on this line of work. Demaine et al. [36] extract a mathematical model from this line of work and analyze the energy complexity of various algorithms within this model. Note that the Thompson model we use is one informed by how modern VLSI circuits are created, even though they operate at energies far above ultimate physical limits.
IX. FUTURE WORK
Currently, our work on lower bounds has not be extended to other channels, like the additive white Gaussian noise channel. A natural question that arises from this work is that, for each f (n) that is sub-exponential, is there a coding scheme with encoding and decoding energy that matches, or comes close to matching, the energy lower bounds? In [16] it is shown that, through a parallelization technique, generalized polar codes [37] can reach the two-dimensional energy lower bounds up to an n polylog(n) factor, for any > 0, where polylog(n) is a function that grows polylogarithmically in n. We conjecture that the energy lower bounds of Theorem 5 can also be reached using a similar parallelization technique, and this remains an area of future work. As well, there has been no work on upper bounds for the serial decoding scheme results. We conjecture that the constant-output-node lower bounds of Theorem 3 can be reached using serialized polar codes, though this may require some significant work. Perhaps more difficult will be to find corresponding upper bounds for the increasingoutput-node lower bounds of Theorem 4. Generalizing the serial lower bounds to three-dimensional computation is also an area of future work.
More practically, simulating actual circuits and then comparing them to the lower bounds predicted by the theorems may also be of interest and is an area of future work. However, we conjecture that the gaps between actual energy values in working circuits and the lower bounds will be large. We believe that instead of predicting actual values of area, time, and energy consumption in real circuits, our discussion predicts how optimal circuits scale.
The decoding problem for communication systems is a special case of the more general problem of inference. Well known algorithms used for inference, for example the Sum-Product Algorithm [38] and variational methods [39] , are generalizations of Gallager's [40] low-density parity-check decoding algorithms. Thus, we conjecture that there may be similar tradeoffs between energy, latency, and reliability in circuits that perform inference.
There is also possibly a deep connection between this paper and neuroscience. Firstly, experimental evidence by Chklovskii et al. [17] suggests that brain tissue in living organisms is optimized for a parameter similar to, but not the same as, the "information friction" parameter discussed in this paper. Moreover, Sreenivasan and Fiete [41] show that brains contain powerful error-control codes. Understanding how the energy of such decoders scale in brains is a possibly interesting biology question related to this line of work.
