The fundamental problem considered in this paper is, basically, "what is the energy consumed for the implementation of compressive sensing algorithm on a circuit?" We use the proposed "bit-meters 1 " measure as a proportional measurement for energy, i.e., the product of number bits transmitted and the distance of information transport. Be Analogous to the friction imposed on relative motion between two surfaces, this model is so-called "information-friction" model. By using this "information-friction" model, a fundamental lower bound for the implementation of compressive sensing algorithms on a circuit is provided. Further, we explore and compare a series of decoding algorithms based on different implementation-circuits. As our second main result, an order-tight asymptotic result on a fixed precision bits for the regime m = O(k) (m is the number of measurements and k is the number of non-zero entries in the compressible vector) is provided. We thus attest in this paper that the asymptotic lower bound is order-optimal with a sub-linear sparsity k such that k = n 1−β (β ∈ (0, 1), n is the total number of input entries) since a proposed algorithm with corresponding construction of implementation-circuit can achieve an upper bound with the same order.
T "information-friction" model is firstly considered in [12] for finding a trade-off between energy consumed in encoding/decoding processes and transmission power in a communication system. Compressive sensing, spurred by [4, 9] recently has been fully explored using information theory, and order-optimal algorithms for encoding and decoding based on a construction of a spare graph have been given, for instance, in [1, 10] . In certain practice, like some distributed cellphones networks 2 , a naturally generated ensemble of measurement matrix and compressed signal is achievable. And for the base-station, after receiving the compressed signal, it needs to decode the compressed signal and get the needed information as efficient as possible. Therefore, it's natural for us to ask that if these algorithms for compressive sensing above are also energy-efficient for decoding, if we plan to implement them on physical circuits. To answer this question, the first problem is to find the fundamental limit on energy for the implementations of compressive sensing decoding algorithms on a circuit. The setting for compressive sensing is traditional such that the compressible signal is assumed to be a k-sparse real vector X of length-n. And the corresponding compressed signal is a length-m real vector Y which is computed using a m × n real measurement matrix 3 denoted by A. The goal is to efficiently recover the input vector X as a recovery vectorX with a small average block error probability P blk e = Pr(E blk = 1) such that E blk = 1 if, roughly speaking, the normalized l 1 distance between the input vector X and the recovery vectorX is larger than 2 −Q for a fixed constant Q; otherwise, E blk = 0. We thus use notation Q as the precision of the number of bits in order to implement compressive sensing decoding algorithms on a circuit. In the real case, we only get finite bits to store the value for the entries. Thus Q is indeed reasonable to be made as a constant in our analysis.
As an initial work, we focus on the "information-friction" model, and find a lower bound of "bit-meters", namely, the product of the number of bits moved and the distance these bits are moved. Using this particular metric called "bit-meters", we assume the energy consumed is proportional to the "bit-meters" used in order to support a computation. In the first part of the main results, we show that for a fixed precision Q, the required "bit-meters" (energy) for decoding a compressed signal can be no smaller than Ω nk log n log 1 P blk e asymptotically at the regime 1 The "bit-meters" metric is firstly suggested in [12] , as a possible supplantation of VLSI model introduced by Thompson and others in [3, 5, 15, 16, 18, 20] (and explored further in [2, 7, 13, 17, 19] ) for measuring the energy consumed in a circuit. 2 This example is also given in [1]. 3 As the usual setting, we always assume n > m > 1 throughout the following pages. m = O(k) and k = n 1−β where constant β ∈ (0, 1). This asymptotic lower bound is proved to be tight since a 
II. B
In this section, we formalize the employed models, including both the compressive sensing model and the decoding circuit model.
A. Compressive Sensing Model
For the compressive sensing model, we consider two basic models, probabilistic and combinatorial. We assume the length of input vector are considerably large. Therefore the two models are asymptotically equivalent. For the ease of analysis, we adopt the probabilistic one with independent property to avoid tedious calculations. And for the purpose of presentation and elucidation, the combinatorial model are used to give us an intriguing insights of both fundamental limit and performances of batches of delicate decoding algorithms.
Definition 1 (Compressive Sensing Model (n,m,p)). A bounded length-n "compressible" vector X ∈ R n is the input vector with each entry X i ∈ R satisfies |X i | ≤ U (with a constant upper bound U ≥ 0) has probability p to be non-zero and probability 1 − p to be zero. We assume p = o(1) as the sparse assumption. The measurement matrix is a real matrix denoted by A ∈ R m×n . And the length-m "compressed" vector Y ∈ R m is the output vector such that Y = AX. Base on the output vector Y, a recovery vectorX is decoded.
Definition 2 (Compressive Sensing Model (n,m,k)). A bounded length-n "compressible" vector X ∈ R
n is the k-sparse input vector which contains exactly k non-zero entries X i ∈ R satisfies |X i | ≤ U (with a constant upper bound U ≥ 0). We assume k = o(n) as the sparse assumption. The measurement matrix is a real matrix denoted by A ∈ R m×n . And the length-m "compressed" vector Y ∈ R m is the output vector such that Y = AX. Base on the output vector Y, a recovery vectorX is decoded.
Note that by the strong law of large number (see, for instance the excellent textbook [11] ), or even a weaker statement by Chernoff Bound in [6] , the number of non-zero entries of the input vector X in the probabilistic compressive sensing model will be bounded in a constant range containing k (e.g. [k/2, 3k/2]) with an exponential probability of error e −Θk , which is actually, is negligible compared with the error probability we could achieve as a result in IV.
Moreover, since we will analyze the performance of compressive sensing algorithms in a bit stream representation and communication, then in addition we need to define the average block error probability based on quantization and a given norm || · || q of interest for 0 ≤ p = ∞. 
B. Decoding Circuit Model
For the decoding circuit model, we arrange the nodes (input and output) on a lattice Λ ⊂ R 2 , a discrete additive subgroup of the 2-D Euclidean space R 2 which spans R 2 . Furthermore, each point (i.e.the nodes) on the lattice Λ (i.e.the circuit) endowed with a constant packing radius ρ(Λ) > 0 to ensure the distance between nodes for real implementation. Considering the coherence with [12] , we expound the generalized circuit model using the same order of definitions as [12] did.
Definition 4 (Lattice (ρ(Λ))).
A Lattice (ρ(Λ)) is defined as a lattice Λ spans R 2 with a constant packing radius ρ(Λ) > 0.
Definition 5 (Substrate).
A Substrate is defined as a compact subspace V ⊂ R 2 which is also a Λ-subspace.
III. D A
In this section, we use a heuristic way to explore different constructions for the implementation-circuits of compressive sensing algorithm. The basic issues here are the locations of the two types of nodes (input and output) and how they communicate with one another. We discuss two cases, one is the centralization of input nodes, the other one is distributively arrangement of the nodes. By our discussion, for our interested regime m = O(k), the later design always dominates the former one, we will show that local-decoding helps a lot to reduce the energy consumed and approach the order-optimal trade-off.
A. Centralized-Decoding Algorithm
Centralization of input-nodes, which serves as the simplest construction, are going to be introduced and analyzed. A direct intuition is, this design is better for those algorithms have a relatively higher decoding complexity and lager measurements, namely, have a lager number of input-nodes need to talk to one another frequently. The benefit is a better error probability.
However, for the those algorithms with a relatively sparse measurement matrix such that the average block error probability P blk e is of the order satisfies P 
Proof: We only consider the bit-meters from the input-nodes to the output-nodes. Since after the input-nodes collaborates to decode the recovery vectorX, the Ω(k) bits information are required to transmit to the output-nodes with average distance Ω(
Then we combine P blk e
= Ω (e −m ) and conclude that bit-meters(DecCkt) = Ω nk log
B. Distributive-Decoding Algorithms
We now propose two energy-efficient compressive sensing algorithms, both use the idea of local decoding to reduce the order of bit-meters. Instead of arranging all the input nodes related to received output vector on the central part of the circuit, we put them distributively among the circuit with carefully designed algorithms. As the intuition, since a large fraction of the communication are only carried within a small region, then roughly, the consumed energy are reduced prominently. And indeed, the implication above can be verified in the latter part mathematically. Leaving the formal definitions for next section(SectionIV-B), we orchestrate the descriptions of algorithms here using a stage-by-stage manner accompanied by schematic graphs. The analysis of the performance will be given in the later section(see Section IV).
1) Chain Algorithm: a) First Stage:
The input vector containing n entries of real number is firstly divided into k separated sub-groups which are going be compressed individually. Each sub-group contains n/k entries. Note that we need to make sure that if a sub-group only contains a single non-zero entry, then it would be filtered out and resolved within the fixed precision. The decoding process can be conducted independently among all these k sub-groups. Thus, the corresponding decoder only need to deal with these sub-groups locally, i.e., it only needs to communicate within the local regions for the corresponding sub-groups respectively, which is an intuitive method for saving energy since the distances for communication are reduced greatly for most of the communication between nodes. A more rigorous depiction of the decoding circuit with nodes-arrangement is provided in the following section (see also Section IV). In this stage, if we use a measurement constant c number of measurements for each sub-group, then a constant proportion ρ of the total k non-zero entries (ρk entries) can be located and solved in Q-bits precision with a high probability 5 which is exponential in k. Further define φ = 1/(1 − ρ) to be a parameter for the remaining stages. Note that for each node on the right, the weights of edges connected to it should be made unique, which is the requirement for Identification Phase. In Verification Phase, the connection is kept as the same whereas the weights e b) Second Stage up to log φ (k/ log 2 k)-th Stage: We further combine φ = 1/(1 − ρ) of the resulting sub-groups come from first stages together in the second stage, which can be regarded as the "reserve processes" of the dichotomy. Note that in the i-th stage the conflation only processed in a local region with area of order approximately log 2 k(φ i−1 n/k).
In the i-th stage, k/φ i−2 many sub-groups are merged into k/φ i−1 new sub-groups. Each new sub-group contains φ i−1 n/k entries of the input vector. Also, the filtering step in the first stage III-B2a is implemented on each new sub-groups and again, like in the first stage, the corresponding decoder need to merely handle the information locally. The algorithm continue the conflation up to the log φ (k/ log 2 k)-th stage, and still, use measurement constant c as the measurements we used for each sub-group. As a result, approximately 1/2 of the total k/φ i−1 remaining non-zero entries can be located and solved in Q-bits precision with a high probability 6 which is of order 1/k. c) log φ (k/ log 2 k) + 1 -th Stage (Wind-up Stage): After log φ (k/ log 2 k) stages, the algorithm stops conflation, and globally decode the entire length-n input vector given the information from the previous stages III-B2a and III-B1b in the last stage with O(log 2 k) measurements. In this sense, for the decoder, the information are communicated within the entire decoding circuit, which helps improve the performance with respect to the error probability. Actually, as an overall performance, the algorithm could achieve an average block error probability P blk e of order 1/k and the bit-meters of order O nk log n log 1 P blk e .
2) Shotgun Algorithm: a) First Stage:
In a similar way to the first chain algorithm III-B1, the input vector containing n entries of real number is firstly divided into k separated sub-groups which are going be compressed individually. Each sub-group contains n/k entries. Note that we need to make sure that if a sub-group only contains a single non-zero entry, then it would be filtered out and resolved within the fixed precision. The decoding process can be conducted independently Figure 4 . The schematic graph for the last stage, which illustrates the clearing process. This stage, winds up the stages using log 2 k measurements, global communicates between nodes for decoding the entire length-n input vector such that all the remaining non-zero entries are resolved with a high probability.
among all these k sub-groups. Thus, the corresponding decoder only need to deal with these sub-groups locally, i.e., it only needs to communicate within the local regions for the corresponding sub-groups respectively, which is an intuitive method for saving energy since the distances for communication are reduced greatly for most of the communication between nodes. In this stage, if we use a measurement constant c number of measurements for each sub-group, then a constant proportion σ of the total k non-zero entries (σk entries) can be located and solved in Q-bits precision with a high probability 7 which is exponential in k. many sub-groups are merged into k/ϕ i−1 new sub-groups. Each new sub-group contains ϕ i−1 n/k entries of the input vector. Similarly, the filtering step in the first stage III-B2a is implemented on each new sub-groups and again, like in the first stage, the corresponding decoder need to merely handle the information locally. The algorithm continue the independent combination processes up to the log ϕ (k/ log 2 k)-th stage, and still, use measurement constant c as the measurements we used for each sub-group. As a result, approximately 1/2 of the total k/ϕ i−1 remaining non-zero entries can be located and solved in Q-bits precision with a high probability 9 which is of order 1/k. c) log ϕ (k/ log 2 k) + 1 -th Stage (Wind-up Stage): After log ϕ (k/ log 2 k) stages, the algorithm stops combination, and globally decode the entire length-n input vector given the information from the previous stages III-B2a and III-B1b in the last stage with O(log 2 k) measurements. In this sense, for the decoder, the information are communicated within the entire decoding circuit, which helps improve the performance with respect to the error probability. Actually, as an overall performance, the algorithm could achieve an average block error probability P blk e of order 1/k and the bit-meters of order O nk log n log 1 P blk e . As a summary, the first chain algorithm combine local sub-circuits sequentially and transmit the information together gradually to resolve the input vector X, which is in performance better than the second shotgun algorithm. And the shortage for the chain algorithm is that for large size of input vector X, it necessarily requires several powerful central node within each local sub-circuit to handle the gathered information. For the shotgun algorithm, every node has the same functionality and the decoder merely needs to filter out the possible single non-zero entry in each local sub-circuit at each stage. Below we restate the Theorem 4 as the main result proved in Section IV-B using the algorithms given above.
Theorem 2. For the Compressive Sensing Model (n, m, k) in a decoding circuit DecCkt implemented on the Implementation
Model (ρ(Λ), µ) that achieves an average block error probability P Figure 6 . This graph illustrates the stencil partition on circuit. The sub-lattice which has a larger parallelepiped defines the sub-circuits. For instance, for the sub-circuit on the left-up corner contains the fundamental parallelepipeds, the order of quotient λ = 9, and it has 6 input-nodes and 10 output-nodes. Moreover, it has 1 output-node in the inner parallelepiped.
upper bounded as:
bit-meters(DecCkt) = Θ nk log n log 1 P blk e .
IV. M R A. Fundamental Limit
In this section, a lower bound of bit-meters for all possible compressive sensing decoding algorithms is considered. The idea of showing this lower bound comes from [12] , which is using a "stencil" to divide the entire circuit into several sub-circuits, and find the minimal number of bits communicated between each sub-circuits, times a distance induced by both the side-length of sub-circuits and a fraction parameter introduced in the following definition. (Note that we define the circuit using lattice, so we called each sub-circuit "Parallelepiped" sometimes in the coming context 10 
partition of a circuit into sub-circuits, each occupying the same substrate area det(Λ 0 ). And fraction parameter η defines the inner parallelepiped and the outer parallelepiped for each sub-circuit. If any computational node lies on the boundary of an outer parallelepiped, then it is arbitrarily included in one of the sub-circuit. Let the i-th sub-circuit have m i input-nodes and n i output-nodes within the outer parallelepiped and n inside i
output-nodes inside the inner parallelepiped. 10 Although we are using a 2-D substrate represented by a 2-D sub-space in Euclidean space, for the convenience of possible generalization in the future, we just use the term "parallelepiped" instead of "parallelogram". 
Proof: By definition 11 of the Implementation Model (ρ(Λ), µ), the lattice Λ has a packing radius ρ(Λ) > 0, and since for 2-D lattice, there exists a positive packing density σ(Λ) > 0 such that
Similarly for the sub-lattice Λ 0 , we also have a positive packing density
Moreover, the cardinality of the quotient Λ/Λ 0 equals to
Lemma 2. On the Implementation Model (ρ(Λ), µ), for any fraction parameter η > 0, there exists a point u ∈ Λ for Stencil (λ, η, u) such that the number of output-nodes covered by the Stencil is lower bounded by
is the expectation of number of output-nodes covered by the Stencil, if the point u ∈ Λ is uniformly distributed. Thus at least there exists one point u satisfies the requirement. Proof: Assume n > m, firstly we choose the origin O of the Stencil such that the location of output-nodes satisfies the equation 1 given above. Then for this fixed choice of u ∈ Λ, we're going to consider the worst location of the input-nodes which minimizes the fraction of input-nodes satisfy m i ≤ min (2m/L, n i ). We define an inner sub-circuit as non-locally decodable if m i ≤ min (2m/L, n i ) and otherwise we call it locally decodable. Figure IV -A explicitly gives an example. Thus for the first case m i ≤ 2m/L, the minimal fraction of "non-locally decodable 11 " nodes is 1/2.
Similarly, the minimal fraction of "good" nodes is (1 − R) / (1 + R) for the second case m i ≤ n i . Since the total number of input-nodes is m, then with the consideration of overlapping cases, min ((1 − R) / (1 + R) , 1/2) is a lower bound for the fraction of input-nodes with m i ≤ min (2m/L, n i ). Below we prove the claim explicitly.
Let χ = min(2m/L, (n + m)/2L), then the fraction f 1 of "locally decodable" nodes among all sub-circuits defined by the stencil satisfies f 1 ≤ m/χL, which implies the fraction f 2 of "non-locally decodable" nodes satisfies f 2 = 1 − f 1 ≥ 1 − m/min(m/2, (n + m)/2) = min(1/2, (1 − R)/(1 + R)).
Lemma 4. If at most H (X) /3 bits of information is available to obtain an estimateX of a variable X with entropy H (X),
then Pr X = X ≥ 1/9.
Proof: Modifying Fano's inequality [8] applied to reconstructing the message X by defining an error random variable E(X,X) such that
Then since the input vector X, the output vector Y and the recovery vectorX form a Markov chain X → Y →X, we get H (X) = H X|X + I X;X .
Thus, elaborately modify Fano's inequality [8] ,
Given the available information I of at most H (X) /3 bits, the error probability P e := Pr X = X is lower bounded by
where h b (·) on the LHS is the binary entropy function (will also appear in the later parts).
Then since n > m > 1 we have
Lemma 5. For the Compressive Sensing Model (n, m, p) in a decoder circuit DecCkt implemented on the Implementation
Model (ρ(Λ), µ), if the relative error satisfies ||X −X|| q /||X|| q ≤ 2 −Q , then there exists a constant C 0 (X, q ) < 1 only depend on the input vectot X and norm q such that at least C 0 npQ bits need to be received by the out-put nodes asymptotically.
Proof: Let Q i denote the number of bits for each entryX i of the recovery vectorX that it shares the same bits with the input vector X. Hence
Since |X i | ≥ 0 for all i, recall that as our assumption there is constant U ≥ 0 such that |X i | ≤ U for any i and by Jensen's Inequality (see, for instance in book [14] ) we get
And the asymptotic result follows as n → ∞.
Lemma 6. For the Compressive Sensing Model (n, m, p) in a decoder circuit DecCkt implemented on the Implementation Model (ρ(Λ), µ), for any decoder sub-circuit SubCkt obtained via Stencil-partition with
given L is the number of sub-circuits.
Proof: In the i-th sub-circuit, since the number of bit-meters for the sub-circuit with the assumption m i ≤ min (2m/L, n i ) is smaller than ηρ(Λ 0 )C 0 n inside i − m i pQ/3, and the distance between the outer parallelepiped and the inner parallelepiped is bounded from below by ηρ(Λ 0 ), at most C 0 n inside i − m i pQ/3 bits of information I can be communicated from outside the outer parallelepiped to the inside of inner parallelepiped. Now since m i ≤ n i , corresponding to the n i output nodes, if among these n i entries we have m i entries in the input vector X take non-zero values, then the decoder cannot solve all values of the Q bits in output-nodes. We denote this event by L. Then L occurs with probability at least
Conditioning on the event L, from Lemma 4,5 using Fano's inequality [8] , if the received entropy is smaller than C 0 n inside i − m i pQ/3, the average block error probability defined by P blk e = Pr(E blk = 1) where E blk = 1 if ||X −X|| q /||X|| q > 2 −Q and otherwise E blk = 0, will be larger than 1/9. Thus, given the assumption in Lemma 6, the (unconditional) error probability for recovering the n i entries of input vector X with precision Q in the i-th sub-circuit is lower bounded by p 2m/L /9. Also, for the entire circuit, since the average block error probability P blk e is larger than that for any sub-circuit, we obtain the Lemma 6. 
Proof: The outer parallelepiped of the Stencil divide the circuit into L sub-circuits. Let the i-th sub-circuit have m i input-nodes and n i output-nodeswithin the outer parallelepiped and n inside i output-nodes inside the inner parallelepiped.
Using Lemma 2 and Lemma 3 we can choose a fixed origin O of the Stencil such that at least (1 − 2η) 2 fraction of the n output-nodes are covered by the inner parallelepiped. Moreover, note that the number of sub-circuits covered by the inner Stencil with m i ≤ min (2m/L, n i ) is min ((1 − R)/(1 + R), 1/2) L, which will be used in the later part. Next, setting L = 2m/ log p 10P blk e , from Lemma 6, if we assume the bit-meters for a "non-locally decodable" sub-circuit are smaller than ηρ(Λ 0 )C 0 n inside i − m i pQ/3, then the average block error probability P blk e is lower bounded as
which is a contradiction. Thus for each "non-locally decodable" sub-circuit, i.e., the sub-circuit satisfies m i ≤ min (2m/L, n i ) obtained via the Stencil-partition, let µ(i) denote the bit-meters for the i-th sub-circuit generated by stencil partition, we must have for all i with
, therefore, we can bound the total bit-meters in the decoding circuit by 
Since by Lemma 1,
, substituting ρ(Λ 0 ) into the equation 3, we get
Choosing η = 1/4 yields the Theorem 3.
Hence for the regime m = O(k) of our interests, one could derive the following result in order expression, which serves as the benchmark for our design of algorithm. = Ω (e −m ), we can substitute k = np in the inequality 2 and get
which differ from the original lower bound in the inequality 2 by simply a constant. Since k = o(n) by the sparse assumption, in the regime m = O(k), we get R = m/n = o(1). And since m = O(k), letting R = k/n we can asymptotically bound the bit-meters as bit-meters(DecCkt) = Ω nk log n log 1 P blk e .
B. Achievability
Firstly we define the encoding matrix and the corresponding decoding steps. After that, several properties follow and whereby we are able to prove the achievability results. 2) The number of measurements m ≤ 2ck + √ k where c is the measurements constant;
3) The number of decoding steps by the algorithm is N DecCkt = O(k). 
= O nk log n log 1 P blk e .
We get equation 5 because as the assumption the precision parameter Q is fixed which is reasonable in many applications. Summing up all terms in equation 5 yields equation 6. And by Theorem 4, the average block error probability P blk e satisfies P blk e = O(1/ √ k), thus O( √ log n) = O( √ log k) in the sub-linear regime k = n 1−β where β ∈ (0, 1). Hence O( √ log n) = O( 1/ log P blk e ) implies equation 7. Therefore as the conclusion we get bit-meters(DecCkt) = O √ nk log n log 
