Lower bounds over Boolean inputs for deep neural networks with ReLU
  gates by Mukherjee, Anirbit & Basu, Amitabh
ar
X
iv
:1
71
1.
03
07
3v
2 
 [c
s.C
C]
  9
 N
ov
 20
17
Lower bounds over Boolean inputs for deep neural networks
with ReLU gates.
Anirbit Mukherjee∗ Amitabh Basu†
Abstract
Motivated by the resurgence of neural networks in being able to solve complex learning tasks
we undertake a study of high depth networks using ReLU gates which implement the function
x 7→ max{0, x}. We try to understand the role of depth in such neural networks by showing
size lowerbounds against such network architectures in parameter regimes hitherto unexplored.
In particular we show the following two main results about neural nets computing Boolean
functions of input dimension n,
• We use the method of random restrictions to show almost linear, Ω(ǫ2(1−δ)n1−δ), lower
bound for completely weight unrestricted LTF-of-ReLU circuits to match the Andreev func-
tion on at least 12 + ǫ fraction of the inputs for ǫ >
√
2 log
2
2−δ (n)
n
for any δ ∈ (0, 12 )
• We use the method of sign-rank to show exponential in dimension lower bounds for ReLU
circuits ending in a LTF gate and of depths upto O(nξ) with ξ < 18 with some restrictions
on the weights in the bottom most layer. All other weights in these circuits are kept
unrestricted. This in turns also implies the same lowerbounds for LTF circuits with the
same architecture and the same weight restrictions on their bottom most layer.
Along the way we also show that there exists a Rn → R Sum-of-ReLU-of-ReLU function which
Sum-of-ReLU neural nets can never represent no matter how large they are allowed to be.
1 Introduction
There has been a recent surge of activity in using neural networks for complex artificial intelligence
tasks (like this very recent spectacular demonstration [34] of the power of neural nets). This has
rekindled interest in understanding neural networks from a complexity theory perspective. A myr-
iad of hard mathematical questions have surfaced in the attempts to rigorously explain the power
of neural networks and a comprehensive overview of these can be found in this recent three part
series of articles from The Center for Brains, Minds and Machines (CBMM), [26, 25, 42].There
is a rich literature investigating the complexity of the function classes represented by neural net-
works with various kinds of gates (or “activation functions" which is the more common parlance
in machine learning). Many papers, a canonical example being the classic paper by Maass [23],
establish complexity results for the entire class of functions represented by circuits where the gates
∗Department of Applied Mathematics and Statistics, Johns Hopkins University, Email: amukhe14@jhu.edu
†Department of Applied Mathematics and Statistics, Johns Hopkins University, Email: basu.amitabh@jhu.edu
1
can come from a very general family. This is complemented by papers that study a very specific
family of gates such as the sigmoid gate or the LTF gate [16], [35], [31] [19], [3], [32],
[28], [4]. Many associated results can also be found in these reviews [20, 27]. Recent circuit
complexity results in [18], [38], [6], [17] stand out as significant improvements over known
lower (and upper) bounds on circuit complexity with threshold gates. The results of Maass [23]
also show that very general families of neural networks can be converted into circuits with only LTF
gates with at most a constant factor blow up in depth and polynomial blow up in size of the circuits.
In the last 5 years or so, a particular family of gates called the Rectified Linear Unit (ReLU) gates have
been reported to have significant advantages over more traditional gates in practical applications
of neural networks. Such a gate with n real inputs computes the following output,
R
n → R (1)
x 7→ max{0, b+ 〈w,x〉} (2)
where w ∈ Rn and b ∈ R are fixed parameters associated with the gate (b is called the bias of the
gate). In comparison, the ±1 valued LTF gate mentioned above computes (for the same weights
as above) the function, (21(b+〈w,x〉≥0) − 1) where 1(b+〈w,x〉≥0) is the 0/1 indicator function for the
stated halfspace condition.
Some of the prior results which apply to general gates, such as the ones in [23], also apply to
ReLU gates, because those results apply to gates that compute a piecewise polynomial function
(ReLU is a piecewise linear function with only two pieces). However, as witnessed by results on LTF
gates, one can usually make much stronger claims about specific classes of gates. To the best of
our knowledge, no prior results have been obtained for ReLU gates from the perspective of Boolean
complexity theory, i.e., the study of such circuits when restricted to Boolean inputs. The main focus
of this work is to study circuits computing Boolean functions mapping {−1, 1}m → {−1, 1} which
use ReLU gates in their intermediate layers, and have an LTF gate at the output node (to ensure that
the output is in {−1, 1}). We remark that using an LTF gate at the output node while allowing more
general analog gates in the intermediate nodes is a standard practice when studying the Boolean
complexity of analog gates (see, for example, [23]).
Although we are not aware of an analysis of lower bounds for ReLU circuits when applied to only
Boolean inputs, there has been recent work on the analysis of such circuits when viewed as a func-
tion from Rn to R (i.e., allowing real inputs and output). From [8] and [7] (with restrictions on
the domain and the weights) we know of (super-)exponential lowerbounds on the size of Sum-of-
ReLU circuits for certain easy Sum-of-ReLU-of-ReLU functions . Depth v/s size tradeoffs for such
circuits have recently also been studied in [39, 12, 21, 41, 30] and in a recent paper [2] by the
current authors. To the best of our knowledge no lowerbounds scaling exponentially with the di-
mension are known for analog deep neural networks of depths more than 2.
In what follows, the depth of a circuit will be the length of the longest path from the output node
to an input variable, and the size of a circuit will be the total number of gates in the circuit. We will
also use the notation Sum-of-ReLU to refer to circuits whose inputs feed into a single layer of ReLU
gates, whose outputs are combined into a weighted sum to give the final output. Similarly, Sum-of-
2
ReLU-of-ReLU denotes the circuit with depth 3, where the output node is a simple weighted sum,
and the intermediate gates are all ReLU gates in the two “hidden" layers. We analogously define
Sum-of-LTF, LTF-of-LTF, LTF-of-ReLU, LTF-of-LTF-of-LTF, LTF-of-ReLU-of-ReLU and so on. We will
also use the notation LTF-of-(ReLU)k for a circuit of the form LTF-of-ReLU-of-RELU-. . .-ReLU with
k ≥ 1 levels of ReLU gates.
2 Statement and discussion of results
Boolean v/s real inputs. We begin our study with the following observation which shows that
ReLU circuits have markedly different behaviour when the inputs are restricted to be Boolean, as
opposed to arbitrary real inputs. Since AND and OR gates can both be implemented by ReLU gates,
it follows that any Boolean function can be implemented by a ReLU-of-ReLU circuit. In fact, it is
not hard to show something slightly stronger:
Lemma 2.1. Any function f : {−1, 1}n → R can be implemented by a Sum-of-ReLU circuit using
at most min{2n,∑fˆ(S)6=0 |S|} number of ReLU gates, where fˆ(S) denotes the Fourier coefficient of
f for the set S ⊆ {1, . . . , n}.
The Lemma follows by observing that the indicator functions of each vertex of the Boolean hyper-
cube {−1, 1}n can be implemented by a single ReLU gate, and the parity function on k variables
can be implemented by k ReLU gates (see Appendix C). Thus, if one does not restrict the size of the
circuit, then Sum-of-ReLU circuits can represent any pseudo-Boolean function. In contrast, we will
now show that if one allows real inputs, then there exist functions with just 2 inputs (i.e., n = 2)
which cannot be represented by any Sum-of-ReLU circuit, no matter how large.
Proposition 2.2. The function max{0, x1, x2} cannot be computed by any Sum-of-ReLU circuit, no
matter how many ReLU gates are used. It can be computed by a Sum-of-ReLU-of-ReLU circuit.
The first part of the above proposition (the impossibility result) is proved in Appendix A. The second
part follows from Corollary 2.2 of a previous paper by the authors [2], which states that any
R
n → R function that can be implemented by a circuit of ReLU gates, can always be implemented
with at most ⌈log(n+ 1)⌉ layers of ReLU gates (with a weighted Sum to give the final output).
Restricting to Boolean inputs. From this point on, we will focus entirely on the situation where
the inputs to the circuits are restricted to {−1, 1}. One motivation behind our results is the desire
to understand the strength of the ReLU gates vis-a-vis LTF gates. It is not hard to see that any circuit
with LTF gates can be simulated by a circuit with ReLU gates with at most a constant blow-up in
size (because a single LTF gate can be simulated by 2 ReLU gates when the inputs are a discrete set
– see Appendix B). The question is whether ReLU gates can do significantly better than LTF gates
in terms of depth and/or size.
A quick observation is that Sum-of-ReLU circuits can be linearly (in the dimension n) smaller than
Sum-of-LTF circuits. More precisely,
3
Proposition 2.3. The function f : {−1, 1}n → R given by f(x) = ∑ni=1 2i(1+xi2 ) can be imple-
mented by a Sum-of-ReLU circuit with 2 ReLU gates, and any Sum-of-LTF that implements f needs
Ω(n) gates.
The above result follows from the following two facts: 1) any linear function is implementable by
2 ReLU gates, and 2) any Sum-of-LTF circuit with w LTF gates gives a piecewise constant function
that takes at most 2w different values. Since f takes 2n different values (it evaluates every vertex of
the Boolean hypercube to the corresponding natural number expressed in binary), we need w ≥ n
gates.
In the context of these preliminary results, we now state our main contributions. For the next
result we recall the definition of the Andreev function [1] which has previously many times been
used to prove computational lower bounds [24, 15, 14].
Definition 1 (Andreev’s function). The Andreev’s function is the following mapping,
An : {0, 1}⌊
n
2
⌋ × {0, 1}⌊log(
n
2
)⌋×⌊ n
2⌊log(n2 )⌋
⌋ −→ {0, 1}
(x, [aij ]) 7−→ x
bin
(
{(∑⌊ n2⌊log(n2 )⌋ ⌋j=1 aij) mod 2}i=1,2,..,⌊log(n2 )⌋
)
where “bin" is the function that gives the decimal number that can be represented by its input bit
string.
We are particularly inspired by the most recent use of the Andreev function by Kane and Williams
[18] to get the first super linear lower bounds for approximating it using LTF-of-LTF circuits. We will
give an almost linear lower bound on the size of LTF-of-ReLU circuits approximating this Andreev
function with no restriction on the weights w, b for each gate.
Theorem 2.4. For any δ ∈ (0, 12), there exists N(δ) ∈ N such that for all n ≥ N(δ) and ǫ >√
2 log
2
2−δ (n)
n , any LFT-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at
least 1/2 + ǫ fraction of the inputs, has size Ω(ǫ2(1−δ)n1−δ).
It is well known that proving lower bounds without restrictions on the weights is much more
challenging even in the context of LTF circuits. In fact, the recent results in [18] are the first
superlinear lower bounds for LTF circuits with no restrictions on the weights. With restrictions
on some or all the weights, e.g., assuming poly(n) bounds on the weights (typically termed the
“small weight asssumption") in certain layers, exponential lower bounds have been established for
LTF circuits [11, 16, 32, 33]. Our next results are of this flavor: under certain kinds of weight
restrictions, we prove exponential size lower bounds on the size of LTF-of-(ReLU)d−1 circuits. One
thing to note is that our weight restrictions are assumed only on the bottom layer (closest to the input).
The other layers can have gates with unbounded weights. Nevertheless, our weight restrictions are
somewhat unconventional.
4
Definition 2. [Weight restriction condition] Let m ∈ N and σ be any permutation of {1, . . . , 2m}.
Let us also consider an arbitrary sequencing {x1, . . . ,x2m} of the vertices of the hypercube {−1, 1}m.
Define the polyhedral cone
Pm,σ := {a ∈ Rm : 〈a,xσ(1)〉 ≤ 〈a,xσ(2)〉 ≤ . . . 〈a,xσ(2m)〉}.
In words, Pm,σ is the set of all linear objectives that order the vertices of the m-dimensional hyper-
cube in the order specified by σ. We will impose the condition that there exists a σ such that for
each ReLU gate in the bottom layer, the vector w ∈ Pm,σ (w as defined in (1)) and all weights are
integers with magnitude bounded by some W > 0.
We will prove our lower bounds against the function proposed by Arkadev and Nikhil in [5],
g : OMB0n ◦ORn 13−logn ◦XOR2 : {−1, 1}
2(n
4
3−n logn) → {−1, 1} (3)
which we will refer to as the Arkadev-Nikhil function in the remainder of the paper. Here OMB
is the ODD-MAX-BIT function which is a ±1 threshold gate which evaluates to −1 on say a n−bit
input x if
∑n
i=1(−1)i+12i(1 + xi) ≥ 12 . We show the following exponential lowerbound against this
function,
Theorem 2.5. Let m,d,W ∈ N. Any depth d LTF-of-(ReLU)d−1 circuits on 2m bits such that the
weights in the bottom layer are restricted as per Definition 2 that implements the Arkadev-Nikhil
function on 2m bits will require a circuit size of
Ω

(d− 1) 2
m
1
8
d−1
(mW )
1
d−1

 .
Consequently, one obtains the same size lower bounds for circuits with only LTF gates of depth d.
Note that this is an exponential in dimension size lowerbound for even super-polynomially growing
bottom layer weights (and additional constraints as per Definition 2) and upto depths scaling as
d = O(mξ) with ξ < 18 .
We note that the Arkadev-Nikhil function can be represented by an O(m) size LTF-of-LTF circuit
with no restrictions on weights (see Theorem 2.6 below). In light of this fact, Theorem 2.5 is some-
what surprising as it shows that for the purpose of representing Boolean functions a deep ReLU
circuit (ending in a LTF) gate can get exponentially weakened when just its bottom layer weights
are restricted as per Definition 2, even if the integers are allowed to be super-polynomially large.
Moreover, the lower bounds also hold of LTF circuits of arbitrary depth d, under the same weight
restrictions on the bottom layer. We are unaware of any exponential lower bounds on LTF circuits
of arbitrary depth under any kind of weight restrictions.
We will use the method of sign-rank to obtain the exponential lowerbounds in Theorems 2.5. The
5
sign-rank of a real matrix A with all non-zero entries is the least rank of a matrix B of the same
dimension with all non-zero entries such that for each entry (i, j), sign(Bij) = sign(Aij). For a
Boolean function f mapping, f : {−1, 1}m×{−1, 1}m → {−1, 1} one defines the “sign-rank of f" as
the sign-rank of the 2m×2m dimensional matrix [f(x,y)]x,y∈{−1,1}m . This notion of a sign-rank has
been used to great effect in diverse fields from communication complexity to circuit complexity to
learning theory. Explicit matrices with a high sign-rank were not known till the breakthrough work
by Forster, [9]. Forster et. al. showed elegant use of this complexity measure to show exponential
lowerbounds against LTF-of-MAJ circuits in [10]. Lot of the previous literature about sign-rank has
been reviewed in the book by Satya Lokam [22]. Most recently the following result was obtained
by Arkadev and Nikhil in [5] leading to a proof of strict containment of LTF-of-MAJ in LTF-of-LTF.
Theorem 2.6. [Theorem 4.2 and Corollary 1.2 in [5]]
The Akradev-Nikhil function g in equation 3 can be represented by a linear sized LTF-of-LTF circuit
and sign-rank(g) ≥ 2n
1
3−2 log n
16
We will prove our theorem by showing a small upper bound on the sign-rank of LTF-of-(ReLU)d−1
circuits which have their bottom most layer’s weight restricted in the said way.
3 Lower bounds for LTF-of-ReLU against the Andreev function (Proof
of Theorem 2.4)
We will use the classic “method of random restrictions" [37, 36, 13, 40, 29] to show a lowerbound
for weight unrestricted LTF-of-ReLU circuits for representing the Andreev function. The basic phi-
losophy of this method is to take any arbitrary LTF-of-ReLU circuit which supposedly matches the
Andreev function on a large fraction of the inputs and to randomly fix the values on some of its
input coordinates and also do the same fixing on the same coordinates of the input to the Andreev
function. Then we show that upon doing this restriction the Andreev function collapses to an ar-
bitrary Boolean function on the remaining inputs (what it collapses to depends on what values
were fixed on its inputs that got restricted). But on the other hand we show that the LTF-of-ReLU
collapses to a circuit which is of such a small size that with high-probability it cannot possibly ap-
proximate a randomly chosen Boolean function on the remaining inputs. This contradiction leads
to a lowerbound.
There are two important concepts towards implementing the above idea. First one has to pre-
cisely define as to when can a ReLU gate upon a partial restriction of its inputs be considered to be
removable from the circuit. Once this notion is clarified it will automatically turn out that doing
random restrictions on ReLU is the same as doing random restriction on a LTF gate as was recently
done in [18]. The secondly it needs to be true that at any fixed size LTF-of-ReLU circuits cannot
represent too many of all the Boolean functions possible at the same input dimension. For this very
specific case of LTF-of-ReLU circuits where ReLU gates necessarily have a fan-out of 1, Theorem 2.1
in [23] applies and we have from there that LTF-of-ReLU circuits over n−bits with w ReLU gates
can represent at most N = 2O((wn+w+w+1+1)
2 log(wn+w+w+1+1)) = 2O((wn+2w+2)
2 log(wn+2w+2)) num-
ber of Boolean functions. We note that slightly departing from the usual convention with neural
networks here in this work by Wolfgaang Mass he allows for direct wires from the input nodes to
6
the output LTF gate. This flexibility ties in nicely with how we want to define a ReLU gate to be
becoming removable under the random restrictions that we use.
Random Boolean functions vs any circuit class
In everything that follows all samplings being done (denoted as ∼) are to be understood as sam-
pling from an uniform distribution unless otherwise specified. Firstly we note this well-known
lemma,
Claim 1. Let f : {−1, 1}n → {−1, 1} be any given Boolean function. Then the following is true,
Pg∼{{−1,1}n→{−1,1}}
[
Px∼{−1,1}n [f(x) = g(x)] ≥
1
2
+ ǫ
]
≤ e−2n+1ǫ2
From the above it follows that if N is the total number of functions in any circuit class (whose
members be called C) then we have by union bound,
Pg∼{{−1,1}n→{−1,1}}
[
∃C s.t Px∼{−1,1}n [C(x) = g(x)] ≥
1
2
+ ǫ
]
≤ Ne−2n+1ǫ2 (4)
Equipped with these basics we are now ready to begin the proof of the lowerbound against weight
unrestricted LTF-of-ReLU circuits,
Proof.
Definition 3. Let D denote arbitrary LTF-of-ReLU circuits over ⌊log(n2 )⌋ bits.
For some ǫ3 ≤ 12 and a size function denoted as s(n, ǫ) we use equation 4 , the definition of D above
and the upperbound given earlier for the number of LTF-of-ReLU functions at a fixed circuit size
(now used for circuits on ⌊log(n2 )⌋ bits) to get,
P
f∼{0,1}⌊log(n2 )⌋→{0,1}
[
∀D s.t |D| ≤ s(n, ǫ) |P
y∼{0,1}⌊log(n2 )⌋[f(y) = D(y)] ≤
(1
2
+
ǫ
3
)]
≥ 1− 2O(s2 log2(n2 ) log(log(n2 )s))e−
(
ǫ2
9
)
21+⌊log(
n
2 )⌋
≥ 1− 2O(s2k2 log(ks))e−
(
2ǫ2
9
)
2k
whereby in the last inequality above we have assumed that n = 2k+1. This assumption is legitimate
because we want to estimate certain large n asymptotics. For any arbitrarily chosen constant C < 29
we try to satisfy the following condition, O(s2k2 log(ks))− 2ǫ22k9 ≤ −Cǫ22k =⇒ O(s2k2 log(ks)) ≤
O(ǫ22k). For any constant θ > 0 for large enough x > 0 we would have log(x) < xθ and hence the
above constraint on s gets satisfied if we work in the regime, s ≤ O( (ǫ22k)
1
2+θ
k ). So for this range of
7
s we have, 2O(s
2k2 log(ks))e
−
(
2ǫ2
9
)
2k ≤ eO(s2k2 log(ks))−
(
2ǫ2
9
)
2k ≤ e−Cǫ22k . Now we want, e−Cǫ22k ≤ ǫ3 .
But on the otherhand for the upperbound on s to make sense we need, ǫ22k ≥ k2+θ. Its clear
that both the conditions get satisfied if for asymptotically large n we choose ǫ >
√
2 log2+θ(n
2
)
n . And
corresponding to this we have for s(n, ǫ) ≤ O( ǫ
2
2+θ n
1
2+θ
2
1
2+θ log(n
2
)
)
P
f∼{0,1}⌊log(n2 )⌋→{0,1}
[
∀D s.t |D| ≤ s(n, ǫ) |P
y∼{0,1}⌊log(n2 )⌋[f(y) = D(y)] ≤
(1
2
+
ǫ
3
)]
≥ 1− ǫ
3
(5)
Definition 4 (F∗). Let F ∗ be the subset of all these f above for which the above event is true.
Now we recall the definition of the Andreev function in equation 1 for the following definition and
the claim,
Definition 5 (ρ). Let ρ denote the set of all possible “random restrictions" where one is fixing all
the input bits of An except 1 bit in each row of the matrix a. So the restricted function (call it An|ρ
by overloading the notation for simplicity) computes a function of the form,
An|ρ : {0, 1}⌊log(
n
2
)⌋ → {0, 1}
From the definitions of An and ρ above the following is immediate,
Claim 2. The truth table of An|ρ is the x string in the input to An that gets fixed by ρ. Thus we
observe that if ρ is chosen uniformly at random then An|ρ is a ⌊log(n2 )⌋ bit Boolean function chosen
uniformly at random.
Let f∗ be any arbitrary member of F ∗. Let x∗ ∈ {0, 1}⌊n2 ⌋ be the truth-table of f∗. Let ρ(x∗)
be restrictions on the input of An which fix the x part of its input to x
∗. So when we are sampling
restrictions uniformly at random from the restrictions of the type ρ(x∗) these different instances
differ in which bit of each row of the matrix a (of the input to An) they left unfixed and to what
values did they fix the other entries of a. Let C be a n bit LTF-of-ReLU Boolean circuit of size say
w(n, ǫ). Thus under the restriction ρ(x∗) both C and An are ⌊log(n2 )⌋ bit Boolean functions.
Now we note that a ReLU gate over n bits upon a random restriction becomes redundant (and
hence removable) iff its linear argument either reduces to a non-positive definite function or a pos-
itive definite function. In the former case the gate is computing the constant function zero and in
the later case it is computing a linear function which can be simply implemented by introducing
wires connecting the inputs directly to the output LTF gate. Thus in both the cases the resultant
8
function no more needs the ReLU gate for it to be computed. (We note that such direct wires from
the input to the output gate were allowed in how the counting was done of the total number of
LTF-of-ReLU Boolean functions at a fixed circuit size.) Combining both the cases we note that the
conditions for collapse (in this sense) of a ReLU gate is identical to that of the conditions of collapse
for a LTF gate with the same linear argument. Hence corresponding to the random restrictions ρ
we can just directly utilize the random restriction lemma 1.1 from [18] to say that,
Pρ(x∗)[ReLU|ρ(x∗)is removable ] ≥ η
where for η = 1−O( log n√
n
)
The above definition of η implies,
Pρ(x∗)[ A n−bit ReLU is not forced to a constant ] ≤ 1− η
=⇒ Eρ(x∗)[ Number of ReLUs of C not forced to a constant ] ≤ w(n, ǫ)(1 − η)
=⇒ Pρ(x∗)[ Number of ReLUs of C not forced to a constant > s(n, ǫ)]
≤ Eρ(x∗)[ Number of ReLUs of C not forced to a constant ]
s(n, ǫ)
=⇒ Pρ(x∗)[ Number of ReLUs of C not forced to a constant ≥ s(n, ǫ)] ≤
w(n, ǫ)(1 − η)
s(n, ǫ)
=⇒ Pρ(x∗)[ Size of C|ρ(x∗) ≤ s(n, ǫ)] ≥ 1−
w(n, ǫ)(1 − η)
s(n, ǫ)
(6)
Now we compare with the definitions of ǫ and f∗ to observe that (a) with probability at least
1− w(n,ǫ)(1−η)s(n,ǫ) , C|ρ(x∗) is of the circuit type as in the event in equation 2 and (b) by definition of the
Andreev function it follows that An|ρ(x∗) has its truth table given by x∗ and hence it specifies the
same function as f∗ ∈ F ∗. Hence ∀x∗ and ρ(x∗) this can as well write this as,
P
y∼{0,1}⌊log(n2 )⌋ [C|ρ(x∗)(y) = An|ρ(x∗)(y)| Size of C|ρ(x∗) ≤ s(n, ǫ)] ≤
1
2
+
ǫ
3
(7)
∀x∗ equation 6 can be rewritten as,
Pρ(x∗)[ Size of C|ρ(x∗) ≤ s(n, ǫ)] ≥ 1−
w(n, ǫ)(1 − η)
s(n, ǫ)
(8)
The equation 5 can be written as,
P
f∼{0,1}⌊log(n2 )⌋→{0,1}[f ∈ F
∗] ≥ 1− ǫ
3
(9)
Claim 3. Circuits C have low correlation with the Andreev function
Pz∼{0,1}n [C(z) = An(z)] ≤
ǫ
3
+
w(n, ǫ)(1 − η)
s(n, ǫ)
+
1
2
+
ǫ
3
9
Proof. We think of sampling a z ∼ {0, 1}n as a two step process of first sampling a f˜ , a ⌊log(n2 )⌋ bit
Boolean function and fixing the first ⌊n2 ⌋ bits of z to be the truth-table of f˜ and then we randomly
assign values to the remaining ⌊n2 ⌋ bits of z. Call these later ⌊n2 ⌋ bit string to be xother.
Pz∼{0,1}n [C(z) = An(z)] = Ez∼{0,1}n [1C(z)=An(z)]
= Ez∼{0,1}n [1C(z)=An(z)1f˜∈F ∗] + Ez∼{0,1}n [1C(z)=An(z)1f˜ /∈F ∗]
= Pz∼{0,1}n [(C(z) = An(z)) ∩ (f˜ ∈ F ∗)] + Pz∼{0,1}n [(C(z) = An(z)) ∩ (f˜ /∈ F ∗)]
= Pz∼{0,1}n [(C(z) = An(z)) | (f˜ ∈ F ∗)]Pz∼{0,1}n [f˜ ∈ F ∗]
+ Pz∼{0,1}n [(C(z) = An(z)) ∩ (f˜ /∈ F ∗)]
≤ Pz∼{0,1}n [(C(z) = An(z)) | (f˜ ∈ F ∗)] + Pz∼{0,1}n [f˜ /∈ F ∗]
≤ Pz∼{0,1}n [(C(z) = An(z)) | (f˜ ∈ F ∗)] +
ǫ
3
In the last line above we have invoked equation 9. Now we note that sampling the n bit string z
such that f˜ ∈ F ∗ is the same as doing a random restriction of the type ρ(f˜) and then randomly
picking a ⌊log(n2 )⌋ bit string say y. So we can rewrite the last inequality as,
Pz∼{0,1}n [C(z) = An(z)] ≤ P(ρ(f˜),y)[C(ρ(f˜),y) = An(ρ(f˜),y)] +
ǫ
3
≤ E(ρ(f˜),y)[1C(ρ(f˜),y)=An(ρ(f˜),y) | (f˜ ∈ F ∗)] +
ǫ
3
≤ E(ρ(f˜),y)[1C(ρ(f˜),y)=An(ρ(f˜),y)1Size of C|ρ(f˜)<s(n,ǫ) | (f˜ ∈ F
∗)]
+ E(ρ(f˜),y)[1C(ρ(f˜),y)=An(ρ(f˜),y)1Size of C|ρ(f˜)≥s(n,ǫ) | (f˜ ∈ F
∗)] +
ǫ
3
≤ P(ρ(f˜),y)[C(ρ(f˜),y) = An(ρ(f˜),y) |
(
(Size of C|ρ(f˜) < s(n, ǫ)) ∩ (f˜ ∈ F ∗)
)
]
+ P(ρ(f˜),y)[Size ofC|ρ(f˜) ≥ s(n, ǫ) | (f˜ ∈ F ∗)] +
ǫ
3
≤
(
1
2
+
ǫ
3
)
+
w(n, ǫ)(1 − η)
s(n, ǫ)
+
ǫ
3
In the last step above we have used equations 7 and 8.
So after putting back the values of η and the largest scaling of s(n, ǫ) that we can have (from
equation 5), the upperbound on the above probability becomes,
1
2
+
2ǫ
3
+O
(
w(n, ǫ) log(n)
√
n( ǫ
2
2+θ n
1
2+θ
2
1
2+θ log(n
2
)
)
)
Thus the probability is upperbounded by 12 + ǫ as long as w(n, ǫ) = O
(
ǫ
1+ 2
2+θ n
1
2+
1
2+θ log
(
n
2
)
log(n)
)
10
Stated as a lowerbound we have that if a LTF-of-ReLU has to match the n−bit Andreev function on
more than 12 + ǫ fraction of the inputs for ǫ >
√
2 log2+θ(n
2
)
n for some θ > 0 (asymptotically this is
like having a constant ǫ) then the LTF-of-ReLU needs to be of size Ω(ǫ
4+θ
2+θn
1
2
+ 1
2+θ ). Now we define
δ ∈ (0, 12) such that δ = θ2(2+θ) and that gives the form of the almost linear lowerbound as stated in
the theorem.
4 Smaller upper bounds on the sign-rank of LTF-of-(ReLU)d−1 with
weight restrictions only on the bottom most layer (Proof of Theo-
rem 2.5)
For a {−1, 1}M → {−1, 1} LTF-of-ReLU circuit with any given weights on the network the inputs to
the threshold function of the top LTF gate are some set of 2M real numbers (one for each input).
Over all these inputs let p > 0 be the distance from 0 of the largest negative number on which the
LTF gate ever gets evaluated. Then by increasing the bias at this last LTF gate by a quantity less
then p we can ensure that no input to this LTF gate is 0 while the entire circuit still computes the
same Boolean function as originally. So we can assume without loss of generality that the input to
the threshold function at the top LTF gate is never 0. We also recall that the weights at the bottom
most layer are constrained to be integers of magnitude at most W > 0.
Let this depth d LTF-of-(ReLU)d−1 circuit map {−1, 1}m × {−1, 1}m → {−1, 1}. Let {wk}d−1k=1 be the
widths of the ReLU layers at depths indexed by increasing k with increasing distance from the input.
Thus, the output LTF gate gets wd−1 inputs; the j-th input, for j = 1, 2, .., wd−1 , is the output of a
circuit Cj of depth d− 1 composed of only ReLU gates. Let fj(x,y) : {−1, 1}m × {−1, 1}m → R be
the pseudo-Boolean function implemented by Cj. Thus the output of the overall LTF-of-(ReLU)
d−1
circuit is,
f(x,y) := LTF

β + wd−1∑
j=1
αjfj(x,y)

 (10)
Lemma 4.1. Let k ≥ 0 and w1, . . . , wk ≥ 1 be natural numbers. Consider a circuit with 2m inputs
and a single output, consisting of only ReLU gates of depth k+1 with wi ReLU gates at each depth,
with i = 1 corresponding to the layer closest to the input (note that single output ReLU gate is not
counted here). We restrict the inputs to {−1, 1}m × {−1, 1}m, so the circuit implements a pseudo-
Boolean function g : {−1, 1}m × {−1, 1}m → R.
Assume that the weights of the w1 ReLU gates in the layer closest to the input are restricted as
per Definition 2. Define the 2m × 2m matrix G(x,y) whose rows and columns are indexed by
(x,y) ∈ {−1, 1}m × {−1, 1}m as
G(x,y) = g(x,y).
Then G has a block structure, where the rows and columns can be partitioned contiguously into
O
(
(
∏k
i=1wi)(mW )
)
blocks (thus, G has O
(
(
∏k
i=1 wi)
2(mW )2
)
blocks), and within each block G is
constant valued.
11
Before we prove the Lemma, let us see why it implies Theorem 2.5. Let Fj(x,y) be the matrix
obtained from the ReLU circuit outputs fj(x,y) from (10), and let F (x,y) be the matrix obtained
from f(x,y). Let J2m×2m be the matrix of all ones. Then
sign-rank(F (x,y)) = sign-rank

sign

βJ2m×2m +
wd−1∑
j=1
αjFj(x,y)




≤ rank

βJ2m×2m +
wd−1∑
j=1
αjFj(x,y)


≤ 1 +
wd−1∑
j=1
rank(Fj(x,y))
=O


(
d−1∏
k=1
wk
)2
(mW )2


where the first inequality follows from the definition of sign-rank, the second inequality follows
from the subadditivty of rank and the last inequality is a consequence of Lemma 4.1. Indeed, a ma-
trix with block structure as in the conclusion of Lemma 4.1 has rank at most O
(
(
∏k
i=1 wi)
2(mW )2
)
by expressing it as a sum of these many matrices of rank one and using subaddivity of rank.
Now we recall that the Arkadev-Nikhil function g (which is linear sized depth 2 LTF) on 2m =
2(n
4
3 − n log n) bits has sign-rank Ω(2n
1
3−2 logn). It follows that n
4
3 ≥ m and for any constant C
s.t C ∈ (0, 1) for large enough n we would have, sign-rank(g) = Ω(2Cn
1
3 ) = Ω(2Cm
1
4 ). From the
above upper bound on the sign-rank of our bottom layer weight restricted LTF-of-(ReLU)d−1 with
widths {wk}d−1k=1 it follows that for this to represent this Arkadev-Nikhil function it would need,((∏d−1
k=1wk
)2
(mW )2
)
= Ω(m
1
4 ). Hence it follows that the size (1 +
∑d−1
k=1wi) required for such
LTF-of-(ReLU)d−1 circuits to represent the Arkadev-Nikhil function is Ω

(d− 1) 2m
1
8
d−1
(mW )
1
d−1

.
The statement about LTF circuits is a straightforward consequence of the above result and Claim 5
in Appendix B which says that any LTF gate can be simulated by 2 ReLU gates.
We now prove Lemma 4.1.
Proof of Lemma 4.1. We will prove this Lemma by induction on k.
The base case of the induction k = 0: A single ReLU gate. A single ReLU gate’s output is given
by max{0, 〈a1,x〉 + 〈a2,y〉 + b}, where a1,a2 ∈ Rm and b ∈ R. Since the entries of a1,a2 and
b are assumed to be integers bounded by W > 0, the terms 〈a1,x〉 and 〈a2,y〉 can each take at
most O(mW ) different values, since x,y ∈ {−1, 1}m. So we can arrange the rows and columns
in increasing order of 〈a1,x〉 and 〈a2,y〉 and then partition the rows and columns contiguously
according to these values, and the base case is proved.
12
The induction step. We first make a simple claim about the sum of matrices which are block wise
constant.
Claim 4. Let w,M,D be fixed natural numbers. Let A1, . . . , Aw be any M ×M matrices such that
for each Ai the rows and columns can be partitioned contiguously into D blocks (not necessarily
equal in size), such that Ai is constant valued within each of theD
2 blocks. Then A := A1+. . .+Aw
is an M ×M matrix whose rows and columns can be partitioned contiguously into w(D − 1) + 1
blocks such that A is constant valued within each block defined by this partition of the rows and
columns.
Proof. The partition of the rows of Ai into D contiguous blocks is equivalent to a choice of D − 1
lines out of M − 1 lines. When we sum the matrices, the refined partition in the sum is a selection
of w(D− 1) lines out ofM − 1 lines, giving us w(D− 1)+1 contiguous blocks. The same argument
holds for the columns.
To complete the induction step, we observe that a ReLU circuit with depth k + 1 layers can be seen
as computing g(x,y) = max{0, b +∑wki=1 ajgi(x,y)}, where gi(x,y) is the output of a ReLU circuit
of depth k. Thus, the corresponding matrices satisfy G(x,y) = max{0, bJ2m×2m+
∑wk
i=1 ajGi(x,y)},
where J2m×2m is the matrix of all ones, and the “max” is taken entrywise. the induction hy-
pothesis tells us that the rows and columns each matrix Gi can be partitioned contiguously into
O
(
(
∏k−1
i=1 wi)(mW )
)
such that Gi is constant valued within each block. Thus, by Claim 4, the rows
and columns of the matrix bJ2m×2m +
∑wk
i=1 ajGi(x,y) can be partitioned into O
(
(
∏k
i=1 wi)(mW )
)
contiguous blocks.
5 Acknowledgements
We would like to thank Aurko Roy (Google Brain, San Francisco Bay Area) for extensive discussions
on the methods used and the questions addressed in this work. We also thank Nikhil Mande (TIFR),
Piyush Srivastava (TIFR) and Xin Li (JHU) for helpful conversations on circuit complexity. Amitabh
Basu and Anirbit Mukherjee gratefully acknowledge support from the NSF grant CMMI1452820.
References
[1] A. E. Andreev. About one method of obtaining more than quadratic effective lower bounds of
complexity of pi-schemes, 1987.
[2] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with
rectified linear units. https://eccc.weizmann.ac.il/report/2017/098/, 2016.
[3] H. Buhrman, N. Vereshchagin, and R. de Wolf. On computation and communication with
small bias. In Computational Complexity, 2007. CCC’07. Twenty-Second Annual IEEE Conference
on, pages 24–32. IEEE, 2007.
[4] M. Bun and J. Thaler. Improved bounds on the sign-rank of acˆ 0. In LIPIcs-Leibniz Inter-
national Proceedings in Informatics, volume 55. Schloss Dagstuhl-Leibniz-Zentrum fuer Infor-
matik, 2016.
13
[5] A. Chattopadhyay and N. S. Mande. Weights at the bottom matter when the top is heavy.
arXiv preprint arXiv:1705.02397, 2017.
[6] R. Chen, R. Santhanam, and S. Srinivasan. Average-case lower bounds and satisfiability algo-
rithms for small threshold circuits. In LIPIcs-Leibniz International Proceedings in Informatics,
volume 50. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
[7] A. Daniely. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.
[8] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference
on Learning Theory, pages 907–940, 2016.
[9] J. Forster. A linear lower bound on the unbounded error probabilistic communication com-
plexity. Journal of Computer and System Sciences, 65(4):612–625, 2002.
[10] J. Forster, M. Krause, S. V. Lokam, R. Mubarakzjanov, N. Schmitt, and H. U. Simon. Relations
between communication complexity, linear arrangements, and computational complexity. In
International Conference on Foundations of Software Technology and Theoretical Computer Sci-
ence, pages 171–182. Springer, 2001.
[11] A. Hajnal, W. Maass, P. Pudlák, M. Szegedy, and G. Turan. Threshold circuits of bounded
depth. In Foundations of Computer Science, 1987., 28th Annual Symposium on, pages 99–110.
IEEE, 1987.
[12] B. Hanin. Universal function approximation by deep neural nets with bounded width and
relu activations. arXiv preprint arXiv:1708.02691, 2017.
[13] J. Hastad. Almost optimal lower bounds for small depth circuits. In Proceedings of the eigh-
teenth annual ACM symposium on Theory of computing, pages 6–20. ACM, 1986.
[14] R. Impagliazzo, R. Meka, and D. Zuckerman. Pseudorandomness from shrinkage. In Foun-
dations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 111–119.
IEEE, 2012.
[15] R. Impagliazzo andM. Naor. Decision trees and downward closures. In Structure in Complexity
Theory Conference, 1988. Proceedings., Third Annual, pages 29–38. IEEE, 1988.
[16] R. Impagliazzo, R. Paturi, and M. E. Saks. Size–depth tradeoffs for threshold circuits. SIAM
Journal on Computing, 26(3):693–707, 1997.
[17] V. Kabanets, D. Kane, and Z. Lu. A polynomial restriction lemma with applications. In Elec-
tronic Colloquium on Computational Complexity (ECCC), volume 24, page 26, 2017.
[18] D. M. Kane and R. Williams. Super-linear gate and super-quadratic wire lower bounds for
depth-two and depth-three threshold circuits. In Proceedings of the forty-eighth annual ACM
symposium on Theory of Computing, pages 633–643. ACM, 2016.
[19] M. Krause and P. Pudlák. On the computational power of depth 2 circuits with threshold
and modulo gates. In Proceedings of the twenty-sixth annual ACM symposium on Theory of
computing, pages 48–57. ACM, 1994.
14
[20] T. Lee, A. Shraibman, et al. Lower bounds in communication complexity. Foundations and
Trends R© in Theoretical Computer Science, 3(4):263–399, 2009.
[21] S. Liang and R. Srikant. Why deep neural networks? arXiv preprint arXiv:1610.04161, 2016.
[22] S. V. Lokam et al. Complexity lower bounds using linear algebra. Foundations and Trends R©
in Theoretical Computer Science, 4(1–2):1–155, 2009.
[23] W. Maass. Bounds for the computational power and learning complexity of analog neural
nets. SIAM Journal on Computing, 26(3):708–732, 1997.
[24] M. S. Paterson and U. Zwick. Shrinkage of de morgan formulae under restriction. Random
Structures & Algorithms, 4(2):135–150, 1993.
[25] T. Poggio and Q. Liao. Theory ii: Landscape of the empirical risk in deep learning. arXiv
preprint arXiv:1703.09833, 2017.
[26] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but
not shallow-networks avoid the curse of dimensionality: A review. International Journal of
Automation and Computing, pages 1–17, 2017.
[27] A. A. Razborov. On small depth threshold circuits. In Scandinavian Workshop on Algorithm
Theory, pages 42–52. Springer, 1992.
[28] A. A. Razborov and A. A. Sherstov. The sign-rank of ac ˆ0. SIAM Journal on Computing,
39(5):1833–1855, 2010.
[29] B. Rossman. On the constant-depth complexity of k-clique. In Proceedings of the fortieth
annual ACM symposium on Theory of computing, pages 721–730. ACM, 2008.
[30] I. Safran and O. Shamir. Depth separation in relu networks for approximating smooth non-
linear functions. arXiv preprint arXiv:1610.09887, 2016.
[31] A. A. Sherstov. Powering requires threshold depth 3. Information processing letters, 102(2-
3):104–107, 2007.
[32] A. A. Sherstov. Separating acˆ0 from depth-2 majority circuits. SIAM Journal on Computing,
38(6):2113–2129, 2009.
[33] A. A. Sherstov. The unbounded-error communication complexity of symmetric functions. Com-
binatorica, 31(5):583–614, 2011.
[34] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,
M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature,
550(7676):354–359, 2017.
[35] K.-Y. Siu, V. P. Roychowdhury, and T. Kailath. Rational approximation techniques for analysis
of neural networks. IEEE Transactions on Information Theory, 40(2):455–466, 1994.
[36] J. H. stad. The shrinkage exponent of de morgan formulas is 2. SIAM Journal on Computing,
27(1):48–64, 1998.
15
[37] B. A. Subbotovskaya. Realizations of linear functions by formulas using+. Doklady Akademii
Nauk SSSR, 136(3):553–555, 1961.
[38] S. Tamaki. A satisfiability algorithm for depth two circuits with a sub-quadratic number of
symmetric and threshold gates. In Electronic Colloquium on Computational Complexity (ECCC),
volume 23, page 4, 2016.
[39] M. Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.
[40] A. C.-C. Yao. Separating the polynomial-time hierarchy by oracles. In Foundations of Computer
Science, 1985., 26th Annual Symposium on, pages 1–10. IEEE, 1985.
[41] D. Yarotsky. Error bounds for approximations with deep relu networks. arXiv preprint
arXiv:1610.01145, 2016.
[42] C. Zhang, Q. Liao, A. Rakhlin, K. Sridharan, B. Miranda, N. Golowich, and T. Poggio. Theory
of deep learning iii: Generalization properties of sgd. Technical report, Center for Brains,
Minds and Machines (CBMM), 2017.
A Proof of Proposition 2.2
We first observe that the set of points where max{0, x1, x2} is not differentiable is precisely the
union of the three half-lines (or rays) {(x1, x2) : x1 = x2, x1 ≥ 0} ∪ {(0, x2) : x2 ≤ 0} ∪ {(x1, 0) :
x1 ≤ 0}. On the other hand, consider any Sum-of-ReLU circuit, which can be expressed as a
function of the form
f(x) =
w∑
i=1
cimax{0, 〈ai, x〉+ bi},
where w ∈ N is the number of ReLU gates in the ciruit, and ai ∈ R2, bi, ci ∈ R for all i = 1, . . . , w.
This implies that f(x) is piecewise linear and the set of points where f(x) is not differentiable is
precisely the union of the w lines 〈ai, x〉 + bi = 0, i = 1, . . . , w. Since a union of lines cannot equal
the union of the three half-lines {(x1, x2) : x1 = x2, x1 ≥ 0} ∪ {(0, x2) : x2 ≤ 0} ∪ {(x1, 0) : x1 ≤ 0},
we obtain the consequence that max{0, x1, x2} cannot be represented by a Sum-of-ReLU circuit, no
matter how many ReLU gates are used.
B Simulating an LTF gate by a ReLU gate
Claim 5. Any LTF gate {−1, 1}n → {−1, 1} can be simulated by a Sum-of-ReLU circuit with at most
2 ReLU gates.
Proof. Given a LTF gate (21〈a,x〉+b≥0 − 1) it separates the points in {−1, 1}n into two subsets such
that the plane 〈a, x〉 + b = 0 is a separating hyperplane between the two sets. Let −p < 0 be
the value of the function 〈a, x〉 + b at that hypercube vertex on the “-1” side which is closest to
this separating plane. Now imagine a continuous piecewise linear function f : R → R such that
f(x) = −1 for x ≤ −p, f(x) = 1 for x ≥ 0 and for x ∈ (−p, 0) f is the straight line function
connecting (−p,−1) to (0, 1). It follows from Corollary 3.1 of our previous work, [2] that this f
can be implemented by a R → R Sum-of-ReLU with at most 2 ReLU gates hinged at the points
16
−p and 0 on the domain. Because the affine transformation 〈a, x〉 + b can be implemented by the
wires connecting the n input nodes to the layer of ReLUs it follows that there exists a Rn → R
Sum-of-ReLU with at most 2 ReLU gates implementing the function g(x) = f(〈a, x〉 + b) : Rn → R.
Its clear that g(x) = LTF(x) for all x ∈ {−1, 1}n.
C PARITY on k−bits can be implemented by a O(k) Sum-of-ReLU cir-
cuit
For this proof its convenient to think of the PARITY function as the following map,
PARITY : {0, 1}k → {0, 1}
x 7→
(
k∑
i=1
xi
)
mod 2
Its clear that that in the evaluation of the PARITY function as stated above the required sum over
the coordinates of the input Boolean vector will take as value every integer in the set, {0, 1, 2, .., k}.
The PARITY function can then be lifted to a f : R → R function such that, f(y) = 0 for all y ≤ 0,
f(y) = y mod 2 for all y ∈ 1, 2, .., k, f(y) = k mod 2 for all y > k and for any y ∈ (p, p + 1) for
p ∈ {0, 1, .., k−1} f is the straight line function connecting the points, (p, p mod 2) and (p+1, (p+1)
mod 2). Thus f is a continuous piecewise linear function on R with k + 2 linear pieces. Then it
follows from Theorem 2.3 of our previous work, [2] that this f can be implemented by a R → R
Sum-of-ReLU circuit with at most k+1 ReLU gates hinged at the points {0, 1, 2, .., k} on the domain.
The wires from the k inputs of the ReLU gates can implement the linear function
∑k
i=1 xi. Thus it
follows that there exists a Rk → R Sum-of-ReLU circuit (say C) such that, C(x) = PARITY(x) for all
x ∈ {0, 1}k.
17
