Motivated by the resurgence of neural networks in being able to solve complex learning tasks we undertake a study of high depth networks using ReLU gates which implement the function x → max{0, x}. We try to understand the role of depth in such neural networks by showing size lowerbounds against such network architectures in parameter regimes hitherto unexplored. In particular we show the following two main results about neural nets computing Boolean functions of input dimension n,
• We use the method of random restrictions to show almost linear, Ω(ǫ 2(1−δ) n 1−δ ), lower bound for completely weight unrestricted LTF-of-ReLU circuits to match the Andreev function on at least • We use the method of sign-rank to show exponential in dimension lower bounds for ReLU circuits ending in a LTF gate and of depths upto O(n ξ ) with ξ < 1 8 with some restrictions on the weights in the bottom most layer. All other weights in these circuits are kept unrestricted. This in turns also implies the same lowerbounds for LTF circuits with the same architecture and the same weight restrictions on their bottom most layer.
Introduction
There has been a recent surge of activity in using neural networks for complex artificial intelligence tasks (like this very recent spectacular demonstration [34] of the power of neural nets). This has rekindled interest in understanding neural networks from a complexity theory perspective. A myriad of hard mathematical questions have surfaced in the attempts to rigorously explain the power of neural networks and a comprehensive overview of these can be found in this recent three part series of articles from The Center for Brains, Minds and Machines (CBMM), [26, 25, 42] .There is a rich literature investigating the complexity of the function classes represented by neural networks with various kinds of gates (or "activation functions" which is the more common parlance in machine learning). Many papers, a canonical example being the classic paper by Maass [23] , establish complexity results for the entire class of functions represented by circuits where the gates can come from a very general family. This is complemented by papers that study a very specific family of gates such as the sigmoid gate or the LTF gate [16] , [35] , [31] [19] , [3] , [32] , [28] , [4] . Many associated results can also be found in these reviews [20, 27] . Recent circuit complexity results in [18] , [38] , [6] , [17] stand out as significant improvements over known lower (and upper) bounds on circuit complexity with threshold gates. The results of Maass [23] also show that very general families of neural networks can be converted into circuits with only LTF gates with at most a constant factor blow up in depth and polynomial blow up in size of the circuits.
In the last 5 years or so, a particular family of gates called the Rectified Linear Unit (ReLU) gates have been reported to have significant advantages over more traditional gates in practical applications of neural networks. Such a gate with n real inputs computes the following output,
where w ∈ R n and b ∈ R are fixed parameters associated with the gate (b is called the bias of the gate). In comparison, the ±1 valued LTF gate mentioned above computes (for the same weights as above) the function, (21 (b+ w,x ≥0) − 1) where 1 (b+ w,x ≥0) is the 0/1 indicator function for the stated halfspace condition.
Some of the prior results which apply to general gates, such as the ones in [23] , also apply to ReLU gates, because those results apply to gates that compute a piecewise polynomial function (ReLU is a piecewise linear function with only two pieces). However, as witnessed by results on LTF gates, one can usually make much stronger claims about specific classes of gates. To the best of our knowledge, no prior results have been obtained for ReLU gates from the perspective of Boolean complexity theory, i.e., the study of such circuits when restricted to Boolean inputs. The main focus of this work is to study circuits computing Boolean functions mapping {−1, 1} m → {−1, 1} which use ReLU gates in their intermediate layers, and have an LTF gate at the output node (to ensure that the output is in {−1, 1}). We remark that using an LTF gate at the output node while allowing more general analog gates in the intermediate nodes is a standard practice when studying the Boolean complexity of analog gates (see, for example, [23] ).
Although we are not aware of an analysis of lower bounds for ReLU circuits when applied to only Boolean inputs, there has been recent work on the analysis of such circuits when viewed as a function from R n to R (i.e., allowing real inputs and output). From [8] and [7] (with restrictions on the domain and the weights) we know of (super-)exponential lowerbounds on the size of Sum-ofReLU circuits for certain easy Sum-of-ReLU-of-ReLU functions . Depth v/s size tradeoffs for such circuits have recently also been studied in [39, 12, 21, 41, 30] and in a recent paper [2] by the current authors. To the best of our knowledge no lowerbounds scaling exponentially with the dimension are known for analog deep neural networks of depths more than 2.
In what follows, the depth of a circuit will be the length of the longest path from the output node to an input variable, and the size of a circuit will be the total number of gates in the circuit. We will also use the notation Sum-of-ReLU to refer to circuits whose inputs feed into a single layer of ReLU gates, whose outputs are combined into a weighted sum to give the final output. Similarly, Sum-of-ReLU-of-ReLU denotes the circuit with depth 3, where the output node is a simple weighted sum, and the intermediate gates are all ReLU gates in the two "hidden" layers. We analogously define Sum-of-LTF, LTF-of-LTF, LTF-of-ReLU, LTF-of-LTF-of-LTF, LTF-of-ReLU-of-ReLU and so on. We will also use the notation LTF-of-(ReLU) k for a circuit of the form LTF-of-ReLU-of-RELU-. . .-ReLU with k ≥ 1 levels of ReLU gates.
can be implemented by a Sum-of-ReLU circuit with 2 ReLU gates, and any Sum-of-LTF that implements f needs Ω(n) gates.
The above result follows from the following two facts: 1) any linear function is implementable by 2 ReLU gates, and 2) any Sum-of-LTF circuit with w LTF gates gives a piecewise constant function that takes at most 2 w different values. Since f takes 2 n different values (it evaluates every vertex of the Boolean hypercube to the corresponding natural number expressed in binary), we need w ≥ n gates.
In the context of these preliminary results, we now state our main contributions. For the next result we recall the definition of the Andreev function [1] which has previously many times been used to prove computational lower bounds [24, 15, 14] .
Definition 1 (Andreev's function). The Andreev's function is the following mapping,
where "bin" is the function that gives the decimal number that can be represented by its input bit string.
We are particularly inspired by the most recent use of the Andreev function by Kane and Williams [18] to get the first super linear lower bounds for approximating it using LTF-of-LTF circuits. We will give an almost linear lower bound on the size of LTF-of-ReLU circuits approximating this Andreev function with no restriction on the weights w, b for each gate. , any LFT-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + ǫ fraction of the inputs, has size Ω(ǫ 2(1−δ) n 1−δ ).
It is well known that proving lower bounds without restrictions on the weights is much more challenging even in the context of LTF circuits. In fact, the recent results in [18] are the first superlinear lower bounds for LTF circuits with no restrictions on the weights. With restrictions on some or all the weights, e.g., assuming poly(n) bounds on the weights (typically termed the "small weight asssumption") in certain layers, exponential lower bounds have been established for LTF circuits [11, 16, 32, 33] . Our next results are of this flavor: under certain kinds of weight restrictions, we prove exponential size lower bounds on the size of LTF-of-(ReLU) 
In words, P m,σ is the set of all linear objectives that order the vertices of the m-dimensional hypercube in the order specified by σ. We will impose the condition that there exists a σ such that for each ReLU gate in the bottom layer, the vector w ∈ P m,σ (w as defined in (1)) and all weights are integers with magnitude bounded by some W > 0.
We will prove our lower bounds against the function proposed by Arkadev and Nikhil in [5] ,
which we will refer to as the Arkadev-Nikhil function in the remainder of the paper. Here OMB is the ODD-MAX-BIT function which is a ±1 threshold gate which evaluates to −1 on say a n−bit input x if
We show the following exponential lowerbound against this function,
circuits on 2m bits such that the weights in the bottom layer are restricted as per Definition 2 that implements the Arkadev-Nikhil function on 2m bits will require a circuit size of
Consequently, one obtains the same size lower bounds for circuits with only LTF gates of depth d.
Note that this is an exponential in dimension size lowerbound for even super-polynomially growing bottom layer weights (and additional constraints as per Definition 2) and upto depths scaling as
We note that the Arkadev-Nikhil function can be represented by an O(m) size LTF-of-LTF circuit with no restrictions on weights (see Theorem 2.6 below). In light of this fact, Theorem 2.5 is somewhat surprising as it shows that for the purpose of representing Boolean functions a deep ReLU circuit (ending in a LTF) gate can get exponentially weakened when just its bottom layer weights are restricted as per Definition 2, even if the integers are allowed to be super-polynomially large. Moreover, the lower bounds also hold of LTF circuits of arbitrary depth d, under the same weight restrictions on the bottom layer. We are unaware of any exponential lower bounds on LTF circuits of arbitrary depth under any kind of weight restrictions.
We will use the method of sign-rank to obtain the exponential lowerbounds in Theorems 2.5. The sign-rank of a real matrix A with all non-zero entries is the least rank of a matrix B of the same dimension with all non-zero entries such that for each entry (i, j), sign(B ij ) = sign(A ij ). For a Boolean function f mapping, f : {−1, 1} m × {−1, 1} m → {−1, 1} one defines the "sign-rank of f" as the sign-rank of the 2 m × 2 m dimensional matrix [f (x, y)] x,y∈{−1,1} m . This notion of a sign-rank has been used to great effect in diverse fields from communication complexity to circuit complexity to learning theory. Explicit matrices with a high sign-rank were not known till the breakthrough work by Forster, [9] . Forster et. al. showed elegant use of this complexity measure to show exponential lowerbounds against LTF-of-MAJ circuits in [10] . Lot of the previous literature about sign-rank has been reviewed in the book by Satya Lokam [22] . Most recently the following result was obtained by Arkadev and Nikhil in [5] leading to a proof of strict containment of LTF-of-MAJ in LTF-of-LTF. We will prove our theorem by showing a small upper bound on the sign-rank of LTF-of-(ReLU) d−1 circuits which have their bottom most layer's weight restricted in the said way.
Lower bounds for LTF-of-ReLU against the Andreev function (Proof of Theorem 2.4)
We will use the classic "method of random restrictions" [37, 36, 13, 40, 29] to show a lowerbound for weight unrestricted LTF-of-ReLU circuits for representing the Andreev function. The basic philosophy of this method is to take any arbitrary LTF-of-ReLU circuit which supposedly matches the Andreev function on a large fraction of the inputs and to randomly fix the values on some of its input coordinates and also do the same fixing on the same coordinates of the input to the Andreev function. Then we show that upon doing this restriction the Andreev function collapses to an arbitrary Boolean function on the remaining inputs (what it collapses to depends on what values were fixed on its inputs that got restricted). But on the other hand we show that the LTF-of-ReLU collapses to a circuit which is of such a small size that with high-probability it cannot possibly approximate a randomly chosen Boolean function on the remaining inputs. This contradiction leads to a lowerbound.
There are two important concepts towards implementing the above idea. First one has to precisely define as to when can a ReLU gate upon a partial restriction of its inputs be considered to be removable from the circuit. Once this notion is clarified it will automatically turn out that doing random restrictions on ReLU is the same as doing random restriction on a LTF gate as was recently done in [18] . The secondly it needs to be true that at any fixed size LTF-of-ReLU circuits cannot represent too many of all the Boolean functions possible at the same input dimension. For this very specific case of LTF-of-ReLU circuits where ReLU gates necessarily have a fan-out of 1, Theorem 2.1 in [23] applies and we have from there that LTF-of-ReLU circuits over n−bits with w ReLU gates can represent at most N = 2 O((wn+w+w+1+1) 2 log(wn+w+w+1+1)) = 2 O((wn+2w+2) 2 log(wn+2w+2)) number of Boolean functions. We note that slightly departing from the usual convention with neural networks here in this work by Wolfgaang Mass he allows for direct wires from the input nodes to the output LTF gate. This flexibility ties in nicely with how we want to define a ReLU gate to be becoming removable under the random restrictions that we use.
Random Boolean functions vs any circuit class
In everything that follows all samplings being done (denoted as ∼) are to be understood as sampling from an uniform distribution unless otherwise specified. Firstly we note this well-known lemma, Claim 1. Let f : {−1, 1} n → {−1, 1} be any given Boolean function. Then the following is true,
From the above it follows that if N is the total number of functions in any circuit class (whose members be called C) then we have by union bound,
Equipped with these basics we are now ready to begin the proof of the lowerbound against weight unrestricted LTF-of-ReLU circuits, Proof. 
2 k whereby in the last inequality above we have assumed that n = 2 k+1 . This assumption is legitimate because we want to estimate certain large n asymptotics. For any arbitrarily chosen constant C < 2 9 we try to satisfy the following condition, O(s 2 k 2 log(ks)) −
For any constant θ > 0 for large enough x > 0 we would have log(x) < x θ and hence the above constraint on s gets satisfied if we work in the regime, s ≤ O( 
Definition 4 (F * ). Let F * be the subset of all these f above for which the above event is true.
Now we recall the definition of the Andreev function in equation 1 for the following definition and the claim,
Definition 5 (ρ). Let ρ denote the set of all possible "random restrictions" where one is fixing all the input bits of A n except 1 bit in each row of the matrix a. So the restricted function (call it A n | ρ by overloading the notation for simplicity) computes a function of the form,
From the definitions of A n and ρ above the following is immediate, Claim 2. The truth table of A n | ρ is the x string in the input to A n that gets fixed by ρ. Thus we observe that if ρ is chosen uniformly at random then A n | ρ is a ⌊log( n 2 )⌋ bit Boolean function chosen uniformly at random. Let f * be any arbitrary member of F * . Let x * ∈ {0, 1} ⌊ n 2 ⌋ be the truth-table of f * . Let ρ(x * ) be restrictions on the input of A n which fix the x part of its input to x * . So when we are sampling restrictions uniformly at random from the restrictions of the type ρ(x * ) these different instances differ in which bit of each row of the matrix a (of the input to A n ) they left unfixed and to what values did they fix the other entries of a. Let C be a n bit LTF-of-ReLU Boolean circuit of size say w(n, ǫ). Thus under the restriction ρ(x * ) both C and A n are ⌊log( n 2 )⌋ bit Boolean functions. Now we note that a ReLU gate over n bits upon a random restriction becomes redundant (and hence removable) iff its linear argument either reduces to a non-positive definite function or a positive definite function. In the former case the gate is computing the constant function zero and in the later case it is computing a linear function which can be simply implemented by introducing wires connecting the inputs directly to the output LTF gate. Thus in both the cases the resultant function no more needs the ReLU gate for it to be computed. (We note that such direct wires from the input to the output gate were allowed in how the counting was done of the total number of LTF-of-ReLU Boolean functions at a fixed circuit size.) Combining both the cases we note that the conditions for collapse (in this sense) of a ReLU gate is identical to that of the conditions of collapse for a LTF gate with the same linear argument. Hence corresponding to the random restrictions ρ we can just directly utilize the random restriction lemma 1.1 from [18] to say that,
Now we compare with the definitions of ǫ and f * to observe that (a) with probability at least 1 − w(n,ǫ)(1−η) s(n,ǫ)
, C| ρ(x * ) is of the circuit type as in the event in equation 2 and (b) by definition of the Andreev function it follows that A n | ρ(x * ) has its truth table given by x * and hence it specifies the same function as f * ∈ F * . Hence ∀x * and ρ(x * ) this can as well write this as,
∀x * equation 6 can be rewritten as,
The equation 5 can be written as,
Claim 3. Circuits C have low correlation with the Andreev function
Proof. We think of sampling a z ∼ {0, 1} n as a two step process of first sampling af , a ⌊log( n 2 )⌋ bit Boolean function and fixing the first ⌊ n 2 ⌋ bits of z to be the truth-table off and then we randomly assign values to the remaining ⌊ n 2 ⌋ bits of z. Call these later ⌊ n 2 ⌋ bit string to be x other .
In the last line above we have invoked equation 9. Now we note that sampling the n bit string z such thatf ∈ F * is the same as doing a random restriction of the type ρ(f ) and then randomly picking a ⌊log( n 2 )⌋ bit string say y. So we can rewrite the last inequality as,
In the last step above we have used equations 7 and 8.
So after putting back the values of η and the largest scaling of s(n, ǫ) that we can have (from equation 5), the upperbound on the above probability becomes,
Thus the probability is upperbounded by Over all these inputs let p > 0 be the distance from 0 of the largest negative number on which the LTF gate ever gets evaluated. Then by increasing the bias at this last LTF gate by a quantity less then p we can ensure that no input to this LTF gate is 0 while the entire circuit still computes the same Boolean function as originally. So we can assume without loss of generality that the input to the threshold function at the top LTF gate is never 0. We also recall that the weights at the bottom most layer are constrained to be integers of magnitude at most W > 0.
Let this depth
k=1 be the widths of the ReLU layers at depths indexed by increasing k with increasing distance from the input. Then G has a block structure, where the rows and columns can be partitioned contiguously into
, and within each block G is constant valued.
Before we prove the Lemma, let us see why it implies Theorem 2.5. Let F j (x, y) be the matrix obtained from the ReLU circuit outputs f j (x, y) from (10), and let F (x, y) be the matrix obtained from f (x, y). Let J 2 m ×2 m be the matrix of all ones. Then sign-rank(F (x, y)) = sign-rank
where the first inequality follows from the definition of sign-rank, the second inequality follows from the subadditivty of rank and the last inequality is a consequence of Lemma 4.1. Indeed, a matrix with block structure as in the conclusion of Lemma 4.1 has rank at most O (
by expressing it as a sum of these many matrices of rank one and using subaddivity of rank. Now we recall that the Arkadev-Nikhil function g (which is linear sized depth 2 LTF) on 2m = 2(n 4 3 − n log n) bits has sign-rank Ω(2 n 1 3 −2 log n ). It follows that n 4 3 ≥ m and for any constant C s.t C ∈ (0, 1) for large enough n we would have, sign-rank(g) = Ω(2 Cn 
The statement about LTF circuits is a straightforward consequence of the above result and Claim 5 in Appendix B which says that any LTF gate can be simulated by 2 ReLU gates. We now prove Lemma 4.1.
Proof of Lemma 4.1. We will prove this Lemma by induction on k.
The base case of the induction k = 0: A single ReLU gate. A single ReLU gate's output is given by max{0, a 1 , x + a 2 , y + b}, where a 1 , a 2 ∈ R m and b ∈ R. Since the entries of a 1 , a 2 and b are assumed to be integers bounded by W > 0, the terms a 1 , x and a 2 , y can each take at most O(mW ) different values, since x, y ∈ {−1, 1} m . So we can arrange the rows and columns in increasing order of a 1 , x and a 2 , y and then partition the rows and columns contiguously according to these values, and the base case is proved. The induction step. We first make a simple claim about the sum of matrices which are block wise constant. To complete the induction step, we observe that a ReLU circuit with depth k + 1 layers can be seen as computing g(x, y) = max{0, b + w k i=1 a j g i (x, y)}, where g i (x, y) is the output of a ReLU circuit of depth k. Thus, the corresponding matrices satisfy G(x, y) = max{0, bJ 2 m ×2 m + w k i=1 a j G i (x, y)}, where J 2 m ×2 m is the matrix of all ones, and the "max" is taken entrywise. the induction hypothesis tells us that the rows and columns each matrix G i can be partitioned contiguously into O ( −p and 0 on the domain. Because the affine transformation a, x + b can be implemented by the wires connecting the n input nodes to the layer of ReLUs it follows that there exists a R n → R Sum-of-ReLU with at most 2 ReLU gates implementing the function g(x) = f ( a, x + b) : R n → R. Its clear that g(x) = LTF(x) for all x ∈ {−1, 1} n .
C PARITY on k−bits can be implemented by a O(k) Sum-of-ReLU circuit
For this proof its convenient to think of the PARITY function as the following map,
Its clear that that in the evaluation of the PARITY function as stated above the required sum over the coordinates of the input Boolean vector will take as value every integer in the set, {0, 1, 2, .., k}.
The PARITY function can then be lifted to a f : R → R function such that, f (y) = 0 for all y ≤ 0, f (y) = y mod 2 for all y ∈ 1, 2, .., k, f (y) = k mod 2 for all y > k and for any y ∈ (p, p + 1) for p ∈ {0, 1, .., k−1} f is the straight line function connecting the points, (p, p mod 2) and (p+1, (p+1) mod 2). Thus f is a continuous piecewise linear function on R with k + 2 linear pieces. Then it follows from Theorem 2.3 of our previous work, [2] that this f can be implemented by a R → R Sum-of-ReLU circuit with at most k + 1 ReLU gates hinged at the points {0, 1, 2, .., k} on the domain. The wires from the k inputs of the ReLU gates can implement the linear function k i=1 x i . Thus it follows that there exists a R k → R Sum-of-ReLU circuit (say C) such that, C(x) = PARITY(x) for all x ∈ {0, 1} k .
