Abstract. We observe that many important computational problems in NC 1 share a simple selfreducibility property. We then show that, for any problem A having this self-reducibility property, A has polynomial-size TC 0 circuits if and only if it has TC 0 circuits of size n 1+ for every > 0 (counting the number of wires in a circuit as the size of the circuit Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 
1 share a simple selfreducibility property. We then show that, for any problem A having this self-reducibility property, A has polynomial-size TC 0 circuits if and only if it has TC 0 circuits of size n 1+ for every > 0 (counting the number of wires in a circuit as the size of the circuit). As an example of what this observation yields, consider the Boolean Formula Evaluation problem (BFE), which is complete for NC 1 and has the self-reducibility property. It follows from a lower bound of Impagliazzo, Paturi, and Saks, that BFE requires depth d TC 0 circuits of size n 1+ d . If one were able to improve this lower bound to show that there is some constant > 0 (independent of the depth d) such that every TC 0 circuit family recognizing BFE has size at least n 1+ , then it would follow that TC 0 = NC 1 . We show that proving lower bounds of the form n 1+ is not ruled out by the Natural Proof framework of Razborov and Rudich and hence there is currently no known barrier for separating classes such as ACC 0 , TC 0 and NC 1 via existing "natural" approaches to proving circuit lower bounds. We also show that problems with small uniform constant-depth circuits have algorithms that simultaneously have small space and time bounds. We then make use of known time-space tradeoff lower bounds to show that SAT requires uniform depth d TC 0 and AC 0 [6] circuits of size n 1+c for some constant c depending on d.
Introduction
There is consensus in the research community that one of the most challenging and important open problems in computer science is to prove that various computational problems require large circuits in order to be computed. However, there is also a great deal of pessimism in the community, regarding the likelihood of proving such lower bounds on circuit size anytime in the near future. One goal of this article is to suggest that there might be some reason to be more optimistic about prospects for circuit size lower bounds; we show that in certain settings, superpolynomial bounds would follow as a consequence of some very modest-sounding lower bound results (such as a lower bound of size n 1.0001 ). Of course, a confirmed pessimist would say that this is merely evidence that even these modest-sounding lower bounds are likely to remain beyond our reach.
THE QUEST FOR CIRCUIT LOWER BOUNDS.
This article focuses primarily on the task of proving superpolynomial lower bounds for various well-studied restricted classes of circuits, such as NC 1 , TC 0 , and CC 0 [6] . The reader can find definitions of these classes in Section 2, along with a brief discussion of their importance and significance. Here, we recall just a few salient facts: -Although it seems at first to be an absurdly weak class, CC 0 [6] (the class of problems that can be solved by constant-depth polynomial-size circuit families of has not yet been shown to have less computational power than NP. Some theoreticians suspect that CC 0 [6] cannot even compute the AND function [Barrington et al. 1990a; Hansen and Koucký 2010] . Showing that AND (or any other problem in NP) lies outside of CC 0 [6] would constitute a significant advance in complexity theory.
-The "majority function" MAJ, which determines if more than half of the input bits are 1, is the canonical representative of the complexity class TC 0 , consisting of the problems computed by constant-depth polynomial-size threshold circuits. Separating the complexity classes TC 0 and CC 0 [6] is equivalent to proving a superpolynomial lower bound on the size of CC 0 [6] circuits computing MAJ. -NC 1 is the class of Boolean functions that can be represented by Boolean formulae of polynomial size. NC 1 has several natural problems that are complete under very restrictive notions of reducibility; we mention in particular the problem of evaluating a Boolean formula, which we denote by BFE. Separating the complexity classes NC 1 and TC 0 is equivalent to proving a superpolynomial lower bound on the size of constant-depth threshold circuits computing BFE.
The problem of separating these and other circuit complexity classes has remained open for more than two decades. This in itself would be cause for some discouragement about the prospects for progress. Additional grounds for despair were provided by Razborov and Rudich [1997] , who showed that, if a class of circuits C is strong enough to compute pseudorandom function generators, then a wide variety of proof techniques are incapable of proving a given problem is too difficult to be computed by circuits in C. Since there are constructions of pseudorandom function generators computable in TC 0 that are conjectured to be cryptographically secure [Naor and Reingold 2004] , this has been viewed as constituting a significant barrier to progress on proving circuit lower bounds.
Although superpolynomial circuit size lower bounds have proved elusive, there has been significant work proving more modest lower bounds. For example, Håstad [1998] presents a nearly-cubic lower bound on the formula size for a certain function. Nonlinear lower bounds on branching program size have been presented [Ajtai 1999; Beame et al. 2003 ]. The time-space tradeoff results that are surveyed by van Melkebeek [2007] give run-time lower bounds of the form n c for small-space computations.
However, none of these lower bounds has led to superpolynomial lower bounds. More to the point, there was no expectation that a circuit size lower bound of the form n c could possibly yield superpolynomial circuit bounds. In this article, we show that there are several settings where precisely this sort of "amplification" can occur.
Moreover, in Section 8, we show that the work of Razborov and Rudich [1997] on "Natural Proofs" poses no barrier to proving weak lower bounds of the form n c . This can be viewed as holding out some hope of separating circuit classes by proving circuit lower bounds using "natural" proof techniques.
1.2. OUR CONTRIBUTIONS. The main tool allowing us to obtain our results is self-reducibility of problems. We show that many problems in and around NC 1 (such as BFE, MAJ, AND, and many others) are strongly downward self-reducible. Then we show that, for any strongly downward self-reducible set, a lower bound of size n c implies a superpolynomial size lower bound. 1 In particular, we obtain the following corollaries: Let us examine this third corollary more closely. It is interesting to recall that some non-linear lower bounds for BFE are known. Impagliazzo et al. [1997] showed that any depth d TC 0 circuit for PARITY must have n 1+ d wires (where d = (1/(2.5) d )). Since there is a trivial reduction from PARITY to BFE (see the 1 A special case of this general observation (relating only to regular sets) also appears in a survey article by Koucký [2009] ; the present article expands significantly on the related results of Koucký [2009] .
14:4 E. ALLENDER AND M. KOUCKÝ detailed definition of BFE in Section 2), the same size lower bound holds for BFE. In order to separate TC 0 from NC 1 , it would suffice to improve this to a lower bound of size n 1+ where does not depend on d. One might reasonably wonder whether it is overly optimistic to expect to prove constant-depth circuit size lower bounds that do not depend on the depth d. Most circuit size lower bounds in the literature (such as those of Furst et al. [1984] , Yao [1985] , Håstad [1988] , Razborov [1987] , and Smolensky [1987] ) do degrade with depth. For instance, the parity function requires depth d AC 0 circuits of size 2 ω(n 1/(d−1) ) , and this is nearly optimal [Håstad 1988 ]. However, it is important to note that there are exceptions to this trend; Rossman [2008] recently proved that, for every constant k, the k-clique problem requires AC 0 circuits with ω(n k/4 ) gates independent of the depth.
Clearly, no proof of TC 0 = NC 1 can follow from a PARITY lower bound such as the bound of Impagliazzo et al. [1997] , and equally clearly, their argument does not yield a lower bound on the size of depth d CC 0 [6] circuits computing BFE (since CC 0 [6] circuits of linear size compute PARITY). In fact, there seem to be no known superlinear lower bounds for BFE on depth d CC 0 [q] circuits for any q with at least two distinct prime factors. We now turn to the question of obtaining lower bounds for CC 0 [q] and the related class AC 0 [q] , in order to discuss some of our other theorems.
Fortnow [2000] showed that SAT does not have logspace-uniform NC 1 circuits of size n 1+o (1) . (Several improvements of this result of Fortnow are presented by van Melkebeek [2004, Theorem 1.5 ].) Since we are able to show that modest lower bounds for BFE would yield superpolynomial lower bounds, it is natural to wonder if the same situation holds for SAT. That is, if one could build on the Fortnow lower bound, and show that SAT requires AC 0 [6] circuits of size n 1.01 , would it follow that NP = AC 0 [6]? We know of no such implication-and in Section 5 we show that the approach that works for BFE cannot transfer directly to SAT. More specifically, in Section 5 we show that all strongly downward self-reducible sets lie in (uniform) NC. Thus, in order to demonstrate that SAT has the sort of self-reducibility properties that would enable us to amplify modest lower bounds to superpolynomial lower bounds, one would have to first prove that P = NP. (It is still conceivable that one could proceed by arguing that if NP = AC 0 [6] , then SAT has the desired type of self-reduction, but we have not been able to construct such an argument.)
It is interesting to note that Srinivasan [2003] has shown that an (n 1+ ) lower bound on the running time of algorithms that compute weak approximations (of the form n 1−o(1) ) to MAX-CLIQUE would imply P = NP. 
Preliminaries
2.1. CIRCUIT COMPLEXITY CLASSES. This article focuses on Boolean circuits and in particular on the circuit class NC 1 and its subclasses. Let us remind the reader of the main definitions, and present some notation. For more background on circuit complexity, the reader is referred to the text by Vollmer [1999] .
For a function f : {0, 1} * → {0, 1} and an integer n ≥ 1, f n : {0, 1} n → {0, 1} is the restriction of f to inputs of size n.
We begin our discussion of circuits by considering a special case: formulas. A Boolean formula in n variables x 1 , x 2 , . . . , x n is a rooted tree where each internal node is labeled by some function such as AND, OR or NOT and each leaf is labeled either by one of the input variables x 1 , . . . , x n or by a constant zero or one (false or true). Given an input x ∈ {0, 1} n , one can inductively assign a value to each node of the formula as follows: each leaf labeled by a variable gets the value of that variable, each leaf labeled by a constant gets the value of that constant, and each internal node gets the value of the function that labels it applied to the values of its children. In case where the function labeling a node is not symmetric the order of the children has to be specified. The value (output) of the formula on input x is the value of the root node. Hence, a Boolean formula naturally computes a function f : {0, 1} n → {0, 1}. The nodes of the formula are generally referred to as gates. The in-degree of a gate is called its fan-in. In addition to the elementary functions AND, OR, and NOT, we will also consider gates computing the function MAJ (which evaluates to one if and only if the strict majority of its inputs is one) and the MOD-q function for an integer q ≥ 2 (which is one if and only if the number of its inputs set to one is not divisible by q). The MOD-2 function will also be referred to as the PARITY function (⊕ function). Sometimes we allow a more complex function to be computed by a gate; a node of a formula can be designated as an oracle gate. Typically all the oracle gates in a given formula will compute the same Boolean function g : {0, 1} * → {0, 1}, although we allow a single formula to have oracle gates for g m and g m for m = m . The oracle should be viewed as a parameter for the formula; for a function g and formula φ with oracle gates, the formula φ with oracle for g is the formula φ where each oracle gate computes the function g. For a set A, an oracle gate for A is an oracle gate computing the characteristic function of A.
A Boolean circuit is a generalization of a formula where instead of a rooted tree we allow an arbitrary directed acyclic multi-graph. (We allow multiple edges (or wires) between nodes.) The nodes of out-degree zero are the output nodes. This way a circuit can compute a function f : {0, 1} n → {0, 1} m , for integers n, m ≥ 1. In circuits we also allow oracle gates to have several distinct output bits (wires) thus allowing us to have oracle gates for functions g : {0, 1} m → {0, 1} m for m > 1. (The tree-like nature of formulas imposes the restriction that m = 1 in a formula.)
The depth of a circuit is the length of the longest path from an input node to an output node. The size of a circuit is the number of its wires, which is the number of edges in it. We will frequently refer also to the number of gates in a circuit.
14:6 E. ALLENDER AND M. KOUCKÝ A circuit computes a function on a fixed number of variables. To compute a function f : {0, 1} * → {0, 1} by circuits we need an infinite family of circuits {C n } n≥1 , where for each n ≥ 1, circuit C n computes f n . One may abuse notation and say that f is computable by circuits with property γ (n). Such an expression means that there is a family of circuits {C n } n≥1 where each C n has property γ (n) and computes f n . Similarly, asymptotic statements should be interpreted with respect to the input size; for example, f is computable by polynomial-size constant-depth circuits means that there is a circuit family {C n } n≥1 , polynomial p(n) and constant d, such that each C n computes f n and has size at most p(n) and depth at most d. Similarly for formulas.
In addition to functions over the binary alphabet {0,1}, we also consider functions over an arbitrary alphabet . In such cases we assume that there is some fixed encoding Bin : → {0, 1}
* of symbols from into fixed-length binary strings; circuits for a function over the alphabet operate on inputs encoded symbol-bysymbol by Bin. (The string homomorphism Bin extends in the obvious way to a function Bin :
* → {0, 1} * .) Furthermore, a circuit for a function with nonBoolean output produces a binary encoding of the output symbol. The definitions of computability by circuits and of all the other terms extend naturally also to this case; however, we only require that a circuit computing function f defined on * operate correctly on binary strings corresponding to binary encodings of strings from * . Thus, on inputs that do not correspond to binary encoded strings from * , the circuit may give an arbitrary output. For example, a function f : * → {0, 1} is computed by a circuit family {C n } n≥1 if for some k ≥ 1 there is a binary encoding Bin : → {0, 1} k such that for each n ≥ 1, and each input x ∈ n , C n (Bin(x)) outputs f n (x). In this case, the size of the input is considered to be n although its binary encoding has length kn. Oracle gates for a function over an arbitrary alphabet also operate on binary encoded strings from , and on invalid inputs we assume that they output all zeros. (We state this convention only in order to make such oracle gates unambiguous; none of our results depends on it.)
A language A is a subset of * for some finite alphabet . Every language naturally corresponds to its characteristic function χ A : * → {0, 1} defined by χ A (x) = 1 if and only if x ∈ A. Vice-versa, every function into {0, 1} corresponds to a language. We will identify languages with their characteristic functions. We say that A is recognized by {C n } n≥1 if its characteristic function is computable by {C n } n≥1 .
This allows us to define the following classes of functions.
-NC 0 is the class of functions computable by polynomial-size constant-depth circuits built using fan-in two AND and OR gates and unary NOT gates. -AC 0 is the class of functions computable by polynomial-size constant-depth circuits built using unbounded fan-in AND and OR gates and unary NOT gates. Some authors define these classes in terms of languages instead of functions, and use notation such as FAC 0 or FNC 1 , etc., to refer to the associated class of functions. We prefer the simpler notation, and are confident that no confusion will result. We use the names of the function classes to denote also the corresponding circuit families; for example, we refer to "AC 0 circuit families" or more succinctly to "AC 0 circuits". As presented, these classes are nonuniform, that is, it is not required that there be an easy way to construct the circuits for inputs of length n. We shall also need to consider logspace-uniform and Dlogtime-uniform versions of these classes [Barrington et al. 1990b] . A circuit family {C n } n≥1 is logspace-uniform if there is a procedure that runs in logarithmic space and on input 1 n it outputs the description of C n . A circuit family {C n } n≥1 is Dlogtime-uniform if there is a procedure that on input (n, i, r, j, s, t) , where n, i, j, r, s are integers encoded in binary and t is a gate type (e.g., AND, OR, NOT, oracle, input, 0,1), runs in time linear in its input size and accepts if and only if the gate of C n having label i is of type t and its r th child is the sth output bit of the gate labeled j. In case of the gate i being an input gate, the procedure accepts if gate i takes the value of the sth input bit. Furthermore, the procedure accepts (n, i, j, s, output) if and only if the sth output bit of gate i is the jth output bit of the circuit C n . We also require that the procedure accepts the input (n, i, d) if and only if d is equal to the fan-in of the gate of C n having label i; without this condition it is not always clear that Turing reducibilities defined in terms of uniform circuit families are closed under composition.
2 Thus, for example, Dlogtime-uniform AC 0 is the class of functions computable by Dlogtime-uniform families of AC 0 circuits, or more precisely, the class of functions computable by some Dlogtime-uniform family of circuits of polynomial-size and constant-depth that are built using unbounded fan-in AND and OR gates and unary NOT gates.
A string w ∈ {0, 1} * of length n is the binary representation of an integer m = n i=1 2 n−i w i . The logarithm base two is denoted by log. We use the following convention throughout the paper. Whenever we refer to some real value a (such as a = log n or a = n ) in a context where there should be an integral quantity (for instance: "a string of length a") the reader should read it as a . 2.2. REDUCTIONS AND COMPLETE PROBLEMS. The reader is probably familiar with the notion of polynomial-time many-one reducibility ≤ p m . Polynomial-time reducibility is an extremely useful tool for classifying NP-complete problems and more generally for classifying the complexity of problems that are not believed to 2 There are additional conditions that are required, in order to obtain a satisfactory definition of uniform NC 1 ; we refer the reader to the work of Ruzzo, who gives a uniformity condition with the desirable property that uniform NC 1 corresponds to logarithmic time on an alternating Turing machine [Ruzzo 1981 [Agrawal et al. 1998; Agrawal 2001] .
Note that we have defined ≤ reductions. We give detailed definitions of three such problems: the word problem over the permutation group S 5 on five elements [Barrington 1989 ], the Boolean Formula Evaluation problem [Buss 1987] , and s-t-connectivity on directed graphs of width 5. reductions-but this language has the same complexity as the functional version of the problem that we have presented, and that version is more convenient to work with; working with the language would rather obscure things.) (2) The s-t-Connectivity Problem on Directed Graphs of Width 5. This is an NC 1 -complete variant of s-t-connectivity. We say that a directed graph is of width k if its vertices can be partitioned into layers where each layer is of size at most k, the layers are linearly ordered and every edge goes from vertices of one layer to the vertices of the next layer. Every two consecutive layers of a width 5 directed graph form a bipartite graph and this bipartite graph can be represented by a 5 × 5 adjacency matrix. Thus, a width 5 directed graph with n + 1 layers can be described by a sequence of n 5 × 5 adjacency matrices. The s-t-connectivity problem on directed graphs of width 5 is the problem of deciding whether a given vertex s in the first layer is connected by a path to a vertex t of the last layer in a width 5 directed graph. It is more convenient for us to work with the following functional version of connectivity (which has the same complexity as the decision problem), where we ask about connectivity between all vertices of the first and last layers. Let = {0, 1} 5×5 be the set of binary 5 × 5 matrices. We define W5-STCONN :
represents the connectivity between the first and last layer of a width 5 directed graph with n + 1 layers with adjacency matrices A 1 , A 2 , . . . , A n . It is a standard fact that A is equal to the product A 1 A 2 · · · A n over the ring ({0, 1}, OR, AND) -and this could also be taken as a formal definition of W5-STCONN. Moreover, one can view W5-STCONN as a word problem over the monoid , where the binary operation is matrix multiplication over the ring ({0, 1}, OR, AND) and the identity element of is the identity matrix. This view of W5-STCONN will also be useful for us. Clearly, the word problem over S 5 is a special case of W5-STCONN. (3) The Boolean Formula Evaluation Problem. Roughly speaking, the Boolean Formula Evaluation problem is the set of formulas that evaluate to true. We will make use of its variant where we focus only on balanced formulas (i.e., formulas whose graph is a complete binary tree of depth d). Input instances thus consist of a string of 2 d zeros and ones representing the values that label the leaves of the formula, along with a sequence of 2 d − 1 labels for the internal nodes of the tree. Let = {0, 1, ∧, ∨, ⊕}. The set BFE consists of all of the "well-formed formulas" over alphabet that evaluate to 1.
In order to simplify the proof that our construction in Proposition 3.9 is Dlogtime-uniform, we choose a particular encoding that will be convenient. The "well-formed formulas" consist of strings of the form vx such that for some d, x is a string of length 2 d in {0, 1} * , and v is a string of length 2 d − 1 in {∧, ∨, ⊕} * representing the labels of the internal nodes of the formula, given in the order specified by the following recursive definition. If d = 1, then there is only one internal node, so there is no need to specify the order. If d = 2, then the label of the root is listed first, followed by the label of the left child, and then by the label of the right child.
If d > 2 and d = 2c−1, then the 2 c −1 labels of the subtree T of depth c containing the root are given first, in the order specified for trees of depth c. This is followed by 2 c encodings of the subtrees of depth c−1 whose values feed into T (starting from the leftmost subtree), in the order specified for trees of depth c−1.
If d > 2 and d = 2c, then the 2 c − 1 labels of the subtree T of depth c containing the root are given first, in the order specified for trees of depth c. This is followed by 2 c encodings of the subtrees of depth c whose values feed into T (starting from the leftmost subtree), in the order specified for trees of depth c.
The reader may wonder if it is necessary to be so particular about our encoding of the problem BFE. To some extent, the choice of encoding is crucial. For instance, if a formula were not encoded as a formula, but instead were encoded as an unsorted list of gates and edges, then it is an easy exercise to show that evaluating a formula is complete for L, using the fact that determining whether a vertex u occurs before a vertex v in a directed line graph presented as an unsorted list of edges is complete for L [Etessami 1997 ]. Thus it is at least important that the formulas in BFE be presented as parenthesized expressions or some similar encoding. The general (notnecessarily balanced) Boolean formula evaluation problem is in NC 1 [Buss 1987 ], and thus there are "efficient" ≤ AC 0 m reductions from the general formula evaluation problem to the balanced encoding that we have chosen for BFE, but the reductions that one obtains from known NC 1 algorithms (e.g., Buss [1987 Buss [ , 1993 and Buss et al. [1992] ) do not appear to be computable by linear-size AC 0 circuits. This is one reason why we do not know how to obtain linear-size strong downward self-reductions for the general Boolean formula evaluation problem, such as we present for BFE. The reason why we include ⊕ as an operation in BFE is so there will be a linear-size reduction from PARITY to BFE, so that the nonlinear PARITY lower bounds [Impagliazzo et al. 1997] will immediately carry over to BFE.
Even in this restricted form, BFE is complete for NC 1 . (See, e.g., the proof of Lemma 7.2 in Barrington et al. [1990b] .) PROPOSITION 2.1 (Barrington 1989; Buss 1987) .
The problem W5-STCONN remains complete for NC 1 if directed edges are permitted in both directions between adjacent layers, as well as in the undirected case. The arguments that we present for W5-STCONN also carry over to these variants, with minor technical modifications.
Although NC 1 has several natural complete problems under ≤ For any circuit complexity class C, we define C-Turing reducibility. Let f and g be two functions. We say that f ≤ C T g if there is a family of circuits of polynomial size computing f , where the circuits have oracle gates for the function g in addition to the collection of gates that is provided in the definition of the circuit class C. 3 In this article, we do not make use of NC 1 -Turing reducibility, and indeed this definition would need to be modified in order to coincide with the definition of NC 1 -Turing reducibility as studied by Cook [1985] and Wilson [1990] and others. In defining AC k reducibility, each oracle gate is considered to have depth 1, as in our definition, but in defining NC k reducibility, Cook and Wilson felt that it was more in keeping with the flavor of bounded fan-in circuits to define the depth of an oracle gate to be the logarithm of its fan-in. Using their convention, an NC 0 -Turing reduction could have oracle gates of only bounded fan-in, which is not a very useful notion. In contrast, our definition yields exactly the type of "NC 0 -Turing reducibility" that we need in our definition of "pure self-reducibility". Turing reducibility will be used in the next section, in order to define downward self-reducibility.
Reductions can be either uniform or nonuniform. The reader can verify that all of the examples of reductions that we present in this article are Dlogtime-uniform. It is worth observing that if A is complete for any of the uniform classes that we consider under uniform ≤ AC 0 m or ≤ C T reductions, then it is also complete for the corresponding nonuniform class under nonuniform reductions of the same type. For example, if B is in nonuniform NC 1 , then there is a nonuniform family of Boolean formulae {φ n } n≥1 accepting B. The set D = {(ψ, x) : a Boolean formula ψ given in infix notation evaluates to 1 on x} is in uniform NC 1 [Buss 1987; Buss et al. 1992 ] and thus there is a uniform reduction from D to A. Composing this uniform reduction with the nonuniform reduction of B to D that maps x to (φ |x| , x) yields the desired nonuniform reduction of B to A. Note that, for this example, it is important that B is presented in terms of Boolean formulae, instead of, say, logarithmic depth Boolean circuits, since it is not known whether logarithmic depth Boolean circuits can be evaluated in NC 1 . A similar construction works also for constant-depth circuits. As an example, we briefly explain the case of Since completeness results carry over from the uniform setting to the nonuniform setting, we will henceforth slightly abuse notation and simply say that a set A is "complete under ≤ C T reductions" even when C is a nonuniform class, without explicitly mentioning that the reductions must be nonuniform in this case.
The following fact about Dlogtime-uniform Turing reductions is not entirely obvious, and thus for completeness we provide a proof. Let circuit family {C n } be a Turing reduction of f to g, and let {D n } be a Turing reduction of g to h. The composition of these reductions is the reduction of f to h that results by replacing each oracle gate of C n having fan-in m by D m .
PROPOSITION 2.3. For any of the classes C defined in this section, the composition of two Dlogtime-uniform
PROOF. Let {C n } and {D n } be two Dlogtime-uniform families of reductions. Define a new family {E n } where E n has the following gates:
{i : i is a non-oracle gate of C n }∪ {(i, m, j) : i is an oracle gate of C n that has fan-in m and j is a gate of D m }.
Since the definition of Dlogtime-uniformity ensures that it is easy to recognize the fan-in of an oracle gate, it is routine to establish that the family E n (with the obvious connections among gates to implement the composed reduction) is Dlogtime-uniform. For all of the polynomial-size circuit classes C defined in this section, it is immediate that the resulting reduction {E n } is also a ≤ C T reduction.
Downward Self-Reducibility
In this section we define downward self-reducibility and present several examples of downward self-reducible functions. Intuitively, a function is downward selfreducible if it can be efficiently computed from its own values at shorter inputs. We give a formal definition next.
A C self-reduction for f is a family of oracle circuits witnessing that f ≤ C T f , where on input x, the oracle circuit does not feed input x into any of its oracle gates.
Self-reducibility sometimes also goes by the name "autoreducibility." The term "self-reducibility" is more common in those settings (as here) where interest centers on routines that enforce the condition that x is not queried, by ensuring that all queries have length shorter than the length of x. Definition 3.1. Let f : {0, 1} * → {0, 1} * be a function, and let C be a class of circuits. Let s(n), m(n) : IN → IN be functions such that for all n, m(n) < n. We say that f n is downward self-reducible to f m(n) by a C reduction of size s(n) if there is a family of C oracle circuits {C n } n≥1 computing f such that for each n, C n uses its oracle gates to query f on inputs of size at most m(n), and has at most s(n) wires.
Most of the self-reductions that we present consist of almost no hardware other than oracle gates. We call such reductions "pure"; a pure self-reduction for f is an NC 0 self-reduction for f , that is, a self-reduction where the only gates are oracle gates, as well as bounded fan-in AND and OR gates and unary NOT gates.
Definition 3.2. Let f : {0, 1} * → {0, 1} * be a function. Let s(n), m(n) : IN → IN be functions such that for all n, m(n) < n and let d ≥ 1 be an integer. We say that f n is downward self-reducible to f m(n) by a pure reduction of depth d and size s(n) if there is a circuit family {C n } n≥1 such that for each n, C n computes f n , is of depth at most d, size at most s(n), and consists of fan-in two AND and OR gates, unary NOT gates and oracle gates that compute function f on inputs of size at most m(n).
We use the term "pure" rather than simply calling them NC 0 reductions, since the term "NC 0 " usually refers to computation in which the output depends on at most O(1) bits of the input, and pure self-reductions do not share that property.
We will almost exclusively be interested in functions that are downward selfreducible to inputs of size at most m(n) = n , for some > 0. This notion of downward self-reducibility is essentially identical to what Goldwasser et al. [2007] call "strong downward self-reducibility". Hence, if f is downward self-reducible to f n by a pure reduction for some > 0, we will also call it strongly downward self-reducible. (Similarly, if f is downward self-reducible to f n by a C reduction for some class C, we will say that f is C strongly downward self-reducible.) For our purposes, however, it is important to pay close attention to the size and depth of the reduction.
The rest of this section is devoted to showing that the following problems are strongly downward self-reducible: AND, W M , MOD-q, W5-STCONN, MAJ and BFE. We also present somewhat weaker downward self-reducibility results for various types of iterated matrix multiplication problems.
We start with an example that may seem trivial, but is nonetheless useful.
PROPOSITION 3.3. For any 0 < < 1, AND n is downward self-reducible to AND n by a Dlogtime-uniform pure reduction of depth O(1/ ) and size O(n). Similarly for OR.
PROOF. Consider the AND function. The idea of the proof is simple: form a tree of depth O(1/ ) from AND n gates and assign to each leaf one of the variables. However, and n may be arbitrary so this construction may not be uniform. Thus to provide a Dlogtime-uniform construction one has to be careful about the details. We provide a more detailed construction below to demonstrate the necessary techniques. A reader familiar with the issues of uniformity may want to skip the rest of the proof. Let an integer k satisfy 2
k , then a tree of AND 2 gates can be used to compute AND n . So assume for the rest of the proof that n ≥ 4 2 k . Pick the largest integer ≥ 1 such that 2 ≤ n 1/2 k and the smallest integer m such that n < (2 ) m . We will use AND 2 gates to build the circuit. We will label gates of the circuit by labels from {0, 1, . . . , m} × ({0, 1} ) m . Not all labels will be valid; some labels will be unused. We describe the valid labels together with the associated gates next. Let n 1 , n 2 , . . . , n m ∈ {0, 1} be such that n 1 n 2 · · · n m is the m-bit binary representation of n − 1 padded with leading zeros if necessary. Con-
represents a valid label in any of the following cases: No other label is used. Since is a constant, k is also a constant. One can verify easily from the description of the gate labeling that the connectivity language for the circuit with respect to this labeling is decidable by a Dlogtime procedure. (Given n in binary, one can find and m in time linear in the binary representation of n. Incrementing and decrementing a number in binary representation can also be done in time linear in the length of the binary representation. All other operations are clearly in linear time assuming our Dlogtime machine has at least two tapes.) One can also easily verify that the described circuit computes exactly AND n .
14:14 E. ALLENDER AND M. KOUCKÝ
We claim that it contains O(n) wires. Indeed, the number of wires between bottom level of AND gates and inputs is at most n + 2 . The layer one up contains at most n/2 + 2 wires, the next one n/(2 ) 2 + 2 , and so on. Thus the number of wires in the circuit is at most 2n + (m + 1)2 . Since < 1, we have that k ≥ 1 and hence 2 ≤ √ n. Furthermore, 2 (m−1) ≤ n so, m ≤ 1 + log n. Thus, the number of wires in the circuit is bounded by 2n + √ n · (2 + log n).
The case of AND and OR can be further generalized to word problems over finite monoids.
PROPOSITION 3.4. For any finite monoid M and any
0 < < 1, (W M ) n is downward self-reducible to (W M ) n
by a Dlogtime-uniform pure reduction of depth O(1/ ) and size O(n).
The proof is essentially the same as for AND and OR; one uses gates computing W M on inputs of size ≤ n and constants for the binary encoding of 1 M . If for an integer q ≥ 2 we consider the monoid Z q = ({0, 1, . . . , q − 1}, +(mod q)) then we obtain the next corollary.
PROOF. Clearly, MOD-q can be computed using W Z q . The other way around is also true: one can compute W Z q using MOD-q. The proof of the corollary consists of showing how (W Z q ) can be computed using gates for (MOD-q) and then applying the previous proposition on W Z q . A reader familiar with the issue of conversion between (MOD-q) and W Z q may want to skip the rest of the proof.
Let b ≥ 1 be a constant, Bin : Z q → {0, 1} b be an arbitrary injective function, and > 4q be an integer. We show how to use (MOD-q) gates to compute (W Z q ) encoded by Bin. Let x 1 , x 2 , . . . , x ∈ Z q be an input to W Z q and y 1 , . . . , y b be its encoding by Bin. We will build a circuit that takes y 1 , . . . , y b as its input and outputs z 1 · · · z b , where z 1 · · · z b is the encoding of i=1 x i (mod q) by Bin. The circuit will be of constant depth (depending only on Bin and q) and use O( ) bounded fan-in AND, OR and NOT gates and q MOD-q gates of fan-in . The circuit computes as follows.
Let m = /q − 2. Partition arbitrarily {1, . . . , } into nonempty sets S 1 , S 2 , . . . , S m of size at most 2q. Since > 4q, this is possible. For each i ∈ {1, . . . , m}, let
. As q and b are constant, w 1 w 2 · · · w m can be computed from y 1 , . . . , y b by a circuit of constant depth using O( ) fan-in two AND and OR gates and unary NOT gates.
For j = 0, . . . , q − 1, let g j be a MOD-q gate of fan-in that takes as its input the string w 1 w 2 · · · w m 0 j 1 q− j padded with zeros to the length of . Clearly, g j evaluates to zero if and only if i=1 x i = j(mod q). Hence, the output of g 0 , . . . , g q−1 uniquely determines i=1 x i (mod q). The output of g 0 , . . . , g q−1 can thus be processed by a constant size circuit consisting of bounded fan-in AND, OR and NOT gates to compute Bin( i=1 x i (mod q)). This gives the desired circuit for computing (W Z q ) encoded by Bin. (For < 4q, one can build a constant-depth circuit computing (W Z q ) using fan-in two AND gates and unary NOT gates.) By Proposition 3.4, there are constants b, k and a function Bin : Z q → {0, 1} b such that for all n large enough, there is a circuit C n of depth ≤ k/ with ≤ kn wires that computes (W Z q ) n encoded by Bin using fan-in two AND and OR gates, unary NOT gates and gates for (W Z q ) encoded by Bin, for ≤ n . Take C n and replace each gate for (W Z q ) by the circuit constructed in the preceding paragraph to obtain a circuit C n computing (W Z q ) n . The circuit C n consists of fan-in two AND and OR gates, unary NOT gates and (MOD-q) gates, for ≤ n . Since each (W Z q ) gate of fan-in b is replaced by a constant-depth circuit that uses O( ) wires, the depth and the number of wires of C n are only a constant factor larger than that of C n .
If we encode an input v 1 , v 2 , . . . , v n ∈ {0, 1} symbol by symbol by Bin and we feed the resulting string into the circuit C n we obtain n i=1 v i = j(mod q) encoded by Bin. From this encoded value, one can decode the output of MOD-q on input v 1 , . . . , v n . Hence using O(n) additional fan-in two AND and OR gates and unary NOT gates one can convert the circuit C n into a constant-depth circuit for (MOD-q) n . The overall size of the circuit will be linear in n.
One can verify that the construction can be made Dlogtime-uniform. Indeed, the circuit computing W Z q using MOD-q gates can be made Dlogtime-uniform, and its gate labeling can be concatenated with the labeling of gates in C n to obtain a gate labeling of C n . Additional labels can be used for gates calculating the Bin encoding and decoding of input and output of C n . The details of these constructions are rather straightforward and we leave them to the interested reader.
Because of the connection between W5-STCONN and word problems over monoids we also obtain: PROPOSITION 3.6. For any 0 < < 1, W5-STCONN n is downward selfreducible to W5-STCONN n by a Dlogtime-uniform pure reduction of depth O(1/ ) and size O(n).
We can prove a similar self-reducibility claim also for MAJ. This time the proof is a little bit more involved and uses the following lemma. PROOF. First, we prove the claim for = 1/2 to illustrate the technique. For simplicity and clarity, we mostly ignore rounding issues. We can view the input as n 1-bit integers a 1 , . . . , a n . To determine the output of MAJ n we will compute the binary representation of the sum of these integers. The total sum will be obtained in several stages. Each stage will take as an input a sequence a 1 , a 2 , . . . , a m of integers and convert it into a shorter sequence of integers b 1 , b 2 , . . . , b m having the same sum, that is, m < m and a 1 + a 2 + · · · + a m = b 1 + · · · + b m . The first stage will start with the input as a sequence of 1-bit integers and the last stage will output a single integer representing the total sum of the input bits. As no integer at any stage can attain a value larger than n we can always truncate any number of more than log(n + 1)-bits to the log(n + 1) least significant bits. (If convenient, we may also pad a binary representation of any number by leading zeros to log(n + 1)-bits.) √ n/2 and using Lemma 3.7 (with = log(n + 1) and m = √ n/2, truncating the outputs to log(n +1) bits), compute for each of the subsequences log(n +1)-bit integers representing the sum of the a i 's in that subsequence. Output all the 5 log(n + 1) integers that were obtained from the application of the lemma. Since each subsequence contains at most √ n/2 integers, this stage requires at most O( √ n log n) many AND 2 , NOT and MAJ √ n gates.
Stage 3. 5 log(n + 1) × log(n + 1)-bits → 3 × log(n + 1)-bits. This stage gets log(n + 1)-bit integers a 1 , a 2 , . . . , a 5 log(n+1) and outputs b 1 , b 2 , b 3 . It proceeds as follows. We divide the binary representation of each a i , i ∈ {1, . . . , 5 log(n + 1) }, into blocks of log log n consecutive bits. Each block is regarded as a log log n-bit integer so we get integers a i,1 , a i,2 , . . . , a i,k , where k ≈ log(n + 1)/ log log n and
Amplifying Lower Bounds by Means of Self-Reducibility
14:17
For j = 1, . . . , k, we apply Lemma 3.7 on the sequence a 1, j , a 2, j , . . . , Finally, each s j represents a sum of at most 5 log(n + 1) integers each of log log n-bits, so it can be represented by log(5 log(n + 1) + 1) + log log n ≤ 5 + 2 log log n bits. We can form three integers b 1 , b 2 , b 3 from s 1 , . . . , s k that represent the sum of the a i 's (see Figure 1) . Formally, b i = j≡4−i mod 3 2 (k− j) log log n s j , where j ranges from 1 to k.
This stage involves O(log n/ log log n) applications of Lemma 3.7 with parameters and m of order less than log n, and k(5 + 2 log log n) DNF formulas of size n o (1) . Hence, it can be implemented by a constant-depth circuit consisting of a linear number of AND 2 , NOT and MAJ n o (1) gates.
Stage 4. 3 × log(n + 1)-bits → 1 × log(n + 1)-bits. Adding two n-bit integers can be done by AC 0 circuits using O(n 2 ) many AND n , OR n and NOT gates (see, e.g. Vollmer [1999, Theorem 1.15] ). Hence, adding three log(n + 1)-bit integers can be done by constant-depth circuits using O(log 2 n) many AND log(n+1) , OR log(n+1) and NOT gates. Thus summing the input a 1 , a 2 , a 3 of this stage can be done by a constant-depth circuit using O(log 2 n) many MAJ O(log n) and NOT gates to obtain the final sum.
The resulting total sum obtained from Stage 4 of the circuit can be compared with the binary representation of n/2 by an AC 0 circuit consisting of O(log 2 n) many AND log(n+1) , OR log(n+1) and NOT gates or alternatively MAJ O(log n) and NOT gates. 14:18 E. ALLENDER AND M. KOUCKÝ As each stage of the computation can be done by constant-depth circuits consisting of a linear number of AND 2 , NOT and MAJ √ n gates the lemma follows for = 1/2. For general the computation proceeds similarly, but the first two stages are replaced by a repeated use of a stage that reduces the input sequence a 1 , a 2 , . . . , a m  to the sequence b 1 , b 2 , . . . , b m , for m = (2m log(n +1))/n . The reduction is done using Lemma 3.7 applied on subsequences of a i 's of length n /2. Once m ≤ n /2, a single application of Lemma 3.7 produces log(n +1) integers that can be passed to the last two stages of the above procedure. Clearly, 2 + 1/ repetitions will suffice for the first stage, each repetition requiring at most n 1− · O(n ) = O(n) gates for MAJ n , AND 2 and NOT.
We have established that the self-reductions have a linear number of gates, but it remains for us to prove the size bound of O(n 1+ ) by counting the number of wires. There are O(n) gates, each having fan-in at most n . Thus, the total size is O(n 1+ ). Dlogtime-uniformity of the circuit is routine to establish.
We have seen that AND, OR, MOD-q, and MAJ are all downward self-reducible. We saw also that downward self-reducibility holds for the word problem over any finite monoid, which yields self-reductions for some of the standard complete problems for NC 1 : the word problem over S 5 , and W5-STCONN. We thank Mario Szegedy for pointing out that BFE (another standard complete problem for NC 1 ) is also downward self-reducible:
BFE n is downward self-reducible to BFE n by a Dlogtime-uniform pure reduction of depth O(1/ ) and size O(n).
PROOF. We will show that there is a constant c and an oracle circuit family {C n } n≥1 such that C n is a pure reduction of depth c and size O(n) reducing BFE n to BFE 4n 1/2 , where no path from a leaf to the root of C n encounters more than two oracle gates.
We first show that this suffices to prove the proposition. Note that if we replace each oracle gate for BFE m in C n by the oracle circuit C m , we obtain a Dlogtimeuniform family of pure reductions of depth 3c and size O(n) reducing BFE n to BFE 16n 1/4 , where no path from a leaf to the root of C n encounters more than four oracle gates. (Notice, each oracle gate for BFE m uses O(m) wires and is replaced by a circuit having also O(m) wires. Thus, the size of the circuit gets at most multiplied by some constant.) By induction, we obtain, for every k, a Dlogtime-uniform family of pure reductions of depth (2 k − 1)c and size O(n) reducing BFE n to BFE 4 k n 1/2 k . Thus, in particular, for of the form 1/2 k−1 there is a Dlogtime-uniform family of pure reductions of depth (2
k−1 for all large n. The theorem follows, since every is within a factor of 2 of some smaller number of the form 1/2 k−1 . We now proceed to prove the claim, by presenting the circuit family {C n }. BFE contains only inputs of length n of the form n = 2 d+1 − 1 for some integer d, so assume n has this form. Assume that d is odd; the construction is simpler if d is even. Let us denote the first 2 d −1 input symbols by v, and the last 2 d input symbols by x.
The output gate of C n will be an AND gate of fan-in two, where one child a checks if the input is a well-formed formula, and the other child b evaluates the formula, assuming that it is well-formed. We consider b first.
The gate b is an oracle gate that has as its input the string v x , where v consists of the first 2 (d+1)/2 − 1 symbols of v, and x is a string of 2 (d+1)/2 symbols consisting of the outputs of oracle gates b i , for 1 ≤ i ≤ 2 (d+1)/2 . If the input string vx is wellformed, the string v encodes the subformula of the formula v having depth roughly half of the depth of v and containing the output gate of v, and the oracle gates b i will evaluate the subformulas of v that feed into v . More precisely, the oracle gate b i will take as input a string (v i , x i ), where v i is the ith block of length 2 (d−1)/2 − 1 after v in v, and x i is the ith block of length 2 (d−1)/2 in x. It is immediate that the gate b produces the desired output, if the input is a well-formed formula. A routine calculation shows that the queries have length bounded by 4n 1/2 . We now turn to the construction of the subcircuit a that tests if the input is well-formed. Recall that the input is well-formed if and only if v ∈ {∧, ∨, ⊕} * and x ∈ {0, 1} * . This is simply an AND of n conditions (call them c i ), where each condition c i can be computed using NC 0 circuitry. We need to evaluate this AND using oracle gates for BFE m where m ≤ 4n 1/2 . To do this, we first use another layer of NC 0 circuitry to halve the fan-in of the unbounded fan-in AND that we need to compute; we compute conditions c j defined by c j = c 2 j−1 ∧ c 2 j for j = 1, . . . , (n − 1)/2 and c (n+1)/2 = c n . Note that the input is well-formed if and only if BFE(v x ) evaluates to true, where x consists of the bits c j and v = ∧ (n−1)/2 . This well-formed instance of BFE can be evaluated using queries to BFE m for m ≤ 4n 1/2 , using the same construction as was used for the gate b, to evaluate a formula.
To complete the proof, we merely observe that the number of wires is easily seen to be linear in n, and we note that Dlogtime-uniformity is routine to establish.
Indeed, we point out that any problem that is complete for a complexity class that has a strongly downward self-reducible complete problem must be strongly downward self-reducible. See Proposition 5.3.
Another problem for which we can prove downward self-reducibility is Iterated Matrix Multiplication. Let IMM n,d, : {0, 1} The following more interesting lemma will be useful in the next section.
LEMMA 3.11. There is a universal constant c CRR such that for any 0 < < 1 and Here, c CRR is a specific constant that can be determined from the paper of Hesse et al. [2002] . The exact value of c CRR is not important for our purposes, but we estimate that c CRR < 10.
PROOF. Hesse et al. [2002] give Dlogtime-uniform TC 0 circuits with O(n c CRR ) wires that do the following tasks: Let n be large enough and set d = d(n). Using these three circuit families we can reduce IMM n,d,n to the problem of computing O(n 2 ) instances of mIMM n,d,q i in parallel for O(n 2 ) distinct prime O(log n)-bit numbers q i . Namely to compute the iterated product, we first compute the Chinese Remainder Representation of each input matrix; this gives us O(n 2 ) instances of mIMM n,d,q i to solve. Next, we compute the iterated product mod each of the q i (thereby obtaining the output in Chinese Remainder Representation). Finally, we convert the answer to binary representation. The following three steps describe the computation in more detail.
Step 1. We convert the input matrices
2 } as follows: For each i ∈ {1, . . . , n} and k, ∈ {1, . . . , d} we apply the circuit B 2n 2 on the entry (M i ) k, of M i padded by leading zeros to 2n 2 bits, to obtain (
. That is, each matrix M i, j consists of the entries of M i modulo the O(log n)-bit prime q j . This step consists of n · d 2 copies of circuit B 2n 2 so it can be done by a TC 0 circuit of size O(d 2 n 1+2c CRR ).
Step 2. For each j ∈ {1, 2, . . . , 2n 2 }, we compute the product N j of matri- 
Step 3. From the previous step, we obtain matrices N 1 , N 2 , . . . , N 2n 2 which represent the product N of matrices M 1 , . . . , M n . Here, N j = N mod q j . For each k, ∈ {1, . . . , d}, apply the circuit R 2n 2 on ( (N 1 ) k, , q 1 ), ((N 2 ) k, , q 2 ) , . . . , ((N 2n 2 ) 
Amplifying Lower Bounds
In the previous section we have established several downward self-reducibility results. In this section we show that any problem that is downward self-reducible in this way has circuits of polynomial size of some type if and only if it has very small circuits of that type. Thus, if a small circuit size lower bound can be proved for any such problem, it can be "amplified" into a superpolynomial size lower bound. The general form of our claims is:
If a function f is computable by polynomial-size circuits of type C, then for any > 0, f is computable by circuits of type C using O(n 1+ ) gates and wires.
The circuit types we will consider are AC 0 , ACC 0 , CC 0 , TC 0 and NC 1 circuits. The functions f we will consider will typically (but not always) be complete for some complexity class. For example MAJ is complete for TC 0 (under ≤ NC 0 T reductions), and the word problem for S 5 is complete for NC 1 , and so on. The consequence of our claim is that establishing a lower bound of (n 1+ ) for some > 0 on the number of wires or gates necessary to compute f would separate some of the circuit classes. The following proposition summarizes known relationships between these circuit classes.
PROPOSITION 4.1.
Except for the proper inclusion AC 0 ACC 0 [Furst et al. 1984; Yao 1985; Håstad 1988] PROOF. Assume that f n has circuits of type C with n k + k wires. Let = min( /k, 0 ). Consider the reduction of f n to f n that is of size O(s(n)) and hence has at most O(s(n)) oracle gates. If we replace each oracle gate for f n by the circuit of type C of size n k + k, we obtain a circuit of type C for f n with 14:22
The claim follows. (Technically, class C may not allow for bounded fan-in AND, OR or NOT gates which may appear in the pure reduction hence, one needs to simulate such gates by constant-size circuits of type C. However, this simulation does not affect the size bound by more than a constant factor.)
By analyzing the depth of the circuits constructed in the proof of Theorem 4.2, one can observe that if C is a class of bounded depth circuits, then f has circuits of type C having depth O(1/ ) and O(s(n)n ) wires. For most of our arguments, for any 0 < 1, either s(n) = n or s(n) = n 1+ 0 . This yields the following corollary.
COROLLARY 4.3.
( ) wires where the uniformity machine runs in time K log n. (We have not computed the value of K -and indeed this value may depend on minor details of the particular formulation that is used in defining Dlogtimeuniformity-but we anticipate that K = 4 is sufficient; the self-reductions have a very regular structure, and the O(log n) running time of the "original" TC 0 circuit family ends up being simulated only to determine the structure of circuits for inputs of size n for small values of .)
Sometimes concrete lower bounds are easier to prove for specially-constructed sets, rather than for the standard complete sets for a complexity class. The following corollary shows that we can also "amplify" lower bounds for such speciallyconstructed sets, since if one can show that a specially-constructed set lies in NC 1 , then typically one can determine some upper bound on the depth d(n) of the NC Note, one can replace the dimension bound 2 √ log n in the theorem by any other function from n o (1) . The contrapositive may be more informative; if one can show for some > 0 that BIMM n,2 √ log n requires NC 1 circuits of size (n 1+ ), then one has shown that NC 1 = NL. Unlike the earlier theorems in this section, we obtain only an implication, and not an equivalence-since BIMM n,2 √ log n is not known (or believed) to be complete for NL. Note that this result is for NC 1 circuit size; it does not seem to translate into a useful statement about formula size.
PROOF. Since BIMM n,n is in NL, our assumption implies that BIMM n,n is computable by NC 1 circuits of size O(n k ) for some k > 0. Let > 0 and set = /k. Then BIMM n ,n is computable by NC 1 circuits of size O(n k ) = O(n ) and hence we conclude that BIMM ,2 √ log n is computable by NC 1 circuits of size O(n ) for any ≤ n . (Here, we are taking advantage of the fact that 2 √ log n grows more slowly than n for any > 0.) By Proposition 3.10, BIMM n,2 √ log n is downward selfreducible to BIMM n ,2 √ log n by a pure reduction of size O(n2 2 √ log n ) with O(n) oracle gates for BIMM ,2 √ log n , ≤ n . We can replace each oracle gate by an NC 1 circuit
We now turn to the complexity class #L (the class of functions that count the number of accepting paths of NL machines). This is the largest complexity class that we know how to address using these techniques. Iterated Matrix Multiplication IMM n,n,n is a problem complete for #L (see Allender and Ogihara [1996] ). IMM n,2 √ log n ,n is a subproblem not known (or expected) to be complete for #L, but also not known to lie in any smaller complexity class. PROOF. Since IMM n,n,n is in #L, by our assumption, IMM n,n,n is computable by TC 0 circuits of size O(n k ) for some k > 0. Choose = 1/k. Then IMM n ,n ,n is computable by TC 0 circuits of size O(n k ) = O(n) and hence IMM n ,2 √ log n ,n is computable by TC 0 circuits of size O(n). By Lemma 3.11, IMM n,2 √ log n ,n is downward self-reducible to IMM n ,2 [Vollmer 1999, Lemma 2.11.1] .) The depth of the circuit increases by a factor of at most O(log n) and the size by at most a constant factor.
The preceding two theorems do not make use of problems that are known to be complete for well-known complexity classes, and thus we obtain only implications regarding NL and #L, instead of equivalent statements concerning whether these classes collapse with NC 1 . However, it is worthwhile noting that IMM n,3,n is complete for GapNC 1 [Caussinus et al. 1998 ] (the class of functions over the integers, computable by polynomial-size arithmetic formulae). All functions in NC 1 are in GapNC 1 , and it has been conjectured that GapNC 1 coincides with NC 1 [Allender 2004 ]. GapNC 1 is the only well-studied complexity class not known to be contained in NC 1 , for which we can present a complete problem that is strongly downward self-reducible. PROOF. Let us prove the first equivalence. Assume that GapNC 1 ⊆ TC 0 . As IMM n,3,n is in GapNC 1 , there is some k > 0 such that IMM n,3,n has TC 0 circuits of size O(n k ). Let = 2c CRR /k. By Lemma 3.11, IMM n,3,n is downward selfreducible to IMM n ,3,n by TC 0 circuits of size O(d 2 n 3+2c CRR ) with O(n 3 ) oracle gates. Replace each oracle gate in the reduction by the TC 0 circuit for IMM n ,3,n of size O(n k ) = O(n 2c CRR ) to obtain a TC 0 circuit of size O(9 · n 3+2c CRR + n 3 n 2c CRR ) computing IMM n,3,n . This shows one implication. The other implication follows from the fact that IMM n,3,n is complete for GapNC 1 under ≤ AC 0 m reductions. The equivalence for NC 1 follows from the first one by an argument similar to the proof of the previous theorem.
Limits on Downward Self-Reducibility
In the previous section we have seen that downward self-reducibility provides us with an interesting tool for the study of circuit classes. We have shown that in order to separate circuit classes such as ACC 0 and NC 1 , quadratic lower bounds for the circuit complexity of certain NC 1 -complete problems would suffice. What about separating ACC 0 from, say NP? That should in principle be a much easier task. Can we use the technique of downward self-reducibility to establish an analog of Corollary 4.3 for ACC 0 versus NP? The following theorem shows that there are significant obstacles to overcome before such an approach can work. Namely, in order to establish that a problem is strongly downward self-reducible, one must already have an efficient algorithm for the problem. PROOF. We prove the second claim first.
(2) Let n ≥ 2. In order to build a circuit for f n , start with the TC 0 circuit of depth d and size n k that reduces f n to f n , for some < 1. If we replace each oracle gate in this circuit with the circuit that reduces f n to f (n ) , the depth of the new circuit is d 2 and the size is at most n k + n k · n k . We repeat the process until the oracle gates are of size O (1), at which point we replace the oracle gates by circuitry of size O(1) computing f on small inputs. The number of stages is O(log log n); thus, the depth is d O(log log n) = log O (1) n. The size of the circuit is polynomially bounded by
Finally, replace each MAJ gate by an NC 1 circuit. It is easy to verify that the resulting circuit is logspace-uniform if the self-reduction circuits are. This establishes that f ∈ NC. In order to see that f has TC 0 circuits of size 2 [Barrington et al. 1990a] . At most one of these AND gates will evaluate to 1, and hence taking the MOD-q of these AND gates computes the DNF for f m .
(3) Again we use the obvious recursive algorithm. We run the Turing reduction and whenever it asks an oracle query about a smaller instance of f we recursively invoke the reduction on the smaller instance. If the reduction runs in time O(n k ), then Speculation. Although Theorem 5.1 suggests that we abandon any attempt to show that SAT has the downward self-reducibility property, it does not exclude the following approach for trying to prove an analog of Corollary 4.3 for NP. (Such an analog might, for instance, state that if NP = TC 0 then SAT has TC 0 circuits of size n 2 .) Rather than trying to present a self-reduction for SAT unconditionally, perhaps one can start with the assumption that NP ⊆ TC 0 and construct a downward selfreduction of SAT (or some other specially-constructed set in NP) and conclude that under this assumption SAT has almost linear size TC 0 circuits. This is the appropriate time to observe that if NP ⊆ TC 0 , then SAT certainly does have the strong downward self-reducibility property; this follows from Proposition 5.3 below. However, since one can say nothing about the size of this self-reduction (other than that it is computed by an NC 0 oracle circuit of polynomial size), this does not seem to allow us to conclude that SAT has TC 0 circuits of, say, quadratic size. PROOF. The polynomial-size reductions between f and g each ask queries of size at most n k for some k, for all n ≥ 2. The strong downward self-reduction of f reduces f n to f n for some > 0. Let be such that < 1/(2k 2 ). Let {C n } be the circuit family that is the -fold composition of the downward self-reduction of f . By Proposition 2.3, {C n } is a C downward self-reduction that, on inputs of length n k makes no query of length greater than (n k ) < n 1/(2k) . If we compose the reduction from g to f with the reduction computed by {C n }, we obtain a reduction of g n to f n 1/(2k) . Composing this reduction with the reduction from f to g, we obtain a reduction of g n to g n 1/2 . This establishes that g is C strongly downward self-reducible.
Inapproximability of MAX-CLIQUE
In this section we adapt the technique of Srinivasan [2003] to the setting of constantdepth circuit classes, and also obtain a lower bound on the complexity of any polynomial-time reduction of MAX-CLIQUE to the problem of computing approximations to MAX-CLIQUE. 
* . MAX-CLIQUE is the following computational problem: given an undirected graph G determine the size of the largest clique in G. For simplicity, we assume that G is given by its adjacency matrix. We say that the size of G is the number of vertices in G. It is known [Zuckerman 2007 ] (see also Håstad [1998] , Feige and Kilian [1998] , Khot [2001] , and Khot and Ponnuswami [2006] ) that if for some > 0 there is a n 1− -approximation to MAX-CLIQUE computable in P then P = NP. We use the technique of Srinivasan [2003] to show the following statement: It is interesting to note that the depth of the O(n 1+(k−1) (n) )-size circuits does not increase while decreasing (n). As stated, the theorem holds only for nonuniform circuits, but a uniform version holds for any function (n) that is sufficiently easy to compute. To prove the theorem, we need the following simple lemma. PROOF. The computation of the circuit proceeds in three steps. We identify integers 0, . . . , 2 − 1 with their -bit binary representations.
Step 1 and g i (y j )). Hence, we need at most 3 · 2 m wires for this step.
Step 2 Chandra et al. [1985] .
Step 3. Compute the output z. For i = 0, . . . , 2 −2 let e i = (d i AND (NOT d i+1 ) ) and e 2 −1 = d 2 −1 . Hence, the kth bit of the -bit binary representation of z is obtained by taking the OR of gates computing e i for all those i such that the kth bit of the -bit binary representation of i is one. This step requires OR 2 gates, and 2 − 1 AND 2 and NOT gates.
Clearly, the combination of the above three steps gives a constant-depth c2 m size circuit that correctly computes z. Dlogtime-uniformity of the circuit is routine to establish. 1+(k−1) ) computing a n 1− -approximation of MAX-CLIQUE. The computation of the approximation proceeds as follows: we partition the vertices of the graph G into n 1− parts V 1 , . . . , V n 1− of size at most n . For i = 1, . . . , n 1− we compute in parallel MAX-CLIQUE of G restricted to V i . Then, we output the largest of these partial results. The correctness of the algorithm follows from the simple observation that if G contains a clique of size f (G) then for some i, V i contains at least f (G)/ n 1− vertices of that clique and hence MAX-CLIQUE of G restricted to V i is at least f (G)/ n 1− . The size of a circuit carrying out the computation can be bounded as follows. We use n 1− circuits of size O(n k ) to compute the value of the n 1− MAX-CLIQUE subproblems. This requires size O(n 1+(k−1) ) in total. By Lemma 6.2, we can find the maximum of the n 1− values in the range {0, . . . , n } by an AC 0 circuit of size
Thus, the size of the circuits is bounded by O(n 1+(k−1) ). Dlogtime-uniformity of the circuit is routine to establish. The case of TC 0 and NC 1 is proven by essentially the same argument.
The technique from the previous proof can be also used to establish the following claim. PROOF. In the proof of Theorem 6.1, we have seen how to compute a m 1− -approximation of MAX-CLIQUE m by asking queries to MAX-CLIQUE m . If there is a polynomial time algorithm that solves MAX-CLIQUE n using an oracle for m 1− -approximation of MAX-CLIQUE m where m ≤ n k , then we can combine it with the above reduction to obtain the desired self-reduction.
This gives rise to what is perhaps the first example of a lower bound showing that there is no "quick" reduction between two natural NP-optimization problems. For many natural NP-complete problems A and B, very efficient reductions between A and B are known. (For example, for any problem A ∈ NTIME(n log O(1) n), there is a many-one reduction from A to SAT that is computable in time O(n log O(1) n) [Cook 1988] .) It is easy to show that if B ∈ NTIME(n k ), then any reduction from B to SAT requires time n k / log O(1) n-but this does not provide any useful lower bound on the complexity of reducing natural problems to SAT, since no natural NP-complete problem is known to lie outside of NTIME(n). There seems to be no pair of natural NP-complete problems A and B known, where a reduction from A to B is known to require more than linear time (even under the assumption that P = NP).
In contrast, consider the problem of computing a √ n-approximation to MAX-CLIQUE. Zuckerman [2007] presents a deterministic polynomial-time Turing reduction from MAX-CLIQUE to this approximation problem. (More precisely, Zuckerman shows that distinguishing graphs having only small cliques from graphs with large cliques is complete for NP under many-one reductions that is, one can decide the membership of a formula in SAT from the answer to an instance of an arbitrary n 1− -approximation of MAX-CLIQUE n . The polynomial-time Turing reduction from MAX-CLIQUE follows from the trivial observation that MAX-CLIQUE is computable in P SAT = P NP .) How long must the queries in this reduction be? Assuming that P = NP, Theorems 6.3 and 5.1 tell us that the queries in this reduction must ask about graphs with at least n 2 vertices. We can state the following claim.
COROLLARY 6.4. P = NP if and only if there is an α < 2 and a deterministic polynomial-time Turing reduction from MAX-CLIQUE to the problem of computing a √ n-approximation to MAX-CLIQUE that asks queries of size no greater than n α .
PROOF. One direction follows from the observation that if P = NP then there is a polynomial-time Turing reduction for this problem that asks queries of size O(1) (or asks no queries at all).
For the other direction: If there is a reduction from MAX-CLIQUE to the problem of computing a √ n-approximation to MAX-CLIQUE that asks queries only to graphs of size n α for some α < 2, then by Theorem 6.3, MAX-CLIQUE n is downward self-reducible to MAX-CLIQUE n α/2 . By Theorem 5.1, this implies that MAX-CLIQUE is computable in polynomial time, and hence NP = P.
Clearly, analogous statements can be proved for n -approximation for any value of such that 0 < < 1 and α < 1/(1− ); the case = 1/2 is likely to be of greatest interest. Similar claims can be also proved for probabilistic reductions instead of deterministic ones, under the assumption that SAT does not have probabilistic polynomial-time algorithms.
It is worthwhile mentioning that, in some sense, decreasing the size of the query length in Zuckerman's reduction [Zuckerman 2007 ] from MAX-CLIQUE to computing a n 1/2 -approximation to MAX-CLIQUE is a universal approach to proving P = NP. If any approach will work, then this approach will.
Circuit Lower Bounds
We observed in Section 1.2 that, although BFE requires size n 1+ d on depth d TC 0 circuits [Impagliazzo et al. 1997] , no similar bound for ACC 0 or even CC 0 [q] circuits is known. Here, we present lower bounds of this sort for SAT.
We begin this section by showing that problems with small constant-depth circuits have algorithms that run quickly and have small space bounds. Let TISP(t(n), s(n)) denote the class of problems that are computable by machines running in time O(t(n) that use space at most O(s(n) ). (This definition is somewhat sensitive to the underlying model of computation. We shall always refer explicitly to either the Turing machine model or the random access machine model, to clarify which class is meant.)
A technical matter that must be dealt with in stating the following theorem, is that Dlogtime-uniformity does not seem to guarantee that there is a quick way to enumerate, for a given gate h, the list of gates g for which there is a wire from g to h. There are some standard techniques for ensuring that this property holds (see, e.g., Allender and Gore [1994] ), but we note that these techniques seem to involve a polynomial blow-up in the circuit size, which we would prefer to avoid. We believe that, for most uniform families of circuits that are constructed, a quick enumeration of the inputs to a given gate will nonetheless be possible. Rather than alter the definition of Dlogtime-uniformity, in this section we simply say that a circuit family is strongly uniform if it is Dlogtime-uniform, and in addition, on input (n, i, h), the name of the gate g that is the ith input to the gate in C n having label h can be computed in time log O(1) n. (1+ ) log n) . Since we can use more space, we will use it to remember the computed values of gates that have fan-in larger than n δ . The faster algorithm then will also recursively evaluate the circuit but whenever it computes the value of a gate with fan-in larger than n δ it records the value so such a gate will be evaluated at most once. On a random access machine, we will store the values in a binary search tree, on a Turing machine we will store them in a simple list. Since there are at most O(n 1+ /n δ ) gates with fan-in larger than n δ , we will need space only O(n 1+ −δ log O(1) n). Finding the value of a gate and whether it has already been computed will take O(log O(1) n) time on a random access machine and O(n 1+ −δ log O(1) n) on a Turing machine. To bound the total time needed to evaluate the circuit, notice that we will have to recursively evaluate a tree of fan-in at most n δ and depth d. To traverse the tree, we will need to make n δd visits to the nodes. Beside that we will have to evaluate the gates with large fan-in. Since there are at most O(n 1+ ) wires leading into them, these gates will additionally cost at most O(n 1+ ) node visits. This yields the claimed time bound.
We need to make use of known time-space tradeoffs for SAT. The following theorem is a special case of Theorem 1.3 
Since this is true for all < 1/(3d − 1), we have in particular that SAT is in DTIME(n c ) on random access machines for all c > 1.
For the rest of the proof, fix some < 1/(3d − 1). In particular, we have SAT is in TISP(n 1.5 , n 1− ) on random access machines. By Theorem 7.2, if we let c approach 1 from above, the value of e (in Theorem 7.2) approaches 1 from below. Thus, there is some value of c > 1 for which e > 1 − (in the statement of Theorem 7.2). Fix these values of c and e. Thus, we now have that SAT is in TISP(n 1.5 , n e ) on random access machines. At this point, by Theorem 7.2, we know that SAT is not both solvable by conondeterministic random access machine in time O(n c ), and in TISP(n 1.5 , n e ) on random access machines. But we have already observed (three paragraphs ago) that SAT is in DTIME(n c ) and thus it is solvable in co-nondeterministic time O(n c ). Thus, we must conclude that SAT is not in TISP(n 1.5 , n e ). But this contradicts the conclusion of the preceding paragraph. Case d = 1 follows from the case of d = 2.
The Natural Proofs Barrier
Razborov and Rudich [1997] identified a significant obstacle to further progress in proving lower bounds on circuit size, by observing that existing lower bound arguments rely on the existence of an easy-to-recognize combinatorial property of a function f that (a) is shared by a large fraction of all functions, and (b) is shared by no function that has small circuits of a given type. Razborov and Rudich showed that any "Natural Proof" that follows this paradigm and shows that a function cannot be computed by circuits of a class C constitutes a proof that C cannot compute pseudorandom function generators. It is not clear how significant an obstacle this poses for proving lower bounds against ACC 0 , since there is not much evidence that ACC 0 circuit families can compute pseudorandom function generators. However, for TC 0 , this is a serious impediment, since Naor and Reingold [2004] have presented a good candidate pseudorandom function generator that is computable in TC 0 . (The reader should keep in mind the distinction between pseudorandom function generators and pseudorandom bit generators. It is known that there are no pseudorandom function generators computable in AC 0 [Linial et al. 1993] ; in contrast, if the Naor-Reingold generator is secure, then there are pseudorandom bit generators computable in NC 0 [Applebaum et al. 2006] .) It is premature to argue very strongly that we have identified a path around this obstacle. After all, the only new lower bound that this article offers is to be found in Section 7, and that bound follows from known time-space tradeoff results. (These time-space tradeoffs, in turn, rely on diagonalization, which lies outside the natural proofs framework, but only gives lower bounds for uniform circuit families. The natural proofs framework addresses the problem of finding lower bounds for nonuniform circuit complexity.) However, we contend that it is at least plausible that a natural proof could form the basis for a proof that NC 1 = TC 0 , even assuming that the Naor-Reingold generator is cryptographically secure.
How? A proof that NC 1 = TC 0 could conceivably consist of two parts:
(1) A proof that BFE requires TC 0 circuits of size n 1.5 , and
(2) Appeal to Corollary 4.3, to conclude that NC 1 = TC 0 .
14:32 E. ALLENDER AND M. KOUCKÝ Let us assume for the moment that someone hands us a natural proof of the n 1.5 lower bound that takes care of the first part of this hypothetical argument. The entire two-part argument nonetheless fails to be a "natural" proof, because the proof of Corollary 4.3 centers on strong downward self-reducibility, which is a combinatorial property that is shared by only a vanishingly small fraction of all Boolean functions on n variables, contrary to the requirements of a natural proof. (Strictly speaking, the strong downward self-reducibility property is not a "combinatorial property" in the sense of the Natural Proofs framework, as it is a relationship between function values on different input sizes. However, all strongly downward self-reducible functions must have truth-tables of small Kolmogorov complexity (since the truth-table of size 2 n is determined completely by a truthtable of size 2 n δ ), and thus they constitute a tiny fraction of all functions.) So now we are left with the question of whether it is reasonable to hope that a natural proof could possibly show that BFE requires TC 0 circuits of size n 1.5 . First, we note that there are already examples of natural proofs that yield lower bounds of the form n k for some fixed k. The parity lower bound of Impagliazzo et al. [1997] gives a lower bound of this form for BFE on TC 0 circuits of depth d. Håstad [1998] gives a nearly cubic lower bound on formula size. These are natural proofs.
Next, in order to directly address the question of what obstacles have been identified by Razborov and Rudich that might block a proof showing that BFE requires TC 0 circuits of size n 1.5 , let us examine their framework more closely, by recalling their definitions of "natural" and "useful" combinatorial properties.
Let F n denote the class of all Boolean functions f n : {0, 1} n → {0, 1}. A property {T n ⊆ F n } n∈IN is Quasi P-natural if there is a sub-property {T * n ⊆ T n } n∈IN such that for some , c > 0 (1) |T * n | ≥ |F n |/2 n , and (2) there is a deterministic algorithm that given a truth-table of a function f n : {0, 1} n → {0, 1} decides whether f n ∈ T * n in time 2 n c .
Furthermore, a property {T n ⊆ F n } n∈IN is useful against a circuit class if no sequence of functions { f n ∈ T n } n∈IN is computable by circuits from .
Razborov and Rudich show that any Quasi P-natural property that is useful against TC 0 can be used as a subroutine to foil any purported pseudorandom function generator that is computable in TC 0 . More generally, they show how to transform any natural lower bound proof into a lower bound on the complexity of computing a pseudorandom function generator. However, it is absolutely essential for their argument, that there be a single natural property T that is useful against TC 0 circuits of size n k for every k; lower bounds for circuits of size n k for small fixed k translate into lower bounds for pseudorandom function generators that are so weak as to be uninformative. More to the point, such natural properties can easily be shown to exist.
To be concrete, let us exhibit an example of a property T = {T n } n∈IN that is Quasi P-natural and useful against TC 0 circuits of size O(n 1.5 ). Our property T is defined as follows:
T n = { f n : {0, 1} n → {0, 1}; f n does not have circuits of depth log * n and size n 2 consisting of MAJ and NOT gates}.
It is a trivial exercise to verify that T is natural and useful against TC 0 circuits of size O(n 1.5 ). Of course, we are not able to establish that BFE has property T ; if it does, then by Corollary 4.3 NC 1 = TC 0 . Clearly, this argument makes use of no special properties of TC 0 ; one can easily come up with a Quasi P-natural property that will be useful against any class of circuits of a fixed polynomial size.
However, the existence of property T does not seem to imply anything very interesting about the nonexistence of pseudorandom function generators (and consequently does not yield interesting upper bounds on the complexity of factoring Blum integers, which would follow if the Naor-Reingold generator is insecure [Naor and Reingold 2004] ). Thus it seems to us that it is reasonable to hope for a "natural" proof that BFE satisfies property T , which would then yield an "unnatural" proof of TC 0 = NC 1 , by Corollary 4.3.
Conclusions and Open Problems
So are there reasons to be more optimistic about prospects for lower bounds? We are not sure. The truth is that we do not understand computation. All the known lower bounds essentially rest on information theoretic arguments and none of them really takes computation into account. We realize that this is a vague statement; part of the challenge in seeking lower bound proofs is to be able to say something more precise. For example we are unable to handle recursion, so our bounds typically deteriorate with depth. Hence, the underlying message of Razborov and Rudichnamely, that we need to go beyond combinatorial arguments-is still a worthwhile message. We identify two still unresolved challenges that we believe would advance our understanding of computation:
-Prove (n 2 ) lower bounds on the length of width 5 branching programs computing an explicit function (by which we mean any problem in NP). It appears that nothing better than (n 2 / log n) is known [Neciporuk 1966; Razborov 1991 ].
-Prove (n Are there perhaps fundamental barriers that remain in our path, as we attempt to prove circuit lower bounds?
One way to explore this question is to follow the lead of Razborov [1995b] , who showed that (under cryptographic assumptions) the bounded arithmetic proof system S 2 2 cannot prove that SAT requires circuits of superpolynomial size. (In earlier work, Razborov [1995a] had argued that most existing lower bound arguments can be carried out in even weaker systems.)
Perhaps techniques similar to those of Razborov [1995b] , combined with our observations, can enable one to prove that S 2 2 (or a similar system) cannot prove that BFE requires TC 0 circuits of size n 1+ . The most important and interesting question raised by this work is the question of whether it can ultimately lead to separations of complexity classes. (This topic is also discussed in a recent survey [Allender 2008] .) However, a number of other questions naturally arise. We close by listing two such questions.
-Are there sets complete for every level of the NC hierarchy that are downward self-reducible to instances of size n ? Or is there some fundamental reason why we were unable to find a downward self-reduction of this sort for any problem that is complete for NL or L? (In Theorem 4.5, we worked with a restricted version of the NL-complete problem BIMM; the restriction is not believed to be complete for NL.) Showing that a complete set for L is strongly downward self-reducible (via a pure reduction) would show that every problem in L has subexponentialsize 
