A Lower Bound for Perceptrons and an Oracle Separation of the PPPHHierarchy  by Berg, Christer & Ulfberg, Staffan
File: DISTL2 155201 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 6559 Signs: 4679 . Length: 60 pic 11 pts, 257 mm
Journal of Computer and System Sciences  SS1552
Journal of Computer and System Sciences 56, 263271 (1998)
A Lower Bound for Perceptrons and an Oracle Separation
of the PPPH Hierarchy
Christer Berg and Staffan Ulfberg
Department of Numerical Analysis and Computing Science, Royal Institute of Technology, SE-100 44 Stockholm, Sweden
E-mail: bergnada.kth.se, staffanunada.kth.se
Received December 20, 1996
We show that there are functions computable by linear size boolean
circuits of depth k that require superpolynomial size perceptrons of
depth k&1, for k<log n(6 log log n). This result implies the existence
of an oracle A such that 7 p, Ak 3 PP
7
p, A
k & 2 and, in particular, this oracle
separates the levels in the PPPH hierarchy. Using the same ideas, we
show a lower bound for another function, which makes it possible to
strengthen the oracle separation to 2 p, Ak 3 PP
7
p, A
k & 2 . ] 1998 Academic
Press
1. INTRODUCTION
There is a strong connection between lower bounds for
boolean circuits (consisting of AND, OR, and NOT gates)
and relativization results about the polynomial time
hierarchy. This fact was first established by Furst, Saxe, and
Sipser [5]. Sipser [13] later defined a family of functions
that are computable by linear size circuits of depth k and
showed that they require superpolynomial size boolean cir-
cuits of depth k&1. Yao [14] and Ha# stad [8, 9] improved
Sipser’s result by showing that the same functions actually
require exponential size circuits of depth k&1; this fact
implies the existence of an oracle that separates the levels in
the polynomial time hierarchy.
There is a similar correspondence between the levels in
PPPH and constant depth perceptrons. A perception is a cir-
cuit with a single threshold gate at the top, whose inputs are
the outputs of boolean circuits with AND, OR, and NOT
gates, called the perceptron’s subcircuits. A depth k percep-
tron has subcircuits of depth k&1. Linear size perceptrons
can compute majority and are thus more powerful than
polynomial size constant depth circuits [5].
In 1991, Green [6] used a result by Boppana and Ha# stad
[8] on approximating parity to prove a lower bound for the
size of constant depth perceptrons that compute parity. This
bound implies the existence of an oracle that separates P
from PPPH.
Green [7] also discussed the question of whether there is
an oracle that separates the levels in the PPPH hierarchy.
Since this follows from a sufficiently strong lower bound for
the size of depth-(k&1) perceptrons computing functions
computable by perceptrons of depth k, Green was working
on such a lower bound. He was able to successfully prove an
exponential lower bound for depth-3 monotone perceptrons
computing a function computable by linear size depth-4
perceptrons, and concluded that if the same result could be
proved in the nonmonotone case, the Ha# stad switching
lemma could be used to show the separation for all k. In the
monotone setting, the separation between depth-k and depth-
(k&1) perceptrons for all k follows from a stronger result by
Ha# stad and Goldmann [10] that separates boolean circuits
of depth k from threshold circuits of depth k&1.
In this paper we show that there are functions com-
putable by linear size boolean circuits of depth k that
require superpolynomial size perceptrons of depth k&1 for
k<log n(6 log log n) and exponential size perceptrons for
constant k.
The key to making the proof work is to use a separation
between polynomial size depth-2 perceptrons with bounded
fan-in and general polynomial size depth-2 perceptrons as
the basis for the induction. This separation follows from the
‘‘one-in-a-box’’ theorem by Minsky and Papert [12]. To
use this simpler basis (compared to that Green suggested),
we need a somewhat stronger statement of Ha# stad’s switch-
ing lemma. The statement as formulated in this paper
actually follows from looking at the proof by Ha# stad more
carefully.
We show that our result on perceptrons implies the exist-
ence of an oracle that separates the levels in the PPPH
hierarchy and, in fact, that there exists an A such that
7 p, Ak 3 PP
7 p, Ak&2 .
The fact that our basis, i.e., the ‘‘one-in-a-box’’ theorem
by Minsky and Papert, implies that NPNP 3 PP under an
oracle was noted by Fu [4]. Beigel [1] has strengthened
this separation to obtain that PNP 3 PP under an oracle,
and in the last section of this paper we use his result as a
basis for a lower bound for perceptrons with bounded
weights. Using this lower bound, we get an oracle A such
that 2 p, Ak 3 PP
7 p, Ak&2 . (We use 2 pk to denote the complexity
class P7
p
k&1.)
Article No. SS971552
263 0022-000098 25.00
Copyright  1998 by Academic Press
All rights of reproduction in any form reserved.
File: DISTL2 155202 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 6202 Signs: 5157 . Length: 56 pic 0 pts, 236 mm
Beigel, Hemachandra, and Wechsung [2] showed that
PNP[log]PP, and later Beigel, Reingold, and Spielman
[3] proved the even stronger PPP[log]=PP. A relativization
of these results shows that our result is almost tight.
2. A LOWER BOUND FOR PERCEPTRONS
We begin this section by defining the function f mk , first
defined by Sipser [13], which can be computed by linear
size circuits of depth k. Then we show the main theorem,
which states that perceptrons of depth k with bounded fan-
in that compute this function must be large. As a corollary
we get that perceptrons of depth k&1 computing f mk must
be large.
A perceptron is a circuit with a single threshold gate at
the top. The threshold gate may have arbitrary weights and
outputs 1 if its weighted sum of the input variables exceeds
some threshold value. The inputs to the threshold gate are
outputs of boolean circuits, the perceptron’s subcircuits. The
subcircuits consist of alternating levels of AND and OR
gates; their top gate is either an AND gate for all the subcir-
cuits, or they may all be OR gates. Input variables may be
negated, but otherwise no NOT gates may occur in the cir-
cuits. A general perceptron can be transformed into this
form by at most doubling its size. A depth k perceptron has
subcircuits of depth k&1.
Definition 1. The function f mk is a function of m
k&2
1
2(m log m)
14 - 12km log m variables. It is defined by a depth
k circuit that has the form of a tree. At the leaves of the tree
are unnegated variables. The root is an OR gate with fan-in
1
2(m log m)
14. Below are alternating levels of AND and OR
gates with fan-in m. The bottom-most level has fan-in
- 12km log m.
The logarithms in this definition, as well as in the rest of
the paper, are base 2 logarithms.
The theorem is proved analogously to Ha# stad’s proof
that circuits of depth k are more powerful than circuits of
depth k&1. A central part of the proofs is the Ha# stad
switching lemma, which states that if we set a subset of the
input variables to 0 or 1 randomly, we can switch the two
lowest levels of AND and OR gates without increasing the
circuit size very much. Then, two levels of the circuit can be
collapsed into one.
The way that we set some input variables of a circuit C to
0 and 1 is through the use of restrictions. A restriction \ is
a mapping of the variables to the set [0, 1, V ], where V
means that the variable remains unset. Let C W\ denote the
circuit obtained by replacing each input variable xi by \(xi).
We use the same distribution of restrictions as Ha# stad used
in [8, Chap. 6]; they are defined below.
Definition 2. Let q be a real number and let (Bi)ri=1 be
a partition of the variables (that is, the Bi are disjoint and
their union equals the set of all variables). Let R+q, B be the
probability space of restrictions which takes values as
follows. For \ # R+q, B and every Bi , 1ir, independently:
1. With probability q let si= V ; else si=0.
2. For every xk # Bi let \(xk)=si with probability q; else
\(xk)=1.
Similarly a R&q, B probability space of restrictions is defined
by interchanging the roles played by 0 and 1.
Definition 3. For a restriction \ # R+q, B , let g(\) be a
restriction defined as follows: for all Bi with si=V, g(\) gives
the value 1 to all variables given the value V by \, except one
to which it gives the value V . To make g(\) deterministic, we
assume that it gives the value V to the variable with the
highest index given the value V by \. If \ # R&q, B , then g(\)
is defined similarly, but now takes the values 0 and V .
It follows from the definition of g(\) that it never assigns
values to the same variables as \, and for a circuit C, we
denote (C W\) Wg(\) by C W\g(\) .
The gates at the bottom-most level in f mk induce a natural
partitioning of the variables. When the restrictions are used,
the blocks Bi in the previous two definitions correspond to
this partitioning.
We use the Ha# stad switching lemma for this distribution
of restrictions [8, Lemma 6.3], and in the proof of the main
theorem we need an important property of the switching
lemma: after switching, the inputs accepted by the different
ANDs form disjoint sets. The fact that this property follows
from the proof of the switching lemma was noted by
Boppana and Ha# stad [8, Lemma 8.3]. Their note, however,
was regarding another distribution of random restrictions
and was not explicitly proved; for this reason we include a
sketch of the proof where we emphasize the ‘‘disjointness
property.’’
Lemma 4. Let G be an AND of ORs, all of size at most
t, and \ a random restriction from R+q, B or R
&
q, B . Then the
probability that G W\g(\) cannot be written as an OR of ANDs
all of size less than s, where the inputs accepted by the dif-
ferent ANDs form disjoint sets, is bounded by :s where
:=6qt. The same holds for converting an OR of ANDs to an
AND of ORs.
Proof. Let AND(G W\g(\))s denote the event that
G W\g(\) can not be written as an OR of ANDs that accept
disjoint sets, all ANDs having size less than s. This defini-
tion is slightly different from that in Ha# stad’s original proof,
where the same notation is used to denote the event that
G W\g(\) cannot be written as an OR of ANDs of size less
than s.
264 BERG AND ULFBERG
File: DISTL2 155203 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 5694 Signs: 3240 . Length: 56 pic 0 pts, 236 mm
We use induction to prove the lemma, and in fact, we
prove that if G=wi=1 Gi , where Gi are ORs of fan-in at
most t, then
P[AND(G W\g(\))s | F W\ #1]:s
for an arbitrary F.
The basis w=0 is obvious. For the induction step first
rewrite
P[AND(G W\g(\))s | F W\ #1]
max(P[AND(G W\g(\))s | F W\ #1 7 G1 W\ #1],
P[AND(G W\g(\))s | F W\ #1 7 G1 W\ 1]).
The first term is taken care of by the induction hypo-
thesis, and the rest of the proof deals with the estimate of the
second term.
We denote the set of variables occurring in G1 by T and
we have that |T |t. If G W\ is identically 0, G W\g(\) does not
require long ANDs, so we may assume that it is not 0 and,
therefore, that some of the variables in T must be given the
value V by \.
A block B is exposed if there is a variable xi # B & T and
\(xi)= V . If the variables in T belong to the r different
blocks B1 , ..., Br , we know that some of these blocks are
exposed. Let exp(Y), Y[1, ..., r], denote that the blocks
indexed by Y are the ones that are exposed. We get
P[AND(G W\g(\))s | F W\ #1 7 G1 W\ 1]
 :
<{Y[1, ..., r]
P[exp(Y) | F W\ #1 7 G1 W\ 1]
_P[AND(G W\g(\))s | F W\
#1 7 G1 W\ 1 7 exp(Y)].
In this sum, the first factor is identical to the corresponding
first factor in the original proof by Ha# stad, which uses the
following lemma.
Lemma 5 (Lemma 6.6 in [8]). P[exp(Y ) | F W\ #1 7
G1 W\ 1](2q) |Y |.
It remains to estimate the second factor, which differs
from that in the original proof.
Let Y* denote the set of variables that remain in the
exposed blocks after the restriction \g(\) when the blocks
indexed by Y are exposed. We rewrite
G= 
_ # [0, 1] |Y |
G WY*=_ 7 Q(Y*, _),
where Q(Y*, _) denotes a predicate that is true when the
variables in Y* take the values of _; it can be written as an
AND of size |Y |.
Partition \=\1\2 into \1 that is the restriction on the
exposed blocks and \2 that is the restriction on the blocks
that are not exposed. We get
P\[AND(G W\g(\))s | F W\ #1 7 G1 W\ 1 7 exp(Y )]
=P\[AND(G W\g(\))
s | F $ W\2 #1 7 G1 W\1 1 7 exp(Y )]
=P\ _AND \ _ # [0, 1] |Y | (G WY*=_)W\g(\) 7 Q(Y*, _)+
s | F $ W\2 #1 7 G1 W\1 1 7 exp(Y )&
 :
_ # [0, 1]|Y |
P\[AND((G WY*=_) W\g(\))
(s&|Y | ) | F $ W\2 #1 7 G1 W\1 1 7 exp(Y )]
 :
_ # [0, 1]|Y |
max
\1
P\2[AND((G WY*=_ W\1g(\1)) W\2g(\2))
(s&|Y | ) | F $ W\2 #1]
2 |Y |:s&|Y |.
For the first equality, note that the condition G1 W\ 1 is
equivalent to G1 W\1 1 7 G1 W\2 1. Since \2 assigns only
the values 0 and 1 to variables in T, the second of these
requirements can be combined with (F W\1) W\2 #1, and we
write this as F $ W\2 #1.
In the second equality, we do not have to apply the
restriction to Q(Y*, _), since the variables in Y* are not
affected by it anyway.
The first inequality is obtained by noting that for all _,
(G WY*=_) W\g(\) can be written as an OR of ANDs, all of
length less than s&|Y | and all of which accept disjoint
input sets then G W\g(\) can be written as an OR of ANDs, all
of length at most s, in a way that they all accept disjoint
input sets.
In the second inequality we use that for boolean
predicates R and S we have P\1, \2[R(\1 , \2) | S(\1)]
max\1: S(\1) P\2[R(\1 , \2)] to get rid of the two last condi-
tions.
We are finally in a position to use the induction
hypothesis, which is done in the last inequality.
We now evaluate the sum to get
:
<{Y[1, ..., r]
(2q) |Y | 2 |Y |:s&|Y |=:s :
r
i=1 \
r
i+\
4q
: +
i
=:s((1+4q:)r&1)
:s((1+2(3t))t&1)
:s((e2(3t))t&1):s. K
265LOWER BOUND FOR PERCEPTRONS
File: DISTL2 155204 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 6107 Signs: 4823 . Length: 56 pic 0 pts, 236 mm
The next lemma shows that applying a restriction to the
defining circuit for f mk does not reduce it too much; with
high probability the remaining circuit computes f mk&1 . We
fix the parameter q that is used for the restrictions to be
((2km) log m)12 for the rest of the paper.
Lemma 6 (Ha# stad [8, Lemma 6.8]). If k is even the cir-
cuit that defines f mk W\f (\) for a random \ # R
+
q, B , will contain
the circuit that defines f mk&1 with probability at least 23, for
all m such that m log m100k, m>m1 , where m1 is some
constant. The same is true for odd k when R+q, B is replaced by
R&q, B .
Proof. Suppose k is even so that the restriction comes
from R+q, B and the lowest level of the defining circuit for f
m
k
consists of AND gates; the other case is analogous.
The probability that one or more of the AND gates on the
lowest level is reduced to the constant 1 by the restriction,
or, equivalently, that one or more OR gates on the next
lowest level is reduced to the constant 1 by the restriction,
is bounded by 16. This fact follows since the AND gate
corresponding to the block Bi takes the value 1 with prob-
ability
(1&q) |Bi |=\1&\2km log m+
12
+
- (12) km log m
=\\1&\2km log m+
12
+
&- m(2k log m)
+
&k log m
<e&k log m<
1
6
m&k,
and since there are less than mk AND gates on the bottom-
most level.
Now suppose we are in the (good) case that all AND
gates take the values si . The probability that the number of
remaining inputs to an OR gate (i.e., the number of AND
gates not forced to 0) is less than - 12km log m is at most
16m&k. This follows since the expected number of such
gates is qm=- 2km log m, and using Chernoff ’s inequality
we get that the probability of obtaining less than half the
expected number is at most e&qm8<16m&k for m log m
100k. So, with probability 56 none of the OR gates will
have less than - 12m log m inputs.
To sum up, we get a circuit that contains f mk&1 with prob-
ability at least 23. K
We use induction over the depth of the perceptron to
prove the theorem. For the basis we need the following
lemma.
Lemma 7. The bottom fan-in of a depth two perceptron
that computes f m2 is at least
1
2 (m log m)
14.
Proof. The lemma follows from the ‘‘one-in-a-box’’
theorem by Minsky and Papert [12, Theorem 3.2] by sub-
stituting the variable m in their theorem by 12 (m log m)
14. K
We are now ready to prove the main theorem.
Theorem 8. There are no depth k perceptrons computing
f mk with bottom fan-in (1- 3k)(m log m)14 and less than
2(1- 3k)(m log m)
14 gates, not counting the gates on the lowest
level for mm1 , where m1 is some constant.
Proof. We may assume that m log mk2, since other-
wise we have 2(1- 3k)(m log m)
14
21- 3.
The theorem is proved by induction over k. The basis
k=2 follows from Lemma 7. We first give an overview for
the proof of the induction step, which is proved by con-
tradiction.
We assume that there is a small-sized depth-k perceptron
with bounded fan-in computing f mk . Then, we apply a ran-
dom restriction to the inputs, and with high probability we
get a perceptron that computes f mk&1 . Also, because of the
Ha# stad switching lemma, we can switch the two lower levels
of ANDs and ORs in the perceptron without increasing the
fan-in, and therefore collapse two levels to one and obtain
a small-sized perceptron of depth k&1 and bounded fan-in
computing f mk&1 .
When k=3 the procedure above does not work, since
there are only two levels of AND and OR gates in the per-
ceptron. However, we can deal with this as follows. Suppose
that the lower level of the depth-3 perceptron consists of OR
gates (the other case is analogous). When switching an
AND of ORs to an OR of ANDs, Lemma 4 says that the
input sets accepted by the ANDs are mutually disjoint.
Therefore, the output from the OR gate is always equal to
the sum of the outputs of the AND gates, and we can thus
substitute all OR gates that resulted after switching by sum-
mation gates, and thereafter collapse the two top-most
levels.
We make the intuition formal by assuming that there
is a depth-k perceptron P with bottom fan-in (1- 3k)
(m log m)14 that has less than 2(1- 3k)(m log m)14 gates, not
counting the gates on the lowest level, computing f mk .
Apply a random restriction from R+q, B or R
&
q, B depending
on whether k is even or odd, respectively. From Lemma 6
we have that with probability at least 23, the perceptron
P computes a function at least as hard as f mk&1 . Note that
this lemma requires m log m100k which follows from
m log mk2 for large enough m.
Now we want to use Lemma 4 to show that all the subcir-
cuits on the two bottom levels can be switched. We know
that the bottom fan-in is at most t=(1- 3k)(m log m)14
and to use the induction hypothesis we have to obtain a cir-
cuit with fan-in at most s=(1- 3(k&1))(m log m)14. For
each single subcircuit, the probability that this fails is at
most (6qt)s, which we multiply by the maximum number of
266 BERG AND ULFBERG
File: DISTL2 155205 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 5787 Signs: 4190 . Length: 56 pic 0 pts, 236 mm
such circuits to get a bound for the probability that this fails
for at least one circuit,
2(1- 3k)(m log m)
14
(6qt)s
=2(1- 3k)(m log m)
14 \6 \2km log m+
12
_
1
- 3k
(m log m)14+
(1- 3(k&1))(m log m)14
\12 \2km log m+
12 1
- 3k
(m log m)14+
(1- 3(k&1))(m log m)14
12,
where the last inequality holds since m log mk2.
Note that the number of gates that are not on the bottom-
most level does not increase when switching, which we need
to use the induction hypothesis.
Thus, with probability at least 23, P W\g(\) computes a
function at least as hard as f mk&1 , and with probability at
least 12, P W\g(\) is a function that does not compute f mk&1
by the induction hypothesis. Therefore, both of these events
must happen simultaneously with positive probability, a
contradiction. K
Corollary 9. There are no depth-(k&1) perceptrons
that compute f mk with less than 2
(1- 3k)(m log m)14 gates for
mm1 for some constant m1 .
Proof. Assume there is a perceptron such that the
corollary does not hold. Then, adding a gate with fan-in one
to all the inputs yields a depth k perceptron that does not
exist according to Theorem 8. K
Corollary 10. There are functions computable by
linear size circuits of depth k that require superpolynomial
size perceptrons of depth-(k&1) if 3k<log n(6 log log n),
where n is the number of input variables.
Proof. From Corollary 9 we have that if (1- 3k)
(m log m)14 # |(log n) then depth-(k&1) perceptrons
computing f mk require superpolynomial size. We have n>
mk&2 and, thus, k&2<log n log m and log m log n<1.
For m log mk2 we have n<mk and, thus, k>log n log m.
We get
- 3k<3 log nlog m+6=
log n
log m \3+6
log m
log n +
<3  log nlog m
so that
1
- 3k
(m log m)14>\m log
3 m
log2 n +
14
,
which is |(log n) if m log3 m # |(log6 n). This holds if
k<log n(6 log log n), since we then have (log6 n)k<n<mk
so that m>log6 n. K
3. ORACLE SEPARATION
In this section we show how the lower bound in
Theorem 8 implies the existence of an oracle that separates
the different levels in the PPPH hierarchy. An oracle A is a
fixed set of strings called an oracle set. Let yA denote the
characteristic function for the oracle A so that yAz is 1 if the
string z is in A.
An alternating Turing machine is a nondeterministic
Turing machine whose states are marked by 7 , 6 , 0, or 1.
States marked 7 and 6 have at most two next configura-
tions, and states marked 0 and 1 are the halting states, in
which the machine rejects or accepts, respectively. Let a
7 p, Ad machine denote an alternating Turing machine with
access to the oracle A. Such a machine has an additional
oracle query tape; when the query tape contains the string
z, the machine can enter a special oracle query state to com-
pute yAz in one time step.
Given an input value x to a 7 p, Ad machine, the result of
the computation is determined by evaluating the computa-
tion tree in the natural way. The computation tree is defined
as follows. Every possible machine configuration is re-
presented by a node, which is labeled by 7 , 6 , 0, or 1,
depending on the marking of the corresponding state. A
node is the parent to those nodes that represent the possible
next configurations; thus the nodes representing halting
configurations are the leaves of the tree. The maximum
number of blocks of consecutive states labeled by 7 or 6
on a path from the root to a leaf is d, and the block of states
closest to the root is an 6 block.
A PP7
p, A
d&1 machine is defined like a 7 p, Ad machine, but
where the machine starts its computation in a state marked
by + (a counting state). In the computation tree, the block
closest to the root is labeled by +, and the machine accepts
if the evaluation of the tree exceeds a threshold value.
The complexity classes 7 p, Ad and PP
7 p, Ad&1 contain exactly
those languages that can be decided by 7 p, Ad and PP
7 p, Ad&1
machines in polynomial time, respectively.
Given an alternating Turing machine, it can be converted
to a machine that makes all oracle queries at the end of the
computation with only a polynomial increase in execution
time and no extra alternations. We will assume that all
machines have this form for the rest of the paper. The con-
version is done as follows. Oracle queries are replaced by
nondeterministic guesses for the oracle answer. Along one
267LOWER BOUND FOR PERCEPTRONS
File: 571J 155206 . By:XX . Date:01:07:98 . Time:11:22 LOP8M. V8.B. Page 01:01
Codes: 6212 Signs: 4993 . Length: 56 pic 0 pts, 236 mm
computation branch the machine assumes the answer 0, and
along the other it assumes the answer 1; it remembers the
question, the guess for the answer, and the marking of the
original query state for the rest of the computation. A
halting state is replaced by a number of states that verify the
guesses made in the computation. If all the guesses were
correct, the computation branch accepts or rejects accord-
ing to the marking of the original halting state. Otherwise,
the machine makes sure that the computation branch does
not affect the result of the computation by accepting if the
first incorrect guess was made in an 7 state, and by reject-
ing otherwise.
The following relation between alternating oracle Turing
machines and perceptrons is similar to that for example
Lemma 2.1 in [11] and Corollary 2.2 in [5].
Lemma 11. Let MA be a PP7
p, A
d&1 oracle machine that
runs in time t on input x. Then there is a depth-(d+1) percep-
tion P with unit weights and bottom fan-in t which has a sub-
set of the yAz as inputs such that for every oracle A,
MA accepts x precisely when P outputs 1 on inputs yAz . This
perceptron has at most 2t gates, not counting gates on the
bottom-most level.
Proof. For a fixed x, write down the computation tree
for MA(x) for some oracle A. On a path from the root of the
tree to a leaf, let the first node where an oracle query occurs
(if one exists) be called a boundary node. Boundary nodes
mark where the machine starts verifying its guesses in the
computation, and since this is the last part of a computation
branch and since it is deterministic, there is exactly one leaf
under each boundary node.
When varying the oracle A we get different computation
trees. However, the part of the computation trees that are
above the boundary nodes remain the same regardless of the
oracle. The part of the computation trees below a boundary
node accepts or rejects, depending on the oracle. However,
note that the only oracle queries that may occur at and
below the boundary node are the ones for which guesses
have been made above the boundary node; their number is
bounded by t.
We now construct a perceptron from the computation
tree of MA(x). Every boundary node in the computation
tree is replaced by a DNF or CNF formula, depending on
if the boundary node is labeled by 6 or 7 , respectively. In
the case of boundary nodes labeled by 6 , we make a con-
junction for each possible set of oracle answers that makes
the computation branch accept, and we combine them into
a DNF formula. In the case of boundary nodes labeled 7 ,
we make a conjunction for each possible set of oracle
answers that makes the computation branch reject, and we
combine them into a DNF formula. Taking the negation of
this formula and applying DeMorgan’s law yields a CNF
formula. Finally, the resulting tree is collapsed to yield a
perceptron of depth d+1 and bottom fan-in t.
The depth of the original computation tree is at most t
and, hence, its size is at most 2t. Since new gates from the
DNF and CNF formulas only appear on the bottom-most
level, the maximum number of gates on higher levels in the
resulting perceptron is at most 2t. K
Theorem 12. There exists an oracle A such that for all d
there is a language Ld (A) in 7 p, Ad which is not recognizable
by any PP7
p, A
d&2 machine, i.e., 7 p, Ad 3 PP
7 p, Ad&2 .
Proof. The intuition for the proof is that a PP7
p, A
d&2
machine corresponds to a depth d perceptron with inputs yAz
for each input x. Theorem 8 from the last section is used to
show that such a perceptron is too small for computing the
function f md of the oracle bits for some m. We define a
language Ld (A) that depends on A in a way that for a
machine to decide the language properly for all oracles
would require the corresponding perceptron to compute
f md ; thus, we can choose an oracle such that the machine
makes an error.
Let x=x1x2 } } } xn and
Ld (B)=[1n | _x1 , x2 , ..., xnd
\xnd+1, ..., x2nd } } } Qxn(d&1)d+1 , ..., xn : x # B].
Now, a PP7
p, B
d&2 machine that decides if 1n # Ld (B) has a
corresponding perceptron of depth d that evaluates a func-
tion hnd in the variables y
B
z for |z|=n, as exemplified in Fig. 1.
The function hnd resembles f
2nd
d , the only difference being
the fan-in on the top and bottom levels which for hnd is 2
nd
while it is lower for f 2ndd . This means that h
n
d contains f
2nd
d
and is, therefore, at least as hard to compute.
By the construction of Ld (B) it is clear that Ld (B) # 7 p, Bd ,
and we now construct an oracle B such that Ld (B) cannot
be decided by a PP7
p, B
d&2 machine for any d.
Let M Bi for i=1, 2, ... be an enumeration of PP
7 p, Bd&2
machines for all constants d. The oracle will be built in
rounds, and in round i we will make sure that the machine
Mi makes an error for some input.
In round i we do the following: Suppose M Bi is a PP
7 p, Bd&2
machine which runs in time cnc (observe that c and d
depend on i). We want to fix some of the yBz such that M
B
i
FIG. 1. To check if 13 # L3(B) corresponds to evaluating the function
h33 given by this circuit.
268 BERG AND ULFBERG
File: DISTL2 155207 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 6107 Signs: 4666 . Length: 56 pic 0 pts, 236 mm
makes an error for at least one of the strings in Ld (B). More
precisely, we make M Bi fail on the string 1
ni, where ni is
chosen such that both the following statements are satisfied
for large enough ni :
1. None of the yBz with |z|=n i have previously been set.
2. (1- 3d)(2nidni d )14>cnci .
The first requirement makes sure that none of the strings
previously set in the oracle interfere with the current
machine on input 1ni. For any oracle query yBz with |z|{ni
that M Bi may make, we substitute the correct value for those
that are already determined and fix previously undeter-
mined variables to some arbitrary value.
From Lemma 11 there is a perceptron of depth d, bottom
fan-in cnc, and with at most 2cnc gates (excluding those on
the bottom-most level) with a subset of the yBz as inputs that
outputs 1 for the oracles that make M Bi accept the string 1
ni.
This perceptron is, due to Theorem 8, not powerful enough
to compute hnid and is thus unable to determine if the string
1ni is in Ld (B) for all B. It is therefore possible to set the yBz
of length ni such that M Bi makes an error on 1
ni. K
4. IMPROVING THE ORACLE SEPARATION
Beigel [1] obtained the oracle separation PNP 3 PP. To
do this he introduced the language ODD-MAX-BIT and
proved a relation between the bottom fan-in, the maximum
weight, and the size for perceptrons deciding it.
Definition 13 (Beigel [1, Definition 2]). ODD-MAX-
BIT is the set of all strings over [0, 1]* whose rightmost 1
is in an odd-numbered position, i.e., the set of strings of the
form x10l where the length of x is even.
We use this function to define a more complex one that is
suited for obtaining a lower bound for perceptrons of depth
k. We first slightly change the definition of the function f mk
from the previous section.
Definition 14. The function f mk is a function of
mk&1 - 12km log m variables. It is defined by a depth k cir-
cuit that has the form of a tree. At the leaves of the tree are
unnegated variables. The bottom-most level has fan-in
- 12km log m. The rest of the levels, including the root for
k>1, all have fan-in m. The root is an AND gate, and below
are alternating levels of OR and AND gates.
The function that we obtain a lower bound for is gmk ,
defined below.
Definition 15. For k=2, gm2 computes the ODD-
MAX-BIT function of 16- m log m variables. For k>2, gmk is
a function of mk&2 - 12(k&2) m log m variables. It com-
putes the ODD-MAX-BlT function of the outputs of m dis-
joint f mk&2 functions.
When applying a random restriction to a circuit comput-
ing gmk , as is the case with f
m
k , with high probability we get
a circuit that computes gmk&1 . As before, fix the parameter q
that is used for the restrictions to be ((2km) log m)12.
Lemma 16. If k is odd the circuit that defines gmk W\g(\)
for a random \ # R+q, B will contain the circuit that defines
gmk&1 with probability at least 23 for all m such that
m log m100k, m>m1 , where m1 is some constant. The
same is true for even k when R+q, B is replaced by R
&
q, B .
Proof. When k>3, the proof works as for Lemma 6.
The case k=3 is now special and let us consider this case;
the circuit computes the ODD-MAX-BIT function of m
AND gates, each with fan-in - 12m log m. As in the proof of
Lemma 6, with probability at least 56, none of the AND
gates is reduced to the constant 1, and they thus take the
values s i .
When all m AND gates take the values si (which is 0
or V) by the restriction, we show that with high probability
the remaining circuit contains a circuit that computes the
ODD-MAX-BIT function of 16 - m log m variables.
Divide the m inputs to the ODD-MAX-BIT function
(which take the values of s1 , s2 , ..., sm) into - m log m
blocks Di of size |Di |=- m log m. The probability that
such a block Di has a V in at least one odd numbered input
and also in at least one even numbered input is at least
1&2(1&q) |Di |2=1&2 \\1&6 log mm +
&- m(6 log m)
+
&- 62
>1&2e&- 62>13,
since the probability that a block does not get any V in an
even (odd) numbered input is (1&q) |Di |2.
We construct a new ODD-MAX-BIT circuit by using one
input from every block that has a V at both an odd num-
bered and an even numbered input. We use an odd num-
bered input from the first such block, an even numbered
input from the next one, and so on. Thus, we obtain a circuit
computing ODD-MAX-BIT of as many inputs as there are
blocks having both an odd numbered and an even num-
bered V .
There are - m log m blocks, so the expected number of
such blocks is at least 13- m log m, and using Chernoff ’s
inequality we obtain that the probability of getting less than
1
6- m log m such blocks is at most e&- m log m24<16. K
A depth two perceptron that contains no AND gates with
negated literals and no identical AND gates is said to be in
269LOWER BOUND FOR PERCEPTRONS
File: DISTL2 155208 . By:CV . Date:10:07:98 . Time:13:56 LOP8M. V8.B. Page 01:01
Codes: 5951 Signs: 4429 . Length: 56 pic 0 pts, 236 mm
positive normal form by Minsky and Papert [12]. Beigel
called this clean form, and he formulated the following
lemma, which uses the construction from [12, p. 33].
Lemma 17 (Beigel [1, Lemma 1]). If f is computed by a
perceptron with size s, maximum weight w, and order d, then
f is computed by a perceptron in clean form with size 2ds,
weight sw, and order d.
Lemma 18. There are no depth two perceptrons with unit
weights that compute gm2 with bottom fan-in m
16 and less than
25m
16 gates, for m larger than some constant.
Proof. A lemma due to Beigel [1, Lemma 5] gives the
following relation between the size s, maximum weight w,
and order d for depth two perceptrons on clean form com-
puting ODD-MAX-BIT of N input bits:
w
1
s
2w(N&1)(2 Wd
2(- 87&9)X)x.
Suppose there is a depth-2 perceptron with unit weights
that computes gm2 that has bottom fan-in (order) m
16 and
less than 25m
16
gates. Then, due to Lemma 17, there is a
perceptron in clean form with fan-in m16, size 26m16, and
weights bounded by 25m
16
computing ODD-MAX-BIT of
1
6- m log m variables.
If we put these values into Beigel’s formula above, we get
a contradiction for large enough m. K
Theorem 19. For k3, there are no depth k perceptrons
computing gmk with unit weights, bottom fan-in (1- k) m16,
and less than 2(1- k) m16 gates, not counting the gates on the
lowest level, for m larger than some constant. When k=2 we
have a somewhat stronger result. The bound for the number of
gates holds when counting all gates in the circuit.
Proof. We may assume that mk3, since otherwise we
have 26(1- k) m
16
26. The basis for the induction, k=2,
follows from Lemma 18.
Assume there is a depth-k perceptron P with bottom
fan-in (1- k) m16 that has less than 26(1- k) m16 gates com-
puting gmk .
To be able to apply Lemma 16, choose one of R+q, B or
R&q, B and apply a random restriction to the perceptron. We
have that with probability at least 23, the perceptron P
computes a function at least as hard as gmk&1 .
The bottom fan-in is at most t=(1- k) m16, and to use
the induction hypothesis we have to obtain a circuit with
fan-in at most s=(1- (k&1)) m16. For each single gate on
the next to the lowest level, the probability that this fails is
at most (6qt)s due to Lemma 4, and we multiply this by the
maximum number of switchings to get a bound for the
probability that the conversion fails for at least one circuit:
26(1- k) m
16
(6qt)s
=64(1- k) m
16 \6 \2km log m+
12 1
- k
m16+
(1- (k&1)) m16
\6 } 64 \2km log m+
12 1
- k
m16+
(1- (k&1)) m16
12,
where the last inequality holds since mk3.
Going from depth 3 to depth 2 is a special case, since we
need to bound the number of gates in the entire depth-2 cir-
cuit. Suppose without loss of generality that the lowest level
of the depth-3 perceptron consists of OR gates (if it consists
of AND gates we can negate the perceptron’s weights and
use deMorgan’s law). Then, the depth-2 perceptron that
results after switching has AND gates on the lowest level,
and the fan-in of these gates is bounded by s so that each of
them accepts at least a fraction 2&s of the inputs. We know
from Lemma 4 that all the AND gates accept disjoint inputs
so we get that the maximum number of AND gates that
results from each switching is bounded by 2s. Thus, the total
number of gates in the resulting depth-2 circuit is at most
2s26(1- k) m
16
=2(1- 2) m
16
2(6- 3) m
16
<2(6- 2) m
16
. K
Theorem 20. There exists an oracle A such that, for all d,
there is a language Ld (A) in 2 p, Ad which is not recognizable
by any PP7
p, A
d&2 machine, i.e., 2 p, Ad 3 PP
7 p, Ad&2 .
Proof. The proof is analogous to the proof of
Theorem 12, but we change the language to one that can be
decided by a 2 p, Ad machine. First, let
L$d(B)=[ y | \x1 , x2 , ..., x |y|
_x | y|+1 , ..., x2 | y| } } } Qx(d&1)| y|+1 , ..., xd | y| : yx # B],
where x1 , x2 , ... denote the individual bits of x. This
language is reminiscent of the language we used in the proof
of Theorem 12, but it may contain many strings of the same
length.
We then define
Ld (B)=[1n | max(L$d&2(B) & [0, 1]n(d&1)) ends in a 1].
The idea is that deciding if the string 1n is in Ld (B) is the
same as computing the ODD-MAX-BIT function of 2n(d&1)
inputs, where the inputs are the characteristic function of
the language L$d&2(B) for inputs of length n(d&1).
Notice that a deterministic turing machine using an NP
machine as an oracle can compute the ODD-MAX-BIT
function of 2n input variables in polynomial time by doing
binary search for the index of the variable with the highest
270 BERG AND ULFBERG
File: 571J 155209 . By:CV . Date:13:07:98 . Time:10:47 LOP8M. V8.B. Page 01:01
Codes: 6782 Signs: 2887 . Length: 56 pic 0 pts, 236 mm
index being 1. Since a 7 p, Bd&2 oracle can decide the language
L$d&2(B), we thus have that a PNP machine using a 7 p, Bd&2
machine as an oracle can decide the language Ld (B), and
therefore, the language Ld (B) is in 2 p, Bd .
The existence of a PP7
p, B
d&2 machine that decides if
1n # Ld (B) implies a perceptron of depth d that evaluates a
function hnd in the variables y
B
z for |z|=n. The language con-
struction ensures that the function hnd is at least as hard to
compute as g2n(d&1)d (the defining circuit for g
2n(d&1)
d has
smaller fan-in at some levels).
Due to Theorem 19, however, the perceptron correspond-
ing to a PP7
p, A
d&2 machine deciding the language is too small
for computing the function gmd of the oracle bits for some m.
The construction of an oracle B such that Ld (B) cannot
be decided by a PP7
p, B
d&2 machine for any d is now as in the
proof of Theorem 12, the only difference being that ni is
chosen such that
6
- d
(2ni (d&1))16>cnci
in each round. K
ACKNOWLEDGMENTS
We thank Mikael Goldmann and Johan Ha# stad for helpful discussions.
REFERENCES
1. R. Beigel, Perceptrons, PP, and the polynomial hierarchy, Comput.
Complexity 4, No. 4 (1994), 339349.
2. R. Beigel, L. A. Hemachandra, and G. Wechsung, Probabilistic poly-
nomial time is closed under parity reductions, Inform. Process. Lett. 37
(1991), 9194.
3. R. Beigel, N. Reingold, and D. Spielman, PP is closed under intersec-
tion, J. Comput. System Sci. 50, No. 2 (1995), 191202.
4. B. Fu, Separating PH from PP by relativization, Acta Math. Sinica
(New Ser.) 8, No. 3 (1992), 329336.
5. M. Furst, J. B. Saxe, and M. Sipser, Parity, circuits, and the polyno-
mial-time hierarchy, Math. Systems Theory 17 (1984), 1327.
6. F. Green, An oracle separating P from PPPH, Inform. Process. Lett.
37 (1991), 149153.
7. F. Green, A lower bound for monotone perceptrons, Math. Systems
Theory 28 (1995), 283298.
8. J. Ha# stad, ‘‘Computational Limitations for Small-Depth Circuits,’’
ACM doctoral disseratation awards, MIT Press, Cambridge, MA,
1987.
9. J. Ha# stad, Almost optimal lower bounds for small depth circuits, in
‘‘Randomness and Computation’’ (S. Micali, Ed.), Advances in Com-
puting Research, Vol. 5, pp. 143170, JAI Press, London, 1989.
10. J. Ha# stad and M. Goldmann, On the power of small-depth threshold
circuits, Comput. Complexity 1, No. 2 (1991), 113129.
11. K. Ko, Relativized polynomial time hierarchies having exactly k levels,
SIAM J. Comput. 18, No. 2 (1989), 392408.
12. M. Minsky and S. Papert, ‘‘Perceptrons,’’ MIT Press, Cambridge, MA,
1988. [expanded edition]
13. M. Sipser, Borel sets and circuit complexity, in ‘‘Proceedings of 15th
Annual ACM Symposium on Theory of Computing, 1983,’’ pp. 6169.
14. A. Yao, Separating the polynomial-time hierarchy by oracles, in
‘‘Proceedings of the 26th IEEE Symposium on Foundations of Com-
puter Science, 1985,’’ pp. 110.
                   
271LOWER BOUND FOR PERCEPTRONS
