INTRODUCTION
There has been a great deal of recent research focussed on neural networks as a promising approach to artificial intelligence. The main attraction has been the possibility of using training instead of explicit programming in order to obtain the required performance of a system. The degree to which explicit programming is excluded varies from cases where a randomly generated network is presented with a representative sample of input/output pairs, to those in which some knowledge of the problem domain allows the design of network to be tailored to simplify the training complexity.
The other attraction of neural networks is their potential for highly parallel implementation. This possibility has been explored in the design of practical hardware systems but has not received much theoretical analysis. The purpose of this paper is to address this aspect of neural networks by linking them with the much studied boolean circuits which are the most natural theoretical model of massively parallel implementation.
In order to show how the classes of neural networks fit into the circuit complexity hierarchy, we must first consider them as boolean functions. This involves restricting the inputs to { 0, 1 } vectors and thresholding the output of the whole network. Note that we allow Requests for reprints should be sent to John Shawe-Taylor, Department of Computer Science, Royal Holloway and Bedford New College, University of London, Egham, Surrey TW20 0EX, UK.
real values for the activation of hidden nodes and we are not considering just simple linear threshold networks, though these networks do form subclasses of the classes we study. So that we may model the computation in our networks asymptotically, we need to put bounds on the maximum fan-in to nodes, the number of bits of accuracy used in storing the weight and activation values, and on the depth of the network. As with standard boolean circuit theory, these bounds are made in terms of the number of inputs to the network.
For each natural number k we define the class N Nk of problems to be those that can be solved by neural networks with monotonic activation functions (not all necessarily the same function) at the nodes, b bits of accuracy used in representing the weights and activation values, A maximum fan-in and height h, where log (A) = O(Vlog n), b log A = O(Iog n) and h log (A) = O( logan ), where logkx denotes ( log x) k. We show that for all k,
The paper is organized as follows. Section 2 introduces the terminology and definitions we require. These include circuit complexity and communication complexity. Section 3 gives a number of background results which will be used later. Included is a proof of the wellknown equivalence between communication complexity and circuit depth. Section 4 shows how a neural network is converted to an equivalent boolean circuit, while Section 5 completes the complexity analysis and the proof of the main result. We finish with conclusions and directions for further research.
d. S. Shauz,-Taj'h,; M. fl. G. Anthon) ; and I1: Kern 2. DEFINITIONS
Networks
A network is a directed acyclic graph. The nodes of indegree 0 are called inputs, and are labelled with a variable & or with a constant 0 or I. The nodes of in-degree k > 0 are called gates and are labelled with a function on k inputs. The in-degree of a node is referred to as its.lan-in and its out-degree as its.lan-out. One node will be designated as the output node. A network specifies a function in a natural way. The size of the network is the number of gates and the depth is the maximum distance from an input to the output.
A neural network is a network for which each edge has an associated weight value. The functions associated with the nodes take the weighted sum of the inputs to the node and pass the result through an activation./itnction. The activation function is a monotonically increasing function from the real numbers to the interval [0, 1] . Thus to specify the functionality of a neural network, we must specify the weights and the activation functions for each node. The activation function of the output node is a threshold function. By restricting the inputs to boolean values a neural network determines a boolean function, see McClelland and Rumelhart (1986) .
A boolean circuit is a network for which the functions associated with the gates are the boolean functions AND or OR. We will also allow negation to appear on inputs. That is, each input will be paired with its negation. All circuits containing only AND, OR and NOT gates can be transformed into a circuit of the type we consider which is at most twice the size and has the same depth. This result is mentioned in Boppana and Sipser (1990) and a proof may be obtained along the lines of our Proposition 4.1. A boolean circuit is termed hinao' if each gate has two inputs. A boolean circuit represents a boolean function.
Complexity
Let N denote the natural numbers, . ' 0, 1 'j" the set of binary strings of length n, and [ 0, I ~j * the set of all finite binary strings. Let 1: { 0, 1 ] * --~ { 0, I ]. We say that ./is computed by a family of networks if the members of the family are indexed by the natural numbers such that the n-th network has n inputs and computes the function f/{0, 1}". We say that some parameter of the family of networks (e.g., size, fan-in, etc.) has complexity c = c(n) if the n-th network has value c(n) for that parameter. For example the circuit comph'vity of a function is the size complexity of the family of minimal boolean circuits with constant fan-in that compute the function. For more information on boolean circuit complexity we refer the reader to Boppana and Sipser (1990) and Wegener (1987) .
The classes NC~ (.-IC,) are defined to be those functions with polynomial circuit complexity, which can be computed by a family of boolean circuits of constant (unbounded) fan-in and depth O(Iog~n).
The class N N~ is defined to be those functions which can be computed by a family of polynomially sized neural networks with weights and activation values determined to b bits of accuracy, fan-in equal A and depth h, satisfying log (A) = O( 1V~g n ), b log A = O(log n) and h log(~) = O(login).
Communication Protocols
Consider a function Foftwo inputs, x = (x~ ..... x,,) and y = ( y~ ..... y,), which takes values in some finite set. Assume that two computing agents, normally referred to as Alice and Bob, are given the values of x and y, respectively. They are interested in determining the value of the expression F(x, y). We may assume that they both have unlimited computing resources and, of course, knowledge of F and ask only how many bits must pass between them in the worst case in order to determine the function value. In fact, we only require that one of the participants determines the value, while the other must know only that the value has been determined. In order to make the messages passed between Alice and Bob meaningful, they must agree betbrehand on a system of rules to decide at each stage who should send the next message and what information it contains. Clearly after any particular sequence of bits has been passed between them, they must both know who is to send the next bit, though its value may affect who sends the next bit. This system of rules is called a communication prolocol.
More generally, F need not be a function: the aim may simply be to determine any one value from a particular set of outputs corresponding to the inputs x, y. The definition of a communication protocol in this case is analogous.
The comph:\ily of a protocol is the number of bits communicated in the worst case. The trivial proloco[ consists of either Alice or Bob sending all their bits to the other, thus allowing them to compute the value of the function. For more information on communication complexity the reader is referred to Lov,4sz and Saks (1988), The Difference Problem for a monotonic boolean function ./'is the problem in which, given two inputs a and b tbr which f(a) = 0 and f(b) = 1, one has to determine an index i for which a~ 4: b/. This is the second type of problem described above, where any index i, such that a, and b~ differ, will be a correct output.
In the next section we will present and prove a wellknown result which states that a boolean function can be evaluated by a binary boolean circuit of depth t if and only if the communication complexity of the Difference Problem for ./'is at most t, see Karchmer and Widgerson ( 1988 ) .
PRELIMINARY RESULTS
We first show the trivial inclusion NCk c_ N N~, which follows from showing that we can simulate binary AND and OR gates with bounded fan-in neurons using just one bit of accuracy. PROOF. Given an arbitrary binary boolean circuit, we will convert it to an equivalent neural network with the same number of nodes and depth using binary neurons and one bit of accuracy. The underlying graph is the same as that used for the boolean circuit. The weights on all the lines will be 1. The activation functions for nodes which were AND gates in the boolean circuit will be a threshold function with threshold 1.5, while for an OR gate it will be a threshold function with threshold 0.5. Clearly the neural network computes the same function as the boolean circuit. Hence. ifa function /'lies in the class NCk, there are circuits of polynomial size, constant fan-in and depth O(logan) which compute the function. We can therefore find neural networks with constant fan-in, a constant number of bits of accuracy and O(Iogkn) depth which compute the function./. This implies ./'lies in the class NNk, as required.
• Next we consider the equivalence between communication complexity and circuit depth. We will give a proof of this result both tbr completeness and because we will need the result in the slightly stronger form presented here. PROPOSITION 3.2. ( Karchmer & l,l "idgerson, 1988 Proof (~) We can assume the existence of a binary boolean circuit of depth t which Alice and Bob will use to direct the protocol. We will also assume that the two inputs to each of the gates have been labelled 0 and 1 in the same way on both Alice's and Bob's copy of the circuit. They use the information passed in the protocol to both trace the same path from the output node to an input which solves the difference problem. We assume inductively that Alice's input gives output 0 at the current node, while Bob's gives output 1. If the node is an AND node, Alice sends the label of an input to the node which also gives output 0. Bob's output at this node must be 1 and so we complete the inductive step. If the node is an OR node then Bob sends the label of an input to the node which also gives output 1. This will again complete the inductive step. After, at most t steps, they must reach an input node which solves the Difference Problem.
(~) The proof is by induction on t. For t = 0 the result is trivial as X and the function ,/'must be such thatfl ,(x) = xi for some i. Suppose now that the result holds for values smaller than t > I, that X ~_ { O, 1 }" and that the Difference Problem for fix has communication complexity at most t. We must consider two cases depending on who sends the first bit in the protocol. As they are symmetrical, we will consider only the case where Alice sends the first bit. We will define two functions g and h which both agree with ./'on different subsets of X. The function g (h) will agree with ./on the set A~ (A),) composed of the union of X~ = ./. t ( 1 ) f-) X and the sets of inputs from X which when given to Alice cause her to send a 0 ( 1 ). Note that, however, the functions g and h are defined elsewhere on X, we will have f= g A h. We claim that for g (h) the communication complexity of the Difference Problem on inputs from A~ (X/,) is less than t. Both protocols use the protocol for./~ This is possible since the two functions agree with ./on their respective sets. However, the protocol for f is in both cases shortened by removing the transmission of the first bit. Since this bit is known to be 0 ( 1 ) |br any input given to Alice from .~, (Xh), this bit does not need to be transmitted for Bob to continue with the protocol and for them both to correctly identify an index on which the inputs differ. Bv induction there are binary boolean circuits ofdepth at most t -1 which compute g correctly on ,~(, and h correctly on A),. By combining the outputs of these two circuits into one AND gate, we obtain a circuit of depth at most t which computes g A h for some extensions ofg and h to X. By the above observation this agrees with ./on X. • Our reason for considering the above result is in order to perform the following conversion from a linear threshold neuron to an equivalent boolean circuit. Proof We first construct a protocol of maximum length (b + log A) log A, which solves the Difference Problem for the function./: This will imply the existence of a binary boolean circuit computing /'and of depth at most (b + log A) log A. We will then observe that the circuit involves groups of layers all containing AND gates. These can therefore be concentrated into single multiple input AND gates, reducing the depth significantly, and implying the required result.
The protocol assumes that Alice has been given an input x for which f(x) = 0, while Bob has an input y satisfying,/'(y) = I. Bob and Alice agree to numbering the inputs to the neuron. The protocol involves bisecting the set of inputs always retaining a set for which Alice's weighted sum is less than Bob's. This is true at the outset for the complete set of inputs, while it is true of a single input if and only if that input is a solution to the Difference Problem. At each stage Alice computes her weighted sum of the first half of the current set and transmits this value to Bob using b + log A bits. Bob computes the equivalent weighted sum for his input. If this is greater than the value transmitted by Alice, he transmits a 0 back indicating that Alice should continue with this as the new set of inputs. Otherwise, he returns a 1 implying that Alice's weighted sum on the other half of the inputs in the current set is less than Bob's and so this should be adopted as the current set. The bisection of the inputs takes log A interactions, while the number of bits transmitted in a single interaction is b + log A + 1, giving the length of the protocol as (b + log A + 1 ) log A. This implies the existence of a binary boolean circuit of depth (b + log A + 1 ) log A which computes the functionJ~ The number of nodes in a binary circuit ofdepth h is easily seen to be bounded by 2 h+t, and therefore this circuit has at most 2A/'+l°gx+l nodes. However, in the construction described in the proof of Proposition 3.2, bits transmitted by Alice translate into AND gates. Hence the transmission of b + log A bits by Alice translates into a subtree of AND nodes of depth b + log A with at most 2 ~'+~°ga inputs. By making these substitutions, we reduce the depth of the circuit significantly, obtaining a network of depth 2 log A with nodes having a fan-in of at most 2 ~'+~°gz. The size of this network is at most the size of the binary circuit above, and is therefore bounded by as required. Pro(f We again devise a protocol to distinguish inputs vectors, but now only input vectors satisfying the restriction that the on-inputs in each group are all of lower index than the off-inputs. The same protocol is used as before except that Bob and Alice agree to a subdivision of the inputs, which respects the grouping. In this way, after log A interchanges, they will have identified a group for which Alice's weighted sum is less than Bob's. Since their inputs both respect the restriction, it will be sufficient for Alice to send the number between 1 and 2 h which is the largest index of an on-input in her group. The next input above this one can then be guaranteed to be a solution ofthe Difference Problem, since it must be offfor Alice and on for Bob. Hence, the protocol has only been extended by b bits, all sent by Alice. In the binary circuit, this translates into b extra levels of AND gates, while in the reduced circuit (after collapsing the sub-trees of AND gates) it gives one extra level composed of AND gates with at most 2 h inputs. The depth of the binary circuit is therefore at most (2h + log A + l)log A + h, and therefore the number of nodes in each circuit is at most
which is less than ( 2A )-'l' +Joga~ t.
The result follows. • 4. NETWORK CONVERSION
Having covered the preliminary results that will be needed in the network conversion, we can begin the main part of the reduction proof. We begin with a general neural network with given accuracy, fan-in and depth and convert it into an equivalent boolean circuit whose parameters can be computed from those of the original network. The conversion has three stages. The first is to eliminate negative weights (apart from threshold values) by moving them back through the network. The second stage involves replacing each neuron by 2 ~' threshold neurons, one for each possible output value of the original neuron. In the third stage we replace the threshold neurons by boolean circuits using the results of the last section. At each stage, the size, depth, and fan-in of the new circuit will be computed from the parameters of the previous stage.
Eliminating Negative Weights
This subsection shows how negative weights can be removed from a network in much the same way as negations can be moved to convert a general boolean circuit to one containing only AND and OR gates as in-ternal nodes. The conversion at most doubles the number of nodes while not altering the fan-in or depth. Proof We prove the result by first duplicating all the nodes in the network in such a way that the pair of each node has a complementary output, whatever the input to the network. This process doubles the number of nodes and weights. We then show that if there are any negative weights they can be replaced by positive ones. Finally, any redundant nodes can be deleted.
To duplicate a non-input node v, we make a copy v' of v, which is connected to exactly the same nodes as v. but with weights of opposite sign. Furthermore. the activation function/" of v' is related to the activation function ./'of v by as required. We now show how to remove a negative weight. Let v be connected to z via a negative weight. In order to remove the negative weight w= we replace the v ~ z connection with a v' ~ -connection from v's complement to : having weight -w:~,. In addition, we change the activation function by shifting the argument of the function by the constant w:v. Hence, the new output of -is In this way we can remove all the negative weights in the network. This completes the proof. II
In view of this result, we assume from now on that all weights are positive. This will not affect our complexity results, since each neural network can be converted to one with positive weights in which the number of nodes is increased by at most a constant factor 2 and the maximum fan-in is unchanged.
Converting to Boolean Circuit
We first convert a neural network to an equivalent Linear Threshold Network, before completing the conversion to a boolean circuit using Proposition 3.4.
To convert a neural network to a Linear Threshold Network we first take 2 h copies of each of the internal (hidden) nodes, where b is the number of bits of accuracy. With b bits there are at most 2 b different values which can be represented. Let these values be fi, tz, .... t2b in increasing order between 0 and 1. The output of the i-th copy will be 1 if and only if the output of the original network node was greater than or equal to t~. The node takes input from the 2 I' copies of the nodes corresponding to those connected to it in the original neural network. The weight on the i-th line will be (ti -t,_ ~ ) w. where w is the weight on the corresponding line in the neural network. Assuming the difference between two denotable values is also denotable, this implies that to denote the new weight will require 2b bits. The net input to each copy of the node is therefore where o', is the output of tt threshold at t,. Here, f, is the activation function at node u in the neural network. Input lines from input nodes do not need to be duplicated and their weights do not need to be changed, since the inputs are already binary. The output node does not need to be duplicated since it is by definition a threshold node. Hence, the new network computes the same boolean function as the original neural network. The new linear threshold network has the same depth as the original network, the number of nodes has been increased by a factor 2 I', the fan-in has likewise increased by this factor, while the number of bits has doubled.
There is, however, one important property of the network. If, for a particular node in the original network, the node corresponding to output at least t~ is switched on for a particular input, then the nodes corresponding to lower denotable values will also be on.
We use Proposition 3.4 to convert each of the individual linear threshold neurons of the network created above. Note that, by the observation, the restriction required by the proposition is satisfied. Hence, for a neural network of depth h, maximum fan-in A and accuracy of b bits, we obtain an equivalent boolean circuit with depth h(2 log A + 1 ) with binary OR gates and 22t'+J°g'-input AND gates and size at most
where N is the number of nodes in the original neural network.
COMPLEXITY RESULTS
We are now ready to prove our main result. Proof Let g by a function in N Nk. Then for each n there is a neural network which computes g [{0,11" and has size N(n) , height h(n), fan-in A(n) and b(n) bits of accuracy such that N(n) = O(poly(n)), log ( A( n )) = O( ~ n ), b( n ) log A( n = O( log n ) and h(n) log (A(n)) = O(log k n). By the above we can convert the neural network into a binary circuit of size N( n )2 h~ .~( 2',(n) )..t,~,, j ~log~ ,,~.
Taking logarithms of this expression gives:
Hence, the circuit is polynomially sized. The depth of the circuit is h(n)(2 log A(n) + 1) = O(logkn).
This implies that the function g lies in ,=ICk as required.
• These results concern the exact representation, as boolean circuits, of neural networks with weights and outputs to a fixed accuracy of b bits. This can be regarded as in some sense approximating unlimited accuracy neural networks with boolean circuits, where the degree of approximation is determined by b. However, Raghavan (1988) has shown that a single linear threshold neuron with boolean inputs, unlimited accuracy on its weights, and fan-in A can be represented exactly by a linear threshold neuron with the same fanin, but with weights constrained to be A log A-bit integers. That is, the neuron can be replaced by one which computes the same function, has the same fan-in, but has bounded accuracy A log A. Using this, we can obtain the following exact representation theorem for linear threshold networks. Proof The first containment follows as earlier, in the proof that NCk c_ NNk. For the second containment, we use the preceeding result. Let g be a function in TN),. Then for each n there is a feedforward linear threshold network which computes gl { 0,1 } ~, with size N(n), height h(n) and fan-in A(n) such that A(n) = O((log n) ~-') and h(n)log A(n) = O(log k n). Now, by the previous result, we can convert this threshold network into a boolean circuit of size at most 2N( n ) A ±loga+loga ~ I and depth at most 2h log A. Taking logarithms, we get I +logN(n)+(AIogA+logA+ I)logA = O((log n)l-'log log n) = O(Iog n).
Hence, the circuit has polynomial size. As before, the depth of the circuit is 2h log A = O(Iogkn).
Therefore g lies in AG, as required.
•
CONCLUSIONS
We have introduced a hierarchy of classes of neural networks and shown that they interleave the well-known boolean complexity classes NCk and .4Ck. The neural network class introduced has two significant [imitations. The first is in the number of bits used to represent the real numbers involved and the second is in the fanin to the nodes. The limitation on the fan-in is sublinear in the number of inputs, but significantly larger than logarithmic. This appears to be a critical limitation on the computational power of the network.
The limitation on the number of bits is at best logarithmic, but decreases to the square root of the logarithm as maximum fan-in is used. This appears to be a fairly severe limitation on the expressibility of the numbers involved. This would seem a reasonable limitation not only for standard computing equipment but also for the biological neural networks, where synapse accuracy does not appear to be very great or increase dramatically in more advanced warm blooded species.
There are several indications why the limitation on the number of bits is perhaps not as severe as we might suppose. The first is that in the proof that NCk ~_ NNk, we require only one bit of accuracy to represent the Boolean circuit, leaving O(log n) bits in "~reserve" in this constant fan-in example. More importantly, Theorem 5.3 shows that at least in the case of linear threshold circuits with fan-in still further restricted, the expressive power is no longer affected by increasing the number of bits indefinitely. It is an open question whether there is a restriction on the fan-in which would make the computational power of a neural network independent of the accuracy of the numbers involved, but in view of Theorem 5.3, this certainly does not seem an inapossibility.
Our conclusions are therefore that, as in the case of threshold circuits, it appears to be the large fan-in nodes which significantly increase the computational power of neural networks rather than the detailed functions computed at a node. This is reinforced by our allowing any monotonic activation functions to be used in internal nodes in the classes discussed. We feel that to implement neural networks in hardware, more emphasis should perhaps be placed on processing large fan-ins in parallel rather than modelling traditional activation functions exactly.
