This paper shows that neural networks which use continuous activation functions have VC dimension at least as large as the square of the number of weights w. This result settles a long-standing open question, namely whether the well-known O(w log w) bound, known for hard-threshold nets, also held for more general sigmoidal nets. Implications for the number of samples needed for valid generalization are discussed.
Introduction
One of the main applications of arti cial neural networks is to pattern classi cation tasks. A set of labeled training samples is provided, and a network must be obtained which is then expected to correctly classify previously unseen inputs. In this context, a central problem is to estimate the amount of training data needed to guarantee satisfactory learning performance. To study this question, it is necessary to rst formalize the notion of learning from examples.
One such formalization is based on the paradigm of probably approximately correct (PAC) learning, due to Valiant ( 15] ). In this framework, one starts by tting some function f, chosen from a predetermined class F, to the given training data. The class F is often called the \hypothesis class", and for purposes of this discussion it will be assumed that the functions in F take binary values f0; 1g and are de ned on a common domain X. (In neural networks applications, typically F corresponds to the set of all neural networks with a given architecture and choice of activation functions. The elements of X are the inputs, possibly multidimensional.) The training data consists of labeled samples (x i ; " i ), with each x i 2 X and each " i 2 f0; 1g, and \ tting" by an f means that f(x i ) = " i for each i. Given a new example x, one uses f(x) as a guess of the \correct" classi cation of x. Assuming that both training inputs and future inputs are picked according to the same probability distribution on X, one needs that the space of possible inputs be well-sampled by the training data, so that f is an accurate t. We omit the details of the formalization of PAC learning, since there are excellent references available, both in textbook (e.g. 1, 11] ) and survey paper (e.g. 10]) form, and the concept is by now very well-known.
After the work of Vapnik in statistics ( 16] ) and of Blumer et. al. in computational learning theory ( 4] ), one knows that a certain combinatorial quantity, called the Vapnik-Chervonenkis (VC) dimension VC(F) of the class F of interest completely characterizes the sample sizes needed for learnability in the PAC sense.
(The appropriate de nitions are reviewed below. In Valiant's formulation one is also interested in quantifying the computational e ort required to actually t a function to the given training data, but we are ignoring that aspect in the current paper.) Very roughly speaking, the number of samples needed in order to learn reliably is proportional to VC(F). Estimating VC(F) then becomes a central concern. Thus from now on, we speak exclusively of VC dimension, instead of the original PAC learning problem.
The work of Cover ( 5] ) and Baum and Haussler ( 2] ) dealt with the computation of VC(F) when the class F consists of networks built up from hard-threshold activations and having w weights; they showed that VC(F)= O(w log w). (Conversely, Maass showed in 9] that there is also a lower bound of this form.) It would appear that this de nitely settled the VC dimension (and hence also the sample size) question.
However, the above estimate assumes an architecture based on hard-threshold (\Heaviside") neurons. In contrast, the usually employed gradient descent learning algorithms (\backpropagation" method) rely upon continuous activations, that is, neurons with graded responses. As pointed out in 13], the use of analog activations, which allow the passing of rich (not just binary) information among levels, may result in higher memory capacity as compared with threshold nets. This has serious potential implications in learning, essentially because more memory capacity means that a given function f may be able to \memorize" in a \rote" fashion too much data, and less generalization is therefore possible. Indeed, the paper 14] showed that there are conceivable (though not very practical) neural architectures with extremely high VC dimensions. Thus the problem of studying VC(F) for analog networks is an interesting and relevant issue. Two important contributions in this direction were the papers by Maass ( 9] ) and by Goldberg and Jerrum ( 6] Assume given an activation which has di erent limits at 1, and is such that there is at least one point where it has a derivative and the derivative is nonzero (this last condition rules out the Heaviside activation). Then there are architectures with arbitrary large numbers of weights w and VC dimension proportional to w 2 . The proof relies on rst showing that networks consisting of two types of activations, Heavisides and linear, already have this power. This is a somewhat surprising result, since purely linear networks result in VC dimension proportional to w, and purely threshold nets have, as per the results quoted above, VC dimension bounded by w logw. Our construction was originally motivated by a related one, given in 6], which showed that real-number programs (in the Blum-Shub-Smale model of computation) 3] with running time T have VC dimension (T 2 ). The desired result on continuous activations is then obtained, approximating Heaviside gates by -nets with large weights and approximating linear gates by -nets with small weights. This result applies in particular to the standard sigmoid 1=(1 + e ?x ). (However, in contrast with the piecewise-polynomial case, there is still in that case a large gap between our (w 2 ) lower bound and the O(w 4 ) upper bound which was recently established in 7].) A number of variations, dealing with Boolean inputs, or weakening the assumptions on , are also discussed. The last section includes some brief remarks regarding an interpretation of our results in terms of threshold-only networks with \shared" weights.
Basic Terminology and De nitions
It is possible to formulate a general de nition of \network architecture" that allows for very arbitrary nets; see 10]. However, in order to streamline the presentation we will only provide a simpler de nition which is su cient for dealing with rst-order (additive-synapse) nets. (At one point we do need to deal technically with product units, but we treat that case in an ad-hoc manner.)
Formally, a ( rst-order, feedforward) architecture or network A is a connected directed acyclic graph together with an assignment of a function to a subset of its nodes. The nodes are of two types: those of fan-in zero are called input nodes and the remaining ones are called computation nodes or gates. An output node is a node of fan-out zero. To each gate g there is associated a function g : R! R, called the activation or gate function associated to g.
The number of weights or parameters associated to a gate g is the integer n g equal to the fan-in of g plus one. (This de nition is motivated by the fact that each input to the gate will be multiplied by a weight, and the results are added together with a \bias" constant term, seen as one more weight; see below.) The (total) number of weights (or parameters) of A is by de nition the sum of the numbers n g , over all the gates g of A.
The number of inputs m of A is the total number of input nodes (one also says that \A has inputs in R m "); it is assumed that m > 0. The number of outputs p of A is the number of output nodes (unless otherwise mentioned, we assume by default that all nets considered have one-dimensional outputs, that is, p = 1).
Two examples of gate functions that are of particular interest are the identity or linear gate: Id(x) = x for all x, and the threshold or Heaviside function: H(x) = 1 if x 0, H(x) = 0 if x < 0.
If G is some set of functions R! R so that, for each gate g of A, the function g 2 G, then we say that G is a set of gates for A. We use informal terminology as well; for instance if we say that A consists (or is made of) of linear and threshold gates, we mean that G = fH; Idg is a set of gates for A. Let where p is the number of outputs of A, de ned by rst assigning an \output" to each node, recursively on the distance from the the input nodes. Assume given an input x 2 R m and a vector of weights w 2 R n . We partition w into blocks (w 1 ; : : :; w l ) of sizes n 1 ; : : :; n l respectively. Assume that A is an architecture with inputs in R m and scalar outputs, and that the (unique) output gate has range f0; 1g. A subset A R m is said to be shattered by A if for each Boolean function : A ! f0; 1g there is some weight w 2 R n so that F w (x) = (x) for all x 2 A. The Vapnik-Chervonenkis (VC) dimension of A is the maximal size of a subset A R m that is shattered by A. If the output gate can take non-binary values, we implicitly assume that the result of the computation is the sign of the output. That is, when we say that a subset A R m is shattered by A, we really mean that A is shattered by the architecture H(A) in which the output of A is fed to a sign gate.
The above formalism is too cumbersome for all proofs, so we often use obvious shortcuts. For instance, if we say \A is the net w 0 + w 1 H(2x ? 1)" we really mean that this expression represents the scalar function computed by the obvious net with one input (the variable x) and two gates (one Heaviside, one linear); note that the number of weights is 4 (the weights are 2; ?1; w 0 ; w 1 ).
Another convention, consistent with standard computer science usage, is that we may use a phrase like \for each n 1 there is an architecture A with O( p n) weights and gates in G" to assert the existence of a sequence of architectures A n so that G is a set of gates for each A n and so that the number of weights of A n is O( p n) as n ! 1. In this context, when we say that A shatters a set of size (n) we really mean that there is a sequence of sets A n so that A n shatters A n and the cardinality of each A n is (n). Proof. Our architecture has n parameters W 1 ; : : :; W n ; each of them is an element of T = f0:w 1 : : :w n ; w i 2 f0; 1gg: The shattered set will be S = n] For a given choice of W = (W 1 ; : : :; W n ), A will compute the boolean function f W : S ! f0; 1g de ned as follows: f W (x; y) is equal to the x-th bit of W y . Clearly, for any boolean function f on S, there exists a (unique) W such that f = f W .
We rst consider the obvious architecture which computes the function:
sending each point y 2 n] to W y . This architecture has n ? 1 threshold gates, 3(n ? 1) + 1 weights, and just one linear gate.
Next we de ne a second multi-output net which maps w 2 T to its binary representation f 
Constant Number of Linear Gates
We do not know whether similar constructions using only a constant number of linear gates are possible. However, by making a distinction between \ xed weights" and \programmable weights" one can prove a result in that direction. Given a sequence of architectures A j , with numbers of weights n (j) respectively, assume that some partition of the indices i = 1; : : :; n 
Arbitrary Sigmoids
We now extend the preceding VC dimension bounds to networks that use just one activation function (instead of both linear and threshold gates). All that is required is that the gate function have a sigmoidal shape and satisfy a very weak smoothness property:
1. is di erentiable at some point x 0 (i.e., (x 0 + h) = (x 0 ) + 0 (x 0 )h + o(h)) where 0 (x 0 )6 = 0. 2. lim x!?1 (x) = 0 and lim x!+1 (x) = 1 (the limits 0 and 1 can be replaced by any distinct numbers).
A function satisfying these two conditions will be called sigmoidal. Given any such , we will show that networks using only gates provide quadratic VC dimension. This implies by induction on the depth of g that for any gate g of N and any input x 2 S, the net input I g; (x) to g in the transformed net N satis es lim !0 I g; (x) = I g (x) (here, we use the fact that the output function of every g is continuous at I g (x)). In particular, by taking g to be the output gate of N, we see that N and N compute the same function on S if is small enough. Such a net N can be transformed into an equivalent net N 0 that uses only as gate function by a simple transformation of its weights and thresholds.
The number of weights remains the same, except at most for a constant term that must be added to each net input to a gate; thus if N has n weights, N 0 has at most 2n weights. 2 
More General Gate Functions
The objective of this section is to establish results similar to Theorem 4, but for even more arbitrary gate functions, in particular weakening the assumption that limits exist at in nity. The main result is, roughly, that any which is piecewise twice (continuously) di erentiable gives at least quadratic VC dimension, save for certain exceptional cases involving functions that are almost everywhere linear.
In deriving the results of this section, we nd it useful to allow networks with multiplication and division gates, which strictly speaking are not networks in the way that we have de ned the concept. By this we will mean that the de nition of architecture is extended so that the symbols \ " and \/" are also allowed as labels for gates. The gates labeled \/" have fan-in two and number of weights also two (even though there is no natural numerical parameter associated to the gate); the output of such a gate is de ned as the quotient of its two inputs, assuming that the second input is nonzero. The multiplication gates may have arbitrary fan-in, and \number of weights" equal to this fan-in; their output is de ned as the product of the inputs. An input to a circuit is said to be valid if it does not cause a division by zero at any division gate. We will only work with sets of valid inputs (so the domain of the function computed by such a generalized network is a subset of R m and shattering is only de ned for subsets of this domain).
We may work indi erently with multiplication gates of fan-in 2 or of unbounded fan-in. The number of weights is unchanged up to a constant factor since a k-ary multiplication x 1 : : :x k can be replaced for k > 2 by a depth-(k ? 1) circuit x 1 (x 2 (x 3 (: : :x n )) : : :) of binary gates with 2k ? 2 weights. We now explain how to get rid of division gates.
Theorem 7 Let N be a network made of linear, multiplication and division gates, with one threshold gate at the output and w weights. Then N can be simulated on its set of valid inputs by a network N 0 of linear and multiplication gates with one threshold gate at the output and w 0 = O(w) weights.
Proof. Let us assume without loss of generality that all linear and multiplication gates of N are binary, and that the output gate is unary. Each non-output gate g of N will be replaced in N 0 by two gates g 1 and g 2 such that on any valid input of N, the values assumed by these 3 gates satisfy g(x) = g 1 (x)=g 2 (x). Each input to the new gates g 1 and g 2 is now a pair (x 1 ; x 2 ), representing the corresponding input x 1 =x 2 to the original gate g. The outputs of g 1 and g 2 are passed as a pair to the next gates in the graph. The rules for forming g 1 and g 2 from each g follow these three simple rules: Finally, a multiplication gate is added before the output gate since H(x 1 =x 2 ) = H(x 1 x 2 ) whenever x 2 6 =0. 1. is piecewise-constant, and in this case the VC dimension of any architecture of n weights is O(n logn); 2. is a ne, and in this case the VC dimension of any architecture of n weights is at most n. 3 . there are constants a6 = 0 and b such that (x) = ax + b except at a nite nonempty set of points. In this case, the VC dimension of any architecture of n weights is O(n 2 ), and there are architectures of VC dimension (n logn).
Note that the upper bound of the rst special case is tight for threshold nets, and that of the second special case is tight for linear functions in R n .
The proof of Theorem 8 is broken down into several steps. The following result deals with the most interesting case.
Lemma 3 Assume that is piecewise C 2 and that there exists a point x 1 6 2fa 1 ; : : :; a p g where 00 (x 1 )6 =0. Proof. The proof technique is similar to that of Theorem 5. We will show that an architecture consisting of linear and multiplication gates can be simulated on any nite set by an architecture of -gates with the same number of weights, up to a constant factor. A -gate can be replaced by a subnetwork made of a constant number of theshold gates by the following relation:
Note that the role of the second sum is to compute the correct value of at the breakpoints a 1 ; : : :; a p . 2
If is not piecewise-constant, there is a non-trivial (i.e., not reduced to one of the a i 's) piece where (x) = ax + b with a6 =0. If this relation holds in fact over R, is a ne and we are in the second case of Theorem 8.
In this case, the input-output mapping of the network is a ne, so that the VC dimension is bounded by the number of weights (this observation goes back at least to 1965; see 5]).
If the relation (x) = ax + b (a6 =0) holds everywhere except at a nonempty nite number of points, we are in special case 3. The VC dimension of any architecture of n weights is O(n 2 ) by 6] (that paper actually deals with arbitrary piecewise-polynomial gate functions). The lower bound is established in Lemmas 6 and 7.
The only remaining case is that in which there exist at least two non-trivial pieces, and in at least one is not constant. This leads again to quadratic VC dimension, as shown in the next lemma. Proof. As in Lemma 3, the proof technique is similar to that of Theorem 5. We will show that an architecture consisting of linear and multiplication gates can be simulated on any nite set by an architecture of -gates with the same number of weights, up to a constant factor.
Assume for instance that 6 =0. Linear gates can be simulated using on I, just as in the proof of In order to deal with the last exceptional case in Theorem 8, we nd it useful to introduce another auxiliary computation model, based on circuits of linear and equality gates. An equality gate has fan-in one and outputs E(x) = 1 when its input x is equal to 0; it outputs E(x) = 0 otherwise. The following lemma completes the proof of Theorem 8 (note that we only need the direct implication).
Lemma 7 Let be a real function of the form (x) = x+ (with 6 =0) except at a nite number of points.
A network of linear and equality gates with n weights can be simulated by a network of -gates with O(n) gates. Conversely, a network of -gates with n weights can be simulated by a network of linear and equality gates with O(n) gates.
Proof. As usual, a linear gate can be simulated using on any of its linear pieces.
Assume that (x) = x + except for x 2 fa 1 ; : : :; a p g. For In Section 2, we showed that we can get VC dimension n 2 when using a mixture of O(n) threshold and linear gates, compared to the known upper bound of O(n logn) which would hold if only threshold gates were allowed. The gain involved in adding linear gates would seem to be counterintuitive, since it is obviously possible to rewrite a network made up of linear and threshold gates as a network made exclusively of threshold gates. The explanation of this apparent paradox is that, when rewriting in this manner, the number of weights becomes as high as O(n 2 ). The resulting weights are \shared" among gates. But such a sharing arrangement is not allowed in our de nitions, and indeed, cannot be exploited when using standard Cover-like counting arguments, as in 5, 2], which are dependent upon the fan-in of gates.
As a simple illustration of this process, consider a way of rewriting the network obtained in Proposition 1 as a network which only employs threshold gates. We sketch next how one eliminates linear gates in that case.
First of all, the function f by its thresholded value. This construction does not involve any linear gates. Each entry of C occurs in the gates computing w 1 ; : : :; w n , but these n instances can be considered as a single instance of a \shared weight". We conclude that it is possible to shatter the same set as in Proposition 1 by means of an architecture of threshold gates with O(n 2 ) \non-programmable" (i.e., constant) weights and only O(n) \programmable" shared weights (the entries of C and a ne functions of d). Note that without weight sharing, the VC dimension of of a threshold network with n programmable weights remains O(n log n) by the counting argument of 2].
A more restrictive type of weight-sharing has been studied in the neural network literature, and proved to be useful in invariant recognition tasks 8]. A formal model is studied in 12], and it is shown that the VC dimension remains O(n logn). In this model one assumes that there is an equivalence relation between weights; this is similar to our weight-sharing mechanism. However, one also assumes that there is an equivalence relation on nodes, and that this relation is compatible (in a precise sense) with the equivalence relation on weights. This makes the model more restrictive, explaining the smaller VC dimension.
