Abstract
Introduction
In this paper we shall consider feedforward neural networks (NNs) made of linear threshold gates (TGs) . A neuron (i.e., linear TG) will compute a Boolean function (BF) f : { O , l ) n -+ { O , l ) , whereoneofthekinputveclorsisZ,= ( Z , ,~~, . . . , Z~~-J E { O , l l n andf(Z,)=sgn(C:=,:w,z, ,+e>, with the synaptic weights w, E R , threshold: 8 E R, and sgn the sign function. Two cost functions commonly associated to a NN are: (i) depth, which is the number of layers (or the number of edges-if we consider unit length for all the edges connecting the TGs) on the longest input to output path; and (ii) size (or node complexity) , which is the number of neurons (TGs) . Unfortunately, these measures are not the best criteria for ranking different solutions when going for silicon [21] , as "comparing the number of nodes is inadequate for comparing the complexi@ of neural networks as the nodes themselves could implement quire complex functions" [411. The fact that size is not a good estimate for area is explained as: (i) the area (of one neuron) can be related to its associated weights; and (ii) the area of the connections isin most cases-neglected. Here are several alternatives of how the area could scale in relation to weights and thresholds:
for purely digital implementation, the area scales at least with the cumulative size of the weights and thresholds (the bits for representing these weights and thresholds have to be stored); for certain types of analog implementations (e.g., using resistors or capacitors), the same type of scaling is valid (in particular cases, analog implementations can have binary encoding, thus the area would scale with the cumulative log-scale size of the parameters); there are some types of implementations (e.g., transconductance ones) which offer a constant size per element, thus scaling only with CNN fan-ins.
All these 'cost functions' are linked to VLSI by the assumptions one makes on how the area of a chip scales with the weights and the thresholds [5, 9, 101 . That is why several other measures (i.e., 'cost functions')-beside sizehave already been used: the total number-ofconnections, or NN fan-ins, has been used by several authors [l, 25, 30, 331 ; the total number-of-bits needed to represent the weights and thresholds C N N (C , rlogl wf I1 + Dog1 e I1 )
has been used by others [22, 23, 411 ; the sum of all the absolute values of the weights and thresholds C NN (C I I w f I + I 0 I) has also been advocated [5, 10, 18, 19, 211 , while another similar 'cost function' is CNN (C, w12 + e '), which has been used in the context of genetic programming for reaching minimal NNs [44] . The sum of all the absolute values of the weights and thresholds has also been used as an optimum criterion for: (i) linear programming synthesis [32] ; (ii) defining the The focus of this paper will be on NNs having limited fan-in (the fan-in will be denoted by A). We will present both theoretical proofs and simulations in support of our claim that VLSI-and size-optimal NNs can be obtained for small fan-ins. For simplification, we shall consider only NNs having n binary inputs and p binary outputs (if real inputs and outputs are needed, it is always possible to quantize them up to a certain number of bits such as to achieve a desired precision 14, 7, 11, 241) . In Section 2 we shall present a theoretical proof for the lFn,m class of functions, showing that their VLSI-optimal implementation is achieved with small constant fan-ins. Section 3 deals with BFs, and details the generalisation of a result from [26] to arbitrary fan-ins. Based on that generalisation we will show that the size can be minimised for small fan-ins. Finally, in Section 4 we will suggest how to implement Fn, functions in size-optimal NNs having small constant fan-ins.
Conclusions, open questions and further directions for research complete the paper. Due to space limitations some of the lengthy mathematical proofs suggested in [6, 7] have been omitted, but can be found in [9, 101.
VLSI-optimal neural implementations of

Fn, functions
minimum-integer TG realisation of a function [27] . Recently [3] , the same measure (under the name of "total weight magnitude") has been used in the context of computational learning theory. It was proven that the generali- showing that the misclassification probability converges at all n -q BFs). This of functions has been introduced and c is a constant. monly in use [38] :
With respect to delay, two VLSI models have been com-0 the simplest one assumes that delay is proportional to the input capacitance, hence a T G introduces a delay proportional to its fan-in; a more exact one considers the capacitance along any wire, hence the delay is proportional to the length of the connecting wires. It is worth emphasising that it is anyhow desirable to limit the range of parameter values [42] for VLSI implementations-be they digital or analog-because both the maximum value of the fan-in [28, 40] and the maximal ratio between the largest and the smallest weight [23, 24, 29, 421 cannot grow over a certain (technological) limit.
In this paper txJ is the floor of x, I.e. the largest integer less than or equal to x, and rxl is the ceiling of x, i.e. the smallest integer greater or equal to x. In this paper all the logarithms are to the base 2, except explicitly mentioned otherwise.
Proposition I (Theorem I from [34])
The complexity realisation (i.e., number of threshold elements) of En* m (the class of Boolean functions f (x, x, . . . xn-, x,) that have exactly m groups of ones) is a t most 2 G+ 3.
The construction has: a first layer of r(2m)1'21 T G s (COMPARISONS) with fan-in = n and weights 5 2 -'; a second layer of 2 r(m /2) 1'21 TGs of fan-in = n + r(2m) '121 and weights I 2 n; one more TG of fan-in = 2 r(m/2) '121
and weights E {-1, +I} in the third layer.
Red'kin also proved that if the implementation of BFs of this type is restricted to circuits having no more than three layers, than the upper bound-following his method of synthesis-is equal to the lower bound obtained from capacity considerations. Although this construction is sizeoptimal, it is not VLSI-optimal as the weights and thresholds are exponential.
Another 
This constructive class of solutions (which we shall call B-A), was proposed in [18, 191, and is based on decomposing COMPARISON in a tree structure. The NN has a first layer of 'partial' COMPARISONS c,' and C,' (LA/2J bits from X and 1A/2J bits from Y ) followed by a A-ary tree of TGs combining these partial results. The fact that the BFs implemented by the nodes are linear separable functions was proven in [ 17, 211. The network has: [6, 7] : take A = 2 r f i 1-which leads to B-2 rhland the NN has depth = 2, size = r2.iii 1 , with weights and thresholds of at most 2 ' &' .
For normal length COMPARISONS Vassiliadis et al. [39] claim improvements over RQS [35] and SRK [37] . W e present in Table I the results reported in [7] . Both VCB and B-2 r h 1 achieve better performances than SRK and ROS. For depth = 2, 8-2 r h 1 outperforms VCB (both for 32-bit and for 64-bit operand lengths), while for depth = 3, 8-2 r.in 1 has lower weights andfun-ins, but slightly more gates. Still, B-A has two main advantages: (i) being a class of solutions it can be used for other depths (see the last column of Table  I ) ; (ii) because the weights and the fan-ins are lower, the area should also be lower.
It is known that a VLSI design is considered optimum when judged on a combined measure AT2 [38] , thus:
AT,,,
AT:^, = 3 r n / ( r i 0 g n i + i j l x 3~ = O ( n / l o g n j .
The natural extension of the circuit complexity results to VLSI complexity ones is by using the closer estimates for the area and the delay (as discussed in Introduction).
Proposition 6
If the area of a neural network is estimated as C NN fan-ins, there are neural networks computing the COMPARISON of two n-bit numbers which occupy between 0 (n) and 0 (n 2, area: 
For all these different estimations of A and T, the A T 2 complexity values have been computed, ordered, and can be seen in Table 2 .
Not wanting to complicate the proof, we shall determine the VLSI-optimal fun-in when implementing COMPARISON (in fact an F,,, , function) for which several solutions were detailed in Propositions 2 to 9. The same result is valid for F,,, functions as can be intuitively expected as: the delay is determined by the first layer of COMPARI-SONS; while the area is mostly influenced by the same first layer of COMPARISONS (the area for the implementing the MAJORITY gate can be neglected [ 15, 211) . From the alternatives presented in Table 2 , we have chosen EN, (C, I wi I + I 8 I) for area and depth for delay, but other estimates lead to similar results (the optimal A T 2 being 0 (n log2n) in four out of nine cases-see Table 2 ). To get a better understanding, the AT values have been computed for variable fan-ins and for different number of inputs n, and can be seen in Figure 1 . 
Proposition 10 The VLSI-optimal neural network which computes the COMPARISON of two n-bit numbers has smallconstant fan-in 'neurons' with small-constant bounded weights and thresholds.
Proof From Propositions 3 and 7:
and we compute the derivative: 
which when equated to zero leads to InA (A ln2 -2) = 4 (again a transcendental equation). This has AOpt,, = 6 as integer solution, and because the weights and the thresholds are bounded by 2'12 (Proposition 3) the proof is concluded.
c l
To get a better understanding, the AT values have been computed for variable fun-ins and different number of inputs n, as can be seen in Figure I , while Figure 2 presents exact plots of the AT measure which support our previous claim A optlm = 6.. .9 (as the proof has been obtained using several approximations: neglecting ceilings, using the complexity estimate, etc.). 
Size-optimal neural implementations of BFs
We start from the classical construction developed by Shannon [36] for synthesising one BF with fan-in 2 AND-OR gates. It was extended to the multioutput case and modified to apply to NNs by Horne & Hush [26] .
Proposition 11 (Theorem 3 [26] f (XI x2 . . . xn-l X , ) = XI A) (x* . . . Xn-l X , ) -+ x, f i (X2 . . . xn-, X,) .
By doing this recursively for each subfunction, the output BFs will--in the end-be implemented by binary trees. Horne & Hush [26] Figure 2 from [26] ). That q which minimises the size of the two subnetworks is determined by solving d (size) / d q = 0, and1 gives:
By substituting this value in ( 2 ) and ( 3 ) , the minimum size:
size-3 j~. 2 ' ' -~ s 3 p . 2 " / ( n + l o g p )
is obtained.
13
We will use a similar approach for the case when the fan-in is limited by A.
Proposition 12 Arbitrary Boolean functions f
Proof We use the approach of Horne & Hush [26] and limit the fun-in to A. Each output BF can be decomposed in 2 A -' subfunctions (i.e., 2 '--AND gates). The OR gate would have 2 inputs. Thus, we have to decompose it in a A-ary tree offan-in = A OR gates. This first decomposition step eliminates A -1 variables and generates a tree O f A -I .
Repeating this procedure recursively k times, we have:
where the subfunctions depend only on q = n -kA variables. We now generate all the possible subfunctions of q variables with a subnetwork of
The inequality (7) can be proved by induction. Clearly, size . 2 2 0 < (size + 1) . 2 20. Let us consider the statement true for a; we prove it for CI + 1 :
(due to hypothesis), thus:
( a + I ) A ( s i z e + 1 ) . z M < 2
and computing the logarithm of the left side: 2 a + log (size+ 1)
From (4) and (6) 
and using the notations kA = y, P = p (A -1) / ( A ln2), and taking logarithms of both sides:
which has an approximate solution y = ti -log ( n + logp).
The same result can be obtained by computing with finite differences (instead of approximating the partial derivative):
size , ,
and after taking twice the logarithm of both sides and using the same notations we have:
which has as the approximate solution:
Starting again from (lo), we compute &size , , , /dA = 0:
which-by neglecting 2 y + A / { k (ln2) . 2n}-gives:
2n-y-A l o g P + 2 y -k -n = i.e., the same equation as (1 1).
(close) vicinity of the parabola kA = n -log (n + logp). Figure 3 . From Figure 3(a) , (b) and (c) it may seem that k and A used in the proof of Proposition 12 have the same influence on size BFr. The discrete parabola-like curves approximating kA = n -log ( n + logp) can be seen in Figure   3 Sketch ofproof We will analyse only the critical points by using the approximation kA = n -logn. Intuitively the claim can be understood if we replace this value in (10):
which clearly is minimised for A = 2. The detailed proof computes size,, (n, p, k, A) for those larger than the absolute minimum. They might be ofpractical interest as leading to networks having fewer layers: n/logA instead of n. Last, but not least, it is to be mentioned that all these relative minimum are obtained forfunins strictly lower that linear, as A 5 n -logn.
Size-optimal rU3m-d implementations of
II: = (n -logn)/A (i.e., sizeiFs (n, p, A)), and shows that:
Hence, the function is monotonically increasing and the minimum is obtained for the smallestfrn-in A = 2. Because the proof has been obtained using successive approximations, several simulation results are presented in Table 3 . It can be seen that while for relatively small n the size-optimal solutions are obtained even for A=: 16, starting from n 2 64 all the size-optimal solutions are obtained for A = 2.
It is important that the other relative minima (on, or in the vicinity of the parabola kA = n -logn) are only slightly where depth B-a = rlogn / (1ogA -1)1, but a substantial enhancement is obtained if thefan-in is limited. Due to the limitation, the maximum number of different BFs which can be computed in each layer is:
For large enough m (needed for achieving a certain precision [lo, 23, 42j) , andor large enough n, the first terms of the sum (1 3) will be larger than the equivalent ones from (14). This is equivalent to the trick from [26] , as the lower levels will compute all the possible functions using only limited fan-in COMPARISONS. Hence, the optimum size becomes: explained as the same BFs are computed redundantly. In terms offan-in, several exponentially decreasing terms will be replaced by double exponential increasing terms.
Following similar steps to the ones used in Proposition 13, it is possible to show that the minimum size is obtained for A = 3. To get a better understanding we have done extensive simulations by considering that m = 2 =. Some of the results of these simulations can be seen in Figure 4 .
They show that it is always possible to obtain a significant reduction of the size by properly choosing a small constant fan-in. It is to be mentioned that the size reduction is by a huge factor which is of the form 2 E n -c for very smallfanins Aopnm = 4.. .6.
Conclusions and further work
The paper has focused on sparsely connected NNs, i.e. having either (small) constant fun-ins, or at most logarithmic in the number of inputs n. Using different cost functions-which are closer estimates than size and depth for the urea and the delay of a VLSI chip-we have been able to prove that VLSI-optimal implementations of IF,,, functions are obtained by small constant fan-ins.
Concerning size-optimal solutions, we have shown that:
arbitrary BFs require small-but not necessarily constant-fun-ins (at most n -logn);
Fn, nl functions require small constant fan-ins. Some of these results have already been applied to optimising the VLSI design of a neural constructive algorithm [5, 14, 151. We are working on a mixed constructive algorithm which-after quantizing the input space as in [4, 11, 12 , 13]--could synthesise F,,, ,n functions, arbitrary BFs, or a mixture of them such as to reduce the area of the resulting VLSI chip. This could also have applications to automatic synthesis of mixed analog/digital circuits [8, 161 . An alternative solution currently under investigation [ 141 is to use such a synthesis step after quantizing the input space as detailed in [24] . finding closer estimates (i.e., other cost functions) for optimal mixed analoguddigital implementations. The main conclusion is that VLSI-optimal solutions can be obtained for small constant fan-ins. We mention here that there are similar small constants relating to our capacity of processing information [3 11.
