This section considers some of the alternative approaches towards modeling biological functions by digital circuits. It starts by introducing some circuit complexity issues and arguing that there is considerable computational and physiological justification that shallow threshold gate circuits are computationally more efficient than classical Boolean circuits. We comment on the tradeoff between the depth and the size of a threshold gate circuit, and on how design parameters like fan-in, weights and thresholds influence the overall area and time performances of a digital neural chip. This is followed by briefly discussing the constraints imposed by digital technologies and by detailing several possible classification schemes as well as the performance evaluation of such neurochips and neurocomputers. Lastly, we present many typical and recent examples of implementation and mention the 'VLSI-friendly learning algorithms' as a promising direction of research.
• Baum (1988b) presented a network with one hidden layer having m/n neurons capable of realizing an arbitrary dichotomy on a set of m points in general position in R n ; if the points are on the corners of the n-dimensional hypercube (i.e., binary vectors), m−1 nodes are still needed (the general position condition is now special and strict).
• Huang and Huang (1991) proved a slightly tighter bound: only 1 + (m − 2)/n neurons are needed in the hidden layer for realizing an arbitrary dichotomy on a set of m points which satisfy a more relaxed topological assumption as only the points forming a sequence from some subsets are required to be in general position; also the m − 1 nodes condition was shown to be the least upper bound needed.
• Arai (1993) recently showed that m − 1 hidden neurons are necessary for arbitrary separability (any mapping between input and output for the case of binary-valued units), but improved the bound for the two-category classification problem to m/3 (without any condition on the inputs).
A study which somehow tries to unify these two lines of research has been published by Bulsari (1993) who gives practical solutions for one-dimensional cases including an upper bound on the number of nodes in the hidden layer(s). Extensions to the n-dimensional case using three-and four-layer solutions are derived under piecewise constant approximations having constant or variable width partitions and under piecewise linear approximations using ramps instead of sigmoids.
To strengthen such claims, we shall go briefly through some basic circuit complexity results (Papadopoulos and Andronikos 1995 , Parberry 1994 , Paterson 1992 , Pippenger 1987 , Roychowdhury et al 1991a , b, 1994b , Siu et al 1994 and argue that there is considerable computational and physiological justification that shallow (i.e., having relatively few layers) threshold gate circuits are computationally more efficient than classical Boolean circuits. When considering computational complexity, two classes of constraints could be thought of:
• Some arising from the physical constraints (related to the hardware in which the computations are embedded) and including time constants, energy limitations, volumes, geometrical relations and bandwidth capacities.
• Others are logical constraints: (i) computability constraints and (ii) complexity constraints which give upper and/or lower bounds on some specific resource (e.g., size and depth required to compute a given function or class of functions).
The first aspect when comparing Boolean and threshold logic is that they are equivalent in the sense that any Boolean function can be implemented using either logic in a circuit of depth-2 and exponential size (simple counting arguments show that the fraction of functions requiring a circuit of exponential size approaches one as n → ∞ in both cases). Yet, threshold logic is more powerful than Boolean logic as a Boolean gate can compute only one function whereas a threshold gate can compute up to the order of 2 αn 2 functions by varying the weights, with 1/2 ≤ α ≤ 1 (see Muroga 1962 for the lower bound, and Muroga 1971 and Winder 1962 for the upper bound). An important result which clearly separates threshold and Boolean logic is due to Yao (1985) (see also Håstad 1986 and Smolensky 1987) and states that in order to compute a highly oscillating function like PARITY in a constant depth circuit, at least exp[c(n k ) 1/2 ] Boolean gates with unbounded fan-in are needed (Furst et al 1981, Paturi and Saks 1990) . In contrast, a depth-2 threshold gate circuit for PARITY has linear size.
Another interesting aspect is the tradeoff between the depth and the size of a circuit (Beiu 1994 , 1997 , Beiu and Taylor 1996a , Beiu et al 1994c , Siu and Bruck 1990c , Siu et al 1991b . There exists a very strong bias in favor of shallow circuits (Judd 1988 (Judd , 1992 for several reasons. First, for a fixed size, the number of different functions computable by a circuit of small depth is larger than the number of those computed by a deeper circuit. Second, it is obvious that such a circuit is also faster, as having a small(er) depth. Finally, one should notice that biological circuits must be shallow-at least within certain modules like the cortical structures-as the overall response time (e.g., recognizing a known person from a noisy image) of such slow devices (the response time of biological neurons being at least in the 10-ms range due to the refractory period) is known to be in the few hundred millisecond range. Other theoretical results (Abu-Mostafa 1988a, b) also support the shallow architecture of such circuits.
A lot of work has been devoted to finding minimum size and/or minimum constant-depth threshold gate circuits (Hajnal et al 1987 , Hofmeister et al 1991 , Razborov 1987 , Roychowdhury et al 1994a , Siu and Bruck 1990a , 1993b , Siu and Roychowdhury 1993 but little is known about tradeoffs between those two cost functions (Beiu et al 1994c , Siu et al 1991b , and even less about how design parameters like fan-in, weights and thresholds influence the overall area and time performances of a digital neural chip. Since for the general case only existence exponential bounds are known (Bruck and Smolensky 1992, Siu et al 1991b) , it is important to isolate classes of functions whose implementations are simpler than that of others (e.g., shallow depth and polynomial size (Rief 1987) ). Several of the corner-stone results obtained so far have been gathered in table E1.4.1. Here n is the number of input variables, and the nomenclature commonly in use is (see Amaldi and Mayoraz 1992 , Papadopoulos and Andronikos 1995 , Parberry 1994 , Roychowdhury et al 1994b , Siu et al 1994 , Wegener 1987 ):
• AC k represents the circuits of polynomial size with AND and OR unbounded fan-in gates and depth O(log k n)
• NC k is the class of Boolean functions with bounded fan-in, and having size n c (polynomial) and depth O(log k n) • T C 0 the family of functions realized by polynomial size threshold gate circuits with unbounded fan-in and constant depth • LT 1 ( LT 1 ) is the class of Boolean functions computed by linear threshold gates with real weights (bounded by a polynomial in the number of inputs |w i | ≤ n c (Bruck 1990 )) • LT k is the class of Boolean functions computed by a polynomial size, depth-k circuit of LT 1 gates (Bruck 1990, Siu and Bruck 1990b) • P T 1 is the class of Boolean functions that can be computed by a single threshold gate in which the number of monomials is bounded by a polynomial in n (Bruck 1990, Bruck and Smolensky 1992) • P T k is the class of Boolean functions computed by a polynomial size, depth-k circuit of P T 1 gates • P L 1 is the class of Boolean functions for which the spectral norm L 1 is bounded by a polynomial in n (Bruck and Smolensky 1989) • P L ∞ is the class of Boolean functions with the spectral norm L −1 ∞ bounded by a polynomial in n (Bruck and Smolensky 1989) • MAJ 1 is the class of Boolean functions computed by linear threshold gates having only ±1 weights (Mayoraz 1991, Siu and Bruck 1990c) • MAJ k is the class of Boolean functions computed by a polynomial size, depth-k circuit of MAJ 1 gates (Albrecht 1992 , Mayoraz 1992 , Siu and Bruck 1993 .
Recently three complexity classes for sigmoid feedforward neural networks have been defined and linked with the (classical) above-mentioned ones:
• NN k is defined (Shawe-Taylor et al 1992) to be the class of functions which can be computed by a family of polynomially sized neural networks with weights and threshold values determined to b bits of precision (accuracy), fan-in equal to and depth h, (Beiu et al 1994e, Beiu and Taylor 1996b) to be the class of functions which can be computed by a family of polynomially sized neural networks which satisfies slightly less restrictive conditions for fan-in and accuracy: log = O(log 1−ε n) and b = O(log 1−ε n) • NN k is defined (Beiu et al 1994d, Beiu and Taylor 1996b) to be the class of functions which can be computed by a family of polynomially sized neural networks having linear fan-in and logarithmic accuracy ( = O(n) and b = O(log n)).
Still, in many situations one is concerned by the values of a function for just a vanishing small fraction of the 2 n possible inputs. Such functions can also be implemented in poly-size shallow circuits (the size and depth of the circuit can be related to the cardinal of the interesting inputs (Beiu 1996b , Beiu and Taylor 1996a , Beiu et al 1994a , Tan and Vandewalle 1992 . Such functions are also appealing from the learning point of view: the relevant inputs being nothing else but the set of training examples (Beiu 1996b , Beiu and Taylor 1995b , Linial et al 1989 , Takahashi et al 1993 .
Circuit complexity has certain drawbacks which should be mentioned:
• The extension of the poly-size results to other functions and to the continuous domain is not at all straightforward (Maass et al 1991 • Even the known bounds (for the computational costs) are sometimes weak • Time (i.e., delay) is not properly considered • All complexity results are asymptotic in nature and may not be meaningful for the range of a particular application.
But the scaling of some important parameters with respect to some others represents quite valuable results:
• Area of the chip (wafer) grows like the cube of the fan-in • Area of the digital chip (wafer) grows exponentially with accuracy.
Furthermore, it was shown recently that the fan-in and the accuracy are linearly dependent parameters. If the number of inputs to one neuron is n, the reduction of the fan-in by decomposition techniques has led to the following results:
• If the fan-in is reduced to (small) constants, the size grows slightly faster than the square of the number of inputs (i.e., n 2 log n) while the depth growth is lower than logarithmic (i.e., log n/ log log n) 
Partly constructive Siu et 
Existence proofs Albrecht Depth-2 threshold circuits require
threshold circuits have more than two weights ∈ {−1, 0, +1} layers.
fan-in unbounded Shawe-Taylor et al • Boolean decomposition can be used for reducing the fan-in, but at the expense of a superpolynomial increase in size ((n log n ) 1/2 ) and a double logarithmic increase in depth (log 2 n).
Much better results can be achieved for a particular function. Due to such scaling problems, theoretical results show that we can implement (as digital chips or wafers) only neural networks having (sub) logarithmic accuracy and (sub) linear fan-in (with respect to the number of inputs n). From the practical point of view (the two parameters being dependent) these should be translated to (sub) logarithmic both for accuracy and for fan-in. The main conclusion is that full parallel digital implementations of neural networks (as chips or wafers) are presently limited to artificial neural networks having 10 2 -10 3 inputs and about 10 3 -10 4 neurons of 10 2 -10 3 inputs each. As will be seen later, these values are in good accordance with those from chips and wafers which stick as much as possible to a parallel implementation. Although we do expect that technological advances will push these limits, they cannot be spectacular-at least in the near future.
Such drastic limitations have forced designers to approach the problem from different angles:
• By using time-multiplexing • By building arrays of (dedicated) chips working together and exploiting as much as possible (in one way or another) the architectural concept of pipe-lining • By using non-conventional techniques such as: stochastic processing Taylor 1989a, Köllmann et al 1996) , sparse memory architecture (Aihara et al 1996) or spike processing (Jahnke et al 1996) .
These allow the simulation of far larger neural networks, by mapping them onto the existent (limited) hardware.
E1.4.3 Digital VLSI
Digital neurochips (and, thus, neurocomputers) benefit from the legacy of the most advanced human technology-digital information processing. VLSI technology is the main support for electronic implementations. It has been mature for many years, and allows a large number of processing elements to be mapped onto a small silicon area. That is why it has attracted many researchers (Alla et al 1990 , Alspector et al 1988 , Barhen et al 1992 , Beiu 1989 , Beiu and Rosu 1985 , Boser et al 1992 , Del Corso et al 1989 , Disante et al 1989 , 1990a , b, Faggin 1991 , Fornaciari et al 1991b , Holler 1991 , Jackel 1992 , Mackie et al 1988 , Personnaz et al 1989 , Tewksbury and Hornak 1989 , Weinfeld 1990 ).
The main constraints of VLSI come from the fact that the designer has to implement the processing elements on a two-dimensional limited area and-even more-connect these elements by means of a limited number of available layers. This leads to limited interconnectivity as has been discussed in Akers et al (1988) , Baker and Hammerstrom (1988) , Hammerstrom (1988) , Reyneri and Filipi (1991) , Szedegy (1989) and Walker et al (1989) and limited precision (higher precision requires larger area-both due to storing and processing-leading to fewer neurons per chip (Dembo et al 1990 , Denker and Wittner 1988 , Myhill and Kautz 1961 , Obradovic and Parberry 1990 , Stevenson et al 1990 , Walker and Akers 1992 ). The shallowness of slow biological neural networks has to be traded off for (somehow) deeper networks made of higher speed elements. Beside these, the power dissipation might impose another severe restriction (especially for wafer scale integration-WSI). The tradeoff is either to reduce the number of neurons per chip (working at high speed) or reduce the clock rate (while having more neurons). Lastly, the number of available pins to get the information on and off the chip is another strong limitation.
From the biological point of view, synapses have to be restricted on precision and range to some small number of levels (Baum 1988a, Baum and Haussler 1989) . Lower bounds on the size of the network have been obtained both for the networks with real valued synaptic weights and for the networks where the weights are limited to a finite number of possible values (Siu and Bruck 1990a) . These bounds differ only by a logarithmic factor, but to achieve near optimal performance O(m) levels are required (Baum 1988b)-where m is the number of training examples given. A similar logarithmic factor has been proven in Hong (1987) , Raghavan (1988) and Sontag (1990) when replacing real weights by integers. Some results concerning the needed number of quantization levels have already been presented in Section E1.2.2 and can be supplemented by many references. For example, Baker and Hammerstrom (1988) , Hollis et al (1990) , Höhfeld (1990) , Shoemaker et al (1990) , Allipi (1991) , Asanović and Morgan (1991) , Holt and Hwang (1991, 1993) , Nigri (1991) and Xie and Jabri (1991) argue that the execution phase needs roughly 8 bits (6 . . . 10), while learning demands about 16 bits (14 . . . 18). There are few exceptions: Halgamuge et al (1991) being the only pessimistic one claiming that 32 bits are needed, and Reyneri and Filipi (1991) claiming that 20 . . . 22 bits are needed in general, but explicitly mentioning that this value can be reduced to 14 . . . 15 bits or even lower by properly choosing the learning rate (for backpropagation). New weight discretization learning algorithms can go much lower: to just several bits (see Section E1.2.4). This makes them ideal candidates for digital implementations.
Today, the digital VLSI design is still the most important design style. The advantages of the dominant CMOS technology are small feature sizes, lower power consumption and a high signal-to-noise ratio. For neural networks these are supplemented by the following advantages of digital VLSI design styles (see Pöchmüller 1994 and Hammerstrom 1995 for more details):
• Simplicity (an important feature for the designer)
• High signal-to-noise ratio (one of the most important advantages over analog designs)
• Circuits are easily cascadable (as compared to analog designs) • Higher flexibility (digital circuits in general can solve many tasks) • Reduced fabrication price (certainly of interest for customers) • Many CAD (computer aided design) systems are available to support a designer's work • Reliable (as fabrication lines are stable).
Digital VLSI implementations of a neural network are based on several building blocks:
• Summation can easily be realized by adders (many different designs are possible and well-known: combinatorial, serial, dynamic, carry look ahead, manchester, carry select, Wallace tree) • Multiplication is usually the most area-consuming operation and in many cases a multiplier is timemultiplexed (classical solutions are serial, serial/parallel and fully parallel, each of which differ in speed, accuracy and area) • Nonlinear transfer function (very different nonlinear activation functions (Das Gupta and Schnitger 1993) can be implemented by using circuits for full calculations, but most digital designs use either a small lookup table (Nigri 1991 or-for even lower area and higher precision-a dedicated circuit for a properly quantized approximation, as can be seen in table E1.4.2 and also in Murtagh and Tsoi (1992) , Sammut and Jones (1991) • Storage elements (are very common-either static or dynamic-from standard RAM cells) • Random number generators (are normally realized by shift registers with feedback via XOR-gates).
E1.4.4 Different implementations

E1.4.4.1 General comments
As the different number of proposed architectures or fabricated chips, boards and dedicated computers reported in the literature is on the order of hundreds, we cannot mention all of them here. Instead, we shall try to cover important types of architectures by several representation implementations-although certain readers could disagree sometimes with our choice. For a deeper insight the reader is referred to the following books: Eckmiller and von der Malsburg (1988), Eckmiller et al (1990) , Souček and Souček 
Beiu et al
Piecewise approximation of the Errors ≤ ±1.9% using only a (1993, 1994b) classical sigmoid. shift register and several logic gates.
(1988), Sami (1990) , Zornetzer et al (1990) , Antognetti and Milutinovic (1991) , Ramacher and Rückert (1991) , Sanchez-Sinencio and Lau (1992), Hassoun (1993) , Przytula and Prasama (1993) , Delgado-Frias and Moore (1994) and Glesner and Pöchmüller (1994) together with the references therein. Several overview articles or chapters can also be recommended: Alspector and Allen (1987), Mackie et al (1988) , Jackel et al (1987) , Jackel (1991) , Przytula (1988) , DARPA (1989 ), Del Corso et al (1989 , Denker (1986) , Goser et al (1989) , Personnaz and Dreyfus (1989) , Treleaven (1989) , Treleaven et al (1989) , Schwartz (1990) , Burr (1991 Burr ( , 1992 , Nordström and Svensson (1991) , Graf et al (1991 Graf et al ( -having many references, 1993 , Hirai (1991) , Holler (1991) , Ienne (1993a, b) , Lindsey and Lindblad (1994) and the recent ones-Heemskerk (1995) , Hammerstrom (1995) and Morgan (1995) . One of the difficult problems when discussing dedicated architectures for artificial neural networks is how to classify them. There are many different ways of classifying such architectures, and we shall mention here some which have already been presented and used in the literature.
• A first classification can be made based on the division of computer architectures due to Flynn (1972) : single instruction stream, single datastream (SISD); single instruction stream, multiple datastreams (SIMD); multiple instruction streams, single datastream (MISD)-which does not make too much sense; multiple instruction streams, multiple datastreams (MIMD). Most of the architectures proposed for implementing neural networks belong to the SIMD class, and thus the group should be further subdivided into: systolic arrays, processor arrays (linear, mesh, multidimensional) and even pipelined vector processors.
• Another classification has been based on 'how many and how complex' processing elements are (Nordström and Svensson 1991) . Computer architectures can be characterized by the level of parallelism which can be: moderately parallel (16 to 256 processors), highly parallel (256 to 4096 processors) or massively parallel (more than 4096 processors). As a coarse measure of the 'complexity' of the processing elements, the bit-length (i.e., the precision) of a processing element has been used.
• A much more simple classification of neurocomputers has been suggested by Heemskerk (1995) :
those consisting of a conventional computer and an accelerator board; those built from general purpose processors; and those built from dedicated neurochips.
• A completely different classification was suggested by Glesner and Pöchmüller (1994) based on the following three criteria: biological evidence (mimicking biological systems; mimicking on a higher level; or without biological evidence), mapping onto hardware (network-oriented; neuron-oriented; or synapse-oriented) and implementation technology (digital; analog; or mixed).
Only for digital electronic implementations a simple three-class subclassification scheme-somehow similar to that of Heemskerk (1995) -could be the following (Beiu 1994) .
• Dedicated digital neural network chips (Kung 1989 , Kung and Hwang 1988 , 1989a , Wawrzynek et al (1993) can reach fantastic speeds of up to 1G connections per second. Several examples of such chips are: L-Neuro from Philips (Duranton 1996 , Duranton et al 1988 , Duranton and Sirat 1989 ), X1 and N64000 of Adaptive Solutions (Adaptive Solutions 1991 , Hammerstrom 1990 ), Ni1000 from Intel Reilly 1991, Holler et al 1992) , MA16 from Siemens (Ramacher 1990 , Ramacher and Rückert 1991 , Ramacher et al 1991a , p-RAM from King's College London (Clarkson and Ng 1993 , Clarkson et al 1989 ) and Hitachi's WSI (Yasunaga et al 1989 (Yasunaga et al , 1990 ) and the 1.5-V chip , SMA from NTT (Aihara et al 1996) , NESPINN from the Institute of Microelectronics, Technical University of Berlin (Jahnke et al 1996) , or SPERT from the International Computer Science Institute, Berkeley (Asanović et al 1992 , 1993d , Warwzynek 1993 , 1996 .
• Special purpose digital coprocessors (sometimes called neuroaccelerators) are special boards that can be connected to a host computer (PCs and/or workstations) and are used in combination with a neurosimulator program. Such a solution tries to take both advantages: accelerated speed and flexible and user-friendly environment. Well-known are the delta Floating Point Processor from SAIC (DARPA 1989) which can be connected to a PC host, and the ones produced by Hecht-Nielsen Computers (Hecht-Nielsen 1991): ANZA, Balboa. Their speed is in the order of 10M connections per second improving tenfold on a software simulator. Some of them are using conventional RISC microprocessors, some use DSPs or transputers, while others are built with dedicated neurochips.
• Digital neurocomputers can be considered the massively data-parallel computers. Several neurocomputers are: WARP (Arnould 1985, Kung and Webb 1985, Annaratone et al 1987) , CM (Means and Hammerstrom 1991) , RAP (Morgan et al 1990 , Beck 1990 , SANDY (Kato et al 1990) , MUSIC , Müller et al 1995 , MIND (Gamrat et al 1991) , SNAP (Hecht-Nielsen 1991, Means and Lisenbee 1991), GF-11 (Witbrock and Zagha 1990, Jackson and , Toshiba (Hirai 1991) , MANTRA Blayo 1991, Lehmann et al 1993) , SYNAPSE (Ramacher 1992 , Ramacher et al 1991a , Johnson 1993a , HANNIBAL (Myers et al 1993) , BACCHUS and PAN IV (Huch et al 1990 , Palm and Palm 1991 , PANNE (Milosavlevich et al 1996) , 128 PE RISC (Hiraiwa et al 1990) , RM-nc256 (Erdogan and Wahab 1992) , CNAPS (Adaptive Solutions 1991 , Hammerstrom 1990 ), Hitachi WSI (Boyd But even such a subclassification is not very clear cut, as in too many cases there are no borders. For example, many neurocomputers have been assembled based on identical boards built with custom designed neurochips: SNAP uses the HNC 100 NAP chip; MANTRA uses the GENES IV and the GACD1 chips; HANNIBAL uses the HANNIBAL chip; SYNAPSE uses the MA 16 chip; MasPar MP-1 uses the MP-1 chip; CNAPS uses the X1 or the N64000 chip; CNS-1 will use the Torrent and Hydrant chips. That is why we have decided in this section to use a more detailed classification which starts with the first historical neurocomputers and continues through acceleration boards, slice architectures, arrays of DSPs (digital signal processors), arrays of transputers, arrays of RISC processors, SIMD and systolic arrays built of dedicated processing elements and continuing with several other alternatives and ending with some of the latest implementations.
Beside classification and classification criteria, another problem when dealing with neurocomputers and neurochips is their performance evaluation. While the performance of a conventional computer is usually measured by its speed and memory, for neural networks 'measuring the computing performance requires new tools from information theory and computational complexity' (Abu-Mostafa 1989). Although the different solutions presented here will be assessed for size, speed, flexibility and cascadability, great care should be taken especially when considering speed. Hardware approaches are very different, thus making it almost impossible to run the same benchmark on all systems. Even for machines which support backpropagation (which is commonly used as a benchmark), the average number of weight updates per second or CUPS (connection updates per second) reported in publications shows different computational power-even for the same machine! This is due to: different precision of weights; the use of fixed point representation in some cases and the size of the network to be simulated (larger networks may be implemented more efficiently). A typical example of two different backpropagation implementations on C1.2.3 WARP can be found in Pomerleau et al (1988) . For architectures which do not support learning, the number of synaptic multiplications per second or CPS (connections per second) will be mentioned, but the same caution should be taken due to different word lengths (precision of computation) and network architectures. Normalizing the CPS value by the number of weights leads to CPS per weight or CPSPW, and was suggested as a better way to indicate the processing power of a chip (Holler 1991) . Precision can also be included in the processing performance by considering a connection primitive per second (CPPS) which is CPS multiplied by bits of precision and by bits for representing the inputs (van Keulan et al 1994) . Another reason for taking such speed measurements with a lot of care is that some of the articles report only on a small test chip (and the results reported are extrapolations to a future full-scale chip or to a board of chips and/or neurocomputer), or that only peak values are given.
Finally, for neurochips and neurocomputers which are dedicated to a certain neural architecture (e.g., the Boltzmann machine (Murray et al 1992 (Murray et al , 1994 ; Kohonen's self-organizing feature maps (Hochet et al C1.4, C2.1.1 1991 , Goser et al 1989 , Rüping and Rückert 1996 , Tryba et al 1990 , Thiran 1993 , Thiran et al 1994 , Thole et al 1993 ; Hopfield networks (Blayo and Hurat 1989 , Gascuel et al 1992 , Graf and de Vegvar C1.3.4 1987a , b, Graf et al 1987 , Savran and Morgül 1991 , Sivilotti et al 1986 , Weinfeld 1989 , Yasunaga et al 1989 ; Neocognitron (Trotin and Darbel 1993, White and Elmasry 1992) ; radial basis functions and C2.1.3, C1.6.2 restricted coulomb energy (LeBouquin 1994, Scofield and Reilly 1991) ), or for those which are built as C1.6.3.1 stochastic devices (Clarkson and Ng 1993 , Clarkson et al 1993a , b, Köllmann et al 1966 , it is almost C1.4 impossible to assess their speed. It should be mentioned that due to such unsurmountable problems there is usually little if any information on benchmarks.
E1.4.4.2 Typical and recent examples
We shall firstly mention Mark III and IV from a historical point of view.
• Mark III was built at TRW, Inc., during 1984 Hecht-Nielsen (1989 . The design used eight Motorola M68010-based boards running at 12 MHz, with 512 kbytes of DRAM memory each. The software environment used was called ANSE (Artificial Neural Systems Environment). The original Mark III had a capacity of approximately 8000 processing elements (neurons) and 480 000 connections, and had a speed of 380 000 CPS (large instar network using Grossberg learning).
• Mark IV was also built at TRW, Inc., but under funding from the Defense Science Office of the Defense Advanced Research Projects Agency (DARPA). A detailed description is given by HechtNielsen (1989) implementing as many as 262 144 processing elements and 5.5 M connections, and had a sustained speed of 5 MCPS, whether or not learning was taking place Kuczewsk et al (1988) . It had a mass of 200 kg and drew 1.3 kW of power. The basic computing unit was a 16-bit Texas Instruments TMS32020 DSP. The idea was that Mark IV would be a node of a larger neurocomputer (which was never intended to be constructed).
In the meantime most of the neural network simulations have been performed on sequential computers. The performance of such software simulation was roughly between 25 000 and 250 000 CPS in 1989 (DARPA 1989) . Fresh results show impressive improvements on computers having just one processing element.
• IBM 80486/50MHz exhibits 1.1 MCPS and 0.47 MCUPS (Müller et al 1995) .
• Sun (Sparcstation 10) has 3.0 MCPS and 1.1 MCUPS Müller et al (1995) .
• NEC SX-3 (supercomputer) achieves 130 MCUPS (the implementation was presented by Koike from NEC at the Second ERH-NEC Joint Workshop on Supercomputing 1992 Zürich, but no published English reference seems to be available). As NEC SX-3 has 5.9 Gflops it is expected that a similar performance would be obtained on a Cray Y-MP/8 (which has 2.5 Gflops).
Similar results have been reported for Hypercube FPS 20 (Roberts and Wang 1989, Neibur and Brettle 1992) and CM (Deprit 1989 , Zhang et al 1990 . At least one order of magnitude increase can be expected on Fujitsu, Intel Paragon or on the NEC SX-4. As a first alternative and aimed at increasing the speed of simulations on PCs and workstations, special acceleration boards have been developed Williams and Panayotopoulos (1989) .
• Delta Floating Point Processor from the Science Application International Corporation (SAIC), has separate addition and multiplication parts; it runs at 10 MCPS and 1-2 MCUPS (Souček and Souček 1988, Works 1988 ).
• SAIC later developed SIGMA-1 which has a 3.1 M virtual interconnections and has reached 11 MCUPS (Treleaven 1989 ). Many other accelerator boards are mentioned in a tabular form by Lindsey and Lindblad (1994) . One simple way to increase performance even more is to use processors in parallel. A classical design style was used for slice architectures, and several representative models are detailed.
• Micro Devices have introduced the NBS (Neural Bit Slice) chip MD1220 (Micro Devices 1989a ). The chip has eight processing elements with hard-limit thresholds and eight inputs (Yestrebsky et al 1989) . The architecture is suited for multiplication of a 1-bit synapse input with a 16-bit weight. The chip only allows for hard-limiting threshold functions. The weights are stored in standard RAM, but only eight external weights per neuron and seven internal weights per neuron are supported. Such a reduced fan-in (maximum 15 synapses per neuron) is quite a drastic limitation. This can be avoided by additional external circuits, but increasing the fan-in decreases the accuracy (as the 16-bit accumulator can overflow). The chip has a processing rate of 55 MIPS which roughly would correspond to 8.9 MCPS.
• A similar chip is the Neuralogix NLX-420 Neural Processor Slice from Neuralogix (1992), which has 16 processing elements. A common 16-bit input is multiplied by a weight in each processing element in parallel. New weights are read from off-chip. The 16-bit weights and inputs can be user selected as 16 1-bit, 4 4-bit, 2 8-bit or 1 16-bit value(s). The 16 neuron sums are multiplexed through a user-defined piecewise continuous threshold function to produce a 16-bit output. Internal feedback allows for multilayer networks.
• The Philips L-Neuro 1.0 chip , Duranton and Sirat 1989 , Theeten et al 1990 , Maudit et al 1992 was designed to be easily interfaced to transputers. It also has a 16-bit processing architecture in which the neuron values can be interpreted as 8 2-bit, 4 4-bit, 2 8-bit or 1 16-bit value(s). Unlike the NLX-420, there is a 1 kbyte on-chip cache to store the weights. The chip has 32 inputs and 16 output neurons and only the loop on the input neurons is parallelized (weight parallelism). This chip has on-chip learning with an adjustable learning rate. (Dejean and Caillaud 1994) , followed recently (Duranton 1996) by L-Neuro 2.3 (see the paragraph on the latest implementations).
• BACCHUS is another slice architecture which was designed at Darmstadt University of Technology.
There have been three successive versions I, II, and III (Huch et al 1990, Pöchmüller and . The neurons perform only a hard-limiting threshold function. The final version was designed as a sea-of-gates in 1.5-µm CMOS (Glesner et al 1989, Glesner and . The chip contains 32 neurons and runs at 32 MCPS (but for 1-bit interconnections!). An associative system PAN IV, based on BACCHUS III chips has been built (Palm and Palm 1991) . It has eight BACCHUS III chips (for a total of 256 simple processors) and 2 Mbytes of standard RAM. The system was designed only as a binary correlation matrix memory.
For even higher performances the designers have used SIMD arrays (various one-or two-dimensional systolic architectures , Kung and Hwang 1988 , 1989a , 1989b , Kung and Webb 1985 , made of DSPs (digital signal processors), RISC processors, transputers or dedicated chips.
Many neuroprocessors have been built as arrays of DSPs.
• One of the first array-processors proposed for neural network simulation was built at IBM Palo Alto Scientific Center (Cruz et al 1987) . The building block was the NEP (Network Emulation Processor) board able to simulate 4000 nodes (neurons) with 16 000 links (weights) and a speed of between 48 000 and 80 000 CUPS. Up to 256 NEPs could be cascaded (through a NEPBUS communication network), thus allowing for networks of 1 million nodes and 4 million links.
• Another DSP neuroprocessor called SANDY emerged from Fujitsu Laboratories (Kato et al 1990) .
The DSP used was the Texas Instruments TMS320C30 connected in a SIMD array. SANDY/6 (with 64 processors) was benchmarked on NETtalk (Sejnowski and Rosenburg 1986) (Morgan et al 1990 , Kohn et al 1992 . These chips are connected via a ring of Xilinx programmable gate arrays, each implementing a simple two register data pipeline and running at the DSP clock speed of 16 MHz. A single board can perform 57 MCPS and 13.2 MUCPS, with a peak performance for a whole system reaching 640 MCPS (tested at 570 MCPS) and 106 MCUPS.
• At the Swiss Federal Institute of Technology in Zürich, a 63-processor system named MUSIC (Multiprocessor System with Intelligent Communication) has been developed (Müller et al , 1994 (Müller et al , 1995 . The architecture is similar to that of RAP but differs in the communication interface. Three Motorola 96002 DSPs (32-bit floating-point) are mounted on one board, each one with a Xilinx LCA XC3090 programmable gate array and an Inmos T805 transputer. Up to 21 boards (i.e., 63 processors) fit into a standard 19-inch rack. A global 5-MHz ring connects the nodes and communication can be overlapped with computation. The complete system has achieved 817 MCPS and 330 MCUPS (for a 5000-1575-63 two-layer perceptron), but the peak performance is 1900 MCPS. A fully equipped system consumes 800 W.
• PANNE (Parallel Artificial Neural Network Engine) has been designed at the University of Sydney (Milosavlevich et al 1996) and exploits the many specialized features of the TMS320C40 DSP chip. One board contains two DSPs together with 32 Mbytes of DRAM and 2 Mbytes of high speed SRAM. These are accessed through a dedicated local bus. Apart from this local bus, each board has a global bus and six programmable unidirectional 8-bit ports specially designed to allow connections of neighboring DSPs at 20 Mbytes per second. The system has up to eight boards and is estimated at 80 MCUPS. (Ernst et al 1990 , Murre 1993 ). Ernoult (1988) Digital integrated circuit implementations and 9.9 MCUPS (Mühlbein and Wolf 1989) . This performance should increase tenfold on the Parsytec's Gigacluster which uses T9000 transputers. Instead of transputers some researchers have used RISC processors and here are some of the neurocomputers built as arrays of RISC processors.
Different solutions have been implemented on arrays (networks) of transputers
• One solution was to design a RISC processor (dedicated for simulating neural networks) and assembling several of them in SIMD arrays. Here we can mention the 16-bit Neural RISC developed at University College London , Treleaven and Rocha 1990 . Several neural RISCs have been connected in a linear array. A linear array interconnecting scheme has several advantages: simplified wiring and ease of cascadability. Several arrays are linked by an interconnecting module (Pacheco and Treleaven 1992) . This allows for different topologies (rings, meshes, cubes) and is expandable up to a maximum of 65 536 processors. The flexibility is high as the computer is of the MIMD type (multiple instructions multiple data).
• REMAP 3 was an experimental neurocomputing project (Bengtsson et al 1993 , Linde et al 1992 with its objective being to develop a parallel reconfigurable SIMD computer using FPGAs. The performance was estimated to be between 100 and 1000 MCUPS.
• Another solution is to use a standard RISC processor. An example is the 128 PE RISC which uses the Intel 80860 (Hiraiwa et al 1990) . 128 processors are connected in a two level pipeline array where the horizontal mesh connections serve for information exchange (weights) and vertical meshes share dataflow. For a 256-80-32 network and 5120 training set vectors, the performance is around 1000 MCUPS.
• BSP400 from Brain Style Processor (Heemskerk et al 1991 , Heemskerk 1995 used low-cost commercial microprocessors MC68701 (8-bit microprocessor). Due to the low speed of the processor used (1 MHz!) the overall performance reached only 6.4 MCUPS when 400 processors were used.
Because both DSP and RISC processors are too powerful and flexible for the task of simulating neural networks, a better alternative is to use smaller and more specific (less flexible) dedicated processing elements. This can increase the computational power and also maintain a very small cost. The trend has been marked by the use of SIMD arrays (Single Instruction Multiple Data) and especially systolic arrays (Kung and Hwang 1988) of dedicated chips. Systolic arrays are a class of architecture where the processing elements and the interconnecting scheme can be optimized for solving certain classes of algorithms. Matrix multiplication belongs to this class of algorithms (Leiserson 1982) , and it is known that neural network simulation relies heavily on matrix multiplication (Beiu 1989 , Kham and Ling 1991 , Kung and Hwang 1989b . The SIMD arrays are similar structures, the main difference being that the elementary processing elements have no controllers and that a central controller is in charge of supervising the activity of all the elementary processing elements.
•
The WARP array was probably the earliest systolic one (Kung and Webb 1985 , Arnould 1985 , Annaratone et al 1987 . Although built primarily for image processing, it has also been used for neural network simulation (Pomerleau et al 1988) . It is a ten (or more) processor programmable systolic array. The system can work either in a systolic mode, or in a local mode (each processor works independently). A performance of 17 MCUPS was obtained on a 10-processor WARP.
• ARIANE chip (Gascuel et al 1992) is a 64-neuron implementation in a 1.2-µm CMOS of the architecture first proposed by Weinfeld (1989) . The chip-having 420 000 transistors in 1 cm 2 -implements a fully digital Hopfield-type network, thus continuing on the lines of other Hopfieldtype implementations (Sivilotti 1986 , Graf et al 1986 . All operations are performed by a 12-bit adder/subtracter. There are 64 connections per neuron, making it possible to store 4096 weights. The reported speed is 640 MCUPS, but this figure cannot be compared to standard CUPS as the chip does not implement backpropagation. The main drawback is that the chip is not easily cascadable (however, a four chip board has been designed).
• SNAP ( Digital integrated circuit implementations performance being 640 MCPS) and 128 MCUPS. Although the system performs lower than CNAPS (described below), we have to mention that SNAP uses 32-bit floating point arithmetic.
• The APLYSIE chip is a two-dimensional systolic array dedicated for Hopfield-type networks (Blayo and Hurat 1989) . Since the outputs are only +1 and −1, the synaptic multiplication can be performed by an adder/subtracter (like in Weinfeld's 1989 solution) . The weights are limited to 8-bit and the partial product is computed by a 16-bit register. The adder/subtracter is of the serial type for minimizing the area, but is also thought for the serial interconnecting scheme used. An advantage of such a solution is its cascadability.
• The GENES chip is a generalization of APLYSIE and it was implemented at the Swiss Federal Institute of Technology (Lausanne) as part of the MANTRA project (Lehmann and Blayo 1991 , Lehmann et al 1993 , Viredaz et al 1992 . It is based on the same recurrent systolic array as APLYSIE, but it has been enhanced to simulate several neural network architectures. The first chip of the family was GENES HN8 implementing each synapse as a serial-parallel multiplier. Two versions have been fabricated: 2×2 array of processors and 4×4 array of processors. Weights and inputs are represented on 8 bits. The partial sum is calculated on 24 bits. A full board, GENES SY1, was built as a 9 × 8 array of GENES HN8 2 × 2 chips (18 × 16 synapses) and was able to reach 110 MCPS. A GENES IV chip was later designed as an upgrade of GENES HN8 (Lehmann et al 1993 , Viredaz et al 1992 . It has 16-bit inputs and synaptic weights and uses 39 bits for the partial sum. The chip was designed with standard cells in a 1-µm CMOS technology on a 6.2 × 6.2 mm 2 area. Together with another chip, GACD1 (dedicated to the error computation for delta rule and backpropagation), it was used to build the first MANTRA neurocomputer as a 40 × 40 array of processing elements. The speed is estimated at 500 MCPS and 160 MCUPS.
• A low-cost high-speed neurocomputer system has recently been proposed (Strey et al 1995) and implemented (Avellana et al 1996) . The system is based on a dedicated AU chip which has been designed so as to dynamically adapt the internal parallelism to data precision. It tends to achieve an optimal utilization of the available hardware resources. The AU chip is organized as a pipeline structure where the data path can be adapted dynamically to the encoding of the data values. The chip has been realized in 0.7 µm and has 80 mm 2 . Four chips are installed on a board together with: a Motorola DSP96002 (used for the management of the local bus, computation of the sigmoid function, error calculation, winner calculation and convergence check); an FPGA for communication; local weight memories; central memory; and FIFO memory. Several boards can be used together. For 16-bit weights and with only one board the estimated performance is 480 MCPS and 120 MCUPS.
• TNP (Toroidal Neural Processor) is a linear systolic neural accelerator engine developed at Loughborough University of Technology (Jones and Sammut 1993 , Jones et al 1990 . The system is still under development although several prototype chips have been successfully fabricated and tested.
• HANNIBAL (Hardware Architecture for Neural Networks Implementing Backpropagation Algorithm Learning) was built at British Telecom. A dedicated HANNIBAL chip contains eight processing elements , Naylor et al 1993 , each one with 9216 bits of local memory (configurable as 512 17-bit words, or 1024 9-bit words). Such a chip allows for high fan-in neurons to be implemented; up to four lower fan-in neurons can be mapped onto one processing element. The neuron activation function is realized by a dedicated approximation for area saving reasons. The chip uses reduced word length (8-bit in the recall phase and 16-bit when learning (Vincent and Myers 1992) and it was fabricated in a 0.7-µm CMOS technology. This has led to 750 000 transistors in a 9 × 11.5 mm 2 area. The clock frequency is 20 MHz and a single chip can reach 160 MCPS.
• MM32K (Glover and Miller 1994) is a SIMD having 32 768 simple processors (bit serial). A custom chip contains 2048 processors. The bit serial architecture allows for the variation of the number of bits (variable precision). The processors are interconnected by a 64 × 64 full crossbar switch with 512 processors connected to each port of the switch.
• SYNAPSE 1 and SYNAPSE X (Synthesis of Neural Algorithms on a Parallel Systolic Engine) from Siemens (Ramacher 1990 , Ramacher et al 1991b ) are dedicated to operation on matrices based on the MA16 chip (Beichter et al 1991) , which has four systolic chains (of four multipliers and four adders each). The chip runs at 25 MHz and was fabricated in 1.0-µm CMOS. Its 610 000 transistors occupy 187 mm 2 . The MA16 alone has 800 MCPS when working on 16-bit weights. SYNAPSE neurocomputer is nothing else but a two-dimensional systolic array of MA16 chips Digital integrated circuit implementations arranged in two rows by four columns. The weights are stored off chip in local memories. Both processor rows are connected to the same weight bus which excludes the operation on different input patterns. The MA16s in a row form a systolic array where input data as well as intermediate results are propagated for obtaining the total weighted sum. Multiple standard 68040s and additional integer ALUs are used as general purpose processors which complement the systolic processor array. The standard configuration has eight MA16s, two MC68040 for control and 128 Mbytes of DRAM. It performs at 5100 MCPS and 133 MCUPS.
• CNAPS (Connected Network of Adaptive Processors) is a SIMD array from Adaptive Solutions, Inc (Adaptive Solutions 1991 , Hammerstrom 1990 ). X1 is a neural network dedicated chip with on-chip learning. It consists of a linear array of elementary processors, each one having a 32-bit adder and a 24-bit multiplier (fixed-point). The structure of an elementary processor is such that it can work with three different weight lengths: 1-bit, 8-bit and 16-bit weights (Hammerstrom 1990, Hammerstom and Nguyen 1991) . X1 chips are fully cascadable, allowing the construction of linear arrays having arbitrary many elementary processors. Another chip, the N64000, was produced in 0.8-µm CMOS and 80 elementary processors have been embedded in this design. N64000 is a large chip (one square inch) containing over 11 million transistors (Griffin et al 1991) and due to defects in the fabrication process only 64 functioning processing elements are used from one chip (the 16 more being redundant). The same idea will be used at a higher level for the Hitachi's WSI (wafer scale integration) to be discussed later. The maximum fan-in of one neuron is 4096 and there are 256K programmable synapses on the 26.2 × 27.5 mm 2 chip. The chip alone can perform 1600 MCPS and 256 MCUPS for 8-or 16-bit weights (12 800 MCPS for 1-bit weight). The CNAPS has four N64000 chips running at 20 MHz on one board (256 processing elements). The maximum performance of the system is quite impressive: 5700 MCPS and 1460 MCUPS (Adaptive Solutions 1991 , McCator 1991 , but these values are for 8-and 16-bit weights! Hammerstrom and Nguyen (1991) have also compared a Kohonen self-organizing map implemented on the CNAPS: 516 MCPS and 65 MCUPS, with the performance on a SPARC station: 0.11 MCPS and 0.08 MCUPS.
• MasPar MP-1 is a SIMD computer based on the MP-1 chip (Blank 1990 , MasPar 1990 . It is a general purpose parallel computer but it exhibits excellent performances when simulating neural networks. The core chip is MP-1 which has 32 processing elements working on 32-bit floating point numbers (each processing element can be viewed as a small RISC processor). MP-1 was fabricated in 1.6-µm CMOS on an area of 11.6 × 9.5 mm 2 and has 450 000 transistors. The chip works at a moderate clock frequency of only 14 MHz for minimizing the dissipated power. One board uses 32 MP-1 chips, thus having 1024 processing elements which are arranged in a two-dimensional array. The connection scheme is different from others: 16 processing elements are configured as a 4 × 4 array with an X-net mesh and form a 'processor element cluster'. These clusters are again connected as an X-net mesh of clusters. The processors are connected together from the edges to form a torus. On top of that, a global communication between processing elements is realized by a dedicated 1024 × 1024 crossbar interconnecting network having three stages for routing. MasPar can have from 1 to 16 boards. The largest configuration has 16 384 processing elements. Grajski et al (1990) have simulated neural networks on a MasPar MP-1 with 4096 processing elements (MasPar MP-1 1100). A 900-20-17 backpropagation network obtained 306 MCUPS, but on the largest MasPar MP-1 1200 (16 384 processing elements) performance is expected to be on the order of GCUPS.
Many other alternatives have also been presented and we shall shortly enumerate some of them here.
• WISARD belongs to the family of weightless neural networks or the RAM model (Aleksander and C1.5. 4 Morton 1990) and has been used in image recognition.
• The pRAM (probabilistic RAM) is a nonlinear stochastic device Taylor 1989a, b, 1990a, C1.5.2 1991a, c) with neuron-like behavior which-as opposed to the simple RAM model-can implement nonlinear activation functions and can generalize after training (Clarkson et al 1993a) . It is based on a pulse-coding technique and several chips have been fabricated. The latest digital pRAM has 256 neurons per chip. The 16-bit 'weights' (probabilities) are stored in an external RAM in order to keep the costs at a minimum. Up to 1280 neurons can be interconnected by combining five chips. Learning (Clarkson and Ng 1993 , Clarkson et al 1991a , b, 1992b , 1993b , c, Gorse and Taylor 1990b , 1991b , Guan et al 1992 is performed on-chip. The pRAM uses a 1-µm CMOS gate-array with 39 000 gates. A PC board has been designed and tested. A VMEbus-based neural processor board (using the pRAM-256) has also been recently built (El-Mousa and Clarkson 1996) . version is being used for studying the various different architectures and advantages of hardwarebased learning using pRAM artificial neural networks. For this purpose, the board relies heavily on the use of in-system programmable logic devices (ISPLD) to facilitate changing the support hardware logic associated with the actual neural processor without the need to rewrite and/or exchange parts of it.
• Intel has several neural network solutions (Intel 1992a, b) . Two commercial chips are dedicated to radial basis functions (Watkins et al 1992) : the IBMZISC036 (LeBouquin 1994) and Ni1000 (Scofield and Reilly 1991) build in cooperation with Nestor. The ZISC036 (from Zero Instruction Set Computer) contains 36 prototype neurons, where the vectors have 64 8-bit elements and can be assigned to categories from 1 to 16 384 (i.e., the first layer has 36 neurons fully connected by 8-bit weights to the 64 neurons of the second layer). Multiple ZISC036 chips can be easily cascaded to provide additional prototypes, while the distance norm is selectable between city-block (Manhattan) or the largest element difference. The ZISC036 implements a region of influence (ROI) learning algorithm (Verleysen and Cabestany 1994) using signum basis functions with radii of 0 to 16 383. Recall is either according to the ROI identification, or via the nearest-neighbor readout, and takes 4 µs for a 250-K sec-pattern presentation rate. The Ni1000 was developed jointly by Intel and Nestor and contains 1024 prototypes of 256 5-bit elements (i.e., the first layer has 256 neurons, while the second layer is fully connected to the first layer by 5-bit weights and has 1024 neurons). The distance used is the city-block (Manhattan) distance. The third layer has 64 neurons working in a sequential way, but achieving higher precision. All the weights and the threshold are stored on board in a nonvolatile memory, as the chip is implemented in Intel's 0.8-µm EEPROM process. On the same chip a Harvard RISC is used to accelerate learning (Johnson 1993b) , and increases the overall number of transistors to 3.7 million. The chip implements two on-chip learning algorithms: restricted coulomb energy or RCE (Reilly et al 1982) and probabilistic neural networks or PNN (Specht 1988) . Other algorithms can be microcoded. In a pattern processing application the chip can process 40 000 patterns per second (Holler et al 1992) .
• A generic neural architecture was proposed by Vellasco and Treleaven (1992) . The idea is to tailor the hardware to the neural network to be simulated. This can increase the performance at the expense of reduced flexibility. The aim of such an approach is to automatically generate application-specific integrated circuits (ASICs). Several chips have been fabricated. Other authors have been working on similar approaches (Disante et al 1990b , Fornaciari et al 1991a , or have tried a mapping onto FPGAs (Beiu and Taylor 1995c , Botros and Abdul-Aziz 1994 , Gick et al 1993 , Rossmann et al 1996 , Rückert et al 1991 .
• Several implementations of the Boltzmann machine have also been reported. A high-speed digital C1.4 one is that of Murray et al (1992 Murray et al ( , 1994 . The chip, realized in a 1.2-µm CMOS technology, has 32 neural processors and four weight update processors supporting an arbitrary topology of up to 160 functional neurons. The 9.5 × 9.8 mm 2 area hosts 400 000 transistors. This includes the 20 480 5-bit weights stored in a dynamic RAM (the activation and temperature memories are static). Although clocked at 125 MHz, the chip dissipates less than 2 W. The theoretical maximum learning rate is 350 MCUPS and the recall rate is typically 1200 patterns per second. An SBus interface board was developed using several reconfigurable Xilinx FPGAs.
• ArMenX is a distributed computer architecture (Poulain Maubant et al 1996) articulated around a ring of FPGAs acting as routing resources as well as fine grain computing resources (Léonhard et al 1995) . This allows for a high degree of flexibility. Coarse grain computing relies on transputers and DSPs. Each ArMenX node contains an FPGA (Xilinx 4010) tightly coupled to an Inmos T805 transputer and a Motorola DSP56002, but other processors could be used. The node has 4 Mbytes of transputer RAM and 384 Kbytes of DSP RAM and the FPGA connects to the left and right neighboring nodes. The sustained performance of a node is about 5 MCPS and 1.5 MCUPS, and it is expected that the scale-up will be linear for a 16-node machine: 80 MCPS and 24 MCUPS.
• A solution which uses on-line arithmetic has been proposed in Girau and Tisserand (1996) and should be implemented on an FPGA. A redundant number representation allows very fast arithmetic operations, the estimated speed being 5.2 MCUPS per chip.
The use of stochastic arithmetic computing for all arithmetic operations of training and processing backpropagation networks has also been considered (Köllmann et al 1996) . Arithmetic operations become quite simple. The main problem in this case is the generation of numerous independent random generators. The silicon reported uses a decentralized pseudorandom generator based on the principle of shifting the turn-around code for parities formed on partial stages of a feedback shift register. A 3.5 × 2.8 mm 2 silicon prototype has been implemented in 1-µm CMOS technology. The prototype delivers a theoretical performance of 400 MCUPS for 12-bit weight length and 15-bit momentum length. It is estimated that a state-of-the-art 0.25-µm process would allow 4K synapses and 64 neurons should fit into 160 mm 2 if standard cells are used; a custom design should increase these values to: 16K synapses and 128 neurons.
Some of the latest implementations are pushing the performances even further and we shall mention here the most promising ones, even if by our classification some of them might also fall in another class.
• The RM-nc is a reconfigurable machine for massively parallel-pipelined computations and has been proposed in Erdogan and Wahab (1992) . The reconfigurability is not only in the domain of communication and control, but also in the domain of processing elements. A fast floating point sumof-products circuit using special carry-save multipliers (with built-in on-the-fly shifting capability and extensive pipelining) has been proposed and has to be implemented on FPGAs. The performance of an RM-nc256 machine (with 256 processing FPGAs) has been estimated for NETtalk (203-60-26 network with 13 826 connections) at a speed of 2000 MCUPS. No implementation has yet been reported.
• One interesting development is based on WSI (Mann et al 1987 , Rudnik and Hammerstrom 1988 , Tewksbury and Hornak 1989 . A first neural network WSI has been developed by Hitachi (Yasunaga et al 1989 (Yasunaga et al , 1990 ). This first version was designed only for Hopfield networks without learning. Hitachi's WSI has 576 neurons with a fan-in of 64. Weights are represented on 10 bits. If larger fan-in is required, three neurons can be cascaded to increase the fan-in to 190 (this reduces the number of available neurons). A 'small' 5-inch wafer and a 0.8-µ CMOS technology has been used to realize the designed 19 million transistors. The wafer has 64 chips of 12 neurons each; one redundant chip (Zorat 1987 ) is used to replace faulty neurons from the other chips. Up to 37K synapses are available on chip. For controlling the neurons and the buses there are eight more chips on the wafer. The only way to keep the power to a reasonable 5 W is a quite-slow clock rate: 2.1 MHz, but the actual performance is still around 138 MCPS. The same idea has been used by Hitachi (Boyd 1990 ) to design a WSI for multilayer feedforward networks including the backpropagation algorithm. The weights' accuracy has been increased to 16 bits to cope with the required precision of on-chip learning. One wafer has 144 neurons and eight wafers have been stacked together to form a very small neurocomputer with 1152 neurons (Yasunaga et al 1991) . The reported speed is 2300 MCUPS. Using a similar architecture and the present day state-of-the-art 0.3-µm CMOS technology it becomes clear that we can expect to have 10 000 neurons WSI in the very near future.
• For portable applications Hitachi has also developed a 1.5 V digital chip with 1 048 576 synapses . The chip can emulate 1024 fully connected neurons (fan-in of 1024 each) or three layers of 724 neurons. An on-chip DRAM cell array is used to store the 8-bit weights. A 256 parallel circuit for summing product (Baugh parallel multiplier) pushes the processing speed to 1370 MCPS. A scaled-down version of the chip was fabricated using a 0.5-µm CMOS design rule. It allowed an estimation of the full-scale chip: 15.4 × 18.6 mm 2 and 75 mW.
• The new L-neuro 2.3 (Duranton 1996 ) is a fully programmable vectorial processor in a highly parallel chip composed of an array of twelve DSPs which can be used not only for neurocomputing, but also for fuzzy logic applications, real-time image processing and digital signal processing. Beside the twelve DSPs, the chip contains: a RISC processor, a vector-to-scalar unit, a 32-bit scalar unit, an image addressing module and several communication ports. All the DSPs are linked together: by a broadcast bus connecting all DSPs; by two shift chains linking the DSPs as a systolic ring; by fast neighbor-to-neighbor connections existing between adjacent DSPs; and also to an I/O port. All the internal buses are connected together through a programmable crossbar switch. The RISC processor of one chip can be used to control several other L-Neuro chips, allowing an expansion in a hierarchical fashion. The chip was fabricated in 0.6-µm technology and has 1.8 million transistors clocked at 60 MHz. It can implement different learning algorithms such as backpropagation, Kohonen features map, radial basis functions and neural trees (Sirat and Nadal 1990) . The peak performance is estimated at 1380 MCUPS and 1925 MCPS but no tests have yet been reported.
• One very interesting approach is the novel SMA (Sparse Memory Architecture) neurochip (Aihara et al 1996) which uses specific models to reduce neuron calculations. SMA uses two key techniques Digital integrated circuit implementations to achieve extremely high computational speed without an accuracy penalty: 'compressible synapse weight neuron calculation' and 'differential neuron operation'. The compressible synapse weight neuron calculation uses the transfer characteristics of the neuron to stop the calculation for the sum if it is determined that the final sum will be in the saturation region. This also cancels subsequent memory accesses for the synapse weights. The purpose of differential neuron operation is to do calculations only on those inputs whose level has changed. A dedicated processing unit having a 22-bit adder, a 16-bit shifter, an EX-OR gate and two 22-bit registers has been designed. A test chip having 96 processing units has been fabricated in 0.5-µm CMOS and has 16.5 × 16.7 mm 2 . It runs at 30 MHz and dissipates 3.2 W. The chip can store 12 228 16-bit synapse weights and has a peak performance of 30 GCPS (tested at 18 GCPS).
• SPERT (from Synthetic Perceptron Testbed) (Asanović et al 1992 , 1993d , Wawrzynek et al 1993 , 1996 ) is a fully programmable single chip neuromicroprocessor which borrowed heavily from the experience gained with RAP (Morgan et al 1990 (Morgan et al , 1992 . It combines a general purpose integer data path with a vector unit of SIMD arrays optimized for neural network computations and with a wide connection to external memory through a single 128 VLIW instruction format. The chip is implemented in 1.2-µm CMOS and runs at 50 MHz. It has been estimated at a peak performance of 350 MCPS and 90 MCUPS. The chip is intended to be a test chip for the future Torrent chip: the basic building block of CNS-1 (see below). Recent developments have led to SPERT-II (Wawrzynek et al 1996) which has a vector instruction set architecture (ISA) based on the industry standard MIPS RISR scalar ISA.
• NESPINN (Neurocomputer for Spike-Processing Neural Networks) is a mixed SIMD/dataflow neurocomputer (Roth et al 1995 , Jahnke et al 1996 . It will allow the simulation of up to 512K neurons with up to 10 4 connections each. NESPINN consists of the spike-event list (the connectivity of sparsely connected networks is performed by the use of lists), two connectivity units containing the network topology (a regular and a nonregular connection unit), a sector unit controlling the processing of sectors and the NESPINN chip. The chip has a control unit and eight processing elements; each processing element has 2 Kbytes of on-chip local memory and an off-chip neuron state memory. The chip has been designed and simulated and will be implemented in 0.5-µm CMOS. It will operate at 50 MHz in either SIMD or dataflow mode. The estimated performance of the system with one NESPINN chip for a model network with 16K neurons of 83 connections each is 10 11 CUPS.
• CNS-1 from University of California Berkeley is the acronym from Connectionist Network Supercomputer-1 (Asanović et al 1993a (Asanović et al -c, 1994 and is currently under development. It is targeted for speech and language processing as well as early and high-level vision and large conceptual knowledge representation studies. The CNS-1 is similar to other massively parallel computers with major differences in the architectural details of the processing nodes and the communication mechanisms. Processing nodes will be connected in a mesh topology and operate independently in a MIMD style. The processor node, named Torrent, includes: an MIPS CPU with a vector coprocessor running at 125 MHz, a Rambus external memory interface, and a network interface. The design is scalable up to 1024 Torrent processing nodes, for a total of up to 2 TeraOps and 32 Gbytes of RAM. The host and other devices will connect to CNS-1 through custom VLSI I/O nodes named Hydrant connected to one edge of the mesh and allowing up to 8 Gbytes of I/O bandwidth. A sketch of the future CNS-1 can be seen in figure E1 .4.2. The goal set ahead is to be able to evaluate networks with one million neurons and an average of one thousand connections per unit (i.e., a total of a billion connections) at a rate of 100 times per second, or 10 11 CPS and 2 × 10 10 CUPS.
Several of the implementations presented here have been plotted in figures E1.4.3 (digital neurochips) and E1.4.4 (dedicated neurocomputers and supercomputers). As can be seen from these two figures, some architectural improvements are to be expected from the techniques used in designs like the SMA and the NESPINN, which could reach speed performances similar to the CNS-1.
We shall not end this section before mentioning an interesting alternative that has recently emerged. To cope with the limited accuracy, new learning algorithms with quantized weights have started to appear (see also Section E1.2.4). One might call them 'VLSI-friendly learning algorithms', which was the topic covered in MicroNeuro'94. Such algorithms could be used to map neural networks onto FPGAs or to custom-integrate circuits. The first such learning algorithm (Armstrong and Gecsie 1979, 1991) is in fact synthesizing Boolean functions using adaptive tree networks whose elements-after training and elimination of redundant elements-perform classical (Boolean) logical operations (AND and OR) . This line of research has been extended by using a combination of AND and OR gates after an initial layer of threshold gates (Ayestaran and Prager 1993, Bose and Garga 1993) . New learning algorithms have been developed by quantizing other learning algorithms (Höhfeld and Fahlman 1991 , Jabri and Flower 1991 , Makram-Ebeid et al 1989 , Pérez et al 1992 , Sakaue et al 1993 , Shoemaker et al 1990 , Thiran et al 1991 or by devising new ones (Fiesler et al 1990 , Höhfeld and Fahlman 1991 , Hollis and Paulos 1994 , Hollis et al 1991 , Mézard and Nadal 1989 , Nakayama and Katayama 1991 , Oliveira and Sangiovanni-Vincentelli 1994 , Walter 1989 , Xie and Jabri 1992 , a particular class being the one dealing with threshold gates (Beiu et al 1994a , Beiu and Taylor 1995a , b, 1996a , Diederich and Opper 1987 , Gruau 1993 , Krauth and Mézard 1987 , Kim and Park 1995 , Littlestone 1988 , Raghavan 1988 , Roy et al 1993 , Tan and Vandewalle 1992 , Venkatesh 1989 . Four overviews have compared and discussed such constructive algorithms (Śmieja 1993 , Fiesler 1994 , Moerland and Fiesler 1996 , Beiu 1996c . The main conclusion is that a lot of effort and creativity has been used recently to improve digital solutions for implementing artificial neural networks. The many designs proposed over the years make this area a lively topic confirming its huge interest. Fresh proposals together with estimates and/or results already show impressive performances competing with analog chips and reaching towards an area which, not so long ago, was considered accessible only for future optical computing. 
