For the calcul<.tt,ion of the othclr parts of a neura! algorith~n, like waling f;ictors or the tr<rnsfrr function, the systolic matrix-vcctor c.dlculatic,ii is emheclded in an asyiichronouh multiprocessor Motorola's M('6F040 ( ' P l l s Thi. allows to program the system i3asily in high Ircel prograiriinint; languages I11 many models of neural net 2 System Architecture SYNAPSE 1 is a niodular system whose building blocks are arranged in a two-dimensional st,ructurr [3, 41. T h e building blocks are (Fig 1) a twodimensional array of nc'uro signal procesyors MA 16, weight memories, data iinits, anti a control unit
For the calcul<.tt,ion of the othclr parts of a neura! algorith~n, like waling f;ictors or the tr<rnsfrr function, the systolic matrix-vcctor c.dlculatic,ii is emheclded in an asyiichronouh multiprocessor Motorola's M('6F040 ( ' P l l s Thi. allows to program the system i3asily in high Ircel prograiriinint; languages I11 many models of neural net 2 System Architecture SYNAPSE 1 is a niodular system whose building blocks are arranged in a two-dimensional st,ructurr [3, 41. T h e building blocks are (Fig 1) a twodimensional array of nc'uro signal procesyors MA 16, weight memories, data iinits, anti a control unit
The central part of t hr. neuroconiputer is the matrix of processing elements T h e processiiig elemmts receive data from data units at thf, left 8dgr of the processing array The hjiiaptic weights that are required for processing arc' input frorn the ni.ight memories a t the top d g e Partial rrsults art computed and pipelined along the row:, of thc matrix 1 0 the riglit. thc. results are routed back to the I)ata C'nits. Here all .omputational steps are executed that cannot be tloiie i n the processing arraj itself Processing is h i e iiiitler cent,ral control by Iht I"ontro1 Irnet I'he proc~ssirig power aiicl the storage capacit) of thi, rnernorics can bc adalltril to the application need\ Ti11 iriiniiri:tl sqst r i r i configuration which i:, opf'rat- 
I'nzls
Systerri pxtmsion is i,abily acliieved 1)) iisiiig tltt al)propriatt, rlurriber of I,oar(ls iii L systerri c r a t r T h e ititri conncction is provided by fwt) special purpose I)usc%s fi,r weights and control z~g -r i a l \ , and a systcrii 1 ) i i s 'Ik VME:hu, has hern chost~n as the system bus for sirriple interfacing to off-thpshelf hardware like t h e host computer or 1 / 0 device.; For initialization of the weights, anti for reading out and storing of learned weights, theye is also an interface to the Control Unzt for direct access Via this connection the Wczghl Memory can I)e addressed sequentially using single a n d blork transiers 
Neuroprocessor Unit

Data Unit
A Data 1Inzt feeds the data to he ltrocessed by thc selectcd neural algorithm into the ( W O rows of MA16 processors that are implcinrnted on one Nettroyrocessor ('nzt, receivt's their results. <md postproctwes t h m i (Fig. 3) . 'I'he Data linzt is 6 mtrolled by a Motorola MC68040 (;F"J with local rii 'mory and a VRiIEbus Interface The data ti, bii pro1 essed t i 1 the Arc-uroprocessor linzt art' stored in a separ ite on-board memory ( Y -M r m o n J ) 111 iriost cast-s post processing lis done b j it special purpose arithmetic UII t tailortd to the operations rquired in Itiost ~ietiral alqorithrris
5.2
CPU arid VMEbus Interface
The processor rriodule u s~s a 3lotorola M('68040 as C'F'IJ [5] . This The VME iriastrr interface of the CPI supports "riorrrial" and bloch transfers in extended address sibace. Fast read out of data from another VME l)oard, i.e , video Imards, to the C'PU or the Y- The VME slave interface is funclionally completely independent of thc master interfacl.. It allows parallel accesb to the multiported dath and instruction triemory of the ('PlI, to the data menlory of the N e w ?clprocessor G'nzt, and to control registers of the DMA controller and the Arzlhrricizc Unz1. To 
Arithmetic Unit
For postprocessing of the accumulated results of the matrix-vec.tor products, as they are calc,ulated by the Neuroprocessor Units, there is a special-purpose pipeline processor. in this Arithrnrlir lrnzt (Fig. 4 ) scaling factors and the transfer function of the neural algorithm are computed.
At the input of the Arithmetic Unit the data of the accumulation bus are stored in a FIFO buffer. Since these data have a width of 48 bits the most, significant bits have to be selected for processing in the Arithmetic Unit. This is done hy a barrel shift,er that is controlled by the information provided by a max detector. During processing this max detector snoops on the accurnulation bus of the Neuropror~s.sor ['nit in order t,o detect and to store the position of t,he most significant bit.
The Arithmelic Unit consists of an ALIJ, a multiplier, and a look-up 
Programmable Sequencer
The main part of the programmable s'quencer is a S U M in which prograrrs are stored that consist of very-long-instruction-words (VLIW) of 64 bits width. The VLIW is divided into individual fields I hat are interpreted as coded conirriands for the unit%< controlled by the sequencer. T h e w art' (see Fig. ti ) the top, upper, and lower row of MA16 chips; the acl,umulation and data output FIFOs, thc backplane drivers: and the weight antl Z -rnemciries Every step me control wi)i d is read out and used to generate the control signals for these units For more compact program coding two nested hardware loops are available with loop coiinters and address registers. An addititmd repeat function allows to repeat one instruction for a prograainiable nuinber of clock cycles.
[i, ease programniing and to avoid programming error5 the sequencer words are not used directly for i oiitrollirig the hardware Rather additional hardware aut oiriatically iriscrts opcocles for refreshs of thcb dyi i i t i i~l c rneriiorie:, into the instruction sequence' frorn t h t prograiri rnernory During refreshs opcodw f o r %I iqwratiori ' are generated to thi. other units like t h c M A 1 6 array rile program codc, for the sequencer is generated usirig a ('++ cross compiler and a bpecial assernbler.
lT,iial prograrris, like for st aridard matrix-kector prodi i ( I < are alrewiy coiiipiled ;ind can 1)t) loaded directly fr, )iii a library t\ft8er %oiiie rriodificntions the h t y i e I i ( t ' r trted Siich modifications arc', tx.g , Irciss for the weights and the 1001p give users the option of non-hardware-deperident programming. The top two layers are entirela independent of the hardware, while the two layers beneath them are increasingly har dware-dep endent.
Figure 7 . Software architecture of SYNA €'SE-1
SE"++; Simulator
The top level is occupied b y the simulator coniponent of SE"++ (Simulation Environment for Neural Networks). This IS where the topology c i f a neural network, the algorithms t o lie used, and tlie data to be learned are entercvi, with the aid of an e,sy-to-nse description lariguagc An interactive graphii.al user intwface allows users to monitor the simulati~m, change parameters, select data records for learning, and so on Progress of the sunidation can be illustratwi on-line with a variety of graphich features At this level users arc' drpendent on tlie existing objects (~ieiiroiis, hanapst's, algorithms, et( ), and do not have to program anything nor have any knowledge of SYNAPSE-1. T h e option of programming by t h e nser is prokided on the nPxt lower level.
SE"++; Class Library
The simulator is implrrnented with a clnss library which provides users w i t h ObJeCtS such <L\ neuroris, synapses, c l u s t e n of neiirIiiis coniwctors 1)r.t wx~r cliisters, various rnethods of viwalization, coniplete learning algorithnls, actrninist rhtion of learning patterns, etc A C'++ programmer can w e thcse classes for his o r her ow,n prograrra, derive new classes frotri the oxirting onas, and add new properties 'rhi, make\ i t Itossible to tle%elop new algorithms with n iriinimiini of programniing effort Providw-l that the programmer adheres t o certain standards, t h e = new classes ,tutomatically liecorne available 111 tht. simulator and can he integrated thew and ruanipulatrd inlernctt\ely There is a separate matrix class for etch type of memory, Y , W and 2 , each containing he elementar) operations that can be executed for that memory Thanks to the generous SYNAPSE 1 hardware it is possible to c.xecutr concatenations of' computeintensive and non-compute-intensive operations in one step For example, an entire layer of tieurons can be processed with a single nAPL ~nstruction containing all parameters for the operations involved This means that extrc mely powerful hardware optiriiized instructions are available for exwution Each of these instructions IS executcd with the aid of messages which initiate the parallt.1 processes (tasks) on the CimtmZ f r n z t and D a t a U u i t , ensuring
