A digital neural network architecture including a forward cascade of layers of neurons, having one input channel and one output channel, for forward processing of data examples that include many data packets. Backward cascade of layers of neurons, having one input channel and one output channel, for backward propagation learning of errors of the processed data examples. Each packet being of a given size. The forward cascade is adapted to be fed, through the input channel, with a Succession of data examples and to deliver a Succession of partially and fully processed data examples each consisting of a plurality of packets. The fully processed data examples are delivered through the one output channel. Each one of the layerS is adapted to receive as input in its input channel a first number of data packets per time unit and to deliver as output in its output channel a Second number of data packets per time unit. The forward cascade of layerS is inter-connected to the backward cascade of layers by means that include inter-layer Structure, Such that, during proceSS ing phase of the forward cascade of neurons, any given data example that is fed from a given layer in the forward cascade to a corresponding layer in the backward cascade, through the means, is Synchronized with the error of the given processed data example that is fed to the corresponding layer from a preceding layer in the backward cascade. The first number of data packets and the Second number of data packets being the same for all the layers. The present invention concerns a digital hardware archi tecture for realizing neural networks.
Throughout the description, reference is made to the following list of references: 2 ment linear image convolutions, and the non-linear output function offers additional capabilities. While it is often extremely difficult to construct empirically a traditional image processing algorithm for non-trivial tasks, the Self learning capabilities of neural networks can be exploited for Some image processing applications in an automatic Way. However, available high performance architectures for neu ral network processing do not Support Such applications in an efficient manner.
A number of digital parallel architectures for neural network processing have been described in the literature 2-11). Some of them are limited to forward processing only 2, 8), while others support learning as well 3-7, 9-11). Some architectures employ systolic data flow 6, 9, 10, and the others use only broadcast. As is well known, "Systolic' processing means: In Systolic computation, dataflaws rhyth mically from one processing unit to another each clock cycle. The multiple identical or similar processing units operate on the data simultaneously, each unit processing a different part of the data at the same time. 
where:
Y.R(i) is the output of neuron i in layer R; Y.O() is the output of neuronj in layer Q and is also the jth input to each neuron in layer R;
W.R(i,j) is the weight assigned in neuron i (in layer R) to its jth input;
B.R(i) is the bias for the neuron i in layer R; V.R(i) is the sum of the weighted inputs and bias for neuron i in layer R; f.R is the output function for all the neurons in layer R. Still further, a typical, known per Se, basic back propa gation learning algorithm is given below:
Learning Algorithm 3
The Back Propagation Supervised learning algorithm with variable learning rate is Supported. Each example is pre Sented to the network together with a corresponding desired output. The algorithm minimizes the difference (error) between the network and the desired outputs for every example. This is carried out by calculating the change that must be applied to each weight and bias for each example presented to the network. All weight and bias changes are accumulated over all input examples (an epoch).
Subsequently, weights and biases are updated, and the next epoch can begin. This proceSS is repeated until the error reaches the desired level or ceases to improve.
The weights and biases in layer R are changed as follows:
AW.R(i,j)eror is the total accumulated weight change The error depends on whether the present layer is the last one in the network. For the last layer of the network, say layer S, the error calculated for each neuron is:
where: t(i) is the desired output from neuron i in layer S, fS is the derivative of the output function for layer S. For a hidden layer, Say layer R, preceded by Q and Succeeded by S (containings neurons), the error calculated for each neuron is:
8 RG) -f R(VRG)'s to S6) WSG)
It is accordingly an object of the present invention to provide for a digital Systolic neural network capable of realizing forward processing and back propagation learning Simultaneously. The four terms that will be used frequently in the follow ing description and appended claims:
1. Data Packet-a data unit (typically but not necessarily 9 bit long) that is fed to the digital neural network of the invention. The data packet propagates in the internal bus(es) of the chip and is delivered as output from the network after having been processed by the network. Insofar as image processing application is concerned, data packet that is fed to the first layer of the network during the processing phase normally corresponds to a pixel. 2. Data Set-the number of data packets that are fed
Simultaneously to the neural network on a Single chan nel.
3. Data example-a collection of data packets that con Stitute one example during processing of the neural network. Thus, for example, a 5x5 (i.e. 25 data packets or 5 data sets) neighboring pixels of a given pixel, form one data example. 4. Epoch-a collection of data examples that should be fed to and processed by the neural network, during learning phase, in order to gather "Sufficient error information" that will allow change of the network's parameters, i.e. weights and/or biases. It should be noted that for clarity, the proposed architec ture of the invention is described with reference to image processing application. Those versed in the art will readily appreciate that the invention is, by no means, bound by this exemplary application.
A Digital Systolic Neural Network Chip (e.g. DSNC) of the invention implements a high performance architecture for real-time image processing. It can be used as a single or a multi-layer neural network. The proposed architecture Supports many forward algorithms and the Back Propagation learning algorithms which are utilized in parallel. The chip is designed for fast execution of both forward processing and learning, and the latter can be carried out at essentially the same Speed as the former. The output and derivative functions (i.e. f.R) in each layer, as well as the network topology, are user defined. The output and derivative func tions are normally although not necessarily realized as a Look up Table (LUT). In a specific example of 25 neurons, arranged in 5 consecutive banks, each holding 5 neurons, 25 data packets are Supported for each computation, making the DSNC Suitable for real-time image processing of 5x5 pixel neigh borhoods. Preferably, although not necessarily signed inte ger arithmetic is employed with e.g. 9-bit input/output precision and e.g. 8-bits for weight and biases. The output forward value and the error calculated in each neuron are normalized to nine bits, thus taking advantage of the full dynamic range available to them. Four data channels are employed, each capable of carrying five data packets, han dling 20 parallel buses in total. By concatenating multiple DSNCs, almost any size neural network can be constructed for real-time image processing.
Accordingly, the present invention provides for a digital neural network architecture that includes a forward cascade of layers of neurons, having one or more input channel and one or more output channels, for forward processing of data examples that include, each, a plurality of data packets. A backward cascade of layers of neurons, having one or more input channels and one or more output channels, for back ward propagation learning of respective errors of the pro cessed data examples, each packet being of a given size; S The forward cascade is adapted to be fed, through the one or more input channels, with a Succession of data examples and to deliver a Succession of partially and fully processed data examples each consisting of a plurality of packets. The fully processed data examples are delivered through Said one or more output channels, Each one of the layerS is adapted to receive as input in its input channel a first number of data packets per time unit and to deliver as output in its output channel a Second number of data packets per Said time unit; the improvement wherein:
(i) the forward cascade of layers is inter-connected to the backward cascade of layers by means that include inter-layer Structure, Such that, during processing phase of the forward cascade of neurons, any given data example that is fed from a given layer in the forward cascade to a corresponding layer in the backward cascade, through said means, is essentially Synchronized with the error of Said given processed data example that is fed to Said corresponding layer from a preceding layer in Said backward cascade; and
(ii) The first number of data packets and the Second number of data packets being essentially the same for all Said layers. Preferably, the computation propagates in a Systolic man C.
The invention further provides for a digital neural net work architecture that includes a forward cascade of layers of neurons, having one or more input channels and one or more output channels, for forward processing of data examples that include, each, a plurality of data packets, and a backward cascade of layers of neurons, having one or more input channels and one or more output channels, for back ward propagation learning of respective errors of the pro cessed data examples, each packet being of a given size;
The forward cascade is adapted to be fed, through the one or more input channels, with a Succession of data examples and to deliver, a Succession of partially and fully processed data examples each consisting of a plurality of packets, the fully processed data examples are delivered through the one or more output channels, Each one of the layerS is adapted to receive a first number of data packets per time unit and to deliver as output a Second number of data packets per Said time unit; the improvement wherein:
(i) each layer includes at least two banks of neurons operating in a Systolic manner;
(ii) each data example is partitioned to plurality of data with the input layer of the backward cascade, having Said indeX l+1, being inter-connected to both the output layer of Said forward cascade, having Said indeX l, and to additional output, thereby constituting a cascade of 21 layers each asSociated with essentially identical propagation period T where the latter being the elapsed time for propagating data between the at least one input channel and at least one output channel of each layer from among Said 21 layers.
The network further having inter-layer pipeline Structure for attaining Systolic operation of the backward and forward cascades Such that the input to each layer having index i in the forward cascade is also fed as input, after having been delayed for a delay period T by the inter-layer Structure that include hard-wired logic to a corresponding layer having index (21-i-1) in the backward cascade, such that said delay time T, equals (41-4i+3) times Said propagation period T.
it and/or j may or may not be integers, all as required and appropriate.
The realization of the neural network of the invention is not confined to a specific chip arrangement. Thus, a chip may constitute one layer with forward (and possibly also backward propagation) constituent. By this embodiment a neural network of n layerS is realized by n distinct chips. Alternatively a single chip may encompass only portion of a layer, or if desired, more than one layer. Those versed in the art will, therefore, appreciate that the design of the chip in terms of the neural network portion that is realized is determined as required and appropriate, depending upon the particular application.
BRIEF DESCRIPTION OF THE DRAWINGS
For a better understanding, the invention will now be described with reference to the accompanying drawings, in 
DESCRIPTION OF PREFERRED EMBODIMENTS
Multiple alternative VLSI architectures can be designed for efficient execution of neural networks for real-time image processing tasks. For Simplicity, and for network partitioning reasons, only architectures in which each chip computes a Single layer of the neural network is considered. By this particular example, the architecture is preferably Subject to two implementation constraints in order to be implemented with current technology, (i) the chips are limited to one million gates and (ii) 500 I/O pins. In addition, it is proposed that many useful image processing tasks can be accomplished with 5x5 pixel neighborhoods. Thus, the digital VLSI neural network architectures having up to 25 neurons and up to 25 inputs per layer are considered. The neurons will Support multiple forward algorithms, as well as the Back Propagation learning algorithm. I/O data precision is chosen to be nine bits (to provide for signed 256 gray levels) and weight precision is 8 bits.
Under the above exemplary constraints (1 MGates, 500
I/O pins, and 25 neurons/layer I/O data precision 9-bits and weight precision 8-bits) an optimal price performance is Sought. Preferably, (although not necessary) Systolic archi tectures are employed, and assume that each algebraic calculations (Such as additions and multiplications) are completed during one clock cycle. To maximize hardware utilization, the pipeline should preferably be balanced, in the Sense that all other operations should require Substantially Similar time to complete. AS will be explained in greater detail below, when the number of physical neurons is less than 25, multiple cycles are required for execution of the 25 logical neurons of the network layer, and Similarly for the inputs. The proposed Solution assumes balanced architectures, namely architec tures containing no idle resources when fully loaded. For example, if each multiplication is completed in one clock cycle, there is no need for more multipliers than there are inputs to each neuron. If the algebraic operations (e.g. multiplication) requires more than one cycle then known per Se techniques Such as pipelining or time sharing of multi pliers may be used in order to complete the operation while maintaining the same input and output throughput.
It is desired to design the architectures So as to execute Back Propagation learning at essentially the Same Speed as forward processing. This is motivated by the many research and development applications where learning constitutes the principal time consuming effort. Analyzing the algorithmic data flow of Back Propagation can be performed along Similar directions, and the results are already incorporated in the architecture described herein. Note that if learning is eliminated, or is designed for Substantially lower perfor mance than forward processing, the chip size can be reduced or, alternatively, faster architectures which can process larger images in real-time can be designed under the same constraints. Before turning to a detailed discussion of one proposed architecture, it should be noted that the various constraints Specified in the above discussion exemplifies the motivation in designing a neural network that meets one's needs. It should be emphasized that the present invention is by no means bound by the Specified constraints. Thus, for example, the proposed architecture is not bound to one layer. The maximal allowed number of gates and the I/O pads are determined depending upon the particular application. The implementation is not confined to 5x5 pixel neighborhoods. Likewise, 25 neurons, nine bits data precision and weight precision of eight bits are only one example. It is not necessary to complete the algebraic operations during one clock cycle. The pipeline should not necessarily be balanced and alternative combinations of parallel and Serial architec tures may be realized. Whilst balanced architecture is pref erable this is not necessarily obligatory.
Accordingly, the various constraints that are mentioned herein may be adjusted or eliminated, and other constraints may be imposed all as required and appropriate, depending upon the particular application.
A Description of a Preferred DSNC Architecture There follows a description of some variants of one exemplary architecture in which a Digital Systolic Neural Network Chip (DSNC) realize one-layer (both forward and backward propagation learning (hereinafter "one-layer DSNC architecture")). Referring at the onset to FIG. 2, the exemplary DSNC architecture 10 implements one layer of 25 neurons and up to 25 data packets to each neuron for each computation. If desired, less than 25 data packets may be fed to the first layer (e.g. 3x3). In this case, Several neurons (and possibly also in interim layers) in the first layer will be rendered inoperative. The 25 inputs are presented in five consecutive cycles of data Sets of five data packets each. The 25 neurons are grouped in five banks 12, 14, 16, 18, and 20, respectively, each containing five neurons. AS will be explained in greater detail below, each neuron is capable of realizing both the forward and backward execution. Neurons residing in the same bank perform the same operations in parallel. All five banks operate Systolically in parallel on different input data Sets. Data to and from the chip are carried over two input channels 22, 28 and two output Since most networks make use of up to l=three layers, the architecture is optimized for fast learning when using Such networks. AS is well known, back propagation learning on an data example (several data packets) occurs only after the forward processing of that example has already been com pleted throughout the network. Immediately afterwards, weight-changes associated with that input example are calculated, and the Same inputs are utilized again for this calculation. Hence, the architecture should provide for inter mediate internal Storage of the inputs, Since it is impractical to read them again.
Learning is carried out in the network in the opposite direction to forward processing, i.e., from the last layer 34 to the first 30 utilizing to this end DSNC constituents 34°, 32° and 30°. Thus, the total length of the inter-layer pipeline consists of six layers (three forwards, three backwards) as shown in FIG. 3 . While learning at full speed (a new data set entering every clock cycle), the three-layer network confronts a time lag of eleven input data examples between the entry of the inputs of the first example and the time they are used again in the first layer. To ensure the availability of these inputs without inputting them again, e.g. a 2759-bit entry FIFO buffer being an exemplary inter-layer structure is incorporated in the chip. (275 stands for 11 registers each holding 25 data packets, 9-bit long each). The To this end, 11 registers 36' . . .36' having the same time delay as constituents 30', 32, 34, 34° and 32 assure that constituent 30° will be fed essentially simulta neously with processed data example and data example from constituent 32° and 36, respectively.
In a similar manner, registers 38,38° and 387 assure that constituent 32° will be fed essentially simultaneously with processed data example and data example from con stituents 34° . . .387, respectively.
Similarly, 3 registers 40('', 40° and 40' assures that constituent 34°) will be fed essentially simultaneously with data from constituents 32'34' and also with desired data outputs. In the latter case, a desired data output, fed via channel 33 and fully processed output of constituent 349, fed to channel 33", will be synchronized with the inputs partially processed output from constituents 32 for proper execution of algorithmic expression No. 9.
It should be noted that whereas FIG. 3 illustrates three layer architecture, the invention is, of course, not bound by this configuration. Thus, for example, should four layer architecture be employed, i.e. a fourth layer will be concat enated to layer 34, this will require use of four constituent These channels can also be used in a known perse manner by the host to monitor the learning progreSS of the network.
Normally, DSNC accommodates j banks each of which holding up to m, neurons, Such that m ji=n, and if n is not an integral product of m, then j can be a non-integer number. Moreover, the DSNC can receive inputs over k, buses and can generate outputs over k buses, Such that m=k* i? and if m is not an integral product of k, then it is not an integer number, and further m=k* o, and if m is not an integral product of k, then of is not an integer Each neuron computes the Sum of its weighted inputs. The Sum is used to address the LUTs and retrieve the appropriate output and derivative values. An internal Sum Channel 70, comprising five 9-bit buses, transfers in parallel the five Sums of a single bank to the LUTs. In consecutive cycles, consecutive bankS Send their Sum outputs over the Sum Channel, thus enabling efficient time sharing of the LUTs.
In Normally, when using a DSNC, there are three distinct operational modes: initialization, processing, and conclud ing an epoch. In order to use the chip efficiently and at the highest performance, every mode requires data to be entered in a well organized fashion. The principal rule to remember while using this architecture is that input data flows Systoli cally between the five banks, and accordingly there is a time lag of one clock cycle between the composition of calcula tions in adjacent banks. The hereinbelow description will predominantly focus in the operation of the proposed archi tecture in the processing stage. The known perse aspects of the initialization and concluding an epoch Stage will not be expounded upon herein.
Initialization
The organization of weights (in forward and backward Stages) into register files depends on the order in which inputs enter the neuron. For example, the first pile of registers in each neuron holds the weights for inputs No. 1, 6, 11, 16, 21 out of the total 25 data packets which enter in five consecutive input cycles. In order to take advantage of the Systolic architecture, the weights should be loaded into the banks from last bank to first bank. Weight initialization requires five input cycles per neuron, or 125 clock cycles to initialize all 25 neurons. The biases and the learning-rate are initialized likewise, each taking five clock cycles. An 8-bit weight is appended with an allow-change bit which controls the learning of the networks. If an allow-change bit is Zero, the corresponding weight is not allowed to change during learning.
The output and derivative functions are loaded into the LUTs via the bi-directional Output Forward Channel. The data are entered Serially, and each entry is loaded Simulta neously to all five identical LUTs. The two sets of LUTs (for the output and derivative functions) are loaded in Series, It should be noted that, whilst for simplicity, the LUT table is depicted as an integral portion of the neuron architecture, this is normally not the case, Since the LUT table is preferably utilized as a Separate module shared by more than one neuron (and by this particular embodiment by all the neurons).
Next, and as shown in column 2 of Table II, a second data Set containing data packets 6-10 of example A are fed simultaneously to neurons N1-N5, whilst at the same time, the previous input 1-5 is fed from 5 input registers 27 to each one of neurons N6-N10 in the second bank 14. Put 13 differently, N6-N10 receive a delayed input of the data packet 1-5. The procedure continues in a similar manner in cycles 3 and 4 and in cycle 5 neurons N1-N5 receive, each, the fifth data Set consisting of data packets 21-25, neurons N6-N10 receive, each, delayed inputs of data packets 16-20; neurons N11-N15 receive, each, delayed inputs of data packets 11-15; neurons N16-N20 receive, each, delayed inputs of data packets 6-10 and neuron N21-N25 receive, each, delayed input of data packets 1-5 of data example A. In cycle 6, neurons N1-N5 receive data packets of a new data example B.
In cycle 11, each one of the neurons N1-N5 is capable of producing an output since it has already processed the entire 25 input packets (in five cycles). By this particular example the five cycle delay from the receipt of the last input Sequence (i.e. data packets 21-25 in cycle 5) until the result (i.e. processed data example that is actually delivered from neurons N1-N5 (in cycle 11) stems from 3 cycles for Summation (in unit 66) one cycle for bias calculation in module 68 and one cycle for LUT calculation (72). The Reverting now to cycle 11 of Table II, the remaining neurons N6-N10 process delayed data packets 21-25 of data example B; neurons N11-N15 process delayed inputs 16-20; N16-N20 process delayed data packets 11-15 and N21-N25 process delayed data packets 6-10.
During the same cycle (11), neurons N1-N5 also com mence a new cycle of calculation and receive input packets (1) (2) (3) (4) (5) ) that belong to the next 25 inputs of data example C.
During succeeding cycles 12, 13, 14 and 15 N1-N5 will process the remaining packets 6 to 25 in order to yield the next output in cycle 21 (not shown).
Reverting now to cycle 12, neurons N6-N10 are now capable of generating an output after having processed all 25 data packets of data example A. In cycle 12, N11-N15, N16-N20 and N21-N25 continue to receive the appropriate input for processing. AS shown, in cycles 13, neurons N11-N15 produce an output, and likewise in cycles 14 and 15, neurons N16-N20 produce their respective outputs.
As Attention is now directed to Table II-A below which illustrates the operation of both the forward and backward cascade of neurons with respect to a Succession of data examples (designated as A-P) and extending over 80 cycles of computation. AS before, it is assumed that the neural network has been operating for a long time before cycle 1 and continues to operate after cycle 80. Unlike Table II , the rows in Table II -A are broken down by layers, i.e. layer 1-layer 3 of the forward cascade and layer 3 layer-layer 1 of the backward cascade. The rows in Table II-A correspond to the six constituents 30', 32, 34, 34, 32° and 30°, respectively. Each of the forward layers is further partitioned into "in" and "out" rows and each backward layer is further partitioned to "error in", "past input' and "error out" rows.
As shown in Table II -A, in clock cycles 1-5 the five sets of data exple A are fed to the forward layer 1, i.e. constituent 30' in FIG. 3 . In cycles 11-15 partially pro cessed data example A leaves forward layer 1 and enters forward layer 2 and past input of layer 2. The partially processed data is referred to as the given data example that is fed from the forward to the backward layer. Said given data example may, likewise be a row data example that is fed to the first layer of the forward cascade or fully processed data as the case may be. By partially processed data example it is meant that data example layer has been processed by neurons N1-N5 (see Table II ) and is entering the first out of 7 registers that interconnect constituent 30' of the forward cascade with constituent 32° of the backward cascade. The latter is illustrated by indication of "A" in the past input row of backward layer 2 in cycles 11-15. As clearly arises from Table II-A, the processing of the entire Example A by neurons N1-N25 takes 25 cycles and 5 more cycles are required for the Specified Summation, bias and LUT calculations, giving rise to the delivery of a fully processed data example A at the output of forward layer 3 at cycles 31-35. Indeed as shown in Table II (designated as "error out") in backward layer 3 and the Specified error is passed back to backward layer 2 (shown in the "error in" line of backward line 2). Backward layer 2 is now capable of processing also the partially processed example A after the latter was Subject to delay by 7 registers that interconnect the first layer in the forward cascade to the Second layer in the backward cascade. The propagation of the latter partially processed example A is depicted in Table  II -A in the "past input" row of backward layer 2 commenc ing from cycle 11-15 and ending at cycle 41-45. Layer 2 is now capable of calculating an error of processed example A in compliance with algorithmic expression 10.
Layer 2 will deliver "error out" results (in compliance with algorithmic expression 10) at cycles 51-55 and the latter will be fed at the same cycle to backward layer 1 as "error in" data ("error in" signifies an error of processed example). Layer 1 will utilize at the same cycle the past input data that has been propagated through 11 registers 36 to 36' and is capable, in its turn, to calculate an error of processed example A in compliance with algorithmic expression 10. In cycles 61-65 backward layer 1 completes the learning phase with respect to Example A and produces as an output an error of processed example A.
It is accordingly appreciated that there are required 11 registers to interconnect the input of the first layer in the forward cascade to backward layer 1 of the backward cascade and likewise 7 registers and 3 registers are required to interconnect the input of the Second and third layer in the forward cascade to the corresponding layer 3 and layer 2 of the backward cascade. Since each register holds one data example that contains 25 packets (each 9 bits long), the size of this register is 225 bits.
The maximum size of images that may be processed in real-time (30 images per Second) depends on the most heavily loaded layer in the network. The load is a function of the number of inputs in a single data example and the number of neurons used. The overall network processing rate is limited by the processing Speed of its most heavily loaded layer. 
In the following Concluding an Epoch Updating the weights and biases of the DSNC neurons takes place after the last weight-change calculation is con cluded for the last input example of the epoch. Each cycle, one weight in each forward weight register pile is updated, using the change-weight piles. This update is carried out from the last to first weight in the pile. This order exploits the Systolic architecture in the same manner as explained above, and only 125 clock cycles are thus required for passing the new forward weights to the backward Stages of the previous layer. The initial data for operation (network topology, weights, biases, learning-rate, output and derivative function) can be Stored in memories or Supplied by a host computer. The processed images can come from an image acquisition Systems, e.g. CCD camera coupled to the host computer or, if desired, be retrieved from image files Stored in or received by the host. In any case, neighborhood generators are responsible for arranging the pixels into the required data sets and feed them successively into the network. While learning, the calculated new weights and biases can be sent The same 3-layer network can be built with only one DSNC, executing different layers at different times. This implementation is leSS expensive but performs slightly slower than three times when in Forward mode and slightly slower than Six times while learning, due to the time required to Store intermediate data packets in external memories and retrieve the same, and the time required to modify weights, biases, learning rates, output and derivative functions, for each layer.
As depicted for example in FIG. 7 appreciate that larger images (up to 1230x1230 pixels) can be processed in real-time when using Smaller neighborhoods and fewer neurons in each layer, depicted in Table II .
To test the ability of such a limited numeric precision architecture to learn and process early-vision tasks, an explicit structural Simulation of the architecture was written in PASCAL. An edge detection task was then learned using a 256 gray level image (FIG. 9) , 1024 examples of 3x3 pixel neighborhoods from the Same image, and 1024 correspond ing desired output pixels taken from a desired output image (FIG. 10) . The network contained two layers with seven neurons in the first layer and one neuron in the Second layer. The initial weights were empirically Selected in random from the ranges -127,-120 and 120,127). The output functions of the first and Second layers were empirically Set as hyperbolic tangent and Sigmoid, respectively. The learning-rate was constantly equal to one. The number of epochs required was 49. The result image (FIG. 11) was generated by forwarding the input image (FIG. 9) through the network and thresholding it at the gray level of 221. The test indicates that, at least for Some image processing tasks, the limited numeric precision of the architecture is quite Sufficient.
The numeric precision of the DSNC architecture has been determined empirically. It is optimized for a certain class of algorithms, and accordingly a different precision other than 8 bits for weight and bias and 9 bits size for data may be required for other algorithms. Digital Simulations should be carried out to compare the performance and the quality of the results over a number of different precisions, until the appropriate precision is identified.
The DSNC architecture combines both forward process ing and learning. Certain applications may not require 21 learning, and for them a Subset of the architecture, with only forward processing hardware, can be designed. AS explained in the Appendix hereinbelow, about 2/3 of the number of gates i.e. 447,595 out of 617,640 may be eliminated in Such cases. This could lead to either Smaller chips or to higher performance architectures. For instance, using the same technology and size of chip as in the Appendix, a 1230x1230 or larger images can be processed in real-time.
Combining the learning with forward processing, and providing for maximum rate learning, is highly desirable in Some applications. When data attributes change often, con tinual learning is an integral part of processing, and it should not delay normal forward processing. When real-time image data are processed, additional large image Storage buffers would have been required if learning were to be performed at rates slower than the real-time data input rate. Finally, Entity 8 bit register 9 bit register it register register register register register it adder it adder it adder bit adder it adder bit tiplier 9 x 9 bit multiplier 5 8-bit weights training neural networks is an important task by itself, and often requires a large number of empirical iterations. Thus, the DSNC is also intended for use in special purpose WorkStations for neural network development.
Neural networks can offer an attractive implementation for Some complex image processing algorithms; their Self learning capability often helps overcome the extreme diffi culty of constructing conventional image processing algo rithm for non-trivial tasks. High performance architectures for neural network processing are required in order to execute image processing algorithms at Video rates. 7. A digital neural network according to claim 6, wherein l=3,m=5 for 1s is 5, n=5, for real-time image processing of images, wherein each image consists of essentially 550-550 pixels.
