Abstract-In this paper, we present the design of a deterministic bit-stream neuron, which makes use of the memory rich architecture of fine-grained field-programmable gate arrays (FPGAs). It is shown that deterministic bit streams provide the same accuracy as much longer stochastic bit streams. As these bit streams are processed serially, this allows neurons to be implemented that are much faster than those that utilize stochastic logic. Furthermore, due to the memory rich architecture of fine-grained FPGAs, these neurons still require only a small amount of logic to implement. The design presented here has been implemented on a Virtex FPGA, which allows a very regular layout facilitating efficient usage of space. This allows for the construction of neural networks large enough to solve complex tasks at a speed comparable to that provided by commercially available neural-network hardware.
I. INTRODUCTION
W HILE artificial neural networks (ANNs) are able to provide compact solutions to difficult problems, their real attraction lies in their ability to learn by example. However, in order to obtain a network that produces a desirable output, the network weights must typically be trained upon the available examples many times. Due to the large number of iterations often required, neural networks implemented in software can take many hours of CPU time to learn a particular problem. For example, in a time series prediction competition, while ANNs performed very well, the run times ranged from 3 h on a CRAY Y-MP to three weeks on a SPARC 1 [13] .
In light of this, there has been a steadily increasing number of hardware implementations of neural networks [16] , [19] , [21] , [22] , [25] . Implementing neural networks in hardware can speed up the training by several orders of magnitude, due both to the faster nature of the hardware, and the fact that such networks can operate in parallel.
II. DIGITAL NEURAL NETWORKS
In the review of hardware neural networks by Lindsey and Linblad [22] , the implementation of neural hardware was divided into three broad categories-analog, digital, or a hybrid of the two. In the discussion on digital neurons, it was argued that:
"For the designer, digital technology has the advantages of mature fabrication techniques, weight storage in RAM, and arithmetic operations exact within the number of bits of the operands and accumulators. From the user's viewpoint, digital chips are easily embedded into most applications. However, digital operations are usually slower than in analog systems, especially in the weight input multiplication, and analog inputs must first be converted to digital." This review is now slightly dated, and we would argue that several additional benefits of implementing neural networks in digital technology have arisen due to the proliferation of very large (digital) field programmable gate arrays (FGPAs), which can be reconfigured as required. The ability to reconfigure the operation of these chips allows for the formation of neural networks which can be tailored to the task at hand, the possibility of task specific logic (such as preprocessing of data) to be placed on the chip, and the potential widespread use and adaptation of such a design.
While these are additional benefits to designing digital neurons, the speed disadvantage outlined above remains. In digital designs, a large amount of logic is required for a fast multiplication. As a network typically contains many connections, a separate fast multiplication for each connection requires far more space than can be provided on even the largest digital chip. Thus, in order to fit the design in the chip space, the design must either implement the multiplication operations serially, hence slowing down the design, or (to still have the design operate in parallel) the logic required by the multiplication operation must be drastically reduced. Several authors have investigated using stochastic logic to achieve this [1] , [2] , [20] , [27] - [29] .
A. Stochastic Logic
Stochastic logic is a digital operation, which realizes pseudoanalog operations (multiplications, summation, etc) using stochastically coded bit-streams. A stochastic bit-stream is simply a sequence of bits, where the probability of each bit being set high is proportional to the value the bit-stream represents. 1 Such a bit-stream can be represented by an N-component random binary vector with the following properties:
With stochastic bit-streams where , the components and are independent, identically distributed random variables. The probability distribution used depends on the representation adopted. The two common mappings of a real number to a binary pulse sequence are unipolar and bipolar [2] . When using bipolar representation, is the probability of any one bit being high and the probability of a bit being low. The resulting vector can represent a real number . In the unipolar representation is the probability of any one bit being high and is the probability of a bit being low. In this representation the resulting vector represents a real number . As the bipolar scheme allows both positive and negative signals to be represented it might be thought to be better suited to neural-network signals. However, the unipolar representation, which only allows the encoding of positive signals has been adopted, as it has proved convenient to transmit the sign of the signal separately.
With the unipolar representation, the vector , corresponding to the product of and , has and . As the vector produced by ANDing the components of and together is equivalent to , a bit-wise AND of two unipolar stochastically encoded bit-streams produces a bit-stream which represents the product of these bit-streams.
B. Error From Stochastic Multiplication
The number of ones in a given vector has a Bernoulli distribution For a Bernoulli distribution of length with generating probability , the mean and variance are given by
As the distribution of the number of ones in the product vector is also a Bernoulli distribution the expected number of ones is . An estimate of the product of with is given by (1) Thus, the error in a specific estimate for a given and is
The mean square error (mse) for a given and , taken over the whole ensemble of outputs of the stochastic processes and is given by mse
for stochastic computation , so the above equation can be rewritten as mse (4) As and the variance of the Bernoulli distribution is the mse for stochastic multiplication is mse Three consequences of these results are the following.
1)
is an unbiased estimate of . 2) The estimate is reliable in that the rms error reduces as , where is the number of bits in the representation.
3) The maximum error occurs when the product . Fig. 1 shows the theoretical rms error in counts that results from stochastic computation for bit-streams of length 128 for various values of input bit-streams. Shown alongside this in Fig. 2 is the experimental error for bit-streams of the same length. 2 While the exact error obtained depends upon the sequence used, the results shown here is typical of that obtained using stochastic computation and is unacceptably high for neural computation-in this case, the mean error is as much as five counts out of 128. For individual cases the error can be much larger.
III. DETERMINISTIC BIT-STREAM COMPUTATION
The requirement for independent random bit streams implies the need for many random sources. There has been some attempt to solve this problem by using techniques for generating multiple uncorrelated pseudorandom sources [1] , however, they still require considerable hardware resources. By using digital chopping (which is easy to realize in hardware) Gschwind et al. [15] have reduced the number of sources required by making one of the bit-streams deterministic. 3 Although this reduces the amount of hardware required for random sources, they still consume a significant amount of hardware. In either case the probabilistic nature of the bit-streams produced causes a large error in the signal value.
For bit-stream computation there will always be some error in the signal value, due to the quantized nature of the signals. To see this, consider a value represented by a bit-stream. The count of the bit-stream is an integer value, so the best representation of the value that can be obtained is where is the length of the bit-stream, and is defined as closest integer value to the real value
The exact magnitude of the error produced will depend on the value being represented, but for the best case it will lie somewhere in the range of [0, 0.5] counts. It will be shown that by making the bit-streams deterministic it is possible to reduce the error in the output bit-streams to lie within these bounds.
A. Making Bit-Streams Deterministic
For a signal period of clock cycles, there are only possible values that can be represented. The generation of a unique bit-stream for each of these possible values can be achieved by constraining the time when the signal is high to be the first part of the signal period.
Once one bit-stream has been constrained in this manner, the error produced through ANDing it with the other bit-stream can be minimized if this bit-stream is configured such that it is evenly spread across the signal period.
By defining the following terms:
the equations for the vectors associated with input value and the weight value can be written as where , .
where round otherwise. (7) 3 One of the reviewers of this paper expressed the opinion that the method presented here is very similar to previous neural hardware developed using a digital chopping method such as that presented by Gschwind et al. While the authors are not of this opinion, readers are encouraged to refer to this work. An example of two such bit-streams can be seen in Fig. 3 . Here both the input bit-streams represent the value 0.5 and the signal period is 24. As shown, the resulting bit-stream is high for six of the 24 clock cycles, which represents the desired value of 0.25.
The estimate of obtained through ANDing two such bitstreams together then becomes the number of values of in this equation, which are less than or equal to . This is the number of values of in the range specified for which As the rms error would be expected to be of the same order, deterministic logic results in an rms error which is inversely proportional to . In contrast, the implementation of stochastic logic results in the rms error being inversely proportional to . To determine the equations governing the distribution of the average rms error for the stochastic and deterministic cases, the average error was determined through simulation for various values of , and plotted on the log-log graph of Fig. 4 . For each value of shown, a random value for the two input values in the range [0, 1] was obtained, and a bit-stream of the desired length generated using the appropriate method. The error between the desired value and the actual value was determined, and this was repeated 500 times to obtain an approximation of the average rms error.
When trend-lines were fitted to these values the equations which were found to fit the points determined were Average rms deterministic Average rms stochastic
The trend-line for both cases closely match the expected power law, and the fit of the trend-lines to the data points is very good, with the values in each case being close to 1.0 ( for the deterministic case and for the stochastic case).
Using these equations it can be determined that at 128 clock cycles the average rms error produced with deterministic computation is only 0.003 125. In contrast for stochastic computation to yield an error as small as this would require a signal period in the order of 8000 clock cycles. Previous studies have shown that this level of accuracy is not adequate for training if backpropagation is the training algorithm employed [17] , [18] . However, alternative algorithms are available for which this level of accuracy is adequate to perform network learning [3] , [6] , [7] , [11] , [31] .
B. Error in Deterministic Case for a Signal Period of 128
With a signal period of 128 clock cycles, the error for all possible input combinations is shown in Fig. 5 . As can be seen, the error never exceeds the value of 0.5 counts. Shown in Fig. 6 is the required (evenly distributed) weighting bit-streams when the signal period is 128 bits. In this figure, black represents logic low and white represents logic high. The axis is the value of the th bit of the bit stream. The axis is the value of the weights (multiplied by ).
Note that the error shown in Fig. 5 is the difference between the desired output and the actual output of the weight. This is shown in preference to the percentage error as the neuron error is directly related to the activation signal error. As the activation signal is the sum of several weighted inputs, the effect of a given difference upon the activation value is the same regardless of the value of the weighted input.
IV. IMPLEMENTATION OF A SINGLE NEURON
The discussion so far has only dealt with the mechanism to carry out the multiplication required to perform the weighting of the inputs. To perform the task of a generic artificial neuron, the circuitry must also sum these weighted inputs, pass this sum through a transformation function, and output this value. (Many such variations on this scheme exist, however, artificial neurons, which perform these generic tasks are the most widely used and accepted. As we are aiming to allow for the increased use of hardware neural networks, the neuron which we present here has been designed to perform these generic tasks).
A functional diagram of the hardware neuron is shown in Fig. 7 . The blocks in this diagram correspond to hardware modules which carry out the appropriate tasks. 
A. Neuron Weights
The weight values are represented by the evenly distributed bit-streams for reasons which are outlined above. This scheme requires that these bit-streams be either stored in memory or calculated as needed. The question of whether or not it is feasible to store the weights in memory is governed by the degree of accuracy required. This is because for a large signal period, the calculation of the bit-streams "on the fly" requires less hardware space than storing the bit-streams. However, as the bit-streams become shorter there comes a break even point when the storage of the required values in hardware is more efficient and quicker than calculating these values. In the memory-rich architecture of the Virtex FPGA, with bit-streams only 128 bits longs, storage is by far the better option.
For bit-streams with a signal period of 128 bits, the weight storage component can be implemented with eight Virtex SRL16 primitives (each of which is a 16-bit shift register). As four of these components can be placed in one Virtex configurable logic block (CLB), the weight storage can be done in just two CLBs.
The overall design of the weighting component is shown in Fig. 8 . The logic required to load a weight stream into the weight component is simple, with a 2-1 multiplexor required to select between the input weight bit-stream and the output of the shift register. If it is desired to change the weights, then the load weights signal is set high, and the input weight bit stream (which represents the new weight value) is loaded into the shift register. However, if the current value of the weight is to be retained then the load weights signal is set low, and the output of the shift register is feed back into its input causing the same bit-stream (and, hence, weight value) to be reused.
The input sign and weight sign are loaded when a signal referred to as the sign timing signal goes high. The signs of the input and weight bit-streams are read from the same lines used to transmit the signal values which requires a separate time to transmit the sign. This will be described in Section IV-G.
The value output from the weighted inputs can be one of three values on each clock cycle: 1, 0, or 1. The encoding of the signal is changed from a two-bit signal/sign representation to a positive/negative representation. The positive signal is high if the signal sign is positive and the signal bit-stream value is high. Similarly, the negative signal is high if the signal sign is negative and the signal bit-stream value is high. This has been done to make the summation of the signals easier to implement, and due to the nature of the Virtex FPGA requires no extra logic blocks to implement.
B. Weight Array
The neuron has been designed to be able to accept up to ten inputs. Each of the ten associated weights are instantiated within the weight array. In addition, this component contains circuitry that is used to change the value stored within each weight.
The inputs to the weight array are a neuron address, a weight address (both of which are four-bit values) and a bit-stream. The weight array decodes the neuron address and weight address using RAM-based lookup tables to determine if one of its weight is being addressed. If it is, then the weight array sets the load signal of that weight high to enable the valued stored within that weight to be changed.
The bit-stream presented to the weight array is obtained from a component that is external to the network. This component stores all the possible weight bit streams in memory. With a signal period of 128 bits, there are 128 possible values, which means that 16 384 bits are needed to store each possible weight bit-stream. 4 Due to limitations on memory, this external component is shared between several neurons. It is for this reason that the addressing scheme described above has been adopted. 5 Changing the weight values is achieved using a custom processor which uses a special instruction that causes the external hardware to load a specific weight with the desired bit stream. A detailed description of this processor can be found in chapter four of the thesis by Braendler [6] .
C. Summation of Weighted Input Bit Streams
The positive and negative bit-streams output from the weights are summed by the input adder on each clock cycle. This summation is performed by lookup tables, as efficient lookup tables can be implemented in the Virtex FPGA through its distributed ROM. However, the maximum size of the individual distributed ROM blocks in the Virtex device is only 32 bits. Blocks of this size allow the implementation of a lookup table with five inputs. As each weight outputs two bits (positive and negative) and there are ten weights, the design requires four (identical) lookup tables to calculate the "initial summation." The result of the initial summation is four vectors representing the sum of the upper and lower five bits of the positive and the negative bit-streams. With the neuron designed to be able to accept up to ten inputs, this initial summation can be carried out without wastage.
As the sign of the inputs to the lookup tables are fixed, the sign of the four values from the initial summation are also fixed, with each being an integer in the range . A three-bit vector is required to represent this range of values.
Intermediate summation is then performed as shown in Fig. 9 . The output of this intermediate summation is two four-bit signed vectors (in twos complement) which can take on an integer value in the range of [ 5, 5] .
Finally, these two values are added together to obtain a count (the "single-cycle sum"). This value is a delayed summation of the weighted inputs, with three clock cycles passing between the weights outputting their values, and the summation of these outputs appearing from the input adder.
D. Accumulator
The next stage in the neuron is the accumulator. The function of the accumulator is to sum the contribution from the individual weights to form an internal activation value. This internal activation value is then passed on to the transform component. As will be discussed below, the transform component implements an approximation to the function. This approximation outputs the maximum value when the magnitude of the activation value is greater than or equal to 240 counts, which corresponds to a value of just less than two.
Because of this, it is not necessary to continue adding to the accumulator once the activation value has passed the value of 240. This allows the internal count of the accumulator to be stored in a nine-bit two's complement number, allowing it to represent values from 255 down to 256. Limiting the number of bits needed for the internal count reduces the amount of hardware required for the accumulator.
Determining if the value of the internal count has passed the value of 240 can be done by XORing bit seven through to four with the sign bit (bit eight) of the count to obtain a four-bit comparison vector. If this comparison vector is all ones, then the cutoff value has been reached (or exceeded) and there is no need to add any further values to this count.
Because there is a maximum of ten weights per neuron, it is only possible to add an integer value in the range to the internal count on each cycle. Thus, having the addition stop within 16 counts of the maximum magnitude means that the internal count never exceeds this maximum magnitude.
The output of the accumulator is a sign bit and a magnitude bit vector. The sign bit is simply the uppermost bit of the internal counter. Converting the bit twos complement number to a -bit magnitude bit vector has been achieved by outputting the lower bits and inverting all these bits if the number is negative. Doing this introduces an error of one count into negative signals, the effect of which will be discussed in Section VI.
E. Implementation of Transformation
With stochastic neurons, the transformation of the inputs can be done with a simple comparison between the total net contribution from the inputs on any one cycle, and a threshold value in the range [27] , [29] . A sigmoidal function can be approximated through an appropriate choice of this threshold value. However, this approach requires independent inputs, an extended signal period to produce a relatively smooth curve and at least 15 inputs to make the transform sufficiently smooth [27] . The deterministic bit-stream neuron has a short signal period, and certainly cannot guarantee independent inputs, making this approach infeasible.
However, due to the short signal period used in the deterministic bit-stream neuron, the transform can be efficiently implemented using a linear piecewise approximation to the function. This can be done in several ways, with the method that we have adopted being as follows.
The magnitude bit-vector from the accumulator is split into two parts. The upper three bits of this bit-vector are used to provide an index that is used to lookup two values (an offset and a divisor). The offset and divisor values used for different activation values are shown in Table I offset divisor
The value of the transformation is then calculated according to (10) (where is the activation value). The first term in this equation is simply the offset for that particular activation value. The second term is the activation-value modulus 32 (which is just the lower five bits of the magnitude bit-vector) divided through by the divisor for the activation value. Because the divisor is selected to always be a power of two, this can be implemented efficiently by a right-shift operation.
The resulting transformation obtained using this scheme can be seen in Fig. 10 , where comparison shows it to be a close approximation to the function. Note that the transform component only operates on the magnitude of the activation value. The sign information is not used within this component, but is passed directly from the accumulator to the pulser for generation of the appropriate output bit stream.
F. Pulser
If the neuron is connected to other neurons, then the bit-vector obtained from the transform component must be converted back to an appropriate bit-stream. The required format for the converted bit stream is for all of its high bits to occur in the first part of the bit-stream (as discussed in Section III). To achieve this, the pulser simply loads the output magnitude from the transformation into an internal register. 6 Then, while the value of this register is not zero, it outputs a high signal and decrements the stored value by one. The sign of the output is the sign obtained from the activation component.
G. Timing and Transmission of Sign
From the above discussion, it should be apparent that a small number of clock cycles are required for a single bit of data to "flow" through a neuron.
The latency associated with this data flow means that the neuron requires a timing signal to tell it when to perform the transmission of the bit-streams, but it also means that there is a period where the signal lines are not being used to transmit values. It is possible to take advantage of this cessation of activity by using it to transmit the sign of the signal. By doing this, only one line is required for each value, which makes routing the design a far easier task. Additionally, an extra bit of accuracy is obtained at very small cost.
The sign of the signal may be transmitted either at the start of the signal or the end, provided that the same protocol is adopted for all signals. It was chosen to use the start of the signal period, so that the sign information is in place before the signals are transmitted.
The required timing signals are generated via a component external to the neural circuitry and distributed via the global clock lines of the Virtex FPGA (see Fig. 11 ).
V. PERFORMANCE STATISTICS OF THE VARIOUS COMPONENTS
A breakdown of the time and space requirements (the number of slices of a Virtex chip used 7 ) of each individual component is shown in Table II . These figures were generated by running the place and route software for each component. The speed and When the neuron was floor-planned, the amount of logic required by each of the components was reduced, and the maximum speed of each component was increased. The maximum speed obtainable from the floor planned neuron was 110 MHz, and the number of slices occupied by a single neuron was decreased to 140.
The floorplanned neuron occupies an area of 10 7 configurable logic blocks. The basic floorplan of the design is shown in Fig. 12 . As can be seen the amount of area occupied by the weight array has reduced considerably-when floor-planned it occupies just over 60 slices (30 CLBs) . Within this component, the weighting of the ten inputs is performed. Using a conventional multiply for each input weight pair would require over 400 CLBs, so it can be seen that the use of deterministic bitstream logic resulted in a considerable reduction in the amount of hardware required for each neuron. Importantly, it also reduces the number of connections that are required, as in the scheme presented here, only one signal line is required per input. This greatly alleviates the difficulty in routing the often complex connections of a neural network in FPGAs, where operators can only handle a limited amount of fan in [14] .
The latency of each component is also shown in Table II , with the total latency of the neuron being seven clock cycles. Combined with the 128 clock cycles required to transmit the bit-streams, this means that the overall number of clock cycles required for one update of the neuron is 135. With a maximum clock frequency of 110 MHz, this means that the overall time required for one "update" to occur is 1.25 s. The number of updates which one weight can carry out in 1 s is, thus, 800 000.
It should be noted that the latency between neurons is 135 clock cycles, as the neurons must perform the weighting, accumulation, and transformation of the input signals for an entire neuron period before they can begin outputting data. (In contrast, stochastic bit-stream neurons can pass on a bit immediately after receiving a bit, so the latency between stochastic neurons is a single clock cycle).
For networks formed from deterministic neurons, this latency between neurons means that the processing time required is proportional to the number of hidden layers. For networks with one hidden layer, three "neuron periods" will be required to process the data. However, despite this delay, the deterministic neuron still offers a significant speed advantage over stochastic neurons. The speed advantage offered will depend upon the level of accuracy desired, and the number of hidden layers employed. When the deterministic neurons presented here (seven bits of accuracy) are used to form networks with one hidden layer, a speedup advantage of approximately 20 times is obtained over an equivalent network formed with stochastic neurons.
VI. OUTPUT OF THE NEURON
While the output of the neuron is an approximation to the curve, the actual output does not necessarily match this approximation. This is because there is an error in the activation value of the neuron due to addition of errors in the weighted inputs. As the error between the actual and desired weighted inputs can be as much as 0.5 of a count, the total error in the activation value of the neuron may be as much as NumInputs . However, looking at Fig. 5 , it is clear that the sign of the error changes rapidly across the input-weight surface. With many inputs to one neuron, this will result in the errors tending to cancel one another out.
In the neuron presented here, there are ten inputs, so the error in the activation values may be as much as 5 counts. To test the actual output of the neuron versus the desired output, a single neuron was run with random weight values and random input values for 50 000 examples. A scatter plot showing the error in the neuron output for each case (the neuron output minus the expected neuron output) is shown in Fig. 13 . Note that this error is worst when the activation value is in the range [ 0.5, 0.5] as we are looking at the output of the neuron. In the range [ 0.5, 0.5] there is a 1-1 mapping of the activation values to output values. Outside this range, the mapping is , and so the error in the neuron output is reduced from the error in the activation by a factor of .
What is also apparent from Fig. 13 is that there is some asymmetry between positive and negative values. This asymmetry is due to the error introduced in the accumulator component for negative signals (caused by the conversion from a twos complement bit-vector to a magnitude bit vector). However, it is clear from Fig. 13 that the effect of this error on the neuron output is small.
While Fig. 13 shows the range of the error in the output values of the neuron, it is perhaps of more interest to determine the distribution of the error in the output values for a given input value.
In order to do this, it is first necessary to know the distribution of the error in the weighted input bit-streams, which can be obtained from analyzing the results used to obtain Fig. 5 . It turns out that the error from the deterministic scheme is uniformly distributed across the range [ 0.5, 0.5] with mean and variance , respectively. As the activation value is the sum of such inputs, by the central limit theorem the mean and variance of the activation value will be and , respectively. This corresponds to a standard deviation in the activation error of 1.69 counts.
Although the weighted input error is uniformly distributed, using the central limit theorem it can be stated that the activation error will approach a normal distribution. To confirm this, the distribution curve for the activation error was generated through simulation, and plotted against the normal distribution in Fig. 14 , where it can be seen that the actual distribution does indeed closely approximate the normal distribution.
Thus, from these results we expect the activation error-and hence the output error-to be normally distributed about the expected activation value. To empirically show this, the following test was performed. The neuron was run for 200 iterations with input and weight values chosen such that the resulting activation value was the same each time, but that the input and weight values differed on each iteration. 8 For each iteration the difference from the desired value was calculated, and a histogram was constructed from these results. This histogram was then normalized by dividing through by the count of the most common value.
This process was performed for activation values ranging from 2.5 to 2.5, in 0.05 increments. The results are shown in Fig. 15 . The -axis in this figure is the normalized frequency of occurrence, thus, white represents a frequent event and black represents an infrequent event. Although the histogram was normalized, the data shown varies between 0.2 and 1.2 as the algorithm used to generate the surface interpolated some values outside the actual bounds.
What is clear from this image is that the activation value of the neuron does occasionally stray two or three counts from the desired activation level. However, in general, the error between the desired value and the actual value is within 2 counts, which is consistent with the discussion above. 
VII. COMPARISON WITH OTHER HARDWARE IMPLEMENTATIONS

A. Hardware Specifications
Because of the wide variety of network architectures and hardware implementations, no one or two numbers can give a true picture of the hardware's capabilities. Basic specifications of a neural network include the network architecture, number of external inputs/outputs, numbers of neurons and synapses per neuron, number of layers, etc. For a hardware implementation, specifications include the technology used (analog, digital, or hybrid), the accuracy (in numbers of bits) of the input/outputs, of the weights, and of the accumulators, etc. Various figures of merit indicate the hardware performance. The most common performance rating is the connection-per-second (CPS), which is defined as the rate of multiplication and accumulate operations during recall processing. For a more detailed overview of the various possible figures of merit, refer to [22] .
B. Deterministic Network Performance
The neuron described in this paper has been implemented and tested upon the XCV800 Virtex FPGA. Although the XCV800 is at the upper end of FPGAs, there are much larger devices available. The largest device currently available is the XCV3200E, which has a floor space of 104 156 CLBs. As one neuron can be placed within a 10 7 block of CLBs, we estimate that it will be possible to place around 200 neurons into the XCV3200E (including the extra logic required to run the networks). As each neuron has ten connections, this means that it will be possible to place a network with 2000 synapses. 9 In Section V, it was shown that one weight is able to perform 800 000 updates per second, which for a network with 2000 connections translates to a maximum speed of 1600 MCPS.
Table III (adapted from [22] ) compares the performance of the neuron which we present with several commercially available neural-network hardware devices. Although the speed and the number of deterministic neurons which can fit onto one chip is better than that provided by these commercially available devices, the accuracy and the number of synapses of the design presented here is significantly less. However, as mentioned previously, it has been shown that the accuracy of the neurons presented here is sufficient for good performance if suitable training algorithms are used [7] .
The small number of synapses available with determinsitic neurons is a result of the limited fan-in of these neurons. However this limited fan-in was chosen as it is more efficient for hardware implementation [5] , and it has been shown that it is possible to construct networks from neurons with such a limited connectivity which perform as well as "maximally connected" networks [8] .
VIII. TRAINING THE NETWORKS
The accuracy of the neurons presented here is insufficient for algorithms which are traditionally used to train neural networks to obtain good results. However, as the number of hardware implementations of neural networks has grown, alternative algorithms have emerged which do perform well on limited accuracy neurons [4] .
As discussed in [6, Ch. 8] , such algorithms can be divided into local and global optimization methods. However, because the neurons described in this paper do not smoothly map changes in their inputs to changes in their output, it would be difficult to train a network composed of such neurons using local optimization algorithms.
Fortunately, a number of global minimization algorithms are suitable for the training of neural networks of limited accuracy [4] , [10] , [24] . We have used a variant on the particle swarm optimization (PSO) algorithm to perform the training of these networks [6] . This algorithm is based on the dynamics of collective behavior whereby individuals share information to arrive at successively better points.
The problems attempted were XOR, the Iris dataset [12] , and the Pima Indian diabetes dataset [26] . Additionally, two 2-D problems, described below, were attempted in order to obtain some visual feedback onto how the networks were performing.
IX. RESULTS
The training algorithm was run on a PC, with the weight and input values of the network being downloaded to the hardware. The results are shown in Table IV. The XOR is a common test problem which novel neural networks are tested upon, because it is one of the simplest nontrivial problems. For the XOR problem, the training and testing datasets were identical, consisting of all four data points. The results presented are the average of ten runs. The results for the remaining datasets were generated using ten-fold cross-validation.
As can be seen from Table V, the hardware network was able to consistently solve this problem to 100% accuracy. The performance on the Iris data set-which is more difficult than XOR, but still a a relatively simple problem for neural networks-was good, with an average training and testing error of 2.2% and 2.5%, respectively. Better results on this problem have been reported in [9] , however, the results presented here are only marginally worse, and better than a number of other published results [30] .
The Pima Indian dataset represents another step up in difficulty. This dataset was a part of the Statlog project, and so results for a variety of methods exist [23] . The best neural-network solution reported in the Statlog project was achieved by radial basis function networks, which were trained to 21.8% error, and tested to 24.3% error. The results we have obtained are again only marginally worse than this, and better than many of the training algorithms invesitaged.
The two 2-D problems attempted are both nonstandard datasets, so no results are available for comparison. However, the purpose of using these datasets was to try and get a better feel for the decision boundaries that the networks were producing. In both cases the datasets were generated by randomly selecting points within the region . For the square dataset, 255 such points were chosen, and if a point was within the square bounded by , then the desired value was 1, otherwise the desired value was 1. For the two circles problem 500 random points were chosen, with the desired value being 1 if the point was within one of two circles and 1 otherwise. The two circles were both of radius 0.2, with one centered at (0.3, 0.3) and the other centered at . The resulting plot of the datasets can be seen in Figs. 16 and 17, respectively. In these figures, the data sets are shown by the circles and crosses. A cross indicates a desired value of 1, and a circle indicates a desired value of 1. Contour lines showing the network approximation obtained for one "split" is superimposed on each dataset.
It is clear from these pictures that the actual output is only an approximation to the desired outputs. As can be seen the sides of the decision boundary between the upper and lower areas of the network are smooth gradual curves, whereas the actual data-set exhibits a very sharp decision boundary. This is an artifact of the fact that the weights of the network are limited to lie between plus and minus one. However, despite this inability to form steep decision boundaries, the classification achieved with these networks is good, and bettered the results that we could achieve with backpropagation implemented on software with 64-bit accuracy.
X. CONCLUSION
While stochastic computation results in very small circuitry requirements for the multiplication, the length of time for which the circuit must be run to ensure accurate results is very long. This drastically reduces the advantage of implementing the networks in hardware. In contrast, deterministic computation requires only a very small number of clock cycles to compute to an acceptable accuracy, and yet still results in circuitry requirements modest enough to allow the formation of relatively small neurons. Such neurons can be used to form networks large enough to solve complex problems on a single FPGA.
