The probabilistic Random Access Memory (pRAM) is a biologically-inspired model of a neuron. The pRAM behaviour is described in this paper in relation to binary and real-valued input vectors. The pRAM is hardware-realisable, as is its reinforcement training algorithm.
Introduction
The probabilistic RAM 1, 2) (pRAM) device has been recently described 3) as an example of VLSI implementation of an artificial neural network. The pRAM neuron 2, 4) generates an output in the form of a spike train where the probability of generating a spike is controlled by an internal weight, represented as a real-valued number. The firing probabilities for all possible binary input vectors can be trained and weights are used in each pRAM, where N 2 N is the number of synaptic inputs to the pRAM.
The pRAM is a hardware device with intrinsically neuronlike behaviour ( be assigned values which have a neurobiological interpretation; it is this feature which allows networks of pRAMs, with suitably chosen memory contents, to closely mimic the behaviour of living neural systems.
In a pRAM, all memory components are independent random variables. Thus, in 2 N addition to possessing a maximal degree of nonlinearity in its response function -a Prob (a = 1 i ) = α i ;,
deterministic,
, pRAM can realise any of the possible binary functions of its α ∈ {0, 1} N 2 2 N inputs -pRAMs differ from units more conventionally used in neural network applications in that noise is introduced at the synaptic rather than the threshold level; it is well known that synaptic noise is the dominant source of stochastic behaviour in biological neurons.
The pRAM models the noise which arises from the release of neurotransmitter by single vesicles in the synapses of real neurons.
Each vesicle releases a variable amount of neurotransmitter which may cause the neuron to fire with a given probability. This feature is represented by the in the pRAM, which is the firing probability for a given input vector, i,
at the neuron's synapses at that instant. An N-input pRAM has weights which represent multiplied by the input synaptic activity and these products are summed before thresholding, or a squashing function, is applied to determine the neuron's output. It can be seen that nonlinear interactions between synapses cannot be represented in a single neuron of this form, whereas the pRAM can exhibit nonlinear behaviour in terms of synaptic activity.
Linear-weighted-sum neurons can be made noisy, but this is normally achieved by superimposing noise at the point of summation or at the threshold level.
It is also noted that because pRAM networks operate in terms of 'spike trains' (streams of binary digits produced by the addressing of successive memory locations), information concerning the timing of firing events is retained; this potentially allows phenomena such as 3 the observed phase-locking of visual cortical neurons to be reproduced by pRAM nets, with the possibility of using such nets as part of an effective 'vision machine'.
Reinforcement Training
Reinforcement training is a strategy used in problems of adaptive control in which individual behavioural units (of which pRAMs are an example) only receive information concerning the quality of the performance of the system as a whole, and must discover for themselves how to change their behaviour so as to improve this performance. Because it relies only on a global success/failure signal, reinforcement training is likely to be the method of choice for 'on-line' neural network applications.
A form of reinforcement training for pRAMs has been devised which is fast and efficient.
Networks of such units are likely to find wide application, for example, in the control of autonomous robots. Control need not be centralised; small nets of learning pRAMs could, for example, be located in the individual joints of a robot limb. Such a control arrangement would, in many ways, be akin to the semi-autonomous neural ganglia found in insects.
The form of the training rule devised for the pRAM 6) is described by where r and p are global success or failure signals respectively, , received from the ∈ {0, 1} environment at time t, a(t) is the unit's binary output, is the weight addressed by the binary α u input vector u, and ρ and λ are constants [0, 1] . The delta function is included to make it ∈ clear that only the location which is actually addressed at time t is available for modification, the contents of the other locations being unconnected with the behaviour that led to reward or punishment at time t. When r = 1 (success) the probability changes so as to increase the α u chance of emitting the same value from that location in the future, whereas if p = 1 (failure) the probability of emitting the other value when addressed increases. The constant λ
represents the ratio of punishment to reward; a nonzero value for λ ensures that training converges to an appropriate set of memory contents and that the system does not become trapped in false minima. Note that reward and penalty take effect independently; this allows the possibility of 'neutral' actions which are neither punished nor rewarded but may correspond to a useful exploration of the environment. Figure 2 shows the manner in which rule (2) has been implemented in hardware. The memory contents are updated each clock period according to rule (2) .
There are many interesting problems of adaptive control which require real-valued inputs.
These inputs, when digitised, may be input to a pRAM network in parallel as shown later in the paper. It is also possible to input such a real-valued vector in the form of a spike train to a single pRAM input. The pRAM is modified to include spike generators for the inputs (pRAMs) and spike integrators at the outputs (counters) which will enable such inputs or outputs to be handled. Such a modified device is called an integrating pRAM.
Integrating pRAM
This device performs mappings from to {0,1} using the concept of At each time step r = 1...R= selects a particular location in the pRAM using the 2 M , i(r)
address inputs which result in a binary output . These outputs are accumulated in a spike a(r) integrator ( Figure 3 ) whose contents are reset at the start of a cycle. After R time steps, the
contents of the counter are used to generate the binary i-pRAM output which is 1 with probability This i-pRAM can be developed further to implement a generalised form of the training rule (2). According to rule (2), the input of a single binary address results in the contents of the single addressed location being modified. However, the i-pRAM can be used to implement a generalised form of training rule (2) in which the input of a real-valued number causes the contents of multiple locations to be modified.
Address counters are required for counting the number of times each of the storage locations is addressed. This device can then be used to implement the generalised training rule referred to above. This generalised training rule 7) is where replaces the delta function in eq. (2). Thus in the learning i-pRAM case, X u(t) every location is available for updating, with the change being proportional to that address's responsibility for the ultimate i-pRAM binary output a(t). The 's record the frequency with X u which addresses have been accessed and are derived from the address counters above.
The learning rule may be further generalised in order to deal with situations in which reward or punishment may occur an indefinite number of time steps after the critical action which caused the environmental response. 7) In such delayed-reinforcement tasks it is necessary to learn path-action rather than position-action associations. This can be done by adding eligibility traces to each memory location. These decay exponentially, by a factor δ, where a location is not accessed, but otherwise are incremented to reflect both access frequency and the resulting i-pRAM action. One trace records "access and activity", whereas
a complementary trace records "access and inactivity" (both are equally important in developing an appropriate response to a changing environment). These terms are called , respectively. The eligibility traces are initialised to zero at the start of a task e u (t) and f u (t) and subsequently updated according to
The necessary extension of eq. (5), which results in the capacity to learn about temporal features of the environment, is
. It can be seen that in this case eq. (8) reduces to the e u =a X u , and f u =a X u original learning i-pRAM training rule (5).
VLSI pRAMs
The pRAM-256 is a VLSI device which processes 256 internal pRAMs in a similar manner to an earlier pRAM device 3) (Fig. 4) . The earlier device 3) provided only local reinforcement training in which two "auxiliary pRAMs" are used to determine the appropriate reward and penalty signals for each "learning pRAM". The pRAM-256 device is no longer restricted to local learning. By allowing the reward and penalty inputs to each pRAM to be reconfigured by the use of "connection pointers", global, local and competitive methods of training can be used in this fourth generation of pRAM hardware.
8)
The configuration of the pRAM network is defined by a "connection pointer" with the pRAM weight memory (Fig. 4) . Thus many architectures can be built with the same hardware; reconfiguring the network only requires updating a connectivity table.
e u (t) = δe u (t − 1) + δa(t)X u (t)
f u (t) = δf u (t − 1) + δa(t)X u (t) . At the same time as the pRAMs are being processed, new external data may be shifted onto the chip so that neural processing may proceed without incurring any data transfer delay.
Hardware Learning
The learning algorithm used in the pRAM-256 is given by eq. (2) above. Thus it may be seen from eq.(2) that the weight, , is always in the interval [0,1] so that clipping is never α u required. The action of the learning algorithm is to move the weight closer to the value of a (the pRAM output) if a reward is given and to move the weight further away from a if a penalty signal is given, for binary inputs and outputs. Thus beneficial actions are made more likely to happen in the future and adverse actions are made less likely to occur, given the same circumstances. It is the use of a hardware-realisable algorithm on-chip which makes a totally hardware-based learning system possible. This algorithm is incorporated within the pRAM-256 chip shown in Fig. 4 .
Interfacing to External Hardware
To interface the pRAM-256 device to external hardware, programmable logic devices (FPGAs) are currently used. These FPGAs have many registers, some of which may be used to count the spikes coming from the serial outputs of individual pRAMs to generate a mean firing frequency. Combinational circuitry within the FPGA may be used for the environment, to generate the reward and penalty signals during training.
8
An example of the use of the pRAM-256 to solve a pattern classification task has been given by Clarkson and Ng 9) . The next section shows how a pRAM network may be used to solve a classical pole-balancing task.
Inverted Pendulum System
The inverted pendulum system is a classical problem that can be used to test the performance of a neural network. 9) In this experiment, the system comprises a rigid pole and a cart on which the pole is hinged. In order to reduce the complexity of the problem, the cart moves on rails to its right or left, depending on the force exerted on the cart. The pole is hinged to the cart through a frictionless free joint which allows the pole to move in one dimension only. Figure 5 shows the arrangement of the pendulum system. The objective of the neural network control system is to balance the pole starting from nonzero conditions by applying appropriate force to the cart. 
23.1Network architecture
In order to generate the appropriate force, F, for balancing the pole, the network must be able to estimate the next state variables at time t+1 provided that the state variables at the present time, t, are given. This is done by using an estimation subnet which takes θ(t),
and F as inputs (Fig. 6 ). The outputs of the estimation subnet are . The θ(t + 1) and
subnet is trained to perform the calculation according to the following differential equations:
which are the linear approximations of the next-time-step state variables.
The state variables are represented by 6-bit signed binary numbers and F is represented by one bit. The control action F is calculated by another subnet, the action subnet, whose structure is shown in Fig. 7 .
The inputs to this action subnet are the state variables corresponding to the present time step.
The estimation and action subnets are connected to form a slice network. The structure of the slice network is depicted in Fig. 4 . The action subnet takes inputs from the previous slice network and produces the corresponding control action (F). The control action and the previous state variables will be fed into the estimation subnet which prepares the prediction of the next-time-step state variables for the adjacent slice network. Therefore, by connecting N units of the slice network, one can predict the movement of the inverted pendulum system and the control goal can be achieved.
The delay introduced by the slice network corresponds to one time step (δ) between two consecutive state variable pairs. In this experiment, the network is used to predict the movement (the angle and angular speed) of the pendulum system from t = 0 to t = 1 s.
Obviously, a shorter value for δ requires a larger network because more slice networks are required to cover the time span. Therefore, a large value of δ would be preferred based on this reason. On the other hand, the delay cannot be too long, for if the control action is too slow, the network will no longer be able to balance the pole. In addition, with a long delay, the linear approximation from the estimation subnet may no longer be precise enough and a higher-order approximation, e.g., a second-order approximation, would have to be used instead. This will increase the training time. Since δ is related to the speed of the system clock, we can manipulate the time delay by adjusting the system clock. In this experiment, the pRAM system clock is 33 MHz and the delay induced by a combined forward and weight-update process is 174 ns. For each slice network, 64 samples are taken and the average is sent to the decision maker (Fig. 9) . This contributes to a total delay time of 11 ms.
In order to cover a time span of 1 second, a network with 90 slice networks is required. The schematic diagram of the complete pRAM-based control system is shown in Fig. 9 .
The decision maker is an external circuit which performs the following tasks:
averages the outputs of the slice networks, for each slice network, it computes the hamming distance between the estimated pole movement and the desired one, it generates the environmental signals (r and p). The calculated hamming distance will be thresholded. If it is lower than a predefined threshold, the corresponding slice network will be rewarded. Otherwise, the slice network will be penalised.
The decision maker uses a 12-bit accumulator to calculate the average of the slice network outputs. After 64 samples have been added to the accumulator, the top 6 bits are taken as the average of the samples.
23.2Training strategy
The network is trained by the on-chip reinforcement training algorithm which is provided in the pRAM-256 modules. Every slice network receives a pair of environmental signals (r and p) from the decision maker, hence the learning is localised.
Estimation subnet
Since every slice network uses the same estimation subnet, it is required to train only one estimation subnet. The trained estimation subnet will be replicated accordingly. The training set is generated by solving eq. (9) . In order to improve the network's generalisation from the training set, a random noise signal is added during training whose amplitude is up to 10% of the maximum input vector. For shorter training time, the training noise level can be reduced at the expense of generalisation. The elements of the training set are initial pole angle, angular velocity and F (the applied force) together with the desired pole angle and angular velocity.
All of the above variables are represented using 6-bit signed binary numbers. When there is no training noise, the estimation subnet is trained to perform a one-to-one mapping between present and next-state angles and velocities. Such a one-to-one mapping together with the stochastic property of the pRAM indicates that the estimation subnet is not able to produce a pair of stable outputs. By introducing training noise, the subnet is forced to perform a many-to-one mapping . As a result, the outputs of the subnet will be the same even though there is a slight change in the inputs; hence, a pair of more stable outputs will result.
Action subnet (slice network)
Outputs of the slice networks can be regarded as the trajectory of the pole; every slice network corresponds to a particular point of the pole at a given time. Given the pair of state variables, the action subnet is expected to generate the correct control action which is used to estimate the state variables at the next time step. Whether the control action is good or not depends on the generated state variables. During the training process, the generated state variables will be compared with a pair of target values. If the difference is within a predefined limit, the action subnet will be rewarded, otherwise it will be penalised. It is required to halt training when the average of the slice network output is being calculated. The decision maker can do so by setting r=0 and p=0. An alternative strategy is to disable the TRAIN input to the pRAM 256, which results in only forward passes being processed. By means of the strategy above, a pRAM net was trained to solve eq.(9) and thereby to balance the inverted pendulum.
Conclusions
It has been shown how the pRAM artificial neuron may process temporal data in two ways. The first is by the use of the integrating pRAM which retains temporal information in the form of an activity history. The second is by dividing time into slices, and this method is closer to pattern recognition than is the first method. We have proposed an application for a neuroprocessor, the pRAM-256, to solve the inverted pendulum balancing problem using the second method. This application shows the flexibility of using pRAM-256 chips in pRAM-based neural network construction. In the system described, ten pRAM-256 chips are used. These are connected via the on-chip serial links. In addition, the on-chip reinforcement learning facility enables fast learning for a large-scale network. However, the decision maker is a drawback in a self-contained system. We are investigating the feasibility of incorporating such a facility into a single package. 
