Nanodevices have terrible properties for building Boolean logic systems: high defect rates, high variability, high death rates, drift, and (for the most part) only two terminals. Economical assembly requires that they be dynamical. We argue that strategies aimed at mitigating these limitations, such as defect avoidance/reconfiguration, or applying coding theory to circuit design, present severe scalability and reliability challenges. We instead propose to mitigate device shortcomings and exploit their dynamical character by building self-organizing, self-healing networks that implement massively parallel computations. The key idea is to exploit memristive nanodevice behavior to cheaply implement adaptive, recurrent networks, useful for complex pattern recognition problems. Pulse-based communication allows the designer to make trade-offs between power consumption and processing speed. Self-organization sidesteps the scalability issues of characterization, compilation and configuration. Network dynamics supplies a graceful response to device death. We present simulation results of such a network-a self-organized spatial filter array-that demonstrate its performance as a function of defects and device variation.
(Some figures in this article are in colour only in the electronic version)
Nanoelectronics and computing paradigms
Nanodevices are crummy 1 . High defect rates, high device variability, device ageing, and limitations on device complexity (e.g., two-terminal devices are much easier to build) are to be expected if we intend to mass produce nanoelectronic systems economically. Not only that, it is almost axiomatic among many researchers that such systems will be built from simple structures, such as crossbars, composed of nanodevices that must be configured to implement the desired functionality (Heath et al 1998 , Williams and Kuekes 2000 , Kuekes and Williams 2002 , DeHon 2003 , Snider et al 2004 , DeHon 2005 , Ma et al 2005 , Snider 2005 . So we are faced with the challenge of computing with devices that are not only crummy, but dynamical as well.
Can reliable Boolean logic systems be built from such crummy devices? Yes, but we argue that at some point as 1 'Crummy' was introduced into the technical lexicon by Moore and Shannon (1954). device dimensions scale down, the overhead and complexity become so costly that performance and density improvements will hit a barrier. In the next section we discuss two frequently proposed strategies for implementing Boolean logic with crummy, dynamical devices-reconfiguration and coding theory-and argue that each has severe scalability problems. This is not suggesting that logic at the nanoscale is not worth pursuing. It clearly is, and semiconductor manufacturers have the economic motivation to continue scaling down as aggressively as their profits permit. Rather we are suggesting that the 'killer app' for nanoelectronics lies elsewhere.
An alternative computational paradigm-adaptive, recurrent networks-is computationally powerful and requires only two types of components, which we call 'edges' and 'nodes'. Edge to node ratios are typically high, hundreds to thousands, and edges are, unfortunately, difficult to implement efficiently. This difficulty has made these networks extremely unpopular; software implementations are impossibly slow, and hardware implementations require far too much area.
In this paper we propose using memristive nanodevices to implement edges, conventional analog and digital electronics to implement nodes, and pairs of bipolar pulses, called 'spikes', to implement communication. The tiny size of the nanodevices implementing edges would allow, for the first time, a practical hardware implementation of adaptive, recurrent networks. We suggest that such networks are a better architectural fit to nanoelectronics than Boolean logic circuits. They are robust in the presence of device variations and defects; they degrade gracefully as devices become defective over time, and can even 'self-heal' in response to internal change; they can be implemented with simple, crossbar-like structures. Just as importantly, they can be structured to self-organize their computations, sidestepping scalability problems with device characterization, compilation and configuration. Such systems can contain large numbers of defective components, but we will not need to discover where they are-in fact, we will not care where they are. The system can adapt and 'rewire' itself around them.
Boolean logic is hard with crummy, dynamical devices

Reconfiguration to the rescue?
Several researchers have proposed using a combination of device characterization, defect avoidance, and configuration to handle initial static defects , DeHon 2003 , Snider et al 2004 , Snider and Williams 2007 . The strategy is a three-pass algorithm:
(1) Characterization. Analyze every nanowire and nanodevice in the system and compile a list of resources which are defective (stuck open or stuck closed, shorted, broken, out-of-spec, etc). Such analysis algorithms were used in the Teramac system (Culbertson et al 1997) . (2) Defect avoidance. Give the list of defects from pass 1 to a compiler which maps a desired circuit onto the defective fabric, routing around defective components (Culbertson et al 1997 , Snider et al 2004 , Snider and Williams 2007 . (3) Configuration. Give the mapping determined in pass 2 to a controller that electrically configures each of the mapped components.
Since every chip will have a unique set of defects, the above process must be applied to each and every chip. This presents some interesting challenges for manufacturing, since the time required to perform the above steps will contribute to production cost.
Characterization (pass 1) is problematic due to device variability-the functional state of a device (or wire) is not necessarily discrete (working versus nonworking) but can lie on a continuum. And characterizing, say, 10 12 nanodevices in a reasonable amount of time is not likely to be trivial, especially given the bottleneck of pins on the chip. It is not clear if existing characterization algorithms could be applied to systems like this, or how well they would scale.
Compilation (pass 2) also presents considerable risk. Compiling circuits onto configurable chips (such as FPGAs) today is a time-consuming process, due to the NP-hard placement and routing problems that lie at the compiler's core. Even circuits comprising only a few tens of thousands of gates can require several minutes to compile, depending on the degree of optimization needed-and that's assuming a defect-free target, where a single compilation can be used to manufacture thousands of parts. One proposal for minimizing this problem requires an initial 'ideal' compilation onto a hypothetical defect-free fabric, laying out components a little more sparsely than optimal. This would be done only once for each (circuit type, chip type) combination, so one could afford to spend enormous amounts of compute time on this to arrive at this ideal configuration. The configuration of an individual chip on the production line would then be viewed as a 'perturbation' of the ideal configuration, with resource allocation shifted as necessary to avoid defects. One might even combine the characterization pass with this pass for further speed improvements. This strategy might be workable. But it is not clear how well this would scale, or how robust this would be in the presence of defect clustering.
Configuration (pass 3) is the most straightforward, with the most significant risk being configuration time restrictions due to the pin bottleneck and power dissipation.
Note that the above approach of characterize, compile, configure does not handle device death. What happens when a nanodevice stops working and the chip starts computing nonsense? If the nanodevices are reconfigurable, the system can be stopped and reconfigured to work around the newly formed defects. But that assumes additional circuitry to detect the malfunction (e.g. self-checking circuits (Wakerly 1978) ), and a companion host processor capable of implementing the three passes in order to reconfigure the chip. Such a processor would have to be fast, reliable (which probably means it would not be implemented with nanodevices), and likely would require a significant amount of memory. Implementing such a coprocessor would seem to negate the benefits that one was presumably getting from the nanoscale circuit in the first place.
Coding theory to the rescue?
Coding theory has been used for decades to robustly transmit information over noisy channels by adding a small amount of redundant information. Can coding theory do the same for logic circuits by adding a small number of redundant gates or components in order to achieve reliable operation?
von Neumann (1956) , looking forward to nanoscale computation, was perhaps the first to address this question. His approach used a code that replicated each logic gate, and combined the replicated gate outputs in a clever way so that the entire system achieved a desired level of reliability. Although his primary concern was correcting errors induced by transient faults, the approach could also compensate for initially defective devices or 'device deaths' as long as the failing devices were randomly distributed and the number did not exceed a threshold (the trade-off being that device deaths would reduce the system's tolerance of transient faults). The overhead for his scheme could be enormous, though. Replication factors for Boolean logic went to infinity as fault rates approached about 1%. This bound has been improved by later researchers, with replication factors of only 3 claimed for fault rates up to 7.8 × 10 −8 for majority gate logic (Roy and Beiu 2004) 2 . Still, replication coding appears practical only when fault+death rates are acceptably low, and even then the overhead is economically unappealing. What you would really like is an efficient, non-replication code that works at projected fault, defect and death rates.
At the gate level, no such code exists (Elias 1958 , Peterson and Rabin 1959 , Winograd 1962 . At a higher level (e.g. multi-level logic, function units, etc), though, efficient coding schemes are possible for special cases: operations implementing group or semigroup operations (Hadjicostis 2002) .
Such codes have been proposed to implement robust, nanoscale demultiplexers built from defective crossbars and could be used to efficiently implement special-purpose circuits composed largely of group operations, such as digital filters. But a study by Reischuk (1997) suggests that an efficient code for general logic systems may not exist: 'Only a vanishing proportion of all Boolean functions can be computed reliably'. Presumably the 'vanishing proportion' refers to group and semigroup operations, although (to our knowledge) this has not been proved.
Coding theory can certainly help in augmenting circuits for detecting faults and protecting circuits in special cases. But in the general case, coding theory does not appear to offer any magic bullets for building reliable Boolean logic systems out of crummy, dynamical parts. It is possible that some combination of reconfiguration and coding could provide the desired reliability with economic redundancy. Given how valuable such an approach would be, we have not abandoned the search, but given the elusive nature of the prize, we have called it the 'Holy Grail' of nanoscale Boolean logic.
Adaptive, recurrent networks
A recurrent network is a dynamical system that can be represented by a directed, cyclic graph. An edge of the graph is used to transmit information between two nodes; generally edges perform a transformation (e.g. a scaling or multiplication) of the data received from the sender before relaying it to the receiver. A node of the graph performs a nonlinear transformation of the data received on its edges, and responds by adjusting its internal state and sending information back out on its edges. (Although edges are directed, information traverses them in both directions.) If edges change their state (and their corresponding transfer function) in response to the data sent through them, the network is adaptive. The timescale on which edges change their state is much longer and slower than the timescale on which nodes change their state.
An adaptive, recurrent network evolves over time due to feedback interactions between the nodes, edges and inputs from the outside world. If the components are 'wellbehaved' (describable by deterministic, continuous processes), the system can be modeled by a set of coupled, differential equations. Good component behavior is not essential, however, as interesting computational systems can be built from 2 Replication tends to be worse than this in practice since wires at the nanoscale are themselves potentially defective components. Long nanowires require buffers (and lots of them because of high nanowire resistivity leading to large RC time constants) so the number of 'components' in the system scales worse than linearly with the number of 'gates' in the original circuit. stochastic components and modeled, at least approximately, with difference equations or stochastic differential equations (Kosko 1990) . Of course interconnecting nonlinear, dynamical components with feedback is risky since the resulting network may not be stable, and may not produce interesting behavior even if it is. Fortunately, research in nonlinear dynamics has provided insight into how to build networks that can be built to respond stably to an external stimulus with a useful result (Grossberg 1976a (Grossberg , 1976b (Grossberg , 1980 . Adaptive, recurrent networks are difficult to implement in hardware due almost solely to the complexity of dynamical edge behavior (since nodes are usually very sparse relative to edges, their implementation is less of an issue). Edges can be complex dynamical systems involving multiple processes, but for computation we can reduce this to a single process or state variable that implements two important properties:
(1) Multiplication. The edge must be able to take an input data value and scale it by some factor to produce an output. The scaling factor, often called the edge's 'weight,' is a function of the edge state. (2) Adaptation. The edge state, and hence its weight, must change in a useful way in response to the behavior of the nodes it is connected to. An equation that describes this state evolution is called the edge's learning law.
In a practical implementation, edge states should be nonvolatile. This is a lot of functionality-multiplicative transfer function, useful learning law, nonvolatile memory-to pack into a single device. (See Fusi et al (2000) for one idea of how to do this.)
Many different learning laws have been proposed for edges (Kosko 1990 , Grossberg 1997 , Gerstner and Kistler 2002 . The most famous, and one of the earliest, is the Hebbian learning law (Hebb 1949) , often paraphrased as nodes that 'fire together, wire together.' In other words, when two connected nodes are 'active' at the same time, the weight of their connecting edge increases to reinforce that correlation. This learning rule is not complete, since it provides a rule for increasing edge weight but no means for decreasing it. A network built from such edges would eventually drive all edge weights to their maximum values. Hence Hebbian-based learning laws must have an additional non-Hebbian term to prevent edge weight runaway.
Perhaps the first complete learning law for the weight, w, of an edge connecting two nodes, n i and n j , was the 'gated steepest descent' law devised by Grossberg in the 1950s (although it was not published for another decade (Grossberg 1997) 
In this equation: w represents the weight of the edge; S j is a function (which we will call the 'activity') of the internal state of the sink node, n j , driven by the edge; S i is the activity of the source node, n i , driving the edge; k is a 'learning' rate; and τ s is a time constant describing a decay or 'forgetting' rate. The 'activity' functions are left to the network designer to define, although stability imposes constraints on the form these functions may take. Expanding the right side of this equation yields a Hebbian term k S j S i (2) Figure 1 . A two-layer adaptive, recurrent network with a center-surround interconnect in each layer. x i and y i refer to nodes in the bottom and top layer, respectively. Edges are shown as lines with arrowheads interconnecting nodes between layers (straight arrows) or within a layer (curved arrows). w i j is the weight of an edge connecting x i to y j . Figure 4 contains more details about nodes.
that increases the edge weight in response to correlated activity, and a non-Hebbian decay term
that decreases the weight in the absence of correlated activity. The correlation term, k S j S i , appears in some form in nearly all learning laws. This term is another reason why edges are difficult to implement: it requires a second multiplication in the edge, in addition to the multiplication required in the edge transfer function. In practice, the activity functions S i and S j may produce single bit outputs, in which case this multiplication reduces to a logical AND, but still this forces a nonlinear operation into the edge. At the faster node timescale, one simple node model sums its inputs, transforms the sum using a sigmoidal transfer function (such as the logistic function 1/(1 − e −x )), then lowpass filters that result. If y is the internal state of the node, it can be modeled with the equation
A structure useful for stabilizing recurrent networks is the on-center/off-surround interconnect (top plane in figure 1 ). Nodes, y j , are arranged in a two-dimensional grid, with each node connected to near neighbors with edges. Close neighbors, called the 'center' (shown by the smaller oval in the top layer of figure 1) are driven by 'excitatory' edges that increase the receiving node's activity (shown by the smaller oval in figure 1 ). Slightly more distant neighbors, called the 'surround' (larger oval in figure 1) are driven by 'inhibitory' edges that decrease the receiving node's activity. The edge weight for each of these connections is denoted w j j , where j is the index of the source node and j the index of the sink. This weight is permanently fixed at zero for widely separated nodes that are not in each other's 'center' or 'surround.' Each node also has a set of edges from a previous layer (shown as x i in figure 3) or from the environment. The weight of this edge is denoted w i j . If we assume the gated learning law, the behavior of this network can be described by the following equations:
These structures can be cascaded, as suggested in figure 1. Network structure and dynamics are explored in greater depth in the work of Grossberg (1976a Grossberg ( , 1976b Grossberg ( , 1980 . Global stability and structural stability are discussed by Wilson (1999), Carpenter and Grosberg (1991) and Kosko (1990 Kosko ( , 1991 .
Circuit model
The architecture proposed in this paper is a hybrid system for implementing an adaptive, recurrent network. Nodes are implemented in silicon with conventional analog and digital circuitry. Edges are implemented with dynamical nanowire junctions formed by the crossing of two nanowires separated by a thin (few nanometers thick) material, such as a metal oxide. This structure is interesting because of the small edge implementation, and because some materials yield junctions that are non-volatile. This makes it possible to create a self-organizing network that does not lose its computational properties when the power is turned off.
The inspiration for this architecture was the memristive device behavior (Chua and Kang 1976) we found for many nanojunctions in our lab (figure 2), along with speculations by Chua (2003) that such behavior should be expected at the nanoscale because of various physical processes. Memristive behavior has also been seen by other researchers , Lau et al 2004 , Duan et al 2002 , though it is not always clear whether these devices are bistable. We noted that many of our junctions were not bistable but took on a continuum of states. This made designing logic circuits out of such components even more challenging; how could one implement a logic gate out of devices (crummy to begin with) that changed their state in response to signals traversing them? Either you stabilize the devices somehow, or exploit their dynamical behavior, which is the approach taken here.
Memristive edges
A time-invariant, voltage-controlled memristive device changes its resistance over time (Chua and Kang 1976) . Given one or more state variables, w, the device obeys the equations
where i is current and v is voltage. In this paper we consider an extremely simple, voltage-controlled, memristive model, where the state variable, w, is just the conductance of the device. The function f (w, v) is linear in w, clipping when w hits a maximum or minimum value; this constrains the conductance to lie within a range (w min , w max ). But we assume that f (w, v) is nonlinear in the voltage, v, in a 'sinh-like' way (figure 4), yielding an equation of the form:
The exact form of the sinh-like function (which is plausible given quantum tunneling) is unimportant since we are only going to drive three different voltages across the device (V , 2V , and −V ); what matters is that a voltage, V , causes the device to change its conductance by a small amount, while a voltage twice as large, 2V , causes it to change by a much greater amount, much more than twice the change caused by V . This property provides the multiplication required to implement correlational learning in equation (2). Note that edges have a polarity and that the voltage drop across the edge is defined to be (v source − v sink ). This means that a positive voltage drop will cause conductance to increase, and negative voltage drop will cause it to decrease.
Nodes
Nodes are synchronous components with three input terminals and one output terminal ( figure 4(a) ). Multiple edges can be connected to each input or output. A node communicates with other nodes via edges by sending a pair of pulses-a positive pulse followed by a negative pulse-to the 'output' terminal and 'prior input' terminal. Each positive/negative pulse pair is called a 'spike.' Each terminal acts as a virtual ground, so that incoming spikes are sensed by measuring the current flow into the terminal; simultaneously arriving spikes sum their currents. Note in figure 4 that although node terminals are labeled as 'outputs' or 'inputs,' information transmission is bidirectional through non-lateral edges.
Spike timing is shown in figure 4(b). All nodes are synchronized to a global, three-phase clock. Forward spikes, consisting of a positive pulse in phase 0 followed by a negative pulse in phase 1, are sent to the node's output. Back spikes, delayed or phase-shifted by one phase relative to forward spikes, are sent to the 'prior' input terminal (and could be sent to the others as well). The negative pulse of a back spike is wider than the positive pulse; as we will see, this pulse pair implements the 'decay' term in equation (3). A node receives and processes forward spikes on its inputs and back spikes delivered other nodes on its output.
Node internals and operation are sketched in figure 5 . A state machine guides the node through two modes of operation: a processing mode and a spiking mode. In processing mode, the node integrates weighted positive pulses of forward spikes received in phase 0 from the input terminals, and the weighted positive pulses of back spikes received in phase 1 from the Figure 5 . Top: synchronous node consists of a few analog components (summing amplifiers, scaled inverter, analog switches, leaky integrator, and comparator) under the control of a state machine. Bottom: timing diagram for node operation. A node sums incoming spikes with a leaky integrator, firing a pair of output spikes (forward spike and back spike) when the integrator exceeds a threshold. After spike firing, the integrator is reset to a negative value. output pin 3 . Since edges have different conductances or weights, but spikes all have the same amplitudes, the current collected by the node's inputs for a given spike is proportional to the conductance of the edge carrying the spike.
If we index incoming edges by j , the node's internal integrator state, n, approximates the following equations during processing mode:
In this paper, the scaling constants f p and f e equal 1.0, f i equals −0.5, and f b is 0.
Whenever the integrator output exceeds threshold in phase 2, the state machine transitions to spike mode in the following cycle. In that mode a forward spike and back spike are generated, the integrator is reset to the value -threshold, and 3 In this paper we consider only a single layer of processing nodes, so there are no incoming back spikes to process. input spikes are ignored. After spiking, the node then returns to processing mode in the next cycle.
The leakiness of the integrator supplies the sigmoidal transfer function of equation (4). If spikes arrive randomly but infrequently, the integration decay usually prevents the integrator from reaching the threshold and the node rarely fires a spike. As spikes become more frequent, output spiking probability increases, but eventually the output spike rate saturates due to integrator saturation and to the fact that the state machine limits the node to spiking, at most, every other cycle.
Note that when a node is in spiking mode, it ignores all incoming spikes. It is difficult to analyze what effect this has on learning, but it is nevertheless desirable for three reasons:
(1) It prevents self-feedback. Since a nanowire implementation of this approach would likely involve interlaced crossbars similar to that shown by Snider and Williams (2007) , output nanowires from a node would cross over its own input nanowires, thus delivering a node's output spikes to itself. By ignoring all spikes in the window following an output spike, this feedback is eliminated. (2) It mimics the 'refractory period' seen in biological neurons which also ignore incoming spikes immediately after generating an output spike. This does not necessarily make this a good thing, but it at least suggests that it may not hurt. (3) As mentioned earlier, it helps implement the sigmoidal transfer function of the nodes (which is only apparent when you average spikes over time), necessary to suppress noise.
Edge learning
An edge's conductance changes every phase as a function of the voltage drop across the edge induced by forward spikes from the source node and back spikes from the sink node (figure 6). There are four cases to consider:
(1) Neither source nor sink spike. The voltage drop across the memristive edge is zero for the entire cycle, so conductance does not change (although DC offsets might contribute a small drift term). (2) Source spikes. If the change in the edge weight, w, due to the spikes is very small, the positive and negative pulses of the source spike will roughly cancel each other out. (3) Source and sink spike. The negative pulse of the forward spike, amplitude V , and the positive pulse of the back spike, amplitude −V , cause a voltage drop across the edge of 2V . This causes a large conductance change (equation (10)), implementing the Hebbian component of learning (equation (2)). (4) Sink spikes. The wider width of the back spike's negative pulse causes a net decrease in conductance. This implements the decay component of learning (equation (3)).
As a result, the nodes and edges cooperate to implement the gated steepest descent learning law (equation (1)).
Simulations
Application
To explore the sensitivity of the network to defects, device variation, noise, and so on, we simulate a classic problem in self-organized systems: the self-organization of an array of spatial filters sensitive to edges at various orientations. Such an array might be used as the first stage of a visual pattern matching system such as a character recognition system. The idea is to feed in a pixelated grayscale image (figure 7), find the edges in it, and determine the angular orientation and magnitude of each edge segment across the image. This is not a very exciting application by itself. But ever since von der Malsberg's first implementation 1973, this has become a standard first problem in self-organizing systems-analogous to writing the 'hello, world' program in a new programming language-and is therefore almost mandatory (Miller 1994 , Olson and Grossberg 1998 , Farkas and Miikkulainen 1999 . It also makes a simple and convenient test bed for experimentation.
Network structure
The simulated network consists of a single layer of nodes, with center-surround recurrences, driven by a pair of transducer arrays that translate the pixels of an input image into sequences of spikes (figure 8). The arrays are shown separated in figure 8 for clarity; in an actual implementation, the photocell array, transducer arrays, and node array would be interleaved. All arrays are of size 16 × 16 and have their edges wrapped around in a torus to minimize boundary effects-this is a computational shortcut that compensates for the small number of nodes in the network (simulating the dynamics is computationally expensive, so larger networks require more processing power and patience).
The photocell array is used to present grayscale images to the transducer arrays. Each pixel in that array takes on a value in the range −1 (white) to 1 (black).
The two transducer arrays, called the 'on' array and 'off' array, implement spatial bandpass filters. Each cell of these two arrays computes a weighted average of pixel intensities in small circular, center-surround neighborhoods of the photocell array directly beneath it. The 'center' neighborhoods have a radius of 3, and the 'surround' neighborhoods have a radius of 6; the computed average for each 'on' cell is:
(12) (The coefficients K center and K surround were chosen so that 'center' coefficients summed to 1, and 'surround' coefficients summed to −1.) The average for an 'off' cell is the negative of the corresponding 'on' cell average. Negative averages are clipped at zero for both types of cells to determine their 'activity.' In each cycle, the 'on' and 'off' cells emit spikes to the nodes in the cell array with probability 1/1 + e −(activity−7) .
'On' cells spike most strongly when they see a black region in their center surrounded by white; 'off' cells respond most strongly to white centers surrounded by black.
The node array has a square, center-surround recurrence structure. Each node receives edge inputs from a 7 × 7 square neighborhood of cells in the 'on' and 'off' arrays, centered on the cells directly beneath it. These two sets of edges (a total of 98) are collectively called the 'receptive field' of the node. The node also excites its neighbors within a 3 × 3 square neighborhood, and inhibits neighbors within a 4 × 4 square neighborhood. Since lateral inhibitory inputs have only half the effect of excitatory lateral inputs ( f i and f e in figure 5), the net effect is excitation of the eight closest neighbors, and inhibition of the ring of neighbors surrounding those. Default network parameters are shown in table 1. 
Procedure
The network was initialized by setting all edges to their maximum weight and all node integrators to zero. Random 16 × 16 input images were constructed by randomly assigning pixel values from a Gaussian distribution, and then smoothing the toroidal image (to approximate the Gaussian random fields of Bartsch and van Hemmen (2001) ) by convolution with the kernel 18 exp(
2 ), where d is distance measured in units of pixels. The network was clocked for 5000 000 cycles, with a new random image presented every nine cycles.
To evaluate the network's ability to detect oriented edge, we constructed a set of 24 half-black/half-white, oriented edge images, I θ , with orientations, θ , uniformly distributed over (0, 2π). The pixels, I θ i j , in each image were either white (−1), black (+1), or appropriately interpolated if they intersected the oriented edge.
Response, R, of a node to the image I θ used the weights of the forward edges within that node's receptive field:
where the operator [·] + is equal to its argument for positive values and zero for negative values. This is not, strictly Figure 9 . Left: receptive fields of network after 5000 000 cycles of evolution. Right: network selectivity, equation (15), as a function of time.
speaking, an accurate measure of a node's response to the image, but is an approximation that is commonly used (for example, see Olson and Grossberg (1998) ).
To evaluate the quality of the edge detectors in the presence of defects and device variation, we use a 'selectivity' metric which is often used to measure orientation filter performance. Following Swindale (1998), we calculated the components
Then the preferred orientation, ϕ, for a node can then be calculated as
and orientation selectivity, S i , for node n i as
Network selectivity was then calculated as the average orientation selectivity for all nodes in the network:
Simulation issues
The network defined in the previous section contains 256 nodes and 31 488 edges. Since the spike-based implementation of the network approximates the differential equations (5)- (7), one could simulate the network behavior by simply integrating the 31 744 coupled differential equations for the network using, for example, a fourth-order Runge-Kutta integrator with adaptive step sizes (Press et al 1992) . In practice this is difficult because the equations are 'stiff,' meaning that the system contains equations that react on widely different timescales: the fast nodes and the slow edges. Conventional integrators, such as Runge-Kutta, are notoriously slow on stiff systems, because of the need to take tiny steps to obtain acceptable accuracy. Specialized stiff integrators, such as the semiimplicit extrapolation method (Press et al 1992) , can be much faster on small systems, but do not scale well since they require multiple inversions of the Jacobian matrix, each of which requires O(n 3 ) operations and storage of O(n 2 ) matrix elements. Since n = 31 744 for this system, such integrators are not feasible here.
So we are forced to a simpler scheme, similar to that described by Gear and Kevrekidis (2002) and often used for stiff systems. The idea is to alternate integration steps of the fast nodes and the slow edges. When integrating nodes, the edge states are held fixed and node states updated by accumulating incoming spikes and (leakily) integrating them using equation (11). When integrating edges, the node states are held fixed and the edge states updated using a forward Euler approximation of equation (6). This is simple to implement and greatly speeds up the simulation, at the cost of some loss of accuracy.
Results
The receptive fields of a defect-free and device-invariant network clocked for 5000 000 cycles is shown in figure 9 . To visualize the receptive fields of each node, we consider the weights of the receptive field edges and compute a pixel value for each pair of edges as
In all receptive field images we then show pixel i j mapped on a continuous grayscale spanning −1 (white), 0 (gray), and 1 (black).
The evolution of the lower, right quadrant of the network in shown in figure 10 . Here we can see the emergence of edge orientation filters, and the tendency of the network to drive the edge weights into a strongly bimodal distribution.
The results of simulations in network sensitivity to defects and device variation are shown in figure 11 . In these experiments we compared networks using the 'selectivity' metric, equation (15) .
The first set of simulations addressed edge defects. A 'stuck open' edge had a fixed weight of 0; a 'stuck closed' edge had a fixed weight of w max . In these experiments, edges Figure 10 . Network evolution. Receptive fields develop oriented edge detectors as edge weights are driven to a strongly bimodal distribution. The length of the line segments shown in the central column for each node correspond to node's selectivity (equation (14)), while their direction corresponds to the node's orientation (equation (13)). Network selectivity at the right shows the value for equation (15).
were randomly selected to be stuck open (or stuck closed) with different probabilities (0, 0.05, 0.1, . . .). As can be seen from the bar graphs, network selectivity degrades gently as defect rates increase, with a greater sensitivity to stuck closed defects.
The second set of simulations explored sensitivity to edge weight variation. The first simulation fixed w max at 1.0 and varied w min to achieve 'on/off' ratios (w max /w min ) varying from 5 to 1000. Although larger ratios yield higher network selectivities, the results show rapidly diminishing returns for ratios larger than about 20, and suggest it might be possible to build functional networks even with low ratio devices. The second simulation studied the impact of on/off ratio variation, with ratios uniformly distributed over different ranges (e.g. 5-20) . The network was surprisingly insensitive to this variation. In the final simulation of the set, edges were assigned values for w max uniformly distributed in the interval (1 − factor, 1 + factor). Again, the network was quite insensitive to variation.
The last set of simulations explored sensitivity to variations in device learning rates. The 'Hebbian' experiments randomly varied the correlational learning coefficient from table 1, w( f − b+), uniformly over the range 1.5 R (1 − factor, 1 + factor). The 'decay' simulations varied the 'pulse' coefficient, w( f +), similarly. (The remaining coefficients, w( f −), w(b+) and w(b−), maintained the ratios to w( f +) shown in table 1 for each device.) Increasing learning rate variations gently degraded network performance.
Related work
The literature on nanoscale defects and their impact on computational architectures is already quite vast (Graham and Gokhale 2004 , Mishra and Goldstein 2004 , Schmid and Leblebici 2003 . Device crumminess has led many researchers to propose alternatives to Boolean logic as a computational paradigm at the nanoscale (Beckett and Jennings 2002, Bandyopadhyay et al 2002) , including some hybrid digital/analog schemes similar to the one proposed in this paper (Sarpeshkar and O'Halloran 2002, Beiu 2004) . Fault tolerance has also been extensively studied in the context of neural networks (Piuri 2001 , Distante et al 1991 , Neti et al 1992 .
The work most similar to this, and the most fully developed, is the CMOL/CrossNet architecture developed over several years at Stony Brook by the group led by Konstantin Likharev (Folling et al 2001 , Turel et al 2004 , Lee and Likharev 2007 , Ma and Likharev 2007 . The CrossNet architecture is primarily a feedforward neural network with neurons implemented in CMOS and synapses in bistable nanocrossbar switches 5 . The CMOS interfaces to the nanowire crossbars through small 'pins' (with nanoscale-sharp tips) distributed uniformly over the CMOS surface, providing direct addressability to each nanowire. Each synapse is implemented by an N × N array of nanoswitches (where N is usually about 4) to form a resistor network capable, through suitable programming of the switches, of implementing approximately N 2 different weights through resistive current summing. They have studied the issues of defect tolerance of stuck open switches, and of programming the network, using both weight-import and in situ learning.
From a manufacturing viewpoint, the CMOL approach presents some interesting process challenges; for a discussion of these, see Snider and Williams (2007) . CrossNet circuits might also be difficult to program: the most successful scheme found requires an external measurement of each weight, including the effects of bad switches (Lee and Likharev 2007) . This raises the scalability concerns discussed in section 2. We agree with Lee and Likharev that in situ training is required for a scalable architecture. It is unfortunate that initial results of in situ training simulations are not encouraging: on the MNIST 5 Recurrent networks have also been proposed for CMOL. benchmark of handwritten digit recognition, the best CrossNet network was only capable of achieving a classification error of about 3.5%. State-of-the-art for MNIST is about 0.4% (for example, see Simard et al (2003) ). It is also not clear from the CrossNet papers if device variation, device death, and nanowire resistance variation have been taken into account.
The approach of this paper is clearly not as developed as CMOL/CrossNets-we have not yet been able to simulate MNIST, for example-but is starting from a somewhat different philosophical stance. We consider uncertainty and scalability to be the primary problems of the nanoscale, and are therefore excluding paradigms which cannot handle both. It is not clear that traditional feedforward neural networks are any better matched to the problems of the nanoscale than Boolean logic, and making them robust against overfitting through regularization (e.g. weight decay or early stopping) presents even more challenges. These considerations are leading us in a somewhat different direction.
Discussion
The architecture presented in this paper is intended to be suggestive, not prescriptive, demonstrating a concept rather than a mature strategy. The simulated network selforganizes and is robust, but its learning is not structurally stable 6 -changing the statistics of its input images would cause its edge weights to continue to evolve. Additional complexity is required for structural stability (see Carpenter and Grosberg 1991, Kosko 1991) . Still, the basic principles discussed here should apply, and the symmetry of recurrent networks should allow for a regular implementation of crossing wires to implement edges. In addition, the 'spike'-based communication allows a designer to make trade-offs between computation rate and power expenditure, which would be difficult to do in a network using analog communication.
There are a number of shortcomings and potential sources of error in this study.
DC offsets in node terminals were not addressed, even though they would be inevitable in any implementation. The edges were assumed to be time-invariant, although irreversible atomic or molecular configuration changes might occur over time. Nodes were assumed identical, even though variations in, for example, integration time constants and spiking thresholds would certainly occur. Wire capacitance would distort the pulse shapes, which would have an unknown impact on learning stability. Noise was not studied in detail here (although the stochastic nature of the transducer cells could be considered 'noise'). These and other issues require more study.
The main contribution of this paper is a simple schema for building robust, self-organizing networks out of crummy, memristive nanodevices that may, in fact, be cheap to manufacture, and that might be integrated with existing semiconductor processes. Pulse-based communication allows the designer to make trade-offs between power consumption and processing speed.
Self-organization sidesteps the scalability issues of characterize, compile, configure. Network dynamics supplies a graceful response to device death. The network described here is a building block, or stepping-stone, to more complex (but cheap and energy efficient) networks that might be built to tackle complex pattern recognition problems that are beyond the reach of today's processors.
