Abstract-We describe asynchronous circuits that can relay spikes between multiple chips in a grid. These circuits interface with an on-chip SRAM to implement programmable connectivity among chips. We introduce a packet format that is compatible with updating the SRAM. From a high level specification, we synthesized and fabricated these circuits in an area of 0.206mm2 in 0. 18-,um CMOS technology. Test results that measure performance and demonstrate correct function on first silicon are presented.
I. LAYERED NEUROMORPHIC SYSTEMS Neuromorphic engineers are building single-chip systems with ever-increasing complexity, at both the neuronal and microcircuit levels. At the neuronal level, the conductancebased neuron is displacing the integrate-and-fire one [1] - [3] . At the microcircuit level, the canonical microcircuitstereotyped local connectivity among specialized cell typesis displacing the free-standing all-purpose neuron [4, 5] . This trend has capped the number of pixels and bands in neuromorphic vision and auditory systems despite on-going miniaturization bumping up against the single-chip limit.
Faced with the problem of finite area, the neocortex took advantage of the third dimension, organizing its various neuronal types into six layers and transforming its canonical microcircuit into a columnar circuit [4] . Inspired by this architecture, neuromorphic engineers have looked to 3D microfabrication techniques [6, 7] . Though impressive systems have been built, 3D-integration remains a niche solution, used mainly in memory devices [8] , where structure is regular, functionality fixed and bandwidth enormous features that do not apply to neuromorphic systems.
Here we propose a solution that supports and exploits neuromorphic systems' unique features. On the one hand, bandwidth is discounted: bandwidth-per-neuron is a millionth of a wire's. This favors packet-switching, where wires are shared by numerous users (as in the Internet), over circuit-switching, where a wire is dedicated to a single pair of users (as in the archaic phone system) [9, 10] . On the other hand, flexibility is priced: functionality comes from reconfiguring connections (through developmental and learning mechanisms). This also favors packet-switching: packets may be rerouted on the fly simply by relabelling them [ 1] whereas circuits must be setup in advance.
In our multichip architecture, layers of different neurons reside on different copies of the same chip (differentiated by electronically adjustable parameters) and the columnar circuit (CC) spans corresponding locations on these chips (Fig. 1) . To restrict the size of the look-up-table that specifies its connectivity, we assume the CC is translation-invariant, thus there are only as many entries as there are chips. If there is a connection, the entry also specifies its properties, physiological (e.g., excitatory or inhibitory) or, in the case of a multicompartment neuron, anatomical (e.g., basal or apical). In this way, any desired CC may be realized simply by programming the appropriate bits into a small SRAM that fits snugly on the periphery of the chip.
Section II describes our architecture's connectivity network. Section III presents our network's high-level specification and Section IV decomposes this into smaller, concurrent processes. Section V presents synthesized circuits and Section VI presents test results.
II. GRID NETWORKS
Capitalizing on existing chip-to-chip links [12] , grids were developed to broadcast a packet consisting of a neuron's chip, row and column addresses whenever it spikes; these packets are relayed from chip to chip [13] . A packet may have multiple column addresses (following the chip and row address) corresponding to multiple spikes read from the same row. More recent work added the capability to selectively deliver or filter a packet based on its chip address; a particular chip may be either targeted or excluded [14] . The excluded mode provides the functionality we desire for broadcasting spikes; we do not wish to send them back to the same chip. The targeted mode provides the functionality we desire for programming; we wish to write a byte to a particular SRAM on a particular chip.
Targeted and excluded delivery capitalize on the relative addressing scheme grids use. To support relative addressing, a packet's chip address is incremented or decremented as it hops from chip to chip. Thus, we can arrange for an overflow or underflow to occur at a particular chip by modifying the 1-4244-0173-9/06/$20.00 ©2006 IEEE. [14] . We describe the T process using CHP (Communicating Hardware Processes) notation [15] (see Table I ):
where denotes complement, k is a local byte, and t is a packet. . . We have decomposed the programmable T into eight concurrent processes. In the next section, we synthesize these into circuits. In this step, we also choose the data-encoding scheme. We use l-of-4 (Table II) instead of a bundled-data (binary) because we want our circuits to be delay-insensitive. In I-of-4, the data is valid if any one of a set of four wires is driven; otherwise it is neutral [15] . This validity functions as the request; bundleddata requires a separate signal for this purpose. A 4-input OR checks for validity or neutrality in a single I-of-4 set, whereas a datapath with M sets requires M 4-input ORS that feed into a tree composed of M-1 C-elements (VN in Fig. 3a) . We name this validity signal va for port A. Since I-of-4 encodes two bits, we use I-of-2 code (dual-rail) when we want to encode a single bit. For bit b, we label these two signals b. 0 (b = 0) and b. 1 (b = 1).
V. CIRCUIT SYNTHESIS
We begin with DCTL, which feeds DEC a true borrow-bit on B after it detects X on F. DCTL In the first sequence, when MAD R asserts f i, F LTER signals SWITCH (bo+) to supply data to SRAM. MADR releases f i when the write is complete. In the second sequence, ei signals FILTER to find X (be. 1); the third process receives words otherwise.
We implement SEND's C and D communications with SWITCH and RECEIVER as a triangular communication, with both ports active (see Fig. 3a ). APP is implemented as combinational logic. SEND relays RECEIVER's acknowledge to SRAM through MRD. SEND's HSE:
The first sequence waits for j i from SRAM before it signals SWITCH (co+) to supply the RECEIVER with the first word (row-address); it acknowledges MRD when X appears (ce. 1). The second sequence signals SWITCH to supply words to RECEIVER (column-addresses) otherwise.
Once we have HSE sequences, the final step is to compile them into production-rule sets (PRS), which are straightforward to implement with CMOS transistors [15] . Due to space constraints, we show only the synthesized CMOS circuits (see Fig. 4 ). In the next section we present test results that validate functionality and measure performance.
VI. TEST RESULTS
We present test results from a full-custom chip fabricated in 0.18-,um CMOS technology. This chip has a die area of 29.59mm2 (6.19mmx4 .78mm) with the programmable T occupying 0.697% of that area (Fig. 5) . The array is comprised of 4x320x240 pixels: a group of 4 pixels corresponds to RGBY color video output (half VGA resolution). Two bits from the SRAM (256 x 16-bit) determine which of these four pixels is targeted. Each pixel is a pulse-extender configured to function as either a sample-and-hold or a leaky integrator.
In testing the chip, we are concerned with performance and functionality. One measure of performance is burst rate, the rate at which a chip receives words within a packet. We measured the maximum burst rate to be 62.9Mhz (Fig. 6) , a 38.2% improvement upon the previous best of 45.5MHz (0.25-,um CMOS) [14] . We tested functionality by simulating activity from virtual chips with a computer through a USB (Universal Serial Bus) link using the packet format described in Fig. 7 . When the array receives an address-event, the pixel with the corresponding address outputs a current. By measuring the response of a single pixel, we successfully demonstrated programming, filtering, and delivery (Fig. 8) Programming and address-event packet formats. Both packets are comprised of four words; the LSB is the tailbit. Programming: Head is set so that its value is zero at the target chip. Words one and two are address and data. Packets are delivered to the on-chip receiver if K is set, with AP appended. Address-event: Head contains the relative address of the packet's source chip. Words one and two are the row and column-addresses; there can be multiple columrn-addresses. Logic analyzer traces demonstrating maximum burst rate of 62.9MHz. The receiver acknowledges when the data is valid and releases the acknowledge when the data is neutral. The bits are arranged from MSB to LSB (from set 4 to set 0 and left to right within the set). Lo-ic Aiialyzer systems that expand beyond the single-chip limit.
[ 
