The Von Neumann architecture, defined by strict and hierarchical separation of memory and processor, has been a hallmark of conventional computer design since the 1940s. It is becoming increasingly unsuitable for cognitive applications, which require massive parallel processing of highly interdependent data. Inspired by the brain, we propose a significantly different architecture characterized by a large number of highly interconnected simple processors intertwined with very large amounts of low-latency memory. We contend that this memory-centric architecture can be realized using 3D wafer scale integration for which the technology is nearing readiness, combined with current CMOS device technologies. The natural fault tolerance and lower power requirements of neuromorphic processing make 3D wafer stacking particularly attractive. In order to assess the performance of this architecture, we propose a specific embodiment of a neuronal system using 3D wafer scale integration; formulate a simple model of brain connectivity including short-and long-range connections; and estimate the memory, bandwidth, latency, and power requirements of the system using the connectivity model. We find that 3D wafer scale integration, combined with technologies nearing readiness, offers the potential for scaleup to a primate-scale brain, while further scaleup to a human-scale brain would require significant additional innovations.
INTRODUCTION
Fueled by the explosion in the Internet of Things and Social Media, the sheer amount of data in the world today is growing at a tremendous pace [Kelly and Hamm 2013] . The bulk of new information being created takes the form of unstructured data -e.g., images, videos, text, news feeds, spatio-temporal trends -and computing systems are increasingly being called upon to do insightful and intelligent analysis very different Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested frombecause the underlying device technology is mature (not even dependent on the latest technology node).
Because for this highly scaled up system we are interested in a system-level view of brain functionality, we will illustrate our ideas by using a cortical algorithm such as the Hierarchical Temporal Memory (HTM) model [Hawkins and Blakeslee 2004] . Nonetheless, many features of the architecture could potentially be used to scale up other neuronal systems based on different approaches. For example, the SpiNNakker project [Furber et al. 2014 ] is based on a massive network of highly-interconnected parallel processors with a communications infrastructure optimized for delivery of small data packets. A second example is the TrueNorth architecture [Merolla et al. 2014] in which each chip contains 4096 interconnected neurosynaptic cores. Note that both of these approaches are based on spiking neural networks (SNNs) [Maass 1997] , in contrast to the HTM model. However, all of these architectures have in common features that are well suited for scaling up using 3D-WSI: a large network of processing cores (currently realized using digital CMOS), a large aggregate memory bandwidth, and message passing between cores supported by a strong communications infrastructure.
Finally, we note that each of these methods is based on the idea of eventdriven simulation, meaning that information is sent only when an event such as a neuron spiking (in a SNN) or a cell becoming active (in HTM) occurs. The energy consumption of event communication depends fundamentally on the sparsity of such events, not on the underlying model that triggered the event, and the information is transmitted by a message packet, not a biological spike. Note that if the message to be transmitted also contains numerical data, then representing it as a long series of spikes is not very energy efficient as it will require charging and discharging an electrical wire multiple times to represent the numerical data accurately.
The purpose of this article. is to carry out a feasibility study of 3D-WSI for realizing a neuronal system and to outline both the strong advantages and significant challenges. The main contributions of this article are as follows:
-Presentation of a principal embodiment of a neuronal system using 3D-WSI (Section 2) -Development of a connectivity model of brain-like function to assess the performance of the system (Section 3) -Study of the scaling behavior as the neuron count is increased to a level comparable to that of the human brain (Section 4)
We then discuss the routing and fault tolerance (Section 5), and conclude with a discussion of our findings. Table I gives a list of symbols used in this work.
3D WAFER-SCALE INTEGRATION

Suitability of 3D Wafer-Scale Integration for Neuromorphic Computing
Wafer Scale Integration refers to fabrication of an entire system on a wafer which is not diced into individual dies. Individual fields on a wafer are connected together by a metallization level that stitches across the reticle boundaries. If two such wafers are bonded together, the circuits on the top stratum can be connected to the ones below using Through Stratum Vias (TSVs), which can be made very dense by thinning the top stratum . Additional wafers can continue to be thinned and bonded in succession, with each stratum connected to the one below by a network of TSVs. Although 3D Wafer Scale Integration (3D-WSI) offers the potential for a massively parallel, highly interconnected system in a compact form factor, past attempts to use [Skordas et al. 2011] and to achieve submicron overlay tolerance have been found only recently. It is projected that 1 μm vias, with a pitch of approximately 2 -2.5 μm with better than 10% overlay tolerance will be possible in a high volume manufacturing environment within three years , giving a density of about 200000 vias/mm 2 . Moreover, any application of 3D-WSI has two key requirements. First, the yield of the individual dies must be very high. Second, 3D-WSI is mainly suited for low-power applications, since the heat must be removed from the bonded wafer package. Both of these requirements have been in sharp contrast to the tendency of increasing speed and chip complexity characteristic of past technology scaling.
However, as initially pointed out by Mead [1990] , 3D-WSI is very well-suited for neuromorphic computing. Rather than trying to build a single, centralized processor, 3D-WSI enables realization of a large network of small processing units which can be interconnected with each other and with low-latency memory, at very high bandwidth. Thousands of such small processors can be realized on a single wafer. Because the processor design is vastly simplified (<4 cores), the expected yield is much higher than for enterprise server chips. Nonetheless, as discussed in greater detail later, fault tolerance and repair techniques are a crucial aspect of the design because some defects in the processing units, their memory units, and the interconnects are inevitable in such a large system. Indeed, the biggest reason why 3D-WSI is so well-suited for a neuromorphic computing system is that many neural algorithms, particularly those that allow connections to be flexibly formed and destroyed [Hawkins and Blakeslee 2004] , are remarkably robust to defects (we will show a dramatic example of this), in stark contrast to transactional processing applications in which a single point of failure Fig. 1 . One possible embodiment of a neuronal system using 3D-WSI, consisting of separate logic and memory wafers bonded together and connected using high density TSVs.
can be crippling. Finally, the overall power density is also greatly reduced, due both to the distribution of processing over the wafer (with simplified processor complexity) and to lower communication power (from much shorter wires).
Before ending this section, we comment on the possibility of using a silicon carrier interposer with packaged chips as an alternate method of scaling up. There are two main issues with this approach: (1) Within the interposer, the chip-to-chip interconnectivity is still limited by the packaging, whose features have scaled very slowly compared to the device features , (2) The 3D interconnectivity between interposers is highly limited, as each interposer would need to be mounted on a substrate such as a PCB, and the interposer-to-interposer communication would effectively become a board-to-board connection. While an interposer could be an interesting intermediate step towards full 3D-WSI, it does not allow the ultimate scalability afforded by removing the chip packaging entirely, and hence we choose to focus on 3D-WSI in this work. Figure 1 shows conceptually one embodiment of our idea . In this embodiment, a logic wafer is fabricated with a few thousand small processors whose function is to perform the memory-intensive primitive computations required by neuronal simulations. These processors are specialized and designed to accelerate a few highly specific tasks. Some examples of these tasks might include: -Multiply a vector or a matrix with a constant -Multiply a matrix with a vector -Determine whether the overlap of two vectors exceeds some threshold value -Decode synaptic connection addresses stored in memory from a compressed representation These examples include operations needed both in traditional machine learning and in machine intelligence algorithms, such as HTM. For example, the first two are fundamental in back propagation-type machine learning algorithms [LeCunn et al. 2015] , while the third is specific to HTM-type algorithms based on sparse distributed representations [Hawkins and Blakeslee 2004] . The last one is useful in a very large system where addresses are stored in a compressed representation, as discussed later. Each small processor, hereafter referred to as a node, has a sizeable private memory which is directly connected to it, on the memory wafer. Since each node is responsible for storing information about a number of neurons m in its domain, a substantial fraction of its memory is dedicated to storage of all the synaptic connections made by each neuron in its domain. Each connection requires b syn bits for both the connection address and some data about the connection, such as its permanence. Therefore, the minimum memory requirement for the m neurons in the node, if each has an average of s synaptic connections, is mxsxb syn bits. For a neuronal system with M total neurons, the number of address bits required, in an unoptimized representation, is log 2 M. Note that the number of address bits (∼34 for 20 billion neurons estimated in the human cortex [Pakkenberg and Gundersen 1997] ) can significantly exceed the number of data bits (∼8), leading to a highly inefficient ratio of address to data bits and a bloated memory requirement. As described in the Appendix, an efficient address scheme can be applied which exploits the observation that the vast majority of synaptic connections in the brain are local to a given neuron. In the example given there for a 64-bit addressing scheme, a highly local address is compressed to 25 bits by using run length encoding for all the leading zeros in the address, which can be optimally arranged due to the locality. That case would correspond to a biological example of 75000 neurons/mm 3 , where a neuron would have about 300000 potential connections in a sphere of radius 1 mm, and the worst-case connection at the edge of the sphere, at coordinates (+23, −23, +23), would require 25 (rather than 64) bits. We find on average for this case about 22 address bits are required, which is reasonable considering the 25-bit worst case. Note that this scheme still allows any neuron to connect to any other neuron, which is essential for long-distance, white-matter communication [Wen and Chklovskii 2005] , and for a flexible number of synaptic connections per neuron [Hawkins and Ahmad 2015] , but keeps the memory requirements tenable. Finally, we assume 8 data bits (a typical number for the permanence in an HTM-type algorithm), so b syn ∼ 30.
Principal Embodiment
We target an average value for s of 1000, based on Hawkins and Ahmad [2015] , which finds that a number in this range would be the minimum necessary for a neuron with distal synapses along a stretch of a dendrite to act effectively as a local coincidence/pattern detector. Eventually, we would like to grow this to an average s of 10000 to be at the higher end of the commonly accepted range of 1000-10000 for synapses per neuron in humans [Worobey et al. 2015 ], but will use 1000 as a good starting point for the analysis. For b syn = 30, we find that 1GB of memory can accommodate about 250,000 neurons.
For the memory wafer, we begin our analysis using DRAM because of its technology maturity. Based on a current estimate of 0.2Gb/mm 2 for DRAM density [TechInsights 2013], a 300mm wafer could contain about 1TB of memory, assuming a utilization factor of 80% for the wafer. Partitioning this into 1GB Sections, we could combine this memory wafer with a logic wafer containing 1000 nodes. Overall this logic-memory wafer pair would have about 250 million neurons, about the number in a small mammal. In order to scale the system up, we could continue to bond multiple logic-memory wafer pairs together. However, since each node occupies only about 1 mm 2 , we could reasonably put ∼7000 nodes on a 300 mm wafer (occupying ∼10% of the wafer area), and still leave significant area for the wiring and TSVs. This single logic wafer, with ∼7000 nodes, would then need to be paired with 7 DRAM wafers to have the same 1 GB of memory per processor as before. Via blockage should not be an issue given the high via density of ∼200000 vias/mm 2 . To scale up even more, this stack of 8 wafers could be stacked with perhaps 1-2 other such stacks, but further stacking will likely be very challenging. As discussed later, further scaling will require innovations in the memory density to reduce the number of memory wafers so that the total number of strata is tenable (<30).
In contrast to a chip-to-chip connection, nodes on a wafer can be interconnected using a wide parallel bus of metal wires on the wafer, avoiding the multiplexing to a small number of input-output pins that becomes necessary on a chip for packaging purposes. The total bandwidth of each node is given by wf bus , assuming 1 bit per wire (unidirectional), where w is the number of input-output wires at the processor and f bus is the clock frequency of the data bus, which must be chosen low enough to avoid skew in the parallel bit stream. For f bus = 100MHz and w = 2000 (half incoming and half outgoing), the outgoing bandwidth of each node is about 100 Gbps, and the total system bandwidth would be N times the value per node, a very high value for N in the thousands.
In addition to the high bandwidth connections between processors, 3D stacking of the memory on top of the logic processors enables a wide, high-frequency memory bus interface. High-density TSVs can be used to directly connect each processor to its memory domain, alleviating the memory wall between processor and memory. This redesign of the memory hierarchy using a wide memory bus enabled by 3D stacking can lead to significant speedup and energy savings [Woo et al. 2010 ].
Stitching and Local/Express Lanes
The nodes on a wafer must be connected by metal lines that have to cross the field boundaries in order for them to communicate with each other. This can be done by adding a final stitching layer between fields. In addition to local connections between adjacent nodes, wafer scale integration allows additional point-to-point connections to be made between distant nodes. For example, each node can also be connected to another node some fixed distance away both in the x-direction and in the y-direction and even in the z-direction. A network of these express connections greatly reduces the number of hops required to send a message from one node to a distant one. Although they are hard-wired, unlike the brain's flexible connections, they can mimic the brain's ability to reduce conduction delays through direct long-range connections, which is a key feature of its remarkable efficiency [Wen and Chklovskii 2005] .
Current Status of 3D Wafer-Scale Integration
We end this Section with a review of state-of-the-art 3D-WSI . The key enabler of 3D-WSI is aggressive thinning of the wafer which allows, through preservation of the TSV aspect ratio, a reduction in TSV feature size and an increase in TSV density. An innovative feature is implementation of TSVs after thinning and bonding, allowing for tighter overlay tolerance and greater inter-strata connectivity . Today, multi-stacks of 4 silicon wafers, thinned to 5 μm with 0.25-1 μm TSV features, have been demonstrated , as has functional control of memory on one wafer using logic on another . Wafers may be fabricated in parallel and then bonded or bonded first and then processed sequentially [Batude et al. 2015] , with different tradeoffs. For this application, we favor the parallel approach because each wafer can be processed and tested independently, leading to higher yield and robustness. In particular, logic and memory wafers each have very different and highly complex fabrication requirements which would make processing a number of them in series very challenging. Although the sequential approach offers the potential for extremely high interconnect density and stacking with zero alignment overlay, the TSV density offered by the parallel approach (∼200000 vias/mm 2 ) is more than sufficient for the high-bandwidth requirements discussed later.
A SIMPLE CONNECTIVITY MODEL FOR PERFORMANCE EVALUATION
The human brain is characterized by very high neuron count, thousands of synaptic connections per neuron but overall very sparse connectivity, and connectivity at local ('grey matter') and global ('white matter') length scales [Wen et al. 2005] . If our goal is to design a computing system capable of solving unstructured (brain-like) problems, we could reasonably expect that it would need a connectivity comparable to the brain, although of course the actual connectivity will depend on the particular application. In order to assess the performance of a computing system tasked to solve unstructured computational problems, we need a connectivity model which will give us an idea of the networking demands on the system. Figure 2 illustrates a simple model based on physical (experimentally verified) models of brain connectivity at both short and long length scales. It consists of several regions, or functional units, with dense connections between cells within each region, and sparse connections between the regions. We now discuss this model in greater detail.
Local Connectivity
We first consider the local ('grey matter') connectivity. The connection probability between two neurons generally decreases with the separation distance l sep between their cell bodies. Studies on rat brains [Hellwig 2000; Perin et al. 2011 ] find a local decay length l loc typically in the 100s of microns, for neuron density ∼75000/mm 3 , corresponding to l sep ∼ 25μm if the neurons are evenly spaced in a simple cubic lattice. Using this information, we can calculate how often a neuron connects with other neurons in the same node, nearest neighbor nodes (one hop away), next nearest neighbor nodes (two hops away), and so on.
If the m neurons of a processor are arranged in a simple cubic lattice, the length of one edge of the cube is m 1/3 l sep . Any given neuron will typically connect with other neurons within a distance of a few l loc from the neuron. Since we would like to keep most of the connections within the same node to minimize network traffic, we can define a characteristic "nodal length" ratio q = m 1/3 l sep /l loc whose value should be >1. The case of 0 hops means connections within the same node, 1 means connections to all nearest neighbors, 2 means connections to all next nearest neighbors, and so forth. As q decreases, the fraction of connections in nodes outside the same node increases, requiring more network traffic. The lines are meant only as a guide to the eye, since the number of hops is discrete. Figure 3 shows the fraction of neuronal connections as a function of the number of hops, using the Gaussian parameterization suggested in Hellwig [2000] , for various values of q. The number of hops represents whether the connection is in the same node (i.e., number of hops = 0), in a nearest neighbor node (i.e., number of hops = 1), and so forth. Note that for m = 250,000 and l loc = 620μm, the longest range found by [Hellwig 2000 ], q = 2.6 and about 73% of the connections are within node and 24% to nearest neighbors, which is a reasonable design point allowing most of the network to be used for the long-range connections to be discussed later. The curves for different q in Figure 3 can be viewed either as changing l loc for fixed m or as changing m for fixed l loc . However, since q only depends weakly on m as m 1/3 , the design is only very weakly dependent on this initial choice. For example, the three curves represent more than a 100× variation in m for l loc = 620μm. This insensitivity is fortunate because, while greatly increasing m may seem beneficial, it also increases the demands on the internal processor, its memory pipe, and on the outgoing bandwidth for synaptic connections that are not local.
Finally, we point out two sources of error in our simplistic model. First, pyramidal neurons are arranged in mini-columns [Buxhoedeven and Cassanova 2002] and preferentially connect to other pyramidal neurons in the same mini-column, so neither neuron arrangement nor connection probability is isotropic. Of course, we could have parameterized each direction by its own l loc and l sep , and our conclusion would be the same provided each l loc is large compared to its l sep . Second, the tails of the connection probability are likely to be underestimated by the Gaussian parameterization. However, in the next section, we will present a global connectivity model with a parameter reflecting the fraction of long distance connections, so any underestimation of the tails can be incorporated by adjusting this parameter upwards. Therefore, neither of these simplifications is expected to have a major effect on our estimates. 
Global Connectivity
The brain may be structurally and functionally divided into many regions which communicate with each other through a network of global ('white matter') connectivity. While mapping the neural pathways that underlie human brain function is a hugely complex undertaking (see, for example, [NIH Brain Initiative website]), there exists a simple yet widely accepted model that can be used to simulate the inter-region connectivity properties of the brain. As summarized in Bassett and Bullmore [2006] , numerous studies in cat and macaque brains have found that the functional connectivity exhibits attributes of a Small World Network (SWN), as first proposed in a classic paper [Watts and Strogatz 1998 ]. Subsequent work using diffusion MRI in humans [Hagmann et al. 2007 ] has also found global organization in the form of a SWN.
The essential idea of a SWN is illustrated using the concrete example shown in Figure 4 . Each vertex of a regular lattice of R = 512 vertices, arranged in a onedimensional ring lattice, is connected to its nearest neighbors by k = 16 edges per vertex. (The pictures in the figure are meant only to illustrate the various regimes and have many fewer vertices and edges.) Traveling clockwise, each edge of each vertex is allowed to be rewired to another randomly chosen destination vertex with probability p. For small p, the existence of just a few of these rewired connections greatly diminishes the average path length to traverse the network, but the network retains the high degree of local connectivity ('cliquishness') in the original regular lattice (p = 0). This regime of high clustering and short path length, which occurs for a surprisingly wide range of p, is called a SWN, and has been found to describe the behavior of many highly disparate systems [Bassett and Bullmore 2006] . For large p, the network becomes randomly connected, marked by short path length and poor clustering. Figure 4 shows these regimes using the average clustering coefficient C(p) and path length L(p) (as defined in Watts and Strogatz [1998] ) as a function of p. 
Application of the Connectivity Model
In order to formulate a reasonable model of the networking demands in our computing system, we must consider both long-range and short-range connectivity. This is biologically plausible since a given neuron might be expected to have a mixture of short-range and long-range connections though proximal and distal arborizations [Ruppin et al. 1993] . However, since the brain seeks to minimize the energy expended on communication, the connectivity is dominated by short-range connections [Hasler and Marr 2013] . We choose to study the behavior of our system treating the proportion of short-range vs. long-range connections as a parameter. While an assumption of <10% long-range connections is likely to be typical, we extend our study to 50% long-range connections to study the bandwidth, latency, and power implications in an extreme worst-case scenario.
To apply the global connectivity model, we first group some number of nodes into a region, or functional unit, such that there are a total of R regions in the network. For every edge in a simulated network with R vertices, with k and p chosen to be in the SWN regime as illustrated above, we calculate the total number of hops (local and express) needed to make the connection. By combining this with the local connectivity model in some proportion (here chosen to be 90% short-range (grey matter) and 10% long-range (white matter)), we obtain a distribution such as the one shown in Figure 5 (a) for the case of 512 regions, 13824 nodes. While the vast majority of connections belong to grey matter and hence require at most 2 hops, the worst-case network latency is set by the relatively small number of connections requiring >20 hops.
The number of required hops in Figure 5 (a) benefits greatly from the presence of express lanes, as discussed earlier and illustrated in the inset of Figure 5 (b). Figure 5( b) shows that a dramatic reduction in the maximum number of hops, compared to no express lanes (0 on the x-axis) is achievable. This benefit is comparable to that seen in the SWN through the rewiring of a few random links, but is of course more costly since it involves hard-wiring an express channel for every node. While an optimal value for express lane length of ∼9 is apparent, it is noteworthy that the large reduction in maximum number of hops has only a weak dependence on this choice.
Cortical Algorithm
The unified model above tells us of the required connectivity, but understanding the network demands requires an algorithm that will tell us how much and how often data will flow on the network. As an example, we use the Hierarchical Temporal Memory (HTM) algorithm of cortical processing [Hawkins and Blakeslee 2004] . While a full explanation of HTM is beyond the scope of this work, we provide a very brief summary that should help to understand the network demands. During the course of a single HTM iteration, each functional region activates a number of cells based on input from the outside world or other regions and sends a message to each activated cell's connections. While the messages are traveling throughout the network, each node's processor evaluates and updates the state of the synapses of all its affected neurons based on its input. It may create new synapses, or destroy weak ones, which is how it "learns". By the time the next iteration begins, all of the messages must have reached their destinations in order for the network latency to remain effectively "hidden" behind the computation time, and therefore the time of one iteration sets the time scale for data flow on the network. To estimate the time of a single iteration, we have carried out numerous HTM simulations on ARM A9 processors, and found a typical range of 50-200ms per iteration, depending on the number of synapses, which changes as the simulation proceeds. This estimation is also dependent on processor performance (here we have assumed a frequency of 1GHz) on specified tasks. We will later show that the condition for keeping the latency hidden behind the compute cycle is well satisfied for a timestep in the ms range, but note that this condition could be violated if the processors can be greatly accelerated. In that limit, the system will become dominated by the network latency rather than the compute cycle.
A SCALING STUDY
This section explores the bandwidth, latency, and power implications of scaling up to human brain levels of neuron count using the 3 cases shown in Table II . Case 2 is an 8x scaleup in total neuron count over Case 1, and Case 3 is an 8x scaleup in total neuron count over Case 2. The average number of synapses per neuron is kept fixed at 1000. Hence, the number of DRAM wafers in Case 2 is 8x that of Case 1, and the number in Case 3 is 8x that of Case 2. In each case, a region is assumed to consist of 27 nodes, or about 7 million neurons, which is in the right range for primate brains [Collins et al. 2010] . The bottom part of Table II shows the SWN parameters used for each case. The values of C(p) and L(p) for the chosen values of p are in the range of biological examples [Bassett and Bullmore 2006] . The last column shows the value of σ = (C(p)/C(1))/(L(p)/L(1)), which is a metric of the "small-worldiness" of the system [Humphries et al. 2006 ]. The ratio σ should be well above 1 if the system is in the small-world regime since clustering should be high compared to the random case of p = 1 (C(p) >> C(1)) while average path length should be comparable (L(p) ∼ L (1)). For the larger cases (Case 2 and Case 3), we have maintained a high value of σ by scaling up the number of edges k and scaling down the probability p by the same factor as the number of regions R is scaled up. 
Bandwidth and Latency
The network traffic can be estimated as follows: In a given iteration of the underlying cortical algorithm, a fraction α of all neurons, or αxm neurons per node, become active. We assume that only those neurons whose state has changed since the previous iteration need to send a message to each of their s synaptic connections, leading to (αxmxs) messages to be sent per node in each iteration. If a fraction γ of these connections are outside the node (mainly long-range, but also a few short-range connections), then each node sends γ x(αxmxs)xb msg bits onto the network in each iteration. Here b msg is the number of bits contained in a message packet, including the header and payload, and is significantly larger than the b syn bits that was used to store just the address in compressed format and the data bits. Some of those messages (by far the white-matter ones) contain multiple hops and thus need to be rerouted several times during the iteration. Figure 6(a) shows the outgoing bandwidth per node as a function of the fraction of connections that are white-matter (long-range) for the three cases, assuming a iteration of 100ms and an activity factor α = 0.01. For Case 3, even for an extreme example of 50% white matter, the 25Gbps is comfortably within our 100 Gbps estimate earlier.
Since the curves in Figure 6 (a) scale with activity factor α, the outgoing bandwidth could approach the bandwidth capability for high activity factors, as might occur at a few very active nodes, but a very high activity factor is not consistent with biological energy constraints [Lennie 2003 ]. Also shown in Figure 6 (a) is a breakup of the outgoing bandwidth for Case 3 into grey-and white-matter components. Except when the white-matter fraction is very small (below 4%), the network traffic is dominated by the white matter, as it should be for a well-designed system.
As long as the system is not bandwidth-constrained, the latency will be determined by the maximum number of hops, which is set by the long-range connections. Figure 6(b) shows the maximum number of hops as well as the connection-weighted average number of hops for the three cases. We can estimate a time per hop as the sum of the transit time across the hop and the processing time at each intermediate node where the message has to be rerouted. For the short distances between nodes, the transit time (due to RC) is small compared to the rerouting time. Conservatively estimating the rerouting time as 100ns per hop, the worst case of 40 hops would require on the order of a few μs, so the assumption of latency hiding during the 100 ms iteration is valid, provided the system is not bandwidth-constrained.
Power Consumption
The total system power can be approximated as the sum of the processor power, memory power, and communication power. Since typical low-end processors consume in the 10's of μW per MHz (see, for example, [ARM website]), a reasonable estimate of the processor power might be 0.5W per node at 1GHz, as we need to add some power for the drivers of the communication lines (explained below). The power consumption of a 1GB DRAM can be estimated from standard datasheets (see, for example, [Micron website]) and from Vogelsang [2010] as approximately 100mW (including refresh), assuming a 1% neuron activity factor which means about 1% of the data will be pulled (random access) per iteration. The communication power can be estimated from 1 / 2 C w V 2 based on the capacitance C w per wire, supply voltage V dd , and the number of connections. For V dd = 1 V and typical BEOL wiring capacitance ∼2 pF/cm, a local hop of ∼0.3 cm would require ∼0.3 pJ = 0.3 mW/Gbps, and an express hop of ∼3 cm would require ∼3 pJ = 3 mW/Gbps. Figure 6 (c) shows the total power per node and its breakup into these 3 components as a function of the white-matter fraction for Case 3 (Cases 1 and 2 are similar). Adding these three components together yields a power estimate of about 600 mW per node, or a total of 1 kW for Case 1, 8 kW for Case 2, and 66 kW for Case 3. We have carried out thermal simulations which find, for a stack of 10 wafers, that 1 kW/wafer is sustainable using air cooling and 10 kW/wafer is sustainable using water cooling [Sikka et al. 2015] . Thus, the thermal challenges even for a "human-like" system are manageable.
ROUTING AND FAULT TOLERANCE
During each iteration, messages must be passed from one node to another, according to a message passing protocol. This protocol may be deterministic for simplicity, or adaptive to mitigate the congestion of the system. One example of adaptive routing for this system is the algorithm of May et al. [1997] , which alleviates traffic congestion by routing a message from one node to another through a randomly selected intermediate node. The routing algorithm must be able to deal with defects and failures, which we now discuss in detail.
Resilience to faults is essential for a wafer-scale system as pre-existing defects during fabrication and real-time failures during operation are inevitable. Fault tolerance can be realized through a combination of redundancy, repair techniques, and algorithms that route around defects, borrowing from a long history of known techniques for mesh networks. However, probably the single most important factor in alleviating vulnerability to failures is the remarkable ability of many neural algorithms to be naturally resilient to faults. We examine each of these in turn.
Redundancy and repair techniques [Arzubi 1973; Robson et al. 2007 ] play a crucial role in dealing with fabrication defects. For example, because TSVs are freely available due to the high density , sparse TSV faults can be tested and repaired using redundancy after integration [Chi et al. 2013] . While the processor yield is expected to be very high due to its relative simplicity, extra processors can be added without substantial area penalty. If necessary, power domains can be used to shut off a block of processors containing a fault that could otherwise be fatal to the entire system. Fig. 7 . Effect on performance (average reward) of suddenly disabling 50% or 75% of the columns after 7000 decision epochs in an HTM-type algorithm. Although the performance shows an initial drop, it is able to reconfigure its resources autonomously and recover to nearly its original performance level. Reprinted with permission from J. Marecki.
Unfixable defects on a wafer, such as non-functional nodes, can be addressed using well-known routing algorithms, in some cases extensions of mature 2D routing algorithms such as those used for Network-on-Chip. Because the TSVs of the stacked wafers have RC characteristics comparable to the planar wiring load , the routing algorithm can treat the TSVs in the vertical dimension the same as the wiring in the x and y directions on the wafer from the graph point of view. Many known techniques for fault avoidance as deadlock-free and livelock-free routing [Boppana and Chalasani 1995] can also be adopted in 3D. Routers for 3D need two more ports due to the addition of +z and -z directions, but the area and power consumption will be in the same scale [Bahmani et al. 2012] . The third dimension is also beneficial in providing additional paths to route around faults.
Due to the robustness of neural algorithms, the presence of a few faults, which may occur during operation, will not substantially affect the performance of the system. Figure 7 is a simulated result showing a dramatic illustration of this robustness. Here an HTM-like cortical algorithm is subjected to a sudden disabling of a large percentage (50% or 75%) of the mini-columns, each of which contains a group of cells. Although the performance shows an initial drop, the system shows a remarkable resilience to reconfigure its resources autonomously and recover to nearly its original performance level within a reasonable time. Behind this ability to recover is the ability of the system simply to form new synapses when existing ones are destroyed.
DISCUSSION AND CONCLUSION
The purpose of this work has been to do a feasibility study of using 3D-WSI to realize a neuronal system approaching human brain levels of neuron count and connectivity. We now review our findings to understand where the biggest gaps and challenges are.
Using a simple model to emulate network connectivity, we found that the high bandwidth afforded by metal lines directly on the wafer and by TSVs between wafers is quite adequate for the expected traffic. When compounded over the entire system with thousands of nodes, the total bandwidth capability compares very favorably to the bandwidth capability of the human brain which has been estimated at 1Tbps [Laughlin and Sejnowski 2003] . Communication latency also appears to be very low in our system. Aided by express channels, the latency can be well hidden during the compute cycle even for the worst case. Interestingly, it appears that the human brain is possibly more limited by the communication time (with axonal propagation times in the 20 ms range [Wen and Chlovskii 2005] ) while the 3D-WSI system is more constrained by the compute time because of its inability to parallelize below the level of a single node. Still, the iteration time in the 10's of ms should allow for sensory perception on the time scale of a second. Finally, the estimated power consumption in the 10's of kW, while high, is manageable if advanced cooling techniques are used.
The biggest challenge appears to be in the memory requirements since we desire storage-class density with access time typical of volatile memories. The primate-like case described above, requiring 2 logic wafers and 16 DRAM wafers, is still feasible at today's DRAM densities, using wafer bonding techniques which to date have demonstrated stacks of 4 wafers. However, scaling up to the human-scale case using 128 DRAM wafers at today's density would not be feasible, pointing out the need for further memory innovation which would become even more important due to the slower scaling of DRAM feature size compared to digital CMOS. For the human-scale case, a 10x increase in density would be needed to bring the number of wafers down to a feasible level (<30) for 3D-WSI. These density increases may be possible with advances in emerging memory technologies [Meena et al. 2014] ; one promising example is given in Cappelletti [2015] .
Before advocating for the use of 3D-WSI, it is worthwhile to ask whether this system could practically be realized using the conventional method of packaged chips mounted on boards. While it is possible to achieve a comparable bandwidth using high-speed SERDES MGTs (multi-gigabit transceivers) [Kimura et al. 2014] , the very high bandwidth consumes significant power to drive the high capacitance chip-tochip and board-to-board connectors [Hasler and Marr 2013] and to maintain synchronization. Using Kimura et al. [2014] , in which 28Gbps is achieved with 560mW in a 28nm technology, we estimate an added 4W per node just for the high bandwidth communications, resulting in an added 400kW for a human scale system. Each SERDES operation also adds a delay of ∼100ns for serialization and deserialization to the time for each hop. The beauty of not serializing lies in the inherent simplicity and power savings of parallel communication: power is consumed only when used, not all the time as is needed in a SERDES MGT to maintain clock synchronization. Finally, with approximately 25 chips on a board, the human scale system would require a roomful of racks to implement. The 3D-WSI system would be much more compact, even when control and input/output chips, power supply, and cooling system are included.
Finally, while we believe that 3D-WSI represents a strong next step in advancing brain computing, we comment on what could be next steps in a neuromorphic roadmap for further scaleup. First, as discussed, we have chosen as a starting point an average of 1000 synaptic connections per neuron, while a 10x increase to an average of 10000 per neuron would be desirable to improve the contextual capabilities of the system [Hawkins and Ahmad 2015] . In addition to much higher memory requirements, the network traffic would also increase markedly. As a possible futuristic enhancement, optical interconnects (see, for example, Schow et al. [2010] ) offer one possible solution to greatly increasing the bandwidth-distance product. In addition, they offer the possibility of a freespace interconnect that can be flexibly reconfigured to meet changing network demands [Katayama et al. 2013] . Fig. 8 . Example of address compression for a single synaptic connection in a 64-bit addressing space. The top frame shows the original representation of the connection using its absolute address. The middle frame shows some improvement when the relative address is used along with run length encoding. The bottom frame shows significant additional improvement when relative address, run length encoding, and rearrangement are used. In the bottom frame, msb refers to most significant bit, smsb to the second most significant bit, and so forth.
In conclusion, we have found that 3D-WSI offers a feasible path for realization of a next-generation computing system with significant cognitive capabilities that could complement today's enterprise servers. The ability to tackle unstructured computation problems would have significant impact on a wide variety of fields such as cybersecurity, healthcare, public safety, economics and finance, and robotics.
APPENDIX
This Appendix describes a method to store the addresses of synaptic using a highly compressed representation (run length encoding) that takes advantage of the observation that the overwhelming majority of a neuron's connections are located in the local vicinity of the neuron. Naively, for M total neurons in the system, log 2 M address bits are required to specify an absolute address for each synaptic connection. If M is very large, the number of address bits can significantly exceed the number of data bits being stored with the address, resulting in a highly inefficient ratio of address bits to data bits. However, we can improve on this situation by storing, for a given neuron, the address of each of its connections relative to the address of the given neuron, rather than as an absolute address. Since the vast majority of connections are local, the relative addresses will contain a large number of leading zeros which can be stored compactly by noting only the number of nonzeros in the relative address. Figure 8 gives a concrete example for the case of 64-bit addressing. First, the original absolute address is divided into 4 fields consisting of a 16-bit global address field G of 16 bits and three 16-bit local address fields, x,y, and z. Suppose that a given neuron, whose relative address we take to be all 0's, makes a connection with another neuron located in the same region, but +23 units away in x, −23 units away in y, and +23 units away in z, which we denote as (+10111,−10111,+10111) in binary. As a first compression, we could represent this address, consisting of 27 leading zeros followed by 37 bits of information, using 47 bits (6 bits for the number of leading zeros, 37 bits of information, and 4 bits for the sign of each address field), instead of the original 64 bits. However, if we regroup the x, y, z subfields such that we take the most significant bit of each, then the second most significant bit of each, and so forth, we can increase the number of leading zeros from 27 to 49, requiring only 25 bits to store (6 bits for the number of leading zeros, 15 data bits, and 4 bits for the sign of each address field).
