Abstract. Emergent nanoscale non-volatile memory technologies with high integration density offer a promising solution to overcome the scalability limitations of CMOS-based neural networks architectures, by efficiently exhibiting the key principle of neural computation. Despite the potential improvements in computational costs, designing high-performance on-chip communication networks that support flexible, large-fanout connectivity remains as daunting task. In this paper, we elaborate on the communication requirements of large-scale neuromorphic designs, and point out the differences with the conventional networkon-chip architectures. We present existing approaches for on-chip neuromorphic routing networks, and discuss how new memory and integration technologies may help to alleviate the communication issues in constructing next-generation intelligent computing machines.
Introduction
Neuromorphic computing is an inter-disciplinary area of research, aiming to emulate the computational principles of Biological neural systems and to utilize such principles to solve complex and computationally intensive problems. The discipline originated with the goal of constructing silicon models of Biological neurons and synapses, based on the observation that the basic Physics governing the flow of current across a semiconductor junction and Biological "junction" in an ion channel were the same [1] . Early papers in the field include the design of silicon models of the retina and cochlea, with analog CMOS circuits emulating Biological circuits discovered through anatomical studies [2] [3] [4] [5] [6] .
Concurrently, neural networks were being used for pattern recognition, where neurons were modeled using the McCulloch-Pitts model [7] . In this approach, a neuron is a function that partitions points in space into two regions by a separating hyperplane. Mathematically, a neuron computes a dot product between its weight vector and an input vector, and the output is given by the sign of the result. Generalizations of this where the output is computed using sigmoids or other functional forms were also studied [8] . More recently, such neural networks with a large number of parameters (millions, if not more) have been used to achieve unsurpassed classification accuracy for tasks such as object labeling, face detection, speech processing, etc. Sometimes work in this area is also referred to as "neuromorphic," since it can trace its roots to Biology as well. In what follows, we do not examine these more recent "deep neural networks," since while the neurons have Biological connections, the communication structure they use is engineered for functionality, without any attempt to mimic Biology. Neuromorphic systems can be modeled entirely in software, and several projects have developed tools to model Biological neurons, synapses, axons, and dendrites at varying levels of detail [9] [10] [11] . However, such systems are very slow. Even a simplified model of Biological neural networks implemented on a massively parallel machine ran 10× slower than real-time, and consumed 655 kW of power [12] . A Biological system of the same complexity takes five orders of magnitude less power while operating in real-time. Hence, in order to explore the potential of Biologically-inspired computing systems for solving real-world problems, new types of energy-efficient devices and hardware architectures with high scalability factors are required. A domain where such systems could have a large impact is the Internet of Things (IoT), where small form-factor battery operated devices could be enhanced with capabilities akin to the sensory system (e.g. vision or speech recognition) of Biological systems.
Hardware neuromorphic computing systems [13] [14] [15] [16] [17] [18] [19] [20] [21] contain structures that model neurons, synapses, and the connectivity between neurons and synapses. Current large-scale systems have been implemented using conventional CMOS technology, and there has been a large body of work in recent years on novel memory devices for reducing the power consumption in neuromorphic systems. We begin with a brief overview of this body of work (Section 2), summarizing the focus of researchers developing novel devices for neuromorphic system. We then present the problem of Generic Memristor structure and its symbol [22] .
implementing connectivity in neuromorphic systems and discusses some of the issues involved in building large-scale neuromorphic hardware that can model millions of neurons (Section 3). We argue that the cost of communication in such systems is extremely high due to the complexity of supporting flexible high-fanout connectivity. This is consistent in what has been observed in existing large-scale neuromorphic systems (Section 4). We discuss what new materials/device technologies might be able to do to alleviate this problem (Section 5), and provide possible directions for future research in this area.
Our primary thesis is that almost all the recent work in memory technologies and materials/device design for neuromorphic systems cannot have a major impact on the system-level power consumption of large-scale neuromorphic systems without new techniques being developed to address the communication problem. When examining the complete system, it is quite clear that the focus of device researchers has been on only a portion of the total power requirement; a non-trivial fraction of the total power budget remains unaddressed, limiting the benefits of the proposed device technologies in isolation.
Resistive memory technologies for neuromorphic systems
Most prominent large-scale neuromorphic projects have employed commercial CMOS technology for implementing neural principles. In recent years, there has been growing research efforts to utilize emergent memory and device technologies for scaling up neuromorphic hardware systems. In neuromorphic engineering, most such technologies are primary considered for implementing synaptic and learning models. However, there is a potential to utilize nanoscale emergent devices to overcome the interconnects limitations [24] in 2D /3D VLSI integrations. In recent neuromorphic designs [13, 17, 26] , the communication and standby power has become a significant portion of total power consumption. In this section we review most relevant research articles that have investigated the potential of two-terminal emergent devices in improving the performance of CMOS neuromorphic computing systems.
Multiple emergent memory devices have been recently proposed to overcome the limitations of the conventional CMOS memories [30] in embedded systems. Memory resistive (memristor) elements, oxide-based resistive random access memory (oxRAM), conductive-bridging random access memory (CBRAM), phasechange memory (PCRAM) are examples of the emergent memory devices that are explored for integrating with CMOS-based neuromorphic circuits.
The characterizations of these memory devices varies depending on the switching materials.
In order to incorporate the emergent devices with CMOS-based neuromorphic systems, certain design constraints need to be satisfied [32] . The main idea has been to use such memory elements for implementing synaptic computation in a more compact and energy-efficient way in compare to their CMOS counterparts. Given the large number of synapses in neuromorphic networks, more optimal synapse circuits leads to significant improvements in overall performance. While the term memristor was introduced in 1971 [33] , the first experimental prototypes were only recently demonstrated by HP Labs [34] . Fig.1 illustrate the generic structure of memristive devices. The memristor can be considered as two regions in series, doped and undoped. Ideally, the resistance difference between doped and undoped region is very high and the overall device resistance is a function of doping ratio(w /G), where w the length of the doped region, and G is the thickness of the memristor. The size of these regions can be changed depending on amount and direction of the charge that flow through the device. For instance in a voltage-controlled, by applying a positive voltage increases the thickness of doped region increases until w = 1, and the resistance will be very low or device is in ON state.deally, memristors can be programmed to represent analog memory or various intermediate resistance states between HRS and LRS [37] . Because of their programmable resistive states and small footprint (in order of a few nm), memristors are promising candidates for efficiently implementing the synaptic connection between pre-synaptic and post-synaptic neurons. There has been significant activity dedicated to developing memristive-based neuromorphic systems [30, [38] [39] [40] . A variety of oxide-based materials (e.g.
have been investigated to implement the synaptic array in neuromorphic systems [31, 45, 46] , for applications such as pattern recognition [42] , face detection [43] , learning algorithms [44] . Further details on memristive synapse implementations is beyond the scope of this paper. However readers can refer to Hong et al. [31] for a comprehensive survey of oxide-based memristor devices and their benchmarking metrics for neuromorphic computation. Crossbar-based structures are typically used to implement synaptic arrays. In crossbar schemes, the memristors are placed on the cross-point between vertical and horizontal wires. However by increasing the size of crossbar array the power consumption dramatically increases. This happens due to sneak current path of memristive-based memory elements as well as other alternate current path. It is therefore memristors are typically placed on the grid point in series with an access transistor to limit the sneak current. Also referred to as 1T1R (one-transistor-oneresistor), such scheme leads to larger cell size e.g. 20F
2 , and lowers the memory density. 1D1R (one-diode-one-resistor) and 0T1R [41] are among other solutions that have been investigated for limiting the sneak current issues [47] . Hybrid CMOS-memristor based architectures have been proposed based on the cross-net structure [48] , where the idea has been to use a grid of synaptic memristive devices on the top of CMOS implemented neuronal arrays (see fig. 2 ). Successful implementations of such a scheme can help to significantly scale the number neurons and synapses on a single die compared to standard CMOS technology. The memristor devices have also been utilized to implement training algorithms, for instance back-propagation [49] and biologically-inspired learning rules e.g. Spikingtiming Dependent Plasticity (STDP). Furthermore, the potential of all-memristive neuromorphic computing system has also been examined [50] . CBRAM [51, 52] and oxRAM [53] are examples of other nanoscale devices that have been studied for implementing synapse circuits. Their structure is based on Metal-InsulatorMetal (MIM) configuration, and are typically used as binary memory elements. While it is possible to program these devices with multi-bit resolutions, it is often achieved at the cost of a higher variability and larger control circuits [54] . These devices are typically fabricated using different materials and techniques [35] , and therefore, depending on configuration, represent different electrical characteristics.
Although the efficiency of memristor devices has been successfully demonstrated for single device or a small array of synapses [38] , the potential of large-scale CMOS-memristor based design are still being explored in academia. The device reliability, manufacturability, and yield issues are major obstacles in implementing a large synaptic matrix based on the emergent nanoscale memories. Combinations of different switching materials are being investigated in order to overcome these limitations in large-scale integrations of neuromorphic CMOS and memristors. If successful, it would further reduce the power requirements for performing computation. In the next section, we discuss the communication requirements of next-generation scalable low-power neuromorphic processors, and investigate the impact of communication and computation in overall power consumption.
Communication requirements
A diverse variety of spike-based neuromorphic computing systems have been developed and implemented in silicon over the past three decades. There are many differences in the implementation details of each system, based on the goals of the implementation and the state of silicon technology at the time the system was created. For instance early implementations were dominated by analog circuit design, but completely or mostly digital implementations can be seen in recent years due to the scaling of CMOS device technology from a feature size of 3µm in the mid 1980s to 14nm and below today. Some systems aimed to faithfully mimic Biology-which included operating at Biological time constants. Others were aimed at accelerated modeling, with time constants many orders of magnitude faster than Biology. These choices have cascading implications that change the overall architecture of the neuromorphic system. In particular, communication requirements can change dramatically depending on the implementation goals.
Despite their many differences, neuromorphic systems all attempt to emulate Biology. This means that no matter how various aspects of Biology are modeled, the hardware representing neurons and synapses must logically operate in a massively parallel fashion. Hence, it is possible to view every large-scale neuromorphic hardware platform as a parallel collection of neurons and synapses with some communication infrastructure. Furthermore, it is widely accepted that neurons and synapses have dense local connectivity and relatively sparse global connectivity. Hence, every platform consists of a collection of clusters of neurons and synapses that support efficient local communication within the cluster combined with a global inter-cluster communication network (Fig. 3) .
The flexibility of a routing architecture determines the connectivity between neurons and synapses that is natively supported by the hardware. It should come as no surprise that there is a high cost to be paid for supporting highly flexible connectivity, whereas systems that are more restrictive can be made more efficient. The cost of flexibility is higher design complexity and more hardware resources including on-chip memory. In this section, we examine the communication requirements for a generic spiking neural network and possible trade-offs between the system flexibility and its complexity.
Consider a system that models N neurons and S synapses per neuron. (N ≈ 10
14
for the mammalian brain, and S ≈ 10 4 .) Since the average number of incoming connections to a neuron from other neurons is S, it follows that the average outgoing connections from a neuron is also S (with some exceptions for external inputs and outputs). A neuromorphic communication network has to support an average fan-out of roughly 10 4 in silicon. Static usage of wiring resources for this purpose is infeasible, since VLSI systems have always been wiring limited even when the average signal fanout is in the range of three to four destinations. Therefore, packet switching is the architecture of choice for global communication in neuromorphic systems. The speed of silicon device switching is used to time-multiplex wires to provide support for global communication. The address-event representation (AER) is most commonly used communication protocol in neuromorphic systems. In the AER protocol, each neuron is given a unique address and upon generating a spike, the addresses packet is routed to the destinations of the source neuron. Hence, the global inter-cluster communication network is a packet-switched communication network, and its singlechip component is a network-on-a-chip (NoC).
Routing memory
Assume a generic spiking neural network with N neurons and S synapses per neuron. This results in total N × S destinations. In conventional routing schemes used by parallel machines, each destination would be assigned a unique address encoded with lg(N S) bits. In order to support arbitrary connectivity, N S lg(N S) bits would have to be stored-which corresponds to ≈38.7GB of storage for a million neuron system (40.5KB per neuron), which would dominate the silicon area needed for the system. Many techniques have been introduced to reduce the storage required. One approach groups the synapses for a neuron by their type e.g. inhibitory vs excitatory, fast vs slow, etc. Modeling the computation performed by the synapse as a linear filter permits the per-synapse state to be grouped by synapse type using the principle of superposition [23] . If there are k types of synapses, the storage requirements would be N S lg(kN ). However, even for small k (e.g. k = 4), this only reduces the storage by a factor of 1.5. For this reason, state of the art neuromorphic systems typically limit the flexibility of supported connectivity in order to limit the memory needed to store connectivity information.
Latency constraints
In neuromorphic hardware systems, spikes are represented by binary addresses sent in the form of packets across the network. The packets carry the spikes' timing (implicitly) and contain the routing information to create the virtual connection(s) between a source node and its destination(s). The packets may include other information such as axonal delays or synaptic weights. In terms of the size, the spike packets in neuromorphic hardware are typically smaller of those of conventional communication networks, consisting of little more than the packet header that stores routing information. Since neuromorphic systems represent information in the timing of spikes, it is important that the communication network preserve the delivery time of a spike relative to the time it was generated. In other words, the communication network must have predictable latency relative to the time scale of the computation. In what follows, we assume a packet switched network consisting of a collection of routers interconnected in some pre-defined topology [24] . The bandwidth and timing requirements are the important factors in the design of large-scale neuromorphic hardware systems. Fig. 4 illustrates a sketch of network with N neurons with firing rate of R (R is typically 10Hz). Consider a very simple network model, where the network has a bisection bandwidth B spikes/s, spike latency of l, and mean router link occupancy per spike of o. Since the bandwidth of a link is determined by the rate at which it can process spikes, the link bandwidth is 1/o spikes/s. Such a network requires l fraction of time unit to route the first spike, l + o for the second spike and so on for spikes sharing a link. The number of links between the two halves of the network in Since there are C links, even under uniform distribution of traffic, the last group of packets will have a latency of l + (
) while the first set of packets will have a latency of l. Hence, the uncertainty in spike arrival will be in the range of [0,
] relative to when the spikes were generated. Note that for a fixed bisection bandwidth, it is better to have more parallel links (higher C) with lower bandwidth per link if the traffic can be distributed across them. (Note that the Biological extreme of dedicated wiring minimizes this quantity, but is infeasible in current silicon technology.)
For a firing rate of R = 10Hz, Biological models typically care about temporal precision in the 0.1ms range [25] (i.e. 1/(10 3 × R)). Using this rule of thumb to bound the timing uncertainty, we have that
-a significantly stricter requirement than B ≥ N R/2 for conventional networks where the packet timing does not carry information. For accelerated network modeling (i.e. values of R significantly faster than real-time), there is "double penalty" for the speedup: one from simply a larger volume of traffic, and a second from a much stricter latency delivery constraint. For N = 10 6 and R = 10Hz, even with ≈ 1000 links between two halves of the network, this comes to B ≈ 50Gspikes/s rather than 5Mspikes/s for traditional networks. Note that both numbers scale down once we assume significant communication locality, i.e. if we assume that αN/2 rather than N/2 neurons generate spikes that have to cross over to the other half of the network. Also, the bandwidth requirement can be reduced by relaxing the temporal precision used by the system-going from a 0.1ms tolerance to a 1ms tolerance would cut the bandwidth requirement by a factor of ten. This is why most state-of-the-art neuromorphic systems have networks that at first glance look highly over-designed and underutilized-the latency constraint compounds the network bandwidth requirement.
Topology and router design
The NoC architecture is specified by its topology and routing strategy. Different topologies have been adopted to support higher throughput demands over time. One-dimensional, two-dimensional, and hierarchical networks are examples of NoC topologies that are utilized by the neuromorphic community. The link bandwidth between routers in a NoC can also be estimated using an analysis similar to the one above. Assuming that a cluster of neurons has on average N c neurons that have to communicate with neurons in a different cluster, if r is the average degree of a router and the average number of links traversed by a spike is d, then each link will have to support traffic that is approximately N c Rd/r. Using an argument similar to the one above, the latency uncertainty of a link is [0, ( . Hence, the network has to be designed to support much higher per link bandwidth to preserve spike timing, even though only a small fraction of the bandwidth is utilized in practice.
A major difference between the neuromorphic and conventional NoCs is the amount of network traffic generated by a node in the network. In contrast to conventional applications of NoCs, the computing nodes (neurons) in neuromorphic systems run at very low frequencies (in order of Hz), and so drastically lower traffic is injected into the routing fabric. Fig. 5(a) shows the average latency for different networks with high workloads as simulated by a NoC simulator used Power consumption for different 2D-mesh network sizes, estimated using the DSENT tool [27] , and the existing neural cores: Spinnkaer [14] , Neurogrid [20] , DYNAPs [26] and Truenorth [13] .
for computer architecture studies [28] . As the figure suggests, the latency in the networks with high workload varies significantly by network topology and routing algorithm in addition to the network load. However, as we have argued above, the link bandwidth has to be over-provisioned relative to conventional networks in order to keep the spike delivery uncertainty small. Hence, as far as the network is concerned, we are operating in a regime with very low workload compared to the peak feasible bandwidth. As can be seen ( Fig. 5(a) ), in networks with very low workload (below 0.01 spike/neuron/cycle), the routing methodology and the network load does not significantly impact the network latency. Given the cycle time of NoC digital circuits (at most a few nanoseconds) and the low neuronal firing rates (in order of a few Hz), most neuromorphic NoC workloads fall well below 0.001 spikes/neuron/cycle. This characteristic means that the network topology and routing algorithm is not as important in systems operating at Biological time scales compared to conventional NoCs. Another differentiating feature is that the power consumption of a neuromorphic core is usually substantially lower than that of conventional compute cores like microprocessors that are traditionally used with NoC fabrics. For example, in both the TrueNorth [13] and DYNAPs [26] projects, the neural cores are custom designed and require a fraction of the power consumption of a standard low power microprocessor. So, for neuromorphic systems, the power consumption of the routing fabric can be a significant portion of the total system power. Fig. 5(b) shows the estimated network power used by an architectural NoC simulator [27] that is used to study on-chip networks. Apart from the SpiNNaker project (which consists of a large number of ARM cores as compute nodes), other neuromorphic systems have networks that consume orders of magnitude less power than those modeled by the NoC simulator because they are much more parsimonious in their hardware usage. We note that the estimated power of different neuromorphic networks in Fig. 5(b) are not intended for direct comparison with each other, as they represent different choices in terms of network flexibility; we simply note that reducing the network power consumption is an important aspect of neuromorphic hardware design. In the light of new memory technologies, the power gap between the computation and communication in custom-designed neuromorphic systems is likely to increase, making the network power consumption even more important in future neuromorphic systems.
Existing Large-scale Neuromorphic CMOS Architectures
We briefly review major existing neuromorphic hardware systems and their design choices in addressing the design scalability challenges. As will be evident, each system is organized as a cluster of neurons and synapses with a global inter-cluster communication network. Table 1 shows examples of representative neuromorphic hardware systems that have been developed by academic and industrial research teams. The system architecture developed during the SyNAPSE project, called TrueNorth [13] , is the largest single chip built within the neuromorphic community with 1 million neurons, 256 million synapses, and a network of 4096 neurosynaptic cores. Each core contains 256 neurons and 256×256 synapses using SRAM-based cross-bar arrays. The neurons and synapses implemented in the "neurosynaptic core" are designed to have behavior that matches a deterministic software model for the system. The global routing topology is a two-dimensional mesh architecture.
Thanks to the power efficiency of asynchronous circuits and the use of low-leakage CMOS process, the power figure reported for TrueNorth is its most distinguishable feature compared to other neuromorphic hardware efforts. The chip consumes only 72 mW-significantly lower than other neuromorphic substrates [13] .
The Loihi chip [17] , designed by Intel, is a many core neuromorphic architecture that comprises 128 neural cores, three embedded x86 processors and asynchronous 2D mesh network-on-chip routing fabric for connecting the neural cores. Each neuromorphic core is equipped with a programmable learning engine to implement different training algorithms. The learning capability and more flexible routing network distinguish Loihi from prior large-scale custom-designed neuromorphic systems. The communication architecture is also mesh based. Loihi is a fully digital architecture implemented in 14 nm CMOS process, and its routing network is designed using asynchronous circuits.
The Neurogrid project consists of a system that comprises almost entirely of custom mixed-signal hardware for modeling biological neurons and synapses. The core hardware element in Neurogrid is the Neurocore chip, which is a custom ASIC that uses analog VLSI to implement neurons and synapses, and digital asynchronous VLSI to implement spike-based communication. The chip was fabricated in a 180nm process technology, and contains a 256×256 array of neurons whose core functionality is implemented with analog circuits. Each neuron is implemented using custom analog circuitry that directly implements a a quadratic integrate-and-fire (QIF) model, and is combined with four types of synapse circuits [20] . The routing architecture uses a tree topology, with hardware support for multicasting.
The SpiNNaker (Spiking Neural Network Architecture) project [14] at the University of Manchester has taken the most "general-purpose" approach to the design of large-scale neuromorphic systems. The core hardware element for the neuromorphic system is a custom-designed ASIC called the SpiNNaker chip. This chip includes eighteen ARM processor nodes (the ARM968 core available from ARM Ltd, one of the projects industrial partners), and a specially designed router for communication between SpiNNaker chips [29] . The routing topology is a twodimensional torus, with additional diagonal connections. A complete SpiNNaker board contains 47 of these chips, and the goal is to assemble 1200 of these boards. A full SpiNNaker system of this size would consume about 90 KW.
Truenorth [13] Spinnaker [14] Neurogrid [20] Loihi [17] Table 1 : The specifications of the major neuromorphic hardware projects. The power breakdown numbers are provided in the reference articles, except for Neurogrid and Loihi chips.
Summary
When examining a complete neuromorphic system, a few items become quite clear. First, the static power consumption is a large fraction of the total power budget. This is because neuromorphic systems operate at very slow timescales compared to the capabilities of modern CMOS devices. The traditional approach taken by conventional digital systems-namely running computation at a very high throughput, which amortizes the static power consumption-is not as easy to accomplish in the neuromorphic domain due to the "double penalty" imposed by strict latency constraints. Note that while our analytical modeling is quite simple, its simplicity is what makes it applicable to a large class of neuromormophic systems. Second, traditional on-chip network architectures consume significantly more power than the custom-designed networks in neuromorphic chips. In spite of this, the network power is a double-digit percentage of the total system power. This means that even if the rest of the system took zero power, the maximum power benefit would be limited by a factor of ten. Finally, the approach of using a large number of small clusters to achieve power efficiency means that any memory requirements for routing have to be distributed into a large number of small memories rather than a single monolithic memory. Hence, what becomes important is not how large a single memory bit is, but the effective density of a small memory array. The array efficiency of a memory array is the ratio of the area occupied by the storage elements (the bits) to the total area of the memory-which includes addressing as well as read/write circuitry. Even SRAMs (which arguably have the simplest external circuitry while having the largest storage element area) have array efficiencies as low as 50%-80% for sizes below 1Mb. A smaller storage element proposed by numerous researchers makes the array efficiency even worse when the memory size is fixed. This observation was validated in a design that used phase-change memory technology as the synaptic element [55] .
Less explored device technologies for neuromorphic hardware
In this section, we present some less-explored device technologies that have been proposed to improve the performance of memory systems, that also have applicability in improving the scalability of neuromorphic systems.
Three-dimensional integrated circuits
Three-dimensional (3D) integration has been explored as a way to improve memory system performance by vertical stacking of multiple memory device layers. This technology is mature enough so that it is commercially available, as it has demonstrable advantages in improving integration density of memory with logic. Also, since neuromorphic circuits operate with extremely low power budgets, one of the major concerns in 3D integration-the problem of heat dissipation-is alleviated. Neuromorphic networks are typically composed of a massively parallel collection of neurons and synapses with large neuronal fan-outs. A vast majority of hardware resources in neuromorphic systems are therefore dedicated to the interconnections between the computing modules. Unlike biology, neuromorphic VLSI systems are implemented in 2D substrates. The large number of connections imposes serious design challenges for scaling up the size of neuromorphic networks in VLSI. Recent advances in semiconductor industry e.g. flip-chip and Through-silicon-vias (TSV) offer vertical integrations of integrated circuits. Such integrations potentially would lead to significant performance improvements in implementing highly interconnected networks, in comparison to their planar counterparts. TSV technology [57, 58] is one of the most promising solutions for stacking of multiple dies in a small package and for memory integration with conventional CMOS. 3D TSV-based ICs are used in stacked memory and processing systems [59, 60] , CMOS&MEMS image sensors [62] , IoT Socs [63] , and 3D FPGAs [64] . Utilizing TSV in vertical integrations reduces wire complexity, the 2D die size, and the interconnection delays in high density interconnect networks. However, such integrations face serious technological challenges from design and physical stand point. A major drawback in design of 3D TSV ICs is the lack of commercial EDA tool support, and accurate electrical models that could capture the physical properties of the TSV connections integrated with CMOS circuits [61] . IC design tools equipped with detailed TSV characterization models can greatly help in investigating the full potential of this technology across different application domains. Depending on operating frequency, TSV links can have resistive or capacitive behavior. A detailed model and frequency response of TSV channels are presented in [56] . Thermal power and cross-talk issues are other important factors to be considered in utilizing TSV connections with CMOS. Furthermore, TSV arrays may impact the behavior of CMOS-based circuits by inducing thermo-mechanical stress in the front end of line (FEOL) layer [58] .
Three-dimensional TSV integration technology can enable a higher level of parallelism in highly-interconnected neuromorphic computing networks, with layered structures [65] . In such configuration, each layer contains one or multiple neural modules and TSV arrays represent inter-module connections. Given the huge number of inter-neuron connections, it is impractical to utilizes one TSV link per each neural connection, mainly because of TSV-to-TSV coupling [66] , crosstalk [57] , and density issues [65] . TSV technology offers opportunities for energy/performance improvements in large-scale neuromorphic design by reducing the network distance per spike (parameter d in the analysis presented in section. 3) by having a higher radix router. Energy cost of communication networks is directly a function of the number of hops (routers and interconnects) that neural spikes traverse from source to the destinations. Therefore, TSV-based neuromorphic systems are expected to offer better energy cost compared to 2D structures. It is noted that the router's complexity in TSV-equipped hypercube neuromorphic designs increases due to extra links required per inter-layer router. In order to investigate the feasibility and impact of TSV technology on large-scale neuromorphic designs, it is paramount to have access to more detailed TSV models and design tools. 
Nano-Electro-Mechanical Systems (NEMS)
Nano-electro-mechanical systems are miniaturized mechanical systems that use the mechanical properties of silicon (and other materials) in addition to the electronic properties to build circuits. NEMS has also been proposed as a way to reduce leakage currents in CMOS, and hence could benefit neuromorphic systems where leakage power dominates. Excessive leakage current of field-effect transistors is a major obstacle in design of neuromorphic circuits in ultra deep sub-threshold CMOS processes [68, 69] . Device's off-current consume power even when the transistors are inactive. The problem aggravates in advanced CMOS processes e.g. 28 nm where the off-current is innegligible in comparison to on-current of the transistors, leading to an increase in total power dissipation. The power breakdown presented in Table 1 shows that the static power consumption becomes a dominant factor of total power dissipation in large-scale custom neuromorphic systems, especially once computation power is reduced using emerging devices. For instance, the static power in TrueNorth [13] , a fully-digital design, accounts for 60% of total power, at 20 HZ firing rate. The power leakage issue in deep-submicron mixed-signal neuromorphic designs would be even more critical than digital designs [68] . Mixed-signal design typically occupy larger silicon area (higher static power), due to their highly parallel structure. In addition, the leakage current may impact the behavior of analog sub-threshold circuit blocks. The neuromorphic sub-threshold CMOS neuron and synapse circuits typically operate within pico-ampere currents, while off-current of NMOS transistors in 28 nm can be as high as pico amperes (Fig. 6) , even for long transistors (e.g. Length=300 nm), are integrated with digital switches. In other words, larger leakage current not only leads to higher power dissipations, but also narrows down the range of currents in which the neuromorphic sub-threshold circuits can reliably operate.
Nano electro-mechanical (NEM) switches have been recently been investigated as a potential alternative to CMOS [69] [70] [71] [72] [73] . The potential of NEM devices have been investigated for SRAM cells [70, 74, 75] , FPGAs [76, 77] , 3D integration [78] , asynchronous logic [67] , etc. Hybrid CMOS-NEMS neuromorphic circuits, presented in [69] , implement the leaky-integrate-fire (LIF) neuron model [79] and take a step toward reducing the leakage power in a neuromorphic system. Normally-open and normally-closed NEM relays can be used to build replacements for NMOS and PMOS switches. A challenge with the ideal scenario depicted here is that manufacturing uncertainty can require different body voltages for different NEMS devices. Compared to digital CMOS switches, the switching delay of NEMS is high, and the number of time a device switches is limited before it fails. However, since neuromorphic systems operating on Biological time scales have extremely low activity, the reliability of NEM switches may not be as important a factor in this application domain compared to high-speed digital logic.
In addition to the neural computation circuits, the NEMS switches can be utilized in local communication blocks in order to alleviate the leakage power issues. Most custom-designed neuromorphic CMOS designs, discussed in Sec. 3.4, have utilized SRAM-based crossbar arrays to implement local connectivity. The SRAM blocks are typically designed using conventional 6-transistor circuits and suffer from high-leakage power consumptions. Within this context, future research works may focus on investigating the potential of NEMS switches in combination with CMOS, for optimally implementing local synaptic connections, as an alternative to highleakage SRAM cells in large-scale neuromorphic designs.
Discussion and Conclusion
In recent years, research interests in exploring the potential of emergent nanoscale device technologies have been growing, as researchers have investigated their application to design more efficient memory and interconnections technologies. Large-scale neural networks are one of many applications that can significantly benefit from reliable emergent memory technologies. A large number of research articles have explored the potential of utilizing new memory devices e.g. memristors, to replace CMOS-based memories in implementing neuromorphic computing units e.g. synapses. The yield, scalability (e.g. size and control overhead), and reliability issues are major obstacles in utilizing the emergent nanoscale devices with current technologies. Although large-scale neuromorphic hardware requires a considerable amount of storage in total, the storage is not centralized; rather, it consists of a large number of small blocks (e.g. [13] ) that is distributed with the clusters of neurons and synapses. Hence, the control overhead of accessing a small memory block is nontrivial, so emerging memory technologies do not provide a means to tackle this issue unless they can also reduce this overhead. In order to efficiently utilize the emergent memory technologies in scalable neuromorphic systems, compact and power-efficient small memories are needed.
Successful utilization of the emergent memories in large-scale neuromorphic designs is expected to further reduces power requirements for performing computation. Despite such potential improvement, the impact on total power consumption is debatable. In the recent neuromorphic developments, the power required for computation is not the dominant factor of the total power figure. Static power and communication power are a major fraction of total system power in large-scale designs. Although the emergent memory devices partially help to reduce the static power by replacing high-leakage SRAM blocks, the communication power requirements would remain a major issue.
Emergent 3-D integration technologies such as TSV offer promising solutions to overcome the limitations facing current interconnection technology. TSV-based designs have been explored for variety of systems such as image sensors, memories, FPGAs and SoCs. The 3-D integration technologies have significant potential for highly-interconnected large neuromorphic networks by stacking multiple dies and reducing wire-complexity. Currently this technology is not easily accessible to designers, but this can change with time.
Because of their near-zero leakage current, CMOS-compatible NEMS devices could alleviate the static power issues in CMOS circuit implementations. Hybrid CMOS-NEMS designs have been investigated for memories, FPGAs and 3D SoCs. These switches can be potentially be utilized in communication fabrics in neuromorphic networks.
In this paper, we elaborated on the communication requirements of large-scale neuromorphic designs in terms of power and bandwidth requirements, provided a detailed comparison with conventional network-on-chip architecture, and discussed the potential improvements and issues in integrating the emergent nanoscale technologies in large-scale hardware neural networks. Research advances in emergent memory technologies can offer a route toward improved biological-scale neural computing systems only if they address communication and idle power requirements in addition to the computational requirements of such systems.
