Neural Network (NN) accelerators with emerging ReRAM (resistive random access memory) technologies have been investigated as one of the promising solutions to address the memory wall challenge, due to the unique capability of processing-in-memory within ReRAM-crossbar-based processing elements (PEs). However, the high efficiency and high density advantages of ReRAM have not been fully utilized due to the huge communication demands among PEs and the overhead of peripheral circuits.
Introduction
1 It is the resistive-capacitive delay of just the crossbar circuits hardware resources efficiently, rather than complicating hardware. It is composed of a novel reconfigurable architecture for ReRAM-based NN accelerator, Field Programmable Synapse Array (FPSA), and the software system including neural synthesizer, spatial-temporal mapper, and placement & routing.
For communication, we optimize the communication subsystem with a reconfigurable routing architecture, which provides massive wiring resources for extremely high bandwidth and low latency and utilize them with the placement & routing tool. Due to this optimization, we can achieve about two-orders-of-magnitude speedup in comparison of PRIME.
For peripheral circuits, we employ spiking schema to simplify the PE circuit while still maintaining the functionality of vector-matrix multiplication and Rectified Linear Unit (ReLU) activation for artificial neural network (ANN). We leverage the neural synthesizer to make the NN computation more compact and enable our high density homogeneous hardware to support different kinds of NN operations in order to fully utilize the advantage of ReRAM. The latency and area of the entire PE is reduced by 94.90% and 36.63% respectively, which provides another order-of-magnitude speedup.
Last but not least, we introduce spiking memory blocks (SMBs) and configurable logic blocks (CLBs) in hardware as on-chip buffer and programmable logic. They are utilized by the spatial-to-temporal mapper to achieve optimized resource allocation and scheduling in order to balance the storage and computation requirements of NN, especially catering to the weight sharing property of convolutional neural networks (CNNs). It can lead to super-linear performance increase with more hardware resources.
In our design, the performance is no longer bounded by the communication bottleneck, and the peripheral circuit overhead is significantly reduced. Experiments show that the performance is increased by 1000× compared to PRIME [9] , which is all due to the architectural and system improvements.
ReRAM-device variation is also considered: We propose a novel weight representation method, the add method, to decrease device variation exposed to NN models. It can approach the full precision accuracy for large-scale NNs.
The contributions of this paper are summarized as follows.
• We propose a full stack solution for ReRAM-based NN accelerator, including a reconfigurable architecture, FPSA, and the software hierarchy. The latter fully utilizes the various kinds of programmable resources provided by the former to deploy NN efficiently. Evaluations show that our approach can outperform a stateof-the-art ReRAM-based accelerator, PRIME, by up to 1000× for NN inference. • We have observed that communication is the bottleneck of existing ReRAM-based NN accelerator and then propose to optimize it with a reconfigurable routing architecture to break this bound.
• We make the PE design much more compact and efficient by leveraging the spiking schema. The latency is decreased by 19 .6× and the density is improved by 1.6×.
Finally, we believe that it is a new design philosophy for ReRAM-based NN accelerators. Inspired by the spirit of the reduced instruction set computer (RISC) architecture of the conventional computer systems, our compact hardware design enables extremely high performance and can support rich NN functionalities with the software stack.
Background and Related Work

ReRAM-Based NN Acceleration
Neural Network applications are both memory-intensive and compute-intensive. Thus, there are a lot of NN accelerators [3, 5, 7, 8, 13, 15, 22, 23, 30, 32, 38] based on mature digital circuits to speedup NN computations.
To further increase the performance and eliminate other problems such as memory wall, quite a few studies on ReRAM based NN accelerators and neuromorphic hardware [2, 9, 18, 21, 24, 33, 36, 37, 39, 41, 44] have also been proposed.
Resistive random access memory, known as ReRAM, is a type of emerging non-volatile memory, which stores the information using its resistance. Prior work [17] shows that the ReRAM-based crossbar is very efficient at computing analog vector-matrix multiplications in the locations where the matrices are stored. As shown in Figure 1 , there is a ReRAM cell in each intersection of the crossbar. An input voltage vector {V i } is applied to the rows and is multiplied by the conductance matrix of ReRAM cells {G ji }. The resulting currents {I j } are summed across each column. The output current vector can be calculated by I = GV .
Existing studies on ReRAM-based NN accelerators [9, 39, 41] treat the ReRAM-crossbar as a very low-precision vectormatrix multiplication engine, and use it as the building block, combined with peripheral circuits, to construct NN accelerators. To support higher precision, these studies usually use the splicing method, which employs multiple cells for different bits of the high precision number and shift-add the partial sum of different bits to get the final result. For example, ISAAC conservatively uses 8 cells to represent one 16-bit cells; each cell represents 2 bits. PRIME [9] and PipeLayer [41] are modified from the ReRAM-based memory chip. Thus, their PEs are connected through the internal hierarchical memory bus. ISAAC [39] is a dedicated accelerator, which employs NoC.
Reconfigurable Architecture
Reconfigurable architecture provides much higher efficiency than general-purpose processors while providing more flexibility than Application Specific Integrated Circuits (ASICs). There are also some reconfigurable routing architectures designed for NN accelerators such as MAERI [27] , but they Figure 2 . Performance vs. Area for the peak performance, the ideal case (with infinite bandwidth), and the real case for running VGG16 [40] on PRIME [9] (45nm process). The performance of the real case is bounded by communication.
target to the accelerators based on digital circuits. The capability is still far from the demands for ReRAM-based PEs. FPGA is one of the most widely-used reconfigurable architectures, composed of many Configurable Logic Blocks (CLBs). The main function modules of CLB are Look-Up Tables (LUTs) that can be configured to achieve any arbitrary logic function. The routing architecture of an FPGA chip occupies up to 90% of the total area [14] , and provides most of the reconfigurability. It consists of wires and programmable switches. The programmable switches use Connection Boxes (CBs) to configure the connection from CLBs to the routing network, and use Switch Boxes (SBs) to configure the connections from different wire segments. There have been many studies [10, 34, 45, 46, 48, 50] on using ReRAM to augment existing reconfigurable architectures. For example, ReRAM cells are used to replace SBs and CBs in FPGA [10] and to implement arbitrary logic function [50] .
Motivation
We analyze the scalability and performance of PRIME [9] 2 , which uses memory bus as the communication subsystem;
we assume that its structure can scale-out linearly under 45nm process. A large scale CNN, VGG16 [40] for ImageNet [11] , is employed as the NN application.
Based on the hardware configurations and NN requirements, we can get three performance bounds ( Figure 2 ) as follows.
Computation Bound. It is the theoretical upper bound (which is defined as peak performance in this paper), the product of the PE number and the performance of one PE, as the total computation capability provided.
Utilization Bound. Usually, computation and communication capabilities are two important factors restricting performance improvement. But, even if the communication is ideal, the performance (called ideal performance) still cannot reach the peak value, caused by the following two utilization issues:
• Temporal Utilization (Load Balance). The first is the imbalance between storage and computation requirements of NN, especially for convolutional neural networks (CNNs). For example, the first two convolutional layers of VGG16 only occupy 0.028% of weight storage but consume 12.5% of computation because the weights are reused by 224 × 224 different regions of the input feature map, while the fully connected layers take 89.3% of storage but only consume 0.8% of computation. In contrast, ReRAM-crossbars integrate computation and storage in the same physical place; thus a PE can only provide computing power commensurate with its storage capacity. To map a neural network onto the ReRAM-based NN accelerator, the prerequisite is that there should be enough PEs for all the weight parameters. This mapping is quite unbalanced: about 0.028% of PEs should process 12.5% of computation and become the bottleneck, while the utilization of other PEs is low. This issue can be solved when more PEs are available: We can duplicate these layers' weights onto more PEs to speedup them significantly. For example, adding extra 0.028% of PEs for the first two layers can double the performance. That is why the first half of the ideal performance curve shows a super-linear increase. The curve will converge to linear scalability and approach the peak performance when different layers are balanced.
•Spatial Utilization (Crossbar Mapping). The fixed size of crossbars cannot match weight matrices of different scales perfectly, which also affects the PE utilization. Between the two, the first is the main issue.
Communication Bound. In real cases with limited bandwidth, the utilization cannot be improved efficiently when more PEs are provided because the communication subsystem cannot fetch enough data in time for the PEs. This leads to a large gap with the ideal case.
Currently, PRIME has tried to balance the computation and communication requirements. However, due to its limited bus bandwidth, its real performance is far below the ideal value (two orders of magnitude lower than the latter).
Based on these observations, it is reasonable to improve the performance of ReRAM-based accelerators with the following methods in order.
1. Improving Communication. We should improve the communication subsystem to break the communication bound. 2. Reducing Area. We should reduce the area of a single PE to push the performance to the high-utilization region of the utilization bound for a given chip area. 3. Reducing Latency. We should reduce the latency of PEs to increase the peak performance (the upper bound) further.
Accordingly, we adopt the reconfigurable routing architecture first and then design simplified PE circuits to reduce area and latency, which are given in Section 4; the whole system software stack is proposed in Section 5. Figure 3 shows the overview of FPSA architecture. It contains three kinds of function blocks: ReRAM-based processing elements (PE) for computation, spiking memory block (SMB) for buffering, and configurable logic block (CLB) for controlling. These blocks are connected through a reconfigurable routing architecture. Functional blocks and the routing architecture are all programmable, which provide massive computation, buffering, controlling, and wiring resources for software to utilize.
Architecture Design
To reduce the peripheral circuit overhead, we employ spiking schema to perform the vector-matrix multiplication. It uses the spike count to represent a high-precision number rather than the amplitude of an analog signal. The area and latency can be significantly reduced with this schema. In addition, the spiking memory block is customized to buffer spiking signals.
Routing Architecture
PEs and other function blocks are connected by the routing architecture and working in parallel in a pipelined manner. The pipeline clock cycle is bounded by the maximum latency of all pipeline stages, including the computation and communication latency. As mentioned before, the computation time has been significantly reduced by ReRAM-crossbar, which makes the communication a system bottleneck.
Therefore, we adopt the reconfigurable routing architecture widely used in FPGA chips, instead of the memory bus or NoC in existing NN accelerators. Compared to the memory bus and NoC that reuse physical channels for different traffic and provide flexible runtime data-path, the reconfigurable routing architecture assigns individual channels for each signal in the configuration phase and has a fixed runtime data-path (since the NN topology is fixed, the runtime flexibility is unnecessary). Furthermore, compared to the bus and NoC where the worst communication latency is not guaranteed, the maximum latency of critical path can be evaluated in advance.
One of the most widely used FPGA routing architectures is the island-style architecture: configurable logic blocks (CLBs) are connected to the wiring network through connection boxes (CBs) and different wiring segments are connected through switch boxes (SBs). Normally, the routing architecture consumes most of the FPGA chip area [14] . In our design, the area consumption would be greater because of more fan-in/outs in the ReRAM-based PEs than those of CLBs in normal FPGA.
To reduce this overhead, we adopt the previous work, mrFPGA [10] , that employs ReRAM cells to construct CBs and SBs to reduce the area consumption. Figure 3 provides a detailed view of the routing architecture, in which SBs and CBs are placed over the function blocks. Specifically, the connections in SBs and CBs are decided by the resistance of the ReRAM cells. For example, an ReRAM cell with high resistance means that there is no connection between the two corresponding segments while low resistance is a pass. Figure 3 also provides the detailed wiring and layout inside CBs and SBs, which only use five metal layers from M5 to M9 without resource conflict. Functional blocks are connected to the wiring network through the CBs at four sides.
Processing Elements
We use spiking schema to simplify the peripheral circuits of PE. The inputs of the PE are digital spike trains that use the spike count to represent a number between 0 and 1. Although it requires 2 n spikes to represent a number of n bits, processing spikes is much more efficient than processing high-precision analog signals comprehensively.
The essential of the PE is an ReRAM crossbar followed by spiking neuron circuits. The input signal will be converted into a charging voltage and applied to each row of the crossbar. Then the resulting current of each column will be injected into the corresponding neuron circuit, which accumulates the current and issues a spike when the threshold voltage is reached.
In order to handle negative weights with the positive conductance values, we use two physical adjacent columns to represent one logic column of the weight matrix, one for the positive part and one for negative. The output spike train of the negative column will be subtracted from the positive one to get the final output.
Accordingly, the main components of a PE are charging units (one for each row), ReRAM-crossbar, neuron units (one for each column), and spike subtracters (one for every two columns). The overview of a PE is shown in Figure 4 A .
Charging Unit. As shown in Figure 4 B , since the input spike is a 1-bit signal, the DAC can be simplified to a transistor. When a spike signal arrives, the transistor will open and the charging voltage will be applied to this row.
ReRAM Crossbar. Figure 4 C is the ReRAM crossbar. Each row connects to an input charging unit and each column connects to an output neuron unit. ReRAM cells are in the intersections of the crossbar.
Neuron Unit. It is an analog implementation of one widely used spiking neuron model, integrate-and-fire (IF) model. As shown in Figure 4 D , it has a capacitor to integrate the current from the corresponding column. When its internal voltage reaches the threshold voltage, a spike signal will be stored in the S-R latch; the discharging unit will be turned on to discharge the capacitor until the voltage reaches the reset value. The discharging unit can also be triggered by a reset signal because we use the spike count in a sampling window to represent a number. Thus, a reset signal will be sent to clear internal states before a new sampling window begins.
Spike Subtracter. Figure 4 E shows the circuit of the spike subtracter. It has two input spike trains from the corresponding two neuron units. The output is also a spike train, whose spike count is the different of the two inputs. The working mechanism is that the spikes from the negative neuron unit will block the next spike coming from the positive neuron.
Although we use spiking schema in our circuit design, the computation achieved by the circuit is just a vector-matrix multiplication followed by the ReLU activation function; the precision depends on the size of the sampling window. The proof is as follows. The equivalent charging circuit is shown in Figure 4 F . We denote the charging voltage from the voltage source as V dd , the capacitance of neuron unit as C, and the charging time of each clock cycle as τ . For the j-th output neuron unit, the equivalent resistance of the ReRAMcrossbar is denoted as R j (t) at time t. We suppose that from the reset voltage V r e , the neuron unit's capacitor reaches the threshold V th in the T -th cycle. In accordance with the model of charging a capacitor in an RC circuit, Equation 1 gives the capacitor's voltage U T at the cycle T .
For convenience, we denote the right-hand side of Equation 2 as η because it is a constant. On the left-hand side, the equivalent resistance only counts the rows with spike inputs. Therefore, we can derive Equation 3 as follows, where s i (t) is the spike signal for the i-th row at time t and д ji is the conductance of the cell at the intersection of the i-th row and the j-th column.
Suppose the size of the sampling window is Γ cycles. During this period, the spike counts of the i-th input row and the j-th output column are X i and Y j respectively. Thus, the voltage of the capacitor has reached the threshold for Y j times and then we have Equation 4 .
By definition, X i is the sum of s i (t) of the sampling window Γ. Thus, the relationship between the input and output spike count is shown in Equation 5 .
Further, we connect two columns to one spike subtracter to support negative weight values. Suppose the corresponding spike counts and conductance values for positive and negative columns are
or the output spike count is 0. Thus the final spike count from the j-th output port is shown in Equation 6 .
In conclusion, the difference from existing ReRAM-based accelerators that employ spiking schema (e.g. PipeLayer [41] ) is that we directly charge the capacitor and transit spike trains between PEs. Thus, the overhead of current mirrors Session: Machine Learning I ASPLOS'19, April 13-17, 2019, Providence, RI, USA and encoder/decoder for spike trains can be removed. Equation 6 shows that with this simplification we can still complete the vector-matrix multiplication followed by ReLU. In addition, owing to the area reduction, we do not need to reuse peripheral circuits for different rows and columns. They can process input and output of an ReRAM-crossbar in parallel. In contrast, existing ReRAM-based accelerators usually share ADCs and/or DACs to reduce the area overhead, which also leads to a corresponding increase in processing delay. (e.g. in ISAAC [39] , 128 crossbar-columns share one ADC). Our approach achieves a good balance in terms of function, area cost and time delay. Quantitative evaluation will be given in Section 6.
Spiking Memory Block
As shown in Figure 3 , in addition to the computation resources provided by PEs, we also have spiking memory blocks (SMBs) to provide on-chip buffer for the intermediate data.
Since the size of on-chip buffers has a significant impact on chip area, we only store the spike counts instead of the spike trains to fully use the buffers. The counters and spike generators are embedded inside the SMB to do the encoding and decoding between spike counts and spike trains; thus SMB can directly send and receive spike trains but only store the spike counts. The internal memory is indexed by bits so that it can fit any sampling window size (e.g., when the sampling window is 2 n , it can store the spike counts in the manner of n-bit by n-bit.
Although we heavily adopt ReRAM in our PE design and routing architecture, we still use SRAM for the SMB. ReRAMs are not suitable for buffers because they have low endurance (they can support about 10 12 writes).
Configurable Logic Block
Further, we provide configurable logic blocks (CLBs) to provide logic resources for controlling as shown in Figure 3 . The control signals for PEs and SMBs are generated by the CLBs.
We also use SRAMs to implement the LUTs in CLBs. Although ReRAM provides higher density than SRAM, it requires current sense amplifiers to read data, which consume a lot of area. Thus, its area efficiency is very poor when the capacity is small: A conventional 6-input LUT can be implemented with a 64-bit memory. According to NVSim [12] , the area of a 64-bit SRAM cell is 35.129μm 2 under 45nm process while the area of an ReRAM cell is 172.229μm 2 . Thus, CLBs contain multiple SRAM-based LUTs, flip-flops, and multiplexers to perform any logic function.
System Design
We highly leverage the software system to enable flexible functionality and high efficiency of FPSA architecture. Utill now, the hardware has provided massive computation, buffering, and controlling resources in the form of the three kinds of function blocks, as well as the massive wiring resources and configurable connections through the routing architecture. How to make full use of these hardware resources to fit the diversity of NN requirements is a complex problem, especially as we try to maintain the advantages of ReRAM (i.e. the high computational density of vector-matrix multiplication).
From a formal perspective, most deep learning frameworks [1, 6, 35] use computational graph (CG) as the programming model to represent NNs. Thus, the problem is how to efficiently map the software-level CG to the above reconfigurable resource pool.
We divide the problem into three independent sub-problems and design the software stack to solve them respectively, as shown in Figure 5 . First, the neural synthesizer transforms the NN CG to make up the gap between the NN requirement and hardware functionality. Second, the spatial-to-temporal mapper gives the optimized allocation of PE-resources and the scheduling strategy for the above-mentioned output CG, including the corresponding control logic; all of them are collectively referred to as the function-block netlist. Finally, we place the netlist onto the FPSA chip and generate the routing.
Neural Synthesizer
Here the essential is to maintain the user-friendly programming interface and synthesize NN model into a hardwarefriendly, compact representation for efficient execution.
Flexible NN Programming. Computational graph (CG) is widely used programming model in most deep learning frameworks. It is a graph that consists of many tensor operations and describes the data dependencies of the operations. There are hundreds of flexible and complex operations in most deep learning frameworks.
Efficient ReRAM Execution. The support of hundreds of operations in hardware is impractical. On the other side, our ReRAM-based PE can complete vector-matrix multiplication with ReLU function very efficiently (in Section 4.2). Therefore, the neural synthesizer is expected to synthesize the software CG into an equivalent CG only including operations that the hardware can support efficiently.
We adopt the existing NN compiler framework from Y. Ji et al [19, 20] to do the synthesis. They propose to transform a trained, software NN into an equivalent network that meets hardware constraints; one case study is to transform such a CG into a core-op graph (core-op is defined as an operation composed of a low-precision vector-matrix multiplication and ReLU). Namely, it can implement different kinds of operations with the core-op, and then fine-tune the model to retain the accuracy. The basic idea is to construct dedicated structures with core-ops to implement other operations or Figure 5 . System stack of FPSA.
approximate them with multilayer perceptrons (MLPs). Further, large fully-connected layers or convolutional layers will be split into multiple small core-ops.
Spatial-to-Temporal Mapper
The output core-op graph only contains purely computational tasks. If we map CG nodes onto PEs directly, it will require extremely huge amount of PEs, which is impractical. For example, although a convolutional layer reuses its kernel weights for different regions of input feature map, its core-op graph contains individual core-ops for each region. Thus, we have to temporally map the core-op graph onto hardware with the on-chip buffering and controlling resources. Still taking the convolutional layer as an example, we can map all core-ops with shared weights onto one or more PEs and reuse the weights in a time-division-multiplexing manner. Accordingly, the mapper will generate an optimized netlist of function blocks for the core-op graph: PEs complete all the computation tasks, buffers hold the intermediate data, and control logic will be generated to schedule the execution. Further, the buffers separate the entire circuits into multiple pipeline stages and different pipeline stages process different samples in parallel. The mapping involves the following two sub-steps. Resource Allocation. As discussed in Section 3, different layers reuse the weights for different times. We should assign more PEs to those layers that reuse weights more times. To do so, we have all the core-ops with the same weights into one group. The number of core-ops in one group is denoted as reuse degree. The iterations required to complete the computation of a group depends on the number of PEs assigned to that group. We first allocate one PE for each group to satisfy the minimum storage requirement. To balance the pipeline stages, we will assign extra PEs to those groups that require more iterations to complete if more PEs are available. The number of duplications (PEs) assigned to one group is referred as duplication degree of that group. We use the duplication degree of the group with the maximum reuse degree as the duplication degree of the entire model. With n× duplication degree, the temporal utilization bound is usually increased by n×.
Scheduling. After the core-ops are assigned to PEs, we also need to schedule the execution order, insert buffers between PEs, and generate the control signal to get the netlist. We denote the core-op graph as G = (V , E) where V is the node set and E is the edge set. A v denotes the PE assigned to the core-op v ∈ V . s v and e v represent the start cycle and end cycle for executing the core-op v respectively. The following contraints should be satisfied.
• Resource Conflict (RC). Two core-ops cannot be executed synchronously if they are assigned to the same PE, which is shown in Formula 7.
• No-Buffer Dependency (NBD). If there is data dependency between node u and v, and if these two nodes are placed into directly connected PEs without buffers, the execution time of v needs to cover the one of u to receive the spike train generated by u, as shown in Formula 8.
• Buffered Dependency (BD). Resource conflict and nobuffer dependency may conflict; thus we add buffers between the two PEs to solve conflict. The buffers will store the firing rate of u and generate spikes for v when A v is ready. This constraints is given by Formula 9.
• Buffer Conflict (BC). If two nodes u and v receive spike trains from the same port of one buffer, the buffer should provide spike train of sampling window Γ one-by-one. The timing should satisfy Formula 10.
• Sampling Window (SW). Finally, the execution time of each core-op cannot be less than Γ as Formula 11.
We can optimize all the s v and e v for a certain objective under these constraints. Here, we show a greedy algorithm in Algorithm 1 to minimize the buffer used and the latency.
The basic idea is to traverse the graph in topological ordering and try to connect PEs without buffer. If there is any conflict, a buffer from SMB should be inserted to separate When all s v and e v have been determined, the controlling signals can be generated accordingly with the CLBs.
Placement & Routing
The last step is to place all function blocks of the netlist onto physical units. Then the CBs and SBs in the routing architecture can be configured to connect the function blocks according to the topology of the netlist. The placement & routing problem is the same as the one for FPGA. We adopt the mature solution used in FPGA development tool-chain, which usually uses simulated annealing (SA) algorithm for the placement, and uses dijkstra's shortest path algorithm for the routing to minimize the latency of critical path.
Evaluation
We evaluate the FPSA architecture and its system stack with a set of typical NN applications. Specifically, we evaluate the contributions of the routing architecture and simplified PEs to the whole system improvement separately. Further, the scalability is evaluated when more resources are provided.
Experiment Configurations and Methodology
Benchmark. We evaluate our proposal on NN models of different scales, including MLP-500-100 for MNIST dataset [28] (an MLP with two hidden layers composed of 500 and 100 neurons), LeNet [29] for MNIST dataset, VGG17 for CIFAR-10 dataset [25] , AlexNet [26] , GoogLeNet [43] , VGG16 [40] , Performance / OPS Peak (PRIME) Ideal (PRIME) Peak (FPSA) Ideal (FPSA) PRIME FP-PRIME FPSA
Reduced Area
Improved Communication
Reduced Latency Figure 6 . Comparison between PRIME, FP-PRIME (FPSA with PRIME's PE), and FPSA for VGG16. Average Latency / ns 10 4 Computation Communication Figure 7 . The breakdown of processing latency of one PE of PRIME, FP-PRIME, and FPSA (for VGG16). Table 3 . The overall performance of FPSA for different NN models and ResNet152 [16] . The last four are for the ImageNet dataset [11] .
Baseline. We compare FPSA to state-of-the-art ReRAMbased accelerators, PRIME [9] , ISAAC [39] , and PipeLayer [41] , especially PRIME (as detailed information is available). Previous studies already show great speedup over conventional digital circuits. For example, Eyeriss [7] achieves 35 frame/s throughput and 115.4ms latency for AlexNet on a chip of 12.25mm 2 under 65nm process with off-chip memory, while we achieve 28.2K frame/s and 100.49μs on 51.86mm 2 under 45nm process without off-chip memory. Most of the improvements come from device benefit. Thus, we only compare with ReRAM-based accelerators to show the improvements from the innovation at the architecture and system levels.
FPSA Configuration. The crossbar size is set to 256×512; the positive and negative values of each logic column is represented with two adjacent crossbar-columns respectively. Logically, the crossbar size is 256 × 256. At each intersection, we put 8 cells connected in parallel. Each cell can be set to 16 levels (4-bit), and we add up the values of 8 cells to represent an 8-bit weight. This is done for reliability reasons, which will be discussed in Section 7.2. We integrate 128 LUTs in one CLB to make the area and number of pins of one CLB similar to one PE. For SMBs, we choose SRAM with 16Kb capacity.
Simulation Setup. We use mrVPR tool for mrFPGA [10] as the placement & routing tool to evaluate the area consumption and critical path for communication. The mrVPR has two inputs: one is an architecture description file that contains the parameters of all the function blocks, and the other is a netlist composed of these blocks. We implement the neural synthesizer to generate the core-op graph and the spatial-to-temporal mapper to generate the function-block netlist for mrVPR. The parameters of function blocks are listed in Table 1 . We use NVSim [12] to evaluate ReRAMcrossbar, sense amplifier, SMB and CLB, and use Synopsys Design Compiler for other peripheral circuits; all are under the 45nm process. The routing architecture is stacked over function blocks. According to the report from mrVPR, the area of the former is less. We build a simulator to evaluate the performance based on the reported routing result from mrVPR.
Methodology. To show the effects of the new routing architecture and simplified PEs, we first compare PRIME with FP-PRIME (FPSA's routing architecture with PRIME's PE) to show that the communication bound of PRIME can be broken. Then, FP-PRIME is compared with FPSA to show the further improvement from the new PE circuits. In addition, we evaluate FPSA with different models to give the overall performance.
Session: Machine Learning I ASPLOS '19, April 13-17, 2019 , Providence, RI, USA
Performance Improvement
Overall Comparison. In Figure 6 , we compare PRIME, FP-PRIME, and FPSA for VGG16. FP-PRIME is composed of the FPSA routing architecture and PRIME's PEs, whose peak performance and ideal performance are the same as PRIME's. The performance improvements comes from the three aspects listed in Section 3: Improving Communication, Reducing Area, Reducing Latency.
• Improved Communication. Comparing PRIME and FP-PRIME in Figure 6 , we can see that by introducing the reconfigurable routing architecture, FP-PRIME can break the communication bound. Its performance is very close to the ideal case (the gap looks negilible in the logarithmic axes).
• Reduced Area & Latency. Comparing FP-PRIME and FPSA, we can further increase the performance due to the area & latency reduction of our PE design.
Combining these together, we can achieve up to 1000× speedup with the same area consumption. Communication Improvement. In Figure 7 , we show the average latency of computation and communication of one PE for VGG16. The communication takes most of latency of PRIME. By introducing the reconfigurable routing, the communication latency is reduced to 59.4ns, which is negilible compared to the computation time, 3064.7ns. By further simplifying the peripheral circuits of PE, the computation time is reduced to 156.4ns, while the communication time increases to 633.9ns because we transmit the spike trains directly instead of spike counts. The communication overhead is simply the reason for the gap between the ideal case and the real case for FPSA in Figure 6 . It can be improved by adding buffers: Currently, the input spike signal of the charging unit is hold by its source PE. If we add more buffers between the source and target PEs, the latency could be reduced, but it will also decrease the density advantage of current FPSA design. We will discuss more about the effect of transmitting spike trains in Section 7.1.
Area & Latency Reduction. In Table 2 , we compare the area and latency of one PE in PRIME and those in FPSA. The area is reduced by 36.63% and the latency is reduced by 94.90%, which leads to the overall improvement on computational density by 31×. The major improvements are from latency reduction because we do not need to share simplified peripheral circuits among different rows and columns. The computational density is 38.004TOPS/mm 2 , which is higher than PRIME [9] (1.229TOPS/mm 2 ), PipeLayer [41] (1.485TOPS/mm 2 ), and ISAAC [39] (0.479TOPS/mm 2 ).
Scalability & Utilization
We test the performance of FPSA under 1×, 4×, 16×, and 64× duplication degrees (defined in Section 5.2) for all the benchmark models, results in Figure 8 . The detailed performance for the 64× case is listed in Table 3 .
In Figure 8a , with 4×, 16×, and 64× duplication degree, the geometric mean of the performance improvement is 3.06×, 10.88×, and 38.65×, respectively. In contrast, the increase of the geometric mean of area consumption is only 1.25×, 1.85×, and 3.73×, respectively. Especially, for the last four ImageNet models, the area consumption is only increased by 1.003×, 1.074×, and 1.504× on average.
The reason for the super-linear scalability is the increased utilization when more resources are available. In Figure 8c , we show the peak computational density, the spatial utilization bound (due to the imperfect crossbar mapping), the temporal utilization bound (due to the unbalanced workload), and the real computational density. The two bounds depend on the property of the models: There is no weight sharing in the MLP model, so its workload is balanced and the two bounds coincide with each other. For CNN models, when more resources are available, the spatial utilization bounds do not change (we will discuss how to improve this bound in Section 7.3). But the temporal utilization bound will increase significantly, which provides the super-linear scalability (as long as the communication bound is not hit)
Discussion
Despite overall improvements, there are also some other considerations that affect our design details.
Spiking Schema
Spiking schema has been used in existing design, e.g. PipeLayer [41] , to reduce the overhead of ADC and DACs, but there is a significant different between our work and theirs. We transmit spike trains directly through the routing architecture while they transmit the spike counts. Despite the saved overhead of encoder/decoder circuits, it can also reduce end-to-end latency and on-chip buffers.
As discussed in Section 5.2, when two PEs are connected directly without buffers, the post-PE can start computation only 1 cycle after the pre-PE starts (the No-Buffer Dependency (NBD)), and we only need 1-bit buffer to store current spike. If we transmit the spike counts, the post-PE should wait for at least 2 n cycles (the sampling window for n-bit number) until the pre-PE finish all its computation, and then start its computation. In addition, it needs n-bit buffer to store the spike count. Thus, by transmitting the spike trains directly, we can gain up 2 n × end-to-end latency reduction for NBD and n× buffer consumption saving. The drawback is that we will generate 2 n -bit traffic for an n-bit number, which is the reason for the increased communication latency from FP-PRIME to FPSA in Figure 7 . But compared to the original latency of PRIME, it is negligible. We list them in Table 3 : the latency for VGG16 is only 671.8μs while PRIME's is 102.0ms.
Device Variation and NN accuracy
ReRAM devices are not ideal. Due to the programming overhead and the intrinsic working mechanism of ReRAM cells, its conductance value cannot be programmed to the exact value as expected; the conductance value also has cycleto-cycle variation [49] . The device variation will inevitably lead to inaccurate results even if we set a tight margin between levels. The reason is that, in the ReRAM-crossbar based computing, there is no explicit read to quantize the obtained conductance, and all currents (with errors) from cell along the same column will accumulate. Some software approaches, e.g. Vortex [31] , have been proposed to make NN models more robust to variation. We have adopted these methods in our neural synthesizer, but as the inherent fault tolerance of NN is limited, for relative large variation, the effect is limited. Thus, from the architecture perspective, we should also leverage more cells for one weight value to reduce the variation exposed to software level. Without loss of generality, suppose the conductance of an ReRAM cell is a random variable obeys a normal distribution N (μ, σ 2 ) rather than a number. We use normalized deviation, which is the ratio between the standard deviation and the value range, to measure the variation exposed to software.
The existing splicing method. Most existing architecture studies [9, 39] employ the splicing method, which uses multiple cells for different bits of a number, to increase the representation precision of ReRAM. Suppose we use two n-bit cells to form a number of 2n-bit cell, one for the high n bits and one for the low n bits. Their conductance values are H and L, respectively: H ∼ N (h, σ 2 ) and L ∼ N (l, σ 2 ) where h and l are the expected values of the high n bits and low n bits, respectively. The number should be expressed as 2 n H + L ∼ N (2 n h + l, (2 n σ ) 2 + σ 2 ). Its normalized deviation is √ 2 2n + 1σ /(2 2n − 1), which is almost equal to the ratio of one-cell case, σ /(2 n − 1). Namely, it has little improvement on accuracy.
The new add method. We propose the add method that will add the conductance values evenly to increase precision and reduce variation. Considering the general case that n cells (X 1 , . . . , X n and X i ∼ N (x i , σ 2 )) are joined together by coefficient a 1 , . . . , a n . Then the number is expressed as
The normalized deviation is decreased by i |a i |/ i a 2 i . According to Cauchy inequality, the deviation decrease would reach its maximum value √ n when |a 1 | = . . . = |a n |. Figure 9 shows the effect of the two methods on the accuracy of VGG16. The variation data is derived from real fabricated ReRAM cells [49] . PRIME use two 4-bit cells to form an 8-bit weight value with splicing. The accuracy drops to 70% of the full precision accuracy. In our design, we use 16 4-bit cells, 8 for positive and 8 for negative to form an 8-bit weight value with add. The accuracy is close to full precision accuracy. Figure 9 . The normalized accuracy of VGG16 (normalized by the full precision accuracy) for the splice and add method with different number of cells used (4-bit for each cell).
Spatial Utilization
The Spatial Utilization Bound comes from the fact that weight matrices cannot fit crossbars perfectly. Moreover, we find that the neural synthesizer aggravates this situation. It introduces many small-scale weight matrices to implement operations such as reduction and max pooling. For example, in GoogleNet, after synthesis the pooling operations occupy 67.2% of PEs, which leads to the large gap between the peak performance and the spatial utilization bound in Figure 8c . To improve the utilization, from the hardware perspective, we could introduce different scales of PE to fit weight matrices better. From the software perspective, a future task is to find a better set of operations supported by hardware than the core-op.
Conclusion
By analyzing the bottlenecks and bounds for ReRAM-based NN acceleration, we propose a full system design of ReRAMbased NN accelerator, from the circuit level to the architectural and system level. Owing to the software system and massive hardware resources, it can support the function diversity and optimized execution of NN models on the proposed compact and efficient ReRAM PEs, achieving up to 1000× speedup compared to an existing ReRAM-based design, PRIME. Last but not least, the computational density, 38TOPS/mm 2 , is also much higher than counterparts.
