Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations of digital logic. They have been shown to be effective in special-purpose accelerators for a limited set of neural network applications.
Introduction
General-purpose computing systems have benefited from scaling for several decades, but are now hitting an energy wall. This trend has led to a growing interest in domainspecific architectures. Machine Learning (ML) workloads in particular have received tremendous attention because of their pervasiveness in many application domains and high performance demands. Several architectures have been proposed, both digital [20, 21, 36, 61, 72, 89] and mixed digitalanalog using memristor crossbars [22, 23, 73, 95, 100] .
ML workloads tend to be data-intensive and perform a large number of Matrix Vector Multiplication (MVM) operations. Their execution on digital CMOS hardware is typically characterized by high data movement costs relative to compute [49] . To overcome this limitation, memristor crossbars can store a matrix with high storage density and perform MVM operations with very low energy and latency [5, 13, 52, 87, 98, 116] . Each crosspoint in the crossbar stores a multi-bit value in one memristor device, which enables high storage density [112] . Upon applying an input voltage at the crossbar's rows, we get the MVM result as output current at the crossbar's columns based on Kirchhoff's law. A crossbar thus performs MVM in one computational step -including O(n 2 ) multiplications and additions for an n × n matrix -which typically takes many steps in digital logic. It also combines compute and storage in a single device to alleviate data movement, thereby providing intrinsic suitability for data-intensive workloads [23, 95] .
Memristor crossbars have been used to build special-purpose accelerators for Convolutional Neural Networks (CNN) and Multi Layer Perceptrons (MLP) [23, 73, 95] , but these designs lack several important features for supporting general ML workloads. First, each design supports one or two types of neural networks, where layers are encoded as state machines. This approach is not scalable to a larger variety of workloads due to increased decoding overhead and complexity. Second, existing accelerators lack the types of operations needed by general ML workloads. For example, Long Short-Term Memory (LSTM) [51] workloads require multiple vector linear and transcendental functions which cannot be executed on crossbars efficiently and are not supported by existing designs. Third, existing designs do not provide flexible data movement and control operations to capture the variety of access and reuse patterns in different workloads. Since crossbars have high write latency [54] , they typically store constant data while variable inputs are routed between them in a spatial architecture. This data movement can amount to a significant portion of the total energy consumption which calls for flexible operations to optimize the data movement.
To address these limitations, we present PUMA, a Programmable Ultra-efficient Memristor-based Accelerator. PUMA is a spatial architecture designed to preserve the storage density of memristor crossbars to enable mapping ML applications using on-chip memory only. It supplements crossbars with an instruction execution pipeline and a specialized ISA that enables compact representation of ML workloads with low decoder complexity. It employs temporal SIMD units and a ROM-Embedded RAM [69] for area-efficient linear and transcendental vector computations. It includes a microarchitecture, ISA, and compiler co-designed to optimize data movement and maximize area and energy efficiency. To the best of our knowledge, PUMA is the first programmable and general-purpose ML inference accelerator built with hybrid CMOS-memristor technology.
A naïve approach to generality is not viable because of the huge disparity in compute and storage density between the two technologies. CMOS digital logic has an order of magnitude higher area requirement than a crossbar for equal output width (∼20×, see Table 3 ). Moreover, a crossbar's storage density (2-bit cells) is 160MB/mm 2 , which is at least an order of magnitude higher than SRAM (6T, 1-bit cell) [95] . A 90mm 2 PUMA node can store ML models with up to 69MB of weight data. Note that the PUMA microarchitecture, ISA, and compiler are equally suitable to crossbars made from emerging technologies other than memristors such as STT-MRAM [94] , NOR Flash [46] , etc.
We make the following contributions:
• A programmable and highly efficient architecture exposed by a specialized ISA for scalable acceleration of a wide variety of ML applications using memristor crossbars. • A complete compiler which translates high-level code to PUMA ISA, enabling the execution of complex workloads on thousands of spatial cores. • A detailed simulator which incorporates functionality, timing, and power models of the architecture. • An evaluation across ML workloads showing that PUMA can achieve promising performance and energy efficiency compared to state-of-the-art CPUs, GPUs, TPU, and application-specific memristor-based accelerators.
Our simulator and compiler have been open-sourced to enable further research on this class of accelerators.
Workload Characterization
This section characterizes different ML inference workloads with a batch size of one. The characteristics are summarized in Table 1 . The section's objective is to provide insights on the suitability of memristor crossbars for accelerating ML workloads and highlight implications on the proposed architecture.
Multi-Layer Perceptron (MLP)
MLPs are neural networks used in common classification tasks such as digit-recognition, web-search, etc. [26, 61] . Each layer is fully-connected and applies a nonlinear function to the weighted-sum of outputs from the previous layer. The weighted-sum is essentially an MVM operation. Equation 1 shows the computations in a typical MLP (act is nonlinear):
MLPs are simple, capturing the features common across the ML workloads we discuss: dominance of MVM operations, high data parallelism, and use of nonlinear operations. 
Nonlinear operations
In addition to MVM operations, MLPs (and other ML workloads) perform nonlinear vector operations (e.g., ReLU). Since these operations cannot be performed in crossbars, an implication on the architecture is the need to provide digital functional units to support them. Such functional units consume significantly more area (∼20×) than crossbars for equal output width (see Table 3 ). The challenge is to size these units appropriately to provide sufficient throughput without offsetting crossbar area/energy efficiency.
Long Short-Term Memory (LSTM)
LSTMs are the state-of-the-art technique for sequence processing tasks like speech processing, language modelling, etc. [51] . Each layer is fully connected and performs linear and nonlinear operations on the weighted-sum of outputs and the previous state. These operations translate into two MVMs followed by multiple (typically four) vector arithmetic operations and (typically four) nonlinear functions. Equations 2 to 4 show the computations in a typical LSTM:
To the best of our knowledge, PUMA is the first memristorbased accelerator demonstrated with LSTMs.
Linear and transcendental operations
Unlike MLPs, LSTMs also perform linear vector operations. Moreover, the typical nonlinear vector operations in LSTMs are transcendental (e.g. tanh, sigmoid). Supporting these operations has the same implication on the architecture as discussed in Section 2.1.3 for nonlinear operations. Transcendental operations are particularly challenging due to their high complexity.
Weight reuse
Another key distinction of LSTMs compared to MLPs is data reuse. LSTM inputs consist of a sequence of vectors processed across multiple time-steps with the same weights. This feature benefits CMOS architectures by amortizing DRAM accesses for loading weights, but is not advantageous to memristor crossbars. That said, the scope of weight reuse in LSTMs is only over a few inputs so the workload remains memory-bound. It still suffers in CMOS hardware from insufficient amortization of DRAM accesses.
Convolutional Neural Network (CNN)
CNNs are widely used for image recognition and classification [67] . They typically include several convolutional, pooling, and response normalization layers. A convolution layer consists of weight kernels strided across the input image in a sliding window fashion. It exhibits a non-sequential memory access pattern since a window of the input consists of parts of the input image from different rows. Equation 5 shows the computations in a typical convolutional layer of a CNN:
Input reuse and compute intensity
Convolution layers exhibit both weight and input data reuse. They can be mapped to matrix-matrix multiplications which successively apply weight kernels on different input windows. Matrix-matrix multiplications are compute-bound which makes them well-suited for CMOS hardware since there is enough data reuse to amortize DRAM access cost. However, memristor crossbars can still perform well on matrix-matrix operations by treating them as successive MVMs. An implication on architecture is the opportunity to take advantage of input reuse to minimize data movement within the chip. Another implication is that iterating over inputs creates the need for control flow to represent the workload compactly without code bloat.
Non-sequential access
Unlike MLPs and LSTMs, CNNs exhibit non-sequential accesses due to the way inputs are traversed as well as the behavior of non-convolutional layers such as pooling and 3 Core Architecture
We propose a programmable architecture and ISA design that leverage memristor crossbars for accelerating ML workloads. PUMA is a spatial architecture organized in three-tiers: cores, tiles, and nodes. Cores consist of analog crossbars, functional units, and an instruction execution pipeline. Tiles consist of multiple cores connected via a shared memory. Nodes consist of multiple tiles connected via an on-chip network. Subsequently, nodes can be connected together via a chipto-chip interconnect for large-scale execution. While this hierarchical organization is common in related work [20, 95] , our key contributions lie in the core architecture (this section) and tile architecture (Section 4) that bring programmability and generality to memristor crossbars without compromising their energy and area efficiency. An overview of the core architecture is shown in Figure 1 . The following subsections discuss the components of the core architecture and the insights behind their design.
Instruction Execution Pipeline
Existing memristor-based accelerators [23, 73, 95] are limited to one or two ML workloads. They use state machines that can be configured to compose a small set of functional blocks (e.g., convolution block, pooling block, etc.). While this approach works well when the scope of workloads is small, supporting a larger variety of workloads creates high decoding complexity. For this reason, our core architecture breaks functionality down to finer-grain instructions and supplements memristor crossbars with an instruction execution pipeline. Our approach is based on the observation in Section 2 that despite the large variety of ML workloads, these workloads share many low-level operations. The instruction execution pipeline is an in-order pipeline with three stages: fetch, decode, and execute. Keeping the pipeline simple saves area to avoid offsetting the crossbars' area efficiency. The ISA executed by the pipeline is summarized in Table 2 . Instructions are seven bytes wide. The motivations for wide instructions are discussed in Sections 3.3 and 3.4.3. The ISA instruction usage is shown in Section 3.6. More ISA details are discussed in another paper [7] .
The instruction execution pipeline supports control flow instructions (jmp and brn in Table 2 ), as motivated in Section 2.3.1. It also includes a Scalar Functional Unit (SFU) that performs scalar integer operations (ALUint in Table 2 ) to support the control flow instructions.
Matrix-Vector Multiplication Unit (MVMU)
The MVMU (illustrated in Figure 1 ) consists of memristor crossbars that perform analog MVM operations, and peripherals (DAC/ADC arrays) that interface with digital logic via the XbarIn and XbarOut registers. XbarIn registers provide digital inputs to the DACs which feed analog voltages to the crossbar. ADCs convert crossbar output currents to digital values which are stored in the XbarOut registers. This crossbar design is similar to ISAAC [95] . Figure 2 (a) shows how memristor crossbars can be used to perform analog MVM operations. DACs convert the input vector to analog voltages applied at crossbar rows. The matrix is encoded as conductance states (д ij ) of the devices that constitute the crossbar. The currents at crossbar columns constitute the MVM result. They are integrated (converted to voltage) then converted to digital values with ADCs.
Precision Considerations
Practically realizable crossbars provide 2-6 bits of precision per device [52] . We conservatively use 2 bits per device, and realize 16-bit MVM operations by combining 8 crossbars via bit-slicing [95] , illustrated in Figure 2 (b). ADCs are reused across columns to save area. The impact of precision on inference accuracy is evaluated in Section 7.6.
Crossbar Co-location and Input Sharing
Crossbar peripherals have an order of magnitude higher area than the crossbar. Since all eight 2-bit crossbars of a 16-bit MVM operation are used simultaneously on the same input, we co-locate these 2-bit crossbars on the same core in the same MVMU, which allows us to use the same XbarIn registers and DAC array to feed them with minimal routing congestion. This co-location and input reuse is provided transparently in the ISA, which exposes a full 16-bit MVM operation in a single instruction (MVM in Table 2 ).
Input Shuffling
As motivated in Section 2.3.1, ML workloads with sliding window computations typically reuse large portions of the input across successive MVM operations (∼80% for convolutional layers with filter size 5 and unit stride). However, reused input values may come at different positions in the input vector. To avoid moving data around in XbarIn, the MVM instruction provides operands (filter/stride in Table 2 ) that re-route XbarIn registers to different DACs, enabling logical input shuffling without physical data movement.
Multiple MVMUs per Core
A core may have multiple MVMUs, in which case it is desirable to activate them in parallel since MVMs are heavy operations. The in-order pipeline does not capture the Instruction-Level Parallelism (ILP) between MVM instructions automatically. Instead, the ISA exposes an operand (mask in Table 2) to allow a single MVM instruction to activate multiple MV-MUs at once. Compiler optimizations that use this operand are discussed in Section 5.3.2.
Crossbar Writes
PUMA is an inference accelerator, so crossbars are initialized with weights using serial writes at configuration time prior to execution and are not written to throughout execution. In this sense, PUMA is a spatial architecture where data is routed between crossbars, each crossbar storing a different portion of the model. Larger models therefore require more area, and may scale to multiple nodes.
Vector Functional Unit (VFU)
The VFU executes linear and nonlinear vector operations (ALU and ALUimm in Table 2 ), as motivated by Sections 2.1.3 and 2.2.1. An important aspect of designing vector instructions and the VFU is choosing the vector width. Since ML workloads have high data parallelism, they execute wide vector operations, which motivates having wide vector instructions. Wide vector instructions have the benefit of reducing instruction count, and consequently, fetch, decode, and instruction storage overhead. On the other hand, hardware considerations motivate having narrow VFU vector width to avoid offsetting the area efficiency of crossbars as discussed in Section 2.1.3.
To balance the tension between workloads favoring wide vector width and hardware favoring narrow vector width, we propose a VFU design based on temporal SIMD. Temporal SIMD uses a narrow width VFU to execute wide vectors over multiple cycles. The vector instruction operands specify the starting address of the vectors in the register file as well as the vector width (vec-width in Table 2 ). The operand steer unit holds the decoded instruction and reads the operands from the register file over subsequent cycles to feed the VFU. The additional vec-width operand required by temoral SIMD motivates our wide instruction design.
Provisioning the adequate width for VFUs maintains crossbar area efficiency benefits without the VFU becoming a bottleneck and compromising throughput. A narrow VFU is possible because typical ML workloads compute O(n) more operations per MVM instruction than per vector instruction. Section 7.6 evaluates the impact of VFU width on efficiency. 
Register File
We propose a register file design that uses ROM-Embedded RAM [69] to accomplish two objectives: (1) harboring general purpose registers, and (2) providing area-efficient transcendental function evaluations as motivated in Section 2.2.1.
Implementing transcendental functions
Area-efficient function evaluations are crucial for preserving crossbar storage density. For this reason, we use a ROM-Embedded RAM structure [69] which adds a wordline per row to embed a ROM that is used to store look-up tables without increasing the array area or RAM latency. Alternative digital implementations to support transcendental functions are prohibitive due to their high area requirements, especially in light of the large number of transcendental function types used in ML. Transcendental function evaluations also use temporal SIMD (Section 3.3) to minimize fetch/decode energy consumption. Figure 3 details the operation of a ROM-Embedded RAM. In RAM mode, both wordlines (W L1 and W L2) are activated, followed by precharging or driving the bitlines for read or write operations, respectively (similar to typical SRAM). In ROM mode, since a ROM access overwrites the RAM data, the first step is to buffer the RAM data. Subsequently, 1 is written to all cells with both wordlines activated. Next, 0 is written to all cells while keeping the WL1 deactivated, which writes 0 to a cell only if its AXL is connected to WL2. Therefore, cells with AXL connected to WL1 and WL2, will store a ROM value of 1 and 0, respectively. A read with both wordlines activated is done to retrieve the ROM data, followed by restoring the original RAM contents.
Sizing the register file
The register file enables buffering data in general purpose registers to avoid higher-cost access to shared memory. However, if the register file were too large, it would degrade core storage density. A key point to observe is that the majority of ML kernels are such that data is consumed within 1-2 instructions after being produced. This property is preserved via proper instruction scheduling by the compiler to reduce register pressure (Section 5.3.1). Therefore, we provision a per-core register file size of 2*(crossbar dimension)*(# crossbars per core). This size retains storage density while addressing the buffering requirements in the common case as shown in Section 7.6. For window-based computations such as pooling layers that have a large number of intervening instructions (due to non-sequential data access across rows), the compiler spills registers to tile memory (Section 5.4).
ISA implications
To accommodate the large register file required to match crossbar size, long operands are used in ISA (src and dest in Table 2 ), which is another motivation for the wide instruction design. To accommodate moving data between general purpose registers and XbarIn/XbarOut registers, a copy instruction is included.
Memory Unit (MU)
The MU interfaces the core with the tile memory via load and store instructions. These instructions can be executed at 16-bit word granularity to support random access as motivated in Section 2.3.2. However, the instructions also take a vec-width operand for wide vector loads caused by sequential access patterns. Vector loads also use temporal SIMD (Section 3.3) to minimize fetch/decode energy consumption. Figure 4 shows the breakdown of the static instruction count for six different ML workloads. The breakdown demonstrates that MVM alone is insufficient to support all types of workloads, and that the ISA and functional units proposed can be used to bridge that gap. The ratio of instructions requiring MVMU versus VFU varies depending on the number of matrix versus vector transformation layers in the network. CNNs additionally use control flow instructions as discussed in Section 2.3.1. Deeper (or wider) versions of the same networks tend to have a similar instruction breakdown, except for data movement instructions which tend to be higher to implement larger matrices spanning multiple cores and tiles. Figure 6 . Inter-core synchronization mechanism
Static Instruction Usage

Summary
In summary, the core architecture provides programmability while maintaining crossbar area efficiency. It features an instruction pipeline exposed by an ISA to support a wide variety of ML workloads. The use of temporal SIMD and ROM-Embedded RAM enable linear, nonlinear, and transcendental vector operations. Data movement optimizations are enabled via input shuffling, proper sizing of the register file, and flexible memory access instructions.
4 Tile Architecture Figure 5 illustrates the architecture of a tile. A tile is comprised of multiple cores connected to a shared memory. The tile instruction memory holds send and receive instructions that move data between tiles. The shared memory and receive buffer are described in the following subsections.
Shared Memory
The shared memory facilitates communication across cores and tiles. Our shared memory design follows two key principles: (1) enabling inter-core synchronization, and (2) sizing the shared memory to preserve storage density.
Inter-core synchronization
Synchronization between cores happens when the output of one layer is sent as input to the next layer. It also happens within a layer if a large weight matrix is partitioned across multiple cores and tiles and partial MVM results need to be aggregated together. To enable synchronization, we augment the shared memory with an attribute buffer that has two attributes per data entry: valid and count. The use of valid and count is illustrated in Figure 6 . This mechanism enables consumer cores to block until producer cores have written their values, and ensures that producer cores do not overwrite data until it is consumed.
Sizing the shared memory
ML workloads may require buffering large number of inputs to utilize their weight reuse pattern. However, a large shared memory degrades the crossbar storage density. PUMA's spatial architecture enables programming inter-core/tile pipelines that exploit the inter-layer parallelism in ML workloads with weight reuse. These pipelines can be used to maintain throughput while keeping the shared memory size small. The pipeline parallelism is based on the observation that we do not require all the outputs of previous layer to start the current layer computation. For example, LSTMs process a sequence of vectors with S * N inputs per layer, where S is the number of vectors per input sequence and N is vector size. A layer can begin its computation as soon as its first N inputs are available. Section 7.5 discusses the sizing requirements for different workloads and the impact on energy consumption.
Receive Buffer
The receive buffer is an N × M structure with N FIFOs, each with M entries. FIFOs ensure that data being sent from the same source tile is received in the same order. Having multiple FIFOs enables multiple source tiles to send data concurrently using different FIFOs. It also enables data to be received through the network independently of receive instruction ordering in the program. This independence is important because receive instructions are executed in program order in a blocking manner for hardware simplicity.
Each send and receive instruction has a fifo-id operand that specifies the receiving FIFO to be used for incoming data. Using the FIFO ID instead of the sender tile ID provides additional flexibility for the compiler to apply FIFO virtualization, where a FIFO can be used by different sender tiles in different programs or program phases while keeping the number of physical FIFOs small. The key insight is that a typical ML layer will receive inputs from the tiles mapped to the previous layer only. Therefore, using 16 FIFOs (despite the node having 138 tiles) supports workloads with up to (16 tiles)*(8 cores)*(2 MVMU)*128 previous layer activations, which suffices for large-scale ML workloads.
Summary
In summary, PUMA tiles enable inter-core synchronization, inter-core/tile pipelines to contain shared memory size, and FIFO virtualization for efficient inter-tile communication.
Compiler
PUMA is a spatial architecture (not a data-parallel processor) which means that each core/tile executes a different set of instructions. Writing different code for each core/tile is not scalable as applications grow in size and complexity. A compiler is therefore mandatory for programmer productivity. This section describes key aspects of the compiler, while Figure 7 . Simple Code Example other implementation details of the compiler and the rest of the software stack are described in another paper [7] .
Programming Interface
The PUMA compiler is a runtime compiler implemented as a C++ library. A simple code example is shown in Figure 7 . The programmer first creates a model (line 01) with input/output vectors (lines 03-05) and constant matrices (lines 07-08). The programmer may also create vector streams which are useful for setting up inter-core/tile pipelines (see Section 4.1.2). The programmer then describes a computation (line 10) which executes at run time to build a graph of the model (Figure 7 on the right). Finally, the model is compiled (line 12) to generate PUMA assembly code for each core and tile. In addition to this native interface, ONNX bindings are also provided for further adoption and interoperability, enabling the compilation of models written in popular DNN frameworks such as Caffe2, PyTorch, Cognitive Toolkit, MXNet, and others.
Graph Partitioning
The first step in the compilation process is graph partitioning. The compiler divides tensors into 2D tiles, each the size of one MVMU, with appropriate padding, and divides the corresponding vectors and operations in the model accordingly.
Next, the graph is hierarchically partitioned, distributing subgraphs to different MVMUs, cores, and tiles as shown in the example in Figure 8 . The partitioning scheme used in this paper prioritizes placing MVMUs that feed to the same outputs together on the same core/tile, followed by those that read the same inputs, followed by those that feed each other (i.e., producer-consumer MVMUs). After partitioning the graph, the compiler inserts load/store operations across cores and allocates shared memory accordingly, reusing memory locations when there is pipelining. The compiler also inserts send/receive operations across tiles and assigns FIFO IDs accordingly (see Section 4.2), thereby virtualizing the FIFOs and ensuring there are no conflicts.
Instruction Scheduling
After the graph is partitioned into a sub-graph for each core/tile, the compiler schedules instructions by linearizing each sub-graph. Instruction scheduling has three main objectives: reducing register pressure, capturing ILP of MVM operations, and avoiding deadlock. 
Reducing register pressure
There are many possible ways to linearize the sub-graph, as long as source operations are visited before their destinations to enforce data dependencies. We use a reverse post-order traversal which prioritizes consuming produced values before producing new ones. This ordering reduces the number of live values, hence the register pressure. Figure 9 
Capturing ILP of MVM operations
As explained in Section 3.2.4, it is desirable to run different MVMUs in the same core simultaneously. The compiler must therefore fuse independent MVM operations on different MVMUs in the same core. We call this optimization MVM coalescing. MVMs can be coalesced when there are no data dependences between them. It is also desirable to coalesce MVMs whose results are consumed soon after one another to reduce register pressure. The compiler's strategy for coalescing is to first look for coalescing candidates among MVMs that are tiles of the same large MVM operation. Once those are exhausted, the remaining MVMs are coalesced by traversing the graph in reverse poster-order (before linearization) and fusing each MVM node with the first eligible Figure 10 . Deadlock Avoidance Example candidates in the traversal order. The dependence information is updated every time a fusion takes place. Finally, the graph is traversed one last time to perform the linearization. Figure 9 (d) and (e) show two example linearizations, with Figure 9 (e) following the proposed approach resulting in fewer live values at once.
Avoiding deadlock
Linearizing the sub-graph of each core introduces control edges to the graph. Since communication across cores is blocking (see Section 4.1.1), cycles introduced by improper linearization cause deadlock as shown in the example in Figure 10 (b). For this reason, sub-graphs are not linearized independently. The entire graph is linearized at once placing instruction in the corresponding core/tile sequence to ensure a globally consistent linearization order.
Register Allocation
The final step in the compilation is register allocation. Recall that a core has three sets of registers: XbarIn, XbarOut, and general purpose. XbarIn (XbarOut) registers can be written (read) by any non-MVM instruction but can only be read (written) by MVM instructions. General purpose registers can be read and written by any non-MVM instructions. Liveness analysis is performed on each of these sets of registers separately. Conflicting live ranges in XbarIn and XbarOut registers result in spilling to general purpose registers via copy instructions. Conflicting live ranges in general purpose registers result in spilling to shared memory via load/store instructions.
Summary
In summary, the PUMA compiler provides a high-level programming interface and performs graph partitioning, instruction scheduling, and register allocation to generate low-level assembly. Instruction scheduling aims at reducing register pressure, MVM colescing, and avoiding deadlock.
6 Evaluation Methodology
PUMAsim
We have implemented a detailed architectural simulator (PUMAsim) to evaluate the performance and energy consumption of PUMA. PUMAsim runs applications compiled to the PUMA ISA and provides detailed traces of execution. The simulator incorporates functionality, timing, and power models of all system components. The datapath for the PUMA core and tile was designed at Register-Transfer Level (RTL) in Verilog HDL and mapped to IBM 45nm SOI technology using Synopsys Design Compiler which was used to measure the area and power consumption. Subsequently, these area and power numbers were added to PUMAsim for system-level evaluations of workloads. For fair comparison with other application-specific accelerators, the datapath energy numbers have been scaled to the 32nm technology node. Memory modules are modelled in Cacti 6.0 [76] to estimate the area, power, and latency. The on-chip-network is modelled using the cycle-level Booksim 2.0 interconnection network simulator [59] and Orion 3.0 [63] for energy and area models. We use a chip-to-chip interconnect model similar to DaDianNao's [20] which has also been adopted by other accelerators. The MVMU power and area models are adapted from ISAAC [95] . The memristors have a resistance range of 100k Ω − 1M Ω and read voltage of 0.5V. The ADC is based on the Successive Approximation Register (SAR) design, and its area and power were obtained from the ADC survey and analysis [77, 107] .
In the present study, we do not compromise ML accuracy as we conservatively choose 2-bit memristor crossbar cells. Note that laboratory demonstrations have shown up to 6-bit capabilities [52] . We use 16 bit fixed-point precision that provides very high accuracy in inference applications [20, 95] . Table 3 shows the PUMA configuration used in our analysis and lists the area-energy breakdown of PUMA components. Figure 11 . Energy and Latency Results
System and Workloads
We choose popular server grade CPUs and GPUs (listed in Table 4), Google TPU [61] (CMOS-based ASIC) and ISAAC [95] (application specific memristor-based accelerator) for evaluating PUMA. To measure power consumption of CPUs and GPUs, we used management tools such as board management control (BMC) and nvidia-smi respectively. For GPUs, we do not include full system power, just the board/device power. We run multiple iterations of the benchmarks on the GPU, discarding the longer warmup iterations and reporting results from the faster and more stable iterations. Torch7 [29] was used to execute the ML models for CPUs and GPUs. The PUMA compiler was used to compile models for PUMA. These models used are listed in Table 5 . Figure 11 (a) shows PUMA inference energy compared to other platforms. PUMA achieves massive energy reduction across all benchmarks for all platforms. Energy improvements come from two sources: lower MVM compute energy from crossbars and lower data movement energy by avoiding weight data movement.
Results
Inference Energy
CNNs show the least energy reduction over CMOS architectures (11.7×-13.0× over Pascal). Recall that CNNs have a lot of weight reuse because of the sliding window computation (discussed in Section 2.3.1). Hence, CMOS architectures can amortize DRAM accesses of weights across multiple computations. For this reason, PUMA's energy savings in CNNs come primarily from the use of crossbars for energy efficient MVM computation.
MLPs and LSTMs have little or no weight reuse (discussed in Section 2). Therefore, in addition to efficient MVM computation, PUMA has the added advantage of eliminating weight data movement. For this reason, we see much better energy reductions for MLPs (30.2×-80.1× over Pascal), Deep LSTMs (2,302×-2,446× over Pascal), and Wide LSTMs (758×-1336× over Pascal).
LSTMs (both deep and wide) show better reductions than MLPs because they have much larger model sizes (see # Parameters in Table 5 ). As model grows in size, weight data grows at O(n 2 ) and input data grows at O(n). For this reason, we see an increasing disparity between CMOS architectures Session: Machine Learning I ASPLOS'19, April 13-17, 2019, Providence, RI, USA which move both weight and input data, and PUMA which only moves input data. Wide LSTMs have few layers (1-2) with very large matrices, whereas Deep LSTMs have many layers (6-10) with smaller matrices. The large matrices in Wide LSTMs span mutiple PUMA cores/tiles to compute one logical MVM, incurring higher intra-layer data movement overheads. Hence, Deep LSTMs show higher energy benefits than Wide LSTMs. Figure 11 (b) shows PUMA inference latency compared to other evaluated platforms. PUMA achieves latency improvements across all platforms except MLPs on some GPUs. Latency improvements come from three sources: lower MVM compute latency from crossbars, no weight data access latency, and spatial architecture pipelining which exploits inter-layer parallelism.
Inference Latency
CNNs show the least latency improvement over CMOS architectures (2.73×-2.99× over Pascal). Since CNNs are computebound, CMOS architectures can hide the data access latency. Thus, PUMA's primary latency improvements in CNNs come from the use of crossbars for low-latency MVM computation and spatial architecture pipelining.
LSTMs on the other hand are memory-bound. PUMA has the added advantage of eliminating weight data access latency in addition to low-latency MVM computation. For this reason, we see much better latency improvements for Deep LSTMs (41.6×-66.0× over Pascal) and Wide LSTMs (4.70×-5.24× over Pascal) than we see for CNNs. In comparing Deep and Wide LSTMs, Deep LSTMs have more layers than Wide LSTMs, hence more inter-layer parallelism to exploit spatial architecture pipelining (see #LSTM Layers in Table 5 ). Moreover, Deep LSTMs have less intra-layer communication than Wide LSTMs, hence lower data access latency.
MLPs show slowdown compared to some GPU datapoints (0.24×-0.40× compared to Pascal). The reason is that despite MLPs being memory-bound, the sizes of MLPs are typically small enough. Hence, the memory bandwidth bottleneck is not as pronounced, so they perform fairly well on GPUs. Moreover, MLPs have no inter-layer parallelism so they do not exploit spatial architecture pipelining (Section 4.1.2). Nevertheless, PUMA's order of magnitude energy reduction is still beneficial for MLPs in energy-constrained environments.
Batch Throughput and Energy
Inference applications are not usually intended for large batch sizes due to real-time application requirements. Nevertheless, CMOS architectures perform well with large batch sizes because of the weight reuse that data-batching exposes. For this reason, we compare PUMA's batch energy and throughput with the other platforms in Figure 11 Table 6 compares key technology features and efficiency metrics for TPU [61] and PUMA. PUMA has 8.3× higher peak area-efficiency (TOPS/s/mm 2 ) than TPU. Since PUMA does not rely on weight reuse to improve throughput like CMOS architectures do, its area-efficiency remains constant across workloads and batch sizes. On the other hand, TPU's peak throughput is almost an order of magnitude lower for applications with low data reuse due to its inability to amortize weight data movement. PUMA has 64.4×, 193×, and 9.7× higher area-efficiency than TPU for MLP, LSTM, and CNN respectively for the best TPU batch size.
PUMA has 1.65× higher peak power-efficiency (TOPS/s/W) than TPU, with similar trends and reasoning as area-efficiency for specific workloads. We expect PUMA's power-efficiency advantage over TPU to grow by over 3×, as silicon processes scale from 32nm to 7nm and 5nm. Thanks to PUMA's higher peak throughput, we can follow the power reduction scaling curve at constant performance. Conversely, to narrow the performance gap, TPU would follow a scaling curve closer to the performance increase curve at constant power. Note that the former scaling curve is much faster (up to ∼40% power reduction per silicon node compared with ∼20% performance increase). Further, ADC power efficiency has also been following similar and very fast scaling trend with ∼2× power reduction per 1.8 years at the same sampling rate [78] . Using a digital MVMU would increase the total chip area of the accelerator by 4.93× for the same performance and would consume 6.76× energy (factoring in data movement energy due to increased area).
Tensor Cores
Nvidia V100 GPUs with tensor cores (FP16) can be up to 6× more energy-efficient (architecture, tensor cores, and half-precision) than Pascal GPUs. Therefore, PUMA can still achieve energy gains over GPUs with tensor cores, despite the technology difference (PUMA: 32nm, V100: 12nm). Figure 13 . Inference Accuracy reuse (Section 2.1). Graph partitioning (compared to a baseline that partitions the graph randomly) reduces the number of loads, stores, sends, and receives, hence the overall energy. Register pressure is kept low by the compiler with little or no accesses from spilled registers across benchmarks. MVM coalescing runs MVMUs in parallel within a core which reduces latency. Figure 12 shows a PUMA tile's peak area and power efficiency swept across multiple design space parameters. For each sweep, all other parameters are kept at the sweetspot (PUMA configuration with maximum efficiency). Efficiency is measured using a synthetic benchmark: an MVM operation on each MVMU, followed by a VFU operation, then a ROM-Embedded RAM look-up.
Evaluation of Optimizations
Design Space Exploration
Increasing the MVMU dimension increases the number of crossbar multiply-add operations quadratically and the number of peripherals linearly resulting in more amortization of overhead from peripherals. However, larger MVMUs also require ADCs with higher resolution and ADC overhead grows non-linearly with resolution, which counterbalances the amortization. Increasing the # MVMUs per core increases efficiency because of the high efficiency of memristor crossbars relative to CMOS digital components. However, with too many MVMUs, the VFU becomes a bottleneck which degrades efficiency. Increasing the VFU width degrades efficiency because of the low efficiency of CMOS relative to memristor crossbars. However, a VFU that is too narrow becomes a bottleneck. The sweetspot is found at 4 vector lanes. Increasing the # cores per tile improves efficiency until shared memory bandwidth becomes the bottleneck. Increasing the register file size results in lower efficiency, however a register file that is too small results in too many register spills. Figure 13 shows PUMA's inference accuracy for different memristor bit precision (bits per device) and write noise levels (σ N ). Higher precision can lead to larger accuracy Session: Machine Learning I ASPLOS'19, April 13-17, 2019, Providence, RI, USA loss due to the reduction in noise margin. It can be seen that PUMA with 2-bit memristor performs well even at high noise levels. Real CMOS hardware follows the σ N = 0 noise level. Further, recent research have explored coding schemes for reliable memristor computation at high precision [38, 92] .
Related Work
Sze et al. [103] provide a thorough survey of deep learning accelerators. In the digital realm, accelerators can be classified as weight stationary spatial architectures [15, 17, 36, 43, 85, 93] , output stationary spatial architectures [33, 47, 86] , spatial architectures with no local reuse [19, 20, 117] , and row stationary spatial architectures [21] . Many designs also support optimizations and features including weight pruning and exploiting sparsity [3, 4, 25, 32, 48, 84, 89, 119] , reducing precision [8, 62] , stochastic computing [65, 90] , layer fusing [6] , meeting QoS/QoR requirements [108] , graph tuning [56] , and reconfigurable interconnects [68] . Digital accelerators have varied in their degree of flexibility, ranging from custom accelerators specialized for a particular field [14, 79, 101, 114] , to accelerators that are fully programmable via an ISA [61, 71, 72, 105, 119] . FPGAs have also been popular targets for building accelerators [37, 45, 74, 86, 96, 97, 110, 117] . All these works remain in the digital domain, while PUMA leverages hybrid digital-analog computing. Near-memory acceleration for ML has been proposed using DRAM [42, 64, 70] and SRAM [111, 118] . PUMA uses non-volatile memristive crossbars for near-memory acceleration.
Many machine learning accelerators have been proposed that leverage memristor crossbars [9, 10, 12, 22, 23, 53, 58, 66, 73, 88, 95, 100] . These accelerators have been demonstrated on several types of workloads including BSBs [53] , MLPs [22, 39, 73, 88] , SNNs [9, 66] , BMs [12] , and CNNs [22, 23, 95, 100, 109] . Some accelerators support inference only [9, 23, 53, 73, 88, 95] while others also support training [12, 22, 66, 100] . Ji et al. [57, 58] transforms neural networks to configure such accelerators. These accelerators vary in flexibility, but even the most flexible rely on state machine configuration and have only been demonstrated on a few types of workloads. PUMA is the first memristor-based accelerator for machine learning inference that is ISA-programmable and generalpurpose.
Fujiki et al. [40] propose an ISA-programmable memristorbased accelerator. Their accelerator is a data-parallel accelerator whereas PUMA is a data-flow accelerator with more capability for producer-consumer synchronization. Moreover, their accelerator optimizes crossbars for vector operations in general-purpose workloads whereas PUMA optimizes crossbars for MVM operations prevalent in machine learning and uses digital VFUs for vector operations rather than crossbars.
Chung et al. [24] propose Brainwave, which is a spatial accelerator built with FPGAs. Compared to Brainwave, a PUMA core performs 0.26 million 16-bit ops, equivalent to 1.04 million 8-bit ops, per coalesced MVM instruction. A Brainwave NPU performs 1.3million 8-bit ops per instruction. Therefore, PUMA and Brainwave have comparable control granularity while PUMA has 40.8x higher storage-density (Brainwave Stratix10 estimate).
Memristors have also been proposed for building byteaddressable non-volatile memories. There have been various works centered around system support for non-volatile memory, including file systems [30] , memory allocators [11, 83] , programming models [16, 27, 106] , durable data structures [31, 55, 113] , representation of pointers [18, 28, 34, 35] , and architecture support [60, 80, 82, 99] .
There have been concerns in the community about memristor manufacturability. We distinguish between mediumdensity embedded memristor applications and high-density storage-class memory (SCM). Memristors in PUMA use 1T1R configuration which have been shown to have good manufacturability [50] . They are very different from SCM, where the selector transistor may be replaced with an in-line twoterminal selector device for higher density which complicates manufacturability. Panasonic formed a joined venture with UMC foundry in 2017 to enable integration of memristors to UMC 40nm CMOS process with first samples planned in 2018 [115] . TSMC also completed development of their own memristor technology that entered risk production in 40nm ULP CMOS process node at the end of 2017 [1] .
Conclusion
PUMA is the first ISA-programmable accelerator for ML inference that uses hybrid CMOS-memristor technology. It enhances memristor crossbars with general purpose execution units carefully designed to maintain crossbar area/energy efficiency and storage density. Our accelerator design comes with a complete compiler to transform high-level code to PUMA ISA and a detailed simulator for estimating performance and energy consumption. Our evaluations show that PUMA can achieve significant improvements compared to state-of-the-art CPUs, GPUs, and ASICs for ML acceleration.
