Abstract-This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache.
I. INTRODUCTION
In the last two decades, the number of processor cores per chip has steadily increased while memory latency has remained relatively constant. This has lead to the so-called memory wall [1] where memory bandwidth and memory energy have come to dominate computation bandwidth and energy. With the advent of data-intensive system, this problem is further exacerbated and as a result, today a large fraction of energy is spent in moving data back-and-forth between memory and compute units. At the same time, neural computing and other data intensive computing applications have emerged as increasingly popular applications domains, exposing much higher levels of data parallelism. In this paper, we exploit both these synergistic trends by opportunistically leveraging the huge caches present in modern processors to perform massively parallel processing for neural computing.
Traditionally, researchers have attempted to address the memory wall by building a deep memory hierarchy. Another solution is to move compute closer to memory, which is often referred to as processing-in-memory (PIM). Past PIM [2] - [4] solutions tried to move computing logic near DRAM by integrating DRAM with a logic die using 3D stacking [5] - [7] . This helps reduce latency and increase bandwidth, however, the functionality and design of DRAM itself remains unchanged. Also, this approach adds substantial cost to the overall system as each DRAM die needs to be augmented with a separate logic die. Integrating computation on the DRAM die itself is difficult since the DRAM process is not optimized for logic computation.
In this paper, we instead completely eliminate the line that distinguishes memory from compute units. Similar to the human brain, which does not separate these two functionalities distinctly, we perform computation directly on the bit lines of the memory itself, keeping data in-place. This eliminates data movement and hence significantly improves energy efficiency and performance. Furthermore, we take advantage of the fact that over 70% of silicon in today's processor dies simply stores and provides data retrieval; harnessing this area by re-purposing it to perform computation can lead to massively parallel processing.
The proposed approach builds on an earlier silicon test chip implementation [8] and architectural prototype [9] that shows how simple logic operations (AND/NOR) can be performed directly on the bit lines in a standard SRAM array. This is performed by enabling SRAM rows simultaneously while leaving the operands in-place in memory. This paper presents the Neural Cache architecture which leverages these simple logic operations to perform arithmetic computation (add, multiply, and reduction) directly in the SRAM array by storing the data in transposed form and performing bit-serial computation while incurring only an estimated 7.5% area overhead (translates to less than 2% area overhead for the processor die). Each column in an array performs a separate calculation and the thousands of memory arrays in the cache can operate concurrently.
The end result is that cache arrays morph into massive vector compute units (up to 1,146,880 bit-serial ALU slots in a Xeon E5 cache) that are one to two orders of magnitude larger than modern graphics processor's (GPU's) aggregate vector width. By avoiding data movement in and out of memory arrays, we naturally save vast amounts of energy that is typically spent in shuffling data between compute units and on-chip memory units in modern processors.
Neural Cache leverages opportunistic in-cache computing resources for accelerating Deep Neural Networks (DNNs). There are two key challenges to harness a cache's computing resources. First, all the operands participating in an in-situ operation must share bit-lines and be mapped to the same memory array. Second, intrinsic data parallel operations in DNNs have to be exposed to the underlying parallel hardware and cache geometry. We propose a data layout and execution model that solves these challenges, and harnesses the full potential of in-cache compute capabilities. Further, we find that thousands of in-cache compute units can be utilized by replicating data and improving data reuse.
simultaneously. Computation (and and nor) on the data stored in the activated word-lines is performed in the analog domain by sensing the shared bit-lines. Compute cache [9] uses this basic circuit framework along with extensions to support additional operations: copy, bulk zeroing, xor, equality comparison, and search.
Data corruption due to multi-row access is prevented by lowering the word-line voltage to bias against the write of the SRAM array. Measurements across 20 fabricated 28 nm test chips (Figure 2a ) demonstrate that data corruption does not occur even when 64 word-lines are simultaneously activated during such an in-place computation. Compute cache however only needs two. Monte Carlo simulations also show a stability of more than six sigma robustness, which is considered industry standard for robustness against process variations. The robustness comes at the the cost of increase in delay during compute operations. But, they have no effect on conventional array read/write accesses. The increased delay is more than compensated by massive parallelism exploited by Neural Cache.
C. Cache Geometry
We provide a brief overview of a cache's geometry in a modern processor. Figure 3 illustrates a multi-core processor modeled loosely after Intel's Xeon processors [14] , [15] . Shared Last Level Cache (LLC) is distributed into many slices (14 for Xeon E5 we modeled), which are accessible to the cores through a shared ring interconnect (not shown in figure) . Our proposal is to perform in-situ vector arithmetic operations within the SRAM arrays (Figure 3 (d) ). The resulting architecture can have massive parallelism by repurposing thousands of SRAM arrays (4480 arrays in Xeon E5) into vector computational units.
We observe that LLC access latency is dominated by wire delays inside a cache slice, accessing upper-level cache control structures, and network-on-chip. Thus, while a typical LLC access can take ∼30 cycles, an SRAM array access is only 1 cycle (at 4 GHz clock [14] ). Fortunately, in-situ architectures such as Neural Cache require only SRAM array accesses and do not incur the overheads of a traditional cache access. Thus, vast amounts of energy and time spent on wires and higher-levels of memory hierarchy can be saved.
III. NEURAL CACHE ARITHMETIC Compute cache [9] supported several simple operations (logical and copy). These operations are bit-parallel and do not require interaction between bit lines. Neural Cache requires support for more complex operations (addition, multiplication, reduction). The critical challenge in supporting these complex computing primitives is facilitating interaction between bit lines. Consider supporting an addition operation which requires carry propagation between bit lines. We propose bit-serial implementation with transposed data layout to address the above challenge.
A. Bit-Serial Arithmetic
Bit-serial computing architectures have been widely used for digital signal processing [16] , [17] because of their ability to provide massive bit-level parallelism at low area costs. The key idea is to process one bit of multiple data elements every cycle. This model is particularly useful in scenarios where the same operation is applied to the same bit of multiple data elements. Consider the following example to compute the element-wise sum of two arrays with 512 32-bit elements. A conventional processor would process these arrays elementby-element taking 512 steps to complete the operation. A bitserial processor, on the other hand, would complete the operation in 32 steps as it processes the arrays bit-slice by bit-slice instead of element-by-element. Note that a bit-slice is composed of bits from the same bit position, but corresponding to different elements of the array. Since the number of elements in arrays is typically much greater than the bit-precision for each element stored in them, bit-serial computing architectures can provide much higher throughput than bit-parallel arithmetic. Note also that bit-serial operation allows for flexible operand bit-width, which can be advantageous in DNNs where the required bit width can vary from layer to layer.
Note that although bit-serial computation is expected to have higher latency per operation, it is expected to have significantly larger throughput, which compensates for higher operation latency. For example, the 8KB SRAM array is composed of 256 word lines and 256 bit lines and can operate at a maximum frequency of 4 GHz for accessing data [14] , [15] . Up to 256 elements can be processed in parallel in a single array. A 2.5 MB LLC slice has 320 8KB arrays as shown in Figure 3 . Haswell server processor's 35 MB LLC can accommodate 4480 such 8KB arrays. Thus up to 1,146,880 elements can be processed in parallel, while operating at frequency of 2.5 GHz when computing. By repurposing memory arrays, we gain the above throughput for near-zero cost. Our circuit analysis estimates an area overhead of additional bit line peripheral logic to be 7.5% for each 8KB array. This translates to less than 2% area overhead for the processor die.
B. Addition
In conventional architectures, arrays are generated, stored, accessed, and processed element-by-element in the vertical direction along the bit lines. We refer to this data layout as the bit-parallel or regular data layout. Bit-serial computing in SRAM arrays can be realized by storing data elements in a transpose data layout. Transposing ensures that all bits of a data element are mapped to the same bit line, thereby obviating the necessity for communication between bit lines. Section III-F discusses techniques to store data in a transpose layout. We use the addition of two vectors of 4-bit numbers to explain how addition works in the SRAM. The 2 words that are going to be added together have to be put in the same bit line. The vectors A and B should be aligned in the array like Figure 4 . Vector A occupies the first 4 rows of the SRAM array and vector B the next 4 rows. Another 4 empty rows of storage are reserved for the results. There is a row of latches inside the column peripheral for the carry storage. The addition algorithm is carried out bit-by-bit starting from the least significant bit (LSB) of the two words. There are two phases in a single operation cycle. In the first half of the cycle, two read word lines (RWL) are activated to simultaneously sense and wire-and the value in cells on the same bit line. To prevent the value in the bit cell from being disturbed by the sensing phase, the RWL voltage should be lower than the normal VDD. The sense amps and logic gates in the column peripheral (Section III-E) use the 2 bit cells as operands and carry latch as carry-in to generate sum and carry-out. In the second half of the cycle, a write word line (WWL) is activated to store back the sum bit. The carry-out bit overwrites the data in the carry latch and becomes the carry-in of the next cycle. As demonstrated in Figure 4 , in cycles 2, 3, and 4, we repeat the first cycle to add the second, third, and fourth bit respectively. Addition takes n + 1, to complete with the additional cycle to write a carry at the end. 
C. Multiplication
We demonstrate how bit-serial multiplication is achieved based on addition and predication using the example of a 2-bit multiplication. In addition to the carry latch, an extra row of storage, the tag latch, is added to bottom of the array. The tag bit is used as an enable signal to the bit line driver. When the tag is one, the addition result sum will be written back to the array. If the tag is zero, the data in the array will remain. Two vectors of 2-bit numbers, A and B, are stored in the transposed fashion and aligned as shown in Figure 6 . Another 4 empty rows are reserved for the product and initialized to zero. Suppose A is a vector of multiplicands and B is a vector of multipliers. First, we load the LSB of the multiplier to the tag latch. If the tag equals one, the multiplicand in that bit line will be copied to product in the next two cycles, as if it is added to the partial product. Next, we load the second bit of the multiplier to the tag. If tag equals 1, the multiplicand in that bit line will be added to the second and third bit of the product in the next two cycles, as if a shifted version of A is added to the partial product. Finally, the data in the carry latch is stored to the most significant bit of the product. Including the initialization steps, it takes n 2 +5n−2 cycles to finish an n-bit multiplication. Division can be supported using a similar algorithm and takes 1.5n 2 +5.5n cycles.
D. Reduction
Reduction is a common operation for DNNs. Reducing the elements stored on different bit lines to one sum can be performed with a series of word line moves and additions. Figure 5 shows an example that reduces 4 words, C1, C2, C3, and C4. First words C3 and C4 are moved below C1 and C2 to different word lines. This is followed by addition. Another set of move and addition reduces the four elements to one word. Each reduction step increases the number of word lines to move as we increase the bits for the partial sum. The number of reduction steps needed is log 2 of the words to be reduced. In column multiplexed SRAM arrays, moves between word lines can be sped up using sense-amp cycling techniques [18] .
When the elements to be reduced do not fit in the same SRAM array, reductions must be performed across arrays which can be accomplished by inter-array moves. In DNNs, reduction is typically performed across channels. In the model we examined, our optimized data mapping is able to fit all channels in the space of two arrays which sharing sense amps. We employ a technique called packing that allows us to reduce the number of channels in large layers (Section IV-A). ! Figure 7 : Bit line peripheral design
E. SRAM Array Peripherals
The bit-line peripherals are shown in Figure 7 . Two singleended sense amps sense the wire-and result from two cells, A and B, in the same bitline. The sense amp in BL gives result of A & B, while the sense amp in BLB gives result of A & B . The sense amps can use reconfigurable sense amplifier design [12] , which can combine into a large differential SA for speed in normal SRAM mode and separate into two single-ended SA for area efficiency in computation mode. Through a NOR gate, we can get A ⊕ B which is then used to generate the sum (A ⊕ B ⊕ C in ) and Carry ((A & B) + (A ⊕ B & C in )). As described in the previous sections, C and T are latches used to store carry and tag bit. A 4-to-1 mux selects the data to be written back among Sum, Carry out , Data in , and T ag. The Tag bit is used as the enable signal for the bit line driver to decide whether to write back or not.
F. Transpose Gateway Units
The transpose data layout can be realized in the following ways. First, leverage programmer support to store and access data in the transpose format. This option is useful when the data to be operated on does not change at runtime. We utilize this for filter weights in neural networks. However, this approach increases software complexity by requiring programmers to reason about a new data format and cache geometry. Second, design a few hardware transpose memory units (TMUs) placed in the cache control box (C-BOX in Figure 3 (b) ). A TMU takes data in the bit-parallel or regular layout and converts it to the transpose layout before storing into SRAM arrays or vice-versa while reading from SRAM arrays. The second option is attractive because it supports dynamic changes to data. TMUs can be built out of SRAM arrays with multi-dimensional access (i.e., access data in both horizontal and vertical direction). Figure 8 shows a possible TMU design using an 8T SRAM array with sense-amps in both horizontal and vertical directions. Compared to a baseline 6T SRAM, the transposable SRAM requires a larger bitcell to enable read/write in both directions. Note that only a few TMUs are needed to saturate the available interconnect bandwidth between cache arrays. In essence, the transpose unit serves as a gateway to enable bit-serial computation in caches.
IV. NEURAL CACHE ARCHITECTURE
The Neural Cache architecture transforms SRAM arrays in LLC to compute functional units. We describe the computation of convolution layers first, followed by other layers. Figure 9 shows the data layout and overall architecture for one cache slice, modeled after Xeon processors [14] , [15] . The slice has twenty ways. The last way (way-20) is reserved to enable normal processing for CPU cores. The penultimate way (way-19) is reserved to store inputs and outputs. The remaining ways are utilized for storing filter weights and computing.
A typical DNN model consists of several layers, and each layer consists of several hundred thousands of convolutions. -1 to way-18) .
Neural Cache assumes 8-bit precision and quantized inputs and filter weights. Several works [19] - [21] have shown that 8-bit precision has sufficient accuracy for DNN inference. 8-bit precision was adopted by Google's TPU [22] . Quantizing input data requires re-quantization after each layer as discussed in Section IV-D.
A. Data Layout
This section first describes the data layout of one SRAM array and execution of one convolution. Then we discuss the data layout for the whole slice and parallel convolutions across arrays and slices.
A single convolution consists of generating one of the E × E × M output elements. This is accomplished by multiplying R×S ×C input filter weights with a same size window from the input feature map across the channels. Neural Cache exploits channel level parallelism in a single convolution. For each convolution, an array executes R×S Multiply and Accumulate (MAC) in parallel across channels. This is followed by a reduction step across channels.
An example layout for a single array is shown in Figure 10 (a). Every bitline in the array has 256 bits and can store 32 1-byte elements in transpose layout. Every bitline stores R×S filter weights (green dots). The channels are stored across bit lines. To perform MACs, space is reserved for accumulating partial sums (lavender dots) and for scratch pad (pink dots). Partial sums and scratch pad take 3×8 and 2×8 word lines.
Reduction requires an additional 8 × 8 word lines as shown in Figure 10 (b) . However the scratch pad and partial sum can be overwritten for reduction as the values are no longer needed. The maximum size for reducing all partial sums is 4 bytes. So to perform reduction, we reserve two 4 byte segments. After adding the two segments, the resultant can be written over the first segment again. The second segment is then loaded with the next set of reduced data.
Each array may perform several convolutions in series, thus we reserve some extra space for output elements (red dots). The remaining word lines are used to store input elements (blue dots). It is desirable to use as many word lines as possible for inputs to maximize input reuse across convolutions. For example in a 3 × 3 convolution with a stride of 1, 6 of the 9 bytes are reused across each set of input loads. Storing many input elements allows us to exploit this locality and reduce input streaming time.
The filter sizes (R×S) range from 1-25 bytes in Inception v3. The common case is a 3 × 3 filter. Neural Cache data mapping employs filter splitting for large filters. The filters are split across bitlines when their size exceeds 9 bytes. The other technique employed is filter packing. For 1 × 1 input loading would function the same way as convolution layers, except without any filters in the arrays.
Calculating the maximum value of two or more numbers can be accomplished by designating a temporary maximum value. The temporary maximum is then subtracted by the next output value and the resultant is stored in a separate set of word lines. The most significant bit of the result is used as a mask for a selective copy. The next input is then selectively copied to the maximum location based on the value of the mask. This process is repeated for the rest of the inputs in the array.
Quantization of the outputs is done by calculating the the minimum and maximum value of all the outputs in the given layer. The min can be computed using a similar set of operations described for max. For quantization, the min and max will first be calculated within each array. Initially all outputs in the array will be copied to allow performing the min and max at the same time. After the first reduction, all subsequent reductions of min/max are performed the same way as channel reductions. Since quantization needs the min and max of the entire cache, a series of bus transfers is needed to reduce min and max to one value. This is slower than in-array reductions, however unlike channel reduction, min/max reduction happens only once in a layer making the penalty small.
After calculating the min and max for the entire layer, the result is then sent to the CPU. The CPU then performs floating point operations on the min and max of the entire layer and computes two unsigned integers. These operations take too few cycles to show up in our profiling. Therefore, it is assumed to be negligible. The two unsigned integers sent back by the CPU are used for in-cache multiplications, adds, and shifts to be performed on all the output elements to finally quantize them.
Batch Normalization requires first quantizing to 32 bit unsigned. This is accomplished by multiplying all values by a scalar from the CPU and performing a shift. Afterwards scalar integers are added to each output in the corresponding output channel. These scalar integers are once again calculated in the CPU. Afterwards, the data is re-quantized as described above.
In Inception v3, ReLU operates by replacing any negative number with zero. We can write zero to every output element with the MSB acting as an enable for the write. Similar to max/min computations, ReLU relies on using a mask to enable selective write.
Avg Pool is mapped in the same way as max pool. All the inputs in a window are summed and then divided by the window size. Division is slower than multiplication, but the divisor is only 4 bits in Inception v3.
Fully Connected layers are converted into convolution layers in TensorFlow. Thus, we are able to treat the fully connected layer as another convolution layer.
E. Batching
We apply batching to increase the system throughput. Our experiments show that loading filter weights takes up about 46% of the total execution time. Batching multiple input images significantly amortizes the time for loading weights and therefore increases system throughput. Neural Cache performs batching in a straightforward way. The image batch will be processed sequentially in the layer order. For each layer, at first, the filter weights are loaded into the cache as described in Section IV-A. Then, a batch of input images are streamed into the cache and computation is performed in the same way as without batching. For the whole batch, the filter weights of the involved layer remain in the arrays, without reloading. Note that for the layers with heavy-sized outputs, after batching, the total output size may exceed the capacity of the reserved way. In this case, the output data is dumped to DRAM and then loaded again into the cache. In the Inception v3, the first five requires dumping output data to DRAM.
F. ISA support and Execution Model
Neural Cache requires supporting a few new instructions: in-cache addition, multiplication, reduction, and moves. Since, at any given time only one layer in the network is being operated on, all compute arrays execute the same in-cache compute instruction. The compute instructions are followed by move instructions for data management. The intra-slice address bus is used to broadcast the instructions to all banks. Each bank has a control FSM which orchestrates the control signals to the SRAM arrays. The area of one FSM is estimated to be 204 μm 2 , across 14 slices which sums to 0.23 mm 2 . Given that each bank is executing the same instruction, the control FSM can be shared across a way or even a slice. We chose not to optimize this because of the insignificant area overheads of the control FSM. Neural Cache computation is carried out in 1-19 ways of each slice. The remaining way (way-20) can be used by other processes/VMs executing on the CPU cores for normal background operation. Intel's Cache Allocation Technology (CAT) [25] can be leveraged to dynamically restrict the ways accessed by CPU programs to the reserved way.
V. EVALUATION METHODOLOGY
Baseline Setup: For baseline, we use dual-socket Intel Xeon E5-2697 v3 as CPU, and Nvidia Titan Xp as GPU. The specifications of the baseline machine are in Table II . Note that the CPU specs are per socket. Note that the baseline CPU has the exact cache organization (35 MB L3 per socket) as we used in Neural Cache modeling. The benchmark program is the inference phase of the Inception v3 model [26] . We use TensorFlow as the software framework to run NN inferences on both baseline CPU and GPU. The default profiling tool of TensorFlow is used for generating execution time breakdown by network layers for both CPU and GPU. The reported baseline results are based on the unquantized version of Inception v3 model, because we observe that the 8-bit quantized version has a higher latency on the baseline CPU due to lack of optimized libraries for quantized operations (540 ms for quantized / 86 ms for unquantized). To measure execution power of the baseline, we use RAPL [27] for CPU power measurement and Nvidia-SMI [28] for GPU power measurement.
widely-used SRAM. The Tensor Processing Unit (TPU) [22] is another ASIC for accelerating DNN inferences. The TPU chip features a high-throughput systolic matrix multiply unit for 8-bit MAC, as well as 28 MB on-chip memory.
In general, custom ASIC accelerator solutions achieve high efficiency while requiring extra hardware and incurring design costs. ASICs lack flexibility in that they cannot be re-purposed for other domains. In contrast, our work is based on the cache, which improves performance of many other workloads when not functioning as a DNN accelerator. Neural Cache aims to achieve high performance, while allowing flexibility of general purpose processing. Further, Neural Cache is limited by commercial SRAM technology and general purpose processor's interconnect architecture. A custom SRAM accelerator ASIC can potentially achieve significantly higher performance than Neural Cache. Being a SRAM technology, we also expect the compute efficiency of Neural Cache to improve with newer technology nodes.
The BrainWave project [38] builds an architecture consisting of FPGAs connected with a custom network, for providing accelerated DNN service at a datacenter scale. The FPGA boards are placed between network switches and host servers to increase utilization and reduce communication latency between FPGA boards. The FPGA board features a central matrix vector unit, and can be programmed with a C model with ISA extensions. BrainWave with a network of Stratix 10 280 FPGAs at 14 nm is expected to have 90 TOPs/s, while Neural Cache (Xeon E5 2-socket processor) achieves 28 TOPs/s at 22 nm technology without requiring any additional hardware. BrainWave with current generation FPGAs achieves 4.7 TOPs/s.
Terasys presents a bit-serial arithmetic PIM architecture [39] . Terasys reads the data out and performs the compute in bit-serial ALU's outside the array. Neural Cache differs by performing partial compute along the bitlines and augments it with a small periphery to perform arithmetic in an area efficient architecture. Further Terasys performs software transposes while Neural Cache has a dedicated hardware transpose unit, the TMU.
Bit-serial computation exploits parallelism at the level of numerical representation. Stripes [40] leverages bit-serial computation for inner product calculation to accelerate DNNs. Its execution time scales proportionally with the bit length, and thus enables a direct trade-off between precision and speed. Our work differs from Stripe in that Neural Cache performs in-situ computation on SRAM cells, while Stripe requires arithmetic functional units and dedicated eDRAM.
Sparsity in DNN models can be exploited by accelerators [41] , [42] . Utilizing sparsity in DNN models for Neural Cache is a promising direction for future work.
VIII. CONCLUSION
Caches have traditionally served only as intermediate low-latency storage units. Our work directly challenges this conventional design paradigm, and proposes to impose a dual responsibility on caches: store and compute data. By doing so, we turn them into massively parallel vector units, and drastically reduce on-chip data movement overhead. In this paper we propose the Neural Cache architecture to allow massively parallel compute for Deep Neural Networks. Our advancements in compute cache arithmetic and neural network data layout solutions allow us to provide competitive performance comparably to modern GPUs with negligible area overheads. Nearly three-fourth of a server class processor die area today is devoted for caches. Even accelerators use large caches. Why would one not want to turn them into compute units?
