Abstract-This article presents Neural Cache architecture, which repurposes cache structures to transform them into massively parallel compute units capable of running inferences for deep neural networks. Techniques to do in situ arithmetic in SRAM arrays create efficient data mapping, and reducing data movement is proposed. Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in cache. Our experimental results show that the proposed architecture can improve efficiency over a GPU by 128Â while requiring a minimal area overhead of 2%.
& CACHES ARE USED in almost all modern microprocessors, both general purpose as well as accelerators. They occupy a large fraction (over 70%) of the die area. Latest Intel's server class Xeon processor, for instance, devotes 35 MB just for its last level cache. Furthermore, a processor spends a disproportionately large fraction of time and energy moving data over its cache hierarchy, and instruction processing, as compared to actual computation.
To tackle these inefficiencies, we proposed a bold idea: repurpose the elements in cache structures and transform them into large data-parallel compute units. Data stored in cache memory arrays share wires (bitlines) and signal sensing apparatus (senseamps). We observed that arithmetic operations can be computed over these shared structures by augmenting a few gates to them. We refer to this in-SRAM computing technique as bitline computing. We present a novel bit-serial compute SRAM design which is capable of addition, multiplication, and division on bitlines. 1 Our design incurs a 7.5% area overhead for each 6T SRAM data array, which translates to a mere 2% area overhead for a Xeon processor die. We have also taped out a prototype chip with bit-serial arithmetic capable SRAM arrays. 2 The end result is that the thousands of arrays in the cache morph into massive vector compute units (up to 1 146 880 bit-serial ALU slots in a Xeon E5 35 MB last level cache) that are one to two orders of magnitude larger than modern graphics processor's (GPU's) aggregate vector width. By avoiding data movement in and out of memory arrays, we naturally save vast amounts of energy that is typically spent shuffling data between compute units and on-chip memory units in modern processors.
Our work has its roots in the processingin-memory (PIM) line of work. 3 PIMs move logic near main memory (DRAM), and thereby reduce the gap between memory and compute units. Neural Cache, in contrast, repurposes cache (SRAM) structures into compute units, keeping data in place. It is unclear if it would be possible to perform in-place DRAM operations, primarily because DRAM accesses are destructive. Thus, in-place computation will corrupt input operand data. Solutions which copy data and compute on them are possible. 4 While in-place operations in memristors are promising, 5;6 memristors remain an emerging technology waiting for large-scale adoption and are also significantly slower than SRAM. Neural Cache leverages opportunistic incache computing resources for accelerating deep neural networks (DNNs). There are two key challenges needed to overcome in order to harness a cache's computing resources. First, all the operands participating in an in situ operation must share bitline and be mapped to the same memory array. Second, intrinsic data parallel operations in DNNs have to be exposed to the underlying parallel hardware and cache geometry. We propose a data layout and execution model that solves these challenges and harnesses the full potential of in-cache compute capabilities. Further, we find that millions of in-cache compute units can be utilized by replicating data and improving data reuse.
A Compute cache utilizes a bit-parallel data format to perform its operations. Neural Cache implements a bit-serial architecture as opposed to bitparallel. Bit-parallel requires communicating data across bitlines to propagate a carry. By using a bit-serial format, carries can be stored in a latch along the bitline, saving us the complexity of communication across bitlines and also allowing configurable precision.
Bit-serial computing architectures have been widely used for digital signal processing because of their ability to provide massive bit-level parallelism at low area costs. The key idea is to process one bit of multiple data elements every cycle. This model is particularly useful in scenarios where the same operation is applied to the same bit of multiple data elements. Note that although bitserial computation is expected to have higher latency per operation, it is expected to have significantly larger throughput, which compensates for higher operation latency.
For example, the 8-KB SRAM array is composed of 256 wordlines and 256 bitlines and can operate at a maximum frequency of 4 GHz for accessing data. 10 Note, while a 35-MB LLC cache access from core takes 20-30 ns, the smaller 8-KB SRAM arrays can operate at a frequency up to 4 GHz. 10 Up to 256 elements can be processed in parallel in a single array. A 2.5-MB LLC slice has 320 8-KB arrays as shown in Figure 1 . A Xeon server processor's 35-MB LLC can accommodate 4,480 such 8-KB arrays. Thus up to 1,146,880 elements can be processed in parallel, while operating at a frequency of 2.5 GHz when computing. By repurposing memory arrays, we gain the above throughput for near-zero cost. Our circuit analysis estimates an area overhead of additional bitline peripheral logic to be 7.5% for each 8-KB array. This translates to less than 2% area overhead for the processor die.
Bit-Serial Arithmetic
Data are mapped to a transposed layout where different bitlines hold data from different elements in the operand vector. Each n-bit element is stored across n wordlines, and thus each wordline holds one bit-slice from 256 vector elements as shown in Figure 1 (c). The bits in each bit-slice are of the same bit position. By activating two wordlines in the SRAM, we are able to sense logical and at bitline (BL) and logical nor at bitline complement (BLB). Note, we use a reconfigurable differential senseamp 8 to sense BL and BLB independently. A 1-bit full adder can be created by augmenting several gates to the ends of the sense amps as shown in Figure 1 (d). Thus, we add each bit iteratively by activating two wordlines, this gives us the ability to perform addition of two n bit numbers in n þ 1 cycles. The sense amps and logic gates in the column peripheral use the 2-bit cells as operands and carry latch as carry-in to generate sum and carry-out. In the second half of the cycle, a write wordline is activated to store back the sum bit. The carry-out bit overwrites the data in the carry latch and becomes the carry-in of the next cycle. As demonstrated in Figure 2 , in cycles 2, 3, and 4, we repeat the first cycle to add the second, third, and fourth bit, respectively. Thus addition takes n þ 1 cycles, to complete with the additional cycle to write a carry at the end. Figure 2 shows an example 12 Â 4 SRAM array with the transpose layout. The array stores two vectors A and B, each with four 4-bit elements. Four wordlines are necessary to store all bit-slices of 4-bit elements. Vector A occupies the first four rows of the SRAM array and vector B the next four rows. Another four empty rows of storage are reserved for the results. There is a row of latches inside the column peripheral for the carry storage. The addition algorithm is carried out bit-by-bit starting from the least significant bit (LSB) of the two words. There are two phases in a single operation cycle. In the first half of the cycle, two read wordlines (RWLs) are activated to simultaneously sense and wire-and the value in cells on the same bitline.
As mentioned, the RWL voltage is reduced from the normal VDD to prevent distribution in the bit cell. In addition, the cache frequency during compute mode is reduced from 4 to 2.5 GHz to reflect the voltage change.
Multiplication takes n 2 þ 5n À 2 cycles and is performed as a series of conditional additions of partial products. In addition to the carry latch, an extra row of storage, the tag latch, is added to bottom of the array. The tag bit is used as an enable signal to the bitline driver. When the tag is one, the addition result sum will be written back to the array. If the tag is zero, the data in the array will remain. Division can be supported using a similar algorithm and takes 1:5n 2 þ 5:5n cycles.
Transpose Gateway Units
The transpose data layout can be realized in the following ways. First, leverage programmer support to store and access data in the transpose format. This option is useful when the data to be operated on does not change at runtime. We utilize this for filter weights in neural networks. However, this approach increases software complexity by requiring programmers to reason about a new data format and cache geometry.
Second, design a few hardware transpose memory units (TMUs) placed in the cache control box [C-BOX in Figure 1(b) ]. A TMU takes data in the bit-parallel or regular layout and converts it to the transpose layout before storing into SRAM arrays or vice-versa while reading from SRAM arrays. The second option is attractive because it supports dynamic changes to data. TMUs can be built out of SRAM arrays with multidimensional access (i.e., access data in both horizontal and vertical directions). Figure 3 shows a possible TMU design using an 8T SRAM array with senseamps in both horizontal and vertical directions. Note that only a few TMUs are needed 
CONVOLUTION IN-CACHE
-D input activation channel, followed by reduction across C input channels. Figure 4 shows a typical data mapping scheme of one convolutional layer. In each 8 KB 256×256 SRAM array, a bitline stores an unrolled 2-D filter in R Â S Â 8 wordlines (assuming 8-bit filter weights), and the input activations to be multiplied with weights are loaded to another R Â S Â 8 wordlines. Each bitline corresponds to one input channel. The M filters span multiple arrays in the same or neighboring ways. One way of each slice is reserved for storing the outputs of the previous layer and another way reserved for system background processes.
The in-cache convolution is performed in five successive stages as described below: 1) Weight Loading: At the start of each layer, the filter weights are loaded from DRAM to the cache. Mapping a unique 2-D filter to each bitline does not result in full utilization of all bit-serial compute units. We observe that each layer in the network produces several thousand output pixels, and each output requires one convolution. All the outputs can be produced in parallel, provided we have sufficient computing resources and filters are replicated such that bitline can do independent 2-D convolutions. Thus, filters are replicated throughout all the ways and then slices. The interslice ring and intraslice bus allow low-cost replication of weights using broadcasts. After filter replication, the output pixels that still cannot be computed in parallel are computed in serial. 2) Input Loading: The input activations are broadcasted from the reserved way to all the compute arrays in different ways. 3) MAC (Multiply-Accumulation): After data loading, at each array, the R Â S weights multiply with the R Â S inputs sequentially; after each multiply, the product is accumulated into partial sum in the reserved wordlines in the array. 4) Reduction: To sum up all the input channels of each filter, the partial sums at different bitlines are added up in the reduction stage. In reduction, at each array, partial sums of half the channels to be reduced are copied to another set of wordlines and aligned channelwise with the other half of channels. Then, the second half of partial sums are added into the first half. The copy and addition is called one round of reduction; the reduction rounds are conducted iteratively until the final reduction result is calculated. 5) Output Transfer: After reduction, the output activation maps at the compute arrays are transferred to the reserved array in the stage of output transfer. Then, the inputs at a different height or width are loaded in, and the MAC, reduction, and output transfer repeat.
The original paper 1 discusses several data mapping details, such as use of replicated filters to execute thousands of 3-D convolutions in parallel, leveraging reuse across inputs, and splitting/packing of filters to accommodate different filter shapes.
RESULT HIGHLIGHTS
For our baseline, we used a dual-socket Intel Xeon E5-2697 v3 as CPU and Nvidia Titan Xp GPU. Our Neural Cache modeling was based on the exact cache organization of Xeon 35 MB last level cache. Table I shows the studied server configurations. We refer the reader to the full paper 1 for details of experimental methodology.
Performance
Neural Cache achieves a 7:7Â speedup in latency compared to baseline GPU, and 18:3Â speedup on baseline CPU as shown in Figure 5 . The significant speedup can be attributed to the elimination of high overheads of onchip data movement from cache to the core registers, and data parallel convolutions.
Consider an example layer, Conv2D_Layer_ 2b_3×3. This layer computes % 1:4 million convolutions, out of which Neural Cache executes % 32;000 convolutions in parallel and 43 in series. The compute cache arrays show 99.7% utilization for this layer during convolutions (after data loading). Each convolution takes 2,784 cycles (236 cycles/MAC× 9 + 660 reduction cycles). Note that we assume two cycles for compute SRAM access to account for column multiplexing. The whole layer takes 117,912 cycles (43 convolution in series × 2,784 cycles), taking 0.0479 ms to finish the convolutions for Neural Cache running at 2.5 GHz. Remaining time for the layer is attributed to data loading. CPU and GPU cannot take advantage of data parallelism on this scale due to lack of sufficient compute resources and on-chip data-movement bottlenecks. Figure 6 . Throughput with varying batching size.
Top Picks Figure 6 shows the system throughput in number of inferences per second as the batch size varies. As shown in the figure, at the highest batch size, Neural Cache achieves a throughput of 604 inferences/s, which is equivalent to 2:2Â GPU throughput, or 12:4Â CPU throughput. Increased throughput of Neural Cache when batching can be attributed to the amortization of filter loading time. Loading filter weights into cache and distributing them into arrays takes up 46% of total inference time when processing a single image.
Power and Energy
Neural Cache achieves an energy efficiency that is 16:6Â better than the baseline GPU and 37:1Â better than CPU. The energy efficiency improvement can be explained by the reduction in data movement, the increased instruction efficiency of SIMD-like architecture, and the optimized in-cache arithmetic units. Neural Cache's data mapping not only makes the computation entirely data independent, but the operations performed are identical, allowing for mega SIMD instructions. The average power of Neural Cache is 53.11% and 49.87% lower than the GPU and CPU baselines, respectively. Thus, Neural Cache does not form a thermal challenge for the servers. Table 2 shows how Neural Cache compares to the baseline CPU and GPU in various performance measures. To summarize, our experimental results show that the proposed architecture can improve the inference latency by 18.3× over server class multicore CPU (Xeon E5), 7.7× over server class GPU (Titan Xp), for the Inception v3 model. In addition, Neural Cache is 37.1× and 16.6× more energy efficient over CPU and GPU, respectively. This translates to an overall efficiency gain of 679× and 128× over CPU and GPU, respectively, while requiring an minimal area overhead of 2%.
Overall

RELATED WORK
Recent years have witnessed several neural network accelerator architectures, including commercial products. 7;11 In general, custom accelerator solutions achieve high efficiency while requiring extra hardware and incurring design costs. However, they lack flexibility in that they cannot be repurposed for other domains. In contrast, our work is based on the cache, which improves performance of many other workloads when not functioning as a DNN accelerator. Neural Cache aims to achieve high performance, while allowing flexibility of general purpose processing. Being a SRAM technology, we also expect the compute efficiency of Neural Cache to improve with newer technology nodes. Neural Cache is the first work to demonstrate in-place arithmetic operations in caches. Other promising approaches toward in-SRAM computing 12 utilize analog computing requiring expensive ADCs and are restricted to limited precision. Our design is digital and CMOS compatible. It processes one bit at time, eliminating the need for ADCs and enabling higher precision. Since only a binary value needs to be sensed ever cycle, existing SRAM array senseamps can be utilized. Bitserial computing over bitlines obviates the need for communication across bitlines for carry propagation, keeping the logic compact. Further, it magnifies the throughput and enables configurable precision. Configurable precision is attractive for machine learning. A low overhead digital design makes our proposal compatible with and attractive for existing processors.
CONCLUSION
Computer designers have traditionally separated the role of storage and compute units. Memories and caches stored data. Processors' logic units computed them. Is this separation necessary? A human brain does not separate the two so distinctly. Why should a processor? Our work raises this fundamental question regarding the role of caches, and proposes to impose a dual responsibility on caches: store and compute data. By doing so, we turn them into massively parallel vector units, and drastically reduce on-chip data movement overhead.
As caches can be found in almost all modern processors, we envision Neural Cache to be a disruptive technology that can enhance commodity processors with large data-parallel accelerators at almost no cost. CPU vendors (Intel, IBM, etc.) can thus continue to provide high-performance general-purpose processing, while enhancing them with a co-processor like capability to exploit massive data-parallelism. Such a processor design is particularly attractive for difficult-to-accelerate applications that frequently switch between sequential and data-parallel computation.
To conclude, nearly three-fourth of a server class processor die area today is devoted for caches. Even accelerators use large caches. Why would one not want to turn them into compute units?
Charles Eckert is currently a PhD student in the Department of Computer Science and Engineering, University of Michigan. His research focuses on using in-memory computing to accelerate machine learning applications. He has a BS in computer engineering from SUNY Binghamton and an MSE degree in computer science and engineering from the University of Michigan. 
