Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM's execution time scales inversely proportionally with the precisions of both weights and activations. For fully-connected layers LM's performance scales inversely proportionally with the precision of the weights. LM targets area-and bandwidth-constrained System-on-a-Chip designs such as those found on mobile devices that cannot afford the multimegabyte buffers that would be needed to store each layer on-chip. Accordingly, given a data bandwidth budget, LM boosts energy efficiency and performance over an equivalent bit-parallel accelerator. For both weights and activations LM can exploit profilederived per layer precisions. However, at runtime LM further trims activation precisions at a much smaller than a layer granularity. Moreover, it can naturally exploit weight precision variability at a smaller granularity than a layer. On average, across several image classification CNNs and for a configuration that can perform the equivalent of 128 16b × 16b multiply-accumulate operations per cycle LM outperforms a state-of-the-art bit-parallel accelerator [1] by 4.38× without any loss in accuracy while being 3.54× more energy efficient. LM can trade-off accuracy for additional improvements in execution performance and energy efficiency and compares favorably to an accelerator that targeted only activation precisions. We also study 2-and 4-bit LLM variants and find the the 2-bit per cycle variant is the most energy efficient.
INTRODUCTION
Deep neural networks (DNNs) have become the state-of-the-art technique in many recognition tasks such as object [2] and speech recognition [3] . Given their many applications and high computation and memory demands, DNNs are prime candidates for hardware acceleration. While a few different types of DNNs exist, Convolutional Neural Networks (CNNs) in particular dominate applications where the input is an image or video. Devices executing such CNNs will be required to perform mostly if not only inference. An example is computational photography where machine learning has shown great promise in replacing classical algorithms [4] .
We present Loom (LM), a hardware accelerator for inference with CNNs targeting embedded systems where reducing the amount of data transfered per memory connection, be it an external or internal one, is paramount. Specifically, given a memory bandwidth budget LM's goal is to boost performance and energy efficiency compared to a state-of-the-art data-parallel accelerator. LM exploits the precision requirement variability of CNNs to reduce the memory footprint, increase bandwidth utilization, and to deliver performance which scales inversely proportional with precision for both convolutional (CVLs) and fully-connected (FCLs) layers. Ideally, compared to using a fixed precision of 16 bits, LM achieves a speedup of 256 P a ×P w and 16 P w for CVLs and FCLs where P w and P a are the precisions of weights and activations, respectively. LM also reduces the number of weight and activation bits read by 16 16 . To deliver these benefits LM processes both activations and weights bit-serially while compensating for the loss in computation bandwidth by exploiting parallelism. Judicious reuse of activations and weights enables LM to improve performance and energy efficiency over conventional bit-parallel designs without requiring a wider memory interface. For both weights and activations LM utilizes profile-derived per layer precisions. For activations, LM further trims their precision at a much finer granularity at runtime utilizing the approach of Lascorz et al. [5] . By exploiting precision LM delivers benefits for all activations and weights regardless of whether they are ineffectual or not.
We evaluate LM on an SoC and compare against a bit-parallel fixed-precision accelerator (DPNN ) over a set of image classification CNNs. For a configuration that is sized to match the peak computation bandwidth of a bit-parallel accelerator that can perform at peak 128 16b ×16b multiply-accumulate operations per cycle, on average LM yields a speedup of 3.25×, 1.74×, and 3.19× over DPNN for the convolutional, fully-connected, and all layers, respectively. The energy efficiency of LM over DPNN is 2.63×, 1.41× and 2.59× for the aforementioned layers, respectively. LM enables trading off accuracy for additional improvements in performance and energy efficiency. For example, accepting a 1% relative loss in accuracy, LM yields 3.57× higher performance and 2.87× more energy efficiency than DPNN . We also perform a sensitivity study varying the equivalent peak compute bandwidth and the number of bits that LM processes per cycle. LM scales well up up to a configuration equivalent to 256 16b × 16b multiply-accumulate operations per cycle and that a 2-bit per cycle design achieves the best energy efficiency albeit not the best performance.
The rest of this document is organized as follows: Section 2 illustrates the key concepts behind LM via an example. Section 3 presents the DPNN and Loom architectures. The evaluation methodology and experimental results are presented in Section 4. Section 5 reviews related work, and Section 6 concludes.
cycle. The engine can process two new 2-bit weights and/or activations per cycle a throughput of two 2b × 2b products per cycle.
Loom's Approach: Figure 1b shows an equivalent LM engine which matches the bit-parallel engine's throughput by producing 8 1b ×1b products every cycle. The engine comprises an 2×2 array of bit-serial subunits (4 in total). Each subunit accepts 2 bits of input activations and 2 bits of weights per cycle and performs 2 1b × 1b products. The subunits along the same column share the activation inputs while the subunits along the same row share their weight inputs. In total, this engine accepts 4 activation and 4 weight bits equaling the input bandwidth of the bit-parallel engine. Each subunit has two 1-bit Weight Registers (WRs), one 2-bit Output Register (OR) for accumulating its products. , the LSBs of four weights from filters 0 and 1. Each of these two subunits calculates two 1b × 1b products (the product and accumulation would take place in the subsequent cycle adding one more pipeline stage, a detail the example omits for clarity) and stores their sum into its OR. In Figure 1c and cycle 2, the left column subunits now multiply the same weight bits with the most significant bits (MSBs) a 0/1 and a 1/1 of activations a 0 and a 1 respectively accumulate these into their ORs. In parallel, the two right column subunits load a 0/0 and a 1/0 , the LSBs of the input activations a 0 and a 1 , and multiply them by the LSBs of weights , the MSBs of the weights from filters 2 and 3 and multiply them with a 0/0 and a 1/0 . In cycle 5 and Figure 1f , the right subunits complete the multiplication of their WR-held weights and a 0/1 and a 1/1 the MSBs of the two activations. By the end of this cycle, output activations o 2 and o 3 are ready as well.
In total it took 4+1 cycles to process 32 1b × 1b products (4, 8, 8, 8, 4 products in cycles 1 through 5, respectively). Notice that at the end of the 5th cycle, the left column subunits are idle, thus the WRs could have loaded another set of weights commencing the computation of a new set of outputs. In the steady state, with 2b input activations and weights, this engine will be producing 8 1b ×1b terms every cycle thus matching the 2 2b ×2b throughput of the parallel engine. If the weights could be represented using only one bit, LM would be producing two output activations per cycle, twice the bandwidth of the bit-parallel engine.
In general, if the bit-parallel hardware was using P b ase bits to represent the weights while only P w bits were actually required, for the FCLs the LM engine would outperform the bit-parallel engine by
The LM would use an array of P b ase ×k units, where k the number of P b ase × P b ase products DPNN processes per cycle. Each subunit would produce k 1b × 1b products. Since there is no weight reuse in FCLs, 16 cycles are required to load a different set of weights to each of the 16 columns. Thus having activations that use less than 16 bits would not improve performance (but could improve energy efficiency). Convolutional Layers: LM processes CVLs similarly to FCLs but exploits weight reuse across different windows to exploit a reduction in precision for both weights and activations. Specifically, in CVLs the subunits across the same row share the same weight bits which they load in parallel into their WRs in a single cycle. These weight bits are multiplied by the corresponding activation bits over P a cycles. Another set of weight bits needs to be loaded every P a cycles, where P a is the input activation precision. Here LM exploits weight reuse across multiple windows by having each subunit column process a different set of activations. Assuming that the bit-parallel engine uses P bits to represent both input activations and weights, LM will outperform the bit-parallel engine by
where P w and P a are the weight and activation precisions LM uses respectively.
LOOM ARCHITECTURE
This section describes the baseline fixed precision bit-parallel accelerator and the Loom architecture.
Data Supply and Baseline System
Our baseline design (DPNN ) shown on Figure 2a is an appropriately configured data-parallel engine inspired by the DaDianNao accelerator [1] the de facto standard used for comparison in most accelerator studies. DPNN uses 16-bit fixed-point activations and weights. DPNN comprises k inner product units (IP) each processing a different filter. Every cycle DPNN accepts as input N activations and N corresponding weights per filter out of k filters. In the configuration shown N = 16 and k = 8. The N activations are broadcast to all IP units. Each IP unit multiplies each of the N activations with one out of its N weights, reduces the resulting N 32b products with an adder tree, and accumulates the result into an output register. In total, every cycle, DPNN calculates N × k products producing k partial output activations.
An Activation Memory (AM) and a Weight Memory (WM) supply respectively the activations and the weights. An input activation buffer (ABin) buffers the input activations while an output activation buffer (ABout) temporarily buffers the output activations. For clarity, in our description we assume a single tile that processes up to 128 weights (8 filters) and 16 activations per cycle.
Loom
For LM to match our DPNN configuration it needs to process 128 filters concurrently and 16 weight bits per filter per cycle, for a total of 128×16 = 2048 weight bits per cycle. Alternatively, LM could process 32 filters over 64 windows, however, we leave this investigation for future work. LM also accepts 256 1-bit input activations each of which it multiplies with 128 1-bit weights thus matching the computation bandwidth of base in the worst case where both activations and weights need 16 bits. Figure 2b shows the Loom design. It comprises 2K Serial Inner-Product Units (SIPs) organized in a 128 × 16 grid. Every cycle, each SIP multiplies 16 1b input activations with 16 1b weights and reduces these products into a partial output activation. The SIPs along the same row share a common 16b weight bus, and the SIPs along the same column share a common 16b activation bus. Accordingly, as in DPNN , the SIP array is fed by a 2Kb weight bus and a 256b activation input bus. Similar to DPNN , LM has an ABout and an ABin. LM processes both activations and weights bit-serially.
Reducing Memory Footprint and Bandwidth: Since both weights and activations are processed bit-serially, LM can store weights and activations in a bit-interleaved fashion and using only as many bits as necessary thus boosting the effective bandwidth and storage capacity of the weight memory and the AM. For example, given 2K 13b weights to be processed in parallel, LM would pack first their bit 0 onto continuous rows, then their bit 1, and so on up to bit 12. DPNN would stored them using 16 bits instead. A transposer can rotate the output activations prior to writing them to AM from ABout. Since each output activation entails innerproducts with tens to hundreds of inputs, the transposer demand will be low.
Convolutional Layers: Processing starts by reading in parallel 2K weight bits from memory, loading 16 bits to all WRs per SIP row. The loaded weights will be multiplied by 16 corresponding activation bits per SIP column bit-serially over P L a cycles where 10-9-9 9-9-8 VGGM 10-8-8 9-8-8 VGG19 10-9-9 10-9-8 P L a is the activation precision for this layer L. Then, the second bit of weights will be loaded into WRs and multiplied with another set of 16 activation bits per SIP row, and so on. In total, the bit-serial multiplication will take P L a × P L w cycles. where P L w the weight precision for this layer L. Whereas DPNN would process 16 sets of 16 activations and 128 filters over 256 cycles, LM processes them concurrently but bit-serially over P L a × P L w cycles. If P L a and/or P L w are less than 16, LM will outperform DPNN by 256/(P L a × P L w ). Otherwise, LM will match DPNN 's performance. Fully-Connected Layers: Processing starts by loading the LSBs of a set of weights into the WR registers of the first SIP column and multiplying the loaded weights by the LSBs of the corresponding activations. In the second cycle, while the first column of SIPs is still busy with multiplying the LSBs of its WRs by the second bit of the activations, the LSBs of a new set of weights can be loaded into the WRs of the second SIP column. Each weight bit is reused for 16 cycles multiplying with bits 0 through bit 15 of the input activations. Thus, there is enough time for LM to keep any single column of SIPs busy while loading new sets of weights to the other 15 columns. For example, as shown in Figure 2b LM can load a single bit of 2K weights to SIP(0,0)..SIP(0,127) in cycle 0, then load a single-bit of the next 2K weights to SIP(1,0)..SIP(1,127) in cycle 1, and so on. After the first 15 cycles, all SIPs are fully utilized. It will take P L w × 16 cycles for LM to process 16 sets of 16 activations and 128 filters while DPNN processes them in 256 cycles. Thus, when P L w is less than 16, LM will outperform DPNN by 16/P L w and it will match DPNN 's performance otherwise. SIP: Bit-Serial Inner-Product Units: Figure 3 shows LM's BitSerial Inner-Product Unit (SIP). Every clock cycle, each SIP multiplies 16 single-bit activations by 16 single-bit weights to produce a partial output activation. Internally, each SIP has 16 1-bit Weight Registers (WRs), 16 2-input AND gates to multiply the weights in the WRs with the incoming input activation bits, and a 16-input 1b adder tree that sums these partial products. AC 1 accumulates and shifts the output of the adder tree over P L a cycles. Every P L a cycles, AC 2 shifts the output of AC 1 and accumulates it into the OR. After P L a × P L w cycles the Output Register (OR) contains the inner-product of an activation and weight set. In each SIP, a multiplexer after AC 1 implements cascading. To support signed 2's complement activations, a negation block is used to subtract the sum of the input activations corresponding to the most significant bit of weights (MSB) from the partial sum when the MSB is 1. Each SIP also includes a comparator (max) to support max pooling layers. Dynamic Precision Reduction: So far we assumed that software provided profile-derived per layer activation and weight precisions [6] . Lascorz et al., observed that the hardware can further shorten these precisions by inspecting the actual values at runtime [5] . LM determines adjusts precision per group of 256 activations that it processes concurrently. Per bit position OR trees produce a 16-bit vector indicating the positions where any of the activations has a 1. A leading one detector identifies the most significant position and thus the precision in bits that is sufficient. Processing Layers with Few Outputs: For LM to keep all the SIPs busy an output activation must be assigned to each SIP. This is possible as long as the layer has at least 2K outputs. However, in the networks studied some FCLs have only 1K output activations, To avoid underutilization, LM's implements SIP cascading, in which SIPs along each row can form a daisy-chain, where the output of one can feed into an input of the next via a multiplexer. This way, the computation of an output activation can be sliced along the bit dimension over the SIPs in the same row. In this case, each SIP processes only a portion of the input activations resulting into several partial output activations along the SIPs on the same row. Over the next Sn cycles, where Sn is the number of bit slices used, the Sn partial outputs can be reduced into the final output activation. Other Layers: Similar to DaDN , LM processes the additional layers needed by the studied networks. To do so, LM incorporates units for MAX pooling as in DaDN . Moreover, to apply nonlinear activations, an activation functional unit is present at the output of the ABout. Given that each output activation typically takes several cycles to compute, it is not necessary to use more such functional units compared to DPNN . Total computational bandwidth: In the worst case, with 16b activations and weights, a single 16b ×16b product that would have taken DPNN one cycle to produce, now takes LM 256 cycles. Since DPNN calculates 128 products per cycle, LM needs to calculate the equivalent of 256 × 128 16b × 16b products every 256 cycles. LM has 128 × 16 = 2048 SIPs each producing 16 1b × 1b products per cycle. Thus, over 256 cycles, LM produces 2048 × 16 × 256 1b × 1b products matching DPNN 's compute bandwidth. SIP columns and accommodate precisions that are multiple of 2 and 4, respectively. For example, for LM 4b reducing the P L a from 8 to 5 bits produces no performance benefit, whereas for the LM 1b it would improve performance by 1.6×.
EVALUATION
This section evaluates Loom performance, energy and area and explores the trade-off between accuracy and performance comparing to DPNN and Stripes [7] .
Methodology
Execution time is modeled via a custom cycle-accurate simulator and energy and area measurements are collected over layouts of all designs. The designs were synthesized for worst case, typical case, and best case corners with the Synopsys Design Compiler using a TSMC 65nm library. Layouts were produced with Cadence Innovus using the typical corner case synthesis results which were more pessimistic for LM than the worst case scenario. Power results are based on the actual data-driven activity factors. The clock frequency of all designs is set to 1GHz. The ABin and ABout SRAM buffers were modeled with CACTI [8] and AM and WM were modeled as eDRAM with Destiny [9] . We first evaluate LM assuming that all the activations fit on chip and the weights can be read from off-chip memory without any bandwidth constraint to explore the design space without being affected by the choice of a particular off-chip memory. We conclude by investigating performance with a single-channel of low-power DDR4-4267. Table 1 reports the profile-derived per layer precisions of input activations and network precisions of weights for the CVLs and FCLs using the method of Judd et al. [6] . Since LM's performance for the CVLs depends on both P L a and P L w , we adjust them independently. We use per layer activation precisions and a common across all CVLs weight precision. We found little inter-layer variability for weight precisions but additional per layer exploration is warranted. Since LM's performance for FCLs performance depends only on P L w we only adjust weight precision for FCLs. The precisions that guarantee no top-1 accuracy loss for CVLs input activations vary from 5 to 13 bits and for weights vary from 10 to 12. When a 99% relative top-1 accuracy is still acceptable, the activation and weight precision can be as low as 4 and 10 bits, respectively. The per layer weight precisions for the FCLs vary from 7 to 10 bits.
Weight and Activation Precisions:

Performance and Energy Efficiency
Figures 4a and 4b show respectively the performance and energy efficiency of Loom, Stripes, and DStripes configurations relative to DPNN with the precision 100% profiles of Table 1 and for all layers combined. Stripes is based on Stripes which exploits only profilederived per layer activation precisions and only for CVLs [7] . DStripes incorporates dynamic prediction reduction [5] .
On average, LM 1b outperforms DPNN by more than 3× while being more than 2.5× energy efficient. When LM processes multiple bits per cycle the performance benefits are lower but energy efficiency improves up to 2.9×. LM 1b consistently outperforms Stripes and DStripes in performance and Stripes in energy efficiency. LM 1b is more energy efficient than DStripes except for GoogleNet where its energy efficiency is within 2% of DStripes. Table 2 reports per network performance and energy efficiency for LM configurations relative to DPNN for the FCLs and CVLs separately, and for the 100% and 99% accuracy profiles. In general, LM 1b outperforms LM 2b and LM 4b in most cases with the latter two being more energy efficient. On occasion the latter two outperform LM 1b under the 100% accuracy profiles in FCLs. Since for LM the performance improvement in FCLs is only due to the use of lower weight precisions, processing multiple activation bits per cycle does not effect performance in the steady state. However, processing more activation bits per cycle reduces the initiation interval per layer an effect that becomes noticeable for small FCLs.
The table reports detailed results for Stripes. For FCLs, Stripes performance and energy efficiency suffer as it does not exploit weight precisions. With the 99% accuracy profiles, both performance and energy efficiency improve considerably for FCLs and CVLs. Performance with DStripes would be identical to Stripes for the FCLs. We do not present detailed results for DStripes due to space limitations noting that LM consistently outperforms DStripes while being more energy efficient except for the CVLs for GoogLeNet where the difference in energy efficiency is small.
Area Overhead
Post layout measurements were used to measure the area of DPNN and Loom. The LM 1b configuration requires 1.34× more area over DPNN while achieving on average a 3.19× speedup. The LM 2b and LM 4b reduce the area overhead to 1.25× and 1.16× while still improving the execution time by 3.05× and 2.74×, respectively. Thus LM exhibits better performance vs. area scaling than DPNN . 
100% TOP-1 Accuracy NiN n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 
Scaling
Thus far we assumed that all activations fit on chip and focused on a single LM configuration. We next consider configurations with practical on-and off-ship memory hierarchies. Specifically, we size the activation memory so that most layers can fit on-chip avoiding off-chip accesses that today require at least two orders of magnitude more energy a critical consideration in embedded systems. Accordingly, DPNN requires 2MB of activation memory (VGG19 requires 10MB which is impractical for embedded systems and thus has to spill activations off-chip). Since LM processes both activations and weights bit-serially, it naturally stores and communicates values on-and off-chip using the per layer precisions. As a result, LM requires only 1MB on-chip memory for the activations. However, since LM processes more filters concurrently compared to DPNN , it can benefit from a larger weight memory. Figure 5 shows how average performance over all networks scales for different configurations where the number of SIPs is chosen to match the peak compute bandwidth (x-axis) of a bit-parallel accelerator. For example, the "128" configurations can perform the equivalent of 128 16b × 16b multiply-accumulate operations per cycle. For each configuration Figure 5 reports performance relative to DPNN and absolute performance as frames per second (fps). The figure reports results for the convolutional layers only and also for all layers. This is done because fully-connected layers are off-chip bound (and thus are affected by our choice of off-chip memory) whereas the convolutional layers are compute bound. Here we restrict attention to LM 1b . LM outperforms DPNN for all design points shown and can achieve real-time processing rates even for the "32" configuration. The relative performance advantage of LM drops for the larger configurations since LM requires more parallelism and suffers more from increased underutilization as the number of weight lanes grows. DStripes's relative performance over DPNN remains constant for the range shown. LM outperforms DStripes up to the "128" configurations. At "256" LM and DStripes perform nearly identically and at "512" the latter performs better.
The figure also reports the weight memory capacity, the relative (vs. DPNN ) area overhead, and the energy efficiency for the various LM configurations. For the "64" and "32" configurations LM requires 128KB and 544KB less memory in total than DPNN . However, for the "128" and the "256" configurations LM requires more memory than DPNN . Regardless, the performance benefits exceed the relative area overhead and thus LM provides a better performance/area trade-off than DPNN . For the "256" configuration energy efficiency suffers with LM. However, this measurement ignores the energy of off-chip traffic which is on average 0.61× less with LM. Moreover, as CNNs evolve to process higher resolution images the size of activation memory increases significantly compared to the filter sizes which makes the effect of data compression more important [11] . Thus we expect that for higher resolution images LM will ever more appealing.
Per Group Weight Precisions
Thus far we assumed that LM exploits software provided profilederived per layer weight precisions [6] . However, exploiting the approach of Lascorz et al. [10] LM can further trim the weight precisions at a finer granularity to boost the performance and energy efficiency of both FCLs and CVLs. The per group weight precisions can be detected at runtime similarly to the activation precisions, or can be detected statically and communicated via per group metadata. efficiency of Loom configurations relative to DPNN with the precision profiles of Table 3 and for all layers combined is shown in Table 4 . For these estimates we assume that performance scales linearly with weight precision.
Exploiting the effective weight precisions yields a speedup of 4.38×, 4.20×, and 3.76× over DPNN for LM 1b , LM 2b , and LM 4b configurations, respectively. The energy efficiency of LM over DPNN is 3.54×, 3.95×, and 3.94× for the aforementioned configurations.
RELATED WORK
Due to space limitations, we limit attention to a few works that are the most related. We have already compared to Stripes [7] extended with dynamic prediction reduction [5] .
Pragmatic's performance for the CVLs depends only on the number of activation bits that are 1, but does not improve performance for FCLs [12] . Further performance improvement may be possible by combining Pragmatic's approach with LM's but the costs per SIP may make this prohibitively expensive. Proteus exploits per layer precisions reducing memory footprint and bandwidth but requires crossbars per input weight [13] . Loom does not need crossbars. Hardwired NN implementations naturally exploit per layer precisions [14] . Loom does not require that the whole network fit on chip nor does it hardwire precisions. Furthermore, Loom further trims activations precisions at runtime.
Several accelerators target ineffectual weights and/or activations for dense and/or sparse networks [15] [16] [17] [18] . Most target either FCLs or CVLs alone. LM targets both layer types and benefits all inputs ineffectual or not.
CONCLUSION
This work presented Loom, a hardware inference accelerator for DNNs whose execution time for the convolutional and the fullyconnected layers scales inversely proportionally with the precision p used to represent the input data. LM can trade-off accuracy vs. performance and energy efficiency on the fly. Future work may consider extending LM to further exploit weight sparsity.
