We present MATIC (Memory-Adaptive Training and In-situ Canaries), a voltage scaling methodology that addresses the SRAM efficiency bottleneck in DNN accelerators. To overscale DNN weight SRAMs, MATIC combines specific characteristics of destructive SRAM reads with the error resilience of neural networks in a memory-adaptive training process. PVT-related voltage margins are eliminated using bit-cells from synaptic weights as in-situ canaries to track runtime environmental variation. Demonstrated on a low-power DNN accelerator fabricated in 65nm CMOS, MATIC enables up to 3.3× total energy reduction, or 18.6× application error reduction.
I. INTRODUCTION
Deep neural networks (DNNs) have demonstrated state-ofthe-art performance on a variety of signal processing tasks, and there is growing interest in DNNs for next-generation IoT and embedded platforms. However, recent work has shown that for accelerators that reduce or eliminate DRAM access, on-chip SRAM dedicated to synaptic weights accounts for greater than 50% of total system power [1] . The on-chip memory problem is particularly acute in DNNs with classifier layers, where data-reuse techniques [2] , [3] are ineffective since classifier weights are unique, and account for greater than 80% of total weight parameters [4] . Voltage scaling can enable significant static and dynamic power reduction, but read and write stability constraints have historically prevented more aggressive scaling for SRAM -SRAM is either placed on a separate voltage rail hundreds of millivolts higher than the rest of the design, or the system shares a unified voltage domain. In either case, significant energy savings from voltage scaling remain unrealized due to SRAM operating voltage constraints, translating to shorter operating lifetime.
In this paper we present MATIC (Memory-Adaptive Training and In-situ Canaries), the first hardware/algorithm co-design methodology that exploits the inherent error tolerance of DNNs to remove wasteful voltage margins and aggressively scale the voltage of weight SRAMs with little to no accuracy loss ( Figure  1 ). In addition to the development of MATIC, we design and implement SNNAC, a low-power DNN accelerator for energyconstrained mobile devices, and demonstrate state-of-the-art energy efficiency.
II. BACKGROUND

A. Deep Neural Networks
Deep neural networks (DNNs) are a class of bio-inspired machine learning models that are represented as a directed graph of neurons [5] . DNN operation can be divided into two key mechanisms: (1) training and (2) inference:
(1) During inference, a neuron k in layer j implements the composite function z
denotes the output from neuron i in the previous layer, and w ( j) k,i denotes the weight in layer j from neuron i in the previous layer to neuron k. f (x) is a non-linear function, typically a sigmoidal function or rectified linear unit (ReLU). Since the computation of a DNN layer can be represented as a matrix-vector dot product (with f (x) computed element-wise), DNN execution is especially amenable to dataflow hardware architectures designed for linear algebra.
(2) Training involves iteratively solving for weight parameters using some variation of gradient descent. Given a weight w ( j) i,k , its value at training iteration n is given by
, where J is a suitable loss function (e.g., mean-squared error or cross-entropy). The partial derivatives of the loss function w.r.t. the weights are computed by propagating error backwards via partial differentiation (backprop). For example, for a network with a single hidden layer, sigmoid activations f (x) = 1/(1 + e −x ) and mean-squared loss, the error gradient w.r.t. a weight w 
.
Network has learned around bit-error 
B. Weight Adaptation
The simplified example in Figure 2 shows how trainable weight parameters imbue neural networks with intrinsic resilience to error. The network in Figure 2 (a) is initialized with 8-bit integer weights such that the network loss is zero for the training set. The network uses a sigmoid activation function, and square loss with stochastic gradient descent (SGD) for training [5] . To simulate a fault in SRAM, at iteration 1 we apply a static mask to bit '4' of w 1,2 initially degrades error performance, the surrounding weights adapt after several iterations; the large non-zero gradients illustrate the backprop mechanism compensating for the error injected to w Figure 3 shows a simplified 6T SRAM bit-cell model. Variation-induced mismatch between bit-cell devices creates an inherent state-independent offset [6] . This offset results in each bitcell having a "preferred state." For instance, the bit-cell depicted in Figure 3 favours driving out to logic '0'. Due to statistical variation, larger memories are likely to see a number of cells with significant offset error.
C. SRAM Read Failures
As supply voltage scales, the diminished noise margin allows the bit-cell to be flipped to its preferred state during a read. Once flipped, the bit-cell retains state, favouring its (now incorrect) bit value due to the persistence of the built-in offset. Consequently, the occurrence of memory bit-cell read failure at low supply voltages is random in space, but essentially provides stable read outputs consistent with its preferred state. The read failures described above are in distinct from bitline access-time failures, which can be corrected with ample timing margin. 
A. Memory-adaptive Training
Memory-adaptive training leverages the inherent resilience of neural networks to adapt around bit-errors in synaptic weights that result from voltage scaling past V min,read . Random mismatch results in bit-cells that are biased towards a given storage state. If a bit-cell stores the complement of its "preferred" state, performing a read at a sufficiently low V DD flips the cell and subsequent reads will be incorrect but stable. SRAM read failures, as described above, are profiled postsilicon and incorporated into the backpropagation (backprop) algorithm (as described in Section II-B) with an injection mask ( Figure 4 ). During every training iteration, the injection mask preserves the fractional portion of a given synaptic weight and applies bit-masks (that correspond to bit failures in SRAM) to the fixed-point weight. The masked weight is used in SGD before being recombined with its fractional part and weight update. Preserving the fractional weight is critical, since it enables gradual value-shifts that occur over multiple (fractional) weight updates.
To evaluate the feasibility of memory-adaptive training, we first examine the memory-adaptive training flow with simulated SRAM failure rates. At each voltage, a proportion of randomly selected weight bits are fixed to either '1' or '0', where the proportion of fixed bits is determined from SPICE Monte Carlo simulation. Figure 5 shows that a significant fraction of bit errors can be tolerated, and that MATIC provides a reasonably smooth energy-error tradeoff curve.
B. In-Situ Synaptic Canaries
The in-situ canary circuits are bit-cells directly from synaptic weight SRAMs that facilitate SRAM supply-voltage control ( Figure 6 ). Traditional canary circuits replicate critical circuits to detect imminent failure, but require added margin and are vulnerable to PVT-induced mismatch. Instead, MATIC uses synaptic weights themselves as in-situ canary circuits, leveraging a select number of bit-cells that are on the margin of failure to maintain a target bit-cell read failure rate. The in-situ canary technique relies on two key observations:
1. Since the most marginal, failure prone bit-cells are chosen as canaries, canaries fail before other performance-critical bit-cells. 2. Neural networks are robust to a small fraction of uncompensated errors [7] . As a result, the failure states of canary bit-cells are not critical for overall performance, and canaries can be selected directly from synaptic weights.
At runtime, in-situ canary bits are polled periodically to determine whether supply voltage modifications should be applied. While we use an integrated microcontroller in the test chip described below, the runtime controller can be implemented with faster or more efficient circuits, if required. For our tests we use a binary control policy where SRAM supply voltage is adjusted in 1mV steps upon detecting a failing/succeeding canary bit-cell. Since we pick the most marginal, failure prone bit-cells, at runtime only canary bits are re-written. For canary selection, we conservatively select eight distributed, marginal canary bit-cells from each weight-storage SRAM.
IV. DNN ACCELERATOR ARCHITECTURE
To demonstrate the effectiveness of MATIC, we implement a light-weight SoC called SNNAC (Systolic Neural Network AsiC) in 65nm CMOS. The SNNAC architecture (Figure 7) is based on the systolic dataflow design from SNNAP [8] , modified for SoC integration. The SNNAC core consists of a fully-programmable central Neural Processing Unit (NPU) that contains eight multiply-accumulate (MAC)-based Processing Elements (PEs) arranged in a systolic ring. Energy-efficient PE compute is acheived with fixed-point arithmetic with 8-22 bit precision, and each PE uses a dedicated voltage-scalable SRAM bank to enable local storage of synaptic weights. The systolic ring is attached to the activation function unit (AFU), which minimizes energy footprint with piecewiselinear approximations of activation functions (e.g., sigmoid, ReLU). Programmability is acheived with statically compiled instruction schedules, which time-multiplex the computation of each DNN layer onto the systolic array. SNNAC also includes a sleep-enabled OpenMSP430-based microcontroller (μC), which handles runtime control and off-chip communication. 
V. EXPERIMENTAL RESULTS
At 25C and 0.9V, SNNAC nominally operates at 250 MHz and dissipates 16.8 mW, achieving a 90.6% classification rate on MNIST [9] . The other application benchmarks include face detection (MIT CBCL face database [10] ), and 2 benchmarks from the approximate computing suite AxBench [11] . We find that the compiled SRAMs (rated at 0.9V across PVT) first exhibit read failures at 0.53V@25C, with all reads failing at 0.4V (Figure 8(a) ). Since the point of first failure is dictated by the tails of the V min,read statistical distribution, we expect voltage savings to increase in more advanced process nodes, and with larger memories. For instance, the SRAM variability study from [6] exhibits V min,read failures starting at~0.66V with a 45nm, 64 kb array. Figure 10 shows how MATIC recovers application error, resulting in an 18.6× reduction in average error-increase (AEI) versus naive hardware (Table I) . To avoid biasing the application error analysis, all benchmarks use compact DNN topologies that minimize intrinsic over-parameterization ( Figure  8(b) ).
For energy-efficiency we consider the operation of SNNAC in 3 feasible operating scenarios, HighPerf (high performance), EnOpt_A (energy optimal, separate logic/SRAM voltages), and EnOpt_B (energy optimal, single voltage domain). At the minimum-energy point (MEP) across the 3 cases, MATIC enables up to 3.3× total energy reduction, and 5.1× energy reduction in SRAM (Table II, Figure 9 ). We note that SRAM energy is minimized at 0.5V with a 38% SRAM bit-cell failure rate, which translates to an 87% classification rate on MNIST (versus 29.3% for naive hardware). To demonstrate system stability over temperature, we execute the application benchmarks in a chamber with ambient temperature control, and sweep temperature from -15C to 90C for a given nominal voltage. Figure 11(a) shows the SRAM voltage settings dictated by the in-situ canary system for an initial setting at 0.5V, with 0% mismatch with the expected error. Table III shows that the MATIC-SNNAC combination achieves state-of-the-art energy-efficiency and wider operating-voltage range compared to other FC-DNN accelerators. The performance of SNNAC is either better than or comparable to state-of-the-art convolutionalcentric (Conv.) accelerators, despite the lack of data and filter reuse techniques in unique FC-layers.
VI. CONCLUSION
In this paper we presented MATIC, the first hardware/algorithm co-design methodology that addresses the energy-efficiency bottleneck imposed by synaptic weight SRAMs. Key developments and contributions include 1. Memory-adaptive training -a technique that leverages the adaptability of neural networks to train around errors resulting from SRAM voltage scaling. 2. In-situ synaptic canaries -the use of bit-cells directly from weight SRAMs for voltage control and variation-tolerance.
In addition, we designed and implemented SNNAC, a lowpower DNN accelerator fabricated in 65nm CMOS (Figure 11(bc) ). Demonstrated on SNNAC, the application of MATIC results in 3.3× total energy reduction and 5.1× energy reduction in SRAM, or 18.6× reduction in application error, thus enabling robust and energy-efficient operation for a general class of DNN accelerators. 
