Memristive crossbars have become a popular means for realizing unsupervised and supervised learning techniques. Often, to preserve mathematical rigor, the crossbar itself is separated from the neuron capacitors. In this work, we sought to simplify the design, removing extraneous components to consume significantly lower power at a minimal cost of accuracy. This work provides derivations for the design of such a network, named the Simple Spiking Locally Competitive Algorithm, or SSLCA, as well as CMOS designs and results on the CIFAR and MNIST datasets. Compared to a non-spiking model which scored 33 % on CIFAR-10 with a single-layer classifier, this hardware scored 32 % accuracy. When used with a state-of-the-art deep learning classifier, the non-spiking model achieved 82 % and our simplified, spiking model achieved 80 %, while compressing the input data by 79 %. Compared to a previously proposed spiking model, our proposed hardware consumed 99 % less energy to do the same work at 21 × the throughput. Accuracy held out with online learning to a write variance of 3 % and a read variance of 40 %. The proposed architecture's excellent accuracy and significantly lower energy usage demonstrate the utility of our innovations. This work provides a means for extremely lowenergy sparse coding in mobile devices, such as cellular phones, or for very sparse coding as is needed by self-driving cars or robotics that must integrate data from multiple, high-resolution sensors.
I. INTRODUCTION
Sparse coding, accomplished through algorithms that encode an input stimulus in a new basis with few non-zero elements, has been shown to yield excellent image classification accuracy [1] . These algorithms have also been shown to reduce the learning time required for backpropagation, a popular training method for neuromorphic architectures [2] . Since there are few non-zero elements, sparse coding also provides a means for minimizing the bandwidth required to transfer sensor data amongst multiple processors, or to store that data in long-term storage. These algorithms have gained traction in recent years partially thanks to the discovery of these benefits as well as biological evidence that the V1 visual layer in mammalian cortices performs similar functionality [3] , [4] .
In addition to benefits for machine learning, the implementation of neuromorphic algorithms, of which sparse coding algorithms are a subset, on custom Application-Specific Integrated Circuits (ASICs) has been a wildly popular area of study largely in thanks to the development of programmable, variable-resistance nanodevices, named memristors, that can be used to realize the synapses needed in a more compact, energy efficient form [3] - [9] .
Our work extends from the Locally Competitive Algorithm (LCA) proposed by Rozell et al. in 2008 , an optimal solver for the sparse coding problem [10] , coupled with Oja's rule, used to repeatedly tune the dictionary towards an optimal solution for a set of training inputs [11] . The sparse code from the LCA was passed to a supervised layer, which was trained to classify the stimulus either as the value of a handwritten digit or the type of object in a tiny image: MNIST and CIFAR-10, respectively [12] , [13] . This approach leverages the transformation of input data from its native space to a decorrelated space via the LCA, and then uses traditional machine learning techniques to classify the stimulus based on its decorrelated representation. We have used this approach in the past [7] . In practice, there are several benefits to this approach: improved accuracy due to the stability of the decorrelated representation (an effect similar to dropout), and the ability to compress the input stimulus between the measuring device and the identification layer. A single frame of HD video data consists of approximately 6.2 Mbit of information, which the LCA can compress down to 1.6 Mbit, a 74 % reduction, at an RMSE of 5.8 %, or down to 1.0 Mbit, an 86 % reduction, at an RMSE of 7.0 %. For video surveillance systems or autonomously driving vehicles, this means that several cameras could be wired into a single, high-speed, low-energy sparse coding device, greatly reducing the needed communication bandwidth for the system.
The LCA provides competition amongst inputs to minimize the number of active outputs and simultaneously maximize the fidelity of the reconstruction produced. The exact function minimized by the LCA, with the input stimulus as s, the sparse code as a, the reconstruction asŝ, and a cost function C(·) based on the sparse thresholding that yielded a, is written as:
||s(t) −ŝ(t)|| 2 + λ m C(a m (t)). G m,n a n (t)   .
(2)
While an effective equation, prior attempts at implementing the LCA directly in hardware have suffered from a few details which prevented an efficient implementation. The most obvious comes from the inhibition term in Eq. (2): each output column depends on all other output columns. For a naive implementation, and indeed the one chosen by Rozell's group and presented in Shapero et al. [14] , this implies O(N 2 ) hardware scaling: doubling the number of output elements quadruples the required hardware. Additionally, a low-power implementation of the dot product in the ODE is non-trivial: it had to either be implemented digitally, or using next-generation components like memristors. While variableresistance nanodevices like memristors can compute a dot product using little power themselves, to make the computation accurate would require a virtual ground, a very powerhungry configuration [15] .
In this work we set out to provide a low-power, hardwarefriendly realization of an LCA-like algorithm that used a spiking framework. To maximize power savings, the model was simplified and the drawbacks of that simplification were investigated. Spikes were utilized to save power; the efficacy of spikes for power saving was explored. The proposed architecture has been named the Simple Spiking Locally Competitive Algorithm (SSLCA). The SSLCA was compared with both the original LCA's ODE as well as their later spiking work in Shapero et al. Both the MNIST and CIFAR-10 datasets were used for the comparison. Through these comparisons, we found that our proposal demonstrated excellent power and scaling qualities. It is our hope that this work provides the basis for efficient sparse-coding hardware in next-generation computers and robotics.
II. RELATED WORK
Due to the development of nanodevices that can function as hardware synapses, such as memristors, and the popularity of neuromorphic algorithms, there has been a lot of prior work relating to ASICs for neuromorphic sparse coding architectures. The relative performance of these in terms of energy efficiency and throughput are shown in Fig. 1 .
The original work on LCAs was Rozell et al., 2008 [10] . Rozell et al. sought to improve upon prior sparse coding algorithms by deriving an optimal expression to both minimize the sparse equation (Eq. (1)) and smooth the generated sparse representation when given time-varying input. Their work derived an ODE that solved both of these problems. However, a hardware implementation of this ODE would scale with O(N 2 ) operations as an artifact of its inhibitory term (Eq. (2)). Furthermore, digital implementations would require many iterations to stabilize the ODE, while analog implementations would require constant, voltage-scaled inputs, and a power-hungry virtual ground for each output neuron Fig. 1 : Comparison of the SSLCA's energy efficiency and throughput, presented in this work, with previous state-of-the-art results. One "Op," or operation, is the complete generation of a sparse code from a single set of inputs; we used this metric as it is scale-invariant (doubling the number of input elements processed in parallel would not affect this quantity). [7] . The LCA was implemented using analog signals and subthreshold currents on a Field Programmable Analog Array (FPAA) with floating gates in 2012 by Shapero et al., a group including C. Rozell [16] . While it demonstrated power consumption scaling of only O(N √ N ), the required hardware still scaled as O(N 2 ), and convergence was relatively slow, occurring after 240 µs.
These drawbacks were addressed to some extent later by Shapero et al. in 2013 [14] . That work extended the original LCA to a spiking architecture, referred to in this work as the Spiking Locally Competitive Algorithm (SLCA). The motivation for spiking largely seems to have stemmed from biology: all biological systems appear to use spiking rather than constant signals [3] , [4] , [17] . Spiking models have also long been believed to consume less power, and to exhibit additional computational power due to their stochasticity [18] - [20] . The validity of leveraging spikes to save power is discussed further in Section IV-A2 of this work. In their work, Shapero et al. showed that their SLCA consumed more power than their LCA at small sizes, but that their SLCA scaled only as the desirable O(N ), and would consume less power than the LCA at large network sizes. Additionally, they reduced the convergence time to 25 µs, nearly 90 % faster than their LCA with a throughput of 40 kOps/s. However, the required hardware still scaled as O(N 2 ).
Other spiking networks optimized for sparse coding have been published, such as SAILnet, introduced by Zylberberg et al. in 2011 [21] . ASICs using this architecture have been studied, with a substantial reduction in power compared to the approach presented by Shapero et al. Knag et al. [22] were capable of using the SAILnet architecture to process images using only 48 pJ/input for their inference logic with a throughput of 0.55 MOps/s, or using 176 pJ/input with a throughput of 4.8 MOps/s, 120 × as fast as Shapero et al. Their design was CMOS-based, and utilized a decreased resolution for weight storage: 4 bits per excitatory or inhibitory weight. This decision has been justified in a number of prior works dealing with how much accuracy is needed for sparse coding algorithms to perform well [7] , [23] . Like the LCA, SAILnet uses a direct inhibitory weight between each pair of output neurons, yielding a scaling complexity of O(N 2 ).
The closest family of algorithms that does not exhibit O(N 2 ) scaling is Spike-Timing-Dependent Plasticity (STDP). STDP exploits what is known as "Hebbian" learning, where input spike events that occur at the same time as an output spike event become more likely to trigger that output spike event. The common idiom for this behavior is, "neurons that fire together, wire together." In effect, each output neuron learns to activate when a correlated set of inputs fires together. This is very similar to what happens in sparse coding, where a neuron responds to a specific pattern in the input. The primary difference is that STDP makes no effort to preserve the information found in the input; that is, it is not a compressive algorithm. Rather, the purpose of STDP is to flag which features are present in the input and how prevalent they are, without regard for the other features present. Sparse coding, on the other hand, will suppress output of a feature that is already represented by a combination of other features. Both techniques are a form of unsupervised learning, except sparse coding requires some inhibitory terms while STDP does not. This gives STDP the desirable quality of O(N ) scaling. Due to its excellent scaling properties, STDP was used in one of the earliest attempts to replicate the features found in mammalian visual cortices [3] , has been explored as an autoencoder [24] , and has been used to generate unsupervised features for digit classification on the MNIST digit database [5] . STDP is also one of the dominant architectures researched using next-generation nanodevices such as memristors [4] - [6] , [8] , [9] , [17] . The downside of omitting inhibition is that more output neurons are required; with 50 neurons, prior research showed that STDP achieved 80% accuracy on MNIST, while a sparse coding layer using LCA achieved 85% [5] , [7] .
Recent work by Sheridan et al. showed that their group has manufactured memristive crossbars and applied voltage across the network to calculate the similarity coefficients in the LCA equations [25] . The implementation of the actual LCA in Sheridan et al. was performed on a traditional computer reading data from the crossbar; our work extends their work by proposing a means of implementing the entirety of the LCA on the same chip as the memristive crossbar with few additional components. The Sheridan et al. work also advocated using a Winner-Take-All (WTA) approach to training the weight matrix. While effective, using WTA was motivated largely by the supposition that a single neuron's firing would dominate the response to most stimuli; however, with inhibition and larger, more complicated inputs, this is not the case.
III. MODEL In light of issues with previous hardware implementations and the potential benefits of sparse coding algorithms discussed in Sections I and II, we set out to develop the simplest architecture for sparse coding that would exhibit O(N ) scaling, utilize inhibition, and emphasize low-power operation. During inference, input spikes pass through a Row Header. Voltage is forwarded from the Row Headers to a nanowire crossbar with memristors at each junction. Current is allowed to pass through each memristive junction and is used to charge or discharge an LIF neuron in each Column Header ( Fig. 3 ). When any LIF neuron spikes, an output spike is propagated and inhibitory forces are passed back through the crossbar to the Row Headers. The count of output spikes across any given time window describes the sparse code for the input pattern seen during that time window.
While any device whose resistance can be modified in-situ would suffice, memristors from Lu et al.'s group were chosen due to their nanoscale form factor and ability to be fabricated in tight crossbars [25] . These devices additionally exhibit a low on:off ratio, which has been associated with devices that possess better long-term storage and analog qualities [15] , [25] .
As an initial step, we established that the chosen architecture should fit the form shown in Fig. 2 . Assuming good accuracy could be derived, such an architecture would be sufficient for implementing sparse coding with the desired traits. Such an architecture would clearly exhibit O(N ) scaling. Inhibition could be implemented with a backwards-pass through the same crossbar used to charge the output neurons. Low-power operation would stem from the simplicity of the architecture, its good scaling properties, and an innovation on the way the neurons were integrated into the architecture.
Neurons in this architecture differ from previously proposed architectures. Like prior work, the Column Headers implement Leaky-Integrate-and-Fire (LIF) neurons [8] , [14] ; in contrast to those architectures, the LIF neurons both accrue charge and discharge via the crossbar in relation to the current input. In addition to requiring fewer components, this configuration has proven more tolerant of un-normalized receptive fields, a phenomenon discussed in Section III-A.
Inference with this architecture begins with input spikes reaching the Row Headers. Input spikes were chosen due to biological inspiration, the promise of lower power consumption, and also partially because memristors exhibit vastly different resistances at different voltages; using spikes rather than voltage-scaled inputs helps to avoid this situation [15] . Additionally, keeping both the input and output to the algorithm as spiking helps increase the homogeneity of the system. This enables the architecture to encode not only an input stimulus but also e.g. the output of an STDP algorithm. Upon reaching the Row Headers, spikes are converted into voltages -one of V cc or 0 V -which are applied to the nanowire crossbar. At each crossbar junction, a memristive device provides resistance between the Row Headers and the Column Headers. The resistance of each device is set such that, at V cc , it is between R min and R max . The pattern of conductances formed by these memristive devices in each column are the Receptive Field (RF) of the corresponding neuron. An input pattern aligning with this field will trigger an output spike in this column before any other column. Current flows through these memristive devices into or out of the capacitor in each Column Header. When one of the Column Header capacitors reaches a given threshold, that column generates an output spike, and current flows from that column back through the corresponding memristive devices into the Row Headers to re-charge inhibitory forces. Once the output spike event finishes, the process repeats.
The derivation of necessary parameters was broken down into two stages: calculations without inhibition, and an extension of those calculations to incorporate inhibition. This divide was necessary to ensure the solution was tractable, and had the added benefit of deriving two versions of the architecture which were used to demonstrate the benefits of inhibition.
A. Uninhibited SSLCA
To begin the derivation for the Uninhibited SSLCA, we start with the equation for Rozell et al.'s LCA, Eq. (2), and remove the inhibitory term. What remains is a leaky dot product, with no O(N 2 ) scaling problem. However, the resulting equation also no longer possesses optimality guarantees. Oja's rule, used in this work to train each neuron's RF, can be used to somewhat remedy the missing inhibitory term by adjusting multiple RFs to work together to reconstruct the input without inhibition [11] . However, even with this compensation, using a leaky dot product for sparse coding would be difficult in hardware due to the angle property of the dot product between two vectors:
From Eq. (3), a larger magnitude in either vector could be used to compensate for a larger difference in angle. In other words, a maximally-conductive RF would generate more current than an RF that is a better match. Prior works have solved this issue by normalizing each RF [25] . However, we decided to solve this problem by creating a negative stimulus via inactive input channels. The current through these channels would be proportional to the RF, meaning that missing activity where the RF is conductive would lead to a higher penalty. This is the reason that capacitors in this network are placed directly on the crossbar, rather than behind a diode or equivalent.
Using the layout from Fig. 2 , the Row Headers for the uninhibited SSLCA are simple passthroughs, and the Column Headers are simply a capacitor and a Schmitt trigger that drains all capacitors once any one neuron's voltage exceeds V f ire volts. The partial derivative of any neuron's voltage is therefore:
where C is the capacitance of the capacitor, V neuron is the current voltage of that capacitor, V i is the ith input's voltage (one of V cc or 0 V depending on whether it is currently spiking or not), and G i is the conductance of the memristive device connecting the nanowires of the ith Row Header and the neuron in question's Column Header.
Assuming an input row i spikes to voltage V cc with a mean activity of K i (on for K i , off for 1 − K i ), and is grounded the rest of the time, this can be reduced via the Laplace transform to:
where V neuron,t=0 is the neuron's voltage at t = 0. Q 1 , the column's total conductance, and Q 2 , a matching metric between the stored RF and the input pattern, arise as intuitive factors that affect the neuron's state. To establish the necessary values for C and V f ire , Q 1 and Q 2 need to be derived in a way that produces good results for the network's "average case." Empirically, we found that assuming both the input and stored RF have binary elements (even for analog problems) produced the best results: the K values are either 1 or 0, while the G values are either the minimum or maximum conductance of our memristive devices. The resulting calculation for Q 1 and Q 2 , required to determine both the network's trigger voltage V f ire and neuron capacitance C, is described in Algorithm 1. Though these calculations are based on a single sample of Q 1 and Q 2 , our results showed that the network still worked well outside of these "average cases" (Section IV).
Sparse coding being the goal of this architecture, we also make the assumption that any spike event will reset all neuron charges to 0 V, implying that each output spike only encodes input activity seen since the end of the previous output spike. The downside to this assumption is that the architecture becomes a one-hot system: a pattern of simultaneously-firing output spikes becomes impossible. Superficially, this is in contrast to some other work on stochastic computation with spiking neurons [18] - [20] . The network still encodes stochastic information in a single output spike: the input pattern represented is stochastic due to the input spikes' duty cycles, and as such the corresponding output is stochastically selected. By not allowing a pattern of simultaneous output spikes, the number of representable input patterns in a single event is reduced. However, due to this stochasticity, we have found that collecting multiple output spikes over a period of time results in a stochastic pattern of output activity that accurately represents the input. This is functionally identical to the tradeoff of memory for time in computation; we are reducing the memory of the momentary output of our architecture in exchange for longer runtime. For sparse coding, where the resulting code often needs to be stored or otherwise buffered, this is not an issue.
With the above assumption, all V neuron,t=0 = 0, and Eq. (5) can be rearranged to calculate C based on some Q 1 , Q 2 , V f ire , and t f ire , where V f ire and t f ire are the desired voltage and time at which an output spiking event should occur given the input and stored RF parameters that produce Q 1 and Q 2 :
As t f ire can be calculated from the desired hardware clock rate and number of spikes per patch, the remaining parameters needed to fully specify the uninhibited SSLCA are V f ire , Q 1 , and Q 2 . Our experiments yield good results when V f ire is calculated based on a thresholded max voltage from Eq. (5) with a Q 1 and Q 2 calculated for the desired minimum RF that the resulting sparse code can represent, and when the Q 1 and Q 2 for the calculation of C come from an average case of the data set used with the network. The exact procedure followed to calculate these values is described in Algorithm 2. The algorithm requires knowledge of the expected average value of a stored receptive field, Rf avg , as well as an idea of the minimum input intensity that should trigger an output spike, Rf least . After scanning across many different combinations of these variables, we discovered that setting Rf least = (1 − e −1 )Rf avg typically yields optimal results; this relation was used throughout this work.
Following Algorithm 2, and substituting the resulting values into Eq. (6), all parameters for constructing the uninhibited network are defined, and the network might be built. Applying voltage spikes to the input lines of magnitude V cc with a maximum duty cycle of K max will cause the best-matching column to spike for t spike seconds; collecting these spikes across a window of time (e.g. 10(t f ire +t spike ) for an average of 10 spikes) will produce a reasonable reconstruction of the input based on the network's receptive fields. Results with the uninhibited SSLCA can be found in Section IV. B. Adding Inhibition to the SSLCA One of the original requirements deduced at the beginning of Section III was the need for inhibition. Prior works have shown the need for inhibition in an effective sparse coding system [5] , [7] , and Section IV-A3 of this work demonstrates this as well. While works such as that of Shapero et al. implemented inhibition by using additional hardware between each pair of neurons [14] , leading to O(N 2 ) scaling, the SSLCA is designed in a way that allows for O(N ) scaling. Algorithm 1: Process used to determine Q 1 and Q 2 given stored RF of average, relative conductance Rf stored and a matching input of average, relative intensity Rf input .
Input: Rf stored , the average, relative conductance of the stored RF.
This value must be on the interval ( G min Gmax , 1]; Rf input , the average, relative intensity of all inputs; G min , the minimum conductance of a crossbar device; Gmax, the maximum conductance of a crossbar device; Kmax, the proportion of time spent at Vcc for an input signal spiking at its maximum rate; N , the number of inputs to the network. Output: Q 1 , Q 2 begin // Assumes that the stored RF consists entirely of elements at Gmax or G min , and that the input pattern matches, but with a scaled intensity of
// Portion of inputs at max-intensity.
Algorithm 2: Recommended process for selecting V f ire , Q 1 , and Q 2 , required for the calculation of C from Eq. (6) .
Input: Rfavg, the desired average, relative conductance of a stored RF. This value must be on the interval ( G min Gmax , 1]; Rf least , the smallest average, relative input intensity is expected to trigger an output spike; G min , the minimum conductance of a crossbar device; Gmax, the maximum conductance of a crossbar device; Kmax, the proportion of time spent at Vcc for an input signal spiking at its maximum rate; N , the number of inputs to the network.
, with Q 1 , Q 2 from Algorithm 1 applied to Rf stored = Rfavg, Rf input = Rf least , other parameters matching; Q 1 , Q 2 ← Q 1 , Q 2 from Algorithm 1 applied to Rf stored = Rfavg, Rf input = Rfavg. end Instead of additional hardware, a percentage of the SSLCA's running time is dedicated to calculating inhibitory forces. Whenever an output spike is generated, current is passed from the corresponding column back through the SSLCA's crossbar and is used to charge capacitors in the Row Headers. Intuitively, the charges on these capacitors indicate how well represented the corresponding input signal is in the current reconstruction; overrepresented input signals will be suppressed. This is implemented through the Row and Column Headers shown in Fig. 3 .
The Column Header for the Inhibited SSLCA is identical to that of the Uninhibited SSLCA: a standard LIF neuron setup, with the state capacitor connected directly (through a transmission gate) to the nanowire crossbar rather than being buffered. A crude schmitt trigger setup ensures that all output capacitors drain and adequate inhibition charge flows back The Row Header's responsibilities are to stop input spikes from reaching the crossbar when they are inhibited, and to keep track of the current state of the inhibitory forces. An Inhibition Logic Module is diagrammed as broken out from the main circuit for space reasons. The CHARGE port is responsible for sinking current from the crossbar when an output spike has occurred, and in turn charges the capacitor in the Inhibition Logic Module which prevents subsequent spikes from applying a voltage on the crossbar. After enough input spikes occur, the capacitor becomes sufficiently drained to apply voltage to the crossbar once more. The Column Header is much simpler and uses a transmission gate to direct current to and from the neuron's state capacitor. When any neuron fires, the capacitor is drained, and in the same column, Vcc is applied to the crossbar. A simple RC circuit cleaned up by several NOT gates is responsible for the output spike.
through the crossbar with each output spike. The Row Header is more complicated, but the important aspect is that a capacitor storing the inhibition state discharges whenever an input spike arrives, and charges whenever an output spike occurs. The capacitor is charged through the crossbar junctions; the resistor for discharging the capacitor is the one in the labeled Inhibition Logic Module, and is referred to as R inhib . The stored inhibition state, when above Vcc 2 , prevents input spikes from reaching the crossbar. Vcc 2 is chosen as it maximizes the linearity of the inhibitory response, since both charging and discharging occur at the same point on the exponential function (Eq. (7)).
Calibrating this architecture requires specifying both the capacitor, C inhib , and the resistor, R inhib , in the Inhibition Logic Module (Fig. 3) . Ideally, a sparse coding algorithm should produce a stable, one-hot response to an input that exactly matches any of the stored RFs, and should combine several outputs when representing inputs that do not match a stored RF exactly. For simplicity, we focused on tuning the inhibitory components of the network to an input that matches the stored conductance of an RF, similarly to Algorithms 1 and 2. Additionally, we make room for inhibition in the spike cycle by using a neuron capacitance of C cb = f (C), where C is from Eq. (6). Using R cb as the equivalent resistance of the memristive device used to charge the inhibitory force, and R inhib as the resistance in the Inhibition Logic Module, we can write a few equations to describe the inhibition voltage V i for a specific input i both before an output spiking event (V i,pre ) and after an output spiking event (V i,post ):
where t spike is the duration of an output spike, K i is the portion of the time that the input being tracked is active, and V i,0 is the voltage after an output spike. For a stable system with a uniform firing rate, V i,0 = V i,post , and we are left with:
With the inhibition voltage after a spike defined, one issue remains: so long as V i,0 > Vcc 2 , the desired t f ire will no longer match t f ire without inhibition. There is always a period of time during which input spikes are inhibited, inflating t f ire . We label this period of time as t inhib , and rewrite t f ire as t inhib + t collect , allowing Eq. (8) to be rewritten and a second equation for V i,0 to be written by integrating backwards to V i,0 from Vcc 2 . These two equations are then combined to make a single equality, the solution of which indicates adequate values for R inhib and C inhib :
Unfortunately, this formulation leaves two new variables, t inhib and t collect . Additionally, were we to use the original capacitance calculated in Eq. (6), we would miss the desired t f ire due to the added time for inhibition. To solve all of these problems, we set C cb = f (C) = C 2 . t collect is then solved for using the new neuron capacitance C cb and Q 1 , Q 2 from Algorithm 1 using Rf stored = Rf input = Rf avg . t inhib is then solved by subtracting t collect from t f ire . The remaining variable, R inhib , is solved for by taking the log of both sides of the above equality (Eq. (9)), squaring the result, and minimizing the resulting function via Python's scipy.optimize.minimize, ensuring a near-zero result [26] . Examples of the results from Algorithm 2 plus the inhibition transformations (C cb = f (C)) can be seen in Table I . Notably, N = 192 corresponds to an input image of dimension 8 × 8 × 3, while N = 48 corresponds to an input dimension of 4 × 4 × 3. Row 1 is similar to the settings used in most of our experiments. While C cb = 1200fF is significant, this number could be greatly reduced by future memristive technologies with greater resistance (rows 2 and 3). Rows 4 and 5 highlight that a higher ratio of G max to G min results in a higher V f ire , which would be helpful to overcome the comparator's input offset voltage and would allow the algorithm to better represent zero weights in each neuron's RF; both of these would increase the algorithm's effectiveness. A lower G max may be artificially imposed on the network if the circuit designer wants less capacitance and is willing to sacrifice some of the accuracy that comes from a high G max to G min ratio. Rows 6 through 8 demonstrate the effects of fewer inputs, and of varying Rf avg , the expected average stored RF in the network. As written, Rf avg is also treated as the average input to the network; for very low G max to G min ratios, this might not make sense, and the actual average input value should be added as a separate input to Algorithm 2 and used in the final calculation of Q 1 and Q 2 to correct for the difference.
To validate this network design, we investigated different parameters other than Rf avg for Rf stored and Rf input when optimizing the inhibitory response. Figure 4 demonstrates the results of this: while different combinations require different values of R inhib to be completely accurate, choosing a single, median value works well in practice.
C. Training
The networks we used were trained using the ADADELTA algorithm in tandem with Oja's rule across two epochs of the training data, an identical approach to our prior work [7] , [11] , [27] . Briefly, Oja's rule is that a neuron's receptive field will change proportionally to its output activity multiplied by the difference between the original input and the LCA's reconstruction. When considering the reconstructions for Oja's rule, we used the ratio of the conductance of each memristive device to G max , the maximum expected conductance of a crossbar device. This approach limited the minimum representation of each input element in an RF to the on/off ratio of the memristive device. For our experiments, we used the Yang et al. device which featured an on/off ratio of around 0.25 at 0.7 V [15] . We also tried training without this limitation (allowing the learned weight to drop all the way to 0, even though the device conductance would be set to 0.25), but Fig. 4 : Values of R inhib needed to achieve the desired spike rate with different Rf stored and Rf input values (the spike rate scales with the ratio of Rf input over Rf stored ). Ideally, the resulting plot would be flat, indicating a single value of R inhib is sufficient for all cases. Since it is not flat, areas with larger than the chosen R inhib will spike slower than expected, and areas with a smaller value will spike faster. In practice, we use Rfavg for both (the blue line); receptive fields with a high stored value and a low input will under-spike, which should not be an issue as those regions should be bettercovered by another neuron.
did not find such a change to impact accuracy, although it did affect the Root-Mean-Square Error (RMSE) between the input and the reconstruction. Since the logical minimum does not affect the programmed conductance, this makes sense: the resulting sparse code is unchanged. The benefit of training with a non-zero minimum representable value is that the training could be done using only the memristive crossbar, without supplemental memory. Homeostasis was used during training to encourage the network to use all available neurons, similar to prior work by Querlioz et al. [28] . If a neuron had not produced an output spike after several patches, V f ire was lowered for that neuron to encourage it to spike. This behavior was disabled for evaluating accuracy and RMSE.
For this work, all conductances were represented as analog values. We have conducted prior research that assumed a lower resolution of conductances would be achievable [7] . Generally, currently available literature has shown that memristors might be trained within 1% of a target resistance [29] , which is much better than the 4-bit resolution needed for good performance with neuromorphic algorithms [7] , [23] .
D. Models Used for Power and Accuracy Comparisons
The SSLCA was simulated algorithmically based on the above equations and algorithms. The simulator was written as a hybrid event/time-based simulator based on the maximum of the next predicted spiking event and a small window of time (2 × 10 −18 s). Unless otherwise specified, our algorithm was configured to collect an average of 10 spikes per input, based on Fig. 5 . Accuracy was computed with a Single-Layer Perceptron (SLP) network that was trained to associate resulting sparse codes with the category that generated them. This setup is efficient to compute, but does not rival the accuracy of a state-of-the-art deep learning architecture. That was investigated in Section IV-A3.
While crossbar and capacitor power were calculated through these simulations, comparator power for the column headers was derived by simulating the 5 GHz comparator from Xu et al. 2011 at 4 GHz using a 0.7 V power supply, implemented with 45 nm CMOS transistors using the Predictive Technology Model published by Zhao et al. in 2006 [30] , [31] . It was found that, per column, this setup added 2.2 µW.
Since we used Xu et al.'s comparator at 4 GHz, we configured the networks for an average spike accumulation period (t f ire ) of 0.8 ns and an output spike duration (t spike ) of 0.2 ns. Input spikes were considered with a minimum period of 0.4 ns and an active duty cycle of K max = 0.5 unless otherwise noted.
E. Example Code Availability
The simulation implementation used in this work was made available on Github at https://github.com/wwoods/tlab_sslca.
IV. RESULTS
The SSLCA, SLCA, and LCA were tested with two different data sets to demonstrate the relative performance of the SSLCA. Reported RMSE values were generated as though zero were representable, and accuracies were from an SLP (discussed in Section III-C). Experiments were run either 12 times, or until ±10 % accuracy was achieved with 95 % confidence as per [32] .
To show that our assumptions and simplifications did not result in significantly worse accuracy than the algorithms from which the SSLCA was derived, all results were compared with both the LCA and Shapero et al.'s SLCA. An accuracy comparison across different numbers of spikes can be seen in Fig. 5 . The LCA implementation is from equation (3.1) of Rozell et al.'s paper [10] ; the SLCA implementation consisted of equations (5)-(7) in Shapero et al.'s paper [14] . Note that only the outputs of the SLCA network are spiking, while its inputs are constant voltages. Our work deals with both spiking inputs and outputs.
Power numbers for the LCA come from (13) [14] . The throughputs of each of those architectures were several orders of magnitude lower than the SSLCA's ( Fig. 1) .
A. CIFAR-10
The first dataset, CIFAR-10, consisted of 60 000 32 × 32 RGB images, each containing one of 10 classes of objects [13] . For faster simulation and to demonstrate the scalability of each algorithm, these were scaled down to both 3 × 3 and 8×8. As the CIFAR-10 dataset contains equal numbers of each class, a simple accuracy was used to evaluate each algorithm's abilities. , SLCA [14] , and the SSLCA on CIFAR-10 scaled to 8 × 8; suffixes indicate completeness (2× indicates 384 neurons, while 0.5× indicates 96 neurons). While the SLCA achieves lower RMSE with significantly more spikes (around 100), for practical numbers of spikes the SSLCA produced much better results. The LCA performed better classification with fewer output neurons because it had slightly less output activity, which with a shallow classifier is more effective. A lower RMSE is more important for deep learning, seen in Fig. 12 . While the LCA displayed promising power statistics for this problem, its throughput was four orders of magnitude smaller.
1) Accuracy: Compared to an optimal, analog implementation of the LCA, the SSLCA with inhibition matched performance on the 8×8 rescale of CIFAR-10 with a 3 % relative loss in accuracy (33 % vs 32 %; Fig. 6 ). The uninhibited SSLCA always produced a worse reconstruction than its inhibited counterpart, although for low Rf avg (and correspondingly a higher number of spikes per patch) its classification accuracy was better with the simple SLP classifier. The trained network had an average spike count of 8 even though the architecture was configured for 10 spikes. The spike duty cycles were K in = 0.5 and K out = 0.2.
The performance seen on the 3 × 3 and 8 × 8 rescales are Fig. 6 : SSLCA accuracy targeting 10 spikes on CIFAR-10 rescaled to 8 × 8 with and without inhibition, compared to LCA. Lower Rfavg tended to produce lower RMSE due to increased activity in the resulting sparse code, and increased spike count (since the input intensity is greater than the target, spikes happen more frequently than calibrated). Inhibition ubiquitously reduced the RMSE, although with an SLP, its classification accuracy was less than the uninhibited version for darker receptive fields. See Section IV-A3 for the impact of RMSE when using a deep classifier. compared in Fig. 7 . In both instances, the performance of the LCA is approached by the SSLCA. However, the value of Rf avg that optimizes accuracy is not obvious based on the problem's statistics; using the dataset average works well for the 3 × 3 case, but the 8 × 8 case requires a smaller Rf avg in order to encourage more output activity, which translates into higher accuracy. At values of Rf avg approaching the memristive device's minimum, the algorithm breaks down. This result can be explained through Eq. (6) and Algorithm 2: small Rf avg results in a lower Q 1 and thus a smaller C, reducing the smoothing of input spike activity and in turn producing less consistent patterns of output spikes.
Another facet investigated was how different K factors (spike duty cycles) affected the overall classification accuracy of the system. The result is shown in Fig. 8 ; generally, a higher input duty cycle K in performed better, and a lower output duty cycle K out performed better. Intuitively this makes sense: larger duty cycles for input spikes means that more spikes are expected to work together when forming a single output spike; smaller duty cycles for output spikes means more time spent collecting input spikes, and thus each output spike represents a better average of the input spikes triggering it.
Device variability was also considered. Previous work has shown that memristive devices can be tuned within 1 % of a desired resistance [29] . Another work has demonstrated significant variance from one read to the next [33] . To test how Fig. 8 : CIFAR-10 spike accuracy with different spike duty cycles. High duty cycles for input spikes and low duty cycles for output spikes are the most accurate, but require more power (Fig. 11 ).
the SSLCA performed with imperfect hardware, we implemented three types of conductance deviations: read deviation, write deviation using offline training, and write deviation with online training. Read deviation was re-calculated after every output spike to better simulate the time-varying nature of read randomization, and varied the effective conductance of a device uniformly by ±0 % to 80 % (a standard deviation of 0 % to 46 %). Write deviation with offline training consisted of training the model without variance, and then varying Fig. 9 : The effects on the SSLCA of conductance variability during each read cycle (the period of time between two output spikes), during each write cycle (online training), or when a weight matrix learned offline is written to the memristive crossbar (offline training). Our results showed that unmitigated write deviations become serious for online algorithm stability after 3 %. However, using offline training or modifying the training approach as previously reported helps significantly [7] . For CIFAR- the conductance uniformly by ±0 % to 180 % (not allowed to drop below 0 S; a standard deviation of 0 % to 104 %). Write deviation with online training was applied after each application of Oja's rule, and modified the target conductances uniformly by ±0 % to 30 % (a standard deviation of 0 % to 17 %). These results can be seen in Fig. 9 . Neither read variability nor offline-trained write variability were found to have a significant impact. However, for online training, write variability could be tolerated only up to 3 %. Even though this is below the 1 % empirical demonstration by Alibart et al. [29] , prior work on imperfect updates has indicated that sensitivity to these deviations might be further mitigated with a more aggressive training regimen [7] .
2) Power: The inhibited SSLCA exhibited extremely low power consumption; at the optimal Rf avg = 0.43, the consumption was just 2.34 pJ/input (Fig. 10 ) with a throughput of 100 MOps/s. Compared with prior work such as Knag et al., whose lowest energy consumption was 48 pJ/input, this was a 95 % reduction in energy consumption for 180 × the throughput during inference [22] . At their high throughput (310 MHz), the SSLCA exhibited a 99 % reduction in energy Fig. 11 : Power and accuracy trade-offs on CIFAR-10 with varied K in . LCA power shown represents only a voltage-scaled crossbar, to directly compare spiking and non-spiking approaches. Spiking always consumed less power than a voltage-scaling approach, a combination of the dataset having a high average input and the spiking algorithm utilizing inhibition of input signals (see Fig. 10 for the effects of inhibition on power consumption). consumption with a still substantially improved 21 × throughput.
Spiking architectures are often considered to produce power savings, though the extent of these savings has been a topic of discussion for some time [18] ; we investigated that claim in Fig. 11 . Except for very large duty cycles, the spiking architecture's crossbar used less power than the non-spiking, voltage-scaled crossbar. With a spiking implementation like the SSLCA, where input spikes are suppressed during an output spike, we would have expected the spiking to consume less power so long as
This is a result of average power scaling with the square of voltage versus linearly with a duty cycle. The SSLCA surpasses this expectation due to the additional input spike suppression implemented through the inhibition mechanism. Interestingly, the standard deviation for the SSLCA's power was also substantially lower, probably as a result of columns in the SSLCA not being grounded, unlike the LCA, which sinks all current into a virtual ground [7] , [15] .
3) Information Retention for Deep Learning: While an SLP might be used in practice due to the simplicity of its implementation, it does not adequately express the depth of the information contained in the input dataset. To determine how much useful information was retained by both the LCA and SSLCA encodings, we used these architectures to encode augmented CIFAR-10 input images using convolutions of different sizes. The convolved, sparse coded input images were then passed as input to a state-of-the-art deep learning architecture, the DenseNet-BC, presented by Huang et al. in 2016 [34] . This network architecture consists of a number of dense blocks that each halve the scale of the input data; within each dense block are many more layers, each accepting as input all previous layers within the dense block. Using this setup with 3 dense blocks and parameters L = 190, k = 40, Huang et al. achieved 96.54 % accuracy on CIFAR-10 (with data augmentation) [34] . See Huang et al. for more further details on these parameters.
We tested our architecture by dividing the input CIFAR-10 image into non-overlapping patches of S × S, and encoding each patch using either the LCA or the SSLCA. For example, S = 4 implies that the 32 × 32 CIFAR-10 image was broken into 8 × 8 non-overlapping regions of 4 × 4; each region was then sparse coded, and the resulting "image" consisting of all such encodings was passed to the DenseNet. For the 2× networks with S = 4, this means that rather than receiving each image as a 32 × 32 × 3 spatial array, we passed in an 8 × 8 × 96 spatial array. For the 0.5× networks, the spatial array passed would only have a depth of 24. The SSLCA was configured with Rf avg = 0.45.
In order to allow each DenseNet a similar amount of expression for its classification, we parametrized the DenseNet so that the final dense block would output a 4 × 4 spatial array; the original paper's final block output an 8 × 8 array. To accomplish this, each DenseNet had a number of dense blocks B = −1+log 2 32 S . To hold the number of computations that each DenseNet performed roughly equivalent, we chose L = 40, k = 12, and the number of filters on the initial convolution before the first dense block was k 0 = 6S 2 B rather than 16. The limitation of this approach is that the number of tunable parameters becomes significantly larger with larger values of S, creating a greater potential for overfitting.
Rather than 300 epochs with mini-batches of 64 samples, we used 150 epochs and mini-batches of 32 samples to train these networks. We trained using stochastic gradient descent, with an initial learning rate of 0.1; after 75 epochs this was reduced to 0.01, and after 112 epochs this was further reduced to 0.001. Simulations were done with keras; the DenseNet implementa- Since MNIST has a much lower average input value than CIFAR, a bias needed to be applied to compensate for the additional current lost from the neurons back into the crossbar (Eq. (5)). Since a bias also increased the duty cycle of each input signal, more power was consumed. calculated by doubling the number of active neurons used to encode each patch, as both an index and value need to be sent in a sparse code, and dividing by the number of input elements represented.
For different values of S, the LCA maintained similar compression factors, due to the threshold λ being held constant with an increasing number of inputs, leading to more active outputs (Fig. 13) . In contrast, the SSLCA's sparsity comes from the number of spikes collected, which was fixed at 10 for all experiments. Thus, for larger patch sizes, more and more sparse representations were created, resulting in lower accuracy but higher compression. These parameters are all configurable and could be used to trade off between accuracy and compression, but these values were chosen as they produced roughly equivalent compression at S = 4. If the input dataset has greater covariance, then the accuracy loss would be lower for higher compression rates. Inhibition was always beneficial with a deeper classifier, and each increase in accuracy aligned with lower RMSE without exception.
B. MNIST
The second dataset, MNIST, consisted of 70 000 28 × 28 grayscale images, each containing a single, centered, handwritten digit [12] . Again for faster simulations, this dataset was scaled down to 14 × 14. The test set contains an equal number of each digit class, so a simple accuracy was tabulated for each algorithm.
1) Accuracy: The SSLCA as defined up to this point performed notably worse on MNIST than the non-spiking LCA (Fig. 14) . Unlike CIFAR, which has an average input value of 0.47, MNIST has an average input value of only 0.13. Since the SSLCA was designed deliberately to include a leak current through the crossbar, this lower input intensity could not sustain neuron charge reliably, leading to more random patterns of input events being encoded in each output spike. We found that applying a bias signal, by redefining the duty cycle of each input signal from K max k input to K max (bias+(1−bias)k input ), we could remedy this problem while preserving the power gains of the SSLCA architecture. For MNIST, we found that a bias of 0.35 boosted performance from 77 % correct classification up to 84 %, versus a performance of 88 % by an optimal, analog LCA. The MNIST experiments' responses to write deviations were not found to be significantly different than the CIFAR experiments', shown in Fig. 9 .
2) Power: The power savings on MNIST, even with the bias, were in-line with those found for CIFAR: 0.9 pJ/input for 100 MOps/s. Note that the increased power savings compared to CIFAR (which consumed 2.34 pJ/input) were due to the lower relative number of outputs to inputs: additional neurons are more expensive than additional input lines (partially due to the comparator, though mostly due to the crossbar). While the cost of additional inputs is different from the cost of additional columns, the SSLCA still demonstrates O(N ) scaling in both dimensions.
As seen in Fig. 14, a non-spiking approach might consume less power on MNIST due to the low Rf avg of the dataset. On the other hand, the power presented for the LCA does not include inhibitory logic, unlike the SSLCA: it would be difficult to include inhibition logic without closing the alreadynarrow margin.
V. CONCLUSION Our work demonstrated that memristive devices with a low on:off ratio could be used for fast, low-power sparse coding, with in-situ learning, as long as their conductances can be set within ±3 %. We improved upon a previously published all-CMOS ASIC with 21 × the throughput using 99 % less energy per input. The quality of the resulting sparse codes were also evaluated with a state-of-the-art deep learning network, and were shown to reduce relative accuracy by only 2.4 % while compressing the input data by 79 %. These figures could be adjusted for higher accuracy and lower compression. We showed that even datasets with low input activity, such as MNIST, could be properly represented through the use of a bias. The proposed SSLCA architecture was demonstrated to be very resistant to device variations, particularly when used with offline training. Sparse coding algorithms such as the SSLCA could be used to greatly reduce communication bandwidth between visual sensors and other processing algorithms, such as deep-learning networks. This architecture has applications in robotics and self-driving cars as well as surveillance and next-generation computers.
