Pattern matching algorithms, which may be realized via as sociative memories, require further improvements in both accuracy and power consumption to achieve more widespread use in real-world applications. In this work we utilized a memristive crossbar to combine computation and memory in an approximate Hamming distance computing architec ture for an associative memory. For classifying handwritten digits from the MNIST data-set, we showed that using the Hamming distance rather than the traditional dot product increased accuracy, and decreased power consumption by 100x. Moreover, we showed that we can trade-off accuracy to save additional power or vice-versa by adjusting the input voltage. This trade-off may be adjusted for the architecture depending on its application. Our architecture consumed 200 x less power than other previously proposed Hamming distance associative memory architectures, due to the use of memristive devices, and is 256x faster than prior work due to our leveraging of in-memory computation. Improved as sociative memories should prove useful for CPUs, handwrit ing recognition, DNA sequence matching, object detection, and other applications.
INTRODUCTION
Pattern matching is hampered by its high power consump tion and low efficiency in many important applications, such as handwriting recognition, DNA sequence matching, object tracking, and network intrusion detection [2, 3, 10, 13] . These applications have gained popularity in recent years, partially as a result of the fabrication of the first memristor in 2008 by HP labs [20] . The memristor is a two terminal passive device that was first postulated by L. Chua in 1971; func tionally, it is capable of storing a value which manifests as the resistance of the device [1] . Realizing the memristor in hardware aided pattern matching applications by com bining the needed storage and computation elements into a single device. Memristors are generally considered good candidates for these kind of applications because their dy namic resistance can be exploited to perform analogue op-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from pennissions@acm.org. Figure 1 : MNIST software classification rates obtained when using the DP and the HD as the distance metrics when increasing the number of stored patterns, averaged over 10 runs. The number of test patterns used was 5000. The classification rate increased as the number of stored patterns was increased because more versions of the digits are stored, making correctly recognizing the input pattern easier. erations. A conventional topology for memristive circuits is a nanowire crossbar structure, where the memristors are lo cated at the intersections between the horizontal and vertical nanowires. The compatibility of memristors with state-of the-art CMOS allows for dense Resistive Crossbar Networks (RCNs) to be used alongside traditional circuitry, providing circuit blocks that both store information and can compute a number of different functions based on that information. Such networks are very useful for pattern matching applica tions, where a correlation between an input and previously stored data needs to be computed [5, 16, 19] .
The RCN computes the Degree Of Match (DOM) between the input pattern and the stored patterns. Which DOM is achieved depends on the circuit topology around the RCN. As a result, RCNs have gained popularity in the area of pattern matching and neuromorphic applications [4, 11, 15, 16, 22] . Often, the DOM achieved is the Dot Product (DP), which may be achieved by adding an extra column to correct for bias and measuring current through the crossbar columns [22] . However, for pattern matching applications, several distance metrics other than the DP may be used, such as the Hamming Distance (HD) or the Gaussian Distance (CD) [6, 23] . These other distance metrics are sometimes more accurate in determining the best matching pattern. Our own software results showed that using the HD instead of the DP increases the classification rate on the MNIST data set [7] from 57% to 85% (Figure 1 ) . In this paper we present a memristive crossbar architecture that performs in-memory computation to return the stored pattern best matching an input pattern, utilizing the HD as the measure of similarity.
RELATED WORK
Architectures that calculate the HD have been used before to build associative memories. For example, Yusuke et al. designed a hierarchical multi-chip architecture of fully paral lel HD associative memories [25] . Although their multi-chip structure achieved good capacity and scalability, they had the HD module as a separate block. This design suffered be cause the data had to be sent to the HD block to calculate the HD, increasing latency as well as resulting in a power consumption of 51.3 m W for a 64 x 32 (word size x number of words) associative memory.
In 2002, Mattausch et al. designed an associative memory architecture that was compact and smaller in size than pre viously designed ones [9] . Although Mattausch's design was an improvement from the area perspective, it still suffered from high latency because the data had to be brought to the calculating block, and also suffered from a high power consumption of 260 mW.
Rahimi et al. [12] proposed an Approximate Associative Memr istive Memory (A2 M2) for energy-efficient GPUs. Th eir design depended on modifying the Ternary Content Ad dressable Memory (TCAM) cell design by Li et al. [8] , re placing some of the CMOS transistors with memristive de vices. By doing so, the TCAM consumed less power, and they applied the approximate computing concept to trade off accuracy for power. However, their design still featured some CMOS transistors in the TCAM cells, limiting the area-saving potential of using memristive nanodevices. In contrast, an RCN as proposed in this work can be extremely compact. Their design also had a higher operating voltage than our approach in this work; our lower operating voltage should result in additional power savings. Regrettably, a di rect power comparison is impossible as Rahimi et al. only provides energy statistics for entire systems, and not only the A 2 M2, which is the focus of this work.
METHODOLOGY
RCNs compute the DOM between an input and the stored patterns. Figure 2 shows a basic RCN consisting of horizon tal and vertical nano-wires with memristors lying at the in tersection of these wires, represented by black squares. The memristor model used in our simulations was based on the equations from Yang et al. [24] . The ratio of maximum to minimum resistance for this device is:::: :; 3.3; more details can be found in prior work [22] . We will show that depending on the termination circuit used for columns in an RCN, the DOM computed can be either the DP or the HD.
When the input patterns are presented as input voltages to the crossbar rows, current flows through the memristor with conductance gij, where i and j denote the row and column number respectively, is Vi x gij. Therefore, assum ing the termination circuit is equivalent to a ground node, the total current flowing through column j is � i Vi X gij.
The total current of each column represents the correlation between the input and the pattern in stored in that col umn; thus, the best match is the pattern corresponding to the highest correlation magnitude � i Vi x gij. The high est correlation magnitude output from the RCN would then be chosen as the most similar stored pattern, and that pat tern would then be retrieved as the closest match. Choosing the highest current can be done using a Winner Takes All (WTA) circuit. We have implemented the WTA in software for this work as we are primarily interested in the storage and computation of the HD itself.
The patterns used in our simulations were taken from the MNIST data-set [7] and were scaled down to 16 x 16 to re duce the SPICE simulation time. These patterns were stored in our crossbar offline. Due to the low-voltage operation of 160 Out(l) Out(2) Outen) Figure 2 : RCNs were used for storing patterns, and to evaluate the correlation between an input and the stored patterns. The patterns were stored in the columns of the RCN, and binary input patterns were applied to the RCN rows in the form of voltages with read voltage VR or OV, representing binaries "I" and "0" respectively. Depending on the termination circuit used, RCNs can be used to compute either a DP or an HD correlation.
the memristor model used, a read voltage of 0.1 V was ap plied for "l"s in the input pattern. This voltage was chosen based on prior results [22] . For "O"s in the input pattern, the corresponding row was grounded (OV). The hardware was simulated in SPICE by using the Xyce open source, SPICE compatible, high-performance analog circuit simulator [14] .
Each column in the RCN was connected to a termination circuit, Figure 2 . This termination circuit determined the type of correlation evaluation performed by the RCN to be either a DP or an HD evaluation.
1 Dot Product Evaluation
Achieving an in-memory DP evaluation from the RCN is common and was presented and discussed in several pa pers [4, 15, 16, 22] . The main idea is to terminate the RCN columns with Virtual Ground (VGND) modules using an in verting amplifier. A bias column, produced by passing the inputs through a column of maximally-resistive memristive devices, is then subtracted off of the measured currents to force the final calculation to produce the DP exactly [17, 22] .
Note that, to calculate the DP, the VGND modules are required: without the VGND modules, sneak paths (tradi tionally undesired currents) would develop between columns in the RCN, making it impossible to calculate the DP ex actly. It would be desirable to remove the VGND modules, as they consume significant power. This effect will be seen later in our figures demonstrating the power consumption of an HD vs DP architecture ( Figure 5 ) .
Hamming Distance
For binary pattern-matching applications, the main prob lem with the DP is that if either the stored for input pattern has a "0" in a row, then that row will not contribute to the measured output value. In other words, a match between two "O"s and a mismatch between a "0" and a "I" results in the same output. The HD, on the other hand, is the count of rows where the input and stored patterns do not match, regardless of which value was stored or input.
Evaluation
When computing the HD, we require the effect of stored "l"s corresponding to input "O"s to have an influence on the total column current � i Vi x gij. Practically this can be Input Pattern Vread "1"
Out(n) This illustration demonstrates how sneak paths are leveraged to compute the HD. Only a single column's currents are shown, but each column is inde pendent in this architecture, so the same principle ap plies for other columns. When the input pattern is pre sented to the RCN, current flows both into and out of the RCN columns. The output column current is rep resented by 2:i V;9ij' However, when there is a voltage drop at the end of the column Vdl, currents will form from the column to the GND nodes provided by "O"s in the input pattern. The currents flowing through a memristor storing a "1" (thick dotted arrow) are larger than currents flowing through a "0" (thin dotted arrow) because a "0" represents high resistance and a "1" repre sents low resistance. This is beneficial for computing the HD, as currents through a "0" device to ground (which represents a stored "0" with an input "0") will be smaller than currents through a "1" device (stored "1" to an in put "0"). Since more current flows out of a mismatch than a match, the voltage Vdl will be higher when there are fewer mismatches, which is exactly what is required to compute an HD. Thus, the voltage Vdl represents the inverse of the HD value between the input and the pat tern stored in column j. The termination resistance can be omitted completely when "0" inputs are grounded, as long as any other circuit connected to the RCN has high impedance, further explained in Subsection 3.2.1.
achieved by inducing a voltage drop at the end of each col umn rather than using a virtual ground. When there is a positive voltage in each column, that voltage produces a cur rent back through the memristive devices corresponding to rows where a "0" is being input. If these rows were storing a "I", then more current will leave through the device than if the rows were storing a "0", illustrated in Figure 3 . The result is that the voltage of a column will be notably lower if there are more "O"s in the input pattern than in the stored pattern, just like when calculating the HD. Thus, the column with the highest voltage will be the column corresponding to the best matching stored pattern.
There are several ways to induce voltage drops on the RCN columns. The simplest is terminating each column with a termination resistor. These termination resistors need to be large enough to have a significant voltage drop, forcing the current sinking into the ground nodes provided by the input pattern to be significant enough that the mea sured voltage for a column storing a poor pattern match to be distinguishable from other, better-matching columns. Otherwise, the RCN will revert to evaluating a DP correla tion instead of an HD one.
On the other hand, if "O"s in the input pattern are ground ed rather than treated as high impedance, as shown in Fig  ure 3 , the termination resistance can also be omitted com pletely. This leaves the columns as floating voltage nodes with the maximum possible voltage, which is useful for over coming the input offset voltage of operational amplifiers in an attached WTA circuit. This constraint is further dis-cussed in Section 3.2. 2.
Resolution
Since our HD approach depends on the dynamic behaviour of the RCN, the accuracy/precision of our architecture had to be measured. An experiment was performed to deter mine the lowest number of different bits where our archi tecture can still successfully distinguish between two stored patterns as compared to the input. That is, we must answer the question: if the HD between two patterns is "I", will our architecture be able to distinguish between them or will it treat them as matching patterns? This reduces to whether or not the difference between the two highest output column voltages is large enough to be differentiated by a comparator circuit. In order to do so the architecture was tested with input and stored patterns with a known HD between them and the offset voltage 80011 V of the TSV621 operational am plifier was the the offset used [18] . If the difference between the two highest columns' voltages was less than 80011 V, the two patterns were considered to be indistinguishable, and the pattern with the lower column index was selected. Figure 4 shows the hardware results from simulating our proposed architecture in SPICE. Shown are classification results on the MNIST data-set for the different architec tures/approaches discussed in the previous section as well as the software results (Python) for RCNs up to 100 columns. We observe that the hardware implementations match their software equivalents in terms of classification rate. Similarly to software, increasing the number of stored patterns also in creases accuracy. This was previously explored in Figure l. Also as seen in software, the classification rate for the archi tectures in hardware that used the HD as a distance metric outperformed the systems that used the DP (72.5% vs 52% classification rate at 100 stored patterns) .
RESULTS

1 Classification of Handwritten Digits
Compared to the software results, the HD architecture had a 1.5% worse classification rate on average. This was due to the input offset voltage constraint for the TSV621 comparator used for the experiments. As will be seen later, this could be remedied by increasing the voltage used for "l"s in the input pattern from 0.1 V. Increasing this voltage results in better classification performance, but consumes more power. This is explored in Figure 7 .
The power consumption of the HD and DP architectures are compared in Figure 5 . Our experiments reveal that an HD architecture is several orders of magnitude more efficient than a DP architecture. For a 256 x 100 network, the DP architecture consumed 80 m W, while the HD architecture that we are proposing only consumed 0.4 m W, thus con suming less than l � O x the power. This is almost entirely due to the removal of the VGND modules, which needs to source a lot of current to maintain 0 V on each column.
Compared to a DP architecture, the HD architecture yield ed both better accuracy and lower power consumption. These improvements are facets of the architecture and not the data-set: for an associative memory storing binary patterns, there is no benefit to using the DP architecture, and the pro posed HD architecture will always have better performance with less supporting circuitry.
Termination Resistance
To assert our theory that the termination resistors could be omitted entirely, we ran experiments to find the change in with 5000 test patterns. The HD and DP RCNs were tested with and without the offset condition to calculate the percentage error between the offset and the no offset cases. The HD architecture consistently outperformed the DP one in classification rate. Moreover, the error bars in for the DP are much larger than those of the HD. Both of these are a result of the DP only accounting for the number of matching "1"s between the input and stored patterns. The HD, on the other hand, accounts for all mismatched bits. §: 102 E c 10' Power consumption increases linearly with the RCN size because more columns are added, but each column is independent of the others. The power con sumed by the HD architecture is a 100x lower than the DP architecture. DP architectures consume significantly more power because they require VGND modules, which increases the power consumption significantly as the col umn's current must be matched to maintain a voltage of o V, while the HD one does not.
classification rate on the MNIST handwritten digit database when a termination resistor was used and the value of the termination resistance was varied. The input voltages used here were 0.1 V and OV, representing "I" and "0" respec tively. Figure 6 shows the classification rate and the power consumption for different values of termination resistance. The figure shows that the classification rate increases with the increase of the termination resistor value. This is due to larger resistances resulting in larger column voltages, forcing more sneak paths into ground nodes through mismatched "1"s and "O"s ( Figure 3 ) . This in turn creates larger voltage differentials between the columns, which is easier to differ entiate in a comparator. As an added benefit, using a larger termination resistor, or omitting it entirely, saves power. Figure 7 demonstrates that the classification rate of the system increased as the input voltage was increased. This ef fect occurred because a higher input voltage also produces a larger differential voltage between different columns. When that differential surpasses the 800 11 Resistance Value (n) Figure 6 : Classification rate and power consumption for different values of termination resistance. The clas sification rate increased from 51% to 72% as the resis tance was increased from 1000 to 1MO. This was due to the low voltage drop at low resistances not being signif icant enough to overcome the op-amp input offset volt age, which must be surpassed before the circuit can dif ferentiate between two different columns. At 90KO and higher, a high enough voltage drop was induced to have an HD-Iike correlation. Higher resistance is also prefer able because it reduces the power consumed. The classification rate for 100 stored stored patterns using 5000 test patterns at different input volt ages. The classification rate increased with the input voltage because larger input voltages equate to larger differences between column voltages; once the difference between two columns surpasses the comparator's thresh old of 800 J1 V, the architecture could determine which column was a better match for the input pattern. In other words, increasing the input voltage increased the accuracy and the resolution of the system. The classifica tion rate became constant starting from 0.3 V because the problem ceases to be offset voltage related, but rather becomes data-set specific. The error bars for the classi fication rate fluctuated from 2-4% because this approach is sensitive to which training patterns were selected to be stored in the RCN.
Classification Rate & Power Versus Input Voltage
the TSV621, the architecture becomes capable of discern ing which column's pattern is a better match to the input pattern. Once that threshold is surpassed for a hamming distance of 1, further increase in the input voltage would not be beneficial. Unfortunately, increasing the input voltage also increases power consumption. Using 0.1 V with a 256 x 100 RCN consumes O. 4 mW, while using 0. 5 V for the same RCN con sumes 11 m W. Choosing an appropriate input voltage de pends on the application: if we choose a lower voltage, ac curacy will be diminished but power consumption will be decreased as well. A larger value will produce higher accu racy: the 72.5% accuracy from Figure 4 increases to 74% when 0.5 V is used instead of 0.1 V.
Higher input voltages produced higher classification rates because the resolution at which two HDs could be distin- guished also increased. That is, the minimum measurable Hamming distance between two patterns decreased. Fig  ure 8 shows the minimum HD between two stored patterns as a function of the number of bits per pattern (the number of rows in the RCN) and the input voltage. The minimum HD increases as the number of bits per pattern increases, or when the read voltage is decreased. This increase is be cause the difference between the highest and second high est column output is less than the offset voltage (80011 V) at low voltages. Yet, increasing the input voltages decreases the minimum measurable HD, increasing the accuracy of the system. The downside of increasing the input voltage is that this will also increase power consumption. The system was tested using the MNIST data-set to measure the classifica tion rate and power consumption at different input voltages. Figure 7 shows the obtained results, which confirms that in creasing the input voltage increased the classification rate, but also increased the power consumption quadratically. Applications that tolerate less accuracy in order to save power, such as approximate computing applications, could use low input voltages, but systems that need accuracy and do not care about power consumption and can use higher input voltages. The input voltage can only be increased to a certain threshold value depending on the memristor model used, as increasing the input voltage over this thresh old will cause destructive read operations in to the RCN and will change the memristors' weights. All RCNs also have a threshold where further increasing the input voltage will not help, as a single bit change can be registered by the com parator circuit.
Compared to other associative memory architectures uti lizing the HD, our architecture consumes less power. For example, our architecture outperformed Yusuke et al. [25] and Mattausch et al. [9] in terms of both speed and power consumption. Our architecture consumed only 0.4 m W at 0.1 V input voltage for an associative memory of size 256 x 100, versus 51.3 m W for 64 x 32 and 260 m W for 32 x 128. Although these power numbers include the WTA circuit while ours does not, our network is substantially larger and our power consumption is a small enough portion of theirs that this is likely not significant. Our architecture should also provide substantially faster calculations due to the in- -
Number of bits per pattern Figure 9 : The minimum measurable HD for different read/input voltages as a function of number of bits per pattern and the input voltage. Increasing the number of bits decreases the HD architecture's accuracy because each added bit means another added resistive device; since this architecture statically becomes a voltage di vider problem, more resistors mean that each mismatch will produce a smaller change in the column's voltage. As the difference between column voltages falls below the 800 JIV threshold, the minimum measurable HD in creases. However, increasing the input voltage increases the architecture's accuracy, as the difference between two columns' voltages again increases beyond the 800 JIV offset voltage. memory computation architecture; ours does not need to transfer data from a separate memory module to the mod ule that calculates the HD. For example, Yusuke et ai. re quires "D" number of cycles to find the HD of a word of "D" bits, while ours does the computation dynamically and requires only the time to read the memristors values, 0.5 ns in a 2 GHz system. This results in a speed up of 256x for a network with 256 inputs, and this speed increase scales linearly with the number of inputs.
Comparing the ability of this HD architecture to classify digits against similar approaches, our system performed al most the same as the system proposed by Wo ods et ai. be fore any training [21] . For example, at 50 stored patterns our proposed system achieved:::; 63% compared to :::; 65%; at 100 stored patterns we achieved 73% compared to 68%. Although the approach in that work outperforms this archi tecture once trained, the approach in this work is signifi cantly simpler, has no training step, and achieves similar or better classification rates compared to the untrained system.
CONCLUSION
We proposed an in-memory computation Hamming dis tance architecture for an associative memory utilizing mem ristors. Results from testing with the MNIST data-set showed that using the Hamming distance as a distance metric in stead of the dot product achieved a much higher classifica tion rate of 72. 5% versus 52%. We also produced power sav ings versus a dot product architecture that also used mem ristors of more than 100 x. Additionally, we showed that our architecture is very configurable dependent on the applica tion: by increasing the input voltage to our Hamming dis tance architecture, the accuracy of it was also increased. We explored this in terms of the minimum Hamming distance for which two stored patterns could be differentiated from one another. However, increasing the input voltage also led to increased power consumption. The optimal input volt age therefore depends on the application: either accuracy or power can be optimized with the proposed architecture. This adjustment was shown for MNIST to boost accuracy from 72. 5% to 74%, while increasing power from O. 4mW to 11 mW. The power consumed by our architecture was shown to be more than 200x lower than other published architec tures computing the Hamming distance, and was also 256x faster than those architectures due to the use of in-memory computation. This work may be useful for improving the viability of pattern matching hardware for applications such as handwriting recognition, DNA sequence matching, object tracking, or network intrusion detection. As an extension of this paper, different memristive devices could be used, and device and cycle variations could be investigated to check the effects of these parameters on this HD approach. 
