ABSTRACT We present the design and the performance of a hierarchical associative memory (AM) based on phase locking of coupled oscillators used for pattern recognition. The use of coupled oscillators, rather than Boolean logic, provides for implementations using emerging nanotechnology, such as magnetic spin-torque oscillators and resonant body transistor oscillators, which have the potential of lower energy and higher density than CMOS solutions. We develop a model for the general behavior of weakly coupled nonlinear oscillators that perform pattern matching using a simulation of coupled CMOS ring oscillators. We derive a simple analytic model for their phase locking behavior and use this reduced model in a hierarchical AM for image recognition tasks, such as identifying handwritten numbers.
I. INTRODUCTION

C
OMPUTING systems based on Boolean logic can be viewed as fabrics of combinational operators interspersed with restoring amplification and temporal storage. In this model, as implemented in CMOS, the transient state of logical operations is captured in the charge on the capacitive structures and wires, while static state is captured in cross coupled, or active, feedback structures. However, the power/performance limitations of end-of-life CMOS are encouraging researchers to investigate new technologies for computation [25] . Unfortunately, to date, investigators have had little success in identifying technologies that can compete with CMOS using charge-based Boolean logic [1] . This leads us to rethink the use of both charge-based state and Boolean logic for the application of emerging nanotechnologies.
Some emerging technology devices have special nonlinear responses in either the time or frequency domain that are quite different from the bistable step function of traditional CMOS devices. For example, they have more than two stable states, periodic response functions or sigmoid response functions, which can be mapped to similar natural dynamic systems, such as human neural circuitry [2] . Therefore, these devices motivate us to find opportunities for building novel architectures for cognitive tasks, such as pattern recognition or computer vision.
In this paper, we focus on using nanoscale devices that can be used to create weakly coupled oscillators. Weakly coupled nonlinear oscillators have a dynamic oscillatory behavior which is similar to a weakly coupled neural network model [3] . Recently, such oscillators have been fabricated using spin-torque oscillators (STOs) [4] , [5] , [27] , resonant body transistor oscillators [6] , [7] , or NeuroMOS [8] .
For a cluster of loosely coupled nonlinear oscillators, such as these, we can use their relative phase relationships as a representation of state. Using phase or frequency as the basis of state, we use the degree of synchronization of the oscillators, in response to sets of input values, as a primitive computational operation. For associative memory (AM) comparison operations, synchronization of the oscillators based on the degree of similarity between stored and input vectors, encoded in frequency and phase, becomes our primitive operation [9] . Based on this concept, we are developing information processing systems that utilize nanoscale oscillators to perform spatio-temporal recognition tasks. Thus, we use the temporal coherence of weakly coupled nonlinear oscillators to perform associative processing where we match input pattern vectors to a database of stored template patterns in order to perform fast pattern recognition in high-dimension feature spaces.
While the ability to perform associative processing does not provide a complete computing system, it does provide a mechanism for a broad range of pattern matching and classification tasks that are intrinsically data-parallel, providing a very fast solution to a host of useful applications.
In the rest of this paper, we first introduce the general associative processing paradigm we are using for these investigations in Section II. In supplementary material, we describe CMOS ring oscillator prototype circuits that we use to develop a model for the pattern matching behavior of coupled oscillators. We use this CMOS ring oscillator as a representative system for exploring the space of coupled oscillators, which would be realized using emerging nanodevices [26] - [27] . Next, in Section III, we apply our model to the handwriting recognition task and show the performance of this model compared with one that uses a traditional Hamming distance metric. In Section IV, we propose both the architecture of a hybrid oscillator/CMOS implementation of an associative processing system and discuss the need to partition the architecture to support a scalable design. In Section V, we introduce our proposed hierarchical system for performing nearest neighbor search operations and explain the algorithms and data structures we have developed to exploit the unique functionality of our design. In Section VI, we present our experiments on using this system for the handwriting recognition task and discuss the accuracy and the performance of the system. Finally, we give a summary and conclusions. Fig. 1 shows an abstract view of a content addressable AM [10] . Here, a set of associative storage words constitute a single large memory. The set of pattern vectors to be matched (e.g., handwritten digits) is first stored in the memory as a template pattern vector. We define the templates as the stored pattern vectors of an AM in this paper. Depending on the application, codes that correspond to the templates are also stored. Matching operations proceed by broadcasting (on a bus) input vectors to all words in the memory. Then, each word performs a local comparison or match operation. This local match operation is relative, generating a degree of match (DoM) between the input pattern vector and the local template stored in the word. Next, the DoM from each word is compared by a global resolution function, and the best matching result is returned.
II. ASSOCIATIVE PROCESSING
Matching results can be output in one of three ways. First, the best matching template can be returned. This is, in effect, a restored version of the noisy input. This is called auto-associative computation. Second, the stored code associated with the template can be returned. This search for key, return value is more like a database search and is called hetero-associative processing. Finally, simply the index of the matching template can be returned, obviating the need for a code memory. Each of these modes of operation is possible depending on the application. However, in each case, the fundamental local match operation between the broadcast pattern vector and each stored template in parallel is the key function that defines the associative operation. For our investigations, the basic function is the hetero-associative model that outputs the index of the best matching templates based on the input pattern vectors.
Recently, Hoppensteadt has studied weakly connected networks of neural oscillators near multiple Andronov-Hopf bifurcation points [13] . They propose a canonical model for oscillatory dynamic systems. This dynamic model was proved to be able to form attractor basins at the minima of Lyapunov energy function by adjusting the coupling matrix through the Hopfield rule. It is the synthesis of these two paradigms of associative processing neural networks and content addressable memories that has inspired our development of associative processors based on coupled oscillators.
Given that we can develop coupled oscillator modules for computation (as discussed in the other papers in this special section) mapping phase-based operations into a Boolean logic operations, such as Exclusive-OR for bitwise matching is inefficient. Rather, Hoppensteadt and Izhikevich [14] and Hölzel and Krischer [15] have shown that the oscillators can be used to directly perform computation in terms of associative operations. The use of coupled nonlinear oscillators as computational primitives to perform comparison operations has two unique advantages. First, the oscillators have the intrinsic ability to synchronize as a function of the similarity of the input to stored vectors. Second, we can use simple analog circuits to develop a DoM without recourse to local arithmetic circuits.
This first advantage provides an analog comparison rather than what is provided by the traditional Boolean associative processors based on Exclusive-OR operations. The synchronization of coupled oscillators as an attractor basin can thus provide a higher level norm, such as the Euclidean distance [28] . More important is the fact that the oscillator's characteristic attractor basin will give us a DoM that spans all of the dimensions of a multidimensional input vector without the need for any numeric calculations to be done in CMOS support circuitry. As an example, if we look at a k-element vector comparison, in order to calculate the Hamming distance, we need to sum the k bitwise differences. In addition, for the Euclidean distance, we need to take the k-element differences, compute the squares, sum the results, and take a square root. If, on the other hand, we have k oscillators for the k vector elements, then the multidimensional attractor basin of the coupled system will do the equivalent computation in the physics. We only need to measure the resulting waveform to retrieve our distance result.
The second advantage comes directly from the collective behavior of the coupled oscillators. Since every word of the AM is just a cluster of coupled oscillators, the signal at the common node of the cluster provides a direct indication of the state of the comparison. Thus, as shown in the supplementary material for the CMOS oscillators, we can use a simple rectifier/integrator circuit to create an analog DoM for each word/cluster, which can be easily compared with all other words, and give the best match output for the global resolution operation required for the system. This function can be performed for an N word memory with N analog comparators and a simple iterative parallel A/D scheme, where a decreasing reference voltage is compared in parallel with each match value in a binary search [16] , or N log(N ) comparators can directly compute the winner using a tournament tree.
Other papers in this special issue explore several technologies that could be used to implement the nonlinear oscillators, including analog resonant body transistors [6] , and magnetic STOs [4] . Recently, Fan et al. [29] designed an AM unit with the same structure using Spin Hall Effect STO (SHE-STO) and CMOS interface circuits. In the supplementary materials of this paper, we present a CMOS ring oscillator implementation of AM module that we use for our architecture and algorithm explorations.
III. APPLICATION INSTANCE: HANDWRITTEN DIGIT RECOGNITION
In this section, we evaluate our oscillator cluster model on a handwriting recognition problem by using the DoM between two image pattern vectors, represented by the function f out from Sup (2) (supplementary material). Like a nearest neighbor classifier, to recognize an input image, a closest pattern vector in the data set can be retrieved by comparing the DoMs of the input pattern vectors and each of the stored templates.
As discussed in the supplementary material, the input pattern's element values are used to control each corresponding oscillators' frequencies, and the synchronization of the oscillators determines the DoM.
In these tests, we use the MNIST handwritten digit data set to demonstrate pattern matching because it is widely used to test the algorithms of pattern recognition and machine learning [19] - [21] . It contains 70 000 images of handwritten digits from 0 to 9 with a training set of 60 000 examples and a test set of 10 000 examples. The digits have been sizenormalized and centered in a fixed-size image. The images in the MNIST data set have 28 × 28 pixels with 256 gray levels. These samples of handwritten digits are randomly selected during data splitting and uniformly distributed in the training and test sets.
Our objective is to demonstrate that the oscillator based system is capable of nearest neighbor searching at high speed and low power consumption. Because the matching operation using a distance metric can be embedded in most algorithms, for example, the final classifier of object recognition algorithm fed with features. Our system can improve the efficiency of most recognition tasks with the advantages mentioned in previous sections.
For our experiments, we do not use any preprocessing or feature extraction algorithm except converting the data from grayscale images into binary images, with the threshold level at 50%. Thus, each image can be viewed as a 728-bit pattern vector. Finally, a test image is then classified as representing the same digit of the closest matching vector in the training set.
The purpose of binarization is for comparing the results with the ideal Hamming distance, so that the system can perform nearest neighbor search directly on these vectors. The reason we use the Hamming distance metric as the baseline for the evaluation of performance is that it is the most efficient metric to compute using Boolean CMOS logic.
A direct implementation of the nearest neighbor classifier for an AM system is to build a single large memory with multiple words that can hold all 60 000 images as 728-dimension long vectors and can classify each input image by directly comparing the DoM with every stored image. After the model has stored all of the images, we input images from the test set sequentially. If the output image and the input image belong to the same label (digit), we say the recognition is successful. Otherwise, we call it a failure. The hit rate (or accuracy) of this simulation is computed based on the 10 000 test results from the whole test set, where accuracy is defined as the percentage of correctly recognized digits. This model represents the best performance possible using pixel-by-pixel distance to classify a binary handwritten digit image.
The simulation results provide the performance of the oscillator-based system compared with the Hamming distance baseline. The DoM from the oscillators can achieve an accuracy of 94.35%, which is very close to the performance of the Hamming distance, 95.11%.
However, performing the direct search in the space of the whole training set is very inefficient, in that we need to compare the DoM from 60 000 vectors simultaneously, and this is not feasible for the design of detection circuits in the oscillator system.
A better strategy is to reduce the number of vectors in the search space. In the previous searching process, the system is required to find the best matching vector among all the training vectors and use its label as the recognition results. Given the fact that there are only ten classes to recognize in this multiclass problem, we just need ten vectors to represent ten clusters of different classes. Not to be confused, the term clusters here is different from the term oscillator clusters. It represents a subset of the training data set. These ten vectors can be obtained by averaging all the image vectors of each digit or by a clustering algorithm. For example, the mean vector of all the images of handwritten digit 0 is the representation of cluster 0. Sup Fig. 7 shows the images of these vectors. During the recognition process, the input vector could be compared with the ten vectors, instead of 60 000 training vectors and the search can proceed in a hierarchical manner. Sections V and VI, we show the design of an architecture and algorithm to organize the search into such a structure.
IV. SYSTEM DESIGN: HARDWARE ARCHITECTURE
In this section, we describe a system architecture that we have designed for large and high-dimensional nearest neighbor search applications, such as the handwritten digit recognition we mentioned above. We first give a description of the hardware and then turn to the algorithmic improvements that the architecture enables. Fig. 2 shows the architecture of a single AM module. Fig. 2 should be compared with Sup Figs. 1 and 2. However, irrespective of the implementation of the coupled oscillators, we can abstract the behavior into a block diagram, which has the matching functionality and the associated control circuitry. The module contains clusters of oscillators each shown as a star around a summing node. Each cluster has a template memory, T j , that holds a vector of t j i values. The template memory can be either digital or analog. For digital memories, we would need D/A converters to provide the template values to the oscillator clusters for comparison. Once the templates are loaded, comparisons happen between every oscillator cluster and the input pattern vector, X . A DoM detector circuit feeds its output to a winner-takes-all network to choose the best match among all j clusters composing the memory.
A. ASSOCIATIVE MODULE AND PROCESSING NODE
We have designed a SystemC digital simulation of this architecture to explore design tradeoffs in terms of the capabilities of the oscillator arrays and the analog and digital CMOS support circuitry necessary to implement a complete system [9] . Based on our preliminary studies as outlined in this paper and in [26] - [27] , we foresee two technology challenges to practical implementations of large-scale AMs based on coupled oscillators. The first stems from the need to couple a large number of oscillators into clusters. The second comes from the need to pick a winner among many words using analog comparisons on the DoM outputs. We discuss each of these two challenges below.
The first problem can be seen most clearly if the oscillators are coupled via proximity on a substrate, as discussed in [27] for the Spin Torque Nano-Oscillators. Then, there is a 2-D geometric constraint on placement, such that the coupling among oscillators is symmetric. For any more than k = 6 oscillators in a circle, the arc length between pairwise oscillators is less than the distance to the center (or summing node), so pairwise coupling will be dominant (i.e., r versus 2π r/k). The result of this will be that the vector match operation could have an unwanted positional bias. Even for electrical coupling, there is likely to be an upper limit, due to wiring or fabrication constraints, on the number of oscillators that can be effectively coupled into a cluster. In the CMOS implementation above, the oscillators are coupled in a star configuration (rather than a true all-to-all); however, even for that configuration, there would be an upper limit on the size of the clusters, which can be effectively coupled. Therefore, we need an architectural solution that allows us to group smaller clusters into larger words in order to solve problems of practical significance, where the size of the pattern vectors could easily be thousands of elements.
We can address this problem by grouping AM modules into a larger structure creating an AM processing node. Fig. 3 provides two approaches to achieve this. Fig. 3(a) shows small clusters developing individual DoMs, and then, the results being summed to give an overall DoM for the entire vector. As an alternative, Fig. 3(b) shows the winner-takes-all results from matches on partial words being summed digitally and then passed to a global winner-takes-all result. Either technique allows us to build AM processing nodes that can support matching of vectors of an arbitrary width.
The second problem of scaling to a larger system size comes from the ability of the analog CMOS support circuitry to resolve a single winner among the thousands or millions of words in a large application. As can be seen in sup Fig. 2 , circuits are needed to detect the DoM for each cluster. Then, those values need to be compared. We can either convert each DoM into a digital value or perform analog comparisons. Using the digital solution has two problems. First, it requires an A/D for each word. The precision of these A/Ds will need to be high in order to be able to resolve the values between many words. Second, digital hardware will have to perform this N -way comparison. Using an analog solution, we can use a competitive timing (integration) technique, as above, but this still requires precision components and high tolerances in order to accurately resolve the contest between a large number of charging capacitors.
Unfortunately, solving this second problem via partitioning is not as straightforward as the first, word-length, problem, since the resolution among competing DoMs is dependent on both the precision and the dynamic range of the results from the entire memory array. Our solution to this problem is to use a tree of AM processing nodes and a preclustering technique that reduces the demands on the hardware. We discuss this in Section IV-B.
B. N-TREE HIERARCHICAL ARCHITECTURE
Assume there are m pattern vectors {p 1 , p 2 , . . . , p m }, as templates in the AM system. When m is very large, it becomes difficult for a single AM processing node of oscillators to hold and process all the patterns.
If each AM processing node can only store n patterns, while m n, then the m patterns can be clustered hierarchically into a tree structure and stored in a tree of processing nodes, as shown in Fig. 4 . Thus, we have a tree of AM processing nodes, which consists of a root node, and a hierarchy of children, ending with leaf nodes at the bottom of the tree. In this tree structure, every AM node has at most n children nodes. Only leaf nodes (nodes that have no children) store specific patterns from the original input set: {p 1 , p 2 , . . . , p m }. The higher nodes are used to store summary information associated with different levels of pattern clusters. Each nonleaf node stores centroids of the multidimensional values of the subtrees under them. A search of a nonleaf node returns the index of the subtree, whose centroid is closest to the input pattern. Thus, every search operation, to retrieve one pattern, results in a path in the tree, from the root to a leaf. During the retrieval process, the key pattern is input recursively into the AM nodes at different levels of tree, and finally, the pattern with the highest DoM is returned. Fig. 5 shows a virtual tree built from one AM node and a separate memory for templates and indices (tags). In this architecture, the single node takes on the role of different nodes in the tree by loading the associated template memory for that node. As an example, in one search operation, the node would first take on the root node memory and search. That search will return a tag, one of n choices for the subtree containing the pattern. That tag is used as an index into the common memory, such that the node will load the templates and the associated tags for that subtree root. A second search will pick one of n subsubtrees, that is one of n 2 nodes on the third level of the tree. Again, the node will load the templates and tags for that subsubtree. If, in this case, it is a leaf node, then a third search will be performed and the correct pattern will be identified.
C. VIRTUAL TREE HARDWARE DESIGN
One result of the development of the virtual tree model, which will guide our future development, is the optimization of the search time in the hierarchical memory. Given our two assumptions: first, that the number of oscillators is limited, and thus, we will have only one physical node; second, that the templates will be stored in digital SRAM and converted into analog voltages for the actual comparison, we can optimize the design of the organization of the tree (in terms of search time) using a simple timing model: with the time to load one node being a function of the number of words in the node. We assume our memory is as wide as one word and we have a word wide array of D/A converters. The height of the tree is a logarithmic function of the fan-out of each node, which again is the number of words in one node. Taking n as the number of words in a node, which is also the fan-out, and m as the size of the database, then the total tree search time is proportional to nlog n m. Due to the time it takes to load the data, it would have a time penalty of n compared with a (nonpipelined) tree. The area savings, in terms of number of oscillators, would be one node compared with all the nodes in a tree with a fan-out of n and a height of h = log n m. Thus, we can simply trade area for time in optimizing the architecture.
V. SYSTEM DESIGN: SOFTWARE ARCHITECTURE AND ALGORITHM
As discussed above, part of the solution to the second scaling problem is partitioning the data into a tree structure, where the nonleaf nodes of the tree store representative centroids for the subtrees under them. The creation of these values can be thought of in terms of offline training of the memory.
A. TRAINING
Once the hierarchical structure of the AM model is fixed, the next question is how to organize the patterns into a tree structure and which features can be used for the recognition at each node? These questions determine the method of training and pattern retrieval. We employ the fact that the AM nodes always output the pattern with the highest DoM, namely, the one nearest to the input pattern in the data space. Therefore, for this paper, we use a hierarchical k-means clustering algorithm to organize the data in the tree.
Hierarchical k-means clustering is a top-down divisive clustering strategy. All patterns start in one cluster, and are divided recursively into clusters by clustering subsets within the current cluster as one moves down the hierarchy. The clustering process ends when all the current subsets have less than k elements, so they are nondivisible. This algorithm generates a k-tree structure that properly matches our AM structure. The internal nodes are the centroids of each cluster in different levels and the external nodes (leaf nodes) are the exact patterns. During the clustering process, some clusters may become nondivisible earlier than others and have higher positions in the tree than the other leaf nodes.
As an example, with eight patterns given in Fig. 6 (a): p 1 − p 8 and k = 2, the first clustering generates two clusters, {p 1 , p 2 , p 3 , p 4 } and {p 5 , p 6 , p 7 , p 8 } with centroids C 11 and C 12 . Then, these two clusters split into four subclusters, {p 1 , p 2 }, {p 3 , p 4 }, {p 5 , p 6 }, and {p 7 , p 8 }. Their centroids are C 21 , C 22 , C 23 , and C 24 . Since the number of patterns in every cluster is less than the threshold, k, the clustering process ends. After the centroids and patterns are written into each AM node correctly, the training process is finished.
In the application for the MNIST data set, we use an N -tree model with a fan-out of 10. Instead of k-means clustering, the child nodes of the root node (first layer) are generated by classifying all the training vectors into ten clusters according to their label, as above. Each node contains the vectors with the same label and its centroid is the mean vector of the cluster. Then, we use hierarchical k-means clustering to generate the rest of the tree structures as we discussed above. Fig. 6(b) shows the part of the N -tree model built on the MNIST data set and gives some examples of the centroid of each node.
B. RECOGNITION
For each input pattern to be recognized, the retrieval process starts from the root node. The AM unit at the root node will output the nearest centroid as the winner. This centroid has stored with it the corresponding code (or index) for a node in the next level of the tree, and the system will repeat the recognition process until the current node is a leaf node. The final output is a stored pattern. Considering the same simple example that we mentioned above [ Fig. 6(a) ], assume we have a test pattern, p t , which is closest to p 5 . The retrieval process is as follows.
1) Input p t to the root node AM 0 , the matching pattern is C 12 , which points to node AM 2 of level 1, that represents cluster {p 5 , p 6 , p 7 , p 8 }. 2) Input p t to the level 1 node AM 2 , the matching pattern is C 23 , which points to node AM 3 of level 2, which represents cluster {p 5 , p 6 }. 3) Input p t to the level 2 node AM 3 , the matching pattern is p 5 , which is marked as a leaf, and searching stops. For the N -tree structure on the MNIST data set, it is not necessary for recognition to store the digit labels of each vector in the leaf node, since the label of the result is indicated by the nodes of the first layer. Therefore, the N -tree structure solves the hardware design problem of having a large search space by representing each training vector hierarchically from obscure to explicit. However, for this solution, the retrieval performance cannot outperform the nearest neighbor search with ten clusters, because the retrieval result is actually determined at the first layer. Even though the search is from the root node to the leaf node, once the digit label is chosen at the first search, the pattern retrieval under this node will not influence the recognition result. To address this problem, we apply a search technique called branch and bound search to help us find the global nearest search in the whole tree.
C. BRANCH AND BOUND SEARCH
To improve the performance of the hierarchical architecture, we adopt an algorithm called branch and bound search [22] . The basic idea is to search additional nodes that possibly contain a better answer, after we finish a routine N -tree search. In other words, when a leaf node is visited at the end of the search process, we backtrack on the tree and check other nodes. The branch and bound search algorithm provides a method for generating a list of other nodes that need to be searched in order to optimize the result.
In branch and bound search, a simple rule can be used to check if a node contains a possible candidate for the nearest neighbor to the input vector. For a cluster with centroid C i , we define its radius r i as the farthest distance between the centroid and any element in the cluster. We define X as the input vector and B as the distance between X and the best nearest neighbor found so far. Then, if a cluster satisfies
where d is the distance function, no element in cluster C i can be closer to X than the current bound. Initially, B is set to be +∞. This rule is shown in Fig. 7 . Theoretically, the branch and bound search can be guaranteed to find the nearest neighbor of an input vector from a hierarchical k-means clustered data set, which means it should have the same recognition performance as the flat single AM model. Usually, the worse the clustering of the data set, the more nodes need to be searched, in order to obtain the nearest neighbor.
However, some nodes have a large radius due to only very few data points that are far away from the centroid. The clusters in these nodes are actually quite compact and have a large distance from the input vector in the high-dimension space. Thus, they are actually not worthy of searching, in spite of being within the bounds by the checking rule. To speed up the search, we adjust the branch and bound search by changing the condition of the checking rule as follows:
where α is a factor used to reduce the effective radius of the clusters, and ranges from 0 to 1. Reducing the clusters' radius can help us avoid the clusters that are less likely to have a better solution, and thus access many fewer nodes in the process. When α = 1, the traditional branch and bound search is applied; when α = 0, no branch and bound search is applied.
VI. TESTS AND PERFORMANCE
In a final test, we implemented this N -tree hierarchical AM model for the same recognition tasks on the MNIST data set as we did in the Section V-B. We apply two norms, the Manhattan distance and the DoM from the oscillator clusters on both the training and the recognition process, with the algorithms we described in this section. The generated N -tree model has five layers and 20 401 nodes, where 9329 of them are leaf nodes containing the 60 000 images. We then use branch and bound search to perform the recognition of the 10 000 test images. We compute the accuracy and the average number of nodes searched for these 10 000 cases. For traditional branch and bound search, the accuracies are 94.98% for the Manhattan distance and 94.02% for oscillator clusters, which is very close to the accuracy of the direct search in the single large AM unit (95.11% and 94.35%). During these retrieval processes, the average number of search nodes is 6275.55, and on average, 5117.24 of them are the leaf nodes. If we do not use branch and bound, only five nodes and one leaf node are searched.
We also test the effect of the radius reduction technique on the branch and bound algorithm. The results are shown in Fig. 8 . In these two charts, the horizontal axis is the factor α for radius reduction, ranging from 1 to 0; the left vertical axis is the accuracy, ranging from 50% to 100%; the right vertical axis is the number of nodes for the two plots of nodes visited. There are three sets of data in these charts: 1) the accuracy; 2) the number of all the nodes visited; and 3) the leaf nodes visited during the search process. As these two charts demonstrate, the reduction of the effective radius of the clusters dramatically cuts down the number of nodes that are searched, while the performance is not impaired until α is smaller than 0.4. For both of the Hamming distance and oscillator clusters, the performance and the search speed can be traded off against each other. Reducing the radius in the checking rule of branch and bound search helps us obtain the best balance of the performance and the search speed. According to other experiments we have done, α = 0.4 is also applicable for the branch and bound search on other image data sets, such as the FERET data set for face recognition [23] .
The searching speed of the system depends on the decision speed of one AM unit. In [9] , the decision time for a 1024 size CMOS AM unit is 30 ns. In a more recent work, the decision time is ∼5 ns for a 256 size STO AM unit [29] . For searching the nearest neighbor across a large data set, the speed is also determined by the scale of system parallelism, and the efficiency is influenced by how many matching operations can be performed in parallel as we discussed in Section IV. However, our architecture design is adaptable to different technologies and has improved efficiency for searching large databases.
In [29] , the energy consumption and the speed for a single AM pattern matching operation are discussed. In their design, the AM unit is composed of an SHE-STO and the CMOS support circuits around it, including a subtractor, a DAC, and also DoM circuits composed of an integrator and an analog merger. For the image vectors of 256 pixels, the energy per AM operation is 259 pJ, where 160.5 pJ consumed by the STOs. With such an STO-based AM unit, we can estimate the energy consumption for the matching of one input pattern across the whole training dataset of our architecture. Table 1 gives the average number of AM operations for the entire test data set, accuracies, and energy estimates for different system configurations. The N-tree architecture with the optimized B&B search algorithm balances the tradeoff between the accuracy and energy consumption by reducing the number of matching operations. We do not compare the speed and power directly with complete CMOS associative memory circuits due to the numerous designs, configurations, and scaling choices for traditional CMOS.
VII. CONCLUSION
We have developed a new associative architecture for information processing based on the ability of weakly coupled nonlinear oscillators to perform pattern matching in highdimension vector spaces. We demonstrated the abilities of the architecture on handwriting image recognition tasks where the dimensionality was 728. There are several novel concepts embodied in this design.
First, it allows for the partitioning of clusters of oscillators based on the physical properties of the oscillators and microfabrication limitations. This is important because the sizes of the input vectors are quite large, while the number of oscillators that can usefully work in a single cluster is limited due to component matching and repeatability.
Second, the repeatability and the tolerance of the oscillator circuits as well as the ability of the CMOS interface circuits to resolve the best match among competing matches limit the number of comparison words that can be evaluated in parallel. However, the number of vectors that need to be compared (the size of the database) is in the thousands for interesting applications. The N -tree architecture allows us to partition the number of words into groups based on a k-means clustering algorithm. It is important to note that with a few exceptions [24] , hierarchical AMs have been largely unsuccessful, since it is just their ability to do massive searches in parallel that gives them their performance gain over sequential and hashing techniques. This is the first hardware-software implementation of a hierarchical AM using clustering techniques that we know of and, together with a modified branch and bound strategy, provides very fast searches with nearoptimal hit-rate performance.
Third, the limitation on the expected number of oscillators that can be reliably fabricated on one substrate also gives rise to the need to virtualize the search tree. That is, only a subset of the number of nodes needed in the tree would actually be fabricated. Since the searches in the tree are processed sequentially by level (root, middle, and leaf), only one node at each level is active at a given time. Therefore, we can emulate the entire tree with only one node. 
