Abstract-This paper introduces a tri-state logic selforganizing map (bSOM) designed and implemented on a field programmable gate array (FPGA) chip. The bSOM takes binary inputs and maintains tri-state weights. A novel training rule is presented. The bSOM is well suited to FPGA implementation, trains quicker than the original self-organizing map (SOM), and can be used in clustering and classification problems with binary input data. Two practical applications, character recognition and appearance-based object identification, are used to illustrate the performance of the implementation. The appearance-based object identification forms part of an end-to-end surveillance system implemented wholly on FPGA. In both applications, binary signatures extracted from the objects are processed by the bSOM. The system performance is compared with a traditional SOM with real-valued weights and a strictly binary weighted SOM.
Implementation and Applications of Tri-State

I. Introduction
O NE OF THE original motivations for research into neural networks is the observation that neural systems are massively parallel and can therefore potentially escape some of the inherent computational limitations of strictly serial architectures. However, most neural network research uses simulations on standard CPU architectures, and so does not address the architectural issues found in real parallel hardware. This paper introduces an architecture for self-organizing maps custom-designed for field programmable gate array (FPGA) implementation, which is designed to exploit the fine-grained parallelism of the FPGA while respecting its architectural limitations. The FPGA platform is chosen as it is reconfigurable, allowing easy custom-design of each implementation, and onchip integration with other system functions.
The original self-organizing map (SOM) proposed by Kohonen [1] , [2] consists of two layers: the input and the competitive layers. It is an unsupervised neural network with H. Meng is with the School of Engineering and Design, Brunel University, Uxbridge UB8 3PH, U.K. (e-mail: hongying.meng@brunel.ac.uk).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/ TCSVT.2012.2197077 competitive learning that captures the topology and probability distribution of input data, and can be used for a wide range of pattern recognition purposes, including anomaly detection, clustering, and classification [3] - [5] . In the vast majority of implementations the SOM input data and neurons are represented by real numbers (with floating-point representation), making it difficult to implement efficiently on FPGA, which in general do not have specialized floating point hardware, and therefore provide only inefficient implementations of real numbers.
Weightless neural networks (WNNs) [6] are an alternative neural network architecture that directly exploits hardware capabilities (commercially available random access memory) and uses binary inputs and outputs. Instead of adjusting weights, learning is implemented by changing look-up table entries, providing very rapid training [7] . In WNNs, memory blocks play the role of the "neurons" in the system. This approach to neural networks was pioneered by Aleksander [8] , [9] , and has since been further developed by Austin [10] , and others [11] . An N input RAM node (RAM-based neuron) has 2 N memory locations addressed by an N-bit binary string. An N-bit binary input string will access only one memory location. Learning in RAM node is accomplished by writing the desired output into the corresponding look-up table. RAM networks are taught to respond with a "1" output for those patterns in the training set and only for those patterns. Generalization is achieved by subsampling the input space with multiple RAM nodes (with cross-sampling of inputs), and aggregating the RAMnode outputs. A limitation of RAM nodes is that a "0" output may be ambiguous, indicating either lack of a corresponding training example or existence of a counterexample [7] .
To overcome this ambiguity, Aleksander and Myers [12] developed the probabilistic logic node (PLN) system. The PLN node uses a tri-state scheme with three levels (0, 0.5, 1) in which the value of 0.5 means that an output of 0 or 1 can be expected with equal probability if that node is addressed. The three levels in PLN are represented using two bits. PLN are initialized to 0.5 values; in training these are replaced with 0s or 1s. A further development is the probabilistic RAM (pRAM) model [13] that uses fixed-point probability estimates as weights, which approximate the range [0, 1]. Similar to other RAM-based networks, an N input pRAM node has 2 In its basic form, the pRAM comprises a number of memory locations, a comparator, and a noise generator [14] .
In SOM networks the neurons in the competitive layer each have a "weight vector" that represent a position in the input space, and therefore act as "prototype" vectors. During training and execution the "winning" neuron is identified as that with the minimum distance from its prototype vector to the input vector using an appropriate distance metric, D. During execution, the winner-take-all (WTA) algorithm is used and the winning neuron stands for the input [1] , [2] . During training, the winning neuron and its topological neighbors' prototype vectors are adjusted toward the input vector so that the prototypes come to represent cluster centers. The Euclidean distance is most frequently used as the distance metric.
Although the SOM uses real data inputs and outputs, in some applications the data are either presented as a binary string, or may be conveniently recoded as such (a "binary signature"). For example, in image processing Haar filter responses are often used to produce a (long) binary signature. In this case, the real-number representation of prototypes is inefficient, and arguably inappropriate. Most proposed SOM hardware implementations have adopted a real-valued architectural model, modified to utilize the parallel nature of FPGAs [15] . Manolakos and Logaras [15] presented a parallel SOM architecture design following the systolic model, which is realized as a flexible soft IP core. Soft IP-based FPGA processor cores generally have lower performance levels and higher resource utilization [16] than hard IP cores implementing the same functions, but they are highly flexible and can be customized for a specific application with relative ease [17] . In contrast, hard processor IP cores are generally highly optimized and fine tuned, but difficult to port to other targets with equivalent performance [16] and overspecified for restricted tasks.
This paper presents a tri-state self-organizing map (the bSOM), which takes a binary input vector and maintains tristate weights. The design is implemented as a soft IP core in Handel C, where the number of neurons, the number of elements per input vector element, and the number of bits for data and weights are all tuneable parameters, maximizing flexibility and minimizing complexity. The architecture is well suited to FPGAs, achieving very high training and execution speeds, and is easily integrated into a wider on-chip system. The architecture may be used for various pattern recognition tasks, including clustering and classification. We demonstrate its use in two applications: hand-written character recognition and moving object identification. In the latter application, the bSOM is part of a larger on-chip system that includes feature extraction from color video sequences to produce binary signatures.
Preliminary versions of this material have been presented in conference papers [18] , [19] ; this paper extends and integrates the presentation.
The remainder of this paper is divided into five sections. Section II gives an overview of hardware solutions to the implementation of SOM. This is followed by the details and training rules of the proposed bSOM in Section III. Section IV describes the FPGA realization of the proposed bSOM and Section V presents the two practical applications of the bSOM with experimental results. We conclude in Section VI with suggested future work.
II. Hardware Architectures for SOM
Hardware implementations of neural networks are essential to take full advantage of the inherent parallelism of neural network [20] . Software simulations are useful for investigating the capabilities of neural network models, and creating new algorithms, but they fall short where fast execution and training is required [21] , and fall short as problem size scales up, creating a bottleneck [22] . There are two major approaches to implementing neural networks in hardware: analog and digital implementations. Digital neural networks are more popular due to their greater accuracy, flexibility, and relatively insensitivity to noise [23] .
FPGAs provide an appealing platform for the implementation of digital neural networks, due to their reconfigurability and consequently small nonrecurring engineering cost. Neural architectures invariably need to be "tuned" for specific applications (e.g., number of inputs); this is difficult to accommodate in a specialized neural ASIC chip, but easily handled on an FPGA. Moreover, neural networks are rarely used alone, and can be integrated directly on the chip with other system functions (e.g., video or image input, feature extraction, control functions).
However, a key limitation of FPGAs is the cost of implementing arithmetic-particularly floating point operationsand most traditional neural networks are designed around realvalued arithmetic. This suggests that either efficient representations of real values must be used, or that the problem should be recast to use a binary representation.
A popular approach is to use fixed-point arithmetic to approximate real values. Pena and Vanegas [5] implemented a fixed-point version of the SOM on FPGA. They simplified the neighborhood function and introduced a set of new learning. Raygoza-Panduro et al. [24] presented a fixed-point SOMbased neuro-processor using a Xilinx Virtex II FPGA for the analysis and classification of tension deformation patterns of knee ligaments, capable of recognizing different sequences of movement patterns for a knee joint with damage to the anterior cruciate ligaments.
Kurdthongmee [25] presented a modified SOM implemented on FPGA, used for image quantization. They used unsigned integer arithmetic operations suitable for moderate density FPGAs. A similar implementation where the distance, neighborhood, and learning rate computation is replaced with a simplified version, was presented by Chang et al. [26] and Porrmann et al. [27] . An efficient SOM architecture based on a new frequency adaptive learning algorithm, which efficiently replaces the neighborhood adaptation function of the original SOM, was presented in [26] . The design was implemented on a Xilinx FPGA and is capable of quantizing a 512×512 pixel color image in about 1.003 s at a 35 MHz clock rate without the use of subsampling.
A design based on the universal rapid prototyping system RAPTOR2000 for the acceleration of SOM is presented in [27] . Using Xilinx FPGAs, the implementation achieves a speedup of up to 190 times (with five FPGA modules on the RAPTOR2000 system) compared to a software implementation on a state-of-the-art personal computer. A similar system implemented on a Xilinx Virtex II XC2V300, aimed at reducing the training processing time of SOM, has been presented in [28] . The design consists of 16 units in the input layer. The number of neurons in the output layer is divided into three sections: the processing unit array, the address generator, and the controller. Compared with a software implementation, the design achieves approximately 89% speedup. However, these systems still have fairly low numbers of neurons and modest speedup, reflecting the significant amount of silicon area required to deal with the fixed-point arithmetic.
Recognizing these issues, Yamakawa et al. [4] proposed a binary weighted vector SOM based on FPGA. The proposed SOM used binary data for both input and weight vectors. The Hamming distance is used as the distance metric between input and weight vectors. However, as their input data actually consist of integers the weight vector was updated with priority given to the most significant bit (MSB), thus attempting to utilize a hybrid scheme that treats the weights as a direct representation of integer values in some functions, and as binary strings in others. This produces some peculiarities (e.g., in treating the least-significant and most-significant bits equally in the Hamming-distance calculation). Nonetheless, the implementation was five times faster than the real number weighted SOM in software and 140 times faster in hardware, and achieved comparable results [4] . This highlights a key principle that the most successful design will take account of the nature of the hardware architecture, as demonstrated by Austin's [22] ability to implement a fast system on a low-cost digital hardware.
III. Tri-State SOM
This paper introduces a tri-state SOM (the bSOM), which combines concepts from the traditional SOM [1] , [2] with the tri-state logic pioneered in the PLN. The bSOM has the same essential structure as a standard SOM-an input layer and a competitive layer-and is capable of the same wide range of applications as the SOM. The bSOM takes a binary vector input and maintains tri-state prototype vectors "weights" with {0, 1, # } as the possible values. We use # to represent a "don't care" state (signifying that the corresponding input vector bit is matched whether it is set or clear). The resulting architecture implements very efficiently on FPGA, and the additional logic state significantly improves performance compared to a strictly binary architecture. In comparison with WNNs, the weight vectors have the same length as the input binary vector, whereas a WNN uses 2 N memory locations per logic node; moreover, there is no need to subsample the input space and combine outputs in a pyramid structure, so the input part of the architecture is relatively simple.
One of the functions of standard SOMs [1] , [2] is to reflect topological information prevalent in high dimensional input data in the organization of the 1-D or 2-D map of neurons [29] . Each neuron in an SOM has a topological neighborhood, typically 1-D or 2-D and of a defined shape (e.g., circle, square, or hexagon in two dimensions), with size of the region specified by a "radius" parameter r. For ease of hardware implementation, we have used a 1-D neighborhoods in the bSOM. Given a binary input vector x = (x 1 , x 2 , . . . , x n ), all the units in the competitive layer are "connected" by corresponding prototype vectors, w j = (w j1 , w j2 , . . . , w jn ). The bSOM training algorithm is discussed below, and compared and contrasted with the original SOM algorithm [1] , [2] and Yamakawa's [4] implementation.
In contrast to Yamakawa [4] , we assume that the input is strictly binary, and we use a tri-state weight vector. We used a specialized distance metric, and a specialized probabilistic update rule during training, both of which are necessary to reflect our tri-state weight structure. In contrast, in [4] the basic Hamming distance is used as a distance metric, despite its unsuitability for binary representation of fixedpoint integer inputs, and the weight w j (t + 1) is updated by
.N] for N-bit input vector, with ⊗ representing the exclusive OR operation, but with priority given to the MSB to reflect the integer encoding.
A. Distance Computation
We used a modified version of the Hamming distance to compare input to prototype vectors, as shown in (1), for an input vector x and weight vector w
where x i and w ji are the bit inverses of x i and w ji , respectively. This equation implies that any input bit value "matches" a "#" in the prototype vector. A consequence of this is that prototype vectors may effectively represent a region rather than a point or, viewed alternatively, may be selective to distance in some dimensions while ignoring others. This is a powerful feature of the approach. We may think of tri-state prototypes as corresponding to schemata in Holland's genetic algorithm [30] , and so we refer to the modified distance metric as the Schema distance, D.
B. WTA
The unit with the smallest Schema distance to the input is defined as the winning neuron. We use the #-count (number of #s in the weight string) as a tie-break when the Schema distances of multiple neurons to the input vector are the same-the winner is the neuron with the lowest #-count. This implies that we prioritize prototypes with a more specific representation.
C. Neighborhood Selection and Weight Update
As in the original SOM and in [4] , a neighborhood N of neurons around the winning neuron w is selected and updated; the size of the neighborhood progressively decreases. We use a probabilistic update rule as follows.
1) A bit in the weight vector is only updated if it is different from its corresponding input vector bit. 2) An update probability is used for each iteration during training. This value decreases linearly as training progresses. 3) A bit is updated by changing its value from 1 to #, 0 to #, or # to (0 or 1) depending on the input bit value. The behavior of an individual bit can be modeled as a Markov chain with a conditional Markov transition matrix (T ). Fig. 1(a) illustrates the case where the probability that a particular bit is set, when that neuron wins, is 0.5. If the probability of applying the conditional Markov transition matrix is given as p = 1 − α (where α is the update rate), the resulting effective Markov transition matrix (T e ) for a bit to change is as shown in Fig. 1(b) . If T is a regular transition matrix, then as n approaches infinity, T n → S, where S is a matrix with constant vectors, as shown in Fig. 2 . The illustrated transition matrix settles after the 12th iteration. This supports the observation that the bSOM requires few iterations to converge, as compared to the original SOM and that presented in [4] .
IV. Tri-State SOM on FPGA Architecture
The most critical aspect of any hardware design is the selection of an architecture that provides the most efficient and effective implementation [26] . The specifications of the circuit implemented on FPGA is given in Table I , with its corresponding block diagram in Fig. 3 . The circuitry is made up of five basic blocks: the weight initialization, pattern input, WTA, neighborhood update, and display blocks. The circuitry is parameterized by the input bit width, N, and requires only a simple reconfiguration for a different design. Three of the five blocks run in parallel: the pattern input, WTA, and display (output) block. The weight initialization block is triggered only at startup. Similarly, the neighborhood update block is triggered when a winning node is identified for an input binary vector. Details of the five basic blocks are presented in the following sections.
A. Weight Initialization Block
This block is used to randomly initialize all the weight vectors in the network. All the neurons in the network are initialized in parallel bit-by-bit; hence, it takes as many clock cycles as there are bits in the binary input vector to complete the initialization. The hardware architecture presented here has been tested with binary image characters of size 28 × 28, totalling N = 784 bits (and also with binary signatures from moving objects with N = 768 bits). The sizes of the input and weight vectors are all set to N bits and can easily be altered for any input size. The presented implementation takes exactly N clock cycles to completely initialize all the neurons.
B. Pattern Input Block
This block is used to acquire the binary input vector (or binary image) from an external source. The size of the input vector, N, is preconfigured and the input is complete when a total of N bits is read from the input source. This binary datum is stored in the input vector and then passed onto the schema distance computation unit for further processing.
C. WTA Block
This block is made up of two parts, the distance computation unit and the winning neuron unit. The distance computation unit is used to compute the Schema distance between the input binary vector and all neurons in the bSOM. The Schema distance between the input vector x and a neuron w j , as shown in (1) is a bitwise operation, and hence takes as many clock cycles as there are bits in the input vector. Since the Schema distances for all the neurons are computed in parallel, it takes exactly N clock cycles to complete the distance computations for all the neurons in the network.
The winning neuron unit uses the results from the Schema distance computed in the distance computation unit to identify the winning neuron. The design, as shown in Fig. 4 , uses a tree-structured series of comparators to select the minimum of a pair of two inputs. For an implementation with 40 neurons, the design takes exactly seven clock cycles to compute the node with the minimum Schema distance.
D. Neighborhood Update Block
This block is used to select the neighborhood of the winning neuron and to update the neurons in the specified region. The size of the neighborhood reduces as training progresses. In the hardware implementation the neighborhood size is initialized to 4, and decrements every I/4 iterations until it reaches a minimum of 1, where I is the total number of iterations. The update requires a random number generator, which is complex to implement in hardware and computationally expensive. To avoid these costs, a look-up table with 2000 randomly generated numbers has been implemented on the FPGA. For a mismatched bit between the input vector and the neuron to be updated, one of the 2000 values is selected using the iteration count. If the number of iterations exceeds 2000, the last 10 bits of the iteration count is used to address the random number in the LUT. Mismatching bits in the neuron vector are updated, as discussed in Section III-C. A # is implemented as binary "10."
E. Output Display Blocks
The output display block displays the neurons (weights) as an image on an external video graphics array (VGA) for visual verification. It runs in parallel with the input and WTA blocks, at the refresh rate for the VGA used (typically 60 Hz).
F. Implementation Platform
The bSOM architecture discussed here has been implemented on a Xilinx Virtex-4 FPGA chip (XC4VLX160) with approximately 152 064 logic cells with embedded RAM totalling 5184 kb. The design and verification was accomplished using the handel-C high-level descriptive language. Compilation and simulation were achieved using the Agility DK design suite. Synthesis-the translation of abstract high-level code into a gate-level net-list-was accomplished using Xilinx ISE tools.
G. Training Speed
To compare the training speeds of bSOM and cSOM on the FPGA architecture, a simplified version of the cSOM has been implemented on the Xilinx Virtex-4 FPGA. In the simplified version of the cSOM, the Manhattan distance is used instead of the Euclidean distance. Also to accommodate the fine grain learning in cSOM, 8 bits are used to represent values ranging from 0 to 1 in a fixed-point format. The design for the bSOM can be clocked at 40 MHz and 25 MHz for the cSOM. The resource utilization of the two implementations is given in Table II . The cSOM implementation takes three times as many clock cycles as the bSOM, due to the intermediate arithmetic operations required for updating the 8-bit fixed-point memory locations. At 25 MHz the cSOM is capable of training the system with approximately 10 000 patterns per second. Also, at 40 MHz the bSOM implementation can be trained with approximately 25 000 patterns per second, representing a 2.5-fold improvement over the training time of the cSOM on FPGA. The clock frequencies of 40 MHz and 25 MHz also include the design for controlling the external logic for the VGA and the camera. This is the actual hardware test and the most stable clock frequencies for the two implementations. The frequencies could be much higher without the requirement to interface these devices. Table II gives the details of the resource utilization of the FPGA implementations for the 784-bit character recognition problem.
V. Applications and Experimental Results
The performance of the bSOM has been verified using two practical applications: handwritten character recognition, and moving object identification. To verify the performance of the bSOM, the MNIST database of handwritten digits [31] , sample shown in Fig. 5 , was used to test the implementation both in PC simulations and on the FPGA hardware architecture. A comparison on the PC between the original SOM as presented by Kohonen in [1] and [2] (herein referred to as the cSOM), a strictly binary SOM (BSOM), and the proposed tri-state SOM (bSOM) algorithms is also given in this section. Although the bSOM is meant for hardware implementation, it has been implemented on a PC using MATLAB to enable comparison with the original SOM. To illustrate the importance of the tri-state (0, 1, #) rather than binary (0, 1) representation, the BSOM version uses the Hamming distance metric, but otherwise is implemented with the same parameters as the bSOM.
A. Handwritten Character Recognition
To illustrate the comparative performance of the bSOM in cluster analysis and topological ordering, we have tested the system on the MNIST handwritten character dataset [31] . To evaluate how effective the clustering is, after training the neurons were visually inspected and labels (0-9) were assigned to each neuron. A labeled independent test data (10 000 numeric characters) was then used to test the classification accuracy of the three hand-labeled SOMs.
The software-based simulation of the bSOM was achieved on a PC with a general-purpose processor clocked at 2.8 GHz and 2 GB of SDRAM. Initial experiments were conducted to empirically select control parameters-number of neurons, neighborhood size, and learning rate-for all three models, to determine the number of neurons required to represent all 60 000 patterns in the dataset (see Fig. 5 ). Table III illustrates the influence of the different parameters of the cSOM and bSOM performance. Although the bSOM performs better than the cSOM, there is significant improvement in performance for the cSOM as the number of iterations increases. Increasing the number of neurons in the network increases the performance of both the cSOM and bSOM, with some of the neurons left unused for larger networks.
Experiments were conducted with the number of neurons ranging from 10 to 100 in steps of 10. The bSOM results improve with increasing numbers of neurons until performance plateaus at 80 neurons (with minimal improvement thereafter). The initial neighborhood size (4) was determined using the cSOM, and adopted for the other implementations.
After empirically selecting parameter values, tests were conducted to compare the convergence of the bSOM, cSOM, and BSOM at 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, and 500 iterations. The experiment was repeated ten times at each iteration count with the exception of 500, which (due to computational load) was repeated only five times. In the case of the original SOM, repetitions did not make any significant difference to the results, whereas the results of the strict binary and tri-state SOM showed some variability. Fig. 6 illustrates the results. The BSOM has markedly inferior performance. The cSOM and bSOM appear to have similar performance at high iteration counts; however, the bSOM performs better at low iteration counts and plateaus around 50 iterations; increasing the number of iterations beyond this point does not make a significant difference. The cSOM appears to plateau after 700 iterations (not illustrated) with a performance level of 89%; it took approximately 50 h to complete one training run at this number of iterations, and we did not repeat the experiment.
Samples of the resulting topological maps with 100 neurons in each network (after 100 iterations) are illustrated in Fig. 7 for cSOM, BSOM, and bSOM. We note that the specific assignment of neurons to patterns is not significant, although the 1-D topological ordering is reflected (neighborhood runs across rows in a scan-line fashion) in the clustering. The cSOM captures ambiguities well (appearing as "blurring" of the pattern in some nodes). The bSOM can also achieve this to some extent, whereas the BSOM reflects only specific patterns. It is the ability of the bSOM to capture at least some level of ambiguity that distinguishes it from the BSOM in terms of performance. Table IV (left-hand side) shows the average performance level from Fig. 6 numerically. Although in this experiment the SOMs have been used for clustering and a post-hoc analysis of correct classification conducted, for comparison we list the performance of various classifiers on this dataset, as presented in the literature. Reported accuracy levels are 88% using a linear classifier (1-layer NN) [31] , 84% using sparse distributed memory [32] , 94% using support vector machine, and 99.5% using a two-stage pattern recognition architecture using feature extraction (a large convolutional neural network with unsupervised pretraining) [33] .
The performance measure in Table IV is the object level classification measure. Table V gives the pixel level accuracy measure using the MNIST dataset for four different iterations with the three implementations. This is the pixel level comparison of the 10 000 test data and the represented neurons. Table V . Percentage correct classification (PCC), a widely used method for assessing a classifier's performance, has also been given in Table V Iter. is the number of iterations. 
B. Target Identification
Our second implementation illustrates the bSOM as a component of a surveillance system. The system, fully implemented on FPGA, analyzes real-time videos, applies background differencing [34] , segments multiple objects, and tracks them. The tracking and segmentation modules yield individual objects, represented using a bounding box. The bSOM is used to perform appearance-based target identification-a number of known objects (individuals) are learned by the system, and during tracking the bSOM is used to identify which object(s) is/are in view.
The objects are represented using a simple binary signature, extracted from the color histograms-this is frequently sufficient to identify individual objects from a reasonably small set. However, the approach generalizes to more sophisticated feature extraction techniques. A 768 bin histogram is generated; 256 bins for each of the RGB color components. To convert this into a binary signature the average bin frequency, μ bin , is computed; any bin with a value greater than or equal to μ bin is represented as binary 1, 0 otherwise (see Fig. 8) .
A binary feature vector (binary signature), x= {x 1 , x 2 , . . . , x N } for N = 768, is generated as follows:
To test the short-term recognition of the bSOM with signatures extracted from the color histogram, a limited number of objects (nine people) have been used to train a fixed-size bSOM. The bSOM is trained using binary signatures collected from all moving objects in a training sequence. The number of unique objects that appear in the scene determines the Fig. 9 . Processing of objects for identification. The tracking system detects objects and constructs a bounding box. A binary signature is extracted from the color histogram of each object, and fed to the bSOM for identification.
number of neurons required in the bSOM. Ideally, the number of neurons should be the same as the total number of unique objects. However, due to partial occlusion, camera jitter, over segmentation, and under segmentation, the appearance and hence the histogram for an object may vary from frame to frame so that each individual is represented by multiple nodes.
After training the network with binary signatures extracted from 2248 manually labeled objects, we use a win-frequencybased algorithm to automatically label nodes for object identification. For each node, we count how many of the training patterns for which it "wins" the competition correspond to each known object. The node is labeled as the most frequently assigned object. To test the performance, 1139 manually labeled independent test data are used. During the testing phase, the winning neuron is identified. If the minimum Schema distance exceeds a threshold value set during training the object is classified as unknown; otherwise, it is identified using the node label.
The object identification system has been tested with video data recorded over a period of 2 h with a total of 18 122 frames. The video was recorded in an indoor environment, very close to the exit of a building. Typically, people enter the building and leave at the same exit point. The scene has normal office furniture, which partially occludes the moving object in some locations. There is some variation in lighting conditions, particularly, around the wide transparent windows (see Fig. 9 ). Frames from the first 30 min with nine different persons entering the building were used to train the system. A tracking system is used to segment and extract the pixels of all moving objects, as shown in Fig. 9 . Objects with less than 768 pixels are filtered as noise, which also avoids values of μ bin less than 1 in (2). Fig. 10 shows three of the nine objects used to train the bSOM. In the figure, the actual object being tracked is shown to the left with its binary signatures over the period of time that it appears in the scene shown to the right. The binary signatures are shown as images and each row in the image to the right corresponds to the 768 bits representing the binary signature for its color histogram.
Tests were conducted with the number of neurons ranging from 10 to 100 in increments of 10. For networks with more than 50 neurons, the recognition level for both the bSOM and cSOM exceeds 90%, but some neurons do not get used; 40 neurons were adequate for good performance. There were nine distinct objects, therefore roughly four neurons per object in this environment. The average performance of the cSOM, bSOM, and BSOM using 40 neurons is presented to the right of Table IV ; these figures are also illustrated in Fig. 11 . The performance is consistent with the overall observations on the MNIST dataset, the bSOM and cSOM have comparable performance, BSOM has significantly worse performance, the bSOM trains relatively quickly, although the difference is not so marked on this dataset.
C. Statistical Significance of Results
This section examines the statistical significance of the performance of the three SOM implementations (cSOM, bSOM, and BSOM) presented in Sections V-A and V-B. We used the Wilcoxon rank-sum test to determine whether there is any significant difference between the classification performance of the three algorithms. A one-tailed test was used to test whether higher average performance by one algorithm over another was statistically significant. Table VI shows the Wilcoxon statistic (z), (the asymptotic significance) values, and the significance for all the 12 iterations for the MNIST dataset. The values from Table VI suggest that bSOM significantly outperform cSOM for iterations less than 80 at the 5% significance level. There is no statistical significance between the performance for iteration greater than or equal to 80. This test shows that bSOM trains more quickly than cSOM, but that ultimate performance is comparable. Table VII shows the Wilcoxon rank-sum test results for the object identification problem. As with MNIST, bSOM outperforms cSOM for smaller iterations , with the exception of iteration 40. However, cSOM outperforms bSOM for higher iterations (80-500), with the exception of iterations 100 and 200. There is no statistically significant difference at iteration 100. We conclude that bSOM trains more quickly than cSOM on this dataset, but ultimately cSOM has marginally higher performance.
VI. Conclusion
We presented a new neural network architecture, the tri-state self-organizing map, which is suitable for clustering, anomaly detection, and classification. By utilizing the concept of tristate logic, originally presented in weightless neural networks, we can produce an efficient system that has comparable performance to a traditional real-valued SOM in handling binary input data (in contrast to a simple binary system using the Hamming distance), with significantly greater computational efficiency. The bSOM is particularly well suited to an FPGA platform, trains in less iterations than the original SOM, and has a much lower implementation foot-print. We demonstrated the potential use of the bSOM in hand-written character recognition, and in security surveillance systems as an object identification system using binary signatures extracted from color histograms. The work presented here forms part of an end-to-end surveillance system fully implemented on FPGA.
In the future, we will demonstrate further applications of the tri-state SOM, integration with more sophisticated binary signature extraction algorithms, and integrated self-optimization, including online learning.
