Abstract
Introduction
As new algorithms are developed using a paradigm of off-line non real-time implementation, many times there is a need to adapt and advance the state of the art of hardware architectures to implement such algorithms in a real-time manner if they are to truly serve a useful purpose in industry and defense, and beyond an academic setting. Such is the case with many underlying algorithms used in computer vision paradigms. Specifically, of interest are high speed hardware architectures for the implementation of real time unsupervised data clustering.
This paper addresses the mapping of the unsupervised histogram peak-climbing clustering algorithm to a novel high speed architecture suitable for VLSI implementation and real-time performance. Specifically, this architecture exploits paradigms of massive connectivity like those inspired by neural networks, and parallelism and functionality integration that can be afforded by emerging nanometer semiconductor technologies. Special attention is paid to the clustering of high dimensionality sparse data sets like those found in the clustering of information rich features used for color image segmentation and computer vision, and "orders of magnitude" performance increase from current implementation on a generic compute platform.
These architectures will aide computer vision technology to deliver on its promise of real-time data processing and information generation, and these are solutions of special interest to industry and defense. An example of applications to defense is the automatic analysis of scenes for the recognition of targets and foes in battlefields.
Clustering algorithms can be implemented in conventional compute platforms, but while these can have a high degree of flexibility, they do carry the burden of not being tuned specifically for the task of clustering, and therefore suffer from a high amount of overhead and are inherently inefficient. For instance, the clustering algorithm used in this work has been benchmarked as requiring 172 mS running on a 2.27 GHz processor with virtually infinite memory (RAM -not disk) to execute the algorithm. This instance of the algorithm was clustering 961 vectors of 22 dimensions each, which is equivalent to an image resolution of 128 x 128 pixels. This, by no means, comes close to the levels of performance necessary for real-time video processing and higher resolutions.
Some recent related work
Significant advances in the quality of color image segmentation results have recently been reported in the literature [1] , [2] . This methodology uses high dimensionality Multispectral Random Field Texture Models [3] , [4] and Color Content as features of a subimage defined by a sliding window. These features are in turn clustered using an unsupervised peek-climbing algorithm in the highly multidimensional feature space. Once the features are clustered, these clusters are mapped back to the spatial domain of the image which results in the image segmentation.
Although some highly talented researchers have devoted great efforts to devising hardware architectures to accelerate the execution of clustering algorithms, these efforts have not fully addressed the high performance demands of real-time high quality color video processing [5] , [6] , [7] , [8] , [9] , [10] . The specific problem of conceiving architectures for the very efficient unsupervised peak-climbing clustering algorithm has not been addressed either. Some of the literature refers to the same type of architectural approach [6] , [7] , [8] , [9] while other efforts have been mostly focused on aiding the performance of Artificial Neural Networks (ANNs) [11] , [12] , [14] , [15] , [16] , [17] , [18] or apply to specific problem domains [19] , and many of the architectures reported have been of an analog nature [11] , [12] , [13] , [20] . While analog processing tends to necessitate fewer components for a given complex operation, the technology and the approach suffer from several drawbacks. Namely, analog VLSI technology is more expensive to manufacture and test, is less accurate than a digital implementation, suffers from serious sensitivity to noise, and its performance is very susceptible to changes in supply voltage and environmental temperature conditions. Compensating for all these susceptibilities and obtaining a robust design may require added complexity, which may hinder the reliable manufacturability of the circuitry.
Clustering algorithm
This section describes the clustering algorithm implemented in this work. Given M features f of dimensionality N to be clustered, the first step is to generate a histogram of N dimensions [21] . This histogram is generated by quantizing each dimension according to the following equations: Since the dynamic range of the vectors in each dimension can be quite different, the cell size for each dimension would be different. Hence the cells will be hyper-boxes. This provides efficient dynamic range management of the data, which will tend to enhance the quality and accuracy of the results. Next, the number of feature vectors falling in each hyper-box is counted and this count is associated with the respective hyper-box creating the required histogram.
After the histogram is generated in the feature space, a peak-climbing clustering approach is utilized to group the features into distinct clusters. This is done by locating the peaks of the histogram. In Figure 1 , this peak climbing approach is illustrated for a two-dimensional space example. The number in each cell (hyper-box) represents a hypothetical count for the feature vectors captured by that cell. By examining the counts of the 8-neighbors of a particular cell, a link is established between that cell and the closest cell having the largest count in the neighborhood. At the end of the link assignment, each cell is linked to one parent cell, but can be parent of more than one cell. A peak is defined as being a cell with the largest density in the neighborhood, i.e. a cell with no parent. A peak and all the cells that are linked to it are taken as a distinct cluster representing a mode in the histogram. Once the clusters are found, these can be mapped back to the original data domain from where the features where extracted. Features grouped in the same cluster are tagged as belonging to the same category.
In the clustering algorithm described in [21] , the input data is an address to a data tensor in memory with dimensions J V x J H x N containing Multispectral Simultaneous Autoregressive (MSAR) Random Field model features from a video frame, and the number of quantization levels to be used, which applies to all dimensions of the feature space. J V denotes the number of windows used to extract the MSAR model features in the vertical direction of the video frame, and J H is the number of windows in the horizontal. For convenience let us denote the total number of feature vector as J = J V x J H . The input is fixed point data normalized in the interval [-1, +1) over the entire J V x J H x N data set. The output of the architecture is an address to a J V x J H matrix containing the clusters in the video frame space and the number of resulting clusters. Figure 2 shows the different steps of this implementation of the clustering algorithm and the overall architecture. The chosen architecture follows a globally systolic partition. These steps are the computation of the mimum and maximum values of each dimension of the feature vectors, finding the cell size CS(k) for each k dimension, creating the histogram or assigning bin indexes to the data vectors, allocating the vectors to bin numbers, link the bins, assign the clusters, group the clusters in parallel of determining the number of clusters, and mapping the clusters back to the video frame spatial domain from the multidimensional MSAR feature space.
In order to achieve an ultra high speed implementation that supports real-time high density video, there are a number of design considerations as follows: 1. Algorithm implementation analysis and benchmarking in C to group the major step into phases that can borrow budgeted time from each other. 2. Parallelization of steps whenever possible as in the computation of the maximum and minimum values over the data set in each dimension. 3. Using register bank architectures that maximize parallel data access. 4. Dual access to pipeline register banks and a large global memory structure for redundant access storage of data through the different stages of the computation. 
Architectural details
This section presents the architectural details of this high speed data clustering processor. In all figures, the Processing Element (PE) being discussed is bounded by dashed lines. Figure 4 shows the details of the PE to compute the Cell Size CS(k) for each dimension. N PEs are instantiated in parallel; one for each dimension. Because of the high dimensionality of Random Field models, the number of quantization levels in each dimension necessary for effective and efficient clustering is very small; Q = 3 … 8. This allows the division operation of Equation 1 to be implemented by a multiplication by the inverse of Q stored in a small look-up table (LUT). Figure 5 shows the details of the PE to compute histogram indexes for each data vector. N PEs are instantiated in parallel; one for each dimension, and each instantiated PE cycles J times through all the values in a given dimension of the feature vectors. Since the possible number of quantization levels has been constrained to six, the division in the Index PE can be implemented by simple parallel restoring division algorithm that has been limited to computing only the first three bits of the quotient. Figure 6 shows the details of the PE to allocate and identify a data vector with a given histogram bin. J instantiations of this PE are made, which corresponds to one instantiation per each possible bin. The purpose of the compressor is to count the number of ones from the comparators, which corresponds to the density of a given bin in the histogram. Table 1 additionally shows the specific structure of the compressor tree with full adders and half adders capped at J for J = 24882.
Discussion and conclusions
The rest of the micro-architecture to establish the links between the histogram bins, and assign the clusters, so that the results can be output, follow a very similar structure as Figure 6 . The only notable exception is that the PE uses a novel computational cell to calculate the norm between two 22-dimensional vectors. This cell is shown in Figure 7 . A global clocking, control, and shared memory network tie all the modules together to form the complete architecture.
This paper describes a high performance VLSI architecture for the real-time clustering of high dimensionality data extracted from video. Processing rates suitable for DVD quality video processing at MPEG-2 frame rates can be sustained. By using a top level quasi-systolic architectural partitioning scheme and extensive connectivity at the lower levels, the performance is improved 118 times or more than what can be achieved in a generic compute platform. This architecture can be used in many military, industrial, and commercial applications that require realtime intelligent machine processing of high quality video.
