Abstract-We describe an efficient architecture for generic object recognition system based on an ensemble classifier in a Field Programmable Gate Array (FPGA) environment. Utilization of a bag of covariance matrices as object descriptor improves the object recognition accuracy while speed up the learning process. We extend this technique, and present its hardware architecture, as well as object classifier based on on-line variant of random forest (RF) implemented using Logarithmic Number System (LNS). First, we describe the algorithmic and architecture of our model, comprises several computation modules. Then test and verified the model functionality using numerical simulation in the GRAZ02 dataset domain. It has been shown that the proposed system gained strong performance over floating-point and fixed-point precisions, even when only 10% of the training examples are used and is reasonably power efficient.
I. INTRODUCTION
C OMBINING multiple classifiers (e.g., decision trees) to build an ensemble is an advanced machine learning technique with substantially classification improvement over single-based classifiers. Random forests (RFs) [1] , a representative decision tree-based ensemble has been emerged as a principle machine learning tool combining properties of efficient classifier and feature selection running on generalpurpose processor-based (GPP-based) custom-hardware and optimized operating systems. Rather than minimizing training error, RF minimizes the generalization error, while being fast to train, proven not to overfit, and computationally effective (O( √ V T logT ), where V is the number of variables and T is the number of observations). These merits make RF a potential tool suited for adaptive classification problems. RF also has been applied to vision problems such as object recognition [2] - [7] . It has also been used for OCR [8] and for key point recognition [9] . Despite of the appearance success of RF virtually no work has been done to map from its ideal mathematical model to compact and reliable hardware design.
In this paper we present object recognition system implemented on a field programmable gate array (FPGA), enables learning algorithm to scale up. As can be seen in Fig.1 the recognition process is composed of automatic representation of objects as covariance matrices follow by a treebased RF detector that operate in on-line mode. We have shown in [4] utilizing a bag of covariance matrices as object descriptor improves the accuracy of object recognition while speed up the learning process, so we are extending this technique, present its hardware architecture. The RF detector is designed using Logarithmic Number System (LNS) [10] , allows the reduction of the required word-length to 16 and consequently a general-purpose microprocessor of the same word-length is used. The architecture comprises several computation modules, referred to as 'Covariance Matrices', 'Tree Units', 'Majority Vote Unit', and 'Forest Units'. The main contribution of our approach (in addition to its impacts on the tradeoff between algorithmic setting accuracy and hardware implementation cost) is three-fold: (1) its direction towards arithmetic complexity reduction using a modified RF based on LNS (RF-LNS), (2) it has been designed in order to be easily integrated in a system-on-chip (SoC), which can perform both automatic feature selection and recognition, and (3) it allows for fair comparison with floating-point (FP) and fixed-point (FX) implementations. We test and verified the model functionality using numerical simulation, present results obtained using examples from GRAZ02 dataset [11] . First, in Section II we present related works and highlight on general constrains in implementing hardware-based recognition systems. Section III shows the object descriptor we used and overview on RF algorithmic settings. In Section IV we present full architecture and design of our recognition system. We follow with experimental evaluation and estimation of the required precision in Section V. A brief conclusion appears in Section VI.
II. HARDWARE-BASED MACHINE LEARNING
Perhaps motivated by the high computational complexity of many software-oriented machine vision algorithms, there have been several attempts to create faster execution hardware implementations which are able to indentify and localize objects in a given scene or an image, achieve high recognition performance. There are studies about Pulsed Neural Network (PNN) that employ Pulsed Neuron (PN) or Spiking Neuron object localization and processing. The PN models and have the ability to adapt, much better than traditional neural nets. In [12] and [13] , k-means clustering is implemented using reconfigurable hardware. The Kerneltron [14] , [15] is a SVM classification module, with a system precision resolution of no more than 8 bits. A fully digital architecture for SVM classification employing the linear and RBF kernels is proposed. The minimal word size they are able to use is 20 bits. However to the best of our knowledge, ours is the first attempt to implement RF in hardware. We predict further progress using this approach.
A. Hardware implementations: problems and constraints
Any kind of hardware implementations of machine vision algorithms be it analog, digital, or optical, brings along various constraints: hardware contains all the basic blocks needed to build any logic of mathematical function imaginable but the limitations are in the parallelism available in the program, i.e. performance, and power consumption. FPGA provides flexibility to cope with the current evolving applications but at the cost of large performance, area, power and reconfiguration time penalties.
B. logarithmic Number System (LNS)
LNS is an alternative way to represent real numbers/values beside the conventional FP representation. The idea is to convert values into logarithms once and keep them in this representation throughout the entire computation. The LNS represents a number by the exponent in a certain base and a sign bit. The multiplication of two numbers is simply the sum of the two numbers' exponent parts, log 2 (x · y) = log 2 (x)+log 2 (y), divisions and square roots are implemented by fixed-point subtraction and bit shift respectively. However, the addition of two LNS numbers, log 2 |(X, Y )| = X + log 2 |1 + 2 Y −X | is not a linear operation and requires two fixed-point adder/subtractors, and lookup-tables (LUTs) process (Function Generators (FGs)). The size of LNS adders increases exponentially as the operands' word lengths increase. Thus the LNS arithmetic systems usually have advantages of low precision and constant relative error.
III. ALGORITHMTIC CONSIDERATIONS
The proposed object recognition approach consists of two basic models, a model for object descriptor based on covariance matrices [4] , [16] and a classifier based on on-line variant of RF implemented on FPGA using LNS.
A. Covariance Matrices Descriptor
We have used bag of covariance matrices (Fig.2) , to represent an object region. Let I be an input color image. Let F be the dimensional feature image extracted from I
where function φ can be any feature maps (such as intensity, color, etc). For a given region R ⊂ F , let {z k } k=1···n be the d dimensional feature points inside R. We represent region R with d × d covariance matrix C R of feature points.
where μ is the mean of region R centered at the point.
B. Image Labeling
We gradually build our knowledge of the image from features to covariance matrix to a bag of covariance matrices, starting by forming covariance matrix C from image features such that each feature Z in C has intensity μ(z) and associated variance λ −1 (z), so λ is the inverse variance (precision). We then group covariance matrices as a set of spatially grouped feature in C that are likely to share common labels into a bag of covariance matrices. Covariance matrix. Different regions of an object may have different descriptive powers and, hence, a difference impact on learning and recognition ( Fig.2A) . Following [16] , we represent image objects with five covariance matrices C i=1···5 of the feature computed inside R (Fig.2B ), noting that features in the covariance matrix may be used in multiple image locations. Color. Color is described by taken Ohta space histogram values of pixels
. This histogram is chosen because it is less sensitive to variations in illumination. Ohta values for each pixel in an image are clustered using k-means, e.g., each pixel in image I is assigned to the nearest cluster center, then histogram frequency is normalized. Appearance. We have used histograms of Local Binary Patterns (LBPs) for representing each feature's appearance in some appearance space. Fig.2C depicts the points that must be sampled around a particular point (x, y) in order to calculate the LBP. In our implementation, each sample point lies at a distance of 2 pixels from (x, y), instead of the traditional 3×3 rectangular neighborhood, we sample neighborhood circularly with two different radii (1 and 3). The resulting operators are denoted by LBP 8,1 and LBP 8,1+8,3 , where subscripts tell the number of samples and the neighborhood radii. A bag of covariance matrices. A bag of covariance which is a concatenation of Ohta color space histogram, and appearance model based on LBP and SIFT of different features of an image window is presented in Fig.1E . Then estimate the bag of covariance matrix likelihoods P (I i |C, I i ) and the likelihood that each bag of covariance matrices is homogeneously labeled. We use this representation to automatically detect any target in images. We then apply on-line RF learner to select object descriptors and to learn an object classifier.
C. RF for Recognition
A detailed discussion of Breiman's RF [1] learning algorithm is beyond our scope here, however, in order to simplify the further discussion, we briefly define some fundamental terms Decision-tree. For the k-th tree, a random covariance matrix C k is generated, independent of the past random covariance matrices C 1 , . . . , C k−1 , and a tree is grown using the training set of positive and negative image I, and covariance feature C k . The decision generated by a random tree corresponds to a covariance feature selected by learning algorithm. Each tree casts a unit vote for a single matrix, resulting in a classifier h (I, C k ). Forest. Given a set of M decision trees, a forest is computed as ensemble of these tree-generated base classifiers h (I, C k ), k = 1, . . . , n. Finally, a forest detector is computed as a majority vote. Majority vote. If there are M Decision Trees, the majority voting method will give a correct decision if at least f loor(M/2) + 1 decision trees gives correct outputs. If each tree has probability p to make a correct decision, then the forest will have the following probability P to make a correction decision.
D. On-line RF for Recognition
To obtain an on-line algorithm, the steps above must be on-line where the current base classifier is updated whenever a new sample arrives. In particular our on-line RF involves two steps in inferring the object category (Algorithm 1). First, based on covariance object descriptor we develop a new, conditional permutation scheme for the computation of feature importance measure. Second, the fixed set tree K is initialized, then individual trees in RF are incrementally generated by specifically selected covariance matrix from the bag of covariance matrices. For updating, any on-line learning algorithm may be used, but we employ a standard Karman filtering technique.
Algorithm 1 On-line Random Forests
1: Initially select the number K of trees to be generated.
Vector C k that represent a bag of covariance is generate 6: Construct Tree h (I, C k ) using any decision tree algorithm 7: Each Tree makes its estimation based on a single matrix from the bag of covariance matrices at I
8:
Each Tree casts a vote for most popular covariance matrix at image I
9:
The popular covariance matrix at I at is predicted by selecting the matrix with max votes over h 1 , h 2 , . . . , h k 10:
Return a hypothesis h l 12: end for 13: Get the next sample set 14: Output: Proximity measure, feature importance, a hypothesis h
IV. HARDWARE ARCHITECTURE

A. FPGA Architecture
All FPGAs consist of three major components: 1) logic blocks (LBs); 2) I/O blocks; and 3) programmable routing, as shown in Fig.3(A) . A logic block (LB) is functionally complete logic circuits, partitioned to LB size, mapped and routed, and place in an interconnect framework to perform a desired operation. Field programmability is achieved through switches (transistors controlled by memory element or fuses) and each I/O block is programmed to act as an input or output, as required, i.e., N-input LUTs can implement any n-input boolean function. The programmable routing is also configured to make the necessary connections between logic blocks, and from logic blocks to I/O blocks. The processing power of an FPGA is highly dependent on the processing capabilities of its LBs and the total number of LBs available in the array. Generally, FPGAs use logic blocks that contain one or more LUT, typically with at least four-inputs. A four-input LUT can implement any binary function of four logic inputs. Fig.3 shows the architecture of a simple LB containing one fourinput LUT and one flip-flop for storage.
B. Transform into Log-domain
Rather than adapting the FP arithmetic we based on LNS, eliminate the need for multiplications and division, allowing all operations to be carried out using shifts and additions. In LNS, a number x is represented in signed magnitude form, i.e., as a pair (S, e), where x = (−1) s (r) e , S being the sign bit (which is either 0 or 1 according to the sign of x) and e being the signed exponent of the radix r (usually in radix 2). The exponent e is expressed in fixed-point binary mode with say, G bits for the integer part and F bits for the fractional part and one bit for the sign of the exponent, i.e., with a total of (G + F + 1) bits. If the radix is considered to be 2, then the smallest number that can be represented using the scheme is 2 −N , where N = (s
The ratio between two consecutive numbers is equal to r 2 −F , and the corresponding precision e is roughly (lnr)2 −F . Typically, if G = 5, F = 30, and r = 2, we can have a precision of 30 bits in radix 2. However, for the purpose of comparison with the precision of FP representation, e will be assumed as 2 −23 (≈ 10 −7 ). Numbers closer to zero, are represented with better precision in LNS than FP systems. However, LNS depart from FP in that, the relative error of LNS is constant and LNS can often achieve equivalent signal-to-noise ratio with fewer bits of precision relative to FP architectures. float values which require much place for storing in an FPGA memory. In order to reduce the hardware cost, we propose to approximate the function φ using LG. This function will transform float elements of the φ into binary elements. For 'Tree Units' we compute 16 covariance matrices in 32 bit memory. Basically the decision trees consist of two types of nodes: decision nodes, corresponding to state variables and least nodes, which correspond to all possible covariance features that can be taken. In a decision node a decision is taken about one of the input. Each least node stores the state values for the corresponding region in the image, meaning that a least node stores a value for each relevant covariance matrix that can be taken. The tree starts out with only one least node that represents the entire image region then, a decision has to be made whether the node should be split or not. ACC block that does the accumulation operations at each node. Once a tree is constructed it can be used to map an input vector to a least node, which corresponds to a region in the image. Then a decision tree can be converting into an equivalent 'Tree Unit' by extracting one logic function per class from the tree structure. Each 'Tree Units' gives a unit vote for its popular object class. 'Forest Unit' is an ensemble of trees grown incrementally to a certain depth. The object is recognized as the one having the majority vote, stored at 'Majority Vote Unit'. The SIGM block that performs the sigmoid evaluation function for majority votes
C. Object Recognition Architecture based on RF-LNS
V. EVALUATION
We now demonstrate the usefulness of this frame work in the area of recognition generic objects such as bikes, cars, and persons.
A. Dataset
The functionality of the proposed system was simulated, and the hardware is programmed. We have used data derived from the GRAZ02 1 dataset [11] , a collection of 640 × 480 24-bit color images. As can be seen in Fig.5 , this dataset has three object classes, bikes, cars and persons. Table 1 reports the number of images and objects in each class, 380 images are available for background class .
B. Experimental Settings
Our RF-LNS is trained with varying amounts (10%, 50% and 90% respectively) of randomly selected training data. All images not selected for the training split were put into the test split. For the 10% training data experiments, 10% of images were selected randomly with the remainder used for testing. This was repeated 20 times. For the 50% training data experiments, stratified 5 × 2 fold cross validation was used. Each cross validation selected 50% of the dataset for training and tested the classifiers on the remaining 50%; the test and training sets were then exchanged and the classifiers retrained and retested. This process was repeated 5 times. Finally, for the 90% training data situation, stratified 1 × 10 fold cross validation was performed, with the dataset divided into ten randomly selected, equally sized subsets, with each subset being used in turn for testing after the classifiers were trained on the remaining nine subsets.
VI. PERFORMANCES
GRAZ02 images contain only one object category per image so the recognition task can be seen as a binary classification problem: bikes vs. background (i.e., non-bikes), people vs. background, and car vs. background. Generalization performances in these object recognition experiments were estimated by statistic measure; the Area Under the ROC Curve (AUC) to measure the classifiers performance. AUC measures of classifier performance that is independent of the threshold, meaning it summarizes how true positive and false positive rates change as the threshold gradually increases from 0.0 to 1.0, i.e., it does not summarize accuracy. An ideal perfect classifier has an AUC of 1.0 and a random classifier has an AUC of 0.5.
A. Finite Precision Analysis
The primary task here is to analyze the precision requirements for performing recognition. The RF-LNS precision was varied to ascertain optimal LNS precisions and compare them against the cost of using FP architectures. Tables II, III, and IV give the mean AUC values across all runs to 2 decimal places for RF-LNS and training data amount combinations, for the bikes, cars and people datasets respectively. The performance of RF-LNS is reported with weight quantized with 4, 8, and 16 bits, and for different decision tree depths, from depth = 3 to depth = 7. For example a figure of %85 means that %85 of object images were correctly classified but %15 of the background images were incorrectly classified (i.e. thought to be foreground). For RF-LNS to maintain acceptable performance, 16 bits of precision are sufficient for all GRAZ02 categories, even when only 10% training examples are used. Such low precision required by RF-LNS makes it competitive with FP arithmetic for our generic object recognition application.
B. Efficiency and Hardware area
In order to evaluate the efficiency of RF-LNS classifier in terms of hardware area, 10-and 20-bit fixed-point (FX) implementations were synthesized for comparison, and the resulting numbers of slices are shown in Table V . It is worthy noting that on most datasets; the RF-LNS takes roughly the same number of slices as the inadequate 10-bit FX version. When compared against the more realistic 20-bit FX version, the RF-LNS classifiers are about one-half the size of the FX classifiers. Our design also achieved high speed clock rate processing. For the 1-bit RF-LNS, the power dissipation is small, and the area usage on FPGA is less than 2 percents.
VII. CONCLUSIONS AND FUTURE WORKS
Efficient hardware implementations of machine-learning techniques yield a variety of advantages over software solutions: increased processing speed, and reliability as well as reduced cost and complexity. In this paper RF technique is modified so that classification is performed by LNS arithmetic. The model is applied for generic object recognition task, it shows that at low precision the RF-LNS hardware has significant area savings compared to the fixed-point alternative. With these characteristics, RF-LNS may be a good way for designing a real-time low power object recognition systems. Our future goals include further exploring precision requirements for hardware RF-LNS, noise analysis to determine the robustness of the hardware classifier and expanding LNS hardware architectures to other machine learning algorithms.
