Abstract-Real-time face recognition by computer systems is required in many commercial and security applications since it is the only way to protect privacy and security. On the other hand, face recognition generates huge amounts of data in real-time. Filtering out meaningful data from this raw data with high accuracy is a complex task. Most of the existing techniques primarily focus on the accuracy aspect using extensive matrixoriented computations. Efficient realizations primarily reduce the computational space using eigenvalues. On the other hand, an eigenvalues oriented evaluation has minimum time complexity of O (n 3 ), where n is the rank of the covariance matrix; the computation cost for co-variance generation is extra. Our frequency distribution curve (FDC) technique avoids matrix decomposition and other high computationally intensive matrix operations. FDC is formulated with a bias towards efficient hardware realization and high accuracy by using simple vector operations. FDC requires pattern vector (PV) extraction from an image within O (n 2 ) time. Our enhanced FDC-based architecture proposed in this paper further shifts a computationally expensive component of FDC to the offline layer of the system, thus resulting in very fast online evaluation of the input data. Furthermore, efficient online testing is pursued as well using an adaptive controller (AC) for PV classification utilizing the Euclidian vector norm length. The pipelined AC architecture adapts to the availability of resources in the target silicon device. Our implementation on an XC5VSX50t FPGA demonstrates a high accuracy of 99% in face recognition for 400 images in the ORL database, generally requiring less than 200 nsec per image.
INTRODUCTION
There are two main biometric techniques; Intrusive and nonintrusive. Face recognition systems that employs a nonintrusive biometric approach is increasingly in demand for defense, security and commercial applications, as it protects both safety and privacy during the process [1] . Such systems generate a lot of raw data in real-time. Numerous engineering applications that take a decision based on raw data employ pattern recognition approaches that first extract meaningful data [2] and then generate eigenvalues to represent this dataset.
Furthermore, PCA (Principal Component Analysis) is such a common technique associated with face recognition algorithms [2] [3] [4] [5] [6] [7] [8] [9] .
PCA computes the eigenvalues of the co-variance matrix [20] usually in O (n 3 ) time, where n is the rank of the matrix. PCA provides less than 85% accuracy even when using 50% of the training images per subject [10] [11] [12] [13] [14] . Many variants of PCA improve the accuracy and/or speedup the computation [15, 16] . Linear Discriminant Analysis (LDA) normally provides better accuracy than PCA when the dimensionality of the transformed space is one less than the classes used in the training set. Furthermore LDA needs more than 80% of the stored images per subject for training purposes [13, 14] . In fact, LDA's accuracy deteriorates badly when the system is trained with only one or two images per subject. Furthermore, it has higher computational cost than PCA because it uses PCA along with the multivariable normal distribution of the covariance matrix.
Fisher Discriminant Analysis (FDA) works well with the reduced space created by PCA. FDA uses LDA to optimize the sample data point projections [13, 14] . FDA and LDA use a global Euclidian structure for the extraction of features and ignore local face details. Therefore, FDA may not achieve 99% accuracy that is frequently required for reliable real-time processing. Real-time pattern recognition needs algorithms that can extract and identify known features in minimum time [29] . Relevant embedded systems are expected to recognize faces in less than a µsec. one of these real-time systems proposed by Microsoft can handle 15 frames, with two frames of face detection per second, when implemented on a 700 MHz Intel or a 200 MIPS-strong ARM processor [17] . Any software-based application realization is normally slower compared to its hardwarebased counterpart. For example, hardware-based face recognition using an artificial neural network and eigenfaces has been implemented on an analog ADSP-BF535 EZ-KIT device [18] . This system provides recognition with a maximum accuracy of 80%, consumes 36 msec each time, and uses more than one MB of storage for each face. A multi-processor architecture that includes a smart camera achieves recognition in 4.3 msec with 90% accuracy [19] .
Many algorithms have been developed to achieve maximum accuracy within minimum time [10, 11, 20, 21, 22, 23] . However, none of them demonstrates ~99% accuracy within O (n 2 ) time with a single training image per subject, for popular benchmark databases of faces. Our proposed technique extracts both global and local details using the simple frequency distribution curve (FDC) matching method that requires storage space for 256 array of gray levels on the host machine and two vectors simultaneously in the BRAM memory of the FPGA, where m is the number of subjects.
Our herein proposed technique needs only one training image per subject, with frontal pose and good lighting, to provide reliable recognition within O (n 2 ) time. Accuracy may reduce in case training images are acquired with bad lighting or angled pose, but decision will be provided within O (n 2 ) time.
Hardware realization can be customized considering FPGA resource constraints concerning memory and other on-chip areas that can be configured for computations. With our FPGA-based architecture, the generation of pattern vectors (PVs), a computationally expensive process, is accomplished offline. These PVs are grouped off-line according to their similarities in order to minimize the online recognition time. This is done by selecting those calculated PVs which are expected to match the input images. An adaptive controller in the online layer speeds up the recognition rate significantly. It is shown here that the adaptive PV-Controller (APVC) improves the recognition rate of the basic, non-adaptive architecture by 80%. Furthermore, the APVC-based pipelined architecture can be configured to match available device resources and application time requirements. 
II. EIGENVALUE METHODS

Let
A. PCA Reduced Eigen Spaces
The basis vectors are computed in the reduced dimensional space for the eigenfaces. The eigenvectors of the covariance square matrix generate an Eigen space. The eigenvectors are denoted by , 1,2,3 … , . These vectors are computed usually by tridiagonalization and subsequent decomposition of the co-variance matrix [15] . The co-variance matrix is computed as , and the basis vectors for the n 2 xd dimensions are computed with an original space of n 2 xn 2 , that is
Then, the projection of the L leading eigenvectors is computed and stored in a matrix 
where is the row mean of the A matrix having dimensionality 1 x n 2 . The following equation represents the projected space and Equation (5) represents an error vector whose entries are associated with an input test image.
Equation (5) is part of the PCA classifier, after a statistical scalar value is computed in relation to each input test image followed by threshold error limit analysis to achieve recognition. However, the underlying PCA reduction process requires O (n 3 ) minimum time.
B. Fisher Faces
Eigenfaces maximize the variance between classes while ignoring the within-class variance. Fisher faces use LDA to compute both variances separately to seek the direction of an efficient discrimination between classes [12, 24] . Fisher faces usually perform better than LDA and PCA when data for the classes are uni-modal and the training data set includes a large number of images per subject [13, 14] . Let K b and K w be the variance between classes and within a class, respectively:
where is the number of images in class j, m is the total number of classes, and is the j th sample in the i th class.
An optimal projection can be obtained by maximizing the ratio of the determinant in Equation (8):
, … . , contains the L largest eigenvectors. Projections between classes can be derived using Equation (7) and the corresponding eigenvalues [24] . However, LDA uses supervised learning and focuses on global structure because the denominator in Equation (8) is to be minimized or the numerator to be maximized.
III. PROPOSED FACE RECOGNITION ALGORITHM
Various pattern classifiers have been applied to face recognition, such as nearest neighbor, Bayesian, and support vector machine [3, 21, 25, 26] . However, they do not show a close to 100% accuracy while they consume O (n 2 ) time with just one or two training images from three popular benchmark face databases. In this paper, we develop a new classifier based on cumulative frequency distribution and use the standard variance vector (SVV). SVV acts as a template kernel and is represented by a graph. Figure-1 shows the plots of the cumulative frequency against the gray levels. Two same class images and three other class images are used for the proposed transformation. We can observe that plots belonging to the same class do not deviate substantially in Figure-1 . The graphs are divided into three regions using three threshold values. It has been observed that the exclusiveness of the three regions provides better decision quality. Further, it cannot increase the search time by more than O (n) and the computing time for an input image is not more than O (n 2 ). The input image has to pass through the same computing steps which were used to extract the features in the training process. The proposed classifier provides linear time decision making for the input test image.
Let , , … , be a training set having m classes/columns. Define a transformation P on T which computes the gray level distribution as follows:
, where 1,2, … , 0,1,2, … ,255 (9) We then normalize the distribution:
(10) where defines the resolution of the images. We accumulate the distribution as calculated in Equation (9) and then do sorting which provides a gradually increasing order of the frequencies of gray level appearance. Each row in the normalized matrix represents a reference for the respective class.
Let be a transformed testing PV vector obtained by applying Equations (9) and (10) to the incoming test image, which is then subtracted from a column of the matrix to populate a vector M. The content of the M vector is over written for each input test image, because the decision entry is simultaneously stored in the vector.
, (11) where is the number of test images. To maximize the variance between classes, each vector in Equation (11) is divided into three regions because the distribution in Equation (11) shows Gaussian behavior [27] . The minimization of the following objective function, which has been developed for efficient face recognition, provides a better discrimination analysis:
It is observed that distinguishing among three regions in Equation (12) for the variance vector improves the discriminating power of the proposed technique. The variance vector has a critical role as observed in Figure-1 Figure-1 (used in our experiments) [28].
IV. ARCHITECTURE
Our FDC-based three-layer architecture is shown in Figure- 2. Feature extraction for the set of training images using the proposed FDC technique is a computationally expensive process that is handled offline during pre-processing. This process produces the PVs in time depending upon the number of subjects d in the given data set. The PVs are stored in the host machine RAM memory. One PV is transferred at a time to the online layer through the PCI or other bus interface. The number of pre-stored images per subject determines the total test space.
The second layer of the architecture is used for online testing. The digital image, which is treated as a matrix of gray levels, requires substantial bandwidth to be transferred to the FPGA. Such a bandwidth may not be available due to their rather limited number of input/output pins and their operating frequencies. For this reason, our approach here converts the image matrix into the one-dimensional input pattern vector (IPV) on the host machine for transmission to the FPGA board. An FPGA buffer directs this data to the on-chip BRAM memory for efficient FDC computation. FDC applies the mathematical steps in Equations (10) and (11) .
One PV from the pre-stored collection of PVs in the host RAM is transferred while the IPV is being computed, thus overlapping computations with data transfers. After the IPV is computed in the FPGA, the classifier determines statistically, according to Equations (11) and (12), a possible match (binary decision). This process is repeated for all PVs pre-stored in the host RAM unless a true decision is taken by the classifier. Our experiments show that, on the average twelve iterations are needed for a successful recognition. We further improve the efficiency of the architecture in Figure-2 by introducing an adaptive classification technique for the pre-stored (training set based) PVs.
The enhanced architecture relies on a process that eliminates the need to test all the pre-stored PVs against the online produced IPV. The proposed architecture modifications are minimal. The new architecture is shown in Figure-3 . In this architecture, the Euclidian normalization length of a vector is proposed to calculate tags for the prestored PVs. The tags are scalar real values computed for the PVs. The objective is to minimize the number of false attempts by introducing the following algorithm in the host machine.
Algorithm for the controller to locate the most suitable PV for the incoming test image
Step (1): Using the Euclidian normalization length, the following formula ∑ is used to calculate tags for the stored PVs.
Step (2): The computed tags which are relate with PVs are then stored in a vector , 1 and m is the number of subjects in a database.
Step (3): Apply step (1) for the incoming test image and get the vector y, and subtract y from each element of to generate a Euclidian length difference vector Diff_PV of dimension m.
Step (4): Sort the Diff_PV vector in ascending order and then store the sorted vector with their indices in an array F_PV having dimension 2 x m.
Step (5): The index associated to the first value in F_PV is used to send the first PV having the maximum probability to match with the IPV. If recognition is not successful, then choose the next value from F_PV array and repeat this process until a successful recognition signal is received from the next hardware layer.
Step (6): Repeat steps 3 to 5 for each test image.
In Figure-3 , the host layer has a tag vector of length m generated during step (2) of the above algorithm. In the online layer, two signals are introduced to control the next transfer of the PV from the host layer to the online layer. The signal TF_Fag has a "true" value for a matching and "false" otherwise (i.e., demanding the next PV). A "true" value for TF_Fag is also transmitted to the I/O buffer to get a new test image. Furthermore, for a "true" signals, a new ID_IPV is sent to the host layer.
Our simulation results in histogram form for the new algorithm using the ORL database are shown in Figure-4 [28]. The column length indicates the frequency of successful attempts whereas the horizontal axis shows the number of required tests against PVs. The proposed controller algorithm provides significant improvement for face recognition compared to the original architecture in Figure-2 . The first column in Figure-4 shows that ~190 test images will get a successful PV from the PV_controller within the fifth iteration. The average number of run-time tested PVs is reduced to 7.7 from 20.
A. Pipelining
In Figure-5 a pipelined architecture is formulated to take advantage of regularity in PV_ID, IPV_ID and Vm array data. In this architecture, PV_controller is replaced with a new layer for PV classification. This layer has three major components, namely Trained_PV, P_PV_controller and PV_Cache. Trained_PV represents the collection of PVs, similar to the architecture without the pipeline. P_PV_Controller has two extra components compared to the architecture in Figure-3 . These components are Circular FIFO and a k-dimensional array containing the sorted tags of PVs in the corresponding pipelined modules. Circular FIFO is introduced in the PV_Controller to handle the sequence of identification tags for incoming test images. It is assumed that the k modules can get their most suitable trained PVs from the controller layer; these PVs are stored in advance in the PV_Cache area.
The diamond block provides input to the OR gates to preempt early any further processing after a successful match/recognition. This block generates a positive decision if the test image belongs to a person in the training database; otherwise, its feedback signal to the host machine requests the next appropriate PV. This block plays an important role in the pipelined architecture using the PV_Cache area, as shown in Figure-5 .
A "true" or "1" answer for the Boolean expression in Equation (12) means that the input image matches that particular PV; otherwise, the next appropriate PV has to be sent from PV_cache to the online layer. PV_Cache holds the PVs according to the number of test images being under process in the online layer. PV_Cache will be flushed when a "true" signal is received from one of the OR gates.
The horizontal modules in the online layer in Figure- 5 depend upon the target FPGA resources and the vertical elements determine the number of FPGAs on the target board. Therefore, the proposed pipelined architecture is a modular, robust reconfigurable system as shown in Figure-6 . The obtained architecture speed as a function of the pipeline stages was simulated and the result is shown in Figure-6 , assuming a 10% overhead for new stages.
The output of the OR gates that generate a feedback to the controller further improves the decision time for an appropriate PV identification that leads to a transmission to the online layer. A high speed face recognition process can then be obtained.
V. RESULTS
Table-I shows the resource utilization data and the execution time when implementing our architectures on a Xilinx Virtex5 FPGA device. The Xilinx 11.1 suite was used that includes the AccelDSP tool with MATLAB files for floating-point verification. According to the synthesis report, 15 32-bit multipliers, 53 adders and 47 subtractors are used in the implementation. Pipelining in the multipliers, adders and subtractors improves the time further but the resource consumption reaches 36% of the FPGA real estate. On the other hand, an approximate 80 nsec execution time makes our system a very viable choice for real-time face recognition. Our chosen development tool provides two frequencies for the designed system, one being the requested frequency and the other the maximum based on the circuit's critical path after RTL implementation. The execution time for both frequencies is shown in Table-I using the ORL database. The worst frequency is the selected value before the synthesis process mentioned in Table- I. The pipelined architecture shows an improvement of more than 87 % compared to the architecture in Figure-3 for the XC5VSX50t Virtex5 device since up to three modules in Figure-5 can work simultaneously. The testing of any image using the proposed system provides a decision within 0.6 µsec with an accuracy of 98.3%. This was validated with the ORL database with 400 images of 40 subjects and 5600 pixels in each gray level image. On the other hand, Microsoft real-time system provides decision in 0.5 sec and the analog system gives result in 36 msec with 80% accuracy. 
VI. CONCLUSION
Real-time face recognition systems have very high computation demands at the same time also requiring high accuracy. These requirements are significant when dealing with security applications. Most of the existing face recognition algorithms have been developed for desktopbased offline systems. Our proposed frequency distribution curve (FDC) matching technique primarily pre-calculates pattern vectors (PVs) that can be subsequently used by the online classifier. Novel FDC-based architectures were presented that achieve substantial speedups while also providing highly accurate recognition. The adaptive PV_controller-based FDC architecture yields high speedups with due consideration to resource constraints stemming from the chosen FPGA device. Furthermore, the pipelined architecture exploits the parallelism capabilities of configurable devices, thus providing even more viable solutions for real-time face recognition tasks. 
