Abstract-Neural Networks have been widely used in face recognition as a reliable classifier. In the proposed method, neural network classifier with CSD coefficients is used to speed up the recognition system. The FPGA implementation of the proposed method indicates that the high speed recognition can be achieved by using neural network classifier with CSD coefficients while maintaining good recognition rate.
INTRODUCTION
Human face recognition is a widely used as noninvasive biometric method to identify or recognize a given face image using the some features of a face in a stored database of faces [1] . It's an active area of research in pattern recognition and image processing. Its wide range of practical applications includes personal identity verification, video-surveillance, facial expression extraction, advanced human and computer interaction, computer vision etc [1] . Several algorithms have been proposed for face recognition to improve the recognition capabilities up to satisfactory level. But all practical applications of face recognition require very high speed of operation for controlling and accessing authorized systems. Therefore it is necessary to use specialized hardware system for face recognition in order to meet real time performance. This paper presents the specialized hardware implementation of classifier for face recognition to provide high speed performance.
In order to perform face recognition process, there are some major steps need to be considered such as preprocessing, feature extraction and classification. The first step allows to enhance the image quality because images may suffer from noise and bad illumination. In preprocessing step, the noise is removed and color normalization is done to improve the quality of image. The second step is feature extraction method to reduce the dimension and extract important features from face images. The last step consists of classification method which allows the recognition of an unknown face image depending on the extracted features of the database in previous step.
Neural network (NN) is a well known classifier for human face recognition for its robustness and good learning capability. The implementation can be done both in software and hardware. But the processing speed in software system is very slow which makes it quite impractical for applications that require high speed of processing. Therefore, to realize the full benefits of face recognition with neural network classifier, it is important to implement the system in hardware which offers high speed recognition for real time applications and also offers some levels of flexibility over the software based NN by taking the full advantages of inherent parallelism of NN architecture [2] . Hardware implementation can be performed in digital, analog and hybrid system but digital implementation is more popular as it offers some advantages such as more flexibility, less sensitive to noise and fabrication difficulties etc. FPGA implementation is more suitable for digital hardware based NN because it can be reconfigured and the parallel architecture of NN can be preserved [5] . However, designing the multiplier in neurons of NN is a challenging part because it consumes most of the processing time. In this paper, Canonical Signed Digit (CSD) [3] multiplier is used to increase speed because it reduces the number of partial products hence increasing the speed. CSD representation contains ternary coefficient set {-1, 0, 1} and it has some unique properties: 1) No adjacent nonzero elements. 2) CSD representation of each number is unique. 3) For a y bit CSD number, the minimum number of nonzero digits is y/3+1/9+O(2 -y ) [3] which means on average, a CSD number contains about 33% fewer nonzero bits than two's complement numbers.
II. FEATURE EXTRACTION
ORL face database [4] was used as the database of stored images which contains 400 face images of 40 persons and each image size is 92x112. Each class includes 10 identical images with upright, frontal position, dark homogeneous background and variation in facial expressions. Principle Component Analysis (PCA) [6] [7] 12] was used for feature extraction from image database. Half of the total images were used for training and rests were used for testing. The mean is calculated from equation (1) for training set, … and covariance matrix, C (3) is computed for number of mean subtracted images, in (2) .
The eigenvectors, is calculated for in (3) and is used to compute principle component, :
.
The PCA feature vector, is used for both training and testing and samples projection which results to corresponding set of weights, for each images as shown in (5) . The arrangements of set of sample weights are expressed by in (6).
Linear Discriminant Analysis (LDA) is another popular feature extraction method [8] [9] [10] . PCA has a lack of discrimination ability and it retains unwanted features because of considering all variations across training samples e.g. lighting variation, facial expressions and it extracts features which are important to represent a class [11] [12] [13] . But LDA uses face class information to find a subspace for better discrimination of different face classes and it extracts the most effective features for class separability [11] .
LDA is applied on the set of feature vectors found from PCA projections of training samples in (6) which is used to find another subspace for second projection [13, 14] . So LDA requires two training samples to calculate scatter matrixes. For M number of total classes, the mean image per class, μ and total mean, μ for n samples with training samples can be calculated by:
The between-class scatter matrix, and within-class scatter matrix, are given by
Where is the prior class probability. The advantage of applying LDA on PCA features is that it diminishes the complexity of singular problem of [12, 13] by creating another subspace to optimally project the data based on Fisher Linear Discriminant criterion as expressed in (11) .
In (12), is the final set of eigenvectors of S for largest eigenvalues [10] . Feature vectors found form PCA and LDA were applied separately to neural network classifier for training and test images. Fig. 1 shows the schematic representation of feed forward neural network (FFNN) [15] with single hidden and output layer. The input feature vector, … propagates through the network on a basis of forward direction from input layer to hidden layer(s) and next to output layer to get the final output, … which is produced by the neurons in output layer.
III. NEURAL NETWORK BASED FACE RECOGNITION
The architecture of a neuron is illustrated in Fig 2 where … is the synaptic weights of the neuron of corresponding input and weighted inputs are added with bias, to get output, :
The final output, of a neuron can be found by passing through activation function: In the architecture of a neuron, the most time consuming part is to get weighted inputs from multiplication. By using CSD multiplier, the time for multiplication can be reduced because CSD coefficients contain fewer possible nonzero elements, thus adding speed to the total network.
IV. HARDWARE IMPLEMENTATION
The performance of ANN with CSD coefficients for human face recognition in hardware was used in system level for PCA and LDA separately by Matlab neural network toolbox. After extracting features from PCA and LDA for ORL database, the feature vector contains 39 elements for each image which is applied to NN for training the network and after learning; the weights were used for hardware implementation. Hyperbolic tanh activation function and LM training algorithm [16] were used for training the network.
A multilayer FFNN with single hidden layer and single output layer with 6 output neurons was designed for hardware implementation where the number of neurons for each layer can be varied depending on the requirement. The optimum number of hidden neurons for PCA and LDA were selected 28 and 29 respectively from Matlab training for which the recognition rates for these two networks were found maximum. The inputs and weights were transferred from Matlab. The network receives N number of sequential inputs which is equal to the feature vector of an image. Each neuron has its own weight ROM which holds the weights and the input to a neuron which will decide the address of weight ROM and there is a counter responsible for this necessary synchronization. This process is repeated until all inputs are applied serially to hidden layer. Therefore, each hidden neuron takes N steps to give the final output of a neuron for N inputs.
Each neuron in hidden layer receives 18 bit inputs serially which are multiplied with corresponding CSD weights. The multiplication process is carried out for M number of inputs. Fig 3 represents the basic block diagram of a neuron where each 18 bit binary weight is converted in to CSD by using CSD converter and both input and CSD weight are applied to CSD multiplier. An 18 bit CSD multiplier was designed to provide 36 bit output which is added with the previous 36 bit weighted input saved in a register by using an accumulator. The multiplier performs shift operation on binary multiplicand depending on the coefficient's bit position and then addition or subtraction of shifted partial products is performed depending on the sign of each nonzero element of CSD coefficients. The multiplication process was carried out for different number of nonzero elements reduced CSD coefficients thus allowing adding some error in the network to observe the performance of the network. When multiplication and accumulation in each neuron are completed for N inputs, then the result is passed as an input to a LUT based tanh activation function to get the final 18 bit output of a neuron. The design of output layer is almost similar to hidden layer where each neuron in output layer receives inputs from hidden layer in parallel and performs parallel multiplication. As a result, all output neurons provide the outputs of the network at the same time.
The architecture of ANN was coded in VHDL and simulated by ModelSim and input feature vectors for 200 test images were applied to the network to observe the recognition accuracy. The design was synthesized by using Xilinx ISE and implemented in xc6vlx760 FPGA from Xilinx Vertex 6 family which contains 11.9k slices, 720 blocks RAM each with 36 Kbits in size and each slice contains four LUT and eight flip-flops where only some slices can use their LUTs as distributed RAM.
V. RESULTS Table 1 presents the comparison of accuracies found from Matlab and ModelSim simulation. PCA-NN in hardware was able to perform 92% recognition accuracy which is same to Matlab. For LDA, the accuracy becomes 94% and 93.5% for software and hardware level respectively. But if maximum number of corrected classes is considered, the recognition rate in hardware implementation is improved by 3.4% for PCA-NN and 1.8% for LDA-NN. Table 2 presents the recognition rates for 40 subjects both for PCA-NN and LDA-NN in hardware level for each nonzero bit reduction from the coefficients. For 18 bit CSD, all weights contain maximum 9 nonzero bits thus reducing some of these nonzero bits from CSD coefficients allow to add some error because these weights were not exactly same the original weights produced by the network. The accuracy for 1 and 4 bit reduction drops by 18% and 19% for PCA and LDA network respectively.
The time requirements both for PCA and LDA network from Xilinx synthesis are presented in table 3 where the time for both network is reduced almost linearly with each nonzero bit reduction from coefficients. For 1 and 4 bit reduction, the network speeds up by 17% and 15% for PCA and LDA respectively but here we have to sacrifice the accuracy which also started to decrease with reduction of nonzero bits from weights. Table 4 represents the resource requirements found from synthesis result where the resources are also started to decrease with reduction of nonzero bits from coefficients.
VI. CONCLUSIONS
Face recognition application by using PCA-NN and LDA-NN with 18 bit CSD coefficients was realized in hardware platform for high speed recognition rate where the accuracy was found almost similar to Matlab and high recognition speed is achieved which makes it compatible to use in real time applications. Moreover the performance of the network also analyzed in term of accuracy, time and other resources with the reduction of nonzero bits from the CSD coefficients and it was found that all of them were reduced after adding error to the network by reducing nonzero bits from coefficients. 
