We present a custom hardware system for image recognition, featuring a dimensionality reduction network and a classication stage. We use Bi-Directional PCA and Linear Discriminant Analysis for feature extraction, and classify based on Manhattan distances. Our FPGA-based implementation runs at 75MHz, consumes 157.24mW of power, and can classify a 61 × 49-pixel image in 143.7µs, with a sustained throughput of more than 7,000 classications per second. Compared to a software implementation on a workstation, our solution achieves the same classication performance (93.3% hit rate), with more than twice the throughput and more than an order of magnitud less power.
Introduction
During the last few decades, automatic face recognition has evolved greatly. This development involved increased algorithmic complexity, which translates into long computation time and high energy consumption. Software implementations of most highly eective methods require the performance of a state-of-the-art workstation to operate in real time. Their cost, size and power requirements preclude their use in embedded and portable electronic systems. This has motivated the development of custom hardware implementations of face recognition algorithms, which can exploit the parallelism available in silicon integrated circuits, achieving high performance with much smaller die area and power than generalpurpose microprocessors.
To accomplish high performance, neural network based on linear subspaces are normally used, mainly because VLSI technology favors architectures that feature regular computation and local or structured communication. Often the net-work is trained (and retrained) oine in software and the coecients are then transfered onto the chip, greatly simplifying the implementation. This paper presents a custom hardware implementation of an image classication algorithm for face recognition. The algorithm uses Bi-Dimensional Principal Components Analysis (BDPCA) and Linear Discriminant Analysis (LDA) for dimensionality reduction and feature extraction, and a Manhattan distance metric to perform the classifcation. We implemented our design on a Xilinx Spartan 3 XC3S1000 FPGA and tested it with the Yale database of face images. Our chip, compared to a oating-point software implementation of the algorithm, achieves the same classifcation hit rate of 93.3% and classies an image in half the time (143.7µs), while consuming only 157.24mW of power.
Design

Algorithm and testing
To select a suitable algorithm to implement, we analyzed dierent designs and assessed their classication performance and hardware cost. We considered four algorithms for feature extraction: Eigenfaces (PCA), Fisherfaces (PCA+LDA), BDPCA and BDPCA+LDA [1, 3, 5] . Both PCA and BDPCA project the images, for these tests onto a 24-dimensional space, while LDA projects to 14 dimensions. For classication Euclidean and Manhattan distances are tested.
We used the Yale database as input data, which consists of 15 subjects, with 11 images of each, showing variations in lighting and facial expressions and details. The images are 8-bit grayscale, and were centered and resized to a resolution of 61 × 49 pixels using the Matlab Image Processing toolbox. We use 5 images of each subject for training, and 6 of each subject for testing. The procedure is performed 3 times, changing the selection for training and test images. We computed the mean classication hit rate and standard deviation for each experiment.
Fisherfaces achives the best recognition rate (97.8%) with lowest standard deviation. However, BDPCA+LDA achieves the second best performance (93%) with a memory usage 40 times lower than Fisherfaces and 3-4 times fewer arithmetic operations. The classication has a better performance using Manhattan distance. Based on the results, we chose to implement our hardware solution with BDPCA+LDA for feature extraction, and Manhattan distance for classication. Fig. 1 (a) depicts the architecture of our image classier. The classication is performed in three steps. First, the chip reads the image from external RAM and performs BDPCA dimensionality reduction. The second stage arranges the resulting matrix as a row vector and projects it onto the LDA feature space. Finally, the classier computes the Manhattan distance between the vector and a database of reference face images in feature space, selecting the class based on the smallest distance. The BDPCA and LDA projection matrices and the classier reference vectors are small, heavily reused, and require fast access, therefore they are stored in on-chip RAM. These matrices and reference vectors are calculated oine in a computer and are then transfered to the chip Fig. 1(b) illustrates the three steps scheduled in a pipelined fashion. BDPCA projection is the slowest stage because it is the most computationally intensive, and is further limited by the external RAM access time. When the projected matrix is available, LDA projection starts concurrently with BDPCA projection for the next image. Distance computation and classication starts as soon as the rst results from LDA are available, thus the two stages largely overlap. All 6 modules overlap their operation in a pipelined fashion. Because of limited external RAM bandwidth, the module processes 2 pixels of the input image every 7 clock cycles, with an initial latency of 13 cycles. With a faster memory, the module could process 2 pixels every 3 clock cycles with a latency of 6 cycles. Fig. 3 shows the LDA projector. As Fig. 1(b) shows, the module starts when BDPCA results are available in the register le. The block reads the 4 × 6 resulting matrix from BDPCA as a 24-element row vector. Module (1) stores the 14 × 24 LDA matrix using 16-bit coecients, and streams it at a rate of 6 coecients per cycle. Module (2) multiplies the 6 coecients by 6 elements of the BDPCA vector, generating 6 results that are accumulated in a 3-stage pipeline by modules (3) to (4) . Without this pipeline, the accumulation becomes the critical path of the circuit and limits the clock frequency. The module outputs 32-bit results, which are then used by the classier.
Architecture
Implementation of LDA
As with the BDPCA block, all operations in the LDA projector are pipelined. The block produces a 32-bit element of the output vector in the feature space every 8 clock cycles, with an initial latency of 6 cycles. 
Implementation of the classier
The classier, shown in Fig. 4 , starts operating when the LDA projector produces its rst result. Module (1) stores the reference database 14-element vectors, and streams four 32-bit coecients per cycle. Modules (2) and (3) compute parallel subtractions, absolute values and accumulations to implement the Manhattan distance on 4 vectors simultaneously. Module (4) implements a Loser-Take-All (LTA) circuit that selects the smallest distance to classify the image. All operations are performed with 32-bit precision. Because the classier overlaps its operation with the LDA projector, it outputs its result only 10 cycles after the LDA block nishes.
Experimental results
We synthesized a gate-level implementation of the classier from our Verilog HDL code using the Synopsys Synplicity hardware compiler and mapped it onto a Xilinx Spartan 3 FPGA using the Xilinx place-and-route tool. The circuit achieves maximum throughput with a clock frequency of 75Mhz, limited by the access time of the external RAM. Table 1 shows the performance results for our implementation. We repeated the experiment of Section 2.1 with the Yale database on the FPGA, comparing its classication performance to the software implementation. Our xed-point circuit achieves a hit rate of 93.3%, slightly better than the software for this particular dataset, but the dierence is negligible 
words words words Fig. 4 . Implementation of the classier (0.3%). The circuit classies one image in 143.7µs, with a sustained throughput of 7,045 classications per second. This is more than twice the throughput achieved by the software version running on a state-of-the-art workstation. Removing the limitation imposed by the external RAM, the performance more than doubles. We also estimated the power consumption using the Xilinx Xpower tool, showing that the circuit dissipates 157.24mW. The utilization of combinational logic, ip-ops and RAM is less than 40%, the circuit is currently limited by the hardware multipliers (we use 22 out of 24 available on the chip). If we used a more complex classier (e.g. an RBF network), we would need to reuse the multipliers with an increase in logic complexity.
Conclusions
We have described the architecture and implementation of a custom hardware digital implementation of a face recognition algorithm. Our solution uses BD-PCA and LDA for feature extraction, and Manhattan distance for classication. We selected these algorithms based on their measured performance and their implementation cost in hardware. Our architecture makes use of parallelism and pipelined execution to maximize throughput limited by hardware resources. The resulting implementation on a Xilinx Spartan 3 FPGA achieves 93.3% classication hit rate on several experiments using the Yale database, which is equivalent to a oating-point software version. Compared to software running on a stateof-the-art microprocessor, our circuit performs more than twice as fast, with a reduction of more than one order of magnitude in power.
