Abstract. In this paper an FPGA based embedded vision system for face detection is presented.
Introduction
can be found for example in the paper [24] . At the outset it is worth to mention the main challenges associated with face detection. The first is the relatively high variability of shape and appearance between different people (including the presence of beard, moustache, different hairstyles, glasses, etc.). Furthermore, in most cases faces appearing in different positions -frontal, profile, rotated -should be detected. Another problem is related to the background of captured images. In the simplest case it is a homogeneous surface, but it also could have rich texture or even include other faces (e.g. a crowd). In addition, very important factors are lightning conditions and the presence of shadows (e.g. on half of the face).
One of the most "obvious" methods is based on skin colour detection. In YCbCr or HSV (Hue, Saturation, Value) colour space, at least in theory, it is possible to segment parts of the scene with a skin-like colour. In the next 28 M. Drożdż, T. Kryjak stage, after connected component labelling, objects with specific shape (oval) and size are selected. The main advantage of this method is its simplicity and computation speed. On the other hand, its reliability is quite limited due to high sensitivity to illumination changes. However, it can be used as an auxiliary part of a face detection system. It allows to reduce the number of candidates or verify the detection results. An example of such solution is described in the work [6] .
The most common and widely used approach is undoubtedly the method proposed by P. Viola and M.
Jones [20] . It is available in the popular image processing library OpenCV, Matlab software and is the basis for many commercial solutions. Its operation is based on the "classic" scheme -feature extraction followed by classification. In the first stage Haar features are used i.e. simple rectangular patterns that allow to analyse local changes in brightness. It is worth noting that a significant acceleration of the calculations can be achieved using the socalled integral image. In the classification stage a cascade of weak classifiers is used. Its architecture is obtained with the AdaBoost machine learning algorithm. This solution allows to reject a great number of false candidates at an early stage and, therefore, is relatively fast (at least on sequential architectures).
Another method, based on a similar scheme, involves LBP (Local Binary Patterns) or HOG (Histogram of Oriented Gradients) features. The first approach is presented in the work [25] . Authors claim, that LBP (especially the multi block variant) allows to obtain good detection result and reduce the computational complexity. HOG features and SVM (Support Vector Machine) classifier were used in the work [23] . Both issues will be discussed in detail later in this paper.
In recent years, a very strong interest in the image processing community gained deep convolutional neural networks (DCNN) -mainly due to the very good detection results (currently DCNN are used by major technology companies like Facebook, Microsoft or Google). This method consists of two components: a number of convolutional and sub-sampling layers (feature extraction step) and a fully connected neural network (classification). All parameters for this method are obtained during a learning process, which requires very large image datasets. It is worth emphasising that the feature extraction stage is also "learned" and not manually designed. The main disadvantage of this solution is its computational complexity (networks are really large). This forces the use of high performance, but relatively energy inefficient GPUs (Graphics Processing Unit) cards or dedicated ASICs (Application Specific Integrated Circuits). A DCNN based face detection system is described in the work [13] .
Designing an embedded vision system is quite a complex issue. At the very beginning it is necessary to choose an appropriate computing platform, which, basically should satisfy two, often conflicting, requirements.
First, real-time processing should be supported (i.e. processing of all data transmitted by the camera). Second, the solution should be energy efficient. Among the possibilities worth mentioning are: general purpose processors CPUs/GPPs (Central Processing Units, General Purpose Processors), ASICs (Application Specific Integrated Circuits) or reprogrammable FPGAs (Field Programmable Gate Arrays) and heterogeneous systems like Xilinx's Zynq (a combination of ARM processor and reprogrammable logic in one housing). These platforms have their advantages and disadvantages, which in-depth discussion is beyond the scope of this article.
In the described application an FPGA based solution was used. It ensures high performance and flexibility, while maintaining a relatively low energy consumption.
In addition, a number of different vision system were previously implemented in FPGAs -from simple convolutional filtering to quite advanced detection and recognition algorithms and even DCNNs. An overview can be found in the book [1] . 
29
During the process of designing the described embedded face detection system the following requirements were considered:
• obtaining a relatively high detection accuracy, with small number of false positives,
• providing an easy scalability -for resolutions from 720 × 576, through 1280 × 720 and ultimately even for HD 1980 × 1080,
• keeping real-time performance (at least 50 frames per second),
• supporting multi-scale face detection,
• implementing the algorithm in a pipeline vision system on an FPGA device.
The main contributions of this paper are:
• the first (to the best knowledge of the authors) HOG+SVM based face detection implemented in FPGA,
• a simple, yet effective multi-scale processing approach,
• a detailed description of the designed hardware modules. This paper is organized as follows. First, in Section 2 previous FPGA implementations of face detection algorithms are revised. Then, in Section 3 the proposed solution is discussed. Finally, in Section 4 the designed hardware modules are presented. The paper ends with a short summary a future research direction indication.
FPGA implementations of face detection algorithms
As already mentioned face detection is a very important issue of great practical importance. Therefore, a number of FPGA implementations of different algorithms can be found in the literature. Some of them are described below.
In the work [6] skin colour segmentation based on a modified RGB model was used. The obtained binary mask was subjected to filtration (morphological closing).
Then, parameters of the detected objects were analysed (area, shape). The described system has been imple- A similar solution was described in the work [14] .
Skin-region detection on RGB image was supported by foreground object segmentation (block based running average approach). The whole module was described in SystemC and VHDL. The presented simulations results indicate real-time computation ability.
A system which combines skin colour segmentation, lips detection and modified LBP features was presented in [21] . The YCbCr colour space, morphological opening and connected component labelling were used to detect skin-like areas. The lips were segmented using R/G and Y component thresholding. LBPs were used to detect horizontal edges. The integration of these three characteristics allowed to obtain high reliability (the authors reported 94.9% accuracy, but did not specify the used database).
The system has been verified in practice on the DE2-70 board with Cyclone II FPGA from Altera. Real-time processing for 320×240 @ 30 fps video stream was obtained.
In the work [12] a simple two layer neural network, with use of floating point calculations, for face detection implemented in Spartan 3 FPGA device was presented.
The grayscale input images patches had a 23 × 17 pixel size. Authors reported 38 times speed-up to a software implementation. However, no data on accuracy was provided.
The mentioned earlier vast prevalence of the Viola-Jones based approaches [20] results also in the number of described FPGA implementations. In the paper [11] a system capable of processing 143 frames of 640 × 480 pixels was proposed. It consisted of several modules. The first one allowed multi-scale image processing (a double buffering approach was used). Furthermore, two clock domain processing was used to increase system perfor- In the article [7] an embedded implementation of the [7] . The system was practically verified in a Virtex 5
FPGA device. For the considered resolution 100 fps were obtained.
In the work [4] a face detection system able to process a 640 × 480 @ 60 fps video in real-time was proposed.
The architecture was quite similar to the previous ones. 
The proposed solution
The proposed solution consists of several modules: image scaling, feature extraction (HOG or LBP) and SVM classification. These component are discussed later in this section. Finally, the effectiveness of HOG+SVM and LBP+SVM approaches are compared.
Sliding window and multi-scale object detection
At the outset, the typical sliding window based approach should be discussed. Let's suppose a module which accepts input image patches of size M × N and returns a binary classification result is available (this issue can be also quite straightforward generalized to multiple object classes). The detection window is usually significantly smaller than the input image resolution (e.g. 20 × 20 vs.
1280 × 720). Thus, the detection should be performed for all possible locations in the image. In the basic case, this means moving the window with one pixel stride horizontally and vertically. However, due to a considerable computational complexity usually a larger step is used.
Furthermore, the classification module is usually able to detect objects of a certain size (exactly fitting into the detection window). On the other hand, in real-life images objects of different sizes occur -e.g. people in various distances from the camera. In order to detect them, one of the two approaches can be used:
• several modules with different detection widow sizes
• input image scaling. and angle are calculated:
In case of colour images (e.g. in RGB) these opera- to the "lower" histogram bin, θ h -angle corresponding to the "higher" histogram bin and H -histogram, than the mentioned interpolation can be described with the following equations: 
. Then, the normalized feature vector is computed using one of the following equations:
• L2-Hys -like L2 with an additional limit on v equal 0.2 
where: ||v|| k norm ( k = 1, 2), -small constant.
Normalized histograms for the whole detection window form the final feature vector, which is then passed to the classifier. The described above stages are schematically presented in Fig. 2 .
Feature extraction -Local Binary Patterns
The Local Binary Patterns (LBP) were first introduced by T. Ojala [17] . They are considered as good local texture descriptors. Furthermore, this features are computational very effective and invariant to local illumination changes.
The basic LBP is defined as:
where: i n (n = 0, 1, ..7) pixel intensities from a 3 × 3
context, i c -intensity of the centre pixel, P -number of sample points.
The LBP could be visualised as an 8-bit integer number.
Moreover, some modifications of the basic LBP pattern descriptor were introduced. In The following computations are quite similar to HOG.
A histogram in e.g. 32 × 32 pixels cells is computed.
Usually no normalization is required. The histograms are directly used as the feature vector.
Support Vector Machines
Support Vector Machines (SVM) are one of the most popular and widely used binary classifiers. The method was proposed by V. Vapnik [19] . 
Hardware implementation
Hardware implementations of the HOG+SVM algorithm have been previously described in the literature in context of pedestrian detection. For example in the work [8] a floating point FPGA module was proposed and in [18] a multi-scale and high resolution system was presented.
A general scheme of the proposed vision system is presented in Fig. 4 For each scale an independent HOG+SVM object detection is performed (Subsections 4.3 and 4.4). In the last step the best face candidate for a given scale (the best detection) and then within the three considered scales (the best detection in scales) is determined. In the current version only one "best" face is detected (mainly due to the target application -driver monitoring). However, implementing multi-face detection is rather straightforward. The detection is visualized on the output video stream (module mark face). Additionally, information about in which scale a face was found was also provided -signal scaleOut.
Multi-scale image processing
A common feature of the sliding window based detectors is the rather small size of the detection window (for example in case of the Viola-Jones approach it is 20 × 20 pixels). This allows to reduce the computationally complex classification process. In the considered application the window size was set to 64 × 96. Therefore faces bigger or smaller could not be detected and it was necessary to use the described in Section 3.1 approach.
In the considered system the input image has a 1280 × 720 pixel resolution. It is scaled to three smaller images The expected coordinates were calculated according to the following formula:
The module is based on two sets of counters. The first allows to determine the coordinates of the currently considered pixel in the input image. The second allows the same for the output image. On this basis it is possible to calculate the so-called expected coordinates -Equation (12).
In the first step the x e , y e values are computed (it is assumed that the initial position of the output image is (1, 1) ). Then, if the coordinates of the current pixel in the input image are equal to the expected ones, the scaled pixel value is determined (grayscale f ). It is the result of a 5 × 5 Gaussian low-pass filtering with coefficients given in matrix (13) . The coefficients sum to 32 and therefore the required division operation can be realized by a simple 
Scaled image reformatting
The applied scaling procedure results in "uneven" pixel It is worth noting that the used module allows also to display the scaled image on the screen (the valid pixels are then located on the right side). For this purpose, the original data enable signal (deOldOut) is used. However, for pixels without valid data the output is set to 0.
HOG
A general scheme of the HOG feature vector computation module is presented in Fig. 8 . At the outset it should be emphasized that the proposed sliding window approach was designed to operate on a continuous pixel stream. Fig. 6 ).
Then the edge modulus (magnitude) and angle were computed according to Equations ( 1) and ( 2). To realize the square root and arctangent functions the CORDIC IP cores provided by Xilinx were used [22] . It is worth noting, that in case of arctangent it was necessary to design some additional logic to scale the input data and to handle some specific cases like division by zero.
Bin choice, modulus division
In the next step, the appropriate histogram bins were selected (bins choice). A scheme of the used module is presented in Fig. 9 . First, the input angle (θ) in range [0; 180] is multiplied by 9/180 (number of bins divided by angle range). This allowed to obtain the lower bin index (θ lindex ) for this angle. Typically the upper bin is defined as θ hindex = θ lindex +1. However, when θ lindex = 8 than θ hindex should equal 0 (wrap). This case was handled by the module bin. Then, using the θ lindex the distance between the current angle θ and lower histogram bin centre (θ l -stored in module bin centres) was determined. Multiplying the difference by a constant 9/180 allowed to obtain the first scaling factor -cf. Equation ( 3) . Subtracting this value form 1 allowed to obtain the second scaling factor -cf. Equation ( 4) . Finally, the input modulus was multiplied by the scaling factors. The output of these module were therefore two bin indexes (θ lindex -bottomBin and θ hindex -topBin), as well as two scaled moduli (bottomModulus and topModulus). A single histogram update involves the following steps:
• reading data from the memory (the current histogram bin value),
• bin value update,
• saving the new data in the memory.
In a straightforward implementation to handle the realtime processing requirement all these steps should be performed in a single pixel clock cycle. This is possible, assuming that a dual port memory with independent read and write ports, as well as a zero-latency adder are used.
Furthermore, the correct implementation requires also the proper handling of multiple updates of the same bin. The mentioned issues are described in more details in the paper [10] .
What is more, in the considered case two histogram bins should be updated in one clock cycle (topModulus and bottomModulus). Finally, the result was subjected to square root operation and the value was stored into a FIFO buffer. From there is 
SVM
After computing the HOG features, the SVM classification was performed. This operation involved the multiplication of each normalized histogram bin value with a weight, summing up the products and adding a constant bias -cf. Equation ( 10) . As stated earlier the used detection window size was 64×96 pixels. Therefore, as the cell size was 32×32 only 2 blocks were considered. Thus, the feature vector had 72 elements (72 = 2 · 36 = 2 · (4 · 9)).
A scheme of the designed module is presented in 
System integration and evaluation
The above discussed hardware modules have been de- This particular platform was selected due to the planned future work on driver fatigue monitoring. However, in this project, the ARM processor system was not used.
As video stream source the Sony HDR CX280 camera was used. It provided 1280 × 720 @ 50 fps HDMI signal.
The results were displayed on a LCD monitor (connected to the VGA output). The use of logic resources is summarized in Tab. 3. The proposed face detection module consumes over 60% of the available resources of the smallest device in the Zynq family. This result is quite ambiguous.
On one hand, it should motivate to further optimize the module, but on the other there is still some logic left for other operations. Fig. 14 . 
