Abstract. Adaboost algorithm based on Haar-like feature and integral image is made up of cascaded classifier, and the computation is very large. This paper uses the hardware and software co-design for implementing the face detection system on Zynq-7000.And realizes the functions of video acquisition, processing, real-time detection and display, and designs the hardware acceleration module of the integral image. At the same time, it improves the detection speed of the system by using dual-core operation, and the detection speed of the system is obviously improved compared with the pure software solution. This design is verified on ZedBoard. It can work with high resolution stream for real-time face detection.
Introduction
In recent years, the issue of information security has received more and more attention. The traditional identification technology can't meet the requirements of modern Society. As one of the methods of biometric recognition, face detection is a key step in human face detection. But its speed has been relatively slow until P. Viola and Jones proposed the cascade algorithm AdaBoost in year 2001 [1] . AdaBoost is the turning point of face detection technology, which makes the detection speed and detection rate improve at the same time.
Most of the traditional face detection systems are based on PC platform or DSP processor, but these two schemes are relatively poor in portability and cost. T.Theocharides proposed hardware architecture based on array-based algorithms [2] , but the architecture requires a lot of hardware resources. In 2008, Shi Yuehua and others proposed a software and hardware co-design, using the pipeline structure to further improve the detection rate [3] .
This design is based on ZYNQ platform of ARM + FPGA architecture with hardware and software co-design to achieve the face detection system [4] . We design a parallel hardware architecture based on the proposed method to accelerate the calculation of integral image. And the image preprocessing module is implementation with hardware; In addition, using the multi-core structure of ARM to further improve the speed of face detection. Integral image is Viola and Jones proposed in 2001 and applied to eigenvalue calculation [1] .The integral image ( , ) II x y at location ( , ) xy is contains the sum of the pixels in the region determined by an origin and the ( , ) xy point, as shown in Fig. 1 Figure 2 . AdaBoost detection process Figure 3 . AdaBoost detection system
The core idea of AdaBoost cascade classifier is to classify the classifiers to complete the face detection where the order is the simple classifier first and then complexed [6] . The whole algorithm detection process is shown in Fig. 2 . First, the trained cascade classifier is loaded, inputting the image, and the integral image of the original image is calculated. Next, each window of NxN (N is the detection window length or width) is scanned. For each window, each classifier is traversed in order, and the each feature in the classifier are calculated and compared with the feature thresholds to select the influencing factors. In each strong classifier, the sum of all the feature influencing factors represents the similarity of the window and the human face, and finally the similarity is compared with the strong classifier threshold  . If the similarity is greater than  , the window is regarded as a 
DDR3
Processing System AXI Bus face window, and the next strong classifier test can be continued. The window through all strong classifiers is regarded the human face window. In order to adapt to different sizes of the face, after each test, enlarging the detection window by proportion and repeat detection process until the size of detection window closes to the picture size. Finally, merging the face windows, that is, the face windows with the same size and position are needed to merge into one face window and output [7] . Hardware and Software Co-design of System Architecture In this paper, hardware and software co-design process are based on SoC system, build the entire system in a single chip, and divided into hardware platforms and software implementation. The experimental platform is the Zynq chip. In addition, we also need the CMOS camera and the VGA interface. The division scheme of the hardware and software is: the hardware part is in charge of the video signal acquisition, image preprocessing, integral image calculation and VGA display driver module and other functions; the main detection process is implemented in software.
Hardware Platform. The basis of the hardware part is video channel implementation based on ZYNQ. As shown in Fig. 3 , the video processing and display platform.
Video data from the camera divides into two channels, one is the original video path, the other sends data to the video preprocessing module, and then sends data to the DDR through VDMA, after the processor detecting and then reading the results from the DDR with VDMA. At last, the two video channels is merged by V-OSD (Video On Screen Display) module and output to the VGA screen. The integral image is the basis of calculating Haar-like features, and the computational complexity is large. Image preprocessing includes gray scale and scale scaling of the original image. The hardware accelerator mainly deals with these two parts to improve the detection speed.
Software Section. AdaBoost algorithm detection process mainly complete in the software part (ARM Cortex-A9). Because of the development platform contains dual-core Cortex-A9 processor, using the dual-core structure, we can speed up the face detection process. The advantage of multi-core processing is that it can perform multiple tasks at the same time, so the face detection process can be divided into two parts in processors, each processor to complete half of the task detection, theoretically it can double the detection speed. The memory address of each kernel is allocated properly to prevent errors. At the same time, the dual-core communicate through the OCM to ensure that the synchronization of two processors during running the detection process [8] .
The detection process is running in two processors, which are the master-core and sub-core. While detecting, the master core start the sub-core to run, when the master core complete its own task and also detect the task of sub-core are completed. It will merge the results from two cores. As a result, the rectangle information of the face position is output.
Dedicated Hardware Acceleration
The hardware acceleration module is mainly used to calculate the integral image quickly and improve the calculation speed of Haar-like features. The definition of the integral image is shown in the Eq.1. It can also be obtained by the Eq.2 .The first term on the right part of Eq.2 is the value of the integral image at the point 01 ( , ) xy, so we can obtain Eq.3. Hence, when the integral image of region A is known, if we want to compute the integral image values in region B, we can regard B as an independent region and compute its integral image; then, we add each row of integral values with the rightmost value at the same row in the integral image of A (Fig. 4(a) ) [9] . Based on this architecture, we can compute the integral image values with blocks of original image in hardware design, and use pipelining and parallel structure to speed up the whole system. We use a 4x4 image to illustrate this pipeline calculation unit. As shown in Fig. 5 (a) , we can complete the accumulation that interval 1 
21
n  at the cycle n , and get the accumulation of 2 n pixels, n is calculation cycle numbers. The premise of the pipeline structure of the image is a whole line of pixel data input at the same time. For image of large size, we could not access a whole row simultaneously as the input or output data width of memory that stores image pixels is limited. To solve this problem, we refer to the analysis of Eq. (3) and Fig. 4 (a) , divide the image into serval "strips" (Fig. 4 (b) ). We first compute the integral image of strip 1. Then we use the property mentioned above to compute the integral image values in strip 2 as an independent image, and add each row of strip 2 with the rightmost value at the same row in the integral image of strip 1 to obtain the integral image value of region strip 2. Similarly, we can compute the integral image values of the rest strips. These strips are compute in a pipeline way. For an mn  image, if the width of each strip is w (Usually m could be exactly divided by w ), there would be / mw strips, each strip requires 2 log w steps to compute the integral image [10] . In experiments, we find that when =32 w , the architecture achieves the highest power efficiency and area efficiency. Hence, the structure of "strips" are shown in Fig. 5.(b) .
Verification
In this paper, we implement the face detection system on ZedBoard development board based on ZYNQ-7000. Design uses software and hardware co-design, software needs to be modified because of the hardware accelerator, which not only deal with the detection work, but also responsible for hardware control, dual-core start and run, detection window enlarge and the result merging. After the system starts, the camera can capture the images in real time, and the result can be seen on the screen. As shown in Fig. 6 , the original image captured by the camera and the face detection result.
The ZedBoard-Zynq-7000 resources consumed by the whole system are shown in Table 1 . The system uses 22 classifiers, including 2135 Haar-like features. The images of 640 480  captured by the camera transit into DDR through the VDMA, the software detect and transit the result to the VGA display through the VDMA in the real-time. The SoC solution for face detection which uses an 
Conclusion
In this brief, we design and implement the face detection system using SW/HW co-design based on ZedBoard. On the Zedboard platform, the pipeline parallel structure of integral image calculation and image preprocessing implement on FPGA, software part use multi-core to speed up the detection.
