Abstract-Face localization was an important research direction of face recognition technology. Its aim was to segment the face from the background of the detecting image. This technology was widely used in many areas of research, such as identity verification, HMI, visual communication, virtual reality, public files, etc. In this paper, we firstly constructed skin model and calculated the similarity of the image data, then calculated the face boundaries based space projection. Importantly, in order to improve the real-time, we utilized Xilinx high-level synthesis tool Vivado HLS (AutoESL) to achieve a hardware from a C program, which was based on Zynq platform. And the hardware module is used to realize the face location greatly improved the computing speed. The simulation results show that the proposed method worked well and the efficiency increased by 80%.
INTRODUCTION
With the development of science & technology and the continuous progress of human society, computer vision had entered every aspect of social life. The biometric feature of a face was widely used in identity verification, HMI, visual communication, virtual reality, etc based on the visual analyzing. Face location was the first step in face feature analysis, which was responsible for the face detection and localization in a digital image. Face location was an important research content in face recognition and face expression analysis. Viola and Jones proposed the rapid real-time detection method in 2001 [1] firstly, many face location algorithm was presented, such as region segmentation, gray projection, edge detection and template matching, etc. In [2] , the region segmentation method was applied, but it couldn't find the effective position when the man wore the glasses. In [3] , the gray projection method was utilized; however, the accuracy was impacted by the face gestures changing. In [4] , the edge detection was adopted, nevertheless, a lot of data preprocessing was required and couldn't be implemented completely in real time. Meanwhile, with the increase of the image resolution, a single CPU platform had been unable to meet the demand for rapid detection. Parallel computing based on multi-core GPU was widely used in compute-intensive application. Because of the inherent parallel characteristics in face location, more and more people considered parallel processing to accelerate the detection. NVIDIA launched CUDA computing platform in 2007 and CPU and GPU co-processing was presented in [5] . Although the ability of computing was powerful, it couldn't reduce its volume and power consumption. At the same time, due to heavy demand for mobile detection terminals, embedded face detection system was provoked attention. In [6] , implement a parallel accelerating face detection system, but when the image size was changed, the system needed be revised and overhead was larger. In [7] , GPUaccelerated face detection method achieved a certain effect, but its GPU load was not balance and impacted the accelerating effect.
In this paper, we presented an embedded face location system based on Zynq-7000.
Zynq-7000 series integrated a dual-core ARM Cortex-A9 (PS) and Xilinx 7 series FPGA (PL) on a single chip [8] , which was the first integrated high-performance ARM CortexA series processors and high-performance FPGA products. Each Zynq-7000 chip contained the same PS, however, the PL and I/O resources were different. Compared to a single ARM Cortex A9 board or a single Xilinx FPGA board, Zynq series producted not only integrated different technology characteristic processor and FPAG on a single chip, but also designed the high-performance chip interconnection path between the processors and FPGA, meanwhile, Xilinx provided vivado HLS tools to convert the high-level language (such as C, Matlab, etc) into a hardware module, which could accelerate the software to improve the system's real time performance [9] .
In this paper, we provided a face location algorithm in C language. In the face location algorithm, we adopted skin color segmentation method to achieve face region and projected image after edge detection to locate face region. The algorithm was fast and simple to synthesize the C program into a hardware module based on HLS tools and Zynq platform. The experimental results indicate that the hardware module worked well and efficiency increased greatly.
II. THE STRUCTURE OF ZYNQ

A. Internal Structure of Zynq
The internal structure of Zynq contained processor system (Processing System, PS) and a programmable logic (Programmable Logic, PL) the two parts.
Application processing unit (APU) located in the PS part. APU contained two ARM Cortex A9 dual-core processors and two Neon coprocessors. They both shared 512KB L2 cache. Each processor had a high-performance, low-power cores, and independently owned L1 level 32KB cache of instruction and data. ARM Cortex A9 dual-core processor was based on the ARMv7-A architecture and supported for virtual memory, which could run 32-bit ARM instructions, 16-bit and 32-bit Thumb instructions, and Jazelle state 8 Java byte code. Neon coprocessor aimed to process media and signal, which was optimized for increased audio, video, image and voice processing and 3D graphics instruction. Zynq-7000 AP SoC architecture diagram [10] 
B. SCU Module
In Figure1, Snoop control unit (SCU) module [11] was connected with the two ARM Cortex A9 processor and memory subsystem. It could intelligently managed data consistency between of two processors and L2 shared cache. This module was responsible for the data transmission, arbitration, memory interconnection and the cache coherence. The SCU module communicated with each ARM Cortex A9 processor by cache coherency bus (Cache Coherency Bus, CCB). SCU supported MESI (Modified, Exclusive, Shared, Invalid) monitoring system by avoiding unnecessary access to achieve higher efficiency and performance. SCU module implemented a four-way repeated association RAM to label local directory in the L1 data cache, which checked the relevance of the data. SCU could copy a clean data from one processor cache to another processor cache without participating of the main memory. In addition, it could move the dirty data between processors, while avoiding the delays caused by the sharing or writing back. Consistency management system provided by the independent hardware management unit not only reduced the complexity of the software, but also maintained the consistency across different operating systems and software drivers.
C. Accelerator Coherence Port
Accelerator Coherence Port(ACP) was is 64-bit AXI slave interface, which provided an asynchronous cache coherent access points and could be accessed directly from PL SCU to ARM Cortex A9 MP-Core processor. PL could use this interface to access the entire APU cache and memory system, which could simplify software designing and improve system performance. As a standard AXI slave interface, ACP supported all standard reading or writing operation without adding any additional PL component. Therefore, ACP provided a cache memory interface to maintain cache consistency from PL end to ARM Cortex. When any read operation from ACP interface to a continuous memory region, SCU module checked whether the required information had been stored in the L1 cache. If it already existed in the L1 cache, the data would be returned directly. If the L1 cache missed, it still had the opportunity to L2 cache hit. If L2 cache was not hit, SCU would eventually be forwarded to the main memory.
D. I / O Peripheral and Programmable Logic Peripheral
Zynq-7000 AP SoC included a lot of common internal I / O peripherals and memory interfaces and it was an important part of PS [12] . These I/O peripherals included GPIO, Gigabit Ethernet controller, USB controller, SPI controller and UART controller etc. These I/O peripherals in addition to complete common I/O functionality, but also made some changes for the Zynq-7000 AP SoC,so that it could be a good support for PS + PL architecture and flexibly used of PL. Storage interface on Zynq was also very rich, which included DDR controller, Quad-SPI controller and Nand / Nor / SRAM controller etc.
Programmable logic "peripherals" (PL) was Xilinx FPGA essentially. To facilitate understanding, on the Zynq platform, PL firstly could be seen as a reconfigurable "peripheral", which could be used as a part of the PS controlled by the ARM processor. Actually, PL could be reconfigurable to a variety of extensional peripherals, such as extended serial, Ethernet or video interface. Secondly, PL could be seen as a master device without ARM controlling. In this case, PL could actively complete data exchange through the interface with external chip, more even it could also be used as the master device to get data from the main memory of the APU, data storage and could control the ARM processor computing. PL part was added to the traditional ARM based SoC chip brought more flexibility.
E. Connection between PL and PS
PL and PS used multiple interfaces and signals to achieve tighter or looser coupling and there were more than 3000 connections between these interfaces and signals. This ensured that the designers could efficiently complete hardware accelerators or other PL logic integration. PL could be designed to a module accessed by the processor or be designed to a module to access the main memory resources within the PS. When start-up the system, PS always started firstly and then used a software-centric approach to configure PL. PL could be 716 JOURNAL OF MULTIMEDIA, VOL. 9, NO. 5, MAY 2014 reconfigurable in the boot process or also be arranged at a future time. Particularly, PL could be fully reconfigurable or partly reconfigurable by dynamic configuration. The bus between PS and PL data exchange was based on AXI bus protocol [13] . AXI bus, which was inside the PS pathway, had been designed to follow AMBA bus specification. As long as PL module designed by user complied with AXI bus protocol, it could communicate through these pathways with PS, such as providing data to the processor, accessed DDR, OCM and L2 Cache and so on. There were two types interface signals between PL and PS:
Functional interface: It included AXI, EMIO, interrupts, DMA flow control, clock, and debugging interface. When designers developed FPGA in the PL, the designers could use these interface signals to exchange data with PS. Signal interface for different purposes were not the same, should according to the need to use.
Configuration Interface: It included PCAP, SEU, configuration status signals and Program / Done / Init signal interface. These signals were connected to the built-in module within PL and provided PS with the ability to control PL.
In the hardware designing, we mainly used AXI, EMIO, DMA, PCAP interfaces.
Here we introduced two important concepts-MIO and EMIO in the Zynq platform. MIO meant Multi-function I / O, and EMIO stood for extend Multiplexed I / O. MIO was a I / O interfaces of PS, which had 54 pins. These pins could be used on GPIO, SPI, UART, TIMER, Ethernet, USB and other functions. Meanwhile, each pin had several functions simultaneously, so called multifunction I / O. Because of the number of MIO was limited, when it was not enough, Zynq provided us a method that was to use the extended MIO. The main feature was that when the MIO was not enough and at the same time wanted to use the I / O interface to connect the PL, from PL pin is connected to the outside of the chip. Another function using EMIO was to connect the PL module as a control signal, such as using the GPIO to control several PL modules. Using this method, we could skip the AXI bus to connect the modules, but we needed to consume CPU resources to control PL modules. The difference between MIO and EMIO was shown in Figure  2 . 
F. Advanced eXtensible Interface
AXI (Advanced eXtensible Interface) protocol described the master devices and slave devices data transmission. The master equipment and slave equipment was connected through the handshaking signal. When the data of master devices was ready, AXI maintained the VALID signal, which meant the data was valid. When the data of slave devices was ready, AXI maintained the READY signal. Both the VALID signal and READY signal were valid, the data transmission began. When these two signals continued to maintain valid, the master device would continue to transmit the next data. The transmission would be terminated when master devices canceled the VALID signal or slave devices canceled the READY signal. In essential, AXI was the upgraded version of AMBA (Advanced Microcontroller Bus Architecture) bus, which was proposed by ARM company. It was a high-performance, high-bandwidth, low-latency on-chip bus and could be used to replace the previous AHB and APB bus. AXI4.0 included three interface standards, namely AXI4 (also called AXIMemory Map or AXI Full), AXI4-Stream and AXI4-Lite. Which AXI4-Stream was also defined by Xilinx and ARM together, designed for large data path FPGA design application preparation. AXI 4.0 protocol structure was shown as AXI4.0 protocol was equivalent to the AHB bus standard, which provided high-speed interconnect channel and supported burst mode. AXI4.0 mainly used access high-speed data storage. AXI4-Lite provided a single data transmission to peripheral, equivalent to the original APB protocol, which mainly used to access some of the low-speed peripherals. AXI4-Stream Interface was similar to FIFO. When data transmitted, it didn't need address and continuously read or wrote data directly from the main device to slave device. It mainly used in the case of high-speed data transmission, such as video, highspeed AD, PCIe, DMA. AXI4-Stream was similar to Xilinx's Local Link.
III. SOFTWARE/HARDWARE CO-DESIGN BASED ON ZYNQ
Software/Hardware co-design was defined as designing and coordination of hardware and software at the same time, which based on the definition of the whole system. Software/Hardware co-design included hardware and software division, hardware and software system development and joint debugging. Software/hardware codesign procedure was shown in Figure 4 First, the system was described in a high-level language, such as C language, Matlab, SystemC, etc. Then according to the system requirements to divide hardware and software, the system described in the JOURNAL OF MULTIMEDIA, VOL. 9, NO. 5, MAY 2014 717
original C program was decomposed into a software part executed on the processor and a hardware part which was transformed from a C program to the hardware module. Third, if the hardware part required a special interface, designed it through the Zynq AXI interface and independently verified function. Fourth, in the software part, we should design software interface for the hardware and if the development involved in an OS then a driver would be designed for the coprocessor hardware. Finally, we needed to be combined hardware and software simulation and debugging together. 
IV. FACE LOCATION
Face localization [14] [15] was an important research direction of face recognition [18] [19] [20] technology and the process was actually human face detection. It was firstly determined whether someone's face in the input image, if where was a face in the image, then to determine the further information [21] on the human face. Skin color didn't depend on other features of the face and was not sensitive to the posture and facial expression changing. So it had the good stability and significantly different from the background color of most objects.
If we wanted to extract face region from the image, we must have a variety of skin color model under all kinds of light. So we needed an appropriate color space firstly. Because RGB tricolor color space tricolor represented not only color but also brightness, so it was not suitable for skin models. Due to changing in illumination, brightness makes it more difficult to locate the face. In this case, we could utilize clustering in color space by means of the skin and transform the RGB model into the chroma and brightness independent color space. Namely convert RGB space to YC b C r color space, in which the components were unrelated.
Face color in the Y C b C r color space was relatively concentrated (referred to as color clustering features) so choose YC b C r color space for face detection, RGB space was needed to complete the mapping YC b C r color space. 
In the equation (1) 
Considering the hardware computing floating-point arithmetic was more complexity, we floored YC b C r to an integer value, which denoted as   .
In the YC b C r color space, after normalized chromaticity histogram, through large amounts of data, we performed statistical computations to multiple samples facial skin pigmentation. From different color face data, C r and C b values projected to 3D graph as Figure 5 shown. From Figure 5 , we could see that skin color distribution was very similar to a two-dimensional Gaussian distribution. So we defined skin color met a two-dimensional Gaussian model [16] (m,C) M  , in which, m was the mean value 
Through the color model to detect the probability of an arbitrary pixel color was skin. 
After calculated every pixel in the image, we found out the maximum similarity normalized all pixel similarity in the range [0, 1]. The higher the value, the greater of the likelihood belonged to skin, whereas the smaller. Assuming only two level gray scale in the image, one was the representative objects and the other was the background. There was significant difference between the two objects in the gray histogram. After statistical calculation, histogram statistics was bimodal and the simplest way was to select a valley point of the histogram of the gray as the threshold for image binaryzation. Shown as Figure 6 . In this paper, we adopted morphology to remove the noise and smooth the boundary.
Efficiency of edge detection based on the grayscale morphology erode transform was proved by experimental results.
{x | b f}
In which, f meant original image and b meant a structure component. Erode transform was to find the region, where b could be put its in f. Erode transform was a shrinkage change. It always removed outermost layer of the original image, so it utilized to remove noise and thinning the image.
The other basic operation in morphology was dilation transform.
Dilation was a expansion procedure, namely connected the outer boundary and expanded the original image. Dilation transform aimed to eliminate holes and connect the cracks.
Erode transform and dilation transform were not inverse operation but they could be cascade connection. Using a same structure component to one original, firstly used the erode transform then dealt with dilation transform. This procedure was defined open operation, which was a fundamental operation in mathematical morphology.
Aimed to above mentioned binarization image, we used open operation to smooth the boundary and remove the burr. Make the outline of the face more clearly.
On the basis of analyzing the above-mentioned, we proposed projection method to locate the face region. Statistics the number of white pixels in each column, when the maximum value was obtained, using the values normalized of the other columns to determine the left and right boundaries. Statistics of the number of white pixels per line, when obtained the maximum value, using the max value normalized of the other rows to determine the upper and lower bounds. Finally, mark the boundary in the color image. The algorithm flow diagram was shown in Figure 7 .
The face location results were shown in Figure 8 . Xilinx provided design tools to solute the traditional problem of multi-core heterogeneous [17] . Xilinx utilized high-level synthesis tool Vivado HLS (AutoESL) to achieve a C program into hardware. Xilinx utilized Platform Studio to achieve the hardware system design and construction. Xilinx Software DesignKit achieved the design software interfaces and drivers.
The whole system structure and system frame were shown as Figure 9 . In this paper, we utilized xilinx Zynq 7000 platform to design a face location system, which applied hardware accelerator method converting a C program to a hardware module. Through face location testing, the results demonstrated that using hardware module to complete a face location algorithm had very high performance and real time.
The experimental results had also shown that the design encouraged the decrease of the time occupied greatly and reduces the design cost. The result of FPGA verification shown that the system generated a good image quality, all this proved that the proposed hardware structure was valid and worked well, it could be integrated into the location, monitoring system well and used as the geometric normalizing operation of license plate recognition system. These technologies will broad application prospects in high-speed face recognition system in the future.
