Emerging embedded 3D vision systems for robotics and security applications utilize object detection to perform video analysis in order to intelligently interact with their host environment and take appropriate actions. Such systems have high performance and high detection-accuracy demands, while requiring low energy consumption, especially when dealing with embedded mobile systems. However, there is a large image search space involved in object detection, primarily because of the different sizes in which an object may appear, which makes it difficult to meet these demands. Hence, it is possible to meet such constraints by reducing the search space involved in object detection. To this end, this article proposes a depth and edge accelerated search method and a dedicated hardware architecture that implements it to provide an efficient platform for generic real-time object detection. The hardware integration of depth and edge processing mechanisms, with a support vector machine classification core onto an FPGA platform, results in significant speed-ups and improved detection accuracy. The proposed architecture was evaluated using images of various sizes, with results indicating that the proposed architecture is capable of achieving real-time frame rates for a variety of image sizes (271 fps for 320 × 240, 42 fps for 640 × 480, and 23 fps for 800 × 600) compared to existing works, while reducing the false-positive rate by 52%.
INTRODUCTION
Object detection is an important integrated task in several embedded applications from the computer vision and artificial intelligence domains and refers to the ability of a
The research leading to these results received funding from the European Union's Seventh Framework Programme managed by REA-Research Executive Agency http://ec.europa.eu/research/rea (FP7/2007 (FP7/ -2013 under project SAFEMETAL (FP7-SME-2010-262568), as well as by E! 5527 -RUNNER, an EU-funded project under the EUREKA's Eurostars Programme. Authors' addresses: C. Kyrkou (corresponding author), C. Ttofis, and T. Theocharides, KIOS Research Center for Intelligent Systems and Networks, Department of Electrical and Computer Engineering, University of Cyprus, Nicosia, Cyprus 1678; email: {kyrkou.christos, ttofis.christos, ttheocharides}@ucy.ac.cy. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested fromcomputer system to analyze an image and determine the presence of an object of interest. Such embedded applications are often associated with requirements for real-time performance, high detection accuracies, and low energy consumption. Additionally, these constraints must often be met under limited available hardware resources. Software implementations of object detection for embedded applications, even if highly flexible, cannot satisfy these constraints. Consequently, research has focused on the design of custom hardware architectures for object detection. Reconfigurable hardware platforms, such as FPGAs, have emerged as a very attractive platform for implementing architectures for object-detection applications. FPGAs offer high flexibility with regards to area, power, and performance and thus are able to meet application-specific constraints, which is difficult to achieve with other platforms, such as CPUs and GPUs, due to their fixed interconnect and high power demands.
Several FPGA-based object-detection hardware architectures can be found in the literature. The majority of these architectures operate on low-resolution (320 × 240) images, employ a traditional sliding-search-window-approach to search for objects, and also downscale the image several times to find objects of different sizes. However, as the image resolution increases, the number of generated search windows increases as well. Depending on many factors, including the number of scales that need to be searched (from highest to lowest resolution), the overlap between successive windows and the window size itself, the increase in search space can make it difficult to meet real-time constraints and maintain an acceptable detection accuracy. Software implementations of object detection use search-reduction techniques, such as motion detection and color processing, to reduce the search space. However, only a few hardware implementations feature such search-reduction methods that could potentially improve the efficiency of embedded object-detection systems [Sadri et al. 2004; He et al. 2009; Ming and Yisong 2010] . Additionally, some of these techniques, such as color processing, are application specific and thus cannot be used in a variety of object-detection applications. Finally, search-reduction techniques that have been used in hardware do not give information on the object size, and thus, exhaustive search must still take place even in a smaller image region.
Alternatively, with the emergence of 3D systems, it is possible to utilize depth information to accelerate object detection. Depth information has been successfully used in software implementations for intelligent object recognition systems [Wu et al. 2009; Darrell et al. 1998; Wang et al. 2004] ; however, many assumptions and simplifications are made to allow for software implementations to achieve near real-time performance. To the best of our knowledge, our initial attempt towards a depth-accelerated objectdetection system was the first to consider a fully custom hardware approach that utilizes depth information. That workdemonstrated the performance speed-ups, detection improvements, and energy savings stemming from the use of depth information compared to a conventional sliding-window-based object-detection system. Similarly, showed the hardware realization and subsequent benefits of an edge-based search-reduction method, used both for window-size estimation and fast window rejection, again compared to the traditional sliding-window approach.
In this work, we show how the preceding two methods can be merged together into a single algorithm and how that algorithm can be implemented in hardware utilizing a dedicated architecture in order to provide an efficient and generic framework for real-time object detection. Compared to our previous works, we perform experiments on larger images (640 × 480 and 800 × 600), rather than just 320 × 240, a common image size often used in object-detection systems, to explore the potential and scalability of the proposed architecture. We also provide a more detailed comparison with other hardware implementations of object-detection systems, some of which also feature search-reduction methods. Finally, realizing that the classification performance is equally important towards an efficient object-detection system, we employ the reduced set method [Burges 1996 ] to reduce the number of support vectors used by the support vector machine (SVM) classifier, resulting in increased classification performance and reduced memory demands. The proposed architecture is implemented on a Virtex 5 FPGA targeting a face-detection application and is able to handle various image sizes of 320 × 240, 640 × 480, and 800 × 600 pixels, achieving 271, 42, and 23 frames-per-second (fps), respectively, while also providing a 52% reduction in the false-positive rate, compared to the sliding-window method.
The rest of the article is organized as follows. In Section 2, we present the basic steps that constitute the proposed depth-and edge-directed object-detection algorithm. Following, is Section 3, which provides details on related object-detection algorithms and hardware implementations. The hardware architecture of the proposed method is described in Section 4, while Section 5 details the experimental platform and evaluation methodology for the architecture, along with performance results and discussion, and also provides a discussion on how our framework could be used for other applications. Finally, Section 6 concludes the article, while also providing directives for future research.
DEPTH-AND EDGE-DIRECTED SEARCH SPACE REDUCTION
Object detection is concerned with identifying the presence of an object of interest in an image. This is a tedious task which typically involves a sliding window scanning the input image and various downscaled versions of it (a process called image pyramid generation) in order to find objects of interest in various sizes. This exhaustive search makes it difficult to meet real-time constraints, especially as the image resolution increases, since more scales will need to be searched for the object of interest, and as a result, the number of search windows also increases. It is possible to increase the window size as the image size increases in order to reduce the search space; however, in such a case, the classifier demands on hardware resources, memory, and processing speed will also increase. As such, it is preferable to keep the window size at a considerably small size and introduce search-reduction techniques in the overall processes to increase performance. In the remainder of this section, we outline how depth and edge information can be utilized to reduce the search space and speed up the object-detection process.
Depth Extraction and Object-Size Estimation
Depth information (i.e., the distance of an object from the camera) can be extracted from the host environment of the embedded object-detection system. There are a number of methods that could be used to extract depth information from the host environment, such as the Microsoft R Kinect TM sensor or a stereo vision system that processes a stereo image (a pair of left and right images) [Trucco and Vierri 1998 ]. In the context of stereo camera systems, information about depth (Z) evolves from the computation of the disparity map d(x, y) using the formula Z = f * (b/d(x, y)), where b refers to the baseline distance between the stereo camera optical centers, and f refers to the focal length of the stereo camera system. As such, any stereo-based depth extraction method that can produce the disparity map could work in the context of our framework.
By using depth information extracted from a vision system, it is possible to estimate the size of the object at a given location of the image, thus avoiding downscaling the input image several times and subsequently reducing the number of windows that need to be classified. The actual size of the object (O size ), as is represented in the real world, and its projection in one of the stereo image frames (W size ) can be estimated using Equation (1) [Trucco and Vierri 1998 ]. The equation is derived from the setup shown in Figure 1 , which shows that the ration between the size of the object is the real world (O size ) and the distance from the camera (Z) equals the ratio between the size in the image (W size ) and the camera focal length ( f ). Additionally, using the relationship between depth and the disparity map (d) that applies for stereo vision systems, we can use the disparity value instead (Equation (2)), thus avoiding the need to compute the actual depth.
Edge-Based Window-Rejection Process
The performance of object-detection systems also suffers from the necessity that all windows need to go through the classification process. Thus valuable computational time as well as power are wasted on potentially unpromising regions. An efficient way to eliminate windows prior to the classification process, in a manner that can be parallelized in hardware and does not require many hardware resources, is utilizing edge information. Edges provide information about visual features in an image, and thus the number of edge pixels in an image can give an indication of the useful information in a particular image region. An early approach to eliminating windows from the classification process, in addition to constructing the windows, was first proposed in Anilla and Devarajan [2010] . Overall, the edge-based window-rejection process involves obtaining the edge image, using an edge-detection algorithm such as Sobel, scanning the image, and rejecting windows depending on the number of edge pixels. Specifically, after obtaining the edges of a particular image region (window), the edge pixels are counted, and if they exceed a certain threshold, which is assigned according to the object of interest, the window is considered for classification; otherwise, it is not classified and considered a non-object. In Anilla and Devarajan [2010] , a window is discarded if it contains no edge pixels. However, the zero edge pixels threshold is susceptible to noise and is only able to remove regions with almost uniform intensity, and so in our approach, we use a larger threshold.
Depth-and Edge-Accelerated Object-Detection Process
An overview of the proposed accelerated object-detection approach is given in Figure 2 . The approach relies on the computation of the disparity map from a stereo image to extract depth information. Each value in the disparity map corresponds to the center of a candidate window. The candidate window size is determined by Equation (2), and a candidate window is formed around each disparity value. The disparity map is sampled every few pixels (disparity search overlap), and a candidate window is extracted for the sampled disparity values only rather than for every pixel in the disparity map. Furthermore, if the size of the candidate window exceeds the image boundaries, indicating the presence of a large object at the border of the image, that window can be discarded, as the object may fully fit in another candidate window. Finally, any disparity values that map to window sizes which are smaller than the size of the classifier can also be discarded on the premise that they wouldn't be considered for classification in the sliding-window approach either. After the size of the candidate window has been validated, that window is extracted from the edge image, as shown in Figure 2 , and the number of edge pixels contained in the window is determined. If it exceeds a predetermined threshold, the candidate window is considered valid, and that window is extracted from the grayscale image, resized to the classifier size, and then classified by the classification algorithm. Otherwise, it is discarded and the whole process is then repeated for the next candidate window.
RELATED WORK
In recent years, a fair amount of work has been done on development of algorithms and techniques used in 3D vision systems. Hetzel et al. [2001] , for example, use local feature histograms extracted from range images to recognize single object images. Browatzki et al. [2011] use 2D and 3D features from color and depth images, respectively, in order to recognize objects. Affine feature finders are combined with SIFT in Moreels and Perona [2005] in order to provide robust 3D recognition under viewpoint changes and lighting and scale changes. More recently, efficient 3D feature-extraction algorithms have been proposed in the literature, such as the Fast Point Feature Histograms (FPFH) used for 3D image registration [Rusu et al. 2009 ] and Normal Aligned Radial Features (NARF) that are used to find and recognize 3D objects in range images [Steder et al. 2010] . Finally, other approaches, such as Liebelt et al. [2008] , use generated 3D object representations to recognize different objects. Common classification methods that have been used in 3D object-recognition methods include nearest-neighbor methods [Liebelt et al. 2008 ], multilayer perceptrons [Browatzki et al. 2011] , and support vector machines [Roobaert and Van Hulle 1999; Ruiz-Llata and Yebenes-Calvino 2009] . The main focus of these works was on providing higher recognition accuracy that could be used in a generic framework for 3D object recognition. However, hardware implementations of such 3D object-recognition systems have been rather limited to simple problems with low-resolution images [Ruiz-Llata and Yebenes-Calvino 2009]. Nevertheless, the advancements in the research of 3D vision systems provide an opportunity to utilize additional available information, such as depth, in order to accelerate 2D object-detection systems by reducing the search space. A fair amount of work has been done in hardware implementations of 2D objectdetection algorithms, mostly utilizing FPGAs and targeting face detection. These works employ pattern-recognition algorithms for the classification stage, such as neural networks, support vector machines, and the Viola-Jones detection algorithm, and utilize the sliding-window approach to search for objects of various sizes, while most works targeted 320 × 240 images. The following illustrates some of the most important works found in the literature, while Table I provides a summary of the techniques used in each work and the achieved performance.
One of the first attempts at implementing face detection in hardware was that of McCready [2000] , and it was the first to demonstrate the benefits from a dedicated hardware solution. The author designed a custom face-detection algorithm based on a neural network classifier and optimized it for the TM-2 (Transmogriffer 2) configurable multiboard FPGA platform. It operated on a 320 × 240 image frame and achieved a frame rate of 30 fps while having a detection accuracy of 87%. However, the proposed implementation utilized nine boards on the TM-2 system.
The detection framework by Viola and Jones [2004] is a popular approach used for 2D object detection. The main benefit of this algorithm in terms of hardware implementation is that it only requires additions and only a few multiplications, making it attractive for resource-constrained systems. Hence, a few hardware implementations of this algorithm can be found in the literature. Cho et al. [2009] showed a parallel implementation of this algorithm on an FPGA, demonstrating near real-time performance of 23 fps for 320 × 240 images. Hiromoto et al. [2009] proposed a hybrid model that consisted of parallel and sequential modules. The most frequent parts of the algorithm are implemented by the parallel modules, while the least frequent by the sequential. Through this approach, the authors manage to save hardware resources while achieving 30 fps. Finally, Kyrkou and Theocharides [2011a] propose a systolic architecture for the implementation of the Viola-Jones algorithm both for ASIC and FPGAs, with the FPGA system able to process images at 64 fps. Han et al. [2011] proposed a face-detection system that utilizes the Modified Census Transform (MCT) and the AdaBoost learning algorithm with cascaded classifiers targeting mobile environments. They use only a single classification stage, and their system can detect up to 32 faces. Their system achieves high frame rates and high detection accuracies; however, the FPGA architecture and FPGA utilization information are not presented, making it difficult to assess their proposed system.
The hardware implementation of a Convolutional Face Finder [Garcia and Delakis 2004] based on a convolutional neural network was demonstrated in Farrugia et al. [2009] . The proposed architecture consisted of a ring of processing elements (PEs) that implemented the CFF algorithm and demonstrated how pipelined architectures can be used to speed up the detection process. They exploit parallelism by dividing the image in vertical strips with overlapping regions, and each PE processes a block of that strip. However, a 25-ring PE is required to achieve a real-time performance on 640 × 480 images, and consequently, a large FPGA is used to fit the architecture. Furthermore, not enough details are given on the I/O requirements of their architecture and the memory and buffering requirements.
The aforementioned works have demonstrated how dedicated hardware solutions and classifier optimizations in hardware can provide high speed ups and real-time performance. However, higher frame-rates can only be achieved by integrating hardware search-space-reduction mechanisms. The following works demonstrate different methods and approaches that have been implemented in hardware in order to reduce the search space. Ming and Yisong [2010] use a hardware/software approach to design an FPGA-based face-detection system with skin-detection acceleration. A Nios-embedded soft processor handles the face identification task, while a dedicated hardware accelerator handles the skin-detection process. This specific work focuses on accelerating the search-space reduction in hardware rather than the actual classification. The processor can then handle the classification process, since the search space is reduced. However, this also depends on the input image size. Sadri et al. [2004] implemented a face-detection neural-network algorithm on a Virtex II Pro FPGA, which included a skin color filter to reduce the search space within the image and an edge-detection mechanism to produce a binary image on which the neural network operates. The majority of the system was implemented on the FPGA custom logic fabric, while higher-level operations were left to the Power PC processor.
Their system operated on 800 × 600 input images with a frame rate of 9 fps if the entire image was processed. The resulting frame rate could be improved up to 90 fps if only 25% of the image was searched, using skin detection, and only up frontal detection was considered. The approach of using binary-image data demonstrates the potential performance benefits; however, the impact on detection accuracy is unclear.
Finally, He et al. [2009] demonstrated the hardware implementation of a massivelyparallel face-detection system that achieved frame rates of over 600 frames per second. They utilize two search-reduction techniques, motion detection and skin detection. They also make a few simplifications on the face-detection procedure and adding two search-reduction techniques. The input image is of 640 × 480 pixels; however, all subsequent operations, including the detection process, happen on a downscaled 80 × 60 image, greatly reducing the search space, while only considering three face sizes (11 × 11, 19 × 19, and 27 × 27) . The classification for the three window sizes is done in parallel and thus the proposed system needs a lot of resources to provide these high frame rates. This approach may not be suitable for other object-detection applications, as downscaling the input image may result in loss of quality.
These works have demonstrated the successful implementation of face-detection systems on FPGAs. The majority of these implementations have either adopted the sliding-window approach or integrated face-detection-specific search-reduction techniques, such as skin detection. The majority of these works have targeted image sizes of 320 × 240 pixels for face detection; however, other applications may require processing higher-resolution images. Recognizing that there is a need for a more generic object-detection framework, and with the advancements in the field of 3D vision systems, we propose an object-detection architectural framework that is based on depth information as well as edge detection that can provide a more generic platform for various detection applications and can also process larger image sizes.
PROPOSED HARDWARE SYSTEM ARCHITECTURE
The architecture that implements the proposed object-detection process consists of four major hardware units, each implementing the major tasks of the algorithm, as outlined in Section 2. These units are the Disparity Computation Unit (DCU) which computes the depth information from a stereo image, the Edge Computation Core (ECC) that implements a Sobel edge detector, the Window Extraction Unit (WEU) that processes the depth and edge information, and the Classification Engine (CE) which implements a support vector machine classifier. It is also possible to use different approaches and algorithms for each of these units; however, the overall architectural framework will remain the same. The system also consists of a memory controller and a system controller that optimize accesses to the external memory, control I/O operation, and synchronize the other major units. The system uses three on-chip buffers to store the image data; there are dedicated buffers for the edge image, the disparity image, and the left image of the stereo pair, which is used as the reference image for the disparity computation. Figure 3 shows an overview of the system architecture and the communication flow between units. Certain features of the architecture were optimized for FPGA implementation; however, the architecture can also be implemented using an ASIC design flow with only minor adjustments.
The detection operation begins when the memory controller fetches stereo image data from the external memory to the ECC in raster-scan fashion and stores the incoming pixels of the left stereo image in the input image buffer, while the produced edge image is stored in the edge image buffer. The DCU reads the edge image pixels, performs the disparity map computation, and stores the disparity pixels in line buffers (disparity image buffer). The WEU reads pixels from the line buffers to validate the candidate window and extract window pixels from the edge image and the left stereo image. All valid windows are then fed to the CE for classification. 
Edge Computation Core
The Edge Computation Core (ECC) integrated to the system implements a flexible and scalable Sobel edge-detection architecture. It employs hardware features, such as parallelism and pipelining, in an effort to speed up the repetitive calculations involved in the Sobel operation, and uses optimized memory structures in order to reduce the memory-reading redundancy. This is particularly important, as the time for edge detection must be small enough in order to obtain a speed up in the overall operation.
The architecture of the ECC is shown in Figure 4 and consists of an I/O controller, a set of FIFO line buffers used for temporary pixel storage and parallel window processing, and a series of convolution units (CONV), as well as comparators. The I/O controller fetches pixel data from the input stereo image in a row-wise fashion and forwards them to the input port of the scan-line buffers. The scan-line buffers produce four successive 3 × 3 windows per cycle, which are convoluted with the 3 × 3 Sobel kernels by the convolution units. The architecture utilizes two edge-detection units, one for the left and one for the right input image. The edge map of the left stereo image is stored in the edge image buffer and is used by the WEU, in the latter stages, to determine if a candidate window should be rejected prior to classification or not. Additionally, the produced edge-detection images are also used for the fast computation of the disparity map, as described in Hadjitheophanous et al. [2010] . 
Disparity Computation Unit
The Disparity Computation Unit (DCU) used in the architecture extracts depth information from a stereo vision system, and its architecture is based on an improved version of the hardware architecture that was proposed in Hadjitheophanous et al. [2010] . The architecture, which is shown in Figure 5 , combines the sum-of-absolutedifference (SAD) matching algorithm with edge features for faster and more efficient processing. Specifically, the use of edge features reduces the hardware demands, since processing happens with one-bit pixels rather than eight-bit for grayscale images. Thus the area saved is used to increase parallelism and performance. This is significant for the efficient integration of this unit into an object detection system, since the overhead introduced must not affect the performance of the classification process.
The DCU follows the ECC in the computation flow, as it receives edge pixels from the ECC and uses them to perform correlation on the input stereo images to produce the disparity map. The DCU can process images with a disparity range of 80 pixels and window sizes up to 11 × 11 pixels. The DCU architecture consists of scan-line buffers in order to receive edge pixels from the ECC in streaming fashion. The scan-line buffers temporarily store the pixels and allow for parallel processing of many windows. As such, multiple parallel adders and subtractors are utilized which facilitate in the parallel implementation of the SAD algorithm. Through this parallel structure, the DCU is capable of producing disparity values every cycle. The disparity values are produced faster than they are consumed by the WEU and CE; thus, it is only necessary to store a couple of lines of the disparity map in the disparity image buffer. More disparity map pixels can be produced very rapidly, so there will always be data available for the WEU which follows the DCU in the computation flow. There are additional advantages of activating the DCU only when disparity values are needed by the WEU. First, energy is saved, since the DCU is not active for a large amount of time unless a lot of disparity values are not valid, in which case, the DCU will constantly produce disparity values. However, in that case, the CE will not be active again, resulting in energy savings. Second, we reduce the on-chip memory requirements for storing the disparity map and use the remaining memory resources to store the input image. The reader is referred to Hadjitheophanous et al. [2010] and Ttofis et al. [2012] for a more detailed description of the concepts used in the design of the DCU.
Window Extraction Unit
The Window Extraction Unit (WEU), shown in Figure 6 , is the third unit in the detection system pipeline and performs the necessary operations for the validation of candidate windows, which are the window-size estimation and the accumulation of edge pixels. Additionally, it also performs the reverse mapping process that rescales large windows to the appropriate window size for the classifier (m × n, can be either square or rectangular). The WEU is enabled once the DCU starts generating and storing disparity values in the disparity image buffer. The disparity map is sampled by the WEU every five pixels, and thus not all values of the disparity map are processed. The WEU loads them from the buffer and proceeds to determine the corresponding window size of each disparity value using Equation (2). To implement the equation, the incoming disparity value is multiplied by the fixed-point preloaded value (O size /b), producing the window size in pixels. The calculated window size goes through some comparators that check if the size is equal or larger than the classifier window size, and if the window is within the image ranges. If the conditions are met, then the corresponding window is extracted from the edge image buffer, using the reverse mapping process, and the number of edge pixels in the window are counted. The reverse mapping technique is implemented by multiplying the actual window coordinates with a predetermined scaling factor (computed window size/classifier window size). This ensures that for every window, regardless of the estimated window size, only m × n pixel values will be read and processed by the subsequent stages. Details on the reverse mapping process are given in Figure 7 . The edge image pixel values are either 1 or 0; hence, a simple accumulation will suffice to give us the number of edge pixels in the window. Storing and processing of the edge image pixels happens in groups of four bits to improve performance, and as such, the accumulation is done for four edge image pixels as well. Once all the pixels are accumulated, the sum is compared to a threshold to determine if it should be discarded or not. If the threshold is exceeded, the WEU starts fetching window pixels, this time from the input image buffer using the reverse mapping technique, to the classifier.
Classification Engine
The CE architecture is based on the SVM processing architecture proposed in . It consists of an array of processing elements (PEs) and reflects the processing requirements of the SVM computation flow, which requires the calculation of a kernel function that has both vector and scalar operations. The architecture consists of two types of PEs which perform the vector and scalar operations of the SVM computation flow (i.e., the vector and scalar units). The architecture proposed in permits the SVM processing core to handle the processing of multiple windows, as each row can perform classification of one window. Additionally, it allows trading off processing the training data in parallel or processing Fig. 7 . The reverse mapping process involves reading only the pixels that correspond to an m × n window on a larger image. Each pixel address in the m × n window is transformed into an address corresponding to the pixel's location in the larger image using scaling factors F 1 , F 2 , which are computed by the larger window dimensions divided by the classifier window dimensions.
input windows in parallel. The latter, however, depends on the memory I/O and parallel access capabilities.
We have optimized that architecture for the specific problem at hand. Specifically, in the case of the developed system where we use dual port on-chip memories available on the FPGA for storing the input image, with one port used for writing while the other is used for reading, only one window can be accessed at a time. As such, we adapt the SVM classifier architecture in order to speed up the classification of a single window by processing as many training data as possible. The modified array architecture, which is shown in Figure 8 , consists of five rows, each processing the same window but with different reduced-set vectors (training data), thus increasing performance. The results of each row's vector units are accumulated by that row's scalar unit. In turn, the result of each scalar unit is also accumulated to produce the final classification result. Using this approach, a 20 × 20 window can be classified in 445 cycles. This makes the classification process the bottleneck in the overall computation, since the ECC and DCU can produce results almost each cycle, based on their developed architectures. Hence, unless windows are discarded, there will almost always be data available to be processed by the CE.
EXPERIMENTAL METHODOLOGY AND RESULTS

Experimental FPGA Platform and Methodology
The proposed architecture was evaluated after implementation on a Xilinx ML505 board (Virtex 5-LX110T FPGA), targeting the application of face detection as benchmark. The evaluation image dataset consisted of real-world images that were taken from a custom-built stereo vision system as well as synthetic images. The stereo image capturing system was comprised of two video cameras separated by a baseline distance of 77 mm, both with a focal length of 25 mm. Stereo image pairs were loaded to the FPGA board DRAM from the compact flash through a Microblaze subsystem that was used for initialization and verification purposes. After the initialization phase, the object-detection architecture was enabled, and data was retrieved from the DRAM by a custom memory controller. A total of 20 images per size (320 × 240, 640 × 480, and 800 × 600) were used to evaluate performance metrics, such as frame-rate and detection accuracy. Visual output was directed to a digital monitor through the on-board DVI output. Classification was performed by a support vector machine which was able to classify 20 × 20 images. The kernel used for the SVM was a second degree polynomial, which was found to perform well for image-processing applications [Osuna et al. 1997 ]. The SVM model was trained using MATLAB and the CBCL Face Database #1 [CBCL 2000] using the methodology and strategies employed in Osuna et al. [1997] and Burges [1998] , producing a total of 80 reduced-set vectors [Burges 1996] . Table II summarizes the system algorithmic parameters.
FPGA Implementation Discussion
The system architecture was implemented on a Virtex 5 LX110T FPGA and was optimized for the specific FPGA hardware features and resources in order to be able to process images up to 800 × 600 pixels. Specifically, the image buffers are implemented as dual-port block RAMS, which are available on the FPGA, to facilitate the streaming nature of the operations. One port is used for writing image data and the other for reading. A total memory space of 800 × 240 pixels was allocated for buffering the input grayscale image. As such, a whole image of 320 × 240 can fit on the FPGA; for larger images (640 × 480 and 800 × 600), only a part of the image is stored on-chip. This is sufficient, as in most cases, the window will be available on the on-chip memory, and as a result, external memory access will not be needed. Furthermore, whenever the window pixels are not on-chip, this will indicate the presence of a very large object, which covers most of the image; as such, a large portion of the image will not need to be processed and so the overhead from accessing the external memory is negligible. The Sobel convolution process of the ECC unit was implemented using shift registers rather than multiplications to save hardware resources and increase frequency. The n/a n/a n/a n/a 54 SVM array was comprised of a total of 80 vector units (5 rows and 16 columns), with each vector unit handling the processing of the input window with one reduced-set vector. A total of five scalar units was required for the implementation of the SVM computation flow. All the units in the SVM processing core require multiplication and accumulation units; however, since the scalar units have a higher demand in bitwidth requirements, they were mapped on the DSP48E units of the FPGA for performance improvement, while the vector units where mapped on the LUT logic. The FPGA resource utilization of the proposed system is illustrated in Table III , which also shows relevant results from related works. The FPGA technology used plays an important role in the efficiency and performance of a design. A feature-rich FPGA platform that provides more processing power-in the form of more reconfigurable logic, embedded multipliers, and embedded block RAM-to further exploit parallelism, could potentially provide higher performance. However, the architecture design has to be scalable and utilize the FPGA resources efficiently in order to deliver the full performance potential of a given FPGA platform. As such, comparing architectures implemented using different FPGA technologies, even if it may be indicative of potential performance and required hardware, may not be fair. Thus, we focus our comparison with works that have used the Virtex 5 FPGA technology; however, we also include other works in Table III for a more comprehensive view. The reported figures for our implementation are for a system capable of processing images of up to 800 × 600. Less resources would be needed if the system is only going to process lower-resolution images. Our proposed architecture requires less LUT resources than other works and only half of the available register resources of the targeted FPGA for the implementation of the architecture. The CE requires the majority of the DSP48E units, which could be reduced by reducing the number of rows in the classifier, possibly reducing, however, the classification performance per window. Architectures that use the Viola-Jones detection algorithm have an additional advantage in that they do not require many multiplier units, since the algorithm is mostly implemented with adders/subtractors and accumulators. On-chip storage is critical for reducing external memory I/O and increasing performance, as also demonstrated in He et al. [2009] . Hence our system utilizes the majority of available FPGA block RAMs to reduce external I/O. We targeted the frequency offered on the available FPGA board, and as such, we pipelined the critical paths of the design to achieve a frequency of 100 MHz. The frequency can be improved with further optimization for higher throughput. With this frequency, the system manages to offer very good performance results.
Performance Results and Discussion
Performance in object-detection systems is measured in terms of frame-rate and detection accuracy. A real-time performance of around 30 frames per second (fps) is sufficient for most video-processing applications; however, applications with multiple data streams and high-resolution video analysis may require higher frame rates. Through the speedup gained by the search-reduction techniques, the FPGA implementation of the proposed depth-and edge-directed object-detection system is capable of high performance for 320 × 240 and 640 × 480 images (271 and 42 fps, respectively), while achieving near real-time performance of 23 fps for 800 × 600 images. The average frame rate achievable by our system and other systems is shown in Table I . The bottleneck in the performance of our system is the CE, which takes a few hundred cycles to classify a window (depending on its size), whereas the ECC (because of the parallel window operations) and the DCU (because of the binary data processing) produce a result almost every cycle. Thus, they are capable of achieving over 100 fps for all targeted image resolutions, as demonstrated in Ttofis et al. [2012] . The achieved performance is then limited by the number of windows that need to be classified and the classification time per window. The work from He et al. [2009] achieved 625 fps for 640 × 480 images; however, the actual size of the processed image size is 80 × 60, with only three object sizes considered. The FPGA system by Sadri et al. [2004] can process 800 × 600 images in 90 fps; however, this is achieved if only 25% of the image is searched through skin detection and the detection itself happens on binary images, and thus it is unclear how detection accuracy is affected.
The depth of objects, the number of edge pixels in the image, and disparity map sampling step are all factors that affect the frame rate. Depending on the depth of objects and the number of edge pixels in the input stereo image, the frame rates may be a bit higher or lower. Coarser grain sampling of the disparity map could further improve performance, as fewer windows would be generated. An increase of the disparity search overlap for 800 × 600 images from five pixels to ten pixels could potentially double the performance without any significant loss in detection accuracy. Increasing the disparity search overlap could enable the architecture to also process larger image sizes than 800 × 600 in real time. The proposed system architecture was tailored to the available FPGA resources, and as such, there are a few limitations that could be overcome by investigating an ASIC design especially for higher-resolution images. Some of the things that could be explored through an ASIC implementation are the design of onchip memory structures that would allow for higher pixel throughput and optimizations on the CE architecture to increase classification performance per window. Finally, it would be possible to instantiate parallel CEs to process multiple windows in parallel.
Detection accuracy is an equally important performance metric for object-detection systems, and it heavily depends on the classification algorithm used as well as the training data. The Viola-Jones detection algorithm has demonstrated very good results in terms of detection accuracies for 2D object detection; however, SVMs have also shown very good results, and literature suggests that the detection results are comparable [Osuna et al. 1997] . Furthermore, SVMs have also been used for 3D object recognition [Roobaert and Van Hulle 1999; Ruiz-Llata and Yebenes-Calvino 2009] . Detection results of our system and other related works are shown in Table I . A direct comparison of the detection accuracies, however, is not fair, since the image datasets used in other works are comprised of a single image, and as such, we cannot use them in our stereo processing system. Additionally, the training data and preprocessing methods also impact the classification performance. The proposed depth-and edge-accelerated system can achieve classification rates of ∼80% which would suffice for most applications. The detection accuracy is a bit lower than the Viola-Jones face-detection implementation in Kyrkou and Theocharides [2011a] ; however, higher classification rates are possible by improving the training dataset. An inherit advantage of the proposed depth and edge search-reduction approach comes from the fact that the number of windows that are processed is reduced; as a result, the false alarm rate also decreases when compared to the traditional sliding-window approach. This is true regardless of which classifier is used, granted that it has acceptable discrimination capabilities. The number of falsely classified faces (false positives) decreased an average of 52%, compared to the sliding-window approach, while the number of correctly classified faces (true positives) remained relatively the same. Some synthetic and real-world images used for the experiments and the resulting detections are illustrated in Figure 9 .
The proposed hardware architecture has been optimized for face detection; however, with minor modifications, it could be adjusted to perform detection of any object. First, we shall consider changes in order for the architecture to be used for other applications. In general, only the classifier needs hardware changes; if an SVM produces adequate results, then a similar architecture to that of the one we propose in this work could also be used; however, any other classifier could be incorporated in the architecture if necessary (e.g., the Viola-Jones-based classifier in Kyrkou and Theocharides [2011a] could be used). The other changes concern specific parameters that need to be adjusted in each hardware module and are necessary in order to facilitate the new object of interest. Specifically, the window size for each detection scenario is different, and as a result, the pre-computed WEU parameters are adjusted to the object size characteristics (window height and window width in Figure 6 ). The threshold for the minimum number of edge pixels in a window could also be adjusted, since the window is now different. Additionally, it is also possible to introduce an upper threshold on the number of edge pixels depending on the object of interest to eliminate noisy regions or cluttered background regions. Finally, any feature extraction algorithm (SURF, SIFT, etc.) could be integrated by implementing an appropriate processing unit that would perform processing on the valid windows prior to the classification process.
To illustrate that the proposed framework could be used in other applications, we performed simulations for car side-view detection for 320 × 240 cropped and resized test images from EISATS Set 1 [2007] and a window size of 100 × 40. Some detection results are shown in Figure 10 . We briefly describe the necessary changes done to facilitate the car side-view detection. The disparity-and edge-detection tasks were carried out with the same parameters as face detection, as they do not take any objectspecific parameters, while the window-extraction process was adjusted for a 100 × 40 window. The classifier architecture was adapted to the new training set. A polynomial SVM with 50 reduced-set vectors was utilized. Because of the larger search window size, it was possible to increase the disparity search overlap, which resulted in fewer generated windows compared to face detection. However, the time needed for classification and edge-pixel accumulation increased, as both tasks are dependent on the window size (vector dimensionality). The hardware demands for the SVM classifier also changed because of the increased window size. The memory needed was increased, since the reduced-set vectors have higher dimensionality; however, the processing resources needed were reduced, as there were fewer reduced set vectors. Using the Xilinx synthesis and simulation tools, it was estimated that the expected performance for this application could reach up to 70 fps. The figure is lower than face detection, since the classification time per window, which is the bottleneck in the overall computation, was increased.
CONCLUSIONS
This article presented a search-reduction approach that utilizes depth and edge information and its hardware realization on an FPGA, which can be integrated into existing 3D vision systems in order to build efficient, embedded image-analysis systems. The implemented object-detection system offers improvements both in terms of frame rate and detection accuracy. Specifically results indicate a real-time performance for a range of image sizes on a Virtex 5 FPGA. Furthermore, the detection accuracy is improved with respect to the sliding window approach, as a lot of windows are discarded prior to the classification process. The overall framework and proposed hardware architecture could be further improved either with application-specific optimizations or an ASIC implementation. Additionally, the edge-and depth-directed method could be used with any classification algorithm and potentially other search-reduction techniques and featureextraction algorithms for a variety of object-detection applications. As an immediate follow-up to this work, we also intend to explore how to incorporate 3D information into the classification process to provide more robust detection results. Finally, we will also explore different approaches and algorithms to implement the three main tasks of the algorithm, namely, edge detection, disparity computation, and classification, and the impact of different algorithms in terms of hardware implementation and performance.
