ABSTRACT Significantly improved performance of the various learning algorithms has revived the interest in computer vision for recognition applications during the current decade. This paper reports a vision-based hardware recognition architecture combining the Haar-like feature extraction with the support vector machine (SVM) classification. To support an optimal tradeoff between resource requirements, processing speed, and recognition accuracy, a 12-bit fixed-point computation for block-based feature normalization and a recycling allocation of minimalized memory resources are proposed in this paper. Furthermore, an efficient scale generation of target objects for recognition is enabled by configurable windows with high size flexibility. Additionally, a parallel-partial SVM-classification architecture is developed for improving the recognition speed, by accumulating the partially completed SVM results for multiple windows in parallel. The proposed hardware architecture is verified with an Altera DE4 platform to achieve a high throughput rate of 216 and 70 f/s for XGA (1024×768) and HD (1920×1080) video resolutions, respectively. A recycled memory space of only 193 KB is sufficient for processing high-resolution images up to 2048×2048 pixels during online testing. Using the INRIA person dataset, 89.81% average precision and maximum accuracy of 96.93% for pedestrian recognition are realized. Furthermore, about 99.08% accuracy is achieved for two car recognition tasks using the UIUC dataset (side view of cars) and a frontal car dataset collected by ourselves at Hiroshima University with the proposed hardware-architecture framework.
I. INTRODUCTION A. BACKGROUND
Machine learning (ML) research has been around for many decades and employs various algorithms to analyze input data for different applications. Speech recognition [1] , [2] , was the first subject before many other ML applications such as machine translation [3] , [4] , and computer vision [5] - [7] became popular. Driven by the exponentially increasing usage of image sensors and connected devices, more than 10 gigabytes of data per second will be generated by 2021 [8] . During the ongoing era of big data, interest in ML has exploded to enable extraction of meaningful information from a plethora of raw data and to represent it by some type of generalized model. The computational power of ML algorithms must be enhanced for keeping up with the growth speed of generated data.
Modeling the visual world in all its rich complexity is far more difficult than modeling other signals such as the vocal tract that produces spoken sounds [9] . Video data is arguably one of the biggest data forms in computer-vision applications. To reduce the related computation and communication cost, many artificial intelligence (AI) tasks employ specific algorithms to extract representative information from the video captured by the image sensor. This information can be analyzed for recognizing specific objects (e.g., face [10] , pedestrian [11] , vehicle [12] , traffic sign [13] ), or for the purpose to take immediate actions (e.g., in robotics navigation [14] or driving assistance [15] ). In computer vision, a typical framework for object recognition can be mainly divided into two parts as shown in Fig. 1 : Feature extraction and Classification. In supervised learning, a set of weights of the selected model can be trained for the classification by learning from previous labelled examples of the given task.
Although approaches such as deep neural networks (DNN) blur the distinction between the feature extraction and classification, feature extraction is still an indispensable step for capturing and representing the meaningful information from the raw pixels from the image sensor in various ML algorithms. Many well-known features such as histogram of oriented gradients (HOG) [16] , scale-invariant feature transform (SIFT) [17] , and scale-invariant speeded-up robust feature (SURF) [18] , are still popular in many application fields. Both traditional SIFT and SURF are based on finding specific interest points even in scaled spaces, while HOG is used to describe a whole image or image patch computing edge gradients. The original HOG feature usually leads to a higher detection accuracy but it is not scalable and rotation invariant. The higher feature dimensionality also leads to more complex computations and causes HOG to consume more resources in hardware implementations.
Feature selection is a critical issue for classifiers since only a limited number of data samples can be selected but are then used for representing all original attributes of the images in an approximate way. The original SURF includes both a detector and a descriptor. The initial detection stage is based on the Hessian matrix to find the local interest points, while the subsequent description stage calculates the Haar-wavelet responses [19] in both horizontal and vertical directions for orientation assignment of these local interest points, which makes SURF faster and better repeatable than SIFT and many other descriptors. On the other hand, SURF is restricted to the extraction of local features from these distinctive locations of interest points in the images such as edges, blobs, or corners. In order to extract the increased global-feature information for the complete image and to reduce the overall computational complexity, this work simplifies the architecture of the SURF algorithm and employs the Haar-like global feature to describe the whole image. Thus the necessity of a detection stage for determining the local interest points is removed.
Classification in the computer-vision framework is to determine the class of the objects, which have been represented by the chosen features extracted from the testing images. Popular techniques for classification include the nearest neighbor search (NNS) [20] , linear or non-linear support vector machines (SVM) [21] and Adaboost [22] , which each have their respective merits. The accuracy and computational complexity are two important factors for selecting a classifier in different fields.
Taking account of the design methodologies and tools used, there are many different approaches for object recognition implementation. Rapid prototyping and much shorter development period make software approaches attract great interest in computer vision. Implementations developed in C/C++, JAVA, Matlab or Python programming language, using specific software development tools and cross compilers, can be downloaded from the internet and utilized with a standard processor architecture such as Intel Pentium. Recently, these software solutions like OpenCV [23] , BoofCV [24] , NumPy [25] or Scikit-learn [26] are rapidly improved due to their open and comprehensive standard libraries, which support various algorithms for classification, regression and clustering. This has further led to a breakthrough in the convolutional neural network (CNN), for which these methodologies and tools can be optimally applied. The deep-learning methods for object classification have achieved a rapid development against this background and are widely applied in current decade to various fields such as Internet of Things (IoT) [27] , natural language processing [28] or clinical diagnosis [29] . The hierarchical architecture of a CNN repeats convolutional operations to learn and adapt to the given tasks by filtering the information at each stage. Trainable feature detectors make CNNs highly adaptive and enable the achievement of high accuracy in most applications.
Recently, the CNN-based methods trend towards larger and deeper networks with increasing sample size of the database such as ImageNet [30] . It is for sure that the heavy training stages of conventional neural networks with massive visual samples of the database lead to the disadvantage of slow performance. Many improved detection models are reported for specific applications with good enough performance to be successfully executed on high-power platforms with fast processing capability. To avoid laborious and unreliable manual annotation of large-scale image databases, a deformable part model (DPM) [31] was proposed for weakly supervised learning, effectively further improving the final accuracy. Another improved model [32] with higher mean average precision (mAP) tried to avoid the heavy dependence on a large amount of training data, without bounding-box annotations by basing on the verification of complementarity for visual similarity and semantic relatedness.
However, the standard metric of mAP does not tell the entire story for real deployments of computer vision systems [33] . Various detection models are proposed for offering a tradeoff between processing speed and accuracy.
The YOLO9000 model [34] can predict detections of more than 9000 different object categories on the full 448×448 resolution in real time. However, in the age of cross-media systems, the necessity for higher resolution capacity in practical applications arises. A Region Proposal Network (RPN) [35] made an algorithmic improvement to speed up the expensive computations of the original region-based CNNs.
Deeper networks require of course more powerful processors and consume more hardware resources and energy, thus these CNN-based approaches turn out to be too expensive by a significant margin for common civilian applications. There usually is a further significant overhead for software solutions due to the need for expensive operating systems. Meanwhile, the compiler inefficiency also leads to relative slower processing speed. The dependency relationships between the CPU's hardware and software also cause performance reductions, such as higher energy consumption and longer processing time. The deployments of mobile devices are generally accompanied with supply constraints like less high-speed memory and smaller battery-based energy, but nevertheless require real-time performance. Thus new solutions, which consume less processing time, hardware resources and energy are in strong demand to meet with the real-world requirements of mobile applications.
B. RELATED WORKS
A more specialized processor type named graphics processing unit (GPU), popularized by Nvidia, is more efficient than a general-purpose CPU for image processing. However, more energy is consumed by a GPU when compared to CPU-based solutions. For example, Rister et al. [36] reported a heterogeneous SIFT detector, incorporated by a GPU, to have taken an average of 50 seconds for continuous iteration over a video-frame dataset, which is unaffordable for real-time (i.e., ≥30 fps) mobile applications.
Various custom-hardware architectures [37] - [40] in integrated circuit designs are proposed for faster processing speed and lower power consumption in mobile applications. Reference [37] reported a 52 mW system-on-chip (SoC) scheme which supported multi-object recognition for videos with HD (1920 × 1080) resolution at 30 fps by an object viewpoint prediction engine and a visual vocabulary processor. A hardware accelerator in [38] provided a scheme for combination of a simplified HOG algorithm with simultaneous SVM calculation on field programmable gate arrays (FPGA) to achieve HD-resolution video processing at 30 fps as well. Another FPGA scheme [39] achieved a faster processing speed of 64 HD image frames per second in real-time pedestrian detection using a HOG descriptor and a SVM classifier with a time-multiplexing approach. A full-scale demonstration system for pedestrian detection with nine image scales from 800×600 to 170×128 has been implemented in an Altera Cyclone IV EP4CE115 FPGA device in [40] . This multi-scale pedestrian detector enabled real-time detection for HD resolution as well, processing a non-stop pixel stream from the camera. Although these previous strategies achieved real-time image processing, the speed efficiency was still unsatisfactory for many high-speed vision applications, since the processing speed mainly determines the possible response time to take quick actions.
High resolution images are popular in these research works as they provide enough pixels to identify faraway or fast-moving objects. A configurable hardware architecture for an AdaBoost-based face detection system [41] took advantage of the integral image for fast Haar-like feature calculation and achieved maximal speed of 30 fps for 1080p (1920×1080 pixels) video frames at 200 MHz frequency. Logic elements were reduced more than 50% and 48.4% RAM was saved comparing with previous AdaBoost schemes. Nevertheless, a large additional external memory for buffering image blocks, integral images and detected windows was essential, but was not sufficiently detailed or considered for the performance discussion in [41] . Under these conditions high performance in power consumption (1.8 mW per 320×240-pixel frame) was reported, mainly due to the TSMC 65nm technology and some transformations for complex hardware operations such as division.
Our previous works [42] , [43] focused on a cell-based hardware architecture combined with NNS classification without feature normalization, simplifying the calculation and leading to a resolution flexibility for the video frame of up to 1024 ×∞ pixels. Approximately 31 fps in XGA (1024×768) throughput has been obtained when operating at 200 MHz, while consuming only 1.29 nJ/pixel of energy and 0.203 Mbit of on-chip memory. Such a cell-based solution clearly reduces the consumption of power and hardware resources for the implementation circuit, when compared to conventional schemes, largely due to the benefit from the concept of ''regular rule of reusing times'' (RRRT) for each cell during window construction. However, the on-chip storage requirement for the NNS classification increases linearly with the number of reference samples. Further, the non-normalized feature-extraction strategy reduced the computational amount and the complexity at the expense of approximately 6% accuracy degradation in pedestrian recognition. In addition, the multiple iterative calculations based on the cells still consume redundant resources, since the scan window slides with a quadratic stride of one block (2×2 cells).
Related to the SURF plus SVM or Haar-like plus SVM frameworks, further recognition frameworks combining HOG and SVM were reported such as [44] and the aforementioned [38] - [40] . These previous works tried to exploit the fact that the HOG features turn out to be more accurate and suitable for pedestrian detection, because they are computing edge gradients over a whole image. The reported FPGA-based HOG feature extraction processor [44] , embedded with SVM for pedestrian detection, performed well with 47 fps SVGA (800×600) images at 100 MHz, but for higher resolutions the performance needs to be further enhanced. Multi-scale detections results are additionally considered in [40] for both SVGA and HD images, processed on an FPGA implementation for full real-time pedestrian detection. Nevertheless, since the original HOG feature is not rotation-invariant, the reported strategy of [40] was to use a computation-expensive Fourier analysis to achieve rotation invariance for the HOG feature. This is of course largely unaffordable in hardware implementation especially for mobile applications.
In this paper, we propose a hardware-optimized recognition method combining the SVM classifier with global features over the whole image using a Haar-like descriptor, which is distinguished from previous local SURF approaches. An optimized tradeoff between detection accuracy, processing speed, flexibility and hardware cost is the main consideration in this work for mobile devices, which generally supply less fast-memory resources and small finite batterybased energy, but nevertheless require real-time performance.
C. CONTRIBUTIONS
This work presents a hardware-efficient architecture for object recognition using a framework composed of global Haar-like descriptor and linear SVM classifier. The resulting implementation delivers high-throughput processing to achieve real-time, robust and accurate object recognition with low hardware and energy costs. The main contributions of this work include:
• Approximate computation for block-based feature normalization to achieve robust recognition with lower resource consumption;
• Efficient image-scale generation with configurable window sizes for multi-scale object recognition;
• Flexible regulation for memory allocation to reduce storage overhead and power consumption;
• Parallel-partial recognition accelerator, operating on multiple windows at the same time, for achieving highresolution-image processing with high throughput rate.
D. STRUCTURE
The rest of this paper is organized as follows. Section II describes the fundamental principles of the employed recognition algorithm. Section III thoroughly introduces the proposed hardware architecture integrating local Haar-like feature extraction, approximate block-based normalization, and linear SVM classification. In section IV, performance verification is developed based on FPGA implementation, and experimental results are discussed. Finally, conclusions are given in Section V.
II. CONFIGURABLE OBJECT RECOGNITION A. HAAR-LIKE FEATURE EXTRACTION
Feature selection is a critical issue for object recognition. Appropriate features can maintain original attributes of the raw pixels from the image sensor in desired application scenes. The edge Haar-like features used in this paper are reminiscent of Haar basic functions, which were reported by Viola et al. [19] , and are known to have robustness in decreasing object-matching complexity as well as enhancing computational performance. Specifically, the difference between the sum of the adjacent pixels within two rectangular oriented planes (see A and B in Fig.2 ) over a 4×4 pixel grid (named sub-cell) of spatial locations in either horizontal or vertical directions is computed according to (1) or (2), respectively.
FIGURE 2. Scanning order and mapping relationships between sub-cell, cell, block and window for local Haar-like feature extraction.
Here, p(x) and p(y) are pixel intensities in horizontal and vertical direction, respectively. D x and D y represent the differences in their respective direction within the 4×4-pixel sub-cell, as constituted by the orientation planes A and B. Absolute values |D x | and |D y | are determined as well to capture the polarity of intensity changes. Afterwards, the differences of 2×2 non-overlapped sub-cells are accumulated for the local four-dimensional Haar-like wavelet response −→ v cell of one cell, where
Instead of representing only detected interest points as in many conventional works, we apply the edge Haar-like wavelets as a global feature descriptor across the entire image to preserve more useful information and to eliminate the necessity for interest-point detection from the feature extraction.
Although the response of the local cell is the fundamental computational component for a window-based object recognition, there is a further desired advancement space for accelerating the processing since the window slides by a quadratic stride, i.e., one block covering 2×2 cells. For robust object recognition, four cell-based Haar-like responses related to the same block are normalized before constructing the block-based feature components. To meet a favorable tradeoff between computational complexity and recognition accuracy, this work employs the L1-norm operation, which is defined as:
Here, − − → v ncell (i) j represents the i th normalized cell-based Haar-like-feature component within the j th block in one window while each −→ v cell (i) j is computed by (3) . Note that each of the 4 feature components in the i th cell-based −→ v cell (i) j is independently normalized by the accumulation of the corresponding feature components in the four cells within the j th block according to (4) .
We propose an approximate computation based on the fixed-point division with data expansion and shift operation instead of using the costly floating-point division. The four normalized feature components are realigned in a zigzag manner within the j th block and then the 16-dimentional normalized block-based feature vectors −−→ v block (j) are sequentially created according to
Ultimately, the d w -dimensional feature vector (FV) for representing a target object is constructed inside a window with a cell-based block movement in a zigzag-scan manner according to
where m and n describe the non-overlapped block number for the desired-size scan window in horizontal and vertical directions, respectively, as illustrated in . In this work, flexible window sizes can be reconfigured in the proposed hardware architecture for enabling scalable image recognition, which makes the conventional image-pyramid construction unnecessary. The ratio between the sizes of the scaled windows corresponds to the scaling factor used for construction of the image pyramid. Intermediate calculation results for feature extraction based on cells, blocks and windows are temporarily stored within flexibly allocated memory locations to reduce the storage overhead.
B. SUPPORT VECTOR MACHINES CLASSIFIER
Theory and algorithm of the linear support vector machine (SVM) model were originally established by Vapnik [45] . Such a linear support vector machine (SVM) classifier is applied as it can produce higher recognition accuracy when compared to other classifiers including the NNS classifier applied in our previous works [42] , [43] . For classification problems, the linear SVM approach in the feature space aims at constructing a classifier using a mathematical notation of the form:
Here, sgn(·) is the sign function and f − → v , − → w is the kernel function. The classifier y − → v carries out the learning task to map the Haar-like FV for a window − → v = {v i |i = 1, 2, . . . , d w }, extracted from the input testing images, onto a given pattern according to the linear kernel f − → v , − → w . This is done by separating the testing data into two classes, i.e., a positive and a negative class represented by +1 and −1 values, respectively. The linear SVM kernel f − → v , − → w is defined as
where v i is the i th dimensional feature component for
Given a training set of FVs, w i and b can be determined by finding optimal values under specific constraints [34] such as minimization of a cost function
where w is a mathematical matrix notation for representing the weight vector − → w . Slack variables ε = {ε i |i = 1, 2, . . . , N} are introduced to allow some misclassification during the training phase with N FVs, and C is a positive parameter provided by the user that controls the amount of misclassification. For the linear SVM learning kernel, C determines the upper bound of correct input-FV placement with respect to the margins associated with Lagrange multipliers in the range of (0, C). This work employs an off-line linear SVM learning kernel to provide the weight vector − → w and constant bias b for the on-line object recognition.
III. PROPOSED HARDWARE ARCHITECTURE A. OVERALL ARCHITECTURE
To obtain an optimal trade-off between resource requirement, power consumption, processing speed, flexibility, and recognition accuracy, this paper presents a configurable hardware implementation for scalable object recognition composed of a global Haar-like feature descriptor and a linear SVM classifier with flexible image resolution up to 2048×2048 pixels. As illustrated in Fig. 3 , the proposed recognition framework in this work includes two parts: off-line training and on-line testing. Table 1 describes the main calculation steps with pseudo codes for our recognition algorithm, combining Haar-like features and a block-based SVM classifier. Specifically, a practical hardware-oriented implementation of each step should consider appropriate optimized hardware architectures with suitable bit width and processing latency, etc., for data storage and transmission.
Corresponding to the hardware architecture in Fig.3 , our approach uses an off-line training phase in advance to obtain the SVM classification model parameters and to save resource requirements. The off-line training phase is executed by software emulation on a 3.30 GHz Intel R Core TM i5-4590 CPU and 8 GB of RAM memory. A set of training images are collected, grouped into positive and negative samples and the used for extracting a number (N) of training Haar-like FVs at the beginning of the off-line training phase. The k-means clustering technique is then run on the entire set of training Haar-like FVs to determine k cluster centers (k N), so as to accelerate the training speed by reducing the number of input training vectors to the following SVM training model. Afterwards, the weight vector − → w and the constant bias b are generated by the linear SVM training model for application in the proposed hardware architecture during on-line object recognition.
This paper focuses on hardware implementation of the second phase of on-line testing shown in Table 1 , which is applied for real-time recognition. Mainly two parts can be found during the on-line testing phase. First the incoming testing pixel stream from the camera is mapped to the feature space by the Haar-like descriptor. Then the SVM classification output is provided, which is weighted by the vector − → w and added to a constant bias b for a given class. Three functional parts, i.e., the Haar-like feature extraction engine, the window regulation engine and the SVM classification engine, constitute the proposed configurable objectrecognition architecture in this work, as shown in the upper part of Fig.3 .
Firstly, the Haar-like feature extraction engine processes the incoming pixel stream on-the-fly by synchronizing to the working frequency of the camera so that no image buffer or partial pixel buffer is needed for the global feature extraction. All pixels scanned from the camera are processed immediately without pre-storage or pre-processing such as integral image calculation. Variable input parameters define image width (IW), image height (IH), window width (WW), window height (WH), and square cell size (CS) of the scalable high resolution images, thus enabling high flexibility and scalability for both feature extraction and object detection. These input parameters are initialized at the beginning of the testing phase.
Secondly, window feature construction and parallel SVM classification are consistently actuated by the parametric window index from the window regulation engine. The cell-based Haar-like features are sent for normalization within the relevant blocks. The block unit is shifted by one cell in both horizontal and vertical directions inside one window so that there are up to 4 overlapping blocks on one cell. Likewise, the window shifting leads to overlapping of multiple windows over each block. Thus, we can calculate the normalized block-based feature vectors in raster manner over the whole image, and then output them in sequence to proceed with the feature-vector construction of all relevant overlapping windows, which are in different construction stages depending on the block location within each window. The accumulation of window-based Haar-like features and the calculation of the linear SVM-classifier results are executed synchronously based on the block unit. All the partial results are buffered in fast embedded memory.
VOLUME 7, 2019
The memory allocation for partial SVM-result storage and proper data access from the SVM-weight storage are also optimized by the updating scheme of the window index. Finally, classification results are sequentially produced during the parallel accumulation of partial recognition results for multiple windows by the SVM classification engine, resulting in a high overall throughput rate.
B. FEATURE EXTRACTION ENGINE
According to the row-raster-scan manner of the image sensor in the camera, the pixels covering the entire image are transferred to the proposed Haar-like feature extraction engine in sequential order as shown in Fig.4 . Dispensing with the interest-points detection applied in many conventional methods such as SIFT and SURF, the feature extraction operation starts to perform as soon as the pixels are received by the onthe-fly data-stream-processing architecture. To offer higher throughput and lower storage necessities, the requirement of image buffers or pixel-row buffers to cache the pixels for local feature extraction is eliminated in this work. On the basis of the transformation characteristics from pixel intensities to sub-cell responses described in (1) and (2), the non-overlapped sub-cell responses are successively generated with a pixel-based pipelined circuitry. Then the sequential sub-cell responses are accumulated at the time when their generation has finished by the same summation circuit containing a multiplexer and an adder, in order to construct the local cell-based feature components according to (3) . This data-flow and hardware-architecture concept leads to a reduction in hardware consumption and power dissipation for the construction of the local cell-response −→ v cell , when compared to previous architectures.
The acceptable image resolution for the proposed Haar-like feature descriptor depends on the storage capacity for intermediate calculation results, while more memory usage leads to larger area requirements and higher power dissipation. This work introduces a flexible allocation of memory space, enabling overwriting of no longer necessary data and reuse of the related storage location, to obtain a better tradeoff result between high image resolution and small hardware consumption. Due to the pipeline strategy for local feature extraction, treated pixels, sub-cell responses, cell-based and block-based feature components can always be deleted after the respective data has been transferred to the next pipeline stage. Thus the memory space for caching the intermediate results of subcells, cells or blocks can be continuously reutilized.
Flexible control to allocate and regulate the memory space for different image resolutions is implemented by modulo-M counters (i.e., counting from '0' to 'M-1') in combination with some essential peripheral logic gates.
Given an image width w, a modulo-w/4 counter is applied for allocating the access address to read/write the partial sub-cell responses from/to the intermediate-storage memory with w/4 words. A similar circuit implementation is used for memory allocation in the case of serial local cell-FV calculation. Since the maximum acceptable image resolution in the proposed hardware architecture relies on the memory capacity for such intermediate storage, this work provides four dual-port memories supplying 8 kB capacity for extracting local cell-based features to enable a scalable resolution of up to 2048 × 2048 pixels for the processed images frames.
In addition, feature normalization is applied to the four cell-based feature components within each block, to mitigate recognition-accuracy degradation resulting from circumstance changes as e.g. brightness changes. The L1-norm according to (4) leads to decimals between ''0'' and ''1'' for each dimension result. However, in a straight-forward implementation the floating-point operation relies on a resourceconsuming mathematical circuitry, which is unaffordable in many hardware designs. Since integer arithmetic is much faster and less complex than floating-point arithmetic, as it can usually be done directly by using common logical gates (AND, OR, XOR), this work applies an approximate computation with fixed-point block-based feature normalization, as illustrated in Fig. 5 . The cell-based feature components are respectively expanded by computing the product of each feature dimension with a factor 2 q (q>1) before normalization. This product can be realized by bit-shifting operations which are even faster than additions. The shifting amount q is set to be 12 in the practical circuit, bound by the tradeoff between hardware cost and result precision. Under consideration of data size and calculation speed, a 32-bit fixed-point divider is employed to perform the division between the cell-based feature components and the block-based divisor for feature normalization according to (4) .
Since the 2×2-cell block moves by one cell, i.e., half the block size, within the window in a zigzag-scan manner, all cells are covered by more than one block, except for those cells in the 4 window corners. Thus nearly all of the serial cell-based feature components are reused, which means temporary storage is necessary for caching cell-based feature components and block-based feature components. Nevertheless, the buffer space for the earlier-stored cells and blocks, which are no longer used for normalization, can be overwritten with the subsequently inputted cells and partial block-normalization results in the sequence this storage space in both cell and block memory becomes available for reutilization. Specifically, only one image row of cell-based and block-based feature components are necessary to be stored for a cell-based block movement during normalization. Thus 256 words for temporarily storing one row of 8×8-pixel cells and 2×2-cell blocks are enough for 2048×2048-pixel images. Note that there is sufficient time (8 clock cycles) to perform the feature normalization in real time for each block since the feature-extraction engine receives one pixel per clock cycle from the image sensor in the camera.
C. WINDOW REGULATION ENGINE
The window regulation engine, shown as the 2 nd major functional block in Fig. 3 , serves as a control unit for the concurrent processing of window-feature construction and SVM classification. On one hand, the sequentially incoming normalized block-based FVs are allocated to the relevant windows for parallel generation of the window-based FVs according to the respective block location in each of these windows, which are applied for matching the target objects in the input testing image. On the other hand, during the construction of the window-based FVs, the linear SVM kernel calculations according to (8) are operated in parallel as well. The window index for indicating the whole calculation progress is generated by the window regulation engine. Specifically, a system configuration unit (SCU) is applied for transforming the customization parameters (e.g., the CS, WW, WH, IW, and IH) to dynamic control signals, e.g., the initial window index I ni , which represents the first and smallest window index covering the current block, for window localization and window index update.
The circuit architecture for implementing the units for the window localization and window index update is illustrated in Fig.6 . Basically, the window localization circuitry is used for calculating the number of windows covering the current block and their respective locations (i.e., window indices) across the entire image.
According to the dynamic control signals generated by SCU, all window indices related to the current block can be computed in sequence by the adder network in the window circuitry according to I or = I ni + N CI · i + j, where i ∈ [1, N ov ] , j ∈ [1, N oh ]. N ov and N oh are the respective overlapping window numbers in vertical direction and horizontal direction, while N CI is the number of windows across the entire image in horizontal direction. The detailed structure of the adder network is shown in Fig.7 . In this work, almost all blocks are covered by multiple windows except for the four blocks, which are located on the four corners in the rectangular image. Based on the first window index I ni for the current block, the number (i.e., N ov ) of window indices I or for each node of the N oh horizontally-overlapping windows are calculated in turn. With the sequentially received block-based FVs, according to the row raster scan manner, all related window indices across the entire image are calculated.
The main point of computing the related window indices is to indicate the calculation progress and to account for the required memory words for storing temporary window results. Because most of the blocks are covered by multiple windows due to the row raster scan manner, the intermediate results of some windows have to be cached until the VOLUME 7, 2019 accumulations of relevant window-based FVs are completed. In other words, not all window indices I or should be retained until the whole image is completed. As illustrated in Fig. 8 , in order to reduce resource consumption, only the region named 'cache region' and defined by two overlapping window rows, which share one row of boundary blocks at the end of the upper window, has to be preserved for the parallel window calculation. The redundant previous-window index I pr can be subtracted from the original window index I or according to (10) . The generated updated window index I up is applied to indicate the access address of the memory for storing the temporary window results during the SVM calculation.
The redundant previous-window index I pr , indicating the front-stored window, can be computed by I pr = N CI ·N RW · n wr , where N RW represents the number of non-overlapped blocks in vertical window direction, N CI represents the window number in horizontal direction across the entire image, and n wr is a dynamic variable representing the current row number of non-overlapping windows in the entire image. A modulo-M counter is employed for monitoring the nonoverlapping window-number count in vertical direction. Particularly, a 5-bit modulo-32 counter is sufficient for counting the non-overlapping window-row number when employing a window with 64-pixel height shifting over an image with 2048-pixel height.
The window index is updated along with the dynamic shifting of the cache region from the upper to the lower part of the image. Superfluous memory space for window storage can be eliminated with the developed flexible regulation for memory allocation, based on the updated window index I up . Given a testing image with w×h pixels, the necessary number of buffered windows N BW can be expressed as 
D. SVM CLASSIFICATION ENGINE
Under the control of the window regulation engine, the block-based normalized FVs and trained model parameters (i.e., weight vectors − → w and bias b) are synchronically invoked for linear SVM kernel calculation according to (8) by the SVM classification engine, which operates in parallel on all related windows and intermediately stores the partial SVMclassification results, as illustrated in Fig.9 .
To identify a target object in the testing phase, the training model should employ appropriate datasets and chose designated images as positive or negative samples in the training phase. Due to the non-necessity of synchronizing the training phase with testing phase for practical applications, the training phase for identifying specific target objects operates off line in advance for hardware-resource savings. The corresponding model data are transferred to the SVM classification engine, where the high-dimensional weight vectors are stored in the single-port SVM weight storage. The high-dimensional weight vectors, from off-line linear SVM-kernel training, are split to four parts with respect to the zigzag arrangement of block-based weight components in each window. In accordance with the number N b of overlapping blocks in a designated window, each part of the SVM weight storage has N b words. Four 64-bit single-port memories each with 128 words are employed in the real circuit, so that up to N b = 128 overlapping blocks in a window with their 16 weight components can be included. The 64 bits from each single-port memory are further split into four 16-bit parts, corresponding to the 4 components of a cell FV.
On the other hand, the block-based normalized FVs of the input testing images from the camera are transferred for matrix multiplication with the block-based weight components. For each 2×2-cell block, 16-dimensional block-based computations are executed at the same time. The products of multiple multiplications for one block are further accumulated in parallel by the adder tree. To obtain the integrated window-based results for detecting the target object in the testing image, the block-based results are further accumulated in related overlapping windows in parallel, and the intermediate partial window results are cached in the 'Partial SVM storage', as shown in the right part of Fig. 9 . The required memory capacity for buffering and updating the window results is determined by the window index I up according to (10) and (11), which has been discussed in section III-C. Specifically, a 1024-word dual-port memory is sufficient for caching the intermediate SVM results in the case of 2048×2048-pixel images scanned by a 64×128-pixel window. A reduction of storage requirements by more than 80% for calculating SVM results is achieved in the case of HD images due to the flexible reutilization of memory space.
Due to the partial-processing concept in this work, temporary storage of partial SVM-classification result is only necessary for a part of all windows at the same time. Further, the intermediate caching of partial results is in a different completion status for each of these windows processed in parallel. Given a 64×128-pixel window with N b = 105 overlapping blocks, once the block indicated in red color in Fig. 8 is received and processed, the classification calculation for the window highlighted with a blue-solid boundary line is completed by 99/105 ≈ 94.3%, while the window indicated with a pink-solid boundary line is finished by only 1/105 ≈ 0.95%.
A 'Load' signal is produced by a finite state machine and applied for automatically latching the completed SVM result of each window into a register. Finally, the bias b from the training model is added to the latched SVM result. Thus the SVM results for each of the windows, slid across the whole testing image, are outputted sequentially and applied for indicating the classification results.
Once the calculation of each current window is completed, the corresponding memory space in the 'Partial SVM storage' can be overwritten and replaced by the intermediate SVM result of a new incoming window. The corresponding window index I up for indicating the access address of the 'Partial SVM storage' is then reassigned as well.
In addition, different datasets and chosen designated images can be applied to retrain the off-line model, so that multiple patterns can be used for recognizing different target objects on line. In particular, multiple SVM classification engines can be operated in parallel for multi-object recognition.
IV. EXPERIMENTAL RESULTS AND ANALYSIS

A. HARDWARE IMPLEMENTATION
The hardware architecture was described in Verilog HDL, verified on an Altera DE4 development board, which is powered by the Stratix IV GX device, and then a demonstration system for pedestrian detection was built as illustrated in Fig.10 . The camera records testing videos in real time with a charge coupled device (CCD) and passes the pixel data for each frame to the developed object recognition circuitry implemented on the DE4 FPGA board. The synthesis tools running on the PC, transfer the configuration files to FIGURE 10. Prototype of the object recognition system based on an Altera DE4 FPGA development board, employing Haar-like feature extraction and SVM classifier. VOLUME 7, 2019 the FPGA development board through the USB blaster port. Specifically, the camera-captured XGA-size image frames are transferred to the FPGA device on the main board through a Camera Link connector as illustrated on the right side in Fig. 10 . The circuitry running on the main board handles the pedestrian identification through the developed circuitry for Haar-like feature extraction and SVM classification, and then outputs image frames with detection results to an LCD display by an independent DIV transmitter-receiver sub board. Weight vectors and biases for the SVM classifier are determined beforehand with the off-line training model and then stored in the RAM of main FPGA board. Besides, the desired sizes of both detection windows and image frames can be designated flexibly by initialization of the input settings. Resource utilizations of the proposed circuitry on the FPGA board for fast human recognition are listed in Table 2 . Most logic operations of the proposed circuitry are implemented by addition and subtraction, which are configured by combinational adaptive look-up tables (ALUTs). The frequent access, transmission and storage of intermediate computational results for multiple overlapping windows makes use of the memory ALUTs (<1% needed) on the Stratix IV board. Multiply and divide operations in the circuitry are also configured with the combinational ALUTs (total usage only 3%) rather than with the digital signal processing (DSP) block elements to obtain more efficient resource usage.
The SVM classifier eliminates the influence of an increase in the number of reference vectors on memory requirements when comparing to an NNS strategy. The total memory usage in this work is mainly determined by the window size for detecting the target objects, since the window size determines the dimensionality of the feature vector and the weight vector, which is partially or completely stored in memory. For instance, basing on the 16-dimensional components for each block (16×16 pixels), the 105 overlapping blocks within one 64×128-pixel window make up a 1680-dimensional weight vector for the SVM model, while the 121 overlapping blocks within one 96×96-pixel window create a 1936-dimensional weight vector. Thus 420 or 484 64-bit words are respectively required in the SVM weight storage (see in Fig.9 ) for storing the weight vector, trained with the SVM model for a 64×128-pixel window or a 96×96-pixel window. On the other hand, memory requirements for caching the intermediate window results during both Haar-like feature extraction and SVM classification are determined by the testing-image width.
The demonstration system needs approximately 31.46 ms per XGA image frame when operating at 25 MHz frequency, resulting in ≈31.79 frames per second (fps), i.e., real-time (frame rate >30 fps) object recognition. The synthesis results on the FPGA board indicate that our developed circuitry is able to work at up to 170 MHz, which enables a maximum processing speed at 216 fps with XGA frames or 70 fps with HD frames for object recognition. The total FPGA-board power dissipation is 2161.17 mW, where 652.07 mW is for the dynamic power dissipation of the FPGA-core, when operating at 170 MHz frequency. An even higher processing speed with substantially lower power consumption can be obtained, if the circuitry is fabricated as an advanced CMOS ASIC.
A comparison to other hardware implementations is presented in Table 3 . Much less memory requirement is needed in this work when comparing to other FPGA schemes, due to the flexible regulation for memory allocation. A similarly small memory usage in our previous NNS-based architecture implemented in 65 nm CMOS [43] , mainly results from the finite number of stored reference vectors. The memory usage in [43] will substantially increase with a larger number of onchip-stored reference vectors. In [39] twice the throughput (64fps) for HD resolution is obtained at the cost of consuming approximately twice memory, when compared to [38] . The throughput rate was improved with less memory usage in [40] , mainly due to the higher operating frequency.
More comparisons to CNN-based approaches implemented in hardware are listed in Table 4 . An 8-bit fixed-point LeNet inference engine [46] achieved 44.9 GOPS throughput in pipeline processing and 98.16% high accuracy for handwritten digits when implemented on a Xilinx 485t FPGA. Another CNN accelerator [47] optimized the OpenCL kernels, efficiently utilized the flexible hardware resources on Arria10 GX1150 and achieved desirable performance in both floating-point and fixed-point schemes compared to existing CNN-based methods. Reference [48] provided a CNN model with high achieved throughput of 780.6 GOPS, 669.1 GOPS and 552.1 GOPS for AlexNet, VGG16 and FCN-16s, respectively. The CNN-based solution [46] consumed less resource when comparing to other two CNN-based work in Table 4 at the cost of much lower throughput. But [46] earned excellent power efficiency, even in comparison to our work. The hardware-oriented strategies enhance the processing speed, but the CNN-relevant algorithms are known as both computation-intensive and data-intensive, thus much more hardware resources are consumed generally when compared to our method with separate feature extraction and object recognition, as listed in Table 4 . Table 4 illustrates that our method provides a better balance between computation, throughput, resource requirements (e.g., memory access) and cost for mobile applications when comparing to CNN-based approaches. Moreover, the maximum image resolution that can be processed is extended with at the same time higher throughput in this work, based on the applied acceleration strategy with optimized choices of parallel and partially processed architecture parts. The approximate computation for block-based normalization further reduces the computational complexity thus leading to less resource consumption as well.
Additionally, the flexibility of variable window sizes from 64×64 pixels to 512×512 pixels can be applied for multiscale recognition. Consequently, a fast multi-scale objectrecognition prototype with efficient resource utilization has been implemented by using the proposed hardware circuitry.
B. EXPERIMENTAL ANALYSIS
To evaluate the accuracy performance, we employed the applications of pedestrian and car recognition as an illustration of our FPGA implementation. An equivalent software emulation is built up to verify the proposed framework with Haar-like feature descriptor and SVM classifier. Figure 11 (a∼c) shows some positive samples of the front view car dataset collected by ourselves at Hiroshima University (HU), side view car samples in the UIUC dataset [49] and pedestrian samples in the INRIA dataset [50] , respectively. The negative samples for both pedestrian or car recognitions are cropped from the INRIA dataset. Specifically, 2416 pedestrian images, 550 side-car images and 1258 front-car images are cropped as positive samples, respectively. They are applied for training with 12180 negative samples using an off-line linear SVM kernel, as illustrated in the bottom half of Fig. 3 . To verify the effectiveness of the proposed block-based 'Haar-like plus SVM' solution in comparison to our previous cell-based solution, which combined with Haar-like feature descriptor and NNS classifier [43] , another 1126 pedestrian positive and 4530 negative samples in 64×128-pixel windows were cropped from the INRIA dataset for person recognition. In addition, 199 side-car views from the UIUC dataset and 556 front car images collected at HU are respectively applied for car recognitions in the testing phase. Figure 12 shows the precision-recall curves for the proposed SVM-based solution with normalization for both pedestrian and car recognitions. For pedestrian recognition, it is observed that the recognition solution based on the SVM classifier has higher efficiency than the NNS-based classifier [43] for the case of the INRIA dataset. The area below the precision-recall curve indicates an average precision (AP) of 89.81% for the proposed SVM-based classifier, which is approximately 7% higher than with our previous NNS-based classifier (82.84%). The performance of HOG-based pedestrian recognition works better than the 'Haar-like plus SVM' solution since the HOG feature is more capable of capturing shape information than the Haar-like feature. For car recognition, the framework based on Haar-like features obtains comparable performance to the HOG-based pedestrian-detection solution. A further positive impact of the block-based feature normalization is the mitigation of the recognition-accuracy degradation due to image-brightness changes. The L1-norm defined in (4), operating to the four cells within each block, is employed in this work due to its simplicity for hardware calculations, when compared to other normalization strategies such as the L1-sqrt-norm or the L2-norm in [16] .
The statistical results of true positive rate (TPR) against the false positive rate (FPR) for our SVM-based pedestrian recognition with and without normalization are plotted in the receiver operating characteristic (ROC) of Fig. 13 . Such ROC curves visualize the relative trade-off between true positive (benefits) and false positive (costs) rates. The better optimized prediction models will hold larger area below their ROC curve and approach closer to the upper left corner, i.e., the coordinate (0,1). We find that the normalized classification models approach clearly closer to the upper left corner and thus perform better than the non-normalized ones. In other words, the feature normalization clearly benefits the recognition performance with higher correct-classification potential of the SVM classifier.
In addition, Fig.13 reveals the influence of numerical formats for both pedestrian and car prediction models as well. The models built on floating-point computation with normalization are found to be more efficient for pedestrian recognition. Nevertheless, we employ the fixed-point calculation to reduce the computational complexity due to the lower hardware cost, according to the analysis in section III-B. Furthermore, car recognition (both front view and side view) performs better than pedestrian recognition since the image samples in the INRIA dataset have more complicated information such as pedestrian postures and background texture. Though the HOG-based model [16] works better for pedestrian detection, our original Haar-like model obtains a similar desirable performance in car recognition. We have selected the Haar-like model for mobile applications due to the savings in hardware resource consumption.
To independently estimate the influence of the numerical formats on the detection accuracy, approximate computations based on fixed-point division with data expansion in different bit width is carried out and compared with the floating-point computation using 4600 input samples for recognition-accuracy determination from the INRIA dataset. Statistical results show that there is a certain accuracy degradation from 1.8% to 6.9% for the fixed-point approximate computations depending on the bit widths, as illustrated in Fig. 14 . However, the accuracy degradation is narrowing gradually and becomes stable for fixed-point solutions with increased input sample number. Specifically, the average accuracy for 6-bit, 8-bit, 10-bit and 12-bit fixed-point pedestrian recognition is approximately 92.89%, 93.61%, 94.38%, and 95.49%, respectively, while the average accuracy of the floating-point solution is 98.46%. With growing testingsample number, the maximal accuracy gap between the 12-bit fixed-point solution and the floating-point solution shrinks to 2.1%. To obtain a good trade-off between hardware cost and accuracy, this work employs 12-bit fixed-point instead of more costly floating-point computation to realize fast recognition with low hardware cost and a maximum accuracy of 96.93%. Thus, a comparable accuracy can be obtained by employing the fixed-point hardware architecture with less resource consumption and higher processing speed. We have further evaluated the detection accuracies of the SVM classifier using the proposed block-based Haar-like feature for both pedestrian and car recognition, respectively, and made a comparison to the cell-based HOG feature results in our previous work [51] , as illustrated in Fig. 15 . The original HOG feature is simplified to save hardware resources for mobile applications in [51] at the cost of a certain degree (≈2%) of precision degradation.
Approximately up to 99.09% accuracy for side view car recognition based on UIUC dataset and 99.08% accuracy for front view car recognition can be reached in this work, which is a better performance than for pedestrian recognition. Although the HOG-based solution obtains slightly higher accuracy for car recognition as shown in Fig.15 , given the finite resources such as provided on-chip memory and processing energy in most practical mobile applications, we have employed the block-based 'Haar-like plus SVM' solution with its acceptable accuracy.
V. CONCLUSION
We introduced a hardware-efficient recognition architecture using Haar-like features and an SVM classifier with 7% higher average precision than our previous NNS-based architecture. The proposed recognition framework is verified to obtain 96.93% competitive accuracy for the complex pedestrian recognition and about 99.08% for car recognition. A block-based feature normalization is included with approximate computation for robust recognition with less resource consumption. The window size can be configured by input signals in this hardware architecture to enable efficient scale generation for multi-scale recognition. Moreover, flexible regulation for the memory allocation is employed to reduce the storage requirements to 193 kb for high-resolution frame processing up to 2048×2048 pixels. High speed with 70 fps for HD (1920×1080 pixels) frames can be handled at 170 MHz maximum frequency to achieve real-time object recognition. Although the FPGA implementation in this paper consumes more than 2 W total power dissipation at 170 MHz, much less power dissipation and faster processing speed are expected by using advanced CMOS technology in our future work. In addition, multiple implementations of the proposed recognition architecture can be integrated and operated in parallel to enable synchronous multi-object recognition.
XIANGYU ZHANG received the B.S. degree from Tianjin University, China, in 2013, and the M.S. degree from Hiroshima University, Japan, in 2016.
She is currently pursuing the Ph.D. degree with the Taoyaka Program, Hiroshima University. Her main research interest includes the hardware development for energy-efficient imagerecognition algorithms.
HANS JÜRGEN MATTAUSCH (M'96-SM'00) received the Ph.D. degree from the University of Stuttgart, Germany, in 1981.
From 1982 to 1996, he was with Siemens AG, Munich, Germany, where he was involved in the development of CMOS technology, memory development, CMOS circuit design, and compact model development. Since 1996, he has been a Professor with Hiroshima University, researching in the fields of VLSI design, nano-electronics, and compact modeling. He is a member of IEICE.
