Certain classification tasks in computer vision require the classifier response to be computed in every pixel of an image. When combined with large, complex features, it becomes challenging to build such a classifier on a standard PC architecture and achieve real-time performance.
INTRODUCTION
A typical visual object detection task requires "computing something everywhere in the image". One of the simplest forms uses a set of classifiers, one per primitive object, to compute response maps in all image pixels. This works as an operator transforming the image to another image, more suitable for building an object detector.
We want to detect passenger cars in videos of crossing traffic based on a structured model that includes the wheels, the wheelbase, the pillars etc. The wheel classifier and detector, Fig. 1 , is therefore the part which has to be run everywhere in the image. When evaluating the detection likelihood ratio for the rest of the model, the raw input image is needed but it is not necessary to access all its pixels. We concentrate in this paper on this approach -to pre-compute wheel likelihood maps for use in a car detector.
We present an FPGA implementation of an object classifier. Our solution is generic and can be used in most localized object detection tasks. In addition, to boost performance under diverse illuminations, our procedure includes learnable adaptive image re-quantization based on local image contrast. We refer the reader to [1] for a more complete picture of the whole car detection subsystem and to [2] for an extended version of this paper.
In recent years, the need for real-time applications of object detection has risen, and many hardware systems have been proposed for accelerating the object classification and detection. Most common is the AdaBoost-based classifier adapted from the classification scheme by Viola and Jones for face detection [3] . Majority of the papers use Haar-like features and focus on FPGA implementation of the integral image feature extraction and cascaded classification. Gao et al. [4] have implemented a 40-stage cascade on an FPGA, processing 16 features per stage simultaneously. Hiromoto et al. [5] compute features in the first few stages in parallel in order to speed up the initial decision process.
He et al. [6] also use Haar features, but they build a cascade of Artificial Neural Network classifiers. Support Vector Machines are also suitable for FPGA acceleration, as in [7] , where again a cascading approach is used. Huge amount of data necessary to process in traditional sliding window approaches is reduced in [8] , where a hardware edge detector has been added to steer the focus of the classification cascade.
We utilize the AdaBoost-based framework for detection. This learning scheme has an advantage that a criterion for selecting efficient features can be simply employed.
We place emphasis on using more structured features than simple Haar-like ones; general kernels that require full convolution to be computed. The properties of the bank of kernels are given by an application, and can represent some knowledge about the detected objects that can be hard to learn automatically. In our application, we use rotationally invariant templates for wheel detection.
We do not employ a classification cascade because we use fewer (but more complicated) features. The second rea- son is that we need the classifier response map (not the decision) for the entire image to make post-processing (Fig. 1 ).
CLASSIFIER WITH STRUCTURED FEATURES
A linear classifier classifies fixed-size image patches S x,y around a pixel (x, y). The classifier learned by AdaBoost is composed of a set of weak classifiers, each one is defined as a triplet
where M i is its associated kernel (either real or complex matrix), t i is a threshold and g i = ±1 is a sign. A kernel has the same size as an image patch. For a classifier of car wheels, rotational invariance is a natural requirement to avoid the need to learn all possible wheel rotations and risk overfitting. We construct some of the kernels as a complex sinusoid with unit L 2 norm. The modulus of the dot product with such a kernel is then a rotationally invariant feature.
Specifically, for an image patch S x,y the dot product with the kernel d i = vec(S x,y ) vec(M i ) is compared with the threshold to obtain the weak classifier's decision y i ,
These decisions are aggregated using learned weighting coefficients α i to form the overall classifier response
To simplify the FPGA implementation, the kernel values are quantized to {−1, 0, +1}, allowing multiplications in the dot product to be replaced by simpler logical operations, and the L 2 norm in (1) is replaced by L 1 norm,
Using a reference PC implementation, we compared a classifier with quantized kernels and L 1 norm to a classifier with kernels quantized to 8 bits and L 2 norm, and proved that this change has very little effect on the classification error. For the image patch size 25×29 pixels, we have learned a classifier of wheels consisting of 55 unique kernels, Fig. 2 (the complex ones have two parts, so there are 77 parts in total) and 150 weak classifiers (some of them share the same kernel). Both limits were fixed for the learning phase.
FPGA IMPLEMENTATION
We have implemented the classifier on a Xilinx ML605 evaluation board with a PCIe DMA data transfer wrapper.
The processing pipeline implementing general functionality of a classifier is summarized in Fig. 3 . The classifier is specified by supplying numeric constants. We automatically generated the appropriate parts of VHDL source. 
Linecache
The sliding window buffer outputs the whole image patch of dimensions 25×29. This block is implemented as a shift register, consisting of both general flip-flops and BRAMs.
Slice Selector and Batches
This section describes generic algorithmic optimizations that are needed for real-time performance. This, together with Sec. 4, is the main contribution of this paper. The number of dot products (77) is larger than the number of dot product-computing blocks (DPCBs, Sec. 3.4) we are able to place to the FPGA (30). Also, processing 25×29 image patch at once is hard to implement efficiently because -too many parallel routes (25×29×n bits) would prove achieving the timing closure very difficult, -there are not enough HW resources, e.g. BRAMs that are used to store the kernels, -some kernels cover only a sub-region of the image patch; processing speed can benefit from some way of discarding the pixels not covered by a kernel. We have divided the image patch area into four slices, 182 pixels each. To exploit circularity of most kernels, we have set the slices to be circular as well, Fig. 4 . Since the whole image patch has already been cached, the slice selector can select pixels completely arbitrarily.
The number of dot products is addressed by computing them in batches evaluated in a sequence. Each batch computes several dot products in parallel. The distribution of kernels into batches is discussed later in Sec. 4.
The slice selector block outputs one of the four slices at each clock cycle in the predefined order that is specific for the selected set of kernels and slice shapes.
Normalizer
An adaptive image patch normalization reduces the sensitivity of the classifier to local scene illumination. Each pixel in the image slice is multiplied by a coefficient and the result is re-quantized to 4 bits. These 182 multiplications are done in dedicated DSP blocks. The coefficients for each image patch are computed in advance on the host PC based on the mean intensity of the patch using real-time integral image technique.
Dot Product-Computing Block (DPCB)
The block, Fig. 5 , takes slices of the image patch, one at a time, and produces a dot product of the entire image patch with a single kernel. Because the kernel values are limited to {−1, 0, +1}, a case selection is performed instead of multiplication. A slice sub-product is computed in adder tree and sequentially summed in n-adder to get the result.
There are 30 DPCBs in the FPGA fabric. They all have the same input but each computes a different dot product.
Modulus Block
This block computes the L 1 norm (3) of complex kernels, for real-valued kernels the data are simply passed through.
Weak Classifier and Adder Tree
The weak classifier outputs the α i y i term in (2) . Several weak classifiers share the same kernel dot product. We have put all the 150 weak classifier blocks in the FPGA fabric. Weak classifiers' parameters α i , t i , g i , are hardwired during synthesis. The final adder tree sums up all the α i y i terms.
KERNEL SLICING SCHEME
The Kernel-Slice Incidence Table (Fig. 4c ) defines slices used in computation for each kernel. A particular slice containing all zero values is discarded from computation. 18
req. FPs total: 30 Table 1 : Possible batch assignment constructed from the Incidence Table for the set of filters in Fig. 2 and 30 DPCBs. There are slices containing nonzero entires ('•'), slices with all zeros that are either processed and discarded ('-') or not processed at all (' '). The slice processing sequence is (1 | 1, 2 | 1, 2, 3, 4). In total, the 30 DPCBs evaluate 7 · 30 = 210 slices, while only 158 of them (the number of '•' marks) contain nonzero data, i.e. the utilization is 75%. Note that both the real and the imaginary parts of a complexvalued kernel must be processed at the same time-the doubled rows must not be separated.
As described in Sec. 3.2, the kernels are grouped into batches that are run in a sequence. The set of N B i slices processed in a particular batch i is a union of slices required by all kernels in that batch (that have to use the same slices). The total number of slices N = i N B i processed in the sequence then determines the overall processing time. Grouping the kernels that require similar set of slices helps minimizing the N . See Tab. 1 for example of grouping.
To summarize, the Kernel Slicing Scheme is defined by 1. the number and shape of slices, 2. the Kernel-Slice Incidence Currently, we have learned the classifier independently, manually created the slice shapes and used semi-manual exhaustive search for distributing the kernels into batches. In Tab. 1 it is clearly visible, that the results are not optimal. One can possibly replace the kernels to make the overall code length N shorter while keeping the classifier performance or more kernels can be added to use the free DPCBs, thus keeping the processing time but improving the classifier performance. This can be easily incorporated into a weak feature selection method during the AdaBoost learning.
All parts of the Kernel Slicing Scheme, as well as the classifier itself, should be optimized with respect to the overall detection rate and processing time, given the kernel bank, the training data for our classifier and the hardware constraints. This is a challenging yet unsolved problem.
PERFORMANCE, SCALING AND LATENCY
The latency for a single pixel is 240 ns as has been shown in Fig. 3 . The processing time depends on the size of the input image. Since we spend N = 7 clock cycles processing a single pixel, for a 951 × 400 pixel image we currently have cycle time of 951 · 400 · 7/125 MHz = 21.3 ms per image which determines the effective latency.
Our proposed architecture is scalable, allowing to adjust the number of DPCBs to suit the size of a given FPGA, Table 2. Changing the size of the image patch also greatly influences the design, Table 3 . On the other hand, the size of the input image has almost no influence on the chip layout.
It proved useful to use two slice selectors instead of one, each with different set of slices, and to distribute the DPCBs between them. This brings yet another level of complexity to the optimization problem from Sec. 4. We have tested this design, using the slice sequence codes (1 | 1 | 1, 2, 3, 4) and (1 | 1, 2 | 1, 2, 3) , i.e., reducing the length from 7 to 6 slices.
CONCLUSIONS
We have proposed an approach for implementing a dense linear classifier on an FPGA. The top-level scheme is chosen as quite generic. The specific instantiation is then tuned to a particular application. An interesting optimization problem has been discussed, that would allow to efficiently choose a trade-off between processing speed and classification performance.
The proposed architecture is scalable, it is up to the designer how many blocks are needed to place in the FPGA.
The performance evaluation proved that the design is able to run fast, making it ideal for use in real-time applications. The proposed wheel classifier response map computation is used in a car detector running in an intelligent vehicle as a part of a more complicated collision mitigation system [1] , that requires processing cycle of 20-30 fps and a maximum latency of 200 ms.
ACKNOWLEDGEMENT
This work was supported by the European Commission under interactIVe, a large scale integrated project, part of the FP7-ICT-246587 for Safety and Energy Efficiency in Mobility. The authors would like to thank all partners within interactIVe for their support. Special thanks go to Stefan Wonneberger at Volkswagen Group Research.
