Abstract-The object detection algorithm developed by Viola and Jones has become very popular due to its high quality and detection speed. However, the complexity of the computation required to train a detector makes it difficult to develop and test potential improvements to this algorithm. Furthermore, improving or training new detectors in the field is problematic.
I. INTRODUCTION
Object detection is the task of determining if and where a particular object exists in an image of a scene. This is an important computer vision task and has applications in areas such as surveillance and robotics. Building a fast algorithm that is robust to variations in object scales, orientations and different lighting conditions remains a challenging problem.
One of the seminal developments in this field was the object detector by Viola and Jones [1] . Their detector involves scanning an image window across the scene and evaluating a fast binary classifier at each point to localize objects. The binary classifier takes an image window as its input and identifies whether it contains the desired object or not. This technique proved to be robust for frontal face detection and provided good detection run-time performance.
A major disadvantage of their algorithm is the long training time required to construct the binary classifier. Even on modern processors, training a detector can take hours to days and training may be repeated many times to tune the detector parameters. Further, adapting detectors to variable field environments may require low-power, embedded processors, exacerbating the training time. Thus, power-efficient acceleration of this training phase is important to facilitate exploration of extensions to this technique [2] and allow for object detectors to be trained quickly in the field.
The binary classifier used by Viola and Jones is an ensemble where the outputs of many weak classifiers are combined to classify an image window. During detection, these simple classifiers may be evaluated in parallel on an image, opening up the possibility for acceleration. A number of works have implemented the detection algorithm on various hardware platforms [3] , [4] , [5] providing real-time detection performance. Parallelism may also be exploited during training, where many weak classifiers are trained independently and one is selected to join the ensemble. However, the parallel tasks involved in training are more complex than those in detection. The large amount of resources in modern Field Programmable Gate Arrays (FPGAs) provides an opportunity to accelerate training in hardware as well. FPGAs offer several potential benefits as an acceleration platform. Their power efficiency relative to multi-core processors makes them amenable to training classifiers in the field. In addition, their reconfigurability relative to Application Specific Integrated Circuits allows modifications to be made to the training algorithm. However, to the best of the authors' knowledge there has been no previous work targeting the acceleration of training in dedicated hardware.
In this paper, we propose a flexible, high-performance architecture for training Viola-Jones classifiers on FPGAs. We present an analysis of the architecture and describe some of the issues related to training these classifiers in hardware.
The rest of the paper is organized as follows. Section 2 provides background on how Viola-Jones detectors are trained. Considerations for mapping this algorithm to hardware are presented in Section 3. Details of the FPGA architecture are given in Section 4. Results and analysis are shown in Section 5 and conclusions are given in Section 6.
II. BACKGROUND
The Viola-Jones object detector is constructed around a binary classifier producing a positive output when an image window contains the desired object and a negative output when it does not. We will denote positive outputs as 1 and negative outputs as 0. This classifier is used many times as an image window is scanned across a scene to locate objects. Thus, it must be fast to evaluate and have good detection performance. Some common metrics of detection performance are the percentage of correctly identified positive windows (hit rate) and percentage of incorrectly identified negative windows (false positive rate). A good detector has high hit rate and low false positive rate.
978-1-4673-2845-6/12/$31.00 c 2012 IEEE
. . . The binary classifier used by Viola and Jones is composed of several levels of hierarchy forming an ensemble classifier. First, committee outputs are formed by additively combining the outputs of many weak, but fast to evaluate, classifiers and making a decision based on that aggregate input. Each weak classifier has an associated weight (α) that determines how much influence it has over the decision. Second, the outputs of several committees are combined in a cascade, such that the final classifier only produces a positive output if all committees produce positive outputs as well. This second stage allows the detector to quickly classify an image window as negative if even one of the committees produces a negative output. Fig. 1 shows the hierarchy of the binary classifier.
Training the binary classifier involves constructing the ensemble to have good predictive performance on a known, labelled training set. This training set usually contains thousands of examples consisting of image data (x) and a label (y = {1, 0}). The ensemble is built in a step-wise process where first, weak classifiers are selected sequentially to form a committee. Once a committee has been formed, the current cascade is tested and a decision is made whether or not to add additional committees. This work focuses on accelerating committee training since very few other operations are involved at the cascade level of training.
A. Training Committees
Viola and Jones use a method called AdaBoost [6] to combine many weak classifiers together to form a committee. The idea behind AdaBoost is to add weak classifiers that perform well in areas where the committee is doing poorly.
The algorithm begins by assigning an equal weight (w) to all of the training examples. A large, diverse set of weak classifiers is then trained to minimize the weighted misclassification error ( ) on the training set:
Here N is the size of the training set and h(·) is the output of a weak classifier. The weak classifier with lowest error is added to the committee and its output weight (α) is determined by its performance on the training set. The weights of misclassified training examples are increased relative to the correctly classified examples and the process is repeated to continue adding weak classifiers. The final committee output Algorithm 1 Boosting T classifiers 1: Given N training examples (x 1 , y 1 ), . . . , (x N , y N ) 2: Initialize:
8:
α t ← ln((1 − t )/ t ) 10: end for 11: Final Classifier: S(x) = T t=1 α t h t (x) ≥ φ is determined using an adjustable threshold (φ). Algorithm 1 shows the steps of the procedure. Training is stopped early if target hit and false positive rates are met for the committee.
The computation at each step in the committee construction is thus broken down into a weak classifier training routine, during which all weak classifiers are trained, and a parameter update routine. The parameter update routine requires O(N ) operations to update the training set. However as will be shown in the next subsection, training a weak classifier requires O(N log N ) operations and must be performed for all weak classifiers. Thus, the vast majority of the computation is involved in training the weak classifiers.
B. Weak Classifier Training
As the basic building block of the Viola-Jones classifier, the structure of the weak classifiers is very important. Each weak classifier is composed of a feature and a decision stump.
In object detection, features are often used to represent certain key characteristics of an image. For instance, a feature might quantify how much brighter the left side of an image is relative to the right side. The basic structures of features used by Viola and Jones are shown in Fig. 2 (a) . A feature value is calculated for a particular image by integrating the pixels in the white rectangular areas and subtracting the integrated values in the black areas. Thus, these features measure the relative pixel intensities for various shapes. A large feature pool is generated by translating and scaling these shapes across an image window. For example, over 500,000 possible features, and consequently weak classifiers, exist in a 32x32 window.
To quickly evaluate these features, Viola and Jones introduced the integral image representation. Integral image pixels are formed by summing the value of a pixel along with the pixels above and to the left of it in the original image. By first transforming all of the training set images into this representation as a pre-processing step, the pixel intensities in any rectangular area can be calculated using four memory accesses as shown in Fig. 2 (b) . In this paper, the training image data (x) will refer to the integral image representation.
The features specialize the weak classifiers to examine particular characteristics of the image window. In addition, the feature representation reduces the very high dimensional window down to a single dimension. The classification decision is made by choosing a threshold in this 1-D feature space and assigning a direction such that values on one side of the threshold are identified as positive and the values on the opposite side are identified as negative. This type of classifier is called a decision stump. In this paper, a decision stump directed up classifies all values above the threshold as positive.
To select the threshold value, the training examples are first sorted by their feature values. Training examples with the same feature values and label are combined with their weights accumulated. By iterating through this sorted array and calculating the error (Eqn. 1) at each point for both directions, a threshold dividing the training data can be found that minimizes the error. Since the quantity |h(x i ) − y i | is either zero or one, error calculation only involves addition and subtraction of weights.
The error at each feature value is calculated in a single pass by keeping track of the total positive and negative weights above (P A , N A ) and below (P B , N B ) the current value and is calculated as
Here, P B + N A is the misclassification error directed up while P A + N B is the misclassification error directed down.
Computationally, evaluating feature values and iterating through them require O(N ) time, while sorting, in general, requires O(N log N ) operations. Thus training several hundred thousand weak classifiers is quite computationally intensive. In addition, this training must be done for each weak classifier added to the ensemble.
C. Related Work
Although there has not been work in hardware architecture accelerating the training process, several algorithmic and software techniques have been developed to reduce the run-time of the training phase. Threading, through Intel's Thread Building Blocks (TBB), has been used in the popular OpenCV project [7] to provide some parallelism. In addition, algorithmic techniques have also been developed. For instance McCane et al. [8] describe a method of restricting the number of weak classifiers compared at each iteration by using a heuristic search. This approximation necessarily has a negative effect on the final classifier quality due to the restricted search space. Wu et al. [9] propose a method that avoids re-training weak classifiers by using a large look-up table. However, pre-computation time can still be significant especially if additional features are introduced to increase detection performance. A method of accelerating training of weak classifiers should allow further improvement using these algorithmic approaches.
III. HARDWARE WEAK CLASSIFIER TRAINING
The kernel of the training algorithm identified in Section II is the task of training weak classifiers, involving feature evaluation, sorting and threshold selection. To accelerate training in hardware, multiple training engines are used to train weak classifiers in parallel. In this section, the computation and communication concerns relating to the hardware implementation of this kernel will be analyzed.
A. Design Constraints
One of the important considerations for hardware design is the data width and representation. Typical applications of the Viola-Jones object detection algorithm use 8-bit greyscale images and are trained at window sizes around 20x20 pixels. It has been found, for instance, that a 20x20 window size achieves a higher hit rate than larger windows on face detection [10] . During our experiments, to facilitate more general exploration of window sizes, we decided to use a maximum window size of 32x32, although by reconfiguring the datapath, larger window sizes may be implemented. This choice requires the integral image representation to be 18 bits per pixel to support the maximum integral image value. Furthermore, since the feature values are signed, the feature value range for the three element feature in Fig. 2 requires 19 bits to be represented.
Image and classifier weights (w,α) are typically represented as floating point numbers. Since floating point operations are expensive in hardware, we chose to represent these values as 32-bit fixed-point numbers.
B. Feature Evaluation
The computations involved in feature evaluation consist primarily of accumulation and basic multiplications which map well to FPGA resources. However, the evaluation of each feature value requires several random memory accesses, causing performance to be memory bound. FPGAs have very high on-chip memory bandwidth, but the large size of the training set precludes its storage in on-chip memory. Thus managing the limited external memory bandwidth is very important for achieving high performance.
The design platform in this work is a Xilinx ML605 development board equipped with DDR3 memory with a word size of 8 bytes. Each training engine requires a different set of integral image pixels depending on the feature being evaluated. The required memory bandwidth thus necessarily increases with the number of engines. However, the burst organization of DDR3 RAM provides some opportunity for data reuse. Since weak classifiers are trained independently, the order in which they are trained is unimportant. Thus by creating groups of weak classifiers that require pixels in the same bursts, the amount of memory bandwidth required for a particular set of training engines is reduced.
Ordering is done by first assigning each pixel of the integral image to a particular DRAM burst. With 18-bit pixels, 14 pixels are assigned to each 256-bit vector. This configuration leaves four bits unused for every 256 bits, but simplifies the extraction of data. The features are then enumerated and sorted by the number of bursts required to evaluate them. Next, sets of features are created and sized to the number of compute engines available. Each set begins with a feature with the current minimum number of bursts and additional features are added such that the number of bursts in the set is minimized. Fig. 3 shows the resulting average number of bursts required for different numbers of parallel engines. If only one engine is requesting data, 4.5 bursts are required on average. This is shown as the horizontal grey line and represents the best case limit. The "unordered" configurations are formed by grouping features in the order in which they are generated by translating and scaling each rectangular shape across the image window. Notice that the number of bursts required grows quickly with engines as they are not shared efficiently resulting in long periods where individual engines are idle. When features are ordered on the other hand, the majority of requested data is shared with 30 engines requiring about five bursts per configuration on average.
C. Sorting and Threshold Selection
Although feature evaluation is relatively simple, sorting is much more intensive, particularly since several thousand feature values must be sorted for each training engine. Instead of explicitly sorting the data, we use the feature values to index large arrays and insert the weight values appropriately. Separate tables are kept for positive and negative weights to facilitate the error calculation. Using this technique, a sorted array is created quickly during the feature evaluation step. However, the size of the table is the full range of the feature values rather than just the number of training examples. With two tables of 2
19
, 32-bit weights, this technique leads to some complications in terms of data storage and communication.
Memory is quickly exhausted if storing the sorted tables on the FPGA. However, storing the data in external memory causes a linear increase in memory bandwidth with the number of engines. Instead of creating the full sorted table at once, the sorting technique described above can be used to create sorted subsets of the full range by calculating feature values for the full training set and filtering out training examples not contained in the desired range. Thus during threshold selection, the full range can be explored by sequentially filling small tables and iterating through the loaded elements. This dramatically reduces the required memory capacity and allows FPGA Block RAMs to be used since at most two ports are required to accumulate weights in the table and read the values. In particular, we use 32-bit, 1024-element Block RAMs to store the positive and negative weights for each engine.
The above technique reduces the memory requirements, but exacerbates the time to iterate through the feature value range since tables must be filled multiple times. In general, the range of values for a particular feature will not be very large. Therefore, searching through all possible feature values is inefficient. Instead, we can reduce the number of sorted subsets explored by performing a two-phase threshold selection.
The two-phase threshold selection, shown in Fig. 4 begins with a low-precision pass where training example weights are accumulated into tables indexed by the most significant bits of the feature values. For example, the total weights in the 111, 110, 101 and 100 bins on the right hand table sum up to the values in the 1xx bin on the left of Fig. 4 . From this table, error is calculated at rough thresholds, for instance the error for a classifier directed upwards at threshold 100 is 0.4 since P B is 0.1 and N A is 0.3.
In addition to error, an uncertainty (∆error) is calculated for each bin representing the amount that the error could decrease by searching through the range in higher precision. This ∆error is defined as the negative weights in the bin above a threshold when directed up and the negative weights in the bin below when directed down. To understand this, consider the low-precision bin 1xx in Once the errors and ∆errors have been calculated, the potential errors (error -∆error) are found. If any of these are lower than the current minimum error, then that subset becomes a candidate for a high-precision expansion. During the expansion step, the table is re-filled with elements belonging to the candidate subset to determine if the global minimum error can be improved. In the figure, the range 1xx has a low potential error of 0.1 and is thus a candidate for a high-precision pass. The actual minimum error in this range is 0.2, found at threshold 110. This is higher than the potential of 0.1 since the weights are not optimally distributed. Twophase threshold selection ensures that only subsets with error potentially lower than the global minimum are searched. If it is found during the low-precision pass that the currently configured weak classifier cannot produce an error lower than the current minimum, then no high-precision pass is performed and no threshold is selected.
D. System Architecture
Using the two-phase search described above, the two primary operations involved in training a weak classifier become filling the small, 1024-element, tables based on the phase of the threshold selection process and iterating through the resulting loaded arrays, to determine error or candidate subsets to search. The resulting systolic system architecture is shown in Fig. 5 . Groups of weak classifiers are configured into this set of parallel engines and two-phase threshold selection is performed to find and train the minimum error weak classifier.
Each training engine consists of a Feature Evaluator, Block RAM storage for the Sorted Tables and a Threshold Selector. The abundance of wiring in FPGAs allows wide, highthroughput datapaths to be constructed between elements of the array. For instance, Feature Evaluators share the full 256-bit memory bandwidth through the systolic structure. Since connections exist primarily between adjacent engines, this structure is very scalable and large arrays may be built without the degradation of clock frequency. 
IV. HARDWARE IMPLEMENTATION
In this section, implementation details of the Feature Evaluator, Sorted Tables and Threshold Selection will be discussed. The process of updating committee parameters will also be described.
A. Feature Evaluator
The inputs to the Feature Evaluator are the training example integral image data, weight and label. The engine computes the feature value based on the image data and passes the training data on to be sorted. Since tables are filled multiple times in our approach, it is very important to maintain high throughput to process the entire training set as quickly as possible. To simplify control logic, training examples do not share bursts; at least one memory read is required per feature value. The Feature Evaluator core is thus designed for a maximum throughput of one feature value per cycle. Fig. 6 shows the structure of the Feature Evaluator. As mentioned in Section III, 256-bits of integral image data are available from the memory controller each cycle for a total of 14 pixels per cycle. It's possible for all pixels required by a feature to be contained within a single burst. Therefore, 14 filters, g(·), are aligned to the pixel locations. Each burst of image data is accompanied by an index denoting the position of the pixels in the image window. The filters then either ignore the pixel data or apply one of the simple operators: {+, −, +2×, −2×, +3×, −3×, +4×}. The result is fed into a tree accumulator structure. In modern 6-LUT based Xilinx FPGAs, ternary adders require only extra routing resources compared to a binary adder. Thus, a ternary tree adder is used to reduce resource requirements and improve latency. The structure is fully pipelined and has a latency of eight cycles.
By shifting configuration words through the systolic array, the inputs filters can be set up to implement a wide variety of possible features. A Feature Evaluator may combine up to nine different integral image pixels as a linear combination with the aforementioned input filter operations.
Along with feature values, the weight and label data must also be passed to the Sorted Tables. To minimize the number of memory accesses, weights are stored separately from pixel data in external memory and are cached before pixel data is loaded. Once the current weights have been exhausted, a new set is loaded. In addition, all positive training examples are fetched before the negative examples. By keeping the two classes separate, the labels become implicit to the data being fetched.
B. Sorted Tables
The calculated feature values are used to generate addresses for the Sorted Tables and add the training example weights into the appropriate positive or negative Block RAM. An address is calculated depending on the current phase of the computation. During the low-precision phase, the upper bits of the feature values are used to index the tables, while in high-precision, the feature values are filtered based on the current sorted subset being evaluated. If the feature value is out of the range of the current subset, extra counters keep track of the total positive and negative weights above and below the table range. This sorting procedure is fully pipelined to maintain a throughput of one training example per cycle and has a latency of four cycles. This latency is required to read the current weight value in the target table address and accumulate the incoming weight.
C. Threshold Selection
Once the Sorted Tables have been filled, the Threshold Selection engines iterate through each element calculating the errors or candidates depending on the phase of the computation. For each table address, error is compared and candidates for high-precision expansion are aggregated among all parallel engines. However, a synchronous comparison is not desired as that would require simultaneous reduction from all engines, resulting in long wire delays. Instead, synchronization is maintained by running all engines offset by one cycle. As one engine is calculating the error or candidate for address i, the result from the previous engine has already been produced and may be compared; this is demonstrated in Fig. 7 .
To encode which engines have candidates for each subset, engines are represented as one bit in a candidate vector. Threshold Selection engine j receives a j − 1 size bit vector encoding which engines are candidates for the current table address. If the potential error for engine j is lower than the minimum error found so far, a 1 is appended to the candidate vector otherwise a 0 is appended. In this way, the list of candidates for each low-precision bin and weak classifier being trained is iteratively constructed. The full list of candidates at the end of the array is stored in a 1024-element FIFO. Error and candidate calculation is again fully pipelined with a latency of eight cycles between requesting the data from the table and producing an error or candidate.
D. Parallel Engine Coordination
The system operation begins by configuring the engines for a particular set of weak classifier features. Sorted Tables are then filled by the Feature Evaluators. Candidate classifiers and subsets are next identified and pushed into a FIFO as explained above. Then, while the FIFO is not empty, the engines are reconfigured to perform high-precision expansion on the candidate subsets. During both the low-precision and high-precision phases, the global minimum error is updated and the weak classifier with minimum error is saved.
Since feature evaluation must occur before threshold selection, the operations are serialized. To keep the engines busy, the Sorted Tables Block RAMs are replicated, providing a buffer such that the Feature Evaluators and Threshold Selection engines work on different sets of weak classifiers concurrently. When switching to and from high-precision expansion, the values in the Sorted Tables are stale and must be filled before Threshold Selection to continue. Thus, some serialization occurs during this transition.
E. Updating Parameters
After a minimum error classifier has been found, the example weights must be updated and an α t value must be calculated for the selected weak classifier. The weight update requires a multiplication and both calculations require evaluation of non-linear functions.
Weight update involves an error factor and a normalization shown in lines 7 and 8 of Algorithm 1. These operations are combined to rewrite the weight update multiplicative factors. Weights of correctly classified examples are multiplied by 1/(2(J − t )) while incorrect examples are multiplied by 1/(2 t ). Here J is the sum of all the weights which should always be approximately 1, but may deviate depending on inaccuracies in the weight update. In the case where t = 0, the weak classifier perfectly separates the positive and negative examples and the algorithm terminates.
The weight update terms are thus determined by calculating the denominators and evaluating a reciprocal function. When t is very small, the incorrect example weight update may grow very large and become difficult to estimate. In this case, we multiply both the associated weight value and t by 2 k , equivalent to shifting both of them k bits, to bring the denominator into the range [1, 2) . This process makes the reciprocal function better behaved and easier to estimate. The same floating point approach is used for 1/(2(J − t )) in case J deviates far from 1. A look-up table and three-stage piecewise linear interpolation (PLI) unit is used to estimate the reciprocal.
To estimate α t = ln
, the logarithm is first split into ln(J − t ) − ln t . In this case, if t approaches 0, − ln t approaches infinity and thus the error in this estimate can become very large. This error is controlled by scaling the ratio J− t t with a multiplicative factor. A look-up table and threestage PLI are also used to estimate the natural logarithm. For more precision, the value of α t may be calculated in software given the trained committee.
Parameter updates make use of the same system structure as weak classifier training. The last engine in the array is configured with the minimum error weak classifier to calculate e i . Incoming weights are then multiplied with the appropriate weight update term using a DSP block and stored in a cache to be written back to memory.
V. RESULTS AND ANALYSIS
The design was implemented on a Xilinx ML605 development platform, containing a Virtex-6 LX240T FPGA, using ISE 13.1 tools. A PCIe interface was used to communicate with the hardware core and initialize memory with training data. A Xilinx MIG DDR3 controller with 400MHz I/O clock provided a maximum bandwidth of 6.4 GB/s to the compute engines which ran at 200 MHz, matching the user interface clock of the MIG. A maximum 30-engine system was implemented, consuming approximately 72 percent of available LUTs on the FPGA. The LUTs introduced by the continued addition of engines caused difficulty in floorplanning and consequently with routing and timing closure.
To ensure correctness and measure performance, a face detector was trained using a set of 1000 faces and 1000 nonfaces as positive and negative examples. Faces were collected from the FERET [11] dataset, while non-face images were collected from patches of the Caltech background [12] dataset. The examples were manually cropped, resized and equalized as 8-bit greyscale images. This dataset is small for training a good face detector, but it is sufficient to test the performance of the training implementations.
A. OpenCV Implementation
The traincascade procedure of the OpenCV library [7] version 2.3.1 was used as a performance benchmark. This implementation contains several deviations from the original algorithm of Viola and Jones. For instance, when evaluating features, the values are multiplied by a variance normalization factor. This is a useful technique for improving detection performance under different lighting conditions. However, it may be done as a pre-processing step on the training set, thus it was disabled during our comparison. The software implementation was modified to perform the same type of training as the hardware system.
In addition to these fundamental changes, several optimizations are implemented in OpenCV to reduce training time. Weight trimming is a technique that removes examples with very low weight, and thus low influence. A more direct method of improving performance is pre-calculating and sorting some of the feature values. Since the feature values do not change during training, a great deal of computation may be saved. However, the applicability of this technique is limited by the available system memory since data must be stored for many combinations of features and training examples. During testing, the default settings for both of these optimizations were used. Examples representing less than five percent of the total weight were trimmed and 512 MB of memory was reserved for pre-calculated data. Software performance testing was done on a node of the GPC supercomputer at the SciNet HPC Consortium [13] . A GPC node contains two Intel 2.53 GHz Xeon E5540 CPUs, each with four cores and hyperthreading support for a total of 16 logical processors. The OpenCV library was compiled with icc version 12.1.3 and TBB version 4.0. The software was instrumented to measure only the committee training time, using clock gettime(), and results were averaged over twenty runs. Hardware computation time was measured as the time between sending the start command over PCIe to receiving the committee results. One committee was trained with a desired hit rate of 0.99 and false positive rate of 0.5.
All tests were performed where the hardware and software produced committees with the same weak classifiers. In general, the hardware system will produce the same results as software except for the case where multiple options for the minimum error weak classifier exist. This can occur if multiple weak classifiers produce the minimum error or multiple thresholds exist that have the same minimum error. The algorithm does not differentiate between these options and thus they are equally valid solutions. Table I shows the full system resource utilization for K training engines, including memory controller and PCIe interface, after FPGA mapping. Resource utilization scales roughly linearly with the number of engines, each additional engine requiring approximately 3000 LUTs, 2300 Flip Flops and four 36 Kbit Block RAMs. Due to the systolic architecture, all configurations maintained a clock speed of 200 MHz and should continue to scale on larger devices. Fig. 8 shows the speed-up of the hardware platform over OpenCV for different configurations of compute engines. Results for one Xeon CPU, using 8 threads, and two CPUs, using 16 threads, are shown. When comparing performance between a single FPGA and single CPU, the FPGA platform is able to obtain a 14-fold speed-up over the multi-threaded software implementation. Furthermore, against two CPUs, the hardware maintains a 7-fold speed-up. Despite the burst ordering technique described in Section III-B, the average number of bursts necessarily increases with the number of engines. Thus, performance scaling is less than linear.
B. Hardware Utilization

C. Performance Results
The training parameters resulted in a committee with two weak classifiers, requiring about 52 seconds to train using eight threads on one CPU and less than 4 seconds using the hardware platform. When thousands of weak classifiers are trained for full detectors, the reduction in training time and consequently energy consumption can be dramatic. The decision to use table-based sorting results in unique performance characteristics. Full iterations through the 1024-element tables must be performed even when the training set consists of only a few hundred examples. On the other hand, this approach performs well on large training sets. A weak classifier may be examined and bypassed using only iterations in the low-precision phase if no candidate subsets are found. Fig. 9 shows how the performance of the 30 engine system scales as the training set size increases. As can be seen, the performance relative to software increases with the training set size and should continue to increase for larger datasets.
In general, 19 bits are required to represent feature values in a 32x32, 8-bit window. However, if the thresholds of features are expected to be within a much smaller range, the search range can be reduced. This makes ∆error more meaningful as fewer examples are compressed into a subset range, accordingly fewer subsets become candidates. For our test data, if the search range is reduced to 17 bits, the speed-up for the 30 engine system becomes 17-fold against one CPU.
VI. CONCLUSIONS AND FUTURE WORK
A major drawback of the Viola-Jones style object detectors is their training time, particularly as the complexity and number of features increase. This paper demonstrates that FPGA hardware is not only applicable to accelerating the detection phase of Viola-Jones style object detectors, but can also provide good acceleration during training. Through the use of a two-phase threshold selection technique, the training algorithm can be efficiently mapped to FPGA resources. Furthermore, by designing for high throughput, and making efficient use of external memory bandwidth, the presented hardware architecture is able to obtain noticeable speed-up over a multi-threaded OpenCV implementation. In powersensitive field applications, FPGA hardware may be used to greatly accelerate object detection training in place of embedded processors.
This work focuses on the basic Viola-Jones algorithm. However, the structure of the architecture is very modular and may be extended to implement some of the advancements [2] that have been made to this detector. In the near future, we plan to add support for additional features and boosting methods.
