The usage of this PDF file must comply with the IEICE Provisions on Copyright. The author(s) can distribute this PDF file for research and educational (nonprofit) purposes only. Distribution by anyone other than the author(s) is prohibited. SUMMARY Visual object detection on embedded systems involves a multi-objective optimization problem in the presence of trade-offs between power consumption, processing performance, and detection accuracy. For a new Pareto solution with high processing performance and low power consumption, this paper proposes a hardware architecture for decision tree ensemble using multiple channels of features. For efficient detection, the proposed architecture utilizes the dimensionality of feature channels in addition to parallelism in image space and adopts task scheduling to attain random memory access without conflict. Evaluation results show that an FPGA implementation of the proposed architecture with an aggregated channel features pedestrian detector can process 229 million samples per second at 100 MHz operation frequency while it requires a relatively small amount of resources. Consequently, the proposed architecture achieves 350 fps processing performance for 1080P Full HD images and outperforms conventional object detection hardware architectures developed for embedded systems.
Introduction
Object detection is now an indispensable component of multiple practical applications on embedded systems such as advanced driving assistant system, robotics, and surveillance. These types of systems have strong constraints on power consumption, processing performance, and detection accuracy, which are in a trade-off relationship. Recently deep learning largely improved this trade-off, and features computed by deep convolutional networks achieved remarkable inference performance beyond hand-engineered features. However, considering the power constraint and the limited resources of embedded systems, such features that require computationally intensive networks are not always the best choice.
In the field of computer vision, multiple studies have proposed deep learning methods for real-time object detection [1] - [4] . The features for object detection are extracted through a fully convolutional network, and the computational cost is often reduced by a region of interests (RoI) estimator. Here, let us take Fast YOLO in [2] as an example since it is, to the best of our knowledge, the fastest method at the current stage. Fast YOLO processes 448x448 input images at more than 150 frames per second on Titan X GPU, which means the constraints of object detection accuracy and processing performance are satisfied. However, the other constraint of power consumption is not satisfied at all: inference using Titan X GPU requires approximately 200 Watts [5] , and it is not affordable for embedded systems. Alternatively, when using Tegra GPUs for embedded use, the processing performance is not sufficient.
On the other hand, researchers in hardware community focus on conventional machine learning algorithms such as support vector machines (SVMs) and shallow neural networks (NNs) due to their implementation easiness [6] - [8] . SVMs and shallow NNs mainly consist of multiply accumulation operations and are implementable with a uniformly distributed array structure. However, it is a well-known fact that object detection using SVMs and shallow NNs suffer from poor detection accuracy and the circuit area of multipliers becomes a critical issue when designing an accelerator that exploits the high degree of parallelism. For addressing this issue, it is necessary to find out an object detection method with reasonably high detection accuracy and develop its efficient hardware for high processing performance.
In this point of view, we focused on conventional object detection methods using decision tree ensembles (DTE) [9] , [10] . These object detection methods have multiple advantages in hardware implementation: low computational cost with soft cascade and multiplier-free operations in classification. Even with these attractive features, only a few studies on DTE hardware architectures are reported [11] , because conditional branches of decision stumps composing a DTE require random memory accesses, which prevents efficient parallel processing.
As a solution, this paper proposes a hardware architecture for DTEs. The proposed architecture processes object detection in a SIMD-like homomorphic manner even while DTEs are adopted as a classifier. Also, this paper proposes a task scheduling algorithm to control memory accesses from multiple modules to improve processing performance. The proposed architecture has two distinctive features as follows. First, it supports three-dimensional parallel memory access, 1-D for feature channels and 2-D for image space, achieving multiple times higher processing performance than conventional hardware architectures. Second, it takes advantage of algorithmic acceleration by using soft cascade, which improves processing performance by over one to two orders of magnitude. For evaluating the proposed DTE hardware architecture, we assume the feature extraction method called aggregated channel features (ACF) [10] and classification us-ing multi-scale classifiers with octave-wise feature maps [9] .
The rest of this paper is organized as follows. Section 2 explains DTEs and conventional hardware architectures. Section 3 provides a hardware architecture for DTEs, and Sect. 4 proposes a task scheduling algorithm to handle memory access conflict. Section 5 describes evaluation results and provides an analysis of the proposed hardware architecture. Section 6 concludes the paper.
Target Application and Available Architectures
This section briefly explains the object detection method based on the DTE using ACF [10] , which is the target of hardware implementation in this work, and introduces DTE hardware architectures proposed so far.
Decision Tree Ensemble using Aggregated Channel Features
The underlying idea of ensemble learning is to boost the final learner's predictive performance by accumulating weighted votes from weak learners whose predictive performance is merely better than random guessing. As shown in Fig. 1 , a DTE is one of ensemble learning methods using multiple decision trees (DTs) as weak learners. Each DT consists of decision nodes and leaf nodes, where a decision node selects one of its child nodes based on the comparison result between its input and threshold, and a selected leaf node returns its evaluation value. Given an input feature vector x, the final learner H is defined as
where h i is the prediction function of the i-th DT and returns the value of a selected leaf node, and T is the number of DTs in the DTE. Compared with recent deep learning algorithms, a DTE is a shallow machine learning algorithm. However, it is reasonably deep and shows good classification performance for practical applications [12] , which will also be demonstrated experimentally in Sect. 4. In computer vision, non-rigid object detection has been a challenging issue and studied for several decades. Of many existing methods, Dollár et al. proposed a remarkably efficient and accurate object detection method based on the DTE using ACF [10] . Figure 2 shows the processing flow of object detection using ACF. As shown in Fig. 2 , ACF consists of ten channels extracted from three types of features: six channels from HOG, three channels from LUV color space, and one channel from gradient magnitudes. ACF calculates raw features of each channel, aggregates each 4x4 block to build an aggregated channel, and classifies input data extracted from sliding window sampling. Non-maximum suppression clusters detection results corresponding an object into one. In [10] , Dollár et al. reported that DTE using ACF achieved 17% log-average miss rate (MR) and 31.9 fps processing performance on a single CPU, which outperforms other types of state-of-the-art methods. However, to implement a hardware for exploiting ACF, it is necessary to clarify how to utilize multiple channels of features in parallel computation.
Hardware Architectures for Decision Tree Ensemble
There exist multiple hardware architectures for DTEs [13] , [14] , and Struharik and Novak classified them into three types in [15] : threshold networks, single-path architectures, and single-node architectures. Figure 3 shows a threshold network, a single-path architecture, and a single-node architecture, which are available implementations for a depth-two DT. The threshold network, shown in Fig. 3(b) , is an architecture that processes all decision nodes of a DT in parallel, calculating an output O as follows:
where d i is the binary response of the i-th decision node, 0 or 1, and l j is the value of the j-th leaf node. Since threshold networks enable to calculate output instantly after input, it is suitable for applications requiring short time delay between input and output. The single-path architecture, shown in Fig. 3(c) , is an architecture that has pipeline stages of universal nodes, where the number of pipeline stages is equal to the depth of the DT, and the universal nodes are processing elements to carry out the function of decision nodes. Since single-path architectures adopt a pipelined homogeneous structure, it achieves equivalent throughput to the corresponding threshold network with a relatively small amount of hardware resources. The single-node architecture, shown in Fig. 3(d) , also uses universal nodes as processing elements but does not have pipeline stages. The single-node architecture has more flexibility in its design than the others mentioned above in that it can handle any processing order of decision nodes and there exist multiple hardware architectures for storing the responses of decision nodes. However, its processing performance and required hardware resources largely depend on the design. Therefore, for the hardware implementation based on the single-node architecture, the architecture design plays an important role.
Parallel Implementation of Decision Tree Ensembles Using Multiple Memory Banks
The proposed hardware architecture is a single-node architecture designed for exploiting multiple memory banks with a small amount of routing resources. This section explains its overview and details in order.
Architecture Overview
ACF has multiple types of features and requires sophisticated memory access patterns. Considering the parallel feature extraction before classification, we need to allocate a dedicated memory bank for each channel. In this case, the threshold network and the single-path architecture are not suitable because they require a massive amount of routing resources for supporting random memory access to all the banks. On the other hand, the single-node architecture can resolve the routing resource problem by assigning a universal node to each channel and merging the responses of decision nodes belonging to each DT. The proposed architecture is designed based on the idea mentioned above. Figure 4 shows the overview of the proposed hardware architecture, and Table 1 lists the notations used in Fig. 4 . The proposed architecture mainly consists of three sub-modules: decisionNodeCube, leafNodeCube, and ctrl, where they are a 3-D array of decision nodes, a 3-D array of leaf nodes, and a control unit, respectively. The decisionNodeCube consists of C decisionNodeMatrix modules, and the leafNodeCube consists of M leafNodeMatrix modules, an accumMatrix module, and a chSelMem module, where M is less or equal to C due to the non-uniformity of the channel usage described in Sect. 4 This architecture enables a 3-D parallel classification, and the hardware handles a massive amount of data. Thus, in hardware design, the scalability of the architecture for each sub-module, decisionNodeMatrix, leafNodeMatrix, and accumMatrix needs to be carefully considered, which are discussed in Sect. 3.2.
Details of Sub-Modules
In the proposed architecture, its processing flow completely depends on the ctrl and the chSelMem modules. The ctrl observes the states of all the sub-modules and dynamically provides control signals, and the chSelMem provides static task schedules generated by the proposed scheduling algorithm described in Sect. 4. Therefore, the proposed architecture can handle any DTEs by updating task schedules.
Each of C decisionNodeMatrix modules composing the decisionNodeCube corresponds to one of C input feature channels: HOG channels, LUV channels, and a gradient magnitude channel described in Sect. 2. Each decisionNodeMatrix consists of three submodules: a decisionMem, a featureMem, and a 2-D array of W node × H node decisionNode modules. In the decisionNodeMatrix, the decisionMem is the only module controlled by the signal c d from the control unit, and the data, d x , d y , and d t , loaded from decisionMem controls the others. The featureMem provides W node × H node feature values, f , of the block at the (d x , d y ) position to decisionNode modules. To support loading the feature block at an arbitrary position, featureMem uses H node dual port line buffers, lineBuffer, and H node shift registers, horShiftReg, for location adjustment. The dual port line buffers enable to load malaligned data at any vertical position with a single cycle, and shift registers enable to extract the target data columns at any horizontal position. Each decisionNode generates a 1-bit comparison result as the decision response between a feature value of f and a threshold d t as shown in Fig. 5(b) , where d t is the threshold shared in all decisionNode modules of a decisionNodeMatrix.
Each of M leafNodeMatrix modules composing the leafNodeCube consists of three sub-modules: a leafMem, a ringShiftReg, and a 2-D array of W node × H node leafNode modules. The leafMem provides all leaf values of each DT to leafNodeMatrix, the ringShiftReg vertically and An accumulated response horizontally rotates the W node × H node × C 1-bit decision responses to correct positions, and leafNode selects a leaf value from l based on each series of decision responses. Figure 5 (c) shows the details of the leafNode for depth-two DTEs. As shown in Fig. 5(c) , the leafNode includes a demultiplexer for rearranging the order of decision responses, three flip-flops (FFs) for storing the decision responses of a depth-two DT, a leafNodeSel for selecting a leaf value based on the responses. This structure enables the leafNode to handle the random input order of decision responses and to improve the processing performance by task scheduling technique. Also, a simple modification of the leafNode allows to handle DTEs deeper than depth-two DTEs: for processing depth-three DTEs, the leafNode requires 7-bit FFs to store seven decision responses, and the leafNodeSel requires an extension to select a leaf value of eight leaf values. The accumMatrix consists of W node × H node accum modules. Figure 5 the offset, which is used for soft cascade rejection. When the static threshold of the soft cascade is −s, the offset is set to s, so that the control unit can decide soft cascade rejection only with the sign bit of the accumulated value, a s .
Task Scheduling for Parallel Implementation
The proposed hardware architecture can process decision nodes of a DT in random order, and its processing performance depends on the efficiency of the parallel memory access. For further acceleration, we propose a task scheduling algorithm dedicated to the proposed architecture. This task scheduling is an optimization problem considering each decision node as a task subject to two constraints derived from the architecture design. This section explains how to formulate the task scheduling problem, describes the proposed algorithm, and analyzes its effectiveness.
Decision Tree Ensemble Scheduling Problem
Memory accesses resulted from conditional branches may cause memory conflict while processing multiple DTs at once. The purpose of task scheduling is to avoid this memory conflict by processing all decision nodes of a DT in a fixed order as shown in Fig. 6 and controlling parallel memory accesses from multiple DTs. Given a DTE classifier, this scheduling algorithm fixes a task schedule in an offline manner, and classification requires no additional computation for scheduling. The task scheduling is an optimization problem finding the minimum completion time t * comp and its assignment matrix A * for M modules defined as
where t comp ( A) = max{t | ∃m ∈ {1, . . . , M }, a mt 0}, in which t comp ( A) is the completion time using an assignment matrix A, and a mt is the (m, t)-th entry of A representing the decision node processed on the m-th module at the t-th cycle. If there is no task assignment, a mt will be zero. The two constraints of this scheduling problem are defined as follows. When a decision node d i j is assigned to a m i j t i j , the first constraint is ∀i ∈ {1, . . . , T }, ∀j 1 , j 2 ∈ {1, . . . , S}, m i j 1 = m i j 2 ,
where S is the number of decision nodes in a DT, and for any i, each t i j needs to be a consecutive number. The second constraint is ∀t ∈ {1, . . . ,
where c(d i j ) represents the channel used in d i j , and T max is the possible maximum completion time. These constraints represent that an identical leafNode processes all decision nodes belonging to a DT without preemption, and each leafNodeMatrix exclusively uses decision responses from a decisionNodeMatrix at each time, respectively. Then, the scheduling problem can be defined as an offline problem as shown in Fig. 7 . This scheduling problem can be considered as an extension of the N P-hard job shop scheduling problem.
Proposed Heuristic Scheduling Algorithm
As mentioned above, the target scheduling problem is N Phard, and it is difficult to find the optimal solution t * comp .
Algorithm 1 Task scheduling of a DTE
Input:
for all h ∈ H n do 7:
(m * , t * tgt , t * ) ← (0, T max , T max ) 8:
P(h): a set of tuples consisting of permutation of h 9:
for all P ∈ P(h) do 10:
(m, t) ← S W C ( A, P) 11: Thus, the proposed algorithm aims to find a solution which is close to the lower bound, where the lower bound is equal to the maximum number of frequency in channel histogram. The proposed algorithm adopts a greedy approach and focuses on the frequency of channels in a DTE. The assignment is in the order that the frequency of channels represents the priority, which reduces the number of assignment candidates and reduces the amount of computation. Also, for improving the completion time, it is a promising approach to make a flat histogram by reducing the number of channels considering the variations of the frequency in the channel histogram.
Algorithm 1 shows the proposed algorithm. As a preprocessing, the proposed method merges the input C channels to M channels, where the two channels of lowest frequencies are merged in each iteration as described in lines 23-35. The assignment process consists of M iterations of the merged channels, and in its n-th iteration, DTs containing the channel n, represented as H n , are assigned. For each DT h in H n , the proposed algorithm searches the assignment position satisfying the constraint described in Eqs. (4) and (5) for all the patterns of processing orders as shown in lines 36-46. From all of the processing orders, the one with the earliest completion time and the channel n is selected using the condition shown in line 12, and it is assigned to an assignment matrix.
Analysis of Scheduling Algorithm
In the analysis, the target classifiers are depth-two and depththree ACF classifiers trained in the same manner as [10] , consisting of 2,048 and 1,673 DTs. The MATLAB evaluation code of Caltech Pedestrian Detection Benchmark [16] is used to evaluate detection accuracy. The log-average MRs on INRIA Person Dataset [17] are 16.5% and 16.3%, respectively. Figure 8 shows the detection error trade-off curves of these two classifiers and the classifier reported in [10] , where our classifiers achieve equivalent detection accuracy to the original ACF classifier. Figure 9 shows the histograms of input channels for these classifiers, which indicates that there exists large variance of frequencies among channels in both histograms. Taking into account the memory access exclusiveness, the lower bound for this problem is equal to the maximum number of decision nodes in a channel. Figure 10 shows the relationship between the parallel degree M and the number of the cycles required for processing the classifiers based on the task schedules. For both classifiers, the number of processing cycles decreases as M increases until 8. When M is equal to 8, both numbers of processing cycles reach the lower bound drawn in dotted lines. Compared with serial classification, the proposed scheduling achieves 6.6 and 6.4 times speed up for the depth-two and depth-three classifiers, respectively. For more details, Table 2 lists the number of processing cycles and the occupancy of the leafNodeCube. From the result, the proposed scheduling reduces the number of cycles to the lower bound. Also, using soft cascade enables to accelerate the processing performance of negative windows. In Table 2 , c neg represents the average number of processing cycles for negative windows, and Table 2 shows that combining the proposed task scheduling and soft cascade can reduce both average cycles of depth-two and depththree ACF classifiers to 3.3% and 5.4% of processing cycles required for a positive window.
Scheduling under Deeper Decision Tree Ensemble
Recent work [18] reports that deeper DTs show good detection performance. To analyze the relationship between the task scheduling performance and the depth of DTs, the proposed task scheduling is applied to a deep DTE classifier. The evaluation uses a depth-six classifier provided by the authors † , which is trained for Caltech Pedestrian Detection Benchmark [16] . The classifier consists of 3,324 DTs, and the number of decision nodes in the classifier is 137,043. The number of available permutations calculated in line 8 in Algorithm 1 is exponentially proportional to the depth Fig. 11 Result of depth-six ACF classifier. of a DT, and then it is necessary to reduce the number of candidates for deep DTs. For mitigating this, the processing order is fixed to the order of channel frequency in this experiment. Figures 11(a) and 11(b) show its channel histogram and the scheduling results, respectively. The result shows the similar convergence curve to the shallow DTE and achieves 4.1 times speed-up compared with the serial implementation when M is equal to 8. However, the processing cycles do not reach the lower bound even when the parallelism is equal to the channel since Eq. (5) is difficult to satisfy for all the decision nodes. Improving the task scheduling for deeper DTs is included in our future work.
Evaluation
This section explains how to generate a fixed-point classifier and implementation settings, used in the evaluation, and evaluates the FPGA implementation based on the proposed hardware architecture regarding resource usage and processing performance.
Evaluation Settings
For hardware implementation, we converted the depth-two DTE described in Sect. 4 into a classifier in fixed-point representation by using the method proposed in [19] . Figure 12 shows detection results of the converted fixed-point classifier. The proposed hardware architecture is implemented using Verilog hardware description language (HDL) at register transfer level (RTL). Table 3 shows the implementation settings. The target device is Xilinx xc7z045ffg900, the target operating frequency is 100 MHz, and the degree of parallelism is 1,024: parallel degree 8 from feature channels and 128 from image blocks, respectively. In feature extraction, we use three types of feature descriptors, i.e., HOG, gradient magnitude, and RGB color channels. We adopted RGB channels instead of LUV channels because the difference of color channels does not cause notable accuracy loss and converting to LUV channels is computationally intensive [20] . Also, for hardware implementation efficiency, we assumed the classification procedure proposed by Benenson et al. [9] , which uses multiple classifiers corresponding to different window sizes and feature maps extracted from scaled images, instead of the genuine ACF classification procedure using a classifier and the fast feature pyramid (FFP) proposed in [10] . Although FFP shows efficient memory usage and higher processing performance for software implementation, it is not suitable for hardware implementation because FFP needs to generate each layer of feature pyramid sequentially.
Resource Usage
For the evaluation of resource utilization, we synthesized the RTL implementation with Vivado 2015.4.2. Table 4 shows the resource utilization of the proposed implementation. As in Table 4 , it occupies less than 35% of both slice and block RAM resources of the target FPGA for processing 1,024 decision nodes in parallel. Also, from the details of LUT usage, the balanced use of both LUTs and FFs can be confirmed.
Processing Performance
For the evaluation of processing performance, we simulated the RTL implementation with actual input images on ModelSim SE-64 10.3. In the simulation, the classification process takes 45,809 cycles or 0.46 milliseconds for a full HD image without scaling. Since processing time is linearly proportional to the image resolution, when the processing time t single represents the required cycle or time for a single-scale full HD image, the entire processing time t all for full search detection with sliding-window sampling is defined as follows:
where N scale is the number of scale images. Given t single is 0.46 and the parameters shown in Table 5 , the processing time t all becomes 2.86 milliseconds. Thus, the proposed hardware enables to process full HD images at 350 fps. Table 5 provides a processing performance comparison between the proposed method and three conventional methods: an ACF software implementation [10] , an ACF hardware implementation [11] , and a deformable part model (DPM) hardware implementation [8] . The DPM hardware implementation is the fastest hardware implementation so far, using a deformable part model [8] . Evaluation based on frame rate does not provide precise result because it does not use detailed implementation settings, and windowbased evaluation is suitable for a fair comparison [21] . Here, window-based evaluation is adopted, and the processing performance is evaluated by recalculating to processed windows per second, N wps , using the following equation:
As shown in Table 5 , the proposed implementation is 105.0 and 116.1 times faster than the software and the hardware implementations of ACF, respectively. Also, it is 57.6 times faster compared with the DPM hardware implementation, which was the fastest implementation.
Discussion
The processing performance of practical applications depends on both feature extraction and classification. So far, we have shown the proposed hardware architecture can process 350 fps for full HD images. Now, we discuss the processing performance of the feature extraction part. We assumed three feature descriptors as mentioned the above. To verify the feasibility of our assumption, we implemented feature extraction modules in Verilog HDL and evaluated them in terms of throughput and resource utilization. Table 6 summarizes the implementation result with 32 degrees of parallelism, where the parallelism comes from eight channels and four scaled images from a full HD image. As in Table 6, the entire utilization of feature extraction modules is less than 10% of slices. With 32 degrees of parallelism, the feature extraction modules can achieve 60 fps processing performance for full HD images. However, it does not seem to be enough for providing feature maps to the proposed DTE classifier. Besides, to realize N-class object detection, the proposed DTE classifier needs to process N× faster than the feature extraction modules. In this case, the proposed DTE classifier with the assumed feature extraction modules can classify five classes of objects simultaneously.
Conclusion
For practical applications using visual object detection, the improvement of the trade-offs between hardware resources, processing performance, and detection accuracy has been a critical issue, and the proposed architecture successfully resolved this issue by improving classification speed without detection accuracy degradation. The proposed architecture adopted a hardware and software cooperative design and is distinctive from other existing architectures. The hardware implementation based on the single-node architecture exploits its resources by using the proposed task scheduling method. Since the task schedules are static within a DTE, once one fixed the task schedules of a DTE, there is no processing overhead in detection phase. Also, the task schedule focusing on the lower bound of required cycles clarified that the efficient parallel degree is less than the number of feature channels used in ACF, and made it possible to reduce the hardware resource for leaf nodes without lowering processing performance. In the evaluation, the proposed architecture achieved more than 100 times faster than conventional ACF implementations without any detection accuracy degradation and more than 50 times faster than the fastest DPM implementation. The proposed method outperforms the conventional methods in terms of the scalability and processing performance.
