A local binary pattern (LBP)-based tracking algorithm and a novel heterogeneous vision chip architecture that implements the algorithm in high speed are proposed. The algorithm is more robust than previously reported high-speed tracking algorithms. The proposed vision chip architecture adopts multiple levels of parallel processors to execute the algorithm in pixel-parallel and patch-parallel ways. Experimental results show that the proposed implementation can achieve 1000 fps robust tracking.
Introduction: A vision chip adopts massively parallel SIMD processing elements (PEs) to carry out image processing algorithms in a pixelparallel fashion [1] [2] [3] [4] . It is widely applied in the field of high-speed object tracking. However, reported vision chips can only perform simple and less robust algorithms such as background subtraction and self-window; thus they can only be applied to certain scenarios with a clear background and constant illumination. In this Letter, we propose a local binary pattern (LBP)-based tracking algorithm. The feature of the tracked object is characterised using an LBP descriptor that is discriminative and invariant to the illumination change. The algorithm is more robust than previously reported vision chip tracking algorithms. Moreover, we develop a heterogeneous vision chip architecture that can perform the algorithm in multiple levels of parallelism. Experiments and performance analysis are carried out. The results indicate that our proposed implementation can achieve 1000 fps robust and high-speed tracking.
Proposed tracking algorithm:
The LBP is one of the best performing texture descriptors and it has wide applications [5] . A well accepted way to form a global description with a LBP is to divide the image into several regions from which the LBP histograms are extracted and concatenated into an enhanced vector [6] . The main components of the proposed LBP-based high-speed tracking algorithm are shown in Fig. 1 . The object window is first divided into a grid of sub-regions. The LBP histogram of each sub-region is then concatenated to form a global feature of the object, as shown in Fig. 1a . This method has been successively applied to facial image analysis, and it has proven to be discriminative and robust. The tracking procedure is shown in Fig. 1b . In the successive frame, the global features of the search windows surrounding the object location are generated and compared with the feature of the object. The sum of absolute difference (SAD) between the features of the search window and the feature of the object window is calculated. The search window with the smallest SAD value is updated as the object window. If the proposed algorithm is processed in high speed, the difference between two successive frames will be limited to a few pixels, even with rotation and scale changes. The feature is updated for every frame to adapt to affine and scale changes of the object. In the following Section, the algorithm is implemented on our vision chip in a multiple parallel way. Proposed vision chip architecture: The proposed vision chip architecture is shown in Fig. 2 . It is mainly composed of an N × N macro block (MB), a RISC core and some other necessary logic. The RISC core controls the overall system. The MB consists of N 1 × N 1 PEs and a patch processing unit (PPU). The PE contains a local memory and an arithmetic and logic unit (ALU) that can perform 1 bit and, or, addition and inversion operations. The (N 1 × N)
2 PE array has an O(N × N) parallelism and can perform some algorithms in a pixel-parallel way [2, 3] . The PPU has an RISC-like structure. It consists of an instruction sequencer, an 8 bit ALU, a general purpose register, a PE input buffer register and a local memory. The PPU can perform arithmetic operations and index addressing that cannot be accomplished by the PE. Similar to the PE array, the N × N PPU array has an O(N × N ) parallelism, and it can process algorithms in a patch-parallel way. The feature match procedure can also be finished by the PPUs in parallel. Each PPU can store the histograms of the same sub-region of different frames. Then the histogram SAD difference of all sub-regions can be calculated in PPUs in parallel. At last, the RISC accesses the PPUs, and only a few operations are needed before determining the best match window and the object location.
ALU

Implementation and experimental results:
A tracking system was built as shown in Fig. 3a . It contains a camera, two actuators and a field programmable gate array (FPGA) which is behind the camera. The proposed architecture is realised on the FPGA. It consists of a 128 × 64 PE array and 16 × 8 PPUs uniformly distributed along the PE array. Each PE includes 64 bits memory and each PPU can access 4096 (8 × 8 PE, 64 bits per PE) bits PE memory. The PE array and the PPU array can give a peak performance of 56 GOPS (giga-operations per second) at a clock frequency of 50 MHz when performing 8-bit arithmetic operations. The prototype system continues capturing the target and extracting the position of the target. According to the position of the target, the actuators adjust to locate the target in the centre of view. Fig. 3b gives the record of the motion of the target during the target tracking experiment. The collection of 1110 data points clearly exhibits the sinusoidal motion of the target. A more detailed performance analysis of the proposed architecture is shown in Fig. 4 . The breakdown of time and computation for the proposed vision chip performing the tracking algorithm on a 128 × 128 pixel image are shown in Fig. 4a . Thanks to the pixel-parallel PE array, 10.8 million operations are finished consuming only 194 μs. Data transfer and histogram operations are performed by the PPU array. It takes 309 μs to finish about 2.1 million operations. The RISC is used to control the whole system and issue some of the SIMD instructions for the PE array and the PPU array. It consumes 490 μs. Fig. 4b shows the instruction count of some kernel functions of the algorithm. LBP operators with different scales are realised. The
instruction cycles, respectively. Histograms with bigger patch size can be finished using additional PE data transfer instructions. The SAD of two histograms can be finished in 100 PPU instruction cycles. Table 1 shows a comparison with previously reported vision chips. The algorithms proposed in these studies either utilise a threshold value to obtain a binary image [3, 4] or capture a binary image in the first place [1, 2] . These algorithms can all be classified as binary imagebased methods. The algorithms can only work in artificial environments with clear a background and constant illumination. The proposed algorithm adopts a robust LBP feature that can work in much more complex environments. Furthermore, the proposed architecture has two levels of parallelism. The integration of PPUs greatly enhances the system flexibility and performance for the proposed algorithm. The architecture in [3] has the highest tracking speed among those listed in Table 1 , however the algorithm is simple. If our proposed algorithm is implemented using the architecture described in [3] , the tracking speed will be below 1000 fps. Conclusion: This Letter introduces a high-speed tracking implementation composed of an LBP-based tracking algorithm and a heterogeneous vision chip architecture. Taking advantage of the LBP texture descriptor, the proposed algorithm is more robust than those in previously reported studies. In addition to the pixel-parallel PE array, the proposed architecture utilises the PPU to finish complex operations and gains the patch-parallel processing ability. Experimental results show that the proposed high-speed tracking implementation can achieve 1000 fps tracking. Our analysis indicates that the pixel-parallel and patch-parallel designs of the architecture greatly improve system performance and reduce the algorithm's execution time.
