Abstract-A heterogeneous multi-core processor is proposed to achieve real-time dynamic object recognition on HD 720p video streams. The context-aware visual attention model is proposed to reduce the required computing power for HD object recognition based on enhanced attention accuracy. In order to realize real-time execution of the proposed algorithm, the processor adopts a 5-stage task-level pipeline that maximizes the utilization of its 31 heterogeneous cores, comprising four simultaneous multithreading feature extraction clusters, a cache-based feature matching processor and a machine learning engine. Dynamic resource management is applied to adaptively tune thread allocation and power management during execution based on the detected amount of tasks and hardware utilization to increase energy efficiency. As a result, the 32 mm chip, fabricated in 0.13 m CMOS technology, achieves 30 frame/sec with 342 8-bit GOPS peak performance and 320 mW average power dissipation, which are a 2.72 times performance improvement and 2.54 times per-pixel energy reduction compared to the previous state-of-the-art.
DBM
database-based matching A S THE resolution of the video applications are ever increasing, object recognition applications are becoming increasingly computationally intensive, requiring hundreds of giga operations per second (GOPS). Examples of such applications include augmented reality, image retrieval/reconstruction and scene analysis. Since these computationally complex applications are now being implemented in mobile vision platforms that are constrained by form factor, processing delay and power dissipation, a dedicated processor is necessary to obtain 30 frame/sec application throughput and sub-Watt power consumption with HD (720p or 1080p) video streams. However, the scale invariant feature transform (SIFT) [1] -based object recognition algorithm, which is popular for its invariance to scaling, rotation and illumination, is computationally complex due to its heavy workload required in local feature extraction and matching operation. As a result, conventional vision processors fail to achieve real-time performance while sustaining low power dissipation simultaneously.
Even the latest multi-threaded CPU [2] is only capable of achieving 4.48 frame/sec when performing SIFT-based recognition on 720p video, due to the limited computing power far below 100s of GOPS. Conventional single-threaded RISCs or VLIW DSPs [3] are even worse than this. In contrast, the extensive data-level parallelism (DLP) of GPU [4] or multi-core processors [5] which integrate multiple single instruction multiple data (SIMD) or single instruction multiple thread (SIMT) processing units, enable them to achieve high computing power of 100s of giga floating operation per second (GFLOPS), although at the cost of high power consumption approaching 200 W, which is far beyond power budgets of a mobile vision system. Thus, massively parallel DSPs, including IMAPCAR [6] and Strom-1 processor [7] , are proposed to exploit high DLP with minimized power dissipation for specific applications, and achieve power efficiency of 50 GOPS/W and 24.4 GOPS/W respectively. However, their achievable computing power within limited power budget of mobile platforms is still insufficient for HD video-based object recognition, one of the most complex vision applications. Considering these problems of conventional vision processors, a new vision architecture should possess not only exceedingly high computing power but also high power efficiency for real-time SIFT implementation in mobile vision platforms.
Furthermore, since the tile-based SIFT implementation in a vision processor requires 60-150 GB/s on-chip bandwidth and 10-25 GB/s off-chip bandwidth for HD-based real-time object recognition due to its massively parallel architecture for high computing power, not just minimizing the number of off-chip accesses but also sustaining high utilization of on-chip bandwidth is important for a vision processor. Thus, high on-chip bandwidth of a highly parallel processor must be sustained based on high datapath utilization and network bandwidth in each core when performing compute-intensive operations such as convolution, cascaded feature extraction, and long-latency feature matching, so that the maximum possible throughput can be achieved.
With consideration of theses design constraints, in this paper, we introduce a real-time low-power object recognition processor which achieves 30 frame/sec throughput and sub-Watt power consumption for 720p video streams. A new visual attention-based object recognition algorithm is proposed that reduces more than 33% of the entire workload to relax the required computing power and on/off-chip bandwidth. It helps the vision processor overcome many of the above challenges in conventional SIFT implementation. In addition, the network-on-chip (NoC)-based heterogeneous multi-core architecture is proposed to obtain high computing power by utilizing different ILP, DLP and thread-level parallelism (TLP) of multiple processing cores. Lastly, we integrate a dynamic resource management (DRM) technique into the heterogeneous multi-core processor to minimize the power consumption of the processor as well as to increase utilization of on-chip bandwidth.
The rest of this paper is organized as follows. Section II describes the attention-based recognition algorithm and the highlights of the proposed architecture compared to the related architectures for vision applications. In Section III, the system architecture of the processor will be explained. The detailed core implementations will be covered at Section IV. Section V discusses the design and advantages of the DRM of the processor. The chip implementation and the system evaluation follow in Section VI. Finally, Section VII concludes this paper. 
II. BACKGROUND

A. Algorithm
The tile-based object recognition has been adopted to increase the system throughput by executing multiple recognition threads on different decomposed tiles in parallel. In addition, the attention operation was adopted to filter out the meaningless tiles with no object features and focus on the tiles containing object features as region-of-interests (ROIs). It helps the SIFT implementation increase its throughput and reduce power consumption by reducing the number of processing tiles dramatically. Previous attention model [8] , which exploited not only the bottom-up conspicuity information but also the top-down object familiarity between the query and target objects in database (DB), obtained 35% background clutter tile rejection on average and 30% processing speed increase without any recognition accuracy degradation for 640 480 image-based object recognition.
However, when dynamic noises, such as motion blur, illumination and occlusion, fade SIFT features way from captured images, the previous model suffers from severe attention accuracy degradation and fails to accommodate HD object recognition due to its highly demanding computing power incurred from increased ROIs in an image. In order to solve this problem, we present more accurate attention algorithm to reduce the number of processing tiles further with increased processing speed. Fig. 2 shows the proposed attention model for the HD object recognition in a mobile vision system, named context-aware visual attention model (CAVAM). The CAVAM integrates temporal familiarity which measures the temporal coherence of consecutive frames by tracking and prediction of the recognized objects in addition to the previous saliency and object familiarity. That is, the familiarity map reflects not only the spatial conspicuity but also temporal continuity so that the obtained ROI can track the target object movement accurately irrespective of dynamic noises. For example, the 77% loss of total SIFT features in an image by dynamic noises generates 3.3 times increase in the number of ROI tiles in the previous attention model, and requires additional GOPS computing power and GB/s on-chip bandwidth. However, in CAVAM, the required computing power and on-chip bandwidth are reduced by 16% and 46% respectively thanks to 1.44 times higher attention accuracy for dynamic object recognition with HD video streams on average. Thus, the 4.8% extra on-chip bandwidth for temporal familiarity generation is negligible compared to the performance gain.
B. Related Works
The proposed object recognition processor has some characteristics of application specific hardware accelerators such as IMAPCAR [6] and National Taiwan University's machine learning SoC [9] . In these systems, a highly parallel SIMD architecture and a high bandwidth dual memory architecture are adopted to accelerate in-vehicle image recognition and K-means clustering algorithm respectively, restricting redundant data computations and memory accesses for their targeted applications. However, unlike those application specific processors, the proposed processor is also optimized for general stream processing with application characteristics such as compute intensity, data parallelism, and produce-consumer locality [10] similar to CELL [11] , GPUs [4] and the Strom-1 processor [7] . CELL includes data-parallel synergistic processing units, GPUs support many lightweight data-parallel threads, and the Strom-1 processor utilizes an optimized ALU and memory architecture for kernel and stream data processing with different ILP and DLP. In terms of its chip multiprocessor (CMP) or multi-core architecture, the proposed streaming architecture is more like Intel 80-Tile processor [12] and Toshiba's eight-core media processor [13] that process streams as threads in different cores. These multi-core processors exploit a packet-switched NoC and pipeline-based/thread-based parallel execution schemes for high computing power respectively. In comparison to those processors, while exploiting the high-performance technologies of the multi-core architectures, the proposed processor is much more power efficient due to its use of fixed-point ALUs instead of floating-point ALUs as well as its use of 2-D direct memory access (DMA)-integrated NoC interfaces instead of a cache-based memory system that requires a power-hungry hierarchical memory architecture. Therefore, the processor can achieves higher computing power with lower power consumption compared to the multi-core processors [12] , [13] and, also, SIMD-based parallel machines such as IMAPCAR and Xetal-II [14] .
The previous generation of our vision processor [8] was designed to realize real-time object recognition for VGA images by employing a heterogeneous multi-core architecture containing SIMD and multiple instruction multiple data (MIMD) processing units in parallel. However, since its performance is not proportional to the number of processing cores due to its limited throughput and bandwidth of the top architecture, this processor is not capable of object recognition on HD 720p video streams. Thus, this chip adopts a dual-threaded SIMD/MIMD core cluster and latency/power optimized cores in addition to the NoC-based heterogeneous architecture by taking advantages of a new fine-grain object recognition pipeline. Coping with the machine learning-based DRM, the high level of ILP, DLP and TLP in the processor can realize object recognition for HD 720p video streams with enhanced throughput and power-efficiency for advanced mobile vision applications.
III. SYSTEM ARCHITECTURE
The proposed processor is based on a heterogeneous multi-core architecture for low-power real-time object recognition in a mobile vision system [15] . Achieving high computing power with low power consumption, the processor adopts several system-level technologies for high on-chip bandwidth with increased datapath utilization. It integrates a hierarchical NoC architecture with GB/s aggregate on-chip bandwidth and heterogeneous processing cores with GOPS computing power, which is measured by 8-bit integer operation, required for 720p video-based object recognition.
Even though CAVAM reduces 16% computing power and 36% on-chip bandwidth of HD object recognition, it is still difficult to satisfy the requirements at the same time in a vision processor. Thus, the proposed processor employs the 5-stage task-level pipeline and the simultaneous multithreading (SMT) [16] operations of ROI tile processing to increase the throughput of tile-based object recognition in CAVAM by increased ILP and TLP of the multi-core architecture. Furthermore, it integrates different types of parallel processing cores in a NoC architecture to reduce the processing delay of each pipeline stage as well as the visual attention stage in CAVAM, thereby achieving high computing power and on-chip bandwidth required in HD object recognition. For power efficiency of a mobile vision platform, resource management technique is applied to increase hardware utilization and throughput and to reduce power dissipation of idling cores. Using these technologies, the proposed mobile vision processor achieves 342 GOPS computing power and 640 GOPS/W power efficiency as well as 83.3 GB/s on-chip bandwidth required in CAVAM-based object recognition on 720p HD video streams. Fig. 3 shows the overall block diagram of the proposed SoC. It contains of 4 simultaneous multithreading feature extraction cluster (SFEC) for SIFT feature extraction operation, a feature In the SFEC, one dual-threaded vector processing element (DVPE), integrating one 16-lane data-parallel SIMD processing element, and four scalar processing elements (SPEs) are contained together using a local NoC router. Once the 4 SFECs carry out feature extraction operation, then the FMP performs the following matching operation, which requires external memory access with minimized latency, for the extracted SIFT feature descriptors. The MLE is designed to realize several functions required in CAVAM by using its reconfigurable architecture with minimized power dissipation.
A. SoC Architecture
The DRC is the main controller which accounts for throughput and power consumption of the multi-core processor. It contains a task management unit (TMU) for software-managed workload allocation, a dynamic voltage and frequency scaling (DVFS) [18] controller for optimization of power dissipation in the SoC, a network-on-chip (NoC) controller for sustaining high NoC bandwidth, and a ARM10-based host RISC processor. With the help of the DVFS and NoC controllers, the TMU enables the maximum aggregate on-chip bandwidth for the object recognition processor reaches 83.3 GB/s.
B. 5-Stage Task-Level Pipeline
In order to realize the CAVAM for 720p HD video streams, the 5-stage task-level pipeline is proposed to accelerate overall processing speed of feature extraction and matching operation for 16 16 image tiles. The feature detection of previous SIFT algorithm [8] is divided into 3 fine stages, namely, Gaussian filtering (GF), difference of Gaussian (DoG), and localization (LOC), which are 3 main procedures of SIFT scale space generation and extrema detection. Along with conventional feature description (FD) and feature matching (FM) stages, the 5-stage object recognition is implemented to increase ILP/TLP with high system utilization as shown in Fig. 4(a) . The DVPE performs GF, DoG and LOC with different special function units (SFUs) respectively, and the SPE and FMP perform FD and FM respectively. Fig. 4(b) shows the operation diagram of the proposed pipeline. Thanks to the increased datapath utilization of SFEC to 0.92 by this pipeline, the overall processing throughput of the SIFT pipeline is increased by 1.67 times compared to the conventional 3-stage pipeline.
C. SMT-Based SIFT Implementation
In SFEC, the SMT is adopted to process multiple ROI tiles at the same time for squeezing system throughput further out of the object recognition pipeline. Conventional singled-thread SIMD core suffers from low-datapath utilization of less than 0.3, since only small part of ALUs in each lane of SIMD core is activated for decoded instructions. Thus, to minimize the wastage of power and processing time, the SIMD unit is segmented into 3 different SFUs to support SMT operation, and the SFUs are corresponding to the GF, DoG and LOC respectively. Fortunately, the each ROI tile is totally independent to each other so that it is possible to exploit producer-consumer locality of tile-based SIFT processing. Fig. 5 shows the proposed SMT-based SIFT implementation using SFECs. When the visual attention operation determines the ROIs from the image, the 128-entry TMU software queues accumulates the 32-bit pointers of image tiles and the TMU fetches image tiles from the external DRAM and to SFECs as a thread scheduler. The SFEC is designed to perform 3-stage feature detection operation for the maximum two threads simultaneously. The theoretical peak IPC of SFEC datapath is 1.92, and this can increase the throughput of overall architecture by 1.43 times compared to single-threaded SFEC. Thanks to the NoC and SMT, the out-of-order execution of ROI-based SIFT operation can be easily implemented without complex scheduling and re-order buffer. With only 12% extra area overhead and 9.9 mW extra power consumption for a simple context switching controller, 16 general purpose registers, and 20 kB extra memory, the SFEC achieves at least 30% processing delay reduction for an ROI tile processing of 100-1000 instructions in an inner loop and 2.8 times processing speed increase for 10-100 ROI tile sequences based on hiding data fetch delay as well as increased pipeline stages. As a result, thanks to the dual-threaded processing of 8 ROI tiles and the fine-grain pipeline with increased throughput of 4 SFECs, the average performance of the proposed architecture reaches 47700 tiles/sec for feature extraction of SFEC and 62720 vectors/sec for the matching operation of FMP respectively.
IV. CORE ARCHITECTURE
A. Dual-Threaded Vector Processing Element
The detailed architecture of DVPE is depicted in Fig. 6 , consisting of 3 SFUs for a 16-lane data-parallel SIMD unit. For 2 different threads, the SIMD datapath utilization can be increased to 0.92 on average for SIFT feature detection. Since each thread or an image tile in tile-based recognition is totally independent, data consistency and race condition of two threads can be eliminated by isolating each memory space. Therefore, the SFEC contains 2 16 kB DMEM and 4 kB IMEM for different threads speed-up for Gaussian filtering and 15% throughput increase in SFEC pipeline. With the help of proposed technique, the DVPE can obtain 1.46 IPC and 0.82 utilization, which are 1.87 times and 2.7 times improvement respectively. As a result, the proposed DVPE can perform feature extraction operation for an ROI tile within 15000 cycles, 3.45 times improvement from the previous SIMD core [8] . Fig. 7 shows the hardware diagram of the cache-based FMP for keypoint matching with the zero-less locality sensitive hashing (ZLSH). Since more than 80% of matching delay is consumed at external DB access for feature searching, minimization of keypoint searching regions in DB is essential to increase matching throughput as well as the overall recognition speed. To this end, in FMP, there are two keypoint matching mechanism; the primary cache-based matching (CBM) and the secondary DB-based matching (DBM). The CBM uses the keypoints which are previously used at the last matching and stored in the keypoint cache for searching the nearest neighbor. If the keypoint is matched, the matching ends with 98% reduction of external accesses. If the keypoint is not matched, the additional DBM is performed with the ZLSH index to access the candidate keypoints of DB, which still results in 86% access reduction than the brute-force matching.
B. Feature Matching Processor
The 16 kB inter-frame cache is implemented with 4-way set associative structure since each way is corresponding for 32 keypoint vectors and the size of a cache line is 128-byte corresponding to the size of a keypoint vector. The 32 kB vector memory contains the detected keypoint vectors for the ROIs that are compared with cached vectors through the 128-way SAD array for CDM. The 1024-bit wide 1 kB configuration SRAM is used as operand register files to reduce redundant memory access for higher throughput. The hashing accelerator and a 10 kB hash table are implemented to realize ZLSH to minimize the degree of uneven binning of hashing for maximum reduction of external memory access, and achieve 64% reduction in the largest bin size compared with previous locality sensitive hashing [19] . As a result, it only consumes less than 3.2 s on matching a keypoint with DB for the proposed task-level object recognition pipeline.
C. Machine Learning Engine
The architecture of MLE is as shown in Fig. 8(a) and it is designed for accelerating CAVAM's Kaman filter operation and DRC's reinforcement learning algorithm with different processing granularity to optimize power consumption. The architecture of the MLE is very similar to application specific reconfigurable processors, such as ADRES [20] , Montium Tile processor [21] and eXtreme Processing Platform (XPP) [22] , which adopts coarse-grained architecture for multimedia and communication applications. They exhibit high level of ILP and/or DLP but less control flow. The MLE also utilizes data-parallel, compute-intensive operations by adopting SIMD computation model. The ALU of SIMD processing core is optimized for sub-word-level operations, including ADD, SUB, MUL and SHIFT, supporting one shared instruction set architecture.
In terms of reconfigurable processing core architecture, the MLE is analogous to the MorphoSys [23] which comprise the reconfigurable cell arrays, an RISC control processor, context memory, frame buffer and DMA controller. Similarly, the MLE adopts 4 4 reconfigurable processing element (PE) arrays, consisting of 16 reconfigurable processing elements (RPEs), and 32 kB kernel memory for parameter operands, a light-weight control RISC processor for reconfiguration control and instruction fetch/decode. A RPE is composed of 4 processing elements to modify the parallelism of MLE which possesses 8 bit-resolution granularities for different operations. The block diagram of RPE is depicted in Fig. 8(b) . The 4 processing elements can be reconfigured from 8-bit resolution pixel-level operators to 32-bit resolution complex sequential learning operators. Each processing element can propagate result data to the next processing element as new operand.
As a result, different bit-resolution of 8/16/24/32-bit processing element can be applied for different types of target algorithms with minimized power consumption. For example, in two extreme cases, such as 16 32-bit configuration for high precision learning algorithm and 64 8-bit configuration for pixel-level operation of saliency map generation, the power consumption varies from 33 mW to 123 mW, while reducing at most 71% of unnecessary power dissipation on un-used registers and ALUs compared to the SIMD-core based implementation. In addition, it only takes 4.4 ms based on 12.5 16-bit giga multiply-accumulates per second (GMACS) for running reinforcement learning algorithm of DRM.
V. DYNAMIC RESOURCE MANAGEMENT DRM [24] , which is famous technology adaptively tuning hardware resource of multi-core processors or data centers, is employed to handle workload allocation and voltage and frequency configuration of the proposed architecture. The DRC operates the DRM operation with its sub IPs as hardware resource controller. The TMU performs ROI tiles allocation for DVPE and keypoints allocation for SPE and FMP to sustain maximum throughput by keeping the core from going idle frequently. Coping with the TMU's workload allocation, the DVFS controller and NOC controller is configured according to the performance margin for the 30 frame/sec real-time requirement. Since the implemented processor is designed to satisfy the maximum workload scenario of actual use cases, a large power saving through aggressive voltage and frequency scaling can be obtained by DRM.
A. DRM Implementation
Because the configuration is carried out based on one thread which is a 16 16 ROI tile-based SIFT operation, the DRM of the object recognition processor needs less than a few s response time, or 10-500 cycles. Therefore, we adopt hardware-implemented DRC with software programmability with about 100 times speed-up compared to the middleware-based approach [25] . While using the on-line learning ability of MLE, the DRC can change the throughput and power characteristics of multi-core system with precise workload prediction for better energy efficiency as shown in Fig. 9 . The DRC adjusts the power management configuration of SFEC based on the amount of ROIs, , and utilization per frame, . Since the dynamic resource management can estimate the optimized state transition point by adopting reconfigurable thresholds, and , of ROI and utilization the optimum energy and throughput management can be selected by one of three different configurations; C0: DVFS, C1: DVFS multithreading (MT), C2: DVFS MT dynamic tile allocation (DTA). The DTA will be discussed on the following sub-section for utilization control. Based on the configurations, the multi-core architecture can change its throughput and power efficiency to sustain 30 frame/sec with lowest energy consumption. The control parameters, which are 2 state transition thresholds and 2 energy configuration ROI points, are updated by with Q-learning-based on-line learning operation [26] to minimize DRC miss prediction rate less than 2.2% for prohibiting severe performance degradation. As a result, 9.6 mJ/frame or 10.5 nJ/pixel energy efficiency can be obtained with 320 mW average power consumption. Fig. 10(a) shows the NoC-based DVFS implementation of the proposed processor. It has 6 different voltage-frequency islands (VFIs), including one global island for the top NoC router, the MLE and the DRC and 5 different local islands for SFECs and a FMP core. When there is packet transition between two different cores, the level shifting and the synchronization have to be performed for different power and clock frequency domains.
B. NoC-Based DVFS Implementation
To simplify the implementation complexity and increase system robustness, the monolithic design of NoC router is proposed by merging a level shifter and synchronizing dual-clock FIFO at the TX and RX ports of NoC. The DRC configures each local VFI from 0.7 V-1.2 V VDD range, and 50 MHz-200 MHz operating frequency range by using the external switched-mode power supply IC and the PLL. For higher throughput of the NoC, the proposed switch can configure the priority ports of the weighted round robin arbiter based on the DRM configuration to reduce the packet conflicts.
The proposed NoC router architecture is shown in Fig. 10(b ). An arbiter controls the cross fabric to connect input ports to output ports and each port contains 38-bit wide and 8-words deep queues. The network interface supports 640 MB/s/port bandwidth at 200 MHz to all the switches in the respective directions. As Philips' AEthereal NoC [27] , the proposed compact NoC router is designed with simple packet switching but obtains high processing speed only with 0.31 mm for the 8 8 top switch. Thanks to the monolithic NoC routers, the hierarchical star-ring network can be easily implemented without extra IP or back-end support, supporting multi-core DVFS system.
C. SFEC Utilization Control
To achieve highest SFEC throughput for SIFT feature extraction along with the 5-stage task-level pipeline, the utilization of each processing core should be sustained as high as possible. In order to increase utilization of total 4 DVPEs and 16 SPEs of 4 SFECs, the processor adopts the DTA based on the distance between processing ROIs.
Since the neighboring 8 tiles of the ROI tile are additionally required to perform the scale space generation, at most 6 of 9 processing ROIs tiles can be shared with the next ROI processing. Based on the ROI distance which indicates the number of tiles that can be shared between two different ROI processing, the TMU performs the DTA of ROI tiles to the SFECs by utilizing different sustained bandwidth between SFEC cores, such as 3.2 GB/s internal data channel of SFEC, 512 MB/s/port local ring NoC, and 426 MB/s/port top star NoC. Through the internal bus and local NoC, the shared tiles can be rapidly transferred between two different ROI threads, as a result, the DTA can increase 17% processing speed by reducing top channel occupancy compared to the conventional sequential thread allocation.
D. On-Line Learning-Based Energy Control
With the help of MLE, the DRC can perform on-line learning based dynamical resource control to minimize the power consumption for the different amount of tasks in each frame. After deciding the number of ROIs in the image by attention process, the DRC can measure the required timing margin to process all ROIs so that it configures the thread allocation and DVFS strategy based on the DRM policy. The MLE performs on-line learning operation for varying patterns of the current task and the hardware utilization of processing cores to estimate optimized hardware resource and power margin for the current frame.
The measurement result of the DRC operation for 30 continuous HD video frames is shown in Fig. 11 . The number of ROIs for frames is fluctuating as 50-1800 that possibly incurs performance degradation or power wastage. Therefore, the MLE updates the new energy configuration point based on the optimized throughput and power consumption for its task. For the given test frames, it updates two thresholds, and , and two ECPs of DRM policy as control parameters by on-line learning of ROI variation as a monitoring parameter. Then, the power and throughput are updated to provide optimized system performance based on four parameters of DRM. For the test video scene, the updated control parameters of enable the processor achieve 279 mW average power consumption and 14341 tiles/sec throughput. As a result, the DRC obtains 1.3 times higher frame rate and 66% energy reduction compared to the static allocation.
VI. IMPLEMENTATION AND EVALUATION
A. Chip Summary
The proposed chip in Fig. 12 is implemented with 0.13 m CMOS process and occupies 32 mm with 2.4 M NAND2 gate count of and 382 kB on-chip SRAM. Table I summarizes the chip specification. A total 31-IP multi-core processor consumes Table II lists the comparison of four vision processors which have similar vision applications with this work. As compared with four architectures, namely, CMOS sensor integrated camera chip [28] , a massively parallel image processor [29] and our previous arts [30] , [8] , this work reduces at least 51.5%, 14.8%, 54.6% and 49.3% power efficiency (GOPS/W) respectively. Thanks to the 5-stage fine-grain pipeline and SMT-enabled multi-core architecture, this chip obtains 1.5 times higher GOPS, which is 342 GOPS, even with 18% reduced gate counts compared to our latest work [8] . In addition, the DRM-based DVFS enhances energy efficiency of the multi-core processor and enable the chip to obtain 640 GOPS/W consequently. As a result, for object recognition applications, the obtained 10.5 nJ/pixel energy dissipation is the lowest ever and 2.54 times less than the state-of-the-art object recognition processors. Table III shows the computing power (GOPS) and power consumption breakdown of the proposed processor with gate count. The total peak performance amounts 342 GOPS when 534 mW is dissipated. The 4 SFECs, the largest component of the processor with 1.88 M gate, accounts for the highest 112 GOPS and consume 236 mW, 44% of total power consumption. The FMP performs 98 GOPS with 86 mW power consumption 
B. System Evaluation
The fabricated chip is integrated with an application multimedia board and applied to the unmanned aerial vehicle (UAV) system as shown in Fig. 13 . The chip is evaluated in a real demonstration system for 30 frame/sec 720 p video streams of the UAV. The proposed object recognition processor, named BONE-V5, is integrated with the Texas Instrument's OMAP4430-based multimedia board by using the FPGA extension board. The video processing, such as capturing and displaying image is carried out Linux operating system in OMAP at the main board. The software system of BONE-V5 includes the custom compilers for converting ANSI C-based SIMD and MIMD core programs into separate assembly codes and the linker for merging them with custom assembly code for kernel functions of vision applications. The generated assembly of each core is managed by the control program of TMU which also bases the same compiler. Otherwise, the host RISC program uses the ARM compiler separately due to its different architecture. The communication between two processors is conducted through the NoC interface in FPGA, and the processor can access the 128 MB DDR2 SDRAM and the 4 MB SRAM in the extension board by the FPGA. The recognition accuracy measured in terms of the true positive rate is approximately 98.2% with a false positive rate of less than 1.1% for 22 different target objects such as toy tanks, cars and building miniatures. The recognition accuracy of the CAVAM is even to the SIFT implementation without attention operation that obtains theoretical maximum accuracy, while reducing processing time more than 40%. With the help of CAVAM, this processor can obtain 73% attention accuracy, which is 1.44 times higher compared to the previous object recognition processor [8] . As a result, the CAVAM-based processor can provide high-quality recognition performance for 720p HD video applications with low-power consumption.
VII. CONCLUSION
In this paper, we present a real-time object recognition processor for HD 720p video streams in mobile vision system. The context-aware visual attention model is proposed to reduce the on-chip bandwidth of HD video-based object recognition at least 46%. Along with the proposed 5-stage task-level pipeline of SIFT-based object recognition, the heterogeneous multi-core processor employs the simultaneous multithreading clusters for feature extraction and the latency-optimized matching processor for feature matching, and achieves 47700 tiles/sec and 62720 vectors/sec throughput respectively. With the help of machine learning engine, the dynamic resource controller increases the system utilization and power efficiency at the same time. As a result, the fabricated SoC achieves 30 frame/sec dynamic object recognition for UAV with 720p video streams while dissipating 320 mW at 200 MHz on average, achieving 2.54 times higher energy efficiency with 10.5 nJ/pixel compared to the state-of-the-art vision processors.
