I. INTRODUCTION
T HE significance of mining and recognition in the next era of Tera [1] has received more and more attention in recent years. As a subfield in artificial intelligence, machine learning [2] provides a series of algorithms for mining and recognition, such as supervised learning algorithms, unsupervised learning algorithms, and reinforcement learning algorithms. These algorithms are widely employed in different applications in multimedia content analysis, including face detection [3] , color image segmentation [4] , and content-based image retrieval [5] . The repetitive operations for high-dimensional vector processing result in laborious computations for machine learning algorithms, so it is difficult to meet the real-time requirement by using traditional processors. Many hardware architectures and design methodologies for machine learning algorithms, such as Gaussian mixture model-based classification [6] and K-means clustering [7] , [8] , are proposed to accelerate the computational speed, but the hardware integration of different kinds of algorithms is still an open question. At the same time, the development of high-resolution CMOS image sensors [9] has introduced the requirements for high-performance video processing, which is crucial to satisfy the real-time requirement in mobile systems. Different kinds of large-kernel image processing operations, such as the median filter, the sharpening filter, the bilateral filter [10] , and the Gaussian filter, are essential to video noise reduction and video quality enhancement. Therefore, it is necessary to develop suitable architectures and platforms for the coacceleration of machine learning and image processing tasks.
Due to the rapid growth of consumer electronics and advances of semiconductor technology, mobile devices, such as digital cameras, portable computers, and cellular phones, are equipped with various kinds of high-performance processors. To analyze the content of image and video data, many kinds of VLSI architectures are proposed. Abbo et al. propose a massively parallel processor for video scene analysis [11] , Kim et al. propose a processor with a visual attention engine to recognize objects [12] , and Cheng et al. propose an SoC which combines a processor with a CMOS sensor [13] . Nevertheless, these processors do not have suitable architectures to accelerate machine learning algorithms for multimedia content analysis. Different from the previously reported works [11] - [13] , a high-performance machine learning SoC (MLSoC) for multimedia content analysis is proposed. It focuses on the coacceleration of computer vision and machine learning algorithms, and the image stream processor (ISP) and the feature stream processor (FSP) are integrated into the dual stream processor (DSP). Both processors, the ISP and the FSP, are established on the 256-bit local media bus (LMB), which is directly connected to the high-bandwidth dual memory (HBDM). The HBDM offers the DSP the instant access of video data and feature data, and the hardware architecture can achieve a maximum throughput of 62.5 Gpixel/cycle for image processing operations and 16 vector/cycle for machine learning algorithms. This paper is organized as follows. The proposed SoC architecture is first described in Section II. Then, the architectures of the ISP and the FSP are introduced in Sections III and IV, respectively. Next, the VLSI implementation of the proposed work is shown in Section V. Finally, a short summary is given in Section VI. Fig. 1 shows the system diagram of the MLSoC, which contains a complete platform for multimedia content analysis. There are two AMBA AHBs [14] of different bandwidth, and 0018-9200/$26.00 © 2010 IEEE the data in the two buses can be exchanged through the AHB bus bridge. Each silicon intellectual property (SIP) on two AHBs can access the external DDR memory through the multichannel memory controller. To fully utilize the bus bandwidth, the RISC is connected to the 32-bit AHB, whereas the DSP is connected to the 128-bit AHB. The system and application tasks are executed on the RISC, and the DSP receives instructions from the RISC through the LMB-AHB interface, which is connected to the DMA controller to efficiently access video or feature data from the external DDR memory.
II. MLSOC ARCHITECTURE
The architecture of the DSP is shown in Fig. 2 . After receiving the instructions, the control unit in the DSP analyzes the instructions to manipulate the operations with two stream processors. In the DSP architecture, the data are accessed through the LMB, which is connected to the 32-bank HBDM. The half size of the HBDM is sufficient to store an image of 160 120 pixels, which can be obtained from the video input interface by down-sampling or slicing the image data. The purpose of the HBDM is to store the image data for the ISP and the feature data for the FSP, and the data in one memory of the HBDM can be fully copied to the other in 2048 cycles, where one cycle is defined as the inverse of the clock frequency of the system in this paper. The rapid data access accelerates the computation for multimedia content analysis and reduces the power consumption of data transmission between the RISC and the DSP because most of the operations are performed in the architecture of the DSP. Besides, the maximum input bandwidth of a subprocessor inside the DSP reaches 2048 bit/cycle.
III. IMAGE STREAM PROCESSOR
Large kernel operations for pixels are significant for image processing tasks in multimedia content analysis. For example, the Gabor transform [15] can be used for image texture extraction, and the Gaussian filter can be applied to the detection of scale-space extrema for feature points [16] . The image stream processor (ISP) employs massively parallel processing elements to process image or video data, and the input bandwidth of image pixels is transformed from 16 pixel/cycle to 256 pixel/cycle inside the ISP by the input interface to accomplish window-based operations. In other words, a maximum of 16 16 pixels in a window can be processed in the same cycle based on the ISP architecture. Since the window-based operations are processed in the raster-scan order, the "Pixel Stream Memory," which is connected to the input interface in the ISP, can instantly offer the subprocessors the pixels of the current window by receiving the 16 new pixels from the HBDM and discarding 16 old pixels in the previous window. Therefore, the ISP can help achieve tera-scale performance even if the bandwidth of the LMB is only 256 bit. The ISP consists essentially of an arbiter, two subprocessors, and a set of shared memory. The arbiter controls the behavior of each processor according to the ISP instructions, and the processor not in use is automatically set to be inactive to save the power consumption. The set of shared memory, which includes "Pixel Stream Memory" and "Kernel Stream Memory," is used to store the pixel data and the kernel data for image processing tasks. Two subprocessors are the linear processor and the order processor, both of which are able to handle 256 pixel streams simultaneously, and the bit length of each pixel stream is 8 bit. These two subprocessors have the same output bandwidth, 1 pixel/cycle, and they deal with parallel data-in, scalar data-out image processing tasks.
In order to fetch the input pixel stream from the HBDM, a new memory architecture is developed for window-based operations in image processing tasks. The memory utilization scheme is illustrated in Fig. 3 , where two examples are shown. The input image, whose width is denoted by , is first partitioned into different slices, each of which includes pixels. These slices are stored to one of the HBDM and sequentially arranged. When the ISP demands pixels in a 16 16-pixel window, the HBDM sends 16 pixels simultaneously to the ISP in one cycle, and the memory bandwidth of the HBDM can be fully utilized. For example, in Fig. 3 , the pixels in Window 1 occupy Slice 2 and Slice 3, and the pixels in Window 2 occupy Slice N. No matter what the window position is, the pixels required by the ISP never occupy the same bank in the HBDM. Based on the memory architecture of the HBDM, the traditional line buffer memory [17] , which is used for window-based operations, can be saved to reduce hardware costs. Fig. 4(a) shows the architecture of the linear processor, which handles linear operations for image processing. There are four levels of configuration network in this processor, and the amount of data is reduced level by level. The processing elements in the first level can deal with multiplications of a maximum of 256 parallel pixels and their corresponding coefficients from the window in the image, and the results are sent to the next level through the configuration network, which contains a set of context registers to manipulate the stream of pixels in each level. In the second level, the ALU trees handle subtraction and addition operations based on the results of multiplications in the first level. The data are then collected in the next level through the configuration network. In the third level, dedicated accelerators for face detection, pixel variance calculation, and correlation coefficients, are integrated into this processor to enhance the functionalities for video analysis. Based on the statistical analysis, these dedicated functionalities are frequently used in the image processing tasks for multimedia content analysis, and the high-throughput divider is used to compute one division operation per cycle. Then, the ALU in the fourth level performs simple instructions, such as additions and subtractions for the final output data. General image linear operations, including the Laplacian filter, the low-pass filter, the Gaussian filter, the Gabor transform, and the 16 16-pixel convolution, can be executed in one cycle with 40 cycles of latency.
A. Linear Processor
In this architecture, the pixels can be repetitively processed by the processing elements in different levels of the configuration network. Different from traditional processors, the equivalent input bandwidth of the ISP is computed according to the total bandwidth of the input bandwidth to each of the pipeline stages. The bandwidth is distributed to the processing elements by the configuration network in four levels as shown in Fig. 4(a) , and the performance of the linear processor can achieve 0.67 TOPS (Input Bandwidth 0.31 TB/s 0.98 TB/s 0.04 TB/s 1.33 TB/s) while the clock frequency is 300 MHz. This performance also results from the special design of the processor architecture, where the processing elements are cascaded in each level. Note that in this paper, the 8-bit operations are considered for the performance evaluation. Fig. 4(b) shows the architecture of the order processor, which deals with sorting operations for image processing. The sorting procedure [18] , [19] is implemented by a set of reconfigurable hardware, which contains eight-stage processing elements for bit-wise operations and a set of multiplexers to compute the rank order of a set of parallel pixels. Each stage of processing elements contains 256 parallel bit-logic modules, an adder tree with nine layers, and a comparator to compare the results of summation between stages. The set of multiplexers is 16-stage pipelined to shorten the critical path for the real-time processing requirement. Common nonlinear image operations, such as the morphological filter, the 16 16-pixel median filter, the arbitrary kernel median filter, can all be executed in one cycle with 40 cycles of latency.
B. Order Processor
Similar to the linear processor, the configuration network of the order processor is responsible for the bandwidth allocation to eight "Bit-Level Processing Elements." Based on the configuration network, a total of 0.61 TB/s bandwidth can be supplied to each processing element at 300 MHz. The processing elements are also cascaded in each level to simultaneously process the input pixel streams, so the equivalent input bandwidth to the processing elements in the sub-processor is much higher than the total bandwidth of input pixels streams and kernel streams (Input Bandwidth 76.8 GB/s 76.8 GB/s 0.15 TB/s).
These two sub-processors, the linear processor and the order processor, enable the ISP to handle most of common operations used in the image processing tasks of computer vision algorithms, such as pre-processing and filtering. The performance Table I . Although the proposed architecture is designed to handle the operations with a maximum of 16 16 pixels in a window, it can be used for other window sizes which are smaller than 16 16 pixels as well. As long as the AHB resources are available, the performance of window-based operations from 3 3 pixels to 16 16 pixels listed in Table I can be higher than 100 fps for an HDTV image (1920 1080 pixels) when the clock frequency is 300 MHz.
IV. FEATURE STREAM PROCESSOR (FSP)
The FSP, whose architecture is shown in Fig. 5 , is intended to handle feature vectors extracted for multimedia content analysis, and it contains two subprocessors for machine learning algorithms: the supervised learning processor and the unsupervised learning processor. Although these two processors support different algorithms, the common property is that they both need a large amount of data bandwidth, and the supported algorithms are suitable for stream processing because of the regular data access pattern. Therefore, the two subprocessors are directly connected to the LMB to deal with the feature data in the HBDM.
Both of the sub-processors are able to compute either the Manhattan distance or the Euclidean distance by using the same hardware resources, and the distance measurement is selected according to the instructions. The supervised learning processor can handle the K-nearest neighbor algorithm, which is frequently employed in information retrieval applications, and it contains a set of parallel processing elements that is able to handle feature vectors with a maximum of 128 dimensions. It also employs the automatic sorting mechanism for distance ranking, and the results can be immediately dumped out without extra sorting stages. The architecture of the automatic sorting mechanism is shown in Fig. 6 , which contains a total of 128 parallel processing elements to compute the ranking of distances. In addition, to fully utilize the bandwidth of the LMB, the unsupervised learning processor employs the bandwidth adaptive mechanism [20], which allocates different hardware resources for K-means clustering algorithm according to the number of feature vector dimensions. A simplified version of the architecture of the bandwidth adaptive mechanism is illustrated in Fig. 7 , where four-parallel 1-D vectors are simultaneously processed in the three layers of different sets of processing elements in Mode 1, and two-parallel 2-D vectors and one-parallel 4-D vectors can be processed in Mode 2 and Mode 3, respectively. Much more complicated than Fig. 7 , the unsupervised learning processor focuses on the bandwidth adaptive mechanism for 16-D vectors, and the HBDM enhances the efficiency of the data access for the iteration process of K-means clustering. The performance of some FSP single operations for the DSP is listed in Table II , and the throughput of vectors can be adjusted to the bandwidth to obtain the optimal efficiency. (Input Bandwidth 0.33 TB/s) to process two 256-D vector streams in parallel. Therefore, the performance of this processor can achieve 0.35 TOPS (Input Bandwidth 0.38 TB/s 0.33 TB/s) when the clock frequency is 300 MHz.
V. VLSI IMPLEMENTATION
The proposed MLSoC is implemented on a 16-mm die using TSMC 90-nm 1P9M process. The maximum operating frequency of the proposed MLSoC is 300 MHz, and the total on-chip memory is 79 KB. Table III shows the chip specifications, and the chip micrograph is shown in Fig. 8 . The peak performance of the MLSoC achieves 1.3 TOPS (Input Bandwidth 1.33 TB/s 0.61 TB/s 0.38 TB/s 0.33 TB/s), and the input bandwidth inside these processors can achieve more than 2.6 TB/s while one operation uses 2-Byte pixels for image data or 2-Byte vectors for feature data. The power efficiency and the area efficiency of this work are compared with the previously reported works [11] - [13] in Fig. 9(a) . The proposed work achieves 1.7 TOPS/W in the power efficiency and 81.3 GOPS/mm in the area efficiency, both of which are the highest among the four works.
The ISP in the MLSoC is compared with the related work, the coarse-grained reconfigurable image stream processor (CRISP) [17] , and the result is shown in Fig. 9(b) , which shows that the proposed ISP uses 4.21 times of logic gate count to handle more than 11 times of input pixel bandwidth and to support 10.24 times of the maximum window size. Fig. 10 shows the flowchart of an algorithm example using the proposed MLSoC. The image stream (160 120 pixels) is first loaded to the HBDM through the LMB-AHB interface. The ISP executes the instructions for "median filter" and sends the processed image stream to the HBDM. The FSP regards the data stored in the HBDM as the feature stream and executes K-Means clustering by using the bandwidth adaptive mechanism. When input vectors are 1-D, 16 times the speed acceleration can be achieved. The clustered feature vectors are stored to the HBDM again and the process of image segmentation is completed. Then the data can be transferred to the external memory by the DMA controller. The total time to compute the algorithm example in the DSP, where 32 iterations are performed in K-means, is less than 0.5 ms. Other complicated examples, such as image retrieval and object recognition, can also be accomplished in the proposed MLSoC with the aid of the RISC and the external DDR memory.
VI. CONCLUSION A 1.7 TOPS/W 16-mm MLSoC is implemented in TSMC 90-nm CMOS technology. The new SoC architecture meets the flexibility and performance requirements of multimedia content analysis for consumer electronics, and the tera-scale performance of the DSP enables the coacceleration of computer vision and machine learning algorithms. Moreover, the proposed MLSoC achieves higher power efficiency and area efficiency than other related works.
[18] I. Hatirnaz, F. K. Gürkaynak, and Y. Leblebici, "Realization of a programmable rank-order filter architecture using capacitive threshold logic gates," His major research interests include low-power stream processors for computer vision and low-power video coding architectures.
Chen-Han Tsai received the B.S. degree in electrical engineering from National Taiwan University (NTU), Taipei, Taiwan, in 2002, where he is currently working toward the Ph.D. degree at the Graduate Institute of Electronics Engineering, NTU.
His major research interests include face detection and recognition, motion estimation, H.264/AVC video coding, digital TV systems, multimedia SoC, and related VLSI architectures. 
Shao-Yi Chien

