Abstract-Multiview video coding (MVC) systems require much more bandwidth and computational complexity relative to mono-view video systems. Thus, when designing a VLSI architecture for MVC systems, the hardware resource allocation is a critical issue. In this paper, we propose a new system bandwidth analysis scheme for various and complicated MVC structures. The precedence constraint in the graph theory is adopted for deriving the processing order of frames in a MVC system. In addition, current block centric scheduling (CBCS) and search window centric scheduling (SWCS) are proposed for MVC bandwidth analysis. By adopting data reuse schemes, several design points are explored with the aid of the proposed analysis scheme. The suitable hardware resource allocation can be easily determined.
I. INTRODUCTION
Multiview video can provide users with a sense of complete scene perception by transmitting several views to the receiver simultaneously. It can give users a vivid information about the scene structure. Moreover, it can also provide the capability of 3D perception by showing two of these frames to each eye. With the technology of 3D-TV getting more and more mature [1] , multiview video coding (MVC) draws more and more attention. In recent years, MPEG 3D auido/video (3DAV) Group has worked toward the standardization for MVC [2] , which also advances the multiview video applications. The reference software for MVC (JMVM1.0) [], which was recently released by MPEG 3DAV Group, is based on the hybrid coding scheme. The H.264/AVC is adopted as the base layer. The instruction profiling shows that 2.76 tera-operations/s (TOPS) of computational loading and 4.25 tera-bytes/s (TB/s) of memory access are required for real-time encoding SDTV videos [3] . The required computational loading and memory access are even much larger. Therefore, the hardware acceleration is an efficient solution.
Motion estimation (ME) and disparity estimation (DE) are the major components in a MVC system. They dominate the greater part of the computational complexity and memory bandwidth in the system. The large computational complexity is due to a lot of candidate blocks to be matched, and the huge memory bandwidth results from loading the data of candidate blocks. The challenge of large computational complexity can be solved by parallel processing skills or fast prediction algorithms. However, the system memory bandwidth is limited in a VLSI hardware system. In tradition, the data of the current macroblock (MB) and the search window (SW) are loaded from system memory and then buffered in on-chip SRAMs or registers. The system memory bandwidth can be reduced by local data reuse schemes. Some data reuse strategies have been proposed with different tradeoffs between system bandwidth and local memory size [4] [5] . In addition, a frame-level data reuse scheme has been proposed to reduce more memory bandwidth for multiple-reference-frame ME [3] . However, as the design space extends from mono-view to multiview video systems, the demand for system bandwidth, on-chip and offchip buffers increases with an order. Various coding structures for MVC are required for different applications, which greatly increase the design challenge of the system design. Thus the previous data reuse schemes for mono-view video systems no longer efficiently support MVC. In this paper, a new system analysis scheme with precedence constraint is proposed for MVC systems. It utilizes the relation between SWs for ME and DE and combines the previous data reuse schemes. With the aided of the precedence constraint, the most suitable scheduling and resource allocation for every coding structures can be systematically derived. The rest of this paper is organized as follows. The previous data reuse schemes for mono-view video coding systems is briefly introduced in Section II. In Section III, the proposed system analysis scheme with precedence constraint is described. The performance evaluation and discussion are shown in Section IV. Finally, Section V concludes this paper.
II. PREVIOUS DATA REUSE SCHEMES FOR MONO-VIEW VIDEO CODING SYSTEMS
Data reuse is an important concept and is usually adopted in most VLSI designs for ME in video coding systems. Many data reuse schemes have been proposed, and they can be generally classified into two categories, intra-frame and interframe data reuse. Intra-frame data reuse schemes utilize the characteristic that the SWs of the neighboring MBs overlap each other to save the memory bandwidth. With different trade-off judgement between system bandwidth and on-chip memory size, they can be classified into four schemes and indexed from level-A to level-D [4] . Level-A scheme requires the smallest on-chip memory size and the highest external bandwidth, while level-D scheme has the largest on-chip memory size and the lowest external bandwidth. Among four schemes, the level-C scheme is often adopted because it is more suitable to be implemented with the current VLSI technology. To enhance the scalability of data reuse and fully utilize the hardware resource, Chen et al. [5] propose the level-C+ data reuse scheme. As shown in Fig. 1 , it not only fully reuse the overlapped SWs in the horizontal direction, but also partially reuse the overlapped SWs in the vertical direction. It inserts many design choices between the design choices of level-C and level D scheme. The system bandwidth can be further reduced with a little overhead of oh-chip memory size.
On the other hand, inter-frame data reuse schemes, such as single reference frame multiple current MBs scheme (SRMC) [3] , reuse SW data in the frame-level when performing multiple-reference-frame ME. The concept of the SRMC scheme is shown in Fig. 2 . The current MBs located in the same positions in their corresponding frames have an identical SWs in a reference frame. Therefore, only single SW memory is required. The system bandwidth can also be further reduced. To achieve inter-frame data reuse, the ME procedure for MEs have to be rescheduled, that is, ME for one current MB in different reference frames are processed at different time slots.
III. PROPOSED SYSTEM ANALYSIS SCHEME WITH PRECEDENCE CONSTRAINT MVC is a challenged task due to the fact that various coding structures and different number of view channels. The processing scheduling and resource allocation greatly effect the architecture performance of MVC. The previous data reuse schemes for mono-view video systems are not sufficient for MVC. In this section, a system analysis scheme with precedence constraint is proposed to derive the suitable processing scheduling and the hardware resource allocation systematically. Before introducing the proposed analysis method, the system architecture of a MVC system is defined first. Then, the intra-inter-view data reuse scheme with precedence constraint, its corresponding analysis, and the case studies are described. Figure 3 shows the block diagram of the proposed multiview video encoder, which is based on the hybrid coding scheme. There are two kinds of view channels, the primary channel and the secondary channel. They are both encoded with H.264/AVC. There is no DE operation in the primary channel. The number of primary and secondary channels depends on the coding structure. The block engine includes quantization, transform, and deblocking filter, etc.. After encoding, the compressed bitstream of each channel is transmitted. In addition, the hardware system architecture is defined in Fig.  4 . It consists of three part, the multiview video encoding engine, the system memory, and the processor. Most of the system bandwidth are required in the ME and DE parts. The busy communication between the system memory and the SW buffers make the bandwidth loading of the system bus a critical issue.
A. System Architecture of Multiview Video Coding Systems

B. System Bandwidth Analysis with Precedence Constraint 1) Precedence Constraint:
To cope with the complicated processing order of frames in a MVC system, we found that there exists the data dependency between frames, that is, a current frame cannot be encoded until its reference frames are encoded. Therefore, the precedence constraint, which is a concept in the graph theory, is adopted to interpret the data dependency betweens the frames. Each frame can be regarded as a vertex v i with the sequence order S(v i ). Each prediction arrow can be regarded as an edge e i j with weight d(e i j ) between two vertices. Therefore, a constraint graph G(V, E) can be constructed with the following criterion,
(2) Figure 5 shows an example of the precedence constraint applied on a stereo video coding structure. The first frames in both view channels are intra-coded. Thus no data dependency exists between them, and their vertex values are assigned 1. There is only one edge connected to v 3 , so S(v 3 ) is assigned S(v 1 ) + d(e 13 ) = 2. The other vertex values are defined with the same rule. Therefore, the processing order of the frames can be derived.
2) System Bandwidth and Memory Analysis Scheme: Although the processing order of the frames in a MVC system is derived, the processing order of prediction arrows is not designed yet. The prediction is carried out by the limited hardware resource, such as processing elements (PEs) and onchip buffers. To make the analysis more systematically, two kinds of scheduling, current block centric scheduling (CBCS) and search window centric scheduling (SWCS), are proposed for convenience for the analysis. In CBCS, each current block and its corresponding SWs are loaded from system memory for ME or DE. The prediction for the next MB will not start until the mode decision of this current block is finished. Therefore, each current frame is loaded only once from system memory. CBCS is a common scheduling with a simple data flow. However, in some MVC structures, a reconstructed frame may be accessed several times from system memory if it is required for predicting several current frames. It wastes much system memory bandwidth. In contrast to CBCS, each reconstructed frame taken as a reference frame is loaded only once in SWCS. When a SW is loaded, its corresponding current blocks are also loaded for cost computation. However, the mode decision of these current blocks are not finished if they have other reference frames. Therefore, the partial result for mode decision is needed to be stored in the on-chip or off-chip memory. Storing in the on-chip memory is not a suitable choice usually due to the requirement of much silicon area. The little penalty of SWCS is the little quality loss due to incomplete MV predictor generation. The SRMC scheme belongs to this scheduling. Take Fig. 6 as an example, the proposed inter-view data reuse scheme for our prior stereo video system [6] can be extended for MVC. With SWCS, SW ME is first loaded for the current block in view channel 2 for DE. Then, the current block in view channel 1 is loaded for ME. Therefore, the required on-chip memory and bandwidth for SW DE are saved. The choice of CBCS or SWCS for a MVC system greatly effect the performance of system architecture, especially system bandwidth. Whether CBCS or SWCS is chosen, the system bandwidth can be described as
BW ME = n(re f ME ) × BW SW ME ,
The system bandwidth is composed of four parts, BW ME , BW DE , BW PR , and BW BE . BW SW ME and BW SW DE are the required bandwidth for loading SWs for ME and DE, respectively. They depend on whether the intra-frame data reuse scheme is adopted. BW BE is the required bandwidth for the block engine introduced in Section III-A. n(re f ME ), n(re f DE ), n(Ori), and n(Rec) are the frequencies of loading or transfering SW ME , SW DE , original frames, and reconstructed frames, respectively, through the system bus in every time slot. BW PR is the required bandwidth for sending or loading the partial results of cost from the system memory. In addition, for any two vertices v i , v j connected by an edge e i j , the distance between v i and v j is defined by
is either equal to 1 or bigger than 1. In CBCS, n(Ori) and n(Rec) are equal to the number of view channels regardless of D(v i , v j ). n(re f ME ) and n(re f DE ) depend on the coding structures. n(Cost PR ) is equal to zero because the mode decision can be finished immediately without storing partial results in CBCS. In the case of SWCS, if D(v i , v j ) > 1, it means that for a SW, its corresponding current frames have different vertex values. According to the precedence constraint, the partial results of cost is needed to be stored in the system memory. Thus n(Ori) and n(Cost PR ) increase. Usually, n(re f DE ) in CBCS is bigger than that in SWCS. It exists the trade-off between loading SWs and storing partial results of cost. To make the analysis more comprehensive, a design example is shown in the next section. Fig. 7 shows a design example of the proposed system bandwidth analysis with precedence constraint. The coding structure consists of five view channels. First, the vertex values, which are regarded as the processing order, are derived by the proposed method. Then, the frames are arranged according to the processing order as shown in Fig. 7 (c) . Figure 7 (c) and (d) shows two schedulings. The prediction arrows with the same color means that these operations can be executed at the same pipeline stage if the required SWs are ready. FSBMA and level-C data reuse scheme is adopted. In the case of CBCS, n(Ori) = n(Rec) = 5. It means five original frames and five reconstructed frames are transmitted through the system bus in a time slot. Among the prediction arrows with the same color, there are five ME prediction arrows and four DE prediction arrows. Thus n(re f ME ) = 5 and n(re f DE ) can be assigned 4. However, some frames, such as v 5 , have three corresponding current frames. The SWs for the current frames overlap, so the bandwidth is reduced. n(re f DE ) = 4−2 = 2 and n(Cost PR ) = 0. With SWCS applied in Fig. 7 
IV. CASE STUDIES AND PERFORMANCE EVALUATION
, so the mode decision of v 7 and v 9 can not be finished in a time slot. It also means the current frames of v 7 and v 9 have to be load twice. Thus n(Ori) = 5 + 2 = 7,n(Rec) = 5, and n(Cost PR ) = 2 × 2. The multiplier 2 for n(Cost PR ) represents the data are sent off-chip for storage and loaded on-chip for final mode decision. After all the parameters are derived, the system bandwidth can be calculated.
The proposed analysis method can support more complicated MVC structures. Two MVC structures with 720 × 480 frame size and 30 fps, as shown in Fig. 8 Fig. 9 . MB stripe height represents the degree of partial data reuse in the vertical direction. When MB stripe is equal to 1, the level-C+ scheme is simplified to become the level-C scheme. With the increase of the MB stripe height, two curves intercepts with each other. The bandwidth requirement is lower in CBCS with large MB stripe height. The reason is that in CBCS, n(re f DE ) is usually bigger than that in SWCS. Thus the bandwidth requirement for SW DE is higher. However, with the increase of the reusable ME stripe height, BW SW DE is getting lower. In addition, n(Cost PR ) and n(Ori) are the overhead in the SWCS. They can not be reused by adopting the level-C+ scheme. The trade-off between the system bandwidth and the required on-chip memory can be easily observed from the analysis. Therefore, the proposed analysis scheme provides effective quantitative design selection for MVC systems.
V. CONCLUSION This paper presents a new bandwidth analysis scheme for various MVC structures. The concept of precedence constraint in the graph theory is adopted to derive the processing order in a MVC structure. In addition, two schedulings in MVC are proposed for systematical analysis. With the combination of the level-C+ data reuse scheme, several design points can be derived. Hardware resource allocation can be systematical defined with the trade-off between system bandwidth and onchip memory.
