One of the largest challenges for coarse-grained reconfigurable arrays (CGRAs) is how to efficiently map applications. The key issues for mapping are (1) how to reduce the memory bandwidth, (2) how to exploit parallelism in algorithms and (3) how to achieve load balancing and take full advantage of the hardware potential. In this paper, we propose a novel parallelism scheme, called 'Hybrid partitioning', for mapping a H.264 high definition (HD) decoder onto REMUS-II, a CGRA systemon-chip (SoC). Combining good features of data partitioning and task partitioning, our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and macroblock (MB) level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method. Further, on the sub-MB level, we propose a few mapping strategies such as hybrid variable block size motion compensation (Hybrid VBSMC) for MC, 2D-wave for intra 4 × 4, parallel processing order for deblocking. With our mapping strategies, we improved the algorithm's performance on REMUS-II. For example, with a luma 16 × 16 MB, the Hybrid VBSMC achieves 4 times greater performance than VBSMC and 2.2 times greater performance than fixed 4 × 4 partition approach. Finally, we achieve 1080p@33fps H.264 high-profile (HiP)@level 4.1 decoding when the working frequency of REMUS-II is 200 MHz. Compared with typical hardware platforms, we can achieve better performance, area, and flexibility. For example, our performance achieves approximately 175% improvement than that of a commercial CGRA processor XPP-III while only using 70% of its area.
Introduction
Video coding standards have been rapidly improved and developed during the last decade. H.264/AVC (advanced video coding) is the latest digital video codec standard that has proven to be superior to prior standards in terms of compression ratio, quality, bit rates and error resilience [1] . This high coding gain increase comes mainly from the combination of new coding techniques such as inter frame prediction with quarter pixel resolution, multimode intra prediction, variable block size and context-based adaptive entropy coding, as well as multiple reference pictures and in-loop deblocking filters [2] . However, these coding efficiency improvements are produced at the expense of increased computational complexity. The combination of these new coding features increases the computational complexity by approximately four times more than MPEG-2 and two times more than MPEG-4 [3] . Though it has very high complexity, nowadays, even mobile platform requires FHD (full highdefinition, 1920 × 1080) resolution video decoder for H.264 and for many other standards. So, stringent restrictions for more flexibility, higher performance, and lower power consumption are requested on embedded devices. None of the traditional hardware architectures achieve both high performance and enough flexibility for emerging and evolving multimedia standards. General purpose processors (GPPs), including digital signal processors [4] , help to reduce the time-to-market and development costs for new products because of their flexibility and ease of use. However, the performance of software is orders of magnitude lower than that of hardware, even taking into account larger power consumption, because of sequential executions. On the contrary, application specific integrated circuits (ASICs) [5] can provide high performance for a specific application. However, these circuits cannot be altered after fabrication, making them incapable of adapting to new system requirements or standards. To find a better architecture that can balance efficiency and flexibility, there has been increasing interest in reconfigurable architectures (RAs) in recent years. RAs can provide performance similar to that of ASICs by mapping computationally-intensive tasks directly to highly parallel hardware resources on-chip. This mapping allows RAs to avoid sequential executions. Meanwhile, RAs also maintain post-fabrication flexibility because they can reconfigure themselves for standard updates or new application requirements. Currently, field programmable gate arrays (FPGAs) still dominate the reconfigurable computing field [6] . However, the bit-level, fine-grained granularity of FPGAs has significant costs in terms of routing area, speed and configuration time. On the contrary, a CGRA typically consists of an array of ALUs that provide word-level or subword-level datapaths. A major benefit of CGRAs is the reduction of the configuration (also called context) size because CGRAs need fewer bits to control their architecture. Therefore, a single CGRA can be dynamically customized for different kernels or applications at runtime; this ability is too time-consuming to receive for FPGAs.
To achieve these performance benefits and support a wide range of applications, reconfigurable systems are usually formed with a combination of CGRAs and a general purpose microprocessor [6] . The processor performs operations that cannot be efficiently performed in reconfigurable logic, such as data-dependent controls. Computational kernels, meanwhile, are mapped to the CGRAs. REMUS-II is a novel CGRA SoC that contains a computation intensive proCopyright c 2013 The Institute of Electronics, Information and Communication Engineers cessing unit (RPU) and a control-intensive processing unit (µPU).
Effectively mapping applications onto CGRA platforms is a key factor in determining whether CGRA is promising. The motivation of this paper is exploring the strategy of parallel mapping an H.264 HD decoder onto REMUS-II while taking into account the algorithm features and architecture characteristics at the same time. For the purpose, we will consider effective hardware/software partitioning and scheduling schemes to balance computational loading and minimize inter-core communication. Further, on the sub-MB data level, we adopt the correct mapping methods to maximize hardware utilization and reduce memory bandwidth requirements after analyzing different kernel sub-algorithms. Our mapping experience can serve as a reference for mapping similar applications onto CGRAs.
The rest of this paper is organized as follows. Section 2 reviews related works. Section 3 gives an overview of H.264/AVC decoder, and Sect. 4 gives an overview of REMUS-II architecture. Section 5 analyzes the parallelism of H.264 decoder and proposes our 'Hybrid partitioning' scheme in detail on REMUS-II. Implementation results are discussed in Sect. 6. The conclusions are given in Sect. 7.
Related Work
Many CGRAs have been proposed in recent years [7] - [14] . The previous review [7] only summarizes some of the important CGRAs that were proposed before 2001.
MorphoSys [8] is composed of a tiny RISC main processor and an 8 × 8 ALU RCA that performs 16-bit operations in SIMD mode. A high-bandwidth memory interface consists of a frame buffer and a DMA controller that sits between the external memory and RCA. However, communications between the frame buffer and the main memory occurs through a 32-bit bus and can be a bottleneck for the overall system performance. A CIF@30 fps MPEG2 encoder, automatic target recognition and data encryption algorithm have been mapped onto MorphoSys.
ADRES [9] tightly couples a VLIW processor and a reconfigurable array at the highest abstraction level. The reconfigurable array works as a coprocessor for the VLIW processor, as such, their executions never overlap. However, because they cannot execute instructions concurrently, the code running on the VLIW processor and the remaining code that is accelerated on the reconfigurable array may not be pipelined efficiently. In previous study [15] , a H.264/AVC CIF decoding at 56 MHz is implemented on ADRES.
RICA [10] has a highly reconfigurable fabric of interconnected instruction cells that allow it to build circuits from an assembly representation of programs. Special ICs in the core provide interfaces to the data and program memories. The H.264/AVC decoder achieves between 12 fps and 21 fps at D1 after minor source code modifications. XPP-III [11] contains a rectangular array of three types of processing array elements (PAEs): ALU-PAE, RAM-PAE and FNC-PAE. The ALU-PAEs and RAM-PAEs compose a dataflow array. FNC-PAE is a VLIW-like processor kernel and is suitable for processing the control-oriented portion of the application. The H.264/AVC decoder can be split into a number of parallel tasks that are distributed onto multiple FNC-PAEs and the reconfigurable array. The simulation results allow up to HD (1920 × 1080@24 fps) decoding [16] .
MORA [12] consists of a scalable 2D array of identical RCs that are organized in 4 × 4 quadrants. The RCs are connected through a hierarchical reconfigurable network. MORA does not include a centralized RAM system. Instead, each individual RC is a tiny processor-in-memory (PIM) with internal data memory. Computations are performed close to the memory, avoiding time-and powerconsuming communication between the RCs and the data memory. Three image processing algorithms are compared on different architectures. The results show that MORA clearly outperforms FPGA and DSP.
The PE array of FloRA [13] connects an ARM processor through a bus. One special feature of FloRA is that PEs support floating-point operations. Compared to software implementations, the average performance of an H.264 deblocking filter is increased 18 times for luma and 42 times for chroma MBs.
ERP [14] is a novel reconfigurable processor that uses a dynamically partitioned single-instruction multiple-data (DP-SIMD) model. The control unit manages the external I/O and the internal PU array via a PUA configuration and data memory. Due to DP-SIMD, the ITQ and MC of H.264 require only 3.9 ms to process one D1 frame when the ERP run at 200 MHz.
In summary, except for the industry CGRA XPP-III, only sub-portions of the H.264 decoder algorithm or a low performance H.264 decoder are implemented on these CGRAs.
The H.264/AVC Decoder Overview
H.264/AVC is the best video coding standard in terms of compression and quality. A field or a frame is encoded to produce a coded picture. A picture is partitioned into fixedsize MBs that each contain a rectangular area of 16 × 16 luma samples and two associated areas of 8 × 8 chroma samples (cb and cr in 4:2:0 format). A picture can be split into one or several slices, which are contained some number of MBs in the raster scan order (from left to right, from top to bottom), when not using flexible MB ordering. I MBs are predicted using intra prediction from decoded samples in the current slice. P and B MBs are predicted using inter prediction from previously decoded pictures (reference pictures).
The coded video data are organized into bitstreams. An H.264/AVC decoder interprets the syntax elements in the compliant bitstream to produce a reconstructed video sequence. The functional block is shown in Fig. 1 .
As shown in this figure, the incoming bitstream is stored on a memory buffer to be parsed and decoded at the entropy decoding stage. H.264/AVC main profile includes two different entropy-coding (ED) modes: (1) a simple context-based adaptive variable length coding (CAVLC) method; and (2) a more complex, compression-efficient, context-adaptive binary arithmetic coding (CABAC) method. The various syntax elements obtained after this process are demultiplexed and sent to the different functional kernels involved in the decoding process. At the MB level, syntax elements include the coding mode of the MB, the information required for forming the prediction, such as motion vector (MV) and spatial prediction mode, and the coded information of the residual (difference) blocks, such as the coded block pattern (CBP) and quantized transform coefficients. In particular, the residual samples of the current MB are reordered by following a typical inverse scan procedure. Then, the levels, which represent quantized transform coefficients, are inversely quantized via multiplication by a scaling factor. Finally, an integer-specified inverse transform is performed on the inverse-quantized coefficients. In parallel, a predictor is constructed from previously decoded pixels in the same frame (intra-coded MBs) or from specific pixels denoted by the MVs as belonging to reference pictures (inter-coded MBs). The decoded and inversely transformed residual samples are then added to the selected predictor. Finally, a deblocking filter is implemented to reduce annoying blocking artifacts that result from the block-based processing. The original MB is recovered with substantial objective and subjective quality improvements.
REMUS-II Architecture Overview
REMUS-II, short for Reconfigurable Multimedia System 2, is a novel CGRA SoC that improves data-flow [17] and its corresponding control-flow compared with the previous version [18] . As shown in Fig. 2 , REMUS-II consists of a main controller, ARM7DMI, that executes necessary control and scheduling works; two reconfigurable process units (RPUs) to speed up computationally intensive tasks with intrinsic data parallelism; a micro-processor unit (µPU) to process control-intensive tasks; and several assistant modules, including an interrupt controller (IntCtl), a direct memory access controller (DMAC). All of the modules are AMBA2.0-compatible and connected to a 32/64-bit bus. Each RPU contains an RCA16 × 16 that is composed of four fundamental RCA8 × 8s. The RC's supported instruction set is close to the instruction set of a standard RISC processor except branch instructions. In addition, some special instructions for multimedia applications are added, such as the CLIP instructions. RPU receives its configuration information (also called context; configuration information is used to restructure a RCA8 × 8's function and data communication, thereby accomplish the expected function) through the configuration interface (CI).
The µPU, which is one of the largest improvements relative to the previous version, consists of a 1 × 8 micro-process element array (µPEA) and a 1 × 2 special process element array (SPEA). The SPEA consists of two configurable stream processor elements that mainly implement multistandard entropy decoding function efficiently for multimedia applications, include CAVLD and CABAD for H.264, Huffman/Run Length decoder for MPEG-2 and 2D-VLD for AVS [19] . Apart from this, the SPEA is also in charge of handling special complex tasks, such as MV prediction in MC and boundary strength (BS) calculation in the H.264 deblocking algorithm. Similar to the tightly coupled VLIW processor in ADRES and the FNC-PAEs in XPP-III, the µPEA is responsible for carrying out control-intensive tasks. In the H.264 decoder, a µPE will generate configuration word (CW) of a MB according to the MB information, such as MB types, prediction modes, etc. These CWs will indicate correct context in context group cache to restructure a RPU.
To satisfy the communication bandwidth requirement of multimedia and control the power consumption, REMUS-II contains hierarchical memory architecture, as shown in Fig. 2 . Each RCA8 × 8 has an input/output FIFO that makes a three-step pipeline (load, execute, store) possible. RCA internal memory (RIM) holds intermediate data in RCA, and also is the data exchange center in a RPU. As a cache for external memory and local memory in RPU, block buffer speed up data access. To suit the data access pattern of multimedia applications, multiple data communication modes are supported. For more details, please refer to the paper [17] .
Parallel Analysis and Mapping the H.264/AVC Decoder to REMUS-II

Task Partitioning and Data Partitioning
There are two methods to parallelize H.264 decoder applications over a multicore environment: one is task partitioning, and the other is data partitioning [20] . The two types of partitioning approaches are sketched in Fig. 3 . The main concern of partitioning methods choice is to minimize the overheads caused by the partitioning methods such as the inter-task communication, data synchronous and shared resource contention. In task partitioning [21] - [23] , sub-functions of the H.264 application are assigned to different hardware processing modules. Task partitioning fits naturally to ASIC implementations that different hardware modules can be implemented against objective subfunctions to achieve the best performance. But task partitioning requires significant inter-tasks communication in order to move the intermediate data between processing stages, and this may become the bottleneck. Additionally, significant synchronization overhead is required for activating the different modules at the right time. However, the main drawbacks of task partitioning on CGRA are load balancing and scalability. Different sub-functions have different execution times which depend on the algorithm characteristics and processed data. Due to imbalanced loads, a certain task can block another tasks, computing resources cannot be fully utilized. For task partitioning, scalability is also difficult to achieve while different throughput requirements need re-implement the task partitioning.
On the contrary, data partitioning [24] , [25] divided a picture into multiple sub-picture. Each sub-picture is assigned to a different process unit which runs the whole program. Data partitioning inherently results in locality of data, and provides natural load balancing and easy scalability of the system. However, due to communication overhead from possible dependencies between data partitions, the size and shape of the data partitioning have to be chosen carefully. Data partitioning also propose higher requirement on the performance of each processing module. Moreover, it also requires much more program memory and data buffer than functional partitioning.
Problem of Choosing Data Size
In H.264 a video segment can be decomposed into different levels data structure, which includes from Group of Pictures (GOP), to frames, slices, MBs, and finally to variable sized blocks.
Many researches have tried to design their own architectures aiming the level of the data structure that they consider to be the most appropriate. Roitzsch [26] proposed a slice-balancing approach to improve the load balance of exploiting slice-level parallelism. MB-level parallelism has been presented in theoretical and simulation analysis as scalable and efficient. However, MB-level based architectures [23] - [25] has some disadvantages, such as entropy decoding cannot be parallelized at the MB level, hardware resources cannot be fully utilized across different block sizes, number of independent MBs fluctuate at the start and the end of the coding to reduce the parallelism. For a mobile appliance, the resource cost of the system becomes a more important factor. Compared with MB level pipelining architecture in conventional designs, the 4 × 4-block level parallelism [21] , [22] improve the utilization of processing units, decrease operation complexity, optimize circuit size and eliminates the bubbles that exist in MB-level pipeline. But the increased external memory access in MC is a penalty. Considering the data dependencies between data partitions, the paper [27] grouped the adjacent MBs in the staircase shape to improve performance. However, this has negative consequences on the scheduling. To address these difficulties, a hybrid task pipelining scheme [28] is presented with a balanced schedule with block-level, MB-level, and framelevel pipelining. We need to find the one which satisfies the minimum amount of scheduling overheads and maximum parallelisms for REMUS-II.
Data Dependencies within H.264 Decoder
In order to maximize the parallelism in our architecture, we also need to explore all the data dependencies in H.264 to choose the appropriate data granularity. As shown in Fig. 4 , the GOP is the coarsest grained data level in H.264. Inside a GOP, there are a sequence of I-B-P frames. I and P frames are used as reference for other frames but B frames might not. Thus in this case the B frames can be processed in parallel. Each frame is partitioned into one or more slices. The size of a slice can vary from one MB up to a complete frame. Slices are self-contained and completely independent from each other.
In MB level, it is necessary to take into account the dependencies between them in the spatial domain and in the temporal domain. Three types of data dependencies can be identified as shown in Fig. 5 . Usually MBs in a slice are processed in scan order to provide these dependencies are satisfied.
Intra prediction: Depending on the prediction mode, data is predicted from pixels to the left, top-left, top, and/or top-right relative to the current MB.
MV prediction: The MVs of a MB are predicted from MVs of adjacent MBs to the left, top-left, top and/or top-right.
Deblocking: Filtering is performed for the 4 top pixel rows and 4 leftmost pixel columns of the current MB, using the 4 bottom rows of the MB to the top and the 4 rightmost columns of the MB to the left. In block level, depending on the size and shape of a coded block, there are similar restrictions like in MB level.
Proposed Hybrid Partitioning Scheme
Neither task partitioning nor data partitioning is a proper solution to map an H.264 FHD decoder on the CGRA. In this section, we propose a novel parallelism scheme, called 'Hybrid partitioning', which combines good features of data partitioning and task partitioning. Kim [29] tried 'Hybrid partitioning' on a quad-core system, but their data parallelism exists only in IDCT and prediction function. There still exists a large amount of data synchronization, and their method has not scalability. Our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and MB level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method.
Hybrid Task Pipeline Based on Slice and MB Level
To achieve H.264 1080p@30 fps decoding with 200 MHz system clock, from a system-level perspective, multiple technologies are employed to add slack to the time constraint. We have carefully analyzed H.264 decoder and subdivided it into sub-tasks, data flow, control flow and configuration flow to adapt REMUS-II architecture, as shown in Fig. 6 . From the system point of view, we adopt task partitioning. The task level parallelism is used to pipeline the H.264 decoding stages. Because only one EnD module in REMUS-II, the processed data object of pipeline need be carefully selected to reduce synchronization overhead and maximize parallelism. As described in the previous section, no content of a slice is used to predict blocks of other slices in the same frame and that the search area of a dependent frame cannot cross the slice boundary, we propose a hybrid task pipeline based on slice and MB level to balance parallelism and storage needs.
In the original decoding process, the MBs are processed in scan line order which includes the entropy decoding. The complexity of the entropy decoding method is derived from the fact that it is a bit-serial operation. The bit-serial operation requires several sequential operations to decode each bit of data. Therefore, the first step is to decouple the entropy decoding. The slice level entropy decoding (P1 in Fig. 6 ) is implemented by SPEA and the output is stored in one of two slice buffers which work in ping-pong mode. Thereby, the entropy decoding and other sub-tasks are pipelined at the slice level which reduces pipeline bubbles. The other sub-tasks mainly work at MB level.
The next stage includes IQ, filling prediction caches, MV calculation, BS calculation for every MB in a slice. The purpose is to decouple control-intensive and computeintensive part in MB level.
According to different MB parameters, CW generation is performed on a µPE (including Intra/Inter, IDCT, and deblocking). The CW is used to select the appropriate context to reconfigure RPU. Due to the complex judgments and branchings based on the coding mode and MB type, generating CWs is a performance bottleneck. Hence, the µPEA, which adopts 8 µPEs, can at most handle 8 CWs simultaneously. In this manner, configurations can be generated to meet the RPU's computing needs.
MB level decoding consists of the three most com-putationally intensive decoder sub-tasks: 1) intra-or interprediction; 2) deblocking; and 3) inverse transform (including secondary DC transforms). These sub-tasks account for 71% of the baseline profile decoder's time complexity [30] . After decoupled control-intensive and computeintensive part, these sub-tasks share the following properties: separation of the compute intensive responsibilities from the control-intensive responsibilities; an intensive and regular calculation mechanism in the inner MBs (all of the data in one MB follows the same computing regulation); and a relative independence of data. All of these features are suitable for RPU computing. Functions are dynamically configured at runtime and carried out in parallel.
MB Row-Level Data Parallelism
Our REMUS-II architecture can be seen as dual-core system which includes two homogeneous RPUs. MB level decoding sub-tasks (P4, P5 in Fig. 6 ) can be implemented in the RPUs using either task pipeline or data parallelism. In our previous work [17] , we have adopted a simple task pipelining method (Model1 in Fig. 7 ). One RPU processes intra/MC and IDCT, another RPU processes deblocking. The biggest drawback of this method is the existence of data dependence stall between RPUs due to the load imbalance. According to the RTL simulation, the data dependence stall cycles of two RPUs respectively reach up to 10.9% and 59.4% of the total work cycles. This is a great waste of the hardware resources.
To balance load, we choose data partitioning method. We propose a novel MB row-level data parallelism (Model2 in Fig. 7) . Instead of individual MBs, each RPU processes entire a line. As shown in Fig. 8 , MBs with the same number in a circle are processed by same RPU. Recently REMUS-II has only two RPU, the maximum number is 2. But the scalability can be easily achieved when we simply increase RPU number. Because MBs from the leftmost MB to the rightmost MB in a row are processed by the same RPU, the left dependency can be naturally resolved. This yet allows to better exploit data locality while the data of the left MB required by a current MB remain local in same RPU. The dependency of MBs between two rows can be also solved very easy. A certain MB decoding of current row must be started two MBs behind of that of the upper row. MB row-level data parallelism reduces synchronization overhead, and this static mapping method greatly simplifies the scheduling problem. Compared to previous work [17] , our method increase performance respectively more than 20% and 10% for I frame and P(B) frame.
Sub-MB Level Parallelism Method
The main computing kernels are applied at the MB level, but H.264 allows some kernels to operate on smaller blocks. Each RPU contains four homogeneous RCAs which can be used to fully exploit potential sub-MB level parallelism. In the following sections, we will analyze sub-MB level parallelism in MC, intra prediction and deblocking algorithms, and give the corresponding mapping strategies in detail.
MC
Algorithm Analysis
MC in H.264 adopts a few state-of-the-art technologies to achieve high quality at low bitrates include tree structured variable block sizes (TSVBS), fractional sample interpolations (FSI) and multiple reference pictures (MFP) [2] . As shown in Fig. 9 , TSVBS allows four basic MB partitions and four sub-MB partitions for luma samples. The corresponding chroma is partitioned in the same manner as the luma, with the exception that the partitions have half of the horizontal and vertical resolution.
The accuracy of MC is in units of one-quarter of the distance between luma pixels. The samples at half sample positions in the luma component of the reference picture are generated first (b, h as shown in Fig. 10 (a) ) with an interpolation filter that is based on a 6-tap FIR filter. In Fig. 10 (a) , b is calculated from the six horizontal integer samples E, F, G, H, I and J. Similarly, h is interpolated by filtering A, C, G, M, R and T. Once all of the half-pel samples are available, the sample at quarter-pel positions are produced by averaging two horizontally or vertically adjacent half-or integerpel position samples (a, d in Fig. 10 (b) ), or a pair of diagonally opposite half-pel position samples (e in Fig. 10 (b) ). The prediction values for the chroma component are always obtained by bilinear interpolation with four integer-pel position samples (a in Fig. 10 (c) ), and the displacements used for chroma have one-eighth of the sample position accuracy. These filters functions are defined in Eq. (1)- (3).
From the above description, we observe two basic characteristics of the algorithm that can improve parallelization on an RCA.
(1) The calculation is regular and identical in each partition block. (2) The data are relatively independent. Thus, the calculation between different partition blocks is independent. In one partition block, the filtering of different rows and columns is independent.
Mapping Strategy of MC onto REMUS-II
During the designing of MC mapping strategy, memory bandwidth and hardware utilization are the two main problems. First, the 6-tap filter requires fetching a (M+5)×(N +5) byte reference window to interpolate the fractional position of a M × N samples in the worst case. For H.264, MC adopts a variable block size, and the 4 × 4 sub-MB is the minimum partition block. Previous researchers [22] , [28] have used a 4 × 4 block-based MC to support different block sizes. However, this scheme forces each 4 × 4 block to always load 9 × 9 pixels, as shown in Fig. 11 (a) . However, MC using large block size occupies higher probability than that using 4 × 4 block size. This requires tremendous memory bandwidth because of redundant data between adjacent blocks. As shown in Fig. 11 (b) , there are overlapping regions between interpolation windows for neighboring 4 × 4 element blocks when the block mode is larger than 4 × 4. In order to minimize the bandwidth requirement, the shaded regions should be reused. When the block size is 16 × 16, the repetitive redundant data are nearly twice the size of the necessary data. Thus, the VBSMC scheme [31] was proposed to fetch reference data in units of 21 × 21, 21 × 13, 13 × 21, 13 × 13, 13 × 9, 9 × 13, and 9 × 9 pixels for 16 × In order to take full advantage of the four RCAs in RPU, a 'Hybrid VBSMC' scheme is implemented to achieve a trade-off between memory bandwidth and calculation parallelism. Partitions with luma block sizes of 16 × 16, 16 × 8, 8 × 16 samples are divided into multiple 8 × 8 blocks to use 8 × 8 block-based MC. According to the partition type, the exact reference window is loaded onto the on-chip block buffer from external memory. Overlapped data between adjacent blocks can be reused through sharing in high-speed internal memory. Each 8 × 8 block is mapped to the corresponding RCA according to the spatial position in the MB (as shown in Fig. 11 (c) ). Purpose is to exploit block level data parallelism. Because there is no overlap data, luma blocks with 8 × 4, 4 × 8 and 4 × 4 sizes use the VB-SMC scheme, but blocks belong to the same 8 × 8 block is mapped to the same RCA. This ensures load balance of the four RCAs. Chroma blocks are processed in the same road. In this manner, we achieve a balance between memory throughput and calculation parallelism. Table 1 gives an example of different block partition scheme comparison. For a luma 16 × 16 block, when the MV equals (0, 1), we compare the calculation performance (not including any data communication) of three types of block partition schemes on REMUS-II. In this example, the VBSMC scheme becomes the worst case, because, when a 16 × 16 block is handled as a whole, we are obliged to operate in sequential mode due to the hardware scale of the 8 × 8 RCA. If we consider the data communication overhead, the advantage of the Hybrid VBSMC should be more obvious.
Second, after examining the interpolation formulas of luma pixels at different positions, we can find the interpolation window is not always (M + 5) × (N + 5). For example, a luma 4 × 4 block with MV pointing to horizontal-integer pixels, pixel d, h in Fig. 10 (a) , only need an interpolation window sized 4 × 9. Therefore, we should precisely load data from external memory rather than a general one to further reduce bandwidth requirement. The scheme is named Motion Vector Classification (MVC). Regardless of what partitions type in 'Hybrid VBSMC', we all employ MVC scheme. The MVC can further reduce bandwidth requirement.
Third, we begin to map the interpolation algorithm onto the RCA. We will apply a bilinear filter to a chroma block to illustrate how to achieve optimal performance on the given RCA architecture. As described in Eq. (3), a bilinear filter needs four pixel values and two displacement values. The values of dx and dy range from 0 to 7, so we can precalculate these coefficients as input constants to reduce the computation complexity. The dataflow graph (DFG) after computational complexity reduction is shown in Fig. 12 (a) . The DFG can be divided into four steps, each step requires one cycle to complete. A mapping example is given in Fig. 12 (b) . A chroma 8 × 8 MB has two 4 × 8 partitions. The MVs of the two partitions may be not equal, but the interpolation can be implemented in parallel because of their data independence. Simultaneously, the two 4 × 4 blocks (split by the broken line) in the same 4 × 8 partition can also be calculated in parallel because of the data independence between rows. Block0 is mapped onto RCA0, and other sub-blocks are mapped onto their corresponding RCAs. The DFG of a pixel occupies 8 RCs, as shown in Fig. 12 (b) . Every RCA can simultaneously calculate 4 pixels.
Furthermore, loop unrolling also is an important method for exploiting data-level parallelism. Due to the complex characteristics of the REMUS-II architecture, we decided to perform manual loop unrolling to find the best algorithm representation. As shown in Fig. 12 (a) , we divide the DFG into two stages and carry out loop unrolling separately. Because there is no data dependency between different iterations of the loop, the unrolled loop has no additional restrictions that follow from data dependencies beyond the size of the RCA.
Several representative unrolled loop versions are shown in Table 2 (an assumption is that the reference window data has been in the on-chip memory). In this manner, we can determine the best mapping method. However, the results also remind us that performance improvements can be counteracted by the overhead of context switching. Thus, the final performance is a tradeoff between parallel computing, temporary data communication and configuration scheduling.
Intra Prediction
Algorithm Analysis
Intra prediction predicts the pixels of MB using the available neighboring blocks. For the luma component of a MB, a 16 × 16 predicted block is formed by performing intra predictions for each 4 × 4 luma block in the MB and by performing intra prediction for the 16 × 16 MB. There are nine prediction modes for each 4 × 4 luma block and four prediction modes for a 16 × 16 luma block. Each 4 × 4 intra prediction mode generates predicted pixel values using some or all of the neighboring pixels A-M as shown in Fig. 13 . The arrows indicate the direction of prediction in each mode. The data dependency of 4 × 4 intra prediction is similar that of MB level.
Mapping Strategy of MC onto REMUS-II
The double-z-scan order of 4 × 4 block provided by standard is shown in Fig. 14 (a) . The number in each rectangle presents the decoding order. Because of the data dependencies between the two neighboring decoding blocks in the traditional decoding order, the current block's prediction should wait until the finish of previous block's reconstruction, thus much redundant time is introduced. Wu [32] proposes a reordered decoding order as shown in Fig. 14 (b) . In reordered decoding order, block 3, 4, 5, 6, 7, 9, 10, 11, 12 and 13's prediction does not require the reconstructed pixels of its former decoding block, so each of them can be pipelined with its former.
In fact, in Fig. 14 (a) when the block 1 reconstruction is finished, block 2 and 4 can be predicted simultaneously. So we propose 2D-wave method for 4 × 4 intra prediction as shown in Fig. 14 (c) . The blocks that have the same time slot can be calculated at the same time. Those sub-MBs in the left half portion of MB are mapped onto RCA0, the right half portion is mapped onto RCA1. To the best of our knowledge, this is the first time to apply the 2D-wave method for intra 4 × 4. In Fig. 14 (a) , the blocks 0, 1, 2, 4, 5, 8, 10 require data from the previous MB while the other blocks only require data from the previously decoded subMBs within the current MB. Due to MB row-level data parallelism, the left neighboring pixels of current MB can be stored inside RPU. On-chip caches only are used for storing the corner and top neighboring pixels of current MB. The cache's sizes are one row of HD resolution i.e. 1088 pixels each of 8 bits. After perform intra prediction, the contents of the cache are overwritten with the pixels from the current MB bottom row. In theory, our method can improve 37.5% performance at the maximum than standard serial processing order.
Deblocking
Algorithm Analysis
In H.264/AVC, an in-loop deblocking filter is used to reduce annoying blocking distortions that are caused by the blockbased integer discrete cosine transform (DCT) in intra/interframe prediction errors and MC predictions [33] . The comparatively high complexity of the loop filter, which easily accounts for one-third of the computational complexity of a decoder, is mainly based on the highly self-adaptive nature of the filter and the small block size employed for DCT/IDCT.
The deblocking process can be divided into two main stages: (1) BS parameter calculation, and (2) FIR filtering. For an edge between two 4 × 4 luma sample blocks, the BS value ranges from 0 to 4, and depends on the modes and coding conditions of the two adjacent blocks. The BS values of a chroma block are the same as their corresponding luma edges. Three types of filters are employed, based on the BS values: a strong filter is chosen when the BS is equal to 4, a weak filter is chosen for BS from 1-3, and a value of 0 corresponds to no filtering. The luma and chroma filters are different, and each filtering operation affects up to three samples on either side of the boundary. As shown in Fig. 15 , the filter is applied to luma 4 × 4 block edges and chroma 2 × 2 block edges in vertical and horizontal directions. Note that the chroma 8 × 8 block is also divided into 4 × 4 blocks, but the edge filtering operation is executed in units of 2 × 2 blocks.
For an MB, the deblocking filter is implemented in the following order: horizontal filtering of vertical edges from left to right, then vertical filtering of horizontal edges from top to bottom. The results of horizontal filtering will be modified by vertical filtering again.
From the above analysis, we can observe several characteristics of the algorithm that obstruct the performance improvement of deblocking on the REMUS-II:
(1) Almost every sample in a picture must be loaded from memory, either to be modified or to determine if neighboring samples will be modified. This requires large memory access bandwidth. (2) Filter order determines data dependence and obstructs parallelization of the sub-block. However, the vertical edges of vertically adjacent sub-blocks are uncorrelated. The same situation exists for the horizontal edges of horizontally adjacent sub-blocks.
Mapping Strategy of Deblocking onto REMUS-II
First, to resolve the problem of large memory access bandwidth, we have to reuse data in internal memory as much as possible. According to our MB row-level data parallelism, the current MB is stored in RPU. To complete deblocking filter, we also need data from the top and left neighbors except for the current MB. For luma, if we obtain these data from on-chip, we need an upper MB buffer whose size is four rows of a picture. We also need a 64-byte buffer to store data from the left MB. The advantage and shortcoming of this solution is obvious: external memory access is reduced, but a large buffer is needed to decode HD video. Finally, we decided to obtain the left MB data from the RIM and obtain the upper MB data from external memory. Each time, after completion of deblocking filter, the rightmost column data of the MB will be stored in RPU for the next MB. Except for the rightmost column MB of the current picture, the rightmost column data of the MB will not be stored in external memory. Second, to eliminate obstacles of the mapping process, a few C source-level modifications and transformations need to be applied to expose greater parallelism. BS calculation is a control-intensive process that is not suitable for RCAs. In the original code, the workflow is as follows: calculate the four BS values of the luma edge (such as a0 to a3 as shown in Fig. 15 (a)) ; and then perform current edge filtering. If the current chroma edge also needs to be filtered, perform two chroma edge filter operations (cb and cr by order). Then, calculate the four BS values for the next edge and filter, until the completion of the MB. Because the two stages in the original code are coupled together too closely, the RPU often had to wait for the result of BS calculation. This wait time caused an unacceptable degradation of performance. Therefore, the key point is to transform the original C code into two independent stages: (1) calculate all BS values of the MB, and (2) select the filter according to the calculated BS values. On REMUS-II, the BSs are calculated by SPEA to achieve the requirements of HD decoding.
An filter example is shown in algorithm 1. The original chroma weak filter code contains branch execution in the innermost loop. This branch execution is an obvious drawback that should not be neglected. We must transform algorithms from control dependencies into data dependencies. Overall, there are two approaches to mapping branch operations onto a CGRA: (1) with predicated execution, we first calculate the conditional flag and then decide to perform a certain branch according to the conditional flag; and (2) with speculative execution, the CGRA executes all branches at the same time and chooses one result based on the last branch flag. Because our RCs do not support branch and prediction execution, we adopt the speculative execution method. In the speculative method, filtering can be performed concurrently with condition checking to improve performance. By adding one flag to the condition branch and moving the calculation portion out of the branch judgment (as shown in branch-coalescing version of algorithm 1), the CGRA can implement the following calculations in parallel: (1) calculating the condition to obtain variable flag value, (2) calculating the original expressions in if-branch, and (3) calculating the expressions in another side of the branch. The final result is obtained by multiplying the flag value with result expression. Furthermore, for nested condition judgments in luma deblocking filters, several flag variables shall be introduced and several multiplication operations are implemented to achieve the final result.
Third, the deblocking filter follows a determinate processing order in each direction. According to this rule, to improve performance, four types of processing orders have been proposed by [34] - [36] . Most of the 4 × 4 blocks need to be filtered four times with the adjacent blocks (left, right, top, and bottom). To take full advantage of the four RCAs in RPU and the relative data independency that exists between different rows, between different columns, and between luma and chroma, we propose a novel parallel processing order (as shown in Fig. 16 (a) ). Edges with same number can be filtered in parallel. For a luma MB, we always divide the MB into two 16 × 8 blocks. Then, we map them on RCA0 and RCA1 to horizontally filter the vertical edges in parallel. Due to the RIM transpose model (as shown in Fig. 16 (b) ), we can implement vertical filtering of horizontal edges as the same operations as horizontal filtering of vertical edges without the need for a data transpose operation. The chroma filtering is mapped onto RCA2 and RCA3. Because of the simplicity of the chroma filtering algorithm, the chroma filtering cycles are overlapped by luma filtering.
Lastly, except for BS values equal to 4, the four BS values of each edge may not be equal. In this case, a weak filter and no filter may simultaneously exist in an edge, and the appearance order of different BS values is uncertain. Table 3 gives the statistical distribution of BS values for HD video. Although BS=0 is the most frequent except for I-slices, the emergence of BS=0 make the situation not unified because only weak filter is used when BS equal 1 to 3. The consequence is that the RCA can only handle an edge of 4 × 4 block, suspend the RCA pipeline, wait for µPEA to generate a filter context for the next 4 × 4 edge, reconstruct the RPU according to this new context, filter the next edge, and so forth. This process causes large scheduling and reconfiguration overhead, leading to an unacceptable reduction in performance. Hence, we merge the no-filter into the weak filter to form one combined filter. In essence, this combination is equivalent to transform the control-intensive tasks into computation-intensive ones. In this manner, we now can generate context and implement deblocking in the unit of the MB that greatly enhances overall performance, at the cost of slightly increasing RCA computing time.
Implementing Results
The design of REMUS-II was synthesized with TSMC 65 nm logic process. The REMUS-II area is 30.1 mm 2 . To obtain accurate performance, we used RTL-level simulation to decode several H.264 test bitstreams. Each sequence used the YUV 4:2:0 format and consists of 200 frames and be encoded at HiP 1920 × 1080@30 fps. Table 4 shows the execution cycles of the H.264 kernel subtasks in select cases on RPU. Due to the complexity of MC 4 × 4, there may need more than 823 cycles to complete the interpolation of a MB in B Slice when MB is partitioned into luma 4 × 4 blocks. However in HD bitstreams, the pro- Table 6 Comparison of implementing H.264 decoding on different architectures. portion of 4 × 4 cases is nearly close to zero. Therefore, the average execution time of one MB is less than 800 cycles. Table 5 shows the performance results of a few representative sequences decoded on REMUS-II. The results show that we can achieve real-time H.264/AVC HD decoding on REMUS-II running at under 200 MHz. Table 6 is a summary of performance comparisons, as well as relevant parameters, for implementation of real-time H.264 decoding on typical state-of-the-art hardware architecture platforms.
ASIC [5] achieves better performance and area through optimization, from the algorithmic and architectural perspective. However, ASIC can only support a single video standard due to the lack of ASIC flexibility. Platform-based SoC [22] , which consists of a RISC and an ASIC decoder core, targets mobile applications such as a handheld video player. The important design considerations are the size, power consumption and cost, not performance and flexibility. TI's OMAP3430 is a multimedia processor which is highly used in portable devices [4] . OMAP3430 provides high processing power through an improved DSP along with improved Imaging and Video Accelerator (IVA), which is specifically designed to address the requirements of highend video systems. However, only WVGA decoding can be achieved at 430 MHz. ADRES [15] , which is one of the most successful reconfigurable architectures, supports Table 4 The execution time of different cases. 30 fps H.264 decoding at resolutions up to D1 when using a VLIW processor with a 7 × 7 CGRA accelerator. The commercial CGRA XPP-III [16] , which contains 40 ALUPAEs (Processing Array Element), 16 RAM-PAEs and 8 VLWI class FNC-PAEs, can yet only implement 24 fps main profile full-HD decoding at 400 MHz. REMUS-II, which is also a CGRA system, reflects the inherent advantages of a reconfigurable architecture. The REMUS-II obviously outperforms the other CGRAs such as ADRES and XPP-III, and the performance of REMUS-II is close to the performance of the state-of-art ASIC. Additionally, 1080p@30 fps of AVS Part-2 Jizhun Profile@ Level 4 decoder, 1080p@30 fps of MPEG-2 MP@ High-level decoder, CIF@30 fps H.264/AVC encoder and GPS baseband processing algorithms have been implemented on REMUS-II by simply changing contexts.
Conclusion
We implement a H.264 decoder on REMUS-II, a novel CGRA SoC. After comparing different partitioning method, analyzing the features of H.264 algorithms and REMUS-II architecture, we proposed a novel parallelism scheme which called 'Hybrid partitioning'. Further, on Sub-MB level, we proposed a variety of strategies such as loop coalescing, loop unrolling, computational complexity reduction, Hybrid VBSMC, block level 2D-wave and parallel processing order to exploit parallelism. The consideration of our mapping strategies is not only to maximize data reuse and computation parallelism, but also to reduce configuration scheduling and data communications between the RPU and external memory. Our mapping strategies can provide some useful references for how to implement similar streambased applications on coarse-grained reconfigurable platforms. The experiment results illustrate that 1080p@33 fps H.264 HiP@Level 4.1 can be achieved when operating at 200 MHz on the final architecture. This design occupies 30.1 mm 2 silicon areas when using the TSMC 65 nm logic process. A comparison results of typical state-of-the-art hardware architecture, shown in Table 6 , shows that our implementation is more efficient in performance and flexibility.
