Abstract We described an H.264 decoder implemented with our design methodology, in which a system function model of transaction level is first captured in SystemC and refined into RTL with a library of communication templates. We determined its communication architecture by exploring the design space with template-based communication refinement to meet its requirement of decoding VGA 30 frames per second at a clock frequency of 50MHz.
II. DESIGN FLOW
The design flow we used in designing H.264 decoders consists of two steps: function modeling and architecture refinement, as shown in Figure 1 . In the function modeling step, the designer captures a system-level function for H.264 decoders at the transaction level using SystemC (Figure la In the refinement step, we refined computation models first. Each computation TLM can be refined to a RTL model in HDL manually or with a C-to-RTL synthesis tool. Note that we used computation models manually described in RTL because our design target was performance critical.
In the communication refinement step, for each channel in the design, we selected a proper architecture template from the CATtree library and replaced it with the original channel after configuring its parameters to meet the design constraint. We verified functional correctness of the refined system model with transaction-level simulation. In refining some part of the system into RTL, we checked the function and performance in RTL with our mixed-level simulation environment [3] . III A. Function-Level Models The CAT library in our design environment supports several types of the channels such as FIFO, array, event and variable channels.
A FIFO channel is for point-to-point, ordered and synchronized data transmission, which are suitable to model the channels between two computation blocks if they have data-dependency. An event channel is for point-to-point event notification without data transmission, which is useful to model the channels between two computation blocks if they have control-dependency. An array channel is good for the channels with data storage that are addressable with index, which does not include any synchronization function. We support three kinds of array channels such as ID, 2D, and 3D arrays. A variable channel is useful to model the channels with data storage with multiple writers and readers, which does not have any synchronization function either.
B. Architecture-Level Models
A CATtree is provided for each communication primitive and its root includes a SystemC TLM model as its functionlevel model, which does not have any architecture details. Therefore, during the communication refinement step, we refine it to a specific architecture of each channel by selecting one of the CATs in its corresponding CATtree to meet the design constraints for a specific application. The CATs may include the following parameters.
(1) Buffer's memory size: FIFO depth or array size (2)
Buffer's memory type: register, on-chip memory, or offchip memory Cache configuration: cache size, and cache line size Bus topology: point-to-point, or shared bus Bus arbitration scheme : priority, round robin If we change an architecture parameter in a CAT, the variation of its performance and area can be significant. We modeled a decoded frame buffer of H.264 decoder with a 3D array channel, which includes an off-chip SDRAM as its buffer memory. To find a good architecture, we changed the size of cache embedded in the array channel as well as its cache line size and then measured the average number of clock cycles required to transfer the data per a macroblock processing from a decoded frame buffer to Inter Prediction block. As shown in Figure 3 , the performance is changed from 787 cycles to 2667 cycles and the logic gate count also varies from 14.7K to 70.3K. In this experiment, we used the foreman bit-stream of seven QCIF pictures. 
IV. COMMUNICATION REFINEMENT FOR H.264 DE-CODERS
We designed two different H.264 baseline decoders: one for VGA 30 frame/s and the other for CIF 30 frame/s at a clock frequency of 50 MHz. To have a margin of about 300 cycles, the decoders were targeted to decode one macro block in 1100 cycles for VGA pictures and in 3900 cycles for CIF pictures, respectively. Similarly, their critical path delays were limited to be less than 15ns.
As shown in Figure 4 , we first captured a system function model, which is a TLM in SystemC, for the H.264 decoders. It consists of seven computation blocks: a bitstream parser (PARSER), a variable length decoder (VLD), an inverse transform and inverse quantization (ITQ) block, an interpicture prediction (INTRA) block, an intra-picture prediction (INTER) block, a reconstruction (RECONST) block, and a deblocking filer (DF). Note that bit-stream parser controls all the computation blocks, which is not shown in Figure 4 for the clarity. Basically each computation blocks are pipelined in MB level, but INTER, ITQ, INTRA and RECONST are pipelined in sub-MB level to reduce the channel's buffer size.
To model the communication among the computation blocks, we used 65 FIFO channels, 18 array channels, and 3 variable channels. Note that only the important channels in the H.264 decoder are shown in Figure 4 . After capturing the function model of the H.264 decoder, we manually refined each computation block into RTL. Their performance and complexity are summarized in Table 1 . In designing the VGA decoder, we started with an initial configuration and went through five major refinement steps to get the final communication architecture, as shown in Figure 5 . In order to evaluate the system area, we synthesized the decoder in 0.18 ,um process technology with a 15ns timing constraint. To measure its system performance, we simulated the whole H.264 decoder with a VHDL simulator while only PARSER was executed in software. In this experiment, we used the foreman bit-stream of forty QCIF pictures where QP is 28 and the maximal reference frame number is 15.
For the initial communication architecture, the most of FIFO channels were refined to register-implemented FIFOs [4] with the depth enabling MB level pipelining. Decoded frame buffers were refined to a non-cached 3D array that uses an offchip SDRAM. Line buffers were refined to a SSRAM ID array that uses an on-chip static SRAM. The other array channels were refined to a registered ID array. This initial architecture implies that just like other conventional video decoders, we used an off-chip memory for the frame buffers, on-chip memories for line buffers, but for the other channels we tried to minimize their complexity. After throughput analysis, we found that the array channel transferring the data from the decoded frame buffer to INTER block was a throughput bottleneck in the initial architecture because the INTER block wasted most of the cycles in waiting to read data from the decoded frame buffer. Therefore, we decided to use an array channel with a cache to reduce the latency of the array channel using off-chip SDRAM. At the refinement step 1, we configured the 3D array to one with a 2D cache of size 32x32 with line size 8x8, based on the results in Figure 3 . In the refinement step 2, we configured two 3D arrays for the decoded frame buffers of chrominance to ones with a 2D cache of size 16x16 with line size 4x4. After the refinement steps 1 and 2, the performance was enhanced by four times with complexity overhead of 39 Kgates and 1.5KB on-chip memory.
After the step 2 refinement, we found that chrominance sample processing in the INTER block waited until luminance sample processing was finished in each macro block because luminance and chrominance processing shares the same datapath in the DF block. We resolved this problem by increasing the FIFO depth from 16 bytes to 256 bytes. Consequently, the performance was improved from 1049 cycles to 992 cycles by decreasing 57 cycles while the area was increased by 18 Kgates.
Initially the total size of the on-chip memory for the line buffers was 20 KB. Therefore, instead of the ID array channels with on-chip memory for the line buffers, we also decided to use ID array channels using the off-chip SDRAM in the refinement step 4. Consequently, with the performance penalty of 65 cycles, the gate counts were increased by 40 Kgates and the on-chip memory size was reduced by 20 KB, which is equivalent to roughly 240 Kgates. Consequently, its reduction of the silicon area was 1.9 mm2, which is about 44% of the total silicon area of the final VGA decoder.
Because the frame buffers and the line buffers were decided to use the same SDRAM and the same bus in this design, its performance depended on the bus arbitration scheme. The initial bus arbitration scheme was round robin. Therefore, in the refinement step 5, we configured that the priority of 0 ---+--+.n channels were scheduled by a round robin policy, which enhanced the performance by 30 cycles without any penalty. After all the five refinement steps, the performance was improved by about 4 times and the area was increased to 1.3 times and the on-chip memory size was reduced from 20KB to 1.5 KB. Consequently, we could meet the design specification of decoding VGA 30 frames per second at 37 MHz.
We also designed another H.264 decoder for CIF pictures with similar refinement steps from the same initial communication architecture. Both results are summarized in Table 2 . In the CIF decoder, we decided not to use cache in the frame buffers because the performance constraint is looser than that of VGA decoder. We could reduce the system area by 0.7 mm2 while meeting the performance constraint of CIF decoder. Although we designed the communication architecture by refining the templates in the CATtree library rather than by designing it manually, Table 3 shows the design results have reasonable system area and performance compared with other designs. We designed two different H.264 decoders with the CAT library. We found that our design flow was very efficient because the system function model capture for a VGA decoder can be reused directly in designing a CIF decoder. Furthermore, the communication DSE could be performed effectively with the CAT library we developed. The VGA decoder occupies an area of 4.3 mm2 and can decode a macro block in 1027 cycles, while the CIF decoder occupies 3.6 mm2 and in 3914 cycles.
