As the VLSI technology advances continuously, ASIC can easily achieve the required performance and most of them are actually over-designed. Thus, architecture 
Introduction
The H.264/AVC [1] is a new video coding standard, which is developed for efficient video compression and reliable data transport. It provides better performance over its prior standards such as H.263 and MPEG-4, and the applications include videoconferencing, video telephony, digital TV and DVD, etc. There are four kinds of 4-by-4 2-D transforms in the H.264/AVC video encoder -forward, inverse, Hadamard, and inverse Hadamard transforms. These four transforms are very similar and therefore this paper discusses the forward transform only, which is approximation of the 2-D discrete cosine transform (DCT) [2] with the scaling multiplications integrated into the quantizer. There have been quite a lot of works on the algorithms and the architectures for DCT such as [3] [4] , but only few focus on DCT with small block sizes, such as those for H.264/AVC [5] [6] [7] . Besides, it is not the functional units but the interconnections that dominate the circuit performance (i.e. speed, silicon area, and even power consumption) in today's VLSI technology. That is, choosing an algorithm with fewer operations may not always lead to less hardware complexity as before. Moreover, the 2-D transforms in H.264/AVC can be carried out with only additions and shifts, and such simple operations make the interconnection overheads much more significant. This paper discusses the area efficiency of the architecture mapping of several algorithms for the forward transform, and we have shown that fewer operations do not necessarily result in smaller designs. In our experiments with the 0.18µm CMOS technology, the most straightforward matrix multiplication without separable 2-D operation or fast algorithm has the best area efficiency for D1-size (720×480) video at 30fps. It can save 48%, 34%, and 16% silicon area of the previous works [5] [6] [7] respectively. The rest of this paper is organized as follows. Section 2 first reviews the algorithms for the 2-D transform in H.264/AVC. Section 3 then describes their architecture mapping. Several designs are implemented via effective architecture mapping of the algorithms, and the results and area comparison are available in Section 4. Section 5 concludes this paper.
Algorithms for Forward Transform

Separable or direct 2-D transforms
The forward transform in H.264/AVC is defined as
where X and Y denote the 4-by-4 input and output matrices respectively. M denotes the intermediate result matrix and the coefficient matrix C is defined as 3), and compute the output matrix Y directly by performing the matrix-vector multiplication on the elements of X. This is the direct 2-D transform. Since the coefficients are very simple and only require trivial multiplications by shifts, the computational complexities of these two algorithms can thus be compared by the number of the required additions and subtractions. First, the separable 2-D forward transform needs eight 1-D transforms shown in Eq. (2), each of which requires twelve additions/subtractions. Therefore, it needs 96 additions/subtractions in total. On the other hand, the direct 2-D forward transform requires 240 additions/ subtractions.
Algorithmic strength reduction
Both 1-D and direct 2-D forward transform algorithms have symmetric coefficients in their transform matrices, which can be exploited to rearrange the computations and to effectively reduce the additions/subtractions. For example, the 1-D forward transform can save four additions as shown in Fig. 1 . Thus, the separable 2-D and fast forward transform, which is adopted in [5] and [6] , requires only 64 additions/subtractions. [7] . Interestingly, this direct 2-D and fast forward transform also requires 64 additions/subtractions after the computation rearrangements. Table 1 summarizes the computational complexities of the above four forward transform algorithms. 
Architecture Mapping
The most straightforward method to translate an algorithm into its hardware implementation is to allocate a dedicated functional unit to each operation. However, it is too costly for many practical applications, especially when advanced fabrication technology is used. Architecture shrinking that time-multiplexes several operations on shared resources is a commonly used technique to reduce such unnecessary waste.
Folding [8] is the systematic methodology that maps DSP algorithms on hardware architectures. To clarify our further discussions, we classify the architectures into two major categories -data-parallel and data-serial. The former handles multiple input data concurrently, while the latter processes a single input datum at one time. Besides, we will extensively use the term scaling-down factor (SF) to indicate the number of steps for the target architecture to perform the 4-by-4 forward transform. For example, the aforementioned "most straightforward architecture mapping" that dedicates a functional unit to each operation has SF=1. Note that a higher SF may only imply that fewer functional units are required in the hardware implementation. Indeed, it incurs extra multiplexers and sometimes additional registers. These overheads will compensate the benefits, not to mention the fact that interconnection predominates the performance in today's deep-submicron VLSI technology. In 8
Data-parallel architectures
In 9
In 10
In 11
In 12
In 13
In 14
In 0
In 7
In 15 A direct interconnection network is included for transposition of the intermediate results. Fig. 2(b) shows a shrunk version of Fig. 2(b) with SF=4. Therefore, it requires only two 1-D transform modules, and the four column-and the four rowtransforms are performed respectively in the corresponding modules. Besides, the original memory-less transpose matrix becomes a transpose memory with 16 registers after the architecture shrinking. The architecture shrinking in this case is very efficient, which just incurs an additional output de-multiplexer.
Data-serial architectures
Data-serial architectures process one input sample at a time and thus they would have SF>16 for the 4-by-4 2-D forward transform. Fig. 4(a) shows the data-serial architecture for the direct 2-D and direct forward transform algorithm in Eq. (3) with SF=16. This architecture processes an input sample in a cycle, which multiplies the sample with 16 coefficients in a column of the 16-by-16 coefficient matrix in Eq. (3) and accumulates the 16 products into the 16 registers respectively. Thus, the accumulation registers will have the 16 outputs of the forward transform after 16 cycles. Fig. 4(b) shows our proposed area-efficient architecture for the 2-D forward transform in H.264/AVC with SF=256. This architecture is actually a shrunk design from Fig. 4(a) by folding it 16 times. Here, each sample holds at the input for 16 cycles, and is multiplied by 16 coefficients accordingly and then accumulated in the 16 output registers. After 256 cycles, the outputs will be ready on these 16 accumulation registers. 
Simulation Results
In this section, we will evaluate the area efficiency of our proposed data-serial architecture for the 4-by-4 2-D forward transform in H.264/AVC with some typical designs based on the algorithms and architectures classified in Section 3 and 4.
The designs under investigation are listed in Table 2 with the abbreviated names. First, D/D/S/256 represents the proposed area-efficient design depicted in Fig. 4(b) , while S/F/P/1, S/F/P/4 and D/F/P/2 denote the three previous works [5] , [6] and [7] respectively. Data-serial architectures for the separable and the direct 2-D fast algorithms are derived for reference by applying ASAP operation scheduling and the forward-backward register allocation with minimum registers [8] . These two designs are S/F/S/64 and D/F/S/64, and both of them only have a single adder as our proposed D/D/S/256. An additional D/F/P/1 is designed to evaluate the architecture shrinking of D/F/P/2 [7] , and to provide a fair comparison with S/F/P/1 [5] . Finally, D/D/S/128 is constructed to show the performance scalability of our proposed architecture. The eight designs in Table 2 ns.
The clock period constraints and the minimum area reported by the Synopsys Design Compiler for the eight designs under investigation are shown in Table 3 . The designs are listed in the descending order of their reported area. [6] 256 ns 158,051.3 D/F/P/2 [7] 512 ns 123,927. 
Conclusions
This paper reviews the algorithms and architectures for the 4-by-4 2-D forward transform in H.264/AVC and it describes an area-efficient data-serial architecture for the transform.
Owing to the fast functional unit in the advanced 0.18µm technology and the regular dataflow, the proposed design of direct matrix multiplication without any fast algorithm or separable 2-D operations stands for the most area-efficient one for applications with lower pixel rates than D1@30fps. It can save 48%, 34%, and 16% silicon area of the previous works [5] [6] [7] respectively. Our experimental results in the UMC 0.18µm CMOS technology show that separable 2-D transform algorithms seem unable to perform well for small block sizes. Therefore, high-performance applications such as HDTV and cinema videos may adopt the direct 2-D and fast forward transform algorithm [7] . For low-cost and areacritical codec, the proposed data-serial architecture with the direct 2-D and direct forward transform algorithm is an effective alternative. By the way, although our proposed design is the most area-efficient one for many practical applications (throughput < D1@30fps), the required clock rate (250MHz) is extremely high. It wastes significant power and may not be acceptable in many embedded systems. We are studying the power issues and trying to identify the application ranges that our proposed approach has low-power advantages. In the future, we will study circuit techniques to reduce the clock overheads to broaden its applications.
