Abstract-This paper presents a cost-effective processor core design that features the simplest hardware and is suitable for discrete cosine transform/indiscrete cosine transform (DCT/IDCT) operations in H.263 and digital camera. This design combines the techniques of fast direct two-dimensional DCT algorithm, the bit-level adder-based distributed arithmetic, and common subexpression sharing to reduce the hardware cost and enhance the computing speed. The resulting architecture is very simple and regular such that it can be easily scaled for higher throughput rate requirements. The DCT design has been implemented by 0.6 m SPDM CMOS technology and only costs 1493 gate count, or 0.78 mm 2 . The proposed design can meet real-time DCT/IDCT requirements of H.263 codec system for QCIF image frame size at 10 frames/s with 4:2:0 color format. Moreover, the proposed design still possesses additional computing power for other operations when operating at 33 Mhz.
the fast direct 2-D DCT algorithms [14] , [15] that are superior to row-column DCT because the numbers of multiplication-accumulation operations have been reduced to half. However, it sacrifices with the irregular data permutation. This problem is not affective in this AU-based design due to the software-like controller design. Furthermore, since all multiplication-accumulation operations are expressed as shift-and-add, common subexpression can be shared [21] - [25] such that these common ones are computed only once and then used for many times. So, the proposed architecture combines the techniques of bit-level DA, fast direct 2-D DCT algorithm, and common subexpression sharing, to successfully design the efficient 2-D DCT/IDCT processor. The resulting implementation has shown that it meets the real-time H.263 encoding requirement, with high scalability to higher throughput rate applications such as MPEG2 MP@ML decoding.
This paper is organized as follows. In Section II, we introduce the design techniques used in this paper, including the fast direct 2-D DCT/IDCT algorithms, the corresponding DA formulations, and common subexpression sharing. Section III presents the architecture design with design techniques and scheduling considerations. We will show the hardware cost and performance comparison in Section IV. Section V presents the applications and comparisons of this processor. Finally, concluding remarks are given in Section VI.
II. DESIGN TECHNIQUES

A. Fast Direct 2-D DCT/IDCT Algorithm
The 2-D DCT and IDCT coefficient for a block sequence with is defined as
where for otherwise.
Since computing the above 2-D DCT/IDCT by using matrix multiplication requires multiplications, a commonly used approach in hardware designs to reduce the computation complexity is row-column decomposition that performs row-wise one-dimensional (1-D) transform followed by column-wise 1-D transform with intermediate transposition. Though row-column decomposition is simpler and more regular for hardware implementations, their computation cost is much higher than that of the fast direct 2-D algorithms [14] , [15] . The fast direct 2-D algorithms [14] , [15] , as shown in Fig. 1 , explore the trigonometry equality such that it needs 1-D DCT instead of 1-D DCT, as that in the row-column decomposition, to compute a 2-D DCT. Besides, the fast direct 2-D algorithms do not need transpose memory. However, the fast direct 2-D algorithms have several stages of butterfly additions, which makes them difficult for hardware implementation. This problem is avoided in our designs by using the appropriate address generation. Thus, the proposed design can preserve the low computation cost of fast direct 2-D DCT algorithms and avoid irregular routing cost.
B. DA Formulation
DA [26] , [27] has been regarded an efficient computation method since DA distributes the arithmetic operations rather than lumps them as multipliers do. Conventional DA (called ROM-based DA) [26] , [27] decomposes the variable input of the inner product into bit level to efficiently sum up the selected precomputed data. The precomputed data is stored in a ROM table for table look-up operations, which makes ROM-based DA regular and attractive in VLSI circuits. However, the ROM area in ROM-based DA increases exponentially and becomes impractical large when the size of the inner product increases. Besides, this type of DA does not exploit the numerical properties of the constant coefficients.
Another type of DA [6] (called adder-based DA) contrasts with conventional DA, and decomposes the constant operand of inner products into bit level and distributes the multiplication operations. This adder-based DA can exploit the distribution of binary value patterns and may maximize the hardware sharing possibility in the implementation. Considering an -tap inner product with input sequence , output sequence , and coefficient , we can express the inner product formulation as
The inner product expression reformulated with the adder-based DA algorithm is (4) where is the word length of and denotes the -th bit of . Without loss of generality, this equation is expressed in an unsigned fraction form. This formulation enables the combination of DA and subexpression sharings since we can combine the same subexpression together and avoid the computation when is zero. Thus, the computation of the inner product only requires addition and shift operations such that they can be implemented by a sequence of shift-add operations. DA formulation can directly be applied to DCT/IDCT designs, since DCT/IDCT can be viewed as a collection of multiple inner products.
C. Common Subexpression Sharing
Since transform coefficients in the DCT/IDCT computation are constant for fixed -point transforms, these transform computations can be simplified by expressing the multiplications into shift-and-add operations and sharing the common ones. This technique is called common subexpression sharing [21] - [25] . Fig. 2 shows a filter example with coefficients and represented by the canonical signed digit (CSD). The circled groups of digits have the same subexpression, 
where denotes " " sample delay and " " digit right shifts of . If we define (6) we can rewrite the filtering operation as (7) Thus, by sharing the common subexpression, the number of additions is reduced from six to four. Fig. 3 shows the computation flow of the filter example. The common subexpression part is done first, then the result is shifted or negated for other computations. Therefore, much computation can be saved if we find the better common subexpression. However, sharing the common subexpression will result in irregular routing for hardware designs. This problem is also avoided in our design by using a proper address-generation scheme.
III. PROCESSOR CORE DESIGNS
A. DCT/IDCT Coefficient Exploration
The proposed processor design explores the sharing properties of the adder-based DA formulation to the extreme case: only one word adder and shifter. So, fewer computation cycles will result in higher throughput. (6), which are also used in many fast algorithms to reduce the computation complexity (8) (9) where Since the coefficient matrixes are constant values, we can minimize the number of additions by the signed digit encoding. A commonly used signed-digit representation is a CSD [28] that can reduce nonzero digits from half to one-third of the total digits and no two consecutive digits will be both nonzero digits. Table I shows the CSD representation of the coefficients. From the table, it is found that the reduction of the nonzero bits achieves about 32%. Besides the direct manipulation of the coefficients, we also try to increase the sharing possibility by scaling the coefficients by . This scale factor can be easily removed by a shift in 2-D transform designs. The nonzero bit reduction of is about 26%, and the numbers of nonzero bits are 38 and 39 for and , respectively.
After applying the CSD representations, we use the subexpression sharing to share the common computation. With signed-digit representations, a number and its negative can be shared with only sign change. However, unlike the subexpression sharing used in [24] , [25] that follow the strict CSD representation, we relax the rule by using the general signed-digit representation. For example, the "10N" and "011" have the same number of additions, but CSD only allows the first case. The allowance of "011" may result in better sharing for the subexpression sharing. Besides, following the formulation of adder-based DA, the subexpression sharing used in this paper expresses the computation in direct-form scheduling instead of in transposed direct-form scheduling used in other methods. Combined with the DA formulations, this scheduling has better precision by adding terms from LSB to MSB, which means lower hardware cost. Transposed direct-form scheduling used in other methods, accepts one input at a time, and multiplies the input with all coefficients, which suffers from more computation cycles for the output in this design case due to fewer sharing terms. Direct-form scheduling gets all the input at a time, which will need more temporary storage. However, we can easily eliminate this disadvantage by sharing the system memory if the 2-D DCT is used in a video codec system. Fig. 5 shows the common subexpression sharings for DCT outputs on scaled coefficients . Scaled coefficients are used due to their better sharing property. By applying the signeddigit representation and the common subexpression sharing, the number of additions in the 1-D DCT is reduced to 98. Comparing with the original number of the additions, 144, we have 32% improvement. Fig. 6 shows the datapath of the proposed architecture, which includes a 16-bit adder/subtractor and shifter. The adder is a carry-propagation adder to save area cost. The operands and operations of the datapath are controlled by a dedicated controller. The input, intermediate results, and final results are stored in the RAM or register files. This datpath is dedicated to the shift-add subexpression. The basic operation of this datapath can be expressed by , where the "BUS" is the output of the datapath, the " " means right shift by -bit, and the " " denotes the add/subtract operation. Therefore, we can store the operands in the registers RA and RB and select the desired one by the MUX from one of the two sources, RAM or BUS. The two registers provide temporary storage to save memory accesses. The shifter in the datapath performs 0-3-bits shift that is suitably designed for the DCT coefficient. In the above subexpression sharing, the maximum number of shifts is six. We split the 6-bit shift operations into two 3-bit shift operations, which reduces about 30% of shift hardware and just increases two more computation cycles.
B. Datapath Design
The design of this datapath, contrast to other general purpose CPU designs, places the shifter before the adder/subtractor. Fig. 7 , we can find that version 1 design is not as efficient as version 2 design, since additional cycles are often required in version 1 design due to improper shifter positions.
The limitation of the datapath is the available RAM bandwidth and the position of the shifter. In this design, one-port RAM access is assumed for RAM access for simple hardware consideration. To avoid the idle cycles due to the RAM access conflict, we schedule the operation sequences according to their RAM access, which is implemented in the control signal generation. Other solutions, such as multiport RAM access, can also be used to solve this problem. The position of the shifter limits us to do shift and subtraction simultaneously for register RB, since only register RA can perform shift. This limitation can be eliminated by proper operation scheduling.
C. Controller Design and Scheduling
Since the datapath is extremely simple, all the operand selections and shared term generation rely on the controller. Con- trollers based on finite-state machines are commonly used in most of the controller designs. However, since no control signals in the proposed design depend on their earlier states, we use a simplified finite-state machine, i.e., a counter-based controller, as shown in Fig. 8 . Such design is much simpler and more easily adaptive to other transform applications by only changing the combinational circuit part.
The RAM access conflict, which will result in efficiency loss of the design, is eliminated with developed operation scheduling strategies. With all four scheduling techniques, we can complete one 1-D 8-point DCT in 121 cycles that only pays 23 extra cycles overhead, as compared with original estimated 98 cycles.
The first strategy is to group operations that have the same operand and keep one data used continuously. By using this strategy, we can keep one operand in register RA or RB and read only one register from the RAM. The reloading of the same operand is minimized as few as possible. The following is an actual code subsequence in the firmware of this IDCT:
This scheduling example keeps one operand or unchanged, where the boldface is read from RAM. To compute and , the data stays in "RA," and just read and to "RB" from RAM. For and computation, the and stay in "RA" and "RB", which can save memory read operations to obtain . The second strategy is to rearrange the operations such that the output data of current operation is the input of next operations. It will reduce one read cycle of the RAM access. Not all operations can be arranged by this strategy. Fortunately, the operations to compute output can always apply this strategy. The following list shows a code example to calculate .
This example reads one operand from the output bus, where the boldface and denote the data without RAM access. The output data and are fed as input of the next operation to reduce RAM access. When calculating output , the processor uses just one memory read access, and it writes data to RAM at last operation.
The third strategy is to eliminate the memory write access of the output that will not be used later. If the output will not be used in later operations, it does not need to be stored in RAM. Rearrange the operations could reduce these memory write operations. The following list shows a code example of this case:
This example shows how to reduce the memory write access, where is the output data that does not have to be written to RAM.
is used immediately in and calculation, and it will not be used in later operations. Thus, we can avoid writing into RAM. Another special case is to swap data in "RA" and "RB" to solve the constraint of ALU design. If both data are reloaded from RAM, it will cost two memory read cycles. The following list is a code example that just needs one extra RAM-read cycle to overcome this problem. Original code is
Now use these codes instead
This example swaps the operands data, and . The output data and need both input data and but on different operands. We can resolve this by rewriting them into the right one, such that just one memory read cycle instead of two memory read cycles is required. 8 IDCT, the number of cycles required is . Fig. 9 shows the savings on the number of the additions when we apply different design techniques to compute 2-D 8 8 DCT. This evaluation shows the effectiveness of each technique in the design. The 1-D DCT used in the row-column decomposition is based on the fast algorithm [13] that only requires 11 multiplications and 29 additions for an 8-point DCT with 16-bit precision. We adopt this approach as the relative reference, i.e., the 3104 additions, as 100%. The first significant improvement comes from the fast direct 2-D DCT algorithm that acounts for 42% reduction. The remaining improvement is from the adder-based DA formulation and subexpression sharing. The RAM-conflict problem in this design adds an extra 6% addition cycles, which can be avoided with larger memory bandwidth support. In the 1208 addition cycles, eight 1-D DCT computations use 968 cycles and the butterfly stage additions use 240 cycles.
This processor core design for DCT/IDCT with 16-bit word length is synthesized with 0.6-m SPDM CMOS cell library [29] . Table II shows the hardware cost and delay for DCT/IDCT designs. The total gate count of DCT is 1493, which is smaller than one 16 16 multiplier which will consume 2122 gate count for the multiplier with carry-save adder array or 2536 gate count for the multiplier with Wallace tree array. The delay, 18.21 ns, is satisfied conservatively with the assumed 33-MHz clock frequency. Table III shows the hardware utilization of each function unit in the 1-D DCT design. The controller is always used in the design and is not listed in the table. The idle cycles in each function unit are due to the available memory bandwidth and the datapath limitation. The overall utilization is quite high for the DCT design. Similar statistics can also be found in the IDCT design. The precision of the IDCT unit meets the accuracy specifications [30] which are shown in Table IV . Due to DA formulation and proper sharing terms selection, we can use short wordlength to satisfy the precision requirement. Table V shows the design applications to various video standards. For digital still camera (DSC) that requires low cost while tolerating longer delay, this design can compute all DCT operations within 0.176 s. This delay leaves enough time for other functions such as quantization. Another application is the DCT/IDCT unit in an H.263 codec system. For QCIF size, the proposed design can meet real-time encoding requirements with only one datapath unit.
V. APPLICATIONS AND COMPARISONS
A. Applications to Various Video Standards
For larger picture size and higher frame rate, this design can be simply scaled with adding more datapath units or with higher processing clock frequency. Since the datapath part is quite small, even eight datapath units just need 5104 gate counts or 2.07 mm . Fig. 10 shows the scalable design with two datapath units. The bottleneck to the scalable designs is the available memory bandwidth. Larger bus width and multiple port memory can eliminate this problem. Scalable designs also offer the possibility for low-power design. With more datapath Table V can be halved at double the clocked rate. The 40.5-MHz clock rate in Table V does not introduce extra cost since the processor delay is 18.21 ns. Higher working frequency can easily be attained by using pipelining or high-speed adders. The tradeoff depends on the target application environment.
B. Comparisons With Other Relevant Approaches
Since the proposed design combines the dedicated ALU datapath and the software-oriented controller, comparisons with processor-based implementations can show the effectiveness of the proposed design. Table VI lists the computation time comparisons of one 8 8 DCT executed on our design with that executed on DSP processors [17] - [19] or RISC processor with multimedia enhancement [20] . The instruction cycle count in the table are directly taken from the reference reports or papers. Note that in this paper, we only consider the core computation cycles in our design and do not include other system overheads. All the other implementations used fast algorithms in which the direct 2-D fast algorithm requires fewest cycles. Compared with C30 processors [17] that include multipliers, the performance of our proposed design, 1208 cycles, is superior at the similar clock rate. Other design likes that in [18] - [20] use multiple processing units to accelerate the DCT execution. Our proposed design can also attain the same performance by using multiple ALU's whose cost as shown in Table V is still less than that in [18] - [20] . The proposed design achieves higher performance by using dedicated datapath unit to accelerate subexpression sharing operation but sacrifices with design flexibility and applicability. In some video applications, incorporating our design as accompany core with conventional DSP processors can provide the advantages of flexible DSP software approach and efficient dedicated hardware accelerator. These advantages are present especially for various inner product computation with either constant coefficients or variable coefficients.
Comparison with dedicated hardware designs is more difficult because of the different approaches used. The proposed design combines software-oriented controller with hardware units, while dedicated hardware designs are pure hardware-oriented approaches. However, Table VII lists the comparisons with dedicated hardware designs. The data of previous designs are directly taken from the reference papers. All these designs can meet the decoding speed and accuracy of MPEG2 MP@ML (640 480, 30 fps, 4:2:0, 13.82 Mpixels/s). The IDCT core in [7] used digit-serial construction to reduce the overall size. Low-power IDCT in [9] , [11] used MAC for computation. The design in [10] was a DCT/IDCT accelerator in a DSP processor based on ROM-based DA. The DCT/IDCT unit in [8] shares the same hardwired multipliers for computations. All five designs are based on row-column decomposition. Compared with these listed designs, the proposed design is very competitive in area cost at the processing rate up to MPEG2 MP@ML decoding. However, for higher throughput rate such as HDTV requirements, memory bandwidth limits the applicability of the proposed design. In such cases, high-speed dedicated hardware designs [2] - [6] can provide a more efficient solution.
VI. CONCLUSION
In this paper, we propose a cost-effective processor core design for 2-D DCT/IDCT that can be used in digital still camera and real-time H.263 encoding. We use the fast algorithm to reduce the computation, the DA formulation for higher precision, and the subexpression sharing for lower hardware cost and fewer computation cycles. The resulting architecture is quite simple, regular, and easily scalable to higher throughput applications such as MPEG2 MP@ML decoding. Extensions to other inner product computations like filters and transforms are easily achieved by applying the design techniques to rewrite the controller program. Low-power applications to portable multimedia terminals are possible due to the simple architecture design and low computation cycles.
