A generic multiplication scheme for the low power VLSI implementation of the DCT is described in this paper. The scheme concurrently processes blocks of cosine coefficient and pixel values during the multiplication procedure, with the aim of reducing the total switched capacitance within the multiplier circuit. The cosine coefficients, within each block, are manipulated such that some are processed using shift operations only. The remaining coefficients are presented to the multiplier inputs as a sequence, ordered according to bit correlation between successive cosine coefficients. The paper describes the multiplication scheme, the power evaluation environment used, and presents results, with a number of standard benchmark examples, demonstrating upto 50% power saving.
INTRODUCTION
Currently there is considerable interest in the low power implementation of the Discrete Cosine Transform (DCT). This is mainly due to the DCT being the computational bottleneck of standards such as JPEG and MPEG [1] . Most research work considering low power implementation of the DCT have targeted reducing the computational complexity of the design or modifying it for operation under a lower supply voltage [1, 2] . Both these techniques have a limited effect on power reduction. Another major contribution to power consumption is due to the effective switched capacitance [3, 4] . Only a few researchers have targeted reducing power of a DCT implementation through a reduction in the amount of switched capacitance. This reduction has been achieved through techniques such as the detection of zero-valued DCT coefficients and lookup table partitioning [5] . This paper presents a generic multiplication scheme for the low power VLSI implementation of the DCT on CMOS-Based Digital Signal Figure 2 outline the algorithm, which commences with the following initialisation steps: (1) Figure 4 illustrates the framework utilised for the evaluation of the scheme with a number of 512 512 pixel example benchmark images.
These, which include Lena, Man and checked images, are shown in Figure 5 . The cosine coefficient matrix used was obtained using the MATLAB signal processing toolbox [9] . This the DCT coefficient matrix [C], i.e., Ri=Cxl--(244, 18.96, 0, 1.35)7. This procedure is carried out until x-k n, at which case R; will contain C44.
As it stands, the algorithm can be implemented on traditional DSPs without any loss in throughput. However for high throughput applications, a modified processor architecture is required. The architecture, a simplified version of which is shown in Figure 3 , requires an internal register bank in order to store the partial products, (R), which eventually result in the DCT coefficients, [C] . In addition, a special memory unit is allocated for both the cosine and the pixel matrix elements, Eix and D xk respectively. A shifter is included to cope with the additional shift operations. The multiplier and the adder units are included to perform the normal multiply-add DSP operations. Since the outputs of both the multiplier and the shifter have to share the same input of the register bank, a multiplexer is needed to resolve which one output can use the register bank input. Another multiplexer is required to handle outputs from multiplier/shifter and adder units since some of the inputs to the register bank proceed directly to the bank without passing through the adder. A C-program based test-fixture mapping system was developed to generate input simulation files for the Verilog-XLr digital simulator [10] . This involved forming the appropriate image-pixel/ cosine-coefficient pairs, in the order imposed by the multiplication algorithm, so that they can be applied to the inputs of the hardware multiplier. Three types of input simulation files are generated, representing the use of one of the following:
(1) Traditional cosine DCT multiplication (Traditional) . (2) Column-based multiplication algorithm without ordering (Column-based). (3) Column-based multiplication algorithm with ordering according to minimum hamming distance (Hamming). In each simulation, the number of signal transitions (switching activity), at the output of each gate, is monitored. Capacitive information for each gate is extracted from the layout of the multiplier circuit. Both of these are used to obtain a figure for the total switched capacitance of the multiplier. Table I illustrates the results obtained with the different bench mark images. Clearly, power saving is achieved in all cases with a maximum of 50% with the checked image using Hamming, 
