An improved architecture for the adaptive discrete cosine transform by Martin, F & Bull, DR
                          Martin, F., & Bull, D. R. (1996). An improved architecture for the adaptive
discrete cosine transform. In Unknown. (Vol. 2, pp. 742 - 745). Institute of
Electrical and Electronics Engineers (IEEE). 10.1109/ISCAS.1996.541832
Link to published version (if available):
10.1109/ISCAS.1996.541832
Link to publication record in Explore Bristol Research
PDF-document
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms.html
Take down policy
Explore Bristol Research is a digital archive and the intention is that deposited content should not be
removed. However, if you believe that this version of the work breaches copyright law please contact
open-access@bristol.ac.uk and include the following information in your message:
• Your contact details
• Bibliographic details for the item, including a URL
• An outline of the nature of the complaint
On receipt of your message the Open Access Team will immediately investigate your claim, make an
initial judgement of the validity of the claim and, where appropriate, withdraw the item in question
from public view.
AN IMPROVED ARCHITECTURE FOR THE ADAPTIVE 
DISCRETE COSINE TRANSFORM 
FrunGois Martin and David R. Bull 
Image Communications Group, Centre for Communications Research 
University of Bristol, Queens Building, University Walk, Bristol BS8 ITR, UK 
Dave.Bull @bristol.ac.uk 
ABSTRACT 
This paper presents a new approach to the efficient 
realisation of the discrete cosine transform for the specific 
case of interlaced image sequence coding. In such cases, 
the conventional approach of decomposing each frame or 
frame difference into 8x8 blocks is often no longer 
satisfactory and an adaptive architecture capable of 
processing either 8x8 or two 4x8 blocks is desirable. The 
approach described is based on the decomposition used by 
Madisetti, modified to maximise shared hardware resources 
and to exploit arithmetic redundancy using primitive 
operator methods. The resulting architecture is compared 
with alternative implementation options using an area-time 
metric with savings in excess of 50% having been observed. 
1. INTRODUCTION 
Since its definition by Ahmed, the Discrete Cosine 
Transform (DCT) has been widely used in many image and 
video signal processing systems and has been incorporated 
in most international standards including JPEG, H26 1, 
MPEG-1 and -2. The definition of the DCT is given below 
in its one dimensional (1) and two dimensional (2) forms. 
X ( k )  = -lk Cx[m]cos (2m+1)- E 1: [ 3 
X(k ,  1 )  = - 
- for k = 0 
1 otherwise 
Equations (1) and (2) may also be represented in matrix 
form as given in equations (3) and (4) respectively. 
0-7803-3073-0/96/$5 .OO '1996 IEEE 
where (C,)k,m = Elk cos[(2m+ 1 ) g ]  
In the case of the 2D-DCT, the calculation complexity is 
O( N 4 ) .  However, using equation (4) it can be readily 
shown that the 2D-DCT is separable. This implies that the 
2D calculation can be performed by applying an N point 
1D-DCT to the rows followed by an N point ID-DCT on 
the resulting columns. Although the number of operations 
is reduced from O( N 4 )  to O( N 3  ) in the separable case, the 
2D-DCT remains computationally intensive. For this 
reason, many algorithms have been developed to reduce its 
implementation complexity. 
Early fast DCT algorithms were FFT based, but these are 
no longer popular due to their requirement for complex 
multiplications and additions. Alternative algorithms have 
been proposed that decompose the DCT itself; some of 
these decompose the DCT into lower order operations, 
while others rely on cyclic correlated structures [ I ]  or on 
the DFT [ 2 ] .  Other methods take into account the unitary 
property of the C, matrix which allows it to be factorised 
into products of relatively sparse matrices [3]. Recursive 
methods have also been reported as alternative ways to 
calculate the DCT. Examples include a recursive form of 
equation (1) [4] and a recursive form of the matrix 
representation (equations (3) and (4)) [5]. For all such 
algorithms, the complexity is typically O( N 2  log, N ). 
Many algorithms have been developed to exploit the 
properties of VLSI implementation [6]. These include the 
use of distributed arithmetic [7] and, more recently, 
systolic arrays [8]. The latter offer the possibility of high 
through-put rates especially for HDTV applications. 
This paper presents a new approach to the efficient 
realisation of the 1-D DCT for the specific case of an 
interlaced image sequence. MPEG 2 for example, has an 
option for an interlaced field mode of operation. In such 
cases the conventional approach of decomposing each 
frame or frame difference into 8x8 blocks is often no longer 
satisfactory and an adaptive architecture is necessary. 
Section 2 of the paper justifies the need for such an 
742 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on February 9, 2009 at 08:58 from IEEE Xplore.  Restrictions apply.
adaptive architecture and section 3 presents a new and 
efficient solution to this problem based a modified form of 
Madjsetti's algorithm [9] in conjunction with the use of 
primitive operator techniques [IO]. 
2. JUSTIFICATION FOR AN ADAPTIVE DCT 
In the case of an interlaced image sequence, if motion 
occurs between the scanning of odd and even lines, large 
valued spurious DCT coefficients may be created at high 
frequencies. These will be coarsely quantised during 
compression, reducing the quality of the decoded image 
sequence. This is demonstrated in figures 1 to 4 where an 
artificial 16 pixel shift has been introduced between even 
and odd lines of the block. In such situations it would be 
desirable to have the option of computing 4x8 DCTs on odd 
and even lines independently for comparison with the 
conventional 8x8 transform. The solution yielding the best 
reconstructed image (according to some predefined 
performance criterion) would then be selected for the block 
in question. Although desirable, this approach has the 
disadvantage that the hardware must simultaneously 
provide both 4x8 and 8x8 transforms. The algorithm and 
architecture presented in this paper represents an efficient 
method of performing this task. 
Figure 1: correlated 16x16 image block 
0.5. 
0. 
-0.5. 
-1. 
-1.5. 
-2 
3. A BLOCK ADAPTIVE ARCHITECTURE 
The computational complexity of the DCT can be reduced 
by applying a decomposition where the calculation is 
divided on the basis of its even and odd rows (5):  
where ai = cos( iz) . 
All columns of each matrix in ( 5 )  have the same set of 
coefficient magnitudes ( a', a4, a6 for the first matrix and 
al , a3,  a5 ,  a7 for the second). Thus, for each set of inputs, 
all operations of the form: a i ( x [ n ] + x [ 7 - n ] )  and 
ai (4.1 - x[7 - a] )  may be performed using shared hardware. 
2 4 5 8 10 12 14 16 
Figure 3: 16x16 image block with a shift of 16 
pixels between even and odd lines. 
'1 
.. 
x ax15 
y ax,$ 0 0  
Figure 2: 16x16 DCT corresponding to figure 1 .  
743 
Figure 4: 16x16 DCT corresponding to figure 3. 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on February 9, 2009 at 08:58 from IEEE Xplore.  Restrictions apply.
and the result accumulated (possibly after reordering and 
sign changing) to produce the outputs X(O),..-,X(7) . 
Madisetti [9] proposed an implementation for the DCT and 
the inverse DCT based on this decomposition and 
demonstrated its use in a 100 MHz 2-D 8x8 DCTLDCT 
processor suitable for HDTV applications. 
To facilitate a block-adaptive structure, the DCT algorithm 
must be modified to efficiently produce 4 point transforms 
on both the even and odd components of an 8 point input 
vector as well as the original 8 point DCT. The 4 point 
DCTs required are thus: 
where x, is the DCT of the even rows and X, is the DCT 
of the odd rows of the input vector x. Using an approach 
similar to that of Madisetti, the 8-point DCT can be readily 
extended to efficiently implement the adaptive structure. 
For example, the DCT of the even components in x can be 
rewritten as (7): 
A similar decomposition can be applied to the odd rows of 
x. Comparing (5) and (7) an architecture which allows 
some sharing of resources between the 8x8 and 4x8 
transforms is now possible. Consider for example, the 
computation of X(2) and X(6). This can achieved using the 
accumulator-based architecture of figure 5, where the R 
block performs a permutation on the data streams according 
to sample index and the S blocks perform a sign change. It 
can be shown that with alternative permutations and sign 
changes, the a2 and a6 products can be reused in the 
formation of other outputs: X,,(l), X0(3), Xe(l) and Xe(3). 
Similar configurations and their reuse can be devised for 
other product terms resulting in the architecture of figure 6.  
In this figure, the B1 and B2 blocks form the required 
products of input samples and DCT coefficients. Although 
these may be implemented using conventional fixed 
function multipliers, further complexity savings can be 
made if these are replaced with primitive operator sections 
[IO]. 
To compare the resulting structure with alternative 
implementation options, a complexity metric of the form: 
Gate count x delay per sample has been used. Several 
implementation alternatives were investigated, based on 
existing algorithms and/or implementations [5-7, 91. It was 
found that a pipelined version of the proposed architecture 
incorporating primitive operator graph multipliers was the 
most efficient solution offering a throughput of 
approximately Ions per pixel using approximately 20 000 
gate equivalents. This represents a 50% reduction in 
complexity when compared to solutions based on Lee’s[5] 
or McGoverns [6] algorithms. 
4. CONCLUSIONS 
The possibility of adaptively selecting the size of a DCT 
could be a major asset in the design of future video codecs 
especially in the case of interlaced images. This paper has 
proposed a structure based on a pipelined modified 
Madisetti algorithm incorporating primitive operator graph 
based multipliers. In an extensive comparative study, this 
was found to offer the most efficient solution to this 
problem offering a 50% saving over comparative. 
ACKNOWLEDGEMENTS 
The authors wish to thank Sony Broadcast and Professional 
Europe and the Centre for Communications Research at 
Bristol University for their support of this work. 
REFERENCES 
[ l ]  Chan Y.-H., Siu W.-C., “A Cyclic Correlated Structure for 
the Realisation of Discrete Cosine Transform”, IEEE Trans. 
Circuits and Systems 11, Vol. 39, No 2, Feb. 1992, pp 109- 
113. 
[2] Vetterli M., Ligtenberg A., “A Discrete Fourier Cosine 
Trunsfonn”, IEEE J. Selected Areas Commun., Vol. 4, No 
1, Jan. 1986, pp 49-61. 
Figure 5: Calculation of X ( 2 )  and X ( 6 )  
744 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on February 9, 2009 at 08:58 from IEEE Xplore.  Restrictions apply.
Chen W.-H., Smith C.H., Fralick S.C., “A Fast 
Computational Algorithm for  the Discrete Cosine 
Transform”, IEEE Trans. on Communications, Vol. 25, No 
9, Sept. 1977, pp 1004-1009. 
Chau L.-P., Siu W.-C., “Recursive Algorithm For the 
Discrete Cosine Transform with General Lengths”, 
Electronics Letters, Vol. 30, No 3, Feb. 1994, pp 197-198. 
Lee P., Huang F.-Y., “Restructured Recursive DCT and 
DST Algorithms”, IEEE Trans. Signal Processing, Vol. 24, 
McGovern F.A., Woods R.F., Yan M., “Novel VLSI 
implementation of (8x8) point 2 -0  DCT”, Electronics 
Letters, Vol. 30, No. 8, 14 Apr. 1994, pp 624-626. 
NO. 7, July 1994, pp 1600-1609. 
[7] White S.A., “Application of Distributed Arithmetic to 
Digital Signal Processing. A Tutorial Review”, IEEE ASSP 
Mag., Vol. 6, July 1989, pp 4-19. 
181 Chan Y,-T., Wang C.-L., “New Systolic Array 
Implementation of the 2 0  Discrete Cosine Transform and 
its Inverse”, IEEE Trans. Circuits and Systems for Video 
Tech., Vol. 5, No. 2, Apr. 1995, pp 150-157. 
[9] Madisetti A., Wilson Jr. A.N., “A IOOMHz 2 -0  8x8 
DCTLDCT Processor for HDTV Applications”, IEEE 
Trans. Circuits and Systems for Video Tech., Vol. 5, No. 2, 
Apr. 1995, pp 158-165. 
[IO] Bull D.R., Horrocks D.H., “Primitive operator digital 
filters”, IEE Proc., Vol. 138, No. 3, June 1991, pp 401-412. 
Figure 6: Adaptive DCT architecture 
Authorized licensed use limited to: UNIVERSITY OF BRISTOL. Downloaded on February 9, 2009 at 08:58 from IEEE Xplore.  Restrictions apply.
