A new convolution structure for the realisation of the discrete cosine transform by Chan, YH & Siu, WC
A New Convolution Structure for the Realisation 
of the Discrete Cosine Transform 
Yuk-Hee Chan and Wan-Chi Siu 
Department of Electronic Engineering 
Hong Kong Polytechnic 
Hung Hom, Kowloon 
Hong Kong 
ABSTRACT 
In this p%er, we present a new formulation for converting 
a length-N( = 2  ) Discrete Cosine Transform into two length-N/2 
correlations. This formulation enables us to realise the Discrete 
Cosine Transform with a reduced number of operations compared 
to conventional approaches and it also results in extremely regular 
structure which is most witable for the realisationusing distributed 
arithmetic. 
Introduction 
The discrete cosine transform (DCT) has been widely used 
as a tool for digital signal processing applications, such as image 
coding. Many algorithms for the computation of the DCT have 
been proposed since the introduction of the DCT in 1974 by 
Ahmed et al.[l]. These algorithms can broadly be classified into 
two groups: 1) indirect computation through fast Discrete Fourier 
Transform and Walsh Hadamard transform[2-51 and 2) direct 
computation of DCT through matrix decomposition or recursive 
computation[6-151. 
Among them, Lee[6]'s Algorithm and Vetterli[ 151's 
algorithm meet the minimum known number of multiplications to 
implement a length 2* DCT. Lee[6]'s algorithm decomposes an 
N-point DCT into the sum of two N/2 point DCT's repeatedly to 
achieve the number of real multiplications as (N/2)logfl for an 
N-point DCT, with N =2'". Vetterli[lS]'s algorithm is special as it 
is a recursive algorithm which uses the property that DCT, DFT, 
sin-DFT and cos-DFT can be decomposed into two half-length of 
the individual transforms. Hence, it is difficult to classify it 
critically. Narasimha[S]'s algorithm is a typical example of group 
one. It uses an N-point discrete Fourier transform (DFT) 
algorithm to evaluate a DCT by a simple rearrangement of the 
input data, which requires Nlogfl-N + 2 real multiplications. 
In this pger ,  we present a new formulation for converting 
a length-N( = 2 ) Discrete Cosine Transform into two length-N/2 
correlations. This formulation enables us to realise the Discrete 
Cosine Transform with a reduced number of operations compared 
to conventional approaches and it also results in extremely regular 
structure which is most suitable for the realisation using distributed 
arithmetic. 
The Algorithm Derivation 
The DCT[l] of a real data sequence {x(i):i=O,l,,..N-l}, 
where N = 2'" and m is an integer, is defined by 
N- 1 
X(k) = 2 x(i) cos [2n(2i + l)k/4N] 
i=O for k =  0,1, ... N-1 (1) 
Let y(i) = x(2i) 
y(N-i-1) = x(2i + 1) for i =  O,l, ... N/2-1 (2) 
N- 1 
then, X(k) = y(i) cos [ h ( 4  + l)k/4N] 
for k = 0,l ... N-1 (3) 
Now let us split X(k) into odd and even sequences, say 
X(2k+ 1) and X(2k), and look for a formulation of the form 
cos[(4i + 1)(4k + 1)2~/4N] for cosine terms. The reason for such an 
arrangement will be clear at a latter stage. 
i=O 
For odd terms of X@): 
N- 1 
X(2k + 1) = y(i) cos [h(4i  + 1)(2k + 1)/4N] 
for k = O,l, ... N/2-1 i=O 
N- 1 
We define X(k)  = 2 y(i) cos [%(4i + 1)(4k + 1)/4N] 
for k = 0,1, ... N-1 
for k = O,l,  ... N/4-1 
for k = N/4,N/4 + 1, ... N/2-1 
i=O 
then it can be shown that 
X(4k + 1) = X'(k) 
X(2N-4k-1) = - X(k) 
For even terms ofX@): 
X(2k) = 
N- 1 
y(i) cos [h(4i  + 1)(2k)/4N] 
for k = O,l, ... N/2-1 i=O 
If we define X*(k) = 
N- 1 
z(i) cos[h(4i + 1)(2k+ 1)/4N] 
for k = 0,1, ... N-1 i=O 
where z(i) =2y(i)cos[(h/4N)(4i + l)] 
then w,e have X*(N/2-1) =X(N-2) 
and X (k) = X(2k) + X(2k + 2) 
for i = 0,1, ... N-1 (10) 
for k=0,1, ... N/2-2 (11) 
In order to have the required form, let us define again 
F(k) = . z(i) cos[h(4i + 1)(4k + 1)/4N] 
for k=O,1, ... N-1 (12) 
We can obtainX*(k) through the realisation of eqn. 12 since 
it is read;y shown that 
X (2k) = F(k) for k = OJ, ..., N/4-1 (13) 
X*(N-2k-1) = -F(k) for k = N/4,N/4 + 1, ... N/2-l(l4) 
N- 1 
1 = o  
CH2868-8/90/0000-2373$1.00  1990 IEEE 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on September 30, 2009 at 02:09 from IEEE Xplore.  Restrictions apply. 
Hence, the even terms of X(k) can be determined by the 
Let us summarize what we have done so far. X(k) can be 
sequence {F(k):k=O,l..N/2-1} through eqn.ll,13 and 14. 
computed through two steps: 
Step 1: Compute X(k) and F(k) for k=0,1, ... N/2-1 and 
Step 2: Compute X(k) by the following set of equations: 
X(4k + 3) = -X(N/2-k-l) 
X(4k + 2) = -F(N/2-k-1) - X(4k + 4) 
X(4k+ 1) = X(k) 
X(4k) = F(k) - X(4k + 2) for k = N/4-1, ... 1,O (15) 
where X(N)is defined as zero. 
Now let us define a bijective mapping[l4] on the set 
<5"> 4N = 4U + 1 (16) 
INDEX= (i: i = O,l,  ... N-l), where N = 2m, to itself 
then from eqn. 5 and 12, 
we have X"(k) = E y"(i) cos[(%/4N) < 
,where u,v E 0,1, ... N-1 
N- 1 
> 4 ~ ]  
i=O for k=O,1 ... N-1 (17) 
N- 1 
and 
where X(k)  =x( (< 5k >4N - 1)/4 ) 
F ( k )  = E z"(i) cos[(%/4N) < 5 i + k >  4 ~ 1  
i = O  for k = 0,l ... N-1 (18) 
F'(k) =F(  ( <5k> 4 N -  1)/4 ) 
y"(i) =y( ( < 5: > 4N - 1)/4 ) 
Z"(i) = Z( ( < 5l> 4N - 1)/4 ) for i,k E INDEX (19) 
We note that both of X'(k) and F'(k) are in correlation 
form which can be computed easily as there exists a number of fast 
and easy- implemented algorithms[l6] for the realisation of such 
a structure. 
Further Simplification and Realisation 
Consider the correlation F"(k) in eqn.(l8). We have 
where C(n) = cos[(b/4N) < 5 > 4 ~ 1  
for any integer n (21) 
As F"(N/2+n)  = -F"(n) for n = 0 , 1  ... N/2-1, only 
F(O),..F"(N/2-1) are required to compute. One may observe that 
C(N/2 + n) = -C(n) for n = 0 , l  ... N/2-1. Hence we have 
where g(n) = z"(n>-z"(N/2 + n) for n = 0,l ... N/2-1 
This saves almost half of the number of operations required 
to  realise F"(k) by eqn.(20). Furthermore,  as g(n)  = 
2{y"(n) +y"(N/2 + n)}C(n) for n = O,l..N/2-1, we need not compute 
z(n) fromy(n) term by term as shown in eqn.(lO) to compute the 
sequence {g(n)). This further saves N/2 multiplications. 
Let us clarify our proposal with a length 8 DCT with input 
sequence {x(i):i = O,l, ... 7}. Rearranging the data according eqn.(2) 
and from eqn.( lo), we have 
, 
where a = n/16. 
From eqns. (18), (20) and (22), we have 
L 
Similarly, from eqn.(l7), 
Hence, 
x(7) = -x'(2), 
X(5)=X'(1), X(4) =F(l)-X(6), 
x(3) =-x'(3), X(2) = -F(3)-X(4), 
X(1) = x'(O), X(0) = F(0)-X(2) 
X(6) = -F(2), 
Note that values of F"(N/2-n-l)'s of eqn.(22) are exactly 
equal to the coefficients of 2% of the polynomial F(z): 
where G(z) = g(N/2-)ZNn-' + ... +g(l)z +g(O), and 
C(z) = C(0)z '2-1 + ... + C(N/2-2)2 + C(N/2-1) 
Recall that an Nth order polynomial can be interpolated 
exactly through N + 1 points using Lagrange's interpolation 
formula. This suggests a method for determining the polynomial 
F(z). Firstly, the (N-2)th order polynomial C(z)G(z) is 
interpolated by using Lagrange's interpolation formula with 
G(zo)C(zo), G(zl)C(zl), ... and G(zN-z)~(zN-~), which requires N-1 
multiplications. Then F(z) can be determined by equation (23). 
Hence, values of F'(k)'s in eqn.(22) can be determined with N-1 
multiplications only. Values ofX(k)'s can also be computed by the 
same method. This method reduces the number of real 
multiplications of the new algorithm to 5N/2-2. 
Table 1 shows a comparison of the multiplicative 
complexity among the new algorithm, Lee[6]'s algorithm, 
Vetterli[ 151's algorithm and Narasimha[S]'s algorithm. The new 
algorithm shows its superiority in multiplicative complexity 
compared to other algorithms when N is larger than 16. 
F(z) = C(z)G(z)mod(zNn + 1) (23) 
Table 1. Comparision of number of multiplications between 
algorithms. 
New al. Algorithm[6] Algorithm[ 151 Algorithm[S] 
N 5N/2-2 (N/2)10g2N (N/2)10g;?N Nlog2N-N + 2 
8 18 12 
16 38 32 
32 78 80 
64 158 192 
128 318 448 
256 638 1024 
512 1278 2304 
12 18 
32 50 
80 130 
192 322 
448 770 
1024 1794 
2304 4098 
2374 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on September 30, 2009 at 02:09 from IEEE Xplore.  Restrictions apply. 
Hardware Realisation 
The proposed algorithm is very structural. Hence it is most 
suitable for VLSI implementation. Consider Fig.1 which shows 
a block diagram of the implementation of the new algorithm. Only 
three simple permutation networks and two correlation hardware 
modules are required. Besides, nearly half of hardware cost can be 
saved as hardware modules A and B can be realised serially using 
a single unit. In this case the realisation time is unavoidably 
increased. 
y(i) =x(2i), 
y(N-i-l)=x(2i+ 1) 
S t a g e  T 1 
d l  y (n)=y"(n) -y"(N/Z+n) g(n)=Z(y"(n) +y"(N/Z+n)) C(n) 
I I 
Hiw module B 2
Addition 
Figure 1. Hardware implementation olthe proposed algorithm 
S t a g e  1 2 
S t a g e  I3 
1 
There are a number of implementation methods to build a 
cyclic correlation hardware module. One of the possibilities is to 
realise it by distributed arithmetic[ 171. The distributed arithmetic 
is so regular and structural that it is very suitable for VLSI 
realisation. The advantages of this architecture are: (1) no actual 
multiplication involved as multipliers are replaced by memory 
look-up tables, (2) high accuracy as i t  suffers fewer 
roundingltruncation error than the other structures, (3) possible 
for modular circuit design as the structure is extremely regular and 
(4) simple structure which leads to a saving of gate count and 
makes routing easy. These features allow a high speed circuit 
design composed of memories, adders and registers only. 
Consider eqn.(22), if g(-n) is defined as -g(N/2-n), we have 
As g(?-k) can be expressed in the way 
M- 1 
g(7-k) = -g(V-k)O + g(V-k)j 2-j (25) 
j = 1  
where M, g(7-k)j and g(q-k)o are the word length, the jth 
most significant bit and the sign bit respectively. After scaling to 
2's-complement fractional number, equation (24) becomes 
j = 1  ? = O  
N/2- 1 
7 = 0  
Value of g(?-k)j C(?) can be pre-calculated and stored 
in a ROM with ROM size = 2N'2 words. Then F(k)  can be 
obtained by M ROM accesses and M-1 shift-additions after g(n s 
are available. The. implementation of the convolution l y  
distributed arithmetic is as shown in Fig.2. X"(k) can also be 
computed by the same approach. 
. 
initial 
value 
= a  
F ( k )  
U 
M bits word length 
Figure 2. Implementation of convolution with Distributed Arithmatic 
To speed up the whole process, we introduce the concept 
of pipelining. As shown in Fig.1, the whole process is divided into 
three stages. Stage 1 includes jermutation Network A and 
hardware modules for realising y (n) and g(n). Fig3 is a typical 
approach torealiseyO(n) and g(n). Note that no address generation 
is required to obtain the values for C(n). Actually, values of C(n) 
are stored in a circular buffer such that it is automatically sent out 
one by one sequentially. The circular buffer can be constructed 
with either ROM or RAM. It would be more flexible if RAM is 
used. One multiplier is required in this stage. 
y"(N/2 + n) 
Addr= <Addr+l>N/z 
ROM 
Figure 3. Hardware module for realising yo@) and g(n) from y"(n). 
Stage 2 includes the correlation hardware and the 
permutation network for realising 5 ' > 4 ~ = 4 k  + 1. The 
correlation hardware is realised by the distributed arithmetic as 
mentioned above. Fig. 4 shows a possible approach to realise the 
permutation network. Typically, permuted data can be obtained by 
using the technique of table look-up or a switch network. The use 
of switch networkincreases the hardware complexity and hence the 
hardware cost while the use of table look-up involves address 
generation. However, we can use ROM to store up the address 
generation table such that permuted data can be retrieved 
efficiently within two memory accesses. In a practical case, such as 
image compression, we only deal with short lengths, for instance, 
2375 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on September 30, 2009 at 02:09 from IEEE Xplore.  Restrictions apply. 
Addr = 
<Addr+  1 > ~ / z  
(=index) 
Addr = .1 
RAM index permuted 
data 4 data 
ROM 
Permutation 
table 
Figure 4. Permutation Network by using Table Look-Up. 
a length-16 DCT. Hence, this approach is effective and efficient as 
only a table of small size is required. Instead of dynamic approaches 
mentioned above, we can use static approach by wiring up inputs 
to appropriate outputs of the permutation network. This is not 
flexible but can speed up the permutation. It may not be impractical 
as only a particular short length DCT is required for some special 
usage. The permutation network Ain stage 1 has also the same type 
of constraints. 
& S 'tch init ial  Subtracter Register value c=a-b I cL-J  
U 
Figure 5. Addition Matrix in Stage 3. 
Stage 3 can be implemented as shown in FigS. Input data 
should be negated alternatively before going into the subtracter. 
This can be done by a dedicated hardware buffer which negated 
input data alternatively before sending them out or by using a 
multiplexer which acts as an switch as shown in FigS. The output 
of the subtracter is fed back to the input of the subtracter for the 
following subtraction such that final results can come out 
recursively. 
According to the derivation from eqn.(24) to (26), the time 
required for computing the two correlations is N(M- 1) TA/2, where 
TA is the time required for an addition, if they are computed in 
parallel as shown in Fig.1. The time required for stage 1 and stage 
3 are roughly (N/2) (TA + TM) and (N/2-1) TA respectively, where 
TM is the time required for an multiplication. Therefore, stage two 
will dominate the timing of the pipeline process if a fast multiplier 
exists. This gives the lower bound of the total time required for 
computing all DCT coefficients. Actually, we can make some 
modifications to increase the speed of the correlation further. The 
simplest one is to partition the input words into the most significant 
half and least significant half and so on. Then we can introduce 
parallelism in the computation by increasing the number of adders 
as we can deal with additions in parallel. Theoretically speaking, 
we can speed up the whole process to the upper bound that requires 
the least computation time, say (N/2-1) TA, by introducing a 
sufficient number of adders and multipliers when the hardware 
cost is not a problem. This upper bound is limited by the recursive 
nature of stage 3. 
In short, the proposed algorithm can be realised efficiently 
m d  easily by dedicated hardware or gate array technology. The 
structure of the hardware required is so simple that it involves a 
small memory size, a few adders, registers and one multiplier only. 
This can achieve a high performance DCT processor at aminimum 
cost and development time. 
Conclusion 
In this paper, a new algorithm is presented such that an 
N- length DCT can be directly computed by correlation. This 
algorithm involves no DFT computation and can be realised 
through very simple hardware structures and is very suitable for 
VLSI implementation. 
Reference 
[ 11 N.Ahmed, T.Natarajan and K.R.Rao, "Discrete cosine 
transform," IEEE trans., Vo1.C-23, pp.90-94, Jan. 1974. 
[2] M.R.Haralick, "A storage efficient way to implement the 
discrete cosine transform," IEEE Trans., Vo1.C-25, pp.764-765, 
July 1976. 
131 B.D.T ZLg and W.C.Miller, "On computing discrete cosine 
transform." IEEE Trans., Vol. C-27, pp.966-968, Oct. 1978. 
[4] J.Makou1, "A fast cosine transform in one and two dimensions," 
IEEE Trans., Vol.ASSP-28, pp.27-34, Feb. 1980. 
[5]  M.J.Narasimha and ii.M.Peterson, "On the computation of the 
discrete cosine transform," IEEE Trans., Vo!.COM-26, 
pp.934-946, June 1978. 
[6] B.G.Lee, "A new algorithm to compute the discrete cosine 
transform," IEEE Trans., VoLASSP-32, pp.1243-1245, Dec. 
1984. 
[7] W.H.Chen, C.H.Smith and S.C.Fralick, "A fast computational 
algorithm for the discrete cosine transform," IEEE Trans., 
Vol. COM-25, pp.1004-1009, Sept. 1977. 
[SI H.S.Hou, "A fast recursive algorithm for computing the discrete 
cosine transform," IEEE Trans., VoLASSP-35, pp.1455-1461, 
Oct. 1987. 
[9] M.L.Haque, " A two dimensional fast cosine transform," IEEE 
Trans., VoLASSP-33, pp.1532-1539, Dec. 1985. 
[ 101 HXtajima, "A symmetric cosine transform," IEEE Trans., 
Vo1.C-29, pp.317-323, Apr. 1980. 
[ l l ]  Z.Wang, "Fast algorithm for the discrete W transform and for 
the discrete Fourier transform," IEEE Trans., Vol.ASSP-32, 
[12] N.Suehiro and M.Hatori, "Fast algorithms for the DFT and 
other sinusoidal transforms," IEEE Trans., Vol.ASSP-34, 
pp.642-644, June 1986. 
[13] Z.Wang, "On computing the discrete Fourier and cosine 
transforms," IEEE Trans., VoLASSP-33, pp.1341-1344, Oct. 
1985. 
[ 141 P.Duhame1 and H. H'mida, " New 2n DCT algorithms suitable 
for VLSI implementation," ICASSP-85, Tampa, March 1985, 
[15] M. Vetterli and H. Nussbaumer, " A simple FFT and DCT 
algorithms with reduced number of operation," Signal 
Processing, Vol. 6, No. 4, August 1984, pp.267-278. 
[ 16]H.J.Nussbaumer, "Fast Fourier Transform and Convolution 
Algorithms," 2nd corrected and Updated Edi t ion,  
Springer-Verlag, 1982. pp.32-78. 
[17] S.A. White, " Applications of distributed arithmetic to digital 
signal processing: A tutorial review," IEEE ASSP magazine, 
pp.803- 816, Aug. 1984. 
pp.780-783. 
Vol. 6, NO. 3, July 1989, pp.4-19. 
2376 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on September 30, 2009 at 02:09 from IEEE Xplore.  Restrictions apply. 
