Novel formulation and realisation of discrete cosine transform using distributed arithmetic by Chan, YH & Siu, WC
IEEE Region 10 Conference on Computer and Communication Systems, September 1990, Hong Kong 
Novel Formulation and Realisation of 
Discrete Cosine Transform using Distributed Arithmetic 
Yuk-Hee Chan and Wan-Chi Siu 
Department of Electronic Engineering 
Hong Kong Polytechnic 
Hung Hom, Kowloon 
Hong Kong 
Abstract 
In this paper, a new algorithm is introduced such that 
we can convert an odd prime length, say N, DCT into two 
(N-1)/2 length cyclic correlations. This formulation enables 
us to realise the DCT using the distributed arithmetic and it 
also results in extremely regular structure which is most 
suitable for VLSI realisation. An example is given to show 
the feasibility and the structural reBllarity of the algorithm. 
1. INTRODUCTIQN 
The discrete cosine transform (DCT) is an important 
tool for digital signal processing as it performs much like the 
optimal Karhunen-Loeve transform ( U T )  under a variety 
of criteria. Many algorithms for the computation of the DCT 
have been proposed since the introduction of the DCT in 
1974 by Ahmed, Natarajan and Rao[l]. Among these 
algorithms, three algorithms[2-41 meet the minimum known 
number of multiplications for their realisation. However, all 
of these algorithms still rely heavily on the use of multipliers 
which are expensive to be implemented using VLSI or other 
techniques. 
The distributed arithmetic[S-9] is essentially a 
technique which is extremely regular and structural in nature 
such that it is most suitable for VLSI realisation. The 
advantages of this type of structure include: (1) no actual 
multiplication involved as multipliers are replaced by 
memory look-up tables, (2) high accuracy as it suffers fewer 
rounding/truncation errors than the cases for other 
structures, (3) possible for modular circuit design as the 
structure is extremely regular and (4) simple structure which 
leads to a saving of gate count and makes routing easy. These 
features allow a high speed circuit design composed of 
memories, adders and registers only. Besides, due to the 
nature of the structure of the distributed arithmetic, it is most 
suitable for the realisation of convolutions/correlations. It 
can reduce the computation of an N-length cyclic 
convolution from requiring N2 multiplications and N(N-1) 
additions to N(M-1) additions only, where M is the 
wordlength of the input sequence {x(n):n= O,l..N-l}. 
Hence, it is also true that the success of using distributed 
arithmetic for the realisation of a certain transform relies 
heavily on whether we could be able to convert the transform 
into a convolutiodcorrelation form. Successful results[7-91 
for the realisation ofdiscrete Fourier transforms have been 
reported. However, there are less significant results reported 
in the literature for the realisation of DCT using the 
distributed arithmetic. Recently, we proposed a new 
approach to realise a 2m-length DCT by correlations[lO]. In 
such case, one can make use of the distributed arithmetic 
technique to achieve an easy and fast chip implementation. 
It is found that most of fast discrete cosine transform 
algorithms were proposed for the computation of 2m length 
DCT. In this paper, a new algorithm is proposed such that we 
can convert an odd prime length, say N, DCTinto two (N-1)/2 
length cyclic correlations. This formulation enables us to 
realise the DCT using the distributed arithmetic with the 
minimum number of operations and it also results in an 
extremely regular structure which is most suitable for VLSI 
realisation. An example will also be given to show the 
feasibility and the structural regularity of the present 
algorithm. 
2. ALGORITHM DERIVATION: 
i = O,l, ... N-1} is defined as 
T h e  DCT[ l ]  of a real  data  sequence {y(i): 
N-1 
Y(k) = y(i) cos [2n(2i + l)k/4N] 
i = O  for k=O,1, ... N-1 (1) 
We firstly define another sequence {x(i):i = 0,l ... N-1) as 
x(N-1) = y(N-1) 
x(i) = y(i) - x(i + 1) 
Y(k) = { 2T(k) + x(0) } cos[kn/2N] 
where T(k) is defined as 
T(k) = 2 x(i)cos[niklN] 
for i = 0,l ... N-2 (2) 
Then we have 
for k = 0,1, ... N-1 (3) 
N-1 
i = l  
fork= 42  ... N (4) 
190 CH2866-2/90/0000-190 $1.00 0 1990 IEEE 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on July 28, 2009 at 00:17 from IEEE Xplore.  Restrictions apply. 
N-1 
Among Y(k)'s, Y(0) =2 y(i) , which requires 
simple additions only. For other terms of Y(k), we have to 
obtain the se,quence {T(k):k= 1,2 ... N- 1). 
Now let N be an odd prime. If we split T(k) into odd 
and even sequences, we have 
i=O 
N-1 
T(2k) = 2 x(i)cos[2JciklN] 
i = l  
fork= 1,2 ...,( N-1)/2 (5) 
and 
N-1 
T(2k' + 1) = 2 x(i) cos[ni(2k' + 1)/N] 
i = l  
for k' = 0,1..,(N-3)/2 (6) 
Let 2k' + 1 = N - 2k, then we have 
N-1 
T(N-2k) = 2 e(i) cos[2nik/N] 
i = l  
fork= 1,2..,(N-1)/2 (7) 
where e(i) = (-l)'x(i) for i = 1,2..,(N-l) (8) 
As N is prime, there exists a bijective mapping on the 
set INDEX = {i:i = 1,2 ... N-1) to itself: 
< g q > N =  U 
(9) 
where u,v E { 1,2 ... N-1}, g is a primitive root of N and 
< x > means the residue of the number x modulo N. If we 
extentthe domain of T(N-2k) and T(2k) in eqn.(7) and (5) 
to {kkdNDEX}, we have 
N-1 
T'(k) = 2 x'(i) cos[(2Jc/N) < gi+k 
(10) i = l  
N-1 
and T ( k )  = 2 e'(i) cos[(&/N) < i+k 
(11) i = l  
where x'(i) = x( <gi > N) 
e'(i) = e( <g' > ) 
kN T'(k) = T(2<g >N)  
T (k )  = T(N-2<gk >N) 
for i,k E INDEX (12) 
We note that equations (10) and (11) are in circular 
correlation form which can be computed easily as there exists 
a number of fast and easy-implemented algorithms for the 
realisation of such a structure. Moreover, we can make a 
further simplification on these equations. 
Consider the correlation T'(k) in eqn.( 10) for k = 1 to 
N-1, we have 
C(2)C(3) C(1) 
C(3)C(4) C(2) 
. .  . .  
191 
where C(n) = cos[(%/N) < g" > 
(14) 
As T'((N-l)/2+n) = T'(n) 
for n = 1,2 ...( N-1)/2 (15) 
only T'(O), ...T'(( N-1)/2) are required to compute. One may 
observe that C((N-l)/2+n) = C(n) for n =  1,2 ...( N-1)/2. 
Hence we have 
(16) 
In other words, T'(k) is given by: 
T'((N-1)/2+ k) = T'(k) 
(N-1)/2 
= 2 { x'(i) + x'((N-l)/2+i)} C(i+k) 
i = l  
fork= 1,2 ...( N-1)/2 (17) 
Similarly, T ( k )  can be realised through another 
This becomes an (N-1)/2 length cyclic correlation. 
(N-1)/2 length cyclic correlation: 
T((N-1)/2 + k) = T ( k )  
(N-l)/2 
= 2 { e'(i) + e'((N-l)/2+i) } C(i+k) 
i = l  
for k = 1,2 ...( N-1)/2 (18) 
In summary, we can realise a prime length N DCT by 
computing two (N-1)/2 correlations with a cost of N-1 
multiplications and 7(N-1)/2 additions. This structure is 
particularly attractive for hardware realisations such as using 
the distributed arithmetic[6] or  other VLSI design 
techniques. 
3. ANEXAMPLE 
input sequence {y(i):i = O,l, ... 6}. Obviously, we have 
We now clarify our proposal with a length 7 DCTwith 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on July 28, 2009 at 00:17 from IEEE Xplore.  Restrictions apply. 
Y(2) = [ 2 T(2) + x(0) ] cos(k/14) 
Y(3) = [ 2 T(3) + x(0) ] cos(3n/14) 
Y(4) = [ 2 T(4) + x(0) ] cos(4n/14) 
Y(5) = [ 2 T(5) + x(0) ] cos(k/14) 
Y(6) = [ 2 T(6) + x(0) ] cos(6n/14) 
By choosing 3 as the primitive root, eqn (17) gives us 
the results of {T(k):k=2,4,6}: 
LhJ T'(5 = T'(2 = [5J T(4) = [;"" cosl2a co5n.a da '""1 coda r + x ( 4 j  x(2)+x(5) 
cos8a cos4a cosl2a x(6)+x(1) 
for a = n17 
{T(k):k = 1,3,5} can also be obtained from eqn (18) r;] c o s m  '""3 r - x ( 3 ' ]  
T"(5 = T"(2 = T(3) = cosl2a coda cos4a x(2)-x(5) 
cos8a cos4a cosl2a x(6)-x(1) 
Hence two length-3 correlations are required for the 
computation of a length-7 DCT with a cost of 21 additions 
and 6 multiplications. These results show that, by comparing 
with our previously reported results[ll], a further reduction 
in terms of the numbers of multiplications and additions 
required has been achieved. 
4. HARDWARE REALISATION 
The proposed algorithm is very structural such that it 
is very suitable for a VLSI implementation. Particularly, one 
can realise it with the gate array technique. Figure 1 shows a 
block diagram of the implementation of the proposed 
algorithm, To speed up the whole process, pipelining is 
applied to realise the parallelism. As shown in figure 1, the 
whole process is divided into 3 stages. In stage 1, the 
sequence {x(i)} is generated recursively from the input 
sequence {y(i)}. The value of x(0) is stored in a register for 
future usage. The coefficient Y(0) can be obtained at the 
same time by summing up the sequence {y(i)}. In this stage, 
only some simple hardware modules such as adders, registers 
etc. are required. 
U 
+-- Stage I __t_ Stage 2 -+ Stage 3 __I 
Figure 1. Hardware implemmlation orthr proposed nlgorilhm 
X'(1) = x( < g' > ,) 
x'(i) = ~ ' ( 1 )  +x'((N-l)/Z + I) e"(i) =e'(i) +e'((N-l)R + I )  -lI*l 
I 
Correlation Hardware Correlation Hardware 
1N-I)R 
T ( k )  =z I = I C(i+k)  ~ ' ' ( 1 )  
Tlk) 
Peimuiation Network Permutarion Network 
Figure 2. Hardware implementation of {Tfi)} sequmer 
The hardware module for computing the sequence 
{T(k)} is shown in figure 2. This module contains 
permutation modules and two identical hardware modules 
used for the realisation of the correlation. If one wants to save 
hardware cost, either one of these two identical modules 
could be eliminated. However, the realisation time would be 
unavoidably increased as the workload of the left module 
would then be double. 
Addr = 
index 
Addr = I 
permuted ? 
index ROM permuted 
data Permutation L z L l  I table I 
Figure 3. Permutation Network by using Table Look-Up. 
The respective permutation networks are realised by 
using the table lookup technique as shown in figure 3. 
Comparing with a switch network, the table lookup 
technique reduces the hardware complexity and hence 
lowers the hardware cost. In our approach, we use a ROM to 
store up the address generation table such that we can fetch 
the permuted data within two memory accesses. It is 
interesting to point out that in many applications such as 
image coding, only short length DCT is concerned. Hence, 
this approach is very efficient and effective as only a 
small-size table is needed. 
The distributed arithmetic technique is used to 
realise the correlation hardware module. It is recognised 
that the distributed arithmetic architecture is very suitable 
for VLSI implementation as it has the following advantages: 
(1) complicated multiplications can be replaced by a table 
lookup and shift-add operations, (2) higher accuracy can be 
achieved by suffering less rounding/ truncation errors and (3) 
modular circuit designispossib1e.These features allow a high 
speed circuit design composed of memories, adders and 
registers only. Figure 4 shows the implementation of the 
correlation with the distributed arithmetic. 
192 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on July 28, 2009 at 00:17 from IEEE Xplore.  Restrictions apply. 
6 rotate 1 word every M cycles 
U * 
M bits word length 
Fipre 4. Implemmtntian of wnvolotion with Distributed Arithmatie 
After post-permutation, the sequence {T(k)} is fed 
into the final stage of theimplementation. In thisstage, items 
of the sequence will be shifted up by 1 bit and then added to 
the pre-calculated x(0) value. The factor cos[kn/2N] is then 
multiplied to the corresponding output items to obtain the 
final cosine transform result. These multiplications can be 
realised with a multiplier. However, to save the hardware 
cost further, one can realise it with a shift-add circuit instead. 
Note that stage 2 dominates the timing of the pipeline 
process. In stage 2, totally M-1 additions are required for the 
computation of each T(k) from the distributed arithmetic 
processing unit. It means that the time allowed for each 
multiplication of cos(k~d2N) term to the corresponding T(k) 
is roughly equivalent to the time for M-1 additions. Hence, 
the most economical and convenient way to realise the 
multiplication of cos(h/2N) would just be a shift-add circuit 
realised by hardware as shown in figure 5. As the values of 
cos(luzL2N)'s are pre-defined, one can construct a table of 
N-l'words of these values to ease the determination of 
cos(knL2N). In such case, the processing time of the whole 
process can be evenly distributed among three stages of the 
pipe-lined structure. 
I 
Fig~trt 5. Implcmenlalion of lhr mulliplicntion with rhM-add circuit. 
These show that the proposed algorithm can be 
realised efficiently and easily by dedicated hardware or gate 
array technology. The structure of the hardware required is 
so simple that it involves only memory and adders. This can 
achieve a high performance DCTchip at a minimum cost and 
development time. 
5. CONCLUSIONS 
In this paper, we present a new algorithm such that we 
can convert a prime length DCT into cyclic correlations. In 
particular, this algorithm allows us to realise an N-point DCT 
by using two (N-1)B-point cyclic correlations, where N is an 
odd prime. This makes the realisation simple as the 
correlation structure is very suitable for hardware realisation 
using the distributed arithmetic. Finally, by making use of the 
distributed arithmetic and other simple techniques, we 
propose an efficient approach to construct a short prime 
length DCT chip that is suitable for VLSI realisation. 
REFERENCES 
[ 11 N.Ahmed, T.Natarajan and K.R.Rao, "Discrete cosine 
transform," IEEE trans., Vo1.C-23, pp.90-94, Jan. 1974. 
[2] B.G.Lee, "Anew algorithm to compute the discrete cosine 
transform," IEEE Trans., VoLASSP-32, pp.1243-1245, 
Dec. 1984. 
[3] M.Vetterli and H-Nussbaumer, " A  simple FFT and DCT 
algorithms with reduced number of operation," Signal 
Processing, Vol. 6, No. 4, pp.267-278, August 1984. 
[4] H.S.Hou, "A fast recursive algorithm for computing the 
discrete cosine transform," IEEE Trans., Vol.ASSP-35, 
pp. 1455-1461, Oct. 1987. 
[5] S.A.White, "Applications of distributed arithmetic to 
digital signal processing: A tutorial review," IEEE ASSP 
magazine, Vo1.6, No.3, pp.4-19, July 1989. 
[6] T.C.Chen, M.T.Sun and A.M.Gottlieb, "VLSI 
Implementation ofa 16* 16 DCT," Proc IEEE 1988, V1.4, 
pp.1973-1976. 
[7] W.C.Siu, "Microprocessor-base implementation of digital 
signal processing using distributed arithmetic," IERFi 
workshop on Advanced Microprocessors and Digital 
Signal Processing, H.K. Section, Sep 1982, pp.148-157. 
[8] W.C.Siu and C.F.Chen, "New realisation technique of 
high-speed discrete Fourier Transform described by 
distributed arithmetic," IEE Proc., Vo1.130, Pt.E, No.6, 
NOV 1983, pp.177-182. 
[9] S.Chiu and C.S.Burrus, "A Prime Factor FIT Algorithm 
using Distr ibuted Arithmetic," I E E E  Trans., 
Vol.ASSP-30, No.2, Apr 1982, pp.217-227. 
[ 101 Y .H.Chan and W.C.Siu, "A New Convolution Structure 
for the realisation of Discrete Cosine Transform," 
Proceedings, IEEE International Conference on Circuits 
and Systems (ISCASPO), May 1-3, 1990, New Orleans, 
USA. pp.2373-2376. 
I l l ]  Y.H.Chan and W.C.Siu, "Algo!thm for Prime Length 
Discrete Cosine Transforms," Electronics Letters, 1st 
Feb 1990, Vo1.26, No.3, pp.206-208. 
101 
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on July 28, 2009 at 00:17 from IEEE Xplore.  Restrictions apply. 
