Efficient prime factor algorithm and address generation techniques for the discrete cosine transform by Chau, LP et al.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001 985
Efficient Prime Factor Algorithm and Address Generation
Techniques for the Discrete Cosine Transform
Lap-Pui Chau, Daniel Pak-Kong Lun, and Wan-Chi Siu
Abstract—This brief proposes an efficient prime factor algorithm for
the discrete cosine transform. In the proposal, we formulate the decompo-
sition directly, by using the proposed input and output mapping, a novel
in-place address generation scheme for input index mapping is derived,
while the formulations in the literature require a table to store the index
mapping. Besides, our approach requires one output index mapping only
while the conventional algorithms require two index mapping. Hence, by
using the proposed mappings and address generation techniques, less tem-
porary storage is required, such that a reduction on memory requirement
can be achieved during the implementation. A comparison of the address
generation time between our approach and the conventional approach is
also shown.
Index Terms—Address generation technique, discrete cosine transform,
fast algorithm, prime factor algorithm.
I. INTRODUCTION
The discrete cosine transform (DCT) [1]–[8] has been widely used
in the area of digital signal processing because of its energy packing
capabilities and also approaching to the statistically optimal transform
in decorrelating a signal governed by a Markov model. Recently, re-
searchers have been prominent in developing algorithms for real-time
implementation of DCT at high bit rates and which is now recom-
mended as a standard transform for image coding and data compression
by various standard committees. This coding scheme can be applied in
teleconferencing, videophones and high-definition television system.
Since DCT had been appeared, many efficient algorithms were pro-
posed to improve the computing speed and hardware complexity. Pre-
viously, Yang, Narasimha, and Lee [3], [4] developed a prime factor
algorithm for the DCT but its index mapping is quite complicated. Lee
[5] realized their input index mapping by constructing two index tables
but it occupies extra memory. Chakrabarti and JaJa [6] implemented the
DCT from DHT and found that the systolic architecture was suitable
for its realization. Lee and Huang [7] modified Lee’s [5] algorithms
and based on the previous research efforts to divide the input mapping
into five disjoint groups such that the complexity of the algorithm is de-
creased. Recently, Bi [8] proposed a straightforward approach to con-
vert a one-dimensional (1-D) input sequence into a two-dimensional
(2-D) array, which reduce the computational complexity and also avoid
irregular computational structure.
Although there are several algorithms [3]–[8] for computing the
DCT of an arbitrary length using a prime factor decomposition
technique, these algorithms usually need to calculate two output index
functions, namely N2k1 +N1k2 and N2k1  N1k2, respectively.
Besides, some researchers concentrate on the realization of the al-
gorithm expecting to reduce the complexity of the implementation,
however in-place address generation technique is also important to re-
duce the storage requirement while implementation. Let us recall that
conventional methods [3]–[8] require a table to store the input index
Manuscript received September 6, 1999; revised September 15, 2001. This
paper was recommended by Associate Editor A. Skodras.
L.-P. Chau is with the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore 639798 (e-mail: lpchau@
ieee.org).
D. P.-K. Lun and W.-C. Siu are with the Department of Electronic and Infor-
mation Engineering, The Hong Kong Polytechnic University, Hong Kong.
Publisher Item Identifier S 1057-7130(01)11057-8.
mapping, all these methods required extra memory to store this index
table. Many different efficient techniques [9]–[12] have been reported
in the literature for an in-place computation and address generation of
the DFT. The term “in-place” is defined as the computation where the
memory requirement is restricted to be as large as the transform length.
Actually, a very small amount of temporary buffers is usually allowed
for the storage of the intermediate results while implementation.
In this brief, we propose an algorithm to directly implement the
prime factor DCT. This algorithm involves one output index function
N2k1+N1k2 only, such that the memory space and the computational
complexity are greatly reduced. Besides, we propose a novel in-place
address generation scheme for input index addressing which generates
the address column by column or row by row during the DCT pro-
cessing, by using this address generation scheme, a table to store the
mapping sequence is not required. Furthermore, the address generation
time is improved by about 20%.
II. DIRECT DECOMPOSITION OF PRIME FACTOR DCT ALGORITHM
The DCT of a sequence fx(i 0): i 0 = 0; 1; . . . ; N   1g can be
written as
Y (k) =
N 1
i =0
x(i0) cos

2N
(2i0 + 1)k ;
for k = 0; 1; . . . ; N   1: (1)
In order to obtain a simplified form of the 2-D DCT, the input sequence
x(i 0) should be permuted by (2) and (3)
i
0 =2i ; for i = 0; . . . ; N   1
2
(2)
=2N   2i   1 for i = N   1
2
+ 1; . . . ; N   1: (3)
Now consider the transform length N which can be factorized into two
mutually prime factors N1 and N2, respectively, then this 1-D DCT can
be decomposed into a 2-D DCT via a suitable bijective mapping. Let
i = f (i1; i2) be the input index mapping and k = g(k1; k2) be the
output index mapping.
Now we define
i = N2N
 1
2 i1 +N1N
 1
1 i2 N (4)
k = hN1k2 +N2k1iN (5)
where hN1N 11 iN = 1; hN2N 12 iN = 1 and hxiN = x modulo
N for i1; k1 = 0; 1; . . . ; N1   1 and i2; k2 = 0; 1; . . . ; N2   1.
Because f (i1; i2) and g(k1; k2) are bijective mapping, (1) becomes
Y (k2; k1) =
N  1
i =0
N  1
i =0
x(i2; i1)
 cos
2
N
hN1i2k2+N2i1k1iN+

2N
hN1k2+N2k1iN : (6)
According to (6), we note that the first term of the cosine argument
hN2i1k1 +N1i2k2iN is multiplied by (2=N), which means that the
value of the cosine function remains unchanged if N2i1k1+N1i2k2 >
N after the decomposition. Note that the second term of cosine argu-
ment hN1k2 + N2k1iN is multiplied by (=2N), hence the value of
the cosine function will be changed if N1k2 + N2k1 > N after the
1057–7130/01$10.00 © 2001 IEEE
986 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001
decomposition, so the equation should be compensated for the case of
N1k2 + N2k1 > N .
IfN1k2+N2k1 < N , the value of the cosine function is unchanged
after decomposition, (6) becomes
Y (k2; k1) =
N  1
i =0
N  1
i =0
x(i2; i1)
 cos

2N2
(4i2+1)k2 cos

2N1
(4i1+1)k1
 
N  1
i =0
N  1
i =0
x(i2; i1)
 cos

2N2
(4i2 + 1)(N2   k2)
 cos

2N1
(4i1 + 1)(N1   k1) : (7)
If N1k2+N2k1 > N , (6) should be compensated by  (=2), and (6)
becomes
Y (k2; k1)
=
N  1
i =0
N  1
i =0
x(i2; i1)
 cos

2N2
(4i2+1)(N2 k2) cos

2N1
(4i1+1)k1
+
N  1
i =0
N  1
i =0
x(i2; i1)
 cos

2N2
(4i2+1)k2 cos

2N1
(4i1+1)(N1 k1) :
(8)
In order to obtain the same type of kernel of the DCT as shown in (1),
the same permuting algorithm as in (2) and (3) should be applied to (7)
and (8), so we have the mappings as shown in (9)–(12).
Let
i
0
2 =2i2 for i2 = 0; . . . ;
N2 1
2
(9)
=2N2 2i2 1 for i2 =
N2 1
2
+ 1; . . . ; N2 1 (10)
and
i
0
1 =2i1 for i1 = 0; . . . ;
N1 1
2
(11)
=2N1 2i1 1 for i1 =
N1 1
2
+ 1; . . . ; N1 1 (12)
and let us define Y 0(k2; k1) to be the output of the 2-D DCT
Y 0(k2; k1) =
N  1
i =0
N  1
i =0
x(i02; i
0
1)
 cos

2N2
2i02 + 1 k2 cos

2N1
2i01 + 1 k1 (13)
we have
Y (k2; k1) =Y
0(k2; k1)  Y
0(N2   k2; N1   k1) (14)
Y (k2; k1) =Y
0(N2   k2; k1) + Y
0(k2; N1   k1): (15)
From (13) to (15), Y (k2; k1) can be obtained by combining two 2-D
DCT outputs. The major advantage of the 2-D DCT is that it can be
TABLE I
INPUT ADDRESS MAPPING FOR 15-POINT DCT
Fig. 1. Data loading and retrieval operation of 3-point DCT algorithm.
implemented by short length DCT modules so that the number of op-
erations will be greatly reduced.
III. ADDRESS GENERATION TECHNIQUES FOR INPUT INDEX MAPPING
The address generation scheme can be derived by permuting the
input sequence. By substituting (4) and (5) and (9)–(12) into (2) and
(3), we can obtain two sets of equations.
Formula A:
AC =
hkCN2 + Ci2N
hkCN2   C   1i2N
for KC is even from 0 to N1   1 and C = 0 to N2   1.
Formula B:
AR =
hkRN2 +Ri2N
hkRN2  R  1i2N
for KR is even from 0 to N2   1 and R = 0 to N1   1 where AC and
AR are the addresses in Column C and Row R, respectively.
Example: In this example a 15-point prime factor DCT is going to
be evaluated. Let N1 = 3 and N2 = 5. According to Formula A, the
addresses in column 0 are 0, 2N2   1 = 9; 2N2 = 10, respectively.
According to Formula B, the address in row i 01 can be determined by
haddressi2N = i
0
1 or 2N1   i
0
1   1 i.e., h0i6 = 0; h9i6 = 3 =
6  2  1; h10i6 = 4 = 6  1  1. The row numbers for addresses 0,
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001 987
Fig. 2. Graphical representation for in-place implementation of 15-point prime factor DCT algorithm.
TABLE II
OUTPUT MAPPING FOR 15-POINT DCT
9, and 10 are 0, 2, and 1, respectively. Table I shows the input mapping
for a 15-point prime factor DCT.
For the realization of the prime factor DCT algorithm, the computa-
tion in each dimension is independent. The global data array store the
input data in sequence, then the address generation scheme for input
index is used to select the global data to temporary data array, since
the address index is generated in an on-demand basis that means the
relevant row or column address is generated when the system request
during the DCT processing. So no table is required to store the indexing
map. The temporary data are processed using the short length DCT al-
gorithm, e.g., [2]. After the short length DCT implementation, the tem-
porary data will be passed back to the global data array. Fig. 1 demon-
strates the operations of the in-place prime factor DCT algorithm.
IV. OUTPUT INDEX MAPPING
In this section, we arrange to obtain the output index mapping k =
hN1k2 +N2k1iN . From the previous direct decomposition, the trans-
form output can be obtained after some add and subtract operations.
In case of N1k2 + N2k1 < N , subtraction is required. Otherwise,
addition is performed. It is obviously that the decomposition is a dec-
imation-in-frequency algorithm, because the transformed sequence is
grouped into different frequency index parts. That is, the input is in
normal order and the output requires permutation. Actually, the decom-
position is an in-place algorithm but the output is not in-order. In order
to get the output in-order, an extra unscrambling is required. Table II
shows the output mapping for a length-15 prime-factor DCT.
From the above mapping and the previous derivation, we know that
if N1k2 + N2k1 < N , we have
Y (0) =Y 0(0; 0) = Y (0; 0);Y (3) = Y 0(1; 0) = Y (1; 0);
Y (6) =Y 0(2; 0) = Y (2; 0);Y (9) = Y 0(3; 0) = Y (3; 0)
Y (12) =Y 0(4; 0) = Y (4; 0);Y (5) = Y 0(0; 1) = Y (0; 1);
Y (10) =Y 0(0; 2) = Y (0; 2);
Y (8) =Y (1; 1) = Y 0(1; 1) Y 0(5 1;3 1) = Y 0(1; 1) Y 0(4; 2)
Y (11) =Y (2; 1) = Y 0(2; 1) Y 0(5 2;3 1) = Y 0(2; 1) Y 0(3; 2)
Y (14) =Y (3; 1) = Y 0(3; 1) Y 0(5 3;3 1) = Y 0(3; 1) Y 0(2; 2)
Y (13) =Y (1; 2) = Y 0(1; 2) Y 0(5 1;3 2) = Y 0(1; 2) Y 0(4; 1)
if N1k2 + N2k1 > N , we have
Y (2) =Y (4; 1) = Y 0(5 4;1)+Y 0(4; 3 1) = Y 0(1; 1)+Y 0(4; 2)
Y (1) =Y (2; 2) = Y 0(5 2;2)+Y 0(2; 3 2) = Y 0(3; 2)+Y 0(2; 1)
Y (4) =Y (3; 2) = Y 0(5 3;2)+Y 0(3; 3 2) = Y 0(2; 2)+Y 0(3; 1)
Y (7) =Y (4; 2) = Y 0(5 4;2)+Y 0(4; 3 2)=Y 0(1; 2)+Y 0(4; 1):
Fig. 2 shows the steps for the computation of 15 point prime factor de-
composition algorithm with unscrambling output in global data array.
The set fxi ; i = 0; 1; . . . ; 14g is the input sequence, which is stored
in global data array and mapped according to our algorithm into the
2-D DCT array. After the two stages row–column evaluation, the output
data can be obtained by some add and subtract operations according to
the input and output index function.
V. SOFTWARE PERFORMANCE FOR INPUT INDEX MAPPING
The new address generation scheme has been tested using the Mi-
crosoft C. The aim of this section is to give the results of comparison of
our proposed address generation scheme with the conventional scheme
which was proposed by Lee [5]. Fig. 3 shows the percentage improve-
ment of the new approach as compared with the conventional approach.
As shown in the figure, the new approach improves an average of about
20% in speed over the address generation using the conventional ap-
proach.
988 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001
Fig. 3. Percentage improvement for the new approach to the conventional
approach.
VI. CONCLUSION
In this brief, an efficient implementation of prime factor algorithm
has been developed. Our objective is to design a prime factor algo-
rithm which requires less memory consumption while implementation
such that it can be implemented using built-in-memory processor or
hardware implementation. Unlike other existing prime factor DCT al-
gorithms, this address generation algorithm generate the input index
mapping in on-demand basis which does not require a table to store the
index mapping. For the long length prime factor DCT application, due
to the memory restriction, it is not practical to generate an extra tempo-
rary 2-D address array. This shows the importance of our approach for
an in-place address generation technique. On the other hand, we have
also discussed the output index mapping, which is based on the equa-
tionN1k2+N2k1, the outputs can be obtained in a straightforward way
after a fast unscrambling procedure. The memory space and implemen-
tation complexity is further reduced since the equation N1k2  N2k1
appeared in reference [5] is eliminated. Finally, the address generation
time is also improved by about 20% over a wide range of DCT lengths.
REFERENCES
[1] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”
IEEE Trans. Comput., vol. 23, pp. 90–93, Jan. 1974.
[2] L. P. Chau and W. C. Siu, “Direct formulation for the realization of dis-
crete cosine transform using recursive structure,” IEEE Trans. Circuits
Syst. II, vol. 42, pp. 50–52, Jan. 1995.
[3] P. P. N. Yang, M. J. Narasimha, and B. G. Lee, “A prime factor decom-
position algorithm for the discrete cosine transform,” in Proc., Int. Conf.
Comput., Syst. Signal Processing, India, 1984, pp. 132–135.
[4] P. P. N. Yang and M. J. Narasimha, “Prime factor decomposition of the
discrete cosine transform and its hardware realization,” in Proc., IEEE
Int. Conf. Acoust., Speech Signal Processing, San Diego, CA, 1984, pp.
772–775.
[5] B. G. Lee, “Input and output index mappings for a prime-factor-decom-
posed computation of discrete cosine transform,” IEEE Trans. Acoust.,
Speech Signal Processing, vol. 37, pp. 237–244, Feb. 1989.
[6] C. Chakrabarti and J. JaJa, “Systolic architectures for the computation
of the discrete hartley and the discrete cosine transforms based on prime
factor decomposition,” IEEE Trans. Comput., vol. 39, pp. 1359–1368,
Nov. 1990.
[7] P. Z. Lee and F. Y. Huang, “An efficient prime-factor algorithm for
the discrete cosine transform and its hardware implementations,” IEEE
Trans. Signal Processing, vol. 42, pp. 1996–2005, Aug. 1994.
[8] G. Bi, “Index mapping for prime factor algorithm of discrete cosine
transform,” Electron. Lett., vol. 35, no. 3, pp. 198–200, Feb. 1999.
[9] C. S. Burrus and P. W. Eschenbacher, “An in-place, in-order prime factor
FFT algorithm,” IEEE Trans. Acoust. Speech Signal Processing, vol.
ASSP-29, pp. 806–817, Aug. 1981.
[10] C. Temperton, “Implementation of a self-sorting in-place prime factor
FFT algorithm,” J. Computat. Phys., vol. 58, pp. 283–299, Oct. 1985.
[11] K. L. Wong, R. Chan, D. P. K. Lun, and W. C. Siu, “Efficient address
generation for prime factor algorithms,” IEEE Trans. Acoust. Speech
Signal Processing, vol. 38, pp. 1518–1528, Sept. 1990.
[12] D. P. K. Lun and W. C. Siu, “An analysis for the realization of an in-place
and in-order prime factor algorithm,” IEEE Trans. Signal Processing,
vol. 41, pp. 2362–2370, July 1993.
On RNS-Based Enhancements for Direct
Digital Frequency Synthesis
P. V. Ananda Mohan
Abstract—It is shown that Chren’s techniques for reducing the ROM re-
quirements for the implementation of reside number systems–based digital
frequency synthesis are not correct.
Index Terms—Digital frequency synthesis (DFS), residue number sys-
tems (RNS), VLSI architectures.
I. INTRODUCTION
The use of residue number systems (RNS) for digital frequency syn-
thesis (DFS) has been considered by Chren, [1], [2]. The classical dig-
ital frequency synthesizers facilitate generation of sinusoidal output by
using the architecture of Fig. 1(a). Herein, the “sine” function is stored
in ROM and is sequentially read. The L bit address of the ROM is gen-
erated by a phase accumulator ofL bit resolution which repeatedly adds
a phase increment of k corresponding to the frequency to be generated.
Since, sine-wave is symmetric, only sine function for one quadrant (0 to
90 degrees) is stored. The L bit word, however, corresponds to a max-
imum of 360 degrees. The MSB of this word contains sign information
and hence used to generate two’s complement of the contents read from
the ROM so as to obtain negative value of the sample stored in ROM.
In addition, the bit next to the MSB is used to selectively invert the
lower order bits of the address by using exclusive-OR gates controlled
by the second MSB. This can be appreciated by noting that for the case
of values in the second quadrant, complementing the bits yields an ad-
dress corresponding to the correct value. Assuming a clock frequency of
fc and frequency setting word k , and anL bit accumulator, the frequency
of the sinusoid generated is fc :k=2L. The phase increment is 2::k=2L.
The figures of merit of a DFS design are frequency, phase, and am-
plitude resolution. The frequency resolution is evidently fc=2L corre-
sponding to k = 1. Alternatively, the width of the accumulator is the
frequency resolution. Phase resolution is the width W of the ROM ad-
dress. Note that while phase accumulation can be done with resolution
as desired, the restriction on the memory size may need truncation of
L bits to W bits. Amplitude resolution is evidently the number of bits
used to represent the stored samples in the ROM.
Manuscript received July 5, 2000; revised October 4, 2001. This paper was
recommended by Associate Editor W. Sandham.
The author is with Transmission R&D, I.T.I. Ltd., Bangalore 560016, India.
Publisher Item Identifier S 1057-7130(01)11058-X.
1057–7130/01$10.00 © 2001 IEEE
