







Reconfigurable implementation of recursive DCT 







Cavendish School of Computer Science 
 
 
Copyright © [2003] IEEE.   Reprinted from Proceedings of the 2003 International 
Symposium on Circuits and Systems (ISCAS '03), pp. 289-292. 
    
This material is posted here with permission of the IEEE. Such permission of the 
IEEE does not in any way imply IEEE endorsement of any of the University of 
Westminster's products or services.  Internal or personal use of this material is 
permitted.  However, permission to reprint/republish this material for advertising or 
promotional purposes or for creating new collective works for resale or redistribution 
must be obtained from the IEEE by writing to pubs-permissions@ieee.org.  By 
choosing to view this document, you agree to all provisions of the copyright laws 
protecting it.  
   
 
The Eprints service at the University of Westminster aims to make the research 
output of the University available to a wider audience.  Copyright and Moral Rights 
remain with the authors and/or copyright owners. 
Users are permitted to download and/or print one copy for non-commercial private 
study or research.  Further distribution and any use of material from within this 
archive for profit-making enterprises or for commercial gain is strictly forbidden.    
 
 
Whilst further distribution of specific materials from within this archive is forbidden, 
you may freely distribute the URL of the University of Westminster Eprints 
(http://eprints.wmin.ac.uk). 
 
In case of abuse or copyright appearing without permission e-mail wattsn@wmin.ac.uk. 
RECONFIGURABLE IMPLEMENTATION OF 
RECURSIVE DCT KERNELS FOR REDUCED 
QUANTIZATION NOISE 
Siileyman Sirri Demirsoy, Robert Beck, Andrew G. Dempster and Izzet Kale 
Applied DSP and VLSI Research Group, Department of Electronic Systems 
University of Westminster, 115 New Cavendish St, London, W 1 W 6UW, UK 
Tel: +44 20 7911 5000 - 361 l(ext) e-mail: demirss@,cmsa.wmin.ac.uk 
ABSTRACT 
Time multiplexed implementations of the recursive DCT 
processors are widely used in many multimedia and compression 
applications. Recently proposed three Goertzel kernels offer 
significant improvement (up to 90 %) in the noise performance 
of the time-multiplexed architecture to allow word-length 
specifications get reduced. In this paper, a highly optimized 
reconfigurable DCT architecture is proposed that can perform the 
function of three different kemels (Type A, B and C) on Virtex 
FPGA. 
1. INTRODUCTION 
Recursive DCT implementations are attractive due to their 
regular structure and reduced computational complexity. The 
transfer function of the DCT given in (1) can be implemented 
with a second order IIR filter. The kernel employed in most of 
the designs in the literature [6] ,  [7] i s  a resonator of Type B 
configuration as shown in Figure l(a). In [2], two alternative 
forms, Type A and Type C were proposed (Figure 1 b and 1 c). 
Pk( l - z - ' )  
1 - 2p,z-' + z-* H ( z )  = 
Pk = (4Lk I N)Cos(kn I N )  , k: frequency bin index 
L, =1/& f o r k l ,  L, =1  forallotherk 
2Pk = 2Co~(kn  I N )  N. transform length 
The benefits of these structures are investigated in [ I ]  and 
substantial area gains in a fully parallel implementation were 
demonstrated. In that design, Type A, Type B and Type C 
structures were all employed for different frequency bins to have 
the optimum coefficient magnitude. Figure 2 shows how these 
magnitudes vary for different frequencies. By using Type A for 
the first one-third of the filter bins, Type B for the next one-third 
and Type C for the last one-third of the filter bins, the optimum 
coefficient values are used. 
For a time-multiplexed system, where only one recursive 
structure is used for all frequency bins by multiplexing it in time, 
Type B is the natural choice to.employ for having the average 
coefficient magnitude among the three types. However, if a 
reconfigurable structure that can be transformed into three 
different types for different frequencies is developed with 
minimum reconfigurability overhead, a significant improvement 
on intemal quantization noise figures can be achieved. 
numerical characteristics Of coefficients 
3 5  
< ,  
U E
8 2 5  






- 3  
e,  
I 
0 0 0 1  1.1 0 1 1  0 2  0 2 1  0 3  0.11 0 .  0,s 0 s  
normalized frequency(") 
Figure 2 Coefficient magnitude behaviour of Type A, B 
and C structures vs. frequency. The optimum behaviour 
is also marked. 
FPGAs are attractive platforms for easy and fast implementation 
and offer high performance when the available resources are 
used efficiently. The effective use of the Look-Up Tables (LUT) 
available on-board in Xilinx Virtex FPGAs is investigated in [3]. 
In this paper, a reconfigurable recursive kemel that is highly 
optimized for the Xilinx Virtex FPGA series is given and 
investigated for its noise performance. Section 2 describes the 
details of the Virtex slice architecture and the optimization steps 
of the recursive kemels. Section 3 gives the noise performance 
analysis and Section 4 concludes the paper. 
(-l)k 3* in (-1)k t 
2 4  
2-1 2y,= 2*(1+Cos(kn/N)) 
(4 (b) 
Figure 1 Recursive Goertzel filters, a) Type B, b) Type A, c) Type C 
IV-289 0-7803-7761-3/03/$17.00 02003 IEEE 
2. OPTIMIZATION 
The Virtex FPGA series comprise Configurable Logic Blocks 
(CLB), which are the main programmable functional blocks. 
Each CLB contains two slices each of which has two of the 
structures given in Figure 3a. Fast carry logic is implemented in 
each half-slice by MUXCY. The Look-up Table (LUT) is a 16- 
bit SRAM with 4 inputs. It can be programmed to realise any 
logic function having up to four inputs. Figure l b  and I C  shows 
two examples for a I-bit adder subtractor implementation on 
each half-slice, and a subtractor with a multiplexer choosing one 
of its two inputs for subtraction from A. The half-slice also has a 
D-type flip-flop that can function like an output register. 
(b) (c) 
Figure 3 (a) Half of the Virtex Slice diagram 
(simplified), (b) A combination of XOR gates 
implemented in the LUT to make an adderhubtractor, (c) 
a multiplexer choosing the input to subtract from A 
Any circuit optimization for a given device with limited 
resources would be a job of facilitating the available resources as 
much as possible, bearing in mind that the required operation can 
be satisfied by a modification on the structure that allows fitting 
the resources in a better way while maintaining the functionality. 
For the reconfigurable DCT kernel of this paper given in Figure 
4a, there are several multiplexers (MX1 to MX5) which allow us 
to use the same components with different inputs. The letters A, 
B and C indicate the Type of the kernel a line is required for. 
The signs on top of the letters are the hnctionality that is 
required in the adderhbtractors (A1 to A4) for that input when 
the specified kernel is configured. The loop multiplier (Ml)  
takes the coefficient of all kernels. The optimization is 
performed as follows: 
The addedsubtractor AI and the delay element D2 can fit into a 
half-slice if the multiplexer MXI does not exist. Moreover, the 
adderhbtractor A4 consumes half a slice by itself without 
making use of the delay element. By replicating D2 as D2b and 
storing the output of A4 in D2b, it is possible to get rid of MXl. 
(See Figure 4b) 
MX3 requires an extra LUT. However by simply feeding A3 
with ‘0’ from MX5, the output for Type A can be achieved via 
A3 from the same output as Type B and C. This implies we have 
three different outputs from MX5 (Figure 4c). Figure 4 Optimization steps of the reconfigurable kernel 
IV-290 
It is possible to generate the ‘0’ signal by clearing D2. This 
reduces the number of necessary inputs to MX5 down to two. 
(Figure 4d) 
The input to A5 coming out from AI requires negation for type 
C and the other input needs to be negated for Type B too. This 
functionality of negation on both inputs for different kemels is 
not possible to implement in a half-slice. However by 
performing negation for Type C at the same input as Type B, it is 
possible to fit MX5 and A3 to the same half-slice. Hence the 
output of A3 for the Type C kemel needs to be negated. It is 
done by using the negated coefficients in the Pk multiplier 
outside the recursive loop shown in Figure 1. 
The multiplication by 2 is performed by hardwiring the output of 
A4 one bit shifted toward the MSB. 
The oudut of the MX2 needs to be negated for Type B and C. In 
its current form, it is not possible to fit MX2 with A2 to the same 
half-slice. It is known that for Type A, D2 generates a ‘0’ signal 
for MX5. By generating a ‘ 1 ’ signal on D2, the output of D2 can 
be used as a select signal for the operation on the other input to 
MX2. This modification allows fitting MX2, A2 and D1 to the 
same half-slice. (See Figure 4e) 
MX4 requires an extra LUT. It is possible to get rid of this 
multiplexer by using the output of A4 for Type B kemel as well. 
A ‘0’ signal generated by D2b would allow having the AI signal 
at the output of A4, after adding a small amout of delay. (Figure 
After all the modifications are performed, the reconfigurable 
DCT kemel occupies four half-slices where Type B kemel uses 
three half slices. This overhead is easily compensated by the 
reduction in the coefficient word-length of multiplier M1. 
Hence, if a general multiplier is used for M1, the reconfigurable 
structure needs the same number of half-slice as Type B. For the 
choice of fixed or Reconfigurable Multiplier Blocks (ReMB) [8], 
the necessary coefficients are smaller in magnitude and are 
constructed using less adder stages. An example to this fact is 
given in [8] for a coefficient word-length of 12-bits. The loop 
multiplier MI for Type B kemel i s  constructed with four basic 
structures, whereas the loop multiplier for the reconfigurable 
kemel required only three basic structures. 
4f) 
3. NOISE PERFORMANCE 
The quantization noise exists in a circuit due to the fixed word- 
length of the intermediate and output signals. The performance 
of the system is affected by the choice of the signal word- 
lengths. Several restrictions on the maximum quantization noise 
are defined by the standards to avoid performance degrade below 
a limit. For example, MPEG-4 standard defines the maximum 
allowed MSE, mean error and magnitude error that occurs in an 
8 by 8 block of pixels in a video data for Two Dimensional 
Inverse DCT (2D-IDCT) [5]. 
The coefficient values required by the Type B kemel for the 
frequencies close to DC and half-Nyquist are higher than the 
values required by Type A near DC and Type C near half- 
Nyquist (see Figure 2). An extra bit for the integer part of the 
coefficient is required if just the Type B coefficients are used. 
Therefore by using the Type A and C kemels for these frequency 
bins, we save one bit for the same fractional precision. On top of 
that, the analysis of the integer parts of the intermediate signals 
shows that Type B kemel requires more bits for the integer parts 
of the signals than the Type ABC kemel does. Table 1 shows 
how many bits are required to represent the integer parts of the 
signals for a transform length of N=16. The reconfigurable 
kemel, Type ABC, would spare more bits for the representation 
of the fractional parts at the output of A l ,  A2, and M1, hence 
would lead to better noise performance. 
TABLE 1 Number of bits required for the integer parts 
of the signals at the output of the specified components 
for N=16. Loop and Pk are the coefficient values for 
the loop multiplier (Ml) and Pk multiplier (see 
Figure 1) 
A set of experiments were performed to demonstrate the 
enhanced performance. For various word-length specifications, 
both Type B and Type ABC design .were stimulated with a 
thousand uniformly distributed random numbers between -300 
and 300. A reference design with floating point precision was 
used to find the quantization noise generated by the two designs. 
Our error measure for these experiments was the Mean Square 
Error (MSE). Table 2 shows the results for different 
specifications and transform-lengths (N). The loop coefficient 
word-length for Type B design (loopb) is always kept one bit 
more than the loop coefficient of the Type ABC design (loopa) 
to maintain the same area consumption when a general-purpose 
multiplier is used. Pk is the coefficient word-length for the Pk 
multiplier. The frequency bins k=O and k=4 are neglected during 
the calculation of the percentage reduction in noise because these 
frequency bins only have noise contributions from the Pk 
multiplier which is outside the recursive kemel. The gain for 
each frequency bin is different. The maximum and minimum 
gains that occurred for a set are also shown in Table 2. For 
particular frequency bins, there is 93 % decrease in the noise. 
The average gain is a reasonable measure of the overall noise 
performance enhancement. It is observed that, there is up to 67 
% decrease in the MSE levels among the given sets. 
I Word-length specifications I N I Max I Min I Avg I 
2. wl=20; loopa=l8; loopb=l9; Pk=16 
TABLE 2 The percentage decrease in the MSE noise 
with reconfigurable Type ABC design as opposed to 
Type B design. 
Figure 5 shows the MSE noise levels for the design given in the 
1’‘ set of Table 2. As seen from the figure, although the loop 
IV-291 
coefficient precision for k=3 and k=5 are same for Type B and 
Type ABC designs, the difference of the magnitude of the 
intermediate signals leads to an enhancement in the noise 
performance of Type B kernel outputs in the reconfigurable 
design. 
I lo4 
average of all frequency bins in the proposed design when 
compared to the Type B kernel. The optimized structure 
includes design ideas that can be applicable to VLSI 
implementation too. Future work will focus on the pipelined 
implementation of this reconfigurable structure to increase the 
operating clock frequency. 
0 ‘  1 I I 
2 3 4 5 6 7 
Alter bin index, k 
Figure 5 Comparison of MSE noise figures for N=8 and 
word-length of 18 bit. The coefficient word-lengths in 
Type B-only and reconfigurable kemel are 18 bit and 17 
bit respectively. An average decrease of 67% is 
achieved. 
4. CONCLUSION 
A reconfigurable recursive DCT processor, that can compute 
three kemels (Type A, B and C) has been designed and 
optimized for Xilinx Virtex FPGA series. It occupies only one 
more half-slice than Type B kernel does for a single bit, due to 
high utilization of the slice resources. The loop multiplier 
coefficient requires one bit less for the same precision. For the 
same half-slice count, the MSE level is reduced by 67 % on 
5. REFERENCES 
Demirsoy S . ,  et al., ”Novel recursive DCT implementations: 
A comparative study“, IEEE Int. Conf on Intelligent Data 
Acqusition and Advanced Computing Systems 
(IDAACS’2001), pp.120-123, Ukraine, July 2001 
Beck, R. “An Investigation of Finite-Precision Digital 
Resonators”, PhD Thesis, University of Westminster, June 
2002 
Turner R. H., T. Courtney and R. Woods, “Implementation 
of fixed DSP functions using the reduced coefficient 
multiplier”, IEEE Proc. of ICASSP’2001, vol. 2, pp. 88 1- 
884, May 2001, USA 
http://www.xilinx,com/xlnx/xil urodcat landinauage.isu?tit 
le=Virtex Series 
ISO/IEC, “Information technology-Coding of audio-visual 
object: Visual ISO/IEC 14496-2 Final Proposed 
Drafl”,14496-2, July 1999 
Wang J.L. et al, “Implementation of the DCT and its inverse 
by recursive structures”, IEEE Workshop on Signal 
Processing Systems, pp 120-130, Oct 1999 
Kozick R.J., and M.F. Aburdene, “Methods for designing 
efficient parallel -recursive filter for computing discrete 
transforms”, Telecommunication Systems, vol. 13, no. 1, 
Demirsoy S. S . ,  A. Dempster and I. Kale, “Design 
Guidelines for Reconfigurable Multiplier Blocks”, 
submitted to IEEE ISCAS 2003 
2000, pp.69-80 
IV-292 
