A Single-Channel Architecture for Algebraic Integer Based 8$\times$8 2-D
  DCT Computation by Edirisuriya, A. et al.
ar
X
iv
:1
71
0.
09
97
5v
1 
 [c
s.A
R]
  2
7 O
ct 
20
17
A Single-Channel Architecture for Algebraic Integer Based 8×8 2-D DCT Computation
A. Edirisuriya∗ A. Madanayake∗ R. J. Cintra† V. S. Dimitrov‡ N. T. Rajapaksha∗
Abstract
An area efficient row-parallel architecture is proposed for the real-time implementation of bivariate algebraic integer (AI) encoded
2-D discrete cosine transform (DCT) for image and video processing. The proposed architecture computes 8×8 2-D DCT transform
based on the Arai DCT algorithm. An improved fast algorithm for AI based 1-D DCT computation is proposed along with a single
channel 2-D DCT architecture. The design improves on the 4-channel AI DCT architecture that was published recently by reducing
the number of integer channels to one and the number of 8-point 1-D DCT cores from 5 down to 2. The architecture offers exact
computation of 8×8 blocks of the 2-D DCT coefficients up to the FRS, which converts the coefficients from the AI representation to
fixed-point format using the method of expansion factors. Prototype circuits corresponding to FRS blocks based on two expansion
factors are realized, tested, and verified on FPGA-chip, using a Xilinx Virtex-6 XC6VLX240T device. Post place-and-route results
show a 20% reduction in terms of area compared to the 2-D DCT architecture requiring five 1-D AI cores. The area-time and
area-time2 complexity metrics are also reduced by 23% and 22% respectively for designs with 8-bit input word length. The digital
realizations are simulated up to place and route for ASICs using 45 nm CMOS standard cells. The maximum estimated clock rate
is 951 MHz for the CMOS realizations indicating 7.608·109 pixels/seconds and a 8×8 block rate of 118.875 MHz.
Keywords
DCT, Algebraic Integers, Expansion factors
1 Introduction
The discrete cosine transform (DCT) is a widely used mathemat-
ical tool in image and video compression. It is a core component
in contemporary media standards such as JPEG and MPEG [1].
Indeed, the DCT is known for its properties of decorrelation, en-
ergy compaction, separability, symmetry, and orthogonality [2].
However the computational complexity of the DCT operation
imparts a significant burden in VLSI circuits for real time ap-
plications. Several algorithms have been proposed to reduce the
complexity of DCT circuits by exploiting its mathematical prop-
erties [3, 4, 5]. Integer transforms are employed in video stan-
dards such as H.264. However, we emphasize that these methods
are approximations, which are inherently inexact and may intro-
duce a computational error floor to the DCT evaluation.
To wit, one of the main obstacles in performing accurate
DCT computations is the implementation of the irrational co-
efficient multiplications needed to calculate the transform. Tra-
ditional DCT implementations adopt a compromise solution to
this problem employing truncation or rounding off [6] to approx-
imate these quantities. As a consequence, computational errors
are systematically introduced into the computation, leading to
degradation of the signal-to-noise ratio (SNR).
To partially address this issue, algebraic integer (AI) encod-
ing [7] has been employed [8, 9, 10, 11]. The main idea in this
approach is to map required irrational numbers into an array of
integers, which can be arithmetically manipulated in an error
free manner. At the end of the computation, AI based algo-
rithms require a final reconstruction step (FRS) in order to map
the resulting encoded integer arrays back into the usual fixed-
point representation. FRS can be implemented by means of
∗A. Edirisuriya, A. Madanayake and N. T. Rajapaksha are with De-
partament of Electrical and Computing Engineering, University of Akron,
Akron, OH, USA e-mail: arjuna@uakron.edu
†R. J. Cintra is with the Signal Processing Group, Departamento de
Estat´ıstica, Universidade Federal de Pernambuco, PE 50740-540, Brazil
e-mail: rjdsc@de.ufpe.br
‡V. S. Dimitrov is with the Department of Electrical and Computer
Engineering, University of Calgary, Calgary, AB T2M 4S7, Canada e-mail:
vdvsd103@gmail.com
individualized circuits, at in principle any given precision [4].
Architectures based on the low-complexity Arai DCT algo-
rithm [3] has been proposed in [9, 12]. The Arai DCT algo-
rithm is an algorithm for 8-point DCT computation in video
and image processing applications because of its relatively low
computational complexity. It is noted that this algorithm re-
quires only five multiplications to generate the eight output coef-
ficients. In the AI based architectures proposed in [9, 12], the al-
gebraically encoded numbers are reconstructed and represented
in fixed-point format at the end of column-wise DCT calcula-
tion by means of an intermediate reconstruction step; then data
is coded again before the row-wise DCT calculation. As a result,
the intermediate reconstruction step introduces errors that prop-
agate into subsequent sections. In a sense, the presence of such
computational error in a intermediate stage of the calculation
diminishes the point of employing an AI based structure.
To address this issue, in [13] a doubly AI encoded architecture
where the reconstruction is performed only once at the end of the
entire computation was proposed. Such architecture could allow
a completely error free computation throughout all algorithm
stages until the reconstruction stage for the 2-D DCT computa-
tion. In fact, after the column-wise computation, data path is
divided and the row-wise computation is performed separately
for each AI base. We use the term channel to refer to these data
paths.
In this paper, we propose an improved fast algorithm for the
AI based 1-D DCT computation derived from the algorithm pro-
posed in [9]. Additionally, we present a 2-D DCT architecture
based on the improved 1-D transform. This 2-D constitutes of
a single channel which provides improvements in terms of area
and power consumption when compared with the four channel
architecture described in [13]. We show that these improvements
could be obtained without making any compromise in terms of
accuracy. Detailed comparisons between the proposed architec-
ture with some of the existing are also provided.
1
Table 1: 2-D AI encoding of Arai DCT constants
cos(4pi/16) cos(2pi/16) − cos(6pi/16)[
0 0
0 1
] [
0 0
2 0
]
cos(6pi/16) cos(2pi/16) + cos(6pi/16)[
0 1
−1 0
] [
0 2
0 0
]
2 Review of AI based DCT computation
AI encoding provides an exact representation devoid of quanti-
zation noise in DCT computations. An AI is defined as a root
of a monic polynomial whose coefficients are integers [14]. AI
may constitute a basis and real numbers may be expressed as
a integer linear combination of such basis elements. Thus, real
numbers can be possibly represented without errors by an array
of integers.
In [9] AI encoding is adapted into Arai DCT algorithm [3],
and the corresponding encoding is shown in Table 1. The
corresponding AI basis is furnished by
[
1 z1
z2 z1z2
]
, where z1 =√
2 +
√
2 +
√
2−√2 and z2 =
√
2 +
√
2−
√
2−√2. It should
be noted that the hardware implementation of this representa-
tion requires only of adders/subtracters.
Indeed, a given real number x is encoded into an integer array[
x(a) x(b)
x(c) x(d)
]
, where x(a), x(b), x(c), and x(d) indicate the integers
associated to basis elements 1, z1, z2, and z1z2, respectively. We
refer to these integers as the AI components of x.
Therefore, quantity x can be decoded (reconstructed) from its
AI encoded form according to [9]:
x ≡ tr
([
x(a) x(b)
x(c) x(d)
]
· [ 1 z1z2 z1z2 ]⊤
)
= x(a) + x(b) · z1 + x(c) · z2 + x(d) · z1z2,
(1)
where tr(·) returns the trace of its argument and superscript ⊤
corresponds to the transposition operation. The decoding can
be done in a tailored FRS where the AI basis is represented at
the desired precision. Several FRS structures have been pro-
posed, including schemes that employ Booth encoding [9] and
the expansion factor method [13, 15].
3 Improved AI based 1-D DCT algorithm
The proposed 1-D DCT algorithm is derived from the 1-D AI
Arai DCT [9, 13], using algebraic manipulations. Its computa-
tional complexity consists of 20 additions; no multiplication or
shift operations are required, as depicted in Fig. 1. This provides
an improvement in terms of hardware resources when compared
with the algorithm proposed in [13] and [9], which requires 21 ad-
ditions and 2 shift operations. Although when only a single DCT
is considered, the economy of one addition and two bit-shifting
operations may seem modest, we note that in video processing
systems, the 1-D DCT is performed many times (once per col-
umn, per row, per component, per 8×8 block, per frame). Thus,
the overall computational savings is cumulative. Further, the
proposed method as well as its competitors are highly optimized
procedures; thus one may not expect enormous gains in terms
of computational complexity. Indeed, we are approaching the
asymptotic limits of the theoretical DCT complexity. Finally,
it is noted that the savings in multiplications account for about
5% in total complexity for considering DCTs only, which can be
a significant gain especially for low-power applications.
For the purpose of mathematical analysis, the transform is
described in matrix notation. Let A denote the matrix transfor-
mation associated to the 1-D Arai DCT algorithm defined over
the AI structure and B denote the matrix related to the oper-
ations in the FRS. Matrix A is represented in terms of signal
flow diagram in the dashed block of Fig. 1. Matrix B is sim-
ply represents the AI presentation, ensuring that the resulting
calculation furnishes AI as shown in (1).
Therefore, the complete 1-D AI DCT is given by
X1D = B ·A · x1D,
where x1D and X1D are 8-point column vectors for the input
and output sequences, respectively, and where
A =


1 1 1 1 1 1 1 1
1 −1 −1 1 1 −1 −1 1
1 1 −1 −1 −1 −1 1 1
1 0 0 −1 −1 0 0 1
1 1 1 1 −1 −1 −1 −1
0 −1 −1 0 0 1 1 0
−1 −1 1 1 −1 −1 1 1
1 0 0 0 0 0 0 −1

 ,
B =


1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 z1z2 0 0 0 0
0 0 1 −z1z2 0 0 0 0
0 0 0 0 −z2 −z1z2 −z1 1
0 0 0 0 z2 −z1z2 z1 1
0 0 0 0 −z1 z1z2 z2 1
0 0 0 0 z1 z1z2 −z2 1

 . (2)
The multiplications present in B can be efficiently imple-
mented by means of the expansion factor method as described
in [15]. The expansion factor method employs a factor α which
scales the required multiplicands z1, z2, and z1z2 into quantities
that are close to integers governed by the following relationship:
α · [z1 z2 z1z2] ≈ [m1 m2 m3] , (3)
where m1, m2, and m3 are integers. This approach entail a
reduction in the overall number of multiplications required by
the FRS [15].
4 Row-parallel 2-D DCT architecture
Let x2D and X2D be 8×8 input and output data of the 2-D
DCT, respectively. The 2-D DCT of X2D can be obtained after
(i) applying the 1-D DCT to its columns; (ii) transposing the
resulting matrix; and (iii) applying the 1-D DCT to the rows
of the transposed matrix. Mathematically, above procedure is
given by:
X2D = B ·A · x2D · (B ·A)⊤ = B ·A · x2D ·A⊤ ·B⊤. (4)
Matrix multiplications by B (reconstruction step) should be
the final computation stage. Otherwise, an earlier multiplication
byB represents an intermediate reconstruction stage, which may
reintroduce numerical representation errors. Such errors would
then be propagate to subsequent 1-D DCT calls. This may ren-
der the purpose of AI encoding ineffective.
In [13] an AI based architecture, where the reconstruction
step occurs only at the end of the computation, was proposed.
In other words, the reconstruction step was placed after both
row- and column-wise transforms.
In the current contribution, we propose a similar scheme.
However, unlike the architecture described in [13], which requires
a column-wise DCT call for each of the four AI components,
we propose an architecture which requires a single column-wise
DCT. As a result, the computational complexity is greatly de-
creased in this particular section of the algorithm. The complete
block diagram of the 2-D DCT architecture is given in Fig. 2.
2
0Y
Y3
Y1
Y4
Y5
Y7
X0
X1
X2
X3
X4
X5
X6
X7
Y2
Y6
x0
x
x
x
x
x
x
x
1
2
3
5
6
7
   FRS
4
A
(B)
Addition Negation
Figure 1: Fast algorithm for AI based 1-D DCT.
y 0,k
1,ky
2,ky
y 3,k
4,ky
5,ky
6,ky
y 7,k
x
x
x
x
x
x
x
x
0,k
1,k
2,k
3,k
4,k
5,k
6,k
7,k
X
X
X
X
X
X
X
X
0,k
1,k
2,k
3,k
4,k
5,k
6,k
7,k
T
R
A
N
S
P
O
S
E
B
U
F
F
E
R
B(·)BTAA
Figure 2: Block diagram of the proposed 2-D DCT architecture.
3
4.1 Derivation of the 1-Channel Architecture
In this subsection, we describe mathematically the derivation of
the new architecture. Consider the following decomposition of
the matrix B, which follows from (2):
B = B0 +B1 · z1 +B2 · z2 +B3 · z1z2, (5)
where
B0 =


1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1

 , B1 =


0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 −1 0
0 0 0 0 0 0 1 0
0 0 0 0 −1 0 0 0
0 0 0 0 1 0 0 0


B2 =


0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 −1 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 −1 0

 ,B3 =


0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 −1 0 0 0 0
0 0 0 0 0 −1 0 0
0 0 0 0 0 −1 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0

 .
Matrices B0, B1, B2, and B3 are sparse, and contain only 0, 1,
and −1, leading to low complexity.
Therefore, applying (5) into (4) the 2-D DCT computation
can be rewritten as follows. Let Y2D = A · x2D · A⊤. Notice
that the calculation of Y2D requires no multiplications. Then,
we have:
X2D =B ·Y2D ·B⊤ (6)
=
3∑
i=0
3∑
j=0
Bi ·Y2D ·B⊤j · zm1 zn2
where m,n ∈ {0, 1, 2}. Therefore, the evaluation of X2D is
highly dependent on the computation of the structureBi·(·)·B⊤j .
5 Efficient Implementation of Bi · (·) ·B⊤j block
Matrices Bi, i ∈ {0, 1, 2, 3}, contain only a single non-zero el-
ement in each row. Therefore any matrix multiplication by Bi
can be trivially performed.
To maintain the row-parallel structure of the design, the block
that computes Bi ·Y2D ·B⊤j should output a single row of data
per clock cycle. Hence the 64 elements in the 8×8 Y2D matrix
should be stored for 8 clock cycles. The signal flow graph (SFG)
corresponding to the jth row of the storage structure is depicted
in Fig 3.
The left- and right-multiplication of Y2D by an arbitrary 8×8
matrix would require 56 storage elements operating at a clock
rate of Fclk and another 64 storage elements operating at a rate of
Fclk/8. However, this operation requires far less storage elements
due to the sparse nature of the matrices under consideration. In
the next section we investigate this possibility.
5.1 Derivation of Half Column Independence
The 8×8 matrices Bi, i ∈ {0, 1, 2, 3}, can be decomposed into a
block matrix of 4×4 blocks in the following manner:
Bi =
[
Bi,0 04
04 Bi,1
]
, (7)
where 04 is the 4×4 null matrix. Bi is a block diagonal matrix.
Similarly,
Y2D =
[
Y2D,0 Y2D,1
Y2D,2 Y2D,3
]
.
Therefore, the product Bi ·Y2D ·B⊤j is given by
Bi ·Y2D ·B⊤j =
[
Bi,0 04
04 Bi,1
]
·
[
Y2D,0 Y2D,1
Y2D,2 Y2D,3
]
·
[
B
⊤
j,0 04
04 B
⊤
j,1
]
=
[
Bi,0 ·Y2D,0 ·B⊤j,0 Bi,0 ·Y2D,1 ·B⊤j,1
Bi,1 ·Y2D,2 ·B⊤j,0 Bi,1 ·Y2D,3 ·B⊤j,1
]
. (8)
Above relations show that the computation ofBi·Y2D ·B⊤j , for
a particular choice of i and j, can be performed in two sequential
steps which can be computed independently of each other. The
computation of (8) is achieved by the following steps:
1. compute the first four rows of the resulting matrix using the
first four rows of Y2D;
2. compute the last four rows of the resulting matrix using the
last four rows of Y2D.
The above sequence of computation can be realized using the
SFG in Fig. 4. Eight consecutive rows are stored in clock FIFO
registers. A total of 24 FIFO registers are needed for the section
of the circuit operating at Fclk, while 32 registers are required
for the slower section operating at Fclk/8.
5.2 Buffer
The buffer shown in Fig. 5 consists of a shift register section and
a bank of parallel-load registers. At time instant k, the shift
registers hold four consecutive rows of Y2D denoted by yk,0...7,
yk−1,0...7, yk−2,0...7, and yk−3,0...7. When k (mod 4) = 0, a par-
allel load is executed, synchronously transferring the content of
the shift registers (Fig. 5, block P) into the parallel load reg-
ister bank (Fig. 5, block Q). The timing diagram portraying
register transfers in this block is shown in Fig 6. Transfers oc-
cur at every positive clock edge of Fclk. Next, the computation
of Bi,q · Y2D,r · B⊤j,s block, where i ∈ {0, 1, 2, 3}, q ∈ {0, 1},
r ∈ {0, 1, 2, 3}, and s ∈ {0, 1} is considered.
5.3 Computing Bi,j ·Y2D,r ·B⊤j,k Block
The computation of Bi,j ·Y2D,r ·B⊤j,k is achieved by cross-wiring.
This step of the computation is free of addition and subtraction
operations. Eight-input multiplexers are used for cross-wiring.
Notice that (i) Bi,jY2D,0 and Bi,jY2D,2 and (ii) Bi,jY2D,1 and
Bi,jY2D,3 are to be computed in a time synchronous scheme.
Thus, we need in total 32 multiplexers in total in order to com-
pute all the 16 terms of (6). The multiplexers are commuted
time-synchronously, leading to periodic connections being made
to the appropriate outputs of Block Q in Fig. 5. The wiring of
the multiplexer inputs is governed by (6) and (8).
5.4 Final Reconstruction Step (FRS)
The cross-wiring form theBi ·Y2D ·B⊤j block are presented to the
FRS stage in order to recover the fixed point coefficients. The
computation is completely error free up to the FRS stage. The
FRS implementation based on expansion factors α = 4.5958 and
167.2309 proposed in [13] was employed in the proposed hard-
ware implementation. These expansion factors correspond to in-
teger sets
[
m1 m2 m3
]
=
[
12 5 13
]
and
[
437 181 473
]
respectively, with each set satisfying (3). The architecture using
4
z−1
z−1z−1
·
·
·
z−1
z−1
8
yi,j
@Fclk
@Fclk/8
88
yi−7,j
Figure 3: SFG depicting the direct implementation of the buffer section in Bi · (·) ·B⊤j block for the jth row.
z−1 z−1 z−1
4 4 4 4
z−1 z−1 z−1 z−1
yi,jyi+1,j yi−1,j yi−2,j@Fclk
@Fclk/4
Df1 Df2 Df3
Ds0 Ds1 Ds2 Ds3
yi−3,jyi,j yi−1,j yi−2,j
Figure 4: SFG depicting the multiplexed implementation of the buffer section in Bi · (·) ·B⊤j block for the jth row.
From block
Q
(P) (Q)
Shift register section Parallel load section
·
·
·
w
i
r
i
n
g
w
i
r
i
n
g
·
·
·
DQ
EN
DQ
EN
·
·
·
·
·
·
DQ
EN
DQ
EN
clk/4
DQ
EN
DQ
EN
DQ
EN
clk
DQ
EN
DQ
ENyk,0 yk−1,0 yk−2,0 yk,0 yk−1,0 yk−2,0
Ds3Ds2Ds0Df3Df2Df1 Ds1
yk+1,0
Xi,0
Xi,3
Xi,7
Xi,1
Xi,2
Xi,4
Xi,5
Xi,6
FRS
88
clk
clk
clk
8
8
8
mux0
mux1
mux31
clk/4
DQ
EN
DQ
EN
DQ
EN
clk
DQ
EN
DQ
EN
yk−3,1
yk,0
yk−1,0
yk−2,0
yk−3,0
yk,1
yk−1,1
yk,7
yk−1,7
yk−2,7
yk−3,7
yk−3,7yk−2,7yk−1,7yk,7yk−2,7yk−1,7yk,7yk+1,7
Figure 5: Implementation of B(·)BT block (Register content after the rising edge of ith clock cycle, where i (mod 4) = 0 is shown).
yk−1
yk−2
yk−3
yk
yk−1
yk−2
yk−3
yk
yk−1
yk−2
yk−3
yk
yk−1
yk−2
yk−3
ykyk−4
yk−5
yk−6
yk−7
yk−4
yk−5
yk−6
yk−7
yk−4
yk−5
yk−6
yk−7
yk+3yk+2yk−4
yk−5 yk−4
yk−2 yk−1 yk+1yk−3
yk−3 yk−2 yk−1 yk+1 yk+2
yk
yk
yk+1
yk+2
yk+3
yk+4
yk+4yk+3yk+2yk+1yk−1yk−2 yk
yk−7
yk−3
yk−6
Ds3
Ds2
Ds1
Ds0
Df3
Df2
Df1
Fclk
yk−4
yk−5
yk−6
k + 4k + 3k + 2k + 1kk − 1k − 2k − 3k − 4
yk−4
yk−5
Figure 6: Timing diagram depicting the register content at each clock cycle of Fclk with k (mod 4) = 0. Delay elements Dfα, Dsα,
where α = 0, 1, 2, 3 are defined in Fig 4 (Active high, edge triggered).
5
expansion factor 167.2309 offers a significant improvement in ac-
curacy when compared to expansion factor 4.5958 [13]. In [13,
Sec. IV] a comprehensive account of the FRS implementation
is furnished. Such previously described FRS is applied in the
proposed architecture.
6 Hardware Implementation and Results
Two designs corresponding to expansion factors 4.5958 and
167.2309 were physically implemented and tested on-chip using
field programmable gate array (FPGA) technology and there-
after mapped to 45 nm CMOS technology. We employed a Xil-
inx ML605 evaluation kit which is populated with a a Xilinx
Virtex-6 XC6VLX240T FPGA device. The JTAG interface was
used to input the test 8×8 2-D DCT arrays to the device from
the Matlab workspace. The measured outputs were returned
to the Matlab workspace.
6.1 On-chip Verification using Success Rates
As a figure of merit, we considered the success rate defined as
the percentage of coefficients which are within the error limit of
±e%. For the range e = {0.005, 0.01, 0.05, 0.1, 1, 5, 10}, the suc-
cess rates were of precision as given in the Table 2. Input word
length L was set to 4 and 8 bits. The proposed AI architectures
for the FPGA, are designed to be overflow-free at each stage
throughout the AI encoded structure. The accuracy results ob-
tained are exactly same as the ones for the designs proposed
in [13] based on expansion factor FRS. Notice however that up
to the FRS all AI encoded 2-D DCT computation is totally error
free.
6.2 Area, Critical Path Delay, and Power Metrics
The design was simulated up to place and route in 45 nm CMOS
process using NCSU 45 nm PDK for ASICs and physically im-
plemented using 40 nm CMOS Xilinx Virtex-6 XC6VLX240T
FPGA. Then the area (A), critical path delay (T ) and power
consumption for each implementation were obtained, and are
shown in Tables 3 and 4 for FPGA-based physical implementa-
tion and ASIC simulation. The ASIC was simulated for power
consumption and timing at an operating voltage (VDD) of 1.1 V.
Area utilization in mm2 and the gate count in terms of 2-input
NAND gates are also provided. The metrics were measured
for different choices of finite precision using input word length
L ∈ {4, 8} bits. Similar metrics reported in [13] using the same
FPGA device and tools are also given in Table 3 for comparison
purposes.
The area utilization in FPGA implementation is given in Ta-
ble 3 in terms of the number of elementary programmable logic
blocks (slices) whereas the ASIC equivalent is given in terms of
the chip area. The total power consumption of the hardware de-
sign is constituted of static (leakage) and dynamic components.
Static power consumption in FPGAs is dominated by leakage
power of the logic fabric and of the configuration memory. Thus
this quantity is essentially independent of the proposed design.
Hence, only the dynamic power consumption of the FPGA im-
plementation is given. For the ASIC implementation both leak-
age and dynamic power components are given in Table 4. The
area-time (AT ) and area-time-squared (AT 2) metrics are also
provided as a measurement of the overall performance. The AT
performance is a suitable metric for area efficient designs while
the AT 2 metric could be used for designs with the speed of op-
eration as the optimization goal [16]. The proposed architecture
leads to a reduction of 23% of AT and 22% of AT 2 (Table 3)
when compared with the architecture requiring five 1-D DCT
cores [13], for designs using the set of integers
[
437 181 473
]
with 8-bit input word length. The CMOS design with clock fre-
quency of 951 MHz amounts to a pixel rate of 7.608 G pixels/sec
and an 8×8 block rate of 118.875 M blocks/sec.
6.3 Comparison with Other Architectures
Table 5 provides a comparison between the existing AI based
2-D DCT architectures and designs based on the proposed ar-
chitecture. Eight-bit versions of designs using expansion factors
4.5958 and 167.2309 corresponding to integer sets
[
12 5 13
]
and
[
437 181 473
]
, respectively, are used for the comparison.
The comparison clearly indicates the advantages of the proposed
scheme with respect to throughput accuracy and flexibility.
7 Conclusion
An area efficient row-parallel architecture for 8×8 2-D DCT com-
putation based on AI number representation leading to exact
computations up to the FRS is proposed. The number of inte-
ger channels after the column-wise transform is reduced to one
down from four in [13] by eliminating the redundancies present
in AI channels. This further enables the simplification of the 1-D
blocks and the overall hardware complexity where only two 1-D
DCT blocks are needed. Architectural variants corresponding to
two expansion factors for the FRS were physically implemented,
each at 4-bit and 8-bit input precision, and subsequently verified
on a Xilinx Virtex-6 XC6VLX240T FPGA device. The maxi-
mum clock frequency for the FPGA realizations is 294.3 MHz.
In addition, the architectures were mapped to custom silicon re-
alizations using 45 nm CMOS technology. The realizations were
simulated up to place and route stage but no fabrications were
attempted. The simulated designs achieved a potential maxi-
mum clock frequency of 951 MHz as reported by the Cadence
Encounter tool.
References
[1] C.-J. Lian, Y.-W. Huang, H.-C. Fang, Y.-C. Chang, and L.-
G. Chen, “JPEG, MPEG-4, and H.264 codec IP develop-
ment,” in Proceedings of the Design, Automation and Test
in Europe, vol. 2, pp. 1118–1119, Mar. 2005.
[2] J. F. Blinn, “What’s that deal with the DCT?,” Computer
Graphics and Applications, IEEE, vol. 13, pp. 78–83, July
1993.
[3] M. N. Yukihiro Arai, Takeshi Agui, “A fast DCT-SQ scheme
for images,” The Transactions of the IEICE, vol. e71,
pp. 1095–1097, 1988.
[4] V. S. Dimitrov, G. A. Jullien, and W. C. Miller, “A new
DCT algorithm based on encoding algebraic integers,” in
Proceedings of the 1998 IEEE International Conference on
Acoustics, Speech and Signal Processing, vol. 3, pp. 1377–
1380, May 1998.
[5] K. Wahid, Error-Free Implementation of the Discrete Co-
sine Transform. PhD thesis, University of Calgary, 2010.
6
Table 2: Success rates of the DCT coefficient computation for various fixed-point bus widths and tolerance levels
Percentage Tolerence
(m1,m2,m3) L 10% 5% 1% 0.1% 0.05% 0.01% 0.005%
(12, 5, 13) 4 99.0367 98.1356 90.4067 51.7311 42.5511 31.4689 24.5189
8 99.0356 98.0089 90.4433 51.9189 42.7911 31.4267 24.3556
(437, 181, 473) 4 99.9867 99.9744 99.8856 98.9044 97.9 89.8322 80.8699
8 99.99 99.9744 99.8856 98.9044 97.9 89.8322 80.8689
Table 3: Area-speed and power consumption for FPGA implementation on Xilinx Virtex-6 XC6VLX240T
(m1,m2, m3) L Slices Frequency (MHz) Dynamic power (mW) AT (slices · µs) AT 2 (slices · µs2)
[13] Proposed [13] Proposed [13] Proposed [13] Proposed [13] Proposed
(12, 5, 13) 4 2377 2212 309.9 316.8 1871 616 7.67 6.98 0.025 0.022
8 3144 2536 300.4 294.3 1687 695 10.45 8.61 0.034 0.029
(437, 181, 473) 4 2605 2291 312.4 312.9 912 592 8.34 7.32 0.028 0.024
8 3445 2591 307.8 303.5 1123 714 11.19 8.54 0.036 0.028
Table 4: Simulated results for area-speed and power consumption from CMOS 45nm ASIC place and route (VDD = 1.1 V)
(m1,m2,m3) L Area Gate Speed Power(mW) AT AT
2
(mm2) count (MHz) Leakage Dynamic Total (mm2 · ns) (mm2 · ns2)
(12, 5, 13) 4 0.303 161.1K 951 1.693 941.8 943.5 0.317 0.335
8 0.404 215.2K 949 2.220 1273.9 1276 0.426 0.449
(437, 181, 473) 4 0.394 209.9K 947 2.222 1073.7 1076 0.416 0.439
8 0.439 233.9K 946 2.942 1451.8 1455 0.464 0.490
Table 5: Comparison of the proposed implementation with existing algebraic integer implementations
Nandi et al. Jullien et al. Wahid et al. Madanayake et al. [13] Proposed architectures
[17] [18] [19] (12, 5, 13) (437, 181, 473) (12, 5, 13) (437, 181, 473)
Measured results No No No Yes Yes Yes Yes
Single 1-D Two 1-D DCT Two 1-D DCT Five 1-D DCT Five 1-D DCT Two 1-D DCT Two 1-D DCT
Structure DCT +Mem. +Dual port +TMEM +TMEM +TMEM +TMEM +TMEM
bank RAM
Multipliers 0 0 0 0 0 0 0
Exact 2D AI No No No Yes Yes Yes Yes
computation
Operating
frequency N/A 75 194.7 300.39 307.78 949 946
(MHz)
8× 8 blocks
per clock cycle 1/128 1/64 1/64 1/8 1/8 1/8 1/8
8×8 Block rate 7.8125 1.171 3.042 37.55 38.47 118.625 118.25
(×106s−1)
Pixel rate 125 75 194.7 1158.95 1187.35 7592 7568
(×106s−1)
Implementation Xilinx 0.18 µm 0.18 µm Xilinx Xilinx 45 nm 45 nm
technology XC5VLX30 CMOS CMOS XCVLX240T XCVLX240T CMOS CMOS
Coupled
quantization Yes Yes Yes No No No No
noise
Independantly
adjustable No No No Yes Yes Yes Yes
precision
FRS between
row-column No Yes Yes No No No No
stages
7
[6] A. V. Oppenheim and C. J. Weinstein, “Effects of finite
register length in digital filtering and the fast Fourier trans-
form,” Proceedings of the IEEE, vol. 60, pp. 957–976, Aug.
1972.
[7] J. Cozzens and L. Finkelstein, “Computing the discrete
fourier transform using residue number systems in a ring
of algebraic integers,” Information Theory, IEEE Transac-
tions on, vol. 31, pp. 580–588, Sept. 1985.
[8] R. A. Games, D. Moulin, and J. J. Rushanan, “VLSI de-
sign of an algebraic-integer signal processor,” in Proceedings
of the 32nd Midwest Symposium on Circuits and Systems,
vol. 2, pp. 808–812, Aug. 1989.
[9] V. Dimitrov, K. Wahid, and G. Jullien, “Multiplication-free
8×8 2D DCT architecture using algebraic integer encod-
ing,” IEE Electronics Letters, vol. 40, no. 20, pp. 1310–1311,
2004.
[10] S. Mohammadi and A. Javadi, “An efficient technique for
error-free implementation of H.264 using algebraic integer
encoding,” in International Conference on Signal Acquisi-
tion and Processing, pp. 145–150, Feb. 2010.
[11] A. Pradini, T. M. Roffi, R. Dirza, and T. Adiono, “VLSI
design of a high-throughput discrete cosine transform for
image compression systems,” in 2011 International Con-
ference on Electrical Engineering and Informatics (ICEEI),
pp. 1–6, July 2011.
[12] V. Dimitrov and K. Wahid, “On the error-free computation
of fast cosine transform,” International Journal: Informa-
tion Theories and Applications, vol. 12, no. 4, pp. 321–327,
2005.
[13] A. Madanayake, R. J. Cintra, D. Onen, V. S. Dimitrov,
N. Rajapaksha, L. T. Bruton, and A. Edirisuriya, “A row-
parallel 8×8 2-D DCT architecture using algebraic integer-
based exact computation,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 22, pp. 915–929,
June 2012.
[14] G. H. Hardy and E. M. Wright, An Introduction to the The-
ory of Numbers. London: Oxford University Press, 4 ed.,
1975.
[15] A. Edirisuriya, A. Madanayake, V. S. Dimitrov, R. J. Cin-
tra, and J. Adikari, “VLSI architecture for 8-point AI-based
Arai DCT having low area-time complexity and power at
improved accuracy,” J. Low Power Electron., vol. 2, no. 2,
pp. 127–142, 2012.
[16] C. D. Thompson, “Area-time complexity for VLSI,” in Pro-
ceedings of the eleventh annual ACM symposium on Theory
of computing, pp. 81–88, ACM, 1979.
[17] S. Nandi, K. Rajan, and P. Biswas, “Hardware implemen-
tation of 4×4 DCT quantization block using multiplication
and error-free algorithm,” in Proceedings of the 2009 IEEE
TENCON Region 10, pp. 1–5, 2009.
[18] M. Fu, G. A. Jullien, V. S. Dimitrov, and M. Ahmadi, “A
low-power DCT IP core based on 2D algebraic integer en-
coding,” in Proceedings of the 2004 International Sympo-
sium on Circuits and Systems, 2004 (ISCAS ’04), vol. 2,
pp. 765–768, May 2004.
[19] K. A. Wahid, M. Martuza, M. Das, and C. McCrosky, “Effi-
cient hardware implementation of 8×8 integer cosine trans-
forms for multiple video codecs,” Journal of Real-Time Pro-
cessing, pp. 1–8, July 2011.
8
