A novel 2D filter design methodology for heterogeneous devices by Bouganis, CS et al.
A Novel 2D Filter Design Methodology For Heterogeneous Devices
Christos-Savvas Bouganis, George A. Constantinides and Peter Y. K. Cheung
Department of Electrical and Electronic Engineering
Imperial College London
London, U.K.
Email: christos-savvas.bouganis@imperial.ac.uk
Abstract
In many image processing applications, fast convolution
of an image with a large 2D ﬁlter is required. Field Pro-
gramable Gate Arrays (FPGAs) are often used to achieve
this goal due to their ﬁne grain parallelism and reconﬁg-
urability. However, the heterogeneous nature of modern re-
conﬁgurable devices is not usually considered during de-
sign optimization. This paper proposes an algorithm that
explores the implementation architecture of 2D ﬁlters, tar-
geting the minimization of the required area, by optimizing
the usage of the different components in a heterogeneous
device. Experiments show that the proposed algorithm can
achieve a reduction in the required area in a range of 34%
to 70% when compared to current techniques.
1 Introduction
In recent years, many image processing applications
have appeared in the literature that require the use of large
2D ﬁlters. Moderate size examples can be found in face de-
tection/recognition applications [4] where kernels with size
of 23×23 pixels are used, and some more extreme examples
can be found in medical imaging where applications require
kernels with size of up to 63× 63 pixels. At the same time,
real-time implementation is often required, making the use
of hardware acceleration a necessity [1].
FPGAs are often used to achieve this goal due to their
ﬁne grain parallelism and reconﬁgurability. Modern FPGAs
are heterogeneous devices, often targeting the DSP com-
munity, and thus providing a mixture of resources that can
be used by DSP applications. The two main silicon cores
that are usually included in the recent devices are embed-
ded RAMs [15] and embedded multipliers [5]. The ﬁrst
one provides fast localized memory access, while the sec-
ond one provides high speed accurate multiplication.
Current techniques for 2D ﬁlter optimization for a mod-
ern reconﬁgurable device, such as word-length optimization
[3] and Singular Value Decomposition [12], do not take into
account the heterogeneity of the device. However, research
concerning the exploitation of heterogeneity for a particular
application has recently started to appear in the literature.
In [9], the authors propose an approach for exchanging em-
bedded RAMs for multipliers, whereas in [14] the author
proposes an algorithm that identiﬁes part of the circuit that
can be implemented in embedded RAMs.
In line with this direction, the proposed algorithm de-
parts from the current methods of 2D ﬁlter implementation
by providing an approach that makes explicit use of the het-
erogeneity of the device, targeting designs that use less area.
Furthermore, it provides a framework which allows the de-
signers to move their designs to different points in the three
dimensional design space of embedded RAMs, embedded
multipliers, and 4-input look-up tables (4-LUTs), keeping
the arithmetic error in the ﬁlter implementation below a
speciﬁed level.
The novel contributions of this paper are:
• To use Singular Value Decomposition to approximate
a 2D ﬁlter with a number of 2×1D convolutions and
one low complexity 2D convolution. This reduces the
number of high precision multipliers required for im-
plementation.
• To develop a resource allocation algorithm that maps
the decomposed ﬁlter design onto a given set of hetero-
geneous resources on an FPGA, including dedicated
multipliers, LUTs and embedded RAM in order to
minimize resource usage. The resulting method shows
a signiﬁcant reduction in used resources.
The paper is organized as follows. Section 2 describes
the current related work regarding ﬁlter designs for hetero-
geneous devices. High level and detailed descriptions of the
proposed algorithm are given in Section 3. Section 4 is fo-
cused on the cost models for the ﬁlter components and Sec-
tion 5 illustrates the speed and the convergence properties of
the algorithm. Finally, Section 6 focuses on the evaluation
of the algorithm, and Section 7 concludes the paper.
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
<<5
<<2
x
y
Figure 1. Detailed diagram of a constant co-
efﬁcient multiplier using canonic signed digit
recoding. Coefﬁcient 29 is recoding to [1 0 0
-1 0 1] using CSD, which leads to a reduction
in the required adders.
2 Related work
The paper focuses on the case where designs with high
throughput are required, but design latency is of secondary
importance. For this reason, only pipelined techniques for
implementation of a 2D convolution ﬁlter are considered.
A common technique for implementation of such a ﬁlter
on an FPGA is to use constant coefﬁcient multipliers and a
number of embedded RAMs. The RAMs are used to buffer
the incoming data in order to create the necessary 2D win-
dow on which the mask is applied. The constant coefﬁcient
multipliers are often implemented as shift/add combinations
using 4-LUTs and, to further optimize the design, the co-
efﬁcients are sometimes transformed using canonic signed
digit recoding, which reduces the required logic [6].
Canonic signed digit recording represents the coefﬁ-
cients in a way such that high-speed low area multiplication
can be achieved. It is a radix-2 signed digit system using
the set {1, 0,−1}. Given a number, its canonic signed digit
representation has two important properties: (a) the num-
ber of non-zero digits is minimum, and (b) no two non-zero
digits are adjacent. Due to the ﬁrst property, canonic signed
representation is widely used for implementing constant co-
efﬁcient multipliers [11]. Figure 1 illustrates an example of
CSD, in the case where the coefﬁcient has the value 29. The
usual quantization needs three adders, where under CSD re-
coding only two are required.
Another technique exploits the potential separability of
a 2D ﬁlter into sets of two 1D ﬁlters by using the Singular
Value Decomposition (SVD) [12] to express the original ﬁl-
ter as a linear combination of separable ﬁlters. Using this
technique, the initial ﬁlter can be implemented as a set of
1D ﬁlters where half of them are applied to the rows of the
image, and the other half to the columns. By decomposing
the 2D ﬁlter, the number of necessary multiplications may
be reduced. For a 2D ﬁlter with size m × m, the number
of multiplications in the original form is m2. By applying
the Singular Value Decomposition algorithm, each stage of
the decomposition requires 2m multiplications. However,
the number of levels of decomposition that are required
depends on the separability properties of the ﬁlter and the
arithmetic accuracy for storing intermediate results. More-
over, the decomposition of the ﬁlter into separable masks is
independent of the quantization process of the coefﬁcients.
This technique can be combined with algorithms that mini-
mize the area cost of a ﬁlter by representing the coefﬁcients
using an appropriate number of bits such that the ﬁnal error
at the output of the ﬁlter is bounded by a user deﬁned value
such as in [3]. Current algorithms can be classiﬁed in three
categories. Those that use an analytic approach to scaling
and error estimation [10], those that use simulation [7] and
ﬁnally those that use both techniques [2].
In this paper, we propose a novel algorithm to optimize
a 2D convolution ﬁlter implementation in a heterogeneous
device, given a set of constraints regarding the number of
embedded multipliers and 4-LUTs. The algorithm estimates
an approximation of the original 2D ﬁlter which minimizes
the mean square error and at the same time meets the user’s
constraints on resource usage. The proposed method alters
the structure of the original ﬁlter in order to ﬁnd a structure
that can be mapped in a more efﬁcient way to the targeted
device. It is proposed to perform an exploration of the de-
sign space at a higher level than the word-length optimiza-
tion methods, since they do not consider altering the com-
putational structure of the ﬁlter. The proposed technique is
thus complementary to these previous approaches.
3 Algorithm description
The basic idea of the algorithm is to explore the redun-
dancy in the ﬁlter’s impulse response using Singular Value
Decomposition. In this way, the coefﬁcients are ordered ac-
cording to their impact to the overall error, and residual er-
rors are propagated to the next level of decomposition, and
so on.
The proposed algorithm takes as input the impulse re-
sponse of a 2D ﬁlter F with size m × m and a set of con-
straints for the available embedded multipliers and slices1.
It produces as output an approximation of the ﬁlter which
minimizes the mean square error and at the same time meets
the set of resource constraints. It should be noted that the
assumption of a square 2D ﬁlter is used for reasons of clar-
ity; the algorithm works for any 2D ﬁlter of any size.
The main idea behind the algorithm is to explore the sep-
arability of the input 2D ﬁlter. A 2D ﬁlter is called separable
if its impulse response f(n1, n2) is a separable sequence so
f(n1, n2) = f1(n1)f2(n2). The important property is that
1A slice is a term used by Xilinx, one of the two major FPGA manu-
facturers, to denote a combination of two 4-LUTs together with additional
circuitry to support efﬁcient arithmetic.
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
a convolution with a separable ﬁlter can be decomposed as
follows:
y(n1, n2) =
∞∑
i=−∞
∞∑
j=−∞
x(n1 − i, n2 − j)f1(i)f2(j)
=
∞∑
i=−∞
f1(i)
∞∑
j=−∞
x(n1 − i, n2 − j)f2(j)
where x and y denote the input and output images respec-
tively.
The algorithm decomposes the input ﬁlter to a set of N
separable ﬁlters (Ai) and a non-separable one (E):
F =
N∑
i=1
Ai + E (1)
At the same time, the algorithm quantizes the coefﬁcients of
the separable and non-separable ﬁlters using different pre-
cision levels, in order to achieve a more area efﬁcient ﬁl-
ter implementation, while meeting the resource constraints.
These two processes are not independent since the error in
the approximation of one level of the decomposition affects
the remaining levels of decomposition. The algorithm cal-
culates the approximation error at each decomposition level
and propagates it to the remaining decomposition levels, in
order to produce a better approximation of the original ﬁlter.
The addition of the non-separable ﬁlter E is essential in
the case where the ﬁlter under consideration has poor sepa-
rability properties. It helps to reduce considerably the re-
quired levels of decomposition, permitting the algorithm
to achieve the required ﬁlter approximation within only a
few decomposition levels. The cost related to implement
the non-separable mask is kept low, since the mask encodes
the residual errors of the approximation by dedicating only
a few bits for the representation of the coefﬁcients, as de-
scribed below.
To illustrate the potential for resource saving of this ap-
proach, consider anm×m 2D ﬁlter with 16-bit coefﬁcients.
A parallel implementation requires m2 16× 16 multipliers,
if the data width is 16 bits. Assuming that this is approx-
imated with three levels of separable ﬁlters, only 6m high
precision multipliers are required. Although the implemen-
tation of E still requires m2 multiplications, the coefﬁcients
have a small number of bits, suitable to be mapped to a
LUT-based multiplier implementation.
The separability exploration of the ﬁlter is performed us-
ing the Singular Value Decomposition algorithm [12] which
decomposes a matrix into a linear combination of the fewest
possible separable matrices. By applying the Singular Value
Decomposition algorithm, the initial ﬁlter F is decomposed
into a set of separable ﬁlters A1,A2, . . . ,AL:
F = UΛVT
=
L∑
i=1
λiuiviT
=
L∑
i=1
Ai (2)
where U and VT are orthogonal matrices and Λ is a diag-
onal matrix containing the eigenvalues λi. The eigenvalues
are sorted in descending order, thus λ1 ≥ λ2 ≥ ... ≥ λL.
ui and vi correspond to the ith columns of the U and V
matrices respectively.
Without taking into account the quantization effects of
the coefﬁcients into the ﬁlter approximation, the mean
squared error that is achieved by including the ﬁrst D de-
composition levels is given by:
ErrD =
1
m2
||F− Fˆ||2
=
1
m2
∣∣∣∣∣
∣∣∣∣∣
L∑
i=D+1
λiuiviT
∣∣∣∣∣
∣∣∣∣∣
2
=
1
m2
tr

(
L∑
i=D+1
λiuiviT
)(
L∑
i=D+1
λiuiviT
)T
=
1
m2
L∑
i=D+1
λi
2 (3)
where || · || denotes the Frobenius norm, tr{·} denotes the
trace of a matrix and λi denotes the eigenvalue that cor-
responds to the ith decomposition level. This provides a
lower bound for the mean square error in the approximation
when only the ﬁrst D levels of the decomposition are con-
sidered without taking into account the quantization error of
the coefﬁcients.
Given an input image I and a 2D ﬁlter F, the resulting
image of the convolution is given by Y = I  F, where 
denotes convolution. Using the ﬁlter decomposition in (1)
and (2), the resulting image is given by:
Y = I
(
N∑
i=1
Ai + E
)
= I
N∑
i=1
Ai + IE
=
N∑
i=1
IAi + IE
=
N∑
i=1
I (λiuiviT ) + IE
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
1D
FIR
(u)
1D
FIR
(v)
1
1D
FIR
(u)
1D
FIR
(v)
2
2D FIR
Line Buffer
Line Buffer
Line Buffer
Rasterline In
Rasterline Out
A1
A2
E
Figure 2. Diagram of the decomposition with
N = 2
u1
u2
um
v1 v2 vm
1D FIR (vertical)
1D FIR (horizontal)
Line Buffer
Line Buffer
Line Buffer
Figure 3. Detailed diagram of a decomposi-
tion stage. The necessary registers at each
stage are omitted for reasons of clarity.
=
N∑
i=1
λi(I ui) viT + IE (4)
This equation illustrates that the ﬁnal result can be ex-
pressed as the addition of two different types of convolution.
The ﬁrst one concerns a convolution with separable ﬁlters
and the second one a conventional 2D convolution. The for-
mer type of convolution is performed in the columns of the
image ﬁrst, followed by a convolution along the rows of the
image. A diagram of the decomposition is illustrated in Fig-
ure 2, where the number of decomposition levels is N = 2.
Figure 3 illustrates the inner structure of each separable de-
composition stage. The line buffers are implemented using
the embedded RAM blocks. The available embedded multi-
pliers are used for the realization of the multiplications with
the eigenvalues λi, and the remaining multipliers are used
to realize appropriate constant coefﬁcient multiplications in
the separable ﬁlters. Finally, ﬁlter E is implemented us-
ing only LUT-based multipliers since it contains coefﬁcients
with low complexity.
The impact of each separable mask onto the approxima-
tion of the original ﬁlter is determined by the corresponding
eigenvalue. Thus, the coefﬁcients in the ﬁrst levels of de-
composition, which correspond to masks with large eigen-
values, should be approximated more accurately than the
coefﬁcients that correspond to the masks of the ﬁnal levels.
The algorithm achieves this by allocating the available em-
bedded multipliers in the ﬁrst stages of the decomposition in
order to achieve high accuracy in the coefﬁcient approxima-
tion. For these levels, an extra multiplier for the eigenvalue
multiplication (see (2)) is not required since it can be ac-
commodated in the corresponding coefﬁcients. Moreover,
the algorithm allocates an embedded multiplier for the ﬁnal
addition of the partial results that correspond to the differ-
ent levels of decomposition. This is performed by multiply-
ing each partial result with the eigenvalue that corresponds
to the particular decomposition level. The rest of the co-
efﬁcients are mapped to multipliers realized using slices,
using the appropriate precision level for each one. The non-
separable component E corresponds to the decomposition
stages that are not taken into account and is given by:
E =
L∑
i=N+1
Ai (5)
More speciﬁcally, for a decomposition of an m × m
2D ﬁlter into N decomposition levels, the algorithm im-
plements the ﬁrst K = M−N2m  decomposition levels using
embedded multipliers, where M is the number of available
embedded multipliers and N is the number of decomposi-
tion levels. The remaining N −K stages are implemented
using a combination of embedded multipliers and slices.
The non-separable ﬁlter component, E, is always imple-
mented using only slices since it consists of low complexity
coefﬁcients. Thus, the ﬁnal decomposition of the ﬁlter can
be written as (6).
F =
K∑
i=1
Ai +
N∑
i=K+1
Ai + E (6)
For example, consider a ﬁlter F with size 23 × 23, and as-
sume that 100 available multipliers and N = 3. The al-
gorithm determines K = 2 and decomposes F as follows.
It implements the masks A1 and A2 using only embedded
multipliers. As these are separable masks they need only
2 × 23 multipliers per level of decomposition for a total of
92. These are implemented using the embedded multipli-
ers, while mask A3 is implemented using a combination
of the remaining 8 multipliers and slices. Finally, the non-
separable component E is implemented using only slices.
In the above top level description of the algorithm, the
effect of the coefﬁcient quantization process to the ﬁnal ap-
proximation of the ﬁlter is not considered. Below, a de-
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
tailed description of the proposed algorithm is given that ad-
dresses this effect. The algorithm can be divided into three
stages. The multiplier allocation stage, the decomposition
stage and the reﬁnement stage. An overview of the algo-
rithm is given in Figure 4. Each stage is described in detail
below.
3.1 Multiplier allocation stage
In this stage, the algorithm determines the number of lev-
els of decomposition K that can be implemented using only
embedded multipliers. The algorithm decomposes the in-
put ﬁlter F using the SVD algorithm into a set of separable
masks as:
F =
N∑
i=1
λiuiviT (7)
The algorithm reserves the appropriate embedded multipli-
ers for the multiplications by λi and implements the ﬁrst K
masks using the remaining embedded multipliers. This is
desirable since the coefﬁcients that correspond to the initial
decomposition levels have larger impact to the overall ﬁlter
approximation than the coefﬁcients that correspond to later
decomposition levels. Also, the number of coefﬁcients for
word-length optimization is reduced leading to faster ex-
ecution time of the algorithm. In the quantization of the
coefﬁcients, the whole available precision that is provided
by the embedded multipliers of the available device can be
used.
In order to take into account the error inserted due to
quantization, the algorithm updates the input ﬁlter F after
each level of the decomposition as:
F ← F− uqvqT (8)
where uq and vq correspond to the quantized vector of co-
efﬁcients
√
λiui and
√
λivi respectively. The SVD decom-
position is repeated to the updated ﬁlter F until the ﬁrst K
masks are estimated. In this way, the decomposition of each
ﬁlter stage is optimized, given the quantization of the co-
efﬁcients for the previous levels. Thus, any error that has
been injected due to the quantization process, is addressed
through the rest of the decomposition levels. The algorithm
then proceeds to the decomposition stage.
3.2 Decomposition stage
In the decomposition stage the algorithm further decom-
poses the remaining input ﬁlter into a set of separable masks
and a non-separable mask using the SVD algorithm. The
rest of the available embedded multipliers are assigned to
some of the coefﬁcients, as described below, while the re-
maining coefﬁcients are represented using only one non-
zero signed digit, allowing the associated multiplications to
be implemented cost free in bit parallel hardware.
The algorithm ﬁrst allocates the available embedded
multipliers to the coefﬁcients of the vectors ui and vi by
taking into account the coefﬁcients from all the remaining
levels of the decomposition. The algorithm explores further
the decomposition by taking into account the error that is in-
serted to the ﬁnal ﬁlter approximation, by the quantization
of all the coefﬁcients and not only due to the coefﬁcients
of the current decomposition level. The actual allocation is
performed only for selected coefﬁcients of the ﬁrst level of
the new decomposition. This is done because the approxi-
mation of the new decomposition level determines the coef-
ﬁcients of the remaining levels. The rest of the coefﬁcients
for that level are quantized as before. Due to the fact that
the coefﬁcients are quantized, the eigenvalue of that level of
decomposition is then re-evaluated to correct for the quanti-
zation effects. The new λi is calculated using the following
system of linear equations [13]:
Fvi = λiui
FTui = λivi (9)
where ui and vi have been quantized.
The ﬁlter F is updated as follows:
F ← F− λiuiviT (10)
which is similar to (8) but has been adapted in order to ac-
commodate the λi parameter, and the process is repeated
for the remaining separable decomposition levels.
The ﬁnal level, which is the non-separable mask E, is
formed by the resulting ﬁlter F where all the coefﬁcients
are quantized as before. This mask is actually the error term
of the initial input ﬁlter F and its approximation using a set
of separable masks and ﬁxed-point arithmetic.
3.3 Reﬁnement stage
In the reﬁnement stage, the algorithm assigns extra bits
to the coefﬁcients that have the largest error due to the quan-
tization process, in order to minimize the error of the ap-
proximation. The quantization is performed using canonic
signed digit representation [6]. The addition of extra bits is
constrained by the number of available slices.
In the case where the algorithm selects to update a coef-
ﬁcient that belongs to the stage J of the decomposition, a
new eigenvalue λJ is estimated according to (9) using the
values already estimated for the rest of the coefﬁcients, and
the whole algorithm is repeated starting from the decompo-
sition stage for the next level of decomposition. This is re-
quired because the slice allocation of the later stages of de-
composition for minimizing the error in the approximation
depends on the coefﬁcients of the previous levels. The al-
gorithm terminates when the number of used slices exceeds
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
the number of available slices. The ﬁnal approximation to
F is given by:
F ≈
N∑
i=1
Âi + Ê (11)
where Âi and Ê are the quantized Ai and E respectively.
4 Cost model
The presented algorithm explores the cost of a 2D ﬁl-
ter design in the three dimensional space of embedded
RAMs, embedded multipliers, and slices, while simulta-
neously minimizing the error in the approximation of the
original ﬁlter. The cost model for the embedded multipliers
and embedded RAMs is straightforward. For the number
of slices for the required multiplications and the adder-trees
that are required by the design, an upper bound estimate
is derived from the number of non-zero bits in the canonic
signed digit encoding. This provides a fast and a reliable
estimate of the required number of slices. Moreover, in the
cost model of the adder trees, the worst case scenario has
been taken into account keeping all the available precision
from the results of the multipliers.
5 Properties of the algorithm
The execution time of the proposed algorithm depends
on the separability properties of the ﬁlter, the size of the
2D ﬁlter, and also on the number and type of the available
resources. The number of available embedded multipliers
reduces the number of the coefﬁcients under optimization.
The main property of the algorithm that could lead to po-
tentially long execution times is its ability to backtrack to a
previous decomposition stage and discard any optimization
that was performed in the latter stages. This is a necessity, if
the optimum area usage for a given approximation error is
required, since the error in the approximation can be propa-
gated to the remaining decomposition levels and minimized.
In the case where the ﬁlter is highly separable, only a few
decomposition levels are needed to be explored leading to
short execution times. In the worst case, the execution time
in our experiments was up to two hours for a 21 × 21 ﬁlter
and N = 3 on a Pentium 4 at 3.2GHz.
Due to its iterative nature, when the algorithm backtracks
to a previous decomposition level discarding any optimiza-
tion of the remaining stages, the error in the approximation
jumps to a higher value before it is further minimized. For
this reason, the algorithm keeps track of the best decom-
position during its execution presenting it at the end to the
user. If all the initial decomposition levels have been ap-
proximated without any error, these are not altered again by
the algorithm. Thus, the algorithm will always converge to
Algorithm: Optimized m×m 2D ﬁlter design for N levels
of decomposition, using M embedded multipliers and S
slices
Set Foriginal ← F
Calculate levels with only multipliers K = M−N2m 
Multiplier allocation stage
FOR j = 1 : K
Using SVD estimate: F =
∑
i λiuiv
T
i
Quantize uq =
√
λ1u1 and vTq =
√
λ1v
T
1
Set Âj ← uqvTq
Update F as: F ← F− Âj
END
Set ds = K + 1
Decomposition stage
FOR j = ds : N
Using SVD estimate: F =
∑
i λiuiv
T
i
Determine coeff. to allocate embedded muls
Quantize the coeff. of uq ← u1 and vq ← v1
Estimate and quantize new λq ← λ1 using (9)
Set Âj ← λquqvTq
Update F as: F ← F− Âj
END
Quantize the coefﬁcients of F and set Ê ← F
IF slice constraints S are violated THEN EXIT
Reﬁnement stage
Find the coefﬁcient c that is not allocated to an
embedded multiplier and has the largest
approximation error
Assign extra bit for its representation
Let c belong to the J th level of the decomposition
IF J == N THEN update Ê
ELSE update ÂJ
IF slice constraints S are violated THEN EXIT
IF J == N THEN
GOTO Reﬁnement stage
ELSE
Estimate F ← Foriginal −
∑J
i=1 Âi
Set ds = J + 1
GOTO Decomposition stage
Figure 4. Outline of the algorithm
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
a solution and terminate. The number of required iterations
depends on the number of decomposition stages N , the size
of the mask m and the number of available slices S.
6 Performance Evaluation
For the evaluation of the proposed algorithm, we focus
on the ability of the algorithm to ﬁnd a design that ﬁts in
a given area of the reconﬁgurable device. Such devices are
manufactured by regular repetition of a silicon tile, thus the
relation between the number of slices, the number of em-
bedded RAMs and embedded multipliers gives an indica-
tion for the relative number of resources that are found in the
device in an area of a certain size. The device that is used
for the evaluation of the algorithm is the XC2V8000 high-
end FPGA from Xilinx. It contains 168 embedded 18× 18
multipliers, 168 embedded RAMs and 46,592 slices. Since
we are interested in ﬁnding a design that ﬁts in a given area
of the device, the max operator is used to map the cost from
the three dimensional space onto one dimension, i.e.
Total cost = max
(
#RAMs
168
,
#MULs
168
,
#slices
46592
)
For comparison, we test the proposed algorithm against
a direct implementation of the 2D ﬁlters using constant co-
efﬁcient multipliers. The comparative design is also opti-
mized by using canonic signed digit representation for the
coefﬁcients. It should be noted that in the experiments, the
coefﬁcients that are dedicated to embedded multipliers are
quantized to 10 bits using two’s complement encoding.
6.1 Filter approximation for generated ﬁlters
The performance of the proposed algorithm depends on
the separability properties of the ﬁlter under investigation.
Thus, two 2D ﬁlters of size 21×21 pixels with different sep-
arability properties are considered in order to assess the per-
formance of the algorithm. The two masks are illustrated in
Figures 5 and 6. Figure 7 illustrates the separability prop-
erties of the two ﬁlters by plotting their ﬁrst ﬁve eigenvalues
as a function of levels of decomposition. According to the
ﬁgure, MaskA is less separable thanMaskB. Figure 8 shows
the total cost as a percentage of the area of the device, using
the proposed method and the direct implementation using
canonic signed representation, versus the minimum square
error that can be achieved for the ﬁlter approximation. It
can be concluded that the proposed algorithm reduces the
total cost by between 34% and 70% (mean 55%). The pro-
posed algorithm has less effect on MaskA than on MaskB
since the former is less separable. However, an improve-
ment between 34% and 55% (mean 47%) is still obtained.
Also, for the MaskB curve, clear jumps can be seen in the
0
5
10
15
20
250
5
10
15
20
25
−0.2
0
0.2
0.4
0.6
Figure 5. Impulse response of MaskA
0
5
10
15
20
250
5
10
15
20
25
−0.15
−0.1
−0.05
0
0.05
Figure 6. Impulse response of MaskB
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ei
ge
nv
al
ue
Decomposition Level
Mask A
Mask B
Figure 7. Plot of the eigenvalues of the masks
A and B
10−20 10−15 10−10 10−5 100
0
20
40
60
80
100
120
140
Ar
ea
 u
sa
ge
 o
f t
he
 d
ev
ice
 (%
)
Mean Square Error
Mask A (CSD)
Mask A (Proposed method)
Mask B (CSD)
Mask B (Proposed method)
Figure 8. Area usage of the design versus
achieved mean square error for a 21×21 mask
mean square error wherever the algorithm has inserted an
extra decomposition level. This is because its eigenvalues
are reduced substantially for up to the third level of the de-
composition, whereas in the case of MaskA they do not de-
crease as much after the ﬁrst level.
6.2 Filter approximation for a real application
The proposed algorithm was also applied to optimize ﬁl-
ters that are used in a real application [1]. The authors use a
set of 15× 15 bandpass ﬁlters to decompose an image into
scale and orientation selective channels for scene analysis.
The size of the proposed ﬁlters and the limited resources of
the available target device lead the authors to truncate the
ﬁlters to smaller size and to process each frame three times
in order to calculate all the ﬁlter responses. Figure 9 illus-
trates one ﬁlter of the set. Applying the proposed algorithm
to this ﬁlter results in a reduction of between 43% and 76%
(mean 59%) in the required area. We also mapped the same
ﬁlter under the constraint that only two multipliers are avail-
able. This forces the algorithm to use only slices except for
the eigenvalue scaling. Figure 10 illustrates the percent-
age of the device that is required for the implementation of
the ﬁlter versus the mean square error of the approximation,
under the two different scenarios. From the graph it can be
concluded that the extra multipliers that are used by the al-
gorithm inﬂuence the results in the case where a low mean
square error has to be achieved. Moreover, both proposed
designs out-perform the canonic signed digit methodology
that is currently used. The average gain in the design when
only two multipliers are used is 55%, which is very close
to the case where the algorithm uses all the available re-
sources (59%). This demonstrates the ability of the algo-
rithm to efﬁciently implement a 2D ﬁlter convolution in the
absence of available multipliers by allocating appropriately
the available slices.
6.3 Word-length optimization performance
An experiment is performed to assess the performance of
the algorithm regarding the optimum allocation of the avail-
able slices of the device. The idea is to investigate the ability
of the algorithm to reconstruct a ﬁlter design given the ﬁlter
impulse response and implementation cost. This provides a
measure of the algorithm’s ability to converge to a known
feasible design. A set of 200 2D ﬁlters with size 7 × 7 are
randomly generated having the following properties:
• The ﬁlters are designed using the SVD algorithm
• The ﬁrst two levels are kept separable, where the rest
form a 2D ﬁlter (E).
• The eigenvalues of the two separable levels and the
coefﬁcients are quantized using canonic signed digit
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
0 5
10 155
10
15
−0.4
−0.2
0
0.2
0.4
0.6
Figure 9. Impulse response of MaskC
10−20 10−15 10−10 10−5 100
0
10
20
30
40
50
60
Ar
ea
 u
sa
ge
 o
f t
he
 d
ev
ice
 (%
)
Mean Square Error
Mask C (CSD)
Mask C (Proposed method)
Mask C (Only slices)
Figure 10. Area usage of the design versus
achieved mean square error for a 15×15 mask
0 0.5 1 1.5 2 2.5
x 10−5
0
5
10
15
20
25
30
Mean Square Error
Pe
rc
en
ta
ge
Figure 11. Histogram of the mean square er-
ror in the approximation of 200 separable 7x7
masks
recoding with randomly selected number of non-zero
bits between one and three.
• The E part of the ﬁlter is quantized using canonic
signed digit recoding with a randomly selected num-
ber of non-zero bits between one and three.
For each 2D ﬁlter, the associated cost in multipliers and
slices is calculated. The impulse responses of the 2D ﬁlters
are given as input to the algorithm in order to calculate an
approximation of the ﬁlters having as constraints their ac-
tual implementation cost. Moreover, the algorithm is forced
to decompose the ﬁlters into the same number of separable
masks, i.e. two.
The above design of 2D ﬁlters allows us to explore the
ability of the algorithm to allocate the available slices for the
quantization of the coefﬁcients in an efﬁcient way regard-
less of the separability properties of the ﬁlter under investi-
gation. Figure 11 shows the histogram of the mean square
error of the approximation for the case of 7× 7 ﬁlters. The
graph shows that the majority of the ﬁlters are approximated
with mean square error around 4 ∗ 10−7. A limited number
of ﬁlters are approximated with less accuracy, with maxi-
mum error in the approximation 2.1 ∗ 10−5. It should be
noted that due to the quantization process in the original de-
sign of the 2D ﬁlters, the decomposition of the ﬁlter alters.
This further restricts the algorithm from approximating the
original ﬁlter.
The same experiment is performed for 2D ﬁlters of size
15× 15. The histogram of the mean square error in the ap-
proximation is illustrated in Figure 12. It can be concluded
that the algorithm approximates the 15 × 15 ﬁlters better
than the 7× 7 ﬁlters.
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
0 0.5 1 1.5 2 2.5
x 10−5
0
10
20
30
40
50
60
70
80
90
Mean Square Error
Pe
rc
en
ta
ge
Figure 12. Histogramof themean square error
in the approximation of 200 separable 15x15
masks
It should be mentioned that in the worst case, where the
separability property of the mask is poor, the algorithm will
behave as well as the current techniques.
7 Conclusion
This paper presents a novel 2D ﬁlter design methodol-
ogy for heterogeneous devices. The main point of depar-
ture from the current algorithms for efﬁcient 2D ﬁlter im-
plementation is that it explores the computational structure
of the ﬁlter according to the different types of available re-
sources in the device. Experiments with ﬁlters with size up
to 21×21 and various separability properties have been per-
formed. The results indicate a reduction in the total cost by
a factor between 34% and 70% (mean 55%). Future work
will involve the use of word-length optimization techniques
[3] and the use of current techniques for high-speed multi-
plication [8] to further enhance the performance of the algo-
rithm. Moreover, due to the mean square error criterion that
is used, the proposed method is more appropriate for chan-
nel equalization applications. However, a different criterion
like a min-max approximation of a ﬁlter frequency response
can be accommodated by slightly modifying the proposed
algorithm.
Acknowledgement
This work was funded by the UK Research Council un-
der the Basic Technology Research Programme “Reverse
Engineering Human Visual Processes” GR/R87642/02.
References
[1] C.-S. Bouganis, P. Y. K. Cheung, J. Ng, and A. A. Bharath.
A Steerable Complex Wavelet Construction and its Imple-
mentation on FPGA. Field-Programmable Logic and its ap-
plications, September 2004.
[2] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde, and
I. Bolsens. A methodology and design environment for dsp
asic ﬁxed point reﬁnement. Design, Automation, and Test in
Europe, Munich, Germany, pages 271–276, 1999.
[3] G. A. Constantinides, P. Y. K. Cheung, and W. Luk.
Wordlength Optimization for Linear Digital Signal Process-
ing. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 22(10), October 2003.
[4] S. Gong, S. McKenna, and A. Psarrou. Dynamic Vision:
From Images to Face Recognition. Imperial College Press,
1st ed. edition, 2000.
[5] S. Haynes and P.Y.K.Cheung. Conﬁgurable multiplier
blocks for embedding in FPGAs. Electronics Letters,
34(7):638–639, 1998.
[6] I. Koren. Computer Arithmetic Algorithms. New Jersey:
Prentice-Hall Inc., 2nd ed edition, 2002.
[7] K.-I. Kum and W. Sung. Combined word-length optimiza-
tion and high-level synthesis of digital signal processing sys-
tems. IEEE Transactions on Computer-Aided Design of In-
tegrated Circuits and Systems, 20(8):921–930, August 2001.
[8] M. Martinez-Peiro, E. I. Boemmo, and L. Wanhammar. De-
sign of high-speed multipliersless ﬁlters using a nonrecur-
sive signed common subexpression algorithm. IEEE Trans-
actions on Circuits and Systems-II: Anslog and Digital Sig-
nal Processing, 49(3):196–203, March 2002.
[9] G. Morris, G. A. Constantinides, and P. Y. K. Cheung. Mi-
grating Functionality from ROMs to Embedded Multipli-
ers. IEEE International Symposium on Field-Programmable
Custom Computing Machines, 2004.
[10] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee. Pre-
cision and error analysis of matlab applications during au-
tomated hardware synthesis for fpgas. Design, Automa-
tion, and Test in Europe, Munich, Germany, pages 722–728,
2001.
[11] I.-C. Park and H.-J. Kang. Digital ﬁlter synthesis based on
minimal signed digit representation. Annual ACM IEEE De-
sign Automation Conference, pages 468–473, 2001.
[12] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Nu-
merical Recipes in C. Cambridge University Press, 1992.
[13] G. Strang. Introduction to Linear Algebra. Wellesley-
Cambridge Press, 3rd edition edition, 1998.
[14] S. Wilton. SMAP: Heterogeneous Technology Mapping
for Area Reduction in FPGAs with Embedded Memory
Arrays. ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, February 1998.
[15] S. Wilton, J. Rose, and Z. Vranesic. The Memory/Logic
Interface in FPGA’s with Large Embedded Memory Arrays.
IEEE Transactions on Very-Large Scale Integration Systems,
7(1), March 1999.
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 
0-7695-2445-1/05 $20.00 © 2005 IEEE
