Synthesis and Optimization of 2D Filter Designs for Heterogeneous FPGAs by Bouganis, C et al.
Synthesis and Optimization of 2D Filter Designs for
Heterogeneous FPGAs
Christos-S. Bouganis, Sung-Boem Park, George A. Constantinides, and Peter Y.K. Che-
ung
Department of Electrical & Electronic Engineering,
Imperial College London,
London, SW7 2AZ,
U.K.
Many image processing applications require fast convolution of an image with one or more 2D
ﬁlters. Field - Programmable Gate Arrays (FPGAs) are often used to achieve this goal due to
their ﬁne grain parallelism and reconﬁgurability. However, the heterogeneous nature of modern
reconﬁgurable devices is not usually considered during design optimization. This paper proposes
an algorithm that explores the space of possible implementation architectures of 2D ﬁlters, target-
ing the minimization of the required area, by optimizing the usage of the diﬀerent components in a
heterogeneous device. This is achieved by exploring the heterogeneous nature of modern reconﬁg-
urable devices using a Singular Value Decomposition based algorithm, which provides an eﬃcient
mapping of ﬁlter’s implementation requirements to the heterogeneous components of modern FP-
GAs. In the case of multiple 2D ﬁlters, the proposed algorithm also exploits any redundancy that
exists within each ﬁlter and between diﬀerent ﬁlters in the set, leading to designs with minimized
area. Experiments with real ﬁlter sets from computer vision applications demonstrate an average
of up to 38% reduction in the required area.
Categories and Subject Descriptors: ... [...]: ...
General Terms: ...
Additional Key Words and Phrases: FPGA, 2D ﬁlter design, Singular Value Decomposition,
reconﬁgurable logic.
1. INTRODUCTION
One of the fundamental operators in image processing is the two dimensional convo-
lution. Examples can be found in edge detection and smoothing operations, where
the image is convolved with a speciﬁc 2D ﬁlter, or kernel, in order to produce the
desired result. Early ﬁlter sizes tended to be relatively small, e.g. 3× 3 pixels. In
recent years, many image processing applications have appeared in the literature
that require the use of much larger 2D ﬁlters. Moderate size examples can be found
in face detection/recognition applications where kernels with size of 23× 23 pixels
are used [Belhumeur et al. 1997], and some more extreme examples can be found
...
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for proﬁt or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee.
c© 2008 ACM 1529-3785/2008/0700-0001 $5.00
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008, Pages 1–0??.
2 · ...
in medical imaging where applications require kernels with size of up to 63 × 63
pixels. Moreover, image processing algorithms usually require the convolution of
the input image with a set of 2D ﬁlters rather than with a single ﬁlter, which
increases the computational load [Bouganis et al. 2004]. At the same time, real-
time implementation is often required, making the use of hardware acceleration a
necessity [Bouganis et al. 2004].
FPGAs are often used to achieve this goal due to their ﬁne grain parallelism and
reconﬁgurability. Modern FPGAs are heterogeneous devices, often targeting the
DSP community, and thus providing a mixture of resources that can be used in
DSP applications. The two main coarse grain cores that are usually included in
the recent devices are embedded RAMs and embedded multipliers [Xilinx ; Altera
]. The former provides fast localized memory access, while the latter provides
high-speed accurate multiplication.
In line with this direction, the proposed algorithm departs from the current meth-
ods of 2D ﬁlter implementation by providing an approach that makes explicit use
of the heterogeneity of the device, targeting the reduction of design area. Further-
more, it provides a framework which allows the designers to move their designs to
diﬀerent points in the two dimensional design space of embedded multipliers and
4-input look-up tables (4-LUTs), keeping the variance of the error at the output
of the system below a speciﬁed level. In our previous work [Bouganis et al. 2005a;
2005b] we addressed the problem of allocating in an eﬃcient way the diﬀerent re-
sources of a modern reconﬁgurable device for implementation in an FPGA of a 2D
convolution with a single ﬁlter. In [Bouganis et al. 2005], we proposed an algorithm
that extends our work to target the implementation of a set of 2D ﬁlters, by pro-
viding an approach that exploits any redundancy that exists within each ﬁlter and
between diﬀerent ﬁlters in the set.
This paper presents in more depth the main ideas of the proposed algorithms
[Bouganis et al. 2005a; 2005b; Bouganis et al. 2005], provides a more detailed
cost model of the main components of the design, and presents a more eﬃcient
strategy in terms of computational complexity for searching the design space than
in [Bouganis et al. 2005a; 2005b; Bouganis et al. 2005]. Moreover, results are
presented for fully placed and routed FPGA designs regarding their area, speed,
and power consumption. In summary, the novel contributions of this paper are:
—The use of Singular Value Decomposition to approximate a 2D ﬁlter with a
number of separable 2D ﬁlters and one low complexity non-separable 2D ﬁlter.
This provides an eﬃcient mapping from the 2D ﬁlter design requirements to the
heterogeneous parts of modern reconﬁgurable devices. Furthermore, it reduces
the number of high precision multipliers required for implementation.
—An extension of the Singular Value Decomposition to approximate a set of 2D
ﬁlters with a number of separable 2D ﬁlters and a set of low complexity non-
separable 2D ﬁlters. Moreover, the proposed technique exploits any existing
redundancy within each ﬁlter and between diﬀerent ﬁlters of the input set, leading
to designs with minimized resource requirements.
—A resource allocation algorithm that maps the decomposed ﬁlter design onto a
given set of heterogeneous resources on an FPGA, including embedded multipliers
and LUTs, in order to minimize resource usage. The proposed method shows a
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 3
signiﬁcant reduction in resource usage, targeting various user requirements.
The paper is organized as follows. Section 2 describes the current related work
regarding implementation of 2D ﬁlter designs in hardware, and more speciﬁcally in
an FPGA. Section 3 describes and formulates the objective of the paper. High level
and detailed descriptions of the proposed algorithm are given in Section 4. Section
5 focuses on the evaluation of the algorithm, and Section 6 concludes the paper.
2. RELATED WORK
The paper focuses on the typical DSP case where designs with high throughput
are required, but design latency is of secondary importance. For this reason, only
pipelined techniques for the implementation of a 2D convolution ﬁlter are consid-
ered.
There is a large number of publications regarding digital ﬁlter implementations
taking into account the eﬀects of coeﬃcient quantization on the ﬁlter’s response.
In [Siohan 1990], the author presents statistical bounds for the degradation of the
frequency response due to coeﬃcient quantization in the case of direct-form realiza-
tion of 2D FIR ﬁlters with quadrantal or octagonal symmetry. In [Haseyama and
Matsuura 2006], the authors consider the problem of achieving the most accurate
approximation of the ﬁlter’s frequency response when the coeﬃcients are repre-
sented using ﬂoating-point precision. A ﬁxed number of bits are allocated to the
mantissa and exponent of the coeﬃcients. The proposed algorithm uses a genetic
algorithm and simulating annealing to calculate the coeﬃcients and it was reported
that the proposed method produces better approximation of the ﬁlter’s response
than simulating annealing-based only methods.
Moreover, some work has been concentrated on the implementation of multipli-
erless FIR digital ﬁlters where the coeﬃcients can be expressed as sums of signed
power-of-two (SPT) terms. It should be noted that a mixed-integer linear pro-
gramming formulation (MILP) can lead to the optimum solution, but its compu-
tational requirements prohibit this method for high order ﬁlters [Chen et al. 1995].
Polynomial-time algorithms for designing digital ﬁlters with power-of-two coeﬃ-
cients can be found in [Chen et al. 1995; Li et al. 2002]. However, the above works
consider the problem of digital ﬁlter implementation in general and not the prob-
lem of mapping a digital ﬁlter on an FPGA device taking into account the device’s
speciﬁc structure.
Current design methods for implementing a 2D ﬁlter in an FPGA have two main
stages. In the ﬁrst stage, the ﬁlter’s coeﬃcients are approximated using ﬁnite word-
length precision through optimization of an objective function. Depending on the
application, the objective function can be the mean square error or the maximum
absolute error, and can be applied in the spatial or frequency domain [Kodek 1980].
In the second stage, the constant coeﬃcient multiplications are often implemented
as shift/add combinations using 4-LUTs and, to further optimize the design, the
coeﬃcients are sometimes transformed using canonic signed digit recoding, which
reduces the required logic [Koren 2002].
Canonic signed digit recoding represents the coeﬃcients in a way such that high-
speed low-area multiplication can be achieved. It is a radix-2 signed digit system
using the set {1, 0,−1}. Given a number, its canonic signed digit representation
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
4 · ...
<< 2
x << 5
29x
Fig. 1. Detailed diagram of a constant coeﬃcient multiplier using canonic signed digit recoding.
Coeﬃcient 29 is recoding to [1 0 0 -1 0 1] using CSD, which leads to a reduction in the required
adders.
<< 4
<< 2
21x
x
<< 4
20x 21x 20x
x
<< 4<< 2<< 2
Fig. 2. Common subexpression elimination. The two products 20x and 21x can share a common
subexpression which leads to a reduction in the required adders.
has two important properties: (a) the number of non-zero digits is minimum, and
(b) no two non-zero digits are adjacent. Due to the ﬁrst property, canonic signed
representation is widely used for implementing constant coeﬃcient multipliers [Park
and Kang 2001; Samueli 1989]. Figure 1 illustrates an example of CSD, in the case
where the coeﬃcient has the value 29. The usual implementation needs three
adders, where under CSD recoding only two are required.
Techniques to further reduce the required resources when multiplying by several
coeﬃcients can also be found in the literature [Dempster and Macleod 1995; Pasko
et al. 1999; Yurdakul 2005]. This is achieved by using common partial results, which
eﬀectively reduces the required area for the multiplications. Figure 2 illustrates an
example of using common partial results for multiplication with coeﬃcients having
values 20 and 21.
Another technique [Andrews 1999] exploits the potential separability of a 2D
ﬁlter into sets of two 1D ﬁlters by using the Singular Value Decomposition (SVD)
[Press et al. 1992] to express the original ﬁlter as a linear combination of separable
ﬁlters. This technique can be lossless, where the original ﬁlter is decomposed in a
linear combination of separable ﬁlters without introducing any error, or it can be
lossy, where a certain error in the original ﬁlter approximation is allowed. Using this
technique, the initial ﬁlter can be implemented as a set of 1D ﬁlters where half of
them are applied to the rows of the input image, and the other half to the columns
of the resulting image from the ﬁrst convolution. By decomposing the 2D ﬁlter, the
number of necessary multiplications may be reduced. For a 2D ﬁlter with size m×n,
the number of multiplications in the original form is mn. By applying the Singular
Value Decomposition algorithm, each stage of the decomposition requires m + n
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 5
multiplications. However, the number of levels of decomposition that are required
depends on the separability properties of the ﬁlter and the arithmetic accuracy
for storing intermediate results. Moreover, the decomposition of the ﬁlter into
separable ﬁlters is considered to be, in this work, independent to the quantization
process of the coeﬃcients.
The above techniques can be combined with algorithms that minimize the area
cost of a ﬁlter by representing the result of a multiplication/addition using an ap-
propriate number of bits such that the ﬁnal error at the output of the ﬁlter is
bounded by a user deﬁned value such as in [Constantinides et al. 2003]. Current
algorithms can be classiﬁed in three categories. Those that use an analytic ap-
proach to scaling and error estimation [Constantinides et al. 2003], those that use
simulation [Kum and Sung 2001] and ﬁnally those that use both techniques [Cmar
et al. 1999].
In this paper, we propose a novel algorithm to optimize a 2D convolution ﬁlter
implementation in a heterogeneous device, given a set of constraints regarding the
number of embedded multipliers and 4-LUTs. The algorithm is based on a lossy
synthesis framework, where an approximation of the original 2D ﬁlter is targeted
which minimizes the error at the output of the system and at the same time meets
the user’s constraints on resource usage. Moreover, we extend our framework to
applications that require convolutions with a set of 2D ﬁlters rather than with a
single 2D ﬁlter, by exploiting any redundancy that exists within each ﬁlter and
between diﬀerent ﬁlters in the set. The proposed method alters the structure of
the original ﬁlter(s) in order to ﬁnd a structure that can be mapped in a more
eﬃcient way to the targeted device. The exploration of the design space is per-
formed at a higher level than the word-length optimization methods or methods
that use common subexpressions to reduce the area, since they do not consider
altering the computational structure of the ﬁlter. The proposed technique is thus
complementary to these previous approaches.
3. OBJECTIVE
Under the lossy synthesis framework, we extend the current two dimensional design
space of area-speed to include a third dimension which is the accuracy of the ﬁnal
result [Constantinides et al. 2001]. By allowing the user to specify the acceptable
deviation from the exact result, we are able to explore the design space more ef-
fectively and provide designs that directly target the user’s requirements. In the
present work, we are interested in ﬁnding an algorithm that produces a hardware-
friendly design for a 2D convolution realization, achieving a minimum error at its
output given a bound on the available area. In the speciﬁc problem, the metric
that is used to describe the accuracy of the result is the variance of the noise at the
output of the system. This describes the uncertainty of the result at the output
and has been adopted by many researchers from the Digital Signal Processing ﬁeld
[Constantinides et al. 2003]. The rest of the section focuses on the 1D case for
simplicity, however, the same results are valid in the 2D case too.
From [Mitra 2002] the variance of a signal at the output of a linear time invariant
(LTI) system, and in our speciﬁc case of a 2D convolution, when the input signal
is a white random process is given by (1), where σ2y is the variance of the signal at
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
6 · ...
h[n]
x[n] y[n]
h[n]
x[n] y[n]
e[n]
h[n]
Fig. 3. The top graph shows the original system, where the second graph shows the approximated
system and its decomposition to the original impulse response and to the error impulse response.
the output of the system, σ2x is the variance of the signal at the input, and h[n] is
the impulse response of the system.
σ2y = σ
2
x
∞∑
n=−∞
|h[n]|2 (1)
Under the proposed framework, the impulse response of the new system hˆ[n] can
be expressed as the sum of the impulse response of the original system h[n] and an
error impulse response e[n] as in (2).
hˆ[n] = h[n] + e[n] (2)
The new system can be decomposed into two parts as shown in Figure 3. The ﬁrst
part, has the original impulse response h[n], where the second part has the error im-
pulse response e[n]. Thus, the variance of the noise at the output of the system due
to the approximation of the original impulse response is given by (3), where SSE
denotes the sum of square errors in the ﬁlter’s impulse response approximation.
σ2noise = σ
2
x
∞∑
n=−∞
|e[n]|2 = σ2x · SSE (3)
In the case where the system exhibits N outputs, the objective is a function
f(·) of the output noise variances (4), where σ2noise,i denotes the variance of the ith
output. In this work we have used the average function but other functions like the
max(·) function can be accommodated by the system.
σ2system = f(σ
2
noise,1, σ
2
noise,2, . . . , σ
2
noise,N ) (4)
From (3) and (4) it can be concluded that the uncertainty at the output of the
system is proportional to the sum of square error of the impulse response approxi-
mation, which can be used as a measure to assess the system’s accuracy.
4. ALGORITHM DESCRIPTION
4.1 Single 2D filter
The main idea behind the proposed method is to decompose a given 2D ﬁlter into
a set of separable 2D ﬁlters, and to one non-separable 2D ﬁlter which encodes the
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 7
trailing error of the decomposition. The reason for that is made clear below.
A 2D ﬁlter is called separable if its impulse response h[n1, n2] is a separable
sequence, i.e.
h[n1, n2] = h1[n1]h2[n2].
The important property is that a convolution with a separable ﬁlter can be decom-
posed into two one-dimensional convolutions as shown below,
y[n1, n2] =
∞∑
i=−∞
∞∑
j=−∞
x[n1 − i, n2 − j]h1[i]h2[j]
=
∞∑
i=−∞
h1[i]
∞∑
j=−∞
x[n1 − i, n2 − j]h2[j]
where x and y denote the input and output images respectively.
The separable ﬁlters can potentially reduce the number of required multiplica-
tions from m×n to m+n for each ﬁlter with size m×n pixels. The non-separable
parts encode the trailing error of the approximation and still require m×n multipli-
cations. However, the coeﬃcients are intended to need fewer bits for representation
and therefore their multiplications are of low complexity. The decomposition that
we are seeking should provide a ranking for the coeﬃcients according to their impact
in the overall ﬁlter approximation error. This ordering would enable the algorithm
to assign the available resources in an optimum way. Moreover, the decomposition
should remove any redundancy that exists within the 2D ﬁlter.
In more detail, given a 2D ﬁlter F, the algorithm seeks a decomposition of
the form given in (5), where Aˆ is a linear combination of the separable ﬁlters
A1,A2, . . . ,Ar, and E is a non-separable 2D ﬁlter.
F = Aˆ + E (5)
Since the Aj matrices are separable, they can be decomposed into a product of
vectors as Aj = ujvTj , where uj and vj are vectors. Thus, the approximation of a
ﬁlter with r separable levels of decomposition is as in (6).
F =
r∑
j=1
λjAj + E =
r∑
j=1
λjujvTj + E (6)
The non-separable term E can be considered as the error term of the approximation
of the ﬁlter from the separable set. This term is essential for achieving an adequate
approximation of the ﬁlter using a relative small number of separable ﬁlters.
Given an input image X and a 2D ﬁlter F, the resulting image of the convolution
is given by Y = XF, where  denotes convolution. Using the ﬁlter decomposition
in (6), the resulting image is given by:
Y = X
⎛
⎝ r∑
j=1
λjAj + E
⎞
⎠ = r∑
j=1
λj(X uj) vTj + XE (7)
A top-level diagram of such design is illustrated in Figure 4. The ﬁgure shows a
design containing two decomposition levels, r = 2. A detailed view of one of the
separable stages is illustrated in Figure 5.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
8 · ...
1D FIR (u)
adder
tree
1D FIR (v)
Error
term
adder
tree
1D FIR (u)
adder
tree
1D FIR (v)
adder
tree
adder
tree
adder
tree
X
Y = F 
Line
Buffers
1O
2O
X
Fig. 4. Top level diagram of the proposed decomposition for the case of two separable stages.
The diagram illustrates the major components of the decomposition and their connections. A
raster-scan format is assumed for the input image.
u1
u2
um
v1 v2 vm
1D FIR (vertical) 1D FIR (horizontal)
Input
Line Buffer
Line Buffer
: register
O
: multiplier: adder
Fig. 5. Detailed diagram of a decomposition stage.
For a given 2D ﬁlter there is an inﬁnite number of possible decompositions de-
scribed by (6). We are seeking the decomposition that has the minimum number of
separable parts r for a given mean square error approximation, which corresponds
to minimizing the number of required multipliers. Furthermore, the initial levels
of the decomposition should have more impact on the mean square error approxi-
mation of the ﬁlters than the later ones, a property that provides a ranking for the
coeﬃcients and is exploited by the proposed algorithm.
The decomposition that we are seeking is given by the Singular Value Decomposi-
tion algorithm, which decomposes a matrix into a linear combination of the fewest
possible separable matrices [Press et al. 1992]. By applying the Singular Value
Decomposition algorithm, the initial ﬁlter F is decomposed into a set of separable
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 9
ﬁlters A1,A2, . . . ,AL:
F = UΛVT
=
L∑
i=1
λiuiviT
=
L∑
i=1
Ai (8)
where U and VT are orthogonal matrices and Λ is a diagonal matrix containing the
eigenvalues λi. The eigenvalues are sorted in descending order, thus λ1 ≥ λ2 ≥ ... ≥
λL. ui and vi correspond to the ith columns of the U and V matrices respectively.
Without taking into account the quantization eﬀects of the coeﬃcients into the
ﬁlter approximation, the mean square error that is achieved by including the ﬁrst
D decomposition levels is given by:
ErrD =
1
m2
||F− Fˆ||2
=
1
m2
∣∣∣∣∣
∣∣∣∣∣
L∑
i=D+1
λiuiviT
∣∣∣∣∣
∣∣∣∣∣
2
=
1
m2
tr
⎧⎨
⎩
(
L∑
i=D+1
λiuiviT
)(
L∑
i=D+1
λiuiviT
)T⎫⎬
⎭
=
1
m2
L∑
i=D+1
λi
2 (9)
where || · || denotes the Frobenius norm, tr{·} denotes the trace of a matrix and
λi denotes the eigenvalue that corresponds to the ith decomposition level. This
provides a lower bound for the mean square error in the approximation when only
the ﬁrst D levels of the decomposition are considered without taking into account
the quantization error of the coeﬃcients.
4.2 Multiple 2D filters
The algorithm can be extended to the case where a set of 2D ﬁlters need to be
convolved with an input image. In this case, the main idea is to decompose a given
set of 2D ﬁlters into one set of ﬁlters, which are common for all the given ﬁlters,
and into a second set of ﬁlters which are speciﬁc for each ﬁlter. The common set
of ﬁlters should capture the common properties of the given set, while the second
set should capture their individual properties. Similar to the single ﬁlter case, out
of all possible decompositions we are interested in the case where the common set
of ﬁlters consists of separable ﬁlters.
Given a set of 2D ﬁlters F1,F2, . . . ,Fk, the algorithm seeks a decomposition of
the form given in (10), where Aˆi is a linear combination of the separable ﬁlters
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
10 · ...
1D FIR (u)
adder
tree
1D FIR (v) adder
tree
Error 
term 1
adder
tree
1D FIR (u)
adder
tree
1D FIR (v)
adder
tree
adder
tree
adder
tree
adder
tree
adder
tree
adder
tree
Error 
term 2
Error 
term 3
X
Y1 = F1 X
Y2 = F2 X
Y3 = F3 X
12O
13O
11O
21O
22O
23O
Fig. 6. Top level diagram of the proposed decomposition for the case of three ﬁlters using two
separable stages. The diagram illustrates the major components of the decomposition and their
connections. The necessary line buﬀers are omitted for reasons of clarity.
A1,A2, . . . ,Ar, and Ei is a non-separable 2D ﬁlter.
Fi = Aˆi + Ei (10)
Similar to (6), the approximation of each ﬁlter with r separable levels of decompo-
sition is given in (11), where the resulting images of the convolutions with an input
image X are given by (12).
Fi =
r∑
j=1
λijAj + Ei =
r∑
j=1
λijujvTj + Ei (11)
Yi = X
⎛
⎝ r∑
j=1
λijAj + Ei
⎞
⎠ = r∑
j=1
λij(X uj) vTj + XEi (12)
A top-level diagram of such design is illustrated in Figure 6. The ﬁgure shows
a design regarding three ﬁlters, k = 3, and two decomposition levels, r = 2. The
necessary line buﬀers are omitted for reasons of clarity. It should be noted that the
introduction of a common set of ﬁlters does not aﬀect the throughput of the design.
The output of each ﬁlter in the original set is produced by applying a ﬁlter speciﬁc
linear transformation to the outputs of the common set of ﬁlters and adding to that
the output of a ﬁlter which is speciﬁc for each ﬁlter in the original set (Ei).
Similar to the single ﬁlter case, there is an inﬁnite number of possible decompo-
sitions described by (10). In our case, we seek an extension of the SVD algorithm
to many ﬁlters.
4.3 Rank-1 Decomposition algorithm
Under our framework, the decomposition problem of multiple 2D ﬁlters can be
formulated as follows. Given a set of m × n matrices F1,F2, . . . ,Fk, ﬁnd the
minimum set of rank-1 matrices A1,A2, . . . ,Ar, such that linear combinations of
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 11
them can span the matrices F1,F2, . . . ,Fk i.e.
Fi =
r∑
j=1
λijAj . (13)
A matrix A is a rank-1 matrix, separable, if it can be decomposed as the product
of two vectors, i.e. A = uvT . The deﬁnition of the problem is the extension of
a Singular Value Decomposition for one matrix to a collection of matrices. The
special case where k = 2 is well researched [Ja’Ja 1978]. The general case where
k > 2 has been shown to be an NP-complete problem [Hastad 1990]. The proposed
framework is based on the decomposition proposed in [Shashua and Levin 2001].
It is an iterative algorithm which is based on the Singular Value Decomposition
algorithm. It is designed such that for the case of a single matrix (k = 1) the same
decomposition as the Singular Value Decomposition algorithm is produced. This
provides a uniﬁed framework for the case of single 2D ﬁlter and the case of multiple
2D ﬁlters.
The rank-1 decomposition algorithm is given in Figure 7. It should be noted that
the rank-1 decomposition algorithm is a greedy algorithm since previous selections
of the rank-1 elements are not re-evaluated. Thus, the process is not guaranteed
to converge to a global minimum. However, the convergence of the algorithm to a
local minimum is guaranteed. Following [Shashua and Levin 2001], let Rri be the
residual of the ith matrix after approximating the input set using r rank-1 matrices.
It can be proven that (14) holds, where M = min(m, k), N = min(mk, n), and ||.||F
denotes the Frobenius norm. k denotes the number of ﬁlters Fi in the input set,
and m and n denote the number of rows and columns of the ﬁlters respectively. It
should be noted that this provides only an upper bound on the convergence rate of
the algorithm. Experiments have shown that the convergence rate is much faster.
For more details and proof of convergence the reader is referred to [Shashua and
Levin 2001].
k∑
i=1
||Rri ||2F ≤
(
1− 1
NM
)r k∑
i=1
||Fi||2F (14)
4.4 Detailed Description of the Algorithm
This section gives a detailed description of the proposed algorithm under the uniﬁed
framework. Given a set of m × n ﬁlters F1,F2, . . . ,Fk, and a set of constraints
regarding the available number of embedded multipliers and slices, the algorithm
ﬁnds a design that minimizes the variance at the output of the system and meets
the constraints on resource requirements. The algorithm can be divided into two
main stages: the slice allocation stage and the multiplier allocation stage.
4.4.1 Slice allocation stage. In the slice allocation stage, the algorithm decom-
poses the input ﬁlters using the rank-1 decomposition algorithm and implements
all the constant coeﬃcient multiplications using only slices. In the case of inﬁnite
precision, the matrices in (11) can be deﬁned by one call of the rank-1 decom-
position algorithm. However, due to the coeﬃcient quantization in a hardware
implementation, quantization error is inserted at each level of the decomposition.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
12 · ...
Algorithm: Decompose a set of m× n 2D matrices
F1,F2, . . . ,Fk into rank-1 matrices A1,A2, . . . ,Ar
FOR j = 1 : r
Calculate eigenvector u that corresponds to the
largest eigenvalue of the n× n matrix ∑ki=1 FiFTi
Form m× k matrix Fˆ = [FT1 u,FT2 u, . . . ,FTk u]
Calculate the eigenvector v that corresponds to the
largest eigenvalue of the m×m matrix FˆFˆT .
Repeat
Form the k × n matrix Fˆ = [F1v,F2v, . . . ,Fkv]T
Calculate the eigenvector u that corresponds to the
largest eigenvalue of the n× n matrix FˆT Fˆ.
Form the m× k matrix Fˆ = [FT1 u,FT2 u, . . . ,FTk u]T
Calculate the eigenvector v that corresponds to the
largest eigenvalue of the m×m matrix FˆFˆT .
until u and v vectors change less than a pre-speciﬁed
value that is set by the user.
Aj = uv
T .
λij = v
TFTi u.
Replace Fi with Fi − λijuvT .
END
Fig. 7. Outline of the rank-1 decomposition algorithm
The algorithm reduces the eﬀect of the quantization error by propagating the error
of each decomposition level to the next one.
The algorithm ﬁrst determines the separable ﬁlter A1 = u1vT1 for the ﬁrst level
of the decomposition and the corresponding λi1 for i = 1, . . . , k using the rank-1
decomposition algorithm. The algorithm quantizes the coeﬃcients that correspond
to the vectors u1 and v1 by using canonical signed digit recoding and using one
non-zero bit per coeﬃcient. Due to the fact that the coeﬃcients for this level are
quantized, the λi1 terms are re-evaluated to correct for the quantization eﬀects.
The new λi1 are calculated using the following system of linear equations [Strang
1998]:
Fiv1 = λi1u1 (15)
FiTu1 = λi1v1 (16)
where u1 and v1 have been quantized. The new λi1 are quantized using the same
quantization method as before.
The algorithm updates the input ﬁlters Fi as Fi ← Fi − λi1u1vT1 , and repeats
the above process until it estimates r levels of decomposition. The ﬁnal error terms
Ei are formed by the resulting ﬁlters Fi where all the coeﬃcients are quantized as
before.
The Precision Equalizing algorithm has been developed for allocating the avail-
able slices to the coeﬃcients in an optimum way. The main idea behind the algo-
rithm is to equalize the precision of the coeﬃcients that belong to the same 1D FIR
ﬁlter. The motivation behind this algorithm is that the error that is injected into
the system by the quantization of each coeﬃcient in the same 1D ﬁlter is propa-
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 13
Algorithm: Precision Equalizing Algorithm
FOR k = 1 to m (where m is the number of 1D FIR ﬁlters
Find the current precision of each coeﬃcient
Recalculate the quantization error of the coeﬃcient
Recalculate the eigenvalue of the corresponding level
and reset its HW
Recalculate the coeﬃcients of lower decomposition
levels using the rank-1 algorithm and reset their
HW to the default values
Calculate the ﬁlter error of the new ﬁlter conﬁguration
and store it
END
Find the conﬁguration with the smallest variance.
Fig. 8. Summary of the Precision Equalizing Algorithm
gated to the output of the system through the same transfer function. Given that
the variance of the noise at the output of the system due the quantization of each
coeﬃcient is proportional to the variance of the signal at the input of the coeﬃcient
multiplier, which is the same for the coeﬃcients that belong to the same 1D ﬁlter,
a sensible choice would be for the coeﬃcients of the same 1D ﬁlter to have the same
precision. It should be noted that in the Precision Equalizing Algorithm only one
coeﬃcient for each 1D FIR ﬁlter is considered for optimization at each iteration,
leading to solutions that are computational eﬃcient. Figure 8 gives an outline of
the Precision Equalizing Algorithm.
Regarding the complexity of the algorithm, for the case of k m×n ﬁlters and for
one iteration, the Precision Equalizing Algorithm has a complexity proportional to
3r + mnk, where r is the number of decomposition levels under investigation.
The ﬁrst stage terminates when all the available slices have been used by the
algorithm.
4.4.2 Embedded Multiplier Allocation stage. In this stage, the algorithm deter-
mines the coeﬃcients that will be placed to embedded multipliers. The coeﬃcients
that have the largest cost in terms of slices in the current design and reduce the
ﬁlters’ approximation error when they are allocated to embedded multipliers, are
selected. The second condition is necessary due to the limited precision of the em-
bedded multipliers (e.g. 18 bits in Xilinx Virtex II devices), which in some cases
may restrict the approximation of the multiplication and consequently of the ﬁnal
result. Because of the above condition, in the case where the user requires a design
with high accuracy, not all the available embedded multipliers speciﬁed by the user
are always allocated.
Since the proposed algorithm is based on the Singular Value Decomposition al-
gorithm, there exists an ordering of the coeﬃcients regarding the impact of their
approximation to the total approximation error. That is, coeﬃcients from the initial
decomposition levels have larger impact than coeﬃcients of the later levels. This
feature is explored by the algorithm to eﬃciently allocate the embedded multipliers
to coeﬃcients that belong to the initial decomposition levels and at the same time
their placement on an embedded multiplier would save a considerable amount of
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
14 · ...
resources.
Finally, the algorithm allocates the slices that have been freed, due to the previous
allocation of the embedded multipliers, to the rest of the design in order to achieve
an even lower approximation error. This stage is similar to the slice allocation stage
except that the coeﬃcients that have been allocated to embedded multipliers are
not considered for optimization.
It should be noted that the above decomposition is realized for a speciﬁc number
of decomposition stages. In order to explore the design space for diﬀerent decom-
position levels, the proposed algorithm exhaustively searches a predeﬁned set of
separable levels, and returns the designs that achieve the lowest variance at the
output of the system, for a speciﬁc area constraint.
4.5 Cost Model
The aim of the resource allocation algorithm is to ﬁnd a ﬁlter decomposition that
achieves the minimum error variance at the outputs of the system given constraints
on the available resources. For this reason an accurate and relatively fast model for
the resource usage is required.
A cost model has been constructed for a Xilinx Virtex-II device to predict the cost
of the diﬀerent components that are used by the algorithm like constant coeﬃcient
multipliers and adder trees.
In a Xilinx FPGA, the smallest logic element unit is a “slice”. It consists of two
4-LUTs, two storage elements (ﬂip-ﬂops) and some other components like carry
logic, logic gates and multiplexers [Xil ]. In the current cost model, a slice is further
decomposed into two half-slices, each consisting of a single 4-LUT and a ﬂip-ﬂop.
The model tracks the used LUT and ﬂip-ﬂop components of each half-slice in order
to estimate the required design area. A half-slice does not need to be fully utilized.
Either the ﬂip-ﬂop or the LUT can be used by the design. A LUT connected to a
ﬂip-ﬂop counts as a half-slice. A ﬂip-ﬂop connected to two ﬂip-ﬂops on either side,
as in delay element, counts as a half-slice, with its LUT component not being used.
Similarly, a LUT having LUTs at its input and output, as in combinatorial block,
is also considered as a half-slice with its ﬂip-ﬂip not being used.
The current model predicts a resource usage within 3% of the actual cost when
the component is synthesized and placed on the device. The cost estimate of in-
dividual components is then aggregated to obtain the cost estimate of the ﬁnal
decomposition. Experiments show that the worst case system-level estimate devi-
ates by 22% of the actual area, with a mean value of 12%. This is due to the fact
that the synthesis and place and route tools optimize further the design beyond the
scope of the current cost model, which has to be relatively fast, and tracks only
the LUT and ﬂip-ﬂop components of a slice. It should be noted that diﬀerent cost
models for diﬀerent devices can be easily accommodated by the system.
4.6 Frequency vs Area
Another dimension in hardware design is the maximum achieved frequency of the
design. Due to the nature of the 2D convolution, the design can be pipelined
very deeply. In the current framework, the algorithm can be tuned for a speciﬁc
frequency using the minimum required area.
The framework uses delay models for all the components and inserts extra regis-
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 15
ters whenever a particular component is likely to exceed the predeﬁned time delay.
It should be noted that the delay estimation is only an approximation since the
routing delay, which depends on the relative placement of the components, has not
been modelled. Simulations show that for a 9×9 Gabor ﬁlter targeting to a speciﬁc
noise variance at the output of the system, a change in the clock rate from 90MHz
to 57MHz leads to an area saving of 26%. More results regarding the frequency
versus area trade-oﬀ are illustrated in the performance evaluation section.
4.7 Adder trees
A component with a large contribution to the ﬁnal cost of the design is the adder
tree that is required after each 1D or 2D FIR stage. The proposed algorithm
minimizes the cost of the adder tree by clustering together intermediate results
that have similar precisions. This leads to the construction of more “compact”
adder trees. This property has also been exploited by the Precision Equalizing
Algorithm.
4.8 Area reduction vs separability
The performance of the proposed algorithm is also a function of the separability
properties of the 2D ﬁlters. The more separable the ﬁlters are, the more reduction
in area can be achieved. Considering a single 2D ﬁlter, its rank gives only a lower
bound for the separable levels that are needed in order to approximate the ﬁlter, due
to the quantization of the ﬁlter coeﬃcients. The proposed algorithm exhaustively
searches a predeﬁned set of separable levels, and returns the designs that achieve
the lowest variance at the output of the system, for a speciﬁc area constraint.
Thus, in the worst case scenario, where a 2D ﬁlter has a full rank, the proposed
algorithm will return a design that is no worse than a design obtained using current
techniques.
5. PERFORMANCE EVALUATION
The performance of the proposed algorithm is compared to a direct pipelined imple-
mentation of a 2D convolution using Canonic Signed Digit recoding for the constant
coeﬃcient multipliers. In the case of multiple 2D ﬁlters, an equivalent number of 2D
convolution designs are implemented and their combined performance is compared
against the proposed architecture.
The main target of the proposed algorithm is the computer vision ﬁeld. Thus,
the ﬁlters that are used to assess the performance of the algorithm are drawn
from real computer vision applications. The test sets are shown in Table I. Tests
1 and 2 evaluate the performance of the algorithm in the case of a single ﬁlter.
The Gabor and the Laplacian of Gaussian ﬁlters have been selected because of
their extensive use in computer vision applications. A convolution with a Gabor
ﬁlter yields images which are locally normalized in intensity and decomposed in
terms of spatial frequency and orientation [Gong et al. 2000]. The Laplacian of
Gaussian ﬁlter is used mainly for edge detection. The rest of the test set evaluate
the performance of the algorithm in the case of multiple ﬁlters. Tests 3 and 4
focus on Gabor ﬁlters, where tests 5 and 6 focus on ﬁlters that are used in speciﬁc
applications. Test set 5 is used in [Bouganis et al. 2004], where the authors use
the ﬁlters to decompose an image into scale and orientation selective channels for
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
16 · ...
Table I. Test sets. Tests 1 and 2 are used to assess the performance of the algorithm in the
case of a single ﬁlter, while Test cases 3 to 6 are used for multiple ﬁlter evaluation. Tests 1 to 4
correspond to common convolution operators in the computer vision ﬁeld, while Test cases 5 and
6 correspond to speciﬁc applications drawn from the computer vision ﬁeld.
Test Number Description
1 One 9× 9 Gabor ﬁlter
F (x, y) = α sin θe−ρ
2(α
σ
)2 , ρ2 = x2 + y2, θ = αx,
α = 4, σ = 6
2 One 9× 9 Laplacian of Gaussian ﬁlter
LoG(x, y) = − 1
πσ4
[1− x2+y2
2σ2
]e
− x2+y2
2σ2 ,
σ = 1.4
3 Four 9× 9 Gabor ﬁlters
Fi(x, y) = α sin(θi)e
−ρ2(α
σ
)2 , ρ2 = x2 + y2,
θi ∈ {αx, α
√
2(x + y), αy, α
√
2(y − x)}
α = 4, σ = 6
4 Four 15× 15 Gabor ﬁlters
Fi(x, y) = α sin(θi)e
−ρ2(α
σ
)2 , ρ2 = x2 + y2,
θi ∈ {αx, α
√
2(x + y), αy, α
√
2(y − x)}
α = 6, σ = 20
5 Four 15× 15 ﬁlters from a steerable pyramid for
edge detection [Bouganis et al. 2004]
6 Six 15× 15 ﬁlters describing inhibition neural
connections [Li 2003]
scene analysis. Test set 6 refers to a set of six ﬁlters that are used in the case of
pre-attentive segmentation in Primary Visual Cortex [Li 2003]. The selected test
sets for performance evaluation comprise a representative set of ﬁlters that vary
from commonly used ﬁlters to real application-speciﬁc ﬁlters.
5.1 Area gain evaluation
Tests 1 and 2 are used to assess the performance of the proposed algorithm for the
case of a single ﬁlter. Figure 9(a) shows the achieved variance of the error at the
output of the ﬁlter as a function of the area, for the proposed and the reference
algorithms. It can be concluded that in all cases the proposed algorithm leads
to designs that use less area than the reference algorithm, for the same variance
of the error at the output. Figure 9(b) illustrates the relative reduction in area
achieved by the proposed algorithm for diﬀerent values of the noise variance. The
oscillation in the ﬁgure is due to the discrete nature of the design space exploration.
An average reduction of 24.95% and 12.28% is achieved for the Test case 1 and 2
respectively. Alternative, the proposed methodology produces designs with up to
50dB improvement in the signal to noise ratio requiring the same area in the device
with designs that are derived from the reference algorithm. Moreover, Test case 1
was used for evaluation of the performance of the algorithm when embedded mul-
tipliers are available. 30 embedded multipliers of 18 × 18 bits are made available
in the algorithm. The relative percentage reduction achieved by the algorithm be-
tween designs that use the embedded multipliers and designs that realized without
any embedded multiplier is illustrated in Figure 10 for diﬀerent values of the noise
variance. The oscillation in the ﬁgure is due to the discrete nature of the design
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 17
−30 −25 −20 −15 −10 −5 0 5
0
2000
4000
6000
8000
10000
12000
C
os
t (
in
 s
lic
es
)
Variance of noise (log
10
)
(a)
−20 −15 −10 −5 0 5
15
20
25
30
35
40
45
50
Variance of noise (log
10
)
G
ai
n 
in
 s
lic
es
 (
%
)
(b)
Fig. 9. (a) Achieved variance of the noise at the output of the design versus the area usage of the
proposed design (+) and the reference design (*) for Test case 1. (b) illustrates the percentage
gain in slices of the proposed framework for diﬀerent values of the variance of the noise.
space exploration. It can be concluded that the algorithm explores eﬃciently the
available embedded multipliers leading to signiﬁcant gains in the required area. The
average gain in slices achieved by the algorithm is 9.35%. It should be noted that
the impact of the embedded multipliers is more evident in the larger variances of
the noise, since a LUT based constant coeﬃcient multiplier using Canonic Signed
Digit recoding leads to more accurate results after a certain word-length of the
constant coeﬃcient. Thus, designs targeting very small variance of the noise at
the output are implemented more eﬃciently using LUT based constant coeﬃcient
multipliers.
Tests 3 and 4 evaluate the performance of the algorithm in the case of multiple
ﬁlters. Test case 3 evaluates the algorithm for a case of four 9×9 ﬁlters, where Test
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
18 · ...
−30 −25 −20 −15 −10 −5 0 5
0
5
10
15
20
25
30
Variance of noise (log
10
)
G
ai
n 
in
 s
lic
es
 (
%
)
Fig. 10. Percentage gain in slices of the proposed framework for Test case 1 for diﬀerent values of
the variance of the noise between a design that uses 30 embedded multipliers and a design that
uses none.
case 4 uses four 15 × 15 ﬁlters. Figure 11(a) illustrates the achieved mean value
of the variance of the error at the output of the system for a given area constraint
for the proposed and the reference algorithms. It can be concluded that in all the
cases, the proposed algorithm explores the redundancy in the ﬁlters and produces
superior results than the reference algorithm. Figure 11(b) illustrates the relative
percentage reduction in area achieved by the proposed algorithm for diﬀerent values
of the noise variance. The achieved average gain in slices is 20.06% and 41.87% for
Test case 3 and 4 respectively. The above results show that in the case of large
ﬁlters the reduction achieved by the proposed algorithm is more evident, since the
existing redundancy is larger.
Finally, the proposed algorithm is tested using kernels from speciﬁc computer
vision applications. Test cases 5 and 6 ensemble two of these cases. Figure 12
shows the performance of the proposed algorithm for the two cases. In both cases,
the proposed algorithm outperforms the reference algorithm, regardless of the area
constraint. The mean relative percentage gain in slices is 38.01% for Test case 5
and 13.57% for Test case 6.
The above experiments demonstrate the power of the proposed algorithm to
exploit the redundancy inside a convolution ﬁlter, and between diﬀerent ﬁlters
in order to produce better designs in terms of area and accuracy. Moreover, it
exploits the heterogeneous structure of modern FPGAs regarding the embedded
multipliers by producing an order for the coeﬃcients regarding their impact to the
ﬁnal approximation error and allocating appropriately the embedded multipliers.
5.2 Algorithmic considerations
As has been described in Section 4.4, the proposed methodology performs a search
over the number of decomposition levels in order to ﬁnd the one that produces
the minimum variance of the error at the output of the system for a given set of
resources. In the case where inﬁnite precision arithmetic was used and under the
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 19
−20 −15 −10 −5 0 5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x 10
4
C
os
t (
in
 s
lic
es
)
Variance of noise (log
10
)
(a)
−18 −16 −14 −12 −10 −8 −6 −4 −2 0 2
10
15
20
25
30
35
40
Variance of noise (log
10
)
G
ai
n 
in
 s
lic
es
 (
%
)
(b)
Fig. 11. (a) Achieved variance of the noise at the output of the design versus the area usage of the
proposed design (+) and the reference design (*) for Test case 3. (b) illustrates the percentage
gain in slices of the proposed framework for diﬀerent values of the variance of the noise.
assumption that all multiplication blocks require the same area for implementa-
tion, the above search strategy would be unnecessary since the area and the error
of the system are monotonic functions of the number of the decomposition levels.
However, due to the quantization eﬀects and the varying area costs of the multi-
plication blocks, the area and the error of the system are not monotonic functions
of the number of the decomposition levels anymore. Thus, the algorithm has to
explore the design space by iterating through the decomposition level space when
the optimum design in terms of area is required for a given set of resources.
Figure 13 illustrates the area usage of the design versus the variance of the
noise at the output of the system for Test case 1. The graph is annotated with the
diﬀerent decomposition levels that are used at each design point. This demonstrates
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
20 · ...
−12 −10 −8 −6 −4 −2 0 2 4
0
10
20
30
40
50
60
Variance of noise (log
10
)
G
ai
n 
in
 s
lic
es
 (
%
)
Fig. 12. Percentage gain in slices of the proposed framework for diﬀerent values of the variance
of the noise for Test cases 5 (*) and 6 (+).
−30 −25 −20 −15 −10 −5 0 5
0
2000
4000
6000
8000
10000
12000
C
os
t (
in
 s
lic
es
)
Variance of noise (log
10
)
01
3
41
63
3165
531
13
5
8135
613
823
Fig. 13. Achieved variance of the noise at the output of the design versus the area usage of the
proposed design (+) and the reference design (*) for Test case 1. The numbers on the graph refer
to the number of decomposition levels that are used in each design point.
the necessity to iterate through the decomposition level space in order to ﬁnd the
optimal decomposition for a speciﬁc set of resources.
5.3 Cost model accuracy
The accuracy of the cost model is important for the validity of the results. The cost
model has been developed separately for each component in the ﬁlter design, with-
out exploiting any further optimization when diﬀerent components are combined
together. Thus, it is expected that the estimated cost of the design to deviate from
the actual cost. Test 1 and Test 3 are used to assess the accuracy of the estimated
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 21
−25 −20 −15 −10 −5 0 5 10
0
2
4
6
8
10
12
14
16
18
20
Relative error (%)
N
um
be
r 
of
 d
es
ig
ns
Fig. 14. Distribution of the error in the area prediction for a design. Relative Error =
Actual area−Predicted area
Actual area
∗ 100%
cost model. Figure 14 illustrates the distribution of the deviation between the pre-
dicted cost of the design using the proposed model and the the actual cost of the
design reported by the Xilinx back-end tools. The target device is a XC2V6000
FPGA, and 57 designs were used. 32 designs from Test case 1 (single ﬁlter) and
25 designs from Test case 3 (multiple ﬁlters). It can be concluded that the model
usually overestimates the actual area of a design by a mean relative error value of
12%. The framework’s overestimation in the area of the design is expected due to
the extra optimizations that are applied by the back-end mapping tools. However,
it should be noted that the distribution of the deviation is narrow, implying certain
conﬁdence on the decisions of the algorithm during the optimization and its results.
Using the proposed algorithm, the obtained designs for Test case 1 were syn-
thesized and placed and routed on a Xilinx XC2V6000-4 device. Also, the designs
obtained by the reference algorithm were synthesized for the same target device.
Figure 15 shows the achieved area usage of the designs versus the variance of the
noise at the output of the system. In all the cases, the proposed algorithm produces
designs that require less area than the reference algorithm for the same variance
of the error at the output of the system. The diﬀerent shape of the plots between
Figure 15 and Figure 9(a) is explained due to the use of Pareto curve.
5.4 Frequency vs Area of the design
Depending on the user’s requirements, the proposed framework has the option to
trade area for frequency and vice versa. This is achieved by pipelining the design to
diﬀerent depths. The proposed framework tracks all the components in the design,
and inserts extra pipeline stages in those that are likely to exceed the user’s target
clock period. The delay estimation is an approximation of the actual delay and
routing delay between the components has not be modelled since it depends on
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
22 · ...
−25 −20 −15 −10 −5 0 5
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
C
os
t (
in
 s
lic
es
)
Variance of noise (log
10
)
Fig. 15. Achieved variance of the noise at the output of the system versus the area usage of
the proposed design (+) and the reference design (*) for Test case 1. The designs have been
synthesized and placed and routed using the Xilinx tools ISE 8.1. The target device is a Xilinx
device XC2V6000-4.
the relative placement of the components in the device. Figure 16 illustrates the
frequency versus area trade-oﬀ for Test case 1, when the variance of the noise at the
output of the system is ﬁxed to σ2noise = 10
−7. The target device is a Xilinx device
XC2V6000-4, and the Xilinx tools ISE 8.1 were used for synthesis, placement, and
routing. The graph demonstrates the ability of the framework to explore the area-
speed design space and to produce a design that targets the user requirements.
5.5 Power issues
The power consumption of the designs derived by the proposed methodology is
evaluated and is compared against designs that are derived by the reference algo-
rithm. Test case 1 is used for the evaluation and comparison, the target device is
a XC2V6000-4 from Xilinx, the back-end tools from Xilinx are used for placement
and routing, and the XPower tool from Xilinx is used for the estimation of the total
power consumption. Post place & route information is used to fully capture glitch-
ing eﬀects, and the stimuli of each design is a 256× 256 image of Lenna, a widely
used standard test image in image processing community. Figure 17 illustrates the
total power consumption of designs derived by the proposed methodology and the
reference algorithm. The results show that the proposed methodology produces
designs with up to 40dB improvement in the signal to noise ratio consuming the
same power with designs that are derived from the reference algorithm. On av-
erage, a 53% reduction on the consumed power is achieved assuming designs that
target uniformly placed noise variance at the output of the system. This makes
the proposed methodology a better choice for DSP designers that target low-power
applications.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 23
55 60 65 70 75 80 85 90 95
2600
2800
3000
3200
3400
3600
3800
Target clock rate (MHz)
C
os
t (
in
 s
lic
es
)
Fig. 16. Speed-area trade-oﬀ for Test case 1. The noise variance at the output is ﬁxed to σ2noise =
10−7. The target device is a Xilinx device XC2V6000-4, and the Xilinx tools ISE 8.1 were used
for synthesis, placement, and routing.
−16 −14 −12 −10 −8 −6 −4 −2 0 2 4
500
550
600
650
700
750
800
850
900
Variance of noise (log
10
)
T
ot
al
 p
ow
er
 c
on
su
m
pt
io
n 
(m
W
)
Fig. 17. Total power consumption for Test case 1. (+) indicates designs derived using the proposed
methodology, where (*) indicates designs derived using the reference algorithm. On average, a
53% reduction on the consumed power is achieved assuming designs that target uniformly placed
noise variance at the output of the system. The target device is a Xilinx device XC2V6000-4, and
the Xilinx tools ISE 8.1 were used for synthesis, placement, routing, and power estimation.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
24 · ...
−14 −12 −10 −8 −6 −4 −2 0
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
x 10
4
C
os
t (
A
LU
T
s 
+
 F
F
A
LU
T
s)
Variance of noise (log
10
)
Fig. 18. Achieved variance of the noise at the output of the system versus the area usage of
the proposed design (+) and the reference design (*) for Test case 1. The designs have been
synthesized and placed and routed using the Altera tools Quartus II (v7.1). The target device is
a Stratix III (EP3SE50). Please note that the two points (+) that correspond to the proposed
design and are above the reference line are due to the Pareto curve and do not correspond to
actual designs.
5.6 Targeting an Altera device
A high-level area model for a Stratix III device from Altera was developed and
incorporated into the proposed framework. The high-level area model has a similar
structure to the Xilinx’s model but modiﬁcations have been made to address the
speciﬁc architecture of the new device. Experiments have demonstrated a maximum
relative error of 15% in the area prediction, which outperforms the corresponding
Xilinx high-level area model. Figure 18 shows the achieved area usage of the designs
versus the variance of the noise at the output of the system for Test case 1. The
obtained results correspond to synthesized and placed and routed designs on an
Altera Stratix III device (EP3SE50). In all the cases, the proposed framework
produces designs that require less area than the reference algorithm for the same
variance of the error at the output of the system.
5.7 Computational requirements
The required time for exploration of the design space using the proposed algorithm
depends on the set of the input ﬁlters and the set of targeted resources. Figure 19
illustrates the required run-time of the algorithm for Test cases 1, 3, and 4 targeting
diﬀerent number of decomposition levels. A zero decomposition level implies that
only the error term is used in the approximation. In all the cases the target resource
is 30,000 slices. The total time for Test case 1, 3, and 4 were approximately 1, 2,
and 3.2 hours respectively. The algorithm was implemented in MATLAB running
on an Intel Core 2 CPU at 1.86GHz PC.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 25
0 2 4 6 8 10
0
200
400
600
800
1000
1200
1400
Decomposition levels
T
im
e 
(s
ec
)
Fig. 19. Time required for exploration of diﬀerent decomposition levels for Test cases 1 (*), 3
(+), and 4 (o). The algorithm was implemented in MATLAB running in an Intel Core 2 CPU at
1.86GHz.
6. CONCLUSION
This paper presents a novel 2D ﬁlter design methodology for heterogeneous FPGA
devices. The methodology explores the computational structure of a 2D ﬁlter ac-
cording to the diﬀerent types of available resources in the device. It employs the use
of a Singular Value Decomposition based algorithm which produces a decomposition
of the ﬁlter that maps directly to the diﬀerent components of modern reconﬁgurable
devices. Furthermore, it extends the work to designs with multiple ﬁlters by ex-
ploiting any redundancy that exists within each ﬁlter and between diﬀerent ﬁlters
in the set. Experiments with ﬁlter sets from computer vision applications demon-
strated a reduction of 13.57% and 38% on average in the required area. Future work
will involve the use of word-length optimization techniques [Constantinides et al.
2003] and the use of current techniques for high-speed multiplication [Dempster
and Macleod 1995] to further enhance the algorithm’s performance. Moreover, due
to the sum of squares error criterion that is used, the proposed method is more ap-
propriate for channel equalization applications. However, a diﬀerent criterion like
a min/max approximation of the ﬁlter’s frequency response can be accommodated
by slightly modifying the proposed algorithm.
Acknowledgement
This work was funded by the UK Research Council under the Basic Technology Re-
search Programme “Reverse Engineering Human Visual Processes” GR/R87642/02
and by the EPSRC (EP/C549481/1) grant.
REFERENCES
Virtex-II Platform FPGA Handbook.
Altera. http://www.altera.com.
Andrews, M. S. 1999. Architectures for generalized 2d ﬁr ﬁltering using separable ﬁlter struc-
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
26 · ...
tures. In Proceedings of the Acoustics, Speech, and Signal Processing, IEEE International
Conference. Vol. 4. 2215–2218.
Belhumeur, P., Hespanha, J., and Kriegman, D. 1997. Eigenfaces vs. ﬁsherfaces: recognition
using class speciﬁc linear projection. IEEE Transactions on Pattern Analysis and Machine
Intelligence 19, 7 (July), 711–720.
Bouganis, C.-S., Cheung, P. Y. K., and Constantinides, G. A. 2005. Heterogeneity Explo-
ration for Multiple 2D Filter Designs. In Proc. Field Programmable Logic and Applications.
263–268.
Bouganis, C.-S., Cheung, P. Y. K., Ng, J., and Bharath, A. A. 2004. A Steerable Complex
Wavelet Construction and its Implementation on FPGA. In Proc. Field Programmable Logic
and Applications. 394–403.
Bouganis, C.-S., Constantinides, G. A., and Cheung, P. Y. K. 2005a. A Novel 2D Filter
Design Methodology. In Proc. International Symposium in Circuits and Systems. 532–535.
Bouganis, C.-S., Constantinides, G. A., and Cheung, P. Y. K. 2005b. A Novel 2D Filter
Design Methodology For Heterogeneous Devices. In Proc. Field-Programmable Custom Com-
puting Machines. 1–10.
Chen, C.-L., Khoo, K.-Y., and A.N. Willson,Jr. 1995. An improved polynomial-time algorithm
for designing digital ﬁlters with power-of-two coeﬃcients. In IEEE International Symposium
on Circuits and Systems. Vol. 1. 223–226.
Cmar, R., Rijnders, L., Schaumont, P., Vernalde, S., and Bolsens, I. 1999. A methodology
and design environment for DSP ASIC ﬁxed point reﬁnement. In Proc. Design, Automation,
and Test in Europe, Munich, Germany. 271–276.
Constantinides, G. A., Cheung, P., and Luk, W. 2001. The Multiple Wordlength Paradigm.
In Proc. Field-Programmable Custom Computing Machines. 51–60.
Constantinides, G. A., Cheung, P. Y. K., and Luk, W. 2003. Wordlength Optimization for
Linear Digital Signal Processing. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 22, 10 (October).
Dempster, A. and Macleod, M. D. 1995. Use of minimum-adder multiplier blocks in FIR digital
ﬁlters. IEEE Trans. Circuits Systems II 42, 569 – 577.
Gong, S., McKenna, S., and Psarrou, A. 2000. Dynamic Vision: From Images to Face Recog-
nition, 1st ed. Imperial College Press.
Haseyama, M. and Matsuura, D. 2006. A ﬁlter coeﬃcient quantization method with genetic
algorithm, including simulated annealing. IEEE Signal Processing Letters 13, 4 (April), 189–
192.
Hastad, J. 1990. Tensor rank is NP-complete. Journal of Algorithms 11, 4, 644 – 654.
Ja’Ja, J. 1978. Optimal evaluation of pairs of bilinear forms. In Proc. of the 10th Annual ACM
Sym. of Theory of Computing. 173 – 182.
Kodek, D. 1980. Design of Optimal Finite Wordlength FIR Digital Filters Using Integer Linear
Programming Techniques. IEEE Trans. on Acoustics, Speech, and Signal Processing 28, 304
– 308.
Koren, I. 2002. Computer Arithmetic Algorithms, 2nd ed. New Jersey: Prentice-Hall Inc.
Kum, K.-I. and Sung, W. 2001. Combined word-length optimization and high-level synthesis of
digital signal processing systems. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 20, 8 (August), 921–930.
Li, D., Lim, Y. C., Lian, Y., and Song, J. 2002. A polynomial-time algorithm for designing
ﬁr ﬁlters with power-of-two coeﬃcients. IEEE Transactions on Acoustics, Speech, and Signal
Processing 50, 8 (August), 1935–1941.
Li, Z. 2003. V1 mechanisms and some ﬁgure-ground and border eﬀects. Journal of Physiology
Paris 97, 503–515.
Mitra, S. K. 2002. Digital Signal Processing: A Computer-Based Approach, second ed. McGraw-
Hill Higher Education.
Park, I.-C. and Kang, H.-J. 2001. Digital ﬁlter synthesis based on minimal signed digit repre-
sentation. In Annual ACM IEEE Design Automation Conference. 468–473.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
... · 27
Pasko, R., Schaumont, P., Derudder, V., Vernalde, S., and Durackova, D. 1999. A new
algorithm for elimination of common subexpressions. IEEE Transactions on computer-aided
design of integrated circuit and systems 18, 1 (January), 58–68.
Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. 1992. Numerical Recipes in C.
Cambridge University Press.
Samueli, H. 1989. An improved search algorithm for the design of multiplierless ﬁr ﬁlters with
powers-of-two coeﬃcients. IEEE Transactions on Circuits and Systems 36, 7 (July), 1044–1047.
Shashua, A. and Levin, A. 2001. Linear image coding for regression and classiﬁcation using the
tensor-rank principle. In Computer Vision and Pattern Recognition Conference. Vol. I. IEEE,
Kauai, HI, USA, 42 – 49.
Siohan, P. 1990. An analysis of coeﬃcient inaccuracy for 2-d ﬁr direct form digital ﬁlters. IEEE
Transactions on Circuits and Systems 37, 10 (Oct.), 1308–1313.
Strang, G. 1998. Introduction to Linear Algebra, 3rd ed. Wellesley-Cambridge Press.
Xilinx. http://www.xilinx.com.
Yurdakul, A. 2005. Multiplierless implementation of 2-d ﬁr ﬁlters. Integration, the VLSI Jour-
nal 38, 4 (April), 597–613.
ACM Transactions on Computational Logic, Vol. V, No. N, September 2008.
