Exploiting Reconfigurable SWP Operators for Multimedia Applications by Menard, Daniel et al.
Exploiting Reconfigurable SWP Operators for
Multimedia Applications
Daniel Menard, Hai Nam Nguyen, Franc¸ois Charot, Ste´phane Guyetant,
Je´re´mie Guillot, Erwan Raffin, Emmanuel Casseau
To cite this version:
Daniel Menard, Hai Nam Nguyen, Franc¸ois Charot, Ste´phane Guyetant, Je´re´mie Guillot, et
al.. Exploiting Reconfigurable SWP Operators for Multimedia Applications. ICASSP: Inter-
national Conference on Acoustics, Speech and Signal Processing, May 2011, Prague, Czech
Republic. 36th International Conference on Acoustics, Speech and Signal Processing, 2011.
<inria-00567017>
HAL Id: inria-00567017
https://hal.inria.fr/inria-00567017
Submitted on 22 Nov 2016
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.

EXPLOITING RECONFIGURABLE SWP OPERATORS FOR MULTIMEDIA APPLICATIONS
D. Menard 1, H.N. Nguyen 1, F. Charot 1, S. Guyetant 2, J. Guillot 3, E. Rafﬁn1 , E. Casseau 1
1 IRISA/INIRA, Campus de Beaulieu, F-35000 Rennes, Email: name@irisa.fr
2 CEA, LIST, CEA/Saclay F-91191 Gif-Sur-Yvette, Email: stephane.guyetant@cea.fr
3 LIRMM, 161 rue Ada, F-34095 Montpellier, Email: jeremie.guillot@lirmm.fr
ABSTRACT
Implementing image processing applications in embedded systems
is a difﬁcult challenge due to the drastic constraints in terms of cost,
energy consumption and real time execution. Reconﬁgurable archi-
tectures are good candidates to take-up this challenge and especially
when the architecture is able to support different word-lengths of
pixel through Sub-Word Parallelism (SWP) capabilities. Exploiting
the diversity of supported data-types requires automation tools able
to optimize the data word-length under an accuracy constraint. In
this paper, a new approach for word-length optimization in the case
of SWP operations is proposed. Compared to existing approaches
the optimization time is signiﬁcantly reduced without sacriﬁcing the
quality of the optimized solution. The results show the ability of our
approach to exploit the SWP capabilities associated with multimedia
processors.
1. INTRODUCTION
Multimedia applications are more and more popular in embedded
systems. Since several years, nomadic objects integrate real-time
video and image processing applications. Implementing these appli-
cations in embedded systems is a difﬁcult challenge due to the drastic
constraints in terms of cost, energy consumption and real time exe-
cution. Image processing at pixel level, like image ﬁltering, edge de-
tection, pixel correlation or at block level such as motion estimation
have to be considered. Such applications are typically computation-
ally intensive and require energy efﬁcient architectures. Moreover,
the ﬂexibility of the architecture is required to adapt to the diversity
of the processing patterns associated to the different standards and
the diversity of the data types handled by these applications.
In the ROMA project [7, 9], a reconﬁgurable processor able to
adapt its computing structure to image processing applications has
been developed. To avoid complex interconnection networks be-
tween operators, the processor is made-up of a pipeline of ﬂexible
coarse-grain reconﬁgurable operators exhibiting efﬁcient power and
performance features. The architecture provides the reconﬁgurabil-
ity at the data word-length level by supporting different word-lengths
of data and at the operation level by supporting different multime-
dia oriented operations. This processor provides a trade-off between
computation accuracy and implementation cost through SWP capa-
bilities.
This kind of architectures exhibits high performance but design
automation tools are required to bridge the gap between more and
more complex applications and high performance hardware plat-
forms and thus, to reduce the time-to-market. As mentioned in [2],
The work presented in this paper is supported by the French Architec-
tures du Futur ANR program ANR-06-ARFU-004.
one of the most time consuming part of the implementation process
is the ﬁxed-point conversion and especially the word-length opti-
mization stage. Most of the works on word-length optimization fo-
cusses on the case of hardware implementation [1]. In this case, the
architecture is deﬁned by the developer and the operator word-length
can take any value in a given range. In the case of our architecture
only several word-lengths are available for each operation due to
the availability of SWP capabilities. The methodologies presented
in [6, 4] target ﬁxed-point processors and carry out a ﬂoating-point
to ﬁxed-point transformation leading to an ANSI-C code with inte-
ger data types. Nevertheless, the different data types supported by
the architecture are not taken into account. In [8], the data word-
lengths are optimized in the case of architecture with SWP opera-
tors. The optimization process is based on a tree model and a branch
and bound algorithm is used to ﬁnd the best solution. This approach
is suitable for applications having a limited number of variables to
optimize or when the number of supported data types is limited. For
our ROMA architecture, ﬁve different data types, suitable for multi-
media applications, are supported.
In this paper, a new approach for data word-length optimiza-
tion is proposed for the case of architectures supporting several data
types. Our approach is based on two stages. First an initial optimized
solution is found with a heuristic and then this solution is reﬁned to
improve the quality of the solution. This approach allows obtaining
an optimized solution with a signiﬁcant reduction of the optimization
time compared to existing approaches. The results show the ability
of our approach to exploit efﬁciently the SWP capabilities associated
with multimedia processors.
The paper is organized as follow. In Section 2, the main charac-
teristics of our reconﬁgurable architecture are summarized and the
ﬂexible operator is presented. The ﬁxed-point conversion process
is described in Section 3 and especially, the word-length optimiza-
tion problem is presented. In Section 4, the proposed optimization
approach is detailed. The efﬁciency of our approach is illustrated
through different experiments in Section 5. Finally, Section 6 draws
conclusions.
2. FLEXIBLE RECONFIGURABLE ARCHITECTURE
The ROMA processor is a reconﬁgurable datapath of operators, con-
trolled by a low footprint sequencer that trigs the execution and the
loading of conﬁgurations. The datapath part is composed of a set of
local scratchpad memories, each associated to an address generator
that can create streams of data, a set of reconﬁgurable operators and
an optimized interconnect. The latter creates computing patterns,
with data read from local memories, processed through a pipeline
of operators and written back in selected memories. The controller
handle streams over bursts of data and can as well tune the datapath
1717978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011
at each clock cycle. For more details on the architecture, the reader
can refer to [7].
For performance enhancement, reconﬁgurable processors have
to overcome the overheads of reconﬁgurations such as complexity
of interconnection network and reconﬁguration time. In proces-
sors dealing with multimedia applications these overheads can be re-
duced by providing the reconﬁgurability inside the processing units
rather than at the interconnection level. This reconﬁgurable operator
is made-up of two inputs x and y and one output z. This operator
is able to carry-out accumulation on different patterns. These pat-
terns can be addition/substraction (xi±yi), multiplication (xi×yi),
absolute value of difference (|xi − yi|). Signed and unsigned opera-
tions can be carried-out. The operator input and output word-length
is equal to 40 bits.
Due to low precision data nature of multimedia applications, re-
conﬁguration at the operator level also provides additional speed-
up through the exploitation of data-level parallelism. Our processor
provides Sub-Word Parallelism (SWP) capabilities. An operator of
word-lengthN is split-up to execute k operations in parallel on sub-
words of word-length N/k. This technique can accelerate the code
execution time up to a factor k. Actually, the SWP operator can
compute ﬁve 8-bit operations or four 10-bit operations or three 12-
bit operations or two 16-bit operations or one 40-bit operation. This
operator not only eliminates the need of reconﬁguration time but also
provides the reconﬁgurability at both data size level (different pixel
data sizes) and at operation level (different multimedia oriented op-
erations). This ensures better a utilization of processor’s resources
and reduces the reconﬁguration overheads signiﬁcantly.
3. DESIGN FLOW
The general ﬂow to map an application described with a C code
with ﬂoating data-types to the ROMA architecture is based on the
generic compilation infrastructure GECOS. Firstly, the ﬁxed-point
conversion of the application is carried-out. The data word-length
is optimized according to the data-types supported by our architec-
ture. A Control Data Flow Graph (CDFG) is used as intermediate
representation. Then, the ﬁxed-point application is mapped to the
reconﬁgurable architecture with the DURASE ﬂow [9].
3.1. Fixed-point conversion
The ﬁxed-point conversion process deﬁnes the data type associated
with each variable. This conversion process can be divided into two
main modules. The ﬁrst part corresponds to the determination of the
integer part word-length of each data. The number of bits for this
integer part must allow the representation of all the values taken by
the data, and is obtained from the data bound values. This ﬁrst part
requires to evaluate the dynamic range of each data. The interval
arithmetic technique is used for this evaluation.
The second part corresponds to the determination of the frac-
tional part word-length which depends on the data types supported
by the processor. The data types which minimize the implementation
cost and respect the accuracy constraint are selected. This process is
detailed in the next section.
3.2. Data word-length optimization
The data word-length optimization process must explore the di-
versity of the data types supported by the processor. The goal of
this step is to minimize the implementation cost C under a given
accuracy constraint Pbmax . The ﬁxed-point accuracy is evaluated
through the quantization noise power Pb. Letwl be the word-length
of the different operations. The optimization problem is modelled as
follows
min (C(wl)) subject to Pb(wl) < Pbmax (1)
Our methodology selects the SWP conﬁguration which respects
the global accuracy constraint Pbmax and minimizes the implemen-
tation cost. For the optimization process, the cost C() (implemen-
tation cost) and the constraint Pb() (ﬁxed-point accuracy) functions
must be evaluated.
3.2.1. Constraint evaluation
Most of the available approaches to evaluate the computation accu-
racy due to ﬁxed-point arithmetic are based on a bit-true simulation
of the ﬁxed-point application [5] [3]. Nevertheless, this technique
suffers from a major drawback which is the time required for the
simulations. An alternative to the simulation based methods is an
analytical approach. The main advantage is the reduction of the ex-
ecution time for the ﬁxed-point optimization process. Indeed, the
accuracy metric expression is determined only once, then, the ﬁxed-
point system accuracy is evaluated through the computation of a
mathematical expression. The method presented in [10] is used to
obtain this analytical expression.
3.2.2. Cost evaluation
The aim of this module is to estimate the global implementation cost
according to the SWP conﬁguration selected for each CDFG oper-
ation. Nevertheless, the goal is not to obtain an exact estimation of
the implementation cost but to compare and to select between sev-
eral SWP conﬁgurations. Thus, a simple estimation model is used
to evaluate the application implementation cost C. This cost can be
the execution time or the energy consumption. The execution time
is considered in the rest of the paper.
An operation oi is characterized by its function γi. The number
of times, ni, that this operation is executed is determined statically
for each operation oi. Letm(wli, γi) be the number of operations of
type γi which can be executed in parallel on the reconﬁgurable oper-
ator when the operand word-length is equal towli. Let c(γi), be the
cost to execute an operation of type γi. LetNo, be the number of op-
erations in the CDFG. The expression of the global implementation
cost C according to the operation word-length wl is as follow
C(wl) =
No∑
i=1
⌈
ni
m(wli, γi)
⌉
c(γi) (2)
4. OPTIMIZATION ALGORITHM
The data word-length optimization is achieved in two stages. First,
an initial optimized solution is obtained with a heuristic correspond-
ing to an extension, for SWP operators, of the approach presented
in [1]. This heuristic allows obtaining an initial optimized solution
quickly but does not always select the optimal solution. Thus, this
optimized solution is reﬁned with a branch & bound algorithm inte-
grating different techniques to reduce drastically the search space.
4.1. Determination of the initial optimized solution
For each operation oi, the list of the NSWP SWP conﬁgurations
which can be used are determined. Let wliSWP be the vector
1718
storing in rising order the word-length of the operand associated
with each SWP conﬁguration. Let WL be a No × NSWP ma-
trix representing the search space for the optimization problem.
This matrix WL groups together the vector wliSWP as follows
WL(i, 1 . . . NSWP ) = wl
i
SWP
′
.
The proposed heuristic, described in Algorithm 1 is made-up of
two steps corresponding to the determination of a starting solution
and to the search of the optimized solution which allows respecting
the accuracy constraint. The starting solution WL(jmwc) corre-
sponds to the combination of the minimal word-length associated
to each operation. The minimal word-length for operation oi corre-
sponds to the word-length which minimizes the cost and fulﬁl the
accuracy constraint when all the other word-lengths are set to their
maximal value. For the search of the optimized solutionWL(jmin),
an iterative approach is used and the combination of the minimal
word-length is used as starting solution. In this case, the accuracy
constraint is no longer fulﬁlled. At each iteration, the SWP conﬁg-
uration associated with one operation oi is changed. The choice of
this operation is done from the computation of the metric ∇ which
evaluates the gain in terms of cost and accuracy for this new SWP
conﬁguration. The metric ∇i computes the ratio between the gradi-
ent of the accuracy and the gradient of the cost for operation oi
∇i ← Pb(wlΔ)− Pb(wl)
C(wlΔ)− C(wl) (3)
The operation which minimizes the cost increase and maximizes
the accuracy increase is chosen. The algorithm stops when the accu-
racy constraint is fulﬁlled.
Algorithm 1 WL Optimization for SWP processors
{ — Starting solution search —}
for i = 0 . . . No do
wl ← WL(1 . . . No, NSWP )
j = NSWP
while Pb(wl) > Pbmax do
j ← j − 1
wl ← WL(1 . . . No, j)
end while
jmwc(i) = j + 1
end for
{ — Optimized solution search —}.
∀ i = 0 . . . No wl(i) ← WL(i, jmwc(i))
j = jmin
while Pb(wl) < Pbmax do
for i = 0 . . . No do
wlΔ ← wl
wlΔ(i) ← WL(i, j(i) + 1)
∇i ← Pb(wlΔ)−Pb(wl)C(wlΔ)−C(wl)
end for
k ← argmaxi(∇i)
j(k) ← j(k) + 1
wl(k) ← WL(k, J(k))
end while
jmin = j
4.2. Reﬁnement of the initial optimized solution
To reﬁne the optimized solution WL(jmin) obtained with the
heuristic presented above, a branch & bound (B&B) algorithm is
carried out on a limited search space to obtain reasonable optimiza-
tion times. Different techniques are used to limit drastically the
search space. The search space retained for this B&B algorithm is
made-up for each operation oi of the SWP conﬁguration jmin(i)
obtained with the heuristic and the previous (jmin(i) − 1) and next
(jmin(i) + 1) SWP conﬁgurations of the initial search space. Thus,
each optimization variable has a maximal number of three values.
This limitation to three values per variable is applied because the
probability that the distance between the optimal solution and the
initial optimized solution jmin is greater than two SWP conﬁgura-
tions is very low. Indeed, in this case, for our ROMA architecture,
it implies a difference of more than 4 bits between the two solu-
tions. Let WL(jopt) be the solution obtained with the reﬁnement
approach.
This optimization technique based on a tree exploration is node-
evaluation-order sensitive. To ﬁnd quickly a good solution in order
to reduce the search space, the variables with the greatest inﬂuence
on the optimization process must be evaluated ﬁrst. The metric ∇i
is used to order the optimization variables. The variables are ordered
by decreasing values of ∇i.
In the B&B algorithm, the partial solutions are evaluated to stop
the tree exploration, if they can not lead to the best solution. At the
tree level l, the exploration of the subtree induced by the node rep-
resenting wl(l) can be stopped if the minimal execution time which
can be obtained during the exploration of this subtree is greater than
the minimal execution time which has already been obtained. The
minimal execution time is determined by selecting for operation oj
(j ∈ [l + 1, No]), the SWP conﬁguration with the minimal cost. At
the beginning of the B&B algorithm, the minimal cost is initialized
with the optimized cost C(WL(jmin)) obtained with the heuristic
algorithm. Compared to a classical B&B algorithm, it is not nec-
essary to ﬁnd a ﬁrst solution to be able to apply this search space
limitation technique. This initial value for the minimal cost allows
avoiding the exploration of many branches of the tree.
Likewise, at the tree level l, the exploration of the subtree in-
duced by the node representing wl(l) can be stopped if the minimal
value of the quantization noise power, which can be obtained during
the exploration of this subtree, is higher than the precision constraint
(Pbmax ). The SQNR maximal value is obtained by ﬁxing the word-
lengths wl(j) (j ∈ [l + 1, No]) to their maximal values.
5. EXPERIMENTS
Different experiments have been carried-out on multimedia kernels
to underline the efﬁciency of our approach. The assumption that
the data stored in the local memory are aligned is used. The Pareto
curve for the Sum of Absolute Difference algorithm is given in ﬁg-
ure 1. The Pareto curve gives the best implementation cost obtained
for different accuracy constraints and shows the trade-off between
the computation accuracy and the implementation cost. The number
of variables in the optimization process is limited because all op-
erations processing the different elements of a same vector use the
same ﬁxed-point format. This constraint for the word-length opti-
mization allows obtaining a more homogeneous solution able to ex-
ploit the SWP capabilities of the architecture. On the Pareto curve,
four points sx are plotted and correspond to the solution obtained
for a SWP conﬁguration using x-bit input. For this algorithm, the
number of points sx in the Pareto curve is limited because, the re-
conﬁgurable operator is able to implement directly the pattern asso-
ciated to the SAD algorithm and thus, the SWP conﬁguration ﬁxes
the word-lengths inside the SAD pattern. The implementation cost
corresponding to the execution time is normalized to have directly
1719
  	 
     


	







$


"
!
 

 

!#  








Fig. 1. Pareto curve for the SAD algorithm on 16× 16 blocks.
the acceleration factor due to SWP operations.
For the other algorithms, the coordinates (SQNRx;Cx) asso-
ciated with each point sx of the Pareto curve are given in table 1.
The Signal-to-Quantization Noise ratio is computed from the noise
power Pb. The results show the ability of our approach to exploit the
SWP capabilities associated with our architecture. Different trade-
off between the implementation cost and the computation accuracy
can be obtained. Thus, the reduction of the accuracy constraint offers
opportunities to decrease the implementation costs. Compared to a
classical processor having any SWP capabilities the maximal accel-
eration factor is equal to ﬁve when 8 bits data are used. As shown on
the SATD example, the ﬁve supported data types can be completely
exploited only when the number of processed data is important. For
this example, only two distinct execution times are obtained because
the size of the block lines is equal to four and thus only one packet
of four data or two packets of two data can be used.
SWP conﬁguration s8 s10 s12 s16
SAD (16× 16) (39;54) (51;66) (63;88) (87;130)
SAD (8× 8) (51;15) (63;18) (75;24) (99;34)
SATD (4× 4) (20;40) (32;40) (44;80) (68;80)
DCT 8 (34;256) (46;256) (58;384) (82;512)
DCT 16 (37;2048) (50;2048) (62;3072) (86;4096)
Table 1. Point sx:(SQNRx;Cx) of the Pareto curve for different
applications. The Signal-to-Noise Ratio SQNRx is expressed in
dB and the cost Cx in cycles
To analyze the efﬁciency of our approach, the quality and the ex-
ecution time of the global optimization process have been measured
and compared with a classical B&B algorithm like the one presented
in [8]. In this case, the search space is not limited to 3 values per
variable and no initial value is available for the minimal cost. A ﬁrst
solution has to be found before applying the test on the minimal cost.
To analyze the quality, the solution obtained with the classical
B&B algorithm is used as a reference. The relative error between
the solution obtained with our approach and the reference solution
has been measured for different accuracy constraints and different
applications. For these different experiments, our global optimiza-
tion approach always ﬁnd the same solution as the classical B&B
algorithm. The initial optimized solution and the reﬁned solution
are identical in 76% of the case. The maximal over-cost of the initial
optimized solution is less than 9.6%.
The execution time has been measured for different numbers
of variables in the optimization problem. The maximal execution
times obtained for different accuracy constraints are reported in Ta-
ble 2. The results for the initial optimized solution obtained with
the heuristic and for the solution obtained after reﬁnement are given.
The execution time of the heuristic approach is very low and can be
neglected compared to the reﬁnement algorithm. Thus, the global
execution time of our approach is closed to the one of the reﬁnement
algorithm. Our approach allows reducing signiﬁcantly the execution
time compared to the classical B&B algorithm. The heuristic algo-
rithm allows ﬁnding a good optimized solution and can be used in
stand-alone if the optimization execution time is a crucial parameter.
Maximal execution time (sec.)
Number of variables 9 13 18 36
Initial optimized solution 0.3 0.45 1.09 2.4
Reﬁned solution 0.4 6.3 38 44
Classical B& B 0.9 190 5212 5787
Table 2. Maximal optimization time (s) according to the number of
variables to optimize
6. CONCLUSION
To implement complex multimedia applications in embedded sys-
tems, high performance, energy efﬁcient and ﬂexible architectures
are needed. To reduce the time-to market, design automation tools
are required to implement efﬁciently the applications on this kind
of architectures. In this paper a new approach for word-length opti-
mization in the case of architectures able to manipulate several data
types has been presented. The search space of the B&B algorithm is
drastically reduced thanks to an initial optimized solution obtained
with a heuristic algorithm. Our approach allows efﬁciently beneﬁt
from the diversity of the data types supported by the architecture.
7. REFERENCES
[1] M.-A. Cantin, Y. Savaria, and P. Lavoie. A comparison of automatic word length
optimization procedures. Circuits and Systems, 2002. ISCAS 2002. IEEE Interna-
tional Symposium on, 2:II–612–II–615 vol.2, 2002.
[2] M. Clark, M. Mulligan, D. Jackson, and D. Linebarger. Accelerating Fixed-Point
Design for MB-OFDM UWB Systems. CommsDesign, January 2005.
[3] M. Coors, H. Keding, O. Luthje, and H. Meyr. Fast Bit-True Simulation. In
IEEE/ACM Design Automation Conference 2001 (DAC 01), pages 708 – 713, Las
Vegas, US, June 2001.
[4] M. Coors, H. Keding, O. Lthje, and H. Meyr. Design and DSP Implementa-
tion of Fixed-Point Systems. EURASIP Journal on Applied Signal Processing,
2002(9):908–925, 2002.
[5] S. Kim, K. Kum, and S. Wonyong. Fixed-Point Optimization Utility for C and
C++ Based Digital Signal Processing Programs. IEEE Transactions on Circuits
and Systems II, 45(11):1455–1464, November 1998.
[6] K. Kum, J.Y. Kang, and W.Y. Sung. AUTOSCALER for C: An optimizing
ﬂoating-point to integer C program converter for ﬁxed-point digital signal proces-
sors. IEEE Transactions on Circuits and Systems II - Analog and Digital Signal
Processing, 47(9):840–848, September 2000.
[7] D. Menard, E. Casseau, S. Khan, O. Sentieys, S. Chevobbe, S. Guyetant, and
R. David. Reconﬁgurable Operator Based Multimedia Embedded Processor. In
Reconﬁgurable Computing: Architectures, Tools and Applications, volume 5453,
pages 39–49, Karlsruhe Allemagne, 2009. Springer Berlin / Heidelberg.
[8] D. Menard, D. Chillet, and O. Sentieys. Floating-to-ﬁxed-point Conversion for
Digital Signal Processors. EURASIP Journal on Applied Signal Processing, Spe-
cial issue on Design Methods for DSP Systems, 2006:1–19, january 2006.
[9] E. Rafﬁn and al. Scheduling, binding and routing system for a run-time recon-
ﬁgurable operator based multimedia architecture. In Workshop on Design and
Architectures for Signal and Image Processing DASIP 2010, Edinburgh, 11 2010.
[10] R. Rocher, D. Menard, P. Scalart, and O. Sentieys. Analytical accuracy evalu-
ation of Fixed-Point Systems. In 12th European Signal Processing Conference
(EUSIPCO 2007), Poznan, Poland, September 2007.
1720
