Efficient Design Methods for Embedded Communication Systems by M. Holzer et al.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2006, Article ID 64913, Pages 1–18
DOI 10.1155/ES/2006/64913
Efficient Design Methods for Embedded
Communication Systems
M. Holzer, B. Knerr, P. Belanovic´, and M. Rupp
Institute for Communications and Radio Frequency Engineering, Vienna University of Technology,
Gußhausstraße 25/389, 1040 Vienna, Austria
Received 1 December 2005; Revised 11 April 2006; Accepted 24 April 2006
Nowadays, design of embedded systems is confronted with complex signal processing algorithms and amultitude of computational
intensive multimedia applications, while time to product launch has been extremely reduced. Especially in the wireless domain,
those challenges are stacked with tough requirements on power consumption and chip size. Unfortunately, design productivity did
not undergo a similar progression, and therefore fails to cope with the heterogeneity of modern architectures. Electronic design
automation tools exhibit deep gaps in the design flow like high-level characterization of algorithms, floating-point to fixed-point
conversion, hardware/software partitioning, and virtual prototyping. This tutorial paper surveys several promising approaches to
solve the widespread design problems in this field. An overview over consistent design methodologies that establish a framework
for connecting the diﬀerent design tasks is given. This is followed by a discussion of solutions for the integrated automation of
specific design tasks.
Copyright © 2006 M. Holzer et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. INTRODUCTION
Over the past 25 years, the field of wireless communications
has experienced a rampant growth, in both popularity and
complexity. It is expected that the global number of mobile
subscribers will reachmore than three billion in the year 2008
[1]. Also, the complexity of the modern communication sys-
tems is growing so rapidly, that the next generation of mo-
bile devices for 3G UMTS systems is expected to be based on
processors containing more than 40 million transistors [2].
Hence, during this relatively short period of time, a stagger-
ing increase in complexity of more than six orders of magni-
tude has taken place [3].
In comparison to this extremely fast-paced growth in al-
gorithmic complexity, the concurrent increase in the com-
plexity of silicon-integrated circuits proceeds according to
the well-knownMoore law [4], famously predicting the dou-
bling of the number of transistors integrated onto a single in-
tegrated circuit every 18 months. Hence, it can be concluded
that the growth in silicon complexity lags behind the extreme
growth in the algorithmic complexity of wireless communi-
cation systems. This is also known as the algorithmic com-
plexity gap.
At the same time, the International Technology Roadmap
for Semiconductors [5] reported a growth in design produc-
tivity, expressed in terms of designed transistors per staﬀ-
month, of approximately 21% compounded annual growth
rate (CAGR), which lags behind the growth in silicon com-
plexity. This is known as the design gap or productivity gap.
The existence of both the algorithmic and the produc-
tivity gaps points to ineﬃciencies in the design process. At
various stages in the process, these ineﬃciencies form bottle-
necks, impeding increased productivity which is needed to
keep up with the mentioned algorithmic demand.
In order to clearly identify these bottlenecks in the design
process, we classify them into internal and external barriers.
Many potential barriers to design productivity arise from
the design teams themselves, their organisation, and inter-
action. The traditional team structure [6] consists of the re-
search (or algorithmic), the architectural, and the implemen-
tation teams. Hence, it is clear that the eﬃciency of the design
process, in terms of both time and cost, depends not only on
the forward communication structures between teams, but
also on the feedback structures (i.e., bug reporting) in the
design process. Furthermore, the design teams use separate
system descriptions. Additionally, these descriptions are very
likely written in diﬀerent design languages.
In addition to these internal barriers, there exist several
external factors which negatively aﬀect the eﬃciency of the
design process. Firstly, the work of separate design teams is





























Algorithm analysis (Section 3)
Bitwidth optimization (Section 4)
HW/SW partitioning (Section 5)
Virtual prototyping (Section 6)
Figure 1: Design flow with several automated design steps.
supported by a wide array of diﬀerent EDA software tools.
Thus, each team uses a completely separate set of tools to
any other team in the design process. Moreover, these tools
are almost always incompatible, preventing any direct and/or
automated cooperation between teams.
Also, EDA tool support exhibits several “gaps,” that is,
parts of the design process which are critical, yet for which
no automated tools are available. Although they have high
impact on the rest of the design process, these steps typically
have to be performed manually, due to their relatively large
complexity, thus requiring designer intervention and eﬀort.
Designers typically leverage their previous experience to a
large extent when dealing with these complex issues.
In Figure 1 a design flow is shown, which identifies sev-
eral intermediate design steps (abstraction levels) that have
to be covered during the refinement process. This starts with
an algorithm that is described and verified, for example, in
a graphical environment with SystemC [7]. Usually in the
wireless domain algorithms are described by a synchronous
data flow graph (SDFG), where functions (A, B, C, D, E)
communicate with fixed data rates to each other. An interme-
diate design step is shown, where already hardware/software
partitioning has been accomplished, but the high abstraction
of the signal processing functions is still preserved. Finally
the algorithm is implemented utilising a heterogenous archi-
tecture that consists of processing elements (DSPs, ASICs),
memory, and a bus system.
Also some design tasks are mentioned, which promise
high potential for decreasing design time by its automation.
This paper discusses the requirements and solutions for an
integrated design methodology in Section 2. Section 3 re-
ports on high-level characterisation techniques in order to
have early estimations of the final system properties and al-
lows to make first design decisions. Section 4 presents envi-
ronments for the conversion of data from floating-point to
fixed-point representation. Approaches for automated hard-
ware/software partitioning are shown in Section 5. The de-
crease of design time by virtual prototyping is presented in
Section 6. Finally, conclusions end the paper.
2. CONSISTENT DESIGN FLOW
2.1. Solution requirements
In the previous section, a number of acute bottlenecks in the
design process have been identified. In essence, an environ-
ment is needed, which transcends the interoperability prob-
lems of modern EDA tools. To achieve this, the environment
has to be flexible in several key aspects.
Firstly, the environment has to be modular in nature.
This is required to allow expansion to include new tools as
M. Holzer et al. 3
they become available, as well as to enable the designer to
build a custom design flow only from those tools which are
needed.
Also, the environment has to be independent from any
particular vendor’s tools or formats. Hence, the environment
will be able to integrate tools from various vendors, as well
as academic/research projects, and any in-house developed
automation, such as scripts, templates, or similar.
To allow unobstructed communication between teams,
the environment should eliminate the need for separate sys-
tem descriptions. Hence, the single system description, used
by all the teams simultaneously, would provide the ultimate
means of cooperative refinement of a design, from the ini-
tial concept to the final implementation. Such a single system
description should also be flexible through having a modu-
lar structure, accommodating equally all the teams. Thus, the
structure of the single system description is a superset of all
the constructs required by all the teams, and the contents of
the single system description is a superset of all the separate
system descriptions used by the teams currently.
2.2. Survey of industrial and university approaches
Several research initiatives, both in the commercial and aca-
demic arenas, are currently striving to close the design and
productivity gaps. This section presents a comparative sur-
vey of these eﬀorts.
A notable approach to EDA tool integration is provided
by the model integrated computing (MIC) community [8].
This academic concept of model development gave rise to an
environment for tool integration [9]. In this environment,
the need for centering the design process on a single descrip-
tion of the system is also identified, and the authors present
an implementation in the form of an integrated model server
(IMS), based on a database system. The structure of the en-
tire environment is expandable and modular in structure,
with each new tool introduced into the environment requir-
ing a new interface. The major shortcoming of this environ-
ment is its dedication to development of software compo-
nents only. As such, this approach addresses solely the algo-
rithmic modelling of the system, resulting in software at the
application level. Thus, this environment does not support
architectural and implementation levels of the design pro-
cess.
Synopsys is one of the major EDA tool vendors oﬀer-
ing automated support for many parts of the design pro-
cess. Recognising the increasing need for eﬃciency in the de-
sign process and integration of various EDA tools, Synopsys
developed a commercial environment for tool integration,
the Galaxy Design Platform [10]. This environment is also
based on a single description of the system, implemented as
a database and referred to as the open Milkyway database.
Thus, this environment eliminates the need for rewriting sys-
tem descriptions at various stages of the design process. It
also covers both the design and the verification processes and
is capable of integrating a wide range of Synopsys commer-
cial EDA tools. An added bonus of this approach is the open
nature of the interface format to the Milkyway database, al-
lowing third-party EDA tools to be integrated into the tool
chain, if these adhere to the interface standard. However, this
environment is essentially a proprietary scheme for integrat-
ing existing Synopsys products, and as such lacks any support
from other parties.
The SPIRIT consortium [11] acknowledges the inherent
ineﬃciency of interfacing incompatible EDA tools from var-
ious vendors. The work of this international body focuses on
creating interoperability between diﬀerent EDA tool vendors
from the point of view of their customers, the product devel-
opers. Hence, the solution oﬀered by the SPIRIT consortium
[12] is a standard for packaging and interfacing of IP blocks
used during system development. The existence and adop-
tion of this standard ensures interoperability between EDA
tools of various vendors as well as the possibility for integra-
tion of IP blocks which conform to the standard. However,
this approach requires widest possible support from the EDA
industry, which is currently lacking. Also, even the full adop-
tion of this IP interchange format does not eliminate the need
for multiple system descriptions over the entire design pro-
cess. Finally, the most serious shortcoming of this method-
ology is that it provides support only for the lower levels of
the design process, namely, the lower part of the architecture
level (component assembly) and the implementation level.
In the paper of Posadas et al. [13] a single source de-
sign environment based on SystemC is proposed. Within
this environment analysis tools are provided for time estima-
tions for either hardware or software implementations. Af-
ter this performance evaluation, it is possible to insert hard-
ware/software partitioning information directly in the Sys-
temC source code. Further, the generation of software for
real-time application is addressed by a SystemC-to-eCos li-
brary, which replaces the SystemC kernel by real-time oper-
ating system functions. Despite being capable of describing a
system consistently on diﬀerent abstraction levels based on a
single SystemC description, this does not oﬀer a concrete and
general basis for integration of design tools at all abstraction
levels.
Raulet et al. [14] present a rapid prototyping environ-
ment based on a single tool called SynDex. Within this envi-
ronment the user starts by defining an algorithm graph, an
architecture graph, and constraints. Further executables for
special kernels are automatically generated, while heuristics
are used to minimize the total execution time of the algo-
rithm. Those kernels provide the functionality of implemen-
tations in software and hardware, as well as models for com-
munication.
The open tool integration environment (OTIE) [15] is
a consistent design environment, aimed at fulfilling the re-
quirements set out in Section 2.1. This environment is based
on the single system description (SSD), a central repository
for all the refinement information during the entire design
process. As such, the SSD is used simultaneously by all the de-
sign teams. In the OTIE, each tool in the design process still
performs its customary function, as in the traditional tool
chain, but the design refinements from all the tools are now
stored in just one system descriptions (the SSD) and thus
no longer subject to constant rewriting. Hence, the SSD is a
4 EURASIP Journal on Embedded Systems
superset of all the system descriptions present in the tradi-
tional tool chain.
The SSD is implemented as a MySQL [16] database,
which brings several benefits. Firstly, the database implemen-
tation of the SSD supports virtually unlimited expandability,
in terms of both structure and volume. As new refinement
information arrives to be stored in the SSD, either it can be
stored within the existing structure, or it may require an ex-
tension to the entity-relationship structure of the SSD, which
can easily be achieved through addition of new tables or links
between tables. Also, the database, on which this implemen-
tation of the SSD is based, is inherently a multiuser system,
allowing transparent and uninterrupted access to the con-
tents of the SSD to all the designers simultaneously. Further-
more, the security of the database implementation of the SSD
is assured through detailed setting of access privileges of each
teammember and integrated EDA design tool to each part of
the SSD, as well as the seamless integration of a version con-
trol system, to automatically maintain revision history of all
the information in the SSD. Finally, accessing the refinement
information (both manually and through automated tools)
is greatly simplified in the database implementation of the
SSD by its structured query language (SQL) interface.
Several EDA tool chains have been integrated into the
OTIE, including environments for virtual prototyping [17,
18], hardware/software partitioning [19], high-level system
characterisation [20], and floating-point to fixed-point con-
version [21]. The deployment of these environments has
shown the ability of the OTIE concept to reduce the design
eﬀort drastically through increased automation, as well as
close the existing gaps in the automation coverage, by inte-
grating novel EDA tools as they become available.
3. SYSTEM ANALYSIS
For the design of a signal processing system consisting of
hardware and software many diﬀerent programming lan-
guages have been introduced like VHDL, Verilog, or Sys-
temC. During the refinement process it is of paramount im-
portance to assure the quality of the written code and to base
the design decisions on reliable characteristics. Those char-
acteristics of the code are called metrics and can be identified
on the diﬀerent levels of abstraction.
The terms metric and measure are used as synonyms
in literature, whereas a metric is in general a measurement,
which maps an empirical object to a numerical object. This
function should preserve all relations and structures. In other
words, a quality characteristic should be linearly related to
a measure, which is a basic concept of measurement at all.
Those metrics can be software related or hardware related.
3.1. Software-related metrics
In the area of software engineering the interest in the mea-
surement of software properties is ongoing since the first pro-
gramming languages appeared [22]. One of the earliest soft-











Figure 2: Control flow graph (CFG) and expression tree of one ba-
sic block.
In general the algorithm inside a function, written in the
form of sequential code can be decomposed into its control
flow graph (CFG), built up of interconnected basic blocks
(BB). Each basic block contains a sequence of data opera-
tions ending in a control flow statement as a last instruction.
A control flow graph is a directed graph with only one root
and one exit. A root defines a vertex with no incoming edge
and the exit defines a vertex with no outgoing edge. Due
to programming constructs like loops those graphs are not
cycle-free. The sequence of data operations inside of one BB
forms itself a data flow graph (DFG) or equivalently one or
more expression trees. Figure 2 shows an example of a func-
tion and its graph descriptions.
For the generation of DFG and CFG a parsing proce-
dure of the source code has to be accomplished. This task
is usually performed by a compiler. The step of compilation
is separated into two steps, firstly, a front end transforms
the source code into an intermediate representation (abstract
syntax tree). At this step target independent optimizations
are already applied, like dead code elimination or constant
propagation. In a second step, the internal representation is
mapped to a target architecture.
The analysis of a CFG can have diﬀerent scopes: a small
number of adjacent instructions, a single basic block, across
several basic blocks (intraprocedural), across procedures (in-
terprocedural), or a complete program.
For the CFG and DFG some common basic properties
can be identified as follows.
(i) For each graph type G, a set of vertices V , and edges E
can be defined, where the value |V | denotes the num-
ber of vertices and |E| denotes the number of edges.
(ii) A path of G is defined as an ordered sequence S =
(vroot vx vy · · · vexit) of vertices starting at the root
and ending at the exit vertex.
M. Holzer et al. 5




Figure 3: Degree of parallelism for γ = 1 and γ > 1.
(iii) The path with the maximum number of vertices is
called the longest path or critical path and consists of
|VLP| vertices.
(iv) The degree of parallelism γ [24] can be defined as the
number of all vertices |V | divided by the number of
vertices in the longest path |VLP| of the algorithm





In Figure 3 it can be seen that for a γ value of 1, the graph
is sequential and for γ > 1 the graph has many vertices in
parallel, which oﬀers possibilities for the reuse of resources.
In order to render the CFG contextmore precisely, we can
apply these properties and define some important metrics to
characterise the algorithm.
Definition 1 (longest path weight for operation j). Every ver-
tex of a CFG can be annotated with a set of diﬀerent weights
w(vi) = (wi1,wi2, . . . ,wim)T , i = 1 · · · |V |, that describes the
occurrences of its internal operations (e.g., wi1 = number of
ADD operations in vertex vi). Accordingly, a specific longest
path with respect to the jth distinct weight, S
j
LP, can be de-
fined as the sequence of vertices (vroot vl · · · vexit), which
yields a maximum path weight PW j by summing up all the
weights wrootj , w
l
j , . . . ,w
exit









d j . (2)
Here the selection of the weight with the type j is accom-
plished by multiplication with a vector d j = (δ0 j , . . . , δmj)T
defined with the Kronecker-delta δi j .
Definition 2 (degree of parallelism for operation j). Similar









d j , (3)
which represents the operation-specific weight of the whole






to reflect the reuse capabilities of each operation unit for op-
eration j.
Definition 3 (cyclomatic complexity). The cyclomatic com-
plexity, as defined by McCabe [25], states the theoretical
number (see (5)) of required test cases in order to achieve
the structural testing criteria of a full path coverage:
V(G) = |E| − |V | + 2. (5)
The generation of the verification paths is presented by Poole
[26] based on amodified depth-first search through the CFG.
Definition 4 (control orientation metrics). The control ori-
entation metrics (COM) identifies whether a function is




Here Ncop defines the number of control statements (if,
for, while), Nop defines the number of arithmetic and logic
operations, andNmac the number of memory accesses. When
the COM value tends to be 1 the function is dominated by
control operations. This is usually an indicator that an im-
plementation of a control-oriented algorithm is more suited
for running on a controller than to be implemented as dedi-
cated hardware.
3.2. Hardware-related metrics
Early estimates of area, execution time, and power consump-
tion of a specific algorithm implemented in hardware are
crucial for design decisions like hardware/software partition-
ing (Section 5) and architecture exploration (Section 6.1).
The eﬀort of elaborating diﬀerent implementations is usu-
ally not feasible in order to find optimal solutions. There-
fore, only critical parts are modelled (rapid prototyping [6])
in order to measure worst-case scenarios, with the disadvan-
tage that side eﬀects on the rest of the system are neglected.
According to Gajski et al. [27] those estimates must satisfy
three criteria: accuracy, fidelity, and simplicity.
The estimation of area is based on an area characteriza-
tion of the available operations and on an estimation of the
needed number of operations (e.g., ADD, MUL). The area
consumption of an operation is usually estimated by a func-
tion dependent on the number of inputs/outputs and their
bit widths [28]. Further, the number of operations, for exam-
ple, in Boolean expressions can be estimated by the number
of nodes in the corresponding Boolean network [29]. Area
estimation for design descriptions higher than register trans-
fer level, like SystemC, try to identify a simple model for the
high-level synthesis process [30].
6 EURASIP Journal on Embedded Systems
The estimation of execution time of a hardware im-
plementation requires the estimation of scheduling and re-
source allocation, which are two interdependent tasks. Path-
based techniques transform an algorithm description from
its CFG and DFG representation into a directed acyclic
graph. Within this acyclic graph worst-case paths can be in-
vestigated by static analysis [31]. In simulation-based ap-
proaches the algorithm is enriched with functionality for
tracing the execution paths during the simulation. This tech-
nique is, for example, described for SystemC [32] and MAT-
LAB [33]. Additionally a characterization of the operations
regarding their timing (delay) has to be performed.
Power dissipation in CMOS is separated into two com-
ponents, the static and the dominant dynamic parts. Static
power dissipation is mainly caused by leakage currents,
whereas the dynamic part is caused by charging/discharging
capacitances and the short circuit during the switching.
Charging accounts for over 90% of the overall power dis-
sipation [34]. Assuming that capacitance is related to area,
area estimation techniques, as discussed before, have to be
applied. Fornaciari et al. [35] present power models for dif-
ferent functional units like registers andmultiplexers. Several
techniques for predicting the switching activity of a circuit
are presented by Landman [36].
3.3. Cost function and affinity
Usually the design target is the minimization of a cost or ob-
jective function with inequality constraints [37]. This cost
function c depends on x = (x1, . . . , xn)T , where the ele-
ments xi represent normalized and weighted values of tim-
ing, area, and power but also economical aspects (e.g., cyclo-
matic complexity relates to verification eﬀort) could be ad-
dressed. This leads to the minimization problem
min c(x). (7)
Additionally those metrics have a set of constraints bi
like maximum area, maximum response time, or maximum
power consumption given by the requirements of the sys-
tem. Those constraints, which can be grouped to a vector
b = (b1, . . . , bn)T define a set of inequalities,
x ≤ b. (8)
A further application of the presented metrics is its usage
for the hardware/software partitioning process. Here a huge
search space demands for heuristics that allows for partition-
ing within reasonable time. Nevertheless, a reduction of the
search space can be achieved by assigning certain functions to
hardware or software beforehand. This can be accomplished
by an aﬃnity metric [38]. Such an aﬃnity can be expressed






γ j . (9)
A high value A and thus a high aﬃnity of an algorithm
to a hardware implementation are caused by less control op-
erations and high parallelism of the operations that are used
in the algorithm. Thus an algorithm with an aﬃnity value
higher than a certain threshold can be selected directly to be
implemented in hardware.
4. FLOATING-POINT TO FIXED-POINT CONVERSION
Design of embedded systems typically starts with the conver-
sion of the initial concept of the system into an executable
algorithmic model, on which high-level specifications of the
system are verified. At this level of abstraction, models invari-
ably use floating-point formats, for several reasons. Firstly,
while the algorithm itself is undergoing changes, it is nec-
essary to disburden the designer from having to take nu-
meric eﬀects into account. Hence, using floating-point for-
mats, the designer is free to modify the algorithm itself, with-
out any consideration of overflow and quantization eﬀects.
Also, floating-point formats are highly suitable for algorith-
mic modeling because they are natively supported on PC or
workstation platforms, where algorithmic modeling usually
takes place.
However, at the end of the design process lies the imple-
mentation stage, where all the hardware and software com-
ponents of the system are fully implemented in the chosen
target technologies. Both the software and hardware compo-
nents of the system at this stage use only fixed-point numeric
formats, because the use of fixed-point formats allows dras-
tic savings in all traditional cost metrics: the required silicon
area, power consumption, and latency/throughput (i.e., per-
formance) of the final implementation.
Thus, during the design process it is necessary to perform
the conversion from floating-point to suitable fixed-point
numeric formats, for all data channels in the system. This
transition necessitates careful consideration of the ranges
and precision required for each channel, the overflow and
quantisation eﬀects created by the introduction of the fixed-
point formats, as well as a possible instability which these
formats may introduce. A trade-oﬀ optimization is hence
formed, between minimising introduced quantisation noise
and minimising the overall bitwidths in the system, so as to
minimise the total system implementation cost. The level of
introduced quantisation noise is typically measured in terms
of the signal to quantisation noise ratio (SQNR), as defined
in (10), where v is the original (floating-point) value of the
signal and v̂ is the quantized (fixed-point) value of the signal:











The performance/cost tradeoﬀ is traditionally performed
manually, with the designer estimating the eﬀects of fixed-
point formats through system simulation and determin-
ing the required bitwidths and rounding/overflow modes
through previous experience or given knowledge of the sys-
tem architecture (such as predetermined bus or memory in-
terface bitwidths). This iterative procedure is very time con-
suming and can sometimes account for up to 50% of the to-
tal design eﬀort [39]. Hence, a number of initiatives to auto-
mate the conversion from floating-point to fixed-point for-
mats have been set up.
M. Holzer et al. 7
In general, the problem of automating the conversion
from floating-point to fixed-point formats can be based on
either an analytical (static) or statistical (dynamic) approach.
Each of these approaches has its benefits and drawbacks.
4.1. Analytical approaches
All the analytical approaches to automate the conversion
from floating-point to fixed-point numeric formats find
their roots in the static analysis of the algorithm in question.
The algorithm, represented as a control and data flow graph
(CDFG), is statically analysed, propagating the bitwidth re-
quirements through the graph, until the range, precision, and
sign mode of each signal are determined.
As such, analytical approaches do not require any simu-
lations of the system to perform the conversion. This typi-
cally results in significantly improved runtime performance,
which is the main benefit of employing such a scheme. Also,
analytical approaches do not make use of any input data for
the system. This relieves the designer from having to pro-
vide any data sets with the original floating-point model
and makes the results of the optimisation dependent only on
the algorithm itself and completely independent of any data
which may eventually be used in the system.
However, analytical approaches suﬀer from a number of
critical drawbacks in the general case. Firstly, analytical ap-
proaches are inherently only suitable for finding the upper
bound on the required precision, and are unable to perform
the essential trade-oﬀ between system performance and im-
plementation cost. Hence, the results of analytical optimi-
sations are excessively conservative, and cannot be used to
replace the designer’s fine manual control over the trade-
oﬀ. Furthermore, analytical approaches are not suitable for
use on all classes of algorithms. It is in general not possible
to process nonlinear, time-variant, or recursive systems with
these approaches.
FRIDGE [39] is one of the earliest environments for
floating-point to fixed point conversion and is based on an
analytical approach. This environment has high runtime per-
formance, due to its analytical nature, and wide applicabil-
ity, due to the presence of various back-end extensions to
the core engine, including the VHDL back end (for hardware
component synthesis) and ANSI-C and assembly back ends
(for DSP software components). However, the core engine
relies fully on the designer to preassign fixed-point formats
to a suﬃcient portion of the signals, so that the optimisation
engine may propagate these to the rest of the CDFG struc-
ture of the algorithm. This environment is based on fixed-
C, a proprietary extension to the ANSI-C core language and
is hence not directly compatible with standard design flows.
The FRIDGE environment forms the basis of the commercial
Synopsys CoCentric Fixed-Point Designer [40] tool.
Another analytical approach, Bitwise [41], implements
both forward and backward propagations of bitwidth re-
quirements through the graph representation of the system,
thus making more eﬃcient use of the available range and
precision information. Furthermore, this environment is ca-
pable of tackling complex loop structures in the algorithm
by calculating their closed-form solutions and using these to
propagate the range and precision requirements. However,
this environment, like all analytical approaches, is not capa-
ble of carrying out the performance-cost trade-oﬀ and results
in very conservative fixed-point formats.
An environment for automated floating-point to fixed-
point conversion for DSP code generation [42] has also
been presented, minimising the execution time of DSP code
through the reduction of variable bitwidths. However, this
approach is only suitable for software components and disre-
gards the level of introduced quantisation noise as a system-
level performance metric in the trade-oﬀ.
An analytical approach based on aﬃne arithmetic [43]
presents another fast, but conservative, environment for
automated floating-point to fixed-point conversion. The
unique feature of this approach is the use of probabilistic
bounds on the distribution of values of a data channel. The
authors introduce the probability factor λ, which in a nor-
mal hard upper-bound analysis equals 1. Through this prob-
abilistic relaxation scheme, the authors set λ = 0.999999 and
thereby achieve significantly more realistic optimisation re-
sults, that is to say, closer to those achievable by the designer
through system simulations. While this scheme provides a
method of relaxing the conservative nature of its core analyt-
ical approach, the mechanism of controlling this separation
(namely, the trial-and-error search by varying the λ factor)
does not provide a means of controlling the performance-
cost tradeoﬀ itself and thus replacing the designer.
4.2. Statistical approaches
The statistical approaches to perform the conversion from
floating-point to fixed-point numeric formats are based on
system simulations and use the resulting information to
carry out the performance-cost tradeoﬀ, much like the de-
signer does during the manual conversion.
Due to the fact that these methods employ system sim-
ulations, they may require extended runtimes, especially in
the presence of complex systems and large volumes of input
data. Hence, care has to be taken in the design of these op-
timisation schemes to limit the number of required system
simulations.
The advantages of employing a statistical approach to au-
tomate the floating-point to fixed-point conversion are nu-
merous. Most importantly, statistical algorithms are inher-
ently capable of carrying out the performance-cost trade-oﬀ,
seamlessly replacing the designer in this design step. Also, all
classes of algorithms can be optimised using statistical ap-
proaches, including nonlinear, time-variant, or recursive sys-
tems.
One of the earliest research eﬀorts to implement a sta-
tistical floating-point to fixed-point conversion scheme con-
centrates on DSP designs represented in C/C++ [44]. This
approach shows high flexibility, characteristic to statistical
approaches, being applicable to nonlinear, recursive, and
time-variant systems.
However, while this environment is able to explore the
performance-cost tradeoﬀ, it requires manual intervention
8 EURASIP Journal on Embedded Systems
by the designer to do so. The authors employ two optimi-
sation algorithms to perform the trade-oﬀ: full search and
a heuristic with linear complexity. The high complexity of
the full search optimisation is reduced by grouping signals
into clusters, and assigning the same fixed-point format to
all the signals in one cluster. While this can reduce the search
space significantly, it is an unrealistic assumption, especially
for custom hardware implementations, where all signals in
the system have very diﬀerent optimal fixed-point formats.
QDDV [45] is an environment for floating-point to
fixed-point conversion, aimed specifically at video applica-
tions. The unique feature of this approach is the use of two
performance metrics. In addition to the widely used objective
metric, the SQNR, the authors also use a subjective metric,
the mean opinion score (MOS) taken from ten observers.
While this environment does employ a statistical frame-
work for measuring the cost and performance of a given
fixed-point format, no automation is implemented and no
optimisation algorithms are presented. Rather, the environ-
ment is available as a tool for the designer to perform man-
ual “tuning” of the fixed-point formats to achieve acceptable
subjective and objective performance of the video process-
ing algorithm in question. Additionally, this environment is
based on Valen-C, a custom extension to the ANSI-C lan-
guage, thus making it incompatible with other EDA tools.
A further environment for floating-point to fixed-point
conversion based on a statistical approach [46] is aimed at
optimisingmodels in theMathWorks Simulink [47] environ-
ment. This approach derives an optimisation framework for
the performance-cost trade-oﬀ, but provides no optimisa-
tion algorithms to actually carry out the trade-oﬀ, thus leav-
ing the conversion to be performed by the designer manually.
A fully automated environment for floating-point to
fixed-point conversion called fixify [21] has been presented,
based on a statistical approach. While this results in fine con-
trol over the performance-cost trade-oﬀ, fixify at the same
time dispenses with the need for exhaustive search optimi-
sations and thus drastically reduces the required runtimes.
This environment fully replaces the designer in making the
performance-cost trade-oﬀ by providing a palette of optimi-
sation algorithms for diﬀerent implementation scenarios.
For designs that are to be mapped to software running
on a standard processor core, restricted-set full search is the
best choice of optimisation technique, since it oﬀers guaran-
teed optimal results and optimises the design directly to the
set of fixed-point bitwidths that are native to the processor
core in question. For custom hardware implementations, the
best choice of optimisation option is the branch-and-bound
algorithm [48], oﬀering guaranteed optimal results. How-
ever, for high-complexity designs with relatively long simu-
lation times, the greedy search algorithm is an excellent alter-
native, oﬀering significantly reduced optimisation runtimes,
with little sacrifice in the quality of results.
Figure 4 shows the results of optimising a multiple-input
multiple-output (MIMO) receiver design by all three opti-
misation algorithms in the fixify environment. The results
are presented as a trade-oﬀ between the implementation cost
c (on the vertical axis) and the SQNR, as defined in (10)






















Figure 4: Optimization results for the MIMO receiver design.
(on the horizontal axis). It can immediately be noted from
Figure 4 that all three optimisation methods generally re-
quire increased implementation cost with increasing SQNR
requirements, as is intuitive. In other words, the optimisation
algorithms are able to find fixed-point configurations with
lower implementation costs when more degradation of nu-
meric performance is allowed.
It can also be noted from Figure 4 that the optimisa-
tion results of the restricted-set full search algorithm consis-
tently (i.e., over the entire examined range [5 dB, 100 dB])
require higher implementation costs for the same level of
numeric performance then both the greedy and the branch-
and-bound optimisation algorithms. The reason for this ef-
fect is the restricted set of possible bitwiths that the full search
algorithm can assign to each data channel. In this example,
the restricted-set full search algorithm uses the word length
set of {16, 32, 64}, corresponding to the available set of fixed-
point formats on the TIC6416 DSP which is used in the orig-
inal implementation [49]. The full search algorithm can only
move through the solution space in large quantum steps, thus
not being able to fine tune the fixed-point format of each
channel. On the other hand, greedy and branch-and-bound
algorithms both have full freedom to assign any positive in-
teger (strictly greater than zero) as the word length of the
fixed-point format for each channel in the design, thus con-
sistently being able to extract fixed-point configurations with
lower implementation costs for the same SQNR levels.
Also, Figure 4 shows that, though the branch-and-bound
algorithm consistently finds the fixed-point configuration
with the lowest implementation cost for a given level of
SQNR, the greedy algorithm performs only slightly worse.
In 13 out of the 20 optimizations, the greedy algorithm re-
turned the same fixed-point configuration as the branch-
and-bound algorithm. In the other seven cases, the subtree
relaxation routine of the branch-and-bound algorithm dis-
covered a superior fixed-point configuration. In these cases,
the relative improvement of using the branch-and-bound al-
gorithm ranged between 1.02% and 3.82%.
Furthermore, it can be noted that the fixed-point con-
figuration found by the designer manually can be improved
M. Holzer et al. 9
for both the DSP implementation (i.e., with the restricted-set
full search algorithms) and the custom hardware implemen-
tation (i.e., with the greedy and/or branch-and-bound algo-
rithms). The designer optimized the design to the fixed-point
configuration where all the word lenghts are set to 16 bits
by manual trial and error, as is traditionally the case. Af-
ter confirming that the design has satisfactory performance
with all word lengths set to 32 bits, the designer assigned all
the word lengths to 16 bits and found that this configuration
also performs satisfactorily. However, it is possible to obtain
lower implementation cost for the same SQNR level, as well
as superior numeric performance (i.e., higher SQNR) for the
same implementation cost, as can be seen in Figure 4.
It is important to note that fixify is based entirely on
the SystemC language, thus making it compatible with other
EDA tools and easier to integrate into existing design flows.
Also, the fixify environment requires no change to the origi-
nal floating-point code in order to perform the optimisation.
5. HARDWARE/SOFTWARE PARTITIONING
Hardware/software partitioning can in general be described
as the mapping of the interconnected functional objects that
constitute the behavioural model of the system onto a chosen
architecture model. The task of partitioning has been thor-
oughly researched and enhanced during the last 15 years and
produced a number of feasible solutions, which depend heav-
ily on their prerequisites:
(i) the underlying system description;
(ii) the architecture and communication model;
(iii) the granularity of the functional objects;
(iv) the objective or cost function.
The manifold formulations entail numerous very diﬀerent
approaches to tackle this problem. The following subsection
arranges themost fundamental terms and definitions that are
common in this field and shall prepare the ground for a more
detailed discussion of the sophisticated strategies in use.
5.1. Common terms
The functionality can be implemented with a set of intercon-
nected system components, such as general-purpose CPUs,
DSPs, ASICs, ASIPs, memories, and buses. The designer’s
task is in general twofold: selection of a set of system compo-
nents or, in other words, the determination of the architec-
ture, and the mapping of the system’s functionality among
these components. The term partitioning, originally describ-
ing only the latter, is usually adopted for a combination of
both tasks, since these are closely interlocked. The level, on
which partitioning is performed, varies from group to group,
as well as the expressions to describe these levels. The term
system level has always been referring to the highest level of
abstraction. But in the early nineties the system level identi-
fied VHDL designs composed of several functional objects in
the size of an FIR or LUT. Nowadays the term system level
describes functional objects of the size of a Viterbi or a Huﬀ-


















Figure 5: Common implementation architecture.
nitude. In the following the granularity of the system parti-
tioning is labelled decreasingly as follows: system level (e.g.,
Viterbi, UMTS Slot Synchronisation, Huﬀman, Quicksort,
etc.), process level (FIR, LUT, Gold code generator, etc.), and
operational level (MAC, ADD, NAND, etc.) The final imple-
mentation has to satisfy a set of design constraints, such as
cost, silicon area, power consumption, and execution time.
Measures for these values, obtained by high-level estimation,
simulation, or static analysis, which characterize a given so-
lution quantitatively are usually called metrics; see Section 3.
Depending on the specific problem formulation a selection
of metrics composes an objective function, which captures the
overall quality of a certain partitioning as described in detail
in Section 3.3.
5.2. Partitioning approaches
Ernst et al. [50] published an early work on the partition-
ing problem starting from an all-software solution within
the COSYMA system. The underlying architecture model is
composed of a programmable processor core, memory, and
customised hardware (Figure 5).
The general strategy of this approach is the hardware ex-
traction of the computational intensive parts of the design,
especially loops, on a fine-grained basic block level (CDFG),
until all timing constraints are met. These computation in-
tensive parts are identified by simulation and profiling. User
interaction is demanded since the system description lan-
guage is Cx, a superset of ANSI-C. Not all Cx constructs have
valid counterparts in a hardware implementation, such as dy-
namic data structures, and pointers. Internally simulated an-
nealing (SA) [51] is utilized to generate diﬀerent partition-
ing solutions. In 1994 the authors introduced an optional
programmable coprocessor in case the timing constraints
could not be met by hardware extraction [52]. The schedul-
ing of the basic blocks is identified to be as soon as possible
10 EURASIP Journal on Embedded Systems
(ASAP) driven, in other words, it is the simplest list schedul-
ing technique also known as earliest task first. A further im-
provement of this approach is the usage of a dynamically ad-
justable granularity [53] which allows for restructuring of the
system’s functionality on basic block level (see Section 3.1)
into larger partitioning objects.
In 1994, the authors Kalavade and Lee [54] published a
fast algorithm for the partitioning problem. They addressed
the coarse-grained mapping of processes onto an identi-
cal architecture (Figure 5) starting from a directed acyclic
graph (DAG). The objective function incorporates several
constraints on available silicon area (hardware capacity),
memory (software capacity), and latency as a timing con-
straint. The global criticality/local phase (GCLP) algorithm
is a greedy approach, which visits every process node once
and is directed by a dynamic decision technique considering
several cost functions.
The partitioning engine is part of the signal process-
ing work suite Ptolemy [55] firstly distributed in the same
year. This algorithm is compared to simulated annealing and
a classical Kernighan-Lin implementation [56]. Its tremen-
dous speed with reasonably good results is mentionable but
in fact only a single partitioning solution is calculated in a
vast search space of often a billion solutions. This work has
been improved by the introduction of an embedded imple-
mentation bin selection (IBS) [57].
In the paper of Eles et al. [58] a tabu search algorithm
is presented and compared to simulated annealing and Kern-
ighan-Lin (KL). The target architecture does not diﬀer from
the previous ones. The objective function concentrates more
on a trade-oﬀ between the communication overhead be-
tween processes mapped to diﬀerent resources and reduc-
tion of execution time gained by parallelism. The most im-
portant contribution is the preanalysis before the actual par-
titioning starts. For the first time static code analysis tech-
niques are combined with profiling and simulation to iden-
tify the computation intensive parts of the functional code.
The static analysis is performed on operation level within
the basic blocks. A suitability metric is derived from the oc-
currence of distinct operation types and their distribution
within a process, which is later on used to guide the mapping
to a specific implementation technology.
The paper of Vahid and Le [59] opened a diﬀerent per-
spective in this research area. With respect to the architecture
model a continuity can be stated as it does not deviate from
the discussed models. The innovation in this work is the de-
composition of the system into an access graph (AG), or call
graph. From a software engineering point of view a system’s
functionality is often described with hierarchical structures,
in which every edge corresponds to a function call. This rep-
resentation is completely diﬀerent from the block-based di-
agrams that reflect the data flow through the system in all
digital signal processing work suites [47, 55]. The leaves of
an access graph correspond to the simplest functions that do
not contain further function calls (Figure 6).
The authors extend the Kernighan-Lin heuristic to be ap-
plicable to this problem instance and put much eﬀort in the
























Void f2 (int z) {
     
}
Figure 6: Code segment and corresponding access graph.
the runtime of the algorithm. Indeed their approach yields
good results on the examined real and random designs in
comparison with other algorithms, like SA, greedy search, hi-
erarchical clustering, and so forth. Nevertheless, the assign-
ment of function nodes to the programmable component
lacks a proper scheduling technique, and the decomposition
of a usually block-based signal processing system into an ac-
cess graph representation is in most cases very time consum-
ing.
5.3. Combined partitioning and
scheduling approaches
In the later nineties research groups started to put more ef-
fort into combined partitioning and scheduling techniques.
The first approach of Chatha and Vemuri [60] can be seen
as a further development of Kalavade’s work. The architec-
ture consists of a programmable processor and a custom
hardware unit, for example, an FPGA. The communication
model consists of a RAM for hardware-software communi-
cation connected by a system bus, and both processors ac-
commodate local memory units for internal communication.
Partitioning is performed in an iterative manner on system
level with the objective of theminimization of execution time
while maintaining the area constraint.
The partitioning algorithm mirrors exactly the con-
trol structure of a classical Kernighan-Lin implementation
adapted to more than two implementation techniques. Every
time a node is tentatively moved to another kind of imple-
mentation, the scheduler estimates the change in the overall
execution time instead of rescheduling the task subgraph. By
this means a low runtime is preserved by paying reliability
of their objective function. This work has been further ex-
tended for combined retiming, scheduling, and partitioning
of transformative applications, that is, JPEG or MPEG de-
coder [61].
A very mature combined partitioning and scheduling
approach for DAGs has been published by Wiangtong et
al. [62]. The target architecture, which establishes the funda-
ment of their work, adheres to the concept given in Figure 5.






















































Figure 7: Rank-ordered DAG and its resulting schedule.
The work compares three heuristic methods to traverse the
search space of the partitioning problem: simulated anneal-
ing, genetic algorithm, and tabu search. Additionally the
most promising technique of this evaluation, tabu search, is
further improved by a so-called penalty reward mechanism.
This mechanism modifies the long-term memory, in which
the information about most/least frequently visited neigh-
bourhood solutions is stored. This solution yields the best re-
sults in terms of least processing time, shortest runtime of the
algorithm, while meeting all the hardware area constraints.
The applied technique that is utilised to schedule a vis-
ited partitioning solution avoids any resource conflicts and
is very fast. Not surprisingly the technique is essentially a
list scheduling technique. The process nodes are grouped
together a priori in a so-called precedence list, which is a
rank-ordered sequence, where one sequence element, or one
precedence level, contains all nodes with the same rank. As
the nodes’ ranks remain always the same in the DAG, inde-
pendent from the current partitioning, only the ordering of
the processes within one precedence level has to be calculated
for every new partitioning solution. An example of this ap-
proach can be seen in Figure 7. The nodes of the DAG are
ordered according to its rank and their diﬀerent color iden-
tifies a certain mapping to either software or hardware. The
scheduling of these processes is shown on the right side of
Figure 7. Most notably this approach returns an exact system
execution time to the partitioning engine in opposition to the
estimation-based techniques described before. The schedul-
ing is reasonably fast and collisions are avoided completely.
However, the list scheduling fails to recognize situations in
which one software process would enable many hardware
processes running in parallel, whereas the instead preferred
software process with the same rank does not have a single
hardware successor, as the decision is based on the larger bus
utilization.
The inspiration for the architecture model in the papers
of Knerr et al. [18, 63] and the paper [62] originates from
an industry-designed UMTS baseband receiver chip. Its ab-
straction (see Figure 8(a)) has been developed to provide a
maximum degree of generality while being along the lines
of the industry-designed SoCs in use. It consists of several
(here two) DSPs handling the control-oriented functions,
for instance, an ARM for the signalling part and a StarCore
for the multimedia part, several hardware accelerating units
(ASICs), for the data oriented and computation intensive sig-
nal processing, one system bus to a shared RAM for mixed
resource communication, and optionally direct I/O to pe-
ripheral subsystems. In Figure 8(b) the simple modification
towards the platform concept with one hardware processing
unit (e.g., FPGA) has been established (cf. Figure 5).
Table 1 lists the access times for reading and writing bits
via the diﬀerent resources of the platform in Figure 8(b).
The graph representation of the system, which should be
mapped onto this platform, is generally given in the form
of a task graph. It is usually assumed to be a DAG describ-
ing the dependencies between the components of the sys-
tem. The authors base their work on a graph representa-
tion for multirate systems, also known as synchronous data
flow graphs [64]. This representation accomplished the back-
bone of renowned signal processing work suites, for exam-
ple, Ptolemy [55] or SPW [65]. In Figure 9, a simple ex-
ample of an SDF graph G = (V ,E) is depicted on the left,

































(b) Modification to meet common FPGA and DSP approaches
Figure 8: Origin and modification of a general SoC platform abstraction.
Table 1: Maximum delays for read/write accesses to the communi-
cation resources.
Communication Read (bits/cycle) Write (bits/cycle)
Local software memory 128 256
Local hardware memory 64 128
Shared system bus 256 512
Direct I/O 1024 1024
showing four vertices V = {v1, . . . , v4} connected by four
edges E = {e1, . . . , e4}. The numbers on the tail of each edge
represent the number of bits produced per invocation. The
numbers on the head of each edge indicate the number of bits
consumed per invocation. On the right the decomposition of
the SDF graph has been performed. In the single activation
graph (SAG) the input/output rate dependencies have been
solved and every process invocation is transformed into one
vertex. The vertices v1 and v2 are doubled according to their
distinct invocations that result from the data rate analysis.
The solid edges indicate precedence as well as data transfers
from one vertex to another, whereas the dashed edges just
indicate precedence. The data rates at the edges, that is, in-
terprocess communication, have been omitted for brevity in
this figure.
Note that all of the known approaches discuss task graph
sets, which are homogeneous SDF graphs. This assumption
leads to the very convenient situation of single invocations of
every task. In this work, general SDF graphs with diﬀerent
input and output rates (see edge e2 in Figure 9) are consid-
ered. A mapping of a task in the SDF graph from hardware to
software causes a more complex situation since certainly all
invocations of this task have to be moved to the DSP.
It has to be stated that the hardware/software partitioning
process itself relies heavily on the metrics and measurements




























Figure 9: Simple SDFG (left) and the decomposition into its SAG
(right).
with a high fidelity of values like execution time, power con-
sumption, and chip area, the partitioning algorithm is capa-
ble to return useful decisions very early in the design process.
In 2004 the first work of this group has been published
solely regarding the high-level metrics generation and the
partitioning problem [19]. The objective function for the
hardware/software partitioning included estimations of ex-
ecution time for software and hardware implementation and
gate counts for the hardware implementation. The core al-
gorithm to examine the search space was an adaptation of
the Kernighan-Lin min-cut heuristic [56] for process graphs.
Mainly because of the advancements of the underlying archi-
tecture abstraction and the more and more realistic commu-
nication model, scheduling issues came to the fore. The gen-
eralization to SDF graph systems with multiple invocation of
a single process has been shown [63]. This work focused on
a fast rescheduling method, which returns exact execution
times for every single partitioning solution, which a parti-
tioning algorithm evaluates during its run. The performance





















Figure 10: Decrease of design time by virtual prototyping and automatic generation of virtual prototypes.
of the so-called local exploitation of parallelism (LEP) algo-
rithm has shown to be better than the aforementioned pop-
ular list scheduling techniques, like Hu-level scheduling [66]
and earliest task first [67]. Most importantly LEP has been
developed to preserve linear complexity, since it is aimed
to be applied within partitioning algorithms that move in-
crementally through the search space (direct neighbourhood
searches). This rescheduling method has been enhanced to-
wards multicore architectures with many ASICs and several
DSPs [63].
6. VIRTUAL PROTOTYPING
One of the main diﬃculties in the design of an embedded
system, which consists of software and hardware parts, is that
usually the design and testing of the strong related software
parts have to wait until the hardware has beenmanufactured.
Whereas hardware development, and especially its testing,
is rather independent from the software development. Thus
the design of an embedded system is a sequential process
(Figure 10).
The application of a so-called virtual prototype (VP),
which is a software model of the hardware, allows for ear-
lier start of the software development process and provides a
platform for hardware architecture evaluation. In this tech-
nique, software reflects the behavior of the hardware and im-
plements the software interface to the software, as it will be
realized later in hardware. Such a VP can be implemented
faster than the hardware itself, because all the hardware im-
plementation details specific to the chosen technology can be
neglected and high-level description languages can be used
instead of hardware description languages (HDLs).
Generally, a complex SoC reflects a platform-based de-
sign (PBD), typically one or more DSPs surrounded by mul-
tiple hardware accelerators (HA). Those HAs are called VP
components if they are used inside a VP simulation. The
hardware/software partitioning process transforms a system-
level specification into a heterogeneous architecture com-
posed of hardware and software modules. This partitioning
can be performed by a tool supported way (Section 5) or
manually based on the experience of the designer.
Additionally diﬀerent abstraction levels of a VP support
a refinement process, which is especially needed for systems
with high complexity being too large to support a consis-
tent refinement in one major step. Several properties of ab-
straction layers are proposed for a VP, as they can be time
related (e.g., untimed, timed functional, bus cycle accurate,
and cycle true), data related (e.g., floating-point and fixed-
point representation), and communication related (e.g., syn-
chronous data flow, transaction level modeling (TLM) [68],
and open core protocol international partnership OCP [69]).
In Figure 11 three diﬀerent abstraction levels for a VP are
shown: one VP (Figure 11(a)) for a first architecture eval-
uation, which is characterized by its properties (e.g., data
rates, execution time, and power consumption); another one
(Figure 11(b)) for software development, which achieves fast
simulation performance by using a synchronous data flow
description; and a third one (Figure 11(c)) for the cycle true
cosimulation on register transfer level (RTL). The following
sections explain those VP models in more detail.
6.1. Virtual prototype for architecture exploration
A first evaluation of the system performance is achieved by
a high-level characterisation of a VP component regarding
only its features like, for example, input/output data rates,
worst-case execution time (WCET), best-case execution time
(BCET), and power consumption. Those properties can be
combined to a cost function as shown in Section 3.3. Such a
model provides a base for investigating communication bot-
tle necks on the bus system, power consumption of the sys-
tem, and the meeting of real-time constraints.


















































Figure 11: Diﬀerent abstraction levels of a VP component.
6.2. Virtual prototyping for software development
In order to have high simulation speed together with a cer-
tain accuracy of the model, a VP component can have a cycle
true interface to the bus system, whereas the implementation
of the VP component is a high-level description (e.g., syn-
chronous data flow description in COSSAP). An automatic
generation method for a VP tailored for platform-based de-
signs allows for a further decrease of development time [17,
Figure 10]. Within such a method the algorithmic descrip-
tion is reused for the VP component (Figure 12).
Usually at algorithmic level the design information is free
of communication details. Thus, in order to achieve com-
munication of the VP components via the chosen platform,
an object-oriented environment in C++ provides the func-
tionality of functional blocks, ports, FIFOs, and scheduling.
While this implementation implies a certain hardware plat-
form, much emphasis is put on the fact that this platform
is very general, a DSP with a common bus structure for its
hardware accelerator units. The automatizm is implemented
for COSSAP designs based on GenericC descriptions only.
However, the methodology is left open for supporting other
descriptions, like SystemC.
The implementation of such a VP representation needs a
simulation environment that allows for simulation of parallel
processes. This is provided by a simulation interface, which
is proposed by the virtual socket interface association (VSIA)
[70]. In this approach a static scheduling is used, achieving
faster simulation compared to the event-based simulation of
SystemC and VHDL. Even compared to a plain C++ imple-
mentation, the VSIA implementation introduces negligible
overhead.
The evaluation of a hardware-software system in real-
time constraints additionally needs for an accurate descrip-
tion environment of software and hardware. Software and
hardware need to be annotated with execution time estima-
tions. Especially the software parts need to take into account
the eﬀects of interrupts, which can be modelled with TIPSY
(TImed Parallel SYstem Modeling) C++ [71].
6.3. Virtual prototype for hardware development
As a last step, the internal behavior of the hardware accelera-
tors has to be transformed to a cycle true model. This step is
usually called high-level synthesis, investigated by many re-
search projects [72], and also adopted to commercially avail-
able tools like CatapultC [73]. In that sense VP also supports
a refinement-step-based design, which allows a much more
consistent forgoing than switching between description lan-
guages.
A semiautomatic synthesis is achieved by the MASIC
(MATH to ASIC) environment allowing for describing the
control part of a system with the global control, configura-
tion, and timing language (GLOCCT). Within this language
the FIFOs have to be defined and connected manually. Func-
tions, which are described in C, are used for a bit true and
cycle true implementation. Afterwards the RTL code is gen-
erated automatically. A speedup in the order of 5 to 8 times
compared to manually creation is achieved [13].
The paper of Valderrama [74] describes communication
structures that are provided in a library. Such communica-
tion libraries implement simple handshake mechanisms up
to layered networks. Nevertheless, a focus is needed on the
hardware/software cosimulation process in order to increase
eﬃciency and quality of the design process.
7. CONCLUSIONS
This paper presents an overview of modern techniques and
methodologies for increasing the eﬃciency of the design pro-
cess of embedded systems, especially in the wireless commu-
nications domain. The key factor influencing eﬃciency is the
organization and structure of the overall design process. In
an eﬀort to increase eﬃciency, modern design methodolo-
gies tend towards unified descriptions of the system, with
flexible and generalized tool integration schemes. Such en-
vironments can save most of the design eﬀort invested in
rewriting system descriptions, thus resulting in a streamlined
design process. However, the most substantial increase in





















Figure 12: Reuse of algorithmic description for virtual prototype generation.
eﬃciency comes from the automation of all individual steps
in the design process through dedicated tools which are inte-
grated into the design methodologies.
Firstly, design decisions at all levels of the design process
are based on characteristics of the system, also called metrics.
Hence, reliable, fast, and accurate analysis of the system is of
paramount importance. The required metrics are eﬃciently
obtained through static code analysis techniques, which of-
fer increased speed by avoiding lengthy simulations, as well
as the capability to estimate a wide range of required system
properties.
Floating-point to fixed-point conversion is a critical step
in the design process whose automation oﬀers significant
savings in design eﬀort. Automation through dynamic (data-
driven) techniques is most promising, allowing for complete
replacement of the designer’s manual eﬀort, while achieving
the same quality of conversion results. Modern automation
techniques oﬀer optimization algorithms specifically suited
for the conversion towards a particular implementation op-
tion, such as DSP code or custom hardware.
Hardware/software partitioning is another key step in the
design process, for which a variety of automated techniques
exists. The practical use and applicability of these implemen-
tations to industrial projects hinges heavily on the strength of
the underlying algorithms, the degree to which the environ-
ment is tailored to the application domain, as well as the in-
tegration of the environment into an overall design method-
ology covering the entire design process.
Finally, virtual prototyping is a promising design tech-
nique for speeding up the design process, by allowing parallel
development of both hardware and software components in
the system.Modern design techniques for automated genera-
tion of virtual prototypes also exist, thus boosting the design
productivity substantially.
ACKNOWLEDGMENT
This work has been funded by the Christian Doppler Lab-
oratory for Design Methodology of Signal Processing Algo-
rithms.
REFERENCES
[1] Y. Neuvo, “Cellular phones as embedded systems,” in Pro-
ceedings of IEEE International Solid-State Circuits Conference
(ISSCC ’04), vol. 1, pp. 32–37, San Francisco, Calif, USA,
February 2004.
[2] J. Hausner and R. Denk, “Implementation of signal processing
algorithms for 3G and beyond,” IEEE Microwave and Wireless
Components Letters, vol. 13, no. 8, pp. 302–304, 2003.
[3] R. Subramanian, “Shannon vs. Moore: driving the evolution
of signal processing platforms in wireless communications,”
in Proceedings of IEEE Workshop on Signal Processing Systems
(SIPS ’02), p. 2, San Diego, Calif, USA, October 2002.
[4] G. Moore, “Cramming more components onto integrated cir-
cuits,” Electronics Magazine, vol. 38, no. 8, pp. 114–117, 1965.
[5] International SEMATECH, “International Technology Road-
map for Semiconductors,” 1999, http://www.sematech.org.
[6] M. Rupp, A. Burg, and E. Beck, “Rapid prototyping for wire-
less designs: the five-ones approach,” Signal Processing, vol. 83,
no. 7, pp. 1427–1444, 2003.
[7] R. L. Moigne, O. Pasquier, and J.-P. Calvez, “A graphical tool
for system-level modeling and simulation with systemC,” in
Proceedings of the Forum on Specification & Design Languages
(FDL ’03), Frankfurt, Germany, September 2003.
16 EURASIP Journal on Embedded Systems
[8] G. Karsai, J. Sztipanovits, A. Ledeczi, and T. Bapty, “Model-
integrated development of embedded software,” Proceedings of
the IEEE, vol. 91, no. 1, pp. 145–164, 2003.
[9] G. Karsai, “Design tool integration: an exercise in seman-
tic interoperability,” in Proceedings of the 7th IEEE Interna-
tional Conference and Workshop on the Engineering of Com-
puter Based Systems (ECBS ’00), pp. 272–278, Edinburgh, UK,
April 2000.
[10] Synopsys Inc., “Galaxy Design Platform,” http://www.syno-
psys.com/products/solutions/galaxy platform.html.
[11] SPIRIT Consortium, http://www.spiritconsortium.com.
[12] SPIRIT SchemaWorking Group Membership, “SPIRIT-User
Guide v1.1,” Tech. Rep., SPIRIT Consortium, San Diego, Calif,
USA, June 2005.
[13] H. Posadas, F. Herrera, V. Ferna´ndez, P. Sa´nchez, E. Villar, and
F. Blasco, “Single source design environment for embedded
systems based on SystemC,” Design Automation for Embedded
Systems, vol. 9, no. 4, pp. 293–312, 2004.
[14] M. Raulet, F. Urban, J.-F. Nezan, C. Moy, O. Deforges, and Y.
Sorel, “Rapid prototyping for heterogeneous multicomponent
systems: an MPEG-4 stream over a UMTS communication
link,” EURASIP Journal on Applied Signal Processing, vol. 2006,
Article ID 64369, 1–13, 2006, special issue on design methods
for DSP systems.
[15] P. Belanovic´, B. Knerr, M. Holzer, G. Sauzon, and M.
Rupp, “A consistent design methodology for wireless embed-
ded systems,” EURASIP Journal on Applied Signal Processing,
vol. 2005, no. 16, pp. 2598–2612, 2005, special issue on DSP
enabled radio.
[16] MySQL Database Products, http://www.mysql.com/products/
database.
[17] B. Knerr, P. Belanovic´, M. Holzer, G. Sauzon, and M. Rupp,
“Design flow improvements for embedded wireless receivers,”
in Proceedings of the 12th European Signal Processing Confer-
ence (EUSIPCO ’04), pp. 2015–2018, Vienna, Austria, Septem-
ber 2004.
[18] P. Belanovic´, B. Knerr, M. Holzer, and M. Rupp, “A fully au-
tomated environment for verification of virtual prototypes,”
EURASIP Journal on Applied Signal Processing, vol. 2006, Ar-
ticle ID 32408, 2006, special issue on design methods for DSP
systems.
[19] B. Knerr, M. Holzer, and M. Rupp, “HW/SW partitioning us-
ing high level metrics,” in Proceedings of International Confer-
ence on Computer, Communication and Control Technologies
(CCCT ’04), vol. 8, pp. 33–38, Austin, Tex, USA, August 2004.
[20] M. Holzer and M. Rupp, “Static code analysis of functional
descriptions in systemC,” in Proceedings of the 3rd IEEE Inter-
national Workshop on Electronic Design, Test and Applications
(DELTA ’06), pp. 243–248, Kuala Lumpur, Malaysia, January
2006.
[21] P. Belanovic´ andM. Rupp, “Automated floating-point to fixed-
point conversion with the fixify environment,” in Proceedings
of the 16th InternationalWorkshop on Rapid System Prototyping
(RSP ’05), pp. 172–178, Montreal, Canada, June 2005.
[22] M. Sheppered and D. Ince, Derivation and Validation of Soft-
ware Metrics, Oxford University Press, New York, NY, USA,
1993.
[23] B. W. Boehm, Software Engineering Economics, Prentice-Hall,
Englewood Cliﬀs, NJ, USA, 1981.
[24] Y. L. Moullec, P. Koch, J.-P. Diguet, and J.-L. Philippe, “De-
sign trotter: building and selecting architectures for embedded
multimedia applications,” in Proceedings of IEEE International
Symposium on Consumer Electronics (ISCE ’03), Sydney, Aus-
tralia, December 2003.
[25] T. McCabe, “A complexity measure,” IEEE Transaction of Soft-
ware Engineering, vol. 2, no. 4, pp. 308–320, 1976.
[26] J. Poole, “A method to determine a basis set of paths to
perform program testing,” Report 5737, U.S. Department of
Commerce/National Institute of Standards and Technology,
Gaithersburg, Md, USA, November 1995.
[27] D. Gajski, N. Dutt, A. Wu, and S. Lin,High-Level Synthesis: In-
troduction to Chip and System Design, Kluwer Academic, Nor-
well, Mass, USA, 1992.
[28] J. Pal Singh, A. Kumar, and S. Kumar, “A multiplier genera-
tor for Xilinx FPGA’s,” in Proceedings of the 9th IEEE Interna-
tional Conference on VLSI Design, pp. 322–323, Bangalore, In-
dia, January 1996.
[29] K. M. Bu¨yu¨ksahin and F. N. Najm, “High-level area estima-
tion,” in Proceedings of International Symposium on Low Power
Electronics and Design (ISLPED ’02), pp. 271–274, Monterey,
Calif, USA, August 2002.
[30] C. Brandolese, W. Fornaciari, and F. Salice, “An area estima-
tion methodology for FPGA based designs at SystemC-level,”
in Proceedings of the 41st Design Automation Conference (DAC
’04), pp. 129–132, San Diego, Calif, USA, June 2004.
[31] M. Holzer and M. Rupp, “Static estimation of the execution
time for hardware accelerators in system-on-chips,” in Pro-
ceedings of International Symposium on System-on-Chip (SoC
’05), pp. 62–65, Tampere, Finland, November 2005.
[32] H. Posadas, F. Herrera, P. Sa´nchez, E. Villar, and F. Blasco,
“System-level performance analysis in SystemC,” in Proceed-
ings of the Design, Automation and Test in Europe Conference
and Exhibition (DATE ’04), vol. 1, pp. 378–383, Paris, France,
February 2004.
[33] P. Bjureus, M. Millberg, and A. Jantsch, “FPGA resource
and timing estimation from Matlab execution traces,” in Pro-
ceedings of the 10th International Symposium on Workshop
on Hardware/Software Codesign, pp. 31–36, Estes Park, Colo,
USA, May 2002.
[34] S. Devadas and S. Malik, “A survey of optimization techniques
targeting low power VLSI circuits,” in Proceedings of the 32nd
ACM/IEEE Conference on Design Automation (DAC ’95), pp.
242–247, San Francisco, Calif, USA, June 1995.
[35] W. Fornaciari, P. Gubian, D. Sciuto, and C. Silvano, “Power es-
timation of embedded systems: a hardware/software codesign
approach,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 6, no. 2, pp. 266–275, 1998.
[36] P. Landman, “High-level power estimation,” in Proceedings of
International Symposium on Low Power Electronics and Design,
pp. 29–35, Monterey, Calif, USA, August 1996.
[37] T. K. Moon and W. C. Stirling, Mathematical Methods and
Algorithms for Signal Processing, Prentice-Hall, Upper Saddle
River, NJ, USA, 2000.
[38] D. Sciuto, F. Salice, L. Pomante, and W. Fornaciari, “Metrics
for design space exploration of heterogeneous multiprocessor
embedded systems,” in Proceedings of International Workshop
on Hardware/Software Codesign, pp. 55–60, Estes Park, Colo,
USA, May 2002.
[39] H. Keding, M. Willems, M. Coors, and H. Meyr, “FRIDGE: a
fixed-point design and simulation environment,” in Proceed-
ings of Design, Automation and Test In Europe (DATE ’98), pp.
429–435, Paris, France, February 1998.
[40] Synopsys, “Converting ANSI-C into fixed-point using Co-
centric fixed-point designer,” Tech. Rep., Synopsys, Mountain
View, Calif, USA, April 2000.
M. Holzer et al. 17
[41] M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth anal-
ysis with application to silicon compilation,” in Proceedings of
the ACM SIGPLAN Conference on Programming Language De-
sign and Implementation (PLDI ’00), pp. 108–120, Vancouver,
BC, Canada, June 2000.
[42] D. Menard, D. Chillet, F. Charot, and O. Sentieys, “Automatic
floating-point to fixed-point conversion for DSP code genera-
tion,” in Proceedings of International Conference on Compilers,
Architecture and Synthesis for Embedded Systems (CASES ’02),
pp. 270–276, Grenoble, France, October 2002.
[43] C. F. Fang, R. A. Rutenbar, and T. Chen, “Fast, accurate static
analysis for fixed-point finite-precision eﬀects in DSP de-
signs,” in Proceedings of IEEE/ACM International Conference
on Computer-Aided Design, pp. 275–282, San Jose, Calif, USA,
November 2003.
[44] S. Kim, K.-I. Kum, and W. Sung, “Fixed-point optimization
utility for C and C++ based digital signal processing pro-
grams,” IEEE Transactions on Circuits and Systems II: Analog
and Digital Signal Processing, vol. 45, no. 11, pp. 1455–1464,
1998.
[45] Y. Cao andH. Yasuura, “Quality-driven design by bitwidth op-
timization for video applications,” in Proceedings of IEEE/ACM
Asia and South Pacific Design Automation Conference, pp. 532–
537, Kitakyushu, Japan, January 2003.
[46] C. Shi and R. W. Brodersen, “An automated floating-point to
fixed-point conversion methodology,” in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP ’03), vol. 2, pp. 529–532, Hong Kong, April
2003.
[47] MathWorks Simulink, http://www.mathworks.com/products/
simulink.
[48] J. Hromkovicˇ, Algorithmics for Hard Problems, Springer, New
York, NY, USA, 2nd edition, 2003.
[49] C. Mehlfu¨hrer, F. Kaltenberger, M. Rupp, and G. Humer, “A
scalable rapid prototyping system for real-timeMIMOOFDM
transmission,” in Proceedings of the 2nd IEE/EURASIP Con-
ference on DSP Enabled Radio, Southampton, UK, September
2005.
[50] R. Ernst, J. Henkel, and T. Benner, “Hardware-software cosyn-
thesis for microcontrollers,” IEEE Design & Test, vol. 10, no. 4,
pp. 64–75, 1993.
[51] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization
by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–
680, 1983.
[52] D. Henkel, J. Herrman, and R. Ernst, “An approach to the
adaption of estimated cost parameters in the COSYMA sys-
tem,” in Proceedings of the 3rd InternationalWorkshop onHard-
ware/Software Codesign (CODES ’94), pp. 100–107, Grenoble,
France, September 1994.
[53] J. Henkel and R. Ernst, “Hardware/software partitioner using
a dynamically determined granularity,” in Proceedings of the
34th Annual Conference on Design Automation (DAC ’97), pp.
691–696, Anaheim, Calif, USA, June 1997.
[54] A. Kalavade and E. A. Lee, “Global criticality/local phase
driven algorithm for the constrained hardware/software parti-
tioning problem,” in Proceedings of the 3rd International Work-
shop on Hardware/Software Codesign (CODES ’94), pp. 42–48,
Grenoble, France, September 1994.
[55] E. A. Lee, “Overview of the ptolemy project,” Tech. Rep.,
University of Berkeley, Berkeley, Calif, USA, March 2001.
http://ptolemy.eecs.berkeley.edu.
[56] B. Kernighan and S. Lin, “An eﬃcient heuristic procedure in
partitioning graphs,” Bell System Technical Journal, vol. 49,
no. 2, pp. 291–307, 1970.
[57] A. Kalavade and E. A. Lee, “Extended partitioning problem:
hardware/software mapping and implementation-bin selec-
tion,” in Proceedings of the 6th IEEE International Workshop on
Rapid System Prototyping, pp. 12–18, Chapel Hill, NC, USA,
June 1995.
[58] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli, “System level
hardware/software partitioning based on simulated annealing
and tabu search,” Design Automation for Embedded Systems,
vol. 2, no. 1, pp. 5–32, 1997.
[59] F. Vahid and T. D. Le, “Extending the kernighan/lin heuris-
tic for hardware and software functional partitioning,” Design
Automation for Embedded Systems, vol. 2, no. 2, pp. 237–261,
1997.
[60] K. S. Chatha and R. Vemuri, “Iterative algorithm for hardware-
software partitioning, hardware design space exploration and
scheduling,” Design Automation for Embedded Systems, vol. 5,
no. 3, pp. 281–293, 2000.
[61] K. S. Chatha and R. Vemuri, “Hardware-software partition-
ing and pipelined scheduling of transformative applications,”
IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems, vol. 10, no. 3, pp. 193–208, 2002.
[62] T. Wiangtong, P. Y. K. Cheung, and W. Luk, “Comparing
three heuristic search methods for functional partitioning in
hardware-software codesign,” Design Automation for Embed-
ded Systems, vol. 6, no. 4, pp. 425–449, 2002.
[63] B. Knerr, M. Holzer, and M. Rupp, “Fast rescheduling of
multi-rate systems for HW/SW partitioning algorithms,” in
Proceedings of the 39th Annual Asilomar Conference on Signals,
Systems, and Computers, Monterey, Calif, USA, October 2005.
[64] B. Knerr, M. Holzer, andM. Rupp, “A fast rescheduling heuris-
tic of SDF graphs for HW/SW partitioning algorithms,” in
Proceedings of the 1st International Conference on Communica-
tion System Software and Middleware (COMSWARE ’06), New
Delhi, India, January 2006.
[65] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,”
Proceedings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987.
[66] T. C. Hu, “Parallel sequencing and assembly line problems,”
Tech. Rep. 6, Operations Research Center, Cambridge, Mass,
USA, 1961.
[67] J.-J. Hwang, Y.-C. Chow, F. D. Anger, and C.-Y. Lee, “Schedul-
ing precedence graphs in systems with interprocessor commu-
nication times,” SIAM Journal on Computing, vol. 18, no. 2, pp.
244–257, 1989.
[68] L. Cai and D. Gajski, “Transaction level modeling in system
level design,” Tech. Rep., Center for Embedded Computer Sys-
tems, Irvine, Calif, USA, 2003.
[69] A. Haverinnen, M. Leclercq, N. Weyrich, and D. Wingard,
“SystemC based SoC Communication Modeling for the OCP
Protocol,” Whitepaper, October 2002.
[70] U. Bortfeld and C. Mielenz, “C++ System Simulation Inter-
faces,” Whitepaper, July 2000.
[71] J. Cockx, “Eﬃcient modelling of preemption in a virtual pro-
totype,” in Proceedings of International Workshop on Rapid Sys-
tem Prototyping (RSP ’00), pp. 14–19, Paris, France, June 2000.
[72] S. Gupta, N. Dutt, R. Gupta, and A. Nciolau, “SPARK: a high-
level synthesis framework for applying parallelizing compiler
transformations,” in Proceedings of the 16th International Con-
ference on VLSI Design, pp. 461–466, New Delhi, India, Jan-
uary 2003.
[73] Y. Guo, D. McChain, and J. R. Cavallaro, “Rapid industrial
prototyping and scheduling of 3G/4G SoC architectures with
HLS methodology,” EURASIP Journal on Embedded Systems,
vol. 2006, Article ID 14952, 2006.
18 EURASIP Journal on Embedded Systems
[74] C. A. Valderrama, A. Changuel, and A. A. Jerraya, “Virtual
prototyping for modular and flexible hardware-software sys-
tems,” Design Automation for Embedded Systems, vol. 2, no. 3-
4, pp. 267–282, 1997.
M. Holzer received his Dipl.-Ing. degree in
electrical engineering from the Vienna Uni-
versity of Technology, Austria, in 1999. Dur-
ing his diploma studies, he worked on the
hardware implementation of the LonTalk
protocol for Motorola. From 1999 to 2001,
he worked at Frequentis in the area of auto-
mated testing of TETRA systems and after-
wards until 2002 at Infineon Technologies
on ASIC design for UMTS mobiles. Since
2002, he has a research position at the Christian Doppler Labo-
ratory for Design Methodology of Signal Processing Algorithms at
the Vienna University of Technology.
B. Knerr studied communications engi-
neering at the Saarland University in Saar-
bru¨cken and at the University of Technol-
ogy in Hamburg, respectively. He finished
the Diploma thesis about OFDM commu-
nication systems and graduated with hon-
ours in 2002. He worked for one year as a
Software Engineer at the UZR GmbH & Co
KG, Hamburg, on image processing and 3D
computer vision. In June 2003, he joined the
Christian Doppler Laboratory for Design Methodology of Signal
Processing Algorithms at the Vienna University of Technology as
a Ph.D. candidate. His research interests are HW/SW partitioning,
multicore task scheduling, static code analysis, and platform-based
design.
P. Belanovic´ received his Dr. Tech. degree in
2006 from the Vienna University of Tech-
nology, Austria, where his research focused
on the design methodologies for embedded
systems in wireless communications, virtual
prototyping, and automated floating-point
to fixed-point conversion. He received his
M.S. and B.E. degrees from Northeastern
University, Boston, and the University of
Auckland, New Zealand, in 2002 and 2000,
respectively. His research focused on the acceleration of image
processing algorithms with reconfigurable platforms, both in re-
mote sensing and biomedical domains, as well as custom-format
floating-point arithmetic.
M. Rupp received his Dipl.-Ing. degree in
1988 at the University of Saarbru¨cken, Ger-
many, and his Dr. Ing. degree in 1993 at
the Technische Universita¨t Darmstadt, Ger-
many, where he worked with Eberhardt
Ha¨usler on designing new algorithms for
acoustical and electrical echo compensa-
tion. From November 1993 until July 1995,
he had a postdoctoral position at the Uni-
versity of Santa Barbara, California with
Sanjit Mitra, where he worked with Ali H. Sayed on a robustness
description of adaptive filters with impacts on neural networks and
active noise control. From October 1995 until August 2001, he has
been a member of the technical staﬀ in the Wireless Technology
Research Department of Bell Labs where he has been working on
various topics related to adaptive equalization and rapid imple-
mentation for IS-136, 802.11, and UMTS. He is presently a Full
Professor for digital signal processing in mobile communications
at the Technical University of Vienna. He was an Associate Editor
of IEEE Transactions on Signal Processing from 2002 to 2005, he
is currently an Associate Editor of JASP EURASIP Journal of Ap-
plied Signal Processing, and of JES EURASIP Journal on Embed-
ded Systems, and he is elected as AdCom Member of EURASIP.
He authored and coauthored more than 200 papers and patents on
adaptive filtering, wireless communications and rapid prototyping,
as well as automatic design methods.
