Using on-chip networks to implement polymorphism in the co-design of object-oriented embedded systems  by Goudarzi, Maziar et al.
Journal of Computer and System Sciences 73 (2007) 1221–1231
www.elsevier.com/locate/jcss
Using on-chip networks to implement polymorphism
in the co-design of object-oriented embedded systems
Maziar Goudarzi ∗,1, Naser MohammadZadeh, Shaahin Hessabi
Department of Computer Engineering, Sharif University of Technology, Azadi Avenue, Tehran, Islamic Republic of Iran
Received 3 October 2005; received in revised form 11 March 2006
Available online 24 February 2007
Abstract
The Network-on-Chip (NoC) paradigm brings networks inside chips. We use the routing capabilities inside NoC to serve as
a replacement for Virtual Method Table (VMT) for Object-Oriented (OO) designed hardware/software co-design systems where
some methods could be implemented as hardware modules. This eliminates VMT area and performance overhead in OO co-
designed embedded systems where resources are limited and where some functionality needs to be implemented in hardware to
meet performance goals of the system. Our experimental results on real world embedded applications show up to 32.15% lower
area and up to 5.1% higher speed compared to traditional implementation using VMT.
© 2007 Elsevier Inc. All rights reserved.
Keywords: Embedded systems; Object-oriented design; Network-on-chip (NoC); Hardware–software co-design; Polymorphism; Virtual method
dispatch; Application-specific instruction processor (ASIP)
1. Introduction
The centre of gravity in computing systems is moving from general-purpose computers to embedded computing
systems. Embedded systems of today complexity can no longer be designed in an ad hoc manner; they need automated
design methodologies. Furthermore, available silicon technology from one side, and market demand from the other
side are pushing VLSI designers toward Electronic System-Level design (ESL) where the system-under-design is a
mix of hardware and software components. The International Technology Roadmap for Semiconductors reports that
80% of the development cost of embedded systems goes to software design [1]. This is a strong motive for us to
advocate a successful software design methodology, i.e. the object-oriented (OO) paradigm, for modelling the system
as the starting point in our ESL design methodology, named ODYSSEY [2].
The OO paradigm suggests modelling the system as a set of concurrently communicating objects and their interac-
tions. Each object belongs to a class that defines the data-fields of the object and the methods that can be invoked on
* Corresponding author. Corresponding address: 3rd Floor, Institute of System LSI Design Industry, Fukuoka, 3-8-33 Momochihama, Sawara-ku,
Fukuoka, 814-0001, Japan. Fax: +81 92 847 5190.
E-mail addresses: goudarzi@sharif.edu, goudarzi@slrc.kyushu-u.ac.jp (M. Goudarzi), naser_ml@yahoo.com (N. MohammadZadeh),
hessabi@sharif.edu (S. Hessabi).
1 Maziar Goudarzi is now with the System LSI Research Center of Kyushu University, Fukuoka, Japan. He would like to thank Professor Alan
Mycroft, Computer Laboratory, University of Cambridge, for his insightful discussions and comments on this work.0022-0000/$ – see front matter © 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.jcss.2007.02.009
1222 M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231the object. Each class can have child classes that inherit all data fields and methods of the parent, and that introduce
new data-fields and methods. Objects can be polymorphic; a polymorphic object of a certain class, say Base, can
dynamically change type to any other class that is a (direct or indirect) child of Base. Now, when a method is called
on a polymorphic object, that routine should be called which corresponds to the run-time class of the called object;
this feature is known as polymorphism and its implementation is called virtual method dispatch (the analogy for the
name comes from the fact that to implement polymorphism, the method call is dispatched to the appropriate method
implementation at run-time). Polymorphism is an indispensable feature of OO that allows for uniform handling of
all objects of a base class and all its descendents, and hence, reducing its overhead has been extensively investigated
by many researchers [3]. The dynamic nature of polymorphism prevents static optimisations beyond method bound-
aries. Furthermore, it requires a run-time operation (often done using a mapping table called Virtual Method Table
or VMT) to find the object class and to resolve the called method to the implementation corresponding to that class.
Various techniques try to minimise this run-time overhead [3] but none of them can remove it altogether. The situation
can be even worsened if the OO model is subject to hardware–software co-design in order to explore the synergy
between hardware and software implementations. Two issues arise in this case: firstly method calls may also be in-
voked by hardware units and secondly method dispatch may cross hardware–software boundary—see Section 2. This
reemphasises the need to new techniques to efficiently implement polymorphism in co-designed OO systems.
The network-on-chip paradigm is proposed to address uncertainties in communication on systems-on-chips caused
by implementation issues in current and future deep submicron process technologies [4]. It proposes borrowing mod-
els, techniques, and tools from the already-mature network design field and applying them to System-on-Chip (SoC)
design. At the heart of this paradigm, resides a packet-switched network as the communication media among various
parts of the chip. So if this technology-mandated integration of networks inside chips can be used to also dispatch vir-
tual methods with no additional hardware, polymorphism is provided for free. In this paper we show how we take ad-
vantage of this network to propose a new mechanism for virtual method dispatch among hardware as well as software
method implementations. In summary, we implement the method calls by network packets and assign the network ad-
dresses and object numbers such that routing the packet results in dispatching the call. This implements polymorphism
for free; no additional circuitry is required other than the routing infrastructure inherent in the NoC paradigm.
This work advances our previous work [5] by providing a novel, scalable and distributed technique for dispatch-
ing method calls among method implementations where the source of the method-calls as well as their destination
can vary between hardware and software. In [5] we introduced OO-ASIP as a special kind of Application-Specific
Instruction Processor (ASIP) with built-in support for the specific needs of OO embedded application; this specifi-
cally includes hardware mechanisms for dynamically dispatching method-calls among their hardware and software
implementations. There we showed that a first-generation OO-ASIP corresponds to a given class library already de-
signed for an application domain, and hence, it can serve all applications designed using that class library; we further
showed that the same OO-ASIP can cover extensions to the original class library as new software for the OO-ASIP,
and consequently, the same OO-ASIP can well serve all OO applications designed using the original class library
or its extensions. The focus in [5] was on the design methodology and the design flow for implementing hardware–
software object-oriented embedded applications. The OO-ASIP implementation presented in [5], although fulfilling
its designated tasks, suffers from scalability issue due to using a centralised method dispatcher. In this work, we focus
on the method dispatching issue and propose and analyse a new OO-ASIP implementation based on a distributed
network-based method dispatcher that not only addresses the main shortcoming of the implementation in [5], but also
takes advantage of the technology-mandated on-chip networks [4] to do this for free—see Section 4.1.
This paper is an extension of our previous work [6] on method-dispatching using on-chip networks. In [6] we
briefly presented the main idea and approach behind identifying method-dispatching and network routing. Here we
have added the following investigations:
• Detailed description of the motivation of the work and justifying the need for low-overhead method-dispatching
in object-oriented embedded applications especially when some methods are implemented in hardware for higher
efficiency.
• Explanation and discussion of alternative approaches for method-dispatching especially in embedded systems and
in the case where some methods are implemented in hardware for efficiency.
• Results of implementing our technique on an FPGA chip along with comparison to the most popular, but software-
only, technique.
M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231 1223• Case studies of a number of real-life embedded applications proving practicality of the technique in real applica-
tions and quantitatively demonstrating its advantages.
The rest of this paper is organised as follows. Section 2 presents related works on co-design of OO models and
motivates the need to new techniques to keep a key software abstraction (i.e. polymorphism) intact and efficient across
hardware–software boundary. Section 3 presents our proposed technique to implement polymorphism using packet
routing in on-chip networks. Implications of the scheme are analysed and discussed in Section 4. Section 5 provides
and analyses results of our implementation of the proposed technique and a number of case studies on a Virtex4
FPGA, and finally Section 6 summarises and concludes the paper.
2. Co-design of object-oriented models
Many researchers have used object-orientation to model and/or implement hardware and VLSI systems. However,
only a few of them [7–10] have also investigated the co-design of OO models. Two major approaches exist to partition
a given OO model into collaborating hardware and software components. The first solution divides the model into
software objects and hardware objects [11,12]. For the software objects, all the data and the method implementations
are kept local to the processor that hosts the objects. This can be done in the same way that a non-partitioned OO
program is compiled and run on a processor. The hardware objects are implemented in hardware in a variety of ways,
but this is not particularly important here. (As examples, the ODETTE project implements an object as a Finite-State
Machine (FSM) [11] while the OASE project uses analysis techniques to resolve OO constructs to non-OO ones so
as to pass them to behavioural synthesis tools [13]. A number of other approaches to synthesise hardware from OO
models can be found in [11].) Communication between the two partitions imposes some area/code overhead as well
as performance penalty since extra hardware and software need to be added to the given OO model to interface the
two partitions. Although polymorphic calls may not cross the hardware–software boundary, but all such approaches
impose some overhead to dispatch virtual methods in software objects as well as in hardware objects.
The second approach to partitioning OO models divides the model into software methods and hardware methods;
i.e., methods of the class library used in the model are assigned respectively to the software or the hardware partition.
Data of the objects are kept in a place where both partitions can equally well access them. We believe that this is more
intuitive since the operations (i.e. class methods) are assigned to partitions instead of the modelling components (i.e.
objects). Thus, we follow this approach to partition the OO model as other researchers also do [10] (although toward
a different target architecture). We have also developed an ESL design methodology [14], its supporting design-
automation tools [15], and a number of real-world case studies [16,17] following this approach.
The researchers following the former OO partitioning approach have not reported mechanisms for dispatching
virtual methods across hardware–software boundary. We focus on the latter partitioning approach and discuss why
polymorphism is more complicated to implement here. In method-based hardware–software partitioning, redefini-
tions of the same virtual method may reside in different partitions. Consequently, when dispatching virtual methods,
not only the target method but also the corresponding partition should be identified by the virtual method dispatch
mechanism, and accordingly, appropriate mechanism should be employed for method invocation, parameter passing,
and method returning, which are collectively called method linkage. Non-virtual calls are a special case where we stati-
cally know the method implementation style, and hence, the corresponding method linkage solution can be employed.
In this case, when both the caller and the callee are software methods, the traditional mechanisms of OO software
compilation can still be used. However, if either the caller or the callee is a hardware method, a new method linkage
mechanism is required; interface logic and a communication protocol must be devised to allow communication among
the processor and the hardware methods.
In case of virtual methods, a software caller needs to extend traditional virtual method dispatch mechanisms with a
new tag identifying the implementation style of the callee; this is fairly inexpensively accomplished by adding a new
bit to every entry of the VMT. But for a hardware caller, new techniques are required since no previous work is reported
in the literature. The straightforward choice of per hardware-module replication of the VMT is not acceptable due to
high area, power, and performance overhead (see Section 5). In the next section we provide a mechanism, using on-
chip networks, that simultaneously addresses method linkage and virtual method dispatch problems when a hardware
caller or callee is involved.
1224 M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–12313. Polymorphism in a network-on-chip
Our NoC architecture and its components are given in Fig. 1. It consists of a processor core, which contains and
runs software methods (shown by small grey boxes inside the Instruction Memory box at the top of Fig. 1), and several
Functional Units (FUs, shown by big grey rectangles at the middle of Fig. 1) each implementing a hardware method.
Data of the objects is stored in a central data memory (the white box at the left-hand side of Fig. 1) to be accessible
to the FUs as well as the processor. An on-chip network connects all FUs and the processor core together. The NoC
architecture shown in Fig. 1 corresponds to the following C++-like code excerpt:
class A {
virtual int f() {...} // to be a hardware-method
virtual char g(int) {...} // to be a software-method
};
class B extends A {
virtual int f() {...} // to be a hardware-method
};
class C extends B {
virtual int f() {...} // to be a software-method
};
Here, class C is derived from class B derived from class A while both B and C override the f () method of A. There
is no restriction on assigning methods to partitions, and hence, the same class may have some methods in hardware and
some in software (e.g., A::f () is in hardware in Fig. 1 while A::g() is in software) and also redefinitions of the same
method may be in different partitions (e.g., A::f () and B::f () are hardware methods while C::f () is a software one).
The processor contains the master thread of control and starts running it when powered up. When a method call
is dispatched to a hardware method, its corresponding FU is activated which may also activate other FUs, or even a
software method inside the processor, to accomplish its task.
The fundamental point in our dispatching mechanism is that we view method calls as packets sent over the on-chip
network from the caller module to the called one, carrying the parameters of the call as the packet data payload. The
return-value(s), if any, is sent back in another packet from the callee to the caller; this packet also designates the end
of method execution. We assign the network addresses and object numbers such that routing of the packet results in
(dynamic) binding of the corresponding (virtual) method call. Consequently, not only the target of the call is resolved
irrespective of the call being virtual or non-virtual (see Sections 3.3 and 3.4), but method linkage is also done at the
same time. More importantly, this technique works irrespective of the caller or callee being in hardware or software
Fig. 1. Our NoC architecture and its components.
M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231 1225and requires only one table (the routing table of the network) as opposed to per-FU and per-processor replicated VMTs
in case of straightforward extension of the VMT technique. In the following subsections, we present our address and
number assignment techniques for network nodes and for the objects.
3.1. FU network-address assignment scheme
The input OO model consists of a library of classes each defining a set of methods. We assign a unique bit-field
identifier, cid, to each class in the input class library, and another unique identifier, mid, to the methods of each class
such that overridden methods in child classes use the same identifier as in the parent class; i.e., if class B is derived
from A and overrides the A::f (args) method by B::f (args), both f (args) methods share the same identifier, say M.
Each hardware method is implemented as an FU, and hence, each FU corresponds to a certain method, mid, of a
certain class numbered cid. Correspondingly, we set the network address of the FU to the bit-field identifier FUid in
the following equation where the “.” operation represents a bit-field concatenation:
FUid = <mid.cid>. (1)
The same identification scheme can be used for software methods; however, since all software methods are imple-
mented in the same processor, multiple network addresses are assigned to the processor. This causes no problem as
long as the routing tables of the on-chip network are maintained such that packets destined to different addresses can
arrive at the same node.
3.2. Object numbering scheme
During code generation, compilers of OO programs identify each object by the starting address of the portion of
data memory allocated to store the data fields of that object. They also store a per-object hidden tag to identify the
class of the object [18]. We propose an alternative object identification scheme that explicitly includes the cid of the
object class. This is used in our method dispatching mechanism to identify the corresponding method implementation
on the fly (see below).
Each object is also assigned a number, objn, that is unique among all objects of the same class. The object identifier,
oid, is generated by concatenating this objn to the cid of the object class:
oid = <cid.objn>. (2)
For example, the objects of a class A with cid = 1 will be numbered 1.1, 1.2, 1.3, etc.
3.3. Method dispatch without polymorphism
When calling non-virtual methods (e.g., obj.f (params) in C++), the destination method implementation is stati-
cally known. However, if the caller and/or the callee are a hardware method, still a new method linkage mechanism,
such as ours, is required. We first present our scheme in this case, and then extend it to polymorphic calls where the
type of the called object is not statically known.
We view each method call as a network packet. A method call is identified by a method (mid), an object (oid), and
the parameters of the call ( params), and hence, the bit-field concatenation of these items represents the method call
and comprises the corresponding network packet to be sent; i.e. <mid.oid.params>.
Using Eq. (2) above, the oid field can be expanded to <cid.objn>, and hence, the packet becomes: <mid.<cid.
objn>.params> or simply <mid.cid.objn.params>, which can also be regrouped to <<mid.cid>.objn.params>
which finally becomes <FUid.objn.params> when observing that according to Eq. (1) the <mid.cid> pair designates
an FUid which is conveniently the unit that must handle the method, and hence, is the destination of the packet.
In other words, a method call such as oid.mid(params) is equivalent to a network packet <mid.oid.params> which
is routed, for no extra cost, to the FU numbered FUid while the objn and params respectively represent the object to
work on and the parameters of the call.
Major fields of the method-dispatching packet are shown in Fig. 2. The grey part shows the packet fields. In addition
to mid, cid, and objn, other headers may also be required, e.g. the source address to allow the called FU to return a
result if required and to notify the caller of the method completion. The objn field and method parameters are sent as
the data payload of the packet.
1226 M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231Fig. 2. The method-dispatching packet and its fields.
Example 1. Suppose that an OO model defines a class A with method f (args). Further assume that number 1 is
assigned to the class and the unique number M to the f (args) method. Our network-address assignment scheme
suggests <mid.cid> for the corresponding FU (or for the processor if the method is in software), resulting in M.1 for
A::f (args). When defining a7 as an object of class A, our object-numbering scheme suggests <cid.objn> or 1.7, for
example, as the identifier of the a7 object. Now, calling a7.f (params) corresponds to the following packet:
The packet destination address is M.1 (see the packet format in Fig. 2), and hence, when sent over the network, the
routing structure conveys it to its destination address, M.1, which corresponds to the A::f (args) FU (or the processor)
as expected.
3.4. Method dispatch with polymorphism
Polymorphism is expressed in various ways in different OO languages. We follow a C++-like approach by allowing
method-calls on pointers-to-objects; i.e. assuming that ap is a pointer to objects of class A (i.e. “A ∗ ap;” in C++
syntax), the statement ap → f (args) is one representation of polymorphism. Polymorphism implies that the pointer
may point to objects of different classes (constrained by the class hierarchy and the class of the pointer) at run-time.
Unlike C++, however, we use the object numbers (see Section 3.2) to represent object pointers instead of memory
addresses of objects.
The previous section showed that a packet is assembled for a call with statically known target. To implement
polymorphism, we simply put the run-time value of the pointer in the object-id portion of the packet (which overlaps
both header and payload) and send it on the network as before; depending on the dynamic value of the pointer,
the packet may reach different FUs or the processor, but in any case it will be the appropriate one due to the same
discussion presented in previous section.
Example 2. Suppose that one derives two classes B and C from class A in Example 1, such that both of them override
A::f (args). Further suppose that A::f (args) and B::f (args) are hardware methods whereas C::f (args) is a software
method as shown in Fig. 1. Assume that we assign numbers 2, and 3 respectively to the B and C classes, and hence,
the two FUs are numbered M.1 and M.2 respectively and the processor network address is set to M.3 (recall that all
three f (args) methods use the same identifier M). Finally, assume that the system model defines only one object from
each class, respectively named a, b, c and numbered 1.1, 2.1, and 3.1.
A pointer to class A (e.g. ap) may dynamically point to a, b, or c; i.e. ap may contain 1.1, 2.1, or 3.1 respec-
tively. Polymorphism implementation implies that calling ap → f (params) ought to result in invoking A::f (args)
or B::f (args) or C::f (args) depending on the ap run-time value. Assembling a packet with the ap value results in
<M.1.1.params> or <M.2.1.params> or <M.3.1.params> packet that is routed by the network to M.1 or M.2 or
M.3 node, respectively corresponding to A::f (args) or B::f (args) or C::f (args) method implementation. Thus, the
polymorphism requirement is satisfied and the run-time operation of virtual method dispatch is identified with the
run-time operation of packet routing.
4. Analysing the scheme
In this section we analyse our proposed method dispatch scheme in terms of its implementation cost and also the
implications caused by the change that we propose in the objects identifier.
M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231 12274.1. Implementation cost
Availability of a fast on-chip network is a key requirement for the success of our scheme. In general, a common
network connecting the processor and the FUs is required when some methods are implemented in hardware; other-
wise, point-to-point interfaces would be needed between every two units that could activate each other (this network
can be as simple as a bus, but contemporary facts would ultimately necessitate a packet-switched network in deep
submicron technologies [4]). So the cost of implementing an on-chip network is not characterised to our scheme but
is a necessity raised by today semiconductor technology and also by the need to connecting several hardware units to
one another.
If the traditional VMT-based scheme is just extended to serve hardware–software implementations, each FU needs
either its own copy of the VMT, which imposes replication overhead, or a central VMT, which is not scalable and
requires additional time for contacting. In either case, the delay of invoking (not looking up) the call target is inevitable
if the caller or the callee is in hardware. Our scheme combines the lookup and the invocation operations to save not
only the area but also the clock cycles—see Section 5.
4.2. Implications of object numbering scheme
Our above-explained approach implies a change in the object identifier so that the VMT is no longer required
and the routing facility suffices to dispatch virtual as well as non-virtual methods. This change in the object identifier
affects two operations that were traditionally (i.e. when compiling OO software) straightforward. The first operation is
manipulation of object-pointers; object-pointers must now contain these new identifiers instead of traditional address-
in-memory of objects. The second operation is access to object data fields; the address of the object in memory is no
longer readily available in our object-identifier scheme.
The former issue does not impose any overhead but may affect the number of objects (of a certain class) that a
pointer can point to. With n bits allocated for a pointer, the traditional approach would allow pointing to 2n objects
of arbitrary classes (assuming that all objects have only 1 byte of data or otherwise two of them would overlap in
memory). Our scheme divides the pointer to two portions, namely cid and objn, with m bits for the cid field. This
leaves n − m bits for the objn field, and hence, 2n−m objects of a certain class can be distinguished, but in return,
objects can be of any size. Although not impossible, it is very unlikely that the number of distinguishable objects
becomes the bottleneck in a real application especially in embedded systems where dynamic object (de)allocation is
discouraged to keep systems simple.
To address the latter issue, we combine our object-identification scheme with the traditional address-in-memory
scheme; i.e. we extend the object identifier to include the starting address of the object in memory as well as our
<cid.objn> identifier. This causes no implementation overhead when no object pointer is involved since the com-
piler can statically determine the appropriate identifier to use. For example, to access data fields of an object (e.g.
obj.data++ in C++ notation) only the address-in-memory identifier is required whereas for a non-virtual method call
(e.g. obj.f () in C++ notation) only our proposed identification scheme is needed. However, when using pointers-to-
objects, both identifiers must be stored in the pointer variable since the value of the pointer is not statically known and
the compiler cannot eliminate the unnecessary identifier. This causes a small overhead that conveniently only affects
pointers storage and pointer assignments.
5. Experimental results
In this section we first report the results of implementing our network-based dispatching scheme on FPGA and
compare it to the traditional VMT-based approach widely used in full-software OO compilers. Then a number of case
studies implementing real-life applications are provided and the improvements in area and speed achieved by our
scheme are presented and discussed.
5.1. Implementation results on FPGA
To compare our method-dispatching scheme with traditional approaches, we implemented it on a Xilinx Virtex4
FPGA and measured the delays and hardware area using Xilinx Embedded Development Kit toolset and a Virtex4
1228 M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231Table 1
Area-speed results of implementing method-dispatching approaches on a Xilinx Virtex4 FPGA
Hardware area (slices of logic block) Max. frequency (MHz)
Network-based approach 1330 (processor) + 357 131.688
VMT-based approach (software-only approach) 1330 (processor) 131.688
Table 2
Delay figures of method calls with source and destination varying between software and hardware
Method dispatching delay (clock cycles)
SW to SW call SW to HW call HW to SW call HW to HW call
Network-based approach 500 75 463 1
VMT-based approach 105 NA NA NA
Table 3
Delay figures of return from method with source and destination varying between software and hardware
Return after method completion (clock cycles)
SW to SW call SW to HW call HW to SW call HW to HW call
Network-based approach 500 75 463 1
VMT-based approach 105 NA NA NA
development board. We also investigated the traditional VMT-based approach implemented by GCC C++ compiler
on a MicroBlaze soft processor core implemented on logic blocks of the same Virtex4 FPGA. The resulting area and
maximum clock frequency figures are reported in Table 1.
The on-chip network consumed 357 slices of the FPGA. This is roughly 6.5% of logic resources of the Virtex4
xc4vfx12 chip available on our development board. Note, however, that this is the price that one has to pay for the
technology-mandated NoC, not for our dispatching scheme. In this experiment, we had to prototype our scheme on an
FPGA for rapid prototyping, but on a real NoC this network shall be custom designed with optimum quality measures.
Also note in Table 1 that the on-chip network does not adversely affect the clock speed of the system.
Table 2 (and Table 3) compare the delays, in terms of clock cycles, when calling a method (and returning from it)
in our scheme and in VMT-based scheme. As expected, software-to-software calls incur an overhead in the network-
based approach; however, this overhead conveniently affects only virtual method calls with a potential hardware
target. In other words, if a call is statically known to have all its potential targets in software, the traditional VMT-
based approach can be used to avoid the overhead. Moreover, note that VMT-based approach is unable to directly
handle hardware-methods and is not a real rival for the network-based approach. Later in this section we extend the
VMT-based approach to support hardware methods and then compare it to the network-based approach in a number
of real-life case studies.
5.2. Case studies
To further investigate the approach in practise, we implemented three real-life embedded systems on the same
FPGA chip as above: an MPEG2 decoder, a JPEG encoder, and a JPEG decoder. For the MPEG2 decoder, we used the
MPEG2 decoding reference software [19] as an initial reference implementation and then developed an OO model for
MPEG2 decoding ourselves in C++ on a desktop PC environment. To test conformance of our implementation against
the MPEG2 standard, we used the reference software and movies given in [19] and successfully validated our OO
model on the standalone computer before running it on an embedded processor. Limitations in the memory capacity
of the MicroBlaze processor [20] (a Xilinx-provided embedded processor that we used in our experiments) and its
corresponding compilation tools, however, made us reduce the size of the original video frames to 32×32 pixels to be
able to compile the code for MicroBlaze and download and run it on our Virtex4 development board. We used a Xilinx
ML-401 board connected to a personal computer to configure the FPGA and also for sending/receiving data as well
as for single-stepping and debugging the software being executed on the MicroBlaze processor implemented in the
M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231 1229Table 4
Characteristics of our OO implementation of the studied applications
Application # classes # virtual methods # hardware methods
MPEG2 decoder 4 111 3
JPEG encoder 3 10 6
JPEG decoder 4 20 7
Table 5
Comparison of implementing an MPEG2 decoder using our approach and an extended VMT-based one
MPEG2 decoder application Area (logic slices) Time (clocks) Operating freq. (MHz) Throughput (frames/sec)
Network-based approach 1330 (processor) + 2958 16 992 120 21 186
Extended VMT-based approach 1330 (processor) + 3909 17 412 120 20 675
FPGA logic blocks. After validating this full-software implementation on the embedded processor, we moved the time-
consuming parts of the application to hardware by implementing them in the logic blocks of the FPGA. To compare
our network-based method dispatching scheme with extended VMT-based approach (described below), the hardware–
software implementation was once equipped with the former and once again with the latter. We then validated both
hardware–software implementations of the MPEG2 decoder using the same debugging setup employed for the full-
software implementation. The same above procedures were employed to implement and validate JPEG encoder and
JPEG decoder applications in full-software as well as hardware–software realisations. The initial reference software
for these JPEG applications was the JPEG benchmark version 6a obtained from MiBench benchmark suite [21].
Characteristics of our OO implementation of the three case studies, covering total number of classes, total number of
virtual methods, and number of methods implemented in hardware, are given in Table 4.
Our OO MPEG2 decoder has 4 classes and a total of 111 virtual methods, resulting in a 111-entry VMT, with 3
methods implemented in hardware. We also extended traditional VMT-based approach by adding a per table-entry tag
to represent implementation style of the call target and by replicating the VMT in each FU to let hardware units call
virtual methods. Implementation results are given in Table 5.
The area reported in the Area column of Table 5 is composed of two parts: the area of the MicroBlaze processor core
(which is the same in both cases) and the area of other hardware modules; in the case of extended VMT-based approach
the area is higher due to replicating the VMT in each FU. Each VMT consumed 317 slices when implemented using
FPGA logic resources. Obviously, the total area overhead depends on the number of FUs. In our MPEG2 case study,
this resulted in 3909 slices (including the area required to implement the functionality of the FU in addition to the
VMT) when 3 methods were in hardware which is 32.15% more than the FUs without VMT. This confirms our claim
in Sections 2 and 4 that the straightforward solution of VMT replication results in unacceptable area overhead and
still suffers from VMT lookup delay. Note that in case of VMT-replicating, still a communication network is required
to connect FUs and the processor. In fact, our technique uses the same communication network while also replacing
all replicated VMTs with a single network routing table.
The figures in the Time column of Table 5 reflect the number of clock cycles required to decode a sample movie
with 3 frames of 32 × 32 pixels. The throughput column also corresponds to 32 × 32 pixel frames. Our technique
achieved 2.5% speedup compared to VMT-based implementation due to eliminating VMT lookups. Obviously, the
achieved speedup depends on the number of virtual method calls performed over the entire application run; the more
the number of virtual method calls, the higher the speedup.
Tables 6 and 7 give similar implementation results for JPEG encoder and JPEG decoder respectively. Our JPEG
encoder has 3 classes and 10 virtual methods with 6 of them implemented in hardware while the JPEG decoder has
4 classes and 20 virtual methods with 7 methods implemented in hardware. The values in the Time and Throughput
columns correspond to encoding/decoding 32×32 pixel true-colour pictures. As explained at the beginning of this
subsection, we had to reduce the size of the pictures in order to get the code compiled and run on the limited resources
of the embedded processor. In these implementations, our approach achieved 5.1% and 3.5% higher throughput for
the encoder and the decoder respectively.
Figure 3 reflects the area improvement obtained by our network-based scheme compared to the extended VMT-
based approach. The black bars in the figure show the improvement compared to the FPGA area occupied by the
1230 M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231Table 6
Comparison of implementing a JPEG encoder using our approach and an extended VMT-based one
MPEG2 decoder application Area (logic slices) Time (clocks) Operating freq. (MHz) Throughput (pictures/sec)
Network-based approach 1330 (processor) + 3216 6896 100 14 501
Extended VMT-based approach 1330 (processor) + 3456 7245 100 13 802
Table 7
Comparison of implementing a JPEG decoder using our approach and an extended VMT-based one
MPEG2 decoder application Area (logic slices) Time (clocks) Operating freq. (MHz) Throughput (pictures/sec)
Network-based approach 1330 (processor) + 3613 7562 105 13 885
Extended VMT-based approach 1330 (processor) + 4264 7829 105 13 411
Fig. 3. Area saving achieved by our network-based method dispatching scheme compared to the extended VMT-based approach.
FUs only, while the white bars consider the area occupied by the entire system consisting of the FUs as well as the
MicroBlaze embedded processor core. The black bars reflect the hardware overhead irrespective of the processor used;
this value remains the same when using another processor core or even using a processor hard-coded in the chip (i.e.,
not implemented in the FPGA logic blocks). The achieved saving shown in Fig. 3 is due to eliminating the need to
replicate VMTs, and consequently, it differs in different applications depending on the number of virtual methods and
the quantity of hardware-implemented ones: the number of virtual methods decides the size of each VMT while the
quantity of hardware-methods determines the number of replications. Thus, in the MPEG2 decoder case the highest
saving is achieved since it has the largest product of the above factors (see Table 4) while the JPEG encoder reflects
the lowest saving due to having the smallest product.
6. Conclusions
An object-oriented embedded system can be co-designed by assigning each method to the hardware or the software
partition. This allows implementing computation-intensive operations (methods) in hardware which is required to
achieve performance goals of the embedded system. This, however, necessitates developing new method linkage
mechanisms (which must be efficient and low-cost to be applicable to resource-constrained environment of embedded
systems) for method-calls involving a hardware method. We described such a mechanism that takes advantage of
an on-chip network to dispatch method calls and to pass parameters and/or return value(s). Our mechanism unifies
method dispatching with the network routing that is inherently required in NoC architectures; consequently, method
dispatch involving hardware methods costs nothing more than the on-chip network, which is already necessitated
M. Goudarzi et al. / Journal of Computer and System Sciences 73 (2007) 1221–1231 1231by contemporary facts in deep submicron technologies [4]. Moreover, our dispatching mechanism dispatches virtual
methods at no additional cost compared to non-virtual calls and also compared to other hardware–software approaches,
and hence, implements polymorphism for free. This effectively opens up the way to object-oriented co-design whose
critics have traditionally questioned it for lower performance and implementation overhead. Our case study on MPEG2
decoder, JPEG encoder and JPEG decoder applications shows 18–32% reduction in area with 2.5–5.1% reduction of
total execution time compared to traditional VMT-based approach.
References
[1] International technology roadmap for semiconductors (ITRS)-design, http://public.itrs.net, 2005.
[2] The ODYSSEY project: Object-oriented design and sYntheSiS of embedded sYstems, Sharif University of Technology, Iran, online home
page: http://ce.sharif.edu/~odyssey/.
[3] K. Driesen, Software and hardware techniques for efficient polymorphic calls, PhD thesis, University of California, Santa Barbara, USA,
1999.
[4] L. Benini, G. DeMicheli, Networks on chips: A new SoC paradigm, IEEE Computer 35 (1) (2002) 70–78.
[5] M. Goudarzi, S. Hessabi, A. Mycroft, Object-oriented embedded system development based on synthesis and reuse of OO-ASIPs, J. Univers.
Comput. Sci. 10 (9) (2004) 1123–1155.
[6] M. Goudarzi, S. Hessabi, A. Mycroft, Overhead-free polymorphism in network-on-chip implementation of object-oriented models, in: Pro-
ceedings of Design Automation and Test in Europe, DATE, 2004, pp. 1380–1381.
[7] The OASE Project: Objektorientierter hArdware/Software Entwurf, University of Tuebingen, Germany, online home page: http://www-
ti.informatik.uni-tuebingen.de/~oase/.
[8] The ODETTE Project: Object-oriented co-DEsign and functional test techniques, University of Oldenburg, Germany, online home page:
http://odette.offis.de.
[9] T. Parvataneni, G. Nannetti, C. Holgate, H. Eland, P. Onions, F. Wray, Object-orientated heterogeneous multiprocessor platform, European
and US patent application, GB2381336, 2003.
[10] W. Wolf, Object-oriented co-synthesis of distributed embedded systems, ACM Trans. Des. Automat. Electron. Systems 1 (3) (1996) 301–314.
[11] M. Radetzki, Synthesis of digital circuits from object-oriented specifications, PhD thesis, University of Oldenburg, Germany, 2000.
[12] C. Schulz-Key, T. Kuhn, W. Rosenstiel, A framework for system-level partitioning of object-oriented specifications, in: Proceedings of Work-
shop on Synthesis and System Integration of Mixed Technologies, SASIMI, 2001.
[13] T. Kuhn, T. Oppold, M. Winterholer, W. Rosenstiel, M. Edwards, Y. Kashai, A framework for object-oriented hardware specification, verifi-
cation, and synthesis, in: Proceedings of Design Automation Conference, DAC, 2001.
[14] M. Goudarzi, The ODYSSEY methodology: ASIP-based design of embedded systems from object-oriented system-level models, PhD thesis,
Sharif University of Technology, Iran, 2005.
[15] M. Goudarzi, S. Hessabi, The ODYSSEY tool-set for system-level synthesis of object-oriented models, in: Proceedings of Embedded Com-
puter Systems: Architectures, MOdeling, and Simulation (SAMOS V), in: Lecture Notes in Comput. Sci., vol. 3553, 2005, pp. 394–403.
[16] M. Najafvand, Design and implementation of an object-oriented ASIP for JPEG decoding, MSc thesis, Sharif University of Technology, Iran,
2005.
[17] N. MohammadZadeh, Extending a JPEG-ASIP by software routines to decode MPEG2 streams, MSc thesis, Sharif University of Technology,
Iran, 2005.
[18] H.M. Deitel, P. Deitel, C++ How to Program, fifth ed., Prentice-Hall Publishers, 2005.
[19] The reference website for MPEG, MPEG software simulation group, MSSG, MPEG-2 video codec, http://www.mpeg.org/MPEG/
MSSG/#source.
[20] Xilinx Corporation, MicroBlaze soft processor core, http://www.xilinx.com/.
[21] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B. Brown, MiBench: A free, commercially representative embedded
benchmark suite, in: IEEE International Workshop on Workload Characterization, 2001, pp. 3–14.
