Software performance estimation strategies in a system-level design  tool by J. R., Bammi et al.
05 August 2020
POLITECNICO DI TORINO
Repository ISTITUZIONALE
Software performance estimation strategies in a system-level design tool / J. R., Bammi; E., Harcourt; W., Kruitzer;
Lavagno, Luciano; Lazarescu, MIHAI TEODOR. - ELETTRONICO. - (2000), pp. 82-86. ((Intervento presentato al
convegno CODES '00 tenutosi a San Diego, California, USA.
Original
Software performance estimation strategies in a system-level design  tool
acm_proc
Publisher:
Published
DOI:10.1145/334012.334028
Terms of use:
openAccess
Publisher copyright
© {Owner/Author | ACM} {Year}. This is the author's version of the work. It is posted here for your personal use. Not for
redistribution. The definitive Version of Record was published in {Source Publication}, http://dx.doi.org/10.1145/{number}
(Article begins on next page)
This article is made available under terms and conditions as specified in the  corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/1667466 since: 2018-10-30T15:14:31Z
ACM
Software Performance Estimation Strategies
in a System-Level Design Tool
Jwahar R. Bammi,
Edwin Harcourt
Cadence Design Systems
bammi@cadence.com
harcourt@cadence.com
Wido Kruijtzer
Philips Research Laboratories
wido.kruijtzer@philips.com
Luciano Lavagno,
Mihai T. Lazarescu
Politecnico di Torino
lavagno@polito.it
lazarescu@polito.it
ABSTRACT
High-level cost and performance estimation, coupled with a fast
hardware/software co-simulation framework, is a key enabler to
a fast embedded system design cycle. Unfortunately, the prob-
lem of deriving such estimates without a detailed implementa-
tion available is dicult.
In this paper we describe two approaches to solve software cost
and performance estimation problem, and how they are used
in an embedded system design environment. A source-based
approach uses compilation onto a virtual instruction set , and
allows one to quickly obtain estimates without the need for a
compiler for the target processor. An object-based approach
translates the assembler generated by the target compiler to
assembler-level, functionally equivalent C. In both cases the
code is annotated with timing and other execution related in-
formation (e.g., estimated memory accesses) and is used as a
precise, yet fast, software simulation model. We contrast the
precision and speed of these two techniques comparing them
with those obtainable by a state-of-the-art cycle-based proces-
sor model.
1. INTRODUCTION
With the ability to mix processors, complex peripherals, and
custom hardware and software on a single chip, full-system de-
sign and analysis demand a new methodology and set of tools.
Nowadays, high performance IC technologies combine ever in-
creasing computing power with complex integrated peripherals
and large amounts of memory at decreasing costs. It comes
as no surprise that the software content of embedded systems
grows exponentially. While the system development tool indus-
try has overlooked this trend for years, most estimates place the
software development cost at well over half the total develop-
ment budget for a typical system. The bias towards software
development in system-level design arises mostly from the mi-
gration of application-specic logic to application-specic code,
driven mainly by the need to mitigate product costs and time
to market pressures.
Short product life cycles and customization to niche markets
force designers to reuse not only building blocks, but entire ar-
chitectures. The nal production cost is often paramount, thus
the prime directive is to nd the right combination of processor,
memory, and glue logic for ecient manufacturing. This means
that several candidate architectures must be analyzed for appro-
priateness and eciency with respect to dierent applications
or behaviors. The tness of a new architecture
An important part of the design consists in mapping the behav-
ior (from the specications) to the architectural blocks (from IP
suppliers) in such way that the cost, power consumption, and
timing of the system can be analyzed. For the hardware, ASIC
companies provide gate-level models and timing shells. For the
software, a similar characterization method is expected from
system development tools.
When properly separated behavior and architecture may co-
evolve. As new requirements in the behavior call for changes
in the architecture, architecture considerations (e.g., produc-
tion cost) may lead to behavior modications. Good system
design practice maintains an abstract specication while allow-
ing independent mapping of behavior onto architecture. This
is the essence of what has been termed function/architecture
co-design [6, 7].
Once mapped, the behavior can be annotated with estimated
execution delays. The delays depend on the implementation
type (hardware or software) and on the performance and in-
teraction of the architectural elements (e.g., IC technology, ac-
cess to shared resources, etc. for hardware, and clock rate, bus
width, Real-Time scheduling and CPU sharing, etc. for soft-
ware.) These estimates should be accurate enough to help make
high level choices such as which behaviors should be imple-
mented in hardware or which in software. Once a behavior
is mapped to software there may be decisions about how to ar-
chitect the software in terms of tasks, RTOS scheduling policy,
and task priorities.
In this paper we present and contrast two techniques and tools
to accurately evaluate the performance of a system, working
at dierent levels of abstraction in order to trade o precision
and speed. To capture run-time task interaction the evalua-
tion must be done dynamically  in a simulation environment.
Moreover, it should be fast enough to enable the exploration of
several architectural mappings in search of the best implemen-
tation. We focus mainly on software written in C, because this
is the dominant high-level language in embedded system pro-
gramming. However, the approach can be applied equally well
to other languages, such as C++ or Java. Moreover, the object
code-based approach can also be used (with some limitations
discussed in [9]) to estimate pre-coded software blocks written
in assembler. This is ideal for DSP software blocks whose im-
plementation is commonly assembler.
The rest of the paper is organized as follows. Section 2 intro-
duces the performance estimation problem and overviews re-
lated work. Section 3 describes our source-based approach and
discusses some of its drawbacks. Section 4 describes a more
precise object code-based approach and discusses some of the
trade-os between the two. Section 5 presents the results ob-
tained for various benchmarks and examples. Section 6 con-
cludes the paper.
2. MOTIVATION AND BACKGROUND
The main software performance estimation techniques fall into
four groups:
1. ltering information that is passed between a cycle-ac-
curate ISS and a hardware simulator (e.g., by suppressing
instruction and data fetch-related activity in the hardware
simulator) [2, 3];
2. annotating the control ow graph (CFG) of the compiled
software description with information useful to derive a
cycle-accurate performance model (e.g., considering pipe-
line and cache) [11, 13];
3. annotating the original C code with timing estimates try-
ing to guess compiler optimizations [12];
4. using a set of linear equations to implicitly describe the
feasible program paths [10].
The rst approach is precise but slow and requires a detailed
model of the hardware and software. Performance analysis can
be done only after completing the design, when architectural
choices are dicult to change.
The second approach analyses the code generated for each basic
block and tries to incorporate information about the optimiza-
tion performed by an actual compilation process. It considers
register allocation, instruction selection and scheduling, etc. In
our object code-based approach we partially use this scheme.
The third approach has the advantage of not requiring a com-
plete design environment for the chosen processor(s), since the
performance model is relatively simple (an estimated execution
time on the chosen processor for each high-level language state-
ment.) However, it cannot consider compiler and complex ar-
chitectural features (e.g., pipeline stalls due to data dependen-
cies.) In our source-based approach we extend this method to
handle arbitrary hand-written C code, rather than just synthe-
sized code.
The fourth approach has the advantage of not requiring a simu-
lation of the program, hence it can provide conservative worst-
case execution time information. However, so far it has been
targeted at worst case execution time analysis for a single pro-
gram. Embedded systems, on the other hand, are composed
of multiple tasks, accessing common resources, whose dynamic
activation can signicantly modify each other's execution path
or timing behavior (e.g., by changing the state of the cache.)
target cc
assembler
loader
target cc
Functional
C code
Object
code
executable
Target
translator
Simulator
Annotated
C code
ASM2C
source annotator
Figure 1: Simulation preparation ow.
Mixed approaches can be used under some circumstances. For
example, [8] tries to approach each step in the analysis with the
best currently known methods.
3. SOURCE-BASED ESTIMATION
The estimation ows discussed in this paper are shown in g-
ure 1. In this section we focus on the ow depicted on the
right, based on direct annotation of the C source. The key idea
is that of performing a partial compilation of the source code
using classical techniques described, for instance, in [4]. A min-
imum amount of information is retained to estimate the size
and execution time of the object code that the target compiler
would emit. In this way, one does not need access to the com-
plete development environment (compiler, assembler, debugger,
loader, Instruction Set Simulator) for the target processor, since
a single Virtual Compiler approximates the result of the target
compiler.
The Virtual Instruction (VI) set that we use as the target of the
virtual compilation process includes all the main classes of in-
structions oered by existing processors. It is based on a RISC
philosophy , in that it provides a small set of instructions out of
which others can be synthesized. The VI set that is currently
used by our tool is as follows. We show also, as an example,
estimated cycle counts for each VI for the Motorola MCORE
processor [1] (the methods for deriving them are discussed be-
low):
LD, ST: load/store a register from/to memory (2 cycles.)
OP.c, OP.s, OP.i, OP.l: simple arithmetic operations on a
byte, short word, word, and long word, respectively (1
cycle.)
OP.f, OP.d: single- and double-precision oating point oper-
ation (160 and 360 cycles respectively.) Floating-point
operations are emulated in software and depending on the
algorithm can take various numbers of cycles. Here we
used an average number estimated with a cycle accurate
ISS (hence the VM method will not be accurate for a pro-
gram using many such operations.)
MUL.*, DIV.*: multiply and divide operations. See above.
SUB, RET: subroutine call and return (8 cycles.) The to-
tal cycle count includes overhead for storing the ve non-
volatile registers (using LDM for the ve register takes 6
cycles) plus the time for a BSR (2 cycles), and vice-versa
for return.
IF, GOTO: conditional and unconditional branch (2 cycles.)
In general this is accurate for all of the branch instructions
except for the predictive branch instructions, which take
only 1 cycle if the branch not taken. Hence 2 cycles is a
worst case estimate.
v__st_tmp = v__st ;
startup ( proc ) ;
i f ( frozen_inp_events [ proc ] [ 0 ] & 1 ) f
goto L16 ;
g
. . .
v__st_tmp = v__st ;
__DELAY( LI+LI+LI+LI+LI+LI+OPc) ;
startup ( proc ) ;
i f ( frozen_inp_events [ proc ] [ 0 ] & 1 ) f
__DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF ) ;
goto L16 ;
g
__DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF ) ;
sb $2 , v__st_tmp. 2
jal startup
lw $2 , proc
s l l $2 , $2 , 2
lw $2 , frozen_inp_events ( $2 )
lbu $4 , 0 ( $2 )
andi $2 , $4 , 0x0001
. set noreorder
. set nomacro
bne $2 , $0 , $L16
andi $2 , $4 , 0x0004
DELAY( sb ) ; v__st_tmp = _R2;
DELAY( j a l ) ; // startup ( proc ) ; de f e r r ed
DELAY( lw+nop ) ; _R2 = proc ;
startup ( proc ) ;
DELAY( s l l ) ; _R2 = _R2 << 2 ;
DELAY( lw+nop ) ; _R2 = (& frozen_inp_events+_R2) ;
DELAY( lbu+nop ) ; _R4 = ( 0+_R2) ;
DELAY( andi ) ; _R2 = _R4 & 0x0001 ;
DELAY( bne ) ; _jcond = (_R2 != _R0) ;
// i f ( _jcond ) goto L16 ; de f e r r ed
DELAY( andi ) ; _R2 = _R4 & 0x0004 ;
i f ( _jcond ) goto L16 ;
Figure 2: From top to bottom: C code, source-based
simulation model, assembly code, and object code-
based simulation model.
Each virtual instruction represents a class of instructions on
the target architecture. For example, the OP virtual instruc-
tion class represents over sixty dierent MCORE integer-ALU
instructions.
The rst part of gure 2 presents a C fragment followed by an
automatically annotated version of the same le. This anno-
tated version is used for performance simulation and is anno-
tated using the previously described Virtual Machine instruc-
tions. The simulation model is mainly composed of timing in-
structions and the remaining behavioral part. Timing informa-
tion accumulates during function execution. They receive as
argument an arithmetic expression, composed of encoded Vir-
tual Instruction names. Each Virtual Instruction has associated
a delay value, taken from the processor basis le. This value
is added to a global variable, that represents the accumulated
clock cycles from the beginning of the simulation.
Similar annotations (not shown in the gure) also model mem-
ory accesses for instructions fetches and data loads and stores,
that are ltered and estimated by a cache model, a bus model
and memory model not discussed in this paper.
One key aspect of our approach is that we generate virtual in-
structions independent of the target processor . The only aspect
where the processor is taken into account is when evaluating
the delay of each virtual instruction. In this way, changing the
processor choice for a given piece of C code is simply a matter
of changing the basis le used by the simulator. On the other
hand, there are disadvantages discussed further below.
3.1 Generation of the Processor Basis file
The processor basis le for each supported CPU contains an
estimated cycle count for each Virtual Instruction. This cycle
count must take into account an approximate view of the pro-
cessor pipeline (for example, the cycle counts for the MCORE
shown above take an optimistic view of it assuming that there
are never stalls.) At the same time, it ignores the behavior
of the memory hierarchy and system busses. These are taken
into account by other components of our simulation framework,
and the Virtual Compiler (as well as the object-based estima-
tor described below) only generates estimated addresses for in-
struction and data accesses (not shown in the gures for the
sake of simplicity.) Treatment of these additional architecture-
dependent delays is however outside the scope of this paper.
Currently we use two techniques in order to derive cycle counts
for each VI. The rst technique is fairly straightforward, and is
generally used to bootstrap and verify the results of the second
one. It amounts to reading the processor manual and/or using a
cycle-accurate ISS (e.g., for VIs implemented in SW) and lling
in a table. The second technique is based on a statistical param-
eter identication procedure. We compile a set of benchmarks
(providing a mix of all the VIs) both using the virtual compiler
and using the real compiler. We then simulate the annotated
C code and obtain a number of executions for each VI in each
benchmark. We also run each benchmark on a cycle-accurate
ISS or emulator of the target processor. Now we have a system
of linear equations of the form:
n
1;1
 c
1
+ n
1;2
 c
2
+ : : :+ n
1;m
 c
m
= N
1
n
2;1
 c
1
+ n
2;2
 c
2
+ : : :+ n
2;m
 c
m
= N
2
: : :
n
n;1
 c
1
+ n
n;2
 c
2
+ : : :+ n
n;m
 c
m
= N
n
where each n
i;j
is the number of VI's of type j that our simulator
dynamically estimates in the execution of benchmark i, each
c
j
is the (unknown) cost of VI j, and each N
i
is the actual
cycle count for benchmark i. Note that the system has more
equations than unknowns, and hence we actually to solve it
by minimizing the error between the predicted and the actual
counts.
3.2 Drawbacks of the Virtual Compilation ap-
proach
The most important limitations of an estimation made at the
source code level are that it is quite dicult to account for po-
tential compiler optimizations and for (CISC-style) instructions
that do not fall into any of the VI categories. For example, our
source-level estimator does not know how many registers the
target processor has. Hence it must estimate which variables are
likely to be stored in registers versus in memory. In the current
implementation we optimistically assume that scalar function
arguments and local scalar variables always end up in registers.
Even if the source code analyzer could perform compiler-like
optimizations, it would be impossible to guarantee that they
would match those done by the actual target compiler. For
these reasons we discuss next a more precise approach, based
on performing estimation after compilation and annotating a
regenerated C simulation model from the assembler output by
the compiler.
4. OBJECT CODE-BASED ESTIMATION
Consider now the ow shown on the left hand side of gure 1.
The high-level C code of a given software task of the system
is compiled with the target C compiler. The output assembler
is translated to an assembler-level C model, that maintains the
original behavior, and is annotated with timing information.
This timing information is very accurate, since all the archi-
tectural eects (instruction scheduling, register allocation, ad-
dressing modes, memory accesses. . . ) are visible at this level.
The assembler-level C is used as a very precise, yet fast, co-
simulation model. On the other path, the very same assembler
is used to generate the executable that will run in the target
environment.
Another important aspect of our software estimation technique
is that it supports a co-simulation where assembler-level trans-
lated C models can be mixed with functional non-translated
models (possibly annotated using the source-level technique dis-
cussed in Section 3 or by hand.) This feature may become useful
when it is not possible to compile the entire application (e.g.,
some blocks can be linked only as objects and are not available
in the source form.)
The possibility to mix pre-compiled and pre-characterized simu-
lation models for library functions can also make our approach
more ecient with respect to an ISS-based solution. In that
case, for example, any function in the C or mathematical func-
tion library has to be interpreted every time it is called, while
in our case it could have a faster, hand-written timing model.
The accuracy of the method relies on the fact that the simu-
lation model has the same behavior as the original program.
Hence, the basic assumptions needed to generate an accurate
simulation model using this method are:
 The input program has been optimized by the target com-
piler. Except for hardware optimizations, made by the
target architecture at run time, no other optimization will
be made (e.g., by the assembler.)
 The optimizations made at run time by the target archi-
tecture (e.g., register renaming, speculative execution) are
known. Model accuracy is best if they are data-indepen-
dent.
 The input for the translator is generated by the same com-
piler that will be used for the target executable.
However, this may not be possible. The target compiler may
not be available at the moment of system architecture deni-
tion, or it may not output the assembler le, or a disassembler
for the target executable may not be available. In this case we
can use another supported target compiler to generate the as-
sembler and the simulation model. Once generated, the model
is a portable source by itself, and it can be recompiled with the
`ocial' target compiler in production stage.
Figure 2 shows a fragment of C code, the associated Virtual
Machine simulation model, the assembly code, and the corre-
sponding C code of compilation-based simulation model.
Figure 3: PFC block diagram.
The latter is made of two main parts: the behavioral part and
the timing part. The behavioral part is an assembler-level C
code that reconstructs the behavior of the function. It makes
use of emulated registers, condition codes, and stack, references
to host memory and ow control instructions. When the simu-
lation model is compiled by the host compiler, the references to
the emulated registers would be ideally reallocated to host reg-
isters by the host compiler, so that the simulation model runs
fast.
For the simulation, the generated C model is compiled once
again on the host machine and executed. This results in a faster
simulation with respect to an interpretive (ISS-based) simula-
tion, where, for example, all instruction fetches are performed.
The drawback of this approach is that it is not possible to main-
tain the total separation between the functional and timing in-
formation. For example, compiler optimizations can change the
source code radically, and dierent simulation models with dif-
ferent annotations are needed for dierent processors, compilers
and even optimization levels.
Moreover, a serious simulation performance problem could arise
from architectures with complex processor condition codes, that
can be set by several instructions. On the hardware side, set-
ting condition codes comes at no cost in terms of speed, while
the simulation model executes special code for emulating them.
Generating the emulation code for all the instructions that al-
ter the condition codes on the processor is denitely a waste
of time, since the condition codes are used in branch decisions
much less than they are set. There is a known problem in do-
ing this eciently [5]. While RISC processors often do not use
condition codes, several older architectures (e.g., x86) use them
extensively, hence an ecient solution for translating them into
the C model has been provided.
When the assembler input is parsed, an internal representation
of the simulation model is created in the translator. This rep-
resentation will include all condition code updates. Before the
model is output, we perform a data ow analysis on the con-
dition codes and ag as useful only the settings that (conser-
vatively) have a chance to be used by a subsequent conditional
branch instruction. When the model is output, only the agged
updates of the condition codes are generated, thus reducing use-
less condition codes set statements.
5. EXPERIMENTAL RESULTS
Figure 3 shows the block diagram of the Producer-Filter-Con-
sumer design that we used for our experiments. At startup
the Controller enables the Producer to send a frame of an im-
age through Fifo2 to Filter . After processing the frame, Filter
pushes the result to Consumer using Fifo3. Consumer acts like
a sink and sends a received signal to Controller via Fifo4. Now
Controller enables Producer through Fifo1 to send a new frame
Table 1: Cycle-accurate (a), source-based (b) and ob-
ject code-based (c) software performance on PFC test
case for none (noopt), moderate (opt), and heavy (oam)
compiler optimization.
(a)
task noopt opt oam
producer
1987951 1594577 1594553
controller
421 366 366
consumer
56660 55304 55304
(b)
task noopt opt oam
producer
1982231
( 0:29%)
1982231
(+24:3%)
1982231
(+24:3%)
controller
260
( 38:2%)
260
( 29:0)
260
( 29:0%)
consumer
11035
( 80:5%)
11035
( 80:0%)
11035
( 80:0%)
(c)
task noopt opt oam
producer
1982536
( 0:27%)
1591768
( 0:18%)
1591745
( 0:18%)
controller
1227
(+291%)
1168
(+319%)
1168
(+319%)
consumer
50702
( 10:5%)
49346
( 10:8%)
49346
( 10:8%)
for processing, until there are no more frames to process.
Only Controller , Producer and Consumer are implemented in
software on a MIPS3000 processor (the cache model is set to
always hit in all cases.) We used three software performance
estimation methods:
1. A cycle-accurate simulator for MIPS3000 called TSS, used
as a reference (see table 1 (a); total simulation time 9
minutes.)
2. The source-based method (see table 1 (b), total simula-
tion time 30 seconds.) Errors obviously depend on the
level of optimization used by the compiler. In the case
of Consumer they are especially large due to an error in
estimating the cost of a function call inside a loop.
3. The object code-based method (see table 1 (c); total sim-
ulation time 30 seconds.) The large estimation errors
for Controller and Consumer in this case are due to the
dierent runtime environments, that require a dierent
prologue and epilogue for the routines implementing each
block in TSS (a cycle-accurate simulator using memory-
based communication) and in the performance simulator
(a discrete-event simulator using port-based communica-
tion.) They are more relevant than for Producer due to
the large amount of useful (and simulator-independent)
computation performed by the latter.
The systematic errors exhibited by Controller and Consumer
in the object code-based case could be reduced by instructing
the estimation tool not to analyze prologue and epilogue of top-
level functions (managed by the RTOS in one case and by the
simulation engine in the other), but to use pre-characterized
costs of entry and exit in the real environment (a xed quantity,
that was used by the source-based approach as well.)
6. CONCLUSIONS
System functionality is often implemented in software, yet esti-
mating software performance of embedded multi-tasking reac-
tive systems is still a dicult problem. The work presented in
this paper attempts to analyze the time performance of soft-
ware as accurately as possible while still achieving a high simu-
lation speed (at least one order of magnitude faster than cycle-
accurate ISSs [13].) Reasonably accurate performance estima-
tion is needed in the face of internal pipelines, multiple issue
instructions, code and data caches, and memory hierarchies.
In the future, we are planning to apply this technique also to
VLIW (Very Long Instruction Word) processors. This type of
processors presents an interesting synergy between the compiler
design and the hardware design. A VLIW compiler performs
static code scheduling without requiring runtime pipeline inter-
locks, so that the code can be executed without having to take
any control decisions at runtime.
7. REFERENCES
[1] MCORE Reference Manual. Motorola.
[2] Mentor Graphics Seamless CVE Home Page.
http://www.mentorg.com/seamless/.
[3] Synopsys' Eagle Home Page.
http://www.synopsys.com.tw/products/hwsw/eagle_ds.html.
[4] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers -
Principles, Techniques and Tool. Addison-Wesley, 1986.
[5] D. Keppel B. Cmelik. Shade: a fast instruction-set
simulator for execution proling. ACM SIGMETRICS
Conference on Measurement and Modeling of Computer
Systems, 22(1):128137, May 1994.
[6] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jureska,
L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E.
Sentovich, K. Suzuki, and B. Tabbara. Hardware-software
Co-Design of Embedded Systems: The POLIS Approach.
Kluwer Academic Publishers, Norwell, MA., 1997.
[7] H. Chang, L. Cooke, M. Hunt, G. Martin, A. McNelly,
and L. Todd. Surviving the SOC Revolution  A Guide to
Platform-Based Design. Kluwer Academic, 1999.
[8] R. Ernst and W. Ye. Embedded program timing analysis
based on path clustering and architecture classication.
In Proc. Int. Conf. Computer-Aided Design, pages
598604, Nov. 1997.
[9] M.T. Lazarescu, M. Lajolo, J.R. Bammi, E. Harcourt,
and L. Lavagno. Compilation-based software performance
estimation for system level design. Proc. Design
Automation Conf., Jun. 2000.
[10] S. Malik, M. Martonosi, and Y.T.S. Li. Static timing
analysis of embedded software. In Proc. Design
Automation Conf., pages 147152, Jun. 1997.
[11] F. Stappert. Predicting pipelining and caching behaviour
of hard real-time programs. 1998. C-LAB internal
document, Furstenalle 11, D-333102 Paderborn, Germany.
[12] K. Suzuki and A. Sangiovanni-Vincentelli. Ecient
software performance estimation methods for
hardware/software codesign. In Proc. Design Automation
Conf., pages 605610, Jun. 1996.
[13] V. Zivojnovic and H. Meyr. Compiled hw/sw
co-simulation. In Proc. Design Automation Conf., 1996.
