Fast Dynamic Memory Integration in Co-Simulation Frameworks for
  Multiprocessor System on-Chip by Villa, O. et al.
Fast Dynamic Memory Integration in Co-Simulation Frameworks for
Multiprocessor System on-Chip
O. Villa
 
P. Schaumont
 
I. Verbauwhede
 
M. Monchiero

G. Palermo

 
UCLA -

oreste, schaum, ingrid  @ee.ucla.edu

Politecnico di Milano -

monchier, gpalermo  @elet.polimi.it
Abstract
In this paper is proposed a technique to integrate and
simulate a dynamic memory in a multiprocessor framework
based on C/C++/SystemC. Using host machine’s memory
management capabilities, dynamic data processing is sup-
ported without compromising speed and accuracy of the
simulation. A first prototype in a shared memory context is
presented.
1. Introduction
MultiProcessor Systems on-Chip (MPSoC) [3] are an ap-
pealing solution for complex applications and are becoming
feasible in contemporary technology. Especially for these
complex systems, designers need frameworks to fast evalu-
ate different possible implementations of the target architec-
ture that covers interconnect, communications protocol, and
topology is needed. Typical applications that run on MPSoC
manage large amount of data (such as audio/video process-
ing). Simulating these application, memory management
capabilities are required. If these details are introduced in
the model the memory’s simulation becomes a significant
portion of the workload for the simulator, affecting the per-
formance in terms of simulation speed. Therefore a good
memory model for exploration should: (I) allow the exe-
cution of complex applications with huge amounts of dy-
namic data (II) be accurate (III) have a low overhead (IV)
be easy to design. The contribution of this paper is the de-
velopment of flexible and efficient techniques based on the
use of the host machine dynamic memory capabilities.
2. Framework
Today Co-Simulation frameworks for MPSoC are based
on the integration of ISSs with simulation kernels. A typ-
ical simulation framework can be seen in Figure 2, where
dashed boxes represent our contribution. In most of the tra-
ditional simulation frameworks, the memory subsystem is
modeled as standard hardware modules. Unless complex
Simulation
Kernel
based on
System C / C /
c ++
SWs API
ISSs Interf
C/C++/ SystemC /
VHDL / ...
hardware modules
C/C++/ SystemC
Wrapper
Memories
HOST OS
HOST CPU HOST MMU
HOST Memory
SOFTWARE
LAYER
DESIGN
MODEL
LAYER
HOST
MACHINE
LAYER
Figure 1. Simulation Framework
and slow dynamic memory models are added, static mem-
ories implemented as tables are used. In our approach, dy-
namic hardware memory modules are simulated using a cy-
cle true memory wrappers to access at the operating system
functions which manages the host machine memory man-
agement capabilities. Dynamic memory operations like al-
location, write/read and deallocation are implemented as
communications between the hardware modules, or ISS,
and the shared memory’s wrapper. These operations are
mapped by the wrappers in the native host machine func-
tions, improving simulation speed. The wrapper guarantees
the simulation accuracy using parameters of delays which
can be dynamic and data dependent. Mechanisms to manage
pointer arithmetic for user defined data-types, to model fi-
nite size memories and to reserve pointers in a shared mem-
ory context are proposed. High level APIs used by the ISSs
are also provided using a C formalism. Multiple dynamic
shared memories are considered and methods to manage
general data structures are work in progress.
3. Dynamic Shared Memory
An implementation of our solution for multiple dynamic
shared memories is employed in a system composed of sev-
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
Cycle true
FSM
comm
delays
drives func part
I
A F
W R
I/0 registers
Hptr1
Host OS
Host MMU
Host Memory
WRAPPER MEMORY
Data In Data Out
Functional Part
Translator
memory size
endianess
data size
ptr type
function calls i dim < Tot. Size
V Ptr . H Ptr . Type Dim. Res.
In ack /req
DESIGN LAYER
HOST LAYER
Out ack /req
Dim1
INTERCONNECT
Shared
Mem 1
Shared
Mem 2
ISS1 ISS2
Hptr2
Dim2
.....
.....
Shared
Mem P
ISSn
420
714
Unsigned Long Data
Pointer Table
Figure 2. Shared Memory Implementation
eral processing elements and an interconnect. In Figure 2,
the overall simulated system is shown, as well as the details
of the shared memory subsystem. Interconnect and shared
memories are developed in C++. The simulation kernel is
based on C++ [1] and the ISSs are ARM simulators [2].
The wrapper is composed of a cycle true part, described
as a Finite State Machine (FSM), and a functional part,
composed of a pointer table and a translator. In the FSM the
communication is performed following a handshake proto-
col and incoming signals are evaluated cycle by cycle. Op-
erations as allocation, read/write and deallocation are iden-
tified by an input code (opcode). The opcode and the shared
memory address (sm addr), which identifies the memory
module, are the first data of every transaction between the
ISSs and the wrapper. Endianess, data type translation and
host machine functional calls are performed by the transla-
tor led by the FSM. Virtual pointers for the simulated archi-
tecture, real host pointers to retrieve the data in the host ma-
chine are stored in the pointer table, as well as the size of
the allocated spaces, type of allocated spaces and a reser-
vation bit used as semaphore to reserve the pointers. When
an allocation operation is required, size (dim) and data type
(type) must be sent. The FSM maps this communication in a
calloc(dim, DATA SIZE) operation to the host machine us-
ing the translator. A pointer is then returned by the host ma-
chine and stored in the table with the type and dim informa-
tion. A virtual pointer (Vptr) is sent back to the ISS com-
pleting the transaction. Every new allocation is a new entry
in the pointers table and a finite size memory can be simu-
lated denying other allocations when the sum of the dimen-
sion reaches a prefixed limit. When a write operation is re-
quired, the virtual pointer (Vptr) and the data (DataIn) are
provided. The pointer table is checked in order to see if the
Vptr is valid and to retrieve the real host machine’s pointer
(Hptr). The DataIn is stored in the Hptr host location. Sim-
ilar mechanisms are adopted for a read operation. In this
case, the data retrieved in the host memory is sent as out-
put data (DataOut) to the communication port by the FSM.
For a free operation (deallocation), it is necessary to pro-
vide the Vptr. Then the entry associated to the Vptr in the
pointer table is removed, the table is re-compacted and the
dimension of the deallocated memory is subtracted to the
total memory size. Finally, a free function, with the corre-
spondent Hptr as argument, is performed on the host. The
generation of the Vptr is performed in such way that ev-
ery new Vptr is obtained summing the value of the previous
Vptr in the table with the size of the previous allocated space
(see the table in Figure 2). The first Vptr’s value is zero by
default. If pointer arithmetic is used to access to the mem-
ory, incoming virtual pointers values that are not present
in the table are received. In this case, the true host mem-
ory space is retrieved checking in the pointer table which
space belongs to the Vptr. The Hptr is, therefore, calculated
adding an appropriate offset. When indexed structures are
exchanged, the dimension of the transmitted arrays must be
also sent together with Vptr. I/O registers are substituted by
I/O arrays, where the incoming and outgoing data are tem-
porally stored until the communication is completed. Af-
terward, they are moved to the host machine, employing a
similar mechanism used for scalar data. Data coherence is
guaranteed in our design preserving the Vptr by means of
the reservation bit which is set by an ISS that want to pro-
tect the pointer. To model data dependent latencies, a set
of delay parameters can be used in the FSM. Multiple in-
stances are easily managed, since the host machine provides
the generation of a different host pointer for every alloca-
tion. High level APIs very similar to the host machine func-
tions are used by the ISSs. Different hardware devices that
might be connected on the system can access to the memo-
ries using low level communication.
4. Conclusions
Preliminary experimental results, obtained simulating
the GSM algorithm, show that the overhead in terms of sim-
ulation speed introduced by this model is very low. Compar-
ing the simulation speed of 4 ISSs with one memory and in-
terconnect and this of 4 ISSs with interconnect and 4 mem-
ories we found a degradation of simulation speed of 20%.
References
[1] GEZEL. www.ee.ucla.edu/   schaum/gezel.
[2] SimIT-ARM. www.princeton.edu/   wquin/armsim.htm.
[3] W. Wolf and A. Jerraya. Multiprocessor System on-Chips.
Morgan Kaufman Publishers, 2004.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
