An Abstract Model for Performance Estimation of the Embedded Multiprocessor CoreVA-MPSoC Technical Report (v1.0) by Ax, Johannes  et al.
An Abstract Model for Performance Estimation of
the Embedded Multiprocessor CoreVA-MPSoC
Technical Report (v1.0)
Johannes Ax∗, Martin Flasskamp∗, Gregor Sievers∗, Christian Klarhorst∗, Thorsten Jungeblut∗, and Wayne Kelly†
∗Cognitronics and Sensor Systems Group, CITEC, Bielefeld University, Bielefeld, Germany
†Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia
Email: jax@cit-ec.uni-bielefeld.de w.kelly@qut.edu.au
I. IN T R O D U C T I O N
This technical report presents an abstract model for
the performance estimation of the multiprocessor CoreVA-
MPSoC. The CoreVA-MPSoC targets streaming applications
in embedded and energy-limited systems. The abstract model
is used by our CoreVA-MPSoC compiler [1] to estimate the
performance of a certain streaming application.
Our CoreVA-MPSoC compiler reads applications that are
described in the programming language StreamIt [2]. A
StreamIt program is represented by a structured data flow
graph of its tasks (filter). The CoreVA-MPSoC compiler
partitions all filter of a program onto particular cores of
the MPSoC. An abstract model for such a partitioning is
presented in Section II. Section III shows the abstract model
of the hardware architecture of the CoreVA-MPSoC.
The configurable VLIW CPU CoreVA [3] is used as the
basic building block for our MPSoC. The CPU features L1
scratchpad memories for instruction and data. Several CPU
cores are tightly coupled within a cluster [4]. Several of those
clusters are connected via a network on chip (NoC) [5] (cf.
Fig. 1).
Within a cluster each CPU can access the L1 data
memories of other CPUs via a bus based interconnect (shared,
partial or full crossbar). The NoC interconnect is composed of
three components: The (i) routers transport the data through
the NoC in a packet-based manner. Routers are connected via
(ii) network links. A (iii) network interface (NI) implements
the interface between the routers and the CPUs.
CPU
Cluster
CPU
Cluster
CPU
Cluster
CPU
Cluster
CPU Cluster
Cluster Interconnect
CoreVA
CPU
Data
MEM
F
IF
O
Instr.
MEM
CoreVA
CPU
Data
MEM
F
IF
O
Instr.
MEM
NI
Fig. 1. Hierarchical CoreVA-MPSoC architecture
Section IV shows the abstract model for the total through-
put of a certain partition of a StreamIt application. A goal for
the CoreVA-MPSoC compiler is to maximize this throughput
to achieve the best performance for an application.
II . MO D E L O F T H E ST R E A M IT PR O G R A M
A StreamIt program can be represented as a structured
graph G= (F,E), where F is a set of filters and E is a set
of edges. Each e ∈ E is of form (a, b) which represents a
communication channel between filter a ∈ F and b ∈ F
in which a message with size |e| (in b y teswork funct ion o f a ) is
send from filter a to filter b. Each filter f ∈ F has a work
function with the estimated execution time W ( f ) in cycles.
The execution time W ( f ) includes repeated executions of a
work function, which may be required to consume or produce
enough data for the filter at it’s edges.
There exists a unique filter L without outgoing
edges, which is the last filter of the application:>b ∈ F s.t. (L , b) ∈ E
There exists a unique filter F without ingoing
edges, which is the first filter of the application:>b ∈ F s.t. (b,F ) ∈ E
M : Multiplicity how often the work functions of
all filters are called during one steady state iteration
( work funct ion cal lsstead y state i terat ion).
I I I . MO D E L O F T H E CO R E VA-MPSO C
The CoreVA-MPSoC consist of a set of processors P.
The StreamIt compiler maps each filter to a processor:
M : F 7→ P
The MPSoC has a set of clusters C and each processor
belongs to a cluster: C : P 7→ C
Additionally the MPSoC consist of a set of network links
N. Each n ∈ N has a maximum bandwidth B(n) b y tesc ycle that
it can handle. A network link could be a bus-link within a
cluster, a network interface (NI) or a NoC-link.
N(pa, pb) is a list of all network links involved when
sending a message from processor pa ∈ P to processor pb ∈ P
(depending on the routing algorithm): N : (pa, pb)→ [N]
IV. MO D E L O F T H E TH R O U G H P U T
This section shows an abstract model for throughput
estimation of a certain StreamIt program given by II and
mapped to a configuration of CoreVA-MPSoC given by III.
A. Throughput of a Processor
A filter f ∈ F has input edges: I( f ) = {(a, f )|(a, f ) ∈ E}
A filter f ∈ F has output edges: O( f ) = {( f , b)|( f , b) ∈ E}
For each filter f ∈ F we generate code of the form:
foreach ( Channel i in I ( f ) )
i . WaitInputReady
foreach ( Channel o in O( f ))
o . WaitOutputReady
Work_f ()
foreach ( Channel i in I ( f ) )
i . DoneWithInput
foreach ( Channel o in O( f ))
o . DoneWithOutput
Before executing the work function Work f of filter
f ∈ F it is necessary to wait until all communication
channels (input I( f ) and output O( f ) edges) are ready to
use. After Work f all communication channels (I( f ) and
O( f )) can be set to done. The execution time of these wait
and done functions is given by the channel type of edge
(a, b) ∈ E, which depends on the location of the filter a
and b (same processor, different processor but same cluster,
or different cluster): M(a) = M(b)→ memor y channel
M(a) 6= M(b) ∧ C(M(a)) = C(M(b))→ cluster channel
C(M(a)) 6= C(M(b))→ NoC channel The execution time
of the wait for input channels of edge e ∈ E is represented
by Iw(e) and Ow(e) for the output channels. The execution
time of the done for input channels is represented by Id(e)
and Od(e) for the output channels.
The execution time E( f ) (in cycles per steady state
iteration) of filter f ∈ F is the sum of the execution time of
the filters work function W ( f ) multiplied by the Multiplicity
M and a sum of all software overheads for the different
communication channels of all it’s input and output edges.
E( f ) =M W ( f ) + ∑
e∈I( f )
(Iw(e) + Id(e))∑
e∈O( f )
(Ow(e) +Od(e)) c yclesstead y state i terat ion
(1)
The maximum throughput T (p) (in steady state iteration
per cycle) of processor p ∈ P is the inverse of the sum of the
execution time of all filters f ∈ F mapped to processor p.
T (p) =
1∑
f ∈M ′(p) E( f )
stead y state i terat ion
c ycles (2)
Where M ′(p) are all filters mapped to processor p ∈ P:
M ′(p) = { f ∈ F|M( f ) = p}
B. Throughput of a Network Link
An amount of data D(n) (in bytes per steady state
iteration) is crossing each network link n ∈ N. This amount
of data is based on the MultiplicityM and the message sizes
of all edges going through this network link n
D(n) =M ∑
e∈{(a,b)∈E|n∈N(M(a),M(b))}
|e| b y tesstead y state i terat ion (3)
The maximum throughput T (n) of network link n ∈ E is
the maximum bandwidth (in bytes per cycle) a network link
n can handle divided by the time the network link needs to
transmit all the data (in bytes) of one steady state iteration.
T (n) =
B(n)
D(n)
stead y state i terat ions
c ycle (4)
C. Total throughput of the system
The throughput of all processors is given by set Tcomp.
Tcomp = {T (p)|p ∈ P} stead y state i terat ionsc ycle (5)
The throughput of all network links is given by set
Tnetwork.
Tnetwork = {T (n)|n ∈ N} stead y state i terat ionc ycles (6)
The amount of data D( f ) (in bytes per steady state
iteration) consumed by a filter f ∈ F depends on Multiplicity
M and the message sizes of all its input edges.
D( f ) =M ∑
e∈I( f )
|e| b y tesstead y state i terat ion (7)
The total throughput of the StreamIt application Ts ystem
is given by the bottleneck of the system. The bottleneck of
the system is the component (processor or network link)
with the lowest throughput
Ts ystem = min(Tcomp ∪Tnetwork) stead y state i terat ionc ycles (8)
Or if we also consider the amount of the produced data
within one steady state:
Ts ystem = D(L ) min(Tcomp ∪Tnetwork) b y tesc ycle (9)
V. CO N C L U S I O N
In this report, an abstract model for the performance
estimation of the CoreVA-MPSoC has been presented. The
abstract model is able to estimate the maximum throughput
of a certain streaming application mapped to a particular
configuration of the CoreVA-MPSoC.
RE F E R E N C E S
[1] W. Kelly, M. Flasskamp, G. Sievers, J. Ax, J. Chen, C. Klarhorst,
C. Ragg, T. Jungeblut, and A. Sorensen, “A Communication Model and
Partitioning Algorithm for Streaming Applications for an Embedded
MPSoC,” in Int. Symp. on System on Chip (SoC). IEEE, 2014.
[2] W. Thies, M. Karczmarek, and S. Amarasinghe, “StreamIt: A Language
for Streaming Applications,” in Int. Conf. on Compiler Construction.
Springer, 2002, pp. 179–196.
[3] G. Sievers, P. Christ, J. Einhaus, T. Jungeblut, M. Porrmann, and
U. Rückert, “Design-space Exploration of the Configurable 32 bit VLIW
Processor CoreVA for Signal Processing Applications,” in NORCHIP.
IEEE, 2013.
[4] G. Sievers, J. Ax, N. Kucza, M. Flasskamp, T. Jungeblut, W. Kelly,
M. Porrmann, and U. Rückert, “Evaluation of Interconnect Fabrics for
an Embedded MPSoC in 28nm FD-SOI,” in ISCAS. IEEE, 2015.
[5] J. Ax, G. Sievers, M. Flasskamp, W. Kelly, T. Jungeblut, and M. Por-
rmann, “System-level analysis of network interfaces for hierarchical
mpsocs,” in NoCArc. ACM, 2015. In press.
