Composable local memory organisation for streaming applications on embedded MPSoCs by Ambrose, Jude et al.
Jude Ambrose, Anca Molnos, Andrew Nelson, Sorin Cotofana, Kees Goossens, Ben  
Juurlink
Composable local memory organisation for 
streaming applications on embedded MPSoCs
Conference Object, Postprint version
This version is available at http://dx.doi.org/10.14279/depositonce-6339
Suggested Citation
Ambrose, J.; Molnos, A.; Nelson, A.; Cotofana,S.; Goossens, K.;  Juurlink, B.: Composable local memory 
organisation for streaming applications on embedded MPSoCs. - In: CF '11 Proceedings of the 8th ACM 
International Conference on Computing Frontiers. - New York, NY: ACM, 2011. - ISBN: 978-1-4503-0698-0. - 
Article No. 23. DOI: 10.1145/2016604.2016631. (Postprint version is cited. Page number differs.)
Terms of Use
© ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for 
your personal use. Not for redistribution. The definitive version was published in CF '11 Proceedings 
of the 8th ACM International Conference on Computing Frontiers. - New York, NY: ACM, 2011, 
https://dl.acm.org/citation.cfm?doid=2016604.2016631.
Powered by TCPDF (www.tcpdf.org)
Composable Local Memory Organisationfor Streaming Applications on Embedded MPSoCs
Jude Ambrose1, Anca Molnos1, Andrew Nelson1,
Sorin Cotofana1, Kees Goossens2, and Ben Juurlink 3
1Delft University of Technology
2Eindhoven University of Technology
3Technische Universita¨t Berlin
Email: a.m.molnos@tudelft.nl, k.g.w.goossens@tue.nl, juurlink@cs.tu-berlin.de
ABSTRACT
Multi-Processor Systems on a Chip (MPSoCs) are suitable 
platforms for the implementation of complex embedded ap-
plications. An MPSoC is composable if the functional and
temporal behaviour of each application is independent of the
absence or presence of other applications. Composability is
required for application design and analysis in isolation, and
integration with linear effort. In this paper we propose a
composable organisation for the top level of a memory hi-
erarchy. This organisation preserves the short (one cycle)
access time desirable for a processor’s frequent local accesses
and enables the predictability demanded by real-time appli-
cations. We partition the local memory in two blocks, one
private, for local tile data, and another shared for inter-tile
data communication. To avoid application interference, we
instantiate one such shared local memory block and an Re-
mote Direct Memory Access (RDMA) for each application 
running on the processor. We implement this organisation 
on an MPSoC with two processors on an FPGA. On this 
platform we execute a composition of applications consisting
of a JPEG decoder, and a synthetic application. Our exper-
iments indicate that an application’s timing is not affected
by the behaviour of another application, thus composability 
is achieved. Moreover, the utilisation of the RDMA com-
ponent leads to 45% performance increase on average for a
number of workloads covering a large range of communica-
tion/computation ratios.
ACM Categories & Subject Descriptors
C.1.2: Multiprocessors, D.4.5: Reliability
General Terms
Design, Performance, Verification
Keywords
Composability, Direct Memory Access, MPSoC
1. COMPOSABLE LOCAL MEMORY
Most state-of-the-art MPSoCs follow a tile based organi-
sation comprising processor and memory tiles that commu-
nicate via an interconnect. Processor tiles typically embed
local memories. Applications running on such platforms con-
sist of communicating tasks that process infinitely-long in-
put data streams. These applications share the MPSoC re-
sources, which may result in inter-application functional and
temporal interference due to resource request conflicts. In
this context composability [3, 4] is a desired platform prop-
erty, as it enables application design and analysis in isola-
tion, and integration with linear effort. Prior work proposed
resource virtualisation techniques to achieve composability,
and demonstrated their feasibility for processors, memory
tiles, and interconnect [3]. The techniques proposed for
memory tiles [1] incur arbitration delay, thus do not straight-
forwardly apply to the local memories that have to ensure
fast (one cycle, preferably) access to data.
To obtain a composable, fast local memory organisation we
take four steps, as presented in Figure 1. First, we employ
a Remote Direct Memory Access (RDMA) component (Fig-
ure 1(B)) to avoid that an application monopolises the pro-
cessor potentially indefinitely, due to stalls on remote reads.
Second, to decrease contention, we partition the local mem-
ory into a private memory and a pair of shared memories
– one for the outbound transactions initiated by the local
processor, and one for the inbound transactions initiated by
a remote processor, as shown in Figure 1(C). In this manner
each of the memories is shared by at most two access initia-
tors. Third, to further avoid arbitration and its potentially
large time penalty we propose to employ dual-ported mem-
ories (Figure 1(D)). Finally, to achieve composability, i.e,
no local memory contention when multiple applications are
mapped on a tile, we propose to replicate the shared memo-
ries and the RDMA (Figure 1(E)). Thus there is one RDMA
and a pair of shared memories per application mapped on
a tile. Predictability, crucial for real-time applications, is
preserved by the four steps above.
2. EXPERIMENTAL RESULTS
The experimental platform consists of two processor tiles,
one memory tile, and a NoC [2], implemented on a Xilinx
ML-510 FPGA board. Each processor tile contains a Mi-
croBlaze core and local memory. The MicroBlaze accesses
the local memory via a one-cycle delay Local Memory Bus.
We assess composability using a workload consisting of
a JPEG decoder, and a simple synthetic application (A1 ).
Figure 2(a) presents the task graph and the task to proces-
NoC
P
RDMA
composable tile
LM: arbitrated
composable tile
PrM: 1 cycle delay
ShM: arbitrated ShM: 1 cycle delay
composable tile
PrM: 1 cycle delay
N applications
  
NoC
P
(C) (D) (E)(A) (B)
RDMA dual−portpartitioning replication
NoC
P
ShM−in
ShM−out
RDMA
NoC
P
ShM−in
ShM−out
RDMA
NoC
P
ShM−in
ShM−out
RDMA
N
non−composable tile ShM: 1 cycle delay
composable tile
PrM: 1 cycle delay
N
LMLM PrM PrM PrM
Figure 1: Steps towards a composable local memory
vld
idct
cc
Tile 1
Tile 2
JPEG 
T1
T3
T5
T4
T2
Tile 1 Tile 2
Synthetic (A1)
(a) Applications
0
5
10
15
20
0
2000
4000
6000
8000
1.5
2
2.5
3
3.5
x 107
Appl. IterationsCommunication Load (words)
Finishin
g Time (
clock cy
cles)
(b) A1 with RDMA
0
5
10
15
20
0
2000
4000
6000
8000
0
1
2
3
4
5
6
x 107
Appl. IterationsCommunication Load (words)
Finishin
g Time (
clock cy
cles)
(c) A1 without RDMA
Figure 2: Application workload and performance comparison
sor tile mapping of both applications. We investigate the
difference in the JPEG’s response time between two cases:
the JPEG application executing alone (JPEG-single), and
in combination with the synthetic application A1 (JPEG-
multi). We compare two tile organisations: (1) when both
applications share a single RDMA engine (1 RDMA per tile);
and (2) when the RDMAs are replicated (1 RDMA per ap-
plication). Figures 3 presents the response time differences
between JPEG-multi and JPEG-single for the vld task. As
this figure indicates, the response times differ when using a
single RDMA per tile, revealing interference, as expected.
On the other hand, the response time difference is zero with
an RDMA per application, as there is no interference. This
suggests that our proposal is composable.
0 20 40 60 80 100−10
0
10
20
30
40
50
60
70
Task iterations
Resp. Ti
me Diff. 
(clock cy
cles)
 
 
1 RDMA per tile
1 RDMA per appl.
Figure 3: JPEG, vld response time difference
(JPEG-multi - JPEG-single)
We compare the performance of two processor tile organ-
isations: one with RDMAs and the other without RDMAs.
The platform without RDMAs allows Microblazes to transfer
data via a PLB bus to the NoC. This implies that the Mi-
croBlaze is blocked until the entire data is transferred to the
NoC, unlike in the case when RDMA performs the transfer.
For these experiments we utilise only the synthetic applica-
tion because its communication and computation loads can
be easily varied. The RDMA increases the application per-
formance by 70% for a communication load as large as 8000
words, as shown in Figure 2(b)(c). The RDMA is an inex-
pensive block as it uses very few FPGA resources, namely
332 look-up tables and 276 flip-flops, representing 1% and
0.7% from the Xilinx ML-510 FPGA’s look-up tables and
flip-flops, respectively.
3. CONCLUSIONS
In this paper we propose a composable local memory or-
ganisation for a tiled MPSoC executing a number of stream-
ing applications. This organisation utilises a Remote Direct
Memory Access (RDMA) unit per application and employs
partitioned, dual-ported memory blocks. As a result, this
scheme is predictable and it preserves fast access time to the
local memory – in our case one-cycle accesses. Experiments
suggest that composability is achieved and, in case communi-
cation and computation is overlapped with RDMAs, perfor-
mance is increased with 45% for a number of synthetic work-
loads covering a large range of communication/computation
ratios. Employing the RDMAs led to no significant differ-
ence in the number of FPGA look-up tables utilised and
increased the total number of flip-flops’s by 5%.
4. REFERENCES
[1] B. Akesson et al. Composable resource sharing based on
latency-rate servers. In DSD, 2009.
[2] K. Goossens et al. The aethereal network on chip after ten
years: Goals, evolution, lessons, and future. In DAC, 2010.
[3] A. Hansson et al. CoMPSoC: A template for composable and
predictable multi-processor system on chips. ACM Trans.
Des. Autom. Electron. Syst., 2009.
[4] H. Kopetz. Real-Time Systems: Design Principles for
Distibuted Embedded Applications. Kluwer Academic
Publishers, 1997.
