Compositional Memory Systems for Multimedia Communicating Tasks by Molnos, A. M. et al.
Compositional memory systems for multimedia communicating tasks
A.M. Molnos
        
M.J.M. Heijligers
    
S.D. Cotofana
   
J.T.J. van Eijndhoven
    
   
Delft University of Technology
Mekelweg 4, Delft, The Netherlands
molnos@natlab.research.philips.com
    
Philips Research Laboratories
Prof. Holstlaan 4, 5656 AA
Eindhoven, The Netherlands
Abstract
Conventional cache models are not suited for real-time
parallel processing because tasks may flush each other’s
data out of the cache in an unpredictable manner. In this
way the system is not compositional so the overall perfor-
mance is difficult to predict and the integration of new tasks
expensive. This paper proposes a new method that imposes
compositionality to the system’s performance and makes
different memory hierarchy optimizations possible for mul-
timedia communicating tasks when running on embedded
multiprocessor architectures. The method is based on a
cache allocation strategy that assigns sets of the unified
cache exclusively to tasks and to the communication buffers.
We also analytically formulate the problem and describe a
method to compute the cache partitioning ratio for optimiz-
ing the throughput and the consumed power. When applied
to a multiprocessor with memory hierarchy our technique
delivers also performance gain. Compared to the shared
cache case, for an application consisting of two jpeg de-
coders and one edge detection algorithm 5 times less misses
are experienced and for an mpeg2 decoder 6.5 times less
misses are experienced.
1. Introduction
The system’s high performance and predictability are the
required characteristics for state-of-the-art embedded mul-
timedia applications. To meet their performance require-
ments parallel processing of data on multi-processor archi-
tectures is needed. The low cost demands of this application
domain make the use of general purpose architectures with
several GHz clock frequencies impossible.
The applications’ increase in size and complexity de-
mand hardware platforms with large guaranteed memory
bandwidth. The increase in bandwidth to off-chipmemories
is not growing as fast as the increase in speed of computa-
tion units. A possible approach to cope with this processor-
memory gap is to use memory hierarchies (caches) [3]. For
embedded systems caches induce undesired unpredictabil-
ity because tasks can influence each other’s performance by
flushing each other’s data out of the cache. In this way the
system is not compositional so the overall performance can-
not be predicted based on the system’s parts performance
and changing of components has design implications to the
overall system causing longer time-to-market with econom-
ical implications.
In the real-time domain one could think of replacing
caches with scratch pad memories, but in practice removing
caches is not a good choice because they provide flexibility.
They don’t need to be resized or modified if the application
changes. Detailed analysis regarding the memory require-
ments of multimedia applications’ code and data at every
change of standards or features takes much effort and nega-
tively affects the time-to-market.
The importance of managing caches in general was iden-
tified already in the single processor with multiple threads
domain. Several cache partitioning and management tech-
niques are proposed for time-sharing environment [10], in-
struction controlled, compile time configured data cache
[7], first level compiler-based partitioned cache [6]. How-
ever, for the multiprocessor domain there is a lack of cache
management methods.
This article presents a new cache management technique
applicable for multimedia applications when running on
multiprocessor architectures that achieve systems’ perfor-
mance compositionality. The method allows tuning the
memory system according to different optimization criteria.
Our proposal is based on the observation that the following
holds true: in a multiprocessor system, the levels of unified
cache shared between processors are the most affected by
the inter-task conflicts. Thus we propose to achieve com-
positionality by exclusively assigning sets of the last level
of cache to tasks and to the communication buffers. An
accurate analytical formulation and a practical approxima-
tion to find the cache partitioning ratio for throughput and
power optimization are presented. The influence of task to
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
processor assignment in this formulation is also discussed.
This paper presents a continuation of [5] where the case of
non-communicating multimedia applications is discussed.
To evaluate the performance implications of our method
we consider two applications (two instances of jpeg decoder
plus one canny edge detection algorithm and an mpeg2 de-
coder) running on a multiprocessor CAKE platform [9] with
4 VLIW processors and 512KB of L2 cache. The simula-
tions indicate that when the proposed paradigm is applied
to the L2 cache the system becomes compositional and also
the number of L2 misses is diminished up to 5 times for the
first applications and 6.5 times for the second application.
The outline of this paper is as follows. Section 2
overviews the state of the art in the field of cache manage-
ment for multitasking environments. Section 3 discuss the
proposed method to induce compositionality together with
the optimization possibilities and an approach to determine
the static cache partitioning ratio. Section 4 and 5 present
the experimental framework and results. Finally, in Section
6 conclusions are drawn.
2. Related work
In the single processor domain different hardware ([10],
[8], [4]), software ([6]) or mixed ([7]) cache management
methods were proposed. In [10] the problem of cache effi-
ciency for simultaneous threads in a time-sharing environ-
ment is tackled using dynamic column caching, in a best-
effort way. Based on their number of misses tasks are dy-
namically ”stealing” each other cache ways, such that the
overall number of misses is improved. In [8] the problem
of optimal allocation of cache between two competing pro-
cesses that minimizes the overall miss rate is presented. In
[4] a method to divide a cache into partitions for each real-
time task and a larger partition called the shared pool for
the non-real-time tasks is described. In [6] the cache is par-
titioned among tasks at compile and link time. In [7] a new
data cache organization is proposed. The L1 direct mapped
cache can be partitioned and configured compile time and
controlled by specific cache instructions at run time, bring-
ing considerable performance gain.
The main differences between our work and the previ-
ous research are that we tackle the case of applications run-
ning on platforms with unified set-associative cache shared
between the processors. On such platforms, in our opinion
without explicitly taking into account the inter-task commu-
nication, no compositionality can be achieved and perfor-
mance cannot be guaranteed. None of the above approaches
can be straightforwardly extended to our case. Column
caching used [10] and [8] is based on ways allocation from
every cache set to the tasks. This partitioning type severely
restricts the granularity of cache allocation to the associativ-
ity of the cache. In [4] data access to the shared structures
do not have predictable cache access. The method in [6]
targets the L1 cache, requires a specific compiler and it is
rather inflexible. The method in [7] is limited to L1, direct
mapped data caches for uniprocessors.
3. Cache partitioning for system composition-
ality
The platform model for the proposed method is a ho-
mogeneous on-chip multiprocessor having high bandwidth
communication network with the partitionable shared uni-
fied on-chip cache. On this platform parallel tasks that com-
municate through the memory system are executed. Re-
source contention for the shared buses and the shared levels
of cache are generating unpredictable inter-task conflicts.
The communication network has high bandwidth so the re-
source contention there is low. Since the levels of cache
private to each processor are usually small and task switch-
ing rate in multimedia application is typically low enough,
the first levels of cache can be considered private to each
task. The method to induce compositionality to the system
is by allocating to the tasks and the communication buffers
exclusive parts of the level of cache that is shared through
the processors.
The communication buffers addresses in cache are ac-
cessed by multiple tasks. If every task has a private cache
part, assigning one buffer for example in the consumer’s
cache will allow the producer to unpredictably evict some
consumer cached data. If all the buffers are allocated in a
shared cache pool they will still unpredictably evict each
other data, depending on their addresses and the timing of
the productions and the consumptions. In order to have no
cache interactions between tasks the communication buffers
need their own exclusive cache partitions.
Let’s now assume that the cache allocated for every com-
munication buffer is smaller than its size and the production
and consumption happens in parallel. If the producer is slow
enough not to flush former produced data out of the cache
the consumer will have a hit. Depending on the produc-
ing/consuming rates the number of misses varies. So, in
order to have predictability, for the communication buffers
one of the following should be made:
- ensure that all accesses are hits (except cold misses).
- ensure that all accesses miss in the cache.
- predict the number of misses at the buffer level - this is
rather difficult because requires cycle accurate producer and
consumer rates which are difficult to estimate and usually
not constant in a multiprocessor environment.
3.1. Optimization opportunities
In a compositional environment the performance of one
task is independent of the other tasks and this is achieved
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
by cache partitioning. Different optimization criteria, stat-
ically prioritizing tasks and guaranteeing performance are
possible by tuning the cache partitioning ratio.
An application consisting of   tasks and  communi-
cation buffers can be represented as a graph     
   ,
where       
  fi ffi  ! ! ! # %  &
is the set of nodes denot-
ing the tasks and    *  ffi - 

 ffi -  fi ffi  ! ! ! # %  &
is the set of
edges denoting inter-task communication. An edge *  ffi - is
present in the application’s graph if the task   exchange
data with the task  - . We consider the case of homogeneous
multiprocessors and the set of processors in the system is
1
  3 5  
5  fi ffi  ! ! ! 8 %  &
.
Typically a multimedia application executes for an infi-
nite time in a periodic manner. Every task   has assigned
the following functions: :     expressing its allocated cache
size and =    
 :      expressing its execution time for one
application period.
In a multiprocessor platform the execution time of every
task   is specified by a set @    C D


 F G H I
ffi
D
 fi ffi  ffi ! ! J
G
&
of
time slices when   executes on different processors. De-
pending on the cache size :     the overall execution time
of task   becomes:
=    
 :       N
-
H O G
=  Q 
 3 S U :  Q  
 :     
where =  Q 
 3 S U :  Q  
 :      is the execution time of slice Q
and 3 S U :  Q  denotes the processor executing slice Q .
For the multimedia applications a good optimization cri-
teria is the execution throughput. The execution’s through-
put can be defined as the number of application’s com-
plete executions in a time unit. Let’s denote with Y  3 5 
the time processor 3 5 is used for the completion of one
application execution. The throughput is given by the
\ ^ _
 ` a H b
&
 Y  3 5   . Let  5 e  be the subset of tasks that
execute at least one of their slices on 3 5 , then the processor
execution time becomes:
Y  3 5   N
F G H I a
-
H O
G
` f g h 
- & 
` a
=  Q 
 3 5 
 :  

  j  l m

F h n
j 
 o p q
where  l m  F h n and   o p q are the amount of time when 3 5 does
task switching respectively is idle.
In an environment which allows task migration and dy-
namic scheduling policies the task set  5 executed on 3 5
and the slices @  are execution dependent so Y  3 5  cannot
be accurately computed. In order to have an exact analytical
model to reason about the overall throughput when having
a certain cache partitioning a static assigning of tasks to the
processors is required. In this way, tasks assigned to the
same processor are executed overall sequentially, indepen-
dent of the scheduling decisions for that processor and the
processor’s overall execution time will be:
Y  3
5

 
5
 
N
F G H I a
=    
 3
5

 :      j 
l m

F h n
j   o p q
where =    
 3 5 
 :      is the execution time of task   on pro-
cessor 3 5 .
To optimize the throughput, the task to processor as-
signment and the cache allocation should be such that
\ ^ _
 Y  3 5 
  5   is minimized.
The power consumed by the system and the execution
time are other possible optimization criteria for the multi-
media systems. The consumed power depends by the time
and the memory traffic that the system needs to complete
all its tasks. Optimizing the overall execution time (re-
spectively the number of misses) gives the most power con-
sumptions reduction:
\ u v x
N
F G H I
=    
 :      y
The method is applicable for systems with statically
scheduling and allocation of tasks.
3.2. Proposed optimization method
Because in our experimental system task migration and
dynamic scheduling are allowed, in the proposed method
for finding the cache partitioning ratio the throughput and
power consumed are improved by optimizing the overall ap-
plication’s number of L2 misses (so execution time).
The problem of minimization of the total number of
cache misses is formulated as a (Mixed) Integer Linear
problem. Let \    
 :      be the number of misses of task
  with :     allocated cache sets. The objective is to find
:     values for every task   such that the overall number of
cache misses is minimized:
\ u v
x
N
F G H I
\
   
 :     
y
We denote the set of valid cache sizes with
 z 5  5
 fi ffi  ffi ! ! ! ffi { %  and :     }  z 5  . Due to imple-
mentation reasons z 5 can be, for example, limited to
powers of two. The number of misses of task   with
z
5 cache sets  \ 5


\
   
 z
5
  can be obtained by
simulation or program analysis. In our model we use
and average ~\ 5

over the \ 5

obtained out of different
simulation of task   having z 5 cache. We use a set

_
5



  fi ffi  ! ! ! # %  ffi
5
 fi ffi  ffi ! ! ! ffi { %  &
of    variables that spec-
ify if the allocated cache for task   is z 5 or not. A task can
have only one cache partition allocated, thus
{ % 

5
 fi
_
5

  .
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
As a result the ILP solver produces a    

 assignment
such that the overall number of misses is minimized. Using
 



 and  

the number of misses for task   can be written
as:

   
       
  

  fi


 




where the allocated cache size for every task   is:
  

 
  

  fi


 
 
The sum of the cache sizes allocated to all tasks and com-
munication buffers should be in the limits of the available
cache size

. The size of the cache assigned to the com-
munication buffer   is      !      and this results in the
following constraint:
#  

  fi
     )

 

  fi
     

4. Experimental framework
4.1. Application Model
To describe the applications we use Y-chart Applications
Programmers Interface (YAPI) [2]. The model of computa-
tion in YAPI is based on Kahn Process Networks. Such a
network consists of parallel tasks that communicate through
(theoretically unbounded) FIFOs. Parallelism and commu-
nication are explicit and synchronization is done implicitly
at read from empty FIFO (and write in full FIFO in the prac-
tical case where FIFOs are bounded). In practical video
YAPI applications the “memory-active” entities of the sys-
tem are tasks, FIFOs and frame buffers, so the cache is ex-
clusively split among them. The applicationmodel is not re-
stricted to YAPI descriptions and can be used with any mul-
titasking programming model with shared memory primi-
tives, e.g., POSIX threads.
The FIFOs access predictability is achieved by allocating
them cache of the same size as the FIFO size.
For the frame buffers the production and consumption is
done actually sequential (this is intrinsic for the YAPI ap-
plication). The frames used for the prediction are accessed
only after they are completely produced. This means that
by allocating an exclusive cache partition for every frame
buffer the compositionality of the system is preserved be-
cause the access to this buffer is sequential.
If all the tasks share the heap another factor that influ-
ences compositionality is dynamic memory allocation. De-
pending on which data was previously allocated (so depend-
ing on task scheduling and mapping) a tasks’ data structure
will have different address, resulting in different inside task
. . . 
. . . L2
cache
CPU CPUCPU
L1 cache L1 cache L1 cache
memory
bank
memory
bankbank
memory
interconnection network
Figure 1. CAKE architecture - inside-tile view
misses. In the experiments in Section 5 we assume that the
memory allocation is done during the initialization period
and the overall allocation order is always the same.
4.2. Hardware platform and operating system
In the experiments a practical instance of the CAKEmul-
tiprocessor architecture is used [9]. The CAKE platform
consists of a homogeneous network of computing tiles on
a chip. Each tile contains CPUs (Trimedia and/or MIPS
cores), a router (for out of tile communication) and mem-
ory banks (Figure 1). The processors are connected to mem-
ory by a fast, high-bandwidth snooping interconnection net-
work. The on-tile memory is actually used as a L2 cache,
shared between tasks, facilitating a fast access to the main
memory which is outside the chip. The addressing space is
linear.
Allocating sets of the L2 cache is implemented by chang-
ing the conventional index part of an address to a new index.
In order to implement this, the cache has to be able to relate
memory accesses to tasks and communication buffers. Ide-
ally, each access should be labelled with a task id or buffer
id. In this way the cache can translate the address according
to a table indexed by this id. In the CAKE platform, the task
id is kept in a register and can therefore be used directly.
There are several ways to obtain an id for communica-
tion buffers. A buffer id register could be used. This would
imply additional requirements for the compiler, which then
need to keep that register up to date. Alternatively, a part of
the address could be used to encode the buffer id. This re-
duces the usable address space and also requires adaptation
of the compiler for handling shared static data structures.
Nevertheless, for dynamic memory allocation the partition-
ing can be implemented relatively straightforward by pro-
viding a dedicated malloc for shared buffers. A third ap-
proach is to keep a table with intervals of shared memory.
This table needs to be loaded by the operating system. Then
for every access the cache can lookup if the address has an
associated buffer id.
The third alternative is chosen as for our experiments we
are mainly interested in the system level aspects (e.g., in-
ducing the compositionality property, implication in num-
ber of misses) and we are not concerned for the implementa-
tion’s performance issues The third approach more generic
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
tasks FrontEnd1 IDCT1 Raster1 BackEnd1
alloc. L2 sets 4 1 32 16
tasks FrontEnd2 IDCT2 Raster2 BackEnd2
alloc. L2 sets 4 1 16 16
tasks Fr. canny LowPass HorizSobel VertSobel
alloc. L2 sets 4 16 8 16
tasks HorizNMS VertNMS MaxTreshold
alloc. L2 sets 8 8 4
data appl data appl bss rt data rt bss
alloc. L2 sets 2 2 4 4
Table 1. L2 allocated sets for 2 jpegs & canny
tasks input vld hdr isiq memMan
alloc. L2 sets 2 4 16 8 1
tasks idct add decMV predict predictRD
alloc. L2 sets 4 4 8 16 2
tasks writeMB store output
alloc. L2 sets 8 2 1
data appl data appl bss rt data rt bss
alloc. L2 sets 4 1 8 1
Table 2. L2 allocated sets to tasks for mpeg2
than the others because any address range can be placed in
any place in the cache. This easily allows for other experi-
ments, like for example separating tasks’ instructions, static
initialized variables (data) and static uninitialized variables
(bss) in the cache or sharing some cache partitions.
We have adapted the operating system, such that it man-
ages the necessary translation tables for the cache. For
this, it offers primitives of cache allocation for tasks and
for shared memory.
5. Experimental results
We evaluate the proposed technique using as workload
two applications. The first application (15 tasks) consists
of two jpeg decoders [1] working on different pictures for-
mats and one line based canny edge detection algorithm.
The second application (13 tasks) is a mpeg2 video decoder
[11]. In both examples there is a run-time operating system
that has an exclusive cache part allocated such that it does
not interfere with the application’s tasks.
The instance of the CAKE platform used has four Tri-
media processors, 512KB, 4 ways associative L2. On the
experimental platform the application and run time system
static allocated data (data and bss) is shared between tasks
so in order to have predictable access, with the same consid-
eration as for communication buffers (Section 3), exclusive
cache partitions are allocated for them as well. In the last
row of Tables 1 and 2 (corresponding to the two examples)
the allocated cache sizes are presented.
We performed simulations in order to determine the
number of misses depending of the allocated cache size (for
      ) and we applied the method in Section 3.2. The
Figure 2. Shared vs. best partitioned cache
for every task and communication buffer
obtained partitioning ratios are presented for the two exam-
ples in Table 1 and Table 2. To give a measure of the inter-
task cache interaction in Figure 2 the number of misses for
the shared cache case and partitioned cache case are pre-
sented for all tasks and communication buffers. For both
examples we can observe that with only few sets of exclu-
sive cache assigned to static allocated data a major improve-
ment in performance is obtained.
With respect to performance cache partitioning has two
effects: on one hand the misses due to inter-task conflicts
are alleviated but on the other hand some tasks do not ben-
efit from all the cache space, increasing their number of
misses. The balance between the two effects can be seen
in Figure 2. The results also indicate that for the first appli-
cation the L2 miss rate is improved from 9.46% to 2.21%
and as a result the number of cycles per instruction (CPI) of
every processor is reduced with approximate 20% (from 1.4
cpi to 1.1 cpi). For the mpeg2 application the L2 miss rate
is improved from 5.1% to 0.8% and as a result the number
of CPI of every processor is reduced with approximate 4%
(from 1.7-1.8 cpi to 1.6-1.7 cpi). For this application the
reduction in CPI is not so large because the used mpeg2 im-
plementation was not optimized for Trimedia so it is more
L1 and processor bounded than L2 bounded. The mpeg2
was also simulated with 1MB of shared L2 cache and for
that the L2 miss rate was 0.6% with a corresponding 1.7 cpi
for every processor.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
Figure 3. Expected-simulated performance
comparison for every task
The graphs in Figure 3 presents the difference between
the number of misses for every task (   

) expected using
the model from Section 3.2 and the ones obtained by simu-
lation of the determined best cache partitioning ratio for the
application. These differences give a measure of composi-
tionality and reflect the variation in number of L2 misses
due to the neglected effects like task switching, L1 and bus
contention. For both examples the largest difference for a
task between the expected and simulated number of misses
relative to the overall simulated number of misses is 2%.
The results suggest that system behaves as expected and the
compositionality property is induced.
6. Conclusions
For multimedia applications when running on embedded
multiprocessor systems we tackle the problem of non com-
positionality due to inter task conflicts in the shared cache.
In the proposed approach compositionality is induced by
partitioning the unified level of cache shared between pro-
cessors using a scheme that exclusively allocates sets of
cache to the tasks and inter-task communication buffers.
Different optimization criteria, statically prioritizing tasks
and guaranteeing performance are possible by tuning the
cache partitioning ratio. For homogeneous multiprocessors
the problems of finding the cache partitioning ratio for op-
timizing the throughput and the power are analytically for-
mulated and the influence of task to processor assignment
is discussed. A practical approximation of these formula-
tions is presented and it is validated using two applications:
two jpegs decoders and a canny edge detection running in
parallel and a parallel mpeg2 decoder. Applying the pro-
posed method to a CAKE instance with 4 Trimedia proces-
sors and 512KB L2 cache compositionality was achieved.
The largest difference for a task between the expected and
simulated number of misses relative to the overall simulated
number of misses is 2%. In terms of performance, com-
pared with the shared cache case there were 5 times less L2
cache misses resulting in 20% reduction in cycles per in-
struction for the first example and 6.5 times less L2 cache
misses resulting in a 4% cpi reduction for the second exam-
ple.
References
[1] E. A. de Kock. Multiprocessor mapping of process net-
works: a jpeg decoding case study. Proceedings, 15th In-
ternational Symposium on System Synthesis, October 2002.
[2] E. A. de Kock and all. Yapi: application modeling for sig-
nal processing systems. Proceedings, 37th conference on
Design Automation, pages 402 – 405, 2000.
[3] J. L. Hennesy and D. A. Patterson. Computer Architecture:
A Quantitative Approach. Morgan Kaufmann Publishers,
San Fransisco, CA, 2003.
[4] D. B. Kirk. Smart (strategic memory allocation for real-
time) cache design. IEEE symposium on Real Time Systems,
pages 229–237, 1989.
[5] A. Molnos, M. Heijligers, S. Cotofana, and J. van Eijnd-
hoven. Compositional memory systems for data intensive
applications. Proceedings, Design, Automation and Test in
Europe, 2004.
[6] F. Mueller. Compiler support for software-based cache par-
titioning. ACM SIGPLAN Notices, 30(11), 1995.
[7] H. Muller, D. Page, J. Irwin, and D. May. Caches with com-
positional performance. Proceedings, Embedded Processor
Design Challenges, pages 242–259, 2002.
[8] H. S. Stone, J. Truek, and L. Wolf, Joel. Optimal parti-
tioning of cache memory. IEEE Transactions on computers,
41(9):1054–1068, 1992.
[9] P. Stravers and J. Hoogerbrugge. Homogeneous multipro-
cessing and the future of silicon design paradigms. Proceed-
ings, International Symposium on VLSI Technology, Sys-
tems, and Applications (VLSI-TSA), April 2001.
[10] G. E. Suh, S. Devadas, and L. Rudolph. Dynamic cache
partitioning for simultaneous multithreading systems. Thir-
teenth IASTED International Conference on Parallel and
Distributed Computing Systems, 2001.
[11] P. van der Wolf and all. An mpeg-2 decoder case study as
a driver for a system level design methodology. Proceed-
ings, 7th International Workshop on Hardware/Software Co-
Design (CODES’99), May 1999.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
