Analysis of Memory Latencies in Multi-Processor Systems by Stachulat, Jan et al.
Analysis of Memory Latencies in Multi-Processor Systems
J. Staschulat, S. Schliecker M. Ivers, R. Ernst
Technical University of Braunschweig
Hans Sommer Str. 66, D-38106 Braunschweig, Germany
{staschulat|schliecker|ivers|ernst}@ida.ing.tu-bs.de
Abstract
Predicting timing behavior is key to efficient embed-
ded real-time system design and verification. Current ap-
proaches to determine end-to-end latencies in parallel het-
erogeneous architectures focus on performance analysis ei-
ther on task or system level. Especially memory accesses,
basic operations of embedded application, cannot be accu-
rately captured on a single level alone: While task level
methods simplify system behavior, system level methods
simplify task behavior. Both perspectives lead to overly pes-
simistic estimations.
To tackle these complex interactions we integrate task
and system level analysis. Each analysis level is provided
with the necessary data to allow precise computations,
while adequate abstraction prevents high time complexity.
1 Introduction
Memory is a critical bottleneck in embedded systems,
as the gap between processor speed and memory access
time is increasing. Current chip designs use hierarchical
memory architectures with caches, multi-threading and co-
processors to reduce memory latency time. Embedded ap-
plications often require real time constraints, but perfor-
mance verification of complex systems is a challenge.
State-of-the-art in industrial practice is using functional
test and simulation. Simulation times are often too long for
a complete code coverage, which would need exponential
time. Therefore, only critical paths are tested, and safe tim-
ing bounds cannot be given.
Formal analysis is an alternative. With simplified as-
sumptions the analysis complexity is reduced and safe up-
per and lower performance bounds are calculated. One
such assumption is a constant memory latency, even though
it is influenced by the memory hierarchy, bus arbitration,
and buffer sizes as well as the background memory latency.
Static analysis approaches for single tasks assume constant
memory access time, such as [5]. The behavior of caches
has been studied as well [5] [3] [8], but these approaches
assume a constant cache miss penalty. A small overestima-
tion of memory access time will lead to a high overestima-
tion for worst case execution time of single tasks.
This overestimated task execution time is used for
higher level system analysis for resource scheduling or for
throughput estimation of heterogeneous multi-processor ar-
chitectures. Such analyses have been proposed by [7] [6].
Compositional performance analysis methodologies com-
bine local techniques on the system level by connecting
their input and output event streams according to the overall
application and communication structure [4] [1].
Crowley and Baer propose in [2] an analysis for multi-
threaded processors. They identify parallelism in the exe-
cution of individual threads on a processor and use special
nodes with a negative execution time to model the gain of
multithreading.
While task level methods simplify system behavior, sys-
tem level methods simplify task behavior. Both perspec-
tives lead to overly pessimistic estimations, one reason for
the marginal influence of formal methods in industrial sys-
tem design. To tackle these complex interactions task level
and system level analysis are integrated. Each analysis level
is provided with the necessary data to allow precise compu-
tations, while adequate abstraction prevents state space.
The paper is structured as follows. Section 2 describes
the problem statement and Section 3 reviews previous work.
Section 4 describes the integrated task and system level
analysis approach. Finally, we conclude in Section 6.
2 Problem Statement
A simple multi-processor architecture is given in Fig 1.
Two processors are connected by a shared bus with mem-
ory and a co-processor. For example, suppose that on CPU1
runs a heat control application. A sensor on CPU2 peri-
odically transmits temperature values, which are saved in
the shared memory. The application on CPU1 checks this
value but also loads its instructions and other data from this
Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) Analysis Page 33 of 49
ECRTS 2005
5th Intl. Workshop on Worst-Case Execution Time (WCET) Analysis
http://drops.dagstuhl.de/opus/volltexte/2007/813
memory device. The heat control is managed by the co-
processor, which is triggered by the application on CPU1.
BUS
CPU1 CPU2
MemoryCo-Proc.
Figure 1. Multi-processor design example.
This simple setup already shows the analysis complexi-
ties: Memory traffic by the CPU1 and CPU2 uses a shared
bus. The memory access time depends on the system state,
including bus state, buffers and memory state. We will call
a memory or co-processor request latency of a resource as
transaction latency in this paper. This denotes the end-
to-end latency, including the time for requesting the data,
transmission over the bus, processing on the remote com-
ponent and transmission to its source.
The objective is to compute the worst case execution
time (WCET) of a task, e.g. the heat control on CPU1, while
analyzing transaction times on system level.
3 Previous Work
3.1 Single Task Analysis
The timing analysis of tasks is separated in two stages:
program path analysis and micro-architecuture modeling.
Program path analysis is used to determine the path which
is executed in the worst case. To derive all the possible pro-
gram paths the program is transformed into a control-flow
graph. Based on this control-flow graph the worst case path
which starts at the beginning node and ends at an terminat-
ing node of the control-flow graph is determined.
Micro-architectural modeling is understood as the timing
analysis of sequences of instructions. Li and Malik [5] in-
tegrate program path analysis and micro-architecture mod-
eling to analyze the worst case execution time of tasks un-
der the influence of instruction caches. They use an Integer
Linear Programm (ILP) to avoid an explicit enumeration of
all program paths. The micro-architectural model is used
to derive execution-time of individual basic-blocks and the
ILP is used to find the path through the control-flow graph
which maximizes the execution time spent. In [8] means are
proposed for identifying infeasible paths in the program and
the timing analysis is based on single-feasible-paths instead
of basic-blocks.
In this paper we use this analysis framework, called
SymTA/P, to transform a program into its control flow graph
and to compute the WCET by solving the ILP. We assume
that the execution time of single-feasible paths are given in-
cluding micro-architectural influences. SymTA/P assumes
that a conservative memory access time, either as cache
miss penalty or general memory access time, must be spec-
ified a priori. A more precise estimation of the worst case
memory access time should rather be done at system level,
where timing properties of all other components are avail-
able.
3.2 System Level Analysis
In SymTA/S [4] a compositional performance analysis
methodology is proposed which integrates different local
scheduling analysis techniques through event stream prop-
agation. The local techniques are composed on the system
level by connecting their input and output event streams ac-
cording to the application and communication structure. In-
stead of considering each event individually, as simulation
does, the formal scheduling analysis abstracts from individ-
ual events to event streams. The analysis requires only a few
simple characteristics of event streams, such as an event pe-
riod, a maximum jitter and minimum distance. From these
parameters, the analysis systematically derives worst case
scheduling scenarios.
One way to extend the compositional analysis to memory
accesses would be to model each access as an event. How-
ever, this would require splitting the task into many smaller
atomic tasks and it would lead to a complex task descrip-
tion.
4 Integrated Analysis Approach
Our approach integrates the single task analysis and
global system analysis. Fig. 2 shows the workflow of the
integrated analysis. The global system analysis consists of
tasks and event streams. A task i consumes tokens from an
input event stream, E ini and produces tokens for the output
event stream Eouti . The event streams are connected to tasks
according to the overall application.
Memory access are modeled with additional communi-
cation between task and system level. The task requests
a number of memory transactions, Tmem from the system
level (dashed arrow). The memory request is propagated
via a system path. Based on the event model of the remain-
ing tasks, the end-to-end latency of this request L(Tmem)
is computed by the system level analysis. This latency is
used by the task level analysis to compute the worst case
response time (WCRT), which finally determines its output
event model Eouti . All tasks of the system which share event
streams on this path, or which are directly affected by the
memory access, have to be re-analyzed in the next iteration
loop. The analysis will stop when a fixed point is found.
Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) Analysis Page 34 of 49
Task1
Taskn
Taski
SymTA/S
-
Global 
System
Analysis
Enout
Eiout
E1out
Enin
Eiin
E1in
CFG
Task Analysis (ILP)L(Tmem)
WCRT
Generate Eiout
Tmem
. . .
. . .
Figure 2. Workflow of integrated analysis.
This abstract modeling allows several compositional im-
plementations. From task level only the memory requests
need to be specified. This means any task level analysis
which can determine the memory access patterns can be
used here. In the next section we describe in detail how we
extend our single task analysis. In Section 4.2 we present
an extension to our system level analysis, which implements
this general framework.
4.1 Extension of Single Task Analysis
Worst case execution time analysis is based on the
control-flow graph (CFG) of an application. To include
compiler optimizations, the assembly code is parsed and the
corresponding CFG is constructed. At this stage we assume
that the core execution time has been computed. For each
basic block (or single-feasible pathes) the number of mem-
ory accesses is extracted. For example, if a basic block i
requests three 32 byte memory blocks, then the transaction
request is Tmem(bi) = (32,32,32). For each basic block such
a request is generated. In the second step all requests are
collected and propagated to the system level analysis where
the transaction latency L(Tmem(bi)) considering the system
state is computed. This is described in Section 4.2.
The worst case execution time is computed by solving
the ILP, which is constructed from the control flow graph.
The general ILP consists of an objective function and sev-
eral structural and functional constrains. Structural con-
strains represent possible control flow and functional con-
straints describe loop bounds or denote infeasible pathes.
The sum-of-basic blocks, as proposed by Li and Malik [5]
is given by the following objective function: max∑ni=1 ci ·xi
where ci denotes the core execution time of basic block i,
n the maximum number of basic blocks of the task and xi
the execution count of basic block i. We extend the objec-
tive function to max∑ni=1(ci + L(Tmem(bi))) · xi to include
the memory access time of Tmem(bi) memory access during
the basic block. Further below we omit bi and use only the
term Tmem for simplicity reasons. The ILP is then solved to
find the longest execution path in the program.
4.2 System-Level Analysis of Transaction
Latency
The basic SymTA/S model of [4] is composed of tasks
and event streams. In the following we show how to extend
the SymtTA/S framework to compute L(Tmem). A transac-
tion starts at some task τi and is transmitted via a chain of
intermediate tasks and ends at the same task τi. The total
latency of this transaction is denoted by L(Tmem).
A request Tmem has to be translated into some event
stream, which SymTA/S supports. An event stream is a
tuple E = {P ,J ,dmin}, where P denotes the period, J the
jitter and dmin the minimum distance between two events.
Given these parameters, the system level analysis calculates
the worst case response time R(E) for this event stream E
along the chain considering every other event stream of the
system, which relates to this chain. In this paper we assume
that the transactions of other ressources are given and are
independent of system behavior. From the response time
R(E) the response time of the transaction L(Tmem) is calcu-
lated. First, we give a translation for single transactions. In
a second step we translate multiple transactions.
4.2.1 Single Transaction
The idea for a single transaction, such as Tmem = {32}, is
to restrict an event stream to a single event by choosing
a period greater than the jitter. If the period were smaller
then the jitter the second event of the event stream could ar-
rive together with the first one. We define the event stream
E for a single transaction by P = J ′, J = 0 and dmin = 0.
The minimum distance is zero because this parameter is not
used. The jitter is set to zero, to exclude any future events.
As period we choose a large number, e.g. J ′. If the jitter
along the chain, J max, is larger than the initially assumed
J ′, the period will be adjusted to J max + 1 and the system
analysis is called for a second time.
4.2.2 Multiple Transaction
Multiple events, such as Tmem = {32,32,32}, are modeled
as a burst, this means all requests are issued at the same
time instant. As period P we start with a great value, J ′.
The jitter is set to P · (|Tmem|−1) to guarantee that exactly
|Tmem| events arrive together. Future events will be excluded
if the period is greater than the total jitter J max on the chain.
Unfortunately, J max is only available after the first system
level analysis iteration. So possibly a second iteration is
Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) Analysis Page 35 of 49
necessary. We define E for multiple transactions by P = J ′,
J = P · (|Tmem|−1) and dmin = 1.
We assume a minimum distance of one instruction to be
conservative. In the future we will analyze the minimum
distance between memory accesses within basic blocks to
increase analysis precision. We also restrict each event to
request the same amount of data for simplicity reasons. For
instruction caches this is not a limitation, since entire cache
lines are requested.
Now the system analysis can compute the response time
R(E) for single and multiple events. In case of single
events this is equal to the response time of the transaction
L(Tmem) = R(E). In the case of multiple events the R(E)
denotes the response time of that event, which possesses
the worst case response time among all events of this event
stream. This includes latencies of previous events. Because
it is unknown which event caused the worst case response
time, we have to assume that it is the last one. Therefore
L(Tmem) = (|Tmem|−1) ·dmin + R(E) (1)
The first term denotes the time until the last event is ac-
tivated. This completes the description of system chain
analysis. All tasks belonging to this chain need to be re-
analyzed, since their input event model might have changed
due to the transaction.
5 Experiments
We applied the analysis to the architecture as shown in
Fig. 1. For comparison, we use a simulation environment
for a network processor and an isolated approach, where
every transaction is assumed to take the maximum time ob-
served during the simulation.
Eight applications are executed on CPU1. For simplic-
ity we provided some background traffic for CPU2 and the
the Co-processor. Figure 3 compares the worst case re-
sponse times for constant memory access time (isolated),
the WCRT of the analysis described in this paper (inte-
grated) and the maximum WCRT observed in simulation.
The analysis provides a significantly reduced WCRT
compared to constant memory access times. As simulation
cannot provide the worst case scenario we cannot evaluate
the accuracy using simulation.
6 Conclusions and Future Work
In this paper we have integrated a static timing analy-
sis on task level and a schedulability analysis on system
level. Current approaches focus only on one level which
assume a constant delay for each memory access. In this
approach several memory transactions within a basic block
Figure 3. Experiment for WCRT analysis.
are grouped together. In a second step, the total access time
of these transactions is determined by system level schedu-
lability analysis. The experiment shows that the analysis
precision increases significantly compared to the isolated
approach.
Future research includes to consider the type and size
of a memory transaction as well as as a greater minimum
distance of memory accesses beyond basic blocks.
References
[1] S. Chakraborty, S. Ku¨nzli, and L. Thiele. A general frame-
work for analysing system properties in platform-based em-
bedded system designs. In Design, Automation, and Test in
Europe, Munich, Germany, March 2003.
[2] P. Crowley and J.-L. Baer. Worst-case execution time estima-
tion for hardware-assisted multithreaded processors. In The
Second Workshop on Network Processors, Anaheim, Califor-
nia, USA, Feb. 2003.
[3] C. Ferdinand and R. Wilhelm. Efficient and precise cache
behavior prediction for real-time systems. Real-Time Systems,
1999.
[4] M. Jersak, K. Richter, and R. Ernst. Performance analysis
for complex embedded applications. International Journal of
Embedded Systems, Special Issue on Codesign for SoC, 2004.
[5] S. Malik and Y.-T. S. Li. Performance Analysis of Real-Time
Embedded Software. Kluwer Academic Publishers, 1999.
[6] T. Pop, P. Eles, and Z. Peng. Holistic scheduling and analysis
of mixed time/event-triggered distributed embedded systems.
In CODES, Eates Park, Colorado, USA, May 2002.
[7] K. Tindell, A. Burns, and A. Wellings. An extendible ap-
proach for analysing fixed priority hard real-time systems.
Journal of Real-Time Systems, 6(2):133–152, March 1994.
[8] F. Wolf, J. Staschulat, and R. Ernst. Hybrid cache analysis
in running time verification of embedded software. Design
Automation for Embedded Systems, 7(3):271–295, Oct 2002.
Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) Analysis Page 36 of 49
