BUNDLEP: Prioritizing Conflict Free Regions in Multi-Threaded Programs
  to Improve Cache Reuse -- Extended Results and Technical Report by Tessler, Corey & Fisher, Nathan
BUNDLEP: Prioritizing Conflict Free Regions in
Multi-Threaded Programs to Improve Cache Reuse
Corey Tessler
Wayne State University
corey.tessler@wayne.edu
Nathan Fisher
Wayne State University
fishern@wayne.edu
Abstract—In “BUNDLE: Real-Time Multi-Threaded Scheduling
to Reduce Cache Contention”, Tessler and Fisher propose a
scheduling mechanism and combined worst-case execution time
calculation method that treats the instruction cache as a beneficial
resource shared between threads. Object analysis produces a
worst-case execution time bound and separates code segments
into regions. Threads are dynamically placed in bundles as-
sociated with regions at run time by the BUNDLE scheduling
algorithm where they benefit from shared cache values.
In the evaluation of the previous work, tasks were created
with a predetermined worst-case execution time path through the
control flow graph. Apriori knowledge of the worst-case path is
an impractical restriction on any analysis. At the time, the only
other solution available was an all-paths search of the graph,
which is an equally impractical approach due to its complexity.
The primary focus of this work is to build upon BUNDLE,
expanding its applicability beyond a proof of concept. We present
a complete a worst-case execution time calculation method that
includes thread level context switch costs, operating on real pro-
grams, with representative architecture parameters, and compare
our results to those produced by Heptane’s state of the art
method. To these ends, we propose a modification to the BUNDLE
scheduling algorithm called BUNDLEP. Bundles are assigned
priorities that enforce an ordered flow of threads through the
control flow graph – avoiding the need for multiple all-paths
searches through the graph. In many cases, our evaluation shows
a run-time and analytical benefit for BUNLDEP compared to
serialized thread execution and state of the art WCET analysis.
I. INTRODUCTION
For hard real-time systems, cache memory complicates the
calculation of a task’s worst-case execution time (WCET)
bound [1]–[4] . An architecture that includes an instruction
cache creates the possibility of (at least) two execution times
for every instruction. Executing an instruction from the cache,
a cache hit, typically takes less time than an instruction
that must fetched from main memory (a cache miss). For
hierarchical caches, the multiple loading times from one level
to another increases the diversity of execution times per
instruction. Previous work preceding BUNDLE [5] accounts
for variations in these execution times due to tasks sharing a
single cache by extending execution times of the preempted
or preempting task [2], [6]–[8].
BUNDLE moves away from the classical (negative) perspec-
tive of caches, by treating the cache as a benefit to multi-
threaded hard real-time tasks. It is comprised of two parts,
a worst-case execution time with cache overhead (WCETO)
analysis and scheduling algorithm. The analysis leverages
BUNDLE’s thread-level scheduling algorithm in calculating a
bound which includes the cache benefit between threads. In
addition, the analysis identifies conflict free regions used to
make scheduling decisions.
As a first work, the evaluation in BUNDLE [5] served as
a proof of concept, demonstrating the potential benefit in
WCETO and run-time savings. For WCETO calculations, it
described an all-paths walk with complexity O((|V |!)m) for V
conflict free regions, and m threads. The evaluation subverted
this bound by construction of synthetic programs with a known
set of worst-case execution paths.
Synthetic tasks, all-paths walks, and prior knowledge of
their worst-case paths are barriers to BUNDLE’s practical
application and acceptance. The primary goal of this work is to
provide a suitable WCETO calculation method and run-time
evaluation for BUNDLE scheduling that can be compared to
the classical approach. The following contributions are made
to reach these goals:
• A method for calculating conflict free regions (CFRs)
where instructions participate in exactly one CFR.
• A BUNDLE-based scheduling algorithm that prioritizes
conflict free regions named BUNDLEP.
• A suitable WCETO calculation method for BUNDLEP
which incorporates context switch costs.
• A complete evaluation and simulation environment based
on the Heptane1 package available for download [9].
The remainder of this work is structured as follows. Sec-
tion II sets BUNDLE and BUNDLEP in context of the related
work. Section III summarizes the core concepts of BUNDLE
and background required for the extensions. Section IV gives
a brief overview of BUNDLEP’s approach in contrast to
BUNDLE. Section V outlines the creation of conflict free re-
gions. Section VI is divided into three subsections, describing
the thread bottlenecks, priority assignment, and the BUNDLEP
scheduling algorithm. Section VII details the WCETO method
for BUNDLEP. Section VIII describes the evaluation method
as a compliment to Heptane and results. Our work concludes
with final remarks and potential extensions in Section IX.
II. RELATED WORK
Other efforts have been made to mitigate or manage the
cache impact of concurrent tasks. Memory-Centric Schedul-
1See https://team.inria.fr/alf/software/heptane
1
ar
X
iv
:1
80
5.
12
04
1v
1 
 [c
s.D
C]
  3
0 M
ay
 20
18
ing [10] is influenced by the cache impact of each task.
However, it only supports PREM-compliant [11] tasks that
have been divided into load and execution phases. Loading
phases are isolated from one another preventing an inter-thread
cache benefit between them. Another approach to predictable
cache behavior is taken by [12], using management techniques
such as cache coloring and blocking. Cache reuse is increased
for a single task, but the method does not accommodate
cached values being shared between distinct threads or tasks.
These are representative examples of the classical (negative)
perspective of caches, in contrast to BUNDLE’s positive view.
Aside from BUNDLE, the only works we are currently
aware of taking a positive perspective on caches with re-
spect to schedulability are related to Persistent Cache Blocks
(PCBS) [13], [14] or cache spread [15]. PCBs are cache blocks
that remain in the cache after a job completes to be reused with
the next release. However, PCBs are limited to a single task
(or thread), and the analytical benefits is limited to subsequent
jobs. Calandrino’s [15] examination of cache spread is limited
to empirical analysis with a more coarse grained approach than
BUNDLE.
III. SUMMARY OF BUNDLE
The motivation for BUNDLE stems from a positive perspec-
tive of caches in the setting of multi-threaded tasks on a shared
processor. From this perspective, there is an inter-thread cache
benefit when a thread encounters an unexpected cache hit due
to the previous execution of a thread from the same task. Other
concurrent approaches [16], [17] do not account for the benefit
in their analysis due to their focus on finding the worst-case
cache interleavings. Conversely, BUNDLE schedules threads to
avoid the worst-case and creates a quantifiable benefit of cache
reuse between threads.
The sporadic model [18] is extended to support BUNDLE’s
thread-level scheduling algorithm and analysis. Each task τi
in the set of tasks τ is represented by a tuple of mini-
mum inter-arrival time, relative deadline, and initial ribbon:
τi = 〈pi, di, Ri〉. A ribbon is the set of reachable instructions
from a single entry instruction, described by a conflict free
region graph Ri. The initial ribbon of a task is a starting point
for the first threads of execution released with each job.
Due to the complexity of intra-task scheduling, BUNDLE’s
scheduling algorithm and analysis is limited to one processor,
a single task2, with one ribbon, releasing m threads per
job. Execution on the shared processor is aided by a single
level direct-mapped cache with l lines. Loading a block from
main memory to the cache takes B cycles, with a uniform
number of clock cycles per instruction (CPI) denoted I. This
paper is the first, implementation-based step towards our larger
goal of bringing BUNDLE to a fully preemptive system-wide
scheduler, with multiple tasks, hierarchical caches, on many
cores.
To quantify the inter-thread cache benefit, BUNDLE sched-
ules threads in a manner cognizant of the program’s structure
2Without modification BUNDLE’s techniques could be applied to a non-
preemptive task-level scheduler.
as well as potential cache conflicts within and between threads.
Central to BUNDLE’s scheduling decisions and WCETO anal-
ysis are conflict free regions.
Conflict free regions are created from the control flow
graph [19] (CFG) of a ribbon. A control flow graph is
a weakly connected directed graph representing the flows
of execution through an executable object. Described by a
triple G = (N,E, h) of nodes, edges, and entry instruction.
Typically the nodes n ∈ N of a CFG are basic blocks, for
simplicity of presentation nodes of CFGs within this work are
single instructions. The directed edges (u, v) ∈ E ∧ u, v ∈ N
define the possible paths through the program, with h ∈ N at
the root of all paths, ending in a single terminal node.
A conflict free region (CFR) is a subset of nodes and
edges from the CFG of a ribbon. The subset is determined
by including instructions which, when executed, would not
cause an eviction of each other. When a CFR is extracted
from the CFG G the structure of the program is maintained.
An extracted CFR is itself a CFG, denoted F = (N,E, h) with
the following properties:
1) No two instructions (outside of the same block) map to
the same cache line.
2) All instructions of F are weakly connected to the entry
instruction h.
3) For any two instructions in (u, v) ∈ F , if there was an
edge between them in G then (u, v) ∈ E (of F ).
The set of CFRs of a ribbon’s CFG are collected in
the ribbon’s conflict free region graph (CFRG). A CFRG
R = (N,E, h) is a CFG where the nodes are CFRs. Connec-
tivity between CFRs from the CFG is preserved in the edges
of the CFRG. For an edge (n1, n2) in the CFG, if n1 and n2
are placed in distinct CFRs, then the CFRG must contain an
edge between the CFRs. Figure 1 illustrates the relationships
between the CFG, CFRs, and CFRG.
Fig. 1: CFG, CFR, and CFRG of a ribbon
It is the CFRG which drives the scheduling decisions of
BUNDLE’s algorithm. For each CFR within the CFRG, the
scheduling algorithm creates a container for threads called a
bundle. Only one bundle is active at any time, and only threads
of the active bundle are allowed to execute. If a thread of the
active bundle attempts to execute an instruction outside of the
active CFR, the thread is blocked. After being blocked, the
thread is placed in the bundle of the CFR it attempted to enter.
After all threads of the active bundle have blocked, the bundle
is depleted and another bundle is selected as active.
Graphical Notation: Execution under BUNDLE is illustrated
in Figure 2. Annotation of the CFRG R = (N,E, h) will
remain consistent with other figures. Nodes ni ∈ N are CFRs,
2
ni = (Ni, Ei, hi). The entry instruction hi of CFR ni has a
main memory address, which is denoted ai. When presented as
nodes in graphs, CFRs will be squares and instructions circles.
When needed, the CFR shaded light gray indicates its bundle
is active. Small black squares adjacent to or within a CFR are
the threads of the associated bundle.
(a) n0 active (b) n3 active (c) n2 active
Fig. 2: BUNDLE Execution
In Figure 2a, n0 is active. Accordingly, BUNDLE executes
each of n0’s threads until they leave the CFR, entering n1 and
n2. The next active bundle selected is n3 in Figure 2b, its one
thread executes until termination. This process repeats until
all threads have terminated, and the job has been completed.
By restricting execution by CFR (bundle), the execution
time of threads are lowered by sharing cached values where
they otherwise would not (e.g. letting each thread complete
the execution of the entire CFG before switching to the next
thread). In BUNDLE [5], the selection of which bundle to
activate is arbitrary. Selecting in such a manner can reduce the
inter-thread cache benefit, which can be observed in Figure 2.
Had n1 and n2 been activated before n3, more threads would
have received the benefit when n3 was activated.
IV. OVERVIEW OF BUNDLEP
Arbitrary selection of bundle activation has the deleterious
effect of increasing the complexity of WCETO analysis.
BUNDLEP addresses both issues of increasing the inter-thread
cache benefit, and simplifying WCETO calculation with a
single solution: assign priorities to the CFRs of the CFRG. At
run-time, a bundle inherits the priority of its associated CFR,
and the bundle with the best priority is always activated.
(a) (b) (c)
Fig. 3: BUNDLEP Execution
Priorities are assigned to CFRs to guarantee the minimum
number of activations, which maximizes the inter-thread cache
benefit per activation. Priority assignments are based on the
intuition that the terminal node of the CFRG should have
the lowest priority. Those nodes immediately preceding the
terminal node are given the second lowest, etc. This intuition
is illustrated by example in Figure 3.
CFR’s have their priority listed as superscripts, where the
lowest value has the best priority, ie. node n12 has priority 1
which is worse than n00. Figure 3 revisits the CFRG of Figure 2
to highlight the benefit of prioritizing CFRs. Starting with 3a,
the best priority CFR is selected as active (n0) and the bundle
is depleted. BUNDLEP’s next activation is a choice between n1
and n2, they have equal priority, either would be a valid choice.
The selection of n2 is made before n1. Figure 3b illustrates
only the first choice, omitting depicting the activation of n1
in favor of n3. Figure 3c shows the clear benefit of priority
assignment. The bundle for n3 is activated only once for six
threads, this maximizes the inter-thread cache benefit which is
equivalent to minimizing the number of activations. Compared
to Figure 2 where n3 could be activated up to four times,
possibly quadrupling the cache load penalty for the CFR at
run-time and during WCETO analysis.
The priority assignment which minimizes the number of
activations is shown in Theorem 1 as the longest path value
for each node of the CFRG from the initial node given positive
edge weights. This influences the creation of CFRs and the
treatment of loops in the CFRG. As such, the CFRG must be a
DAG, where the assignment of priorities to CFRs is performed
in polynomial time with the added benefit of enabling a
tractable WCETO analysis.
Theorem 1 (Maximum Bundle Activations). For a CFRG
R = (N,E, h), which is a DAG, where each node ni ∈ N
has priority pii equal to the length of the longest path from
h to ni the bundle ni will be activated at most once per job
using BUNDLEP.
Proof. To illustrate a contradiction assume a CFR ni is
activated more than once. Then there must exist a node nj
with a worse priority (greater value) pij > pii on a valid path
〈h, ..., nj , ..., ni, ...〉. Given that R is a DAG, there can be no
path from ni to nj . Since priorities are assigned equal to the
longest path from h to a node, then pij < pii contradicting
pij > pii. Therefore, ni can be activated at most once.
V. CONFLICT FREE REGION CREATION
To ensure the CFRG of a ribbon is a DAG3, the process
of converting a ribbon’s executable object into a CFG, then
CFRs, and finally a CFRG must avoid introducing ambiguity
or loops into the CFG or CFRG. To do so, the process is
divided into two stages: 1.) create an expanded CFG 2.)
create and link CFRs. In Subsection V-A the motivation and
definition of expanded CFGs is provided. Subsection V-B
details the assignment of individual instructions to CFRs and
their compilation into a CFRG.
A. Expanded Control Flow Graphs
Typically, for a CFG G = (N,E, h), a node n ∈ N is a
basic block identified by the memory address of the first
instruction of the block. In this work, nodes are individual
instructions. However, nodes are not identified by their ad-
dress. They are identified by their address and call stack. This
prevents loops from being introduced into the CFG.
Common to other hard real-time programs, ribbons are
restricted from including infinite loops, function pointers,
3CFRG’s cannot be a DAG in the presence of user defined loops, to maintain
the DAG structure loops are collapsed – which is described in Section VI-C.
3
long jmp’s, or unbounded recursions. With these restrictions
in place, it is still possible to create loops and ambiguity in
the CFG of the ribbon. As illustrated by Figure 4.
1 p r o c e d u r e a ( x , y )
2 c = x
3 x = b ( y )
4 y = b (2 ∗ c )
5 r e t u r n x + y
6 end p r o c e d u r e
Listing 1: Procedure a()
a: Introduced Loop b: Breaking the Loop
Fig. 4: Preventing Introduced Loops
In Figure 4, Listing 1 is used to generate the two CFGs 4a
and 4b. Numbers adjacent to lines in the pseudocode are the in
memory address of the statements. There are no cycles in the
procedure, however the CFG in 4a contains a cycle with 4 and
b(). This is due to nodes of the CFG being identified by their
address coupled with two calls to b(). The loop is broken in
Figure 4b by identifying each node by the instruction’s address
and complete call stack. Identifying instructions in this way
preserves the structure of the program, without introducing
loops into the CFG.
Throughout the remainder of this work, all CFG operations
take place over the expanded CFG of ribbons. Where a node
n ∈ N is identified by its address a and callstack s of depth
k, s = 〈n1, n2, ..., nk〉. Each entry of the callstack is a node
in N , where the first entry is the top of the stack – the node
calling n’s function. In the case of the first instruction h (and
all other nodes reachable without a function call), the callstack
has one element ∅ indicating no parenting call.
Creating an expanded CFG from a ribbon is a straight-
forward modification of common CFG [20], [21] program
analysis. As such, a detailed description of expanded CFG
creation is omitted.
B. Conflict Free Region Assignment
A CFG serves as the basis of construction of conflict free
regions and subsequently the conflict free region graph. In the
previous work [5], a single node of the CFG could belong
to multiple CFRs. Nodes of the CFG are excluded from
participating in multiple CFRs under BUNDLEP. If nodes from
the CFG participated in multiple CFRs, then loops may be
introduced into the CFRG. Additionally, WCETO calculation
and scheduling decisions developed in this work, rely upon
nodes being assigned to exactly one CFR.
Section III listed the three requirements of CFRs from [5].
A fourth requirement is added to accommodate BUNDLEP
scheduling and WCETO calculation for a CFG G = (N,E, h)
and CFRs (F1, F2, ..., Ff ) where Fi = (Ni, Ei, hi):
4) A node n ∈ N is present in at most one CFR:
n ∈ Ni =⇒ ∀k 6=i n 6∈ Nk
Placing a node of the CFG in a single CFR is referred
to as assignment. The assignment process relies upon nodes
of the expanded CFG being annotated with their call stack
and inner-most loop head. A single node may participate in
multiple loops (loops embedded within another). All loops
have a head; a starting instruction that determines if the loop
will repeat. The inner-most loop head of an instruction is the
loop head closest to the node in the hierarchy of embedded
loops it belongs to, identified by any suitable algorithm [22].
Assignment begins with a bi-level depth first search (DFS).
The top-level DFS marks nodes of G as CFR entry points. The
bottom-level DFS marks nodes with their CFR while handling
the special cases that could create loops in the CFRG. Both
use conflicts to bound their searches and return the set of
conflicting nodes as successor nodes to continue the search.
The coordinated result of the bi-level searches complete the
first stage of assignment, called CFR entry tagging. The top-
level DFS procedure TAGCFRS() is presented as pseudocode
in Algorithm 1. It makes use of a simulated cache C to identify
conflicts, with three notable methods: C.insert(a) caches a’s
block, C.clear() removes all blocks, and C.conflicts(a) returns
true if C.insert(a) would evict a cached block.
Algorithm 1 TAGCFRS()
1: G = (N,E, h) . Expanded CFG G
2: C . Simulated Cache
3: procedure TAGCFRS
4: s.clear() . Local stack
5: v.clear() . Visited node array
6: s.push(h) . Starting condition
7: while not s.empty() do
8: n← s.pop() . Take a node
9: v[n]← true . Mark the node as visited
10: C.clear() . Reset the cache
11: X ← LABELNODES(n) . Label CFR nodes
12: for x ∈ X do
13: s.push(x) if not v[x] . Conflict begins a CFR
14: end for
15: end while . v[n] = true indicates n is a CFR entry.
16: end procedure
The TAGCFRS() procedure is responsible for tracking the
entry instructions of CFRs. It resembles a typical DFS, mark-
ing nodes as visited, and adding them to the search list
when they have not been visited. Where TAGCFRS() differs
is the selection of subsequent nodes. In a typical DFS, the
subsequent nodes are the immediate successors of the current
node, in TAGCFRS() the subsequent nodes are the entry
instructions of successive CFRs.
Those entry instructions are determined by the LABELN-
ODES() method. Since no CFR may contain a cache con-
flict, any instruction that would conflict must be the entry
4
instruction of a subsequent CFR. Figure 5 provides an example
call to LABELNODES(n3), n3 is the entry instruction of the
current CFR and LABELNODES(n3) returns the set of entry
instructions for subsequent CFRs {n7, n8, n9}. Placed below
each node in the figure is the cache block it maps to.
Fig. 5: Call to LABELNODES(n3) Returning X
In addition to returning the entry instructions of subsequent
CFRs, the LABELNODES() procedure is responsible for mark-
ing nodes of the CFG with the CFR they belong to. In Figure 5
the nodes {n3, n4, n5, n6} are labeled with their CFR n3. The
pseudocode for LABELNODES() is given in Algorithm 2.
Algorithm 2 LABELNODES()
1: G = (N,E, h) . CFG G, shared with TAGCFRS()
2: C . Simulated cache, shared with TAGCFRS()
3: procedure LABELNODES(n)
4: s, x . Local stacks (not shared with TAGCFRS())
5: v . Local visited array (not shared with TAGCFRS())
6: if n.label 6= ∅ then
7: `← n.label . Breaking an existing CFR
8: end if
9: s.push(n)
10: while not s.empty() do
11: u← s.pop()
12: if (n.isHead() ∧ not n.inLoop(u)) ∨
13: . Case 1, Loop Exit
14: (u.isHead() ∧ u 6= n) ∨
15: . Case 2, Loop Head
16: (u.label 6= ∅ ∧ u 6= `) ∨
17: . Case 3, Already Assigned
18: (C.conflicts(u.a)) . Case 4, Cache Conflict
19: then
20: x.push(u) . Push the Conflict
21: v[u]← true . Skip u’s successors.
22: end if
23: next while if v[u] . Already visited
24: v[u]← true
25: . Case 5, Add to CFR n
26: u.label ← n . Label u with CFR n
27: C.insert(u.a) . Insert u into the Cache
28: for y ∈ G.succ(u) do
29: if not v[y] then
30: s.push(y)
31: end if
32: end for
33: end while
34: return x
35: end procedure
During each iteration of the DFS within LABELNODES(),
a candidate node u is deemed within the CFR or an entry
instruction of a subsequent CFR. If u is within the CFR,
the node is labeled with the CFR’s initial instruction n, and
u’s successors are added to the search list s. If u is an
entry instruction it is placed in the set of conflicts x, and no
successors of u are added to s. When the search list is empty,
the set of conflicts x is returned to the caller.
There are four cases in LABELNODES when the node u
may be deemed a conflict. The simplest is Case 4, when u
conflicts with another value present in the cache. Since CFRs
must contain no conflicts, u cannot be added to the CFR n.
Case 1 and 2 are related to loops. If u falls under Case 1,
then n is a loop-head to which u does not belong. Since u
was reachable from some instruction in the CFR n, it must be
an exit point of the loop and will not be permitted as part of
the CFR. If u falls into Case 2, then u is a loop-head. Loops
are collapsed (described in Section VI) to ensure the CFRG
is a DAG. To permit collapsing, loop-heads must start CFRs
and CFRs must only contain instructions of the same loop.
a: Before LABELNODES(n2) b: LABELNODES(n2), no Case 3.
c: LABELNODES(n2), Case 3. d: LABELNODES(n3), Case 3.
Fig. 6: Case 3 Protection
Case 3 provides two necessary forms of protection. The first
is against loops being added to the CFRG (ie. the fourth CFR
requirement). The second is to prevent the nodes of a CFR
becoming disconnected. For the WCETO method each CFR
must have a single WCET path, which will not exist if the
CFR is not weakly connected.
Figure 6 compares the result if Case 3 was not present
versus it being present. Divided into four parts, Figure 6a is
a snapshot of the CFRG from the TAGCFRS() procedure’s
perspective after calling LABELNODES(n1) before calling
5
LABELNODES(n2); the top half is the CFG, the bottom half
is the resulting CFRG. Figure 6b shows the result of calling
LABELNODES(n2) without Case 3 in place, creating both a
disconnected CFR n1 and a loop in the CFRG between n1
and n2. Figure 6c is the result of calling LABELNODES(n2)
with Case 3 protection. With the protection in place when
LABELNODES(n2) returns to TAGCFRS(), it returns {n3} as
the set of entry points for subsequent CFRs. There is a an
issue to be resolved with respect to n3 in Figure 6c. Before
the call n3 was previously assigned to the CFR n1, but is now
an entry node to a new CFR. As previously stated, a node
must reside in exactly one CFR. Case 3 resolves this issue as
well in Figure 6d, the result of calling LABELNODES(n3).
When the TAGCFRS() procedure returns, tagging has been
completed. All nodes of the CFG have been assigned to CFRs
given by their n.label. What remains to complete the CFRG
is to add edges between CFRs. This final stage is referred
to as linking. Pseudocode for linking is ommitted due to the
simplicity of the operation; a DFS of the CFG where unique
labels are added to the CFRG as nodes, and edges added when
edges in the CFG have differing labels at the end points.
The result is a CFRG R = (N,E, h), where N is the set of
CFRs, E the edges between CFRs, and h the entry CFR. For
consistency and clarity, a CFR Fi is identified by its entry node
in the CFG ni, so is the corresponding node in the CFRG. For
example, in Figure 6 the node in the CFRG and the CFR n3 is
labeled so because its entry instruction was n3 from the CFG.
VI. BUNDLEP
After converting a CFG to a CFRG, priorities are assigned to
CFRs to minimize the number of activations of each bundle. At
run-time, the BUNDLEP scheduling algorithm relies on some
hardware mechanism that halts threads attempting to execute
the entry instruction of an inactive CFR. Such a mechanism
was proposed in [23]; it relied on cache evictions rather than
thread execution, making it unsuitable for BUNDLEP. This
section proposes a new conflict interrupt mechanism suitable
for halting threads. The XFLICT interrupt behavior closely
matches that of hardware breakpoints [24].
A. Hardware Support
The XFLICT interrupt represents the attempted execution
of an instruction that may result in a cache conflict. Since the
execution may or may not result in a conflict, the conclusion
that an interrupt should be raised cannot be made without
additional information. That additional information is encoded
in the XFLICT TABLE. When the program counter is set to a
value present in the XFLICT TABLE, the proposed hardware
mechanism halts the CPU before executing the instruction and
raises an XFLICT interrupt carrying ancillary data encoding
the address of the instruction that raised it.
To illustrate how the interrupt, table, and scheduling algo-
rithm work together the process of activating n2 is illustrated
below. In Figure 7a, n1 has been depleted and its two threads
have moved to n3 where they are blocked. The BUNDLEP
scheduling algorithm will now select n2 as active and begin ex-
ecuting threads. Before n2 is activated, the XFLICT TABLE
(abbreviated X in the figure) is cleared, and the values are
replaced with the addresses of the entry instructions of the
subsequent CFRs of n2: (a3, a4) – recall every node of the
CFG is identified by it’s address a and stack s, CFRs are
identified by their entry nodes from the CFG, in this case
n3 and n4. In Figure 7b, n2 has been activated and threads
are permitted to execute. When any thread attempts to execute
either a3 or a4 the XFLICT interrupt is raised before the thread
can execute it. The BUNDLE scheduling algorithm receives
the interrupt, blocks the thread, and places it into the correct
bundle. Figure 7c provides a snapshot of execution after the
first two threads of n2 have exited the bundle and are blocked
awaiting the activation of n3 or n4.
(a) (b) (c)
Fig. 7: XFLICT Interrupts and BUNDLEP
B. BUNDLEP’s Scheduling Algorithm
Incorporation of the XFLICT interrupt and priorities into
BUNDLE scheduling is given by Algorithm 3, the pseudocode
for BUNDLEP. Bundles are stored in the array B, indexed by
the nodes of the CFRG R. A single bundle b has four members:
a starting address b.a, node from the CFRG b.n, priority b.pi,
and set of threads b.t. When a bundle is ready, it is placed in
the priority queue P .
Lines 6-8 are responsible for adding the threads of the job
to the initial bundle and adding the bundle to the priority
queue. Each iteration of the outer while loop starting on line 9
removes the best priority bundle from the queue and activates
it.
Lines 10-17 manage the XFLICT TABLE and a local
mapping from address to CFRG node. Addresses of successors
to b are added to the XFLICT TABLE so that interrupts will
be delivered when threads exit b. However, those addresses
may map to more than one CFR. To avoid ambiguity, the S
array maps addresses of successor CFRs to their proper node
in the CFRG.
The inner for loop starting on line 18, executes each thread
of the active bundle. When an XFLICT interrupt is raised, the
thread is placed in the successor’s bundle and added to the
priority queue as a candidate for activation. If the interrupt is
not raised, the thread has terminated and belongs to no bundle.
It is appropriate to mention the data structures used for bun-
dles and threads at this point. Threads of bundles are removed
6
Algorithm 3 BUNDLEP Scheduling Algorithm
1: T . Set of Threads
2: R = (N,E, h) . Conflict Free Region Graph
3: P . Priority Queue of Ready Bundles
4: B . Array of bundles indexed by their node n ∈ N
5: procedure BUNDLEP
6: b← B[h]
7: b.t.add(T )
8: P .insert(b, b.pi)
9: while b← P .removeMax() do . Best Bundle
10: S ← ∅ . Clear the successor array
11: XFLICT_CLEAR() . Clear the XFLICT table
12: . Create the mapping of address to node
13: for s ∈ R.succs(b.n) do
14: bs ← B[s]
15: S[bs.a]← bs
16: XFLICT_ADD(bs.a)
17: end for
18: for t ∈ b.t do
19: try {
20: RUN(t)
21: } catch (XFLICT x) {
22: bnext ← S[x.a] . Get the next bundle
23: bnext.t.add(t)
24: P.insert(bnext, bnext.pi)
25: next for . t has not terminated
26: }
27: end for
28: end while
29: end procedure
on line 18, the order is irrelevant requiring nothing more
than an array with O(1) time for extraction. Bundle activation
order is important, being removed from a priority queue on
line 9. An efficient implementation using a Fibonacci heap
has complexity O(1) for insertions and amortized complexity
O(log n) for removeMax. Thus each iteration of the while
loop is dominated by the removeMax operation, which will
be accounted for in the WCETO analysis of Section VII.
C. Priority Assignment
Priorities are assigned to CFRs during the offline object
analysis based on their longest path value from the entry
CFR h. The priority assignment method requires the CFRG
to be a DAG. This requirement is impossible to meet in all
circumstances. Although the analysis does not introduce loops,
the program structure may contain them. These user defined
loops may be contained within a single CFR, which would
ease the analysis, or they may span multiple CFRs creating a
necessary loop in the CFRG.
The structure of CFRs from user defined loops allows them
to be collapsed into a single false node. Every loop has a head,
members, and at least one exit CFR. The loop head identifies
all members of the loop, which are are collapsed into a single
false node during priority assignment and WCETO calculation.
Priorities are first assigned to the interior nodes, then assigned
the false node in the greater scope.
Figure 8 illustrates two scopes of loop collapse. In Fig-
ure 8a, at the greatest scope there is a single loop which is
collapsed into a single false node n1. As part of being col-
lapsed, the interior nodes are assigned priorities corresponding
to their longest path from the loop head. The false node is
given a priority according to the longest path to reach it from
the entry node, with two caveats. First, the interior real (ie.,
not false) node that is the loop head must have a priority worse
than all other members of the loop. Second, each false node
is given a unique priority within its scope, this guarantees
only members of the loop execute until all threads exit the
loop. This is why the false node n1 denoted by a hexagon has
priority 4, which is inherited by the loop head.
(a) Collapsing a Loop
(b) Collapsing an Embedded
Loop
Fig. 8: Loop Collapse and False Nodes
Figure 8b illustrates a smaller scope of embedded loop
collapse. The inner loop is processed first, which affects the
priority values of members within the outer loop.
By repeatedly collapsing loops into false nodes, all loops are
removed and the CFRG is converted to a DAG. Additionally,
at each scope within a collapsed node the graph structure is
also DAG (when excluding the incoming edges to the loop
head). Assigning priorities to interior nodes of the same scope
equal to their longest path from the loop head (with the worst
priority) guarantees each CFR is activated at most once per
iteration shown in Theorem 2.
Theorem 2 (Maximum Bundle Activations per Iteration). For
a graph of the collapsed node G = (N,E, n0) with loop
head n0, set of in-scope nodes N and edges E, where
each node ni ∈ N has priority pii equal to the longest path
from n0 to ni, and n0 has priority worse than all others
{pi0 | ∀nj∈N pi0 > pij} the bundle ni will be activated at most
once per iteration of n0.
Proof. Observation I: For a node n` in scope of n0 that is a
loop head, n` will be collapsed with all other nodes that have
n` as their inner-most loop head. Only the collapsed node n`
will be in scope of n0. Therefore, for any collapsed node, there
is exactly one loop in scope with head n0.
Observation II: A single iteration of the loop contained
within a collapsed node is defined as the series of activations
that begins with the activation of n0 and ends just before n0
would be selected as active once again. Since n0 has the worst
7
priority among all in-scope nodes, all other nodes must have
been depleted before n0 could be activated again.
Observation III: All other threads not in the loop of n0
must be in bundles with worst priority than any bundle in the
current scope. (Otherwise, bundles in n0’s scope would not
be scheduled). Thus, the current loop (and embedded loops)
will complete all iterations before any out-of-scope bundle is
executed.
Consider the graph G where the incoming edges
to n0 have been removed, removing the cycle, i.e.,
E = {(u, v)|(u, v 6= n0) ∈ E} as a graph G = (N,E, n0). By
Observation I, G is now a DAG of CFRs. Treating a single
iteration as a job release and applying Theorem 1 to G, each
n ∈ N is activated at most once per iteration for all threads
executing the loop.
VII. BUNDLEP WCETO CALCULATION
As a practical effort, the focus of this work is on the
calculation of an effective, safe WCETO bound. To that
goal, the bound calculation is formulated as an integer linear
program (ILP) the number of variables grow at O(V + E).
This section is devoted to describing the transformation of a
CFRG into a set of constraints and an objective function.
Assigning priorities to nodes of the CFRG and collapsing
loops (as described in Section VI) guarantees each node is
activated at most once. As such, the contributions of individual
nodes may be considered in isolation. What determines a
node’s individual contribution is the number of threads as-
signed to it. The maximization problem becomes finding the
greatest sum of contributions of individual nodes for a valid
assignment of threads. Figure 9 illustrates the relationship
between the CFRG, WCETO of individual nodes ωn(tn), and
objective function Ω =
∑
n∈N ωn(tn). Refer to Figure 13 in
the appendix for a more detailed example.
Fig. 9: CFRG Individual Nodes and ILP Objective
The WCETO of a node ωn(tn) depends on the number of
threads assigned to it tn. For real nodes, the function takes
the form of Equation 1. We assume a timing-compositional
architecture [25]; the number of cycles required to complete a
single node is divided into two parts: the memory demand and
the execution demand. The memory demand of a node n is the
product of the set of ECBs found in the CFR ECBn and the
block reload time B, γn = |ECBn| · B. The execution demand
is the product of the worst-case execution time of a single
thread over the node cn and the number of threads assigned tn.
Two context switch costs are included to reflect the penalty of
BUNDLEP scheduling, Xb is the number of cycles required to
switch to a new active bundle, and Xt is the cost of selecting a
thread from the active bundle. The costs Xb and Xt are directly
related to lines 9 and 18 of Algorithm 3.
ωn(tn) =
{
0, tn = 0
cn · tn · Xt + Xb + γn, tn ≥ 1
(1)
The WCETO for false nodes is given in Equation 2. It
depends on In the maximum number of iterations of the
loop collapsed into the false node n, and inscope(n). The
set of nodes returned by inscope(n) are the interior nodes of
the false node n, which may include other false nodes. For
example in Figure 8b, inscope(n1) = {n1, n2, n3, n5} where
n1 is a real node and n3 a false node.
ωn(tn) =

0, tn = 0
◦
γn + In ·
∑
i∈inscope(n)
◦
ωi(ti), tn ≥ 1 (2)
The memory and execution demand of a false node are not
entirely separable. Individual nodes within scope of n have
their per-iteration contribution bounded by
◦
ωi(ti), described
later. An initial memory demand for the false node n is
calculated as
◦
γn, it represents the number of cycles required to
cache all blocks of nodes within the collapsed node regardless
of scope. The set of nodes allscope(n) includes any real
node that has been collapsed under n. Using Figure 8b,
allscope(n1) = {n1, n2, n3, n4, n5} where no node is false.
Using the set of all nodes collapsed under n, the
multi-set union of their ECBs is formed and labeled◦
ECBsn =
⋃
i∈allscope(n)ECBsi. The product of cardinality
of the ECB multiset and the block reload time produces
◦
γn = B · |
◦
ECBsn|. This value accounts for all of the cycles
required to load every cache block found in the collapsed node.
For a false node i collapsed under a false node n, its
WCETO contribution
◦
ωi(ti) is defined by Equation 2. For
a real node i under a false node n, its WCETO contri-
bution
◦
ωi(ti) is given by Equation 3. It includes context
switch costs, execution demand, and the worst-case mem-
ory demand. A method similar to the ECB-Union CRPD
approach [2] is employed to calculate the memory demand
from the perspective of the affected node i. The worst-
case occurs when another real node collapsed under n
evicts the ECBs of i, forcing the blocks ECBsi to be
loaded when i is activated. The number of evictions can be
bounded by the ECBs of all loop members, specifically those
that occur more than once in the loop. The set of ECBs
found more than once under the collapsed node are given
by
2
ECBsn = {
⋃
u·k | u · k ∈
◦
ECBsn ∧ k ≥ 2}. Thus, the
memory demand bound for i is
2
γi = |
2
ECBsn ∩ ECBsi| · B.
Incorporating per-iteration context switches, execution and
memory demand into the bound for i yields Equation 3.
◦
ωi(ti) =
{
0, tn = 0
Xb + Xt · ti · ci + 2γi, tn ≥ 1
(3)
8
A valid assignment of threads takes into account the struc-
ture of the CFRG. To reflect the structure, threads are treated as
flow traversing the edges of nodes. The entry node is treated as
the source of flow, providing a total m threads on its outgoing
edges. All threads much reach the terminal node. At each node
the sum of threads along incoming edges and outgoing edges
must be equal (except the entry and terminal nodes).
The ILP finds the assignment of threads according to
the flow of the CFRG which maximizes the number of
cycles required to complete m threads according to BUNDLEP
scheduling, thus bounding the WCETO of a job. To conserve
space in the main body of the work, the details of transforming
the formulae to ILP constraints is placed in Appendix A.
VIII. EVALUATION
The evaluation takes the approach of comparing BUNDLEP’s
thread-level scheduling algorithm to a naive algorithm which
executes threads one after another (serially). Individual bench-
marks from the Ma¨lardalen [26] MRTC suite are treated as
ribbons releasing m threads per job. The WCET of each
job is analyzed twice, once for a single multi-threaded task
scheduled by BUNDLEP, and again for m serial threads by
Heptane. Similarly, the run-time behavior is collected for
each benchmark under BUNDLEP and serial execution. A
fully functional virtual machine with the tools and source is
available for download to recreate these results or expand upon
them [9].
Ideally, BUNDLEP would also be compared with BUNDLE.
However, the evaluation in [5] used synthetic programs rather
than compiled source (for any architecture). WCETO analysis
for BUNDLE is also intractable with complexity O((|N |!)m).
This is due to the nature of the algorithm, it does not restrict
the flow of threads through the CFRG, which demands all-
paths be repeatedly searched. A novel BUNDLE WCETO
implementation of an intractable solution, which is known to
be dominated by BUNDLEP is not compelling, as such it is
omitted from the evaluation.
The target platform for WCETO analysis and execution
is a MIPS 74K processor with a direct mapped single level
instruction cache. Cache blocks are restricted to 32 bytes. The
CPI I, block reload time B, and number of cache blocks ` vary
based on Table I. Additionally, the number of threads per job
m vary from 1 to 16 by powers of two. Jobs are executed on a
MIPS simulator provided by Heptane and modified to execute
BUNDLEP scheduling or a serial batch of threads.
CPI (I) BRT (B) ` m
1 100 {8, 16, 32} {2, 4, 8 ,16}
10 100 {8, 16, 32} {2, 4, 8 ,16}
TABLE I: MIPS 74K Architecture Parameters
Of the 27 MRTC benchmarks, 18 were evaluated. The
selection is limited by Heptane’s ability to perform WCET
analysis using the lp_solve ILP solver and the 12 gigabytes
of RAM available (the complete results are available in the
technical report [27]). For each benchmark, Heptane produces
a single WCET value for the execution of one thread through
the ribbon denoted cH . To compare Heptane’s WCET to
BUNDLEP’s WCETO Ω, the number of threads and context
switch costs are incorporated and quantified as a difference
∆ω = m · (cH + Xb)− Ω. Similarly, the number of cycles
required to execute on the simulator serially is denoted EH ,
cycles required for BUNDLEP execution denoted EB , and the
comparison ∆B = EH − EB . A positive ∆ value indicates
the BUNDLEP approach provides a benefit.
Context switch costs are encapsulated in BUNDLEP’s
WCETO Ω, they are not incorporated into cH . Serial execution
models threads as jobs of distinct tasks, switching between
jobs incurs a task-level context switch cost. A job switch
includes more heavyweight operations than a bundle-level
context switch, such as exchanging task control blocks instead
of thread control blocks. To favor the classical approach, the
bundle-level context switch cost Xb is also used as the task-
level context switch cost.
There are two context switch costs for BUNDLEP: between
bundles Xb and between threads Xt. To find representative
values for both costs, sample programs were written for the
target architecture and analyzed using Heptane. Selecting a
thread from an array and jumping to a new instruction address
took less than 10 cycles. For Xb a precise value would
require the implementation of a priority queue, supported by
an optimized heap, that could be analyzed by Heptane. The
implementation of such a priority queue is beyond the scope
of this work. However, a limited set of queue operations were
analyzed for removing one of two items taking less than
55 cycles. Since removeMax grows at log2(m) for bundle
selection, the context switch costs are set to Xb = 55 · log2(m)
and Xt = 10.
Figures 10a and 10b summarize the results of the evalua-
tion. The y-axis represents the number of benchmarks where
BUNDLEP benefits the task. Along the x-axis, the groups
separate the architecture parameters which are enumerated by
their “(B:I, `)”. For each group the result is tallied by the
number of threads per job, from 1 to 16.
There are several interesting observations to be made in
Figure 10a. Though BUNDLEP analysis provides a benefit in
the majority of cases, it does not always. As the cache size
is reduced the number of benchmarks that benefit increases.
Similarly, as the number of threads per job increases so do the
number of benchmarks that benefit. These trends are due to
the number of misses (typically) increasing as the cache size is
reduced, or the number of threads increased. BUNDLEP avoids
these conflicts or converts them to cache hits. Surprisingly, for
a single thread per job BUNDLEP may provide a lower bound
– this is likely due to the use of the expanded CFG instead of
the conventional CFG used by Heptane’s analysis.
The run-time benefit summary in Figure 10b more heavily
favors BUNDLEP, with unsurprising trends. For a single thread
per job, BUNDLEP provides no benefit since there is no reason
to block and incur context switch costs. As the number of
threads increases so does the run-time performance. As the
cache size decreases, the number of benchmarks that see
a run-time performance increases. When compared to the
9
24
6
8
10
12
14
16
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
o
u
n
t
of
B
en
ch
m
ar
k
s
∆
ω
>
0
Architecture (B:I, `)
m = 1
2
4
8
16
(a) Analytical Benefit of BUNDLEP
0
2
4
6
8
10
12
14
16
18
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
ou
n
t
of
B
en
ch
m
a
rk
s
w
h
er
e
∆
B
>
0
Architecture (B:I, `)
m = 1
2
4
8
16
(b) Run-Time Benefit of BUNDLEP
Fig. 10: Benefits of BUNDLEP
WCETO benefit, more benchmarks benefit from the run-time
behavior than the analysis would suggest. This implies further
refinements of the analysis are possible.
Across the four dimensions of the evaluation (cache size,
BRT, CPI, and number of threads per job), the expectation of
BUNDLEP’s benefit will increase as the cache size decreases,
increase as the BRT increases, decrease as the CPI increases,
and increase as the number of threads per job increase. Many
of the benchmarks match these expectations, such as the results
for ud found in Figure 11.
−1 × 106−500000
0
500000
1 × 1061.5 × 10
6
2 × 1062.5 × 10
6
3 × 1063.5 × 10
6
4 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark ud
m = 1
2
4
8
16
(a) Analytical Benefit
−100000
0
100000
200000
300000
400000
500000
600000
700000
800000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark ud
m = 1
2
4
8
16
(b) Run-Time Benefit
Fig. 11: Results for the ud Benchmark
−8 × 106
−7 × 106
−6 × 106
−5 × 106
−4 × 106
−3 × 106
−2 × 106
−1 × 106
0
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark matmult
m = 1
2
4
8
16
(a) Analytical Benefit
−600000−400000
−2000000
200000
400000
600000
800000
1 × 1061.2 × 10
6
1.4 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark matmult
m = 1
2
4
8
16
(b) Run-Time Benefit
Fig. 12: Results for the matmult Benchmark
The anomalies provide insights into the circumstances
where BUNDLEP may be improved. Using the matmult
benchmark as an example, BUNDLEP never produces an
analytical benefit, and rarely a run-time benefit (Figure 12).
Counter-intuitively as the number of threads increase the
analytical result worsens compared to the serial bound. This
is due to the structure of the CFRG, which has several small
CFRs contained within multiply embedded loops. The number
of bundle-level context switches with cost Xb outweighs the
benefit of sharing cache values.
IX. CONCLUSION
This work expands upon the foundation set in BUNDLE [5].
The central principles of treating the cache as a benefit to
execution times by scheduling threads in a cache cognizant
manner are refined and improved. In [5] permitting instructions
to reside in multiple CFRs created ambiguity in scheduling
decisions. The methods of [5] to create CFRs and CFRGs
introduced unnecessary loops.
Refining the creation of CFRs removes ambiguity from
scheduling decisions and prevents loops from being added
to the CFRG. The preclusion of additional loops, and the
assignment of priorities to CFRs reduces the complexity of
the WCETO calculation. Additionally, restricting the flow of
threads through the CFRG tightens the WCETO bound.
These improvements are demonstrated in the toolkit avail-
able for download and reuse for future use and expansion,
demonstrating BUNDLEP’s practical applicability for repre-
sentative compiled programs. The evaluation shows a benefit
to BUNDLEP scheduling in terms of analysis and run-time
behavior for specific programs and architecture parameters.
The evaluation provides further motivation for future im-
provements in the extraction of CFRs, scheduling mechanism,
and bound calculation. Benchmarks with anomalous results
highlight the cost of BUNDLEP scheduling, the greater number
of context switch costs. This cost must be balanced against the
inter-thread cache benefit, which is not always the case.
Future work seeks to find a balance in two ways: 1.)
ensuring CFRs are of greatest size 2.) developing criteria to
permit some cache conflicts when an imbalance exists. These
efforts coincide with our ongoing development of a multi-task
version of BUNDLEP integrated with the evaluation toolkit.
10
REFERENCES
[1] S. Altmeyer, R. I. Davis, L. Indrusiak, C. Maiza, V. Nelis, and J. Reineke,
“A generic and compositional framework for multicore response time
analysis,” in International Conference on Real Time and Networks
Systems. ACM, 2015, pp. 129–138. 1
[2] Y. Tan and V. Mooney, “Timing analysis for preemptive multitasking
real-time systems with caches,” ACM Transactions on Embededed
Computing Systems, vol. 6, no. 1, Feb. 2007. 1, 8
[3] S. Altmeyer, R. I. Davis, and C. Maiza, “Cache related pre-emption delay
aware response time analysis for fixed priority pre-emptive systems,” in
IEEE Real-Time Systems Symposium, Nov 2011, pp. 261–271. 1
[4] W. Lunniss, R. I. Davis, C. Maiza, and S. Altmeyer, “Integrating cache
related pre-emption delay analysis into edf scheduling,” in IEEE Real-
Time and Embedded Technology and Applications Symposium, 2013, pp.
75–84. 1
[5] C. Tessler and N. Fisher, “BUNDLE: Real-Time Multi-Threaded Schedul-
ing to Reduce Cache Contention,” in IEEE Real-Time Systems Sympo-
sium, 2016. 1, 3, 4, 9, 10
[6] H. Tomiyama and N. D. Dutt, “Program path analysis to bound cache-
related preemption delay in preemptive real-time systems,” in Interna-
tional Workshop on Hardware/Software Codesign, 2000, pp. 67–71. 1
[7] C.-G. Lee, J. Hahn, S. L. Min, R. Ha, S. Hong, C. Y. Park, M. Lee, and
C. S. Kim, “Analysis of cache-related preemption delay in fixed-priority
preemptive scheduling,” in IEEE Real-Time Systems Symposium, 1996.
1
[8] S. Altmeyer and C. Maiza Burguie`re, “Cache-related preemption delay
via useful cache blocks: Survey and redefinition,” Journal of Systems
Architecture, vol. 57, no. 7, pp. 707–719, Aug. 2011. 1
[9] C. Tessler, “BUNDLEP virtual machine toolkit,” 2018. [Online].
Available: http://www.cs.wayne.edu/∼fh3227/bundlep 1, 9
[10] S. Bak, G. Yao, R. Pellizzoni, and M. Caccamo, “Memory-aware
scheduling of multicore task sets for real-time systems,” in IEEE Inter-
national Conference on Embedded and Real-Time Computing Systems
and Applications, Aug 2012, pp. 300–309. 2
[11] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and
R. Kegley, “A predictable execution model for cots-based embedded sys-
tems,” in IEEE Real-Time and Embedded Technology and Applications
Symposium, April 2011, pp. 269–279. 2
[12] B. C. Ward, J. L. Herman, C. J. Kenna, and J. H. Anderson, “Making
shared caches more predictable on multicore platforms,” in Euromicro
Conference on Real-Time Systems, July 2013, pp. 157–167. 2
[13] S. A. Rashid, G. Nelissen, S. Altmeyer, R. I. Davis, and E. Tovar,
“Integrated analysis of cache related preemption delays and cache
persistence reload overheads,” in IEEE Real-Time Systems Symposium,
Dec 2017, pp. 188–198. 2
[14] S. A. Rashid, G. Nelissen, D. Hardy, B. Akesson, I. Puaut, and
E. Tovar, “Cache-persistence-aware response-time analysis for fixed-
priority preemptive systems,” in Euromicro Conference on Real-Time
Systems, July 2016, pp. 262–272. 2
[15] J. M. Calandrino, “On the design and implementation of a cache-aware
soft real-time scheduler for multicore platforms,” Ph.D. dissertation,
Chapel Hill, NC, USA, 2009. 2
[16] R. Mittermayr and J. Blieberger, “Timing analysis of concurrent pro-
grams,” in OpenAccess Series in Informatics, vol. 23, 2012. 2
[17] Y. Liang, H. Ding, T. Mitra, A. Roychoudhury, Y. Li, and V. Suhendra,
“Timing analysis of concurrent programs running on shared cache multi-
cores,” Real-Time Systems, vol. 48, no. 6, pp. 638–680, Nov 2012. 2
[18] S. K. Baruah, A. K. Mok, and L. E. Rosier, “Preemptively scheduling
hard-real-time sporadic tasks on one processor,” in IEEE Real-Time
Systems Symposium, Dec 1990, pp. 182–190. 2
[19] F. E. Allen, “Control flow analysis,” ACM SIGPLAN Notices, vol. 5,
no. 7, pp. 1–19, Jul. 1970. 2
[20] F. Nielson, H. R. Nielson, and C. Hankin, Principles of Program
Analysis. Berlin, Heidelberg: Springer-Verlag, 1999. 4
[21] J. Midtgaard, “Control-flow analysis of functional programs,” ACM
Computing Surveys, vol. 44, no. 3, pp. 10:1–10:33, Jun. 2012. 4
[22] T. Wei, J. Mao, W. Zou, and Y. Chen, “A new algorithm for identifying
loops in decompilation,” in Static Analysis. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2007, pp. 170–183. 4
[23] C. Tessler, G. Bloom, and N. Fisher, “Work-in-progress: Reducing cache
conflicts via interrupts and bundle scheduling,” in IEEE Real-Time and
Embedded Technology and Applications Symposium, April 2017, pp.
125–128. 6
[24] MIPS Technologies, Inc., MIPS R© 74KTM Processor Core Family
Software User’s Manual, 2011. [Online]. Available: https://www.mips.
com/products/classic/ 6
[25] S. Hahn, J. Reineke, and R. Wilhelm, “Towards compositionality in
execution time analysis: Definition and challenges,” ACM SIGBED
Review, vol. 12, no. 1, pp. 28–36, Mar. 2015. 8
[26] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper, “The Ma¨lardalen
WCET Benchmarks: Past, Present And Future,” in International Work-
shop on Worst-Case Execution Time Analysis, vol. 15. Dagstuhl,
Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2010, pp.
136–146. 9
[27] C. Tessler, “BUNDLEP evaluation results and technical report,” 2018.
[Online]. Available: http://arxiv.org/... 9
X. ACKNOWLEDGMENTS
We would like to express our gratitude to Isabelle Puaut and
Damien Hardy for their personal assistance with Heptane and
MIPS simulator. Without a freely available, extensible, and
reusable platform this work would not have been possible,
thank you!
11
APPENDIX A
ILP CONSTRAINTS FROM WCETO CONTRIBUTIONS
This appendix is dedicated to describing the transformation
of equations 1, 2, 3 and the supply of threads into the
constraints of the ILP for WCETO calculation. For a CFRG
R = (N,E, h), the objective of the ILP is to maximize:
Ω =
∑
n∈N
ωn(tn)
Several variables are added to the ILP which are not present
in the formulae. A binary selector variable bn ∈ {0, 1} is
added for each node, when the value is 1 the node has at
least one thread assigned to it. For every edge (u, v) ∈ E, the
variable t(u,v) represents the number of threads passed from
node u to v. The terminal node of the CFRG is identified as
z ∈ N , having out-degree zero.
Two functions are defined for each node. The successor and
predecessor functions return the sets given by their names.
Both of these functions properly obey the scope of the pro-
vided node n. If a false node is provided, the nodes collapsed
within it will not be included in the set. If a real node collapsed
within a false node is provided the set will include only nodes
found within the collapsed node (real or false).
Functions
preds(n) , {u|(u, n) ∈ E}
Set of immediate predecessors of n ∈ N .
succs(n) , {v|(n, v) ∈ E}
Set of immediate successors of n ∈ N .
What follows are the individual constraints generated for
each node. To clarify, a top-most false node contributes its
WCETO directly to the objective, being a part of n ∈ N .
Nodes collapsed within the false contribution to the objective
indirectly. Nodes hidden by collapse have their WCETO
reflected by their containing false node’s ω value.
Node Constraints
tn ∈ {0,m}
Number of threads assigned to node n.
bn ∈ {0, 1} ≤ tn
Binary selector for n, indicating that n has at least
one thread assigned.
tn ,
∑
u∈preds(n)
t(u,n)
Number of threads assigned to n must be equal to
the sum of all entering n.
tn ,
∑
v∈succs(n)
t(n,v)
Number of threads assigned to n must be equal to
the sum of all leaving n.
ωn , cn · tn · Xt + Xb · bn + γn · bn
WCETO of a regular n, see Equation 1.
ωn , (
◦
γn · bn) + In
 ∑
i∈inscope(n)
◦
ωi

WCETO for a false node n, see Equation 2.
◦
ωn , cn · tn · Xt + (Xb + 2γn) · bn
WCETO for a real node n within a collapsed node.
Special Case Constraints
th , m
The initial node h must have all m threads assigned.
tz , m
The terminal node z must have all m threads as-
signed.
APPENDIX B
WCETO EXAMPLE
The ILP objective function Ω, is the sum of the contributions
of the CFRs of the CFRG given an assignment of threads
per node. Figure 13 illustrates the source of each node’s
contribution for four threads (m = 4). It reuses the structure of
Figure 9 with detailed memory and execution demand values
that are closer to those found in the evaluation.
Fig. 13: CFRG Individual Nodes and ILP Objective
When completed, the ILP determines the WCETO bound
is 2,680 cycles. Understanding how the bound is calculated is
made easier by considering the memory demand indepedently
of the execution demand. For n0, there is no decision four
threads are assigned. The memory demand for n0 is 400
cycles, and 40 cycles per thread for 560 cycles total.
There are no decisions to be made for n1 or n3, the number
of threads assigned to them are determined by the structure
of the graph. For threads to be assigned to n4 and n5, their
combined execution and memory demand must be compared
to n2. For one thread, n2 has a total demand of 610 cycles.
For one thread, the combined demand for n4 and n5 is 710
cycles (the demand for interior nodes of n5 is 10 cycles per
thread. Though not explicitly listed in the figure, this is the
reason
∑ ◦
ω = 20 = 2 · 10).
The execution demand for a second thread (or third) of n2 is
110 cycles, and the combined execution demand for a second
thread of n4 and n5 is also 110 cycles. Any assignment where
t2, t4, and t5 are greater than or equal to one will result in
the same WCETO value. The assignment in Figure 13 has
balanced the threads across paths.
Table 14 lists the benchmarks evaluated
12
APPENDIX C
TECHNICAL REPORT ON WCETO RESULTS
From the MRTC benchmark suite, 18 of the 27 tests
were evaluated for their WCET and run-time performance.
Benchmarks are treated as one code segment with m requests
for execution per job release. The classical perspective treats
each execution as a job of a distinct task, while BUNDLEP
views them as threads.
From the classical perspective jobs are scheduled serially,
one after another. This is compared to BUNDLEP’s thread level
scheduling using two metrics. The first ∆ω is the difference
between the WCET of m serial threads and BUNDLEP’s
WCETO. When ∆ω is positive, BUNDLEP provides an an-
alytical benefit. The second metric is ∆B , the difference in
the number of cycles required to complete m threads serially
versus BUNDLEP’s scheduling mechanism. A positive ∆B
value indicates a shorter run-time under BUNDLEP.
Names of each of the benchmarks evaluated are listed in
Figure 14. Starting on the next page are the graphical results.
Two graphs are given for each benchmark, one to illustrate the
WCETO benefit (∆ω), and a second for the run-time benefit
(∆B).
bs bsort100
crc expint
fft insertsort
jfdctint lcdnum
matmult minver
ns nsichneu
qurt select
simple sqrt
statemate ud
Fig. 14: Benchmarks of MRTC
For each benchmark, altering the architecture parameters
has the following anticipated effects on the performance
of BUNDLEP. In general, as the number of cache lines is
increased, the analytical and run-time benefit of BUNDLEP
is expected to decrease; since the shared resource BUNDLEP
benefits from becomes less scarce. As the number of threads
increase, the analytical and run-time benefit of BUNDLE will
increase since the use of the cache increases. As the ratio of
cache block reload time to instruction execution time B : I
increases so should BUNDLEP’s performance.
For many of the benchmarks BUNDLEP performs as ex-
pected when varying cache sizes, threads, and CPI. Exam-
ples include: bs, bsort100, crc, insertsort, select,
qurt, and ud. A highlight of the analysis and run-time benefit
is the select benchmark. In the best case BUNDLEP reduces
the WCETO by roughly fifty percent, from the serial bound
of nearly four million close to two million. At run-time, the
observed execution time is roughly two-thirds of the serial
version.
For a few of the benchmarks: matmult, expint, and
ns, BUNDLEP analysis is worse for almost every architecture
and thread configuration. For each of these benchmarks the
run-time behavior favors BUNDLEP only slightly better than
the WCETO analysis. Examining the structure of the CFRG
for these benchmarks we find embedded loops with small
CFRs. Thus, the benefit of cache sharing is reduced and the
penalty of bundle-level and thread-level context switch costs
are relatively high.
In terms of analysis, the minver benchmark has the most
inconsistent results. For instance, comparing the configurations
of (100:1, 16) and (100:10, 16) as the number of threads
increases the performance of BUNDLEP increases for the
former. Yet, in the latter performance decreases, then increases.
This benchmark lies on the cusp of analytical benefit and
will receive further investigation, in part motivated by the
consistent run-time benefit.
We find merit in BUNDLEP’s scheduling and analytical ap-
proach. Under the best circumstances (100:1, 8) and m = 16,
BUNDLEP provides an analytical benefit for 16 of the 18
benchmarks, and a run-time benefit for 18. For the least
favorable configuration (100:10, 32) and m = 2, BUNDLEP
provides an analytical benefit for 9 of 18, and a run-time
benefit for 8 of the 18 benchmarks.
13
−5000
0
5000
10000
15000
20000
25000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark bs
m = 1
2
4
8
16
−2000
0
2000
4000
6000
8000
10000
12000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark bs
m = 1
2
4
8
16
−2 × 107
0
2 × 107
4 × 107
6 × 107
8 × 107
1 × 108
1.2 × 108
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark bsort100
m = 1
2
4
8
16
−5 × 106
0
5 × 106
1 × 107
1.5 × 107
2 × 107
2.5 × 107
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark bsort100
m = 1
2
4
8
16
−5 × 106
0
5 × 106
1 × 107
1.5 × 107
2 × 107
2.5 × 107
3 × 107
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark crc
m = 1
2
4
8
16
−500000
0
500000
1 × 106
1.5 × 106
2 × 106
2.5 × 106
3 × 106
3.5 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark crc
m = 1
2
4
8
16
14
−2.5 × 106
−2 × 106
−1.5 × 106
−1 × 106
−500000
0
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark expint
m = 1
2
4
8
16
−100000
0
100000
200000
300000
400000
500000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark expint
m = 1
2
4
8
16
−5 × 106
0
5 × 106
1 × 107
1.5 × 107
2 × 107
2.5 × 107
3 × 107
3.5 × 107
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark fft
m = 1
2
4
8
16
0
200000
400000
600000
800000
1× 106
1.2× 106
1.4× 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark fft
m = 1
2
4
8
16
−100000
−50000
0
50000
100000
150000
200000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark insertsort
m = 1
2
4
8
16
−10000
−5000
0
5000
10000
15000
20000
25000
30000
35000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark insertsort
m = 1
2
4
8
16
15
−200000
0
200000
400000
600000
800000
1 × 106
1.2 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark jfdctint
m = 1
2
4
8
16
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1× 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark jfdctint
m = 1
2
4
8
16
−50000
0
50000
100000
150000
200000
250000
300000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark lcdnum
m = 1
2
4
8
16
−4000−2000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark lcdnum
m = 1
2
4
8
16
−8 × 106
−7 × 106
−6 × 106
−5 × 106
−4 × 106
−3 × 106
−2 × 106
−1 × 106
0
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark matmult
m = 1
2
4
8
16
−600000−400000
−2000000
200000
400000
600000
800000
1 × 1061.2 × 10
6
1.4 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark matmult
m = 1
2
4
8
16
16
−100000−50000
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark minver
m = 1
2
4
8
16
0
50000
100000
150000
200000
250000
300000
350000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark minver
m = 1
2
4
8
16
−6 × 106
−5 × 106
−4 × 106
−3 × 106
−2 × 106
−1 × 106
0
1 × 106
2 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark ns
m = 1
2
4
8
16
−200000
0
200000
400000
600000
800000
1 × 106
1.2 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark ns
m = 1
2
4
8
16
−5000000
500000
1 × 1061.5 × 10
6
2 × 1062.5 × 10
6
3 × 1063.5 × 10
6
4 × 1064.5 × 10
6
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark nsichneu
m = 1
2
4
8
16
0
500000
1× 106
1.5× 106
2× 106
2.5× 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark nsichneu
m = 1
2
4
8
16
17
0
500000
1× 106
1.5× 106
2× 106
2.5× 106
3× 106
3.5× 106
4× 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark qurt
m = 1
2
4
8
16
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark qurt
m = 1
2
4
8
16
−2000000
200000
400000
600000
800000
1 × 1061.2 × 10
6
1.4 × 1061.6 × 10
6
1.8 × 1062 × 10
6
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark select
m = 1
2
4
8
16
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark select
m = 1
2
4
8
16
0
1000
2000
3000
4000
5000
6000
7000
8000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark simple
m = 1
2
4
8
16
0
500
1000
1500
2000
2500
3000
3500
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark simple
m = 1
2
4
8
16
18
−100000
0
100000
200000
300000
400000
500000
600000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark sqrt
m = 1
2
4
8
16
−100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark sqrt
m = 1
2
4
8
16
−500000
0
500000
1 × 106
1.5 × 106
2 × 106
2.5 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark statemate
m = 1
2
4
8
16
0
50000
100000
150000
200000
250000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark statemate
m = 1
2
4
8
16
−1 × 106−500000
0
500000
1 × 1061.5 × 10
6
2 × 1062.5 × 10
6
3 × 1063.5 × 10
6
4 × 106
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆ω for Benchmark ud
m = 1
2
4
8
16
−100000
0
100000
200000
300000
400000
500000
600000
700000
800000
(10
0:1
, 8)
(10
0:1
0, 8
)
(10
0:1
, 16
)
(10
0:1
0, 1
6)
(10
0:1
, 32
)
(10
0:1
0, 3
2)
C
y
cl
es
Architecture (B:I, `)
∆B for Benchmark ud
m = 1
2
4
8
16
19
