Abstract-One of the primary sources of unpredictability in modern multi-core embedded systems is contention over shared memory resources, such as caches, interconnects, and DRAM. Despite significant achievements in the design and analysis of multi-core systems, there is a need for a theoretical framework that can be used to reason on the worst-case behavior of realtime workload when both processors and memory resources are subject to scheduling decisions.
I. INTRODUCTION
Over the last decade, multi-core systems have rapidly increased in popularity and they are now the de-facto standard in the embedded computing industry. Multi-core systems are significantly more challenging to analyze compared to their single-core counterparts due to the extensive sharing of hardware resources among logically independent execution flows. The primary source of performance unpredictability, in this class of systems, can be identified as the memory hierarchy. In fact, the memory hierarchy in multi-core platforms is comprised of a number of components that are concurrently accessed by multiple cores. These include: multi-level CPU caches, shared memory controllers and DRAM banks, and shared I/O devices. The interplay of accesses originated by multiple cores has a direct impact on the timing of subsequent memory accesses. The resulting temporal variability is in the range of multiple orders of magnitude, meaning that inaccurate performance modeling and analysis can lead to overly pessimistic worstcase execution time (WCET) estimates.
Despite the remarkable achievements in the analysis of hard real-time workload on multi-core systems, there is a fundamental lack of self-contained theoretical frameworks that can be used to reason on the schedulability of a generic multi-core hard real-time workload when both CPU and memory resources are subject to scheduling decisions. In fact, while consolidated techniques are used to reason about CPU scheduling, comparatively less general results are available to reason on memory scheduling. An even slimmer body of works has provided general results to reason on co-scheduling of CPU and memory. The majority of works in this area assume fixed assignment of memory resources to CPUs.
Memory Scheduling: there are two dimensions to the problem of assigning memory resources to applications. The first dimension is space scheduling, concerning the allocation over time of memory space (e.g., cache lines, DRAM banks, scratchpad pages). A second dimension is temporal scheduling, i.e., scheduling of access to a shared memory interface (e.g., an interconnect, a bus, or a memory controller). In this paper, we focus on the temporal dimension of memory scheduling. In a nutshell, memory interfaces/subsystems are associated with a characteristic sustainable bandwidth that can be partitioned among the CPUs of a multi-core system. If the bandwidth-to-cores assignment is determined offline and remains unchanged over time, we say that memory bandwidth is statically partitioned. Conversely, if bandwidth is dynamically assigned to cores, we say that memory bandwidth is subject to scheduling. We hereafter interchangeably use the terms "memory scheduling", "memory bandwidth scheduling", or "bandwidth partitioning" referring to the same concept.
Since memory bandwidth is constrained and often represents a bottleneck in multi-core systems, memory scheduling is an important dimension to consider and a way to achieve important real-time performance improvements. Clearly, if the memory bandwidth assigned to each core changes over time, this will have an effect on the response time of tasks. In this case, how can the worst-case response time be calculated? In this paper, we address this question. More specifically, we study the problem of determining the worst-case response time for a task that spans a sequence of time intervals, each with a different bandwidth-to-cores assignment.
In this paper, we make the following contributions:
1) we improve response time calculation under static (over time) but arbitrary (across cores) bandwidth partitioning; 2) we provide a general framework to perform response time analysis under dynamic bandwidth partitioning. Our approach can be used to analyze memory schedulers as long as: (i) changes in bandwidth-to-cores allocation are time-triggered; or (ii) a critical instant can be found for the possible CPU-to-tasks and bandwidth-to-cores scheduling decisions; 3) we demonstrate how the proposed analysis technique can be used in a time-triggered memory scheduling scenario, for an Integrated Modular Avionics (IMA) system. In particular, we show that dynamic bandwidth allocation significantly outperforms static allocation in the presence of varying memory-intensive workload.
II. BACKGROUND
A memory interface is characterized by a maximum guaranteed bandwidth. It is generally easy to analyze the temporal behavior of memory requests when the interface operates below the maximum guaranteed bandwidth [1] . Conversely, if the rate of memory requests exceeds such a threshold, the behavior of the memory subsystem can be hard to analyze, or lead to overly pessimistic worst-case estimates [2] . In multi-core systems, however, the available memory bandwidth can be arbitrarily distributed among cores. Take a 2-core system for instance, as depicted in Figure 1 . Workload on the two cores can be either CPU-intensive (blue), or memory-intensive (red). For simplicity, the figure assumes that CPU-intensive workload is unaffected by changes in memory bandwidth (BW) assignment. Conversely, memory-intensive workload is roughly linearly affected by it. An even assignment as depicted in Figure 1 (a) would provide 50% of the available memory bandwidth to each core. Even partitioning is not flexible: mostly memory intensive workload is deployed on core Core A, while mostly CPU-intensive workload is scheduled on Core B. As such, workload is penalized on Core A while memory bandwidth is wasted on Core B. Under this setup, the memory-intensive workload on Core A and B take 5 and 3 time units to complete, respectively. The overall utilization is 95%.
If Core A is known to run memory-intensive tasks while Core B mostly processes CPU-intensive workload, it is beneficial to perform an uneven assignment -e.g., 80% and 20% of the available bandwidth assigned to Core A and B, respectively. This is depicted in Figure 1(b) . In this case, bandwidth can be distributed to better meet the CPU/memory needs of workload on the various cores. The memory-intensive workload on Core A can benefit from this assignment, now completing in 2 time units. However, the (shorter) memoryintensive workload on Core B is negatively affected, completing in 4 time units. Overall utilization decreases to 85% in our example.
In both uneven and even partitioning, memory bandwidth allocation is statically decided at design time, i.e., it does not change over time. The workload on each core, however, can undergo variations in terms of memory requirements. This is often the case as more/less memory-intensive tasks (or partitions) are scheduled on each core. As such, it is natural to consider a scheme where bandwidth-to-cores assignment is varied over time. In this case, we talk of dynamic bandwidth partitioning, i.e., memory scheduling. Figure 1 depicts one such example. Here, when only one core is executing memoryintensive workload, it is given 80% of the available bandwidth; when both are executing the same type of workload, the bandwidth is evenly distributed. Under the new scheme, the system operates at 80% utilization. In general, dynamic bandwidth assignment can yield significant performance improvement, because it is possible to produce an assignment scheme that follows the memory requirements of scheduled workload over time.
In the next sections, we address the problem of computing the time it takes in the worst-case to complete execution of workload that: (i) has known memory and CPU requirements; and (ii) spans over an arbitrary and known sequence of bandwidth assignments. 
III. SYSTEM MODEL AND ASSUMPTIONS
We hereby discuss the assumptions considered in our work. We also provide the basic terminology and notation required to present our results.
Multi-core Model: in this work, we assume a homogeneous multi-core system with m cores. We use the index i to refer to any of the m cores, i.e., i ∈ {1, . . . , m}. We make no assumption on the cache hierarchy, as we focus on the behavior of tasks with respect to main memory accesses. We only assume that hits in last-level cache (LLC) do not generate main memory traffic. Main memory transactions have fixed size, typically one cache line, indicated with L size . We assume that access to main memory is granted to cores/processors following a roundrobin scheme. We assume that the time to perform a single memory transaction is bounded in the interval: [L min , L max ]. We do not require all memory transactions to be of a fixed size; but assume that, in the worst-case, all transactions have the maximum size. As an additional simplification, we assume no transaction parallelism, meaning that L max is also the maximum amount of interference that a given core can suffer due to an active memory transaction directed to a different core. No re-ordering of requests originated by different cores occurs in the system. This behavior can be achieved in a traditional COTS DRAM setup by assigning private memory banks to cores [2] - [4] . With private banking, the available DRAM banks are partitioned among the available m cores. These assumptions make the considered model compatible with the work in [5] - [7] .
Workload Model: we consider a partitioned system in which each task in a set of tasks is statically assigned to one of the m cores [8] . Since our focus is on the behavior of workload in memory, we abstract away the details of each task and only consider the "load", or "workload" in terms of CPU time and the number of memory transactions that need to be completed by a given deadline. The load can correspond to a single task instance, or to an entire busy-period. This is in line with the approach followed in [6] , [7] , [9] . Reasoning in terms of workload allows us to remain generic with respect to the exact task scheduling strategy used at the CPU. For instance, under preemptive rate-monotonic scheduling (RM), in order to analyze the schedulability of a task τ , one would consider the deadline-constrained "load" comprised by the execution (CPU time and memory transactions) of one instance of τ , as well as that of all the instances of interfering higher-priority tasks. The deadline of the workload will be the deadline of τ .
Without loss of generality, we model deadline-constrained workload on a core i under analysis using three parameters: C i , µ i , and D i . Here, C i represents the worst-case amount of time required for pure execution on the CPU (no memory). For ease of notation, we will always consider the worst-case execution time in slots of L max and indicate the latter with
Lmax . It must hold that E i > 0. Next, µ i represents the worst-case number of main memory transactions [10] to be completed by the relative deadline D i . We often use β = E i +µ i as a shorthand notation for the overall CPU and memory requirement of the workload under analysis. We assume that new workload is always released synchronously with respect to regulation periods, and scheduling decisions (on both CPU and memory) are taken at the boundaries of regulation periods.
Memory Bandwidth Regulation Model: in order to unevenly partition the memory bandwidth across cores, a budgetbased memory bandwidth regulation scheme is used, such as MemGuard [11] . In this regulation scheme, per-core bandwidth regulators use hardware-implemented performance counters to monitor the number of memory transactions performed by each core over a period of time P . For this reason, P takes the name of "regulation period". Note that the number of memory transactions over P is a measure of bandwidth. Since the maximum latency of a single memory transaction is L max , then in the worst-case it is always possible to perform Q = P Lmax memory transactions in P . Each core can then be assigned a different budget q i , as long as i q i ≤ Q. However, in order to fully utilize the already constrained memory bandwidth, we consider the case i q i = Q without loss of generality. The budget assigned to all the m cores forms a vector, namely Q = {q 1 , . . . , q m }.
The key idea of memory bandwidth regulation is the following. A core i is given a budget q i , which represents the number of memory transactions that core i is allowed to perform during a regulation period P . The budget is replenished to q i at time zero and at every instant k · P , with k ∈ N. During a regulation period, the core executes tasks normally, performing memory transactions as needed. A hardware performance counter monitors the number of memory transactions, decreasing the residual budget accordingly. If core i depletes its budget q i before the next replenishment, core i is stalled until the next replenishment. P is a system-wide parameter which should be smaller than the minimum task period in the system. P is often experimentally set to 1 ms [11] , [12] .
Memory Schedule: in this paper, we assume that the memory schedule is known, or that a critical instant can be found on the bandwidth-to-core assignment rule, if an online memory scheduling rule is used. This opens a whole new set of questions that are out of the scope of this work: e.g. optimality, or existence of critical instants for memory schedulers. As depicted in Figure 1(c) , a memory schedule S = {B 1 , . . . , B N } is a time-ordered sequence of N memory budget assignment intervals B j . Each B j is of the form
} is the budget-to-cores assignment used in interval j, and L j is the length in regulation periods of interval j. For instance, the memory schedule in Figure 1 
Workload Span: the goal of this work is to compute the maximum number of regulation periods required to execute the workload under analysis to completion. This goes under the name of span, and is defined below.
Definition 1 (Span): We define span as the number of regulation periods to entirely complete E i units of execution and µ i memory transactions for the considered workload. The span is indicated throughout the paper with the symbol W i .
The 
k is the cumulative length of intervals preceding B j , the execution over interval B j must thus be equal to:
IV. MEMORY STALL A fundamental concept is the notion of memory stall. In general, a memory request originating from the core i under analysis can be "stalled" for two reasons. The first reason is that the hardware memory arbiter has prioritized one or more other cores over i for access to the memory subsystem (memory interference). The second reason is that the core under analysis has exhausted its budget and is stalled until the beginning of the next regulation period.
We make no assumption on the behavior of tasks in cores other than core i. It follows that the maximum stall that can be suffered by a memory transaction on core i depends only on the number of memory transactions performed by i in the same regulation period. This is exemplified in Figure 2 , where i = 4 and Q = {2, 2, 5, 7}. The budget for Core 4 is q 4 = 7. It follows that there are 8 possible worst-case scenarios, denoted as (a)-(h) in the figure. In general, there are always q i + 1 possible cases. In the figure, pure execution is represented with "e" and is considered in slots of length L max , as mentioned in Section III.
The pattern in Figure 2 (a) depicts the worst-case memory interference from other cores (cores 1 to 3) provided that core 4 performs zero memory accesses within the regulation period. A more interesting case is Figure 2 (e). Here, Core 4 performs 4 memory accesses. For the first two memory accesses, since three cores (1, 2 and 3) can cause stall, 3 units of stall are accumulated per memory access. If we depict the stall as a curve, then the "slope" of the stall introduced by the first two accesses is 3. After the first two accesses, cores 1 and 2 are temporarily stopped due to regulation -they have exhausted their respective budgets. Core 3, however, can still cause stall on transactions from Core 4 under analysis. Hence, the stall slope for the 3 rd and 4 th transactions is 1. A more efficient way to visualize the possible stall scenarios is by plotting the per-period memory transactions and resulting stall. The q i + 1 possible cases represent a discrete domain. A corresponding continuous curve for the memory stall can be derived by "connecting" these discrete points. Call r i ∈ R ≥0 the number of memory transactions per regulation period being performed. We introduce the notion of memory rate.
Definition 2 (Memory rate): We define as memory rate the number of memory transactions performed per regulation period. Memory rates are indicated throughout the paper with the symbol r i .
A memory rate is often used to indicate the rate at which a total number of memory transactions µ i is performed over the span of the considered workload W i . Hence,
We can also indicate the number of memory transactions performed during a specific interval B j as µ 
A. Memory-stall Curves
The memory-stall curve for a core i represents the cumulative maximum interference-induced stall for a given memory rate r and is denoted as I(r) i . Consider same setup used for Figure 2 . The memory-stall curves for cores 3 and 4 are provided in Figure 3 . Considering Core 4, We have already discussed how the first two transactions introduce stall at a "slope" of 3. This is reflected in the I(r) 4 curve, since the curve has slope 3 when r ∈ ]0, 2[. For clarity, let us construct the memory-stall curve for core 3. The y-axis represents the cumulative maximum stall I(r) 3 that can be experienced by workload on Core 3 with a memory rate r (x-axis). Workload on Core 3 can perform of 0, 1, 2, 3, 4 or 5 memory transactions in a regulation period. The first step is to compute the maximum stall in each of these cases. If the workload does not perform any memory transaction (r = 0) in a regulation period, then it will experience no stall, i.e. I(0) 3 = 0. When r = 1, then it can be stalled by a maximum of 1 memory transaction by each of the m−1 = 3 cores resulting in I(1) 3 = 3. Similarly, for all values of r until r = min i (q 1 , . . . , q m ), the maximum stall rate I(r) 3 = (m − 1) · r, hence I(2) 3 = 3 · 2 = 6. When Core 3 performs an additional memory transaction, i.e. r = 3, it can only be stalled by Core 4, since cores 1 and 2 have been regulated after their second memory access. Thus, the cumulative stall rate is I(3) 3 = I(2) 3 + 1 · 1 = 7. Similarly, for r = 4, I(4) 3 = I(3) 3 + 1 · 1 = 8. Finally, for r = 5 = q 3 , Core 3 is regulated. Here the maximum cumulative stall is I(5) 3 = Q − q 3 = 16 − 5 = 11. The memory-stall curve I(r) 3 is obtained by connecting the discrete values of I(k) 3 , k ∈ {0, . . . , 5} calculated so far.
Generalizing the example provided above, for any fixed budget Q we can define the stall curve I(r) i as follows:
Since the budget assignment Q j changes every scheduling interval B j , a different I(r) j i curve needs to be considered on each interval.
If the resulting curve is concave, then the memory-stall curve is already final. This is the case for I(r) 4 in Figure 3 .
Conversely, a refinement step is necessary to produce the final curve. Specifically, we take the upper-envelope of each of the convex segments to obtain a concave curve. The result of this step is depicted asĪ(r) 3 in Figure 3 .
Definition 3 (Stall rate): We define as stall rate the amount of memory stallĪ(r) i suffered per regulation period with a memory rate r. When considering multiple intervals,Ī(r) j i is the stall rate for core i on interval B j . If the span over B j is W j i and µ j i transactions are performed in the interval, we can compute the worst-case total stall S j i over B j as:
It follows that the total stall is
. If the maximum memory stall that can be suffered by the workload under analysis can be derived, then the worst-case amount of time (in multiples of L max ) required to complete the considered workload is W i ·Q = β i +S i . The rest of the paper is concerned with the calculation of the maximum total stall, and hence span, over a generic memory schedule S = {B 1 , . . . , B N }. For a fixed budget Q, a given memory rate r i and span W i , Lemma 1 guarantees that computing S i according to Equation 3 always results in an upper-bound on the maximum possible memory stall.
Lemma 1:
is an upper bound to the cumulative stall suffered by a workload on core i that performs µ i memory accesses over W i regulation periods, with
Proof: In each of the W i regulation periods, a number of memory accesses between 0 and q i could have been performed; hence, note that we cannot have µ i > W i · q i . Let us indicate with a k the number of periods in which k memory accesses were performed. It must hold that qi k=0 a k = W i . We can then write:
The cumulative stall suffered over W i can be computed as:
Consider now computing the stall rate asĪ(
, by I(r) ≤Ī(r) we have:
Next recall that by definition of concavity for a generic function f (x), it must hold that:
Hence, we can write:
This implies thatĪ(µ i /W i ) i · W i is an upper bound to the cumulative stall qi k=0 I(k) i · a k suffered by the workload for any pattern of memory accesses over W i periods, concluding the proof.
V. WCET UNDER STATIC MEMORY BUDGET
In this section, we present a fixed-point iterative algorithm to compute the worst-case length of the workload on a core under analysis i under static memory budget Q. This is useful to understand the basic mechanisms to compute the span over a generic single memory scheduling interval.
In each iteration, the algorithm recomputes the maximum stall and thereby, the workload span, based on the workload span from the previous iteration (except for the base iteration) and the corresponding memory schedule. The key intuition behind iterative recomputation is that the increase in workload span in an iteration is likely to increase the maximum stall in the consecutive iteration due to a different worst-case distribution of memory requests across (a) different per memory-stall curves and/or (b) different memory scheduling intervals.
In the rest of the paper, we will always focus on the generic core under analysis. As such, we will drop the index i from all the notation introduced so far, unless required to resolve an ambiguity. Since we will be introducing a series of iterations of the algorithm, we subscript the iteration number (e.g., (k)) in the notation introduced so far.
Iterative (10) where the iteration continues until convergence with
In the latter case, the workload in not schedulable. SinceĪ(r) is only defined for r ∈ [0, q], the termĪ(min(µ/W (k−1) ), q) ensures that the function is never evaluated on a value outside its domain.
Theorem 1: The iteration in Equation 10 terminates in a finite number of steps by either obtaining a value W (k) ·Q·L max > D, or by converging, in which case W (k) is an upper bound on the span of the workload on the core under analysis.
Proof Sketch: Notice that we omit the proof for Theorem 1 here, as it is a corollary of the more general Theorem 2. As such, the proof is provided in the Appendix.
For ease of explanation, Section V illustrates how to apply the algorithm in a specific instance. Subsequently, Section VI presents the generic algorithm.
Example of WCET over Static Budget: consider the static budget Q = {2, 2, 5, 7}. Let us now compute the span W of the workload with E = 40 and µ = 35 (i.e. β = 75) executing on Core 3. For simplicity, we ignore the workload's deadline D and focus only on its length. Since workload on Core 3 is being analyzed, we consider the stall curveĪ(r) 3 in Figure 3 
VI. WCET UNDER DYNAMIC MEMORY BUDGET
In this section, we extend our analysis to the case of dynamic bandwidth assignment. In this case, the workload could span across one or more memory scheduling intervals B 1 , . . . , B N . Recall from Section II that each interval B j = (Q j , L j ) is characterized by a budget-to-cores assignment Q j = {q terms at a given iteration k are known. Then, the challenge is to determine how to distribute the total µ memory transactions among the B 1 , . . . , B N intervals in a way that maximizes the overall stall. A distribution of memory transactions simply means that we derive the quantities µ j (k) for each B j interval.
Obviously, it must hold that of memory requests assigned to each interval B j at the k-th iteration to maximize the overall stall S (k) . Then, in Section VI-B we show how to efficiently solve the optimization problem. Note that once µ j (k) has been determined, based on Lemma 1 the stall in interval B j can be upper-bounded as S 
Note that at each iteration k > 0, the values W are computed using Algorithm 1. As in Section V, the iteration continues until convergence or ≤ µ / * total requests constraint * / Example: Suppose we are analyzing the behavior of workload with E = 15 and µ = 25 on Core 3 under the memory schedule depicted in Figure 4 . Assume that at a given step k we have W (k) = 6. In this case we have W 
For this example, we have S 
A. Proof of Correctness
We now formally prove that Equation 11 computes a valid upper bound for the workload length in number of regulation periods. We begin with some helper lemmas; Lemma 2 show that the value of W (k) increases monotonically, which is required for the iteration to terminate, while Lemma 3 shows that if the iteration converges, we are able to distribute all µ memory requests among the N memory scheduling intervals.
Lemma 2: At each iteration step in Equation 11 it holds:
Proof: First note that functionsĪ(r) j are concave and I(0) j = 0. For any such function and positive constant µ, one can prove thatĪ(µ/x) j · x is monotonic non-decreasing in x > 0 (a formal proof is reported in Lemma 6 in Appendix). The proof then proceeds by induction over the index k.
Base Case: Since we assume β > 0, we have W (0) > 0. Furthermore, since by definition all W j (0) terms are non-negative and functionsĪ(r) j have non-negative ranges, . By induction hypothesis, we have
Now consider Line 8 of Algorithm 1: sinceĪ(µ/x) j · x is monotonic for x > 0, it must hold:
In other words, when running Algorithm 1 at iteration k based on the values W j (k−1) , there exists an assignment of variables (µ j (k−1) = µ j (k−2) , i.e., the same assignment as the previous iteration) that results in a value of the objective function that is greater than or equal to the one at iteration k − 1. Furthermore, the assignment µ
is feasible, in the sense that it satisfies the constraints at Lines 11-13 of the algorithm: note that µ
. Hence, given that the optimization problem is maximizing the objective function, it is guaranteed to find an assignment for variables µ
In 
Furthermore, note that we havē
Finally, given E > 0 and based on Equation 11 at convergence, we derive:
which is a contradiction. Theorem 2: The iteration in Equation 11 terminates in a finite number of steps by either obtaining a value W (k) ·Q·L max > D or converging, in which case W (k) is an upper bound to worstcase span of the workload on the core under analysis.
Proof: We first show that the algorithm terminates. By Lemma 2, W (k) ≥ W (k−1) . Since W (k) is a natural number, it follows that the algorithm must either converge or terminate with a value of W (k) greater than the deadline in a finite number of steps.
Hence, assume that the algorithm converges to W (k) . Based on Lemma 3, we can find an assignment to variables µ 
is an upper bound to the stall when performing µ
regulation periods. Now given that Algorithm 1 maximizes the objective function at Line 8 over all possible assignments to variables µ
is an upper bound to the cumulative stall when performing µ memory accesses over
Finally, by definition, the worst-case length of the workload can be obtained (in number of slots) as the sum of β and the stall suffered by the workload. By convergence to W (k) , we have:
and since S (k) is an upper bound to the stall suffered in W (k) regulation periods, this implies that W (k) is indeed an upper bound to the total span of the workload.
B. Implementing the Stall Algorithm
In this section, we show how to efficiently implement Algorithm 1. The algorithm is similar to a concave optimization problem, except that variables are integer rather than real.
By construction, eachĪ(r) j function is a concave piecewise linear, and can be thought as a sequence of segments with decreasing slope. For eachĪ(r) j curve, consider the set of integer values of r corresponding to the beginning of a segment. Call each of this values a start point, and E j the set of all the start points inĪ(r) j . Start points are highlighted with a solid dot (•) in Figure 4 . Considering Core 3, for the example in the figure we have: E 1 = {0, 2}, E 2 = {0, 2, 3, 4}, and E 3 = {0}. Using this formulation, we introduce two helper functions defined on E j forĪ(r) j . First, the next j (r) function returns the next start point strictly greater r:
Second, the function slope j (r) simply returns the slope of the segment at r. If r is a start point, the function returns the slope of the starting segment. Formally:
All the slopes are annotated in Figure 4 right above the corresponding segment. Algorithm 2 first initializes all variables µ j (k) to zero. Then, the algorithm iterates over Lines 9-19 until either (1) N j=1 µ j (k) = µ holds, meaning that all µ memory transactions have been distributed among the N memory scheduling interval; or (2) µ
j for all intervals, meaning that we cannot assign any more memory transactions due to the regulation constraints. When the condition µ
some interval B j , we say that B j is saturated. The set of all the unsaturated intervals is computed at Line 11, and their respective memory rates r j given the current assignment µ
is also integer. Furthermore, the new assignment cannot violate the constraints
j , since we use the minimum of the two expressions. Hence, this shows that the assignment to variables µ j (k) operated by Algorithm 2 is feasible according to the constraints at Lines 11-13 of Algorithm 1. Furthermore, note that Algorithm 2 is guaranteed to terminate after the assignment at Line 17 selects the first expression, or after all intervals have been saturated. The number of segment start points for each functionĪ(r) j is O(m); hence, the number of iteration of the algorithm is O (N · m) .
Finally, we show that once the algorithm terminates, the assignment to variables µ j (k) maximizes the cumulative stall
, that is, the objective function in Algorithm 1. This follows from the way intervals are selected at Line 15. By contradiction, assume that there exists a different feasible assignment, call it {μ
}, subtracting some number of memory transactions, say ∆, from one variable µ scheduling, where each partition is assigned, at compile time, a fixed start time and span in a major cycle i.e., a hyperperiod (H). These partition-level scheduling decisions are stored at compile time resulting in a static CPU schedule, which is repeated every major cycle.
Our analysis (Section VI) works with known memory assignment across cores and known workload parameters. IMA systems are a natural fit, representing a real-world scenario. We consider a set of IMA partitions with a fixed major cycle and assignment of partitions to cores. For simplicity, we assume the order of execution of the partitions is known, and we assume that each partition executes once in the major cycle, and that the major cycle is synchronized among cores. Our goal is to use our analysis from Section VI and perform an empirical evaluation comparing the ratio of schedulable tasksets to generated tasksets, under dynamic memory budget assignment policy against the static budget assignment policies, under a fixed partition execution order on each core.
In the next Subsections, we describe the setup used to compare the budget assignment policies and the two sets of experiments, one, that varies the number of cores and two, that varies the number of memory intensive partitions in a system.
A. Setup
IMA Partition Set Generation: For each experiment run, we consider m cores and a set of 4 × m IMA partitions, with a fixed major cycle, i.e., hyperperiod (H) of 128ms. The earliest start time of each partition is set to t = 0 and the deadline to the hyperperiod i.e. 128ms. From the perspective of the analysis (Section VI), each partition is a workload.
We characterize the varying memory demand between partitions as exhibited by avionic applications [13] , using a parameter -memory intensity (MI) -, that represents the ratio of pure memory demand to the sum of pure processing demand and pure memory demand of a partition under single-core case i.e., no contentions. We then use a bi-modal distribution for M I, where each partition either has a HIGH MI mode or a LOW MI mode. The use of two modes is first, consistent with the memory intensity behavior exhibited by partitions in a real avionic application [13] , and second, some partitions perform I/O activity that is memory-intensive. All HIGH MI mode partitions are randomly assigned an M I value in the range of [0.5, 0.99], whereas for LOW MI mode partitions, the M I value range is [0.001, 0.1]. We use a parameter memory intensity ratio M Ir to vary the number of partitions in the HIGH MI mode to that in the LOW MI mode in the system.
Each partition is then randomly assigned a core, such that each core ends up with 4 partitions. The setup then generates per partition single-core utilization using UUniFast algorithm [14] such that U is the cumulative single-core utilization of each core. The parameter U allows varying the cumulative singlecore utilization of partitions assigned to a core. Next, the setup generates E and µ values for each partition based on its singlecore utilization and memory intensity (MI) value, assuming no stall. The E and µ values of each partition respectively represent an aggregated E demand and an aggregated µ demand of all tasks assigned to it, in line with existing works like [15] and [13] .
System-wide Parameters: We use realistic system-wide parameters: L max = 2.4 × 10 −6 s, P = 1ms, resulting in Q = 41666 as described in [6] .
Budget Assignment Policies: We consider two static budget assignment policies: Static and even (SE) that assigns to each core a constant and identical budget of 1/m times the total budget, e.g., for a 4-core system Q ={10416, 10416, 10416, 10416}, and static and uneven (SU) that assigns to each core a constant budget based on the weight of each core, e.g., Q = {416, 20416, 5416, 15416}. For the SU policy, we use a heuristic to generate the weight of each core and thereby, a budget assignment, based on the input partition set.
The key idea behind the heuristic is to assign cores with higher memory demand a higher memory budget. The heuristic computes weight of each core based on the ratio of the remaining cumulative µ to that of the sum of remaining cumulative µ and cumulative E on a core. Then, the memory bandwidth Q is partitioned among cores based on the computed weights resulting in a budget assignment. This is similar to the term memory intensity, albeit on a core-level.
We compare the static policies (SE and SU) against a dynamic policy (DY), which assigns dynamic memory budgets to cores, using the heuristic. As compared to the static SU policy that uses the heuristic at time t = 0 only, DY policy recomputes the weight of every core each time a partition finishes execution, resulting in a dynamic budget assignment. Figure 5 compares the schedulability ratios for each of the three budget assignment policies -DY, SU and SE --under varying the number of cores m from 4 to 12 in steps of 4, for a fixed memory intensity ratio M Ir of 0.25. In Figure  5 , for each value of U , we generated 1000 partition sets for m = 4 case, and 100 partition sets for each of m = 8 and m = 12 cases. On the x-axis, we vary the cumulative per core utilization U from 0.1 to 0.9 in steps of 0.01.
B. Varying Number of Cores m
First, we observe that as the number of cores m increases, the schedulability ratio decreases for the plots shift towards the left, for each of the three budget assignment policies. This is because, with increasing the number of cores, the total memory supply remains constant, albeit the total memory demand increases as the number of HIGH MI mode partitions increase in the system. Second, for each value of m, the dynamic policy DY dominates the static policies SU and SE.
C. Varying Memory Intensity ratio M Ir
Now, we vary the memory intensity ratio M Ir from 0.15 to 0.50 that impacts the number of HIGH MI partitions in the system, and consequently, the number of LOW MI partitions in the system. We set the number of cores m to 8. Figure 6 shows the schedulability ratios for each of the three budget assignment policies -DY, SU and SE --under varying M Ir. On the x-axis, we vary the cumulative per core utilization U from 0.1 to 0.9 in steps of 0.01. In Figure 6 , we generated 100 partition sets for every combination of U and M Ir.
As the M Ir ratio increases, the cumulative memory load from all cores on the memory increases, in general. Consequently, we observe that schedulability ratio plots shift towards the left on increasing the M Ir ratios. Further, for each memory intensity ratio M Ir, the dynamic budget assignment policy DY dominates static policies SU and SE. VIII. RELATED WORK Recent literature on the design of real-time systems on multicore platforms considers main memory as a significant source of unpredictability, and an important interfering channel to mitigate. Predictable memory controllers have been proposed in [16] - [18] . OS-level techniques implementable on COTS hardware to regulate access of cores to main memory have been proposed and evaluated in [1] , [11] , [19] , [20] . Yet another body of work has investigated the idea of strictly serializing access of cores to main memory. For instance, the work in [21] clusters memory operations in tasks via cache pre-fetching using compiler-level transformations, defining memory-and execution-phases. Then, a central scheduler only allows at most one memory-phase to be active at any point in time. A similar scheme was adopted in [22] - [25] using DMAs instead of CPU-initiated pre-fetches and scratchpad memories. A recent work [26] proposes an analysis for corunning tasks contending for memory resources, i.e. with no explicit bandwidth partitioning.
By clustering and serializing access to shared memory, interference is avoided by design. Compared to this approach, regulation has the advantage of being entirely implementable at OS-level. For memory regulation techniques, analytic bounds for the temporal behavior of tasks was also derived [6] , [7] , [9] , [27] . Similarly, the work in [28] derives runtime guarantees when both a CPU server and memory regulation are used. These works focus on static memory bandwidth partitioning.
With respect to static and even budget assignment, a first analysis was derived in [6] . In [9] , an analysis for static and uneven bandwidth partitioning was performed assuming only knowledge of the memory budget q i for the core under analysis, and assuming arbitrary assignment to the other m − 1 cores. More recently, the work in [7] demonstrated that by leveraging exact knowledge of each core's budget q i it is possible to drastically reduce the pessimism of the analysis.
A few works [1] , [11] proposed unused budget reclamation. However, no offline guarantees can be provided on the dynamic portion of the assigned budget. The work in [29] considers budget reclamation and derives WCET guarantees assuming full knowledge of the workload on all cores. In a more recent work, Nowotsch et al. [15] , [30] consider avionics temporal partitions with pre-defined budget assignment. In this way, they are able to compute offline the WCET of application inside a partition, albeit the budget may vary at the boundaries of partitions. Finally, the work in [13] relaxes the strict single budget-to-partition assignment in [15] and allows different budgets being assigned to a partition offline, enabling dynamic budget assignment, from a set of design-time fixed budgets. By assuming that memory stall is pre-computed in each budget, the WCET computation problem is then be decomposed in (1) assigning "compatible" budgets across cores; and (2) minimizing the use of high-budget slots by the task under analysis.
What sets this work apart is the generality of the provided results. Unlike the aforementioned literature, we do not assume any specific budget re-assignment scheme. In fact, we provide a methodology that can be used to compute the worst-case runtime of a task given any dynamic budget-tocore assignment. To use our results, either exact knowledge of budget assignment over time is known; or a critical instant for memory budget re-assignments should be identified.
IX. CONCLUSION
In this paper, we presented a methodology to analyze the worst-case execution time and schedulability of realtime workload under dynamic memory scheduling. We first introduced a simple iterative algorithm to compute the span of workload under static and uneven budget-to-core assignment. We then generalized the problem to consider a generic memory schedule and formulated the worst-case span analysis as a stall-maximization problem. Next, we demonstrated that the problem has strong similarities with concave optimization and proposed a low-complexity solution to determine the access pattern that maximizes the overall memory stall. As a use case, we considered an IMA setting where a subset of partitions run memory-intensive workload. In this scenario, dynamic memory scheduling outperformed traditional static bandwidth partitioning. The analysis assumes a known memory schedule. It is, however, generic with respect to CPU scheduling. Thus, it is applicable for event-triggered CPU schedulers. As a future work, we intend to study online bandwidth scheduling strategies for which a critical instant on decisions taken of both processor and memory can be identified.
