Consider global-Earliest-Deadline-First scheduling on a multiprocessor assuming (i) constrained-deadline sporadic tasks: a task generates a sequence of jobs and the deadline of a job is at most the minimum inter-arrival time of the task generating the job; (ii) stage-parallelism: a task comprises at least one stage, a stage comprises at least one segment, and a segment is allowed to execute only if all segments of the previous stage have finished; and (iii) contention for shared resources in the memory system -cache eviction, reordering in memory controller (MC), memory bus contention. We present an algorithm that (i) performs schedulability testing; (ii) configures virtual-to-physical address translation (VPAT) so a cache block fetched to the last-level cache by one task is not evicted by another; (iii) configures VPAT to reduce the reordering effect (REE) in the MC; and (iv) considers contention for the memory bus. Our algorithm solves a MixedInteger Linear Program (MILP); we have implemented a tool and tested it. Across all of our experiments, we found that the maximum time it takes to finish is 18h and the median time is 2.5h. We have also done preliminary testing on a real computer platform.
INTRODUCTION
Multicore processors are the norm today. The trend is that the number of processors on a chip increases exponentially while the clock frequency stays constant. And software practitioners are under pressure to deliver improved functionality which requires more CPU cycles. This trend makes it increasingly common that a job has execution requirement so large that executing it sequentially causes a deadline miss; hence, the only way for a job to meet its deadline is to perform some execution in parallel. This brings the challenge:
C1. Schedule software where some parts can execute in parallel so that all deadlines are met; also prove before run-time that deadlines are met.
The timing of software executing on a COTS multicore processor depends not only on the processor scheduler but also on contention for shared resources in the memory system. This includes (i) the last-level cache (LLC) shared between processors, (ii) the row buffer in each memory bank (MB) storing the most recently accessed row, and (iii) the memory bus (the bus between the memory controller (MC) and DRAM modules). A cache memory is typically organized as a set of cache sets where certain bits of the physical address (PA) of a memory access determine which cache set the memory access should use. Hence, if the virtual-to-physical address translation (VPAT) is set up such that no two memory accesses of different tasks use the same cache set, then it is guaranteed that a cache block fetched into the cache by one task cannot be evicted by another task. Also, DRAM modules are typically organized as a set of MBs with each MB having multiple rows and each MB having one row buffer which stores the data of the most recently accessed row. When a memory ac-
Solution
Addresses challenges C1 C2 C3 C4 [17, 27, 13, 10, 9, 22, 4, 7, 2, 16, 8, 18, 24] Yes No No No [21, 29] No Yes No No [19, 25] No No Yes No [28] No Yes Yes No [26, 11, 20, 23, 31, 30, 14] No No No Yes This paper Yes Yes Yes Yes Table 1 : Summary of the state of art. cess experiences a miss in the LLC, it (i) precharges the MB, that is, the data in the row buffer of the MB is written back to its row in the MB and then (ii) activates the MB, that is, load a row in the MB (given by the address of the PA) to the row buffer of the MB and then (iii) reads data from this buffer and transfers the data to the processor (if the memory access is a read) or writes data to this row buffer (if the memory access is a write). If the row needed for a memory access is already loaded in the row buffer, then precharge and activate are not performed and hence execution is faster. Hence, MCs reorder memory accesses so memory accesses to the row that is in the row buffer get ahead in certain queues in the MC. Thus, a memory access can be delayed because other memory accesses, of other tasks, get ahead in the queuereordering effect (REE). Hence, if VPAT is set up so that a MB is accessed by at most one task then it is guaranteed that no task can suffer from this REE. Also, a memory access can be delayed because other accesses use the memory bus. This brings the challenges:
C2. Configure VPAT so that a cache block fetched to the LLC by one task cannot be evicted by another.
C3. Configure VPAT so that reordering of memory accesses from different tasks are avoided and if they occur, then the schedulability test computes an upper bound on extra execution time due to reordering.
C4.
Compute an upper bound on extra execution time caused by processors sharing the memory bus.
Unfortunately, the literature offers no solution for all of these challenges -see Table 1 .
Therefore, this paper presents a solution for all of these challenges. We assume global-EDF (gEDF) and reformulate a previously-known schedulability test as a Mixed-Integer Linear Program (MILP) and extend this formulation to (i) configure VPAT so a cache block fetched to the LLC by one task is not evicted by another, (ii) configure VPAT to try to eliminate extra execution time caused by the REE in the MC; if not possible, the REE is considered in the schedulability test, and (iii) consider contention for the memory bus. We also present a tool 1 that implements this theory and evaluate it. The remainder of this paper is organized as follows. Section 2 presents the system model. Section 3 adapts a previously known schedulability test to a MILP. Section 4 presents constraints that express an upper bound on the execution time of a segment due to memory contention and how it depends on VPAT, and also expresses other constraints. Section 5 puts it all together as a solution for the four challenges. Discussions, evaluations, and conclusions follow. Fig. 1 shows the parallel task model we use and Fig. 2 shows the hardware model we use. We consider a system with m processors of speed s and a taskset τ . A task τi in τ is characterized by Ti, Di, nsi, nsegi,j, and Ci,j. The interpretation of these parameters is that (i) τi generates a sequence of jobs with arrival times of two consecutive jobs of τi separated by at least Ti; (ii) a job of τi needs to finish execution by its absolute deadline (the absolute deadline of a job of τi is Di time units after its arrival); and (iii) job execution is described with stages where nsi denotes the number of stages of a job of τi and nsegi,j denotes the number of segments in stage j of a job of τi. A segment executing contiguously for ∆ time units performs ∆ × s units of execution. A segment of a job finishes when it has performed a number of units of execution equal to its execution requirement. For a segment of a job, if the segment is in stage j of τi then its execution requirement is at most Ci,j assuming that it does not experience memory contention; if it experiences memory contention then its execution requirement may be larger (explained later). When a job of τi arrives, all the nsegi,1 segments of stage 1 of τi become eligible for execution. For each j ≥ 2, at the time when all the nsegi,j−1 segments of stage j − 1 of τi have finished, all the nsegi,j segments of stage j of τi become eligible for execution. A segment becomes non-eligible when it has finished execution. A job of τi finishes when all the nsegi,ns i segments of stage nsi of this job have finished. We assume ∀τi ∈ τ : Di ≤ Ti -such tasksets are called constraineddeadline sporadic tasksets.
SYSTEM MODEL
gEDF works as follows: (i) jobs are assigned priorities such that if a job has higher priority than another job then the absolute deadline of the former is no later than the absolute deadline of the latter, (ii) a segment inherits the priority of the job it belongs to, and (iii) at each instant, if at most m segments are eligible for execution at this instant, then all of them execute at this instant; if m + 1 or more segments are eligible for execution, then the m highest priority segments at this instant are selected for execution at this instant. A taskset τ is gEDF-schedulable on a computer with m processors of speed s if, for each jobset that τ can generate, for each schedule that gEDF can generate for this jobset, it holds that each job finishes no later than its absolute deadline.
Each segment has a virtual address (VA) space. The VA space is organized into pages of size PAGESIZE bytes. (For example, PAGESIZE=4096 bytes.) The memory footprint of a segment of stage j of τi is at most np i,j pages. Each page is associated with a range of VA and a page is associ- Figure 2 : The hardware model we use. ated with a frame of physical memory (of size PAGESIZE bytes) and this frame is associated with a range of PAs. The log 2 PAGESIZE least significant bits of the PA are called frame offset. The other bits of the PA are called frame index. The log 2 PAGESIZE least significant bits of the VA are called page offset. The other bits of the VA are called page index. A VA is mapped to a PA as follows: (i) the frame offset is identical to the page offset and (ii) the frame index is obtained through VPAT from the page index (typically through a page table in an operating system). For convenience, seg(i, j, g) denotes the g th segment of stage j of τi and page(i, j, g, p) denotes page p of seg(i, j, g).
In order to allow segments to share data, we introduce shfr that specifies the requirements that this imposes on VPAT. shfr is a set containing 8-tuples with the interpretation that for each 8-tuple i, j, g, p, i , j , g , p it is required that page(i, j, g, p) and page(i , j , g , p ) are mapped to the same frame.
In previous work [14] , we presented and validated a model of the memory system of typical COTS multicore processor based systems. In this paper, we extend this model to a more fine-grained description of memory accesses. Our model is as follows. The LLC (see LLC in Fig. 2 ) is shared between processors. This cache is organized as a set of cache sets where certain bits in the PA determine which cache set a memory access is associated with. Some of these bits are part of the frame index and some are part of the frame offset. When a memory access experiences a miss in LLC, the memory access is passed on to the MC and identifies which MB the memory access is associated with (certain bits in the frame index determine this) and which row in this MB it is associated with (other bits in the frame index determine this) and it is inserted in a queue for memory accesses to this MB. The queuing discipline First-Ready-First-Come-First-Served (FR-FCFS) is used. With this queuing discipline, FCFS is used but with the following exceptions (i) a memory access can be prevented (to be explained later) from being performed at certain instants because there are DRAM timing parameters which state that a certain part of a DRAM access must wait until a certain timing requirement (based on previous memory accesses) is satisfied and (ii) elements in the queue of a MB can be reordered so that a memory access gets to the head of the queue when this memory access is associated with the row that is currently loaded in the row buffer. When a memory access gets to the head of the queue of the MB, it contends for the memory bus with memory accesses of other MB. When a memory access is granted the memory bus, the memory access precharges its associated MB (that is, the data in the row buffer is written back to its row in the MB) and then the memory access activates its associated row in its associated MB (that is, the data in this row is loaded to the row buffer) and finally it transfers data (from the row buffer of the MB to the MC if the memory access is a read; the other direction if it is a write). If the row associated with the memory access is already in the row buffer then precharge and activate are not performed.
In many of today's processors, the bits in the PA from which the cache set index of LLC is obtained overlap with the bits that determine the frame index (see [28] ). Also, in many of today's processors, the bits in the PA from which the MB index is obtained overlap with the bits that determine the frame index (see [28] ). Therefore, one can use a technique called cache coloring [21, 29, 28, 15] which partitions memory frames of physical memory into cache colors so if two memory accesses belong to different frames and these two memory frames belong to two different cache colors, then one memory access cannot evict a cache block fetched to LLC by the another memory access. Also, one can use a technique called bank coloring [28] which partitions memory frames of physical memory into MB colors so if two memory accesses belong to different frames and these two memory frames belong to two different MB colors, then one memory access cannot evict a row in a MB that another memory access has loaded. Typically an MB color is a single MB. But a cache color typically consists of multiple cache sets. Let 19 /(32 × 16) = 2 10 ). Note that in some modern multicore chips, the number of cache colors is affected by a hash function within the LLC [15] .
One factor that determine the execution time of a program is its self-eviction of blocks in its cache which, in turn, depends on VPAT. Hence, if map is the VPAT of all tasks in the system then Ci,j(map) is an upper bound on the execution requirement of a segment in stage j of τi for the case that this segment does not experience contention for resources in the memory system from other segments. Also, MAi,j,p(map) is an upper bound on the number of memory accesses reaching the MC from page p of a segment in stage j of τi for the case that this segment does not experience contention for resources in the memory system from other segments. Let Ci,j be a value such that ∀map : Ci,j(map) ≤ Ci,j. Let MAi,j,p be a value such that ∀map : MAi,j,p(map) ≤ MAi,j,p. In practice, if the VPAT map is known, then it is possible to obtain Ci,j(map) and MAi,j,p(map) (e.g. using a worst-case execution-time analysis tool) but obtaining Ci,j and MAi,j,p is very expensive because they describe behavior of the software for all possible VPAT of the system. Even if Ci,j and MAi,j,p are obtained, it can happen that we choose a VPAT map such that Ci,j is much higher than Ci,j(map) (and analogously for MAi,j,p(map)). This would result in large pessimism. Also, note that in real systems, there are private caches. Typically, the cache set in the private cache that is used by a memory access is detemined by the VA of the memory access. Thus, cache coloring cannot control the eviction in private caches. We defer discuss of these issues to Section 6. For now, assume Ci,j and MAi,j,p are known.
Let MBCF denote the memory bus clock frequency and let tCK = 1/MBCF. We assume (as do many previous studies [30, 23, 20, 11] ) that a processor is stalled when it waits for memory. Let L PRE inter denote the time required for precharge; L ACT inter is the time required for activate; L RW inter is the time for data transfer. Then, using [14] , these can be computed from parameters available in DRAM datasheets [12] as follows:
We will also use parameters which describe how one memroy access can experience interference from other memory accesses. L conf denotes row-conflict service time. L conhit (x) is a function which describes the time it takes to serve x consecutive memory accesses to the same row in the same MB if the row was already activated. These can be computed [14] from parameters available in DRAM datasheets [12] as:
For convenience, we use the following notation. P denotes the least common multiple of Ti values. DMAX = maxτ i ∈τ Di. Nre is a hardware limit on reordering (to be discussed later). {a..b} denotes the set of integers greater than or equal to a and less than or equal to b. Let s.t. mean such that and : mean it holds that. (∀t ≥ 0 : x) means the predicate (∀t s.t. t ≥ 0 : x). lhs means left-hand side and rhs means right-hand side. We define UBNOMR = τ i ∈τ (max j∈{1..ns i } nseg i,j ) meaning upper bound on the number of outstanding memory requests and define LL = min(m − 1, UBNOMR − 1) and LH = LL + Nre.
Tasks typically perform execution and access memory in an initialization phase which does not have real-time requirements. This execution and memory accesses are not considered as a job but the pages accessed need to be mapped to memory frames. Therefore, INO indicates the number of pages accessed during initialization.
SCHEDULABILITY ANALYSIS FOR GEDF OF PARALLEL TASKS WITHOUT MEM-ORY CONTENTION
The literature offers many sufficient schedulability tests for gEDF for tasks that are not parallel -see for example [3, 6, 5] . Of particular interest is [5] which offers a schedulability test with speedup factor two. The key to its good performance is the use of ffdbf -forced-forward demand-bound function -for describing the maximum amount of execution a task can demand in a time interval, rather than using the traditional dbf -demand-bound function. This schedulability test [5] states that if there exists a σ such that σ is at least as large as the density of each task and for each value of t the sum of ffdbf of tasks is at most a certain value then the taskset is gEDF-schedulable. (If such a σ exists then it is called a witness.) Later work [2] extended this for parallel tasks by defining ffdbf for parallel tasks. Fig. 3 shows it. In Fig. 3 , f(τ, m, s) is a schedulability test. It takes as input a taskset τ , m, and s and outputs a boolean such that if this boolean is true then τ is gEDF-schedulable on m processors of speed s. f(τ, m, s) is evaluated by checking if there exists a σ such that σ is at
f(τ, m, s) ⇒ τ is gEDF schedulable on m processors of speed s if tasks do not experience memory contention Figure 3 : Previously known schedulability analysis for gEDF scheduling of parallel tasks [2] . least as large as the density of each task and for each value of t the sum of ffdbf of tasks is at most (m
Intuitively, ffdbf(τi, t, v, s) expresses an upper bound on the amount of time that a task τi can perform in a time interval of duration t if a deadline miss occurred at the end of this time interval and the parameter v is used to describe the amount of idleness in the system. Hence, ffdbf(τi, t, v, s) comprises two parts: execution of jobs of τi that arrived before the beginning of this time interval and execution of jobs of τi that arrived after this time interval. An upper bound on the former is Ci − WJ(τi, (Di − (t mod Ti)) × v, s). An upper bound on the latter is
Together, this yields ffdbf(τi, t, v, s) as expressed in Fig. 3 . See [5] for a discussion about ffdbf(τi, t, v, s) and [2] for WJ for parallel tasks
We will now discuss how to modify this schedulability test and then rewrite it as MILP. Note that evaluating f(τ, m, s) in Fig. 3 is non-trivial because it requires that one finds a σ such that a certain condition is true and here σ is a real number. If instead, we change this expression so that we only check a finite number of possible values of σ then we get a schedulability test with a slight increase in pessimism but it comes with the benefit of being easier to express as MILP. We choose a value of K that is a positive integer (e.g. K = 20) and check only those σ such that there is a k ∈ {1..K} such that σ = (k/K) × s. This yields: 
can be evaluated by only considering t ≤ P . Hence, f * (τ, m, s, K) is true if and only if the following is satisfiable:
(In the above expression, wi k should be read as witness.) Observe that lhs of the inequality defining h * (τ, m, s, k, K, t) is a piecewise linear function of t and rhs of the inequality defining h
) it is only necessary to evaluate h * (τ, m, s, k, K, t) for the following values of t: (i) values of t such that the derivative of the piecewise linear function changes, (ii) t = P , and (iii) t = 0. With respect to (iii), note that h * (τ, m, s, k, K, 0) is true and hence it does not need to be checked. With respect to (ii), note that h * (τ, m, s, k, K, P ) can be rewritten as (
With respect to (i), note that for t such that there is a positive integer q and a task
, the above mentioned derivative changes but this t is dominated by another t in the condition and hence, this t does not need to be checked. Hence, f * (τ, m, s, K) is true if and only if the following is satisfiable:
Note that in the expressions above, we use primed variables to indicate variables that are used for computing t. For example, i = 3 means that we are computing ffdbf for task τ3. But i = 3 means that we are computing a t and it is approximately three times T3.
We will now rewrite ffdbf(τi,
(Intuitively, I i,q,i ,q ,j ,f ,k can be read as an integer num-
Figure 4: Schedulability analysis for gEDF scheduling of parallel tasks formulated as a MILP.
Figure 5: Domains of variables and domains of indices. Here we also show domains of variables that will be used in later sections. ber of jobs of task τi for a time interval of duration t. The symbol r i,i ,q ,j ,f ,k can be read as remainder of time. And aj i,i ,q ,j ,f ,k can be read as execution from all integer jobs of task τi.) Then, rewrite ffdbf(τi, t i ,q ,j ,f ,k , k K , s) as:
, s) in the expression above. We will now discuss how to rewrite it so that it is on a form closer to a MILP. One can observe (from Fig. 3 ) that WJ is computed differently for three different cases. Also, for the second case, it is necessary to compute WJS and this is done through recursion; hence we need to know for which value of j this recursion terminates. Therefore, we introduce fi,sf,ss, and th as variables that indicate if a certain condition is true. If the condition is true then the variable is 1; otherwise the variable is 0. fi i,i ,q ,j ,f ,k indicates that when
, s) is called, the second case in the definition of WJ is taken and WJS terminates with 3rd parameter taking the value j and the first case in definition of WJS is taken. ss i,j,i ,q ,j ,f ,k indicates that when
, s) is called, the second case in the definition of WJ is taken and WJS terminates with 3rd parameter taking the value j and the second case in definition of WJS is taken. th i,i ,q ,j ,f ,k indicates that when
, s) is called, the third case in the definition of WJ is taken. Since exactly one of them is true, it holds that:
For convenience, we can also introduce the variable
With this notation, we obtain:
We obtain similar expressions for sf,ss, and th. The above reasoning yields that f * (τ, m, s, K) is true if and only if there is an assignment of values to variables such that the constraints in Fig. 4 are satisfied and the domains of the variables are as given by Fig. 5. In Fig. 3 , Ci,j denotes the upper bound on the execution requirement of a segment in stage j of τi but in Fig. 4 cui,j denotes this. (cui,j means execution requirement that we will use.)
MEMORY CONTENTION
Previous work [14] offers a method for computing an upper bound on the response time of a task considering contention for resources in the memory system. That method assumes fixed-priority preemptive non-migrative scheduling and integrates memory contention analysis in the schedulability test. In this section, we will adapt this memory contention analysis (i) to compute an upper bound on the extra execution of a segment of a single job of a task without assuming any specific processor scheduler, (ii) using our more advanced model for discussing memory accesses (page-level), and (iii) expressing it on a form easily translatable to MILP.
cmi,j,g denotes an upper bound on the execution requirement of seg(i, j, g) considering contention for resources in the memory system (the extra execution of this contention is considered to be part of the execution requirement). Also, recall
memory system (the extra execution of this contention is considered to be part of the execution requirement). Hence:
(30) in Fig. 7 expresses (20) . Let o i,j,g,p,h,b = 1 indicate that page(i, j, g, p) is mapped to a memory frame with cache color h and MB color b; otherwise o i,j,g,p,h,b = 0. Clearly, a page can only be mapped to one frame and one frame belongs to exactly one cache color and one MB color. Hence, each page belongs to exactly one combination of cache and MB color. This yields (24) . Also, if a cache and MB color is given then the number of pages that can be mapped to this combination of cache and MB color cannot exceed its amount of physical memory. For the special case that there are no shared frames and there was no memory required during initialization, considering those pages that are mapped to frames of cache color h and bank color h, we can express the limited memory capacity as:
Let us now discuss how to express the constraint of limited memory capacity for the general case. Let GSi,j,g,p be a constant that indicates how many pages maps to the same frame as page(i, j, g, p) maps to. For normal pages, it holds that GSi,j,g,p = 1 but if a page maps to a shared frame then GSi,j,g,p is larger. GSi,j,g,p can be computed as follows. Form a graph, with one vertex for each each i, j, g, p and there is an edge between two vertices i, j, g, p and i , j , g , p if i, j, g, p, i , j , g , p ∈ shfr. Compute the connected components of the graph. Then, for i, j, g, p , let GSVSi,j,g,p denote the set of vertices in the connected component to which the vertex corresponding to i, j, g, p belong. Let GSi,j,g,p denote the cardinality of GSVSi,j,g,p. Then, if we consider a page page(i, j, g, p) and all other pages that map to the same frame then we know that all of them only consume a single frame. We can model that as if each of them consumed 1 GS i,j,g,p frame. With this observation and letting ino h,b indicates the number of pages accessed during initialization that maps to frames of which belong to cache color h and MB color b, we obtain that the limited memory capacity can be expressed by (25) . In addition, the requirement on shared frames, expressed by the set shfr, yields (31).
For a pair of segments that could possibly execute in parallel, we require that the VPAT is set up so that one segment cannot evict a cache block that another segment has fetched to the cache. We can express it as follows: x i,j,g,h = 1 indicates that seg(i, j, g) uses cache color h; otherwise x i,j,g,h = 0. Then, if seg(i, j, g) can execute in parallel with seg(i , j , g ), cache coloring requires: x i,j,g,h +x i ,j ,g ,h ≤ 1, where (x i,j,g,h = 1) ⇔ ( p∈{0..np i,j −1} b∈{0..B−1} o i,j,g,p,h,b ≥ 1). Rewriting these to a form close to MILP yields (27) , (28) , and (29) .
Let mb i,j,g,b be an upper bound on the number of memory accesses from seg(i, j, g) to memory bank b. This yields (22) . For a segment seg(i, j, g), the symbol mmbo i,i ,j ,g ,b is an upper bound on the number of memory accesses on memory bank b from multiple jobs of all other segments than seg(i, j, g) such that these memory accesses can impact a job of seg(i, j, g). (23) expresses it. In the proof of Theorem 1, we will show that it is an upper bound. Now consider memory contention. Look at the queues inside the MC in Fig. 2 1. There are already memory accesses in the queue for MB b when this single memory access is inserted in the queue for MB b and because of FCFS queuing, these other memory accesses are served first. 2. After this single memory access is enqueued in the queue for MB b, there are other memory accesses enqueued in the queue for this MB and these other memory accesses' row is currently loaded in the row buffer and hence they get ahead in the queue for this MB (reordering). 3. When one of the other memory accesses mentioned in 1) or 2) reaches the head of the queue of MB b, it is not served immediately; instead it has to wait for the memory bus being granted and this takes time because other memory accesses in the queues to other MBs than MB b use the memory bus.
Consider seg(i, j, g) and its (at most mb i,j,g,b ) memory accesses that it performs on memory locations of MB b and how the three effects increase the execution time of seg(i, j, g). 
About 1) and 2) We will now discuss coat i,j,g,b . Since we assume a processor stalls until its memory access has been completed, it follows that from each processor, there can be at most one outstanding memory access and hence there are at most LL memory accesses of 1) above. The hardware places a limit on the number of reorderings that can happen. In previous work [14] , we introduced the parameter to indicate an upper bound on the number of those reorderings that a single memory access can experience. In this paper, we let Nre denote this parameter; a typical value [14] is Nre = 12. Consequently, the mb i,j,g,b memory accesses from seg(i, j, g) performing on MB b has to wait for at most mb i,j,g,b × (LL + Nre) = mb i,j,g,b × LH other memory accesses performing on MB b (because of 1) and 2) above). Using mmbo yields that: The mb i,j,g,b memory accesses from seg(i, j, g) performing on MB b has to wait for at most
other memory accesses performing on MB b (because of 1) and 2) above). Let oat i,j,g,b be the expression in (47). (It means other accesses to this MB.) By inspecting L conhit (x) and the parameters in Section 2, one can see that these memory accesses have different effects; the memory accesses that are in the queue before a memory access has arrived to the queue cause more interference than the ones that arrive later that cause reordering. Fig. 6 shows an upper bound. It gives us coat i,j,g,b . About 3) We will now discuss oao i,j,g,b . A memory access related to MB b is inserted in the queue for the memory bus only if (i) this memory access is at the head of the queue of the MB b and (ii) there is no memory access related to MB b already in the queue of the memory bus. Hence, a memory access that has reached the head of the queue of its MB needs to wait for at most B-1 other memory accesses until it is granted the memory bus. Consequently, the mb i,j,g,b memory accesses from seg(i, j, g) performing on MB b has to wait for at most (mb i,j,g,b + oat i,j,g,b ) × (B − 1) other memory accesses performing on MB b (because of 3)). Using mmbo yields that: The mb i,j,g,b memory accesses from seg(i, j, g) performing on MB b has to wait for at most
other memory accesses performing on other MBs than MB b (because of 3) above). (48) expresses oao i,j,g,b .
This reasoning yields an upper bound on the execution requirement on a form close to MILP -see Fig. 7 .
THE MILP FORMULATION
Let Π denote the computer platform (the parameters m, s, H, B and the parameters describing the memory system). fmem(τ, Π, K) is a function which returns the tuple flag, o where flag is a boolean and o is a multi-dimensional array. If there exists an assignment of values to the variables so that the constraints in Fig. 4 and Fig. 7 are satisfied then flag is true and o is the values of the o-variables; otherwise flag is false and o is undefined.
∀ i, j, g, p : Some of the constraints mentioned are not MILP -they have binary variables and logical operators. We will now discuss how to convert them to MILP. A constraint of the form (x = 1) ⇒ (a = b) can be rewritten as: 1) ⇒ (a ≥ b) ). Note that if x is a variable with the domain {0, 1} and a and b are non-negative real variables and BIG is a constant selected so that a ≤ BIG and b ≤ BIG, then a constraint (x = 1) ⇒ (a ≤ b) can be rewritten as
Note that in (17), we can use BIG = m and in (28), we can use BIG = np i,j × B and (27) can be rewritten without BIG.
Let us now discuss the other constraints. In a feasible solution to Fig. 4 and Fig. 7 , the variables mb i,j,g,b and mmbo i,j,g,b are at most
Hence, the lhs of the constraints (33)- (46) is at most
Also, for each of the other constraints, the lhs is at most
Applying the rewriting expressed by (49) (and minor variants of it), with BIG = max((51), (52)), yields that all of our constraints can be converted to a MILP.
DISCUSSION
fmem(τ, Π, K) and the model it is based on has three shortcomings: (i) it outputs a mapping from a page to a cache color and MB color but we actually need a mapping from a page to a memory frame in PA, (ii) it assumes that MAi,j,p and Ci,j are given and known, and (iii) it ignores the effect of eviction of cache blocks in a private cache. We can deal with (i) by simply solving the MILP and if it returns a tuple with the 1st parameter being true then a VPTA can be obtained from o (the 2nd parameter in the tuple) as follows:
( i , j , g , p , i, j, g, p ∈ shfr)) and (for this 4. i , j , g , p , it holds that page(i , j , g , p ) has already 5. been mapped to a frame) then
6.
map page(i, j, g, p) to the same frame as Table 2 : Five number summary of the time required, in our experiments, for performing schedulability analysis and configuring memory (in seconds).
end for
We can deal with (ii) by making an initial guess of the values of MAi,j,p and Ci,j and then check if the guess is correct; if it is not, then refine the guess with data obtained from the checking procedure. Specifically, do it as follows. Guess values of MAi,j,p and Ci,j and then call the function fmem(τ, Π, K) and then obtain a new o and then obtain an VPTA map from this o and then obtain Ci,j(map) and MAi,j,p(map) and check if Ci,j(map) ≤ Ci,j and check if MAi,j,p(map) ≤ MAi,j,p; if this check fails then use map and obtain the execution requirement and number of memory accesses and use that as a new guess of MAi,j,p and Ci,j. An algorithm based on these ideas is shown below: 
declare FAILURE
We can deal with (iii) by modifying the pseudo code above so that on line 7, it not only runs a WCET tool but also computes an upper bound on the cost of a preemption (e.g. by considering that the entire private cache needs to get reloaded) and also computes an upper bound on the number of preemptions that a job can experience. Line 10 can be changed analogously to line 7. Hence, by using these modifications, our solution can be used in practice.
EVALUATION SUMMARY
We have implemented a tool based on this theory and tested it on systems with 4 and 8 processors. Table 2 offers a summary of results -for details, see Appendix in [1] . It can be seen that the maximum time it takes to finish is 18h and the median time is 2.5h. We performed a preliminary evaluation of the guarantee provided by this tool as follows: We developed a plugin for AADL based on Valgrind 2 ; this tool obtains an upper bound on the number of memory accesses, from each page, that reaches the memory controller and then outputs a model of the software system. We applied this plugin on a taskset with synthetic benchmark programs (matrix multiply) and obtained the parameters of our model. We then ran it on a gEDF implementation in the Linux kernel and used our previously developed Linux implementation of coordinated cache and bank coloring [28] configured as specified by our tool. We ran the software system for 8h and observed no deadline misses.
CONCLUSIONS
Using COTS multicore processors in hard real-time systems is challenging because (i) taking full advantage of them for meeting tight deadlines requires parallelization and (ii) the contention for shared resources in the memory system makes execution times hard to predict. In this paper, we have developed a solution that addresses these issues. Our main idea is to formulate a MILP that configures the memory mapping and performs schedulability analysis. 
APPENDIX Proof
Theorem 1.
(( flag, o = fmem(τ, Π, K)) ∧ (flag = true)) ⇒ τ is gEDF schedulable on m processors of speed s for the case that tasks experience memory contention and the VPAT conforms to o Proof. If the theorem is false then there exists a τ, m, s, K and an assignment of the number of jobs that each task generates and an assignment of arrival time to jobs and execution requirement of segments and a schedule such that the following two statements are true:
2. for the jobset generated by τ with the aforementioned assignment, it holds that gEDF can generate the aforementioned schedule and there is at least one job that misses its deadline in this schedule.
For this schedule, let t0 denote the earliest time when a deadline miss occurs. Remove all jobs with arrival time ≥ t0. There is still a deadline miss at time t0. Let us now reason as follow: For each job with absolute deadline > t0 such that it performs execution after time t0, do the following: identify the latest stage of this job such that there is a segment of this stage that performs execution after t0. Then reduce the execution of this segment. Repeated application of this yields that no job with absolute deadline > t0 performs execution after time t0. Hence, it holds that: (i) 1) and 2) above are true, (ii) one or many jobs with absolute deadline at t0 misses deadlines, (iii) each job with absolute deadline < t0 meets its deadline, (iv) all jobs have arrival times < t0, and (v) no job with absolute deadline > t0 performs execution after time t0. For each job with absolute deadline < t0, we can reason as follows: Let τi denote the task that generates the job. Let A denote the arrival time of this job and consider the time interval [A, A+Di) and consider a task i which is not the task that generated the job of τi. Because of (iii) and (iv), there can be at most one job of task i such that this job arrives before A and it has execution that overlaps with [A, A + Di). Also, because of (iii), there can be at most Di/T i jobs of task i such that this job arrives at or after A and it has execution that that overlaps with [A, A + Di).
For each job with absolute deadline ≥ t0, we can reason as follows: Let τi denote the task that generates the job. Let A denote the arrival time of this job and consider the time interval [A, A+Di) and consider a task i which is not the task that generated the job of τi. Because of (iii) and (iv), there can be at most one job of task i such that this job arrives before A and it has execution that overlaps with [A, A + Di). Also, because of (v), there can be at most Di/T i jobs of task i such that this job arrives at or after A and it has execution that that overlaps with [A, A + Di).
Consequently, for each of these cases, there are at most (
+ 1) × mb i ,j ,g ,b memory accesses on MB b of jobs of seg(i , j , g ) that overlaps with [A, A + Di). Putting it together yields that there are at most mmbo i,j,g,b memory accesses that can be issued in parallel with seg(i, j, g) for MB b. Since we know the values of mmbo i,j,g,b , using Fig. 7 yields cmi,j,g. This yields cui,j which provides an upper bound on the execution requirement. Since cui,j is an upper bound on execution requirement we can treat the system as if there was no contention for resources in the memory system and execution requirements were given by cui,j. Since the constraints in Fig. 4 are satisfied, all deadlines are met. This contradicts 2) above. Hence, the theorem is correct.
Solving MILP
Finding a solution to the MILP expressed by Fig. 4 and Fig. 7 is challenging because (i) the number of variables and constraints is large and (ii) BIG is much larger than the other constants causing numerical issues. Therefore, we will rewrite the MILP to avoid numerical issues. We will also present different methods for solving the MILP; they differ in (i) the amount of time to finish and (ii) whether a solution is guaranteed to be found if a solution exists. They all have in common, however, that they return a tuple flag, o such that if flag is true, then the MILP is feasible. It can be seen in Fig. 7 , that changing the domain of mb i,j,g,b from non-negative integer to non-negative real does not change the feasiblity of the MILP. The same applies to mmbo i,j,g,b , oat i,j,g,b , and oao i,j,g,b . We will now rewrite the constraints without changing feasibility but so that numerical issues are avoided. Let SCALINGFACTORNACCESSES be an integer that we choose (e.g. SCALINGFACTORNACCESSES = 2 23 ). It can be seen that multiplying the right hand sides of (38), (39), (40), (41) × coat i,j,g,b does not change feasibility. It can also be seen that dividing the right-hand side of (22) by SCALINGFACTORNACCESSES and replacing
× coat i,j,g,b does not change feasibility. Figure 8 shows these rewritten expressions. This leaves us with discussion on how to choose SCALINGFACTORNACCESSES. We do it as follows. ≥ (51)/(52) such that it is equal to two raised 5. to some integer.
end if
In this way, the parameter BIG is kept small. We will now present the methods.
Method 1
Method 1 is guaranteed to output a solution if a solution exists. Method 1 is to simply take the constraints in Fig. 4 and Fig. 7 and solve the MILP. If there exists an assignment of values to the variables so that the constraints in Fig. 4 and Fig. 7 are satisfied then flag is true and o is the values of the o-variables; otherwise flag is false and o is undefined.
Method 2
Method 2 is guaranteed to output a solution if a solution exists. We can reason as follows: If there is a feasible solution, then it holds that for each cache color, the pages that are mapped to frames of this cache color all belong to the same task (otherwise (29) would be violated). Let occupiescachecolor i,h be 1 if τi occupies cache color h; otherwise 0. If, for this solution, it holds that there is a task τi and a task τ i and a cache color h and a cache color h such that i < i and h > h and occupiescachecolor i,h = 1 and occupiescachecolor i ,h = 1, then we can change the o-values of the solution so that each page of τi that was mapped to h is mapped to h and each page of τ i that was mapped to h is mapped to h. Also update the x-values accordingly. This gives us a new feasible solution such that i < i and h < h and occupiescachecolor i,h = 1 and occupiescachecolor i ,h = 1. Repeating this argument yields that for each τi, tasks with lower index than τi only occupies cache colors of lower index and tasks with higher index than τi only occupies cache colors of higher index. If there is a cache color h that is not occupied by any task, then we can identify all tasks that occupies cache colors of index greater than h and let each of their memory allocation use a cache color that has index 1 less. Also update the x-values accordingly.
For this reason, we can, without loss of generality, add the following constraint:
Method 2 is like Method 1 but with the constraint above.
Method 3
Method 3 is not guaranteed to output a solution if a solution exists. Method 3 is defined as follows.
1. Let the following variables be non-negative real numbers: loadfactorofcells, utilconsideringcont, loadofdeadlinei, myobj. 2. Let utilconsideringcont, loadofdeadlinei be defined as follows: utilconsideringcont = ( τ i ∈τ
)/(m×s) and loadofdeadlinei = ( τ i ∈τ max(
3. Solve the following problem: minimize myobj subject to cui = ns i j=1 (nsegi,j × cui,j) and the constraints in Fig. 7 and return false, o where o is undefined. We solve the optimization problem in step 3 as follows. Keep running the solver for 3600 seconds and then check the status and then run the solver for additional 3600 seconds and then check the status and then run the solver for additional 3600 seconds and check the status and so so. We finish this process if one of the following conditions are true (i) an optimal solution has been found or (ii) when checking, a feasibile solution has been found and at the preceding checking, a feasible solution was found as well and the MIP-gap (the gap between the objective function of the current solution as compared to the best bound) has not changed between these checks. As a result, the solution we obtain from step 3 is either optimal or it is such that it did not change during the recent 3600 seconds when the solver ran.
The intuition behind the optimization problem of Step 3 is that we would like to find a memory allocation such that with this memory allocation, the MILP of Step 4 will be feasible if the MILP in Method 2 is feasible and the way we do it is to make sure that the capacity of the memory cell which is utilized at most is not too high (i.e. loadfactorofcells ≤ 1) and that conditions that approximate the schedulability test in the MILP of Method 2 are satisfied. Specifically, when t = ∞, it holds that whether the ffdbf is at most the supply is equivalent to checking utilconsideringcont ≤ 1. But there are many other durations for which we need to check whether the ffdbf is at most the supply. We only explore those durations that are equal to a deadline; hence we check ∀τi ∈ τ : loadofdeadlinei ≤ myobj. 
Evaluation
In this section, we address the following questions: (i) how long time does it take to perform the schedulability test (solve the MILP) and (ii) how pessimistic is our schedulability test. We will use Method 3. We will solve the MILP with Gurobi 6.0 -a state-of-the-art solver. Consider the system in Fig. 9 . It models a hypothetical autonomous system with 4 processors and task τ1 performing sensor fusion (it first reads the sensors in its 1 st stage and then performs parallel processing in its 2 nd stage and then merges the results in its 3 rd stage) and task τ2 is a mission controller task (it takes high-level decisions about the mission, e.g, whether the mission should be aborted) and task τ3 recomputes the current plans when a certain critical event occurs (its 2 nd stage performs computations in parallel). We will run our evaluation by varying parameters of this system. Specifically, we will vary C1,2 and MAi,j,p. We will vary C1,2 by simply setting it to a new value. We will vary MAi,j,p of all segments by multiplying them by mult. For example, mult = 0 means that all segments perform no memory accesses. mult = 1 means that the upper bounds on the memory accesses are the same as in Fig. 9 . Table 3 shows the outcome of our evaluation for m = 4 and Table 4 shows the outcome of our evaluation for m = 8.
The first column shows the value of C1,2. The second column shows the number of memory accesses to a page relative to the number of memory accesses stated in Fig. 9 . If the value in the column is 1 then the number of memory accesess to a page is equal to the number of memory accesses stated in Fig. 9 . The third column indicates the amount of time it takes to perform the schedulability analysis (with the MILP). The fourth column indicates whether that schedulability analysis provides a guarantee that the taskset is schedulable.
Look at the cases C1,2 = 0.005 in Table 3 . It can be seen that mult = 10 results in not-schedulable and mult = 4 results in schedulable. The reason for this is as follows. For the case mult = 10, in the first phase (when we obtain o), we obtain that the objective function is 0.78 and the left-hand of the inequality in (17) is 0.78 × 4 = 3.12 (because there is a very large number of memory accesses). This requires that we choose k/K ≥ 0.78. Since K = 20, we obtain that k ≥ 16 and hence k/K ≥ 0.80 and this makes the right-hand side of (17) the value 4 − (4 − 1) × 0.8 = 1.6. Hence, the lefthand of the inequality in (17) is larger than its right-hand side and consequently this constraint is violated. With larger k, we obtain the same conclusion: the constraint is violated.
Hence, this MILP is infeasible.
For the case mult = 4, we obtain another outcome though. in the first phase (when we obtain o), we obtain that the objective function is 0.302 and the left-hand of the inequality in (17) is 0.302 × 4 = 1.208 (because there are fewer memory accesses). This requires that we choose k/K ≥ 0.302. Since K = 20, we obtain that k ≥ 7 and hence k/K ≥ 0.35 and this makes the right-hand side of (17) the value 4−(4−1)×0.35 = 2.95. Hence, the left-hand of the inequality in (17) is less than its right-hand side and consequently this constraint is satisfied. It turns out that there is a solution (we got the solution for k=18).
We have rerun some of the experiments and found that for a given setting, the time required can vary. This is because the MILP solver solves multiple LPs and does this concurrently and hence the MILP solver is non-deterministic; for a given setup, the progress that it makes within one hour can vary. Hence, when using Method 3 for a given setting, the time required can be different for different runs. Note that this non-determinism is only about the running time; the output (true/false) of the algorithm is deterministic and hence it does not lead to unsafe results. 
