Most schedulability analysis techniques for multi-core architectures assume a single worst-case execution time (WCET) per task, which is valid in all execution conditions. This assumption is too pessimistic for parallel applications running on multi-core architectures with local instruction or data caches, for which the WCET of a task depends on the cache contents at the beginning of its execution, itself depending on the tasks that were executed immediately before the task under study. In this paper, we propose two scheduling techniques for multi-core architectures equipped with local instruction and data caches. The two techniques schedule a parallel application modeled as a task graph, and generate a static partitioned non-preemptive schedule, that takes benefit of cache reuse between pairs of consecutive tasks. We propose an exact method, using an integer linear programming formulation, as well as a heuristic method based on list scheduling. The efficiency of the techniques is demonstrated through an implementation of these cache-conscious schedules on a real multi-core hardware: a 16-core cluster of the Kalray MPPA-256, Andey generation. We point out implementation issues that arise when implementing the schedules on this particular platform. In addition, we propose strategies to adapt the schedules to the identified implementation factors. An experimental evaluation reveals that our proposed scheduling methods significantly reduce the length of schedules as compared to cache-agnostic scheduling methods. Furthermore, our experiments show that among the identified implementation factors, shared bus contention has the most impact.
Keywords Real-time scheduling · Cache-conscious schedules · Schedule implementation · Multi-core architectures · ILP · Static list scheduling 1 Introduction
Real-time embedded systems, i.e., those for which timing requirements prevail over performance requirements, are now widespread in our everyday lives. In particular, real-time applications can be found in cars, airplanes, spacecraft, nuclear plants. With rising demand for applications that are increasingly compute-intensive and parallel, the traditional single-core architectures are no longer a suitable choice for deploying real-time systems. This limitation of single-core architectures is typically referred to as the power-wall (Venkatachalam and Franz 2005) . To overcome this barrier, leading chip manufacturers have introduced multi-core platforms, in which multiple cores are integrated within a single chip. Multi-core platforms have been shown to improve energy-efficiency and performance-per-cost ratio vs. single-core models (Geer 2005) , mainly by exploiting thread-level parallelism. Examples of multi-core architectures include the Kalray MPPA-256 (Dupont de Dinechin et al. 2014) , the Tilera Tile CPU line (Wentzlaff et al. 2007) , and the Intel Xeon Phi (Sodani et al. 2016) .
One of the important challenges of implementing safety-critical parallel applications on multi-core platforms is to guarantee that real-time constraints will be met under all the possible execution conditions of the applications. It is too difficult to precisely estimate the Worst-Case Execution Time (WCET) of tasks that execute on multiple cores simultaneously. Thus, the traditional WCET estimation methods were designed for single-core architectures (Wilhelm et al. 2008) , accounting for program execution paths and the characteristics of the core micro-architecture but ignoring multi-core factors such as caches and buses that may be shared between cores (Nélis et al. 2014; Fernandez et al. 2014) . Additionally, on architectures with local caches, the WCET of a task is affected by the contents of the cache at the beginning of execution, which in turn is affected by the task execution order. The net effect is that the WCET of any particular task can no longer be computed in isolation-it depends on the execution context of the task, including previous and concurrent tasks, which is ultimately defined by the task scheduler. While it is possible to continue using the traditional context-independent WCET, the results will be overly pessimistic. The following example illustrates the high degree of variation in WCET that can be caused solely by changes in the execution context.
Motivating example Let us consider an overly simplified parallel application made of three tasks, named T1, T2, and T3, executing on a dual-core platform, where each core is equipped with a unified private cache containing two lines. T1 and T2 access the same memory block, named m, whereas both the code and data of T3 are independent of tasks T1 and T2. Assuming that the cache is empty at the beginning of execution, we present two scenarios that demonstrate how the scheduling strategy can affect the WCET of each task. In the first scenario, illustrated in the left side of Fig. 1 , the request of T2 for the memory block m is a cache miss, since at request time, the memory block has not been loaded in the cache yet. In the second scenario, illustrated in the right side . 1 The influence of scheduling strategies on the WCET of tasks of Fig. 1 , the request of T2 for the memory block m is a cache hit since the memory block was already loaded into the cache by T1. In most machines a cache miss takes one or more orders of magnitude (in cycles) than a cache miss since it requires loading the requested data from main memory. Therefore, the worst-case execution time of T2 in the second scenario is much lower than its worst-case execution time in the first scenario. This example shows that on architecture equipped with private caches, the execution order of tasks along with their assignment to specific cores can have an impact on the WCET of the tasks. Circularly, the task scheduler needs to know the WCET of each task so that it can determine the feasible sequence of tasks as assigned to specific cores. Therefore task scheduling and WCET estimation for multi-core platforms are interdependent-it is a chicken-and-egg problem. By making the scheduling strategies aware of the effect that execution context can have on WCET, we believe that the overall efficiency of parallel applications can be improved.
In this paper, we propose two cache-aware scheduling strategies that take advantage of cache reuse between pairs of consecutive tasks. Instead of assigning a single WCET to each task, we assign a set in which each WCET is associated with a task that could potentially precede it on the same core. This context-sensitive WCET takes into account the variation in execution time caused by the contents of the cache that remain after the preceding task completes. The objective of our proposed scheduling strategies is to minimize the schedule length (known as the makespan) by accounting for cache reuse. Throughout this paper we focus on a single parallel application, modeled as a task graph, in which each node represents a task and each edge represents a dependence relation between two tasks.
To further motivate our work, let us consider an example of scheduling an 8-input Fast Fourier Transform application (Bahn et al. 2008 ) on a 2-core platform. As shown in Fig. 2 , in the task graph of the application, T2 and T3 feature code reuse since they call the same function, and T2 and T6 feature data reuse since the output of T2 is the input of T6. In that example, we observe a reduction in WCET of 10.7% on average when taking into account the cache affinity between pairs of tasks that may execute consecutively on the same core. By using the method to be presented in (Bahn et al. 2008) Sect. 3 to generate the cache-conscious schedule for that application, we observe an 8% reduction in the schedule length vs. its cache-agnostic equivalent.
Once the cache-conscious schedules are generated, our next objective is to implement these schedules on real multi-core hardware using a Kalray MPPA-256 (Dupont de Dinechin et al. 2014 ). In the implementation stage, we first identify the implementation challenges that arise when deploying those schedules on the platform, such as shared bus contention, the effect of our time-driven scheduler itself, and the lack of hardware-implemented data cache coherence. These implementation factors require us to make adjustments to our cache-conscious schedules. We thus propose a strategy to adapt the cache-conscious schedules to the identified implementation factors, such that the precedence relations of tasks are still satisfied, and the length of the adapted schedules is minimized.
The main contributions of this paper are as follows:
-We argue the importance of accounting for the effect of private caches on the WCET, and validate our position with experimental results. -We propose an ILP-based scheduling method and a heuristic scheduling method to statically find a time-driven, partitioned, 1 non-preemptive schedule of a parallel application modeled as a directed acyclic graph. -We provide experimental results showing that the proposed scheduling techniques result in shorter schedules than their cache-agnostic equivalents. -We identify implementation issues that arise when implementing cache-conscious schedules on the Kalray MPPA-256, and propose strategies for overcoming them. -We investigate the impact of various implementation factors on cache-conscious schedules.
The rest of this paper is organized as follows. Section 2 gives the overview of our hardware target, and describes the abstract model of the hardware platform, as well as the task model used in cache-conscious schedule generation. Section 3 introduces two cache-conscious scheduling techniques: one based on an Integer Linear Programming (ILP) formulation and a heuristic based on list scheduling. The implementation of cache-conscious schedules on the Kalray MPPA-256 is presented in Sect. 4, where we describe the execution conditions of an application on the platform and propose our time-driven scheduler implementation. We also identify the implementation challenges that arise when deploying cache-conscious schedules on the platform, and introduce our strategies for overcoming these issues. In Sect. 5 we present an experimental evaluation of our proposed scheduling methods and our schedule implementation. Section 6 surveys related work. Finally, we summarize the contents of the paper and provide directions for future work in Sect. 7.
System model

Hardware model
Our target architecture is the Kalray MPPA-256 (Dupont de Dinechin et al. 2014 ), more precisely its first generation, named Andey. The Kalray MPPA-256 is a clustered many-core platform containing 288 cores which are organized into 16 compute clusters and 4 I/O clusters. These clusters are interconnected with a dual 2D-torus Network on Chip (NoC). In this study we generate cache-conscious schedules and implement them on a single compute cluster. The overview of a Kalray MPPA-256 compute cluster is given as follows.
Overview of a Kalray MPPA-256 compute cluster
A Kalray MPPA-256 compute cluster contains 17 identical VLIW (Very Long Instruction Word) cores. The first 16 cores, referred to as processing elements (PEs), are dedicated to general-purpose computations, while the 17th core, referred to as resource manager (RM), manages processor resources for the entire cluster. Additionally, a Kalray MPPA-256 compute cluster contains a Debug Support Unit (DSU), a NoC Rx interface for receiving data, and a NoC Tx interface for transmitting data (supported by a DMA − Directed Memory Access − engine).
As announced in Dupont de Dinechin et al. (2014) , every core in the Kalray MPPA-256 is fully timing-compositional (Wilhelm et al. 2009 ). Each core is equipped with a private instruction cache and a private data cache of 8 KB each. Both are two-way associative with a Least Recently Used (LRU) replacement policy. The default write policy of the data cache is write-through. Data flushed from the data cache is not immediately committed to the shared memory-the flushed data is temporally held in a write buffer. Since there is no hardware-implemented data cache coherence between cores, the consistency of shared data between cores must be managed at the software level.
Tasks executing on different cores in the same cluster communicate through the shared memory (SMEM), which comprises 16 independent memory banks of 128 KB each, for a total capacity of 2 MB. Each memory bank is associated with a dedicated request arbiter that serves 12 bus masters: the D-NoC Rx interface, the D-NoC Tx interface, the DSU, the RM core, and 8 PE pairs. Each bus master has private paths connected to the 16 memory bank arbiters. The arbitration of memory requests to SMEM's banks is performed in three stages, depicted in Fig. 3 . The first two stages use a round-robin (RR) arbitration scheme. The first stage arbitrates between memory requests from the instruction cache (IC) and the data cache (DC) of each PE in a pair. In the second stage, the requests issued from each PE pair compete against those issued from other PE pairs, the D-NoC Tx, the DSU, and the RM. Finally, at the third stage, the requests compete against those coming from D-NoC Rx under static-priority arbitration, where the requests from D-NoC Rx always have higher priority.
Abstract model of a Kalray MPPA-256 compute cluster
To improve the generality of our cache-conscious scheduling techniques and make them usable on other architectures, we focus on a hardware model that abstracts away as many architectural details of the Kalray MPPA-256 as possible. This abstract model of a Kalray MPPA-256 compute cluster is illustrated in Fig. 4 . All cores are homogeneous, and each core is equipped with a private instruction cache and a private data cache. Tasks executing on different cores communicate through the shared memory. Furthermore, we assume that:
-Tasks access the shared bus without contention; -There is no cost for triggering a task at any specific instant of time;
The overheads for bus contention and task triggering, as well as other hardware-related overheads will be addressed in the implementation stage, to be presented in Sect. 4.
Task model
We model an application as a Directed Acyclic Graph (DAG) (Kwok and Ahmad 1999a) , as illustrated in Fig. 2 . A node in the DAG represents a task, and an edge represents a precedence relation between the source and target tasks, and may also indicate a transfer of information between them. A task can start executing only when all its direct predecessors have finished executing, and after all data transmitted from its direct predecessors are available. A task with no direct predecessor is an entry task, whereas a task with no direct successor is an exit task. Without loss of generality, it is assumed that there is a single entry task and a single exit task per application. The structure of the DAG is static, with no conditional execution of nodes. The volume of data transmitted along the edges (possibly zero) is known offline. Each task in the DAG is assigned a distinct integer identifier.
A communication for a given edge is implemented using transfers to and from a dedicated buffer located in the shared memory. The worst-case cost for writing data to and reading data from the buffer is integrated in the WCETs of the sending and the receiving tasks.
Due to the effect of caches, each task τ j is not characterized by a single WCET value but instead by a set of WCET values. The set of WCET values for a task contains: (i) its most pessimistic WCET value, noted W C E T τ j , observed when there is no reuse of cache contents loaded by the task executed before on the same core; (ii) its WCET when the task reuses data and/or instruction cache contents from a directly preceding task on the same core. For example, the symbol W C E T τ i →τ j represents the WCET of task τ j when τ j reuses data and/or instruction cache contents from a directly preceding task τ i on the same core. Note that the definition of the WCET of a task in this paper differs from the traditional definition, where the WCET of a task is simply the upper bound of its execution time in isolation. In contrast, our definition of a task's WCET is the upper bound of its execution times when taking into account the cache contents left by the task executed immediately before. Also note that for a task τ j to benefit from cache reuse from a task τ i , τ i has to be scheduled before τ j on the same core (with no task scheduled in between). However, the end time of τ i needs not coincide with the start time of τ j . The scheduling algorithm may insert idle time between τ i and τ j , for example to respect dependencies between tasks, while still taking benefit of cache reuse.
Cache-conscious scheduling algorithms
Our proposed scheduling methods take as inputs (a) the number of cores available, and (b) the DAG of a single parallel application decorated with context-sensitive WCETs information for each task. The result is a time-driven, partitioned, non-preemptive schedule of the application. More precisely, the schedule for each core determines the start and finish times of all tasks assigned to the core. The objective is to find out schedules having the shortest possible length (also known as makespan). We introduce two scheduling techniques:
-An ILP formulation which is capable of finding an optimal application's schedule whose the length is minimized under the considered assumptions (see Sect. 3.1); -A heuristic method based on list scheduling which is capable of finding a schedule quickly (see Sect. 3.2). The length of the schedule is usually very close to the optimal one, as demonstrated in our experiments in Sect. 5.
The notation we use to describe our scheduling methods is summarized in Table 1 . The first block defines notation for the task graph. A task τ i is a direct predecessor of a task τ j if there is an edge from τ i to τ j in the task graph. A task τ i is an indirect predecessor of a task τ j if there is an edge from τ i to τ j in the transitive closure of the task graph. For instance, in the motivating example of Fig. 2 , T 1 is a direct predecessor of T 2 and an indirect predecessor of T 14 . The second block defines integer constants using upper-case letters. Finally, the third block defines variables using lower-case letters.
Cache-conscious ILP formulation
In this section we present our formula for ILP in the context of cache-conscious scheduling, which we call CILP for "Cache-conscious ILP". Since cores are identical in our abstract model, the execution time of a task is not affected by the properties of the cores. Based on that observation, CILP focuses on constructing sequences of colocated tasks, which includes defining the start time and the finish time of each task in a sequence. Given these sequences, the assignment of tasks to cores is straightforward (each sequence is simply assigned to a core).
The objective function of CILP is to minimize the schedule length sl of the parallel application, which is expressed as follows:
Since the schedule length for the parallel application has to be greater than or equal to the finish time f t τ j of any task τ j , the following constraint is required: The set of direct predecessors of The finish time f t τ j of a task τ j is equal to the sum of its start time st τ j and its worst case execution time wcet τ j :
In the above equation, variable wcet τ j models the variation in WCET of a task caused by private caches, and is computed as follows:
The multiplicative term on the left corresponds to the case where task τ j is the first task running on a core ( f τ j = 1). The summation term on the right corresponds to the case where the task τ j is scheduled just after another co-located task τ i (o τ i →τ j = 1). As shown later, only one of the binary variables among f τ j and o τ i →τ j will be set by the ILP solver, such that exactly one of these WCET values will be assigned to τ j . The assignment depends solely on the preceding task (if any).
Constraints on the start time of tasks A task can be executed only when all of its direct predecessors have finished executing. In other words, the start time of a task must be greater than or equal to the finish times of all its direct predecessors.
The final term in the above formula indicates that when a task has no predecessor, its start time must be greater than or equal to zero.
When there is a co-located task τ i scheduled to precede τ j , such that τ j cannot start before τ i finishes. In other words, the start time of τ j must be greater than or equal to the finish time of τ i . Note that τ j can be scheduled only after a task τ i that is neither its direct nor indirect successor.
In order to linearize equation (6), we use classical big-M notation, which is expressed as:
where M is a constant 2 greater than any possible f t τ j .
Constraints on the execution order of tasks A task is preceded by exactly one other task on the same core unless it is the first scheduled task:
The number of cores used is defined by the number of first scheduled tasks (number of variables f τ j equal to 1). Since the number of cores is K, the number of cores used has to be at most K:
Finally, a task has at most one co-located task scheduled immediately succeeding. This is expressed as:
An ILP solver produces results to the mapping/scheduling problem in the form of two sets of variables:
1. Task mapping is defined by variables f τ j and o τ i →τ j , which represent sets of co-located tasks along with the execution order within each set. 2. The static schedule for a core is defined by variables st τ j and f t τ j , which represent the start and finish time of the tasks assigned to that core.
Cache-conscious list scheduling method (CLS)
Finding an optimal solution to a partitioned, non-preemptive scheduling problem for a multi-core architecture is NP-hard (Kasahara and Narita 1984) , and does not scale with large number of tasks, as shown in our experiments (see Sect. 5). Therefore, we developed a heuristic scheduling method that efficiently produces schedules even for a large number of tasks. This method is based on list scheduling (see (Kwok and Ahmad 1999b) for a survey of list scheduling methods). Cache-conscious List Scheduling (CLS) begins with a list of tasks to be scheduled, scanning the list sequentially and scheduling each task without backtracking. For each task, CLS explores every potential core assignment that respects its precedence constraints. The core which allows the earliest finish time of the task is selected and the corresponding schedule is kept.
Task ordering in the list must follow a topological ordering (if τ i is a direct predecessor or an indirect predecessor of τ j , τ i appears before τ j in the list). To respect precedence constraints, the task sequence must follow a topological ordering. The list order is determined based on two classical metrics, both respecting topological order by construction. They both define for each task a weight (tw τ j for task τ j ), based on the task WCET, as defined below. The bottom level metric defines for task τ j the longest path from τ j to the exit task (τ j included), accumulating task weights along the path:
The top level metric symmetrically defines for task τ j the longest path from the entry task to τ j (excluding τ j itself):
The use of the direct successor function dSucc in Eq. 11 (respectively direct predecessor function dPred in Eq. 12) guarantees the topological ordering.
What distinguishes CLS from existing scheduling techniques is its consideration of context-sensitive WCET coming from cache reuse. Since a task may have a different WCET for each of its potential predecessors, the weight of a task is defined to approximate the variability of its WCET. The weight tw τ j of a task τ j is defined as:
This formula integrates the potential for the WCET of task τ j to be reduced, as well as the diminishing potential for WCET reduction as the number of cores increases.
As will be shown in Sect. 5.2, neither of the two metrics consistently outperforms the other for all task graphs. For this reason we kept both variations. For convenience, we define shorthand for specific forms of CLS: -In CLS_BL, tasks are sorted according to the bottom level metric. In case of equality, the first tie-breaker is the top level metric, and remaining ties are broken arbitrarily by task identifier. -In CLS_TL, tasks are sorted according to the top level metric, with ties broken first by the bottom level metric and then by task identifier. -CLS indicates the better choice among the bottom level and top level metrics; i.e., the method giving the shorter schedule length for a particular task graph. -NCLS is the cache-agnostic equivalent of CLS, and indicates the better choice among the bottom level and top level metrics for a system with no consideration of cache reuse. The weight of a task when using NCLS is its WCET ignoring cache reuse.
1 void sched( uint64_t triggerTime){ 2 uint64_t curTimeStamp = 0; 3 do 4 { 5
/ / get the timing information from the global cycle counter 6 curTimeStamp = __k1_read_dsu_timestamp ( ) ; 7 / / check the criterion for exiting from the loop 8 } while(curTimeStamp < triggerTime ) ; 9 } Listing 1 The code of the sched function arise in the implementation of the schedules on a Kalray MPPA-256 compute cluster. We also propose strategies to overcome these implementation factors.
Assumptions on execution conditions
To limit contention among tasks when accessing the SMEM of the Kalray MPPA-256, we impose the following constraints:
-The code and data of the application must fit into the SMEM of the compute cluster. The Resource Manager (RM) will load the application entirely onto the cluster before the application starts, and will have no further role during execution of the application. -The application is executed in isolation on a compute cluster to avoid potential contention from the NoC. -Debug mode is prohibited to avoid contention from the Debug Support Unit (DSU).
These constraints simplify contention on the shared bus, such that contention can only occur between application tasks running on different cores (PEs). In other words, the arbitration of memory requests to the SMEM's banks is simplified from the three stages described in Sect. 2.1.1 to just one stage, in which access is granted according to an ordinary round-robin policy.
Time-driven scheduler
The Kalray MPPA-256 provides a timestamp global cycle counter for timing synchronization between cores in the cluster. In order to trigger the execution of a task at a specific instant of time, we implement a sched function (see Listing 1) to be invoked just prior to the task. The sched function repeatedly checks the task trigger time against the global cycle counter, and starts the task when it detects that its trigger time has been reached.
Implementation challenges
Several complications arise in the implementation of time-driven, cache-conscious schedules on the Kalray MPPA-256:
Cache pollution caused by the scheduler
As described in Sect. 4.2, an instance of the scheduling function is interleaved between each pair of consecutive tasks on a given core. But since the data accessed by the scheduling function is unrelated to the task data, it effectively pollutes the cache, thereby attenuating the potential benefit of cache reuse between tasks. The net effect is an increase in the WCET which must be accounted for by our scheduling algorithm.
Shared bus contention
In a Kalray MPPA-256 compute cluster, concurrent requests issued from different cores to the same memory bank(s) compete against each other because each memory bank is equipped with only one requests arbiter. Therefore the delay induced by shared bus contention must be taken into account. Note that a PE is consecutively occupied by the executions of either the sched function or tasks mapped on the PE. Therefore, it may happen that memory requests of a task compete against those of both the tasks and the sched function executing in parallel with the task.
Delay to the start time of tasks because of scheduler
A task starts executing only when the sched function which precedes the task terminates. In the worst case, the execution of the task can be postponed by (at most) the amount of time that the sched function spent on its last iteration. As a result, there may be a gap between the trigger time of a task and the actual start time of the task (release jitter (Phatrapornnant and Pont 2006; Maaita and Pont 2005) ). The delay in the start time of a task affects its finish time, thus requiring the trigger time of every task to be updated such that the precedence relation(s) are maintained.
Lack of hardware-implemented data cache coherence
The Kalray MPPA-256 does not provide hardware support for cache coherence between cores. In a compute cluster, tasks executing on different cores communicate through the SMEM, and data in transit from the cache to the SMEM is temporarily held in a write buffer before being committed to the SMEM. This delay may cause communication between pairs of tasks executing on different cores to fail.
Communication failures can occur in the case that a task is assigned to a different core than its predecessor and starts executing right after the termination of its predecessor. At the time that a task starts executing, the most recent data which the task intends to receive from its predecessor may not have been committed to the SMEM yet. As a result, the task may operate on obsolete data. In order to overcome this issue, all memory stores of the sending task must be committed to the SMEM before its termination. This can be done by inserting synchronization instructions at the end of each task. These instructions, which are available natively in the Kalray MPPA-256, include __builtin_k1_wpurge() and __builtin_k1_fence(). The former instruction requests the write buffer flush to the shared memory, while the latter waits for all data to be committed to the shared memory. Section 3 presented algorithms for generating a cache-conscious schedule without accounting for these implementation factors. Here we present a technique, also based on ILP, that updates the tasks trigger times of the initial schedule (shifts them in the future) to account for these overheads. These modifications of trigger times maintain the per-core execution order of tasks of the initial schedule, and ensure that precedence relation constraints between the tasks remain satisfied. We refer to the adjusted schedule as an adapted cache-conscious schedule. Figure 6 illustrates the adapted cache-conscious schedule with its adjusted WCET and task trigger time. The technique proposed to shift task trigger times is in the following called ACILP, for adapted cache-conscious ILP.
Notation used in the ILP formulation Extending the set of symbols from Table 1 for managing the task graph, Table 2 introduces the following new symbols:
-The first section of the table defines notation that represents strictly the theoretical factors of cache-conscious scheduling (omitting reference to implementation factors). B(τ j ) represents the task executing on the same core and immediately preceding τ j , C(τ j ) represents the core to which τ j is assigned, LT (c j ) represents and M R τ j account for overheads caused by all the implementation factors except for the delay from shared bus contention. The overhead estimation is detailed in Sect. 5.3.1. Additionally, the symbol DM E M stands for the upper bound of memory access latency in the absence of contention (to be explained in Sect. 5.1).
-The third section defines variables using lower case letters. Similar to CILP, we use symbols such as sl and f t τ j for the schedule length and the finish time of τ j in the adapted cache-conscious schedule, respectively. The symbol tt τ j , represents the trigger time of τ j . The symbol tt τ j represents the same concept as st τ j (start time) in the initial schedule generation. We used a different symbol simply because the values of tt τ j and st τ j are not the same (tt τ j ≥ st τ j due to the consideration of implementation overheads). The symbol denoted as wcet over ILP formulation to account for the implementation factors (ACILP) ACILP retains both the objective function and the constraints between schedule length and task finish time from the original CILP (presented in Sect. 3.1).
The finish time f t τ j of τ j is the sum of its trigger time tt τ j and its worst-case execution time wcet over τ j , which now accounts for shared bus contention. Indicates if the memory accesses from core c k interfere with the execution of τ j or not
Binary
To complete the ILP formulation, we first present constraints for computing the trigger time of tasks, then we present constraints for computing the worst-case execution time of tasks that account for shared bus contention.
Constraints on task trigger time If τ j has any direct predecessors (notated as τ i ∈ d Pred(τ j )), then τ j cannot begin execution until after those tasks have finished. In order to maintain the precedence relations among tasks, the trigger time of the task must be greater than or equal to the finish time of all its direct predecessors. The same constraint is introduced if τ j has a co-located task that immediately precedes it (notated as τ i ∈ B(τ j )).
When a task has no predecessor and is the first task running on a core, the trigger time of the task must be greater than or equal to zero.
Constraints on the worst-case execution time to account for shared bus contention As announced in Dupont de Dinechin et al. (2014) , every core in Kalray MPPA-256 is fully timing compositional. The worst-case execution time of τ j can safely account for the shared bus contention wcet over τ j by computing the sum of:
-the worst-case execution time of τ j , adjusted for overheads caused by all the implementation factors except shared bus contention (denoted as W C E T over \cont τ j ); -shared bus contention induced on τ j (denoted as o contention
Let us denote δ τ j the maximum number of memory requests (from all cores) that could delay the execution of τ j , and DM E M the upper bound of memory access latency in a contention free situation. The shared bus contention delay induced on τ j is computed as:
Let us denote as δ c k τ j the maximum number of memory requests issued from core c k = C(τ j ) that interfere with the execution of τ j . Considering memory requests of all cores (c k ∈ c ∧ c k = C(τ j )) that delay the execution of τ j , we compute δ τ j as follows:
To compute δ c k τ j , we need to determine whether the memory requests of τ j compete against those issued from c k or not (represented as a binary variable int f c k τ j ). Since the sched function precedes every task, there will always be some code (task or scheduler) executing up to the point that the last task completes (i.e., either the execution of the sched function or the execution of a task). In order to ensure that all possible contentions are captured, we always account for potential interference between task τ j and the operations on core c k , except when τ j is triggered after the termination of the last task running on c k (represented by LT (c k ) ). If τ j and LT (c k ) are constrained by a precedence relation, we can predetermine the value of int f c k τ j , as follows:
-if τ j is either a direct predecessor or an indirect predecessor of LT (c k ), τ j must start executing before LT (c k ). In this case, int f c k τ j = 1; -if τ j is either a direct successor or an indirect successor of LT (c k ), τ j has to start executing after the termination of LT c k . In this case, int f c k τ j = 0.
If τ j and LT (c k ) do not have any precedence relation, the determination of int f c k τ j is presented as following condition:
To restate the above condition in classical big-M notation:
where M, is a constant 3 greater than any possible f t LT (c k ) . To reduce the computational effort of solving the ILP formulation, when memory requests issued by τ j are determined to compete against those from c k , int f c k τ j = 1, we assume that all memory requests of τ j are delayed according to δ c k τ j = M R τ j . We formulate δ c k τ j as follows:
Experimental evaluation
The experimental evaluation is divided into three parts.
-In the first part, we present experimental conditions. We describe properties of the benchmarks used in the experiment. Additionally, we present the method used for estimating WCETs and the number of cache misses of tasks. Furthermore, we describe experimental environment containing information of the ILP solver and the machine used for running the ILP solver and the proposed heuristic scheduling algorithm. -In the second part, we evaluate the quality of generated cache-conscious schedules and required time for generating them. The objective is to evaluate the maximum schedule length reduction attainable by our proposed cache-conscious scheduling methods for any multi-core architectures equipped with local caches. Therefore, in this study, we ignore implementation factors, such as hardware sharing, the effects of the time-driven scheduler, and the lack of hardware-implemented data cache coherence. Those implementation issues will be addressed in the third part of the evaluation. -In the third part, we first validate the functional and temporal correctness of applications when executing on a Kalray MPPA-256 compute cluster. We then quantify the impact of the overheads caused by different implementation factors on adapted cache-conscious schedules. Finally, we evaluate performance of our proposed ACILP formulation in both terms of quality of adapted cache-conscious schedules and required time for generating the schedules. 
Experimental conditions
Benchmarks In our experiments, we use 26 benchmarks of the StreamIt benchmark suite (Thies and Amarasinghe 2010) . StreamIt is a programming environment that facilitates the programming of streaming applications, and was selected because it provides benchmarks with explicit parallelism and data transfers. We modified the StreamIt compilation toolchain (code generation step) to obtain task graphs compatible with our task model. The characteristics of the task graphs are summarized in Table 3 . In the table, the maximum width of a task graph is defined as the maximum number of tasks with the 4 The maximum width defines the maximum parallelism in the benchmark. The average width is an average of the number of tasks for all ranks. The average width defines the average parallelism of the application. The higher the average width, the better the potential to benefit from a high number of cores. The depth of a task graph is defined as the longest path from the entry task to the exit task. Additional information on the benchmarks is reported in Table 4 . Reported information is the code size for the entire application, the average code size per task, the standard deviation of code sizes (the higher the number, the higher the variability of the code sizes of tasks in the application), and the average amount of data communicated between tasks.
WCET and the number of cache misses estimation Many techniques exist for WCET analysis (Wilhelm et al. 2008 ) and could be used in our study to estimate WCETs and the gains resulting from cache reuse. At the moment doing the experiment, there is no publicly available WCET analysis tool for the Kalray MPPA-256. Furthermore, WCET estimation is not at the core of our scheduling methods. Therefore, we obtain WCET values by using measurements on a compute cluster of the Kalray MPPA-256. Measurements were performed on one core of the platform, with no activity on the other cores, providing fixed inputs for each task. The execution time of a task is retrieved using the platform's global cycle counter. The effect of reading the timestamp counter on the execution time of a task turned out to be negligible as compared to the execution time of the task. We further observed that thanks to the determinism of the architecture, when running a task several times in the same execution context (10 times in our experiments), the execution time is constant (the same behavior was reported in Nélis et al. 2016) .
Additionally, in order to record the number of cache misses, we use two performance counters supported by the Kalray MPPA-256. One counts the number of instruction cache misses, the other one counts the number of data cache misses.
Experimental environment
We use Gurobi optimizer version 6.5 (Gurobi Optimization, Inc. 2015) for solving our proposed ILP formulations. The solving time of the solver is limited to 20 hours. The ILP solver and heuristic scheduling algorithms are executed on 3.6 GHz Intel Core i7 CPU with 16GB of RAM.
Evaluation the performance of cache-conscious schedules generation
Context-sensitive WCET information
For each task, we record its execution time when not reusing cache contents, as well as its execution time when executed after any possible other task. 5 Note that the way the benefit of cache reuse is evaluated also captures other (minor) hardware effects such as pipeline effects. Table 5 summarizes the statistical numbers of obtained execution times. This table shows the average and standard deviation of tasks' WCET when having no cache reuse (the higher the standard deviation, the higher the variability of the WCET of tasks in the application). It also shows the weighted average WCET reduction for each benchmark, computed as follows. For each task τ j we calculate its average WCET reduction in percent: 
Equation 23 considers for a task τ j all tasks τ i that may be scheduled immediately before τ j by the scheduler, i.e. all tasks in set nSucc(τ j ) (not successors of τ j in the task graph), as defined in Table 1 .
We observed that tasks with small WCET have important WCET reductions when considering cache reuse. On the other hand, they have low impact on schedule length because of their low WCET. Consequently, we weighted each WCET reduction by its WCET ignoring cache reuse, yielding to the following definition of weighted average reduction:
Regarding the cost of estimating context-sensitive WCETs of tasks due to cache reuse, we observed that the worst profiling time is 10 minutes for the most complex benchmark structure IDCT_2D_reference_fine. The benchmark contains 548 tasks, and 219238 pairs of tasks that may be executed one after the other (with respect to precedence constraints).
Benefits of cache-conscious scheduling
We show that cache-conscious scheduling, should it be implemented using an ILP formulation (CILP) or a heuristic method (CLS), yields to shorter schedules than equivalent cache-agnostic methods. This is shown by comparing how much is gained by CILP as compared to NCILP, the same ILP formulation as CILP except that cache effect is not taken into account (variable wcet τ j is systematically set to the cacheagnostic WCET, W C E T τ j ). The gain is evaluated by the following equation, in which sl stands for the schedule length:
The gain is also evaluated using a similar formula for the heuristic method CLS (shorter schedule results for CLS_BL and CLS_TL) as compared to its cache-agnostic equivalent.
Results are reported in Figs. 7 and 8 for a 16 cores architecture. In Fig. 7 , only results for the benchmarks for which the optimal solution was found in a time budget of 20 hours are depicted. These figures show that both CILP and CLS reduce the length of schedules, and this for all benchmarks. The gain is 11% on average for CILP and 9% on average for CLS. As expected, the higher reductions are obtained for the benchmarks with the higher weighted average WCET reduction as defined in Table 5 .
Comparison of exact (CILP) and heuristic (CLS) scheduling techniques
We compare CILP and CLS according to two metrics: quality of generated schedules, estimated through their length (the shorter the better) and time required to generate the schedules. All results are obtained on a 16 cores system. Table 6 gives the length of generated schedules (sl C I L P and sl C L S ), the run time of schedule generation (in seconds) and the gap (in percent) between the schedules length, computed by the following formula: The distance in term of schedule length between optimal solutions and heuristic solutions are given in bold x, no solution is found in 20 h; f, feasible solution is found; o, optimal solution is found
The shorter the gap, the closer CLS is from CILP. The gap between CLS and CILP is given only when CILP finds the optimal solution in a time budget of 20 hours. The table shows that CLS offers a good trade-off between efficiency and quality of its generated schedules. CLS generates schedules very fast as compared to CILP (i.e., about 1 second for the biggest task graph IDCT_2D_reference_fine which contains 548 tasks). When scheduling big task graphs, such as DES, ChannelVocoder, and IDCT_2D_reference_fine CILP is unable to find the optimal solution in 20 hours, which is expected because the problem of finding the optimal solution to a partitioned, non-preemptive scheduling problem on a multi-core architecture is NP-hard (Kasahara and Narita 1984; Nemhauser and Wolsey 1999) . When CILP finds the optimal solution, the gap between CILP and CLS is very small (0.7% on average). T1   T2   T3  T4   T5   T6   T8  T9   T7   T  10 wcet T 6 wcet T1 T 6 wcet T 6 *100 = 37.3 wcet T 9 wcet T 4 T 9 wcet T 9 *100 = 50.1 wcet T 6 wcet T 5 T 6 wcet T 6 *100 = 22.6 wcet T 9 wcet T 7 T 9 wcet T 9 *100 = 31.3
The highest gap (7.3%) is observed for the Lattice benchmark. It can be explained that the WCETs of tasks in the Lattice benchmark are small and the benchmark contains a reuse pattern (illustrated in Fig. 9 ) where reuse is higher between indirect predecessors than between direct predecessors. For example, the reduction of the WCET of T6 when executed directly after T1 on the same core 6 (37.3%) is higher than when executed directly after T5 on the same core (22.6%). Similarly, the reduction of the WCET of T9 when executed directly after T4 on the same core (50.1%) is higher than when executed directly after T7 on the same core (31.3%). For such an application, the static sorting of CLS never places indirect precedence-related tasks (for which the higher reuse occurs) contiguously in the list, and then does not fully exploit the cache reuse present in the application.
Impact of the number of cores on the gain of CLS against NCLS
We evaluate the gain in terms of schedule lengths of CLS against its cache-agnostic equivalent when varying the number of cores. The results are depicted in Fig. 10 for a number of cores from 2 to 64.
In the figure, we can observe that whatever the number of cores, CLS always outperforms NCLS, meaning that our proposed method is always able to take advantage of the WCET reduction due to cache reuse to reduce schedules length. Another observation is that the gain decreases when the number of cores increases, up to a given number of cores. This behavior is explained by the fact that when increasing the number of cores, the tasks are spread among cores which provides less opportunity to exploit cache reuse since exploiting the parallelism of the application is more profitable. However, even in that situation, the reduction of schedules length achieved by CLS against NCLS is most of the time significant. 
Comparison of schedules length for CLS_TL and CLS_BL
We study the impact of the sorting technique of the list scheduling technique on quality of schedules. Figure 11 depicts the ratio of the length of the schedule generated by CLS_TL to that of CLS_BL as sl Ratio C L S_T L/C L S_B L = sl C L S_T L sl C L S_B L for each benchmark. A ratio of 1 indicates that the two techniques generate schedules with identical length. Results are given for different numbers of cores (4, 8, 16, 32 and 64) . The figure shows that there is no method which dominates the other for all benchmarks. Furthermore, the length of schedules generated by CLS_TL and CLS_BL are most of the time very close to each other if not identical. The explanation is that both C L S_T L and C L S_B L scan tasks in some topological orders. With task graphs having a small width, there is a small number of possible topological orderings of tasks. With task graph having a large width, the WCETs of most tasks having the same rank are correlated since those tasks in the SteamIt benchmarks execute the same piece of code. Therefore, with task graphs having those properties different orderings of tasks results in slight changes in the final schedules length.
There is a significant difference between CLS_TL and CLS_BL only in two cases, ChannelVocoder on 4 cores and FmRadio on 8 cores. The distances between the length of the schedules generated by CLS_TL and CLS_BL in these cases are then 3% and 8% respectively. It shows that in some special cases, the change in the order of tasks in the list significantly affects the mapping of tasks, hence the quality of generated schedules. Since both CS_TL and CLS_BL generate schedules very fast, we have throughout this paper always used both and selected the best result obtained.
Evaluation of the cache-conscious schedules implementation
In the evaluation we use four benchmarks, 7 named AudioBeam, AutoCor, FmRadio, and MergeSort.
Overheads and number of memory accesses estimation
The overhead induced on τ j by cache pollution, noted as o cache_ pollution τ j , is computed by subtracting the worst-case execution time of τ j when considering the execution of the sched function before the task (W C E T cpo τ j ) from its worst-case execution time when ignoring the execution of the function (W C E T τ j 8 ).
The W C E T cpo τ j is estimated according to the execution order of τ j , which is available at the implementation stage. If τ i is the task executing on the same core and immediately preceding τ j , we record the execution time of τ j when the task is executed according to the following order: τ i → sched(0) → τ j . Since during the execution of the sched function the contents of caches do not change after the first iteration of the function, we pass zero to the input of the function.
Regarding the delay to the start time of a task τ j , noted as o delay_sched τ j , we measure the WCET of one iteration of the sched function. For measuring the value, we pass zero to the input of the function, and execute the function in isolation. We observed that o delay_sched τ j = 258 cycles. We also observed that the maximum number of cache misses of one iteration of the sched function is 10. Since the timestamp of the global cycle is stored at a specific address of the SMEM, one iteration of the function takes one more access to the SMEM to retrieve the information. Therefore, the maximum number of memory requests of one iteration of the sched function is 11.
The write buffer is 8-way fully associative (8 bytes per each way), the memory access granularity is 8 bytes, and the memory access latency for accessing 8 bytes in case contention free is 10 cycles (Becker et al. 2016) . The upper bound of the cost for flushing the write buffer to the SMEM when contention free, noted as o W B_ f lush τ j , is 80 cycles. Besides, the upper bound of the memory accesses for flushing the write buffer to the SMEM is 8.
Additionally, the overhead induced on τ j due to shared bus contention, noted as o contention τ j , is a variable in the ACILP formulation. Therefore, its value can be retrieved from the solution file of the Gurobi optimizer (after solving ACILP).
Furthermore, the upper bound of memory access latency when contention free, noted as DM E M, is equal to the cost for loading an instruction cache line from the SMEM to the instruction cache. Since an instruction cache line contains 64 bytes and it takes 9 cycles with 8 bytes fetched on each consecutive cycle for accessing the SMEM when contention free, the cost for loading an instruction cache line from the SMEM to the instruction cache is 17 cycles.
Validation of the functional correctness and the timing correctness of the implementation
Functional correctness For validating the functional correctness of the implementation, we compare the outputs produced by each benchmark when executing in sequential order, i.e., all tasks are executed on one core, and when executing in parallel, i.e., tasks are executed according to their mapping and their scheduling information given in the adapted cache-conscious schedule. We observed the same outputs when executing benchmarks in sequential and when executing in parallel.
Temporal correctness For validating the temporal correctness of the implementation, we record the actual start time and the actual finish time of every task when executing on a Kalray MPPA-256 compute cluster according their mapping and their schedule. We observed that precedence constraints between tasks are satisfied.
Quantification of the impact of different implementation factors on adapted cache-conscious schedules
The impact of an implementation factor on the adapted cache-conscious schedule is reflected by the fraction of the overall overhead induced on tasks (caused by the factor) to the schedule length. However, tasks are executed in parallel, so that in general the schedule length is not a linear combination of the execution time of a set of tasks. Fig. 12 The schedule graph constructed based on the scheduling information in the adapted cache-conscious schedule as depicted in Fig. 6 T4 T2 T3
T1
Therefore, for our quantification, instead of relying on the schedule length, we used the length of the longest path of the schedule, which is the accumulation of the execution time of tasks along the path. In order to determine the longest path of the schedule, we construct a schedule graph based on the scheduling information of tasks in the schedule. In the schedule graph, each node represents a task. Two nodes are connected by an edge if the task represented by the sink node is triggered after the execution of the task represented by the source node. Figure 12 shows the schedule graph constructed based on the scheduling information of tasks in the adapted schedule which was illustrated in Fig. 6 . The weight of a node is the execution time of a task represented by the node, while the weight of every edge is zero. We use implicit-path enumeration technique (IPET) (Li and Malik 1995) to find the longest path of the schedule graph. We denote the set of tasks that lay on the longest path as τ cp , and the length of the longest path as sl sg .
The impact of cache pollution on the adapted cache-conscious schedule, noted as oo cache_ pollution , is quantified as:
The symbols and the quantification of the impact of the other implementation issues on the adapted schedule are done in the same way. Furthermore, we compute the fraction of the effective execution of tasks (i.e., the execution time of tasks when completely ignoring all implementation issues) to the length of the schedule graph as sg . The impact of different implementation issues on the adapted cache-conscious schedules of all benchmarks in the study for 2, 4, 8, and 15 cores 9 is shown in Fig. 13 . The impact caused by cache pollution is negligible. It is expected since the sched function is quite simple so that its execution introduces very small noise on the caches contents. Besides, since the execution of the sched function is short, the delay to the start time of tasks due to the execution of the function is small. Additionally, the overall overhead induced on every task in τ cp by write buffer flushing is very small. The reason is twofold: (i) communicating tasks are likely to be assigned to the same core to benefit from data reuse and in this situation no flush is needed; (ii) the cost for flushing the write buffer to the SMEM is small as compared to the execution time of tasks. As compared to the impact of those implementation factors, shared bus contention has the highest impact on the adapted schedules. The impact of shared bus contention tends to increase when the number of cores increases. The reason is that when the number of cores increases, the number of concurrent tasks tends to increase, which likely introduces more interference to the execution of tasks.
Evaluation performance of ACILP
In this section we evaluate the ability of ACILP to account for contentions, through a comparison with the work (Rihani et al. 2016 ), having simimar objectives as ACILP. The work described in Rihani et al. (2016) transforms a contention-free static timedriven schedule to account for interference. In this work, a double fixed-point algorithm is proposed, which iteratively updates the WCET of tasks with contention delays and updates the trigger time of tasks accordingly (with respect to their execution order and their precedence relations) until that information is stable. In their double fixed-point algorithm, every task is forced to be triggered as soon as possible. However, in ACILP tasks at the near end of schedules are considered to be triggered at the time at which shared bus contention induced on the tasks is reduced. With that concern, ACILP is expected to generate shorter schedules than the double fixed-point algorithm.
Modifying (Rihani et al. 2016) to take into account the same simple contention model as in our work, in which all memory requests of tasks which are involved in contentions are delayed, we compare schedule length and required time for generating schedules when using ACILP and the double fixed-point algorithm. All benchmarks in the study are scheduled on 2, 4, 8, and 15 cores. Table 7 presents the length of the adapted schedules and required time for generating the schedules by using ACILP and the double fixed-point algorithm. The gain in term of schedule length reduction, which shows the benefit of ACILP as compared to the double fixed-point algorithm is computed as: Table 7 shows that ACILP has slight gains in some cases, i.e., the highest gain is 0.98% when scheduling AudioBeam on 4 cores, and is never inferior to the double fixed-point algorithm. The result is expected since ACILP only has chances to reduce contention induced on tasks at the nearly end of schedules. Regarding required time for generating the adapted schedules, both ACILP and the double fixed-point algorithm produce schedules very fast. For all benchmarks in the study, the longest solving time of ACILP is 42 seconds, whereas the longest time that the double fixed-point algorithm takes to generate a schedule is 6 seconds, in the case of scheduling FmRadio benchmark on 15 cores.
Related work
Schedule generation Schedulability analysis techniques rely on the knowledge of the WCET of tasks. Originally designed for single-core architectures, static WCET estimation techniques were extended recently to cope with multi-core architectures. Most research has focused on modeling shared resources (e.g., shared caches, shared bus, shared memory) in order to capture interferences between tasks which execute concurrently on different cores (Kelter et al. 2014; Liang et al. 2012; Chattopadhyay et al. 2010; Hardy et al. 2009; Altmeyer et al. 2015) . Most extensions of WCET estimation techniques for multi-cores produce a WCET for a single task in the presence of concurrent executions on the other cores. By construction, those extensions do not account for cache reuse between tasks as our scheduling techniques do. The scheduling techniques we propose have to rely on WCET estimation techniques to estimate the effect of local caches on tasks' WCETs.
Some WCET estimation techniques pay attention to the effect of private caches on WCETs. In Nemer et al. (2007) , when analyzing the timing behavior of a task, Nemer et al. take into account the set of memory blocks that has been stored in the instruction cache (by the execution of previous tasks on the same core) at the beginning of its execution. Similarly, Potop-Butucaru and Puaut (2013), assuming task mapping on cores known, jointly perform cache analysis and timing analysis of parallel applications. These two WCET estimation techniques assume task mapping on core and task schedule on each core known. In this paper, in contrast, task mapping and scheduling are selected to take benefit of cache reuse to have the shortest possible schedule length.
Much research effort has been spent on scheduling for multi-core platforms. Research on real-time scheduling for independent tasks is surveyed in Davis and Burns (2011) . This survey gives a taxonomy of multi-core scheduling strategies: global vs. partitioned vs. semi-partitioned, preemptive vs. non preemptive, time-driven vs. eventdriven. The scheduling techniques we propose in this paper generate offline time-driven partitioned non-preemptive schedules. Most scheduling strategies surveyed in Davis and Burns (2011) are unaware of the hardware effects and consider a fixed upper bound on tasks' execution times. In contrast, the scheduling techniques we propose in this paper address the effect of private caches on tasks' WCETs. Our work integrates this effect in the scheduling and mapping problem by considering multiple WCETs for each task depending on their execution contexts (i.e. caches contents at the beginning of their execution).
Some scheduling techniques that are aware of hardware effects were proposed in the past. They include techniques that simultaneously schedule tasks and the messages exchanged between them (Carle et al. 2014; Tendulkar et al. 2014; Puffitsch et al. 2015; Abdallah et al. 2016) ; such techniques take into consideration the Network-On-Chip (NoC) topology in the scheduling process.
Some other techniques aim at scheduling tasks in a way that minimizes contentions when accessing shared resources (e.g., shared bus, shared caches) (Calandrino and Anderson 2009; Guan et al. 2009; Ding et al. 2013; Rouxel et al. 2017; Martinez et al. 2017; Kim et al. 2016) . For example, in Ding et al. (2013) and Rouxel et al. (2017) , they jointly perform shared resources contention modeling and tasks mapping/scheduling for multi-core platforms. Ding et al. (2013) focus on shared caches contention, whereas (Rouxel et al. 2017 ) pay attention to shared bus contention. In those works, tasks are mapped and scheduled, such that the contention delays induced on them are minimized, thus minimizing schedule length. Approaching from a different direction, (Martinez et al. 2017 ) modify existed schedules by introducing slack time between the execution of pairs of tasks consecutively assigned to the same core. This modification aims at limiting the contention between concurrent tasks contained in the existing schedules. Besides, some approaches (Yao et al. 2012; Becker et al. 2016; Perret et al. 2016a; Pellizzoni et al. 2011; Perret et al. 2016b ) schedule tasks according to predictable execution models that guarantee spatial/temporal isolation between co-running tasks. For example, Becker et al. (2016) take the advantage of memory privatization features available in the Kalray MPPA-256 to separately allocate private memory (includes code and data) of tasks. Besides, they design a scheduling policy to schedule the execution of tasks (which comply with a PREM-like model (Pellizzoni et al. 2011) ), such that the memory requests of tasks are free from contention. The authors of ) propose a technique for reducing memory interference using a partitioning of DRAM banks and co-locating memory-intensive tasks on the same processor. Our scheduling solutions in this paper differ from those works because we pay attention to the effect of private caches on tasks' WCETs. On the other hand, because our objective was to concentrate on the effects of private caches, in a first step we used a simple contention model, yielding to contention delays less precise than the ones proposed in Dasari et al. (2011) , Dasari and Nélis (2012) , and Kim et al. (2014) ). Using more precise contention models is left for future work. Compared with our previous work (Nguyen et al. 2017) , in this paper we consider practical, implementation related overheads, that are evaluated on a Kalray MPPA-256 compute cluster.
Having the same interest as us in utilizing data reuse between tasks, in Suhendra et al. (2006) Suhendra et al. jointly consider task scheduling and memory allocating for multi-core systems equipped with scratchpad memory (SPM). In the work, the most frequently accessed data are allocated in SPM, and tasks are scheduled properly to reduce the accesses latency to the off-chip memory. As compared to the approach, our scheduling methods take into account both instruction and data reuse between pairs of tasks, and schedule them in order to get benefit in term of WCET reduction from cache reuse.
Related studies also address the effect of private caches when scheduling tasks on multi-core architectures (Ward et al. 2014; Phavorin et al. 2015; Tessler and Fisher 2016) . However, they are based on global and preemptive scheduling techniques, in which the cost of cache reload after being preempted or migrated has to be accounted for. Compared to these works, our technique is partitioned and non preemptive. We believe such a scheduling method allows us to have better control on cache reuse during scheduling. Furthermore, Phavorin et al. (2015) and Tessler and Fisher (2016) focus on single core architectures while our work targets multi-core architectures.
Schedule implementation In the literature most scheduling techniques for multi-core hardware focus on handling shared resources contention (see Fernandez et al. 2014 for the survey). In this paper, we pointed out that along with shared resources contention, the effect of time-driven scheduler and the lack of hardware-implemented data cache coherence are important factors that need to be considered in the implementation of time-driven, cache-conscious schedules on a multi-core hardware (especially the Kalray MPPA-256).
As compared to the works that address shared resources contention, our intent is not to mitigate the contention, but rather to integrate the contention into existing contention-free schedules. Having the same objective, in Rihani et al. (2016) propose the double-fixed point algorithm. As explained in Sect. 5.3, the algorithm does not produce adapted schedules with shortest schedules length, as our proposed ACILP formulation does.
Conclusion
In this paper, we first studied the problem of scheduling a single parallel application on a multi-core platform subjected to the effect of private caches. Two scheduling techniques, including an optimal and a heuristic one, have been proposed to generate static, time-driven, partitioned, non-preemptive schedules. We experimentally showed the benefit in term of schedule length reduction when taking into account contextsensitive WCETs per task (due to cache reuse) in scheduling.
Secondly, we implemented time-driven cache-conscious schedules on a cluster of the Kalray MPPA-256. We pointed out the implementation issues arising when implementing the schedules on the platform, which are summarized as:
-cache pollution and the delay to the start time of tasks caused by the execution of the time-driven scheduler; -shared bus contention; -lack of hardware-implemented data cache coherence.
Additionally, we proposed an ILP formulation to adapt time-driven, cache-conscious schedules to the implementation factors. Experimental validation has shown the functional and the temporal correctness of our implementation and the efficiency of our proposed ILP formulation. Additionally, we observed that shared bus contention is the most impacting factor on the adapted schedules among the other ones.
We see several opportunities to further improve/extend this work. The work presented in this paper currently proceeds in two steps. A schedule that ignores all implementation factors (contention, jitter) is first generated. It is updated in the second step to account for these implementation factors. Proposing an integrated approach, that in particular integrates contention delays from the start, is an interesting direction for future work.
In addition, we can take more benefit from cache reuse between tasks. A task can reuse the workloads of several tasks executed preceding it, but not necessarily of the task executed immediately preceding it. We believe that if those reuses are considered, the advantage of cache-conscious scheduling strategies can be further improved. We envision two approaches to exploit the cache reuse. The first approach is to take into account the reduction in the WCET of a task when executed after several tasks rather than after only the task executed immediately before. Since the number of possible execution orders of tasks needed to be considered increases, estimating context-sensitive WCETs of tasks and scheduling tasks becomes more complex. The second approach is to use cache locking techniques (Puaut and Decotigny 2002; Arnaud and Puaut 2006) in order to ensure that the useful workloads of tasks are still located in the caches until the task referring to them starts executing. Additionally, tasks scheduling has to be jointly performed with cache locking in order to fully exploit the benefit from cache reuse.
