Abstract-We propose a unified approach to the problem of scheduling a set of tasks with individual release times, deadlines and precedence constraints, and allocating the data of each task to the SPM (Scratchpad Memory) on a single processor system.
I. INTRODUCTION
In a typical embedded system, there are multiple concurrent tasks. Tasks may be subject to release times, deadlines, and precedence constraints. The release time and the deadline of a task specify its earliest start time and the latest completion time in a feasible schedule. The precedence constraints specify the data and control dependencies between tasks. In hard real-time embedded systems, it is essential to find a feasible schedule for all the tasks at the design stage.
The problems of scheduling a set of tasks with various con straints have been extensively studied [8, 15, 16] . Most schedul ing problems are NP-complete. On a single processor. if tasks are preemptible, the EDF (earliest deadline first) strategy is guaranteed to find a feasible schedule for a set of tasks with individual release times, deadlines and precedence constraints whenever one exists [8] . However, if tasks are not preemptible, the problem of finding a schedule with minimum lateness for a set of independent tasks with individual release times and deadlines on a single processor is NP-complete [8] .
Scratchpad memory is the on-chip SRAM managed by the compiler. It is an attractive alternative to cache in embedded systems due to its three major advantages. Firstly, it consumes less energy than cache. Secondly, it is easier to compute the WCET (Worst-Case Execution time) of a task because the ac cess time of each variable or instruction is known at compile time. Thirdly, the compiler can usually hide data hazards in modem RISC processors without any hardware support as the latency of each data access to SPM is known at compile time.
However, SPM also introduces additional challenges. One major challenge is that the task scheduling problem and the SPM allocation problem are mutually dependent. On one hand, the WCET of a task is dependent on the size of the SPM allo cated to it. On the other hand, the size of the SPM allocated to each task is dependent on whether this task is preempted in the schedule or not. If a task Ti is preempted by another task Tj, then Ti and Tj cannot use the same section of SPM to store data assuming there is no dynamic SPM reallocation.
In this paper, we study the problem of scheduling a set of tasks with individual release times, deadlines and precedence constraints on a single processor of an embedded system where SPM is used to replace data cache, and the problem of allocat ing the SPM to each task. We assume that the target embedded system is a hard real-time system where the deadline of each task must be met. We make the following major contributions.
1.
We propose a novel unified approach to the task schedul ing problem and and the SPM allocation problem. The unified approach consists of a task scheduling algorithm and an SPM allocation algorithm. The task scheduling al gorithm aims at minimizing the number of preemptions in a feasible schedule for the task set. The SPM alloca tion algorithm employs a novel data structure, namely, the preemption graph, to efficiently allocate SPM to tasks.
2.
We have evaluated our approach and the one proposed by Suhendra et al. [19] by using six task sets with tasks selected from three benchmark suites: Powerstone [13] , MaJ.ardalen WCET Benchmarks [7] , SNU real-time benchmarks [14] , and an open-source UAV (Unmanned Aerial Ve hicle) control application from PapaBench [5] .
For all the task sets, our approach achieves a maximum improvement of 20.31 % on the WCRT reduction.
The rest of this paper is organized as follows. Section 11 de scribes the system model and key definitions. Section III shows how to determine the maximum SPM size for each task. Sec tion IV describes our unified approach to task scheduling and SPM allocation. Section VI describes related work. Section V presents our experimental results, followed by the conclusion section in Section VII.
SYSTEM MODEL AND DEFINITIONS
The target hard real-time embedded system uses a single pro cessor where an SPM is used to replace data cache. The size 978-1-4673-3030-5/13/$31.00 ©2013 IEEE of the SPM is m bytes. The SPM occupies a contiguous sec tion of the processor's memory space. The start address of the SPM is O. The SPM is only used to store the local (stack) data of tasks. The problem of allocating the global data, heap data and code of a task set to SPM will be studied in future work.
There is a set V = {Tb T2,···, Tn} of n tasks to be executed on the processor. Each task is preemptible by any other tasks. However, our unified algorithm for task scheduling and SPM allocation preempts a task only if it is necessary. Tasks have individual release times, deadlines and precedence constraints.
The precedence constraints are represented by a DAG (directed acyclic graph) G = (V, E), where V = {Tb T2,···, Tn} is the set of tasks, E = {(Ti, Tj) : Tj can be executed only after Ti finishes } is a set of precedence constraints between tasks. Each task Ti has the following attributes:
1. Pre-assigned release time R(Ti)' Definition 1 Given a schedule a and a task Ti, the live range of Ti, denoted as L(Ti), is a time interval [S(Ti), F(Ti)], where S(Ti) and F(Ti) are the start time and the finish time, respectively, ofn in a.
Given two tasks, they can share a section of SPM iff their live ranges do not overlap.
Definition 2 Given a schedule a for a set of tasks, the inteifer ence graph of a is an undirected graph G(a) = (V, E), where V = {T1, T2,"', Tn} is the set of tasks, and E = {(n, Tj) :
Definition 3 Given a schedule, a task Ti is said to preempt a task Tj if! one of the following conditions holds: 1) Ti preempts Tj directly. 2) Ti is scheduled immediately after the completion of another task Ts, and Ts preempts Tj in the schedule.
Notice that our definition of preemption is a generalization of the traditional one.
Definition 4 Given a schedule a for a set of tasks, the pre emption graph of a is a directed graph G = (V, E), where V = {Tl' T2,"', Tn} is the set of tasks, and E = {(n, Tj) : Ti, Tj E V and Tj preempts Ti in a}.
98-3
It is easy to see that the preemption graph of any schedule constructed by using an EDF scheduler is a forest. The pre emption graph is a key data structure of our unified algorithm for task scheduling and SPM allocation. We can easily prove that for each path in a preemption graph, the live ranges of any two tasks on the path overlap. is very important to determine the maximum SPM size of each task in a fair manner.
The approaches proposed in [20, 21] assume that the maxi mum SPM size of each task is known without proposing any approach to determining the maximum SPM size of each task in a fair manner. The approaches proposed in [18, 19] Next, we propose a new approach for determining the maxi mum size of the SPM for each task based on our previous work on allocating variables of a single task to SPM [23] . For each variable Vi, we define a benefit vector benefit( Vi) as follows.
(1)
where I is the vector of the lengths, in non-increasing order, of the k longest paths of the task immediately before allocating Vi to the SPM, 1'( Vi) is the vector of the lengths, in non-increasing order, of the k longest paths of the task immediately after allo cating Vi to the SPM, and size( Vi) is the size of Vi. Intuitively, the benefit vector of a variable Vi is the normalized contribution of Vi to the k longest path lengths of the task. To compare any two benefit vectors, we use lexicographical ordering.
In our definition of benefit vector, k is a parameter. On one hand, the larger the value of k, the more accurate a benefit vec tor. On the other hand, the larger the value of k is, the higher time complexity for computing a benefit vector.
In order to determine the maximum SPM size for each task in a fair manner, we introduce a threshold benefit vector amin for all the tasks. For each task, we select a variable as an SPM resident only if its benefit vector is greater than amino The threshold benefit vector is a parameter of our approach. Given a specific task set, its value needs to be tuned for the best per formance of a given task set.
We determine the maximum SPM size Si of each task Ti as follows: Keep selecting a variable of Ti on the longest path with the maximum benefit vector being greater than amin, and allocating it to the SPM of the task until no variable can be se lected. For more details on selecting a most beneficial variable and allocating it to SPM, we refer to [23] .
IV . UNIFIED TASK SCHEDULING AND SPM ALLOCATION
Given a set S of tasks with individual release times, dead lines, and precedence constraints, our objective is to find a fea sible schedule for S on a single processor with an SPM with a size of m bytes to store local data of the tasks. A feasible schedule is the one satisfying all the constraints. Our unified approach to task scheduling and SPM allocation consists of two major parts: the task scheduling algorithm and the SPM allocation algorithm. The task scheduling algorithm aims at minimizing the number of preemptions when finding a feasible schedule for the task set. By default, it uses the npEDF scheduling. It uses the pEDF scheduling only if a task misses its deadline under the npEDF scheduling. Initially, no task is preempted. Therefore, the whole SPM is allocated to each task Ti. When a task currently scheduled meets its deadline, the task scheduling algorithm calls the SPM allocation algorithm to allocate SPM to the task and each of the predecessors of the task in the preemption graph. During the execution of our unified approach, if a task is preempted. the SPM size of each predecessor of the task may decrease.
Our unified approach uses the following variables:
• D(Ti): the deadline ofTi.
• wcet(Ti): the current worst-case execution time of task Ti,
• accu_time(Ti): the accumulated execution time of task Ti,
• preempted(Ti): a Boolean variable, denoting if task Ti has been preempted before,
• miss(Ti): a Boolean variable, denoting if task Ti has missed its deadline before, and
• start: storing successive scheduling points. A scheduling point is a time point at which a task is scheduled.
Our unified approach works as follows:
1. Compute the edge-consistent deadlines for all the tasks, and initialize the relevant data structures. iii. Remove all the edges incident to the tasks in C in the current schedule from the preemption graph.
If a task
iv. Undo the current schedule for C, and continue to schedule all the unscheduled tasks, including the tasks in C. 8 Create an empty preemption Graph G; 9 start = the earliest release time of all the tasks in S; 10 Scheduler _Allocator(S, start);
The SPM allocation algorithm works incrementally based on the current partial SPM allocation scheme. It is called by the task scheduling algorithm whenever a new task Ti is suc cessfully scheduled. When being called, it starts with Ti and works toward the source (root) task along the path from Ti to the root in the preemption graph. For each task Tj visited in the preemption graph, our SPM allocation algorithm tries to allocate Sj bytes to it. If Sj bytes is not available, it al locates the remaining free SPM space to Tj considering the interference constraints. Once a task Tk cannot be allocated Sk bytes, all its predecessors in the preemption graph will not be allocated any SPM space. For each task Ti, we introduce four variables, starLaddr(Ti), end_addr(Ti), spm_size(Ti), and wcet(Ti), where starLaddr(Ti) and end_addr(Ti) are the start address and the end address of Ti, respectively, in the For a non-leaf task that can be allocated to SPM, its start ad dress is one plus the maximum end address of all its children.
Our SPM allocation algorithm is shown in Algorithm 3. Next, we use an example to explain how our unified ap proach to task scheduling and SPM allocation works. We also use it to compare our SPM allocation algorithm with the graph coloring based SPM allocation technique proposed in [19] .
98-3
There are a set of 10 independent tasks to be executed on a single processor where an SPM of 2K bytes is used to store local data of the tasks. The task attributes are shown in Notice that by our SPM allocation algorithm, a task may not be fully allocated to the SPM. In this example, if we change the SPM size requirement of Tl to 1K bytes, our algorithm will allocate only 648 bytes of SPM space to Tl.
Consider the final schedule shown in Figure lb . For simplic ity, we ignore the start times and finish times of all the tasks, and only consider their execution order. Next, we will show how the graph coloring based SPM allocation technique pro posed by in [19] works. The interference graph of all the tasks in the final schedule is shown in Figure Id . After applying the Task   Tl  T 2  T3  T4  T5  T 6  T7  T coloring algorithm, we have three colors. T1, T8 and TlO are assigned color 1, T2, T6, T7 and T9 are assigned color 2, and T3, T4 and T5 are assigned color 3. The SPM is partitioned into three disjoint sections for color 1, color 2 and color 3, re spectively, and all the tasks with the same color share a section of the SPM, as shown in Figure If . As we can see, in order to place all the tasks in the SPM, the size of the SPM must be at least 3048 bytes, in contrast to the SPM size of 2K bytes needed by our SPM allocation algorithm. As a result, their ap proach cannot find a feasible schedule given an SPM size of 2Kbytes.
V. EXPERIMENTAL RESULTS
A. Experiment Setup for the first five task sets. 
98-3
In order to evaluate our unified approach, we created six task sets as shown in Table I . We selected 20 applications from three benchmark suites: Powerstone [13] , Mlilardalen WCET Benchmarks [7] and SNU real-time benchmarks [14] . The statistics of all the applications are given in Table 11 . Each ap plication is a task. Each task set consists of a subset of tasks from the 20 applications. There are no precedence constraints The 6th task set comes from a real-life open-source DAV control application, the PapaBench [5] . It consists of 28 tasks and operates in two modes: fly by wire and autopilot, which means that the aircraft can be controlled both manually and au tomatically. Each mode consists of several tasks to control the aircraft and communicate with ground station. For our eval uation purposes, we separated the tasks from the original im plementation, and maintained the control dependencies among these tasks. The statistics of all the tasks in the 6th task set can be found in [19] .
We implemented both our unified approach and the CR ap proach proposed in [19] . When determining the maximum SPM size for each task, we set k to 2, and the threshold benefit vector C¥m in to (0.1,0). Since the CR approach does not han dle individual deadlines, we revised it such that the interference between two tasks cannot be eliminated if delaying one task causes its deadline to be missed. In addition, we set the prior ity of each task to its deadline, and a smaller deadline implies a higher priority.
We manually assigned each task in all the six task sets a re lease time and a deadline for every SPM configuration in such a way that many preemptions are needed in order to find a fea sible schedule.
We modified Chronos 4.0 [12] to calculate the WCET of each application with different SPM sizes. The infeasible path detection is enabled in Chronos. The target architecture is an out-of-order, pipelined processor, with an instruction cache and perfect branch prediction. If the instruction cache is hit, an in struction fetch takes 1 cycle. Otherwise, it takes 100 cycles. The target processor uses scratchpad memory to replace data cache. The latencies of scratchpad memory and off-chip mem ory accesses are 1 cycle and 100 cycles as in [19] , respectively. The execution time of each instruction is 1 cycle.
B. Results and Analysis
We evaluated both our approach and the CR approach under three different SPM size configurations: 10%, 20% and 30%
98-3
of the total data size. We use two performance metrics, namely WCRT and feasible schedule, to compare both approaches. The WCRT of a schedule is the maximum completion time minus the minimum start time of all the tasks. The WCRTs produced by both approaches under various configurations are shown in Figure 2 .
In each figure, the black bars are for our approach and the light bars for the CR approach. Each bar represents the relative WCRT increase WCRTinc which is computed as follows:
where WCRnase is the WCRT of a schedule, computed by using the pEDF, for the same task set without any SPM, and WC RTa lg is the WCRT computed by the two approaches.
For the 10% SPM size in Setl, the CR approach cannot find a feasible schedule that meets all the deadlines while our ap proach does. Therefore, the second bar is empty. For Set3, the WCRTs computed by our approach and the CR approach are close. For this task set, the schedules computed by both have the same number of preemptions. However, our approach achieves a slightly better SPM utilization due to our more effi cient SPM allocation algorithm. As a result, our approach per forms slightly better. For all the other task sets, our approach performs significantly better. The maximum improvement on WCRT of our approach over the CR approach is 20.31 %, which occurs in Set2 under 30% SPM size configuration.
There are two major reasons that our approach performs bet ter. The first reason is that our approach preempts a task only if it is necessary. The second reason is that our SPM allocation algorithm is more efficient as we demonstrated in an example in Section IV.
It is worth noting that SPM is much less effective for an out of-order processor than for an in-order processor used in [17] . The reason is that an out-of-order processor can hide off-chip memory access latencies by executing other ready instructions.
VI . RELATED WORK
The problems of scheduling tasks with various constraints have been extensively studied [8, 15, 16] . Various schedul ing techniques have been proposed. One common assump tion made by all the previous scheduling techniques is that the WCET of each task is known. If SPM is used to replace cache, this assumption does not hold any more. As a result, all the previous scheduling techniques without considering SPM are not applicable to the processors with SPM.
A number of research groups have studied the SPM alloca tion for a single task [1] [2] [3] [9] [10] [11] 17, 24] . All the techniques proposed assume that the amount of the SPM allocated to each task is known, which is not true for typical embedded systems with concurrent tasks. As a result, those techniques cannot be used to solve the SPM allocation problem.
Recently, several research groups studied the SPM/cache al location problem for concurrent tasks. [22] exploits both cache partitioning and dynamic cache locking to to provide worst case performance estimates for multitasking systems. [6] stud ies the problem of placing multiple tasks in the cache to im prove cache performance. It proposes an ILP based approach to optimally placing multiple tasks in the cache. The ILP for mulations aim to minimize a cost function which is the total conflicts multiplied by a weight assigned to each task. [4] pro poses a dynamic scratchpad memory code allocation technique that supports dynamically created processes. Their approach partitions SPM into pages. At runtime, an SPM manager loads code pages of the running applications into the SPM on de mand. It supports different sharing strategies that determine how the SPM is distributed among the running processes.
[21] proposes scratchpad memory management techniques for priority-based preemptive multitasking systems. The tech niques are applicable to a real-time environment. It proposes three methods: spatial, temporal, and hybrid methods, with an objective to achieve energy reduction in the instruction memory subsystems. It formulates each method as an ILP problem that simultaneously determines (1) partitioning of scratchpad mem ory space for the tasks, and (2) allocation of program code to scratchpad memory space for each task.
All the above-mentioned approaches do not consider the mu tual impacts between task scheduling and SPM allocation. As a result, they cannot achieve the best SPM utilization. [18] and [19] consider the mutual impacts between task schedul ing and SPM allocation. [18] proposes an integrated task map ping, scheduling, SPM partitioning, and data allocation tech nique based on ILP. All the tasks are free of timing constraints and subject to precedence constraints. The ILP formulation ex plores the optimal performance limit and shows that integrated task scheduling and SPM optimization improves performance by up to 80% for embedded applications.
[19] presents several dynamic scratchpad allocation tech niques that take the process interferences into account to im prove the performance and predictability of the memory sys tem. It models the application as a MSC (Message Sequence Chart) to capture the interprocess interactions. It proposes an iterative allocation algorithm that consists of two critical steps: (1) analyzing the MSC along with the existing allocation to determine potential interference patterns, and (2) exploiting this interference information to tune the scratchpad reloading points and content so as to best improve the WCRT.
The approach proposed in [19] is the most related to ours. Both their approach and ours take into account the mutual impacts between task scheduling and SPM allocation. Both consider real-time tasks with precedence constraints. How ever, there are four key differences between our approach and theirs. Firstly, our SPM allocation algorithm is more efficient than their graph coloring based approach. Secondly, by our ap proach, all the tasks are initially non-preemptible, which means that each task occupies the whole SPM. A task is preempted only if another task with a smaller deadline misses its dead line. By their approach, all the tasks are preemptible at the beginning, and not allocated any SPM space. Detailed anal ysis in CR (Critical Path Interference Reduction) algorithm is used to reduce the number of preemptions. As a result, our ap proach leads to fewer preemptions and higher SPM utilization. Thirdly, the task models are different. Under their task model, all tasks are periodic tasks without any additional release times, and all tasks have the same deadline. Our task model assumes that each task has its own release time and deadline. Lastly, their approach aims to minimize the worst-case response time. In contrast, our approach aims to minimize the number of pre emptions while constructing a feasible schedule.
VII. CONCLUSION
We have proposed a unified approach to the problem of scheduling a set of tasks with individual release times, hard deadlines, and precedence constraints on a single processor where an SPM is used to replace data cache to store stack data of each task, and the problem of allocating SPM to each task. Our approach consists of two algorithms: a task scheduling al gorithm and an SPM allocation algorithm. The former aims at minimizing the number of preemptions by using a mix of pre emptive and non-preemptive EDF scheduling strategies. The latter employs a novel data structure, namely, the preemption graph, to allocates SPM to each task. Our simulation results show that our unified approach performs better than the ap proach proposed in [19] . Our future work is to extend our ap proach to mUltiprocessor systems.
