Abstract-Scratchpad memory is an attractive alternative to caches in real-time embedded systems due to its advantages in terms of timing predictability and power consumption. However, dynamic management of scratchpad content is challenging in multitasking environments. To address this issue, we propose the design of a novel Real-Time Scratchpad Memory Unit (RSMU). Our RSMU can be integrated in existing systems with minimal architectural modifications. Furthermore, scratchpad management is performed at the OS level, requiring no application changes. Compared to existing multitasking scratchpad management schemes, our approach improves schedulability by hiding the latency of memory transfers. We demonstrate and evaluate our system design on an embedded FPGA platform.
I. INTRODUCTION
Real-time schedulability theory demands knowledge of tasks' worst-case execution times (WCET) to provide timing guarantees. However, computing tight WCET bounds is becoming increasingly complex in modern embedded architectures. This is because Commercial-Off-The-Shelf (COTS) components are designed to improve average case performance, and include advanced architectural optimization such as out-of-order execution, dynamic branch prediction, multiple memory levels, parallel interconnections, etc. All such mechanisms are stateful and introduce unpredictability in the system by making resource access time dependent on access history.
In this paper, we focus on the problem of predictable access to local RAM resources, which are typically implemented as one or more levels of cache in COTS processors. In general, it is not possible to know exactly which memory accesses result in a cache hit. In the domain of static WCET analysis, an extraordinary amount of effort has been spent to model the cache state at each point in the execution of a task as precisely as possible [8] , thus bounding the effects of intratask interference (i.e., conflict misses). However, bounding inter-task interference in a preemptive multitasking system is significantly more complex: a preempting task can evict cache lines belonging to a preempted task, hence causing additional cache-related preemption delay. While approaches have been proposed to compute safe bounds on the set of evicted cache lines due to inter-task interference [1] , the obtained bounds can be highly pessimistic and are not composable, since the WCET of a task becomes dependent on the pattern of memory accesses performed by all higher-priority tasks.
To improve the predictability of local memory accesses, ScratchPad Memory (SPM) has been proposed as an alternative to caches for real-time systems [15] , [9] , [11] . Since scratchpad content is explicitely managed, the state of local memory can be known precisely at all times, removing the unpredictability associated with intra-task interference in cache. As an added advantage, due to their simpler hardware structure, strachpads consume less power than equally-sized caches [3] . On the other hand, scratchpad architectures typically place additional burdens on application programmers, who are now responsible for managing local RAM resources rather than relying on a transparent caching system. This issue is again compounded in multitasking systems, since optimal scratchpad management requires coordination among tasks.
We argue that to make scratchpad architectures viable in multitasking systems, the responsibility of managing SPM state should not be placed on the application programmer. Instead, in this paper we propose a novel hardware-software solution that dynamically loads/unloads tasks' content in the SPM under the control of the Operating System (OS). Our solution requires minimal application modifications and improves system schedulability by both allowing tighter WCET bounds, and hiding the latency of memory accesses in a predictable manner. In details, our contributions are as follows:
(1) We design a Real-time Scratchpad Memory Unit (RSMU) which transparently handles address translation and data transfers for tasks executed from SPM. Compared to other scratchpad-based predictable architectures [9] , the RSMU can be integrated in COTS embedded microprocessors with minimal overhead in terms of area and clock frequency.
(2) We detail a dynamic scratchpad management algorithm that hides the latency of loading/unloading the scratchpad by overlapping DMA operations with task execution. Compared to existing SPM management schemes, our technique significantly increases overall system utilization since the CPU never stalls while waiting to complete memory transfers.
(3) We demonstrate the applicability of our technique by implementing a hardware-software prototype on FPGA using FreeRTOS [5] and executing a set of memory-intensive realtime benchmarks. We support mixed-criticality applications, where safety-critical tasks execute from the SPM, while lowercriticality legacy applications are free to use existing cache resources. Only limited kernel modifications were required to support our RSMU and management algorithm.
The rest of the paper is organized as follows. We discuss related work in Section II. Section III introduces our system model and dynamic scratchpad management techniques. Details on RSMU operation and OS support are provided in Sections IV and V, respectively. Section VI discusses task scheduling and analysis. Finally, Section VII contains our detailed platform and schedulability evaluation, while Section VIII provides concluding remarks and future work.
II. RELATED WORK
ScratchPad Memory (SPM) has received considerable attention in the embedded and real-time communities as an alternative to processor caches [15] , [9] , [11] . However, support for dynamic SPM management in a multitasking environment has received comparatively limited attention. Dynamic scratchpad management based on compile-time task analysis has been proposed in [19] , [6] . Both approaches attempt to optimize the execution of a single task by dynamically loading/unloading portions of the task's code and data between main memory and SPM. They cannot be easily extended to multitasking systems. Most approaches aimed at preventing inter-task interference are based on static partitioning of memory resources, either in cache [17] , [10] or SPM [9] . These approaches can scale poorly as the number of allocated tasks increases.
SPM management in multitasking systems is considered in [20] , [18] , [7] . Verma et al. [20] first propose a hybrid approach: some tasks are assigned static partitions in SPM, but the SPM also includes a shared area where the content of the currently executed task can be loaded at run-time. However, this work is not targeted at real-time systems and instead attempts to minimize the overall power consumption for access to the memory hierarchy. The work in [18] extends the previous approach to sets of hard real-time tasks, but again considers power reduction as its only optimization metric. The authors of [7] consider on-demand allocation of scratchpad memory for embedded multimedia applications. The overall management scheme is akin to existing virtual memory management techniques and is not targeted at hard real-time workloads.
The work most directly related to our approach is the Carousel mechanism recently proposed by Whitham et al. [23] , [22] . To the best of our knowledge, Carousel is the first work to explicitly consider managing on-chip scratchpad memory dynamically in a multitasking system with the objective to improve schedulability for hard real-time tasks. Under Carousel, the memory space of active tasks is organized as a stack of fixed-size blocks; at any time, only the top n blocks are stored in the SPM. Carousel can be applied to any system where tasks are scheduled in a last-in first-out (LIFO) order, such as Rate Monotonic; this ensures that the currently executing task always occupies the top of the stack. When a higher priority task is activated, it first uses DMA to swap out some blocks from SPM to main memory to free space, then it loads its own blocks, both code and data, before it starts executing. After the task finishes executing, it saves modified blocks back to main memory. Finally, the task swaps in all the blocks it swapped out earlier, so the stack is restored to the state prior to the task invocation. Carousel ensures that scratchpadrelated preemption delay is predictable and accounted for in the WCET of the preempting task. However, the cost associated with moving blocks between main memory and the SPM can be significant. Figure 1 shows the overhead associated τi with a task invocation, where we assume that code i blocks of code and data i blocks of data of task τ i must be loaded from main memory using DMA. From the figure, each task invocation incurs at least five DMA operations, moving a total of 4 data i + 3 code i blocks. During DMA transfers, the CPU is effectively stalled; as we show in Section VII-D, this can significantly impact system schedulability.
Compile-time SPM management approaches suffer dynamic data limitations due to pointer aliasing, pointer invalidating and object sizing. To avoid such problems, some dynamic approaches provide address translation from CPU virtual addresses to physical addresses either in main memory or in SPM. A standard Memory Management Unit (MMU) is employed in [7] , but it can only perform SPM allocation with the granularity of a virtual memory page; furthermore, it cannot avoid intra-task conflicts without compiler support. A Scratchpad Memory Management Unit (SMMU) is proposed in [21] . The SMMU handles SPM allocation and address translation on a per-object basis. A similar unit is employed in [23] , [22] , except that translation is per-block. In both cases, the translation logic must compare a virtual address against each object/block in the SPM. As a result, the comparators array circuit can be a significant performance bottleneck unless the number of objects/blocks is small. As we discuss in Section IV, the overhead of our translation mechanism is instead constant in the number and size of executed tasks.
III. SYSTEM MODEL
We consider scheduling a mixed-criticality system executed on a uniprocessor system, where the CPU is augmented with scratchpad memory and related management hardware. The system comprises a set Γ c of N periodic critical tasks {τ c 1 , . . . , τ c N } and a set of periodic non-critical tasks. Critical tasks execute from the SPM, while non-critical tasks use cache memory available in the CPU. Our main objective is to protect the critical tasks from interference of other critical or noncritical tasks, while minimizing the overhead of SPM management. We assume a two-level scheduling scheme. Non-critical tasks run within a set of M non-critical partitions, similar to the scheme adopted in Integrated Modular Avionic (IMA) system [16] . The top-level scheduler allocates time to critical tasks and non-critical partitions. Each non-critical partition uses a bottom-level scheduler to allocate time to non-critical tasks executed within that partition. Any scheduling policy that supports an associated two-level scheduling analysis can be used at the bottom level. In particular, the fixed-priority scheduling used in [16] has been adopted to schedule noncritical tasks among their assigned partition.
In the context of this work, we assume that critical tasks and non-critical partitions are statically scheduled off-line, again in line with IMA practices. Each critical task τ c i is assigned one or more fixed-time slots within the hyperperiod (e.g., least common multiple of critical tasks' and non-critical partitions' periods). Similarly, each non-critical partition is also Figure 2 also shows all DMA operations required to move code and data from main memory to the scratchpad and viceversa, as well as the state of the SPM at each context switch (i.e., boundary between two successive time slots). We do not detail the execution of non-critical tasks within the non-critical partition since they do not affect the scratchpad state. The SPM is dynamically partitioned so that it can hold the data and code of two successively executed critical tasks. During the execution of a critical task, the system uses DMA to load the code and data for the next scheduled critical task into the SPM. To free space, the system also unloads the previously executed critical task and writes back the modified content (data) to the main memory. For instance, the data of τ c 1 is unloaded to main memory while τ c 2 is running; then, the code and data of τ c 3 are loaded into the SPM from main memory. Note that there is no need to unload code from the SPM back to main memory, as it is not modified. As long as the time required for the described DMA operations is shorter than the length of τ can run right after τ c 2 without any latency other than that for the context switch. As shown in Figure 2 , non-critical tasks, which use the cache, can access memory to perform cache line fetches and replacements during the time slot of a non-critical partition. We do not allow DMA activity while a non-critical partition is running; consequently, the non-critical tasks do not suffer interference from DMA operations. In addition, loading and unloading the SPM while a critical task is running will not degrade the task performance, as the CPU and the DMA access different partitions within the SPM. Therefore, critical tasks do not suffer interference from the DMA either.
The key difference between our proposed mechanism and existing dynamic SPM management schemes [23] , [18] is that our approach does not stall the CPU to load tasks into the SPM. Instead, we overlap DMA transfers with the execution of critical tasks on the CPU. This is possible because critical tasks do not need to access main memory while executing from SPM. Hence, we can co-schedule accesses to both system resources, CPU and main memory, predictably and in parallel. As we show in our scheduling evaluation in Section VII-D, by hiding the latency of DMA transfers, our scheme significantly improves system schedulability.
IV. REAL-TIME SCRATCHPAD MEMORY UNIT
We base our system design on the architecture of a typical modern 32-bit embedded processor with separate instruction and data buses, as shown in Figure 3 . Therefore, the system consists of two SPMs along with two Real-time Scratchpad Memory Units (RSMUs). One pair of SPM and corresponding RSMU is for code (I-SPM and I-RSMU) and the other is for data (D-SPM and D-RSMU), similarly to a partitioned L1 cache architecture. Each SPM is divided into three partitions. The first partition is static and its size is fixed at compile time. It contains critical components of the OS, such as the scheduler, and libraries shared by multiple critical tasks. In addition, the OS can reserve a part of this partition to keep what we call the OS-SPM heap in order to provide dynamic memory allocation for critical tasks. The sizes of the other two dynamic partitions are variable, and are managed by the OS to fit the sizes of the critical tasks loaded in SPM at run-time. We use a standard DMA component managed by the OS to load/unload critical tasks between SPMs and main memory. Our implementation is based on the FreeRTOS [5] kernel, but our architecture is general and could be reasonably applied to most existing real-time OS. Details on the software implementation are presented in Section V. Figure 4 depicts a simplified view of the system's address map; for simplicity, we omit memory-mapped peripherals since they have no influence on the discussion. To clarify how critical tasks are managed, Figure 4 also shows how the code and data of a critical task τ c i is linked off-line, loaded in the system at boot time, and moved into SPMs at run-time. The CPU generates addresses in three memory regions: main memory, the I-RSMU's address space, and the D-RSMU's address space. Each RSMU allocates a much larger address space than the physical SPM connected to it requires: the RSMU has a physical address space and a virtual address space. The corresponding SPM occupies only the first b addresses of the RSMU's address space, where b is the size of the SPM in words. The first b addresses are in the physical address space, and addresses beyond that are in the virtual address space. CPU requests targeting the RSMU's physical address space to the physical addresses of the SPM partitions into which it has been loaded. This is done by instructing the two RSMUs to translate between the virtual addresses against which the task has been linked and the physical addresses of the SPM partitions. τ c 3 can then be executed. The described scheme makes the address translation transparent to application programmers, just like in a cache. In particular, the execution of a critical task is unaffected by the specific SPM physical addresses into which it is loaded at run-time. This ensures that porting application code to our system is straightforward; software does not need code annotation, special machine-instructions or special compilers to take advantage of the described management technique. The main trade-off is that our scheme requires that the SPMs hold the code and data of two critical tasks at the same time instead of just one.
Finally, note that no special provisions are required for noncritical tasks and the OS. Outside of critical tasks, all other software components are linked against physical addresses (either in main memory or in the fixed SPM partition) and loaded at their assigned physical addresses at boot time.
A. Implementation Considerations
The main objective for the Real-time Scratchpad Memory Management Unit (RSMU) was to design a hardware component capable of translating addresses in an efficient way suitable for real-time systems. Therefore, unlike traditional MMU, the RSMU translates addresses in constant time; no lookup table is required. In fact, the performance of the RSMU is independent of the number of tasks that can be loaded into the SPM or their sizes. In addition, the RSMU has minimal impact on the critical path and silicon area. As shown in Figure 3 , the RSMUs are connected to the CPU via dedicated memory buses designed to connect to high performance on-chip memories. Memory buses bypass the cache circuitry; consequently, CPU's requests to RSMUs are uncached. Furthermore, SPMs are dualported to allow simultaneous access from the DMA and the CPU through the corresponding RSMUs.
Given the fact that only one critical task is executing out of the SPMs at any given time, the RSMU is designed to translate only one task at a time. Figure 5 depicts the internal architecture of the RSMU. Each RSMU consists of two registers: the Translation-Edge Register (TER), and the ActiveTask Offset Register (ATOR). The TER determines when the RSMU is supposed to translate an address. In particular, if the required address is greater than the value stored in the TER, then the translation is performed. The translated address is based on the value stored in the ATOR. The equation for translating addresses is SP M physical = (CP U virtual > T ER)?CP U virtual − AT OR : CP U virtual . For example, if the value stored in the ATOR is "0x2000" and the value stored in TER is "0x4000", then the incoming CPU address, such as "0x5000", will be translated into "0x3000". Note that address "0x5000" is in the RSMU's virtual address space; thus, it needs to be translated. Typically, the TER is set to the end of that RSMU's physical space. This ensures that while a critical task is running, only requests targeting the RSMU virtual address space are translated. The ATOR is instead set to the difference between the base address of the virtual memory region against which the task is linked and the SPM partition into which the task is loaded at run-time.
The RSMU is configured to cover limited address space (e.g., 1 MB in our system, which is sufficient to map all critical tasks in our evaluation), and not the whole CPU's address space, as in conventional MMU. The underlying interconnect redirects requests to the RSMU only if the requested address is within the address space of the RSMU. Therefore, the RSMU intercepts only the requests targeting it. As a result, only one comparator circuit is required to determine which address space the incoming request is targeting. Overall, the RSMU uses smaller registers and address comparators, 20-bits, which leads to a faster hardware. A complete hardware evaluation is provided in Section VII-A. The reduction in the operating frequency caused by the RSMU is small compared to other translation units. Each RSMU exports memory-mapped configuration registers TER and ATOR. Any OS can utilize the RSMUs by employing a simple device driver to communicate with them through the configuration registers.
V. SOFTWARE ARCHITECTURE
FreeRTOS, like many other embedded real-time OS, requires all software components, including the OS and all the tasks, to be compiled and linked together to produce the system executable binary file. The system is compiler independent, but a few lines corresponding to each critical task have to be added to the linker script. These lines put all input sections (code and data) of each critical task's object file contiguously into two output sections (code and data) named with the same name of that task, e.g. "task1.text" and "task1.data". This trick allows the OS to move the whole task as two blocks only. In addition, the linker script exposes several linker symbols to help the OS manage critical tasks. The most important symbols exposed by the linker script are the loading and virtual addresses of each critical task's section, the size of each critical task's section, and the SPM sizes.
We extended FreeRTOS to provide support for RSMUbased translation, SPM management and DMA operations. First, a device driver for the RSMUs has been developed. The driver exposes several macros in order to facilitate control of the RSMUs at a reasonable level of abstraction, e.g., map_section(virt_addr, phys_addr, size). Second, a check-up routine, to be executed at system initialization, has been developed to detect whether a certain critical task's compiled-size exceeds its allowed size. Using the linker symbols provided by the linker script, the check-up routine is able to verify the compiled code parameters by comparing them with the parameters defined by the system designer. This capability provides a shorter design cycle as any design restrictions are identified earlier. If the check-up routine succeeds, the OS continues to initialize the system and then runs the scheduled tasks as described earlier.
New modules called RSMU and Application Management Plug-in (AMP) have been developed and added to FreeRTOS. Since FreeRTOS does not distinguish between critical tasks and non-critical tasks, the AMP plug-in wraps some FreeRTOS original APIs, such as xTaskCreate, to handle critical tasks besides non-critical tasks. For example, when the AMP is used to create a critical task, the AMP assigns the stack into the data section of the critical task, and then assigns the appropriate priority level (critical/non-critical). The stack of a critical task is then moved with the rest of the task's data between the main memory and the SPM.
The scheduler is modified to adapt our scheduling policy. Context-switches are triggered by the system's periodic timer interrupt. A scheduling table is generated off-line and used to determine when each time slot ends and the next one begins. The trick is to know, at the time of a context-switch, both the critical task that has to be executed as well as the critical task that will scheduled immediately after. This allows us to pre-load a critical task, using DMA, into the SPM while the currently scheduled task is running. The DMA transfers are performed in background and do not not impact the overhead of context-switch on the CPU. However, the OS suffers some overhead to configure the DMA core before the DMA transfer can start. Due to the implementation of the DMA core we used in the experiments, the DMA core can support only one transfer at a time. Consequently, the OS must set-up the DMA three times to unload the data of the previous critical task, and to load code and data of the next scheduled critical task. We incorporate the DMA set-up overhead in the schedulability analysis in Section VI. Note that interrupts do not alter the state of the SPM, since the timer interrupt service routine, the scheduler, RSMU driver and DMA set-up functions are allocated in the static SPM partitions.
For dynamic memory allocation, malloc() and vpPORTmalloc() functions are overridden. As a result, allocating dynamic data is now dependent on the current context. In particular, critical tasks allocate data from the OS-SPM heap, while non-critical tasks allocate data from the general system heap in main memory.
VI. TASK SCHEDULING AND ANALYSIS
This section discusses how to derive a predictable slotted schedule according to the two-level scheduling mechanism described in Section III. Each critical task τ time units. Let H be the length of the minor cycle, i.e., the greatest common divider of all tasks' periods. We construct a periodic top-level schedule for critical tasks and non-critical partitions with period H. Each critical task and each partition is assigned one slot in the minor cycle. If a task has a period that is multiple of the minor cycle, it executes for a fraction of its execution time in the minor cycle. The schedule has N +M fixed slots in each minor cycle: each of M slots is assigned to partition τ nc i and has a length The described approach is not the only possible scheme to build an off-line schedule for our RSMU architecture. In particular, a static schedule with period equal to the hyperperiod could result in significantly reduced number of preemptions. We decided to use the minor cycle approach in the context of this work for two main reasons: (1) we believe that the main contribution of our paper is the description and demonstration of our novel RSMU architecture. We leave the derivation of an optimized schedule, as well as the extension to on-line scheduler such as fixed-priority, as part of our future work; (2) since the proposed approach minimizes the size of time slots and maximizes the number of preemptions, it represents a worst-case situation for our proposed mechanism.
We next compute the size s 
where p s is the timer period. Note that 
To create the slotted schedule, we finally need to determine the order in which slots are executed within the minor cycle. Note that the order of non-critical partitions does not impact the schedulability of critical tasks in any way. Hence, we focus on computing the slot order for critical tasks; the schedulability of non-critical tasks can then be assessed based on the analysis in [16] . When deciding the slot order, we need to respect two constraints. First, the size of both the data and code scratchpad must be sufficient to execute the task in the current slot while performing the load/unload operations for the previous and next executed critical task. Second, the size of the slot must be sufficient to complete all DMA operations in time: the data of the previous critical task must be moved from the scratchpad to main memory, and both the code and data of the next critical task must be loaded from main memory.
To solve the slot assignment problem for critical tasks, we construct a simple SMT (Satisfiability Modulo Theories) problem instance. In the problem formulation in Equation 5 
Based on Equations 3 and 4, each slot has to be assigned to only one critical task and each critical task has to be assigned to only one slot. Equation 5 expresses the constraint on the DMA time. The right-hand side 1≤j≤N x i,j · s j computes the size of the i-th slot in the schedule, based on the task assigned to the slot. In the left-hand side, 1≤j≤N x (i−1)%N,j · DM A(data j ) represents the time required to unload from data scratchpad to main memory the data of the critical task executed in the previous slot, i.e., slot (i−1)%N -slot, where % is the module operation. Note that in slot 0, we need to unload the data used by the task executed in slot N −1 in the previous minor cycle. Similarly, 1≤j≤N x (i+1)%N,j · DM A(data j )+ DM A(code j ) represents the time required to load from main memory to data/code scratchpad the data/code (respectively) of the critical task executed in the following slot (i + 1)%N ; again, note that in slot N − 1, we need to load the code and data used by the task executed in slot 0 in the next minor cycle. Finally, Equations 6, 7 express constraints on the size of the data and code scratchpad, respectively. As detailed in Section III, when a critical task is running, one portion of the scratchpad is used to contain the data/code of the running task, while the second portion allocates memory for the next executed critical task. Hence, we constrain the size of the data/code scratchpad to be at least equal to the data/code of the task running in slot i, plus the data/code of the task running in slot (i + 1)%N .
As an example consider Table I , which depicts task parameters for an arbitrary set of tasks. Assume the sizes of each scratchpad is 16 size units and the total length of the schedule fits within the minor cycle, i.e., s 
VII. PLATFORM EVALUATION
In this section, hardware, software, and schedulability are evaluated in order to give a clear view of how the system is performing. The hardware and software platforms are first evaluated independent of any running task. Then, several benchmarks are run to evaluate the combined hardware-software platform. Finally, evaluation of the system schedulability is conducted.
A. Hardware Evaluation
We implemented our solution on a hardware platform based on Altera's Cyclone II FPGA. The platform uses the Nios-II/f soft-core processor [4] with instruction and data caches of 16-KB each. Caches are direct mapped with 32-bytes cache line size. The platform provides 64MB of off-chip SDRAM as main memory, running at 100 MHz, and a standard DMA core. The native operating frequency of the Nios-II processor was 100 MHz. A 64-bit cycle counter running at the same speed as the CPU is used to measure timing overheads. We modified the described platform by adding code and data RSMUs and SPMs, as shown in Figure 3 . To fairly compare execution in the scratchpad versus cache, each SPM is also 16-KB in size. As shown in Table II , the frequency drops to 98.2 MHz, the area increases by 5.2% in terms of logic elements, and the used on-chip memory bits increased by 88.9%. The proposed design had minimal impact on logic area and operating frequency. On the other hand, the consumption of memory blocks is relatively high because we are adding SPM while retaining caches to better support legacy, non safety-critical applications. As explained in Section III, the DMA does not interfere with cache while accessing main memory. As a result, the performance of the DMA core is entirely predictable. We evaluated DMA timing by first measuring the amount of time required to transfer some fixed amount of data from main memory to scratchpad or vice-versa. We then derived estimated timing for a transfer of any size based on linear regression. Figure 6 depicts the measured and the calculated performance of the DMA for transfers from main memory to SPM. The derived equation for DMA timing is as follows:
where DM A(b) is in cycles, and b is in bytes. Note that the measured and estimated performance matches extremely closely. The above equation does not include the DM A setup overhead required to prepare the DMA transfer (see Table  III ). To accurately measure the DMA performance, first, the DMA set-up overhead is determined by reading the cycle counter before and after the set-up routine. After that, the DMA performance is measured by reading the cycle counter right before the DMA set-up routine and after the DMA finishes, by polling the status register of the DMA. Finally the overhead is subtracted from the measurements. DMA Performance: the calculated performance is matching the measured one
B. Software Evaluation
About 1050 lines of code were added to the FreeRTOS kernel to support our scratchpad management scheme. Overall, the compiled size of the OS is small ( 18.5KB) and, depending on the required features and libraries needed by the critical tasks, only tiny part of the OS is required to be stored in the SPM, starting from 4.5 KB for code and 324 B for data. Table III shows the Software system parameters. The OS uses a timer periodicity p s of 1 ms. The timer interrupt overhead is measured by reading the cycle counter at the beginning of the timer Interrupt Service Routine (ISR) and reading it again before it returns. Then we added the time needed by executing assembly instructions before and after the ISR (interrupt response and recovery time). This time is only for system ticks that do not lead to a context switch. In the case of a context switch, the cycle counter is first read at the beginning of the timer ISR, then read again at the beginning of the new context considering the timer interrupt response time. In the same way, the context switch timing does not include any DMA set-up happening at the context switch. It is noticeable from the table that the DMA set-up routine requires significant time. The reason for that is the non-optimized DMA driver provided by the Altera Hardware Abstraction Layer (HAL). If we would write an optimised DMA driver, 100 -150 cycles is expected for the DMA setup routine. The RSMU driver is implemented as a macro that allows the kernel to control the RSMUs and maps a critical task in eight clock cycles. Using the proposed build-flow, several applications have been ported to this platform successfully without changing the applications' source code. 
C. Benchmark Evaluation
A set of both synthetic and real benchmarks is executed on the platform to test the performance of scratchpad-based execution and obtain data towards our schedulability analysis. A synthetic benchmark is used to evaluate the baseline hardware performance. It simply iterates over a fixed-size buffer, reading the first word of a new cache line without performing any meaningful computation. We selected and executed additional real benchmarks from existing embedded benchmark suites. The selection is aimed to represent several applications used in the embedded real-time domain. Furthermore, we choose memory intensive applications to better stress the platform's memory subsystem. We choose three benchmarks from the well-known automotive EEMBC benchmark suite [14] , a2time (angle to time conversion), canrdr (response to remote CAN request), and rspeed (road speed calculation). We select two benchmarks from the DIS (Data Intensive System) benchmark suite [13] , transitive and corner-turn.
We configure each benchmark based on the approach used in [2] , which utilizes the same set of EEMBC applications. Each benchmark processes input data by executing a configurable number of iterations each run; every iteration consumes an almost-constant amount of input data. As noted in [2] , the memory access profile of each benchmark is highly dependent on the number of iterations. As an example, Figure 7 depicts the estimated cache stall ratio for the corner-turn benchmark. The stall ratio represents the fraction of the benchmark execution time during which the CPU is stalled while waiting for main memory operations (cache line fetches). On our platform, we evaluate the stall ratio by comparing the measured execution time of the task running from a cold cache versus its execution time when running out of the SPM. As shown in the figure, the stall ratio tends to decrease with increased number of iterations until it reaches an almost constant value; this is 30  120  210  300  390  480  570  660  750  840  930  1020  1110  1200  1290  1380  1470  1560  1650  1740  1830  1920  2010 Ratio % Number of iterations Fig. 7 . The cache stall ratio for the corner-turn benchmark because each benchmark requires some fixed amount of code and data independently of the number of iterations. To obtain a realistic scenario, we thus select a number of iterations that falls within the flat part of the curve. Furthermore, we limited each benchmark to either run up to 1 ms, or by its size so that it does not consume more than half of a SPM size. Table IV shows the results for all analyzed benchmark. The table reports the selected number of iterations, the size of the task's code and data in SPM, as well as the measured execution time of the task running from a cold cache, hot cache and SPM, respectively. The difference in execution time between hot cache and SPM is very small, indicating that the selected benchmarks do not suffer significant intra-task interference. Most of the difference is actually caused by unpredictability due to the dynamic branch predictor. The table reports the difference between the cold and hot cache execution time. Note that a dirty cache situation would result in even worse performance compared to the cold case, as the cache needs to write evicted lines back to main memory. The cache stall ratio provides a better estimation of the performance advantage of scratchpad execution: since our proposed scheme hides the latency of memory transfers in SPM, the CPU never stalls while executing critical tasks from the SPM. The stall ratio of the synthetic benchmark is around 80%, indicating SPM execution is roughly 5 times faster compared to a cold cache. The stall ratio of other benchmark is highly dependent on their memory access patterns. The corner-turn benchmark, used in digital signal processing, scored the highest ratio, about 19%, due to its nature as it performs unit-stride and non-unit-stride memory accesses, e.g, jumps to other cache lines. On the other hand, the transitive benchmark scored the lowest ratio as it performs regular unit-stride accesses. Finally, note that, in this particular implementation of the proposed platform, the CPU and the main memory (SDRAM) are operating at the same frequency. However, in most COTS systems, the CPU runs at significantly higher frequency compared to main memory. Hence, measured stall ratios would be higher.
D. Schedulability Evaluation
In this section, we evaluate the scheduling scheme proposed in Section VI based on simulations. We compare our solution against the Carousel approach proposed in [23] , [22] , which to the best of our knowledge represents the stateof-the-art in dynamic scratchpad management for real-time systems. The applications in Table IV , excluding the synthetic benchmark, are used to generate sets of random tasks. Given a system utilization, each application is randomly selected and assigned a random period from a predefined set of harmonic periods, {10 ms, 20 ms, 40 ms}. The task's utilization is then computed based on the application's execution time and the selected period. At every iteration a new task is randomly generated. The generation stops when the sum of the individual tasks' utilizations reaches the required system utilization. After that, the overhead is added and the slot sizes are computed according to the methodology detailed in Section VI. The Z3 [12] SMT solver is used to solve the satisfiability problem and produce a feasible order of tasks within the minor cycle, if it exists.
We specifically select harmonic periods for the sake of a fair comparison with Carousel. While we optimize the task schedule off-line, Carousel schedules task at run-time according to fixed priority; harmonic periods ensure that rate monotonic scheduling can schedule up to 100% CPU utilization. For a similar reason, we use the same system parameters for both schemes, such as context-switch overhead and DMA transfer time. Carousel schedulability is verified by applying response-time analysis as described in [23] ; the analysis incorporates the overhead of context-switch as well as the blocking time due to non-preemptive DMA operations. Figure 8 shows results in terms of proportion of schedulable task sets for Carousel and our proposed approach. Each point in the graph represents 100 task sets. As shown in the figure, our approach is able to schedule a significantly higher number of task sets compared to Carousel. We believe that such result shows that hiding the latency of memory accesses can have a very significant impact on schedulability, even for systems such as the one in Table IV , where the time required to load data from main memory can be relatively small compared to execution time. The price we pay for such improvement is additional scratchpad memory: under Carousel, only the currently executing task must be loaded in SPM. In our approach, an additional partition must be reserved in SPM to pre-load the code and data of the next scheduled task. We performed additional simulations to analyze the sensitivity of our approach to varying system parameters. Figure 9 evaluates schedulability for reduced scratchpad sizes. In particular, as the D-SPM size is reduced, there will be pairs of tasks that cannot fit into the scratchpad together; note that Table IV includes two benchmarks with a data size close to half the original D-SPM size. In order to show the pure effect of the scratchpad size on schedulability, a low utilization point of 10% is selected. However, a similar effect is expected at other utilization points. The result confirms that our approach requires the SPM to be sufficiently large to accommodate the data of two critical tasks. Fig. 9 . The effect of changing the size of the scratchpad on schedulability Finally, Figure 10 shows the effect of including larger periods into the set of the harmonic periods. A high utilization point (90%) is selected in order to show the system tolerance for overhead. Note that since we select periods with a harmonic ratio of 2, the largest period in the figure is 1024 times the smallest one. Under such situation, the minor cycle formulation in Section VI is less effective because tasks with long periods are split into very small time slots, increasing context switch and DMA overhead. 
Schedulability
The number of different harmonic periods in the set Fig. 10 . The effect of increasing the set of the harmonic periods on schedulability VIII. CONCLUSIONS AND FUTURE WORK Scratchpad memory is an attractive alternative to caches in real-time systems due to its higher predictability and lower power consumption. However, scratchpad management is challenging is the presence of a multitasking environment: previous approaches either statically partition the scratchpad, which is inefficient for large number of tasks, or incur high overhead to load/unload the scratchpad content at run-time.
In this work, we have proposed a novel scratchpad architecture based on a Real-time Scratchpad Memory Unit (RSMU) and a dynamic management scheme. Our approach can be integrated in existing COTS architectures with minimal modifications to existing hardware and software components. Furthermore, we improve system schedulability by hiding the latency of loading/unloading the scratchpad. We demonstrated and evaluated our system design on a FPGA platform.
As future work, we would like to extend our platform in two directions. First of all, we plan to support multicore systems. Since main memory is a shared resource, we will need to coordinate DMA operations performed by different cores. Some SPM partition should also be dedicated to communication among cores. A memory-centric scheduling approach to avoid contention among prefetching operations in main memory has been proposed in [24] ; however, it does not consider overlap between core execution and memory loading/unloading, which is key to our approach. Second, we would like to extend our approach to on-line schedulers such as fixed-priority and EDF. Since under on-line scheduling the next executed task is unknown, a preempting task cannot be executed immediately. However, we believe we can still overlap DMA and CPU activity by allowing the preempted task to continue execution while the DMA loads the preempting task into SPM.
