Introduction
Memory management is a major concern when developing real-time and embedded applications. Predictability issues have resulted in real-time systems being most of the times strictly static, avoiding dynamic allocation/deallocation and virtual memory. However, as these systems are getting increasingly large and complex, there is now a need to escape from this strictly static memory management.
In particular, a recent trend towards systems where different functions are implemented by concurrent processes can be observed, for instance in Integrated Modular AvionThis study was partially supported by the french National Research Agency project Mascotte (ANR-05-PDIT-018-01) ics (IMA) systems or for the automotive industry. Such systems need spatial separation between processes, which can be easily implemented via the use of the Memory Management Unit (MMU) of commercial processors. Moreover, cost constraints may limit the amount of physical memory available.
Virtual memory consists in using hardware support (MMU, Memory Management Unit) to compute at run-time where an address (called virtual address) is located in physical memory. The virtual address space of a program is divided up into fixed-size units called pages. The mapping between virtual pages and physical pages is stored in data structures scanned by the MMU at every memory access (page tables stored in RAM and a fully-associative cache named TLB, for Translation Look-aside Buffer to speed-up accesses to page tables). When a program attempts to reference an unmapped page, the MMU notices that the page is unmapped and traps to the operating system; such a trap is called a page fault. Upon a page fault, the operating system loads the page from disk (page-in). Symmetrically, when there is no free physical page anymore, a replacement policy implemented by the operating system selects one physical page to evict from main memory (page-out). Modified pages have to be copied back to disk before being evicted; this is done either in the page fault handler or by an independent process depending on the operating system. The interests of virtual memory are twofold: (i) it provides spatial protection between processes, since each process has a private page table; (ii) it allows to execute tasks whose address space is larger than the capacity of physical memory, since pages are paged-in and out on demand, in a transparent manner to the programmer.
In real-time systems, it is crucial to prove that tasks will meet their temporal constraints in all situations, including the worst-case situation. Therefore, predictability of performance is as important as average-case performance. One should be able to predict the Worst-Case Execution Time (WCET) of pieces of software for the system timing validation [13] . Virtual memory raises predictability issues at two levels:
Level of address translation: getting the mapping between virtual to physical pages requires a TLB lookup plus possibly a page table lookup if the mapping is absent from the TLB. The duration of address translation is hard-to-predict, because: (i) not all mappings can be stored in the TLB because of its limited capacity, thus it is difficult to know which mappings will be served by the TLB and which ones will require a page table lookup; (ii) the TLB is shared between concurrent processes; Level of paging activity: knowing whether or not a reference to a virtual page will result in a page fault is hard to predict. This is because physical memory is shared between concurrent processes, and in general any physical page regardless of its owner process may be selected by the page replacement algorithm. In addition, the replacement algorithm is never strict Least Recently Used (LRU), because it would be too costly to maintain the ordering of page references using current MMUs. Furthermore, common replacement policies may be arbitrarily complex and in general not, or not enough, documented, because they are implemented in software inside the operating system. For instance, a process may be used to update the disk for dirty pages; some physical pages may be temporarily locked during a page-in or a page-out.
So far, attempts to provide real-time address spaces have focused on the predictability of virtual to physical address translation [11, 2] . Demand-paging is carefully avoided: all physical pages are voluntarily created in memory at process load-time, or wired in memory, to avoid unpredictability due to page faults. Surprisingly, little effort has been devoted to reconciliate the benefits of the paging activity (in particular its ability to execute programs larger than main memory) and predictability. Providing some form of predictable paging seems to us very important, in a context where the volume of software embedded into devices grows and cost considerations limit the amount of available memory in some systems. This paper makes a first step in that direction. We propose a compiler approach to introduce a predictable form of paging, in which page-in and page-out points of virtual pages are selected at compile-time, thanks to the static knowledge of possible references to virtual pages. Our approach operates on a single task 1 and currently considers references to code only. The problem under study can be expressed as a graph coloring problem, heavily used in compilers for register allocation. Since the graph coloring problem is NP-complete for more than three colors, we define a heuristic, which in contrast to those used for register allocation, aims at minimizing the worst-case performance instead of the average-case performance. Experimental results applied on tasks code show that predictability does not come at the price of performance loss as compared to standard demand paging.
The rest of the paper is organized as follows. Related work is surveyed in Section 2. Section 3 formulates the problem of off-line selection of page-in and page-out points as a graph coloring problem and proposes a WCET-oriented graph coloring heuristic. Experimental results applied on code are given in Section 4. Implementation issues and directions for future work are dealt with in Section 5.
Related work
A very predictable approach called overlaying [12] was used before the hardware support for virtual memory became common. The software is divided into pieces called overlays. When an overlay is needed, the overlay is explicitly loaded into memory by the program, overwriting an overlay that was no longer needed. Overlaying techniques, while highly predictable, were in most systems non automatic, requiring manual work from the programmer to define the overlays.
Virtual memory appeared in the sixties to provide spatial isolation between concurrent processes and allow programs larger than the amount of physical memory to execute. The definition of efficient page replacement algorithms has received considerable attention in the seventies. The optimal page replacement algorithm as defined in [1] evicts the page that will not be used for the longest time. Obviously, optimal replacement cannot be implemented in practice because it requires an exact knowledge of future memory accesses. Instead, existing page replacement strategies exploit the knowledge of past references to guess future ones. The mostly used replacement algorithms are approximations of the Least Recently Used (LRU) replacement. LRU evicts the page that has not been used for the longest time. Approximations of LRU are used instead of strict LRU because strict LRU would be too costly to implement using standard hardware. Indeed, most MMUs use only 2 bits per page: U (for Used) and M (for Modified). The U (resp. M) bit is set by the MMU at every reference (resp. modification); these two bits are reset by software to implement efficient in the average-case but approximated LRU replacement. The difference of our work with existing page replacement strategies is that we predetermine page-in and page-out points at compile-time rather than at run-time as in standard demand paging.
So far, demand paging is avoided in real-time operating systems. Demand paging simply cannot be implemented in real-time operating systems running on processors without MMU. For processors with a MMU, some systems like Spring [10] use the MMU for protection between processes only. In Spring, all the pages of a program are loaded at process start such that pages faults do not occur. Furthermore, the number of pages used by a process is limited, such that all address translations are served by the TLB without resorting to page table lookups. Other real-time systems like RT-Mach [16] and real-time extensions of POSIX provide a system call to wire pages in memory for real-time tasks. VxWorks [17] provides a library to control address translation, but does not provide any support for demand paging. Bennett and Audsley [2] focus on page table structure for address translation predictability. Our work builds on these studies ensuring predictability of address translation, and focuses on the predictability of the paging activity.
One may view the predictability issues caused by paging systems as identical to those raised by caches. Many methods have been designed in the last years to estimate worst-case execution times on architectures with instructions and/or data caches [9, 5] , for different cache structures and replacement policies. The tightest predictions are obtained for LRU replacement. In contrast, pseudo roundrobin and random replacement yield to looser timing estimates [6] . Analysis methods originally defined for caches cannot be directly transposed to paging systems. The main reason is that page replacement policies are more sophisticated and less documented than cache replacement policies because they are software-implemented. Moreover due to hardware-software interactions, page replacement policies are not strict LRU. Thus, to the best of our knowledge no attempt to statically analyze page replacement policies has been made so far. In this paper, for the above-mentioned reasons, we do not try to predict the worst-case behavior of dynamic paging. Instead, we predict page in and page out points at compile time thanks to the knowledge of possible future page references.
Graph coloring was recently used by Li et al in [8] for automatically managing transfers between scratchpad memory and off-chip memory, with performance considerations in mind. In contrast, our work focuses on transfers between RAM and disk and is predictability-oriented rather than performance-oriented.
Predictable paging: a graph coloring approach
This section is devoted to the modeling of predictable paging as a graph coloring problem. We first make an informal parallel between predictable paging and graph coloring in paragraph 3.1. Paragraphs 3.2, 3.3 and 3.4 then describe our algorithm for static selection of page-in and page-out points in more details.
Assuming that referenced virtual pages are known statically, which is common in real-time systems for predictability considerations, it is possible to define the program regions where these pages are used. We will term such regions Webs. A web for a virtual page vp is the set of basic blocks that reference vp (see Fig. 1 .a, in which three virtual pages are used).
Two webs are said to interfere if their intersection is not empty. An interference graph can then be defined: a node in the interference graph corresponds to a web, and an (undirected) edge corresponds to an interference between two webs (see Fig. 1.b) .
Defining a mapping between virtual and physical pages amounts to assigning a physical page to every web, assuming a limited number of physical pages. It is equivalent to coloring the interference graph, with a limited number of colors, one per physical page (see Fig. 1 .c, with two physical pages represented by colors black and grey). Obviously, it might happen that the interference graph is not colorable. Then, webs have to be split, resulting in extra page-ins and page-outs. This iterative process has to be repeated until the interference graph becomes colorable.
The mapping between virtual and physical memory pages is a straightforward result of the coloring process (see rectangles in Fig. 1.d) . Similarly, the location of page-in and page-out points is a direct result of the coloring: page-in points are on the web incoming edges, and page-out points are on the web outgoing edges (see small bullets in Fig. 1 
.d, shown for virtual page 2 only).
This problem is similar to register allocation in compilers [3] , where webs represent variable usage and colors the processor physical registers. Spill code is the register allocation equivalent of page-ins/page-outs in our problem.
In the following, we will use the term N-colorable to note an interference graph colorable using N colors. The degree of a node in the interference graph will denote its number of neighbors in the interference graph.
Webs and interference graph are defined for every task. An inter-procedural Control Flow Graph (CFG) is constructed at compile-time. There is one node per basic block and an edge for every possible sequence between two basic blocks (caused by conditional and unconditional branches, function calls and function returns). The set of virtual pages that may be referenced by a basic block is assumed to be known at compile-time.
Let us note the program CFG, with the set of basic blocks and the set of transitions between basic blocks.
Ideally, if the number of physical pages was large enough, a virtual page should be paged-in before its very first use, and paged-out after its last use only, regardless of the references to the virtual pages in-between. As a consequence, the start point of the coloring process is the interfer- More formally, a maximal web for a virtual page is defined as the set of basic blocks either using a virtual page or belonging to an execution path between two basic blocks using (see figure 2) . Let denote the set of virtual pages used by basic block , and denote the set of direct or indirect successors of in the CFG, the maximal web of virtual page is defined as follows: 
Figure 2. Maximal webs
The start point of the coloring algorithm is the set of maximal webs. Once a web is colored, its assigned color is not changed (greedy algorithm). Every web is assigned an integer weight used as a heuristic in the coloring process, and defining the ordering of web coloring. In the following, constant nbcol will represent the number of colors (number of physical pages). The algorithm assumes that every basic block uses a number of virtual pages lower or equal than nbcol, but obviously it supports a total number of used virtual pages much larger than nbcol.
The algorithm for coloring the interference graph is sketched below. The data structures used by the coloring algorithm are first built (lines 6 and 7). Function AssignWeight called at line 8 assigns a weight to every maximal web (see paragraph 3.4 for a description of the weight functions). The algorithm then iteratively tries to color the interference graph through a call to function Color (loop at lines 10 to 14). Function splitWeb splits the web having caused the coloring process to fail, if any. The pseudo code of function Color is presented below. Functions getWebsGreaterOrEqual (resp. getWebsLowerThan) returns the set of webs in the interference graph with a degree greater or equal to (resp. lower than) parameter nbcol. Function Color scans and colors webs by decreasing weight value (loop in lines 11 to 14). Function AssignColor called at line 13 assigns a color to web , such that the assigned color is different from those of the inter- The web splitting procedure splitWeb, whose algorithm is not detailed for space considerations, splits the first non colorable web detected by procedure Color. Let us assume that a particular web is to be split and interferes with a set of already colored webs. Procedure splitWeb extracts from the fully connected sub-web containing the smaller set of interfering nodes. An illustration of procedure splitWeb is given in figure 3 . Assuming that webs 1 and 3 are already colored, the interference graph is not 2-colorable and thus web 2 has to be split. The smallest set of nodes interfering with webs 1 and 3 is b , which is excluded from web 2, thus split in webs 2.1 and 2.2. In this small example the resulting interference graph becomes 2-colorable.
Webs are colored by decreasing weight order. As our target applications have real-time constraints, we are primarily interested in minimizing their worst-case timing requirements. As a consequence, the weight function for a web , called hereafter, accounts for the impact of the web on the task worst-case execution time. is defined as follows: (1) where is the virtual page associated to web and is the number of references to basic block along the program worst-case execution path (WCEP). Execution frequencies along the worst-case execution path are a direct result of WCET estimation tools using Integer Linear Programming (ILP) to estimate WCETs [14] . Since the WCEP may change due to coloring decisions, it is reevaluated regularly. The re-evaluation period can be parameterized from (re-evaluation at every coloring) to (no re-evaluation).
A second weight function, named was also defined for comparison purpose. Contrary to , does not use any frequency information and thus can be used without any WCET estimation tool available.
is a common heuristic used in compilers for register allocation [3] . It favors webs with deeply nested basic blocks.
is defined as follows: (2) with the nesting level of basic block . The nesting level of a basic block in the main function is . A basic block in the loop body of a loop enclosed in a loop is assigned a nesting level of . 
Performance evaluation
We are interested in evaluating the worst-case timing behavior of programs. Estimation of WCETs is completed using static program analysis. Experimental conditions are described in paragraph 4.1. Experimental results are given in paragraph 4.2.
WCET estimation. Our experiments were conducted on MIPS R2000/R3000 binary code. The WCETs of tasks are computed by the Heptane 2 static WCET analysis tool [4] . One may configure Heptane to estimate WCETs using either: a tree-based method, through a bottom-up traversal of the syntactic tree of the analyzed C programs; an IPET (Implicit Path Enumeration Technique) method, generating a set of linear constraints from the program control-flow graph. Here, the IPET WCET estimation method is used, because we are interested in the frequency of basic blocks along the WCEP, which is a direct result of IPET estimation methods.
Heptane includes hardware modeling capabilities to estimate WCETs for programs running on architectures with instruction caches, (in-order) pipeline, simple branch prediction. In this paper, the hardware analysis phase of Heptane is bypassed and a constant cycle execution time per instruction is considered. A page-in time of million cycles is assumed. The page-out delays are zero because only code pages are considered.
Unless explicitly stated, the weight function is used, and the WCEP is not re-evaluated during graph coloring. We use pages of bytes; page size is small to stress the paging activity even on rather small benchmarks.
Benchmarks. The experiments were conducted on seven benchmarks, whose features are summarized in Table 1 . All benchmarks but compress are benchmarks maintained by the Mälardalen WCET research (http://www.mrtc.mdh.se/projects/wcet/benchmarks.html).
Compress
is from the UTDSP Benchmark (http://www.eecg.toronto.edu/).
The main performance metric used in the following paragraphs is the number of page-ins along the worst-case execution path. Such a number is a direct output of the Heptane WCET estimation tool.
Influence of number of physical pages. The left part of figure 4 depicts the impact of the number of physical pages on the number of page-ins along the WCEP. For space considerations, results for only four benchmarks are given, the raw numbers for all benchmarks are given in an appendix available on demand. The figure shows that the smaller the number of physical pages, the higher the number of pageins along the WCEP.
The right part of figure 4 gives the measured number of page faults for a demand-paging system using a LRU replacement policy, obtained using a small operating system running on a simulated MIPS processor 3 . A comparison of the two figures shows the same evolution of the number of page loads when making the number of physical pages vary. In particular, the number of page-ins gets unacceptably high for a small number of physical pages (trashing phenomenon). The measured number of page-ins is in most cases lower than the estimated one, because WCET estimation tools estimate the longest path executed and not only one path. Furthermore, WCET estimation tools may overestimate the length of the WCEP (e.g. overestimation of number of loop iterations like in the fft application, having nested non-rectangular loops). All in all, except for fft the number of page-ins is close to the one of a dynamic paging system, which shows predictability does not come at the price of performance loss.
Predictable paging vs analysis of LRU replacement.
We have introduced our predictable paging technique because current page replacement policies used in real-time 
Number of physical pages Number of page faults (measured) Figure 6 . Impact of weight function operating systems are not predictable enough. As a consequence, it is not possible to compare WCETs of programs with state-of-the-art page replacement policies with our predictable paging method. Thus, we have compared our proposal with a static analysis LRU page replacement, a highly predictable replacement policy. Our analysis of LRU page replacement uses Heptane static instruction cache analysis method for fully associative caches. The cache analysis method of Heptane (see [4] for details) is based on F.
Mueller's static cache simulation [9] . Results are expressed in Fig. 5 in terms of number of page-ins along the worstcase execution path. When the memory is not too scarce, our predictable paging yields to lower WCET estimate than LRU. We have observed on small examples two situations explaining the pessimism of the analysis of LRU page replacement: Circular access to a set of pages of cardinal within a loop, on a system with less than physical pages. In that situation, LRU replacement behaves poorly because every evicted page will be reused shortly after in the loop. This deficiency, detailed in [7] is a deficiency of LRU replacement itself and not the static analysis of LRU replacement. Classifications as misses of references to pages accessed both in the body of a loop and the loop exit. In that situations, the static analysis of LRU page replacement considers that the loop may iterate zero times and thus the page may have to be loaded from disk. This pessimism may become important in the case of nested loops. Here, the problem is with the static analysis of LRU replacement and not with LRU replacement itself. This problem could be fixed if a lower bound of the number of iterations of loops was provided to the WCET estimation tool.
When the number of physical pages is extremely low, in most cases the analysis of LRU yields to tighter WCET estimates than our scheme. A closer analysis of the sources of pessimism of our proposal in that situations is still needed and is left for future work.
Impact of weight function. The two weight functions and presented in paragraph 3.4 have been implemented and tested. Figure 6 gives the number of pageins along the worst-case execution path for these two weight functions. Except on very rare cases, the number of estimated page faults is much lower when using the heuristic than when using . Using frequency information is thus valuable for obtaining as tight WCET estimates.
The numbers given in the appendix show that reevaluating the worst-case execution path during the graph coloring has no impact. This is not surprising for the tasks with little data-dependencies (matmult, jfdctint) but needs further investigations for the others.
Implementation issues and future work
Some hardware and/or operating system support is required to fully implement our proposal. The first requirement is to have support for executing code (here, page-ins and page-outs) at specific code locations. This could be done by using hardware debug registers or operating system support for debug, if any. Another requirement is to have support for changing translation information. This is expected to be straightforward for operating systems with page locking facilities like in RT-Mach, real-time extensions of POSIX, or a library to control address translation like in VxWorks. Further work is required to evaluate the implementation cost of our proposal, in particular in presence of shared libraries/code/memory segments or multiple threads sharing the same address space.
The algorithm for off-line selection of page-in and pageout points is independent of the type of pages referenced (code, data), as far as referenced pages are known at compile time. The difficulty of applying our scheme to data comes from the identification of data pages referenced by every instruction, in case data addresses are computed dynamically (accesses to arrays, stack allocated data, dynamically allocated data). The identification of referenced pages need not be exact, it is sufficient that all pages that may be referenced are known. Our ongoing work evaluates the practical feasibility of identifying possibly referenced data pages, and to quantify the negative impact of an imprecise knowledge of data references.
Furthermore, to be used in hard real-time systems, disks (or any other classes of secondary storage) with predictable access times are required. This is a direction for future research.
Finally, this paper has focused on a single task, and has left open the choice of the number of pages assigned to every task. In [15] an algorithm for optimally partitioning two-level memory and minimize the task utilization is described. Such an algorithm can be used to select the number of physical pages assigned to each task such as to minimize utilization. A research direction would be to improve that algorithm to optimize schedulability rather than utilization.
A Appendix: Raw numbers
In the following tables, the first column indicates the number of pages used. The next three columns then give the estimated number of page-ins along the worst-case execution path using respectively: the weight function with and without re-evaluation of the WCEP (R and no-R) and the weight function. Column LRU gives the estimeted number of page-ins along the worst-case execution path when assuming a LRU page replacement policy. Finally, the last column gives the measured number of page faults for a LRU page replacement policy when following one execution path.
