We present the design and implementation of UPMLIB, a runtime system that provides transparent facilities for dynamically tuning the memory performance of OpenMP programs on scalable shared-memory multiprocessors with hardware cache-coherence. UPMLIB integrates information from the compiler and the operating system, to implement algorithms that perform accurate and timely page migrations. The algorithms and the associated mechanisms correlate memory reference information with the semantics of parallel programs and scheduling events that break the association between threads and data for which threads have memory affinity at runtime. Our experimental evidence shows that UPMLIB makes OpenMP programs immune to the page placement strategy of the operating system, thus obviating the need for introducing data placement directives in OpenMP. Furthermore, UPMlib provides solid improvements of throughput in multiprogrammed execution environments.
Introduction
Scalable shared-memory multiprocessor architectures converge remarkably to a common machine model, in which nodes with commodity microprocessors and memory are interconnected via a fast communication network and equipped with hardware support to provide the communication abstraction of a shared address space to the programmer [2] . High-level programming models for scalable parallel computers converge also to a small set of standards that represent essentially two programming methodologies with different communication abstractions, namely message-passing and shared-memory. MPI [3] and OpenMP [10] are the most popular representatives of these programming methodologies.
There is a considerable debate going on recently with respect to what should be the programming model of choice for scalable shared-memory multiprocessors. Interestingly, contemporary systems such as the SGI Origin2000 [6] support programming models based on both message-passing and shared-memory, via customized runtime systems provided by the vendors. Performance experiences with real applications on these systems suggest that implementations of parallel programs with MPI perform often better than implementations of the same programs with OpenMP, particularly for large industrial applications [11] . The most prominent problem that OpenMP are faced with on scalable multiprocessors, is the nonuniformity of memory access latency (NUMA). Although the shared-memory communication abstraction hides data distribution details from the programmer, programs are extremely sensitive to the page placement strategy of the operating system. A poor page placement scheme may exacerbate the number of remote memory accesses, which cost two to ten times as much as local memory accesses on state-of-the-art systems. It is therefore crucial to ensure that threads and data are aligned in nodes, so that each thread is collocated on the same node with the data that this thread accesses more frequently. Unfortunately, in order to achieve this goal with a plain shared-memory programming model, the programmer must be aware of the page placement strategy of the operating system and either modify the program to adapt its memory reference pattern to the enforced system strategy, or bypass the operating system and hand-code a customized page placement scheme [4] . Both approaches compromise the simplicity of shared-memory programming models and jeopardize code portability across different platforms. Nevertheless, vendors of shared-memory multiprocessors are already facing the dilemma of whether data distribution directives should be introduced in OpenMP or not [7] .
The question that motivates the work presented in this paper is whether OpenMP can be enhanced with runtime capabilities for transparently improving data locality without exporting data distribution details to the programmer. We present the design and implementation of UPMLIB (User-Level Page Migration library), a runtime system with mechanisms and algorithms that transparently optimize at runtime the page placement of OpenMP programs, using feedback from the compiler, the operating system and dynamic monitoring of the memory reference pattern of the programs. UPMLIB leverages dynamic page migration [12] at user-level to correct unfortunate page placement decisions made by the operating system. The notable difference of UPMLIB compared to previously proposed kernel-level page migration engines, is that the employed dynamic page migration algorithms correlate the memory reference information obtained from hardware counters with the semantics of the parallel computation and scheduling information provided by the operating system. This is accomplished by integrating the compiler, the runtime system and the operating system in the page migration engine. The overall approach improves the accuracy and timeliness of page migrations, amortizes well the cost of page migrations over time, and makes the page migration engine responsive to unpredictable runtime events that may harm data locality, such as thread migrations.
We have implemented UPMLIB on the SGI Origin2000, using the IRIX 6.5.5 memory management control interface. As a case study, we have used UPMLIB with unmodified OpenMP implementations of the NAS benchmarks [5] . Our results show that UPMLIB embeds the desirable immunity of OpenMP codes to the page placement strategies of the operating system. In addition, UPMLIB provides in some cases solid performance improvements compared to the native IRIX page placement and migration schemes for standalone parallel programs and multiprogrammed workloads, scheduled with space-and timesharing by the IRIX kernel.
The rest of this paper is organized as follows. Section 2 outlines the design and the algorithms embedded in UPMLIB. Section 3 discusses some implementation details. Section 4 provides results with OpenMP codes that utilize UPMLIB to improve their data locality in dedicated and multiprogrammed execution environments. Section 5 concludes the paper.
UPMLIB Design and Algorithmics
The key design issue of UPMLIB is the integration of the compiler, the runtime system and the operating system in a unified framework that enhances the effectiveness of a dynamic page migration engine, by correlating the dynamic reference pattern of a parallel program with the semantics of the program and the scheduling status of its threads at runtime. UPMLIB implements feedback-guided opti- Figure 1 shows the main modules and interfaces of UPMLIB, which are explained in detail in the following paragraphs.
Compiler Support. The OpenMP compiler identifies areas of the virtual address space which are likely to contain pages candidate for migration and instruments the programs to call the page migration services of UPMLIB at specific points during their execution. In our first prototype, the compiler locates shared arrays which are read and written in possibly disjoint sets of OpenMP parallel/work sharing constructs and identifies these arrays as hot memory areas. The compiler inserts calls to UPMLIB for activating dynamic monitoring of page reference activity and page migration on the hot areas. The implementation is flexible enough to exploit advanced compiler knowledge, in case the compiler can provide more accurate boundaries for parts of the hot areas which are likely to concentrate the most significant fraction of remote memory accesses, as well as the exact points of the program at which page migration could improve locality by emulating data distribution and redistribution schemes. Page Migration Mechanisms. The instrumentation pass of the compiler takes advantage of the semantics of the parallel program, in order to migrate pages accurately and ahead in time. Most parallel codes are iterative in nature, in the sense that they repeat the same parallel computation for a number of iterations. The compiler instruments iterative programs to invoke the page migration algorithms of UPMLIB at the end of the outer iterations that encapsulate the complete parallel computation. At these points the runtime system can obtain an accurate view of the complete page reference pattern of the parallel computation by reading the hardware counters. Therefore, the runtime system is in a position to take very accurate decisions for migrating pages and achieve an optimal page placement, where optimality is defined with respect to the observed repetitive reference trace of the program, and achieved when each page is placed so that the maximum latency due to remote memory accesses seen by any node in the system is minimized.
In the case of iterative parallel computations and in the absence of page-level false-sharing or thread migrations, the runtime system attains the best page placement with respect to the observed reference pattern after executing a single iteration of the parallel computation. Besides to the advantage of timeliness, this strategy amortizes well the cost of page migrations over time. Cost amortization is of particular importance, since page migrations are overly expensive operations and cost around 1 millisecond on state-of-the-art systems. Since page migration is performed based on the reference trace of the complete parallel computation, the page migratin engine is not biased by temporary effects such as cold-start or phase changes in the reference pattern of the parallel computation.
UPMLIB handles non-iterative codes, as well as iterative codes with non-repetitive access pattern, using a sampling-based mechanism for migrating pages. The runtime system wakes up periodically a thread, which scans a fraction of the pages in the hot memory areas and migrates some of these pages if needed. The sampling frequency and the amount of pages scanned upon each invocation of UPMlib can be adjusted by the user to fit the needs of the application. Programs with fine-grain phase changes in the reference pattern benefit from short sampling intervals, while programs with coarse-grain phase changes can utilize longer sampling intervals. The amount of pages scanned upon each invokation is selected to limit the cost of checking and migrating pages to at most a small fraction of the sampling interval. The algorithm for scanning pages can vary from sequential to stride to randomized scanning, in order to enable the runtime system to adapt the page migration engine to the distribution of hot pages in the virtual address space. Page Migration Algorithms. UPMLIB uses by default a competitive algorithm for migrating pages. The criterion of competitiveness in the algorithm is the estimated latency seen by each node in the system due to remote memory accesses. This criterion incorporates the number of references, the estimated cost of each remote reference according to the distance in hops between the referencing node and the referenced page, and contention at the nodes to which references are issued. The competitive thresholds used in the algorithm may change at runtime, according to the observed effectiveness of page migrations on reducing the rate of remote memory accesses and the overall page migration activity. More details can be found in [8] . UPM-LIB circumvents page-level false-sharing with a ping-pong prevention mechanism [8] which avoids migrating a page if it is likely to bounce between the same nodes more than once. The ping-pong prevention mechanism ensures that unless the threads of a parallel program migrate between nodes, each page will be placed at the appropriate node within the first two iterations of the program, assuming an iterative program with a repetitive reference pattern. For the more general case in which pages can be ping-ponged between more than two nodes due to wide false-sharing, UPMLIB uses a bouncing threshold to limit the maximum number of times a page can move before being pinned at a node.
Integration with Multiprogramming.
On scalable shared-memory multiprocessors, the page placement strategy establishes an implicit association between threads and data in a parallel program. In principle, a thread is associated with its memory affinity set, that is, the set of pages that the thread accesses more frequently than any other thread. On a multiprogrammed system in which multiple parallel and sequential jobs execute simultaneously, the operating system arbitrarily preempts and migrates threads between nodes, thus breaking the association between these threads and their memory affinity sets. Thread migrations incur the cost of reloading the working sets of migrated threads from remote memory modules, as well as satisfying most cache misses incurred from migrated threads remotely. A page migration mechanism can alleviate this problem by forwarding the pages that belong to the memory affinity set of a migrated thread to the new node that hosts the thread. Unfortunately, a competitive page migration algorithm may fail to perform timely page migrations in this case. The reason is that the page reference counters may have accumulated obsolete page reference history that prevents a page from migrating unless the new home node of the migrated thread issues a sufficiently large amount of remote references to meet the competitive criterion.
UPMLIB uses a lightweight communication interface with the operating system to obtain scheduling information, which is used as a trigger for switching the page migration algorithms upon migrations of threads from the operating system. The runtime system polls a vector in shared-memory which stores the instantaneous mapping of threads to processors and switches on the fly the default competitive algorithm, if it detects that some threads have migrated between the execution of two consecutive parallel OpenMP constructs. In that case, UPMLIB activates a predictive algorithm which migrates pages according to the observed rate of references from the previous and the new home node of a migrated thread, after the thread migration. For iterative programs, the algorithm checks for each page if some node other than the home node of the page issues remote references at a higher rate with respect to previous iterations, while the home node references the page at a lower rate with respect to previous iterations. If this happens, the algorithm speculates that the observed anomaly is due to a thread migration and verifies this speculation by checking the information provided by the operating system. In case the speculation is verified, the algorithm migrates the page irrespectively to the values accumulated in the reference counters for that page.
For non-iterative programs, the predictive algorithm is invoked periodically and ages the counters of the nodes from which threads migrate by discarding their values after the thread migration, if there is no other thread of the program running on these nodes. Aging is performed to avoid biasing future page migration decisions with obsolete reference history, thus increasing the chances of migrating pages that belong to the memory affinity sets of migrated threads. Details on the predictive algorithms are available in [9] . Table 1 summarizes the UPMLIB user-level interface. Figure 2 gives an example of the use of UPMLIB in the NAS BT benchmark.
Implementation
UPMLIB is implemented on the SGI Origin2000, using the IRIX 6.5.5 operating system interface. The runtime system is integrated with the NANOS OpenMP compiler [1] , which implements the instrumentation pass for using UPMLIB. Interfaces. The page migration facilities of UPMLIB use the memory management control interface (mmci) of IRIX (see Figure 1) . The IRIX mmci provides significant flexibility in managing memory at user-level, by virtualizing the physical topology of the system. The user can create high-level abstractions of the physical memory space, called Memory Locality Domains (MLDs). MLDs can be statically or dynamically mapped to physical nodes of the system. After establishing a mapping between MLDs and nodes, the user can associate ranges of the virtual address space with MLDs in order to implement applicationspecific page placement schemes. IRIX provides also a memory migration facility that allows the pages in a userspecified range of the virtual address space to flow between different MLDs. UPMLIB uses the /proc interface for accessing hardware reference counters. The Origin2000 memory modules are equipped with 11-bit hardware counters. There is one counter per node for each page in memory, for system configurations of up to 64 nodes. The hardware counters are memory mapped to 32-bit software-extended counters by the operating system. UPMLIB tries to batch multiple page migrations for consecutive pages in the virtual address space into a single invocation of the IRIX memory migration facility to reduce the overhead.
The communication between UPMLIB and the IRIX kernel is done via polling shared variables in the private data areas (prda) of IRIX threads. The operating system updates a flag in the prda of each thread, which stores the physical CPU on which the thread was scheduled during the last time quantum. UPMLIB uses this information in conjunction with hints provided by the IRIX kernel for adjusting the number of threads that execute OpenMP parallel/work sharing constructs. The latter are used to implement dynamic process control at user-level. UPMLIB detects thread preemptions and migrations at the boundaries of parallel constructs to trigger the multiprogrammingconscious page migration algorithms.
Mechanisms for executing page migrations. UPMLIB uses two mechanisms for executing the page migration algorithms. By default, the runtime system overlaps the execution of page migrations with the execution of the threads of a parallel program. We measured with microbenchmarks the average cost of a user-level page migration on the SGI Origin2000 to be equal to approximately 1.3 milliseconds, including the cost for reading reference counters and executing the competitive algorithm. This makes evident that UPMLIB can not execute a large number of page migrations on the critical path of the program. Therefore, the runtime system uses a separate thread, called the memory manager, for executing page migrations. This thread is created in sleep mode when UPMLIB is initialized and wakes up upon every invocation of UPMLIB by the program. The memory manager executes in parallel with the application threads. This strategy works well for standalone parallel programs running on moderate to large processor scales, at which the program can easily sacrifice one processor for executing operating system code [4] .
In loaded multiprogrammed systems in which the total number of active threads may be higher than the number of processors in the system, the memory management threads created by UPMLIB may undesirably interfere with the threads of parallel programs. To cope with this problem, UPMLIB supports also the execution of page migration algorithms from the master thread of the OpenMP program. In this case, the runtime system uses stripmining for the buffers that store the reference counters, in order to reduce the working set size of UPMLIB and avoid erasing completely the cache footprint of the master thread which participates in the execution of parallel code. Initializes reference counting and activates dynamic page migration for the address range [va,va+size-1] . upmlib migrate pages(policy) Runs the specified page migration policy for all hot memory areas. upmlib check pset() Polls the effective processor set on which the program executes from shared memory and records thread migrations. upmlib switch() Switches the page migration policy from competitive to predictive, and vice-versa using OS information. upmlib record counters() Records per-page/per-node reference counters for statistics collection. 
Experimental Results
In this section we provide a small set of experimental results, as case studies that demonstrate the potential of UPMLIB. Figure 3 illustrates the performance of two application benchmarks from the NAS suite, BT and SP, both parallelized with OpenMP [5] . The experiments were conducted on a 64-processor SGI Origin2000, with MIPS R10000 processors clocked at 250 MHz and 8 Gbytes of memory. The charts plot the execution time of the benchmarks with three different initial page placement schemes, namely first-touch (label ft), round-robin (label rr) and a hypothetical worst-case placement in which all resident pages of the benchmarks are placed on a single node (label sn), thus exacerbating contention and latency due to remote accesses. For each of the three page placement schemes, we executed the benchmarks without page migration enabled (label IRIX), with the IRIX page migration engine enabled (label IRIXmig), and with the IRIX page migration engine disabled and user-level dynamic page migration enabled in UPMLIB (label upmlib). The experiments were executed using 32 dedicated processors. The primary outcome of the results is that the benchmarks exhibit sensitivity to the page placement strategy of the operating system and in the cases in which the page placement scheme is harmful, the IRIX page migration engine is unable to close the performance gap. For example, worstcase page placement incurs slowdowns of 1.73 to 2.27 even if dynamic page migration is enabled in the IRIX kernel. With round-robin page placement slowdowns compared to first-touch are in the order of 30%. UPMLIB reduces the slowdown factor due to page placement to at most 1.06, thus making the OpenMP implementations of the benchmarks immune to the page placement strategy of the operating system and the associated problems with data locality. Furthermore, UPMLIB provides sizeable performance improvements (28% in the case of BT) over the best-performing page placement and migration scheme of IRIX. Figure 4 illustrates the results from executions of multiprogrammed workloads with the NAS BT and SP benchmarks. Each workload included four identical copies of the same benchmark, plus a sequential background load consisting of an I/O-intensive C program that repetitively reads and writes files from disk. The workloads were executed on 64 processors. All instances of the parallel benchmarks requested 32 processors for execution, however the benchmarks enabled the dynamic adjustment of the number of threads that execute parallel code, via the OMP SET DYNAMIC call. In these experiments, IRIX initially started all 128 threads of the parallel benchmarks, relying on time-sharing for the distribution of processor time among the programs. In the course of execution, IRIX detected that the parallel benchmarks underutilized some processors and reduced accordingly the number of threads, reverting to space-sharing for executing the workload, although some processors were time-shared due to the interference of the background load.
The results show the average execution time of the parallel benchmarks in the workloads with plain first-touch page placement (ft-IRIX), first-touch and the IRIX page migration engine enabled (ft-IRIXmig) and firsttouch and UPMLIB with the predictive heuristic enabled (ft-upmlib). The theoretical optimal execution time of the benchmarks is also illustrated in the charts. The optimal time is computed as the standalone execution time of each benchmark on 32 processors and with the best page placement strategy (ft-upmlib, see Figure 3 ), divided by the degree of multiprogramming in the workload. The results illustrate the performance implications of multiprogramming on the memory performance of parallel programs when their threads are arbitrarily preempted and migrated between nodes by the operating system. Slowdowns of 2.1 to 3.3 are observed for BT and SP with respect to the optimal execution time, with the IRIX page management schemes. Our instrumentation has shown that the IRIX kernel performed on average around 2500 thread migrations during the execution of the workloads. UPMLIB is very effective in dealing with this problem, by aggressively forwarding pages in the memory affinity sets of migrated threads. This results to overall performance within 5% off the theoretical optimal performance.
Conclusion
This paper outlined the design and implementation of UPMLIB, a runtime system for tuning the page placement of OpenMP programs on scalable shared-memory multiprocessors, in which shared-memory programming models are sensitive to the alignment of threads and data in the nodes. UPMLIB takes a new approach by integrating the compiler and the operating system with the page migration engine, to improve the effectiveness of dynamic page migration. Our current effort is oriented towards three directions: utilizing the functionality of UPMLIB in codes with fine-grain phase changes in the memory access pattern of programs; customizing UPMLIB to the characteristics of specific kernel-level scheduling strategies; and integrating a unified utility for page and thread migration in UPMLIB, with the purpose of biasing thread scheduling decisions towards achieving better memory locality.
