A comparison of online superpage promotion mechanisms by Zhang, Lixin & Fang, Zhen
A Comparison of Online Superpage Prom otion M echanisms
Zhen Fang and Lixin Zhang
{zfang | lizhang}@cs.utcih.edu 
http://www.cs.utah.edu/impulse 1
UUCS-99-021
Department of Computer Science 
3190 Merrill Engineering Building 
University of Utah 
Salt Lake City, UT 84112
December 2, 1999
A b stract
The amount of data that a typical translation lookaside buffer (TLB) can map has not kept pace with 
the growth in cache sizes and application footprints. As a result, the cost of handling TLB misses limits 
the performance of an increasing number of applications. The use of superpages, multiple adjacent virtual 
memory pages that can be mapped with a single TLB entry, extends a TLB’s reach without significantly 
increasing its size or cost. The difficulty of identifying what sets of pages should be promoted to superpages 
combined with the overhead of performing these promotions restricts superpage use almost exclusively to 
wired system data structures. Previous studies have shown that simple online policies that decide to create 
superpages dynamically can be effective in reducing TLB penalties.
In this paper we analyze the performance of online superpage promotion for nine benchmarks on a 
simulated HP PA-RISC system running a BSD Unix kernel. We extend previous work in two ways. First, 
we study the impact of creating superpages dynamically by remapping pages at the memory controller 
instead of copying pages to make them contiguous. The use of such a hardware mechanism affects the choice 
between two previously described superpage promotion policies. Previous work has shown that an online 
approximation to a competitive policy is the best choice. Our results show that having hardware support 
makes a greedy policy perform equally well. Second, we use execution-driven simulation, whereas previous 
studies have used trace-driven simulation. Our results show that the differences in accuracy are significant, 
especially when studying complex interactions between operating systems and modern architectures.
Keywords: memory architecture, TLB performance, competitive algorithms, simulation 
Technical Areas: architecture, memory systems, operating systems
’T his effort was sponsored in p a r t by th e  Defense Advanced R esearch P ro jec ts  Agency (D A RPA ) and th e  A ir Force Research 
L ab ora to ry  (A FR L) under agreem ent num ber F30602-98-1-0101 and DA RPA  O rder N um bers F393/00-01 and F376/00 . T he 
views and conclusions contained herein are those of th e  au th o rs  and should no t be  in te rpreted  as necessarily representing  the  
official polices or endorsem ents, e ith e r express or im plied, of DA RPA, A F R L , or th e  US G overnm ent.
. T V 7 ^ 7 w z  td h
f  f f  ( ry\j)n f r o h M j j r s £  £/y>ni£Sttr\LS'<!) o f  S?£jcT~rbns
~ ^ / C s{a A  !
The translation lookaside buffers (TLBs) on most modern processors support superpages: groups of contigu­
ous virtual memory pages that can be mapped with a single TLB entry [8, 16, 27]. Using superpages makes 
more efficient use of the TLB, but the physical pages that back a superpage must be contiguous and properly 
aligned. Dynamically coalescing smaller pages into a superpage thus requires that all the pages be reserved 
a priori, be coincidentally adjacent and aligned, or be copied so that they become contiguous. The overhead 
costs of promoting superpages by copying include the direct costs of copying the pages and changing the 
mappings. Other indirect costs are also important, such as the increased number of instructions executed 
on each TLB miss (due to the new decision-making code in the miss handler) and the increased contention 
in the cache hierarchy (due to the code and data used in the promotion process). When deciding whether 
to create superpages, these costs must be balanced against the improvement in TLB performance.
; Romer et al. [22] study several different policies for dynamically creating superpages. Their trace-driven 
simulations and analysis show how a policy that balances potential performance benefits and promotion 
overheads can improve performance in some TLB-bound applications by about 50%. Our work extends 
theirs by measuring the added performance benefit, as well as the effect on the choice of policy, of using 
hardware support at the memory system to make creating superpages cheaper.
The hardware that we model is the Impulse Memory Controller [28], which helps create superpages 
without copying by adding another level in the memory hierarchy at the memory controller. In Impulse, 
superpages are built through reampping.
Our research shows that combining the work of Romer et al. and the Impulse technology changes the 
tradeoffs in designing an online superpage promotion policy. Romer et al. find that a competitive promotion 
policy that tries to balance the overheads of creating superpages with their benefits achieves the best average 
performance. Our experiments confirm this result when promotion is accomplished via copying, but we find 
that the use of a more aggressive promotion policy that promotes superpages as soon all of their constituent 
sub-pages have been touched performs best when coupled with a remapping-based promotion mechanism. 
We further find that the performance of Romer’s competitive promotion policy can be improved by tuning 
it to create superpages more aggressively, even when copying is employed to promote pages. In addition,
1  I n t r o d u c t i o n  *,s
by using a detailed execution-driven simulator, we identify the impact of several performance factors not 
covered by Romer et al.'s trace-based study, such as the detrimental effects of the cache pollution induced 
by copying. Finally, we find that online superpage promotion achieves performance comparable to the 
hand-coded superpage promotion mechanism employed by Swanson et al. [28]
The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 explains the 
two methods used to create superpages, along with the two policies investigated for promoting superpages 
at run time. Section 4 describes our simulation environment and benchmark suite, and Section 5 gives the 
experimental methodology and the results of our study. Section 6 summarizes our conclusions and discusses 
future work.
2 Related Work
Competitive algorithms perform cost/benefit analyses dynamically to make online decisions that guarantee 
performance within a constant factor of an optimal offline algorithm. Romer et al. [22] adapt this approach 
to TLB management, employing a competitive strategy to decide when to perform dynamic superpage 
promotion. They also investigate online software policies for dynamically remapping pages to improve cache 
performance [3, 21]. Competitive algorithms have been used to help increase the efficiency of other operating 
system functions and resources, including paging [24], synchronization [14], and file cache management [5].
Chen et al. [7] report on the performance effects of various TLB organizations and sizes. Their results 
indicate that the most important factor for minimizing the overhead induced by TLB misses is reach, the 
amount of address space that the TLB can map at any instant in time. Even though the SPEC benchmarks 
they study have relatively small memory requirements, they find that TLB misses increase the effective 
CPI (cycles per instruction) by up to a factor of five. Jacob and Mudge [13] compare five virtual memory 
designs, including combinations of hierarchical and inverted page tables for both hardware-managed and 
software-managed TLBs. They find that large TLBs are necessary for good performance, and that TLB 
miss handling overhead accounts for much of the memory-management overhead. They also project that the 
individual cost of TLB miss traps will increase in future microprocessors.
3
Proposed solutions to this growing TLB performance bottleneck range from changing the TLB structure P
to retain more of the working set (e.g., multi-level TLB hierarchies [1, 9]), to implementing better manage­
ment policies (in software [12] or hardware [11]), to masking TLB miss latency by prefetching entries (in 
software [2] or hardware [23]). .. . . . . . . .
f . All of these approaches can be improved by exploiting superpages. Most TLBs now support superpages, 
and have for several years [16, 27], but more research is needed into how best to make general use of this 
capability. Chen et al. [7] suggest the possibility of using variable page sizes to improve TLB reach, but 
do not explore the implications of their use. Khalidi et al [15] and Mogul [17] discuss benefits of systems 
that support superpages, advocating static allocation via compiler or programmer hints. Talluri et al. [18] 
report many of the difficulties attendant upon general utilization of superpages, most of which result from 
the requirement that superpages map physical memory regions that are contiguous and aligned.
On a system with four-kilobyte base pages, Talluri et al. [19] find that judicious use of 32-kilobyte 
superpages can reduce the impact of TLB misses on CPI by as much as a factor of eight. Exclusive use 
of the larger pages increases application working sets by as much as 60%, which can lead to inefficient use 
of main memory. However, mixing both page sizes limits this bloat to around 10%, and allowing the TLB 
to map superpages without requiring that all the underlying base pages be present (partial superpages) 
eliminates the problem altogether.
Swanson et al. [28] build superpages through a remapping approach with the Impulse hardware support.
They statically modify applications to create superpages via system calls. Their simulation results demon­
strate a two-fold increase in TLB reach and a 5%-20% improvement in the performance of some SPECint95 
and Splash2 applications with medium to high TLB miss rates. Our work is based on the same hardware as 
theirs, while with different kernel modifications.
____________________________
3 ^Research B ack grou nd ^
We measure the impact of combining no-copy superpage promotion with the two online promotion algorithms 
proposed by Romer et al. [22]. The methodological differences between this study and Romer et al.’s are
4
described in Section 4. In this section we describe the promotion policies we study, and then we briefly 
discuss the hardware used by Swanson et al. to support no-copy superpage promotion. v:l,!
3.1 P rom otion  A lgorith m s h*h
We evaluate two of the online superpage promotion policies developed by Romer et al. [22], asap and 
approx-online. asap is a greedy policy that promotes a superpage as soon as all of its component pages 
have been referenced. The algorithm does not consider reference frequency for the potential superpages, 
which minimizes bookkeeping overhead. The price for this simplicity is that the asap policy may build 
superpages that are rarely referenced later, in which case the benefits of these superpages would not offset 
the costs of building them. :
approx-online uses a competitive strategy to determine when superpages should be coalesced. If a 
superpage P accrues many misses, we expect that it will be referenced again in the future, and that promoting 
it will prevent many future TLB misses. Such promotions effectively prefetch the translations for the non­
resident base pages in the new superpage. To track this reference information, the approx-online algorithm 
maintains a counter P.prefetch for each potential superpage P. On a TLB miss, the policy increments the 
counters for all potential superpages that would have prevented the miss. In other words, on a miss to base 
page p, P.prefetch is incremented for each potential superpage P that contains the referenced page p and 
at least one current TLB entry. When the miss charges for a superpage P0 reach a pre-set threshold for 
superpages of size Po.size, the pages that constitute Po are promoted into a superpage.
The miss charges of a potential superpage should reflect the number of misses that earlier promotion would 
have eliminated. So, when page Po is created, the prefetch counters of all larger superpages containing it 
must be adjusted to reflect the now-diminished benefits of their promotion. For all superpages P that contain 
P0, P.prefetch is decremented by Pq.prefetch, since whenever Po.prefetch was incremented, P.prefetch was, 
too.
Consider a system for which the base page size is 4096 bytes, superpages are built using powers of two 
base pages, and the largest superpage contains 64 base pages, approx-online behaves as follows. Let 
(va,n) denote a superpage starting at virtual page number va and composed of 2" base pages. Assume
5
t h a t  t h e  a p p l i c a t io n  in c u r s  a  T L B  m is s  a t  v i r t u a l  a d d r e s s  0 x 6 0 0 0 5 0 2 3  a n d  t h e  T L B  c o n ta in s  a  t r a n s l a t i o n
for virtual base page 0x60006 but has no translation for 0x60004. The prefetch counters for potential 
superpages containing the virtual page 0x60005 and the TLB entry 0x60006 are incremented by one. These 
superpages are (0x60004,2), (0x60000,3), (0x60000,4), (0x60000,5), and (0x60000,6). approx-online then
if (0x60004,2).prefetch has reached the threshold for superpages of size four, the operating system promotes 
the superpage and decrements the prefetch counters for the containing superpages (i.e., (0x60000,3) through 
(0x60000,6)) by the value of (0x60004,2).prefetch.
A simple but inefficient way to compute prefetch charges is to scan the TLB on a miss to page p, and
overhead of scanning the contents of the TLB on each miss, Romer et al. propose tracking an additional
value, P.tlbcount, for each superpage P. This counter indicates how many of the superpage’s component
subpages (one power of two smaller in size) are currently in the TLB or contain TLB entries. P.tlbcount
takes on one of four values: -1, 0, 1, or 2. If P  is a superpage or part of a larger superpage that has been
promoted, then P.tlbcount =  —1. Otherwise, let Pi and P2 be the two component subpages of P. P.tlbcount
is 0, 1, or 2, depending on how many of its component subpages are in the TLB. This strategy allows the
prefetch charges to be updated efficiently on a TLB miss. 2
Note that approx-online is a simplification of the more complex online policy [22], which not only
charges a TLB miss to the potential superpages containing p , but also blames the eviction of p  on the fact
that unrelated pages were not coalesced into superpages, online thus tries to coalesce other superpages
(those that do not contain p. Romer [21] shows that approx-online is as effective as online, but has much
lower bookkeeping, overhead._______  ______ _ ___ _ _________—
2T he kernel on H P PA -RISC only decides which pagetable  en try  to  insert to  TLB . I t  has no control over which T LB  ei
s going to  be  evicted upon a T LB  insert in struction , nor does it have knowledge of which T LB  en try  has becom e th e  v ictim .
To m ain ta in  tlb co u n t, however, th e  T L B  m iss hand ler needs inform ation ab o u t th e  evicted v irtu a l address and th e  size of th e
victim  page. In  ou r im plem entation , we m assaged H P  PA -RISC ii
’ " ' " "  ' 1 contro l register, which is accessible from  t
ropriate  in a  research circum stance.
finds the largest potential superpage that has reached its promotion threshold and creates it. For example,
check whether some potential superpage contains both p and at least one current TLB entry. To avoid the
6
The choice of threshold value used to decide when to promote a set of pages to a superpage is critical 
to the effectiveness of approx-online. The ideal threshold is small enough for useful superpages to be 
promoted early, thereby eliminating future TLB misses, but large enough so that the cost of promotion does 
not dominate TLB overhead. We quantify this tradeoff in Section 5.1. ' .
Romer et al. choose an appropriate threshold value by using a competitive strategy — a collection of 
pages is promoted to a superpage as soon as it has suffered enough TLB misses to pay for the cost of 
the promotion. Theoretically, the promotion threshold should be the promotion cost divided by the TLB 
miss penalty. For example, if the average TLB miss penalty is 40 cycles and copying two base pages to 
a contiguous two-page superpage costs 16,000 cycles, the threshold for superpage promotion would be 400 
(16,000 divided by 40). Romer [21] proves that a system employing approx-online can suffer no more 
than twice the combined TLB miss and superpage promotion overheads that would be incurred by a system 
employing an optimal offline promotion algorithm. Although the theoretical threshold bounds worst-case 
behavior to an acceptable level, smaller thresholds tend to work better in practice. In our experiments, we 
therefore run approx-online with a range of different threshold values.
3.2 P rom o tio n  v ia  R em ap p in g
No-copy superpage creation relies on hardware support provided by the Impulse memory controller [28]. 
Such hardware provides an extra level of address remapping at the memory: unused physical addresses are 
remapped into “real” physical addresses. In keeping with Impulse terminology, we refer to these remapped 
addresses as shadow addresses, or the shadow address space. From the point of view of the processor and
OS memory management system, shadow addresses are used in place of real physical addresses. Shadow 
addresses will be inserted into the TLB as mappings for virtual addresses, they will appear as physical tags 
on cache lines, and they will appear on the memory bus when cache misses occur. The existence of shadow 
memory is completely transparent to user programs and the processor. It is the Impulse memory controller 
that identifies a shadow address and translates it to physical address through the shadow-to-physical memory 
controller pagetable. The operating system is responsible for managing this new level of address translation, 
but the memory controller maintains its own page tables for shadow memory mappings. Building superpages
7












Figure 1: Detailed Example of Using Shadow Physical Regions
from base pages that are not physically contiguous can be accomplished by simply remapping the virtual 
pages to contiguous, aligned shadow pages. The memory controller then maps the shadow pages to the
original physicalj)agesy There is a much larger TLB in the memory controller which we call MTLB. MTLB 
in the same way as the processor TLB except that it translates shadow addresses to real physical 
addresses. MTLB translation is not on the virtual address resolution critical path, which means it can be^  
built to a fairly large size.
’CA99 paper [6] describes the hardware design of Impulse memory controller. The shadow address 
space is divided into seven contiguous regions. A hardware shadow descriptor register is introduced for each 
shadow region to speed up backend memory access request. The shadow descriptor contains the start 
addresses of the shadow region and of the shadow-to-physical memory controller pagetable, and the size of 
the pagetable.
Figure 1 illustrates how superpage mapping works on Impulse^  For simplicity, internal details of Impulse" )^ 
memory controller are not shown. Interested readers can refer to Carter et al. [6]/Suppose 2G bytes are
reserved for real physical memory space. Physical address starting from 0x8000000, therefore, is unused and 
are interpreted as shadow address space. In this example, the OS has mapped a contiguous 16KB virtual 
address range to a single shadow superpage at “physical” page frame 0x80240. Upon a reference to a virtual 
address within this superpage range, virtual to physical address translation is executed in the usual way 







m em ory system  has a tota l m em ory latency of 60 cycles. The sim ulated rem apping m em ory controller is 
based on th e HP controller [10] used in servers and high-end w orkstations. T he M TLB is configured at 1024 
entries.
T he TLB holds both  instruction and data translations. It is fully associative, em ploys a not-recently- 
used replacem ent policy, and returns a translation in one cycle. In addition to  the m ain TL B , a single-entry  
m icro-ITLB holds th e m ost recent instruction translation. T he base page size is 4096 bytes. Superpages 
are built in power-of-two m ultiples of the base page size, and the biggest superpage that the TLB can m ap  
contains 1024 base pages. Kernel code and data  structures are m apped using a single block-TLB entry that 
is not subject to  replacem ent. Our results include m easurem ents for two T L Bs, a sm all one w ith  only 32 
entries, and a larger one w ith  128 entries, which lets us exam ine how scaling th e TLB affects the applications 
that we study. T he smaller TLB size also is close to  the TL B size that Romer e t  al. used in their study. 
T hey generate their traces using ATOM  [25] on a DEC A lpha 3000 /700  running DEC O S F /1  2.1, a system  
that contains a 225 MHz A lpha 21064 processor w ith a 32-entry DTL B and an 8 entry ITLB, a 2-m egabyte  
offchip cache, and 160 m egabytes o f m ain memory.
4 .1  M icro b en ch m a rk
W hen com paring online superpage prom otion schem es, an im portant performance factor is the num ber of 
TLB m isses that m ust be elim inated per prom otion to  am ortize the cost o f im plem enting the prom otion  
algorithm . T his cost includes th e extra tim e spent in the TLB m iss handler determ ining w hen to  coalesce 
pages, plus the tim e spent performing the actual prom otions (via either copying or rem apping). To explore 
the cost/perform ance tradeoffs for each approach, we run a synthetic m icrobenchark consisting of a  loop  
that touches 4096 different base pages for a configurable number of iterations: 
ch ar A[4 0 9 6 ][4 0 9 6 ];
f o r  ( j  = 0 ; j  < t e s t _ i t e r a t i o n s ;  j+ + )  
f o r  ( i  = 0; i  < 4096; i+ + )  
sum += A [ i]  [ j ]  ;
W ithout superpages, each m em ory access in the synthetic m icrobenchm ark suffers a TLB m iss. However, 
since every page is touched repeatedly, superpages can b e used to  reduce th e aggregate cost o f these TLB
10
m isses. This experim ent determ ines the break-even point for each approach, i.e ., th e  number of iterations 
at which the benefit o f creating superpages exceeds th e cost o f doing so. '
4 .2  B en ch m a rk  S u ite
To evaluate the different superpage prom otion approaches on real-world problem s, we use nine program s from  
a m ix of sources. Our benchmark su ite includes three SPE C 95 benchmarks (com press, g c c , and v o r te x ) ,  
three image processing benchmarks ( r a y tr a c e , r o t a t e ,  and f i l t e r ) ,  two scientific benchm arks (cg a  and 
matmul), and one SPLA SH -2 benchm ark (r a d ix )  [29]. A ll benchm arks were com piled w ith  gcc 2.7.2 and 
optim ization  level “- 0 2 ” .
Compress is th e SPE C 95 data  com pression program run on an input of one m illion characters. N ote that 
the default SPE C 95 im plem entation of com press executes th e com pression algorithm  25 tim es, whereas the  
version o f com press em ployed by Romer e t  al. appears to  have executed th e algorithm  only once. Running  
a different num ber o f iterations does not affect th e  relative perform ance of th e various superpage prom otion  
algorithm s, but it does make th e  raw numbers (e.g ., the num ber of TLB m isses) incom parable. Gcc is 
the c c l  pass o f the version 2.5.3 gcc com piler (for SPARC architectures) used to  com pile th e 306-kilobyte 
file “lc p -d e c l .c ” . V o rtex  is an object-oriented database program m easured w ith th e SPE C 95 “te s t” input. 
R adix is an integer radix sort program (based on the m ethod  of B lelloch et  al. [4]) run w ith  th e  SPLASH - 
2 default argum ents. Cga is th e N P B 2 .3  benchm ark su ite’s class A conjugate gradient benchm ark, which  
performs a sparse m atrix-vector product. Matmul is th e conventional, tiled  version o f dense m atrix-m atrix  
m ultiplication run on 1024x1024  m atrices w ith  3 2 x 3 2  tiles. R a y tr a ce  is an interactive isosurfacing volum e 
renderer whose input is a 10 2 4 x 1 0 2 4 x 1 0 2 4  volume; its  im plem entation  is based on work done by Parker et 
al. [20] F i l t e r  perform s an order-129 binom ial filter on a 3 2 x 1 0 2 4  color im age. R o ta te  turns a 1024x1024  
color im age clockwise through one radian.
Tw o of these benchm arks, g c c  and com press, are also included in Rom er e t  a l . ’s benchm ark suite, 
although we use SPE C 95 versions, whereas they use SPE C 92 versions. W e do not use the other SPE C 92  




Q) 1.0---------- -- copy+aol8
------------ copy+aol32 £
------------ copy+aol64 co
------------ copy+aol128 0.5 -







“1— I— I— I— I— I— I— I— I— I— I— I
s *■ » ^  * * &  #  ^ V V V *
iterations 
(b) remapping
Figure 2: Microbenchmark performance for 4096 base pages, aolzz: approx-online with threshold xx
benchm arks are based on tools used in the research environm ent at the U niversity of W ashington, and are 
not readily available to us.
5 Results
The performance results presented here are obtained through com plete sim ulation of the benchm arks, in­
cluding both  kernel and application tim e, the direct overhead of im plem enting the superpage prom otion  
algorithm s, and the resulting effects on m em ory system . We first present the results o f our m icrobenchm ark  
experim ents exploring the break-even points for each of the superpage prom otion policies and m echanism s, 
then we present com parative perform ance results for our application benchm ark suite.
5.1  M icro b en ch m a rk  R e su lts
Figure 2(a) and Figure 2(b) illustrate our m icrobenchmark results for online superpage prom otion v ia  copying  
and rem apping, respectively. T he m icrobenchm ark’s working set is sufficiently large that performance is the 
sam e for both  a 32-entry and a 128-entry TLB. The x  axes indicate the number of tim es the m icrobenchm ark’s 
m ain loop is repeated, i.e., the num ber of tim es each page is referenced. The execution  tim es include kernel 
startup tim es. T hese graphs em phasize the performance differences am ong the a sa p  and a p p r o x -o n lin e  
policies.
For instance, copying-based a s a p  only becom es profitable after each page is touched more than one 
thousand tim es, whereas the sam e policy breaks even after only sixteen  references per page when rem apping
12
is used. Copying performs much worse when pages are seldom  referenced: execution  tim e is 94 tim es slower 
than the baseline when the a sa p  policy is em ployed but each page is touched only once. T his causes all of 
the pages to  be prom oted (copied), even though they  are never accessed again. In contrast, the rem apping  
prom otion m echanism  delivers m ore robust perform ance, and both  rem ap+asap  and rem ap+ aol2 result in 
a slowdown of less than a factor of two when the m icrobenchm ark touches each page between one and eight 
tim es. The cost of a TLB m iss increases from around 30 cycles in the baseline to  700 cycles for rem apping  
a sa p , and to  88,000 cycles for copying a sa p . In this exam ple, a sa p  suffers only one TLB m iss per subpage 
before prom oting the set of pages to  a superpage, so this average TLB m iss tim e includes th e  cost prom otion.
Perform ances for all th e a p p r o x -o n lin e  configurations suffer when the threshold is larger than the  
number of references to  each page. T he additional overheads in the TLB m iss handler dom inate the mi­
crobenchm ark execution  tim e. The number o f references required for th is policy to  be profitable increases 
w ith the threshold. For copying-based prom otion and thresholds of two and eight, th e number o f references 
per page m ust be 64 and 16 tim es th e threshold, respectively. For rem apping and for copying w ith thresholds 
of 32 or more, a p p r o x -o n lin e  im proves perform ance when the number o f references per page is at least 
eight tim es the threshold. T he TLB m iss penalty goes from about 30 cycles in th e baseline to  800 cycles for 
rem apping a p p r o x -o n lin e  and 5800 cycles for copying a p p r o x -o n lin e .
In general, the rem apping-based policies deliver performance benefits at much lower thresholds, and all 
policies and m echanism s perform well when pages are referenced at least 2048 tim es, a sa p  exhib its the 
largest variation in performance, delivering th e best speedups w hen superpages are built v ia  rem apping, and 
the worst slowdowns when superpages are built v ia  copying.
5 .2  F u ll-A p p lic a tio n  R e su lts
Table 1 lists th e characteristics o f the baseline run o f each benchm ark, where no superpage prom otion  
occurs. T hese benchm arks dem onstrate varying sensitiv ity  to  TLB performance: on th e system  w ith the 
sm aller TL B , betw een 14% and 77% of their execution  tim e is spent in the d ata  TLB m iss handler. The  
percentage o f tim e spent handling TLB m isses falls to  between less than  1% and 58% on th e system  w ith  a 



















cctnpress 7369 928492 434950 7117 63878 60.06%
gpc 1196 2A 8W 126507 545 3305 15.66%
vertex 1372 274983 186154 740 5699 23.01%
radix 544 35541 10722 544 1960 20.00%
cga 3059 432698 7514 18572 7384 1497%
matmul 9609 824743 144160 13136 138585 77.01%
raytraoe 1311 86628 9894 2290 8459 33.98%
filter 603 131340 67961 537 4227 37.46%
rotate 483 32358 26149 3016 3617 53.64%
m e n tiy ' iL B
compress 3983 671)41 37U687 6846 22 0.06%
gpc 883 220213 120795 508 170 1.20%
vertex 912 236734 177643 684 616 4.05%
radix 520 33706 10261 543 1502 16.28%
cga 2704 431972 7332 18525 786 3.72%
matmul 2386 269850 4839 12147 136 0.33%
raytraoe 1303 86042 9776 2283 8399 33.89%
filter 578 129410 67476 538 3745 34.70%
rotate 483 32354 26148 3016 3617 53.65%
; Table 1: Characteristics of each baseline run. i :
Figures 3 and 4 show the norm alized speedups of th e different com binations of prom otion policies (a sa p  
and a p p r o x -o n lin e )  and m echanism s ( rem a p p in g  and copying)  com pared to  the baseline instance of each  
benchmark. W e can make two orthogonal com parisons from these figures: rem a pp in g  versus copying,  and  
a sa p  versus a p p r o x -o n lin e . T he two dark bars on the left of the figure for each benchm ark illustrate  
results for rem apping-based a sa p  (r e m a p + a s a p ) and copying-based a sa p  ( c o p y + a s a p ). The two light bars on 
the right represent the best results from rem apping-based a p p r o x -o n lin e  ( r em a p  +ao l)  and copying-based  
a p p r o x -o n lin e  (c o p y + a s a p ). T h e numbers appended to  th e a p p r o x -o n lin e  labels indicate th e optim al 
thresholds for in itia ting  the prom otion of two 4-kilobyte base pages to  one 8-kilobyte superpage. W hen  
the base threshold is eight (e.g ., for c o p y + a o l8 ) ,  th e  corresponding threshold for prom oting two 8-kilobyte  
superpages to  a 16-kilobyte superpage is 2 x  8 =  16, and so forth. O nline superpage prom otion can improve 
performance by up to  a factor of 4 .5 (on matmul, our tiled  m atrix m ultip ly routine). However, it also can  
decrease performance by up to  35% (when using the copying version of a sa p  on r o t a t e ) .
We first com pare the two prom otion algorithm s, a sa p  and a p p r o x -o n lin e , using th e results from Fig­
ures 3 and 4. T he relative performance of the two algorithm s is strongly influenced by th e choice of prom o­
tion m echanism , re m a p p in g  or copying.  U sing a  rem apping prom otion m echanism , a sa p  slightly outperform s 
a p p r o x -o n lin e  in th e average case. It exceeds the perform ance of a p p r o x -o n lin e  in thirteen o f th e eighteen
14
(D (D N i- LD CM LO CD
compress gcc vortex radix cga matmul raytrace filter rotate
Figure 3: Normalized speedups for our two superpage promotion mechanisms for each of two promotion policies on 
a system with a 32-entry TLB. This graph shows the best performance for any policy configuration in each category.
■  asap
m approx-online
compress gcc vortex radix cga matmul raytrace filter rotate
Figure 4: Normalized speedups for our two superpage promotion mechanisms for each of two promotion policies on a 
system with a 128-entry TLB. These results give the best performance for any policy configuration in each category.
15
experim ents, and trails th e perform ance of a p p r o x -o n lin e  in only four cases. The differences in performance 
range from a s a p + r e m a p  outperform ing a o l+ r e m a p  by 13% for g cc  w ith a 32-entry TLB, to  a o l+ r e m a p  out­
performing a s a p + r e m a p  by 17% for r o t a t e  w ith a 32-entry TLB. In general, however, the performance 
differences between the two policies are sm all. Considering th a t the results we present for a p p r o x -o n lin e  
are for the optim al threshold in each case (rather than  som e fixed system -w ide threshold) and that a sa p  is 
a much simpler policy to  im plem ent, we believe th a t th e a sa p  policy is the best choice when rem apping is 
an option.
The results change noticeably when we em ploy a copy in g  prom otion m echanism . In th is case, a p p r o x -o n lin e  
outperform s a sa p  in ten  o f the eighteen experim ents, while th e policies performs identically in three o f the  
other eight cases. The m agnitude of the disparity betw een a p p r o x -o n lin e  and a sa p  results is also dram at­
ically larger. T he differences in performance range from a s a p  outperform ing a p p r o x -o n lin e  by 28% for 
v o r te x  w ith  a 32-entry TL B , to  a p p r o x -o n lin e  outperform ing a sa p  by 36% for r a y tr a c e  w ith a 32-entry 
TLB. Overall, our results confirm those of Romer e t a l the b est prom otion policy to  use when creating  
superpages v ia  copying is a p p r o x -o n lin e .
The relative perform ance of the a sa p  and a p p r o x -o n lin e  prom otion policies changes when we em ­
ploy different rem apping m echanism s because a sa p  tends to  create superpages m ore aggressively than  
a p p r o x -o n lin e . T h e design assum ption underlying th e a p p r o x -o n lin e  algorithm  (and th e reason that  
it performs b etter than  a sa p  when copying is used) is th a t superpages should not be created until th e op­
portunity cost o f T L B m isses equals the cost o f creating th e superpages. Given that rem apping has a much  
lower cost for creating superpages than  copying, it is not surprising that the more aggressive a sa p  policy  
performs relatively b etter than  a p p r o x -o n lin e  when com bined w ith  th e rem apping m echanism  (and vice 
versa). .
N ote that we present numbers for an o p t im a l  a p p r o x -o n lin e  policy for each benchm ark —  one that 
uses the prom otion threshold  that we observe to  deliver the b est performance for that benchm ark. This 
optim al threshold ranges from four (for r e m a p + a o l  on m ost applications) to  32 (for c o p y + a o l  on two o f the  
applications). T he wrong choice of threshold can hurt perform ance, as dem onstrated by Figures 5 and 6 . 
For a 32-entry TL B , th e difference between the perform ances o f c o p y + a o l  w ith  the best and worst threshold
16
(a) compress \ .  (b) vortex  (c) rotate
Figure 5: Performance details for selected benchmarks on the system with a 32-entry TLB.
choices between eight and 32 is as large as 23% for cga (1.14 versus 0.91). The performance spread with a 
128-entry TLB is as large as 45% for matmul. The magnitude of the impact of proper threshold selection 
is atypically large for these two programs, however. The choice of a fixed compromise threshold, e.g., 16, 
reduces the average performance of approx-online by roughly 4%. In their earlier study, Romer e t  al. 
employ a fixed threshold of 100, which we find to be far too large. This issue will be discussed in more detail 
in Section 5.3.
When we compare the two superpage creation mechanisms, re m a p p in g  and copying, rem a p p in g  is the clear 
winner, but by highly varying margins. The differences in performance between the best overall remapping- 
based algorithm (a s a p + r e m a p ) and the best copying-based algorithm (a o n l in e + c o p y in g )  is as large as 51% 
in the case of vortex on a 32-entry TLB. Overall, a s a p + r e m a p  outperforms a o n l in e + c o p y in g  by more than 
15% in seven of the eighteen experiments, although the margin is less than 5% in all but one of the other
Figure 5 and Figure 6 illustrate one of the secondary reasons that the remapping mechanism outperforms 
the copying mechanism — cache pollution. These figures show the relative numbers of cache misses suffered 
by the benchmarks when superpage promotion is enabled (versus the baseline execution) for three of our 
benchmarks. The dark gray bars indicate the relative execution time of the benchmarks with superpage 
promotion enabled, while the medium gray bars indicate the relative number of cache misses. A shorter bar 
thus indicates improved performance or a reduction in the number of cache misses. For both configurations
J? # J* cT I? jP /
(a) compress
Figure 6: Performance details for selected benchmarks on the system with a 128-entry TLB.
of TLBs, the number of cache misses is big for copying based promotion for each of the three benchmarks. 
Take compress for an example. For 32-entry TLBs, superpage promotion improves the performance of all 
policies and both approaches. However, the number of cache misses grows substantially for the copying 
variant of approx-online on compress. A closer examination of compress reveals that the high DTLB 
miss rate (60%) and resulting large number of superpage promotions leads to significant cache pollution as 
the superpages are created. Nevertheless, due to the dramatic reduction in TLB 
compress improves by roughly 40%, despite the large increase in LI cache
the cost of page copying and cache pollution is not amortized enough so that almost all copying based 
promotions fail to benefit at 32-entry TLBs.
5.3  D isc u ss io n
Romer e t  al. show that approx-online is generally superior to asap when copying is used. When remapping 
is used to build superpages, though, we find that the reverse is true. Using Impulse-style remapping results 
in larger speedups and consumes much less physical memory. Since superpage promotion is cheaper with a 
remapping mechanism, policies are much less likely to promote pages too aggressively.
Romer e t al.’s traced-based simulation models no cache interference between the application and the 
TLB miss handler; instead, that study assumes that each superpage promotion costs a total of 3000 cycles 
per kilobyte copied [22]. Table 2 shows our measured lower bounds of the per-kilobyte cost (in CPU cycles)
18
Table 2: Comparison of cache performances with average costs in cycles for approx-online superpage promotion 
via copying.
to promote pages by copying for four representative benchmarks. We measure this bound by subtracting 
the execution time of a o l+ r e m a p  from that of a o l+ c o p y  and dividing by the number of kilobytes copied. 
For our simulation platform and benchmark suite, superpage promotion costs vary with an application’s 
cache performance. For compress, raytrace, and radix, all of which have cache hit ratios in excess of 96%, 
superpage promotion is about twice as expensive as Romer e t  al. assumed. For ro ta te , which has a cache 
hit ratio of only 82%, superpage promotion costs more than five times the cost charged in the trace-driven 
study.
We also find that even when copying is used to promote pages, approx-online performs better with a 
more aggressive (lower) threshold than is used by Romer e t al. Specifically, the optimal threshold in our 
experiments varies from 8 to 32, while their study uses a fixed threshold of 100. This difference in thresholds 
has a significant impact on performance. For example, when we run the gcc benchmark using a threshold 
of 128, approx-online with copying s low ed  performance by 4.3% with a 32-entry TLB, which is close to 
the 0.9% slowdown reported in Romer e t  a l ’s study -  the difference is likely to be caused by the higher 
per-kilobyte promotion costs we measured. In contrast, when we run approx-online with copying using 
the optimal threshold of 8, performance is im p ro v e d  by 16%. Given that our measured cost of promoting 
pages is much higher than the 3000 cycles estimated in their study, we expected our optimal thresholds to 
be higher, not lower than theirs. In general, we find that to achieve their maximum potential, even the 
copying-based promotion algorithms need to be much more aggressive about creating superpages than was 
suggested by the earlier study.
Finally, we can compare the results of our experiments using online superpage promotion against those 
reported by Swanson e t al. using a static promotion policy and a remapping mechanism. For their study [28], 








rotation 16,316 82.12% 87.47%
oompress 5,909 98.88% 99.41%
raytraoe 5,186 96.18% 97.61%




/ ! $ &  ^  / ! % &  / ! $ &  / ! % &  / * 4 &  / ‘4 & P
compress gcc vortex radix cga matmul raytrace filter rotate
Figure 7: Normalized speedups as tlbcount checking is taken off. This graph shows the best performance for any 
policy configuration in each category with a 32-entry TLB.
m allocQ  operations to request that a particular region of virtual memory be made into a superpage. They
find that static superpage promotion coupled with remapping improves the performance of compress by
approximately 5%, gcc by approximately 2%, rad ix  by approximately 20%, and vortex by approximately
10% for a 128-entry processor TLB. Using a dynamic superpage promotion algorithm that automatically
selects pages for promotion without user input (a s a p + r e m a p ), the performance of compress drops by 1%
while the performances of gcc, radix, and vortex  improve by 2%, 19%, and 7%, respectively. Thus, we find
that in most cases, the algorithmic overhead of running an online superpage promotion does not mask the
potential benefits of promotion even when coupled with a low-overhead promotion mechanism. This result
confirms the basic premise of Romer e t  a l .’s study, that online promotion algorithms are a potentially valuable
operating system technique for improving memory system performance on a wide variety of platforms
Remember that in approx-online algorithm, P .p re fe tch  is not incremented unless potential superpage
P  contains a current TLB entry as one of its component pages. To evaluate the importance of tlbcount
checking, we take it off to see what happens. That is, P .p re fe tch  is incremented as long as any of its
component pages causes a TLB miss. Figure 7 shows the normalized speedups obtained on a 32-entry TLB
without t lbcount checking. Compare Figure 7 with Figure 3 and we can see that performance drops noticeably
for almost all the configurations. This experiment proves the effectiveness of tlbcount checking, which makes
approx-online superpage promotion less speculative
20
6  C o n c l u s i o n s  a n d  F u t u r e  W o r k
To summarize our results, we find that when creating superpages dynamically:
• Remapping-based promotion outperforms copying-based promotion by up to 30%.
• Remapping-based superpage promotion has better cache performance than copying-based promotion. 
Depending on the application, the difference in cache performance can significantly affect the speedup 
of superpage promotion. ; ■
•  Remapping-based asap superpage promotion is the most promising approach (because the cost of 
promotion is relatively low).
Although our results for copying-based promotion are qualitatively similar to Romer e t  a l . ’s, they differ 
quantitatively. Romer e t  al. use trace-driven simulation, thus their cost model for promotion is quite simple. 
Based on our measurements, the costs for copying-based promotion are significantly higher in a real system, 
largely due to cache effects. In addition, we find that the promotion thresholds used in Romer e t  a l .’s 
approx-online simulations tend to be too high.
As applications continue to consume larger amounts of memory, the necessity of using superpages will 
grow. Our most significant result is that, given relatively simple hardware at the memory controller, a 
straightforward greedy policy for constructing superpages works well.
Further work in this area should look at how the different promotion mechanisms and policies interact 
with multiprogramming. When multiple programs compete for TLB space, it is possible that the choice 
of which mechanism and policy is best will change. In particular, the penalty for being too aggressive in 
creating superpages increases when the memory subsystem might be forced to tear down superpages to 
support demand paging. Our intuition is that remapping-based asap will likely remain the best choice, 
because it combines the lowest overhead promotion policy with the lowest overhead promotion mechanism.
References
[1] Advanced Micro Devices. AMD Athlon processor technical brief, http://www.amd.com/products/cpg/athlon/- 
techdocs/pdf/22054.pdf, 1999.
[2] K. Bala, F. Kaashoek, and W. Weihl. Software prefetching and caching for translation buffers. In Proc. o f the 
F irst O SD I, pp. 243-254, Nov. 1994.
21
[3] B. Bershad, D. Lee, T. Romer, and J. Chen. Avoiding conflict misses dynam ically in large direct-m apped caches. 
In Proc .  o f  the 6th A S P L O S ,  pp. 158-170, Oct. 1994.
[4] G. Blelloch, C. Leiserson, B . M aggs, C. P laxton, S. Smith, and M. Zagha. A comparison of sorting algorithms 
for the connection machine cm-2. In Proc .  o f  the 3rd  A n n u a l  A C M  S y m p o s iu m  on Para l le l  A lgori th m s and  
Architectures,  pp. 3 -16 , July 1991.
[5] P. Cao, E. Felten, and K. Li. Im plem entation and performance of application-controlled file caching. In Proc.  
o f  the F ir s t  O S D I , pp. 165-177, Nov. 1994.
[6] J. Carter, W . Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, 
M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a smarter mem ory controller. In Proc. o f  the  
Fifth H P C A ,  pp. 70-79, Jan. 1999.
[7] J. B. Chen, A. Borg, and N. P. Jouppi. A sim ulation based study of TLB performance. In Proc. o f  the 19th  
IS C A ,  pp. 114-123, May 1992.
[8] Compaq Computer Corporation. Alpha 21164 M icroprocessor  Hardware Reference M anual,  July 1999.
[9] HAL Computer System s Inc. SPARC64-GP processor. h ttp ://m p d .hal.com /p rodu cts/S P A R C 64-G P .h tm l, 1999.
[10] T . Hotchkiss, N. Marschke, and R. McClosky. A new memory system  design for comm ercial and technical 
com puting products. H ew le t t -P a c k a rd  Journal,  47(1):44-51, Feb. 1996.
[11] Intel Corporation. P e n t iu m  P ro  F am ily  D e v e lo p e r ’s M anual,  Jan. 1996. ■ ■
[12] B. Jacob and T. Mudge. Software-managed address translation. In Proc. o f  the  T hird  H P C A ,  pp. 156-167, Feb. 
1997.
[13] B. Jacob and T. Mudge. A look at several mem ory m anagem ent units, tlb-refi.ll mechanisms, and page table  
organizations. In Proc. o f  the 8th A S P L O S ,  pp. 295-306, Oct. 1998.
[14] A. Karlin, K. Li, M. Manasse, and S. Owicki. Empirical studies of com petitive spinning for shared memory 
multiprocessors. In Proc. o f  the  13th S O S P ,  pp. 41-55 , Oct. 1991.
[15] Y. Khalidi, M. Talluri, M. N elson, and D. W illiams. V irtual memory support for m ultiple page sizes. In Proc.  
o f  the 4 th  W W O S ,  pp. 104-109, Oct. 1993.
[16] M IPS Technologies, Inc. M I P S  R 1 0 0 0 0  M icroprocessor  U s e r ’s M anual,  Version 2.0, Dec. 1996.
[17] J. Mogul. B ig memories on the desktop. In Proc. 4 th  W W O S ,  pp. 110-115, Oct. 1993.
[18] M.Talluri and M. Hill. Surpassing the TLB performance of superpages w ith less operating system  support. In 
Proc .  o f  the 6th A S P L O S ,  pp. 171-182, Oct. 1994.
[19] M.Talluri, S. Kong, M. Hill, and D. Patterson. Tradeoffs in supporting two page sizes. In Proc .  o f  the 19th  
IS C A ,  pp. 415-424, May 1992.
[20] S. Parker, P. Shirley, Y. Livnat, C. Hansen, and P.-P. Sloan. Interactive ray tracing for isosurface rendering. In 
Proc .  o f  the V isualiza tion  ’98 Conference,  Oct. 1998.
[21] T . Romer. Using V irtua l  M e m o r y  to  Im prove  Cache a n d  T L B  P er form ance .  PhD  thesis, University of W ash­
ington, May 1998.
[22] T . Romer, W. Ohlrich, A. Karlin, and B. Bershad. R educing TLB and mem ory overhead using online superpage 
promotion. In Proc. o f  the 22 n d  IS C A ,  pp. 176-187, June 1995.
[23] A. Saulsbury, F. Dahlgren, and P. Stenstrom . Recency-based TLB preloading, h t tp : //w w w .c e .c h a lm e r s .s e /-  
a sh /r e c e n c y -p r e lo a d in g .p d f , 1999.
[24] D. Sleator and R. Tarjan. Am ortized efficiency of list update and paging rules. C A C M ,  28:202-208, 1985.
[25] A. Srivastava and A. Eustace. ATOM: A system  for building custom ized program analysis tools. In Proc. o f  the  
1994 A C M  S I G P L A N  C onference on P ro g ra m m in g  Language D esign  and  Im p lem en ta tio n ,  pp. 196-205, June 
1994.
[26] L. Stoller, R. Kuramkote, and M. Swanson. PAINT: PA instruction set interpreter. T R  U UC S-96-009, University 
of U tah Departm ent of Computer Science, Sept. 1996.
[27] SU N  M icrosystem s, Inc. U l tra S P A R C  U s e r ’s Manual,  July 1997. : J .
[28] M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In 
Proc .  o f  the  25th IS C A ,  pp. 204-213, June 1998.
[29] S. W oo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and m ethod­
ological considerations. In Proc .  o f  the  2 2 n d  IS C A ,  pp. 24-36, June 1995. . ■ ,
22
