W&M ScholarWorks
Dissertations, Theses, and Masters Projects

Theses, Dissertations, & Master Projects

2004

Efficient caching algorithms for memory management in
computer systems
Song Jiang
College of William & Mary - Arts & Sciences

Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons

Recommended Citation
Jiang, Song, "Efficient caching algorithms for memory management in computer systems" (2004).
Dissertations, Theses, and Masters Projects. Paper 1539623446.
https://dx.doi.org/doi:10.21220/s2-q8t1-e863

This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M
ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized
administrator of W&M ScholarWorks. For more information, please contact scholarworks@wm.edu.

Efficient Caching Algorithms
for Memory Management in Computer Systems

A Dissertation
Presented to
The Faculty of the D epartm ent of Computer Science
The College of W illiam & Mary in Virginia

In P artial Fulfillment
Of the Requirements for the Degree of
Doctor of Philosophy

by
Song Jiang
2004

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

APPROVAL SHEET
This dissertation is subm itted in partial fulfillment of
the requirements for the degree of

Doctor of Philosophy

Song Jiang

Approved by the Committee, June 2004

P hil Kearns

Bruce Lowekamp

W

’A

'"

Andreas''Statlibpoulos

,rVI

» v

a

*

Fabrizio Petrini
Los Alamos National Laboratory

ii

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

To m y mother, my wife and my son.

iii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Table o f C ontents

A ck n ow led gm en ts

x

List o f T ables

xii

List o f F ig u re s

xiv

A b stra ct

xxiii

1 In tro d u ctio n
1.1

2

2

Memory Hierarchies and C a c h in g ..........................................................................

3

1.1.1

Locality and Replacement algorithms

6

1.1.2

Replacement Policies for V irtual M e m o r y ................

1.1.3

Global Replacement in M ultiprogramming Environments

1.1.4

Placement and Replacement in D istributed File Buffer Caches

...................................................

8
................
...

10
13

1.2

C o n trib u tio n s............................................................

15

1.3

O r g a n iz a tio n .............................................................................................................

17

G en era l-P u rp o se R ep lacem ent A lg o rith m s

19

2.1

19

B ac k g ro u n d ................................................................................................................
iv

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

2.1.1

The Problems of the LRU Replacement A lg o rith m ..............................

19

2.1.2

An Executive Summary of my A lg o r ith m ..............................................

22

Related W o r k ...............................................................................................................

23

2.2.1

User-level H in ts ...............................................................................................

24

2.2.2

Tracing and Utilizing H istory Information of a B lo c k ...........................

24

2.2.3

Detection and A daptation of Access R e g u la ritie s .................................

27

2.2.4

Working Set M o d e ls ......................................................................................

29

2.3 The LIRS a lg o r ith m ..................................................................................................

29

2.2

2.3.1

General I d e a ...................................................................................................

29

2.3.2

The LIRS Algorithm Based on LRU S ta c k ..............................................

32

2.3.3

A Detailed D escription..................................................................................

34

2.4 Performance E v a lu a tio n ...........................................................................................

36

2.4.1

Experimental S e ttin g s ..................................................................................

36

2.4.2

Access P attern Based Performance Evaluation

.....................................

38

2.4.2.1

Performance for the Looping T y p e ..........................................

40

2.4.2.2

Performance for th e Probabilistic T y p e ................................

44

2.4.2.3

Performance for the Temporally-Clustered T y p e ................

46

2.4.2.4

Performance for the Mixed T y p e .............................................

48

2.4.3

LIRS Performance with High End Systems

...........................................

49

2.4.4 LIRS versus Other Stack-Based R eplacem ents........................................

51

2.4.4.1

LIRS Threshold and Access C h aracteristics...........................

53

2.4.4.2

LRU as a Special Member of the LIRSF a m i l y .....................

55

2.5 Sensitivity and Overhead A n a ly s is ........................................................................

57

v

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

2.5.1

2.6

3

57

2.5.2 Overhead A n a ly s is .........................................................................................

58

S u m m a r y .....................................................................................................................

60

V irtu a l M em ory R ep la cem en t P o licies

61

3.1

B a c k g ro u n d .................................................................................................................

61

3.1.1 The Research Status of Memory Replacement P o lic ie s ........................

61

3.1.2 LRU/CLOCK and their Performance Disadvantages .

.....................

63

3.1.3 LIRS and its Performance A d v a n ta g e s.....................................................

66

3.2

Related W o r k ..............................................................................................................

68

3.3

Description of C L O C K -P ro .....................................................................................

71

3.3.1 Main I d e a .........................................................................................................

71

3.3.2 D ata S tr u c t u r e ...............................................................................................

73

3.3.3 Operations on Searching Victim P a g e s .....................................................

74

3.3.4 Making CLOCK-Pro A d a p ti v e ..................................................................

76

Performance E v a lu a tio n ...........................................................................................

78

3.4.1 Simulation on Buffer Cache for File I/O

...............................................

78

3.4.2 Simulation on Memory for Program E x ec u tio n s.....................................

82

3.4.3 Simulation on Program Executions with Interference of File I/O

87

3.4

3.5

4

Size Selection of List Q Holding Resident HIR Blocks (Lhirs) . . . .

. .

S u m m a r y ..........................................................................................

90

T hrashing in M u ltiprogram m in g E n viron m en ts

91

4.1

B ack g ro u n d .................................................................................................................

91

MPL versus System T h r a s h in g ..................................................................

91

4.1.1

vi

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4.2

4.3

4.4

4.5

4.6

4.7

4.1.2

Thrashing and Page R eplacem ent...........................................

92

4.1.3

Effectiveness of adaptive page replacement

...........................................

93

4.1.4

Our w o r k .........................................................................................................

94

Evolution of Page Replacement in Linux K e rn e l.................................................

95

4.2.1

Kernel 2 . 0 ........................................................................................

95

4.2.2

Kernel 2 . 2 .........................................................................................................

96

4.2.3

Kernel 2 . 4 ........................................................................................

99

4.2.4

The Im pact p f Page Replacement on CPU and Memory Utilizations

100

Evaluation of Page Replacement in Linux Kernels 2.2

101

4.3.1

Experim ental environm ent............................................................................

101

4.3.2

Page Replacement Behavior of Kernel 2.2.14

.......................

104

The Design and Im plem entation of T P F ..............................................................

109

4.4.1

The detection r o u t i n e ..................................................................................

110

4.4.2

The protection r o u tin e ........................................................................

113

4.4.3

State transitions in the s y s t e m .................................................................

113

Performance Measurements and Analysis

.............................................

115

4.5.1

Observation and measurements of T P F f a c i l i t y .......................

115

4.5.2

Experiences w ith T P F in the multiprogramming environment . . . .

119

Related W o r k ..............................................................................................................

121

4.6.1

The Working Set Model and its Implementation Issues .

122

4.6.2

O ther Related W o r k .................................................................

.................

S u m m a r y .....................................................................................................................

vii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

124
126

5

M u lti-L evel B u ffer C ache M anagem ent

128

5.1

B ac k g ro u n d ..................................................................................................................

128

5.1.1

Hierarchical Caching and its C h a lle n g e s .........................................

128

5.1.2

Possible Solutions: Customized Second-Level Replacement and the
Unified LRU

5.1.3

..................................................................................................

130

O ur Principles to Address the C h a lle n g e s .............................................

133

5.2 Quantifying Non-uniform Locality Strengths

in Hierarchical Buffer Caching

134

5.2.1

M ethods to Distinguish Locality S tren g th s.............................................

134

5.2.2

Comparisons of Locality Strength Quantification M e th o d s ................

137

5.3 The Unified and Level-awareCaching

(ULC)Protocol

...............

144

5.3.1

An Executive S u m m a r y ................................................................

144

5.3.2

A Detailed D escription...............

145

5.3.2.1

The Single-client ULC Protocol

.............................................

147

5.3.2.2

The Multi-client ULC P ro to c o l.................................................

149

5.4 Performance E v a lu a tio n ........................................................................

152

5.4.1

Performance M e t r i c .........................

152

5.4.2

Simulation E n v iro n m en t...................................................... .......................

153

5.4.3

Comparisons of Multi-level Schemes in a Three-levelStructure . . .

155

5.4.4

The Performance Implication of System P a r a m e t e r s ..................

158

5.4.4.1

The Im pact of Server Cache S iz e ............................

159

5.4.4.2

The Im pact of Client Cache S i z e ....................................

5.4.4.3

The Im pact of Network Bandw idth

5.4.5

Comparisons of Caching

................................

Schemes for Multi-client Workloads
viii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

160
.

162

. . . . 163

6

5.5

Related Work and D iscussions................................................................................

165

5.6

S u m m a r y ....................................................................................................................

167

C onclu sion s and F uture W ork

168

6.1

General-Purpose Replacement A lg o rith m s .........................................................

169

6.2

Low Cost V irtual Memory ReplacementAlgorithms

170

6.3

Thrashing Prevention

6.4

Multi-Level Buffer Cache M an ag em en t......................................................

................................

.............................................................................................

B ib liograp h y

172
173

175

ix

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

ACKNOWLEDGMENTS
Foremost, I give my thanks to my Lord, who strengthen me when I am weak, show
me the way when I am lost, teach me the wisdom out of heaven, and love me in all the
circumstances.
I would like to give my thanks to my adviser, Xiaodong Zhang, from my deepest heart.
During the past five years, he has provided me every guidance and help for my research
work and life needs. He always has the passion to motivate me with new research directions,
to challenge me for better solutions, and to guide me through the difficulties in the process.
He has always been discussing research issues w ith me open-mindedly and encouraged me
to think in a broader background. The benefits I have so gratefully received from him are
well beyond those on the academic. He has also put much effort to help me overcome the
difficulties in my life and taken care of my well-being. I feel extremely lucky to have a
person like Xiaodong to be my adviser, which makes my time at William and Mary a warm
and happy memory.
I thank my committee, Phil Kearns, Bruce Lowekamp, and Andreas Stathopoulos at
W illiam and Mary, and Fabrizio Petrini at Los Alamos National Laboratory for their en
couragement and advice on my research work. I learned a lot from the system course taught
by Phil, which prepared me for my system im plem entation work. Fabrizio has provided me
with his insightful comments on my research work and much help on my career development.
I am really impressed by his dedication and passion as a researcher. I would also thank
W illiam Bynum for his help in reading almost every my m anuscripts and giving his detailed
comments and suggestions. I thanks Dimitrios Nikolopoulos for his valuable cooperations
and discussions. I really appreciate the help and encouragement from Evgenia Smirni and
Andreas Stathopoulos, who even gave so much baby stuff for my new-born son. I thank
Vanessa Godwin, who, as the adm inistrative director of the department, was so helpful and
x

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

thoughtful in providing me with the assistance I needed.
I also give my thanks to Shirong Zhen, who was my master thesis adviser in the Univer
sity of Science and Technology of China (USTC). He had inspired my interests in research
in the computer science field.
I wish to give my thanks to my colleagues and friends for making my educational life
being so memorable, to name a few, Songqing Chen, Xin Chen, Lei Guo, Yongguang Liang,
Hui Li, Ling Liu, Shuquan Nie, Shansi Ren, Tanping Wang, Qi, Zhang, and Donghua Zhou.
I will miss the tim e I spent w ith them.
I would like to thank the warm -hearted friends I got to know over the years in the
W illiamsburg community. In particular, Harry Ambrose, who helped me w ith my English
study for over three years, Debra Kemelek, who hosted me as an international student, as
well as Eddie and Grace Liu, W alter and Elisabeth K urth, Libby Von Fange, Connie and
Richard Castor, Erwoom Chiou, and Florence Lee, who consistently showed their care and
love to me and my family. They made my life in Williamsburg being so unforgettable.
Last, but not least, my deepest appreciation goes to my family. My wife, Shengli, has
been with me shortly after I arrived at W illiamsburg. Her commitment to our family has
made my life full of happiness. No words can fully express my gratitude to her. Furthermore,
I am really blessed to have my son, Caleb, who always reminds me of how beautiful a life
can be! I also give my thanks to my parents, in particular, my mother, Yinghua Wang, who
always loves me and supports me w ith all her heart under any circumstance. W ithout all
of their love and supports, there would be no this dissertation.

xi

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

List o f Tables
2.1

An example to explain how a victim block is selected by the LIRS algorithm
and how L IR /H IR statuses are switched.

A “X” refers the block of the

row is referenced at the virtual tim e of the column. The recency and IRR

Lurs = 2 and L^irs = 1, and at the time 10 the LIRS algorithm leaves two
blocks in the LIR set = {A, B}, and the HIR set is {C, D, E}. The only
resident HIR block is E ...............................................................................................

3.1

Hit ratios of the replacement algorithms O PT, CLOCK-Pro, LIRS, CAR,
and CLOCK on workload cpp....................................................................................

3.2

81

A brief description of the benchmark programs ( “Size” is in number of mil
lions of in s tr u c tio n s ) ..............................................................................

3.4

81

Hit ratios of the replacement algorithms O PT, CLOCK-Pro, LIRS, CAR,
and CLOCK on workload sp rite...............................................................................

3.3

31

83

The performance (number of page faults in one million of instructions) of
algorithms CLOCK-Pro, CAR and CLOCK on program m 8 8 k sim with and
without the interference of I/O file data accesses...................................................
xii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

88

3.5

The performance (number of page faults in one million of instructions) of
algorithms CLOCK-Pro, CAR and CLOCK on program sor w ith and without
the interference of I/O file d ata accesses.............................................

89

4.1

Execution performance and memory related data of the 3 benchmark programs. 104

5.1

Comparisons of the four measures on locality strengths by comparing their
abilities to distinguish locality strengths, the stabilities of the distinctions,
and if on-line measurements are possible.........................................

xm

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

143

List o f Figures
1.1 Memory system is organized as a hierarchy, giving the user the illusion of a
memory th a t is as large as largest level of memory and has the access speed
as fast as the first level of cache..................................................................
1.2

3

The CLOCK replacement algorithm. The clock hand moves in the counter
clockwise direction. The reference bit of each page is either set (1) or unset
(0).....................................................................................................................................

1.3

9

CPU utilization is plotted against the number of processes in the system.
Though increasing processes in the system could increase CPU utilization,
too many processes could over-commit the limited memory and cause thrashing. 11

1.4

Multi-level buffer cache hierarchy. Caches are distributed along the clients,
intermediate servers, and disk array, where accessed blocks can be buffered.

13

2.1 The LIRS stack S holds LIR blocks as well as HIRS blocks with or without
resident status, and a list Q holds all the resident HIR blocks. . . . . . . . .

xiv

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

34

2.2

Illustration of the reference results in the example shown in Table 1 on the
LIRS stack. In this figure, (a) corresponds to the state at virtual time 9.
Accessing B, E, D, or C at virtual time 10 result in (b), (c), (d) and (e),
respectively

2.3

.

The time-space map (left) of cs and the hit rate curves by various replacement
policies (right)...............................................................................................................

2.4

.

46

The time-space map (left) of sp rite and the hit rate curves by various re
placement policies (right)

2.9

45

The time-space map (left) of 2-pools and the hit rate curves by various re
placement policies (right)...........................................................................................

2.8

43

The time-space map (left) of cpp and the hit rate curves by various replace
ment policies (right)

2.7

42

The time-space map (left) of p o stg res and the hit rate curves by various
replacement policies (right)........................................................................................

2.6

41

The time-space map (left) of glim p se and the hit rate curves by various
replacement policies (right)........................................................................................

2.5

36

.

47

The time-space map (left) of m u ltil and the hit rate curves by various
replacement policies (right).................................................................

48

2.10 The time-space map (left) of m u lti2 and the hit rate curves by various
replacement policies (right).......................................................................................

49

2.11 The time-space map (left) of m u ltiS and the hit rate curves by various
replacement policies (right)....................................................................

xv

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

50

2.12 The hit rate curves of workload O penM ail (left figure) and workload Cello99
(right figure)

.

....................................................

2.13 The IRRs of references of the workloads

50

p ostgres (left) and sp rite (right)

52

2.14 The rates of R m a x and cache size in blocks (L)for workloads p o stg res
(left) and sp rite (right). R m a x is the size of LIRS stack, which changes
with virtual time. Cache size is 500.................................................................. ......
2.15 The hit rate curves of workload p o stg res (left figure) and workload sp rite
(right figure) by varying the rates of threshold values for LIR /H IR status
switching and R m a x in LIRS, as well as curves for O PT and LRU...

56

2.16 The hit rate curves of workload p o stg res (left figure) and workload sp rite
(right figure) by varying the size of list Q (L^irs >the number of cache buffers
assigned to HIR block set) of LIRS algorithm, as well as curves for O PT and
LRU. “LIRS 2” means size of Q is 2, “LIRS x%” means size of Q is x% of
the cache size in blocks....................................................................................

58

2.17 The hit rate curves of workload p o stg res (left) and workload sp rite (right)
by varying the LIRS stack size limits, as well as curves for O PT and LRU.
Limits are represented by rates of LIRS stack size limit in blocks and cache
size in blocks (L )...............................................................................................

xvi

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

59

54

3.1

There are three types of pages in CLOCK-Pro, hot pages marked as “H” , res
ident cold pages marked as “C” and non-resident cold pages marked as shad
owed block w ith “C” . Around the clock, there are three hands: H A N D h ot
pointing to the list tail (i.e. the last hot page) and searching a hot page to
tu rn into a cold page, H A N D ^ a pointing to the last resident cold page and
searching for a cold page to replace out of memory, and H A N D test pointing
to the last cold page in the test period, term inating test periods of cold pages,
and removing non-resident cold pages passing the test period out of the list.
The attached black dots represent the reference bits of 1..................................

3.2

Hit ratios of the replacement algorithms O PT, CLOCK-Pro, LIRS, CAR,
and CLOCK on workloads glim pse and m u lti2 ...................................................

3.3

84

Performance of CLOCK, CAR, CLOCK-Pro and O PT on programs with
moderate locality...................... ..................................................................................

3.6

82

Performance of CLOCK, CAR, CLOCK-Pro and O PT on programs with
strong locality................................................................................................................

3.5

80

Adaptively changing the percentage of memory allocated to the cold pages
in workloads m u lti2 and sp rite.................................................................................

3.4

73

85

Performance of CLOCK, CAR, CLOCK-Pro and O PT on programs with
weak lo c a lity ................................................................................................................

86

4.1

The memory performance of gcc in a dedicated environm ent...........................

105

4.2

The memory performance of gzip in a dedicated environm ent..........................

105

4.3

The memory performance of vortexl in a dedicated environm ent....................

106

xvii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

4.4

The memory performance of gzip (left figure) and vortex3 (right figure) dur
ing the interactions...................................................................

4.5

The memory performance of gcc (left figure) and vortexS (right figure) during
the interactions.............................................................................

4.6

108

Dynamic transitions among normal, monitoring, and protection states in the
improved kernel system...............................................................................................

4.8

107

The memory performance of vortex 1 (left figure) and vortexS (right figure)
during the interactions................................................................................................

4.7

107

114

The execution tim e comparisons (left figure) and comparisons of numbers of
page faults (right figure) for the three group of program interactions in the
Linux w ithout T P F and with T P F ...................................................

4.9

115

The memory performance of gzip (left figure) and vortexS (right figure) dur
ing the interactions in the Linux with T P F ............................................................

116

4.10 The memory performance of gcc (left figure) and vortexS (right figure) during
the interactions in the Linux with T P F ..................................................................

118

4.11 The memory performance of vortex 1 (left figure) and vortexS (right figure)
in the Linux with T P F .................

118

4.12 Comparison of total interaction execution times for the three group of pro
gram interactions in the Linux with T P F , without T P F and the ideal inter
action tim es................................................................................

5.1

119

Multi-level buffer cache hierarchy. Caches are distributed along the clients,
intermediate servers, and disk array, where accessed blocks can be buffered.

xviii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

129

5.2

In the two-level unified LRU scheme, there is an unified LRU stack corre
sponding to the two level of caches. The size of each individual LRU stack,
JV1 or N 2 is equal to its respective cache size in terms of blocks, there are
three type of accesses: (1) a hit in the L I cache. (2) a hit in the L I cache.
(3) a miss in the two caches. If all the three cases, the accessed blocks are
moved to the top of the stack. Except the first case, the block at the bottom
of L I LRU stack is demoted onto the top of the L2 stack.................................

5.3

132

In access stream { R t,t = 0 ,1 ,2 , ...}, Ri, R j , and Ri are three immediately
consecutive references to block b. The current tim e is k. W ith these timing
points, there are various measurements th a t can be used to quantify the
locality strength of block b at time k, including the distance from R% to Ri),
called O PT Distance (O D ), the distance from R j to R ^), called Recency
Distance (R D ), the distance from R j to Ri, called Current Re-use Distance
(C R D ), and the distance from Ri to R j, called Last Re-use Distance (LRD). 135

5.4

In the LRU stack, for a given block, the position for the last access to the
block corresponds to its LRD, its current position in the stack corresponds to
its RD, and the position for its next access corresponds to its CRD. Before its
current position exceeds its last access position (see left figure (a)), LRD-RD
is LRD; after th a t (see right figure (b)), LRD-RD becomes RD. This allows
LRD-RD to more accurately simulate CRD. The illustration also shows th at
RD and OD change w ith every reference................................................................

xix

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

137

5.5

Reference ratios to each of the segments (the ratios between the number of
references to a segment and the number of all references in a workload). It
also shows the accumulative reference ratios for the first N segments in each
workload, where N is 1 through 10....................................................

5.6

140

Movement ratio curves showing the ratios between the number of block move
ments across a segment boundary of the ordered lists and the number of total
references for the four measures: OD, RD, CRD, and LRD-RD on various
workloads. It shows th a t there are two groups of curves: OD and RD with
high movement ratios, NRD and LRD-RD with low movement ratios.

5.7

. . .

142

An example to show the d ata structure of ULC for a 3-level hierarchy. The
blocks with their recencies less th an th at of yardstick Y3 are kept in uniL R U stack. The level status (L%, L 2 or L3) of a block is determined by its position
between two yardsticks where it was accessed last time. Its recency status
(f?i, f?2 or R%) is determined by its position between two yardsticks where
it sits currently. To decide which block should be replaced in each level, the
blocks in the same level can be viewed to be organized in a separate LRU
stack (LR U i, L R U 2 , or LRU 3 ), and the bottom block is for replacement.

xx

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

.

146

5.8

An example to explain how a requested block is cached in the server cache,
and how the allocation scheme adjusts the size of the server cache used by
various clients in a multi-client two-level caching structure. Originally in (a)
server stack gL R U holds all the L 2 blocks from clients 1 and 2, which are
also in their LRU 2 stacks, respectively. Then block 9 is accessed in client 1.
Because block 9 is between yardstick Y\ and I 2 in its u n iL R U sta ck, it turns
into L 2 block and needs to be cached in the server. Because the server cache
is full, the bottom block of gLRU , block 14, is replaced, which will be notified
to its owner, client 2, through a piggyback on the next retrieved block going
to client 2 (delayed notification). After the server buffers re-allocation (b),
the size of server cache for client 1 is increased by 1 and th a t for client 2 is
decreased by 1. So the clients and the server cooperate to make the server
cache efficiently allocated with the aim of high performance for the entire
system..............................................................................................................................

5.9

150

hit ratios in each of th e three levels, demotion rates at each of two boundaries
(between LI and L2, and between L2 and L3 cache), and average access time
for each workload w ith the multi-level caching schemes indLRU, uniLRU and
ULC.................................................................................................................................

156

5.10 The average access times for schemes ULC, uniLRU, MQ and indLRU with
various server cache sizes. The client cache size is fixed. It is 256MB for
z ip f, and 128MB for httpd and dev 1.....................................................................

xxi

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

159

5.11 The average access times

for schemes ULC,

uniLRU, MQ and indLRU with

various client cache sizes. The server cache size is fixed. It is 200MB for z ip f
and dev 1, and 150MB for http d................................................................................
5.12 The average access times

for schemes ULC,

161

uniLRU, MQ and indLRU with

various block transfer times. The client and server cache sizes are fixed, and
are 100MB each for all the workloads.....................................................................

162

5.13 The average access times of multi-client traces httpd, openmail, and db2 with
various server cache sizes. Among them httpd is with 7 clients, openmail is
with 6 clients, and db2 is w ith 8 clients. Each client contains 8MB, 1GB, or
256MB respectively.......................................................................................................

xxii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

163

ABSTRACT

As disk performance continues to lag behind th a t of memory systems and processors,
fully utilizing memory to reduce disk accesses is a highly effective effort to improve the
entire system performance. Furthermore, to serve the applications running on a computer
in distributed systems, not only the local memory but also the memory on remote servers
must be effectively managed to minimize I/O operations. The critical challenges in an ef
fective memory cache management include: (1) Insightfully understanding and quantifying
the locality inherent in the memory access requests; (2) Effectively utilizing the locality
information in replacement algorithms; (3) Intelligently placing and replacing data in the
multi-level caches of a distributed system; (4) Ensuring th a t the overheads of the proposed
schemes are acceptable.
This dissertation provides solutions and makes unique and novel contributions in appli
cation locality quantification, general replacement algorithms, low-cost replacement policy,
thrashing protection, as well as multi-level cache management in a distributed system. First,
the dissertation proposes a new method to quantify locality strength, and accurately to iden
tify the data w ith strong locality. It also provides a new replacement replacement algorithm,
which significantly outperforms existing algorithms. Second, considering the extremely lowcost requirements on replacement policies in virtual memory management, the dissertation
proposes a policy meeting the requirements, and considerably exceeding the performance
existing policies. Third, the dissertation provides an effective scheme to protect the system
from thrashing for running memory-intensive applications. Finally, the dissertation pro
vides a multi-level block placement and replacement protocol in a distributed client-server
environment, exploiting non-uniform locality strengths in the I/O access requests.
The methodology used in this study include careful application behavior characteriza
tion, system requirement analysis, algorithm designs, trace-driven simulation, and system
implementations. A main conclusion of th e work is th a t there is still much room for innova
tion and significant performance improvement for the seemingly m ature and stable policies
th a t have been broadly used in the current operating system design.

xxiii

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Efficient Caching Algorithms
for Memory Management in Computer Systems

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Chapter 1

Introduction
W ith the ongoing dram atic increase in processor speeds, and relatively stable disk speed,
the disparity between processor speeds and disk access times are keeping widened. For
memory-intensive applications and those with frequent file d ata accesses, their execution
t imes are often dom inated by I/O latency. Since disk access times are improving slowly,
these applications are receiving decreasing benefits from the rapid advance of processor
technology, and I/O latency is accounting for an increasing proportion of their execution
times. This technology trend makes memory play an increasingly im portant role to serve
as a cache for I/O file d ata and virtual memory swap files. So, fully utilizing memory to
reduce disk accesses is an im portant issue concerning to the entire system performance. To
serve the applications running on a com puter in a distributed system, not only the local
memory but also the memories distributed on remote servers, even on other clients have to
be effectively managed to minimize I/O operations.
In this dissertation, we examine four challenging issues in the effective memory man
agement to reduce I/O accesses, including (1) General-purpose memory replacement al
gorithms; (2) Low-cost virtual memory replacement policies; (3) Thrashing prevention for
running multiple memory-intensive programs; (4) Multi-level distributed cache manage2

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 1. IN T R O D U C T IO N

3

ment. Our dissertation provides solutions to these challenging issues, and use trace-driven
simulation or im plem entation techniques to dem onstrate their effectiveness in terms of both
performance and cost. O ur dissertation demonstrates th a t innovative methods can signif
icantly improve the utilization of available memory and reduce I/O accesses by effectively
exploiting the locality in the access requests.

1.1

M em ory H ierarchies and Caching

Smaller,
faster, and
more expensive
(per byte)

CPU regiesters hold words fetched
from LI chahe

on—chip LI
cache (SRAM)
on-chip L2
cache (SRAM)

Larger,
slower, and
Cheaper
(per byte)

LI cache holds cacje lines fetched
from L2 cache
L2 cache holds cache lines fetched
from main memory

main memory
(DRAM)
local secondary storage
(local disks)

Main memory holds blocks (pages)
from local/remote disks

global secondary storage
(distributed systems, web servers)

F igure 1.1: Memory system is organized as a hierarchy, giving the user the illusion of a memory
that is as large as largest level of memory and has the access speed as fast as the first level of cache.

In computer systems, memory is organized as a memory hierarchy. A memory hierarchy
consists of multiple levels of memory w ith different speed, size and unit cost (see Figure 1.1).
In the hierarchy, the layers close to processors are two or three levels of fast and expensive
SRAM (Static RAM) cache memory, with their size from 128K to a few Megabytes. The
next layer is main memory made of DRAM (Dynamic RAM), which is of a higher capacity
in the same size of chip area, and less costly, but is slower in access time. Currently the
typical size of main memory is from 128MB to 1GB. A layer below the main memory, which

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C T IO N

4

is further away from processors, is the mechanical disk. The size of disk can be hundreds
of Gigabytes, but its speed is several magnitudes slower than the memory made of RAM
chips. The goal of organizing memory into the hierarchy is try to present the user with a
fast, large and affordable memory, w ith its speed close to the CPU caches, and its size close
to the disk.
W hen a program is running at a processor, it always tries to fetch the data it needs from
the memory close to it. If the d ata is found there, called a hit, th e program will continue
its execution w ith the fetched data. However, if the d ata is not found there, called a miss,
the request has to be sent to the next layer of the memory hierarchy to retrieve the data.
CPU caches are implemented prim arily in hardware to m atch th e processor speed. The
speed of the caches, especially for the first-level cache, is critical to the processor speed,
because they directly affect the performance of load and store instructions. Because of the
critical im portance of hit times of hardware cache, its design is severely constrained - only
very simple and low cost operations are allowed, so th a t most of them can be wired in the
hardware. For this purpose, direct m apped or set associative m apping are used to minimize
the addressing cost both in time and extra parts. The associativity of the set-associativlty
is typically from 2 to 16. Further increasing associativity is not worthwhile because of the
diminishing hit rate increase and rapidly increased com parator cost.
It is a different case for main memory and disk. On one hand, misses are much more
expensive than those in CPU caches. So even a small decrease of misses could considerably
reduce execution time. On the other hand, the main memory can afford fully associative
mapping and more sophisticated management algorithms to reduce misses.
To run programs w ith their to tal memory dem and larger th a n the amount of main

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN TR O D U C TIO N

5

memory available on the machine, virtual memory (VM) is devised to use the main memory
as a cache th a t contains only the active portions of one or multiple programs. The rest of
programs are saved on the swap areas on disks. Conventionally, a virtual memory block
is called a page, and a virtual memory miss is called a page fault. To support a fully
associative virtual memory system, each program is equipped with a page table to index
its virtual memory space. When a page fault occurs w ith a memory access, which means
the virtual page can not be mapped onto a page resident in the main memory, the program
issuing the memory access request has to stop and wait for the page to be retrieved from
the disk, which is much more expensive th a n a memory hit. Even though the size of both
memory and disk have rapidly increased, their speed gap remains largely unchanged.
W ith th e dram atically decreased memory price, the memory installed on the computer
has been significantly increased. However, this does not relent the pressure on memory
used as a cache of swap area.

Parkinson’s Law [58] states th a t “Work expands to fill

the time available.” In the virtual memory case, the law actually reflects the fact th a t
programs expand to fill the memory available to hold them. The most obvious example
is th a t Microsoft continues to increase the memory dem and of its operating systems and
office software to include more advanced functionalities and increased performance. In the
field of scientific computations, many of the com putational problems of interest to scientists
and engineers involve d ata sets th a t are much larger th an physical memory. Increases in
processor power and the available memory capacity make it feasible to solve larger problems,
or to solve the same problem at a finer granularity, and the size of the data set grows with
the problem being solved. For example, the visualization of Computational Fluid Dynamics
(CFD), input d ata sets today can surpass 100 Gbytes, and are expected to scale with the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C TIO N

6

ability of supercomputers to generate them. Despite the continuing trend toward larger
memories, it is unlikely th at these d ata sets will fit entirely within the main memory. How
to make an effective use of main memory as a cache of data on disks to reduce disk accesses
is a ever-existing challenge to operating system designers. The goal of caching is to keep
those active pages in memory, so th a t their next references are hits without the need of disk
accesses. The effectiveness of the caching in m ain memory depends on how effectively to
identify active portions of program address space for storing in memory. This is also the
them e of this dissertation.

1 .1 .1

L o c a lity a n d R e p la c e m e n t a lg o r ith m s

Caching works because of the existence of program access locality, which states th a t “most
of the time, a program tends to reference only a few of its pages and the set of pages
being referenced changes slowly [12]” . The locality consisting of a small portion of program
address space is of two types. The first type is tem poral locality, which states th a t if an
item is accessed, it will tend to be accessed again soon. The second type is spatial locality,
which states th a t if an item is accessed, items whose addresses are close to it will tend
to be accessed soon. The locality inform ation exhibited in the memory access provides a
very useful hints to predict which set of pages are probably to be used soon and should
be prefetched or kept in memory. The spatial locality is mostly exploited by large page
size and prefetching. Prefetching is to fetch pages in advance into memory before there are
access requests on the pages. In contrast, dem and paging states th a t a page is brought into
memory only on a page fault. The tem poral locality is used in caching to decide which
pages should be kept in memory and which pages should be evicted out of memory to make

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 1. IN T R O D U C TIO N

7

room for faulted pages. The algorithm used in the decision process is called replacement
algorithm.
The study on replacement algorithms has a long history since 1950s and generate nu
merous papers on the algorithms, modeling, im plem entation and performance evaluation.
However, the problem is still far from being effectively solved and continues to draw atten
tion from the industry and academia. Meanwhile, th e changes of the program memory access
behaviors and system configurations generate new demands on replacement algorithms.
Traditionally, the metric to evaluate replacement algorithms is hit ratio, which is defined
as the ratio of the number of misses and the number of accesses. Using the metric, the
optim al algorithm is the one called O PT [1] or MIN [7, 63], which replaces the page that
will not be used for the longest period of time. It is easy to describe, but unfortunately, it is
a off-line, unimplementable algorithm, because it requires future knowledge of the reference
requests. As a result, O PT is used mainly for comparison studies.
There are two types of history locality information used in the general-purpose replace
ment algorithms: recency and frequency. Recency of a page refers to the time of its last
reference. Least Recently Used (LRU) [46, 7] is th e most well known replacement algorithm.
It assumes th a t a page not accessed recently will not be accessed in the near future. Thus
it chooses the page whose last reference is the farthest to replace. LRU is very successful
due to its simplicity, low-cost and good performance in most cases and is widely used in
various systems. However, because it considers very limited history accesses, and makes
a assumption th a t does not hold for certain access patterns, LRU performance could be
unacceptably poor. In contrast, Least Frequently Used (LFU) replacement algorithm [20]
uses frequency, the number of times a page has been accessed, to select victim pages. How

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 1. IN T R O D U C T IO N

8

ever, LFU is rarely used in practice because of its severe drawbacks: it requires a very high
running cost, cause cache pollution, in which pages th a t have accumulated large frequencies
in history and will not be used are hard to be replaced.
Most recently proposed replacement algorithms take both recency and frequency factors
into consideration. For example, Least Recently/Frequently Used (LRFU) algorithm [45]
uses a param eter to dictate how much more weight given to the recent history than to the
past history. Other algorithm s considering more history information include LRU-2 [57],
2Q [37], EELRU [67], MQ [82] LIRS [33] and ARC [51]. These algorithms differ in their
hit ratios w ith different access patterns, their overhead, and adaptivity to access pattern
changes.
The key challenge for a high performance and low-cost replacement algorithm is to
accurately quantify locality strength and make an efficient use of the locality information.
The first part of this dissertation provide solution to meet the challenge.

1 .1 .2

R e p la c e m e n t P o lic ie s for V ir t u a l M e m o r y

We have stated th a t LRU is the most widely used replacement algorithm. Because of a very
stringent cost requirement on the policy from virtual memory (VM) management, actually
it is the LRU approximations th a t are used for VM page replacement. It requires the cost
be associated with the number of page faults or a m oderate constant. An algorithm w ith
its cost proportional to the number of memory references would be prohibitively expensive.
This causes the user program to incur a trap to the operating system every few instructions,
the CPU would spend much more tim e on page replacement work th an doing useful work
for the user application even when there are not paging requests.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 1. IN T R O D U C T IO N

9

C lo c k H a n d

Figure 1.2: The CLOCK replacement algorithm. The clock hand moves in the counter-clockwise
direction. The reference bit of each page is either set (1) or unset (0).
There are several low-cost VM replacement algorithm, most of them attem pt to simulate
LRU behavior. The FIFO (Fist-In, First-O ut) replacement policy maintains a list of all
pages currently in memory, where the page at the head of th e list represents the oldest
one, and the page at the tail the most recently accessed one. On a page fault, the page
at the head is removed for the replacement and the faulted page is placed at the list tail.
This simple algorithm does not allow actively accessed pages to always stay in memory.
To make recent access information considered, it is evolved into the Second-Chance (SC)
algorithm [70]. In the SC algorithm, there is a reference bit associated with each resident
page, which is set by hardware w ith every memory access. W hen a page moves to the head
of the list, its reference bit is checked. If its bit is set, the page is given a second chance
and move to the list tail. Otherwise, the page is replaced. So SC is looking for an old
page that has not been referenced in the previous clock interval. One way to implement the
algorithm is to m aintain the list as a circular queue called CLOCK. A pointer called clock
hand indicates which page to be replaced next (see Figure 1.2). W hen a free page is needed,
the hand advances until it finds a page w ith an unset reference bit. The implementation

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C T IO N

10

is usually called the CLOCK algorithm. Experiences and experiments have shown that
CLOCK has effectively simulated LRU and has the performance very close to th at of LRU.
In a generalized CLOCK version called GCLOCK[69], a counter is associated with each
page rather than a single bit. The counter will be incremented if the page is hit. The
circulating clock hand sweeps through the page decrementing the counter until a page with
its count of zero is found for replacement.
The CLOCK algorithm and its alternatives have been dom inating th e VM replacements
for more th an three decades. Though their performance is satisfactory in general, they
inherit the performance drawbacks from LRU and seriously under-perform for some com
monly observed access patterns. On one hand, there are many general-purpose replacement
algorithms improving LRU performance. One the other hand, due to the extremely low cost
requirement of VM management, the performance advantages of the algorithms are difficult
to transfer to VM performance. So the challenge is to design a VM replacement algorithm
that has a cost comparable to CLOCK and overcomes the performance disadvantages of
LRU and CLOCK. The second part of the dissertation is to address this challenge.

1 .1 .3

G lo b a l R e p la c e m e n t in M u ltip r o g r a m m in g E n v ir o n m e n ts

In a multiprogramming environment, when multiple processes compete for page frames,
page replacement algorithms can be classified into two broad categories: local replacement
and global replacement. Local replacement requires th a t each process select from only its
own set of allocated page frames for replacement to satisfy its page fault. Global replacement
allows a process to select a page frame belonging to any processes for replacement and load

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN TR O D U C TIO N

11

its faulted page into the frame l . So one process can take a frame from another. Memory
allocation can be re-distributed according to the competition among processes.
A local replacement uses a memory scheme to assign the allocation to each process.
The assignment can be based on the estim ation of the need of each process. However, the
static allocation can not capture dynamical changing memory demand of each program
[38]. As a result, memory space is not well utilized. If we dynamically adjust the allocation
to the current demand of individual process, the local replacement will essentially evolve
into a global one. Researchers and system practitioners seem to have agreed th a t a local
policy is not an effective solution for v irtual memory management, and it is rarely used
nowadays. Global replacement can autom atically implement memory allocation adapting
to the memory demands of processes through their page replacement interaction. This would
make memory b etter utilized in a global replacement th an th a t in a local replacement.
Thrashing
CPU
Utilization

The number o f processes in the systems simultaneously

Figure 1.3: CPU utilization is plotted against the number of processes in the system. Though
increasing processes in the system could increase CPU utilization, too many processes could over
commit the limited memory and cause thrashing.
1Actually in practice the page frame used for the current page fault may not be the frame just replaced.
Normally, operating systems do not wait to start the search free pages until all the free pages are running
out. Instead, they set a threshold for the minimal available free pages. Once the threshold is reached, they
start to search proper replacement candidates to fill up the pool of free pages. And these free pages are
ready for use whenever needed.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C TIO N

12

One problem w ith global replacement algorithm is thrashing among multiple processes.
A prim ary objective of memory management is to maximize the effectiveness of main mem
ory in meeting the overall goals of sharing, throughput, and responsiveness. For this pur
pose, we need to m aintain a proper number of processes active simultaneously in memory.
If there are not enough active processes, m ain memory is underutilized, and the possibility
of all processes being blocked, leaving the CPU idle, is increased. If there is an excess of
active processes in memory, the main memory will be over-committed, excessive number of
page faults will take place, also CPU idling. This is called thrashing (see Figure 1.3).
Now let’s have a brief look into how a thrashing is developed among multiple processes.
The set of recently used, active pages of a process are called its working set [24], which is used
to estimate the current memory demand of a running program in the system. Now suppose
a process enters a new phase in its execution and needs more page frames. It starts faulting
and could take pages away from other processes under a global page replacement policy. The
replaced pages may belong to their working sets because of the memory overcommitment.
So these processes need these pages, they also fault, taking pages from other processes,
which escalates the problem further. The situation can be worsen until the system ends up
spending most of its time in page fault handling, and the processes can make little progress.
Thus a thrashing occurs.
This problem may be addressed by reducing the number of active processes, thus con
trolling the system load. This is called load control. However, by abruptly suspending, even
killing active processes, the brute-force mechanism could introduce an unnecessary working
set reloading overhead, excessively reducing active processes, and reduce user interactivity.
Actually the thrashing is directly related to th e global replacement policy. W ithout having

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 1. IN T R O D U C T IO N

13

to resort to local replacement policies, a global replacement policy th a t can adapt its re
placement behavior to the current CPU utilization would be a promising idea to overcome
the aforementioned difficulties, and as well alleviate or even solve thrashing problem. The
third part of the dissertation is to develop techniques to address the thrashing problem.

1 .1 .4

P la c e m e n t a n d R e p la c e m e n t in D is t r ib u t e d F ile B u ffer C a c h e s
High Level Caches
Low Level Caches
Client
Front -T ier Server End -T ier Serve

Network

Client

Disk Array

Figure 1.4: Multi-level buffer cache hierarchy. Caches are distributed along the clients, intermediate
servers, and disk array, where accessed blocks can be buffered.

W hen a user requests a remote d ata item in a client-server distributed environment, the
retrieved d ata is cached a t the client file buffer cache, it could also be cached at interm ediate
server buffer caches and disk built-in caches, which forms a multi-level buffer cache hierarchy
(see Figure 1.4). For example, disk arrays use a significant amount of cache RAM as a data
buffer attem pting to provide as much as re-accessed d ata as possible from the cache without
access disks. As an example case, EMC 8830 disk array supports up to 64 GBytes cache
for this purpose. We might naively expect a large amount of cache memory invested on the
d ata retrieving path in the distributed system would automatically gain steady performance
increase. However, in the distributed situation, the issue can be more complicated th a n we

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 1. IN T R O D U C T IO N

14

thought because of the existing buffer cache management.
Unlike the processor cache hierarchy, where the multi-level inclusivity [3] between L \, L 2 ,
,.Ln cache could be accepted as a principle to simplify the cache coherence protocol and the
cache behaviors at different levels are well coordinated, the multi-level caches here are much
loosely connected. The placement of cached file block in the hierarchy and the replacement
at each level of cache are determ ined by local policies independently from each other. Any
client requested blocks are cached by intermediate caches, when they are on their way to
the clients passing through the caches. This causes accessed blocks be redundantly cached
and makes caches under-utilized. Only block misses from the high level caches, which are
close to clients, appear at the low level caches, which are far away from clients. This causes
the locality, which the replacement algorithm depend on for its replacement decision, is
weakened and makes th e hit ratios at the low level caches significantly deteriorate.
There are several possible approaches to attack the problems. One approach is to make
the replacement algorithm s at each level coordinate w ith each and allow one block be cached
at one place at most. Its potential problem is th a t it could incur excessive amount of com
munication overhead on the network for the coordination. A nother approach is to re-design
local replacement algorithms, to improve their hit ratios even with weakened locality infor
mation. This is certainly inadequate in the whole system point of view, because each cache
still makes replacement decisions independently and the block redundant caching problem
is not solved. W ithout the coordination among the caches, its performance potential is
greatly limited. In the fourth p art of the dissertation, we will address the aforementioned
problems in a distributed environment.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C T IO N

1.2

15

C ontributions

The contribution of the dissertation in cache management algorithm s are fourfold: generalpurpose replacement algorithm [33], low-cost replacement policy for virtual memory, thrash
ing prevention in multiprogram ming environments [34], and file block placement and re
placement in multi-level buffer caches [36], and are outlined as follows:

• This dissertation proposes an efficient general-purpose replacement algorithm, called
Low Inter-reference Recency Set (LIRS). We designed the algorithm based on a locality
qualification metric called Low Inter-Reference Recency (IRR), or re-use distance in
the previous studies in the fields such as compiler [28] and CPU cache [64]. It describes
the time between two consecutive references to a block. In the trace-driven simulation,
We compared the hit ratios of LIRS w ith LRU, LRU-2, 2Q, LRFU, EELRU, ARC,
and UBM. W ithout tuning sensitive param eters and assuming specific properties of
access patterns, LIRS outperform s all the other replacement algorithms across a large
number of real-life and synthetic traces w ith different memory sizes. In many cases,
its hit ratios are very close to the optimal ones. The d ata structure and operations
of LIRS are very simple b u t effective. Its running cost is as low as th a t of LRU.
Its unique performance and cost advantages have made LIRS very attractive to the
industry [73, 51].

• Inspired by the general-purpose LIRS replacement algorithm and the demanding need
of a new virtual memory page replacement policy to improve the performance of the
dominating CLOCK policy, this dissertation proposes an enhanced CLOCK replace
ment policy, called CLOCK-Pro. By additionally keeping track of a limited num-

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C T IO N

16

b er of replaced pages, CLOCK-Pro works in a similar fashion as CLOCK w ith a
VM-affordable cost. In the meanwhile, it brings all the much-needed performance
advantages from LIRS to CLOCK. CLOCK-Pro even eliminates the only tunable pa
ram eter in LIRS and makes itself a policy adapting to the changing access locality
to serve a broad spectrum of workloads. W ith the access patterns where CLOCK is
able to achieve high hit ratios, CLOCK-Pro behaves much like CLOCK. For the ac
cess patterns such as memory scan, large-scale loop accesses, where CLOCK performs
unacceptably poor, CLOCK-Pro significantly reduces the page faults, thus makes sys
tem more robust to various memory access behaviors. We also compared CLOCK-Pro
w ith other recently proposed VM page replacement policies, such as CAR [6] and show
th a t CLOCK-Pro consistently outperforms CAR.

• To deal w ith thrashing in multiprogramming environments, this dissertation provide
a scheme, called Thrashing Protection Facility (TPF), which protects the system from
thrashing once a thrashing is detected. The scheme deals with thrashing by adaptively
making adjustm ents on global page replacement policies. The adjustments are based
carefully analyzing the correlation between global page replacement behaviors and
CPU utilizations. Implementation in Linux kernels shows th a t the scheme can reduce
the program execution t imes by up to 67% when there is thrashing.

• In the area of multi-level buffer cache management, this dissertation proposes a clientdirected, coordinated file block placement and replacement protocol called Unified
Level-aware Caching (ULC), where the strengths of locality are dynamically quanti
fied at the client level, where full locality information is available. The quantification

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 1. IN T R O D U C T IO N

17

results are used to direct servers on placing or replacing file blocks at different levels
of the buffer caches. So th a t the locality of block accesses dynamically matches the
caching layout of the blocks in the hierarchy. The effectiveness of our proposed proto
col comes from achieving the following three goals: (1) The multi-level cache retains
the same hit rate as th a t of a single level cache whose size equals to the aggregate
size of multi-level caches. (2) The non-uniform locality strengths of blocks are fully
exploited and ranked to fit into th e physical multi-level caches. (3) The communica
tion overheads between caches are also reduced. Our trace-driven simulation results
show th a t ULC significantly and consistently outperforms existing multi-level caching
schemes.

In this long-term comprehensive study of caching algorithms under the above four sit
uations, the dissertation dem onstrates th a t there is still much room for innovation and
significant performance improvement for the seemingly m ature and stable policies broadly
used in the system design, such as LRU replacement and load control. The algorithms pro
posed and evaluated in the dissertation are valuable in making the system more capable to
handle large-scale, more complicated applications running on variously-configured systems.

1.3

Organization

C hapter 2 describes our study on general-purpose cache replacement algorithms. C hapter 3
continues the replacement work and customizes the proposed replacement algorithm in the
virtual memory management with an extremely low cost policy. C hapter 4 discusses an ex
perim ental study on thrashing prevention in multiprogramming environments by adaptively

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

CH APTER 1. IN T R O D U C T IO N

18

adjusting global page replacements. C hapter 5 describes our study on the management of
distributed, multi-level buffer caches through an effective block placement and replacement
protocol. C hapter 6 provides the conclusions and future work of the dissertation.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Chapter 2

G eneral-Purpose R eplacem ent
A lgorithm s
Replacement algorithm s play im portant roles in buffer cache management, and their effec
tiveness and efficiency are crucial to the performance of file systems, databases, and other
data management systems. In this chapter, we will review previous work on improving the
performance of replacement algorithms and introduce the design of a novel replacement
algorithm.

2.1
2 .1 .1

Background
T h e P r o b le m s o f t h e L R U R e p la c e m e n t A lg o r ith m

The Least Recently Used (LRU) replacement is widely used to manage buffer cache due to
its simplicity, but many anomalous behaviors have been found w ith some typical workloads,
where the hit rates of LRU may only slightly increase w ith a significant increase of cache
size. The observations reflect LRU’s inability to cope with access patterns with weak locality
such as file scanning, regular accesses over more blocks than the cache size, and accesses
19

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

20

on blocks w ith distinct frequencies. Here are some representative examples reported in the
research literature, to illustrate how LRU poorly behaves.

1. Under the LRU policy, a burst of references to infrequently used blocks, such as “se
quential scans” through a large file, may cause replacement of commonly referenced
blocks in the cache. This is a common complaint in many commercial systems: se
quential scans can cause interactive response tim e to deteriorate noticeably [57]. A
wise replacement policy should prevent “hot” blocks from being evicted by “cold”
blocks.

2. For a cyclic (loop-like) p attern of accesses to a file th a t is only slightly larger th an
the cache size, LRU always mistakenly evicts the blocks th a t will be accessed soon
est, because these blocks have not been accessed for the longest time [67]. A wise
replacement policy should maintain a miss rate close to the buffer space shortage.

3. In an example of multi-user database application [57], each record is associated with
a B-tree index. There are 20,000 records. The index entries can be packed into 100
blocks, and 10,000 blocks are needed to hold records. We use R(i) to represent an
access to Record i, and I(i) to Index i. The access p attern of the database application
alternates references to random index blocks and record blocks by 1(1), R( 1), 1(2),
R ( 2), 7(3), R ( 3), ... . Thus, index blocks will be referenced with a probability of 0.005,
and d ata blocks are w ith a probability of 0.00005. However, LRU will keep an equal
number of index and record blocks in the cache, and perhaps even more record blocks
than index blocks. A wise replacement should select the resident blocks according
to the reference probabilities of the blocks. Only those blocks with relatively high

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

21

probabilities deserve to stay in the cache for a long period of time.

The reason for LRU to behave poorly in these situations is th a t LRU makes a bold
assumption - a block th a t has not been accessed the longest would wait for relatively
longest time to be referenced again. This assumption cannot capture the access patterns
exhibited in these workloads w ith weak locality. Generally speaking, there is less locality
in buffer caches than th a t in CPU caches or virtual memory systems [65].
However, LRU has its distinctive merits: simplicity and adaptability. It only samples
and makes use of very limited inform ation - recency. However, while addressing the weak
ness of LRU, existing policies either take more history information into consideration, such
as LFU (Least Frequently Used)-like ones in the cost of simplicity and adaptability, or
switch temporarily from LRU to other policies whenever regularities are detected. In the
switch-based approach, these policies actually act as supplements of LRU in a case-by-case
fashion. To make a prediction, these policies assume the existence of relationship between
the future reference of a block w ith the behaviors of those blocks in its tem poral or spa
tial locality, while LRU only associates the future behavior of a block with its own history
reference. This additional assum ption increases the complexity of implementations, as well
as their performance dependence on the specific characteristics of workloads. My LIRS
only samples and makes use of the same history information as LRU does - recency, and
mostly retains the simple assum ption of LRU. Thus it is simple and adaptive. In our design,
LIRS is not directly targeted at specific LRU problems but fundamentally addresses the
limitations of LRU.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R IT H M S

2 .1 .2

22

A n E xecutive S u m m a r y o f m y A lg o r ith m

we use recent Inter-Reference Recency (IRR) as the recorded history information of each
block, where IR R of a block refers to the num ber of other distinct blocks accessed between
two consecutive references to th e block. In contrast, the recency refers to the number of
other distinct blocks accessed from last reference to the current time. We call IRR between
last and penultim ate (second-to-last) references of a block as recent IRR, and simply call
it IR R without ambiguity in the rest of th e paper. We assume th a t if the IRR of a block is
large, the next IRR of the block is likely to be large again. Following this assumption, we
select the blocks with large IRRs for replacement, because these blocks are highly possible
to be evicted later by LRU before being referenced again under our assumption. Note th a t
these evicted blocks may also have been recently accessed, i.e. each has a small recency.
Similar definition to IRR for measuring d ata access locality have been found in literature
as early as in 1970. M attson et al in [46] define “stack distance” by measuring the number of
distinct virtual memory pages accessed between two consecutive accesses of the same page
in a stack. Recently, this concept has been generalized as “reuse distance” [77] referring to
the number of distinct d ata elements accessed between two consecutive uses of the same
d ata element.
In comparison with LRU, by adequately considering IR R in history information in our
policy, we are able to eliminate negative effects caused by only considering recency, such
as the problems presented in the above three examples. W hen deciding which block to
evict, our policy utilizes the IR R information of blocks. It dynamically and responsively
distinguishes low IRR (denoted as LIR) blocks from high IRR (denoted as HIR) blocks, and

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

23

keep the LIR blocks in the cache, where the recency of blocks is only used to help determine
LIR or HIR statuses of blocks. We m aintain an LIR block set and an HIR block set and
manage to limit the size of the LIR set so th a t all the LIR blocks in the set can fit in the
cache. The blocks in the LIR set are not chosen for replacement, and there are no misses
with references to these blocks. Only a very small portion of cache is assigned to store HIR
blocks. Resident HIR blocks may be evicted at any recency. However, when the recency
of an LIR block increases to a certain point, and an HIR block gets accessed at a smaller
recency than th a t of the LIR block, the statuses of the two blocks are switched. We name the
proposed policy “Low Inter-reference Recency Set” (denoted as LIRS) replacement, because
the LIR set is what the algorithm tries to identify and keep in cache. The LIRS policy aims
at addressing three issues in designing replacement policies: (1) how to effectively utilize
multiple sources of access information; (2) how to dynamically and responsively distinguish
blocks by comparing their possibility to be referenced in the near future; and (3) how to
minimize im plem entation overheads.

2.2

R elated Work

LRU replacement is widely used for the management of virtual memory, file caches, and data
buffers in databases. The three typical problems described in the previous section are found
in different application fields. A lot of efforts have been made to address the problems of
LRU. We classify existing schemes into three categories: (1) replacement schemes based on
user-level hints; (2) replacement schemes based on tracing and utilizing history information
of block accesses; and (3) replacement schemes based on regularity detections.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

2 .2 .1

24

U s e r - le v e l H in ts

Application- controlled file caching [11] and application-informed prefetching and caching
[59] are the schemes based on user-level hints. These schemes identify blocks w ith low
possibility to be re-accessed in the future based on available hints provided by users. To
provide appropriate hints, users need to understand the d ata access patterns, which adds to
the programming burden. In [53], Mowry et. al. attem pt to abstract hints from compilers to
facilitate I/O prefetching. Although their methods are orthogonal to our LIRS replacement,
the collected hints may help us to ensure the existence of the correlation of consecutive
IRRs. However, in most cases, the LIRS algorithm can adapt its behavior to different
access patterns w ithout explicit hints.

2 .2 .2

T r a c in g a n d U t iliz in g H is to r y I n fo r m a tio n o f a B lo c k

Realizing th a t LRU only utilizes limited access information, researchers have proposed
several schemes to collect and use “deeper” history information. Examples are LFU-like
algorithms such as FBR, LRFU, as well as LRU-K and 2Q. We take a similar direction by
effectively collecting and utilizing access information to design the LIRS replacement.
Robinson and Devarakonda propose a frequency-based replacement algorithm (FBR)
by maintaining reference counts for the purpose to “factor out” locality [65]. However it
is slow to react to reference popularity changes and some param eters have to be found
by trial and error. Having analyzed the advantages and disadvantages of LRU and LFU,
Lee et. al. combine them by weighing recency factor and frequency factor of a block [45].
The performance of the LRFU scheme largely depends on a param eter called A, which
decides the weight of LRU or LFU, and which has to be adjusted according to different

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

25

system configurations, even according to different workloads. However, LIRS does not have
a tunable param eter th a t is sensitive to the target workload.
The LRU-K scheme [57] addresses the LRU problems presented in the Examples 1 and
3 in the previous section. LRU-K makes its replacement decision based on the time of
the K th-to-last reference to the block. After such a comparison, the oldest resident block
is evicted. For simplicity, the authors recommended K = 2. By taking the time of the
penultim ate reference to a block as the basis for comparison, LRU-2 can quickly remove
cold blocks from the cache. However, for blocks without significant differences of reference
frequencies, LRU-2 does not work well. In addition, LRU-2 is expensive: each block access
requires log(iV) operations to manipulate a priority queue, where N is the number of blocks
in the cache.
Johnson and Shasha propose the 2Q scheme th a t has overhead of a constant time [37].
The authors showed th a t the scheme performs as well as LRU-2. The 2Q scheme can quickly
remove sequentially-referenced blocks and loopingly-referenced blocks with long periods
from the cache. This is done by using a special buffer, called the A l in queue, in which all
missed blocks are initially placed. W hen the blocks are replaced from the A l in queue in the
FIFO order in a short period of time, the addresses of those replaced blocks are tem porarily
placed in a ghost buffer called A lo u t queue. W hen a block is re-referenced, if its address is
in the A lo u t queue, it is promoted to a main buffer called Am. T hat is, only blocks have
short re-use distance measured by the A l i n queue and A l o u t queue can be cached for a
long period of time in Am. In this way they are able to distinguish frequently referenced
blocks from those infrequently referenced. By setting of the sizes of A l i n and A lo u t queues
as constants K i n and Kout, respectively, 2Q provides a victim block either from A l i n or

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A LG O R ITH M S

26

from Am. However, K i n and K o u t are pre-determined param eters in 2Q scheme, which
need to be carefully tuned, and are sensitive to the types of workloads. Although both the
2Q and the LIRS algorithms have simple implementations with low overheads, LIRS has
overcome the drawbacks of 2Q by a properly updating of the LIR block set. Another recent
algorithm, ARC, maintains two variable-sized lists [51]. Their combined size is two times
of the number of pages th a t are held in the cache, one half of the lists contain the blocks
in the cache and another half are for the history access information of replaced blocks. The
first list contains blocks th a t have been seen only once recently and the second list contains
blocks th at have been seen at least twice recently. The cache spaces allocated to th e blocks
in these two lists are adaptively changed, depending on in which list recent misses happen.
More cache spaces will serve cold blocks (resp. hot blocks) if there are more cold block
(resp. hot block) accesses. However, though the authors advocate the superiority of the
ARC algorithm by its adaptiveness and excluding tunable param eters, the locality of blocks
in the two lists, quantified by recency or frequency, can not directly and consistently be
compared. For example, a block th a t is regularly accessed with an IRR a little bit more
th a n the cache size may have no hits at all while a block in the second list can stay in cache
w ithout any accesses since it has been accepted into the list.
Inter-Reference Gap (IRG) for a block is the number of the references between consec
utive references to the block, which is different from IR R on whether duplicate references
on a block are counted. Phalke and G opinath considered the correlation between history
IRGs and future IRG [62]. The past IRG string for each block is modeled by Markov chain
to predict the next IRG. However, as Smaragdakis et. al. indicate, replacement algorithms
based on a Markov models fail in practice because they try to solve a much harder problem

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

27

th an the replacement problem itself [67]. An apparent difference in their scheme from our
LIRS algorithm is on how to measure the distance between two consecutive references on a
block. My study shows th a t IRR is more justifiable than IRG in this circumstance. First,
IR R only counts the distinct blocks and filters out high-frequency events, which may be
volatile w ith time. Thus the IR R is more relevant to the next IR R than the IRG to the
next IRG. Moreover, it is the “recency” b u t not “gap” information th a t is used by LRU.
An elaborate argument favoring IR R in the context of virtual memory page replacement
can be found in [67]. Secondly, IR R can be easily dealt w ith under the LRU stack model
[20], on which most popular replacements are based.

2 .2 .3

D e t e c t io n a n d A d a p t a t io n o f A c c e s s R e g u la r itie s

More recently, researchers took another approach to detect access regularities from the
history information by relating the accessing behavior of a block to those of the blocks in
its tem poral or spatial locality scope. Then different replacements, such as MRU, can be
applied to blocks with specific access regularities.
Glass and Cao propose adaptive replacement SEQ for page replacement in virtual mem
ory management[30]. It detects sequential address reference patterns. If long sequences of
page faults are found, MRU is applied to such sequences. If no sequences are detected,
SEQ performs LRU replacement. Smaragdakis et. al. argued th a t address-based detection
lacks generality, and advocated using aggregate recency information to characterize page
behaviors [67]. Their EELRU examines aggregate recency distributions of referenced pages
and changes the page eviction points using an on-line cost/benefit analysis by assuming the
correlation among temporally contiguously referenced pages, unlike LRU, which actually

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G E N E R A L-P U R P O SE R E P L A C E M E N T A L G O R ITH M S

28

always set the eviction point in th e bottom of LRU stack. However, EELRU has to choose
a eviction point from a pre-determ ined set of LRU stack positions. And how to select the
set could affect its performance. Moreover, by aggregate analysis, EELRU can not quickly
respond to the changing access patterns. W ithout spatial or tem poral detections, our LIRS
uses independent recency events of each block to effectively characterize their references.
Choi et. al. propose a new adaptive buffer management scheme called DEAR th a t au
tom atically detects the block reference patterns of applications and applies different re
placement policies to different applications based on the detected reference patterns [19].
Further, they propose an Application/File-level Characterization (AFC) scheme in [18],
which first detects the reference characteristics at the application level, and then at the
file level, if necessary. Accordingly, appropriate replacement policies are used to blocks
w ith different patterns. The Unified Buffer Management (UBM) scheme by Kim et. al.
also detects patterns in the recorded history [42], Unlike the detection method proposed
in [19], which associates the backward distance and frequency w ith the forward distances
of blocks between two consecutive detection invocation points, UBM track the reference
information such as the file descriptor, start block number, end block number, and loop
period if re-reference occurs. Though their elaborate detection of block access patterns
provide a large potential to high performance, they address the problems in a case-by-case
fashion and have to cope with the allocation problem, which does not appear in LRU. To
facilitate the on-line evaluation of buffer usage, certain pre-measurements are needed to set
pre-defined param eters used in the buffer allocation scheme [18, 19]. My LIRS does not
have these design challenges. Ju st as LRU does, it chooses the victim block in the global
stack. However, it can use the advantages provided by the detection based schemes.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

2 .2 .4

29

W o r k in g S e t M o d e ls

Lastly, we would like to compare our work with the working set model, an early work by
Denning [24]. A working set of a program is a set of its recently used pages. Specifically, at
virtual time t, the program ’s working set Wt(9) is the subset of all pages of the program,
which have been referenced in the previous 9 virtual time units (the working set window).
A working set replacement algorithm is used to ensure th a t no pages in the working set
of a running program will be replaced [25]. Estim ating the current memory demand of a
running program in the system, the model does not incorporate the available cache size.
When the working set is greater th an the cache size, working set replacement algorithm
would not work properly. Another difficulty with the working set model is its weak ability
to distinguish recently referenced “cold” blocks from “hot” blocks. My LIRS algorithm
ensures th a t LIR block set size is less th an the available cache size and keeps the set in the
cache. IRR helps to distinguish the “cold” blocks from “hot” ones: a recently referenced
“cold” block could have a small recency, but would have a large IRR.

2.3

The LIRS algorithm

2 .3 .1

G en eral Id ea

We divide the referenced blocks into two sets: High Inter-reference Recency (HIR) block set
and Low Inter-reference Recency (LIR) block set. Each block with history information in
cache has a status - either LIR or HIR. Some HIR blocks may not reside in the cache, but
have m etadata in the cache recording their statuses as non-resident HIR blocks. We also
divide the cache, whose size in blocks is L, into a m ajor part and a minor p art in term s of

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T ALG O R ITH M S

30

their sizes. The m ajor part with the size of Lurs is used to store LIR blocks, and the minor
p art with th e size of L^irs is used to store blocks from HIR block set, where Lnrs + Lhirs
— L. W hen a miss occurs and a block is needed for replacement, we choose an HIR block
th a t is resident in the cache. The LIR block set always resides in memory, i.e., there are
no misses for the references to LIR blocks. However, a reference to an HIR block would be
likely to encounter a miss, because Lhirs is very small (its practical size can be as small as
1% of cache size).
We use Table 2.1 as a simple example to illustrate how a replaced block is selected
by the LIRS algorithm and how LIR /H IR statuses are switched. In Table 1, symbol “X”
denotes a block access at a virtual time unit 1. For example, block A is accessed at time
units 1, 6, and 8. Based on the definition of recency and IR R in Chapter 2.1.2, at time
unit 10, blocks A, B, C, D, E have their IR R values of 1, 1, “infinite” , 3, and “infinite” ,
respectively, and have their recency values of 1, 3, 4, 2, and 0, respectively. We assume
Lnrs = 2 and L^*rs — 1, thus at the time 10 the LIRS algorithm leaves two blocks in the
LIR set = {A, B}. The rest of the blocks go to the HIR set = {C, D, E}. Because block E
is the most recently referenced, it is the only resident HIR block due to L^irs = 1. If there
is a reference to an LIR block, we just leave it in the LIR block set. If there is a reference
to an HIR block, we need to know whether we should change its status to LIR.
The key to successfully make the LIRS idea work in practice rests on whether we are
able to dynamically and responsively m aintain the LIR block set and HIR block set. When
an HIR block is referenced, it gets a new IRR equal to its recency. Then we determine
whether the new IRR is small compared w ith reference statistics of existing LIR blocks, so
1Virtual tim e is defined on the reference sequence, where a reference represents a time unit.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T ALG O R ITH M S

Blocks / V irtual time
E
D
C
B
A

1

2

3

4

5

6

7

8

9

10

Recency

IRR

0

inf
3
inf
1
1

X

2

X

X
X
X
X

X
X

31

X

4
3
1

Table 2.1: An example to explain how a victim block is selected by the LIRS algorithm and how
LIR/HIR statuses are switched. A “X” refers the block of the row is referenced at the virtual time
of the column. The recency and IRR columns represent the values at the virtual time 10 for each
block. We assume Lurs = 2 and Lhirs = 1, and at the time 10 the LIRS algorithm leaves two blocks
in the LIR set = {A, B}, and the HIR set is {C, D, E}. The only resident HIR block is E.

th a t we can decide whether we need to change its status to LIR. Here we have two options:
to compare it either with the IRRs or w ith the recencies of the LIR blocks. We choose the
recencies for the comparison. There are two reasons for this: (1) The IRRs are generated
before their respective recencies and are outdated, which are not directly relevant to the
new IR R of the HIR block. A recency of a block is determined not only by its own reference
activity, but also the recent activities of other blocks. The result of comparison of the
new IRR and recencies of the LIR blocks determines the eligibility of the HIR block to be
considered as a “hot block” . Though we claim th a t IRRs are used to determine which block
should be replaced, it is the new IRRs th a t are directly used in the comparisons. (2) If the
new IRR of the HIR block is smaller than the recency of an LIR block, it will be smaller
th a n the upcoming IRR of the LIR block. This is because the recency of the LIR block
is a part of its upcoming IRR, and not greater th a n the IRR. Thus the comparisons with
the recencies are actually the comparisons w ith th e relevant IRRs. Once we know th a t the
new IRR of the HIR block is smaller than the maximum recency of all the LIR blocks, we
switch the LIR /H IR status of the HIR block and the LIR block with the maximum recency.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

32

Following this rule, we can (1) allow an HIR block w ith a relatively small IRR to join LIR
block set in a timely way by removing an LIR block from the set; (2) keep the size of LIR
block set no larger th a n Lurs, thus the entire set can reside in the cache.
Again in the example of Table 1, if there is a reference to block D at time 10, then a
miss occurs. LIRS algorithm evicts resident HIR block E, instead of block B, which would
be evicted by LRU due to its largest recency. Furthermore, because block D is referenced,
its new IRR becomes 2, which is smaller th an the recency of LIR block B (=3), indicating
th a t the upcoming IR R of block B will not be smaller th an 3. So the status of block D
is switched to LIR, and the block joins the LIR block set, while block B becomes an HIR
block. Since block B becomes the only resident HIR block, it is going to be evicted from the
cache once another free block is requested. If at virtual tim e 10, block C with its recency 4,
rather than block D w ith its recency 2, gets referenced, there will be no status switching.
Then block C becomes a resident HIR block, though the replaced block is still E at virtual
tim e 10. The LIR block set and HIR block set are formed and dynamically maintained in
this way.

2 .3 .2

T h e L IR S A lg o r ith m B a s e d o n L R U S ta c k

The LIRS algorithm can be efficiently built on the model of LRU stack, which is an imple
m entation structure of LRU. The LRU stack is a cache storage containing L entries, each of
which represents a block2. In practice, L is the cache size in blocks. LIRS algorithm makes
use of the stack to record the recency, and to dynamically m aintain the LIR block set and
2For simplicity, in the rest of the dissertation we just say without ambiguity “a block in the stack” instead
of “the entry of a block in the stack” .

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H AP TE R 2. G EN ER A L-P U R P O SE R E P L A C E M E N T A L G O R ITH M S

33

HIR block set. In contrast to the LRU stack, where only resident blocks are managed by
LRU replacement in the stack, we store LIR blocks, and HIR blocks with their recency less
than the maximum recency of LIRS blocks in a stack called LIRS stack S. S is similar to
the LRU stack in operation b u t has variable size. W ith this implementation, we do not
need to explicitly keep track of the IR R and recency values and to search for the maximum
recency value. Each entry in the stack records the L IR /H IR status and residence status
indicating whether or not the block resides in the cache. To facilitate the search of resident
HIR blocks, we link all these blocks into a small list Q w ith its maximum size L ^ rs. Once
a free block is needed, the LIRS algorithm removes a resident HIR block from the front of
the list for replacement. However, the replaced HIR block remains in the stack S with its
residence status changed to non-resident, if it is originally in the stack. We ensure the block
in the bottom of the stack S is an LIR block by removing HIR blocks below it. Once an
HIR block in the LIRS stack gets referenced, which means there is at least one LIR block,
such as the one at the bottom , whose upcoming IRR will be greater than the new IRR of
the HIR block, we switch the L IR /H IR statuses of the two blocks. The LIR block at the
bottom is evicted from the stack S and goes to the end of the list Q as a resident HIR
block. This block will soon be evicted from the cache due to the small size of the list Q (at
most Lhirs)'
Such a scheme is intuitive from the perspective of LRU replacement behavior: if a block
gets evicted from the bottom of LRU stack, it means the block occupies a buffer during
the period of time when it moves from the top to the bottom of the stack w ithout being
referenced. W hy should we afford a buffer for another long idle period when the block is
loaded again into the cache? The rationale behind this is the assum ption th a t tem poral

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

34

top

o

0 0 - 0front

end

e

bottom

□

list Q

: LIR block (all LIR blocks are resident)
:;) : resident HIR block

o

: non-resident HIR block

LIRS stack S

Figure 2.1: The LIRS stack S holds LIR blocks as well as HIRS blocks with or without resident
status, and a list Q holds all the resident HIR blocks.
IRR locality holds for block references.

2 .3 .3

A D e t a ile d D e s c r ip t io n

We define an operation called “stack pruning” on LIRS stack S', which removes the HIR
blocks in the bottom of the stack until an LIR block sits in the stack bottom . This operation
serves two purposes: (1) We ensure the block in the bottom of the stack always belongs
to the LIR block set. (2) After the LIR block in the bottom is removed, those HIR blocks
contiguously located above it will not have chances to change their statuses from HIR to
LIR, because their recencies are larger than the new maximum recency of LIR blocks.
W hen LIR block set is not full, all the referenced blocks are given LIR status until its
size reaches Lura. After th at, HIR status is given to any blocks th a t are referenced for the
first time, and to blocks th a t have not been referenced for a long time so th a t they are not
in stack S any longer.
Figure 2.1 shows a scenario where stack S holds three kinds of block: LIR block, resident
HIR block, non-resident HIR block, and a list Q holds all of the resident HIR blocks. An
HIR block may either be in the stack S or not. Figure 2.1 does not depict non-resident HIR

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G ENERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

35

blocks th a t are not in the stack S. There are three cases w ith various references to these
blocks.

1. U p o n accessin g an L IR block X : This access is guaranteed to be a hit in the
cache. We move it to the top of th e stack S. If the LIR block is originally located
at the bottom of the stack, we conduct a stack pruning. This case is illustrated in
the transition from state (a) to state (b) in Figure 2.2 based on the example shown
in Table 1.
2. U p o n a ccessin g an H IR resid en t b lock X : This is a hit in the cache. We move
it to the top of the stack S. There are two cases for block X : (1) If X is in the stack
S, we change its status to LIR. This block is also removed from list Q. The LIR block
at the bottom of S is moved to the end of list Q w ith its status changed to HIR. A
stack pruning is then conducted. This case is illustrated in the transition from state
(a) to state (c) in Figure 2.2. (2) If X is not in the stack S, we leave its status in HIR
and move it to the end of list Q.

3. U p o n a ccessin g an H IR n on -resid en t block X : This is a miss. We remove the
HIR resident block at the front of list Q (it then becomes a non-resident block), and
evict it from the cache. Then we load the requested block X into the freed buffer and
place it at the top of stack S. There are two cases for block X : (1) If X is in the stack
S, we change its status to LIR and move the LIR block at the bottom of stack S to
the end of list Q w ith its status changed to HIR. A stack pruning is then conducted.
This case is illustrated in the transition from state (a) to state (d) in Figure 2.2. (2)
If X is not in the stack S, we leave its status in HIR and place it at the end of list Q.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T ALG O R ITH M S

®
S (D
(D )
|b]

list Q

13

m
H

<D
list Q

H

S
LIRS stack S list Q

(a)

(b)

(C)

E
(g)
E CD

©
©
0
©

36

©
list Q

Lb J

LIRS stack S list Q

(e)

(d)

LIR block (all LIR blocks are resident)

o

resident HIR block
non-resident HIR block

F igure 2.2: Illustration of the reference results in the example shown in Table 1 on the LIRS stack.
In this figure, (a) corresponds to the state at virtual time 9. Accessing B, E, D, or C at virtual time
10 result in (b), (c), (d) and (e), respectively.
This case is illustrated in the transition from state (a) to state (e).

2.4

Perform ance Evaluation

2 .4 .1

E x p e r im e n ta l S e ttin g s

To validate our LIRS algorithm and to dem onstrate its strength, we use trace-driven simu
lations w ith various types of workloads to evaluate and compare it with other algorithms.
We have adopted many application workload traces used in previous literature aiming at ad
dressing limitations of LRU. We have also generated a synthetic trace. Among these traces,
cpp, cs, glimpse, and postgres are used in [18, 19] (cs is named as escape and postgres
is named as postgres2 there), sprite is used in [45], m u iltil, multi2, multiZ are used in
[42], O penM ail and Cello99 are used in [76]. We briefly describe the workload traces here.
These traces represent a wide range of access patterns, sizes, sources and collecting times.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A LG O R ITH M S

37

1. 2-pools is a synthetic trace, which simulates application behavior of the example 3 in
Chapter 2.1.1 w ith 100,000 references.

2. cpp is a GNU C compiler pre-processor trace. The total size of C source programs
used as input is roughly 11 MB.

3. cs is an interactive C source program exam ination tool trace. The total size of the C
programs used as input is roughly 9 MB.

4. g lim p se is a text information retrieval utility trace. The to tal size of text files used
as input is roughly 50 MB.

5. p ostgres is a trace of join queries among four relations in a relational database system
from the University of California at Berkeley.

6. sp rite is from the Sprite network file system, which contains requests to a file server
from client workstations for a two-day period.
7. m u l itl is obtained by executing two workloads, cs and cpp, together.

8. m u lti2 is obtained by executing three workloads, cs, cpp, and postgres, together.
9. m u ltiS is obtained by executing four workloads, cpp, gnuplot, glimpse, and postgres,
together.

10. O penM ail is a trace of a production e-mail system running the HP OpenM ail appli
cation.

11. C ello99 is a trace of every disk I/O access for the month of April 1999 from an HP
9000 K570 server.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER A L-P U R P O SE R E P L A C E M E N T A L G O R ITH M S

38

Because a well-designed replacement algorithm should perform well under various access
patterns exhibited in workloads, we select traces 1-9, which are in relatively small scales, but
cover a wide range of file access patterns to compare the performance of LIRS with other
proposed algorithms. Then we use the two large scale traces, O penM ail and Cello99, to
test the effectiveness of LIRS w ith applications on state-of-art, high end server systems. The
only param eter of the LIRS algorithm, Lhirs, is set as 1% of the cache size, or Lurs = 99%
of the cache size. This selection results from a sensitivity analysis to Lhirs/Lurs, which is
described in Chapter 2.5.1.

2 .4 .2

A c c e s s P a t t e r n B a s e d P e r fo r m a n c e E v a lu a tio n

Through an elaborate investigation, Choi et. al. classify the file cache access patterns into
four types [18]:

• Sequential references: all blocks are accessed one after another, and never re-accessed;

• Looping references: all blocks are accessed repeatedly with a regular interval (period);

• Temporally-clustered references: blocks accessed more recently are the ones more
likely to be accessed in the future;

• Probabilistic references: each block has a stationary reference probability, and all
blocks are accessed independently w ith the associated probabilities.

The classification serves as a basis for their access p attern detections and for adapting
to different replacement policies. For example, MRU applies to sequential and looping
patterns, LRU applies to temporally-clustered patterns, and LFU applies to probabilistic

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

39

patterns. Though our LIRS policy does not depend on such a classification, we would like
to use it to present and explain our experimental results. Because a sequential pattern is
a special case of looping p attern (with infinite interval), we only use the last three groups:
looping, temporally-clustered, and probabilistic patterns.
Policies LRU, LRU-2, 2Q, ARC, LRFU, and LIRS belong to the same category of
replacement policies. In other words, these policies take th e same technical direction —
predicting the access possibility of a block through its own history access information. Thus,
we focus our performance comparisons between ours and these policies. As representative
policies in the category of regularity detections, we choose two schemes for comparisons:
UBM for its spatial regularity detection, and EELRU for its tem poral regularity detection.
UBM simulation requires file IDs, offsets, and process IDs of a reference. However, some
traces available to us only consists of logical block numbers, which is an unique number for
each accessed block. Thus, we only include the UBM experim ental results for the traces
used in paper [42], which are m u lti 1, m u ltil, m ultiS. We also include the results of O PT,
an optimal, off-line replacement algorithm [20] for comparisons.
We divide traces 1-9 into 4 groups based on their access patterns. Traces cs, postgres,
and glim pse belong to the looping type, traces cpp and

2 -pools

belong to the probabilistic

type, trace sprite belongs to the temporally-clustered type, and traces m u ltil, m u lti 2 , and
m ultiS belong to the mixed type.
We present performance results for each trace by a pair of figures: the time-space maps
and the hit rate curves.

In a time-space map, the x axis represents virtual time, the

reference sequence of a given workload, and the y axis plots the logical block numbers of
those referenced. The hit rate curves show the hit rates as the cache size increases for

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

40

various replacement policies on a workload trace.

2.4.2.1

P erform an ce for th e L oop in g T y p e

Figures 2.3 to 2.5 plot four pairs of time-space maps (left figures) and the hit rate curves
(right figures) generated by the various replacement policies for traces cs, glimpse, and
postgres, respectively. The time-space maps show th a t all the 4 programs have looping
patterns with long intervals. As expected, LRU performs poorly for these workloads with
the lowest hit rates. Let us take cs as an example, which has a pure looping pattern. Each
block is accessed almost at the same frequency (see the left figure in Figure 2.3). Since all
blocks in a loop have the same eligibility to be kept in cache, it is desirable to keep the
same set of blocks in cache no m atter what blocks are referenced currently. T hat is just
what LIRS does: the same LIR blocks are fixed in the cache because HIR blocks do not
have IRRs small enough to change their statuses. In the looping pattern, recency predicts
the opposite of the future reference time of a block: the larger the recency of a block is,
the sooner the block will be referenced. The hit rate of LRU for cs is almost 0% until the
cache size approaches 1,400 blocks, which can hold all the blocks referenced in the loop. It
is interesting to see th a t the hit rate curve of LRU-2 overlaps w ith the LRU curve. This is
because LRU-2 chooses the same victim block as the one chosen by LRU for replacement.
W hen making a decision, LRU-2 compares the penultim ate reference time, which is the
recency plus the recent IRG. However, the IRGs are the same for all the blocks at any
time after the first reference. Thus, LRU-2 relies only on recency to make its decision, the
same as LRU does. Generally, when recency makes a major contribution to the penultim ate
reference time, LRU-2 behaves similarly to LRU.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

CS

41

CS

1200

50 -

® 1000

O PT
LIRS
LRU-2
2Q
LRFU
EELRU
ARC
LRU

— i—
— a -~
—
~v~
— x—
-* •-

6
5
6
i
400

0

1000

2000

3000

4000

Virtual tim e

5000

6000

7000

0

200

400

600

800

1000

1200

1400

C ache Size (# of blocks)

F igure 2.3: The time-space map (left) of cs and the hit rate curves by various replacement policies
(right).
Except for cs, the other three workloads have mixed looping patterns with different
intervals. LRU presents a stair-step curve to increase the hit rates for those workloads.
LRU is not effective until all the blocks in its locality scope are brought into the cache. For
example, only after the cache can hold 355 blocks does the LRU hit rate of postgres have
a sharp increase from 16.3% to 48.5% (see the right figure in Figure 2.5). Because LRU-2
considers the last IRG in addition to the recency, it is easier for it to distinguish blocks in
the loops w ith different intervals than LRU does. However, LRU-2 lacks the capability to
deal with these blocks when varying recency is involved. My experiments show th a t the
achieved performance improvements by LRU-2 over LRU is limited, (see the right figures
in Figures 2.4 and 2.5).
It is illuminating to observe the performance difference between 2Q and LIRS, because
both employ two linear d ata structures following a similar principle th a t only re-referenced
blocks deserve to be in cache for a long period of time. We can see th a t the hit rates of 2Q
are significantly lower than those of LIRS for all the three workloads (see the right figures in

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H AP TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

42

GLIMPSE
60

2500

50

u 2000

40

*
30

1500
£

5
OPT — f—

20

LIRS
LRU-2 ~ - a 2Q
LRFU —
EELRU ~ x ~
ARC
LRU

10

500

0

1000

2000

3000

4000

Virtual tim e

5000

6000

7000

0
C ache Size {# of blocks)

F igure 2.4: The time-space map (left) of glimpse and the hit rate curves by various replacement
policies (right).
Figures 2.3, 2.4, and 2.5). As the cache size increases, 2Q even performs worse th an LRU for
workloads glim pse and postgres. Another observation for 2Q on glim pse and postgres is
a serious “Belady’s anomaly” [8]: increasing the size of cache size may increase the number
of misses. Though ARC is an adaptive algorithm without tunable param eters, it actually
shares the same problem as th a t of 2Q. The performance improvement of ARC over LRU
is very limited. Belady’s anomaly also appears in the workload glim pse for ARC. This is
mainly caused by the inconsistent quantification and comparison of locality of blocks in two
lists, which is effectively addressed in LIRS. We will provide a in-depth analysis on this
issue in Section 2.4.4.
LRFU, which combines LRU and LFU, is not effective on a workload with a looping
pattern, because reference frequencies are hard to distinguish for looping references. The
LRFU and LRU h it rate curves for workload cs are overlapped, which is shown in Figure
2.3.
My trace-driven simulation results show LIRS significantly outperforms all of the other

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER A L-P U R P O SE R E P L A C E M E N T A LG O R ITH M S

PO S T G R E S

43

PO STG R ES
80

70

3000

60
2500

50

40
g

1500
30
O PT
LIRS
LRU-2
2Q
LRFU
EELRU
ARC
LRU

1000

20

500

10

0

2000

4000

6000

8000

Virtual tim e

12000

00

500

1000

1500

2000

- 8 —* ~
—
— k—
~+ ~

2500

3000

C a c h e S iz e (# of blocks)

Figure 2.5: The time-space map (left) of postgres and the hit rate curves by various replacement
policies (right).
policies, and its hit rate curves are very close to th a t of O PT. LIRS can make a more
accurate prediction on the future L IR /H IR status of each block for cs and postgres than
glimpse, because the intervals of loops in cs and postgres are of less variance, thus the
consecutive IRRs are of less variance (See the performance difference among cs, postgres
in Figures 2.3, and 2.5 and glim pse in Figure 2.4. However, the LIRS algorithm is not
sensitive to the variance of IRRs, which is reflected by its good performance on workload
glimpse. We explain it as follows.
We denote the recency of the LIR block in the bottom of LIRS stack S as R m a x. When
there are no free block buffers, R m a x is larger th a n th e cache size in blocks. Only when
the two consecutive IRRs of references to a block vary across value R m a x, is the status
prediction of the LIRS algorithm based on the last IR R wrong, including two cases: (1)
an IRR less than R m a x is succeeded by another IR R greater th an R m a x, and (2) an IRR
greater than R m a x is succeeded by another IRR less th an R m a x. All other IR R variances,
no m atter how much they are, would impose no mishandling of the LIRS replacement. Let

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

44

us take a close look at the penalty from a wrong L IR /H IR status decision: (1) Suppose a
block is labeled as LIR (due to its previous, small IRR) when it should be labeled as HIR.
The block will be evicted by LIRS after Lurs references (i.e., when the block reaches the
bottom of stack S), instead of being evicted after Lhirs references. Since Lurs is almost
as large as L, the performance penalty imposed by the LIRS mis-classification is no worse
than th a t imposed by LRU. (2) Suppose a block is labeled as HIR (due to its previous,
large IRR) when it should be labeled as LIR. The block will be evicted by LIRS far before
it reaches the stack bottom , instead of being hit by a reference before it reaches the stack
bottom . Thus LIRS would incur an extra miss if the block had been evicted from HIR
resident list Q. However, because the number of block buffers assigned to list Q (Lhirs)
is very small, which is only 1% of total cache size in our experiments, HIR blocks would
be replaced very soon, which reduces the chance for the replaced block to be re-referenced
shortly after its eviction. The free block buffer for the period between the early eviction
and its next reference helps to reduce the penalty from th e extra misses.

2.4.2.2

P erform an ce for th e P ro b a b ilistic T y p e

Figures 2.6 and 2.7 plot two pairs of time-space maps (left figures) and the hit rate curves
(right figures) generated by the various replacement policies for workloads cpp and 2 -pools,
respectively. According to the detection results in [18], workload cpp exhibits probabilistic
reference patterns. The right figure in Figure 2.6 shows th a t before the cache size increases
to 100 blocks, the hit rate of LRU is much lower than th a t of LIRS for cpp. For example,
when the cache size is 50 blocks, hit rate of LRU is 9.3%, while hit rate of LIRS is 55.0%.
This is because holding a m ajor reference locality needs about 100 blocks (see the left figure

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

CPP

45

CPP
90

1400

80

1200
70

1000
60

n
E

3C

£

50

a

40

5
6

SoI
'5i

o

30

O PT
LIRS
LRU-2
2Q
LRFU
EELRU
ARC
LRU

400

20
200
10

0

1000

MOO

3000

4000

5000
Virtual tim e

6000

7000

8000

9000

10000

00

100

200

300

400

500

600

700

— ♦—
-B -—
—
— x—
—*~

800

900

C ache Size {# of blocks)

Figure 2.6: The time-space map (left) of cpp and the hit rate curves by various replacement
policies (right).
of Figure 2.6). LRU can not exploit locality until enough cache space is available to hold
all the recently referenced blocks. However, the capability for LIRS to exploit locality does
not depend on the cache size - when it is identifying the LIR set to keep them in the cache,
it always let the set size match the cache size. Workload 2-pools is generated to evaluate
the replacement policies on their abilities to recognize the long-term reference behavior.
Though the reference frequencies are largely different between the record blocks and the
index blocks, it is hard for LRU to distinguish them when the cache size is relatively small
to the number of referenced blocks, because LRU takes only recency into consideration.
LRU-2, 2Q, and LIRS algorithms take one more previous references into consideration —
the tim e for the penultim ate reference on a block is involved. Even though the reference
events to a block are randomized (the IRRs on a block are random with a certain fixed
frequency, which is unfavorable to LIRS.), LIRS still outperform s LRU-2 and 2Q (see the
right figure in Figure 2.7). However, LRFU utilizes “deeper” history information. Thus,
the constant long-term frequency becomes more visible, and is ready to be utilized by the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A LG O R ITH M S

2-PO O LS

46

2-PO O LS
65

60

9000

55

8000

50
7000
45

5000

~

a

40

I

3=

4000
30

O PT
LiflS
LRU-2 - Q 2Q
LRFU
EELRU — x—
ARC
LRU - ■ * -

3000
25

2000

15

10
1 0000

20000

30000

40000

50000

60000

70000

80000

90000

100000

50

100

150

200

Virtual tim e

250

300

350

400

450

C a ch e S ize {# ot blocks)

F igure 2.7: The time-space map (left) of 2-pools and the hit rate curves by various replacement
policies (right).
LFU-like scheme. The performance of LRFU is slightly better th an th a t of LIRS. It is not
surprising to see the hit rate curve of EELRU exhibits the poor performance and overlaps
w ith th a t of LRU, because EELRU relies on an analysis of a tem poral recency distribution
to decide whether to conduct an early point eviction. In workload 2-pools, the blocks with
high access frequency and the blocks w ith low access frequency are alternatively referenced,
thus no sign of an early point eviction can be detected.

2 .4 .2 .3

Perform ance for th e T em p o ra lly -C lu stered T yp e

Figure 2.8 presents the time-space m ap (left figure) and the hit rate curves (right figure) gen
erated by the various replacement policies for workload sprite, which exhibits temporallyclustered reference patterns. The right figure in Figure 2.8 shows th a t the LRU hit rate
curve smoothly climbs with the increase of the cache size. Although there is still a gap
between the LRU and OPT, the slope of the LRU is close to th a t of OPT. S p rite is a so
called LRU-friendly workload [67], which seldom accesses more blocks than the cache size

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

SPRITE

47

SP R IT E

100

8000

7000

6000

5

5000

o

3000

O PT — +—
U RS
LRU
LRU-2 - Q - -

2Q—»-■

LRFU
EELRU — x—
ARC
LRU

1000

0

20000

40000

60000

80000

Virtual tim e

100000

120000

140000

100

300

400

500

700

800

900

1000

C a ch e Size {# of blocks)

Figure 2.8: The time-space map (left) of sprite and the hit rate curves by various replacement
policies (right).
over a fairly long period of time. For this type of workload, the behavior of all the other
policies should be similar to th a t of LRU, so th at their hit rates could be close to th a t of
LRU. Before the cache size reaches 350 blocks, the right figure in Figure 2.8 shows that
the hit rate of LIRS is higher than th a t of LRU. After this point, th e hit rates of LRU is
slightly higher. Here is the reason for th e slight performance degradation of LIRS beyond
that cache size: whenever there is a locality scope shift or transition, i.e. some HIR blocks
get referenced, one more miss than would occur in LRU may be experienced by each HIR
block. Only the next reference to the block in the near future after the miss makes it switch
from HIR to LIR status and then rem ain in the cache. However, because of the strong
locality, there are not frequent locality scope changes. So the negative effect of the extra
misses is very limited.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

MULTI1

48

MULTM (cs+cpp)

3000

90

2500

80

70

1500

60
i

1000

50

500

40

/f

0

•

.

l: ir .: - u h : T <'■

/ i / i/ WM/ Zi r /,/ / / ^ y / / 0 j / y / / / / / / y y y / / / ^?y. r
2000

4000

6000

8000

ur n; / \

LRFU —
E R R U — x—
ARC
LRU

r! $1

iiin i/jiiiu u n jiii

10000

O PT — i—
LIRS
UBM
LRU-2 - - B -

12000

30
16000

Virtual tim e

200

400

600

800

1000

1200

1400

1600

1800

2000

C ache Size (# of blocks)

Figure 2.9: The time-space map (left) of m u ltil and the hit rate curves by various replacement
policies (right).
2.4.2.4

P erform an ce for th e M ixed T y p e

Figures 2.9 to 2.11 present three pairs of time-space maps (left figures) and the hit rate
curves (right figures) generated by the various replacement policies for workloads m u ltil,
m ulti2, and m ultiS, respectively. The authors in [42] provide a detailed discussion why their
UBM shows the best performance among the polices they have considered - UBM, LRU-2,
2Q, and EELRU. Here we focus on performance differences between LIRS and UBM. UBM
is a typical spatial regularity detection-based replacement policy th a t makes an exhaus
tive reference p attern detections. UBM tries to identify sequential and looping patterns
and applies MRU to the detected patterns. UBM further measures looping intervals and
conducts period-based replacements. For unidentified blocks, LRU is applied. A dynamical
buffer allocation among blocks managed by different policies is employed. W ithout devoting
specific effort to specific regularities, LIRS outperforms UBM for all the three mixed type
workloads, which shows th a t our assumption on IR R well holds and LIRS is able to cope
with weak locality reference in the workloads w ith mixed type patterns.

Reproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G ENERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

MUITI2

49

MULTI2 (cs+cpp+ps)

5000

a:
<s
5>

2000

LIRS
UBM
LRU-2 ■— b -—
2Q
LRFU — r —
EELRU -■ *—
ARC
LRU

1000

0

5000

10000

15000

20000

25000

30000

500

1000

Virtual tim e

1500

2000

2500

3000

C a ch e Size (# ol blocks)

Figure 2.10: The time-space map (left) of multi2 and the hit rate curves by various replacement
policies (right).

2.4.3

L IR S P e r fo r m a n c e w it h H ig h E n d S y s te m s

Modern high end server systems can have a couple of giga-bytes of memory. Moreover,
state-of-art high-end disk arrays typically have several giga-bytes of cache RAMs, which are
mainly used as low-latency pools of d ata th a t is accessed multiple times by the connected
servers. We have two issues to investigate for the high end systems: (1) whether LRU
becomes com petent enough to deal with the workloads on those systems equipped with
large amount of memory? (2) whether LIRS can help in such as a system environment
once LRU under-performs? We use two large-scale workload traces for this investigation,
O penM ail and Cello99, which have been used in [76] by Wong and Wilkes for their study
of the effective use of cache RAM in disk arrays.
O penM ail was collected on an HP OpenM ail email system for 25,700 users, 9,800 of
whom were active during the hour-long trace, containing about 5.4M I/O requests. The
system is configured by six HP 9000 K580 servers running HP-UX 10.20, each with 6 CPUs,
2GB of memory, and 7 SCSI interface cards. The original traces are collected on the six

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T ALG O R ITH M S

50

MULT13 (cpp+gnu+gli+3s)

MULTI3
80

70
60

r

50

!«
40
o

3000

O PT - h —
U RS
UBM
LRU-2

30

2000

LRFU
EELRU — x ARC
LRU

20

10
5000

10000

15000

20000

25000

30000

35000

500

1000

1500

2000

2500

3000

3500

4000

C ache Size {# of blocks)

Virtual time

F igure 2.11: The time-space map (left) of m ultiS and the hit rate curves by various replacement
policies (right).
nodes separately. We aggregated the six request stream s into a single stream in the order
of their request times. Trace Cello99 is a collection of recorded disk I/O accesses for the
m onth of April 1999 from an HP 9000 K570 server. The trace contains about 61.9M I/O
requests. The server has 4 CPUs, about 2GB of main memory, two HP AutoRAID arrays
and 18 directly connected disk drives. The system ran a general time-sharing load under
HP-UX 10.20.
OpenMail

2000

3000
C a ch e Size (MB)

4000

2000

3000

4000

C a ch e Size (MB)

F igure 2.12: The hit rate curves of workload OpenMail (left figure) and workload Cello99 (right
figure)

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R IT H M S

51

Figure 2.12 shows the hit rates of O penM ail and Cello99 w ith a range of large file cache
size. For O penM ail, LRU seriously under-performs, where significantly increasing of cache
size before cache size reaches 3GB yields little improvements on hit rates, while LIRS shows
much better performance th an LRU by its steadily increasing hit rates. While examining
the trace, We found th a t 60.3% of its blocks are accessed only once, while the references to
other blocks exhibit random access characteristics. LRU allows each of those once-accessed
blocks holding a buffer space for at least L reference times, where L is the cache size in
blocks. This actually reduces the number of buffers used for caching re-use blocks, which
can contribute to the hit rates. LIRS replaces those once-accessed blocks shortly after they
are accessed, so it makes more buffers available to the re-use blocks. In general, Cello99 is
an LRU-friendly workload, where its LRU hit rates get a steady increase with the increase
of cache size. LIRS performs on the workload closely to LRU. Note th a t LRU becomes a
little less effective after the cache size exceeds 2GB: the contribution of increased cache sizes
to its hit rate is reduced. In comparison, LIRS produces b etter performance than LRU,
which implies th a t LIRS can effectively overcome LRU inability. The experiments on the
two large scale workload traces shows th a t the performance of LRU is susceptible to the
workload access characteristics. LRU could under-perform on various system settings when
workload access patterns are not friendly to it. They also show the effectiveness of LIRS to
overcome LRU’s inability on high end systems.

2 .4 .4

L IR S v e r su s O th e r S ta c k -B a s e d R e p la c e m e n ts

To get insights of superiority of LIRS over other stack-based replacement algorithms, in
cluding LRU, 2Q, we would like to use tim e-IRR graph to observe their actions on the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER A L-P U R P O SE R E P L A C E M E N T A L G O R IT H M S

52

blocks accessed at different recencies. In the graph, x axis plots virtual time, references in
the access stream, y axis plots IRR, the recency where the reference at a specific time takes
place. For the first tim e accessed blocks, their IR R is infinite, which we do not plot in the
graph. We select two representative workloads, a non-LRU-friendly one, postgres, and an
LRU-friendly one, sp rite, for this study. Their IRRs are depicted in Figure 2.13.
PO S T G R E S

SPRITE

3000 r—

2500 •

•c 2000 •

£

ao

|

1500 -

D
CC

.j
CC
£

1000 -

500 -

0
0

2000

4000

6 0 00
Virtual Time

8000

10000

12000

0

20000

40000

60000

800C0

100000

120000

140000

Virtual Tim e

Figure 2.13: The IRRs of references of the workloads postgres (left) and sprite (right)

The stack size in LRU, which is determined by the cache size in blocks, is fixed. If
the stack size is L , all the references shown in the graphs w ith their IRRs less than L are
hits, and those w ith IRRs larger th an L are misses in LRU. Thus the hit rates of LRU are
directly determined by the IR R distribution. If most of IRRs are concentrated in the low
recency area such as what is shown in the graph for sprite, LRU will get a high hit rate.
For workloads with dispersed recency distributions, LRU is incompetent in achieving high
hit rates. For example, in workload postgres there are two IR R concentrations at around
IRRs 350, 1150 and 1950 in the left figure of Figure 2.13. In corresponding to the IRR
distribution, there are obvious “lift ups” in the LRU hit rate curve when the cache size

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

53

reach these values, which is shown in Figure 2.5. However, if the number of blocks with
their IRRs less than L is significantly less than stack size L, a large number of blocks with
low recencies but high IRRs hold the stack spaces (residing in th e cache) without being
accessed before being replaced from the stack. The occupied buffers do not contribute to
the hit rate. Thus what really m atters is IRR, not recency. To improve LRU, the criterion
to determine which accessed blocks are to be cached should be the L blocks with smallest
IRRs, rather than the L blocks with their recencies less than L. Following this criterion,
LIRS algorithm uses the LIRS stack to dynamically predict the L blocks which will have
the smallest IRRs. The LIRS stack serves for two purposes: (1) holding the L blocks with
smallest IRRs, called LIR blocks; (2) providing a threshold for being a LIR block. In our
algorithm the threshold is R m a x , the recency of the LIR block in the LIRS stack bottom.
The LIRS stack contains blocks with their recencies less than R m a x. Thus the threshold
is also the LIRS stack size.

2.4.4.1

LIRS T h resh old and A ccess C haracteristics

To get the insights between the relationship of the threshold used by LIRS and workload
access characteristics, we plot the ratio of the LIRS stack size, R m a x, and the size of
the LRU stack L in Figure 2.14, when we fix the cache size at 500 blocks for the two
workloads postgres and sprite. We find th a t the threshold is an inherent reflection of the
LRU capability to exploit locality. If the references have a strong locality, most of the
references are to the blocks with small recencies. Thus LRU stack still hold these blocks
while they get re-accessed, and LRU achieves a high hit rate. At the same time, these
blocks are low IR R blocks, i.e. most of the references go to the LIR blocks, which would

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

p o stg res (cach e siz e = 500)

54

Sprite (ca ch e siz e = 500)

4.5

3.5

C

D
3.5

*c

o

2.5

□
CC
2.5

C
D

5o
£
0.5
0.5

2000

4000

8000

12000

0

20000

40000

60000

80000

100000

120000

140000

F igure 2.14: The rates of Rmax and cache size in blocks (L) for workloads postgres (left) and
sp rite (right). Rmax is the size of LIES stack, which changes with virtual time. Cache size is 500.
leave only a small number of HIR blocks in the LIRS stack and cause the stack to shrink.
This is the case for workload sprite. W ith 500 buffer blocks LRU stack is able to hold most
the frequently referenced blocks (see right figure of Figure 2.13). On the other hand, LIRS
can find enough low IRR blocks w ithin the recency range also covered by LRU stack. Thus
there is no need for LIRS to raise its stack size significantly to hold large number of blocks
with high recencies in the cache. This is evidenced in the right figure of Figure 2.14, where
the ratios of LIRS and LRU stack sizes are not far from 1 for the most of period of time.
However, once LIRS can not find enough low IR R blocks within the size of LRU stack, it
will raise its size accordingly. We observe th a t the thresholds of postgres are significantly
increased in several phases during the periods when more references went to the blocks w ith
high recencies than to those with low recencies (see left figure of Figure 2.14). W ith 500
buffer blocks LRU w ith its fixed stack size can not capture the locality distinction among
blocks with high recencies, and causes their references all missed. By increasing the stack
size according to the current access characteristics, LIRS can make the distinction among

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A LG O R ITH M S

55

blocks with weak locality and make a wiser decision on the replacement th an LRU. The
experiments also hint th a t the threshold is a good indicator of the LRU-friendliness of a
workload.
Replacement algorithm 2Q also tries to identify blocks with small IRRs and to hold
them in cache. It relies on queue A lo u t to decide whether a block is qualified to promote
to stack A m so th a t it can be cached for a long time, or consequently to decide whether a
block in A m should be demoted out of A m . In 2Q, the size of A lo u t serves as a threshold
to identify blocks with small IRRs, and A m holds these blocks. Because the threshold is
intended to predict the blocks w ith L smallest IRRs among all accessed blocks, it should
be related to the access characteristics of blocks in A m . Unfortunately, it is not in 2Q. The
recommended size of A lo u t in paper [37] is 50% of the cache size. So the threshold used
in 2Q is a constant 1.5L, which would be a straight horizontal line with its x axis value at
1.5L in a time-IRR graph. This threshold would be too tight to let blocks join in A m when
LIRS threshold is larger than 1.5L, and be too loose to allow blocks to stay outside of A m .
This explains why 2Q can not provide a consistently improved performance over LRU.

2.4.4.2

L R U as a Special M em b er o f th e LIRS Fam ily

In LIRS algorithm, any HIR block w ith a new IR R smaller than the LIRS threshold can
change into LIR status, and may demote a LIR block into HIR status. The threshold
controls how easily a HIR block may become a LIR block, or how difficult it is for a
LIR block to become a HIR one. We would like to vary the threshold value so we will
have a family of LIRS algorithms with various thresholds in order to get insights into the
relationship of LRU and LIRS. Lowering the threshold value, we are able to strengthen the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 2. G EN ER A L-P U R P O SE R E P L A C E M E N T A LG O R ITH M S

p o stg re s

56

Sprite

100

1
O PT
LIRS 50%
LIRS 75%
U RS100%
U RS125%
LIRS 150%
LIRS 550%
LRU

0

500

1000

2000
C a c h e Size (# of blocks)

2500

O PT
U R S 50%
U R S 75%
U R S 100%
U R S 125%
U R S 150%
LRU

- h—
— x—
~e~—* ~
— ©■-

3000

100

200

300

400

500

600

700

800

— r—
— x—
—e —
~ * ~
--©■•-

900

1000

C ache Size (# of blocks)

Figure 2.15: The hit rate curves of workload postgres (left figure) and workload sp rite (right
figure) by varying the rates of threshold values for LIR/HIR status switching and Rmax in LIRS,
as well as curves for OPT and LRU.
stability of the LIR block set by making it more difficult for HIR blocks to switch their
status into LIR. It also prevents LIRS algorithm from responding to the relatively small
IRR variance. Increasing the threshold value, we go in the opposite direction. Then LRU
becomes a special member of the LIRS family - a LIRS algorithm w ith an indefinitely large
threshold, which always gives any accessed block LIR status and keeps it in cache until it
is evicted from the bottom of stack.
Figure 2.15 presents the results of a sensitivity study of the threshold value. We again
use workloads postgres and sprite to observe the effect of changing the threshold values
from 50%, 75%, 100%, 125% to 150% of R m a x. For postgres, we include a very large
threshold value - 550% of R m a x to hightlight the relationship between LIRS and LRU.
We have two observations. First, LIRS is not sensitive to the threshold values across a
large range. In postgres, curves for the threshold values of 100%, 125%, 150% of R m a x are
overlapped, and curves for 50%, 75% of R m a x are slightly lower th an th a t of the curve with
100% of R m a x threshold. Specifically for sprite, an LRU-friendly workload, increasing the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

57

threshold value, the LIRS hit rate curves move very slowly close to th a t of LRU. Secondly,
the LIRS algorithm can simulate LRU behavior by largely increasing the threshold. As the
threshold value increases to 550% of R m a x, LIRS curve of workload postgres is very similar
to th a t of LRU in its shape, and close to it (See the left figure of Figure 2.15). Further
increasing the threshold value, the LIRS curve overlaps w ith th a t of LRU.

2.5

S ensitivity and Overhead A nalysis

2 .5 .1

S iz e S e le c t io n o f L ist Q H o ld in g R e s id e n t H I R B lo c k s ( Lhirs)

Lhirs is the only param eter in the LIRS algorithm. The blocks in the LIR block set can
stay in the cache for longer time th an those in the HIR block set and experience less page
faults. An sufficiently large Lurs (the cache size for LIR blocks) ensures there are a large
number of LIR blocks. For this purpose, we set Lurs to be 99% of the cache size, Lhirs to
be 1% of the cache size in our experiments, and achieve expected performance. From the
other perspective, an increased Lhirs may be beneficial to the performance: it reduces the
first time reference misses. For a longer list Q (larger Lhirs), it is more likely th at an HIR
will be re-accessed before it is evicted from the list, which can help the HIR block change
into LIR status w ithout experiencing an extra miss. However, the benefit of large Lhirs is
very limited, because the number of such kind of misses is small.
We also use the two workloads, postgres and sprite, to observe the effect of changing the
size. We change Lhirs from 2 blocks, to 1%, 10%, 20%, and 30% of the cache size. Figure 2.16
presents the results of a sensitivity study of Lhirs for postgres (left figure) and sprite (right
figure). For each workload, we measure the hit rates of O PT , LRU, and LIRS with different

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T ALG O R ITH M S

p o stg res

58

Sprite

100

80

70

60

50

(2

40

X
30
O PT
URS 2
L IR S 1%
U R S 10%
U R S 20%
LIRS 30%
LRU

20

10

00

500

1000

1500

2000

C a ch e Size (# of blocks)

2500

O PT —
U RS 2 — x—
LIRS 1%
LIRS 10%
U R S 20% —
U R S 30%
LRU

— *—
—x —a —
— e™

3000

100

200

300

400

500

600

700

800

900

1000

C ache Size (# of blocks)

F igure 2.16: The hit rate curves of workload postgres (left figure) and workload sprite (right
figure) by varying the size of list Q (Lhirs, the number of cache buffers assigned to HIR block set)
of LIRS algorithm, as well as curves for OPT and LRU. “LIRS 2” means size of Q is 2, “LIRS x%”
means size of Q is x% of the cache size in blocks.
Lhirs sizes by increasing the cache size. We have following two observations. First, for both
workloads, we find th a t LIRS is not sensitive to the increase of Lhirs- Even for a very large
Lhirs th at is not in favor of LIRS, the performance of LIRS w ith different cache sizes is still
quite acceptable. W ith the increase of Lhirs, the hit rate of LIRS approaches th at of LRU.
Secondly, our experiments indicate th a t increasing Lhirs reduces the performance benefits
of LIRS to workload postgres, but slightly improves performance of workload sprite.

2 .5 .2

O v e r h e a d A n a ly s is

LRU is known for its simplicity and efficiency. Comparing the time and space overhead
of LIRS and LRU, we show th a t LIRS keeps the LRU m erit of low overhead. The time
overhead of LIRS algorithm is 0 (1 ), which is almost the same as th a t of LRU w ith a few
additional operations such as those on the list Q for resident HIR blocks. The extended
portion of the LIRS stack S is the additional space overhead of the LIRS algorithm.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ERAL-PU RPO SE R E P L A C E M E N T A LG O R ITH M S

p o stg res

59

Sprite

100

80

70

60

50

I
®

£(0E
I

ir

40

30
O PT
LIRS
U R S 1,5
U R S 2.0
U R S 2.5
U R S 3 .0
LRU

20

10

00

500

1000

1500
C a ch e Size (# of blocks)

2000

2500

O PT —
URS
f i~
U R S 1.5
U R S 2.0 — ©■U R S 2.5
U R S 3.0
LRU

— ♦—
•—-q —
—
— ©■--* -■

3000

100

200

300

400

500

600

700

800

900

1000

C a ch e S ize {# of blocks)

Figure 2.17: The hit rate curves of workload postgres (left) and workload sp rite (right) by
varying the LIRS stack size limits, as well as curves for OPT and LRU. Limits are represented by
rates of LIRS stack size limit in blocks and cache size in blocks (L).
The stack S contains m etadata for the blocks w ith their recency less than R m a x. W hen
there is a burst of first-time (or “fresh” ) block references, the LIRS stack could be extended
to be unacceptably large. To give a size limit is a practical issue in the implementation of
the LIRS algorithm. In an updated version of LIRS, the LIRS stack has a size limit th a t is
larger than L, and we remove the HIR blocks close to the bottom from the stack once the
LIRS stack size exceeds the limit. We have tested a range of rather small stack size limits,
from 1.5 times to 3.0 times of L. From Figure 2.17, we can observe th a t even w ith these
strict space restrictions, LIRS retains its desired performance. The effect of limiting LIRS
stack size is equivalent to reducing the threshold values in C hapter 2.4.4.2. As expected,
the results are consistent with the ones presented in Chapter 2.4.4.2. In addition, a stack
entry only consists of several bytes, it is easily affordable to have LIRS stack size limit much
more than 3 times of LRU stack size. W ith such large limits, there is little negative effect on
LIRS performance by removing HIR block entries close to the stack bottom because of the
size limit. By moderately extending the LRU stack size, LIRS makes a large difference on

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 2. G EN ER AL-PU R PO SE R E P L A C E M E N T A L G O R ITH M S

60

its performance. This is because our solution fundamentally address the critical limitations
of LRU.

2.6

Sum m ary

We make two contributions in this work by proposing LIRS algorithm: (1) We show that
LRU lim itations with weak locality workloads can be successfully addressed without rely
ing on the explicit regularity detection. By not depending on the detectable pre-defined
regularities in the reference stream of workloads, my LIRS catches more opportunities to
improve LRU performance. (2) We show earlier work on improving LRU such as LRU-K
or 2Q can be evolved into one algorithm w ith consistently superior performance, without
tuning or adaptation of sensitive param eters. The effort of these algorithms, which only
trace their own history information of each referenced block, is promising because it is very
likely to produce a simple and low overhead algorithm just like LRU. We have shown the
LIRS algorithm accomplishes this goal.
My LIRS algorithm can be effectively applied in the virtual memory management for
its simplicity and its LRU-like assum ption on workload characteristics. In the next chapter,
we will describe my design of a LIRS approximation, called CLOCK-Pro, w ith its reduced
overhead comparable to th a t of LRU approximations, such as the CLOCK and second
chance algorithms.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Chapter 3

V irtual M em ory R eplacem ent
Policies
W ith the ever-growing performance gap between memory systems and disks, and rapidly
improving CPU performance, virtual memory (VM) management becomes increasingly im
portant for overall system performance. Because of the very stringent cost requirement on
the replacement policies from VM management, almost all the general-purpose replacement
algorithms cannot be directly applied here. The research of VM replacement policies is of
special interests to operating system designers.

3.1
3 .1 .1

Background
T h e R e s e a r c h S t a t u s o f M e m o r y R e p la c e m e n t P o lic ie s

Memory management has always been one of the most active research areas for decades
since it was introduced in the computer systems. On one frontier, to make the installed
memory effectively used, much work has been done on memory allocation, recycling, and the
management in various programming languages. Many breakthroughs have been made in

61

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3.

V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

62

both theory and practice. On another frontier, to reduce the page paging between memory
and disks, researchers and practitioners in both academia and industries are working hard
to improve the performance of page replacement to reduce I/O paging, especially to avoid
the worst performance cases. A significant advance in this regard becomes increasingly
demanding w ith the continuously growing gap between memory and disk access times, and
rapidly improved CPU performance. Unfortunately, an approxim ation of LRU, the CLOCK
replacement policy [21], which was developed at least 35 years ago, is still dominating almost
all the major operating systems including MVS, Unix, Linux and Windows1, even though it
has apparent performance problems inherited from LRU w ith certain commonly observed
memory access patterns.
We believe there are two factors responsible for the lack of significant improvements
of VM page replacements. First, there is a very stringent cost requirement on the policy
from VM management. It requires the cost be associated w ith th e number of page faults
or a moderate constant. As we know, a page fault incurs a penalty worth of hundreds of
thousands of CPU cycles. This allows a replacement policy to do its job without intrusively
intervening application executions. A policy w ith its cost proportional to the number of
memory references would be prohibitively expensive. This causes the user program to incur
a trap to the operating system every few instructions, and the CPU would spend much more
time on the page replacement work than doing useful work for the user application even
when there are not paging requests. From the cost perspective, even LRU, a well-recognized
low-cost and simple replacement algorithm, is unaffordable, because it has to m aintain the
1This generally covers many CLOCK variants, including Mach-style active/inactive list, FIFO list facili
tated with hardware reference bits. These CLOCK variants share the same performance problems plaguing
LRU.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IRTU AL M E M O R Y R E P L A C E M E N T PO LIC IES

63

LRU ordering of pages at any time. The second factor is th a t most proposed replacement
algorithms attem pting to improve LRU performance tu rn out to be too complicated to
produce their approximation versions w ith their costs meeting the requirements of VM. This
is mainly because the weak cases for LRU mostly attrib u te to its minimal use of history
access information, which motivates other researchers to make an opposite approach by
adding more bookkeeping and access statistic analysis work to make their algorithms more
intelligent in dealing w ith some access patterns unfriendly to LRU.

3 .1 .2

L R U /C L O C K a n d th e ir P e r fo r m a n c e D is a d v a n ta g e s

LRU is designed based on the assum ption th a t a page would be re-accessed soon if it has
been accessed recently. It manages a d ata structure conventionally called LRU stack, in
which the Most Recently Used (MRU) page is at the stack top and the Least Recently Used
(LRU) page is at the stack bottom . O ther in-between pages in the stack strictly follow the
ordering of their last access times. To m aintain the stack, LRU algorithm has to move an
accessed page from its current position in the stack (assume it has been in the stack) to the
stack top. The LRU block at the stack bottom is the one to be replaced if there is a page
fault and no free spaces are available. In CLOCK, the memory spaces holding the pages can
be regarded as a circular buffer and the replacement algorithm cycles through the ordering
of the pages, like the minute hand of a clock. Each page is associated with a bit, called
reference bit, which is set by hardware whenever the page is accessed. When it is necessary
to replace a page to service a page fault, the page pointed by the hand is checked. If its
reference bit is unset, the page is replaced. Otherwise, the algorithm unsets its reference
bit and continues moving the hand to the next page. Research and experience have shown

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LIC IES

64

th a t CLOCK is a close approxim ation of LRU, and its performance characteristics are very
similar to those of LRU. So all the performance disadvantages discussed about LRU in the
following are also applied to CLOCK.
The LRU assumption is valid for a significant portion of workloads, and LRU works
well for these workloads, which we call LRU-friendly workloads. The distance of a page in
the LRU stack from the stack top to its current position is called recen cy, which shows
the number of other distinct pages accessed after the last reference to the page. Assuming
an unlimitedly long LRU stack, the position the page is in when it is accessed is called its
re-u se d istance, indicating the num ber of other distinct pages accessed between its last
access and its current access. LRU-friendly workloads have two distinct characteristics: (1)
There are much more references w ith small re-use distances than those with large re-use
distances. (2) Most references have re-use distances smaller than the available memory size
in term s of the number of pages. The locality exhibited in this type of workloads is regarded
as strong, which ensures a high hit ratio and a stead increase of hit ratio w ith the increase
of memory size.
However, there do exist occasions th a t this assumption does not hold, where LRU per
formance could be unacceptably degraded. One example access p attern is memory scan,
which consists of a sequence of one-time page accesses. These pages actually have infinitely
large re-use distance and cause no hits. More seriously, in LRU the scan could flush all
the previously active pages out of memory. Linux, which uses a variant of CLOCK as its
replacement policy, faces a serious challenge on the memory management due to the scan
effect by accessing one-time or infrequently used file data on disks.
In Linux the memory management for process-mapped program memory and file I/O

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LIC IES

65

buffer cache is unified, so th a t the memory can be flexibly allocated between them according
to their respective needs. The allocation balancing between program memory and buffer
cache poses a big problem because of the unification. Here is a quote from Rik van Riel
at Red Hat Inc.[73] to describe this problem. "... the amount of data on the file systems
tends to be several magnitudes larger than the amount of memory taken by the processes
in the system. This means that the number of accesses to pages from the file cache could
overwhelm the total number of accesses to the pages of the processes, even though the in
dividual pages of the processes get accessed more frequently than m ost file cache pages. In
other words, the system can end up evicting frequently accessed pages from memory in favor
of a mass o f recently but fa r less frequently accessed pages. ” An example scenario on this
is th a t after one extracts a large tarball, he/she could feel the com puter gets much slower
because the previous active working set is replaced and has to be faulted in. To address
this problem in a simple way, current Linux versions have to introduce some “magic pa
rameters” to enforce the buffer cache allocation within the range of 1% and 15% of memory
size. However, this approach does not fundamentally solve the problem, because one major
factor to cause this allocation unbalancing between process memory and buffer cache is the
inefficient replacement policy to deal with infrequent accessed pages in buffer caches.
Another example access p attern defeating LRU is loop, where a set of pages are accessed
cyclically. Loop and loop-like access patterns dominate the memory access behavior of many
programs, particularly in scientific com putation applications. If pages involved in the loop
along with other pages accessed in a cycle cannot completely fit in the memory, there could
be repeated page faults and no hit at all. The most cited example [30. 67] for the loop
problem is th a t even if you have a memory of 100 pages to hold 101 page data, the hit ratio

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. VIRTU AL M E M O R Y R E P L A C E M E N T PO LICIES

66

would be ZERO if you loop over this d ata set!

3 .1 .3

L IR S a n d it s P e r fo r m a n c e A d v a n ta g e s

A recently proposed breakthrough replacement algorithm, namely LIRS (Low Inter-reference
Recency Set) [33], removes all the aforementioned LRU performance disadvantages while still
maintaining a low cost close to LRU. It can not only overcome the side-effects of scan and
loop accesses, b u t also accurately differentiate the pages based on their locality strengths
quantified by re-use distance.
T he key different approach in handling history access information in LIRS from LRU
is th a t it uses re-use distance rather than recency in LRU for the replacement decision. A
page w ith a large re-use distance will be replaced even if it has a small recency. For instance,
when a one-time-use page is recently accessed in a memory scan, LIRS will replace it quickly
because its re-use distance is infinite, even though its recency is very small. To retain the
LRU low-cost merit, LIRS does not explicitly bookkeep and compare re-use distances of
accessed pages, b u t dynamically categorizes the pages into two sets, one for pages w ith small
re-use distance, called LIR set, and another for pages with large re-use distance, called HIR
set. In LIRS, only pages in LIR set are cached and cannot be replaced until it proves to
be ineligible to stay in the LIR set due to its large recency. On the other hand, the pages
in the HIR set will be replaced soon after they are faulted in. A HIR page must generate a
relatively small re-use distance to tu rn into LIR page and then enjoy the privilege of staying
in memory for a relatively long period of time. In contrast, LRU lacks the insights of LIRS:
all accessed pages are indiscriminately cached until they are either re-accessed when they
are in the stack or replaced at the bottom of the stack, w ithout considering which of the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H AP TE R 3. V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

67

two cases is more possible to happen. For the infrequently accessed pages, which are most
possible to be replaced at the stack bottom w ithout being re-accessed in the stack, holding
them in memory (as well as in stack) certainly means a waste of the memory resources.
This explains the LRU misbehavior with the access patterns of weak locality.
The performance advantages of LIRS are impressive while it is compared w ith other
recently proposed replacement algorithms, including DEAR[19], AFC [18], UBM [42], 2Q
[37], LRU-2[57], SEQ [30], LRFU [45], EELRU[67] and ARC [51]. The advantages include
(1) Unlike DEAR, AFC, UBM, and SEQ, LIRS does not depend on the explicit detection
of access regularity on which LRU is possible to fail in order to improve LRU performance.
(2) Unlike LRU-2, LRFU, and EELRU, LIRS has an 0 (1) overhead and its cost is actually
very close to LRU. (3) Unlike 2Q, SEQ, ARC, LIRS is able to remove LRU problems in
a broad spectrum of workloads w ith scan, loop and various changing access patterns. The
advantages of LIRS to effectively and intelligently replace infrequently accessed pages in
buffer caches have drawn the attention from the industry. Here is a brief comment of Rik
van Riel on LIRS [73]:

the facts that L IR S would make the file cache vs process memory

balancing automatic and that L IR S would also do the right thing as a second level cache ...
make the implementation of L IR S fo r Linux a promising future experiments. ”
In this chapter, we will describe a VM page replacement algorithm, called CLOCK-Pro,
to take the place of CLOCK, which meets both the performance dem and from application
users and the low overhead requirement from system designers. CLOCK-Pro integrate the
principle of LIRS and the way in which CLOCK works. CLOCK-Pro has the following
features: (1) CLOCK-Pro works in a similar fashion as CLOCK and its cost is easily
affordable in VM management. (2) CLOCK-Pro brings all the much-needed performance

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 3.

V IR T U A L M E M O R Y R E P L A C E M E N T PO LICIES

68

advantages from LIRS into CLOCK. (3) W ithout any pre-determined parameters, CLOCKPro adapts to the changing access patterns to serve a broad spectrum of workloads. (4)
Through extensive simulations on real-life I/O and VM traces, we have shown the significant
performance improvement of CLOCK-Pro over CLOCK as well as other representative VM
replacement algorithms.

3.2

R elated Work

There have been a large number of new replacement algorithms proposed for many years,
especially in the last fifteen years. Almost all of them are proposed to target at the perfor
mance problems of LRU. In general there are three approaches taken in these algorithms.
(1) Requiring applications to explicitly provide future access hints, such as Applicationcontrolled file caching [11], and application-informed prefetching and caching [59]; (2) Ex
plicitly detecting the access patterns failing LRU and adaptively switching to other effective
replacements, such as SEQ [30], EELRU [67], AFC [18], and UBM [42]. (3) Tracing and
utilizing deeper history access information such as FB R [65], LRFU [45], LRU-2 [57], 2Q
[37], MQ [81], LIRS [33], and ARC [51]. More elaborate description and analysis on the
algorithm can be found in [33]. The algorithms taking the first two approaches usually place
too much constraint on the applications they are designed to serve to be applicable in the
VM of a general-purpose OS. For example, SEQ is designed to work in VM managements,
and it only does its job when there is a page fault. However, its performance depends on
an effective detection of long sequential address reference patterns, on which LRU could
behave poorly. Thus, th e mechanism it uses makes SEQ lose the generality. For instance,

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

69

SEQ is hard to detect the loop access over linked lists or the accesses to a sequence of pages
by an application but the sequence is randomly interleaved w ith the accesses to the pages of
other applications. Among the algorithms taking the third approach, FBR, LRU-2, LRFU
and MQ are too costly even compared with LRU. The performance of 2Q has been shown to
be very sensitive to its param eters and could be much worse th a n LRU. LIRS and ARC are
the two most promising candidate algorithms th a t could be applied in VM, because they
use the d ata structure and operations similar to LRU and their cost is also close to th a t
of LRU. Both have the potential to produce approxim ation versions for VM, while keeping
their respective performance advantages.
ARC maintains two variable-sized lists holding history access information of referenced
pages. Their combined size is two times of the memory in term s of pages. So ARC not
only records the information of cached pages, b u t also keeps track of the same number of
replaced pages. The first list contains pages th a t have been touched only once recently (cold
pages) and the second list contains pages th a t have been touched at least twice recently (hot
pages). The cache spaces allocated to the pages in these two lists are adaptively changed,
depending on in which list recent misses happen. More cache spaces will serve cold pages
(resp. hot pages) if there are more misses in the first list (resp. in the second list). However,
though ARC allocates memory to hot/cold pages adaptively to the ratio of cold/hot page
accesses and excludes tunable param eters, the locality of pages in the two lists, supposed to
hold cold and hot pages respectively, can not directly and consistently be compared. So hot
pages in the second list could have a weaker locality in term s of re-use distance than cold
pages in the first list. For example, a page th at is regularly accessed with a re-use distance a
little more th an the memory size has no hits at all in ARC while a page in the second list can

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IRTU AL M E M O R Y R E P L A C E M E N T PO LIC IES

70

stay in memory w ithout any accesses since it has been accepted into the list. This does not
happen in LIRS, because any pages supposed to be hot (LIR pages) or cold (HIR pages) are
placed in the same list and compared in a consistent fashion. Any LIR /H IR status changes
are responsively conducted. There is one pre-determined param eter in LIRS algorithm on
the am ount of memory allocation for HIR pages. In CLOCK-Pro, the param eter is removed
and the allocation becomes fully adaptive to the current access patterns.
Compared w ith the research on the general replacement algorithm targeting at LRU, the
work specific to the VM replacements and targeting at CLOCK is much less and inadequate.
While Second Chance (SC) [70] as the simplest kind of CLOCK algorithm utilizing only one
reference bit to indicate recency, other CLOCK variants introduce a finer distinction be
tween page access history. In a generalized CLOCK version called GCLOCK[69], a counter
is associated with each page rather th a n a single bit. The counter will be incremented if the
page is hit. The circulating clock hand sweeps through the page decrementing the counter
until a page with its count of zero is found for replacement. In Linux and FreeBSD, a similar
mechanism called page aging is used. The counter is called age in Linux or act_count in
FreeBSD. W hen scanning through memory for pages to replace, the page age is increased
by a constant if its reference bit is set. Otherwise its age is decreased by a constant. One
problem for this kind of designs is th a t their performance improvements are not consistent,
and “can be either better or worse than LRU” [55]. The param eters for setting the maximum
value of counters or adjusting ages are mostly empirically decided. Another problem is that
they will consume too many CPU cycles and adjust to changes in the access patterns slowly,
which is evidenced in Linux kernel 2.0. Recently, Bansal and Modha provided an approxi
m ation version of ARC, called CAR [6], which has a cost close to CLOCK. Their simulation

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. VIRTU AL M E M O R Y R E P L A C E M E N T PO LICIES

71

tests on I/O traces indicate th a t CAR has a performance similar to ARC. Our simulation
experiments on I/O and VM traces bo th show th a t CLOCK-Pro has a significantly better
performance than CAR.
While the design VM replacements is difficult to take much benefit from the work on
improving LRU due to the strict VM cost requirement, it remains as a demanding challenge
in the OS design.

3.3

D escription o f CLO CK -Pro

3 .3 .1

M a in Id e a

CLOCK-Pro takes the same principle as th a t of LIRS - it uses the re-use distance (called
IRR in the LIRS replacement algorithm) rather than recency in its replacement decision.
W hen a page is accessed, the re-use distance is the period of tim e in term s of the number
of other distinct pages accessed since its last access. Although there is a re-use distance
between any two consecutive references to a page, only the most current distance is relevant
in the replacement decision. We use the re-use distance of a page right at the time of its
access to characterize it either as a cold page if it has a large re-use distance, or as a hot page
if it has a small re-use distance. Then we m ark its status as either cold or hot. We place
all the accessed pages, either hot or cold, into the same list 2 in the order of their accesses 3
with the pages with small recency at the list head and the pages w ith small recency at the
list tail.
2Actually it is the directory entries that are placed in the list. However, for simplicity we say “a page in
the list” instead of explicitly “the entry of a page in the list”
3Actually we can only maintain an approximate access order because we cannot update the list with a
hit access in a VM replacement algorithm and thus lose the exact access orderings between page faults.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3.

V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

72

To give the cold pages a chance to compete w ith the hot pages and ensure their cold/hot
statuses accurately reflect their current access behaviors. We grant each cold page a test
period once it is accepted into the list. If it is accessed during its test period, the cold
page turns into a hot page. If the cold page passes the test period without a re-access, it
will leave the list. It is noted th a t the cold page in its test period can be replaced out of
memory, but its page entry will remain in the list for the test purpose until the end of the
test period or being re-accessed.
The key question here is how to set the time of the test period. W hen a cold page is in
the list and there is still at least one hot page after it (i.e. with a larger recency), it can
tu rn into a hot page if it is accessed, because it has a new re-use distance smaller than the
hot page(s) after it. Accordingly, the hot page with the largest recency should tu rn into a
cold page. So the test period should be set as the largest recency of the hot pages. If we
make sure th a t the hot page w ith the largest recency is always at the list tail, and all the
cold pages th a t pass this hot page term inate their test periods, then the test period of a
cold page is equal to the time before it passes the tail of the list. So all the non-resident
cold pages can be removed from the list right after it reaches the tail of the list. In practice,
we could shorten the test period and limit the number of cold pages in the test period to
save the space cost. By implementing this test mechanism, we ensure th at “cold/hot” are
defined based on relativity and constant comparison, not on a fixed threshold. This makes
CLOCK-Pro distinctive from the prior work including 2Q and ARC, which attem pts to
use a constant threshold to distinguish the two types of pages, and treat them differently
in the separate lists. Unfortunately this will make these algorithms share the performance
weakness of LRU.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

73

HANDhot

H A N D c Jd

HAWDUst

Figure 3.1: There are three types of pages in CLOCK-Pro, hot pages marked as “H”, resident cold
pages marked as “C” and non-resident cold pages marked as shadowed block with “C” - Around the
clock, there are three hands: H AN D hot pointing to the list tail (i.e. the last hot page) and searching
a hot page to turn into a cold page, H A N D coid pointing to the last resident cold page and searching
for a cold page to replace out of memory, and H A N D test pointing to the last cold page in the test
period, terminating test periods of cold pages, and removing non-resident cold pages passing the
test period out of the list. The attached black dots represent the reference bits of 1.
When it is necessary to generate a free space, we replace a resident cold page.

3 .3 .2

D a t a S tr u c tu r e

Let us first assume the memory allocations for hot and cold pages,

and m c, respectively,

are fixed, where m/j + m c is the total memory size m (m = rrih + m c). The number of hot
pages is also rrih, so all the hot pages are cached at any time. For a hot page to be replaced,
it must first change into a cold page. Except hot pages, all the other accessed pages are
categorized as cold pages. Among the cold pages, m c pages are cached, another at most m
non-resident cold pages also have their history access information cached. So totally there
are at most 2m directory entries for page access history in the list. The same as CLOCK,
all the page entries are organized as a circular linked list, shown in Figure 3.1. For each
page, there is a cold/hot status associated with it. For each cold page, there is a flag to

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IRTU AL M E M O R Y R E P L A C E M E N T PO LICIES

74

indicate if it is in the test period.
In CLOCK-Pro, there are three hands. The hand H A N D hot points to the hot page
with the largest recency. The position of this hand actually serves as a threshold being
a hot page. Any hot pages swept by the hand tu rn into cold pages. For the convenience
of the presentation, we call the page pointed by H A N D hot as the tail of the list, and the
page immediately before the tail page in the clockwise direction as the head of the list.
The HANDcoid points to the last resident cold page (i.e. the furthest one to the list head).
Because we always select the cold page for replacement, this is the position where we start
to look for a victim page, an equivalent to the hand in CLOCK. The hand H A N D test points
to the last cold page in the test period. This hand serves to term inate the test period of
cold pages. The non-resident cold pages swept by this hand will leave the list. All the hands
move in the clockwise direction.

3 .3 .3

O p e r a tio n s o n S e a r c h in g V ic t im P a g e s

Just like CLOCK, there are no operations in CLOCK-Pro for page hits, only the reference
bits of the accessed pages are set by hardware. Before we see how a victim page is generated,
let’s examine how the three hands move around the list (clock), because the victim page is
searched by coordinating the movements of the hands.
The reason to move H A N D hot is th a t a cold page is accessed in its test period and thus
turns into a hot page. Accordingly we need to change the hot page with the largest recency
to tu rn into a cold page. If the reference bit of the hot page pointed by the hand is unset,
we can simply change its status and then move the hand forward. However, if the bit is set,
which indicates the page has been re-accessed, we spare this page, reset its reference bit

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IRTU AL M E M O R Y R E P L A C E M E N T PO LIC IES

75

and keep it as a hot page. This is because the actual access tim e of the hot page could be
earlier than the cold page. Then we move the hand forward to examine the next page until
it encounters a hot page with its reference bit of zero. Then the hot page with its reference
of zero turns into a cold page. W henever it encounters a cold page, it will term inate its test
period and remove the cold page out of list if it is non-resident (the most probable case).
This actually does the work assigned to hand H AND test- Finally the hand stops at a hot
page.
We keep track of the number of non-resident cold pages. Once the number exceeds m,
the memory size in the number of pages, we remove the cold page pointed by H A N D test
out of the list if it is non-resident. We term inate its test period. Because the cold page has
used up its test period without a re-access and has no chance to tu rn into a hot page with
its next access. H A N D test will then move forward and stop at the next cold page.
HANDcohi is used to search a resident cold page for replacement. If the reference bit of
the resident cold page currently pointed by H A N D coid is unset, we replace the cold page for
a free space. Otherwise, if its bit is set and it is in its test period, we tu rn the cold page into
a hot page, move it to the list head, and ask H A N D ^ t for its actions, because an access
during test period indicates a competitively small re-use distance. Note th a t the replaced
cold page will remain in the list as non-resident cold page until it runs out of its test period.
The hand will keep moving until it encounters a cold page eligible for replacement, and
stops at the next resident cold page.
When there is a page fault, the faulted page must be a cold page. We first run H A N D coia
for a free space. If the cold page is not in the list, its re-use distance is highly possible to

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IRTU AL M E M O R Y R E P L A C E M E N T P O LIC IES

76

be larger than the recency of hot pages4. So the page is still categorized as a cold page and
is placed at the list head. It also initiates its test period. If the number of cold pages is
larger than the threshold, we ru n H A N D test• If the cold page is in the list5, the faulted
page turns into a hot page and is placed on the head of the list. We ru n H A N D hot to make
a hot page with large recency tu rn into a cold page.

3 .3 .4

M a k in g C L O C K -P r o A d a p t iv e

Until now, we have assumed th a t th e memory allocations for hot and cold pages are fixed.
In LIRS, there is a pre-determined param eter, called Lhirs, to determine the percentage
of memory th a t are used by HIR pages. As it is shown in [33], Lhirs actually affects how
LIRS behaves differently from LRU. W hen Lhirs approaches 100%, LIRS’s replacement
behavior as well as its hit ratios are close to those of LRU. Although the evaluation of LIRS
algorithm indicates th a t its performance is not sensitive to Lhirs variations within a large
range between 1% and 30%, it also shows th a t the hit ratios of LIRS could be moderately
lower than LRU for LRU-friendly workloads (i.e. w ith strong locality) and increasing Lhirs
could eliminate the performance gap.
In CLOCK-Pro, resident cold pages are actually managed the same as CLOCK. H A N D ^ m
behaves the same as what the clock hand in CLOCK does: sweeping across the pages while
sparing the page with its reference b it of 1 and replacing the one w ith its reference bit of
0. So increasing m c, the size of the allocation for cold blocks, makes CLOCK-Pro behave
more like CLOCK. L et’s see the performance implication of changing memory allocation
4We cannot guarantee the largeness because there are no operations on hits in CLOCK-Pro and we limit
the number of cold pages in the list. But our experiment results show this approximation minimally affects
the performance of CLOCK-Pro.
5The cold page must be in its test period. Otherwise, it must have been removed from the list.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LIC IES

77

in CLOCK-Pro. To overcome the CLOCK performance disadvantages with weak access
patterns such as scan and loop, a small m c value means a quick eviction of cold pages just
faulted in and the strong protection of hot pages from the interference of cold pages. How
ever, for strong locality accesses, almost all the accessed pages have relatively small re-use
distance. But some of the pages have to be categorized as cold pages. W ith a small m c,
these pages would have to be replaced out of memory soon after its being loaded in, then
with an additional fault access during its test period to be loaded in th e memory again as a
hot page. Increasing m c would allow these cold pages to be cached for a longer time and to
be more possible to be re-accessed before being replaced. So they can save the additional
page faults.
For a given re-use distance of an accessed cold page, m c decides the probability of a
page to be re-accessed before being replaced during its test period. For a cold page with its
re-use distance larger th a n its test period, retaining the page in the memory with a large
m c is a waste of buffer spaces. On the other hand, for a page w ith a small re-use distance,
retaining the page in the memory for more time w ith a large m c would save an additional
page fault. In the adaptive CLOCK-Pro, we allow m c to dynamically adjust to the current
re-use distance distribution. If a cold page is accessed during its test period, we increment
m c by 1. If a cold page passes its test period w ithout a re-access, we decrement m c by
1. Note the aforementioned cold pages include resident and non-resident cold pages. By
making the adaptation, CLOCK-Pro could take both LRU advantages with strong locality
and LIRS advantages w ith weak locality.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LIC IES

3.4

78

Perform ance Evaluation

To evaluate our CLOCK-Pro and to dem onstrate its performance advantages, we use tracedriven simulations on various types of workloads to compare it with other algorithms,
including CLOCK, LIRS, CAR, and O PT. CAR [6] is an approximation of ARC [51]. O PT
is an optimal, offline, but unimplementable replacement algorithm [7].
Our simulation experiments are conducted in three steps with different kinds of workload
traces. Because LIRS is originally proposed as I/O buffer cache replacement algorithm, in
the first step, we test the replacement algorithms on the I/O traces to see how well CLOCKPro can retain the LIRS performance advantages, as well as its performance w ith typical I/O
access patterns. In the second step, we test the algorithms on the VM traces of application
program executions. Because the integrated VM management on file cache and program
memory such as what is implemented in Linux, is always desired, but has the concern of
m istreatm ent of file d ata and process pages as mentioned in C hapter 3.1.2. In the third step,
we test the algorithms on the aggregated VM and I/O traces to see how these algorithms
respond to the integration.

3 .4 .1

S im u la tio n o n B u ffer C a c h e for F ile I /O

The I/O traces used in this section are from [33] used for the LIRS evaluation. In their
comprehensive performance evaluation, the traces are categorized into four groups based on
their access patterns, namely, loop, probabilistic, temporally-clustered and mixed patterns.
Here we selected one representative trace from each of the groups for the replacement
evaluation, and briefly describe them here.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LICIES

79

1. glim pse is a text information retrieval utility trace. The total size of text files used
as input is roughly 50 MB. The trace belongs to the loop pattern.

2. cpp is a GNU C compiler pre-processor trace. The total size of C source programs
used as input is roughly 11 MB. The trace belongs to the probabilistic pattern.

3. sp rite is from the Sprite network file system, which contains requests to a file server
from client workstations for a two-day period. The trace belongs to the temporallyclustered pattern.

4. m u lti2 is obtained by executing three workloads, cs, cpp, and postgres, together. The
trace belongs to the mixed pattern.

These are small-scale traces w ith clear access patterns. We use them to investigate the
implications of various access patterns on the algorithms. The hit ratios of glim pse and
m ulti2 are shown in Figure 3.2. To help readers clearly see the hit ratio difference of the
algorithms, we list the hit ratios of cpp and sp rite in Tables 3.1 and 3.2, respectively. For
LIRS, the memory allocation {Lhirs) to HIR pages is set as 1% of memory size, the same
value as it is used in [33]. There are several observations we can make in the experiments.
First, even though CLOCK-Pro does not responsively deal with hit accesses to meet the
cost requirement of VM management, the hit ratio of CLOCK-Pro and LIRS are very close,
which shows th a t CLOCK-Pro effectively retains the performance advantages of LIRS. For
workloads glim pse and m ulti2, which contain many loop accesses, LIRS with a small L^irs
is most effective. The hit ratios of CLOCK-pro are a little lower than LIRS. However, for the
LRU-friendly workload, sprite, which consists of strong locality accesses, the performance of

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 3. V IR T U A L M E M O R Y R E P L A C E M E N T POLICIES

GLIMPSE

80

MULTI2
80

70

60

50

t
<C
3
C
o

40

X
O PT — *—

30
O PT
CLOCK-Pro
U R S •••*•••
CAR
CLOCK — x —

aocK-Pro

LIRS
CAR
CLOCK ~ k—

20

10

500

1000

1500

Memory S ize (# o f blocks)

2500

0

500

1000

1500

2000

2500

3000

Memory Size (# o f Weeks)

Figure 3.2: Hit ratios of the replacement algorithms OPT, CLOCK-Pro, LIRS, CAR, and CLOCK
on workloads glimpse and m ulti 2 .
LIRS could be lower th a n CLOCK (see Table 3.2). W ith its memory allocation adaptation,
CLOCK-Pro improves LIRS performance.
Figure 3.3 shows the percentage of memory allocated to the cold pages during the
execution courses of m u lti1! and sprite for a memory size of 600 pages. We can see th a t
for sprite the allocations for cold pages are much larger th an 1% of memory used in LIRS,
and the allocation fluctulates over the time adaptively to the changing access patterns. It
sounds paradoxical th a t we need to increase cold page allocation when there are many hot
page accesses in the strong locality workload. Actually only the real cold pages with large
re-use distances should be managed in a small cold allocation for their quick replacements.
The so-called “cold” pages could also be hot pages in strong locality workloads because the
number of so called “h o t” pages are limited by its allocation. So these pseudo-cold pages
should be avoided to be quickly replaced by increasing the cold page allocation. We can
see th at cold page allocations for m u lti! are lower than sprite, which is consistent w ith the
fact th a t m u lti! access patterns consist of many long loop, weak locality accesses.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LIC IES

Pages
20
35
50
80
100
200
300
400
500
600
700
800
900

O PT
26.4
46.5
62.8
79.1
82.5
86.0
86.5
86.5
86.5
86.5
86.5
86.5
86.5

CLOCK-Pro
23.9
41.2
53.1
71.4
76.2
84.0
85.1
85.7
85.9
86.2
86.3
86.4
86.4

LIRS
24.2
42.4
55.0
72.8
77.6
84.3
85.0
85.6
85.9
86.2
86.3
86.4
86.4

CAR
17.6
26.1
37.5
70.1
77.0
84.8
85.6
85.7
85.8
86.0
86.3
86.4
86.4

81

CLOCK
0.6
4.2
18.6
60.4
72.6
81.8
83.5
84.3
84.7
85.0
85.4
85.2
85.7

Table 3.1: Hit ratios of the replacement algorithms OPT, CLOCK-Pro, LIRS, CAR, and CLOCK
on workload cpp.

Pages
100
200
300
400
500
600
700
800
900
1000

O PT
50.8
68.9
78.8
84.6
87.9
89.9
91.3
92.2
92.8
93.2

CLOCK-Pro
24.8
45.2
58.8
70.1
77.5
82.4
85.3
87.6
88.8
89.7

LIRS
25.1
44.7
58.6
69.5
76.0
80.9
83.8
85.6
86.8
87.6

CAR
26.1
43.0
59.1
70.5
77.7
82.1
85.3
87.3
88.8
89.6

CLOCK
22.8
43.5
59.5
70.9
78.3
83.3
86.0
88.1
89.4
90.4

Table 3.2: Hit ratios of the replacement algorithms OPT, CLOCK-Pro, LIRS, CAR, and CLOCK
on workload sprite.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

-------------- ---- . , ------

60

1!
40

s
a
£»

1
1 ;

1 •

i
i
|

-

2

1

1

1

!

I

ti J A .
; * 1 sifi i; l
, | i/ij/jjijj

, fj

fl

I n

\ h

-111

h !m

E

if ^ 1 / ;j
3
ill!
i-: i
!:

O

1

,1

50

t
CB
M
0-

1

82

20

i
M

I'l
j !

f ]

11

i

i »/ Sj jnj/
.

_ j -f

ol

i i

i j
j J*

L M
:jrif
iM ]

'j

j

1

i \ u !
IN I |

1j
11

y iM

! ; '

j

j

’

"

jj

i j•

|

jf.

1

i

c

a

\(

o

10

00

;

2000

4000

6000

8000

10000

12000

0

Virtual Time (# of Pages)

1

f
200 00

40000

60000

80000

100000

120000

140000

Virtual Tim e {# of Pages)

F igure 3.3: Adaptively changing the percentage of memory allocated to the cold pages in workloads
m ulti 2 and sprite.
Second, regarding the performance difference of the algorithms, CLOCK-Pro and LIRS
have much higher hit ratios th an ARC and CLOCK for glim pse and m ulti2, and are close
to the optimal ones. For strong locality accesses like sprite, there are little improvements
either for CLOCK-Pro or ARC. This is the case for CLOCK to win its popularity considering
its extremely simple im plem entation and low cost.
Third, even w ith a built-in memory allocation adaption mechanism, CAR cannot provide
consistent improvements over CLOCK, especially for weak locality accesses, on which a fix
is most needed in LRU. As we have analyzed, this is because CAR as well as ARC lack a
consistent locality strength comparison mechanism.

3 .4 .2

S im u la tio n o n M e m o r y for P r o g r a m E x e c u t io n s

In this section, we use the traces of memory accesses of the program executions to evaluate
the performance of the algorithms. All the traces used here are also used in [29] and many
of them are also used in [30, 67]. However, we do not include the performance results of

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3.

Program
applu
blizzard
coral
gnuplot
ijpeg
m88ksim
murphi
peri
sor
swim
trygtsl
wave5

V IRTU AL M E M O R Y R E P L A C E M E N T PO LIC IES

Description
Solve 5 coupled parabolic/ elliptic PDE
Binary rewriting tool for software DSM
Deductive database evaluating query
PostScript graph generation
Image conversion into IJP E G format
Microprocessor cycle-level simulator
Protocol verifier
Interpreted scripting language
Successive over-relaxation on a m atrix
Shallow water simulation
Tridiagonal m atrix calculation
plasm a simulation

Size
1,068
2,122
4,327
4,940
42,951
10,020
1,019
18,980
5,838
438
377
3,774

83

Max. Mem. Demand (KB)
14,524
15,632
20,284
62,516
8,260
19,352
9,380
39,344
70,930
15,016
69,688
28,700

Table 3.3: A brief description of the benchmark programs (“Size” is in number of millions of
instructions)

SEQ and EELRU, because of the generality or cost concerns of them for VM management.
Interested readers are referred to the respective papers for a detailed performance details
of SEQ and EELRU, and make a comparison of them w ith CLOCK-Pro and CAR. Here
we simply say th a t CLOCK-Pro provides b etter or comparable performance over SEQ and
EELRU.
Table 3.3 summarizes all the program traces used in this chapter. For detailed program
descriptions, space-time memory access graphs, and trace collection methodology, readers
are referred to papers [29, 30]. These traces cover a large range of access patterns. After
observing their memory access graphs drawn from the collected traces, the authors of pa
per [30] categorized programs coral, m 8 8 ksim , and m urphi as having “no clearly visible
patterns” w ith all accesses temporarily clustered together, categorized programs blizzard,
peri, and sw im as having “patterns at a small scale” , and categorized the rest of programs
as having “clearly-exploitable, large-scale reference p attern s” . If we examine the program

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3.

VIRTU AL M E M O R Y R E P L A C E M E N T PO LICIES

84

45

40
CLOCK — x ~
CAR
CLOCK-Pro - ■ « O PT - h —

35

c

0

c

c
25

a

20

3
a

a.
10

X,

5

0

eooo

10000

12000

14000

16000

18000

Memory Size (KB)

10

CLOCK ~ 'X ~
CAR - b CLOCK-Pro
O PT — + ~

CLOCK — x—
CAR -•-a —
CLOCK-Pro
O PT — «—

8
c

o
3
c

6

o
s
a
3
IS
IL

4

a
ID

S

a.

20

2

02000
I—

4000

6000

8000

10000

12000

Memory Size (KB)

14000

16000

18000

20000

5000

6000

7000

8000

3000

Memory Size (I® )

Figure 3.4: Performance of CLOCK, CAR, CLOCK-Pro and OPT on programs with strong locality.

access behaviors in terms of re-use distance, the programs in the first category belong to
the strong locality workloads. Those in the second category belong to the moderate locality
workloads. And the rest programs in the th ird category belong to the weak locality work
loads. Figure 3.4, Figure 3.5, and Figure 3.6 show the number of page faults per million
instruction executed for each of the programs, denoted as page fault ratio, as its memory
increases up to the its maximum memory demand. We exclude the cold page faults which
occur on their first time accesses. The algorithm s considered here are CLOCK, CLOCK-

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 3. V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

85

BLIZZARD
7

6

c

CLOCK —
CAR e CLOCK-Pro ••••«•••
O PT

5

c
c

4

a
«

3

0

6000

8000

10000

12000

14000

16000

Mernofy S ize (KB)
PERL

SWIM

120

CLOCK — x CAR - s CLOCK-Pro
O PT — r—

c
o

I

CLOCK — x—
CAR — s —
CLOCK-Pro
O PT — 4—

c
o

c
c

3
c
c
o

6

®

3

3

100

o.

a

U.
0

a
o>
0.

Q.

XZi&SSi

10000

15000

20000

25000
M emory Size (KB)

30000

40000

8000

9000

10000

11000

12000

14000

15000

Memory Size (KB)

F igure 3.5: Performance of CLOCK, CAR, CLOCK-Pro and OPT on programs with moderate
locality.
Pro, CAR and OPT.
The experiment results clearly show th a t CLOCK-Pro significantly outperforms CLOCK
for the programs w ith weak locality, including programs applu, gunplot, ijpeg, sor, trygtsl,
and wave5. For gunplot and sor, which have very large loop accesses, the page fault ratios of
CLOCK-Pro are almost equal to those of O PT. The improvements of CAR over CLOCK are
far from being consistent and significant. In many cases, it performs worse th an CLOCK.
The most inability of CAR appears on traces gunplot and sor - it cannot correct th e LRU

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H AP TE R 3. V IR T U A L M E M O R Y R E P L A C E M E N T PO LIC IES

CLOCK
CAR
CLGCK-Pro
O PT

100

86

— x—
- -- a —
—
— i—

P age

F a u lts

p e r M ill io n

In tru c tio n s

-e—

CLOCK — x CAR ~ f l CLOCK-Pro
O PT — • -

12000

10000

8000

10000

14000

20000

50000

30000

60000

Memory S z e (KB)

Memory S ize (KB)

SO R

IJPEG

8

7
CLOCK — x—
CAR — Q—
CLOCK-Pro
O PT — (—

5

4

P age

F a u lts

per

M ill io n

I n d u c tio n s

6

3
CLOCK
CAR
CLOCK-Pio
O PT

2

—)
-{
-h
—

1

0

2000

4000

5000

6000

7000

10000

0000

20000

60000

70000

WAVE5

TRYGTSL

CLOCK — x—
CAR
-o —
CLOCK-Pro
O PT — r—

100

50000

30000
M emory Size (KB)

Memory Size (KB)

10!k.

CLOCK — x—
CAR e ~
CLOCK-Pro — "
Op t - 4 —

I n d u c tio n s

c
o

p e r M ill io n

3
C
C
C

F a u lts

2
3
U.

P age

C
T

10000

20000

3 0000

40000

Memory S ize (KB)

50000

60000

70000

5000

10000

15000

20000

25000

M emory S ize (KB)

Figure 3.6: Performance of CLOCK, CAR, CLOCK-Pro and OPT on programs with weak locality.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 3. V IR T U A L M E M O R Y R E P L A C E M E N T P O LIC IES

87

problems w ith loop accesses and its page fault ratios are almost as high as those of CLOCK.
For the programs w ith strong locality accesses, including coral, m 8 8 ksim and m urphi,
there is little room for other replacement algorithms to do a b etter job than CLOCK/LRU.
The good things are th a t both CLOCK-Pro and ARC retain the LRU performance advan
tages for the type of programs, and CLOCK-Pro even does a little bit better than CLOCK.
For the programs w ith m oderate locality accesses, including blizzard, peri and sw im ,
the results are mixed. Though we see the improvements of CLOCK-Pro and CAR over
CLOCK in the most cases, there does exist a case in sw im w ith small memory sizes where
CLOCK performs b etter th an CLICK-Pro and CAR. Though in most cases CLOCK-Pro
performs b etter th an CAR, for peri and sw im with small memory sizes, CAR performs
moderately better.
To summarize, we found th a t CLOCK-Pro can effectively remove the performance dis
advantages of CLOCK w ith weak locality accesses, retains its performance advantages with
strong locality. It exhibits apparently more impressive performance than CAR which was
proposed to have the same objectives as CLOCK-Pro.

3 .4 .3

S im u la tio n o n P r o g r a m E x e c u t io n s w it h I n te r fe r e n c e o f F ile I /O

In an unified memory management system, file buffer cache and process memory are man
aged with a common replacement policy. As we have stated in C hapter 3.1.2, the memory
competition from a large number of file d ata accesses in the shared space could interfere
with the program execution. Because of the file d ata is far less frequently accessed than
process VM, a process should be more competitive in keeping its memory from being taken
away as file cache buffer. However recency-based replacement algorithms like CLOCK allow

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. V IR TU A L M E M O R Y R E P L A C E M E N T PO LIC IES

Memory(KB)
2000
3600
5200
6800
8400
10000
11600
13200
14800
16400
18000
19360

CLOCK-Pro
9.6
8.2
6.7
5.3
3.9
2.4
0.9
0.2
0.1
0.1
0.0
0.0

CLOCK-Pro w /IO
9.94
8.83
7.63
6.47
5.22
3.92
2.37
0.75
0.52
0.32
0.22
0.19

CAR
9.7
8.3
6.9
5.5
4.1
2.8
1.4
0.7
0.7
0.6
0.6
0.0

CAR w /IO
10.1
9.0
7.8
6.8
5.8
4.9
4.2
3.9
3.6
3.3
3.1
2.9

CLOCK
9.7
8.3
6.9
5.5
4.1
2.8
1.4
0.7
0.7
0.7
0.6
0.0

88

CLOCK w /IO
11.23
11.12
11.02
10.91
10.81
10.71
10.61
10.51
10.41
10.31
10.22
10.14

Table 3.4: The performance (number of page faults in one million of instructions) of algorithms
CLOCK-Pro, CAR and CLOCK on program m 8 8 ksim with and without the interference of I/O file
data accesses.

these file pages to replace the process memory even if they are not frequently used, and to
pollute the memory. To provide a preliminary study on the effect, we select an I/O trace
[22] (WebSearchl) from a popular search engine and use its first 900 second accesses as
a sample I/O accesses to co-occur w ith the process memory accesses in a shared memory
space. This segment of I/O trace contains extremely weak locality - among the total 1.12
millions page accesses, there are 1.00 million unique pages accessed. We first scale the I/O
trace onto the execution time of a program and then aggregate the I/O trace with the
program VM trace in the ordering of access times. We select a program with strong locality
accesses, m 8 8 ksim , and a program w ith weak locality accesses, sor, for the study.
Tables 3.4 and 3.5 show the number of page faults per million of instructions (only
the instructions for m 8 8 ksim or sor are counted) for m 8 8 k sim and sor, respectively, with
various memory sizes. We am not interested in the performance of I/O accesses. There
would be few page hits even for a very large dedicated memory because there is almost no
locality in the accesses.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 3. VIRTU AL M E M O R Y R E P L A C E M E N T PO LICIES

Memory(KB)
4000
12000
20000
28000
36000
44000
52000
60000
68000
70600
74000

CLOCK-Pro
11.4
10.0
8.7
7.3
5.9
4.6
3.2
1.9
0.5
0.0
0.0

CLOCK-Pro w /IO
11.9
10.7
9.6
8.6
7.5
6.5
5.4
4.4
3.4
3.0
2.6

CAR
12.1
12.1
12.1
12.1
12.1
12.1
12.1
12.1
12.1
0.0
0.0

CAR w /IO
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2

CLOCK
12.1
12.1
12.1
12.1
12.1
12.1
12.1
12.1
12.1
0.0
0.0

89

CLOCK w /IO
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2
12.2

Table 3.5: The performance (number of page faults in one million of instructions) of algorithms
CLOCK-Pro, CAR and CLOCK on program sor with and without the interference of I/O file data
accesses.

From the simulation results shown in the tables, we found that:
(1) For the strong locality program, m 88ksim , both CLOCK-Pro and ARC can effec
tively protect the program execution from the I/O access interference, while CLOCK is not
able to reduce its page faults with the increase of memory.
(2) For the weak locality program, sor, only CLOCK-Pro can protect the program
execution from the interference, though its page faults are moderately increased compared
with its dedicated execution on the same size of memory. However, CAR and CLOCK
cannot reduce its faults even when the memory size exceeds the program memory demand,
and the number of faults on the dedicated executions has been zero.
We did not see a devastating influence on the program executions with the co-existence
of intensive file data accesses. This is because even the weak accesses of m 8 8 ksim , are strong
enough to fend off the memory competition from file accesses with their page re-accesses, and
actually there are almost no page re-uses in the file accesses. However, if there are quiet
periods during program active executions, such as waiting for the user interactions, the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 3. V IRTU AL M E M O R Y R E P L A C E M E N T PO LIC IES

90

program working set would be flushed by the file accesses under recency-based replacement
algorithms. However, re-use distance based algorithms such as CLOCK-Pro will not have
the problems, because the file accesses have to generate small re-use distances to qualify
the file data a long-term memory stay, and to replace the program memory.

3.5

Sum m ary

In this chapter, we proposed a new VM replacement policy, CLOCK-Pro, which is intended
to take the place of CLOCK currently dominating various OS designs. We believe it is
a promising replacement policy in future OS designs because (1) It has a low cost that
can be easily accepted by current systems. Though it could move up to three pointers
(hands) during one victim page search, the to tal number of the hand moves is comparable
to th a t of CLOCK. Keeping track of the replaced pages in CLOCK-Pro doubles the size
of the linked list used in CLOCK. However considering the marginal memory consumption
of the list in CLOCK, the additional cost is well acceptable.
a systematic solution to the CLOCK problems.

(2) CLOCK-pro provides

It is not ju st a quick and experience-

based fix to a problem of CLOCK in a specific situation, b u t is designed based on a more
accurate locality definition - re-use distance and addresses the source of the LRU problem.
(3) It is fully adaptive to the strong or weak access patterns without any pre-determined
parameters.

(4) Extensive simulation experiments on real-life I/O and VM traces show

significant and consistent performance improvements. We believe th a t CLOCK-Pro would
be very attractive to the VM system designers in industry.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Chapter 4

Thrashing in M ultiprogram m ing
Environm ents
Improvement of CPU and memory utilizations has been a fundam ental consideration in the
design of operating systems. The interaction of memory management and CPU utilization is
much more involved in the multiprogramming environments th an in a dedicated execution
environment. Studies of page replacement policies have a direct impact on memory and
CPU utilization, which have continued for several decades (e.g. a representative and early
work in [1], and recent work in [30, 67]).

4.1
4 .1 .1

Background
M P L v e r s u s S y s te m T h r a s h in g

M ultiprogramming level, simplified as MPL, is defined as the num ber of active processes in
a system. We refer to these active processes in an multiprogramming environment as inter
acting processes, because they are competing for CPU and memory resources interactively.
How to dynamically m aintain an optim al MPL to keep a high CPU utilization has been a
91

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPR O G RA M M IN G EN V IR O N M E N TS

92

fundamental issue in the design of operating systems [60]. O perating system designers aim
at providing an optim al solution to the problem of using the CPU and memory resources
effectively in multiprogramming, while avoiding the thrashing th a t multiprogramming can
cause. CPU utilization can be increased by increasing MPL — running more processes.
However, as MPL increases to a certain degree, the competition for memory pages among
processes becomes serious, which can eventually cause system thrashing, and CPU utiliza
tion will then be significantly lowered. Considering large variations of memory demands
from multiple processes and dynamical memory requirements in their lifetimes of the pro
cesses, it is not practically possible to set a pre-defined optimal MPL in order to avoid
thrashing while allowing a sufficient number of processes in the system. Existing operating
systems, such as BSD and Solaris, provide load control facility to swap out and in processes,
if necessary, for thrashing protection. This facility allows the systems to adaptively lower
MPL, but process swapping can be quite expensive for bo th systems and user programs.

4 .1 .2

T h r a s h in g a n d P a g e R e p la c e m e n t

Thrashing events can be directly affected by how page replacement is conducted. Most
operating systems adopt global LRU replacement to allocate the limited memory pages
among competing processes according to their memory reference patterns. W ith an increase
in MPL, memory allocation requests become more demanding. To keep more processes
active, limited memory space should be fully utilized. The global LRU page replacement
policy follows this principle. However, the effort to improve memory utilization could cause
low CPU utilization.
In a multiprogrammed environment, global LRU replacement selects an LRU page for

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH RASH ING IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

93

replacement throughout the entire user memory space of the computer system. The risk of
low CPU utilization increases if the memory page shortage happens all over the interacting
processes. For example, a process is not able to access its resident memory pages when
the process is resolving page faults. These already obtained pages may soon become LRU
pages when memory space is being demanded by other processes. W hen the process is
ready to use these pages in its execution turn, these LRU pages may have already been
replaced to satisfy memory requests of other processes. The process then has to request
the virtual memory system to retrieve these pages by replacing LRU pages of others. The
page replacement may become chaotic, and could cascade among the interacting processes,
eventually causing system thrashing. Once all interacting processes are in the waiting queue
due to page faults, the CPU is doing little useful work.

4 .1 .3

Effectiveness o f a d a p tiv e p a g e r e p la c e m e n t

Existing operating system protects thrashing at the process scheduling level by load con
trols. A commonly used mechanism is to suspend/reactivate or swapping o u t/in programs
to free more memory space after the thrashing is detected.

For example, the 4.4 BSD

operating system [50] initially suspends a program after thrashing. If the thrashing con
tinues, additional programs are suspended until enough memory become available. Our
experiments and analysis show th a t there are several system performance advantages for
conducting adaptive page replacement over process scheduling to eliminate thrashing. First,
since improper page replacement during process interactions is a major and internal source
of system thrashing, a solution to adaptively adjust page replacement behavior to current
system needs can be fundamentally effective to address the problem. Second, the alter

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M ING E N V IR O N M E N T S

94

natives of load controls are limited to suspend or remove existing processes. Since this
approach is expensive and can dram atically degrade user program interactivity, it is only
used when the system is seriously thrashing. Finally, using the adaptive page replacement
in th e first place, we are able to eliminate the thrashing in its early stage, or significantly
delay the usage of load controls. W ith adaptive page replacement and load controls guard
ing at two different levels and two different stages, the system performance will become
more stable and cost-effective.

4.1.4

O ur w ork

The objective of our study is to provide highly responsive and cost-effective thrashing pro
tection by dynamically detecting and adaptively taking necessary actions at the kernel level
of page replacement. It can also be regarded as page replacement adaptive to the system
situation. We have designed a dynamic system Thrashing Protection Facility (TPF) in the
system kernel considering the trade-off between CPU and memory utilizations. Once T P F
detects system thrashing, one of th e interacting processes will be identified for protection.
The identified process will have a short period of privilege during which it does not con
trib u te its LRU pages for removal. This allows the process to quickly establish its working
set. W ith the support of T P F , early thrashing can be eliminated at the level of page re
placement, so th a t process swapping will be avoided or delayed until it is truly necessary.
T P F also improves the system stability when memory is dynamically and competitively
demanded by interacting processes. We take the Linux kernel as a case study to illustrate
why T P F is needed and how it works.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

4.2

95

E volution o f Page R eplacem ent in Linux K ernel

Linux, like most other systems, uses an approximate LRU scheme to keep the working set
of a process in the system, and to contribute already allocated pages which may not be
used in the near future to other interacting ones. A clock algorithm [70] is used, because it
provides acceptable approxim ation of LRU, and it is cheap to implement, where NRU (Not
Recently Used) pages are selected for replacement.
Current page replacement im plem entation in Linux is based on the following frame
work. The interacting processes are arranged in an order to be searched for NRU pages
when few free pages are available in the user space, an d /o r they are demanded by interact
ing processes. The system examines each possible process to see if it is a candidate from
which NRU pages can be found for replacement. The kernel will then check through all
of the virtual memory pages in the selected process. In a m oderately loaded system, we
could hardly observe execution performance differences due to th e different page replace
ment implementations. However, when processes are competitively demanding memory
allocations, interacting processes may chaotically replace pages among themselves, leading
to the thrashing. We take the three recent Kernel versions to illustrate how the thrashing
potential is introduced and why a non-adaptive replacement policy is hard to deal with it.

4 .2 .1

K e r n e l 2 .0

In Kernel 2.0, the NRU page contributions are proportionally distributed among interacting
processes.

There is a “swap_cnt” variable for each process, which is initialized with a

quantity (RSS/1MB) proportional to its resident set size (RSS). Once an NRU page is

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

taken away from the process, its “swap_cnt” will be decreased by one.

96

Only when its

“swap_cnt” becomes zero, or the searching for an NRU page fails in resident space of the
process, is the next process in the process list examined. W hen a process with “swap_cnt”
of zero is encountered, it will be re-initialized using the same proportion rule. This strategy
effectively balances memory usage by making all the processes provide proportional NRU
pages. However, a m ajor disadvantage of this approach is its high potential for thrashing,
resulting low CPU utilization. This is because when all th e memory-intensive processes
are struggling to build its working set under heavy memory loads, all are requesting more
pages through page faults, and no one will be given a priority for the purpose of thrashing
protection.

4 .2 .2

K e r n e l 2 .2

In order to address the limit, Kernel 2.2 makes each identified process continuously con
tribute its NRU pages until no NRU pages are available in the process. Attem pting to
increase CPU utilization, this strategy allows the rest of the interacting processes to build
up their working sets more easily by penalizing the memory usage of one process at a time.
Here is the m ajor section of code to select a process for page replacement in the kernel
function “swap_out” in m m /vm scan.c [47].

for (; counter >= 0; counter— ) {
max_cnt = 0;
pbest = NULL;
select:
read_lock(&tasklist_lock);

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH RASH IN G IN M U LTIPR O G RA M M IN G EN V IR O N M E N TS

p = init_task.next_task;
for (; p != &init_task;
p = p->next_task) {
if (!p->swappable)
continue;
if (p->mm->rss <= 0)
continue;
/* Refresh swap_cnt? */
if (assign == 1)
p->mm->swap_cnt = p->mm->rss;
if (p->mm->swap_cnt > max_cnt) {
max_cnt = p->mm->swap_cnt;
pbest = p;

}
>

read_unlock(&tasklist_lock);
if (assign == 1)
assign = 2;
if (Ipbest) {
if (!assign) {
assign = 1;
goto select;

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

98

go to o u t ;

}
i f (sw a p _ o u t_ p ro c e ss(p b e st, gfp_m ask))
r e tu r n 1;
>
o u t:
r e t u r n 0;

In this section of code, the “swap_cnt” variable for a process’s d ata structure can be
thought as a “shadow RSS”, which becomes zero when a swap-out operation of a process
fails. The “swap_cnt” s of all the swappable processes will be re-assigned with the respective
RSS in the second pass through the process list in the inner loop when they all become
zeros. This inner loop will select the swappable process with the maximal RSS th a t has
not yet been swapped out. Variable “counter” is used to control how many processes are
searched before finding an NRU page. We can see th a t once a process provides an NRU
page, which means it is the one with the maximum “swap_cnt” currently, the process will
be selected for swapping upon the next request. This allows its NRU pages continuously to
be replaced until a failure on finding an NRU page in the process occurs. Compared with
previous kernel version, in addition to the changes in the selection of processes for NRU
pages, there has been another m ajor change in this kernel. In kernel 2.0, there is an “age”
associated with each page, which is increased by 3 when it is referenced called page aging
and decreased by 3 each time the page is examined. Once the “age” decreases to zero, it
will become an NRU page and be ready to be replaced. The Kernel 2.2 greatly simplifies

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

99

the structure by eliminating the “age” and only making use of the reference bit of each page
in the P T E (Page Table E n try ). The b it is set when the page is referenced and reset when
the page is examined. The pages with reference bits of Os are NRU pages and ready to be
replaced. This im plem entation will produce NRU pages more quickly for a process w ith a
high page fault rate. These changes in kernel 2.2 take a much more aggressive approach to
make an examined process contribute its NRU pages, attem pting to help other interacting
processes to establish their working sets to fully utilize the CPU.
We have noted the effort made in Kernel 2.2 to retain CPU utilization by avoiding widely
spreading page faults among all the interacting processes. However, such an effort increases
the possibility of replacing fresh NRU pages in the process being examined, while some
NRU pages in other interacting processes th a t have not been used for long time continue
to be kept in the memory. This approach benefits CPU utilization at the cost of lowering
memory utilization. Fortunately, in our experiments, we find th a t each interacting process
is still examined periodically w ith a reasonable time interval. Although the average time
interval in kernel 2.2 is longer th an th a t in kernel 2.0.38, it seems to be sufficiently short to
let most interacting processes have a chance to be examined. Thus memory utilization is
not a major concern. However, the risk of system instability caused by low CPU utilization
remains.

4 .2 .3

K e r n e l 2 .4

The latest Linux kernel is version 2.4, which makes considerable changes in the paging
strategy. Many of these changes target at addressing concerns on memory performance
arising in Kernel 2.2. For example, w ithout page aging, NRU replacement in kernel 2.2 can

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPRO G RAM M ING E N V IR O N M E N T S

100

not accurately distinguish the working set from incidentally accessed pages. Thus Kernel
2.4 has to reintroduce page aging, ju st as Kernel 2.0 and FreeBSD do. However the page
aging could help processes w ith high page fault rates to keep their working sets, thus cause
other processes to have serious page fault rate, and trigger thrashing.
Kernel 2.4 distinguishes the pages w ith age of zero and those with positive ages by
separating them into non-active and active lists, respectively to prevent bad interactions
between page aging and page flushing [72]. This change does not help protect the system
against thrashing, because the system still has no knowledge on which working sets of
particular processes should be protected when frequent page replacement takes place under
heavy memory workload. Similar argument can be applied in BSD and FreeBSD, where a
system-wide list of pages forces all processes to compete for memory on an equal basis.
To make memory more efficiently utilized, Kernel 2.4 reintroduces the method used in
Kernel 2.0 for selecting processes to contribute NRU pages. Going through a process list
each time, it walks about 6% of the address space in each process to search NRU pages.
Compared with Kernel 2.2, this m ethod increases its possibility of thrashing.

4 .2 .4

T h e I m p a c t p f P a g e R e p la c e m e n t o n C P U a n d M e m o r y U t iliz a t io n s

From the evolution of recent LINUX kernel, we can see th a t in VM designs and imple
mentations, finding an optim al MPL concerning to thrashing has been translated into con
siderations of the tradeoff between the CPU and memory utilizations. For the purpose of
high CPU utilization, we require th a t CPU be not idle when there are computing demands
from “cycle-demanding” processes. For the purpose of high memory utilization, we require
th at no idle pages be kept unaccessed when there are memory demands from “memory-

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPR O G RA M M IN G E N V IR O N M E N T S

101

dem anding” processes. Our analysis has shown th a t the conflicting interests between the
requirements on CPU and memory utilizations are inherent in a multiprogramming system.
Regarding CPU utilization, the page replacement policy should keep at least one process
active in the process queue. Regarding memory utilization, the page replacement policy
should apply the LRU principle consistently to all the interacting processes. No process
should hide its old NRU pages from swapping while other processes contribute their fresh
NRU pages. It is difficult for a policy in favor of bo th CPU and memory utilizations con
stantly to eliminate the risk of system instability leading to thrashing. The difficulty in the
design of the page replacement in multiprogram ming environment is general in operating
systems. C urrent systems lack effective mechanisms to integrate the two requirements for
the purpose of thrashing protection.
From the perspective of thrashing prevention, page replacement implementations in
Kernel 2.2 is more effective than Kernel 2.0 and Kernel 2.4. However, we will show th a t the
critical weakness resulted from the conflicting interests between the requirements on CPU
and memory utilizations is inherent in the Kernel 2.2. Our experimental results shown in
the next section reveal its serious thrashing . Thus, we implement our T P F in Kernel 2.2
to show its effectiveness, which is not in favor of our performance evaluation.

4.3

Evaluation of Page R eplacem ent in Linux Kernels 2.2

4 .3 .1

E x p e r im e n ta l e n v ir o n m e n t

Our performance evaluation is experimental measurement based. The machine we have
used for all experiments is a Pentium II of 400 MHz with physical memory space of 384

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPR O G RA M M IN G E N V IR O N M E N TS

102

MBytes. The operating system is Redhat Linux release 6.1 with the kernel 2.2.14. Program
memory space is allocated in units of 4KByte pages. The disk is an IBM Hercules with
capacity of 8,450 MBytes.
W hen memory related activities in program execution occur, such as memory accesses
and page faults, the system kernel is heavily involved. To gain insight into VM behavior
of application programs, we have monitored program execution at the kernel level and
carefully added some simple instrum entation to the system. Our monitor program has two
functions: user memory space adjustm ent and system d ata collection. In order to flexibly
adjust available memory space for user programs in experiments, the monitor program can
serve as a memory-adjustment process requesting a memory space of a fixed size, which is
excluded from page replacement. The available user memory space can be flexibly adjusted
by running th e memory-adjustment process w ith different fixed sizes of memory demand.
The difference between the physical memory space for users and the memory demand size
of the memory-adjustment process is the available user space in our experiments.
In addition, the monitoring program dynamically collects the following memory system
status quanta periodically for every second during execution of programs:

• M emory Allocation Demand (MAD): is the total amount of requested memory space
reflected in the page table of a process in pages. The memory allocation dem and
quantum is dynamically recorded in the kernel d ata structure of tasEstruct, and can
be accurately collected without intrusive effect on program execution.

• Resident Set Size (RSS): is the total amount of physical memory used by a process in
pages, and can be obtained from the kernel d ata structure of tasEstruct.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

103

• Number of Page Faults (NPF): is the number of page faults of a process, and can be
obtained from ta skstru ct of the kernel. There are two types of page faults for each
process: minor page faults and m ajor page faults. A minor page fault will cause an
operation to relink the page table to the requested page in physical memory. The
timing cost of a minor page fault is trivial in the memory system. A major page fault
happens when the requested page is not in memory and has to be fetched from disk.
We only consider m ajor page fault events for each process, which can also be obtained
from task-struct.

• Number of Accessed Pages (NAP):1 is the number of pages accessed by a process w ithin
a time interval of one second. This is collected by a simple system instrum entation.
During program execution, a system routine is periodically called to examine all the
reference bits in the page table of a specified process.

We have selected three memory-intensive application programs from SPEC 2000: gcc,
gzip, and vortex.

Using the system facilities described above, we first run each of the

three programs in a dedicated environment to observe the memory access behavior w ithout
major page faults and page replacement (the demanded memory space is smaller than
the available user space). Table 4.1 presents the basic experimental results of the three
programs, where the “description” column gives the application nature of each program,
the “input file” column is the input file names from SPEC2000 benchmarks, the “memory
requirement” column gives the maximum memory demand during the execution, and the
“lifetime” column is the execution time of each program. The “lifetime” of each program
xThis quantum is only collected for dedicated executions of benchmark programs.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH R A SH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

104

is measured w ithout memory status quanta collection involved. These numbers for each
program represent the mean of 5 runs. The variation coefficients calculated by the ratio of
the standard deviation to the mean is less than 0.01.
Programs
gcc
gzip
vortex
vortex

description
optimized C compiler
data compression
database
database

input file
166.i
input, graphic
lendianl.raw
lendian3.raw

memory requirement (MB)
145.0
197.4
115.0
131.2

lifetime (s)
218.7
248.7
342.3
398.0

Table 4.1: Execution performance and memory related data of the 3 benchmark programs.

4 .3 .2

P a g e R e p la c e m e n t B e h a v io r o f K e r n e l 2 .2 .1 4

The memory usage patterns of the three programs are plotted by memory-time graphs. In
the memory-time graph, the x axis represents the execution time sequence, and the y axis
represents three memory usage curves: the memory allocation dem and (MAD), the resident
set size (RSS), and the number of accessed pages (NAP). The memory usage curves of the 3
benchmark programs measured by MAD, RSS, and NAP are presented in Figures 4.1 (gcc),
4.2 (gzip), and 4.3 (vortexl, which is vortex with input file of “lendianl.raw ” ). However,
we find that Linux kernel 2.2.14 still provides a high potential for interacting processes
to chaotically replace pages among themselves, significantly lowering CPU utilization and
causing thrashing if the page replacement continues under heavy load. To show this, we have
monitored executions and memory performance of several groups of multiple interacting
programs.

To make the presentation easily understandable on how memory pages are

allocated among processes and their effects on CPU utilization, we only present the results of
running two benchmark programs together as a group. We present three program interaction

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

105

groups: gzip+vortex3 (vortexS is vortex with input file of “lendian3.raw” ), gcc+vortex3, and
vortexl+vortex3. The available user memory space was adjusted by the monitoring program
accordingly so th a t each interacting program had considerable performance degradation due
to 27% to 42% memory shortage. (The shortage ratios are calculated based on the maximum
memory requirements. In practice, the realistic memory shortage ratios are smaller due to
dynamically changing memory requirements of interacting programs.)
SP E C 200G gcc
45000
MAD — *—
R S S — x—
NAP - O - - .

40000

30000

25000

o 20000

=

15000

10000

0

50

200

150

250

Execution time (second)

Figure 4.1: The memory performance of gcc in a dedicated environment.

SP E C 2000 gzip
MAD RSS NAP •

® 4 0000
0.

I

|

30000

100

1 50

200

Execution time (second)

Figure 4.2: The memory performance of gzip in a dedicated environment.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

106

SP E C 2000 vortex
3 5000
MAD — tR S S — xNAP - 0 30000

£ 20000
15000

5000

0

50

100

150

200

250

300

400

E xecution tim e (second)

Figure 4.3: The memory performance of vortexl in a dedicated environment.
Figure 4.4 presents the memory usage behavior measured by MAD and RSS of interact
ing programs gzip (left figure) and vortex3 (right figure). After we added gzip to interact
w ith vortex3 at the 250th second, we observed th a t both their RSS curves are up and down
in most of the times. CPU utilization is lower than 50% during the interaction because both
processes were held in waiting list by page faults for the most time. Adding more processes
would worsen the case due to lack of free memory in the system. We found th a t at around
620th and around 780th second, gzip did get its working set and ran with a small number
of page faults. Unfortunately, it went back to chaotic competition after th at period. The
measurement shows th at the slowdown of gzip is 5.23, and is 3.85 for vortexS.
Figure 4.5 presents the memory usage behavior measured by MAD and RSS of inter
acting programs gcc (left figure) and vortex3 (right figure). For program vortex3, the RSS
curve suddenly dropped to about 14,000 pages after it reached to 26,870 pages, which was
caused by the memory competition of the partner program gcc. After th at, the RSS curve
entered a fluctuating stage, causing a large number of page faults in each process to extend

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

gzip (input.graphic) in th e interaction

vortex (tendian3.raw) in th e interaction
6000)

60000

MAD —
RSS — x ~

MAD — t—
R S S — x—

50000

50000

|

107

fl

«s
o>
0.

40000

£•
o
E

30000

E
o

30000

£ 20000

|

= 20000

10000

10000

0

200

4 00

600

800

1000

0

1200

200

400

800

1000

1200

1400

1600

Execution time (second)

Execution time (second)

F igure 4.4: The memory performance of gzip (left figure) and vortex3 (right figure) during the
interactions.
the first spike of gcc in the MAD and RSS curves to 865 seconds, and to extend a RSS stair
in vortex to 563 seconds. In this case the slowdown of program gcc is 5.61, and is 3.37 for
vortex.
g cc (166. i) in th e interaction

vortex (Iendian3.raw) in th e interaction

50000

50000
MAD — *—
R S S — x' ~

MAD — *
RSS ~ >

45000

45000

40000

40000

3 5000

35000

5>
C

30000

30000

i|

25000

25000

o

20000

® 20000
E
3
Z
15000

10000

10000

5000

5000

00

200

400

600
Execution time (second)

1000

1200

1400

0

200

400

600

800

1000

1200

1400

Execution tim e (second)

F igure 4.5: The memory performance of gcc (left figure) and vortexS (right figure) during the
interactions.

Figure 4.6 presents the memory usage behavior measured by MAD and RSS of interact
ing programs vortex 1 (left figure) and vortex3 (right figure). Although the input files are

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

vortex (lendianl.raw ) in the in teracton

vortex (iendianS.raw) in the interaction

40000

40000
MAD —
R S S — x—

MAD
RSS

35000

35000

30000

30000

25000

£

25000

20000

20000

15000

15000

10000

10000

5000

5000

00

108

200

400

600

800

1000

1200

1400

0

200

400

Execution time (second)

600

800

1000

1200

1400

E xecution time (second)

Figure 4.6: The memory performance of vortexl (left figure) and vortex3 (right figure) during the
interactions.
different, the memory access patterns of the two programs are the same. Our experiments
show th at the RSS curves of both vortex programs changed similarly during the interac
tions. To favor memory utilization, NRU pages were allocated between the two processes
back and forth, causing low CPU utilization and poor system performance. After the RSS
curves of both programs reached about 22,000 pages, their MADs could not be reached due
to memory shortage. Our experiments again show th a t the execution times of bo th pro
grams were significantly increased due to the page faults in the interaction. The slowdown
for vortexl is 3.58, and is 3.33 for vortexS.
Our experiments indicate th a t although thrashing could be triggered by a brief, random
peak in memory demand of a workload, the system may continue thrashing for an unacceptably prolonged time. To make a system more resilient against dynamically changing
virtual memory load, a dynamical protection mechanism is desirable instead of a brute-force
process stop, such as process suspension or even process removal.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

4.4

109

The D esign and Im plem entation o f T P F

We propose to implement T P F as part of the page replacement for thrashing protection in
order to improve the system stability under a heavy load. The main idea of T P F is simple
and intuitive. Once the system detects high page fault rates and low CPU utilization caused
by multiple processes, T P F will identify a process and help it to quickly establish its working
set by tem porarily granting a privilege to the process for its page replacement. W ith this
action, the CPU utilization quickly increases because at least one process is able to do
useful work. In addition, the memory space is expected to be released soon by the process
after its completion, so th a t the memory demands of other processes can be satisfied. We
have implemented T P F in the Linux kernel 2.2.14, which consists of two kernel utilities:
detection and protection routines.
The detection routine is used to dynamically monitor the page fault rate of each process
and the CPU utilization of the system. The protection routine will be awakened to con
duct priority-based page replacement when CPU utilization is lower than a predetermined
threshold, and when the page fault rates of more than one interacting process exceed a
threshold. The protection routine then grants a privilege to an identified process th a t will
only contribute a limited number of NRU pages. The identified process is the one th a t
has the smallest difference between its MAD and its RSS (the least memory demanding
process). The detection routine also monitors whether the identified process has lowered
its page fault rate to a certain degree. If so, its privilege will be disabled. This action will
retain memory utilization by treating each process equally.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

4 .4 .1

110

T h e d e t e c t io n r o u tin e

There are four predeterm ined param eters in T P F :
1. CPU_Low: is the lowest CPU utilization the system can tolerate.
2. CPU_High: is the targeted CPU utilization for T P F to achieve.
3. PFJLow: is the targeted page fault rate 2 of the identified process for T P F to achieve.

4. PFJHigh: is the page fault rate threshold of a process to potentially cause thrashing.
We add one global linked list, highJPF_proc, in the kernel to record interacting processes
with high page fault rates. Once we find the current page fault of a process exceeds PF_High,
we will enter it in the linked list.
We have also added three new fields in ta skstru ct d ata structure for each process:
1. num_pf: the number of page faults detected recently;

2. start_time: the system tim e for the first page fault in th e above “num-pf” page faults;
and
3. privilege: the process is granted the privilege (=1) or not (=0).
Here are the kernel operations to determine and manage the processes exceeding the
threshold page fault rates.

if (process p encounters page faults) {
if (p->num_pf == 0)
2In our experiments only those page faults that are revolved by loading pages from the swap files in
disk are counted, because they are the most appropriate factors to reflect the effect of memory shortage on
processes.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M ING E N V IR O N M E N TS

111

p->start_time = current system time;

p->num_pf++;
if (p is not in the "high_PF_proc" list)
if (p->num_pf > high_PF) {
if (current system time p->start_time <= 1 second)
place p in high_PF_proc;
p->num_pf = 0;

}
}

We check the page fault rate of each process in the high_PF_proc list every second. If
a process’s page fault rate is lower th an lowJPF, we will dynamically remove the process
from the list by the following operations:

if (length(high_PF_proc) >= 1) {
for each p in the list do {
if (current system time p->start_time >= 1 second) {
if (p->num_pf/(current system time
- p->start_time) < low_PF) {
if (p->privilege == 1)
p->privilege = 0;

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

112

remove p from the list;

}
p->num_pf = 0;
p->start_time = current system time;
>

>
>

The CPU utilization is measured every second, based on the CPU idle time. Specifically,
we use (1— idle ratio) to represent the current CPU utilization, where the idle ratio is the
CPU tim e portion used for the idle processes in the last second. The current CPU utilization
is compared with CPU-Low to determ ine if the the system is experiencing an unacceptably
low CPU utilization. The protection routine is triggered when the following three conditions
are all true.

if ((CPU utilization < CPU_Low) &&
(length(high_PF_proc) >= 2) kk
(no process has been protected)) {
for all processes in high_PF_proc
select the least memory hungry p;
p->privilege = 1;

>

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M ING E N V IR O N M E N T S

4.4.2

113

T h e p r o t e c t io n r o u tin e

The privilege granting is implemented in a simple way in the kernel routine “swap_out”
presented in Section C hapter 4.3. The function swap_out.process(pbest, gfpjnask) will
reset its “swap_cnt” to 0 if, and only if, the system fails to get an NRU page from process
“pbest” , as we have showed in C hapter 4.3. A small modification in swap_out-process() will
make the privilege effective; th a t is, we reset its “swap_cnt” to 0 even if an NRU page is
obtained in the protected process. This will cause the protected process to provide at most
one NRU page in each exam ination loop on all swappable processes. Considering th a t a
large number of of NRU pages exist in the rest of the interacting processes, such a change
will effectively help the protected process build up its working set and reduce its page fault
rate. Once its page fault rate is lowered satisfactorily, the protected process will be removed
from the “highJPF_proc” list and loose its privilege.

4.4.3

S t a t e tra n sitio n s in t h e system

The kernel memory management has the following three states w ith dynamic transitions:

1. normal state: In this state, no monitoring activities are conducted. The system deals
with page faults exactly as the original Linux kernel does. The system keeps track
of the number of page faults for each process and places the process w ith high page
fault rates in “highJPF_proc” .

2. monitoring state: In this state, the detection routine is awakened to start monitoring
the CPU utilization and the page fault rates of processes in the linked list. If the
protection condition is satisfied, the detection routine will select a qualified process

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

114

CPU utilization < CPU_Low &&

CPU utilization> CPU_High

Figure 4.7: Dynamic transitions among normal, monitoring, and protection states in the improved
kernel system.
for protection and go to the protection state. The system returns to the normal state
when no more th an one process’s page fault rate is as high as the predetermined
threshold.

3. protection state: The protection routine will make the selected process quickly estab
lish its working set. In the protection state, the detection routine keeps monitoring
the CPU utilization and the page fault rate of each process in the list. The detection
routine is deactivated and the protection state transfers to the monitoring state as
soon as the protected process becomes stable an d /o r the CPU utilization has been
sufficiently improved.

Figure 4.7 describes the dynamic transitions among the three states, which gives a complete
description of T P F facility. W hen the system is normal (no page faults occur), detection
and protection routines are not involved. As we have described in the implementation, the
algorithm only adds limited operations for each page fault and checks several system pa
rameters with the interval of one second. So, overhead involved for detection and protection
is trivial compared w ith the CPU overhead to deal w ith page faults.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPR O G RA M M IN G EN V IR O N M E N TS

4.5

Perform ance M easurem ents and A nalysis

4 .5 .1

O b s e r v a tio n a n d m e a s u r e m e n ts o f T P F fa c ility

115

350000
' ^ w ith o u t p rotection E3 with p ro tec tio n |

j II w ithout protection El with protectio n !
300000

250000

g> 200000
O

1500 0 0

£

100000
50 0 0 0

gzip/vortex3

v o rtex 3 /g cc

vorte x 1 /v o rtex 3

gzip/vortex3

vortex3/gcc

v ortex1/vortex 3

Figure 4.8: The execution time comparisons (left figure) and comparisons of numbers of page faults
(right figure) for the three group of program interactions in the Linux without TPF and with TPF.

The predetermined threshold values are set as follows: CPU_Low = 40%, CPU-High =
80%, PF_High = 10 page faults/second, PF_Low = 1 page fault/ second. The performance
of T P F is experimentally evaluated by the three groups of the interacting programs. Each
of the experiments has the exactly same setting as its counterpart conducted in C hapter
4.3, except th a t the T P F is implemented in the kernel.
Figure 4.9 presents the memory usage measured by MAD and RSS of interacting pro
grams gzip (left figure) and vortexS (right figure) in the Linux with TPF. Figure 4.4 shows
that thrashing between processes started as soon as gzip joined the execution at the 250th
second without T P F . In contrast, Figure 4.9 shows th a t T P F quickly detected the problem
and went into the protection state. Because the RSS of vortexS is close to its MAD, it
was selected for protection. After the protection, its page fault rate was lowered w ith the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

116

establishment of its working set. Then the protection was disabled to allow the NRU pages
of vortexS to be fully utilized. This is confirmed by the small gap between MAD and RSS,
which does not exist in the dedicated execution (see Figure 4.3). In the experiment we
observed th a t T P F had to come back and forth during the program interaction over 10
times to help vortexS establish working sets. This is because program vortex is not strong
enough to keep its established working set w ith the competition of gzip. Even for this type
of program, T P F dem onstrates its effectiveness. The numbers of page faults and execution
time of vortex3 are reduced by 72% and 92%, respectively (see Figure 4.8).
gzip (input graphic) in th e interaction

vortex (Iendian3.raw) in the interaction

60000

60000
MAD — »
RSS — >

MAD
RSS

t

5

50000

50000

40000

40000

30000

30000

20000

= 20000

10000

10000

0

100

200

300

400

500

600

800

0

100

300

400

500

600

700

800

Figure 4.9: The memory performance of gzip (left figure) and vortexS (right figure) during the
interactions in the Linux with TPF.

The performance improvement for gzip is also significant. Its number of page faults
and execution time are reduced by 72% and 64%, respectively. Intuitively, its performance
should have been degraded because it contributed more memory space to vortexS for build
ing up its working set enforced by T P F . B ut this is not the case for two reasons. First,
under the protection of T P F , vortexS had an early completion. Then gzip could run w ith
out memory competition and use CPU cycles solely. Second, under the protection of T P F ,

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 4. TH R ASH IN G IN M U LTIPRO G RAM M ING E N V IR O N M E N T S

117

vortexS could greatly reduce its page fault rate, which made gzip utilize most of the I/O
bandw idth and reduced page fault penalty.
Figure 4.10 presents the memory usage measured by MAD and RSS of interacting pro
grams gcc (left figure) and vortexS (right figure) in the Linux w ith T PF. At the 397th
second, memory dem and from gcc rapidly rose, both programs started page faults due to
memory shortage. The thrashing significantly lowered CPU utilization, which triggered
T P F to take actions. Because gcc demanded memory gradually, and kept the gap between
MAD and RSS small, gcc was selected for protection on its rising slope of the first MAD
spike by TPF. The memory is dynamically allocated between two processes to ensure a
reasonable level of CPU utilization. The period for the system to stay in the protection
state is very limited, thus memory utilization is maintained. T P F successfully smoothed
out the peak in memory load th a t might otherwise have caused the system to thrash. Com
pared with the same run in the original Linux kernel, the execution times of programs gcc
and vortex3 are reduced by 69%, and 57% respectively; and the numbers of page faults of
programs gcc and vortexS are reduced by 99% and 87% respectively, (see Figure 4.8).
Figure 4.11 presents the memory usage measured by MAD and RSS of interacting pro
grams vortexl (left figure) and vortexS (right figure) in the Linux w ith T P F . During the
interactions at the execution tim e of 433th second, both programs started page faults due
to memory shortage. The program vortexl was then protected by T P F . We observed th at
vortexl easily held its working set thereafter and only a small am ount times of T P F in
volvement were needed. This is because vortexl and vortexS have similar memory access
rates and patterns. Thus once vortexl was given privilege to establish its working set, it
would keep the working set by frequently using it. In contrast to the performance seen in

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

g c c (166J) in th e interaction

118

vortex (iendian3.raw ) in the interaction

50000

50000
MAD
RSS

MAD — t
RSS — »

45000

45000

40000

K 35000

(D
C
T

?; 30000
c
0

30000

|

25000

25000

® 20000

® 20000

e

•5

£

3
Z

15000

10000
5000

5000

0

200

100

30 0

400

500

600

0

700

100

200

300

Execution tim e (second)

400

500

E xecution tim e

600

700

(second)

Figure 4.10: The memory performance of gcc (left figure) and vortexS (right figure) during the
interactions in the Linux with TPF.
Figure 4.6, a small correction from T P F could make a big difference in multiprogramming.
Compared with the same execution in the original Linux kernel, the execution times of
programs vortexl and vortexS are reduced by 46% and 42%, respectively; and the numbers
of page faults are reduced by 99% and 80% respectively, (see Figure 4.8).
vortex (lendianl.raw ) in th e interaction

vortex (iendian3.raw ) in the interaction

40000

40000
MAD
RSS

MAD
RSS

35000

35000

30000

30000

§
£

o
a

25000

£ 20000
£
5

25000

I£

(D

o

I

Si

■g 15000
3
z

15000

Z

10000

5000

0

200

300

400

500

Execution time (second)

600

700

800

900

0

100

200

300

400

500

600

700

800

900

E xecution tim e (second)

Figure 4.11: The memory performance of vortexl (left figure) and vortexS (right figure) in the
Linux with TPF.

Figure 4.12 compares the total execution times for the three groups of interacting pro

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH ING IN M U LTIPR O G RA M M IN G EN V IR O N M E N TS

119

grams in Linux w ith and without TPF. We define the execution times of each pair programs
under the same multiprogramming condition with sufficient memory space as the the ideal
interaction execution time. Figure 4.12 shows th a t the to tal interacting execution times in
the Linux w ith T P F for the three groups are significantly smaller th an those in the Linux
without T P F , and very close to the ideal execution times. These experiments also indicate
th a t T P F has little runtim e overhead.

| a without protection 1 with protection □ ideal

gzip/vortex3

vortex3/gcc

vortexl/vortex3

Figure 4.12: Comparison of total interaction execution times for the three group of program
interactions in the Linux with TPF, without TPF and the ideal interaction times.

4 .5 .2

E x p e r ie n c e s w it h T P F in t h e m u ltip r o g r a m m in g e n v ir o n m e n t

1. Under what conditions, does thrashing happen in a multiprogramming environment?
O ur experiments show th at VMs in Linux can normally keep a reasonable CPU uti
lization even under an heavy workload, adapting the variance of memory demands,
access patterns and access rates of different processes. A process that can frequently
access its working set in execution interactions has a strong position for memory
space competitions during interactions. However, under the following three condi-

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G EN V IR O N M E N TS

120

tions, thrashing can be triggered.

• The memory demands of one interacting program have unexpected “spikes” .
Case studies in Figure 4.4 and Figure 4.5 show such examples.
• The variance of memory demands, access patterns, and access rates of interacting
processes are similar. Case studies in Figure 4.6 show such examples.
• Serious memory shortage happens in the system.

2. For what cases is TPF m ost effective?
T P F is most effective in the first two cases discussed above. In other words, T P F
is able to quickly resolve the thrashing for interacting programs having dynamically
changing memory demands. We have shown th a t T P F is highly responsive to increase
the CPU utilization and to stop thrashing by adapting page replacement to memory
allocations. In addition, the scheduling action from T P F has little intervention to
the system and multiprogramming environment because the protection period is very
short, b u t is effective to lead the system back to normal.

3. For what cases is TPF ineffective ?
If the memory shortage problem is too serious in a multiprogramming environment,
the selected process may build up and hold its working set in memory at the cost
of obtaining most of the memory space of other processes. Although T P F can still
cause CPU cycles to be effectively utilized, the CPU overhead serving page faults
of other processes will significantly increase, and I/O channels may become heavily
loaded due to a large amount of page faults. As a result, the protected process will

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 4. TH RASH IN G IN M U LTIPRO G RAM M IN G EN V IR O N M E N TS

121

not run smoothly. Under such conditions, the load control has to be used to swap a
process for releasing memory space. W ith the support of T P F , load control facility
will be used only when it is truly necessary.

4. How do the threshold parameters affect the performance of TPF?
We summarize th e relationships between the four threshold param eters (see Chapter
4.3.1) and effectiveness of T P F , and the memory performance of interacting programs.
Smaller values of param eters CPUJLow and PF_high will make T P F more responsive
to system thrashing. On the other hand, larger values of CPUJHigh and PF_Low
will make the identified process stay longer in the protection state after it enters the
state. Thus, adjusting these param eters is equivalent to changing the extent of T P F
intervention to the system. The param eters are set only based on system requirements,
not dependent on application program natures. For example, for systems w ith high
I/O bandwidths (e.g. parallel disk arrays), values of PF_High and PF_Low can be set
larger, because page faults can be resolved quickly. In our experiments we found th a t
the performance of T P F was quite stable within a large range of param eter values.

4.6

R elated Work

Improvement of CPU and memory utilizations has been a fundamental consideration in the
design of operating systems. Extensive research on thrashing had been conducted in the
1960s and 1870s. Among the proposed policies the most influential one, which was able
to thoroughly protect against thrashing while keeping high CPU utilization, is working set
policy. Working set policy provides a solution at the page replacement level, similar to our

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N TS

122

policy.

4 .6 .1

T h e W o r k in g S e t M o d e l a n d its I m p le m e n t a t io n Issu e s

Denning proposes a working set model [24], [26], and [27] to estim ate the current memory
demand of a running program in the system. A working set of a program is a set of its
recently used pages. Specifically, at virtual time t, the program ’s working set W t(0), is
the subset of all pages of the program, which has been referenced in the previous 6 virtual
time units (working set window). The task ’s virtual time is a measure of the duration the
program has control of th e processor and is executing instructions. The working set model
ensures th a t the same program with the same input d ata would have the same locality
measurements, which is independent of the memory size, the multiprogramming level, and
the scheduler policy used. A working set policy is used to ensure no pages in the working set
of a running program will be replaced. Assume th at priorities among the processes exist.
Once there is a request for free pages, b u t they are not available, the processes w ith the
lowest priority has to produce a victim page for replacement. This implies th at an active
process with the lowest priority may not fully allocate its working set. Since the I/O time
caused by page faults is excluded in the working set model, the working set replacement
algorithm can theoretically eliminate the thrashing caused by chaotic memory competition.
Comparatively, other global policies like LRU approximations (two-handed clock, FIFO
with second chance) used in the currently popular UNIX-like operating system, are highly
susceptible to thrashing, because a program ’s resident set depends on many factors besides
its own locality. Our experim ental observations are consistent w ith the conclusions in the
cited work on working set models.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

123

A m ajor difficulty to implement the working set model in a modern computer system is
its im plem entation overhead scaling with the capacity of CPU and memory. The working set
model can be implemented by either hardware or software. The hardware approach requires
th a t each page frame be associated with a counter and an identifier register indicating
which process it belongs to. A broadcast clock pulse periodically increments the counter
of each page frame whose identifier register matches the memory domain of a running
process. W hen the running process refers to a page, the counter of th a t page frame is
automatically reset. W hen a counter is incremented over a pre-determined threshold value,
the corresponding page frame is no longer a member of the working set.
Compared w ith the approach of only associating a page-reference bit with each page
frame to support LRU related page replacement policies in Linux and Unix systems, an
implementation of the working set detector is more expensive. W ith the increase of CPU
speed and memory capacity, and with an increasing amount of memory-intensive workloads
in applications, the number of active pages owned by a process has dramatically increased,
which has become a m ajor reason to limit such an implementation. In addition, since system
thrashing is considered as an exceptional event, it may be difficult to convince computer
vendors to provide a hardware support for the working set model. Instead, the computer
architects prefer to adopt some brute-force m ethods as exceptional handlers, such as to
release memory space by urgently removing some processes.
Implementing the working set detector by system software, we need routinely update a
software counter associated w ith each page frame. Since m onitoring huge amount of page
frames is routine operations in memory management, it would affect the system performance
when the system functions normally.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

124

Although there exist a number of good approximations, its im plem entation cost of the
working policy model may limit its direct usage in m odem com puter systems. However,
this model has given us several im portant motivations in memory system designs and im
plementations. First, being a local memory policy, working set policy has inherent load
control and need no special, additional mechanism to deal w ith thrashing. This certainly
save the cost of additional mechanism to stabilize global LRU policy. While built on global
replacement policy, T F P can protect the working set of a process, like tem porarily under a
local memory policy to eliminate the thrashing. Second, working set policy employs “feed
forward control” , rather than “feedback control” , which means working set does not have to
react to thrashing, but avoid thrashing in advance. The instability resulted from feedback
of load control is greatly reduced by the T P F responsive action on th e global policy.
implementations in existing system kernels, and guided by the principle of the working
set model, we propose the T P F , which is not part of the routine operations in memory
management, b u t is only triggered in an early stage of thrashing to effectively stop the
thrashing or significantly delay the load controls.

4 .6 .2

O th e r R e la t e d W o rk

Studies of page replacement policies have a direct impact on memory utilization, which
have continued for several decades (e.g. a representative and early work in [1], and recent
work in [30, 67]). The goal of an optim al page replacement is to achieve efficient memory
usage by only replacing those pages not used in the near future when available memory is
not sufficient, reducing the number of page faults. In a single-programming environment,
these proposed methods address bo th concerns of CPU and memory utilization since any

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH R ASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

125

extra page faults due to low memory utilization will make the CPU stall. However system
thrashing issue in a multiprogramming environment can not be fully addressed by the cited
work due to the conflicting interests between CPU and memory utilization.
In the multiprogram ming context, existing systems mainly apply two methods to elim
inate thrashing. One is local replacement, another is load control. A local replacement
requires th a t the paging system select pages for a program only from its allocated memory
space when no free pages can be found in their memory allotments. Unlike a global replace
ment policy, a local policy needs a memory allocation scheme to satisfy the need of each
program. Two commonly used policies are equal and proportional allocations, which can
not capture dynamical changing memory demand of each program [38]. As a result, mem
ory space may not be well utilized. On the other hand, an allocation policy dynamically
adapting to the dem and of individual programs will shift the scheme to global replacement.
VMS [41] is a representative operating system using a local replacement policy. Memory
is partitioned into multiple independent areas, each of which is localized to a collection of
processes th a t compete with one another for memory. Unfortunately, this scheme can be
difficult to adm inister [44]. Researchers and system practitioners seem to have agreed that
a local policy is not an effective solution for virtual memory management. Our T P F is built
on a global replacement policy.
The objective of load control is to lower the MPL by physically reducing the number of
interacting processes. A commonly used load control mechanism is to suspend/ reactivate
processes, even swapping o u t/in processes to free more memory space, when thrashing
is detected.

The 4.4 BSD operating system[50], AIX system in the IBM RS/6000[32],

HP-UX 10.0 in HP 9000 [31] are examples th at adopt this method. In addition, HP-UX

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G E N V IR O N M E N T S

126

system provides a “serializeQ” command to run the processes one at a time when thrashing
is detected.

In contrast, T P F protects system from thrashing at the page replacement

level. Memory allocation scheduling at this level allows us to carefully consider the tradeoff
between CPU and memory utilizations.
In [35], we proposed another thrashing prevention mechanism called Token-ordered LRU,
which attem pts to prevent the occurrence of thrashing by eliminating false LR U pages. False
LRU pages are produced because of I/O penalties of page faults, rather than because of the
program access delays. Using a token to set a memory allocation priority, Token-ordered
LRU can effectively prevent thrashing and achieve a performance improvement similar to
the TPF.

4.7

Sum m ary

We have investigated the risk of system thrashing in page replacement implementations
by examining the Linux kernel code of versions 2.0, 2.2, and 2.4, and running interacting
SPEC2000 benchmark programs in a Linux system.

Our study indicates th at this risk

is rooted in conflicting interests of requirements on CPU and memory utilizations in a
multiprogramming environment. We have experimentally observed several system thrashing
cases when processes dynamically and competitively dem and memory allocations, which
causes low CPU utilization and long execution tim e delays, and eventually threatens system
stability.
We have proposed T P F and implemented it in the Linux kernel to prevent the system
from thrashing among interacting processes, and to improve the CPU utilization under

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 4. TH RASH IN G IN M U LTIPRO G RAM M IN G EN V IR O N M E N TS

127

heavy load. T P F will be awakened when the CPU utilization is lower than a predetermined
threshold, and when the page fault rates of more th an one interacting processes exceed a
threshold. T P F then grants privilege to an identified process to limit its contributions of
NRU pages. We create a simple kernel monitoring routine in T P F to dynamically identify
an interacting process which highly deserves tem porary protection. The routine also mon
itors whether the identified process has satisfactorily lowered its page fault rate after the
protection. If so, its privilege will be disabled to let it equally participate in contributing
NRU pages with other processes.
Conducting experiments and performance evaluation, we show th a t the T P F facility
can effectively provide thrashing protection w ithout negative effects to overall system per
formance for three reasons: (1) the privilege is granted only when a thrashing problem is
detected; (2) although the protected process could lower the memory usage of the rest of
the interacting processes for a short period of time, the system will soon become stable by
the protection; and (3) T P F is simple to implement with little overhead in the Linux kernel.
Because the conflicting interests between CPU and memory utilization are inherent in global
page replacement, and our solution is targeted at regulating the conflicts through tuning
page replacement, we believe th a t the T P F idea is applicable to VMs of other UNIX-like
systems.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Chapter 5

M ulti-L evel Buffer Cache
M anagem ent
In a large client/server cluster system, file blocks are cached in a multi-level storage hierar
chy: client buffer caches, multiple server buffer caches, and built-in caches of disks at the
bottom level. More and more applications rely on the hierarchy for their file accesses, so
the caching effectiveness of the hierarchy is im portant to the application performance.

5.1
5 .1 .1

Background
H ie r a r c h ic a l C a c h in g a n d i t s C h a lle n g e s

W ith the ever-widening gap between the speeds of processors and hard disks, practitioners
try to make a full use of the available buffer caches along a file block retrieving route for the
purpose of satisfying the requests before they reach disk surfaces. Besides the buffer caches
at clients, the requested blocks can also be cached at server buffer caches and disk built-in
caches, which form a multi-level buffer cache hierarchy (see Figure 5.1). For example, mod
ern high-end disk arrays typically have several gigabytes of cache RAM. Though multiple

128

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H AP TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

129

Client
Front -T ie r Server End -T ier Server

Network

Client

H
Disk Array

F igure 5.1: Multi-level buffer cache hierarchy. Caches are distributed along the clients, intermediate
servers, and disk array, where accessed blocks can be buffered.
buffer resources are lined up and their aggregate size is increasingly large, the issue of how
to make them work together effectively to deliver the expected performance commensurate
to the aggregate size of the distributed buffer caches is still not well addressed. There are
two challenges related to this issue.
The first challenge comes from the weakened locality in the low level buffer caches1.
Caching works because of the existence of locality, which is an inherent property of applica
tion workloads. Only the first level buffer cache is exposed w ith the original locality and has
the highest potential to exploit it. Low level caches hold the misses from their upper level
buffer caches. In other words, the stream of access requests from applications is filtered by
the high level caches before it arrives at the low level ones. Thus the access stream seen
by low level caches has weaker locality than those available to the first level cache. The
performance of widely used recency-based replacements such as LRU can be significantly
degraded once these replacements are employed in the low level buffer caches. Muntz and
Honeyman [54] as well as Zhou et al [82] have observed the serious performance degradation
in their file server buffer cache studies. In a work to investigate the cost-effectiveness of
1By low level buffer caches, we customarily refer to the caches not close to the workload running clients.
Similarly, high levels of buffer caches are those close to the clients. Thus, the first level buffer cache is the
client buffer cache with the highest level.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

130

disk built-in caches for desktop PCs, Zhu and Hu found th a t the built-in caches contribute
little more to the average response time reduction when its size exceeds 512KB with a client
cache size of 16MB [84]. The above cited work indicates applying a replacement indepen
dently at a low level buffer cache could lose its chance to exploit the original locality. This
motivates us to make replacement decisions based on the original access stream, which is
only available at the first level cache.
The second challenge comes from the undiscerning redundancy among levels of the buffer
caches. Redundancy means a block is cached and duplicated along its retrieving route in
more than one caches. W ithout a proper coordination among the levels, blocks could reside
undiscerningly in multiple buffer caches for a long period of tim e before they become cold
enough to be replaced by a local replacement algorithm. The redundancy can cause the
buffer cache hierarchy seriously under-utilized. Even if the aggregate size of the multi-level
buffer caches could hold the working set, the hierarchy would behave as if it were as big as
the single level of cache with the largest size under some access patterns. We propose to
use an unified replacement scheme for a multi-level cache hierarchy, which can determine
an appropriate place for a block to be cached (if it needs being cached). Thus undiscerning
redundancy can be eliminated. The hierarchy can perform as an unified cache with the size
equivalent to the aggregate size, so th a t all the cache spaces are fully utilized.

5 .1 .2

P o s s ib le S o lu tio n s: C u s to m iz e d S e c o n d -L e v e l R e p la c e m e n t a n d th e
U n ifie d L R U

We have seen recent work on each of the two issues. Most of the work attacks the afore
mentioned challenges separately. Multi-Queue [82, 81] and unified LRU [76] are two repre

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER C ACH E M A N A G E M E N T

131

sentative work among them.
M ulti-Queue(MQ) is a customized second-level replacement algorithm. To overcome
LRU’s inability w ith weak locality at the second level cache, MQ resorts to the frequency,
the number of access times of a block, to differentiate the locality of the accessed blocks.
For this purpose, they sets up multiple queues and uses access frequencies to determine
which queue a block should be in. W henever the access frequency of a block accumulates
to a certain threshold, it moves up to a queue for high frequency blocks. Periodically,
blocks th a t are not accessed for a period of tim e are demoted into a queue for low frequency
blocks until they are finally replaced.

By tracking and utilizing a deep access history,

MQ can achieve a higher hit ratio th an LRU in a second-level cache. However, there are
two weakness in MQ when it is used to address the challenges in the multi-level caching
hierarchy. First, it inherits the disadvantages of frequency-based replacement algorithms
such as Least Frequency Used (LFU), which respond to the access pattern changes slowly,
and carry a high overhead. Second, because the clients own the original locality information
the lack of hints from clients greatly limits its potential of exploiting locality for high hit
ratio greatly limited.
A nother solution was proposed by Wong and Wilkes [76] to eliminate the redundancy
simply apply an unified LRU scheme in a two-level buffer cache: client and disk array
built-in buffer caches. As it shows in Figure 5.2, there is an unified LRU stack. The first
portion of the LRU stack corresponds to the client cache, and the second portion of the
LRU stack corresponds to the disk array cache. Any blocks moving from the first portion
into th e second portion due to the increased recency would incur a demotion, an operation
th a t transfers a block from the current level to its next low level cache. Since any recently

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

N1

132

LI Cache

N1

L I Hit
Demotion

L2 H it

L2 Cache

N 2

Cache Miss

Unified LRU Stack

|

U iS is

"L"->
Cache Hierarchy

Figure 5.2: In the two-level unified LRU scheme, there is an unified LRU stack corresponding to
the two level of caches. The size of each individual LRU stack, N 1 or N2 is equal to its respective
cache size in terms of blocks, there are three type of accesses: (1) a hit in the LI cache. (2) a hit
in the L I cache. (3) a miss in the two caches. If all the three cases, the accessed blocks are moved
to the top of the stack. Except the first case, the block at the bottom of LI LRU stack is demoted
onto the top of the L2 stack.
referenced blocks are brought into the top of LRU stack, all newly referenced blocks are
cached in the first level cache and slipped to the low level caches through demotions if
they are re-accessed. Though their scheme has an significant advantage over independent
replacements by eliminating redundancy, there are two critical weakness of the unified LRU
schemes. First, there is no explicit block placement arrangement adapting to their access
pattern. For a block requested by a client, it has be transfered to the client for its use.
However, this block is not necessarily to be cached there. For example, the block which is
not possible to be re-accessed soon should be quickly evicted from the client cache after its
use and may be cached at a low level cache or even not cached. By indiscriminately storing
all the accessed blocks, high level caches cannot serve the blocks w ith strong locality well.
Second, it could generate a large number of demotions because any access th a t is not a
hit in clients accompanied with a demotion. It has been shown th a t the benefits of cache

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 5. M U LTI-LEVE L BU FFER CACHE M A N A G E M E N T

133

coordinations can be nullified by the demotion cost once the I/O bandw idth is below a
certain threshold [14].

5 .1 .3

O ur P r in c ip le s t o A d d r e ss t h e C h a lle n g e s

Our general approach to address the challenges includes two steps. At first, we propose
a new m ethod to quantify locality strength of accessed blocks. Then we develop a mech
anism to layout the cached blocks along the cache hierarchy according their quantified
locality strength. To serve the purpose of block placement and replacement in multi-level
buffer caches, we have two requirements on the locality strength quantification method:
(1) distinction of locality strengths; and (2) stability of the distinction, which are also our
two principles to address the challenges. Regarding the distinction, if the algorithm can
accurately and responsively distinguish blocks with strong locality from those w ith weak
locality2, then the stronger the locality of blocks is, the higher level of cache they should
be placed in. The distinction of this hierarchical locality will make high levels of caches
contribute more to the hit ratios, which reduces the average access time because of their
low hit times. Since the arrangem ent of block caching positions is based on the distinction
of locality strengths, we need to re-arrange the blocks once the locality strengths change,
which means to transfer blocks among levels. This incurs a communication cost. Thus the
stability of the distinctions is critical to keep a low communication cost introduced by an
unified caching scheme.
Following these two principles, we propose a client-directed file block placement and
2By a block with strong locality, we mean it is highly likely to be referenced soon, and it contributes more
to the hit ratio by being cached than the one with weak locality. The strengths of locality are quantified
differently in different replacements, which we will discuss in the next section.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVE L BU FFER CACHE M A N A G E M E N T

134

replacement protocol, where non-uniform strengths of locality are dynamically identified at
the client level to direct file blocks being placed or replaced at different levels of buffer caches
accordingly. The effectiveness of our proposed protocol comes from achieving the following
three goals. (1) The multi-level cache retains the same hit ratio as th at of a single level
cache whose size equals to the aggregate size of multi-level caches. (2) The non-uniform
locality strengths of blocks are fully exploited and ranked to fit into the physical multi-level
caches. (3) The communication overheads between caches are reduced.

5.2

Q uantifying Non-uniform Locality Strengths in Hierar
chical Buffer Caching

5 .2 .1

M e th o d s t o D is t in g u is h L o c a lity S tr e n g th s

Caching works because of the existence of locality. While spatial locality is mostly exploited
in increasing block sizes and prefetching, replacement algorithms usually depend on the
tem poral locality to make re-accessed blocks hit in the cache. Belady first introduced the
concept of locality and recognized its im portance in the context of memory systems [7].
W ith a temporal locality, if a block is referenced, it will tend to be referenced again soon.
Although there exists a clear description and an agreed intuitive understanding on the
notion, a common quantitative definition on locality is rarely seen in literature. However,
for the replacement purpose, each replacement algorithm has its own defined method to
quantify locality strengths and to make distinctions among them.
Suppose the block reference stream is {R t , t = 0,1, 2,...}, the block accessed at time t
is block(Rt), as is shown in Figure 5.3. The distance between two references Hi and R j

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

135

is the number of other distinct blocks accessed between time i and time j. Specifically,
if block ( R ^ — block (Rj ) , the distance is called re-use distance of block(R . For example,
in a segment of reference stream denoted as blocks accessed, {...a, 6, c, 6, a...}, the re-use
distance of blocks a is 2 because there are two other distinct in-between blocks b and c. The
distance is also the distance between the positions of two accessed blocks in the LRU stack
[20], which is a list in which all accessed blocks are stored in the order of their references,
and any newly accessed block is moved to the top of stack. Though LRU stack was initially
used for the LRU replacement algorithm, it has been widely used to describe and study
various replacement algorithms, such as [37, 33, 67].

LRD-

CR B- R D - — j—

A ccess Stream
A ccessed Block

Ri
. . • I
b

. . .

Rj
Rk
I . . . |
b

R1
. . I
b

F igure 5.3: In access stream {R t,t = 0,1,2,...}, Ri, Rj, and Ri are three immediately consecutive
references to block b. The current time is k. With these timing points, there are various measure
ments that can be used to quantify the locality strength of block b at time k, including the distance
from R). to Ri), called OPT Distance (OD), the distance from Rj to Rk), called Recency Distance
(RD), the distance from Rj to Ri, called Current Re-use Distance (CRD), and the distance from
Ri to Rj, called Last Re-use Distance (LRD).

Here we do not consider the methods using frequency to estim ate locality, because it
becomes irrelevant to the current locality when an access took place much earlier th an the
recent accesses.
As an off-line optim al replacement, O PT, uses the distance between the current time
and the next reference to a block, to quantify the locality strength of the block. We call
the distance O PT distance (OD). Considering th a t the O PT replacement maximizes the
hit ratio for a given cache by selecting a block with the largest OD for replacement, OD

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

136

provides the most accurate distinction of locality strengths among accessed blocks. The
LRU replacement takes the assum ption th a t a block accessed recently will be accessed
again soon. Using the tim e of the last reference to a block to predict the time of its next
reference, the LRU algorithm uses Recency Distance (R D ), which is the distance between
its last reference and the current time, to simulate the O P T Distance (O D). Both OD and
R are measured based on the current time, so they change w ith every reference to any block.
The quantified locality strengths w ith OD or R could be very dynamic. W hen the stability
of quantified locality strengths is of concern, it is unclear where a block should be cached
to reduce the communication cost.
In the unified LRU replacement [76], when a block slips down in the LRU stack with
the ongoing references, it may incur demotions once its recencies reach the its local LRU
stack size. Had it been known at w hat recency a block would be re-accessed when the
block was requested, we would have cached it directly on the level of cache corresponding
to th a t recency, thus the demotions could be avoided. This motivates us to use the distance
between last reference and next reference to a block, called Current Re-use Distance (C R D )
to quantify locality strengths. CRD is also the recency at which the block will be referenced
next time. After a block is accessed, its CRD will not change until its next reference. This
helps to stabilize the distinction of locality strengths. Because CRD represents a future
access timing, it is not collectible on-line. To simulate CRD in an on-line algorithm, we use
Last Re-use Distance (LRD), to simulate CRD (see Figure 5.3). LRD is also the recency
at which a block was accessed last time.
However, LRD could miss some most recent access information. The LRD of a block
does not count the recent references after the last reference to the block, which is reflected

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

137

RD
LRD

Current Block_
Position

L R D -R D

RD

T "RD

CRD

OD

Last Acceess _
Position
Next Acceess _
Position

LRU Stack

(a )

Last Acceess
Position Current Block.
Position
Next Acceess Position

L R D -R D
CRD

OD

LRU Stack

(b)

F igure 5.4: In the LRU stack, for a given block, the position for the last access to the block
corresponds to its LRD, its current position in the stack corresponds to its RD, and the position for
its next access corresponds to its CRD. Before its current position exceeds its last access position
(see left figure (a)), LRD-RD is LRD; after that (see right figure (b)), LRD-RD becomes RD. This
allows LRD-RD to more accurately simulate CRD. The illustration also shows that RD and OD
change with every reference.
in its recency. To responsively capture the changes of locality scope (a hot block becomes
cold, or vice versa), we use the recency distance to take place of LRD once recency exceeds
LRD. T hat is, we use the larger of LRD and R to simulate CRD, called LRD-RD. All of
aforementioned locality strength measurements can be illustrated in the LRU stack shown
in Figure 5.4. We will develop of our caching protocol based on a d ata structure using the
LRU stack as a basis.

5 .2 .2

C o m p a r iso n s o f L o c a lity S t r e n g t h Q u a n tific a tio n M e th o d s

Each of the four measures, OD, RD, CRD, and LRD-RD, is associated with a replacement
algorithm. A replacement algorithm works in the way th at it has its accessed blocks ranked

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

138

according to a certain measure, and selects the one with the least ranking for replacement
once a victim block is needed. For example, the measure used by O PT is OD and the
measure used by LRU is RD. How well a measure satisfies the two requirements on its ability
— distinction of locality strengths and the stability of the distinction, determines how well
the corresponding replacement algorithm serves as an unified replacement algorithm for a
multi-level cache hierarchy.
To understand and compare the two abilities of the measures, we use six small-scale
workload traces (cs, glim p se, z ip f, random , sprite, and m ulti) w ith representative access
patterns for the evaluation. The traces are briefly described in the following..

1. cs is an interactive C source program exam ination tool trace, which was collected
with about 9MB kernel sources as input.

2. glim pse is a text information retrieval utility trace. The search was conducted on
the text files of about 50MB and their index files of about 5MB.

3. zip f is a synthetic trace, in which only a few blocks are frequently accessed. Formally,
the probability of a reference to the ith block proportional to 1/i. The d ata set it
accessed is 39MB.

4. random is a synthetic trace with a spatially uniform distribution of references across
all the accessed blocks. The d ata set it accessed is 39MB.

5. sprite consists of requests to a file server from client workstations for a two-day period
in the Sprite network file system [4], which covers 28MB d ata set.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H AP TE R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

139

6. m u lti is obtained by executing four workloads, cpp, gnuplot, glimpse, and postgres,
together, which covers up 29MB d ata set.

Among these traces, cs, glim pse are used in [18, 19, 33], sprite is used in [45, 33], m u lti
traces are used in [42, 33], and z i p f , random are used in [76] to evaluate the performance
of replacement algorithms. These traces represent the m ajor access patterns common to
the I/O requests. Traces cs and glim pse have a looping access pattern, where all blocks are
regularly and repeatedly accessed. Trace sprite has a temporally-clustered access pattern,
where blocks accessed more recently are the ones more likely to be accessed soon. It is
an LRU-friendly pattern.

The access pattern of trace random is common in database

applications. Zipf-like access patterns exhibited in trace z i p f are typical for file references
in Web servers. Trace m u lti has an access p attern mixed w ith sequential, looping and
probabilistic references.
For a given measure, each accessed block has a changing value. W hen there is a reference
to a block, the value of the block, and possibly the values of other blocks are changed. For
each measure we m aintain an ascendingly ordered block list by their measure values. The
list is dynamically updated with each new block reference to m aintain the order. In the
process there are block movements in the list. We divide the full length of each list into
ten segments of equal size. We collect the number of references to each segment to observe
the locality strength distinction. We also collect the block movements across each of the
segment boundaries to observe the stability of the distinctions when the list is updated
with references. For example, if the given measure is RD, the list is actually an LRU stack
with its size unbounded. Each of the ten segments represents a range of stack positions

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

140

□ S eg m en t

10
□ S eg m en t
9
@ Segm ent

8
B Segm ent
7
B Segm ent

6
B Segm ent
5
S3 Segm ent
4
■ S eg m en t
3
B Segm ent

2
■ Segm ent

1

5

4?
CS

GLIMPSE

ZIPF

RANDOM

SPRITE

MULTI

F igure 5.5: Reference ratios to each of the segments (the ratios between the number of references to
a segment and the number of all references in a workload). It also shows the accumulative reference
ratios for the first N segments in each workload, where N is 1 through 10.
w ith certain recencies. W hat we want to investigate is th a t positions of the stack where
references take place and the block movements in the stack for a given workload trace.
Figure 5.5 shows the reference ratio distributions in the list for each measure. Each of
the measures orders accessed blocks in its list and places the blocks with small values at
the head of the list (in the case of measure RD, it is the top of an LRU stack). A good
distinction of locality strengths should generate a reference ratio distribution with more hits
appearing in the head portion of the list th an those in its tail portion. Assuming each of
the segments corresponds to a level of cache, we can observe the hit ratio on each level of
cache. From the figure we have the following observations:
(1)

OD provides the best reference ratio distribution. The higher (closer to the list

head, and with a sm aller segment number in Figure 5.5) a segment is, the higher reference

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

141

ratio th e segment achieves for OD. This reflects the strong ability of OD to accurately
make the distinction of locality strengths. Actually, the distribution generated by OD is
optim al considering optim ality of the O PT algorithm. While high segments are mapped on
the high levels of caches, which have small hit times, such a distribution helps reduce the
average access time. In contrast, RD provides the worst distribution, though it attem pts
to simulate (predict) ND. This is specially apparent for the workload with a looping access
pattern: cs and glimpse. Most of their references go to the low segments (after Segment 9 in
cs, and after Segment 3 for glim pse). This indicates th a t even an unified LRU replacement
can hardly achieve high hit ratios until the aggregate cache size can hold all the accessed
blocks. RD only performs well on th e workloads w ith an LRU-friendly access pattern, such
as sprite.
(2) CRD performs well for all the workloads w ith various access patterns. This reflects
its ability to make consistently accurate distinction. Except for trace random, LRD-RD
performs very closely to CRD, though it does not depend on the future knowledge. W ithout
looking ahead, all the on-line algorithm s could perform the same as RANDOM replacement
for trace random at best, which randomly selects a block for replacement and has a hit
ratio proportional to the cache size. B oth LRD-RD and RD obtain such a distribution for
the trace.
(3) For the two on-line measures, LRD-RD produces significantly better locality dis
tinctions than RD for workloads cs, glim pse, z ip f, and m u lti. For LRU-friendly workload
sprite, both R and LLD-R perform very well, and RD performs a little better than LRD-RD.
Figure 5.6 shows block movement ratios between the number of block movements across
each of the segment boundaries and the number of all references for each of the four mea-

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFE R CACHE M A N A G E M E N T

142

100
OD —s
CRD —■
xLRD-RD

RD - a O
D

CRD - -x LRD-RD ~ * ~
a>
ca
IE
c
o

E
o
>

5o
o

o
CD

600

800

1000

5 00

1000

2000

1500

List Position {# o f blocks)

List Position (# o f blocks)

ZIPF

RANDOM

25 0 0

40

35

A

RD - e OD
CRD —x—
LRD-RD

_30
\
\

15 25
H

S= 20

OD
LRD-RD

,

X
\

\

0)

'a..

§
^ 15
O
o

“ 10

— . _______ b
-•X -----_

-X—

5 _______

0

............... —
,

\

--*•................................* " .....
>

1

3000

4000

1

,

5000
6000
7000
List Position (# o f blocks)

,
8000

‘
\

—
i

9000

10000

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

List Position (# o f blocks)

SPR IT E

MULT!

CRD
LRD-RD

RD -~eOD - a CRD - x LRD-RD

§30

V

.

2000

3000

4000

List Position (# o f blocks)

5000

1000

2000

5000
3000
4000
List Position (# o f blocks)

6000

7000

Figure 5.6: Movement ratio curves showing the ratios between the number of block movements
across a segment boundary of the ordered lists and the number of total references for the four
measures: OD, RD, CRD, and LRD-RD on various workloads. It shows that there are two groups
of curves: OD and RD with high movement ratios, NRD and LRD-RD with low movement ratios.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

143

sures. For example, the first point from the left on a curve represents the ratio between the
number of times th a t the blocks cross the boundary between the first and second segments
and the number of all references. A small movement ratio means a high stability for the
distinction of locality strengths. W hen the segments are m apped to the levels of caches
and a boundary corresponds to the interface of two adjacent levels of caches, a movement
ratio determines the communication overhead in an unified caching. We have the following
observations in the figure:
(1) OD and RD have the highest movement ratios, which have been expected because
of their volatility. Comparatively, CRD and LRD-RD have much lower movement ratios.
(2) The ratio gaps between CRD (resp. LRD-RD) and OD (resp. RD) are especially
pronounced w ith the looping p attern trace glim pse. However, even for the LRU-friendly
workloads like sprite and z i p f , the gaps are still considerably large. This demonstrates
th a t an on-line unified caching based on LRD-RD promises a much smaller additional
communication cost than th a t based on RD.
(3) The ratios of LRD-RD are smaller than those of NLD in most cases.

Ability to distinguish
locality strengths
Stability of distinctions
On-line measures

OD

RD

CRD

LRD-RD

strong
weak
no

weak
weak
yes

strong
strong
no

strong
strong
yes

Table 5.1: Comparisons of the four measures on locality strengths by comparing their abilities
to distinguish locality strengths, the stabilities of the distinctions, and if on-line measurements are
possible.

Table 5.1 summarizes the four measures distinguishing locality strengths, showing th a t
using LRD-RD is a desired basis to building an unified caching protocol.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

144

5.3

The Unified an d Level-aware Caching (ULC) Protocol

5 .3 .1

A n E x e c u tiv e S u m m a r y

We have shown th a t the position of a block in th e list ordered by LRD-RD provides a
strong hint for caching the block on a level corresponding to its list position, or not caching
it at all3. This also assures us th a t the block would still stay there with a high probability
when the block is accessed next time. Effectively using the hint, we propose a multi-level
buffer placement and replacement protocol, called Unified and Level-aware Caching (ULC)
protocol to exploit hierarchical locality. Based on the access patterns and available cache
sizes on each level, ULC running at the first level client dynamically ranks the accessed
blocks into levels Iq , L 2 , ..., and L out according to their LRD-RD positions, thus directing
them to be placed (cached) at level L \ cache, level L 2 cache, ..., or not cached at any levels at
the tim e of the retrieval, respectively. The size of the first level cache determines the number
of L i blocks, those w ith the smallest LRD-RD values, and the same correlation holds for
other levels of caches. Low level buffer caches are not responsible for extracting locality
from the filtered request stream presented to them any more. Every block request from the
high level buffer cache carries a level tag, so the low level caches only take their actions
accordingly. If the attached level tag matches its level number, this level will cache the
retrieved block. Otherwise, the block is discarded after the block is sent to its next upper
level cache.

W hen the block positions need adjusting, the client sends block demotion

instructions to low level caches, which demand a block originally residing in a cache be
3Those requested blocks that should not be cached in the first level cache are still brought into the client
for its use, but will not be cached there, i.e. these blocks will be quickly replaced from the client after the
reference.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

demoted into its next low level cache.

145

Our client-directed protocol attem pts to answer

the following questions in designing hierarchy caching algorithms: (1) how to exploit the
locality in the entire buffer cache hierarchy thoroughly and consistently; (2) how to make
the exploited locality usable by all buffer caches in the hierarchy; and (3) how to minimize
the overhead of the protocol.

5 .3 .2

A D e t a ile d D e s c r ip tio n

In C hapter 5.2.2 we have shown the LRD-RD measure is a promising basis on which to
build a multi-level caching protocol. However, an implementation of an algorithm exactly
based on LRD-RD ranking criterion will take at least O(logn) time, where n is the number
of distinct accessed blocks. This is the cost of block ordering. In order to develop an
efficient algorithm w ith the tim e complexity 0 (1), we transform the process to determine
the position of a block in an LRD-RD ordered list into two separate steps: (1) W hen a
block is accessed, its recency is 0, so its LRD-RD is LRD, which is the recency at which it
was ju st accessed. We use the LRD to determine in which segment the block will be cached
at the time of retrieval. (2) Once a block is assigned into a specific segment, we use RD to
determine its position in the segment. Each segment corresponds to a level of cache, and
the size of the segment is the same as th a t of the cache.
As is shown in Figure 5.7, the recently accessed blocks are maintained in an unified
LRU stack, simplified as uniL R U stack. These blocks could be cached in any level of buffer
caches, or even not cached4. For each level of buffer cache there is a yardstick block in
4In a protocol implementation, only some metadata, such as a block identifier and two statuses used in
the ULC protocol, are stored in the stack for each block, not the block itself.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T
uniLRUstack

10

LRU3

©
©

©
©
©

Y1 ----1

o
©

Y2 •

(6 )

Y3 -

Hi

€1
CP

©

©

Yardstick

146

L I block
L2 block

□

L3 block
Lout block

Figure 5.7: An example to show the data structure of ULC for a 3-level hierarchy. The blocks with
their recencies less than that of yardstick F3 are kept in uniLRUstack. The level status (Li, £ 2 or
£ 3) of a block is determined by its position between two yardsticks where it was accessed last time.
Its recency status (R\, R% or R$) is determined by its position between two yardsticks where it sits
currently. To decide which block should be replaced in each level, the blocks in the same level can
be viewed to be organized in a separate LRU stack (LRUi, LRU 2 , or LRU 3 ), and the bottom block
is for replacement.
uniL R U stack, which is the block cached in th a t level of cache and has th e maximal recency
among blocks cached there. We call them Y\, Y 2 , ...,Yn for level L \, L 2 , ..., L n cache,
respectively. The size of u n iL R U sta ck actually is determined by the position of Yn, the
last yardstick, which always sits in the bottom of u n iL R U stack. Any blocks with recencies
larger than th a t of Yn will be removed from u n iL R U stack and become L out blocks, which
are not cached in any level of caches. Only when a block gets accessed with the recency
between the recencies of Y - 1 and Yi does the block become L i block, which means it will
be cached in the level Li cache. All of blocks cached on the same level can be viewed as
a local LRU stack, called LRU i, where the order of blocks is determined by their recencies
in u n iL R U stack and its size does not exceed the size of th a t level of cache. The block to
be replaced on level Li is the bottom block of stack LRUi. For the requested blocks th a t

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

147

are neither cached in L \ cache nor going to be cached there because their LRDs are larger
th an the recency of Yi, we set up a small LRU stack called tem p L R U to temporarily store
these blocks, so th a t they can be quickly replaced from the L \ cache.
There are two structures for the buffer cache hierarchy. One is the single-client structure,
in which there is only one client connected to one server5, and another is the multi-client
structure, in which more th an one clients share the same server, and blocks requested by
different clients are shared in the server. There are two additional challenges for the multi
client ULC protocol: (1) How to cache shared blocks in server buffer caches, which could
carry different level tags set by different clients. (2) How to allocate server cache buffers to
different clients.

5.3.2.1

T h e S in gle-clien t U L C P ro to co l

The single-client ULC algorithm runs at a client, which holds the first level cache. It has
the knowledge of the size of the buffer cache on each level. For each block in u n iL R U sta ck,
there are two associated statuses: level status and recency status. Level status indicates at
which level the block is cached, such as L \ 1 L 2 , ..., L n, or L out . W hen a block gets accessed,
we need to know its recency to determine its level status. The recency is actually its LRD.
It takes at least O (N ) time to m aintain the exact recency inform ation for all blocks, where
N is the aggregate size of the buffer caches. Actually we only need to know the recencies
of whatever two yardsticks the recency lies in. Thus we m aintain a recency status R4 for
each block, which means its recency is between the recencies of yardsticks Y ^ i and Yi (or
5Here we call the high level buffer cache, client, and low buffer cache, server, when we discuss two adjacent
levels.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

148

ju st less th an Yi if * is 1). The cost to m aintain recency statuses is 0(1), which will be
explained.
Initially, if level Li is not full and the levels th a t are higher th an it are full, any requested
Lout blocks get level status Li and reside in level Li. If all the caches are full, any blocks
accessed when they are not in u n iL R U sta ck are given level status L out. There are two
circumstances for a block to be outside u n iL R U stack. One is th a t the block is accessed
for the first time, another is th a t block has not been accessed for a long period of time so
th a t it leaves u n iL R U stack from the bottom . For these blocks their level status is L out,
and recency status is R outWe define an operation for yardsticks in u n iL R U stack called Y ardStick A dju stm ent,
which moves a yardstick from the current yardstick block with level status Li in the direction
towards the stack top to the next block w ith level status L{. All the blocks it passes including
the current yardstick block change their recency status from Ri to Ri+\. When a yardstick
block changes its position in u n iL R U stack, we need to conduct yardstick adjustm ent to
ensure the yardstick is on the block w ith correct recency status and w ith the largest recency
among the blocks on the level. Demoting a block into a low level cache is equivalent to
moving the bottom block of local stack LRUi into LRU i+\ , which is sorted on their recencies
in u n iL R U stack. To place the block at the correct recency position in LRUi+i, we define
another operation for a demoted block called D em otionSearching, which searches in the
direction towards the stack bottom in u n iL R U sta ck for next block with a higher level
status.
There are two types of requests in ULC, which are sent from the client to the low level
caches to coordinate various levels of caches to work under an unified caching algorithm.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

149

1. R etrieve (h ,i,j) ( i > j )■ retrieve block b from level Li, and cache it on level L j when
it passes level Lj on its way to level L \.

2. Demote(fr,*, j) ( i < j ): demote block b from level Li into level Lj.

If there is a reference to block b w ith level status Li and recency status R j, there are
only two cases we need to deal with: i = j and i > j . The case i < j is not possible because
block b is demoted to level Lj+i before j is larger than i. When block b is referenced, it is
moved to th e top of u n iL R U stack and its recency status becomes R i. This also makes it
stay in the top of stack LRUi. If i > 1, block b goes to stack tem pL R U in the client and is
going to be replaced soon from the client cache. Then for each of the two cases, we act as
follows: (1) i = j . Block b remains in its current level of cache w ith the same level status
(Retrieve(6, i, i)). (2) i > j. Because block b will be moved from level Li and cached at level
L j (Retrieve(6, i,j) ) , a space needs to be freed at level Lj. We demote the yardstick block
Yj to its next low level cache, whose yardstick block may have to be demoted in tu rn if its
status level is higher th an L*. Yardstick adjustm ent and demotion searching are conducted
here.

5.3.2.2

T h e M u lti-clien t U LC P ro to c o l

When there are multiple clients sharing one server, the cache buffers in the server are no
longer solely used by one client. In the single client ULC protocol, the number of the
blocks w ith level status Li (also called Li blocks), or the size of stack LRUi, is determined
by the size of level L, cache.

If the buffers at level Li are shared by multiple clients,

an allocation policy is needed on level Li for the performance of the entire system. To

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

8

r~9~
v
eT

&

cd

11

g>

©

CD

ClS

18
LRUI

L R U 2\

CD
CD
CD

CD
CD

:i9,
(is)
Jt7,

LRUI
uniLRUstack

>

uniLRUstack

<E>
<S>
“C D
CD
<D
CD

o
o
□

150

gLRU ^

Lout block

( a )

C2)

Server

<ClD>
<22>
CD
CD
CD
gLRU

(b )

Figure 5.8: An example to explain how a requested block is cached in the server cache, and how
the allocation scheme adjusts the size of the server cache used by various clients in a multi-client
two-level caching structure. Originally in (a) server stack gLRU holds all the L2 blocks from clients
1 and 2, which are also in their LRU2 stacks, respectively. Then block 9 is accessed in client 1.
Because block 9 is between yardstick Tj and Y2 in its uniLRUstack, it turns into L 2 block and
needs to be cached in the server. Because the server cache is full, the bottom block of gLRU, block
14, is replaced, which will be notified to its owner, client 2, through a piggyback on the next retrieved
block going to client 2 (delayed notification). After the server buffers re-allocation (b), the size of
server cache for client 1 is increased by 1 and that for client 2 is decreased by 1. So the clients and
the server cooperate to make the server cache efficiently allocated with the aim of high performance
for the entire system.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 5. M U LTI-LEVE L BU FFER CACHE M A N A G E M E N T

151

obtain best performance, it is known th a t allocation should follow the dynamic partition
principle: each client should be allocated a number of cache blocks th a t varies dynamically
in accordance w ith its working set size. Experience has shown th a t global LRU performs
well by approximating the dynamic partition principle [11]. Thus we use a global LRU stack
called gL R U in the server to facilitate the allocation operation. The block order in gLRU
is determined by the block recencies, which are determined by the timings of requests from
clients requiring a block be cached in the server. The bottom block of gLRU is the one to
be replaced when a free buffer is needed. For each block in gL R U we record its owner —
the client most recently requesting the block be cached in this server. A block is cached
on the highest level among all the clients’ direction. If there is only one client, the bottom
block of gL R U is always the yardstick block Yi in uniL R U sta ck, and also is the bottom
block of stack LRU i in the client. Because the server cache buffer is shared among the
clients, the bottom block of LRUi could have been replaced in the server. If this is the case,
it is equivalent to shrinking the cache size of the server dedicated to the client. So when
a block is replaced from gLRU , a message is sent to its owner client so th a t a yardstick
adjustment can occur there. Correspondingly, the size of LRU i is decreased by one. The
owner notifications of block replacements can be delayed until the next requested block is
sent to its owner client without affecting its correctness. Then they are piggybacked on the
next retrieved block, thus saving extra messages. Figure 5.8 shows an example to illustrate
the multi-client case. By dynamically adjusting yardsticks of affected clients based on the
information provided by the allocation policy, we have a ULC algorithm in clients allowing
low level caches to change their sizes dynamically. The changing sizes are the results of the
allocation policy w ith the aim of high performance for the entire system.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

5.4

152

Perform ance Evaluation

This section presents our trace-driven simulation results. We compare ULC with two other
multi-client caching schemes: independent LRU, simplified as in d L R U , which is a com
monly used scheme, and unified LRU, simplified as u n iL R U , an LRU-based unified caching
protocol[76].

5 .4 .1

P e r fo r m a n c e M e tr ic

We use average block access time, Tave, to evaluate the performance of various protocols.
This metric measures the average time required to access a block perceived by applications.
The access time is determined by the hit ratios and miss penalties at different levels of the
caching hierarchy, as well as other communication costs. Generally, we can estimate Tave
for an n-level cache hierarchy as follows. T^ue — ^ —-j hjT^ T hmiss'Fm + Tdemotion where
hi is the hit ratio at level Li cache, T{ is the time it takes to access the cache at level Li,
hmiss is the miss ratio for the cache hierarchy (equivalent to 1 —X ^=i

)) Tm is the cost

for the miss, and Tdemotion is the demotion cost for block placements required by an unified
replacement protocol. If we assume the demotion cost for a block from level Li to Li+i is
Tdi, and the demotion rate between level Li and Lj+i is hdi, then Tdemotion =

Tdihdi-

We do not consider the situation where demotions are delayed, thus their costs could be
hidden from applications, for two reasons: (1) Demotions are highly possible to occur in a
bursting fashion, especially for an LRU-based unified replacement, where 50%, even around
90% of the references incur demotions. A small number of dedicated buffers have difficulty
in buffering the delayed blocks, thus its performance is unpredictable. (2) Reserving a large

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

153

number of buffers for delayed demotions actually reduces the cache size and would hurt the
hit ratios.
Specifically, for a two-level client-server cache hierarchy, the average access time is as
follows: Tave — hcTc + h sTs + (1 —hc —hs)Tm + hc_ sTc_s where hc and hs are the hit ratios
for the client and server respectively, Tc and T s are the costs for a hit in the client and
server respectively, and Tm is the cost for a miss in the server. If the disk access time for
a block is

Tm can be regarded as T s +

hc- s is the demotion rate between the client

and the server. Tc_ s is the cost for a demotion. We assume Tc « 0, the demotion cost Tc_ s
is approxim ated as the server hit time Ts. Then Tave « hsTs -f- (1 —h c —hs)T j + hc- sTs.

5 .4 .2

S im u la tio n E n v ir o n m e n t

We use trace-driven simulation for the evaluation. Our simulator tracks the statuses of all
accessed blocks, monitors the requests and hits seen at each cache level, and the demotions
at each level boundary. We assume 8 KB cache block. We use seven large-scale traces to
drive the simulator, including two synthetic traces: random and z ip f and five other reallife workload traces. We have described the two synthetic traces in Chapter 5.2. Here we
significantly increase the scale of these two traces: random accesses 65536 unique blocks
w ith a 512MB d ata set. It contains about 65M block references, z ip f accesses 98304 unique
blocks with a 768MB data set. It contains about 98M block references. The three real-life
traces used for the single-client simulation are described as follows:

1. h ttp d was collected on a 7-node parallel web-server for 24 hours. [71], The size of
the d ata set served was 524 MB which is stored in 13,457 files. A total of about 1.5M
H TTP requests are served, delivering over 36 GB of data. We aggregate the seven

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

154

request streams into a single stream in the order of the request times for the single
client structure study.

2. d e v l is an I/O trace collected over 15 consecutive days on a Redhat Linux 6.2 desktop
[13]. It contains text editor, compiler, IDE, browser, email, and desktop environment
usage. It has around 100K references. The size of the d ata set it accessed is around
600M.

3. t p c c l is also an I/O trace collected while running the TPC -C database benchmark
with 20 warehouses on Postgres 7.1.2 w ith Redhat Linux 7.1 [13]. It has around 3.9M
references. The data set size is around 256M.

We also select three traces for multi-client simulation. One of them is the original httpd
trace with seven access streams, each for one client. The other two multi-client traces are
as follows:

1. o p e n m a il was collected on a production e-mail system running the HP OpenMail
application for 25,700 users, 9,800 of whom were active during the hour-long trace
[76]. The system has 6 HP 9000 K580 servers running HP-UX 10.20. T he size of the
data set accessed by all six clients is 18.6G.
2. d b 2 was collected by an 8 node IBM SP2 system running an IBM DB2 database th a t
performed join, set and aggregation operations for 7,688 seconds [71]. The total data
set size is 5.2GB and it is stored in 831 files.

For all the simulation experiments, we use the first one tenth of block references in the
traces to warm the system before the measurements were collected.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

5 .4 .3

155

C o m p a r is o n s o f M u lt i- le v e l S c h e m e s in a T h r e e -le v e l S tr u c tu r e

To dem onstrate the ability of multi-level caching schemes (ULC, indLRU, and uniLRU) to
make distinctions of locality strengths as well as the ability to keep their stability, we test
them in a three-level caching hierarchy for the five single client traces, simulating a scenario
where the block transfer route consists of a client, a server and its disk array containing a
large RAM cache. For a common local network environment, we assume the cost to transfer
an 8KB block between the client and the server through LAN is lm s, the cost between the
server and the RAM cache in the disk array through SAN is 0.2ms, and the cost of a block
from a disk into its cache is 10ms [76]. We assume the cache sizes of the client, the server,
and the disk array are 100MB each for traces random, z i pf , httpd, and dev 1, and the cache
sizes are 50MB each for trace tpccl due to its comparatively small d ata set. We report the
hit ratios in each of the three levels, demotion rates on each boundary, and average access
time for each workload with the three multi-level caching schemes in Figure 5.9.
Confirming the experimental results in [76], we observe th a t there are significant per
formance improvements of uniLRU over indLRU for all the five traces, from 17% to 80%
reduction on average access time (see the th ird graph). Actually these are the results of two
combined effects of uniLRU: (1) increasing the cache hit ratios; (2) generating additional
demotion cost. UniLRU eliminates the redundancy in the hierarchy, making the low levels
of caches contribute to the hit ratio ju st as if they stayed in the first level. For example, in
a random access pattern, the contribution of a cache to the hit ratio should be proportional
to its size. However, the second and th ird levels of caches gain much lower hit ratios (1.7%
and 0.3% respectively) than th a t of first level cache (19.5%) for trace random in indLRU

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

156

Hit Ratio Breakdown
for Each Multi-Level Caching Scheme
100
90
80
5? 70
0 60
to 50
Ql 40
I 30
20
10
0

//V //V //V /> v ,//&G
^

^

Ci

^

^

Ci

^

Ci

$

$

Ci

$

Ci

$

TPCC1

RANDOM

Average Access Time Breakdown
Demotion Ratios
at Each Boundary for the Unified Caching Schemes
0 L1-L2 Demotion (uniLRU)
S L2-L3 Demotion (uniLRU)
■L1-L2 Demotion (ULC)
012-13 Demotion (ULC)

for Each Multi-Level Caching Scheme

r

0 Demotion Cost

|1 0

iMiss Penalty

h« 0
(00
0
6
x
0

■ L3 Hit Time

®

0 c
<

0 L2 Hit Time

0 4

a

10

//V
RANDOM

/

/

£
ZIPF

^^0
f / f
HTTPD

DEVI

/ /
TPCC1

Figure 5.9: hit ratios in each of the three levels, demotion rates at each of two boundaries (between
LI and L2, and between L2 and L3 cache), and average access time for each workload with the multi
level caching schemes indLRU, uniLRU and ULC.
(see the first g raph). The unified replacement scheme uniLRU makes the low levels of caches
much better utilized. Their hit ratios (19.6% and 19.5% respectively) are almost the same
as th at of first level cache (19.5%). However this improvement comes with a considerably
high price: high demotion rates. For example, in trace random uniLRU has a first bound
ary demotion rate 80.5%, which means 80.5% of block references accompany “write-backs”
to the server. Furtherm ore, it has a 60.9% demotion rate at the second boundary (see the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

157

second graph). The worst case for the demotion rates of uniLRU is trace tpccl, which has a
looping access pattern. Its first boundary demotion rate is 100%! This is because uniLRU
has little power to predict the level where an accessed block will be accessed. For a looping
access pattern, blocks are accessed at a large recency equal to the loop distance, which
implies almost all the blocks of tpccl are accessed after they are demoted into the second
level of cache. So the hit ratio of the second level cache is very high (92.5%) and 44.7%
of the average access time is spent on the demotion. According to the requirement on the
ability of distinguishing locality strengths for a multi-level caching scheme, the distribution
th a t the level L \ hit ratio (0.03%) is much less th an the L 2 hit ratio (92.5%) under uniLRU
shows a bad case.
Compared w ith uniLRU, ULC protocol has an access-time-aware hit ratio distribution
along the levels of caches: more hits appearing on upper levels. For example, the hit ratios
of the level

L 2 , and L 3 are 50.3%, 45.1%, and 3.4%, respectively for trace tpccl. And

such a distribution is achieved without paying high costs of demotions. For example, the two
boundary demotion rates of tpccl are 1.4% and 1.3%, respectively (see the second graph).
It is also shown th a t ULC has significant demotion rate reductions over uniLRU for all 5
traces. This explains why the proportion of demotion cost in the average access tim e for
ULC is much smaller (from 1% to 8.3% w ith an average of 4.1%) than th a t for uniLRU
(from 12.6% to 44.7% w ith an average of 21.5%) (see the th ird graph).
The access time breakdowns also show th a t ULC still performs significantly better th a n
uniLRU except for trace random, even if we assume the demotions could be moved off the
critical p ath for response time. Actually this is an unrealistic assumption. The experiments
on the client-server system running a TPC-C benchmark show th a t demotions can signif

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

158

icantly delay the network and lower the system throughput [14]. In summary, our ULC
achieves from 11% to 71% reduction on average access time with an average of 34.6% over
th a t of uniLRU.

5 .4 .4

T h e P e r fo r m a n c e I m p lic a t io n o f S y s te m P a r a m e te r s

To be widely applicable, a caching scheme should consistently deliver improved performance
over existing schemes with a large range of system param eters such as cache size and network
bandwidth. For the convenience of observing and comparing performance differences of the
schemes in this study, we choose the client-server structure, a two-level cache hierarchy to
present our results. For the two-level hierarchy evaluation, we include Multi-Queue (MQ).
In a client-server caching hierarchy, the environment th a t MQ is designed for, we use MQ
in the server and use LRU in the client independently. There is a param eter in the MQ
replacement, called life T im e , which determines the speed to decay the frequency of an
in-accessed block. Because this param eter is workload dependent, we run each trace for
multiple sample life T im e values in the range suggested in [79], and report the best results
of these runs. For this client-server structure, we set the time to retrieve an 8K block from
the server, Ts, as 0.4 ms, and the average disk access time, T^, for an 8K block is 10 ms.
Due to the space constraints we only report the results for one synthetic trace, z ip f, and
two real-life traces, httpd and dev 1. The results for other traces are consistent w ith those
presented here.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

159

ZIPF
2

ULC
uniLRU
MQ
indLRU
1.5

1

0.5

0

0

300
Server Cache Size (MB)

100

400

200

500
DEV1

HTTPD
10
-a,

'X,

x .

uniLRl

■*.
ULC uniLRU MQ indLRU -

M(

indLRl

X

£

x .

£

S

y-

6

uj
W

8

<

gi

2

<5

I

0.5

200
150
Server Cache Size (MB)

100

250

300

0

50

100

150
200
250
300
Server Cache Size (MB)

350

400

450

Figure 5.10: The average access times for schemes ULC, uniLRU, MQ and indLRU with various
server cache sizes. The client cache size is fixed. It is 256MB for z i p f , and 128MB for h ttp d and
dev 1.
5.4.4.1

T h e Im pact o f Server C ache Size

Figure 5.10 shows the average access tim e for each workload as the server cache size changes
for all the four caching schemes: ULC, uniLRU, MQ, and indLRU. A n observation for the
indLRU hit ratio curves is th a t there is a segment of flat curve for each workload with
small server cache sizes. These curves start to drop when the server cache sizes approach
the client cache size. This dem onstrates the serious under-utilization of the server cache
under indLRU due to the redundancy and locality filtering effect. T h at is, under indLRU

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

160

a relatively small server cache unfortunately has little contribution to the reduction of the
average access time in a system w ith a large client cache size. This is consistent with
the study in [84], which suggests increasingly large built-in disk cache help little with a
comparatively large file system buffer cache under two independent LRU replacements.
However, such an observation does not exist for all the other three schemes, which achieve
better performance th a n indLRU for all the workloads.
It is shown th a t for most of the cases, the performance of uniLRU is b etter than th a t of
MQ, though MQ does not have demotion costs. This reflects the m erit of unified caching
scheme - elimination of data redundancy. It is also shown th a t the performance gains of
uniLRU over MQ are increased with the increase of server cache size. Our study shows that
this is because MQ relies more on the reference frequencies to make replacement decision
when the cache size becomes large.

Thus MQ becomes less responsive to react to the

changing access patterns, and less effective than LRU-based schemes w ith large server cache
sizes. For all the traces ULC achieves the best performance, steadily decreasing the access
time with the increase of server cache sizes. Its high hit ratios and low demotion rates are
the two major contribution factors.

5.4.4.2

T h e Im p act o f C lient C ache Size

Figure 5.11 shows the average access tim e for each workload as the client cache size changes.
It is shown th a t uniLRU benefits much more from the added client cache size than indLRU
and MQ. This is because increasing client size has negative effects for indLRU and MQ:
more data redundancies in indLRU and weaker locality available for MQ in the server. An
unified caching scheme is immune from these effects. However, the performance of uniLRU

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFER CACH E M A N A G E M E N T

161

ZiPF
ULC
uniLRU - e MQ - x indLRU -•••*-

0.5

200
250
150
Server Cache Size (MB)

100

300

350
DEV1

HTTPD

ULC
uniLRU
MQ
indLRU

ULC
uniLRU —b—
MQ —x—
indLRU

e-

0.5

100

150
Client Cache Size (MB)

200

250

100

150

200

250

300

Server Cache Size (MB)

F igure 5.11: The average access times for schemes ULC, uniLRU, MQ and indLRU with various
client cache sizes. The server cache size is fixed. It is 200MB for z ip f and dev 1, and 150MB for
httpd.
is worse than th a t of MQ with small client cache sizes for z ip f and dev 1. Here is the
explanation. T he smaller the client cache size is, the more requested blocks are retrieved
from outside of the client. In uniLRU every block brought from outside of the client incurs a
demotion. Small client caches cause large demotion costs, which increase the access time in
uniLRU. Though ULC is also an unified caching scheme, it maintains its best performance
in the whole range of client cache sizes because of its accurate block placement decisions.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

162

C H A P TE R 5. M U LTI-LEVEL BU FFER C ACH E M A N A G E M E N T

ZIPF

4r
3 .5 ■

ULC uniLRU MQ indLRU -

3 •

e

F

x------

2 .5 1

...-a-'

<

o3:1
2

1 c I

1 .5 •

^ 1
0 .5 ■

0

•

0 .5

1 .5

2

2 .5

Server Cache Size (MB)
DEV1

3 -

ULC
uniLRU --a—
MQ - x indLRU

12

■

ULC —i
uniLRU —t
MQ - )
indLRU —

2 .5 •

a

£
f-

X-----

2 -

o
o

1 .5 -

<
05

2
d)
I
0 .5 -

0 .5

1

1 .5

2

2 .5

Transfer Time for a Blxk (ms)

0 .5

2 .5

3 .5

Server Cache Size (MB)

F igure 5.12: The average access times for schemes ULC, uniLRU, MQ and indLRU with various
block transfer times. The client and server cache sizes are fixed, and are 100MB each for all the
workloads.
5.4.4 .3

T h e Im p act o f N etw ork B a n d w id th

Figure 5.12 shows the average access time for each workload as we change the 8KB block
transfer time. It is expected th a t the increase of transfer time has a more seriously nega
tive effect for unified schemes than for independent schemes, because the former have the
additional demotion costs determined by the transfer time. We see the average access time
of uniLRU does increase more quickly than those of indLRU and MQ. However, w ith low
demotion rates, ULC have the similar impact from the increase of transfer time as indLRU

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 5. M U LTI-LEVEL BU FFER CACHE M A N A G E M E N T

163

and MQ do, even less im pact for trace z ip f because of the contribution of transfer tim e to
the miss penalty and its much reduced miss ratios.

5 .4 .5

C o m p a r is o n s o f C a c h in g S c h e m e s fo r M u lt i- c lie n t W o rk lo a d s
HTTPD
4.5
ULC - » uniLRU ••• * MQ - x indLRU
3.5

£

''a ...

©

£

i©

o

©
qj
(6
©
I
0.5

100

140

120

Server Cache Size (MB)
DB2

OPENMAIL

9

ULC ™*~
uniLRU ~ a ~ MQ - x indLRU -

8

7

—

u ------- -e——
*

ULC uniLRU —a—
MQ —x—
indLRU -* -•

*
~"-e.

"- X - .

In
©

E
^to 5
©
3 4
o

<
©

09

I

3

4

2

00

1000

2000

3000
4000
Server Cache Size (MB)

5000

6000

7000

500

1000

1500
2000
Server Cache Size (MB)

2500

3000

3500

Figure 5.13: The average access times of multi-client traces httpd, openmail, and db2 with various
server cache sizes. Among them httpd is with 7 clients, openmail is with 6 clients, and db2 is with
8 clients. Each client contains 8MB, 1 G B , or 256MB respectively.

Because the performance of uniLRU scheme can significantly deteriorate due to buffer
competition and data sharing among clients for the multi-client structure, Wong and Wilkes
also proposed two adaptive cache insertion policies to supplement their prim itive scheme

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 5. M U LTI-LEVE L BU FFER CACHE M A N A G E M E N T

164

[76]. Among their three multi-client traces httpd, openmail, and dh2, httpd is the one with
data sharing. While they did not state which version of their unified LRU schemes should
be used for a specific workload, we ran all the versions and report the best results for
comparisons.
Figure 5.13 shows th a t for all the workloads ULC achieves the best performance. For
most of the time, indLRU has the worst performance. However, there are two cases where
indLRU beats uniLRU or MQ. One case is MQ with large server cache sizes for trace httpd.
When server cache sizes become large enough, LRU’s inability of dealing with weak locality
becomes less destructive. However, as a frequency-based replacement, MQ’s shortcoming
of slowness to respond to p attern changes becomes obtrusive. Another case is uniLRU
with small cache sizes for trace db2. This is because db2 contains looping access patterns.
LRU is not effective on a workload with this p attern until all blocks in the looping scopes
can be held in the cache. Carefully examining detailed experiment reports indicates that
both indLRU and uniLRU achieve very low hit ratios (6.9% and 7.9%, respectively for the
two levels of caches, compared w ith th a t of MQ (12.3%) and th a t of ULC (35.1%). Thus
it is the large demotion cost of uniLRU (with an average demotion rate 88.6% for the 8
clients, compared w ith th a t of ULC (7.2%)) th a t makes the difference. W ith the increase
of the cache size, some looping scopes are covered by the combined two-level caches, but
not by a single level, which explains why the performance of uniLRU starts exceeding th a t
of indLRU when the server cache size reaches 1GB. However, the performance of uniLRU
is worse than th a t of MQ because of its looping access pattern. For the traces httpd and
openmail, uniLRU beats MQ by eliminating d ata redundancy.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BUFFER CACHE M A N A G E M E N T

5.5

165

R elated Work and D iscussions

Addressing the challenges of replacements in buffer caching hierarchy, researchers have
mainly tried these two approaches: (1) re-designing low level cache replacement; (2) ex
tending existing replacement into an unified hierarchy replacement through coordination.
The MQ algorithm [82] is a representative of the first approach. However, without the
coordination w ith clients, the performance potential of MQ is significantly constrained.
Since LRU is commonly used in software managed buffer caches due to their simplicity and
adaptability, Wong and Wilkes [76] propose a protocol to integrate two-level buffer cache
hierarchy into a single, large unified cache based on “demotion” operations, and manage it
using LRU. Their goal is to effectively utilize the built-in cache in RAID, so the network
they assumed is high speed SAN. To reduce the possible network bottleneck caused by
demotions in a database client and storage server system, Chen et al [14] even proposed to
re-load evicted blocks from disks rather th an from clients. Our technique deals with the
reduction of demotions by effectively utilizing history access patterns.
Jiang and Zhang [33] propose the LIRS replacement algorithm to address the perfor
mance degradation of LRU on workloads w ith weak localities. They use a LIRS stack to
track the recencies of accessed blocks. The blocks with small recencies at which they get
accessed, are kept in the cache. This single-level cache replacement motivates us to investi
gate if the last locality distance, LLD, can be effectively used to exploit hierarchical locality,
so th a t blocks w ith different locality strengths can be arranged into correct cache levels.
The work on cooperative caching [23, 66, 74] is to coordinate the buffer caches of many
client machines distributed on a LAN to form a fourth level in the network file system’s

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFE R CACH E M A N A G E M E N T

166

cache hierarchy. Besides local cache, server cache, server disk, d ata can also be cached
in another client’s cache. Some associated issues include idle cache availability, consistent
sharing. Our ULC protocol is intended for the conventional file buffer cache hierarchy, while
the characterization of non-uniform locality is expected to enhance the effectiveness of data
placements in the cooperative caching.
As far as the cache hierarchy between processor and memory is considered, the in
teraction of replacements at various levels and its performance implication do not pose a
problem. Multi-level inclusivity between L \, L 2 , --Ln cache could be accepted as a principle
to simplify the cache coherence protocol [3]. This is because a lower level cache is more than
ten times larger th an its upper level cache. W ith this large difference, the size reductions
of useful caches due to data redundancy have only limited negative performance im pact on
the low level caches. In contrast, the sizes of buffer caches in the hierarchy do not follow
this regularity: a client buffer cache could even be larger than the second level cache.
We assume ULC works in a trusted environment. Though it is a client-directed protocol,
ULC does not increase the vulnerability of servers. This is because even with independent
caching schemes, a client still has the opportunity to abuse server buffers by sending extra
requests to servers to keep its blocks in the server.
The underlying algorithms on almost all the existing file systems are LRU or its variants.
ULC basically inherits their data structure - LRU stack. The operation costs associated
w ith the stacks are 0(1) time with each reference request. Regarding space cost used for the
stacks, we need 17 bytes (8 bytes for file identifier and block offset, 8 bytes for two pointers
in a double linked list, and 1 byte for statuses) for a block in the client, which only represents
0.2% of an 8 Kbytes block. The m etadata in the shared server cache needs additional one

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 5. M U LTI-LEVEL BU FFE R CACHE M A N A G E M E N T

167

or two bytes for recording block owner. The stack sizes on other levels except the first one
are determ ined by their cache sizes. Thus a server with a 1GB cache only uses 2.2MB for its
m etadata. The first level cache has to hold u n iL R U stack, whose actual size is determined
by the working set size of applications running on the client. The relatively cold blocks
(with low level statuses) can be trim m ed from the stack without compromising the ULC
locality distinction ability if needed to save space cost. For example, an 8.5MB m etadata
in the client can support a workload working set up to 4GB. This is highly affordable in a
system endeavoring for improved file I/O performance through large caches.

5.6

S u m m a ry

An effective caching scheme for multi-level cache hierarchy is im portant to the performance
of applications because increasingly more applications rely on the hierarchy for their file ac
cesses. After carefully investigating the non-uniform locality strength quantifications in the
representative file access patterns, we propose the ULC caching protocol. Compared with
the commonly used independent LRU scheme and the other recently proposed schemes, the
ULC protocol shows its distinguished performance merits: (1) It consistently and signifi
cantly reduce average block access time perceived by applications; (2) It can be implemented
efficiently w ith 0 (1) time complexity w ith only a few stack operations associated w ith a
reference.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Chapter 6

C onclusions and Future Work
This dissertation has provided solutions to address several im portant memory management
issues aiming a t reducing disk accesses: general-purpose replacement algorithms, virtual
memory replacement policies, thrashing prevention, and multi-level buffer cache manage
ment. The proposed solutions are based on the extensive application behavior characteriza
tion and accurate access locality quantification. These solutions cover both process virtual
memory accesses and file d ata accesses in program execution, both page replacement for
a single program and for multiple running programs, and both buffer cache in a single
computer and multi-level buffer caches in a distributed system. Each proposed algorithm
or scheme has been extensively evaluated using either driven-driven simulations or system
implementations to dem onstrate its effectiveness and practical value. Using the techniques
together will comprehensively enhance the system performance with memory-intensive and
I/O-intensive applications, and make the system more robust in face of dynamical memory
accesses.

168

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 6. CO NCLUSIO NS AN D FU TU R E W O R K

6.1

169

G eneral-Purpose R eplacem ent A lgorithm s

LRU replacement algorithm has been commonly used in file systems, data base systems,
storage systems and in numerous other applications. It has been successful in general due
to its good performance in most cases and its low cost. However, it is also well known
for LRU to be incapable w ith the access patterns of weak locality such as scan and loop
accesses. Because of its im portant role in today’s computing, the negative implication of
LRU’s performance inability cases would be high. Thus, it is understandable th a t there are
so much research work still focusing on improving LRU performance.
Motivated by the limitations of previous studies, we propose the Low Inter-reference
Recency Set (LIRS) replacement policy. LIRS effectively addresses the limitations of LRU
by using recency to evaluate Inter-Reference Recency (IRR) for making a replacement deci
sion. This is in contrast to w hat LRU does: directly using recency to predict next reference
timing. LIRS dynamically and responsively distinguishes low IR R (LIR) blocks from high
IRR (HIR) blocks, and keep LIR blocks in cache. Compared with LRU, LIRS does not
directly use recency to make a replacement decision, but uses it to determine LIR or HIR
status of a block. At the same time, LIRS almost retains the same simple assumption of
LRU to predict future access behavior of blocks. The only additional assumption of LIRS
is there is correlation between consecutive IRRs of a block. It also does not rely on any
detectable regularities.
Performance evaluations w ith a variety of traces and a wide range of cache sizes show
th a t LIRS effectively addresses the limitations of LRU, retains the low-cost merit of LRU,
and outperforms those replacement policies relying on the access regularity detections.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 6. CONCLUSIONS A N D F U TU R E W O R K

170

In the dissertation, we follow the convention of the replacement algorithm study to use
only hit ratio as a performance metric. However, in practical systems the correlation be
tween the hit ratio increase and application performance improvement is more complicated
th an being linear. For example, a number of misses on a set of pages consecutively residing
on the disk would cause a penalty of almost the same of a single miss because of the property
of hard disks. On the other hand, the same number of misses on a set of pages scattered
on the disk will cost much more th an the sequential case. Taking the ultimate application
performance into consideration could significantly affect the design and evaluation of re
placement algorithms. As a future work, we will use the more performance-relevant metrics
such as average block access time in our replacement algorithm study. We expect this will
make the research in the area play a more im portant role in the system design.

6.2

Low Cost V irtual M em ory R eplacem ent Algorithm s

The low cost requirement of virtual memory (VM) management make the research of lowcost approximations of general-purpose replacement algorithms a necessity. However, this
is not easy considering th a t only very limited history access information can be used to
m aintain a low cost. This explains why the CLOCK, a replacement policy developed at
least 35 years ago. still dominates almost all the today’s systems.
While pure LRU has an unaffordable cost in VM, CLOCK simulates LRU replacement
algorithm with a low cost acceptable in VM management. Over the last three decades,
the inability of LRU as well as CLOCK to handle weak locality accesses is getting serious,
and an effective fix on it becomes increasingly demanding. However, almost all the m ajor

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H APTER 6. CONCLUSIONS A N D FU TU R E W O R K

171

improved replacement algorithms are built on LRU and have a cost at least equivalent to
the LRU cost.
Inspired by the general-purpose replacement algorithm, LIRS [33], we propose an en
hanced CLOCK replacement policy, called CLOCK-Pro. By additionally keeping track of a
limited number of replaced pages, CLOCK-Pro works in a similar fashion as CLOCK with
a VM-affordable cost. In the meanwhile, it brings all the much-needed performance advan
tages from LIRS into CLOCK. CLOCK-Pro also eliminates the only tunable param eter in
LIRS and makes itself a policy adapting to the changing access locality to serve a broad
spectrum of workloads. Extensive simulation experiments on real-life I/O and VM traces
show significant and consistent performance improvements. We believe th a t CLOCK-Pro
would be very attractive to the VM system designers in industry.
The potential performance advantages of CLOCK-Pro can only be fully dem onstrated
through real system implementation. As a future work, we plan to continue our efforts to
make CLOCK-Pro practically and efficiently run on Linux systems, where some systemspecific issues will arise, such as how to keep track of the replaced pages in memory, how
to coordinate replacement decisions w ith individual process access behaviors in a global
memory replacement policy. Because th e memory management is a complicated portion
in an operating system and the replacement codes are heavily coupled w ith other parts
of the memory management in Linux, there will be a number of technical challenges to
be addressed. Our objective is to make CLOCK-Pro be widely used in the main stream
operating systems in both commercial and open source communities.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P TE R 6. CO NCLUSIO NS A N D F U TU R E W O R K

6.3

172

Thrashing P revention

Operating system designers attem p t to keep high CPU utilization by maintaining an optimal
multiprogramming level (MPL). Although running more processes makes it less likely to
leave the CPU idle, too many processes adversely incur serious memory competition, and
even introduce thrashing, which eventually lowers CPU utilization. A common practice to
address the problem is to lower the MPL w ith the aid of process swapping o u t/in operations.
This approach is expensive and is only used when the system begins serious thrashing. The
objective of our study is to provide highly responsive and cost-effective thrashing protection
by adaptively conducting priority page replacement in a timely manner.
We have designed a dynamic system Thrashing Protection Facility (TPF) in the system
kernel. Once T P F detects system thrashing, one of the active processes will be identified for
protection. The identified process will have a short period of privilege in which it does not
contribute its LRU pages for removal so th a t the process can quickly establish its working
set, improving the CPU utilization. W ith the support of T P F , thrashing can be eliminated
in its early stage by adaptive page replacement, so th at process swapping will be avoided
or delayed until it is truly necessary.
We have implemented T P F in a Linux kernel. Compared with the original Linux page
replacement, We show th a t T P F consistently and significantly reduces page faults and the
execution time of each individual job in several groups of interacting SPEC2000 programs.
We also show th a t T P F introduces little additional overhead to program executions, and
its implementation in Linux (or Unix) systems is easy.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 6. CONCLUSIONS A N D FU TU R E W O R K

6.4

173

M ulti-L evel Buffer Cache M anagem ent

In a large client/ server cluster system, file blocks are cached in a multi-level storage hi
erarchy: client buffer caches, multiple server buffer caches, and built-in caches of disks at
the bottom level. Existing file block placement and replacement are either conducted at
each level of the hierarchy independently, or by applying an LRU policy on more than one
levels. One m ajor lim itation of these schemes is th a t hierarchical locality of file blocks
with non-uniform strengths is ignored, resulting in many unnecessary block misses, or ad
ditional communication overhead, even when the aggregate size of multi-level buffer caches
is sufficiently large to hold the working set. To address this lim itation, we propose a clientdirected, coordinated file block placement and replacement protocol, where the non-uniform
strengths of locality are dynamically identified on the client level to direct servers on placing
or replacing file blocks accordingly on different levels of the buffer caches. In other words,
the locality of block accesses dynamically matches the caching layout of the blocks in the
hierarchy. The effectiveness of our proposed protocol comes from achieving the following
three goals: (1) The multi-level cache retains the same hit rate as th a t of a single level cache
whose size equals to the aggregate size of multi-level caches. (2) The non-uniform locality
strengths of blocks are fully exploited and ranked to fit into the physical multi-level caches.
(3) The communication overheads between caches are also reduced.
Conducting simulations with a variety of synthetic and real-life traces, and with a wide
range of system param eters, we show our caching protocol significantly and consistently
outperforms existing multi-level caching schemes.
In the work of cooperative caching [23, 66, 74], there are schemes to coordinate the

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

C H A P T E R 6. CONCLUSIONS A N D F U TU R E W O R K

174

buffer caches of many client machines distributed on a LAN to form a fourth level in the
network file system ’s cache hierarchy. Because of the heterogeneity of the working sets of
applications running on each client, possibly as well as the memory at each client, If we
allow memory space sharing at the level of clients, we can reduce the workloads on servers
and further increase the performance of applications with large working sets. One of the
challenges w ith the design is th a t how to set the priority of memory allocations on a client
to local applications and to applications running on other clients. We plan to look into
these technical issues on cooperative caching in distributed environment.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

Bibliography
[1] A . V . A h o , P . J . D e n n i n g , a n d J. D . U l l m a n . P rin cip les o f o p tim a l page rep lace
m en t. Journal of ACM, 18(1) :80—93, 1971.

[2] G . A lm s i, C. C a s c a v a l , a n d D. A. P a d u a . Calculating stack distances efficiently.
In Proceedings of the workshop on Memory system performance, pages 37-43, 2004.
[3] J.-L . BAER AND W .-H . W ang. On the inclusion properties for multi-level cache hier
archies. In Proceedings of Annual International Symposium on Computer Architecture,
pages 73-80, 1988.
[4] M. G. B a k e r , J . H. H a r t m a n , M. D. K u p f e r , K .W . S h i r r i f f , a n d J . K.
OUSTERHOUT. Measurements o f a distributed file system. In Proceedings of Symposium
on Operating System Principles, pages 198-212, 1991.
[5] R. B a l a s u b r a m o n i a n , D. A l b o n e s i , A. B u y u k t o s , a n d S. D w a r k a d a s . Dy
namic memory hierarchy performance and energy optimization. In Proceedings of A n 
nual International Symposium on Computer Architecture, pages 245-257, 2000.
[6] S. B a n s a l a n d D. M o d h a . CAR: Clock with adaptive replacement. In Proceedings
of the 3nd USENIX Symposium on File and Storage Technologies, 2004.
[7] L . A . B e l a d y . A s tu d y o f rep lacem en t alg o rith m s for v irtu a l storage. IB M System

Journal, 5 (2 ):7 8 -1 0 1 , 1966.
[8] L. A. B e l a d y , R . A. N e l s o n , a n d G . S . S h e d l e r .

An anomaly in space-time
characteristics of certain programs running in a paging machine. Communication of
the ACM, 12(6):349-353, 1969.

[9] B. T . B e n n e t t a n d V . J. K r u s k a l . Lru stack processing. IB M Journal of Research
and Development, pages 353-357, 1975.
[10] K . B e y l s a n d E . H . D ’H o l l a n d e r . R eu se d ista n ce -b a se d cache hint selectio n . In

Proceedings of IN T E R N A T IO N A L CO NFERENCE ON P A R A LLE L PROCESSING,
p ages 2 6 5 -2 7 4 , 2002.

[11] P . C a o , E . W . F e l t e n , a n d K. L i. Application-controlled file caching policies. In
Proceedings of U SENIX Sum mer Technical Conference, pages 171-182, 1994.
[12] R. W . C a r r . Virtual Memory Management. UMI Research Press, 1984.
175

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

B IB L IO G R A P H Y

176

[13] T r a c e D i s t r i b u t i o n C e n t e r , http://tds.cs.byu.edu. B rig h a m Y oung U niversity.
[14] Z. C h e n , Y . Z h o u , a n d K. L i.

E v ic tio n -b a sed p la cem en t for storage caches.

In

Proceedings o f Annual U SENIX Technical Conference, 2003.
[15] T . M . C h ilim b i.

E fficien t rep resen tation s and a b str a ctio n s for q u an tifyin g an d e x 

p lo itin g d a ta referen ce locality. In Proceedings of the A C M SIG P L A N conference on

Programming language design and implementation (PLDI), p ages 1 9 1 -2 0 2 , 2001.
[16] T . M . C h ilim b i. O n th e sta b ility o f tem p o ra l d a ta referen ce profiles. In Proceedings of

International Conference on Parallel Architectures and Compilation Techniques, 2001.
[17] T . M . C h ilim b i a n d M . H i r z e l . D y n a m ic h o t d a ta stream prefetch in g for generalp urpose program s. In Proceedings of the A C M SIG P L A N conference on Programming

language design and implementation (PLDI), pages 1 9 9 -2 0 9 , 2002.
[18] J . CHOI, S . N o h , S . M in , AND Y . C h o . Towards a p p lic a tio n /file -le v e l characteriza
tio n o f b lock references: A case for fine-grained buffer m a n a g em en t. In Proceedings of

Annual U SENIX Technical Conference, p ages 2 8 6 -2 9 5 , 2000.
[19] J . C h o i, S . M in S . N o h , a n d Y . C h o . A n im p le m e n ta tio n stu d y o f a d etectio n b ased a d a p tiv e b lo ck rep la cem en t schem e. In Proceedings of Annual USENIX Technical

Conference, p ag es 2 3 9 -2 5 2 , 1999.
[20] E. G. C o f f m a n a n d P . J. D e n n in g . Operating Systems Theory. Prentice-Hall, 1973.
[21] F . J . C o r b a t o J . A p a g in g ex p erim en t w ith th e M u ltics sy ste m . In In Honor of Philip

Morse, H. Feschbach, and U. Ingard, p a g es 2 1 7 -2 2 8 . M IT P ress, 1969.
[22] S t o r a g e

P e r fo r m a n c e

C o u n c il.

I/O Traces from a Popular Search Engine.

h t t p : / / w w w .sto ra g ep erfo rm a n ce.org.
[23] M . D . D a h l i n , R. Y . W a n g , T . E . A n d e r s o n , a n d D . A . P a t t e r s o n .

C oop

erative caching: U sin g rem ote clien t m em ory to im prove file s y ste m p erform ance. In
Symposium on Operating System Design and Implementation, p a g es 2 6 7 -2 8 0 , 1994.
[24] P . J . DENNING. T h e w orking se t m o d el for program b eh avior. Communications of the

ACM, 11 (5) :323—333, 1968.
[25] P . J . D e n n i n g . V ir tu a l m em ory. Computer Survey, 2(3): 153—189, 1970.
[26] P . J. DENNING. W ork ing se ts p a st an d present. IE E E Transactions Software Engi

neering, 6 ( l) : 6 4 - 8 4 , 1980.
[27] P . J . DENNING. B efore m em o ry w as v irtu a l. In The Beginning: Personal Recollections

of Software Pioneers. IE E E P ress, 1997.
[28] C . DlNG AND Y . Z h o n g . P r ed ictin g w h ole-p rogram lo c a lity th ro u g h reuse d ista n ce
a n alysis. In Proceedings of the A C M SIG P L A N conference on Programming language

design and implementation (PLDI), p a g es 2 4 5 -2 5 7 , 2003.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

B IB L IO G R A P H Y

177

[29] G . G l a s s . Adaptive page replacement. M aster’s thesis, University of Wisconsin, 1997.

[30] G. G l a s s a n d P . G a o . Adaptive page replacement based on memory reference be
havior. In Proceedings of A C M SIG M E TR IC S Conference on Measuring and Modeling
of Computer Systems, pages 115-126, 1997.
[31] HP Corporation. H P-U X 10.0 Memory Management White Paper, 1995.

[32] IBM Corporation. A IX Versions 3.2 and 4 Performance Tuning Guide, 1996.
[33] S. J i a n g a n d X . Z h a n g . LIRS: An efficient low inter-reference recency set replace
ment policy to improve buffer cache performance. In Proceedings of A C M SIG M ET
R IC S Conference on Measuring and Modeling of Computer Systems, pages 31-42, 2002.
[34] S. J i a n g a n d X . Z h a n g . T P F : a system thrashing protection facility. Software Practice and Experience, 32(3):295—318, 2002.
[35] S. J i a n g a n d X . Z h a n g . Token-ordered LRU: An efficient page replacement pol
icy and im plem entation for program interactions. In Special Issue on Performance
Modeling and Evaluation o f High-Performance Parallel and Distributed Systems in
Performance Evaluation: A n International Journal, 2004.
[36] S. J i a n g a n d X . Z h a n g . ULC: A file block placement and replacement protocol to
effectively exploit hierarchical locality in multi-level buffer caches. In Proceedings of
International Conference on Distributed Computing Systems (ICDCS), pages 168-177,
2004.
[37] T . J o h n s o n a n d D. S h a s h a . 2Q: A low overhead h ig h perform ance buffer m an age
m en t rep lacem en t alg o rith m . In Proceedings of the International Conference on VLDB

Surveys, p ag es 4 3 9 -4 5 0 , 1994.
[38] E. G. C o f f m a n J r . a n d T . A. R y a n . A study of storage partitioning using a
m athematical model of locality. Communications of the ACM , 15(3):185-190, 1972.
[39] S . F . K a p l a n , L . A . M c G e o c h , a n d M . F . C o l e . A d a p tiv e caching for dem and
prepaging. In Proceedings of the third Internation Symposium on Memory Management,
p ages 114-126, 2002.

[40] S. F. K a p l a n , Y. S m a r a g d a k i s , a n d P . R. W i l s o n . Trace reduction for virtual
memory simulations. In Proceedings of A C M SIG M E T R IC S Conference on Measuring
and Modeling o f Computer Systems, pages 47-58, 1999.
[41] L. J. K e n a h a n d S. F. B a t e .
Press, 1984.

V A X /V M S Internals and Data Structures. Digital

[42] J. K im , J. C h o i, J. K im , S. N o h , S. M in , Y. C h o , a n d C. K im . A low-overhead,
high-performance unified buffer management scheme th a t exploits sequential and loop
ing references. In Symposium on Operating System Design and Implementation, 2000.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

B IB L IO G R A P H Y

178

[43] Y. H . K im , M. D. H i l l , a n d D. A. W OOD. Implementing stack simulation for
highly-associative memories. In Proceedings o f A C M SIG M E TR IC S Conference on
Measuring and Modeling of Computer System s, pages 212-213, 1991.
[44] E. D. L a z o w s k a a n d J. M. K e l s e y . Notes on tuning VAX/VMS. Technical report,
Univ. of Washington, Dept, of Com puter Science, 1978.
[45] D . L e e , J . C h o i, J . K im , S. N o n , S. M in , Y . C h o , a n d C . K im . On the existence

of a spectrum of policies th at subsumes the least recently used (lru) and least frequently
used (lfu) policies. In Proceedings of A C M SIG M E T R IC S Conference on Measuring
and Modeling of Computer Systems, pages 1 3 4 -1 4 3 , 1999.
[46] R . L. M a t t s o n , J. G e c s e i , D. R . S l u t z , a n d I. L. T r a i g e r . Evaluation techniques
for storage hierarchies. IB M System Journal, 9(2):78—117, 1970.
[47] S. M a x w e l l . Linux Core Kernel Commentary. CoriolisOpen Press, 1999.
[48] S. McFARLING. Cache replacement with dynamic exclusion. In Proceedings of Annual
International Symposium on Computer Architecture, pages 191-200, 1992.
[49] K. S. M c K i n l e y a n d O. T e m a m . Quantifying loop nest locality using spee’95 and
the perfect benchmarks. In A C M Transactions on Computer Systems, pages 288-336,
1999.
[50] M . K . M c K u s i c k , K. B o s t i c , M. J. K a r e l s , a n d J. S. Q u a r t e r m a n . The Design
and Implementation of the 4-4 BSD Operating System. Addison Wesley, 1996.
[51] N . MEGIDDO AND D . M o d h a . ARC: a self-tu n in g , low overhead replacem ent cache.
In Proceedings o f the 3nd U SENIX Symposium on File and Storage Technologies, 2003.

[52] R . T . M i l l s , A. S t a t h o p o u l o s , a n d D . N i k o l o p o u l o s . Adapting to memory
pressure from w ithin scientific applications on m ultiprogram med COWs. In Proceedings
of International Parallel and Distributed Processing Symposium, 2004.
[53] T . C. M o w r y , A. K. D e m k e , a n d O . K r i e g e r . Autom atic compiler-inserted i/o
prefetching for out-of-core application. In Symposium on Operating System Design and
Implementation, pages 297-306, 1993.
[54] D. M u n t z AND P . HONEYMAN. Multi-level caching in distributed file system - or your caching ain’t nuthin’ but trash. In Proceeding of the U SEN IX Winter Technical
Conference, 1992.
[55] V . F . N i c o l a , A. D a n , a n d D . M . D ia s . Analysis of the generalized Clock buffer
replacement scheme for database transaction processing. In Proceedings of A CM SIG 
M E T R IC S Conference on Measuring and Modeling of Computer Systems, pages 35-46,
1992.
[56] D. NIKOLOPOULOS. Malleable memory mapping: User-level control o f memory bounds

for effective program adaptation, dimitrios s.nikolopoulos. In Proceedings of Interna
tional Parallel and Distributed Processing Symposium, 2003.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.

B IB L IO G R A P H Y

179

[57] E . J . O ’N e i l , P . E . O ’N e i l , a n d G . W e ik u m . T h e lru-k page rep lacem en t algorith m
for d a ta b a se disk buffering. In Proceedings of A C M SIGMOD Conference, p ages 2 9 7 306, 1993.
[58] C. N . PARKINSON.

Parkinson’s Law or the Pursuit of Progress.

World Scientific

Publishing Co, 1994.
[59] R. H. P a t t e r s o n , G. A. G ib s o n , E. G i n t i n g , D. S t o d o l s k y , a n d J. Z e l e n k a .
Informed prefetching and caching. In Proceedings of Symposium on Operating System
Principles, pages 1-16, 1995.
[60] J. L. P e t e r s o n AND A. S i l b e r s c h a t z . Operating System Concepts. Addison Wesley,
1985.
[61] F. PETRINI, D. K e r b y s o n , AND S. PAKIN. The case of the missing supercomputer
performance: Achieving optimal performance on the 8,192 processors of asci q. In
Proceedings of international Supercomputing 2003 Conference, 2003.
[62] V. P h a l k e a n d B. G o p in a t h . An inter-reference gap model for tem poral locality in
program behavior. In Proceedings of A C M SIG M E TR IC S Conference on Measuring
and Modeling of Computer Systems, pages 291-300, 1995.
[63] B. G. P r i e v e a n d R .S. F a b r y . Min - an optimal variable-space page replacement
algorithm. A C M Press, 19(5):295-297, 1976.
[64] J. A. R i v e r s , E. S. T a m , G. S. T y s o n , E. S. D a v id s o n , a n d M. F a r r e n s .
Utilizing reuse information in d ata cache management. In Proceedings of the A C M
International Conference on Supercomputing, 1998.
[65] J. T. R o b i n s o n a n d N. V. D e v a r a k o n d a . D ata cache management using frequencybased replacement, pages 134-142. Proceedings of ACM SIGMETRICS Conference on
Measuring and Modeling of Com puter Systems, 1990.
[66] P . S a r k a r a n d J. H a r t m a n . Efficient cooperative caching using hints. In Symposium
on Operating System Design and Implementation, 1996.
[67] Y. S m a r a g d a k i s , S. K a p l a n , a n d P . W i l s o n . EELRU: simple and effective adap
tive page replacement. In Proceedings of A C M S I GME TRIC S Conference on Measuring
and Modeling of Computer Systems, pages 122-133, 1999.
[68] E. S m ir n i a n d D. A. R e e d . Lessons from characterizing the in p u t/o u tp u t behavior
of parallel scientific applications performance evaluation. In Performance Evaluation,
pages 27-44, 1998.
[69] A. J. S m it h . Sequentiality and prefetching in database systems. A C M Trans, on
Database Systems, 3(3):223-247, 1978.
[70] A. S. TANENBAUM AND A. S. WOODHULL. Operating Systems, Design and Imple
mentation. Prentice Hall, 1997.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

B IB L IO G R A P H Y

180

[71] M. U y s a l , A. A c h a r y a , a n d J. S a l t s . Requirements of I/O systems for parallel
machines: An application-driven study. CS-TR-3802, Dept, of Computer Science, 1997.
[72] R . VAN R i e l . P a g e rep la cem en t in lin u x 2 .4 m em ory m an agem en t. In Proceeding of

U SENIX Annual Technical Conference (F R E E N IX track), 2001.
[73] R. VAN R ie l. Towards an 0 (1 ) VM: Making linux virtual memory management scale
towards large amounts of physical memory. In Proceedings of the Linux Symposium,
2003.
[74] G. V o e l k e r , E. A n d e r s o n , T . K i m b r e l , M. F e e l e y , J. C h a s e , A. K a r l i n , a n d
H. L e v y . Implementing cooperative prefetching and caching in a globally managed
memory system. In Proceedings of A C M SIG M E TR IC S Conference on Measuring and
Modeling of Computer Systems, 1998.
[75] P . R. W i l s o n , S. F. K a p l a n , a n d Y. S m a r a g d a k i s . The case for compressed
caching in virtual memory systems. In Proceedings of Annual USENIX Technical Con
ference, 1999.
[76] T. M. W o n g a n d J. W i l k e s . My cache or yours? making storage more exclusive.
In Proceedings of Annual U SENIX Technical Conference, 2002.
[77] Y . Z h o n g , C . D i n g , a n d K . K e n n e d y . Reuse distance analysis for scientific pro

grams. In Proceedings of 6th Workshop on Languages, Compilers, and R un-Tim e Sys
tems fo r Scalable Computers, 2000.
[78] Y. Z h o n g , M. O r l o v i c h , X. S h e n , a n d C. D i n g . Array regrouping and structure
splitting using whole-program reference affinity. In Proceedings of the A C M SIG P L A N
conference on Programming language design and implementation (PLDI), 2004.
[79] Y. Z h o u . Memory management for networked servers. In Ph.D Dissertation, Computer
Science Department, Princeton University, 2000.
[80] Y . Z h o u , A . B i l a s , S. J a g a n n a t h a n , C . D u b n ic k i, J . F . P h i l b i n , a n d K . L i.

Experiences w ith vi communication for database storage. In Proceedings of Annual
International Symposium on Computer Architecture, pages 2 5 7 -2 6 8 , 2002.
[81] Y . Z h o u , Z. C h e n , a n d K . L i. S eco n d -lev el buffer cache m an agem en t. IE E E Trans

actions on Parallel and Distributed Systems, 15(6):505—519, 2004.
[82] Y. ZH O U , J. F. P H IL B IN , A N D K. Li. The multi-queue replacement algorithm for
second level buffe. In Proceedings of Annual USENIX Technical Conference, pages
91-104, 2001.
[83] Q. Z h u , F. M. D a v i d , C. F. D e v a r a j , Z. L i, Y. Z h o u , a n d P . C a o . Reducing
energy consumption of disk storage using power-aware cache management. In Pro
ceedings of the International Symposium on High Performance Computer Architecture,
pages 118-129, 2004.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

B IB L IO G R A P H Y

181

[84] Y . Z h u a n d Y . H u . C an large d isk b u ilt-in cach es really im prove sy stem perform ance?
In Proceedings o f A C M SIG M E TR IC S Conference on Measuring and Modeling o f Com

puter Systems, p a g es 2 8 4 -2 8 5 , 2002.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

VITA

Song Jiang

Song Jiang was born in Hefei, Anhui, China on October of 1969. He graduated from Hefei
No. 8 High School in July of 1988. Song Jiang received his B.S. at University of Science
and Technology of China (USTC) in 1993 w ith a degree in Com puter Science, where he
also received his M.E. in 1996 with a degree in Com puter Science. After th at he worked
as a lecturer in the Departm ent of Computer and Technology of the university for another
three years.
In August of 1999, he entered the College of W illiam and Mary as a Ph.D student in
the Computer Science Department.

R eproduced with perm ission of the copyright owner. Further reproduction prohibited without permission.

