Hardware-Assisted Virtual Memory Management : Improving page replacement and migration with on-line memory access information by Neider, Raphael & Bellosa, Frank
System Architecture Group
Department of Informatics
http://os.ibds.kit.edu/
Hardware-Assisted Virtual Memory Management
Improving page replacement and migration with on-line memory access information
Raphael Neider and Frank Bellosa
KIT – University of the State of Baden-Wuerttemberg and
National Laboratory of the Helmholtz Association www.kit.edu
1. Motivation
very hot area
very cold area
warm area
warm area
hot page A
cold page B
hot page C
cold page D cold page D
hot page C
cold page B
hot page A
physically indexed,
physically tagged
last level cache
physical memory physical memory
physically indexed,
physically tagged
last level cache
Operating systems with virtual memory support are common
Page placement (and migration) policy required
Aware of caches, NUMA, memory technologies
Page replacement policy required
Optimal, least frequently used (LFU), least recently used (LRU)
Only referenced and dirty bits available
No access frequency or countÜ no LFU
No time of last accessÜ no LRU
No type of use (read-only/-mostly vs. write-often)
No data on physical memory accesses
Memory traces for off-line analysis desired
Only available from simulationsÜ short time frame
Thesis
More information on memory usage helps virtual memory management
perform better!
2. Requirements
Support variety of policies
Ü Record timestamps, reads, and writes
Tracing every memory access is too costly
Ü Find shortcuts
Cache hits are irrelevant
Ü Monitor activity after caches / at memory controller
100 % accuracy is not required
Ü “Batch” memory access information
Ü Access records per page are (usually) sufficient
Live feedback to OS and software is required
Ü Provide efficient interface
Address ranges and granularities should be configurable
Ü Allow different policies per memory technology
Ü Allow fine-grained examination of cache line utilization
3. Memory Profiling Unit (MPU)
ﬁrst access last access number of reads number of writesvirtual address=
ﬁrst access last access number of reads number of writesvirtual address=
ﬁrst access last access number of reads number of writesvirtual address=
ﬁrst access last access number of reads number of writesvirtual address=
bus address current time current time 1 or 0 0 or 1
mask&
append to in-memory log
or
update array of records per monitored region
shift register, shift on address miss updated on address hit
updated on address hit on read
updated on address hit on write
accessible at runtime
updated on ﬁrst access
Record timestamp of first and last access per page
Record number of reads and writes per page
Keep n such records in associative memory (e.g., 16 ways)
Replace entries via FIFO
Write oldest entry to log on removal
Ü Data in the log will never be too old
Scan/consolidate/write-back log in software
4. Hardware-Assisted Candidate Selection
Hardware remembers m (e.g., 4) best candidates
Candidates are the pages with
smallest timestamp (LRU)
least accesses (LFU)
most accesses (migration)
Remember largest entries on updates
Requires aging policy to prevent overflows
Search smallest entries on
update of largest remembered entry
reset of other records
One unit per physical memory region
(memory technology, NUMA node)
One unit to record misses per “cache page color”
Ü Place new data in uncontended cache areas
Ü Migrate heavily used pages from contended areas
array of raw
timestamps or
access counters
per physical
page
"best" candidate
4th candidate
2nd candidate
3rd candidate
"continuously" updated
sorted list
5. First Results of an FPGA-Based Prototype
“Real” hardware is inaccessible
Implemented on the
OPENPROCESSOR platform
SoC on FPGA devel. board
RISC CPU @ 50 MHz
64 MiB DDR SDRAM  0
 5
 10
 15
 20
 25
 30
 35
 0  2  4  6  8  10  12  14
Hit
 ra
te 
[%
 of
 al
l a
cce
ss
es
]
Index of MPU way
Effectiveness of MPU ways
typeset (init)
typeset (small)
typeset (large)
qsort (small)
patricia (small)
> 98 % hit rate with 16 MPU ways
Median candidate selection cost
2-handed clock: 13 770 µs
Hardware LRU: 211 µs
Up to 90 % less swap-ins  0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
6 MiB 8 MiBNu
mb
er 
of 
sw
ap
-in
s u
sin
g L
RU
[%
 of
 2-
ha
nd
ed
 clo
ck]
Total memory available to the benchmark
Performance of LRU relative to 2-handed clock
bzip
patricia
typeset
