Evaluating TLB (Translation Lookaside Buffer) Performance Overhead for NVM (non-volatile Memory) Hybrid System by Guo, Xiang
The University of Maine 
DigitalCommons@UMaine 
Electronic Theses and Dissertations Fogler Library 
Fall 12-20-2020 
Evaluating TLB (Translation Lookaside Buffer) Performance 
Overhead for NVM (non-volatile Memory) Hybrid System 
Xiang Guo 
University of Maine, xiang.guo@maine.edu 
Follow this and additional works at: https://digitalcommons.library.umaine.edu/etd 
 Part of the Data Storage Systems Commons 
Recommended Citation 
Guo, Xiang, "Evaluating TLB (Translation Lookaside Buffer) Performance Overhead for NVM (non-volatile 
Memory) Hybrid System" (2020). Electronic Theses and Dissertations. 3371. 
https://digitalcommons.library.umaine.edu/etd/3371 
This Open-Access Thesis is brought to you for free and open access by DigitalCommons@UMaine. It has been 
accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of 
DigitalCommons@UMaine. For more information, please contact um.library.technical.services@maine.edu. 
 
 
EVALUATING TLB (translation lookaside buffer) PERFORMANCE OVERHEAD FOR NVM (non-volatile 
Memory) HYBRID SYSTEM 
By 
Xiang Guo 
B.S, Shanghai Dianji University (China), 2014 
 
A THESIS 
Submitted in Partial Fulfillment of the 
Requirements for the Degree of 
Master of Science 
in Computer Engineering 
 
The Graduate School 









Yifeng Zhu, Libbey Professor of Electrical and Computer Engineering, Advisor 
Vincent Weaver, Associate Professor of Electrical and Computer Engineering 
Bruce Segee, Butler Professor of Electrical and Computer Engineering
 
 
EVALUATING TLB (translation lookaside buffer) PERFORMANCE OVERHEAD FOR NVM (non-volatile 
Memory) HYBRID SYSTEM 
By Xiang Guo 
 
Thesis Advisor: Dr. Yifeng Zhu 
 
An Abstract of the Thesis Presented 
in Partial Fulfillment of the Requirements for the 
Degree of Master of Science 
(in Computer Engineering) 
December 2020 
 
As the non-volatile memory (NVM) technology offers near-DRAM performance and near-disk 
capacity, NVM has emerged as a new storage class. Conventional file systems, designed for hard disk 
drives or solid-state drives, need to be re-examined or even re-designed for NVM storage. For example, 
new file systems such as NOVA, HMFS, HMVFS and Ext4-DAX, have been developed and implemented to 
fully leverage NVM’s characteristics, such as fast fine-grained access. This thesis research uses a variety of 
I/O workloads to evaluate the performance overhead of the TLB (translation lookaside buffer) in various 
file systems on emulated NVM storage systems, in which NVM resides on the memory bus. As NVM’s 
capacity becomes much greater than DRAM and applications’ footprints continue to increase rapidly, the 
number of TLB entries scales up with the same pace, leading to a significant amount of TLB misses. The 
goal of this research is to gain insights into file system optimizations on storage-class memory. 
Experimental results show that NVM based file systems can have 50% more TLB overhead compare to 
with conventional file systems, under the same file operations.  Profiling based on performance counters 





I would like to express the deepest appreciation to my committee chair, Professor Yifeng Zhu, 
who has the attitude and substance of a genius. His continually and convincingly spirit of adventure light 
my path through the whole research. Without his guidance and persistent help this thesis would not have 
been possible. 
I would like to thank my committee members, Professor Vincent Weaver and professor Bruce 
Segee, their kindness and patience was the most powerful support to my thesis. The advice they have 





TABLE OF CONTENTS 
ACKNOWLEDGEMENTS............................................................................................................. .......i 
LIST OF TABLES................................................................................................................................iv 
LIST OF FIGURES.............................................................................................................................  v 
1. INTRODUCTION AND MOTIVATION................................................................................... 1 
1.1 Motivation .................................................................................................................... 2 
1.2 Background .................................................................................................................. 3 
1.2.1 Memory Management Unit (MMU) ......................................................................... 3 
1.2.2 Translation lookaside buffer and page table............................................................. 4 
1.2.3 Huge pages .............................................................................................................. 6 
1.2.4 Tailored page size .................................................................................................... 7 
1.3 Summary of contribution .............................................................................................. 7 
2. RELATED WORK ................................................................................................................ 9 
2.1 Traditional file systems.................................................................................................. 9 
2.2 File systems built for NVM .......................................................................................... 10 
2.2.1 NOVA..................................................................................................................... 10 
2.2.2 Ext4-DAX ............................................................................................................... 10 
2.3 File benchmarks.......................................................................................................... 11 
2.3.1 Filebench ............................................................................................................... 11 
2.3.2 IOzone ................................................................................................................... 11 
2.4 Performance counter tools ......................................................................................... 12 
iii 
 
2.4.1 PERF ...................................................................................................................... 13 
2.4.2 PAPI ....................................................................................................................... 13 
3. EXPERIMENTAL SETUP .................................................................................................... 15 
3.1 Environment setup ..................................................................................................... 15 
3.2 Emulating persistent memory using DRAM ................................................................. 16 
3.3 File system setup ........................................................................................................ 17 
3.4 PERF tools ................................................................................................................... 17 
4. BENCHMARKS EVALUATION ........................................................................................... 20 
4.1 Filebench .................................................................................................................... 20 
4.2 IOzone ........................................................................................................................ 22 
4.3 Microbenchmark ........................................................................................................ 26 
5. MIRCOBENCHMARK EVALUATION .................................................................................. 28 
5.1 Evaluation of file size with write .................................................................................. 29 
5.2 Evaluation on unit size with fwrite .............................................................................. 38 
5.3 Evaluation on unit size with write................................................................................ 48 
6. CONCLUSION AND FUTURE WORK .................................................................................. 57 
REFERENCE ............................................................................................................................... 59 
APPENDICES .............................................................................................................................. 64 




LIST OF TABLES 
 
Table 1: TLB configurations on modern processors...................................................................... 6 
Table 2: TLB hierarchy of the test machine. ............................................................................... 15 
Table 3: Setting up DAX support. ............................................................................................... 16 
Table 4 Defining PMEM regions in the /etc/default/grub file. .................................................... 16 
Table 5: Mounting file system on PMEM. .................................................................................. 17 
Table 6: PERF events used in this research ................................................................................ 18 
Table 7: Metrics calculated by using hardware performance counters. ..................................... 19 






LIST OF FIGURES 
 
Figure 1: Page walk ..................................................................................................................... 4 
Figure 2: Virtual address ............................................................................................................. 6 
Figure 3: Comparing TLB traffic under Filebench, normalized to ext4 ........................................ 21 
Figure 4: Comparing TLB miss rates(%) under Filebench ............................................................ 22 
Figure 5: TLB measurements under write/rewrite in IOzone ...................................................... 23 
Figure 6: TLB measurements under rewrite/record in IOzone.................................................... 24 
Figure 7: TLB measurements under fwrite/re-fwrite in IOzone .................................................. 25 
Figure 8: Cycles spent (%) in page walks (data) for 10GB write workload ................................... 29 
Figure 9: Page walks per 1000 instr. (data) for 10GB write workload ......................................... 30 
Figure 10: Average cycles per page walk (data) for 10GB write workload .................................. 31 
Figure 11: Cycles spent (%) in page walks (instructions) for 10GB write workload...................... 32 
Figure 12: Page walks per 1000 instr. (instructions) for 10GB write workload ............................ 33 
Figure 13: Average cycles per page walk (instructions) for 10GB write workload ....................... 34 
Figure 14: L1 misses for 10GB write workload ........................................................................... 35 
Figure 15: L2 misses for 10GB write workload .......................................................................... 36 
Figure 16: LLC misses for 10GB write workload ......................................................................... 37 
Figure 17: Page walk overheads ................................................................................................ 38 
Figure 18: TLB traffic for NOVA and Ext4-DAX from fwrite (1GB file). ......................................... 39 
Figure 19: TLB traffic for NOVA and Ext4-DAX from fwrite (2GB file). ......................................... 40 
Figure 20: TLB traffic for NOVA and Ext4-DAX from fwrite (4GB file). ......................................... 41 
Figure 21: TLB traffic for NOVA and Ext4-DAX from fwrite (8GB file). ......................................... 42 
Figure 22: TLB traffic for NOVA and Ext4-DAX from fwrite (16GB file). ....................................... 43 
vi 
 
Figure 23: TLB traffic for NOVA and Ext4-DAX from fwrite (20GB) file. ....................................... 44 
Figure 24: TLB profile of Ext4-DAX when the total size is 20GB, the unit size is 1KB ................... 45 
Figure 25: TLB traffic profile of NOVA when the total size is 20GB, the unit size is 1KB .............. 47 
Figure 26: TLB traffic under different unit sizes when writing to a 1GB file ................................ 48 
Figure 27: TLB traffic under different unit sizes when writing to a 2GB file ................................ 49 
Figure 28: TLB traffic under different unit sizes when writing to a 4GB file ................................ 50 
Figure 29: TLB traffic under different unit sizes when writing to a 8GB file ................................ 51 
Figure 30: TLB traffic under different unit sizes when writing to a 16GB file .............................. 52 
Figure 31: TLB traffic under different unit sizes when writing to a 32GB file .............................. 53 
Figure 32: Profile from Perf tools for writing a 24GB file with a unit size of 1KB in NOVA. .......... 54 
Figure 33: Profile from Perf tools for writing a 24GB file with a unit size of 4KB in NOVA. .......... 55 





1. INTRODUCTION AND MOTIVATION 
Memory and storage have been two different architecture entitles with greatly distinct 
characteristics for decades. Memory is fast and byte-addressable, but volatile and expensive. On the 
contrary, storage (such as conventional spinning disks and solid-state drives) is non-volatile and cheap, 
but slow and does not support fine-grained accesses. The emerging byte-addressable NVM [1][2] (non-
volatile memory) technologies are expected to provide near-DRAM performance and near-storage 
capacity cost-effectively. Examples of this new byte-addressable persistent memory include spin-transfer 
torque MRAM (STT-MRAM) [3][4], memristors [5], and most notably, the Intel/Micron 3D-XPoint 
persistent memory. They provide large-capacity storage just as traditional spinning disks but are 100-1000 
times faster than state-of-the-art NAND flash [6]. Byte addressability makes it possible to use load/store 
functions for persistence. NVM combines the advantages of both memory and storage, forming a new 
entity in computer architecture: Hybrid NVM storage systems.  
Conventional file systems are tailored for slow hard disk drives (HDDs) or solid-state drives (SSDs). 
With the arrival of NVM-based storage systems, conventional file systems need to be reexamined or 
reevaluated. Furthermore, a few new file systems, such as NOVA, have been designed to take fully 
advantages of NVM’s characteristics, such as byte-addressability and fast fine-grained access.  
This thesis focuses on studying the overhead of the translation lookaside buffer (TLB) in both 
conventional file systems and new NVM-oriented file systems. The TLB is used for translating virtual 
addresses to physical addresses. In NVM storage systems, TLB becomes even more critical for a very 
simple reason: NVM is very fast and typically resides on the same bus as DRAM (Dynamic Random Access 
Memory). Thus, the TLB performance will have much greater impacts in NVM storage systems than 





Byte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM 
performance with scalable capacity. An HDD of typical 7200 RPM (revolutions per minute) delivers a write 
speed of 80-160MB/s. A typical SSD delivers a write speed between 200-550MB/s. On the other hand, 
NVM can deliver a write speed of 2000MB/s, much faster than the SATA SSD III, which is limited at 
600MB/s. 
In conventional storage systems, the I/O performance bottleneck lies in slow accesses to HDDs or 
SSDs. To access DRAM, processors always need to translate the virtual address to its corresponding 
physical address, with the help of TLB. The performance of TLB has little impact to the overall I/O 
performance in conventional storage systems and therefore TLB is not widely studied in the research of 
file systems or data storage. However, NVM provides read speed comparable with DRAM and write speed 
slightly slower than DRAM, the performance of TLB becomes more important in NVM storage than 
conventional storage.  
In addition, modern NVM-based file systems often use more aggressive journaling or logging 
techniques to quickly recover the file system back into a consistency state after a system crash. Journaling 
or logging techniques make extra writes to NVM and accordingly generate more TLB accesses.  
Unfortunately, the TLB size, defined by the number of entries the TLB can hold, does not scale up 
with the fast increase in DRAM capacity. For NVM, this situation becomes even worse as NVM’s capacity 
increases much faster than DRAM. As a result, more TLB misses will take place in NVM storage systems. 
Then, one important question we need to answer is that what is the impacts of the TLB performance in 
NVM storage system. This important question motivates this thesis research. This thesis research aims to 





In computing, a technology called virtual memory has been used for decades. Virtual memory is 
a memory management technique that creates a large and continuously-addressed virtual memory space 
for each application. It was designed to provide two important benefits: (1) running an application that 
has a memory footprint larger than the memory actually available on a given machine, and (2) providing 
the secure isolation between the memory space of different location. However, to access data residing in 
the memory, each virtual memory address has to be translated to its corresponding physical address. The 
address translation is performed in the unit of pages, which has a size 4KB typically. The virtual memory 
is divided into pages, and the physical memory is divided into frames. The mapping from a virtual page 
number to a physical frame number is recorded in a special data structure called page tables. TLB is 
designed to speed up the translation from the virtual page number to its corresponding physical frame 
number. Essentially, TLB is an on-chip cache for off-chip page tables.  
1.2.1 Memory Management Unit (MMU) 
The first system that implemented a variant of virtual memory was the ATLAS computer in the 
early 1960s. In ATLAS, “address is an identifier of a required piece of information but not a description of 
where in main memory that piece of information is” [7]. As Denning later said, referring to ATLAS, this 
theory of virtual memory gives software programmers an illusion that a continuous and very large 
memory space is freely at their disposal, without worrying about address confliction with another 
application or exceeding the capacity of the physical memory [8].  
Address translation is needed to translate a virtual memory address to its actual physical memory.  
One special hardware unit, called memory management unit [9][10] (MMU), which is almost available on 
every modern processor, can effectively perform virtual-physical address translation. A MMU also provide 
other important functions, such as bus arbitration, access protection, and cache control. 
4 
 
1.2.2 Translation lookaside buffer and page table 
A translation lookaside buffer (TLB) is a special CPU cache, which holds the physical frame 
numbers of recently accessed virtual pages. It is essentially an on-chip cache for off-chip page tables. It 
aims to reduce the time taken to access page tables. Therefore, TLB is also called address-translation cache. 
TLB is on the critical path of every memory access. Accordingly, TLB is crucial for the I/O performance of 
a processor [11].  
 
 
Figure 1: Page walk 
5 
 
  In the x86-64 architecture, page tables are stored as a 4-level hierarchical radix tree [12]. Figure 
1 shows an example, in which the page size is set to 4KB, and 48-bit virtual addresses are mapped to 52-
bit physical addresses. Each entry in the upper-level page table, called a page directory entry (PDE), holds 
the base memory address of the next lower-level page table. The entries in the page tables at the last 
level are called page table entries (PTEs). The page table travels through the hierarchical radix tree until a 
PTE is found. This process is called walking the page tables or page walk.  Upon each TLB miss, MMU 
performs a page walk, requiring up to four memory accesses to resolve an address translation.  
In the x86-64, the control register CR3 contains the base physical address of the topmost L4 page-
table. Starting with the topmost page table pointed by CR3, a virtual address can be translated into its 
corresponding physical address by iteratively walking through these hierarchical page tables.  If the page 
size is 4KB, the PTE is located at the leaf of the hierarchical radix, i.e., the PTE is in the bottommost page 
table. However, if a virtual address belongs to a page with a larger size, the PTE is located in a page at the 
higher tree level. The largest page size supported by x86-64 is 1GB. To translate 1 GB page, only L4 and L3 
page tables are accessed. Therefore, only two memory references are required for the page walk of a 1GB 
page. In comparison, four memory references are needed for the page walk of a 4KB page. 
The goal of TLB is to avoid walking the page-tables. The TLB acts as a cache for the paging hierarchy.  
To avoid ambiguity, the term cache(s) refer to data and instruction cache(s) and not TLBs, in the remainder 
of this thesis. The TLB is usually implemented as cache-like, fully-associative, or set-associative structures. 
The data TLB (D-TLB) configurations of several modern processors are presented in Table 1. Each core has 
its own private TLB.  Similar to hierarchical data caches, TLB can also have multiple hierarchical levels. 
Typically, the first level (L1) TLB contains separated instruction TLB (ITLB) for address translations during 
6 
 
data accesses, and data DLB (DTLB) for address translation during instruction accesses.  The second level 
(L2) TLB is usually unified, performing translations for both data and instruction accesses.  
 
Processor Microarchitecture L1 D-TLB Configuration L2 TLB Configuration 
AMD Zen[13] 64-entry (all page size)  1532-entry (no 1G page) 
ARM Cortex-A75[13] 4-way SA 1024-entry for 4KB 16KB and 64KB page sizes 
Intel Skylake [14] 64-entry (all page size) 1536-entry (no 1G page) 
Kaby Lake [14] Same as skylake  Same as skylake 
Coffee Lake [14] Same as skylake Same as skylake 
Table 1: TLB configurations on modern processors 
Table 1 also presents the page sizes supported by each processor. For commonly supported page 
sizes (i.e., 4KB, 2MB, and 1GB), Figure 2 shows example virtual address bits used to index a set-associative 
TLB in x86-64. Assume the TLB is 64-way set-associative. The tag and set-index bit fields in the 64-bit 
address form the page number. The remaining least significant bits form the page offset. 
 
Figure 2: Virtual address 
 
1.2.3 Huge pages 
The virtual memory space is divided into fixed-length contiguous blocks, called pages or virtual 
pages. The standard size of a page is 4KB on many platforms, such as x86-64. Similarly, the physical 
memory space is also divided into fixed-length contiguous blocks, called page frames or frames. Hardware 
needs to translate a virtual page number to its page frame number to access data in the memory. As the 
memory capacity increases at a much faster rate than the TLB entries in modern computers, the overhead 
7 
 
of hardware address translation has grown significantly. To reduce such overhead, hardware 
manufacturers start to provide TLBs with many entries for pages that are significantly larger than the 
standard page. With a larger page, hardware performs fewer address translations.  
Modern operating systems support huge pages and provide dedicated APIs for huge page 
allocations. These pages are called huge pages in Linux [15][16][17], super pages in BSD [18], or large 
pages in Windows [19][20].  Depending on the processor architecture and operating systems, the size of 
a huge page varies from 2MB to 256MB.  
1.2.4 Tailored page size 
To reduce TLB misses, a new mechanism, called tailored page sizes (TPS) [21], has been proposed, 
to efficiently support pages of size 2n, for all n greater than a default minimum. For x86, the default 
minimum page is 4KB, and thus TPS can support all pages of size 2n, for all n greater than 12. Additionally, 
it uses only one page-table-entry (PTE) for each large contiguous virtual memory space. 
TPS makes a minor change to the instruction set architecture (ISA) and the microarchitecture. 
Specifically, the ISA has been extended to support the page size of 2n, and presents a new hardware page 
walking mechanism and a L1 TLB enhancement to support fast virtual-to-physical address translation. 
While this new design is very promising in reducing TLB misses in a computer system with very large 
memory, TPS is still a very new technology that has not been deployed in any commercial processors.  
1.3 Summary of contribution 
With the help of both file system benchmarks and customized microbenchmarks, this thesis 
research studies the TLB overhead in both traditional file systems (including Ext4 running on disks or 
emulated NVM) and new NVM-based file systems (including Ext4-DAX and NOVA). Results from our 
extensive experiments show that, in different file systems, TLB can generate different amounts of 
performance overhead even for exactly the same file operations. In addition, the TLB overhead in NOVA 
is significantly larger than Ext4 and Ext4-DAX. By using PERF profile, we take a deep look at file systems’ 
8 
 
internal calls and finds that logging or journaling operations in modern NVM-based file systems can 
generate a significant TLB overhead.  
In summary, this thesis research makes the following three contributions: (1) quantitively 
evaluating the TLB overhead in NVM storage, 2) identifying file system kernel calls that generates large 




2. RELATED WORK 
This chapter discusses traditional file systems and journaling/logging mechanism used, as well as 
modern file systems recently developed for NVM. Various file system benchmarks that are used in this 
thesis research and software tools for collecting performance metrics are also introduced in this chapter. 
2.1 Traditional file systems 
For Linux, the ext4 journaling file system [22], developed as the successor to ext3, has been widely 
used. Journaling is designed to reduce the time required to recovery the storage system back to a 
consistent state after a system failure. It is similar to the write-ahead logging technique that has been 
used in database for ensuring atomic commits of transactions. In a journaling file system, only updates 
since the last check-point recorded through journaling needs to be scanned to perform redo or undo 
operations. Without journaling, the entire storage systems need to be scanned and all metadata need to 
be checked, which might take hours or days for a large storage system.    
The basic journaling procedure in most modern file systems, such as ext4, Microsoft’s NTFS [23], 
SGI’s XFS [24], Solaris OS UFS [25] and IBM’s JFS [26], are very similar to each other. They group multiple 
related updates together as a transaction, and make sure either all or none of the updates in a transaction 
are written permanently to the disks.  Specifically, a special storage area, called the journal, is allocated. 
It records the changes that will be made to the storage systems. After a crash, all changes recorded in the 
journal will be replayed to ensure that the file system is in a consistent state again. This journaling 
procedure generate a write-twice overhead: one write to the journal and the other to in-place locations. 
To mitigate the write-twice overhead, most modern file systems only log updates on metadata. Linux ext4 
provide two modes: a journal mode which can perform both data and metadata logging, and an ordered 
mode which only logs metadata updates and also ensure data were written to disk before metadata. The 
ordered mode is the default one.  
10 
 
2.2 File systems built for NVM 
More and more new file system [27][28] designs are posted which built for NVM, such as NOVA 
[29], Ext4-DAX [30] and PMFS [31]. To support NVM, some file system uses additional techniques to 
ensure that data are atomic.  
2.2.1  NOVA  
NOVA is a log-structured POSIX [32] file system designed specifically for byte-addressable NVM. 
Conventional file systems, such as ext4, were designed based on the performance characteristics of disk 
drives, and their performance is not optimal for fast NVM. NOVA adopts conventional log-structures file 
system technique to fully leverage fast random accesses provided by NVM. To mitigate the write-twice 
overhead, it doesn’t log file data updates and only metadata updates are logged. In addition, it provides 
a dedicated log for each inode to improve concurrency.  For file data updates, NOVA use a copy-on-write 
technique, in which an unused memory pages are allocated to hold the new file data and a log entry is 
appended in the corresponding inode log to record the location of new file data.  
2.2.2 Ext4-DAX 
Direct Access (DAX) is a new hardware mechanism that enables direct access to files stored in 
NVM, without copying data via the page cache. Ext4-DAX extends Ext4 with DAX capabilities to by-pass 
the page cache and map storage devices directly into the user space. Ext4-DAX uses journaling to 
guarantee the atomicity of metadata updates. Ext4 [33] has an optional data-journal mode to achieve the 
atomicity of file content updates. However, Ext4-DAX does not support this mode, and accordingly file 
data updates are not atomic. 
The page cache is usually used to buffer file reads and writes. It is also used to provide the pages 
that are mapped into the user space via the system call to mmap. For NVM devices that provide fast 
performance and byte-addressability, conventional file systems still make a copy in the page cache for file 
data. This generates unnecessary copies of NVM data and degrades the I/O performance. Ext4-DAX 
11 
 
eliminates such unnecessary copies and performs read and write operations directly to the storage device. 
For file mappings, the storage device is mapped directly into the user space in Ext4-DAX. 
To support DAX, the block driver needs to implement the function of supporting 'direct access' to 
block device. It translates a sector number (expressed in units of 512-byte sectors) to a page frame 
number (pfn) that identifies the physical page in NVM.  It should also return a kernel virtual address so 
that software can access the data. 
 The direct access function takes a parameter named size, which indicates the number of bytes 
being requested.  The function returns the number of bytes that can be read or written contiguously at 
that offset.  A negative error number (errno) is returned if any error takes place. 
2.3 File benchmarks 
Two file benchmarks with different characteristic, Filebench [34] and IOzone [35] are introduced 
in this section.  
2.3.1 Filebench 
Filebench is another widely used file system and storage benchmark that can generate a variety 
of file system requests to emulate I/O workloads on web servers, file servers, mail servers and database 
servers. It allows users to flexibly define new I/O behaviors, via the workload description, to synthesize a 
board range of workloads.  In the workload description, almost all parameters, such as I/O size, can be set 
as a random variable, changing dynamically during the testing.     
2.3.2 IOzone  
IOzone is a file system benchmark that has been widely used to evaluate file system performance 
on a number of platforms. It focuses on the measurement of performance under a wide range of file 
operations, such as read, write, re-read, and re-write. It stresses the underlying storage systems by 
generating file system I/O requests with varying file sizes and record sizes. It can also produce I/O requests 
12 
 
by following a wide range of pre-specified access patterns, such as strided read, backwards read, and 
random read. IOzone allows users to model system-level variations, such as the number of processes, 
processor cache purges, processor cache size, and also model various file-level variations, such as 
synchronous/asynchronous file accesses, file locking, memory-mapped file accesses, and flushing.   
2.4 Performance counter tools 
Nearly all modern processors provide a set of hardware performance counters to count the 
occurrence or measure the timing of hardware-related events. Hardware performance counters are 
valuable tools for system designers to monitor and evaluate the low-level performance. Since these 
counters are built into silicon and managed directly by hardware, they provide precise and low-overhead 
metrics of the CPU performance. More importantly, performance counters are non-intrusive, and they 
are transparent to the OS and user applications. Depending on the processor architecture, the number of 
hardware events monitored by performance counters varies greatly. For example, Intel x86 monitors 
hundreds of events [36], including cycle count, instruction count, branches taken status and prediction 
accuracy, TLB misses and invalidations, pipeline stalls, and memory access behaviors (miss rate at each 
memory hierarchy level). 
Generally, performance counters can be classified into two categories: configuration registers and 
counting registers. Before monitoring and collecting metrics, the user application needs to program the 
configuration registers to select target events, and configure interrupts for counter’s start/stop and 
overflow. Then, operating system sends low-level hardware calls according to user’s requests, and 
software can use special instructions, such as rdpmc on x86 [37], to read the target performance counters.  
It has been reported that most performance counters on x86_64 show significant run-to-run variations 
and often over-count [38]. Judicious considerations should be taken into the interpretation of the results. 
13 
 
Especially identifying performance counters that are deterministic, little overcount, and small run-to-run 
variation is critical for obtaining reliable performance insights.   
2.4.1 PERF 
PERF [39] is a powerful lightweight profiling tool in Linux and it is included in the Linux kernel 
source tree, under tools/perf. PERF can make measurements at both the hardware level, such as 
Performance Monitoring Unit (PMU) and CPU performance counters, and the software level, such as 
software counters, probes, and tracepoints. PERF also provides a variety of commands to analyze the 
performance and trace data.   
A probe means that the kernel dynamically modifies an executable program at runtime, replacing 
instructions if necessary, in order to enable tracing.  PERF supports on-the-fly probing. It can directly probe 
a running process and collect performance information, without the need of recompiling the source code. 
PERF can probe into the kernel space via kprobes, and into the user space via uprobes.   
A tracepoint placed in code provides a hook to probe events of interests. In Linux, all tracepoints 
are listed in the directory /sys/kernel/debug/tracing/events, such as system calls, TCP/IP events, file 
system operations. Generally, tracepoints are instrumented in the Linux kernel, instead of user 
applications. A tracepoint has no performance overhead if it is not activated. When activated, its overhead 
relatively low. 
2.4.2 PAPI 
The Performance Application Programming Interface (PAPI) project [40] is a portable open-source 
tool for accessing low-level hardware performance counters in near real-time. It is available for most 
modern CPUs. It provides a compact set of robust and useful tools to efficiently diagnose and analyze 
processor specific performance metrics. PAPI has two important layers: a large hardware independent 
layer optimized for platform independence, and a smaller hardware specific layer containing platform 
dependent code. By statically linking the hardware independent layer with the hardware specific layer, 
14 
 
PAPI can be compiled for a wide range of operation systems and processors. PAPI is still under active 






3. EXPERIMENTAL SETUP 
This chapter presents the setup of our experiments, including the settings of computer hardware, 
the setup of file systems, and measurement methods for using performance counters. 
Due to the lack of modern NVM devices, this research uses DRAM to emulate NVM. While the 
read performance of DRAM and NVM is comparable, NVM is slightly slower in write than DRAM. Another 
major difference is that NVM is non-volatile. Data are retained in NVM even power is not supplied. These 
two major differences have no significant impact on evaluating the performance overhead of the TLB.  
Since NVM also resides on the memory bus, TLB access patterns are very similar even when NVM is 
emulated by DRAM.     
3.1 Environment setup 
We conduct our study on a 6-core Intel Core(TM) I7-8700k (Coffee Lake) running at 3.7GHz 
equipped with 64GB memory. Each core has a private TLB hierarchy, as shown in Table 2. It has a first-
level Data-TLB (DTLB), a first-level Instruction-TLB (ITLB), and a second-level TLB, shared between ITLB and 
DTLB.  In this thesis, we focus on the second-level TLB traffic because each miss in it will trigger a page 
walk.  
Per-core TLB Hierarchy 
I-TLB 
4-KB 8-way set associative, 64 entries 
1-GB None 
D-TLB 
4-KB 4-way set associative, 64 entries 
1-GB 4-way set associative, 4 entries 
L2 TLB 
1-MB 4-way set associative, 64-byte line size 
4-KB/2-MB  6-way associative, 1536 entries 




The system runs Ubuntu 16.04 with kernel version 4.1. We run file benchmarks and 
microbenchmarks on both traditional file systems, such as ext4, and DAX-support the file systems, such 
as NOVA and ext4-DAX.  
3.2 Emulating persistent memory using DRAM 
We use DRAM to emulate persistent memory. After special configuration, operating systems view 
a reserved DRAM memory portion as a PMEM region [41]. Because it is a DRAM-based emulation, it is 
likely to be faster than persistent memory, and all data will be lost upon powering down the machine. 
Below are basic steps of configuring a PMEM region. 
1) Identify usable regions in DRAM. 
2) Specify mem map kernel parameters in the GRUB file.  
3) After reboot: 
a) Create a PMEM region. 
b) The kernel offers this space to the PMEM driver. 
c) Linux treats this DRAM region as PMEM and creates pmem devices (/dev/pmem0, 
/dev/pmem1,…). 
First, we enable DAX and PMEM in the kernel.  
$ make nconfig 
 
        -> Device Drivers -> NVDIMM Support -> 
 
                    <M>PMEM; <M>BLK; <*>BTT 
Table 3: Setting up DAX support. 
 Then, we update GRUB configuration.  
# vi /etc/default/grub 
GRUB_CMDLINE_LINUX="memmap=10G!30G memmap=32G!52G” 




3.3 File system setup 
We create the file system in the directory /dev/pmen0, then enable DAX on it. We will mount file 
systems such as NOVA and Ext4-DAX in the directory /mnt/ramdisk. The file systems we will test include 
Ext4, Ext4-DAX, and NOVA. We will mount those three file systems on a DRAM-based region to emulate a 
file system running on NVM. Additionally, since our operating systems uses the ext4 file system on the 
hard drive, we will perform experiments to compare the ext4 file performance in different storage systems. 
In addition, our experiments are limited to two major and popular DAX-support files systems, including 
NOVA and Ext4-DAX, that are designed specifically for NVM-based storage systems.  
# mkdir /mnt/ramdisk1 
# mkdir /mnt/ramdisk2 
# mkfs.ext4 /dev/pmem0 
# mount -o dax /dev/pmem0 /mnt/ramdisk1 
# mount –o unit NOVA /dev/pmem1 /mnt ramdisk2 
Table 5: Mounting file system on PMEM. 
 
3.4 PERF tools 
We will use PERF tools to collect experimental data. PERF use event callbacks to achieve accurate 
data collection. Major events related to TLB include TLB-loads, TLB-stores, TLB-load-misses and TLB-store-
misses. The TLB-load is the TLB traffic caused by load operations that read data from the memory. The 
TLB-store is the TLB traffic caused by store operations that write data to the memory. TLB-misses are the 
number of TLB misses occurred during the process. Each event has three levels, including user, kernel and 
both. The user level collects the TLB traffic only from the user level, which is caused by user applications. 
The kernel level only gathers the TLB traffic from kernel level, which is generally caused by the operating 
system.  
We will use the call graph function to analyze the data collected by PERF. The call graph function 
will display how operations have proceeded, such as which function was called and what percentage of 
18 
 
TLB traffic has been generated. In our experiments, we will first focus on the total TLB traffic caused by all 
levels right now. After analyzing the results, we will go deeper and perform more experiments to collect 
TLB traffic of interests. 
PERF has a powerful function called event, that can collect a variety types of information.  Below 
is a list of events that are used in this research.  
Name Meaning 
CPU_CLK_UNHALTED.THREAD_P Thread cycles when thread is not in halt state. 
L1D.REPLACEMENT Counts the number of lines brought into the L1 data cache. 
MEM_LOAD_UOPS_RETIRED.LLC_HIT 
This event counts retired load uops that hit in the last-level 
(L3) cache without snoops required.  
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT 
This event counts retired load uops that hit in the last-level 
cache (L3) and were found in a non-modified state in a 
neighboring core's private cache (same package). 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM 
Retired load uops whose data source was an on-package 
core cache with HitM responses. 
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS 
Retired load uops whose data source was an on-package 
core cache LLC hit and cross-core snoop missed. 
DTLB_LOADS DTLB loads from all levels. 
DTLB_STORES DTLB stores from all levels. 
DTLB_LOAD_MISSES DTLB load-misses from all levels. 
DTLB_STORE_MISSES DTLB store-misses from all levels. 
INST_RETIRED.ANY_P 
Instructions retired (Programmable counter and Precise 
Event). 
DTLB_LOAD_MISSES.WALK_COMPLETED DTLB load miss page walks complete. 
DTLB_LOAD_MISSES.WALK_DURATION 
This event counts the number of cycles while PMH is busy 
with the page walk. 
DTLB_STORE_MISSES.WALK_COMPLETED 
Store misses in all DTLB levels that cause completed page 
walks. 
DTLB_STORE_MISSES.WALK_DURATION Cycles when PMH is busy with page walks. 
ITLB_MISSES.WALK_COMPLETED Misses in all ITLB levels that cause completed page walks. 
ITLB_MISSES.WALK_DURATION 
This event count cycles when Page Miss Handler (PMH) is 
servicing page walks caused by ITLB misses. 
Table 6: PERF events used in this research 
 
After we collect event data from our experiments, the following equations in Table 7 are used to 




(%) Cycles spent in page walks (data) = 
  (DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE MISSES.WALK_DURATION) /  
  CPU_CLK_UNHALTED.THREAD_P 
Page walks per 1000 instr. (data) = 
      (DTLB_LOAD_MISSES.WALK COMPLETED + DTLB_STORE_MISSES.WALK_COMPLETED) /.  
      (INST RETIRED.ANY P / 1000) 
Average cycles per page walk (data) = 
     (DTLB_LOAD_MISSES.WALK_DURATION + DTLB_STORE_MISSES.WALK_DURATION) / 
     (DTLB_LOAD_MISSES.WALK_COMPLETED + DTLB_STORE_MISSES.WALK_COMPLETED) 
(%) Cycles spent in page walks (instructions) = 
     ITLB_MISSES.WALK_DURATION / CPU_CLK_UNHALTED.THREAD_P 
Page walks per 1000 instr. (instructions) = 
     ITLB_MISSES.WALK_COMPLETED * 1000 / INST_RETIRED.ANY_P 
Average cycles per page walk (instructions) = 
     ITLB_MISSES.WALK_DURATION / ITLB_MISSES.WALK_COMPLETED 
L1 misses = L1D.REPLACEMENT 
L2 misses = 
    MEM_LOAD_UOPS_RETIRED.LLC HIT + 
    MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + 
    MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + 
    MEM_LOAD_UOPS_MISC RETIRED.LLC MISS 
LLC misses = MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS 




4. BENCHMARKS EVALUATION 
In this chapter, we first discuss the results obtained from Filebench. The experimental results 
show that, compared with other file systems, NOVA has similar TLB miss rates but 5 times more TLB traffic 
in the mailserver workload.  Then, we use IOzone as our benchmark in our experiments. Additionally, a 
micro-benchmark is developed to analyze TLB behaviors in a controlled setting to focus more on specific 
I/O access patterns.  
4.1 Filebench 
The Filebench benchmark provides various setups to simulate different types of workloads. In 
order to stress the underline storage systems and achieve the maximum performance on all file systems 
to be tested, we select three workload setups, including fileserver, webserver and mailserver. The 
fileserver setup is a workload that repeatedly write data to files, one after the other. It contains file 
operations including create, write, append and read. The webserver setup is a workload that has only 
open and read operations to read a total of 1000 files. The mailserver setup is a workload that performs 
a large number of append and fsync operations.  
In this section, we compare the TLB performance across different benchmark suites in three file 
systems: ext4, ext4-DAX, and NOVA.  Figure 3 shows the normalized TLB traffic for filesever, webserver 
and mailserver. Each experiment is repeated three times and the average is reported in the figure. The 




Figure 3: Comparing TLB traffic under Filebench, normalized to ext4 
 
Figure 3 shows that these three file systems have almost the same TLB traffic in the webserver 
setup. NOVA has slightly less TLB usage compare to ext4 and ext4-DAX. However, in mailserver, NOVA has 
nearly 5 times more TLB traffic than ext4 and ext4-DAX.  
A larger TLB traffic does not necessarily lead to a larger performance overhead. Therefore, we 
measure and evaluate the TLB miss rates under the same experiment setup, as shown in Figure 4. Because 
fileserver contains a huge amount of write operations, its TLB miss rate is very low. In the other two 
workloads, the miss rate is around 0.7% and 0.6%, respectively, which is much higher than fileserver. Since 
NOVA file system have 5 times more TLB traffic than the other two. With higher TLB miss rates and larger 




Figure 4: Comparing TLB miss rates(%) under Filebench 
 
4.2 IOzone 
In this section, IOzone is used to compare the TLB performance under four different file systems, 
ext4 on hard drive, ext4 on DRAM, ext4-DAX on DRAM, and NOVA on DRAM. Our experiments will 





Figure 5: TLB measurements under write/rewrite in IOzone 
 
Figure 5 shows the TLB measurements of file write/rewrite as the file size increases from 1KB to 
20 GB.  It is not surprising that TLB traffic rises with proportionally as with the file size increases. The TLB 
traffic on DRAM-ext4-DAX increases almost linearly to the file size. However, TLB traffic in NOVA grows 
faster than the other two file systems, which confirms the finding overserved in the previous Filebench 
experiments.   
24 
 
To compare TLB traffic under different file APIs, we run the same experiments with re-
write/record and fwrite/re-fwrite. Figure 6 and Figure 7 show the results. It is overserved that the TLB 
traffic trend is similar in all operations.   
 




Figure 7: TLB measurements under fwrite/re-fwrite in IOzone 
 
While using IOzone, our goal is to compare the TLB performance over four different file system 
setups.  However, the data gathered from IOzone don’t have a stable trend to analysis, we think one of 
the main reasons is the I/O [42] traffic interference, because IOzone cannot avoid the I/O interference, so 
the data have too much incomplete and misleading results that make it virtually impossible to understand 
what system or approach to use in any particular scenario [43].  
Furthermore, we have tried several other file system benchmarks, such as IOmeter, Bonnie and 
Postmark. To the best of our knowledge, none of the existing benchmarks can answer our question. 
26 
 
Therefore, we decide to write our own microbenchmark to gather the performance data collected via 
PERF.  
4.3 Microbenchmark 
We design and implement a simple microbenchmark, as shown in Program 1, to measure the TLB 
performance under two different settings: (1) varying the total amount data written into a file, and (2) 
varying the amount of data in each write operation.  We call the former the total size and the latter the 
unit size. Since there is no overlap between the address range of all write operations, the total size then 
determines the size of the file generated. The unit size is the variable that is passed to the write/fwrite 
function to specify the write amount. The microbenchmark takes three arguments, including the file 
directory, the total size, and the unit size.  
This microbenchmark is very simple. It repeatedly appends an array of data to the end of an 
existing file via the function write(). The total size is specified in GB, and the unit size is specified in KB. 
Local variables are declared as registered to suggest the compiler to store the variables in a processor 
register, instead of placing them in the memory. This helps avoid unnecessary memory accesses and TLB 










int main( int argc, char **argv ) { 
    int fd; 
    register uint i; 
    register uint total_size, unit_size; 
 
 
    printf("input is %s\n",argv[1]); 
 




    total_size = atoi(argv[2])*1024*1024; 
    unit_size = atoi(argv[3]); 
 
    printf("total = %d GB, unit=%d KB\n",  
            total_size/(1024*1024), unit_size); 
 
 
    if( fd < 0 ) { 
         printf("fail to creat a file\n"); 
         return(0); 
    } 
 
    // file size = 16 MB 
    for( i = 0; i < total_size/unit_size; i++ ) 
        write(fd, array, unit_size * 1024); 
 
    return(0); 
} 
 









5. MIRCOBENCHMARK EVALUATION 
In this chapter, the microbenchmark presented in the previous chapter is used to compare the 
TLB performance in Ext4, Ext4-DAX and NOVA. Each experiment is repeated 10 times and the average of 
these 10 measurements is used for comparison. According to Table 6 presented in Chapter 3.4 PERF tools, 
a total of 17 raw measurements are collected via the PERF. We use the formula, presented in Table 7, to 
calculate various performance metrics. To better view the data, all measurements are plotted in a log 
scale on both the x axis and the y axis.   
The total write size increases gradually from 1GB to 20GB. The file size increases from 1MB to 
1024MB. The file operations tested include write, read, re-write and re-read. We find out the trend of 
those data are similar. In order to save space, this thesis only presents the experimental measurements 
with a total write size of 10GB and a file size of 1MB, 8MB, 64MB and 1024MB.  
29 
 
5.1 Evaluation of file size with write 
From the results of the experiments presented in this section, we have two major findings: 1) The 
CPU can spend up to 8% working time on page walks caused by instructions. 2) Ext4-DAX have up to 1000 
times fewer L2 misses compare to the other two file systems. 
 
Figure 8: Cycles spent (%) in page walks (data) for 10GB write workload 
 
 
In the write workload, a total of 10GB data was written sequentially to DRAM. The four plots given 
in Figure 8 show cycles spent (%) in page walks for the data under different file sizes and different write 




is interesting that the unit size has little impact on the cycles spent (%) in page walks in Ext4 and NOVA. 
But the overhead of in Ext4-DAX decreases significantly as the unit size increases, especially when the unit 
size become larger than 2MB. The reason is that Ext4-DAX has bypassing settings, in which the threshold 
is set as 2MB. It perfectly matches L2 TLB and thus did not cause nearly any page walk. Once the unit size 
is larger than 2MB, the page walks overhead goes up dramatically.  
 
Figure 9: Page walks per 1000 instr. (data) for 10GB write workload 
 
 
Figure 9 shows the number of page walks caused by data per 1000 instructions. The number of 




There is no special page table setup (1.2.4 and 1.2.5) in our experiments. Therefore, the performance 
penalty of page walks under the same unit size are almost the same in these experiments.  
Figure 10: Average cycles per page walk (data) for 10GB write workload 
 
 
Figure 10 shows the average number of cycles per page walk caused by data accesses. It confirms 
that all file systems have a similar page walk penalty when the unit size is smaller than 1MB. When the 
unit size becomes larger than 1MB, the number of page walks from Ext4-DAX increases (see Figure 8). This 




Figure 8, Figure 9, and Figure 10 focus on the penalty of page walks from data, which are 
generated by the write operations. Our experiments are setup in an ideal workload with sequential write 
and little I/O interference. However, page walks still cause 1% to 3% performance overheads. In addition, 
Ext4-DAX has some interesting performance changes when the unit size becomes larger than 1MB. 
Figure 11: Cycles spent (%) in page walks (instructions) for 10GB write workload 
 
 
Figure 11 shows the overhead of page walks caused by instruction accesses.  There are fewer than 
0.1% page walks overhead when the file size is small, such as 1MB, 8MB, and 64MB. When the file size 
increases to 1GB, the cycle overhead can be up to 8% in all three file systems. The performance of NOVA 





Figure 12: Page walks per 1000 instr. (instructions) for 10GB write workload 
 
 
Figure 12 shows that the number of page walks increases as the unit size increases. All the 
numbers of page walks are too small to interfere the overall I/O performance. To make experiments more 






Figure 13: Average cycles per page walk (instructions) for 10GB write workload 
 
 
Comparing Figure 10 and Figure 13, it can be seen that the average cycles per page walk for both 
data and instruction are similar. Since the architecture of the instruction and data page tables are the 













Figure 15: L2 misses for 10GB write workload 
 
Figure 14 and Figure 15 shows the total number of L1 misses and L2 misses, respectively. The total 
number of L1 misses and L2 misses is stable as the unit size increases. However, the performance of Ext4-
DAX is greatly influenced by the unit size. For example, when the file size is 1024MB and the unit size is 






Figure 16: LLC misses for 10GB write workload 
 
 
A TLB look up will be performed when an LLC miss takes place. The number of LLC misses can be 
used to evaluate the TLB traffic. The figure above shows that there are about 10 LLC misses when the file 
size is 1MB. However, there are at least 103 LLC misses when the file size increases to 1024MB.  
After analyzing the figures presented above, we find out that page walks, caused by TLB misses, 
still have a certain amount of overhead in all the file systems (see Figure 17). Comparing with other file 
systems, Ext4-DAX is more sensitively influenced by the unit size. While the performance overhead in 




difference of 100 or even 1000 times. Ext4-DAX can bypass data through memory. A small unit size means 
more memory traffic, causing a significant impact on DAX file systems.  
 
 
Figure 17: Page walk overheads 
 
This evaluation shows that the unit size can have a large impact in both traditional file systems 
and NVM-based file systems. The performance overhead under the same workload with different unit 
sizes can have 100 times gap. Studying how the unit size influence the file systems’ performance related 
to TLB overhead are our next evaluation goal. We focus on varying the unit size using write/fwrite 
operations, to get a closer look at the results. We will use the profiles obtained via PERF.   
5.2 Evaluation on unit size with fwrite 
The results of the microbenchmark with fwrite on four different file systems are presented in this 
section. In these experiments, the total size increases from 1GB to 20GB, and the unit size increases from 













presented in Figure 18 - Figure 23, with the total file size being 1GB, 2GB, 4GB, 8GB, 16GB, and 20GB, 
respectively.  
 




















Figure 23: TLB traffic for NOVA and Ext4-DAX from fwrite (20GB) file. 
 
The experimental results in the above figures (Figure 18 - Figure 23) show that the total TLB traffic 
from Ext4-DAX is 1.5 to 2.5 times of NOVA when the unit size is small. In addition, the TLB traffic decreases 
as the unit size increases. The total traffic under the four different file system setups are close to each 
other as the unit size gets very large. Since the trend of TLB traffic under different file size is similar, we 
will focus on investigating the impact of the unit size.  
According to the figures shown above, the TLB traffic in Ext4-DAX is more than NOVA’s until the 
unit size increases to 256KB. However, when the unit size increases to 4MB, the TLB traffic in Ext4-DAX 
45 
 
becomes less than NOVA. The results show that although the TLB traffic in Ext4-DAX decreases almost 
linearly with the write unit size if the total size is kept constant, the TLB traffic in NOVA does not present 
such a linear relationship.  
To find out the reason behind that, we use Perf tools to get the profile to analyze the details.  
+   95.33% 16.50%  file_fwrite  libc-2.21.so                                                                  
   - 78.58% 78.58%  file_fwrite  [kernel.kallsyms]                                                             
   - 78.44% entry_SYSCALL_64_fastpath                                                                                
   - 78.36% sys_write                                                                                          
      - 77.55% vfs_write                                                                                         
         - 72.44% __vfs_write                                                                                    
            - 71.83% ext4_file_write_iter                                                                        
               - 70.73% __generic_file_write_iter                                                                
                  - 69.62% generic_file_direct_write                                                             
                     - 69.05% ext4_direct_IO                                                                     
                        - 68.56% ext4_ind_direct_IO                                                              
                           + 45.15% dax_do_io                                                                    
                           + 5.47% ext4_orphan_add                                                               
                           + 5.47% ext4_orphan_del                                                               
                           + 4.37% ext4_mark_inode_dirty                                                                                    
                           + 4.20% __ext4_journal_start_sb                                                       
                           + 2.84% __ext4_journal_stop                                                           
                           + 0.48% mutex_unlock                                                                  
                           + 0.18% __GI___libc_write                                                             
                           + 0.17% jbd2__journal_start 
 
Figure 24: TLB profile of Ext4-DAX when the total size is 20GB, the unit size is 1KB  
 
Figure 24 shows the TLB traffic profile of Ext4-DAX. The fwrite function generates 78% of the total 
TLB traffic. The remaining 22% of the total TLB traffic is generated by other system operations. In Ext4-
DAX, the fwrite function calls the “dax_do_io” function to write data. The profile shows that the 
“dax_do_io” function takes 45% of all TLB traffic. The implementation of this function in Ext4-DAX is 
shown below in Table 8.  
/** 
 * dax_do_io - Perform I/O to a DAX file 
 * @iocb: The control block for this I/O 
 * @inode: The file which the I/O is directed at 
 * @iter: The addresses to do I/O from or to 
 * @pos: The file offset where the I/O starts 
 * @get_block: The filesystem method used to translate file offsets to blocks 
46 
 
 * @end_io: A filesystem callback for I/O completion 
 * @flags: See below 
 * 
 * This function uses the same locking scheme as do_blockdev_direct_IO: 
 * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the 
 * caller for writes.  For reads, we take and release the i_mutex ourselves. 
 * If DIO_LOCKING is not set, the filesystem takes care of its own locking. 
 * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O 
 * is in progress. 
 */ 
 
 * dax_do_io - Perform I/O to a DAX file 
 
ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, 
    struct iov_iter *iter, loff_t pos, get_block_t get_block, 
    dio_iodone_t end_io, int flags) 
{ 
 struct buffer_head bh; 
 ssize_t retval = -EINVAL; 
 loff_t end = pos + iov_iter_count(iter); 
 
 memset(&bh, 0, sizeof(bh)); 
 
 if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) { 
  struct address_space *mapping = inode->i_mapping; 
  mutex_lock(&inode->i_mutex); 
  retval = filemap_write_and_wait_range(mapping, pos, end - 1); 
  if (retval) { 
   mutex_unlock(&inode->i_mutex); 
   goto out; 
  } 
 } 
 
 /* Protects against truncate */ 
 if (!(flags & DIO_SKIP_DIO_COUNT)) 
  inode_dio_begin(inode); 
 
 retval = dax_io(inode, iter, pos, end, get_block, &bh); 
 
 if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) 
  mutex_unlock(&inode->i_mutex); 
 
 if ((retval > 0) && end_io) 
  end_io(iocb, pos, retval, bh.b_private); 
 
 if (!(flags & DIO_SKIP_DIO_COUNT)) 
  inode_dio_end(inode); 
 out: 








The dax_do_io function calls filemap_write_and_wait_range to locate file direction and offset. 
The functions “ext4_orphan_add” and “ext4_orphan_del”, which link or unlink the inodes to ensure data 
protection, generate 10% of the total TLB traffic. 
+ 96.92% 23.38%  file_fwrite  libc-2.21.so                                                                  
- 73.25% 73.25%  file_fwrite  [kernel.kallsyms]                                                             
- 72.74% entry_SYSCALL_64_fastpath                                                                                 
- 72.73% sys_write                                                                                             
- 71.63% vfs_write                                                                                                 
- 64.73% __vfs_write                                                                                     
- 61.69% nova_dax_file_write                                                                          
      + 44.25% nova_cow_file_write                                                                       
      + 17.05% __copy_user_nocache                                                                       
      + 0.22% mutex_unlock                                                                               
      + 0.04% nova_new_data_blocks                                                                       
      + 0.03% mutex_lock                                                                                 
      + 0.03% __sb_start_write                                                                           
      + 0.03% nova_reassign_file_tree                                                                    
      + 0.02% nova_append_file_write_entry 
 
                            Figure 25: TLB traffic profile of NOVA when the total size is 20GB, the unit size is 1KB 
 
Figure 25 shows the TLB traffic profile of NOVA. The function fwrite takes 73% of the total TLB 
traffic. The function “nova_cow_file_write” is the one that performs actual data write operations, and it 




5.3 Evaluation on unit size with write 
We are presenting the same experiments with the write operation. The total size increases from 
1GB to 20GB, and the unit size increases from 1KB to 4MB.  The total TLB accesses, the total TLB stores, 
and the total TLB loads are measured and presented in Figure 26 - Figure 31Figure 23, as the total file size 
being 1GB, 2GB, 4GB, 8GB, 16GB, and 20GB, respectively.  
 
 




















Figure 31: TLB traffic under different unit sizes when writing to a 32GB file 
 
Figure 26 to Figure 31 show the results of TLB traffic under different unit sizes, when the total file 
size increases from 1GB to 32GB, in four different file systems. In each figure, the first two plots are the 
total TLB-load and TLB-store as the unit size increases. The results show that, on average, the total number 
of TLB-loads is 1.5 times that of TLB-stores. The measurements in NOVA and Ext4-DAX are similar. The 
reason behind that is NOVA and Ext4-DAX are both log-structured file system. Each time they will have to 
load the log before the operation. That is why there are more TLB-loads than TLB-stores. This also explains 
why they have similar traffic.  
54 
 
Generally, the TLB traffic decreases initially in NOVA and Ext4-DAX as the unit size increases. 
However, the TLB traffic stops this trend after the unit size becomes larger than some threshold.  With 
the 1GB total size, the TLB traffic in NOVA and ext4-DAX are very similar to each other. As the total size 
increases, the TLB traffic in NOVA became more than ext4-DAX’s. NOVA has around 1.3 times more TLB 
traffic than Ext4-DAX when the total size is 4GB, around 2.5 times more when the total size is 16GB, and 
around 3.5 times more when the total size is 24GB. Since the TLB traffic of larger unit sizes is similar when 
the unit size is large, we would like to focus on the result at low unit size in the rest of this chapter.  
For the figure we can clearly see that the TLB traffic in Ext4-DAX increases linearly as the total size 
increases. NOVA has 3.5 times more TLB traffic when the total size is 24GB. As a result, NOVA has much 
more TLB traffic when the file size is large. We will use Perf tools to find out the reason behind that.  
47.61%  file_write   [kernel.kallsyms]    [k]curr_log_entry_invalid.isra.12.constp 
  13.71%  file_write   [kernel.kallsyms]        [k] nova_get_append_head                     
   6.09%  file_write   [kernel.kallsyms]         [k] __memcpy                                 
   5.72%  file_write   [kernel.kallsyms]         [k] __copy_user_nocache                      
   4.11%  file_write   [kernel.kallsyms]         [k] nova_cow_file_write                      
   2.06%  file_write   [kernel.kallsyms]         [k] __memset                                 
   1.44%  file_write   [kernel.kallsyms]         [k] __radix_tree_lookup                      
   1.09%  file_write   [kernel.kallsyms]         [k] percpu_down_read                         
   1.00%  file_write   [kernel.kallsyms]         [k] _raw_spin_lock                           
   0.98%  file_write   [kernel.kallsyms]         [k] mutex_lock                               
   0.96%  file_write   [kernel.kallsyms]         [k] __fget_light                             
   0.89%  file_write   [kernel.kallsyms]         [k] __srcu_read_lock                         
   0.88%  swapper    [kernel.kallsyms]         [k] intel_idle                               
   0.80%  file_write   [kernel.kallsyms]         [k] nova_reassign_file_tree                  
   0.80%  file_write   [kernel.kallsyms]         [k] nova_free_blocks                         
   0.79%  file_write   [kernel.kallsyms]         [k] nova_new_blocks                          
   0.68%  file_write   [kernel.kallsyms]         [k] __kmalloc                                
   0.66%  file_write   [kernel.kallsyms]         [k] lockref_get_not_zero                     
   0.58%  file_write   [kernel.kallsyms]         [k] percpu_up_read 
 
Figure 32: Profile from Perf tools for writing a 24GB file with a unit size of 1KB in NOVA. 
55 
 
The profile in Figure 32 shows that that ‘curr_log_entry_invalid.isra.12.constprop.30’ takes 47.61% 
of all TLB traffic when the file size is 24GB file and the unit size is 1KB. The reason why NOVA has a large 
increment when the unit size decreases is that for each write operation, NOVA first writes a copy of the 
data to new pages (step 1) and then appends the file write entry (step 2). After that, NOVA updates the 
log tail (step 3) and the radix tree (step 4). Finally, NOVA returns the old version of the data to the allocator 
(step 5). The function ‘curr_log_entry_invalid.isra.12.constprop.30’ is called in step 2, step 3 and step 4. 
This explains why NOVA has more TLB traffic at smaller unit sizes.  
These is another question: why NOVA does have worse performance when the file size increases. 
The answer is that NOVA uses a radix tree to save the metadata for file log. Thus, for the case of 24GB file 
size and 1 KB unit size, NOVA has 24 million write file entries, and they are stored in radix trees. The write 
file entry will be invalid that many times. That explains why NOVA performance is worse at larger file sizes.  
In addition, the difference between NOVA and Ext4-DAX is that NOVA gives inode their own log 
to make sure the data atomic, uses logging and lightweight journaling for complex atomic updates, and 
implements the log as a singly linked list. Since each inode has their own log, each time a file operation 
will involve log updates that will cause a lots TLB traffic. Ext4-DAX doesn’t support data atomic. It just 
writes the data to the target location and cannot guarantee the data persisted if a power failure occurs. 
In the profile report, we also find out that the radix tree has look-up and operations causing TLB traffic. 
This is because NOVA put radix tree in DRAM to store the metadata for file log.  
2.13%  file_write   [kernel.kallsyms]                    [k] nova_cow_file_write 
0.54%  file_write   [kernel.kallsyms]                    [k] nova_append_file_write_entry 
10.90%  file_write   [kernel.kallsyms]                  [k] nova_get_append_head 
38.99%  file_write   [kernel.kallsyms]                  [k] curr_log_entry_invalid.isra.12.const 
25.62%  file_write   [kernel.kallsyms]                  [k] __copy_user_nocache 
  0.12%  file_write   [kernel.kallsyms]                  [k] __fsnotify_parent 
 




          70.70%  file_write   [kernel.kallsyms]        [k] __copy_user_nocache                      
1.50%  file_write   [kernel.kallsyms]        [k] nova_cow_file_write                      
0.14%  file_write   [kernel.kallsyms]        [k] nova_append_file_write_entry           
2.14%  file_write   [kernel.kallsyms]        [k] nova_get_append_head                     
6.64%  file_write   [kernel.kallsyms]        [k] curr_log_entry_invalid.isra.12.constprop.30 
 
Figure 34:  Profile from Perf tools for writing a 24GB file with a unit size of 16KB in NOVA. 
 
Figure 32, Figure 33 and Figure 34 show that the percentage of TLB traffic generated by 
‘curr_log_entry_invalid.isra.12.constprop.30’ decreases slightly as the unit size increases. For a unit size 
of 1KB, the amount of TLB traffic caused by ‘curr_log_entry_invalid.isra.12.constprop.30’ can be 
calculated as follows: 
1.533 × 1011 × 47.71% = 7.3 × 1010 
Similarly, for a unit size of 4KB, we have 2.41 × 1010 TLB accesses.  For a unit size of 16KB, we 
have 8.4 × 108 TLB accesses.  When the total file size remains unchanged, the total TLB traffic of a 1KB 
unit size is 3 times of the total TLB traffic of a 4KB unit size. However, the TLB traffic of a 4KB unit size has 
almost 28 times of the TLB traffic of a 16KB unit size. As the profile shows, the function 






6. CONCLUSION AND FUTURE WORK 
The size of TLBs on modern processors has not kept up the pace of fast increase in DRAM’s 
capacity. As NVM is emerging as a new storage class residing on the same bus as DRAM, the gap between 
the TLB size and the combined memory/storage capacity becomes increasingly wider. More TLB misses 
certainly slow down the performance of translating virtual addresses to physical addresses. Therefore, 
the performance of TLB, which is often hidden by large I/O latency in traditional hard disk drives or solid-
state drives, now becomes a new research issue for emerging NVM storage.    
Traditional file systems are not tailed for emerging NVM storage. Even through a few new file 
systems have been design and implemented for NVM storage, the impact of TLB on the I/O performance 
remains unclear for large-scale and fast NVM storage residing on the memory bus.  This thesis research 
leverages widely used file system benchmarks and also customized micro-benchmarks to study the TLB 
overhead in both traditional file systems and new NVM-oriented file systems.  
In this thesis, we find that the performance overhead of TLB, generated by file operations, can be 
as high as 8% in NVM-based file systems, which means 8% more CPU cycles are spent to perform page 
walking.  Further analysis show that the TLB overhead generated by data accesses and instruction accesses 
is different from each other. TLB misses generated by data accesses can make up to 4% performance 
overhead, and TLB misses generated by instruction accesses can produce up to 8% overhead.   
The TLB overhead generated by the same file operations varies in different file systems. This 
confirms that the internal design of a file system has a tremendous impact on the TLB performance.  
Extensive experiments have been conducted to compare the performance of three different file systems, 
including two new file systems Ext4-DAX and NOVA running on emulated NVM, and a conventional file 
system running on disks and on emulated NVM. Experimental results show that the TLB performance in 
Ext4-DAX, Ext4, and NOVA can be dramatically different from each other.  
58 
 
To find out how and why a file system influence the TLB performance, a micro-benchmark is 
developed and implemented to generate specific I/0 traffic, to help break down the TLB overhead into 
different software function calls. The micro-benchmark writes a file with varies size by using different 
write unit sizes. It generates writes with a specific pattern to facilitate TLB overhead analysis. With the 
help of the micro-benchmark and PERF tools, we find that over 50% of TLB traffic can be produced by file 
systems Ext4-DAX does not generate much less TLB traffic than Ext4. In addition, NOVA can generate up 
to 50% more TLB traffic overhead compare to with the other two file systems. We find that the extra TLB 
overhead is mostly caused the logging operations in NOVA. Therefore, a more TLB-friendly logging or 
journaling should be taken into considerations in the future design of an NVM-based file system.  
Our future work is to design more advanced experiments with real NVM hardware to evaluate 
the TLB overhead under a variety of access patterns and I/O operations. A software tool can also have 
developed to intelligently generate specific file operations to stress a file system, and automatically 













[1] “Intel Optane DC Persistent Memory,” https://www.intel.com/content/www/us/en/architecture-
andtechnology/optane-technology/optane-for-data-centers.html, 2019. 
[2] 3D XPoint Technology Revolutionizes Storage Memory.  
http://www.intel.com/content/www/us/en/architecture-andtechnology/3d-xpoint-
technologyanimation.html, 2015 
[3] Huai, Y. “Spin-Transfer Torque MRAM (STT-MRAM): Challenges and Prospects.” (2008). 
[4] Ping Zhou and Bo Zhao and Jun Yang and Youtao Zhang, “Energy Reduction for STT-RAM using 
Early Write Termination,” in Proceedings of 2009 International Conference on Computer-Aided 
Design (ICCAD’09), San Jose, CA, 2009. 
[5] Chua, Leon. "Resistance switching memories are memristors." Applied Physics A 102.4 (2011): 
765-783. 
[6] Micheloni, Rino, Luca Crippa, and Alessia Marelli. Inside NAND flash memories. Springer Science 
& Business Media, 2010. 
[7] P. J. Denning, “Virtual memory," ACM Comput. Surv., vol. 2, no. 3, pp. 153-189, Sep.1970. 
[8] J. Fotheringham, “Dynamic storage allocation in the Atlas computer, including an automatic use of 
a backing store," Commun. ACM, vol. 4, no. 10, pp. 435-436, Oct. 1961. 
[9] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high performance main memory system 
using phase-change memory technology,” in 36th International Symposium on Computer 
Architecture (ISCA’09), June 2009, Austin, TX, USA. 
60 
 
[10] Karakostas, Vasileios, et al. "Performance analysis of the memory management unit under scale-
out workloads." 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 
2014. 
[11] B. L. Jacob and T. N. Mudge, “A Look at Several Memory Management Units, TLB-refill 
Mechanisms, and Page Table Organizations,” ASPLOS, 1998, pp. 295–306. 
[12] A. S. Tanenbaum, “Modern Operating Systems,” 2nd ed. Prentice Hall Press, 2002. 
[13] “CPU microarchitecture”, https://www.cpu-world.com/CPUs/Core_i7/Intel-
Core%20i7%20Extreme%20Edition%20I7-965%20AT80601000918AA%20(BX80601965).html 
[14] “CPU microarchitecture”, https://developer.arm.com/documentation/100403/0200/functional-
description/memory-management-unit/tlb-organization/main-tlb 
[15] Mathur, Avantika, et al. "The new ext4 filesystem: current status and future plans." Proceedings of 
the Linux symposium. Vol. 2. 2007. 
[16] Love, Robert. Linux kernel development. Pearson Education, 2010. 
[17] Kwon, Youngjin, et al. "Coordinated and efficient huge page management with ingens." 12th 
USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016. 
[18] Talluri, Madhusudhan, and Mark D. Hill. "Surpassing the TLB performance of superpages with less 
operating system support." ACM SIGPLAN Notices 29.11 (1994): 171-182. 
[19] Pham, Binh, et al. "Large pages and lightweight memory management in virtualized environments: 
Can you have it both ways?" Proceedings of the 48th International Symposium on 
Microarchitecture. 2015. 
[20] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, “Operating Systems: Three Easy Pieces,” 0th 
ed. Arpaci-Dusseau Books, May 2015. 
61 
 
[21] Guvenilir, Faruk, and Yale N. Patt. "Tailored page sizes." 2020 ACM/IEEE 47th Annual 
International Symposium on Computer Architecture (ISCA). IEEE, 2020. 
[22] Mathur, Avantika, et al. "The new ext4 filesystem: current status and future plans." Proceedings of 
the Linux symposium. Vol. 2. 2007. 
[23] M. E. Russinovich and D. A. Solomon, Microsoft Windows Internals, Fourth Edition: Microsoft 
Windows Server 2003, Windows XP, and Windows 2000, 4th ed. Microsoft Press, 2005. 
[24] “XFS Overview and Internals,” XFS team, 2006. [Online]. Available: http://linux-
xfs.sgi.com/projects/xfs/training/index.html 
[25] R. McDougall and J. Mauro, Solaris Internals: Solaris 10 and Open Solaris Kernel Architecture, 2nd 
ed. Prentice Hall, July 10, 2006. 
[26] S. Best, “JFS log: How the journaled file system performs logging,” in Proceedings of the 4th Annual 
Showcase & Conference (LINUX-00). Berkeley, CA: The USENIX Association, Oct. 10–14 2000, 
pp. 163– 168. 
[27] Cai, Miao, Chance C. Coats, and Jian Huang. "Hoop: efficient hardware-assisted out-of-place 
update for non-volatile memory." 2020 ACM/IEEE 47th Annual International Symposium on 
Computer Architecture (ISCA). IEEE, 2020. 
[28] Dong, Mingkai, et al. "Performance and protection in the ZoFS user-space NVM file 
system." Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019. 
[29] Xu, Jian, and Steven Swanson. "NOVA: A log-structured file system for hybrid volatile/non-volatile 
main memories." 14th USENIX Conference on File and Storage Technologies (FAST 16). 2016.  
[30] M. Wilcox. Add support for NV-DIMMs to ext4. https: //lwn.net/Articles/613384/. 
[31] Dulloor, Subramanya R., et al. "System software for persistent memory." Proceedings of the Ninth 
European Conference on Computer Systems. 2014. 
62 
 
[32] Gardner, Philippa, Gian Ntzik, and Adam Wright. "Local reasoning for the POSIX file 
system." European Symposium on Programming Languages and Systems. Springer, Berlin, 
Heidelberg, 2014. 
[33] Cao, Mingming, Suparna Bhattacharya, and Ted Ts'o. "Ext4: The Next Generation of Ext2/3 
Filesystem." LSF. 2007. 
[34] McDougall, Richard, and Jim Mauro. "FileBench." (2005): 56. https://github.com/filebench/filebench  
[35] NORCOTT, WilliamD. "IOzone filesystem benchmark." http://www.iozone.org/  (2003). 
[36] Intel 64 and IA32 Architectures Performance Monitoring Events, Revision 1.0, December 2017 
[37] Weaver, Vincent M. "Linux perf_event features and overhead." The 2nd International Workshop on 
Performance Analysis of Workload Optimized Systems, FastPath. Vol. 13. 2013. 
[38] V. M. Weaver, D. Terpstra and S. Moore, "Non-determinism and overcount on modern hardware 
performance counter implementations," 2013 IEEE International Symposium on Performance 
Analysis of Systems and Software (ISPASS), Austin, TX, 2013, pp. 215-224 
[39] De Melo, Arnaldo Carvalho. "The new Linux perf tools." Slides from Linux Kongress. Vol. 18. 2010. 
[40] Weaver, Vincent M., et al. "Measuring energy and power with PAPI." 2012 41st International 
Conference on Parallel Processing Workshops. IEEE, 2012. 
[41] How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM), 
https://software.intel.com/content/www/us/en/develop/articles/how-to-emulate-persistent-memory-
on-an-intel-architecture-server.html 
[42] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. C. Lee, D. Burger, and D. Coetzee, “Better I/O 
through byte-addressable, persistent memory,” in Proceedings of the 22nd ACM Symposium on 
Operating Systems Principles (SOSP’09), Big Sky, Montana, Oct. 2009. 
63 
 
[43] Tarasov, Vasily, et al. "Benchmarking File System Benchmarking: It* IS* Rocket Science." HotOS. 










%% Drived measurements 
% 1. (%) Cycles spent in page walks (data 
% 2. Page walks per 1000 instr. (data) 
% 3. Average cycles per page walk (data) 
% 4. (%) Cycles spent in page walks (instructions)  
% 5. Page walks per 1000 instr. (instructions)  
% 6. Average cycles per page walk (instructions)  
% 7. L1 misses = L1D.REPLACEMENT 
% 8. L2 misses 
% 9. LLC misses  
  
measure_names = { 
    '(%) Cycles spent in page walks (data)' 
    'Page walks per 1000 instr. (data)' 
    'Average cycles per page walk (data)' 
    '(%) Cycles spent in page walks (instructions)' 
    'Page walks per 1000 instr. (instructions)'  
    'Average cycles per page walk (instructions)'  
    'L1 misses = L1D.REPLACEMENT' 
    'L2 misses' 




%% Raw measurements 
% 1. cpu_clk_unhalted.thread_p 
% 2. l1d.replacement 
% 3. mem_load_uops_retired.llc_hit 
% 4. mem_load_uops_llc_hit_retired.xsnp_hit 
% 5. mem_load_uops_llc_hit_retired.xsnp_hitm 
% 6. mem_load_uops_llc_hit_retired.xsnp_miss 
% 7. dTLB-loads 
% 8. dTLB-stores 










perf_event = {'cpu_clk_unhalted.thread_p' 
    'l1d.replacement' 
    'mem_load_uops_retired.llc_hit' 
    'mem_load_uops_llc_hit_retired.xsnp_hit' 
65 
 
    'mem_load_uops_llc_hit_retired.xsnp_hitm' 
    'mem_load_uops_llc_hit_retired.xsnp_miss' 
    'dTLB-loads' 
    'dTLB-stores' 
    'dTLB-load-misses' 
    'dTLB-store-misses' 
    'inst_retired.any_p' 
    'dtlb_load_misses.walk_completed' 
    'dtlb_load_misses.walk_duration' 
    'dtlb_store_misses.walk_completed' 
    'dtlb_store_misses.walk_duration' 
    'itlb_misses.walk_completed' 
    'itlb_misses.walk_duration' 
    %'seconds' 




workloads1 = { 'EXT4'}; 
workloads2 = { 'EXT4DAX'}; 
workloads3 = { 'NOVA'}; 
%'EXT4' 'NOVA' 
    
%% Loop over all workloads 
% 6160 = 10 experiments * 4 workloads * 154 
% 154 = 11 file sizes * 14 unit sizes 
  
TotalFileSizes = 11; 
TotalUnitSizes = 14; 
TotalRepeats   = 10; 
  
  
%% Benchmark script  
% for 11 fileSize 
%    for 14 unitSize 
%        for 10 repeats 
%             Innest loop: 'write', 'rewrite', 'read', 'reread' 
TotalWorkloads = length(workloads1); 
TotalEvents    = length(perf_event); 
TotalEquation  = 9; 
DataSize =6160; 
for w = 1:TotalWorkloads 
    workload1 =  workloads1{w}; 
    data = zeros(length(perf_event), DataSize);  
     
    workload2 =  workloads2{w}; 
    data = zeros(length(perf_event), DataSize);  
     
    workload3 =  workloads3{w}; 
    data = zeros(length(perf_event), DataSize);  
     
    for i = 1:TotalEvents 
        raw_data_file_name1 = sprintf('form_%s_%s.txt', perf_event{i}, 
workload1); 
        disp(raw_data_file_name1); 
66 
 
        rawdata1 = [load(raw_data_file_name1)]'; 
  
        raw_data_file_name2 = sprintf('form_%s_%s.txt', perf_event{i}, 
workload2); 
        disp(raw_data_file_name2); 
        rawdata2 = [load(raw_data_file_name2)]'; 
         
        raw_data_file_name3 = sprintf('form_%s_%s.txt', perf_event{i}, 
workload3); 
        disp(raw_data_file_name3); 
        rawdata3 = [load(raw_data_file_name3)]'; 
         
%         if i =18: 
%             for j = 1:3:6160*3 
%                temp1(j,:) = mean()  
%                  
%             end 
%         end 
         
        [M1, N1] = convert2mat(rawdata1(1:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB1(w, i).write = N1; 
        [M1, N1] = convert2mat(rawdata1(2:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB1(w, i).rewrite = N1;  
        [M1, N1] = convert2mat(rawdata1(3:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB1(w, i).read = N1; 
        [M1, N1] = convert2mat(rawdata1(4:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB1(w, i).reread = N1; 
         
        [M2, N2] = convert2mat(rawdata2(1:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB2(w, i).write = N2; 
        [M2, N2] = convert2mat(rawdata2(2:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB2(w, i).rewrite = N2;  
        [M2, N2] = convert2mat(rawdata2(3:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB2(w, i).read = N2; 
        [M2, N2] = convert2mat(rawdata2(4:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB2(w, i).reread = N2; 
         
        [M3, N3] = convert2mat(rawdata3(1:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB3(w, i).write = N3; 
        [M3, N3] = convert2mat(rawdata3(2:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB3(w, i).rewrite = N3;  
        [M3, N3] = convert2mat(rawdata3(3:4:end), TotalFileSizes, 
TotalUnitSizes, TotalRepeats); 
        DB3(w, i).read = N3; 




        DB3(w, i).reread = N3; 
    end 
        %1.(%)Cycles spent in page walks (data 
        DB1(w,TotalEvents+1).write     = (DB1(w,13).write + DB1(w,15).write)* 
100 ./ DB1(w,1).write; 
        DB1(w,TotalEvents+1).rewrite   = (DB1(w,13).rewrite + 
DB1(w,15).rewrite)* 100 ./ DB1(w,1).rewrite; 
        DB1(w,TotalEvents+1).read      = (DB1(w,13).read + DB1(w,15).read)* 
100 ./ DB1(w,1).read; 
        DB1(w,TotalEvents+1).reread    = (DB1(w,13).reread + 
DB1(w,15).reread)* 100 ./ DB1(w,1).reread; 
  
        %2.Page walks per 1000 instr. (data) 
        DB1(w,TotalEvents+2).write     = (DB1(w,12).write + 
DB1(w,14).write) ./ DB1(w,11).write*1000; 
        DB1(w,TotalEvents+2).rewrite   = (DB1(w,12).write + 
DB1(w,14).write) ./ DB1(w,11).write*1000; 
        DB1(w,TotalEvents+2).read      = (DB1(w,12).write + 
DB1(w,14).write) ./ DB1(w,11).write*1000; 
        DB1(w,TotalEvents+2).reread    = (DB1(w,12).write + 
DB1(w,14).write) ./ DB1(w,11).write*1000; 
         
        %3.Average cycles per page walk (data) 
        DB1(w,TotalEvents+3).write     = (DB1(w,13).write + 
DB1(w,15).write) ./ (DB1(w,12).write+DB1(w,14).write); 
        DB1(w,TotalEvents+3).rewrite   = (DB1(w,13).rewrite + 
DB1(w,15).rewrite) ./ (DB1(w,12).rewrite+DB1(w,14).rewrite); 
        DB1(w,TotalEvents+3).read      = (DB1(w,13).read + DB1(w,15).read) ./ 
(DB1(w,12).read+DB1(w,14).read); 
        DB1(w,TotalEvents+3).reread    = (DB1(w,13).reread + 
DB1(w,15).reread) ./ (DB1(w,12).write+DB1(w,14).reread); 
     
        %4.(%) Cycles spent in page walks (instructions)  
        DB1(w,TotalEvents+4).write     = DB1(w,17).write * 100 ./ 
DB1(w,1).write; 
        DB1(w,TotalEvents+4).rewrite   = DB1(w,17).rewrite * 100 ./ 
DB1(w,1).rewrite; 
        DB1(w,TotalEvents+4).read      = DB1(w,17).read * 100 ./ 
DB1(w,1).read; 
        DB1(w,TotalEvents+4).reread    = DB1(w,17).reread * 100 ./ 
DB1(w,1).reread; 
         
        %5.Page walks per 1000 instr. (instructions)  
        DB1(w,TotalEvents+5).write     = DB1(w,16).write*1000 ./ 
DB1(w,11).write; 
        DB1(w,TotalEvents+5).rewrite   = DB1(w,16).rewrite*1000 ./ 
DB1(w,11).rewrite; 
        DB1(w,TotalEvents+5).read      = DB1(w,16).read*1000 ./ 
DB1(w,11).read; 
        DB1(w,TotalEvents+5).reread    = DB1(w,16).reread*1000 ./ 
DB1(w,11).reread; 
         
        %6.Average cycles per page walk (instructions)  




        DB1(w,TotalEvents+6).rewrite   = DB1(w,16).rewrite*1000 ./ 
DB1(w,17).rewrite; 
        DB1(w,TotalEvents+6).read      = DB1(w,16).read*1000 ./ 
DB1(w,17).read; 
        DB1(w,TotalEvents+6).reread    = DB1(w,16).reread*1000 ./ 
DB1(w,17).reread; 
         
        %7.L1 misses = L1D.REPLACEMENT 
        DB1(w,TotalEvents+7).write     = DB1(w,2).write; 
        DB1(w,TotalEvents+7).rewrite   = DB1(w,2).rewrite; 
        DB1(w,TotalEvents+7).read      = DB1(w,2).read; 
        DB1(w,TotalEvents+7).reread    = DB1(w,2).reread; 
         
        %8.L2 misses 
        DB1(w,TotalEvents+8).write     = DB1(w,3).write + DB1(w,4).write + 
DB1(w,5).write + DB1(w,6).write; 
        DB1(w,TotalEvents+8).rewrite   = DB1(w,3).rewrite + DB1(w,4).rewrite 
+ DB1(w,5).rewrite + DB1(w,6).rewrite; 
        DB1(w,TotalEvents+8).read      = DB1(w,3).read + DB1(w,4).read + 
DB1(w,5).read + DB1(w,6).read; 
        DB1(w,TotalEvents+8).reread    = DB1(w,3).reread + DB1(w,4).reread + 
DB1(w,5).reread + DB1(w,6).reread;    
         
        %9.LLC misses  
        DB1(w,TotalEvents+9).write     = DB1(w,6).write; 
        DB1(w,TotalEvents+9).rewrite   = DB1(w,6).rewrite; 
        DB1(w,TotalEvents+9).read      = DB1(w,6).read; 
        DB1(w,TotalEvents+9).reread    = DB1(w,6).reread;        
         
        %DB2 
                %1.Cycles spent in page walks (data 
        DB2(w,TotalEvents+1).write     = (DB2(w,13).write + DB2(w,15).write)* 
100 ./ DB2(w,1).write; 
        DB2(w,TotalEvents+1).rewrite   = (DB2(w,13).rewrite + 
DB2(w,15).rewrite)* 100 ./ DB2(w,1).rewrite; 
        DB2(w,TotalEvents+1).read      = (DB2(w,13).read + DB2(w,15).read)* 
100 ./ DB2(w,1).read; 
        DB2(w,TotalEvents+1).reread    = (DB2(w,13).reread + 
DB2(w,15).reread)* 100 ./ DB2(w,1).reread; 
  
        %2.Page walks per 1000 instr. (data) 
        DB2(w,TotalEvents+2).write     = (DB2(w,12).write + 
DB2(w,14).write) ./ DB2(w,11).write*1000; 
        DB2(w,TotalEvents+2).rewrite   = (DB2(w,12).write + 
DB2(w,14).write) ./ DB2(w,11).write*1000; 
        DB2(w,TotalEvents+2).read      = (DB2(w,12).write + 
DB2(w,14).write) ./ DB2(w,11).write*1000; 
        DB2(w,TotalEvents+2).reread    = (DB2(w,12).write + 
DB2(w,14).write) ./ DB2(w,11).write*1000; 
         
        %3.Average cycles per page walk (data) 
        DB2(w,TotalEvents+3).write     = (DB2(w,13).write + 
DB2(w,15).write) ./ (DB2(w,12).write+DB2(w,14).write); 
        DB2(w,TotalEvents+3).rewrite   = (DB2(w,13).rewrite + 
DB2(w,15).rewrite) ./ (DB2(w,12).rewrite+DB2(w,14).rewrite); 
69 
 
        DB2(w,TotalEvents+3).read      = (DB2(w,13).read + DB2(w,15).read) ./ 
(DB2(w,12).read+DB2(w,14).read); 
        DB2(w,TotalEvents+3).reread    = (DB2(w,13).reread + 
DB2(w,15).reread) ./ (DB2(w,12).write+DB2(w,14).reread); 
     
        %4.(%) Cycles spent in page walks (instructions)  
        DB2(w,TotalEvents+4).write     = DB2(w,17).write * 100 ./ 
DB2(w,1).write; 
        DB2(w,TotalEvents+4).rewrite   = DB2(w,17).rewrite * 100 ./ 
DB2(w,1).rewrite; 
        DB2(w,TotalEvents+4).read      = DB2(w,17).read * 100 ./ 
DB2(w,1).read; 
        DB2(w,TotalEvents+4).reread    = DB2(w,17).reread * 100 ./ 
DB2(w,1).reread; 
         
        %5.Page walks per 1000 instr. (instructions)  
        DB2(w,TotalEvents+5).write     = DB2(w,16).write*1000 ./ 
DB2(w,11).write; 
        DB2(w,TotalEvents+5).rewrite   = DB2(w,16).rewrite*1000 ./ 
DB2(w,11).rewrite; 
        DB2(w,TotalEvents+5).read      = DB2(w,16).read*1000 ./ 
DB2(w,11).read; 
        DB2(w,TotalEvents+5).reread    = DB2(w,16).reread*1000 ./ 
DB2(w,11).reread; 
         
        %6.Average cycles per page walk (instructions)  
        DB2(w,TotalEvents+6).write     = DB2(w,16).write*1000 ./ 
DB2(w,17).write; 
        DB2(w,TotalEvents+6).rewrite   = DB2(w,16).rewrite*1000 ./ 
DB2(w,17).rewrite; 
        DB2(w,TotalEvents+6).read      = DB2(w,16).read*1000 ./ 
DB2(w,17).read; 
        DB2(w,TotalEvents+6).reread    = DB2(w,16).reread*1000 ./ 
DB2(w,17).reread; 
         
        %7.L1 misses = L1D.REPLACEMENT 
        DB2(w,TotalEvents+7).write     = DB2(w,2).write; 
        DB2(w,TotalEvents+7).rewrite   = DB2(w,2).rewrite; 
        DB2(w,TotalEvents+7).read      = DB2(w,2).read; 
        DB2(w,TotalEvents+7).reread    = DB2(w,2).reread; 
         
        %8.L2 misses 
        DB2(w,TotalEvents+8).write     = DB2(w,3).write + DB2(w,4).write + 
DB2(w,5).write + DB2(w,6).write; 
        DB2(w,TotalEvents+8).rewrite   = DB2(w,3).rewrite + DB2(w,4).rewrite 
+ DB2(w,5).rewrite + DB2(w,6).rewrite; 
        DB2(w,TotalEvents+8).read      = DB2(w,3).read + DB2(w,4).read + 
DB2(w,5).read + DB2(w,6).read; 
        DB2(w,TotalEvents+8).reread    = DB2(w,3).reread + DB2(w,4).reread + 
DB2(w,5).reread + DB2(w,6).reread;    
         
        %9.LLC misses  
        DB2(w,TotalEvents+9).write     = DB2(w,6).write; 
        DB2(w,TotalEvents+9).rewrite   = DB2(w,6).rewrite; 
        DB2(w,TotalEvents+9).read      = DB2(w,6).read; 




         
        %DB3 
                %1.Cycles spent in page walks (data 
        DB3(w,TotalEvents+1).write     = (DB3(w,13).write + DB3(w,15).write)* 
100 ./ DB3(w,1).write; 
        DB3(w,TotalEvents+1).rewrite   = (DB3(w,13).rewrite + 
DB3(w,15).rewrite)* 100 ./ DB3(w,1).rewrite; 
        DB3(w,TotalEvents+1).read      = (DB3(w,13).read + DB3(w,15).read)* 
100 ./ DB3(w,1).read; 
        DB3(w,TotalEvents+1).reread    = (DB3(w,13).reread + 
DB3(w,15).reread)* 100 ./ DB3(w,1).reread; 
  
        %2.Page walks per 1000 instr. (data) 
        DB3(w,TotalEvents+2).write     = (DB3(w,12).write + 
DB3(w,14).write) ./ DB3(w,11).write*1000; 
        DB3(w,TotalEvents+2).rewrite   = (DB3(w,12).write + 
DB3(w,14).write) ./ DB3(w,11).write*1000; 
        DB3(w,TotalEvents+2).read      = (DB3(w,12).write + 
DB3(w,14).write) ./ DB3(w,11).write*1000; 
        DB3(w,TotalEvents+2).reread    = (DB3(w,12).write + 
DB3(w,14).write) ./ DB3(w,11).write*1000; 
         
        %3.Average cycles per page walk (data) 
        DB3(w,TotalEvents+3).write     = (DB3(w,13).write + 
DB3(w,15).write) ./ (DB3(w,12).write+DB3(w,14).write); 
        DB3(w,TotalEvents+3).rewrite   = (DB3(w,13).rewrite + 
DB3(w,15).rewrite) ./ (DB3(w,12).rewrite+DB3(w,14).rewrite); 
        DB3(w,TotalEvents+3).read      = (DB3(w,13).read + DB3(w,15).read) ./ 
(DB3(w,12).read+DB3(w,14).read); 
        DB3(w,TotalEvents+3).reread    = (DB3(w,13).reread + 
DB3(w,15).reread) ./ (DB3(w,12).write+DB3(w,14).reread); 
     
        %4.(%) Cycles spent in page walks (instructions)  
        DB3(w,TotalEvents+4).write     = DB3(w,17).write  * 100 ./ 
DB3(w,1).write; 
        DB3(w,TotalEvents+4).rewrite   = DB3(w,17).rewrite  * 100 ./ 
DB3(w,1).rewrite; 
        DB3(w,TotalEvents+4).read      = DB3(w,17).read  * 100 ./ 
DB3(w,1).read; 
        DB3(w,TotalEvents+4).reread    = DB3(w,17).reread  * 100 ./ 
DB3(w,1).reread; 
         
        %5.Page walks per 1000 instr. (instructions)  
        DB3(w,TotalEvents+5).write     = DB3(w,16).write*1000 ./ 
DB3(w,11).write; 
        DB3(w,TotalEvents+5).rewrite   = DB3(w,16).rewrite*1000 ./ 
DB3(w,11).rewrite; 
        DB3(w,TotalEvents+5).read      = DB3(w,16).read*1000 ./ 
DB3(w,11).read; 
        DB3(w,TotalEvents+5).reread    = DB3(w,16).reread*1000 ./ 
DB3(w,11).reread; 
         
        %6.Average cycles per page walk (instructions)  
71 
 
        DB3(w,TotalEvents+6).write     = DB3(w,16).write*1000 ./ 
DB3(w,17).write; 
        DB3(w,TotalEvents+6).rewrite   = DB3(w,16).rewrite*1000 ./ 
DB3(w,17).rewrite; 
        DB3(w,TotalEvents+6).read      = DB3(w,16).read*1000 ./ 
DB3(w,17).read; 
        DB3(w,TotalEvents+6).reread    = DB3(w,16).reread*1000 ./ 
DB3(w,17).reread; 
         
        %7.L1 misses = L1D.REPLACEMENT 
        DB3(w,TotalEvents+7).write     = DB3(w,2).write; 
        DB3(w,TotalEvents+7).rewrite   = DB3(w,2).rewrite; 
        DB3(w,TotalEvents+7).read      = DB3(w,2).read; 
        DB3(w,TotalEvents+7).reread    = DB3(w,2).reread; 
         
        %8.L2 misses 
        DB3(w,TotalEvents+8).write     = DB3(w,3).write + DB3(w,4).write + 
DB3(w,5).write + DB3(w,6).write; 
        DB3(w,TotalEvents+8).rewrite   = DB3(w,3).rewrite + DB3(w,4).rewrite 
+ DB3(w,5).rewrite + DB3(w,6).rewrite; 
        DB3(w,TotalEvents+8).read      = DB3(w,3).read + DB3(w,4).read + 
DB3(w,5).read + DB3(w,6).read; 
        DB3(w,TotalEvents+8).reread    = DB3(w,3).reread + DB3(w,4).reread + 
DB3(w,5).reread + DB3(w,6).reread;    
         
        %9.LLC misses  
        DB3(w,TotalEvents+9).write     = DB3(w,6).write; 
        DB3(w,TotalEvents+9).rewrite   = DB3(w,6).rewrite; 
        DB3(w,TotalEvents+9).read      = DB3(w,6).read; 
        DB3(w,TotalEvents+9).reread    = DB3(w,6).reread;    
end 
  
%% X axis: write unit size 
for w = 1:TotalWorkloads 
    for m = 1:9 
                
        %File_size = [1,2,4,8,16,32,64,128,256,512,1024]    11 types 
        %Unit_size = [1,2,4,8,16,32,48,64,128,256,512,1024,2048,4096]  14 
types 
        x = sort(2.^[0:13]);     
              
        h = figure(m); 
        set(h, 'units','normalized','outerposition',[0 0 1 1]); 
        %figure_title = sprintf('%s, total = 10 GB, %s', workload(w), 
perf_event{m}); 
        figure_title = sprintf('%s, total = 10 GB, %s', workload1(w), 
measure_names{m}); 
        disp(figure_title); 
        set(h, 'Name', figure_title); 
        title(figure_title) 
        j = [1, 4, 7, 11]; 
        for k = 1:4 
            subplot(2, 2, k); 
            loglog(x, DB1(w, m+17).write(j(k), :), '*-', ... 
                   x, DB2(w, m+17).write(j(k), :), 's-', ... 
72 
 
                   x, DB3(w, m+17).write(j(k), :), 'o-
','LineWidth',2,'MarkerSize',12); 
  
            %legend('write', 'rewrite', 'read', 'reread'); 
            h = legend('Ext4', 'Ext4-DAX','NOVA', 'Location', 'Northwest') 
            set(h,'FontSize',10); 
  
            a = get(gca,'XTickLabel'); 
            set(gca,'XTickLabel',a,'fontsize',13) 
     
            b = get(gca,'YTickLabel'); 
            set(gca,'YTickLabel',b,'fontsize',13) 
  
             
            xlabel('Unit size (KB)','FontSize', 16) 
            ylabel(measure_names{m},'FontSize', 16) 
            title(sprintf('File size = %d MB', 2^(j(k)-1))) 
            %xlim([0, 12]) 
            %ticks = [0:2:12];  
%             xticks(ticks) 
%             xticklabels(2.^ticks) 
            grid on         
          
        end 
        fname = sprintf('%s000%d.png',workloads1{1},m); 
             
        print(fname,'-dpng')   
%         plotFileName = sprintf('./plots_1/%s_%s.png', measure_names{m}, 
workload); 
%         print(plotFileName, '-dpng', '-r600');    
%         plotFileName = sprintf('./plots_2/%s_%s.png', workload, 
measure_names{m}); 
%         print(plotFileName, '-dpng', '-r600');   
    end 







BIOGRAPHY OF THE AUTHOR 
 
Xiang Guo was born and raised in Shanghai, China. He attended Shanghai Dianji University in 
China and received a Bachelor of Science degree in Communication Engineering in 2014. He started his 
graduate study in the department of Electrical and Computer Engineering at the University of Maine in 
2015. He is in the program of M.S. of Computer Engineering, and has worked as a graduate research 
assistant. He is a candidate for the Master of Science degree in Computer Engineering from the 
University of Maine in December 2020. 
