Exploiting solid state drive parallelism for real-time flash storage by Missimer, Katherine
Boston University
OpenBU http://open.bu.edu
Theses & Dissertations Boston University Theses & Dissertations
2020
Exploiting solid state drive





GRADUATE SCHOOL OF ARTS AND SCIENCES
Dissertation




B.A., Boston University, 2012
M.A., Boston University, 2012
Submitted in partial fulfillment of the









Professor of Computer Science
Second Reader
Manos Athanassoulis, PhD
Assistant Professor of Computer Science
Third Reader
George Kollios, PhD
Professor of Computer Science
Fourth Reader
Peter Desnoyers, PhD
Associate Professor of Computer Science
Acknowledgments
I would like to thank my advisor Rich West for inspiring me and supporting me through the
years. This work would not have been possible without his guidance and invaluable advice.
I would also like to thank my committee members, Manos Athanassoulis, George Kollios,
Abraham Matta and Peter Desnoyers for providing invaluable feedback and guidance.
To my family, thank you for supporting me every step of the way. To Eric Missimer,
thank you for being the most wonderful husband, patient co-parent, and my best friend. To
my parents and sister, thank you for all the love, support and encouragement. To my son,
Luke, for being the most sweet-natured two-year-old. To my parents-in-law, thank you for
all the help and support these past few months that made completing this work possible.
To Dash, Bitsy and Holly, thank you for reminding us everyday of the joy and love in life.
Your enthusiasm is contagious and shines through every day.
Last, but not least, I would like to thank my colleagues and friends in the BOSS re-
search group who have been on this journey with me through the years – Ye Li, Tom
Cheng, Craig Einstein, Soham Sinha, Sasan Golchin and Anam Farrukh. Thank you for
making my experience at BU a most enjoyable one.
This work is supported in part by the National Science Foundation under Grant No.
1527050. Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the National
Science Foundation.
iv
EXPLOITING SOLID STATE DRIVE PARALLELISM FOR
REAL-TIME FLASH STORAGE
KATHERINE MISSIMER
Boston University, Graduate School of Arts and Sciences, 2020
Major Professor: Richard West, Professor of Computer Science
ABSTRACT
The increased volume of sensor data generated by emerging applications in areas such
as autonomous vehicles requires new technologies for storage and retrieval. NAND flash
memory has desirable characteristics for real-time information storage and retrieval, such
as non-volatility, shock resistance, low power consumption and fast access time. How-
ever, NAND flash memory management suffers high tail latency during storage space
reclamation. This is unacceptable in a real-time system, where missed deadlines can have
potentially catastrophic consequences. Current methods to ensure timing guarantees in
flash storage do not explicitly exploit the internal parallelism in Solid State Drives (SSDs).
Modern SSDs are able to support massive amounts of parallelism, as evidenced by the shift
from the Advanced Host Controller Interface (AHCI) to the Non-Volatile Memory Host
Controller Interface (NVMe), a multi-queue interface. This thesis focuses on providing
predictable, low-latency guarantees for read and write requests in NAND flash memory
by exploiting the internal parallelism in SSDs. The first part of the thesis presents a parti-
tioned flash design that dynamically assigns each parallel flash unit to perform either reads
or writes. To access data from a flash unit that is busy servicing a write request or perform-
ing garbage collection, the device rebuilds the data using encoding. Consequently, reads
are never blocked by writes or storage space reclamation. In this design, however, low
v
read latency is achieved at the expense of write throughput. The second part of the thesis
explores how to predictably improve performance by minimizing the garbage collection
cost in flash storage. The root cause of this extra cost is due to the SSD’s inability to accu-
rately determine data lifetime and group together data that expires before space needs to
be reclaimed. This is exacerbated by the narrow block I/O interface, which prevents opti-
mizations from either the device or the application above. By sharing application-specific
knowledge of data lifetime with the device, the SSD is able to efficiently lay out data such




1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background and Related Work 8
2.1 NAND Flash Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Internal Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Flash Translation Layer . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Drive-managed vs. Host-managed Designs . . . . . . . . . . . . . . . . . 17
2.3 Performance Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Reducing Write Amplification Factor . . . . . . . . . . . . . . . . . . . 22
2.5 OpenSSD Cosmos Board . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Partitioned Real-Time FTL 26
3.1 Real-Time Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Admission Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Hardware Experiments . . . . . . . . . . . . . . . . . . . . . . . 41
vii
4 Telomere 50
4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Real-Time Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Storage Admission Control . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Single Task Placement (SiP) . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Shared Task Placement (SharP) . . . . . . . . . . . . . . . . . . 58
4.4 Throughput Admission Control . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 Admission Control Simulations . . . . . . . . . . . . . . . . . . 63
4.5.2 Event-Driven Simulator . . . . . . . . . . . . . . . . . . . . . . 71
4.5.3 Hardware Experiments . . . . . . . . . . . . . . . . . . . . . . . 76
5 Infinite Streams 80
5.1 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Bank Reservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Bank Reservation Policy . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Infinite Streams Bank Reservation . . . . . . . . . . . . . . . . . 83
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 Conclusion 94





1.1 Data Allocation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Latency in milliseconds for flash operations. . . . . . . . . . . . . . . . . 27
3.2 PaRT-FTL symbol definitions. . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Experimental setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Throughput and latency parameters. . . . . . . . . . . . . . . . . . . . . 43
4.1 Telomere symbol definitions. . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 WAF for MultiLog-Oracle at different over-provisioning levels [SA13]. . 68
4.3 Pletka’s block characterization. . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Task set running on the OpenSSD Cosmos board. . . . . . . . . . . . . . 77
5.1 Average WAF and standard deviation for Infinite Streams over different
over-provisioning levels and different percentages of data retained beyond
its lifetime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Average WAF and standard deviation for MultiLog heat tracking over dif-
ferent over-provisioning levels and different percentages of data retained
beyond its lifetime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
ix
List of Figures
1.1 Classification of real-time FTLs with predictable performance. . . . . . . 2
1.2 Classification of FTLs with no worst-case performance guarantees. . . . . 2
2.1 SSD internal architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Address mapping in page-level and block-level FTLs. . . . . . . . . . . . 11
2.3 Page-level mapping and block reclamation. . . . . . . . . . . . . . . . . 12
2.4 Partial garbage collection splits traditional garbage collection into steps. . 13
2.5 Drive-managed design. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Host-managed design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Cosmos OpenSSD board [SJLK14]. . . . . . . . . . . . . . . . . . . . . 24
2.8 OpenSSD Cosmos board setup. . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Flash chip layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Flash chips rotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Corresponding GC task period for a write task that writes more than a
block per write period. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Corresponding GC task period for a write task that writes less than a block
per write period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Admission Control Simulation. . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Weighted Schedulability vs Read Request Size. . . . . . . . . . . . . . . 41
3.7 Weighted Schedulability vs Write Request Size. . . . . . . . . . . . . . . 41
x
3.8 PaRT-FTL write throughput. . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9 PaRT-FTL read throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.10 PaRT-FTL write request latency. . . . . . . . . . . . . . . . . . . . . . . 44
3.11 PaRT-FTL read request latency. . . . . . . . . . . . . . . . . . . . . . . . 44
3.12 WAO-GC write throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.13 WAO-GC read throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.14 WAO-GC write request latency . . . . . . . . . . . . . . . . . . . . . . . 46
3.15 WAO-GC read request latency . . . . . . . . . . . . . . . . . . . . . . . 46
3.16 Response times of PaRT-FTL and WAO-GC pagemap FTL. . . . . . . . . 47
3.17 Flash page latencies of PaRT-FTL and WAO-GC pagemap FTL. . . . . . 47
3.18 The effects of garbage collection on write bandwidth with WAO-GC. . . . 48
4.1 Storage vs. throughput utilization. . . . . . . . . . . . . . . . . . . . . . 51
4.2 Telomere design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Telomere Single Task Placement. . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Telomere Shared Task Placement. . . . . . . . . . . . . . . . . . . . . . 59
4.5 Storage and throughput utilization with different read and write granularity. 65
4.6 Storage and throughput utilization with varying number of tasks. . . . . . 65
4.7 Storage and throughput admission control with α = 0.10. . . . . . . . . . 69
4.8 Storage and throughput admission control with α = 0.23. . . . . . . . . . 69
4.9 Storage and throughput admission control with α = 0.75. . . . . . . . . . 70
4.10 Graph shows the 25th, 50th, 75th percentiles, the minimum, maximum
and mean (dot) number of block erasures as the variance in lifetimes and
periods increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xi
4.11 Graph shows the 25th, 50th, 75th percentiles, the minimum, maximum
and mean (dot) number of GC page copies as the variance in lifetimes and
periods increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.12 Cumulative Distribution Function of the measured RBER at the end of a
simulation run for a task set at over-provisioning λ = 0.10. . . . . . . . . 74
4.13 Endurance gain of different methods over pagemap with no wear-leveling
(Pagemap noWL) at different amounts of over-provisioning. . . . . . . . 75
4.14 Write throughput for task set in Table 4.4 for Telomere SiP and SharP,
WAO-GC and Pagemap FTL. . . . . . . . . . . . . . . . . . . . . . . . . 78
4.15 Read throughput for task set in Table 4.4 for Telomere SiP and SharP,
WAO-GC and Pagemap FTL. . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Pathological case in bank reservation where write throughput cannot be
guaranteed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Example of how increasing the granularity allows writes to occur in setg. 85
5.3 WAF for Infinite Streams over different over-provisioning levels and dif-
ferent percentages of data retained beyond its lifetime. . . . . . . . . . . 90
5.4 WAF for MultiLog heat tracking over different over-provisioning levels
and different percentages of data retained beyond its lifetime. . . . . . . . 90
5.5 Infinite Streams throughput utilization . . . . . . . . . . . . . . . . . . . 92




In the past decade, there has been an explosion of data, and it continues to increase
rapidly [PwC15]. In 2019, the number of active Internet of Things (IoT) devices reached
26.66 billion, and an estimated 127 new IoT devices are connected to the web every sec-
ond [Maa20]. Data-intensive real-time systems such as autonomous vehicles, cognitive
assistance and mobile health generate massive volumes of sensor data that needs to be
stored, retrieved and processed with timing guarantees. For example, Google’s self-driving
car reported to generate on the order of 1 GB/s of data from its various onboard sensors in
2013 [Ang13]. Failure to guarantee read and write latency for the real-time processing of
sensor data potentially results in catastrophic consequences. Autonomous vehicles, from
driverless cars to unmanned aerial vehicles (UAVs), use sensor data to perform collision
avoidance, path planning, object detection, 3D scene reconstruction, simultaneous local-
ization and mapping (SLAM), and other mission tasks. As technology advances, we can
expect the quantity of sensor data produced per unit time to be even greater, requiring local
as well as cloud storage. The sheer volume of sensor data dictates the need for real-time
information storage and retrieval in order to accomplish machine learning and mission
objectives.
Flash memory has become ubiquitous in the computing world, from embedded sys-
tems to data centers [PwC15, Net16, Cou16, Bra18]. NAND flash memory has desirable
2
characteristics for real-time information storage retrieval, such as non-volatility, shock re-
sistance and low power consumption [Mic17, LO18, San18]. In the past decade, NAND
flash memory has filled a gap between DRAM and hard disks in capacity, latency and cost.
Unfortunately, due to the nature of NAND flash storage devices, providing predictable,
low-latency guarantee is a non-trivial task. Traditional SSDs designed for general or high
performance computing suffer from unpredictability such as high tail-latency and are un-
able to meet real-time deadlines [LSZ13]. A large body of literature exists that provides
throughput guarantees [SAW+14, CDH+15, YLH+17, HC18]. However, related work that
provides the strict timing guarantees of a real-time system is scarce despite the increasing
demand for real-time sensor data storage. NAND flash memory also has a limited life-
time [Moh10]. When a flash block wears out, it is no longer usable. Thus, strategic
placement of data in flash is important in order to reduce garbage collection overhead
and increase device lifetime. Tesla recently announced failure in the eMMC flash mem-
ory card in their vehicles due to logging which causes the flash memory to wear out too
quickly [Ruf19]. Inefficient data storage in flash memory leads to poor performance, long
latency and short device lifetime [LSZ13, KHMC14, YPG+14, KLN15, HSKH+16].
Figures 1.1 and 1.2 outline the configuration space for flash memory management de-
signs. Ideally, there is a design with high device utilization, throughput and parallelism,
and with low latency and garbage collection (GC) overhead. However, many design deci-
sions require trade-offs such as device utilization vs. throughput, internal parallelism vs.
GC, latency vs. throughput, etc. Previous real-time NAND flash memory designs focused
on providing predictable performance on a single flash chip. As NAND flash memory
technology advances, SSDs are able to support massive amounts of internal parallelism,
including multiple flash chips that are able to independently execute commands [CLZ11b].
In this thesis, we present flash translation layer designs – PaRT-FTL, Telomere and Infinite
3
Figure 1.1: Classification of real-time
FTLs with predictable performance.
Figure 1.2: Classification of FTLs with no
worst-case performance guarantees.
Streams – that exploit the internal parallelism found in SSDs. As Figure 1.1 shows, there
exists a myriad of FTL designs in the non-real-time community on decreasing latency, pro-
viding throughput guarantees and decreasing garbage collection overhead. These designs,
however, do not provide real-time guarantees.
1.1 Problem Statement
NAND flash memory differs from mechanical hard drives in several significant ways.
NAND flash memory is more suitable in moving vehicles because it has no moving parts –
everything is electronic instead of mechanical. Because of this, random reads and writes in
NAND flash do not suffer from rotational delays. However, writes in NAND flash do suffer
long tail-latency under certain garbage collection policies due to NAND flash memory’s
erase-then-write property. Unlike hard disks, where writes are updated in-place, NAND
flash cells need to be erased before the cells can be written to again. A flash erase unit
(erase block) is a multiple of a flash read or write unit (flash page), resulting in garbage
4
collection overhead. Each NAND flash cell also has a limited number of erasures, since
the number of error bits increases after each write-erase cycle. NAND flash is able to cor-
rect a certain number of error bits with error correction code. However, when the number
or error bits exceeds the maximum bits that can be corrected, a flash block, which consists
of multiple flash cells, can no longer be used. To maximize the device lifetime, data should
be stored in NAND flash memory so that flash blocks wear out evenly.
To address these challenges in NAND flash memory, NAND flash manufacturers in-
troduced a flash translation layer (FTL) in the flash controller of the SSD. The FTL al-
lows flash memory to appear as a block device by providing a logical to physical address
mapping, thereby permitting file systems to interact with an SSD transparently, similar
to how a file system would interact with a mechanical hard drive. The issue with this
extra layer of indirection is that the design of the FTL, specifically its garbage collection
and wear-leveling algorithms, directly impacts the latency experienced by a read or write
request. A general-purpose FTL that tries to limit its garbage collection blocking time
suffers from high tail-latency in the worst case and would not be suitable for a real-time
system. In addition, over recent years, a large body of literature has pointed out many
issues that arise from supporting a block I/O interface with flash memory in order to main-
tain backward compatibility. These issues include log-on-log [LSZ13, YPG+14], large
tail-latency [HSKH+16], unpredictable I/O latency [CLZ11a, KHMC14, KLN15], and re-
source under-utilization [APW+08, CLZ11b].
Another challenge in NAND flash memory is the trade-off between garbage collection
overhead and the internal parallelism in the flash device. In the ideal case, the garbage
collection overhead would be low while the internal parallelism in the SSD is fully utilized.
As Table 1.1 shows, traditional FTL approaches of page striping or block striping the data
will either exploit parallelism in the SSD but suffer high garbage collection overhead (page
5
Table 1.1: Data Allocation Schemes.
stripe) or have low garbage collection overhead but unable to take advantage of the SSD
parallelism (block stripe). Such a trade-off exists because in order to exploit the parallelism
in flash, a request should be striped across parallel units. However, this striping stores a
request across different flash blocks, which exacerbates garbage collection. For example,
if a request is the size of a flash block, storing the request in a single block would incur
low garbage collection when an update occurs. However, reading and writing that request
cannot be performed in parallel since it is to a single flash block in a flash chip.
1.2 Research Contributions
The designs of our FTLs are motivated by the physical properties and the internal par-
allelism in NAND flash memory and by real-time applications such as the storage and
retrieval of sensor data.
The Partitioned Real-Time FTL (PaRT-FTL) addresses the problem of guaranteeing
performance and predictability of NAND flash memory in a real-time storage system.
PaRT-FTL splits a set of flash chips into separate read and write sets. This ensures reads
and writes to separate chips proceed in parallel. To access data from a flash unit that is busy
servicing a write request or performing garbage collection, PaRT-FTL rebuilds the data
using encoding. Consequently, reads are never blocked by writes or storage space recla-
6
mation. PaRT-FTL is designed to guarantee hard deadlines with a focus on predictable,
low-latency reads and writes.
Our second real-time design, called Telomere, departs from the traditional block device
interface to guarantee the high throughput needed to process large volumes of data. Using
data lifetime information from the application layer, Telomere is able to intelligently lay
out data in NAND flash memory to minimize garbage collection overhead and guarantee
high read and write performance.
finally, our third design, Infinite Streams, is a non-real-time design with throughput
guarantees. Infinite Streams is an extension of the interface introduced in Telomere, where
an application specifies data lifetime. In Infinite Streams, the application can also specify
a percentage of data that is allowed to live beyond the specified lifetime. We show the
advantage of Infinite Streams compared to previous work on estimating data lifetime and
grouping requests based on the predicted update frequency.
The contributions of this thesis are summarized as follows:
1. PaRT-FTL, a real-time partitioned FTL that provides timing guarantees and low la-
tency reads;
2. Telomere, a real-time FTL for sensor data with block expiration scheduling;
3. Infinite Streams, a non-real-time FTL with throughput guarantees;
4. FTL implementations on the OpenSSD Cosmos board.
1.3 Thesis Statement
As storage applications leveraging Solid State Disk (SSD) technology are being widely
deployed in diverse computing systems, the flash translation layer (FTL) traditionally de-
signed to support a block I/O interface needs to be revisited. An FTL designed for high
7
performance systems suffers high tail-latency and is ill-suited for a real-time system. As
NAND flash memory technology advances, SSDs are becoming more parallel. We pro-
pose novel FTL designs that exploit the internal parallelism in the SSD to provide timing
guarantees and expand the narrow block interface to allow better data layout in the SSD to
minimize garbage collection overhead.
1.4 Thesis Organization
The remaining sections of this thesis are organized as follows. Chapter 2 will provide the
necessary background information on NAND flash storage, related work and the OpenSSD
Cosmos board. Next, Chapter 3 will introduce the Partitioned Real-Time Flash Translation
Layer. Chapter 4 will present Telomere. Chapter 5 will discuss Infinite Streams. Finally,
the thesis conclusion and future work is in Chapter 6.
Chapter 2
Background and Related Work
This section will introduce the necessary background for the work introduced in this thesis.
Section 2.1 contains information on NAND flash internals. Section 2.2 overviews drive-
managed and host-managed flash designs. Related work guaranteeing performance and
reducing write amplification are introduced in Section 2.3 and Section 2.4, respectively.
Section 2.5 contains information on the OpenSSD Cosmos board, the hardware we use to
implement our designs.
2.1 NAND Flash Memory
This section explains the internal parallelism in NAND flash memory and data manage-
ment in flash using the flash translation layer.
2.1.1 Internal Parallelism
The internal structure of flash storage is significantly different to that of traditional me-
chanical hard drives, as shown in Figure 2.1. In flash devices, the smallest read and write
unit is a page. A page used to be standardized at 512 and 2048 bytes [Mic05]. However,
recently much larger page sizes have been seen ranging from 4 to 32 KB [Mic18]. In addi-
tion to data, a page also contains some extra bytes for an out of band (OOB) area, which is
used to store bookkeeping information (e.g. error correction code) for the corresponding
9
Figure 2.1: SSD internal architecture.
page. Data in NAND flash memory cannot be overwritten; instead, a block of pages must
first be erased before a page is eligible for reuse. When SSD initially became available, a
flash block contains 32 or 64 pages [Mic05]. Current flash blocks in SSDs can range from
128 to 512 pages [Mic18]. A 4 MB block, for example, can contain 512 pages with each
page containing 8 KB. As shown in Figure 2.1, multiple blocks form a plane, and typically
1 to 4 planes form a die. The flash die or chip is the smallest unit that can independently
execute commands or report status. Typically, 1 to 16 flash dies form a flash package. A
flash package exists on a specific way on a specific channel. There are usually 4 to 8 ways
on a channel. The ways, also called banks, on a channel share a common flash bus and
10
an internal data bus. The way arbiter in the channel controller grants access to the shared
buses. This is called way interleaving or package-level parallelism. There are usually 4 to
8 channels in a consumer SSD [KOP+11, Oh13], although high-end enterprise SSDs such
as Flashtec 3016 and 3032 have controllers with 16 and 32 channels, respectively. Each
channel contains its own NAND interface block and error correction code block, so it can
operate independently. This is called channel striping or channel-level parallelism. Way
interleaving and channel striping are the two main methods of parallelization that mod-
ern flash controllers support [CHL16]. With all its internal parallelism, SSDs can support
a massive amount of parallel operations. For example, the 2 TB NAND module in the
OpenSSD Cosmos+ board has 8 channels and 8 ways per channel, so each flash package
contains about 256 Gb. Micron’s MLC NAND flash part catalog shows that they manu-
facture flash packages of that size with 4 flash dies. Assuming 2 planes per flash die, the
total number of operations that can be performed in parallel in this example is 512. Note
that due to the lack of transparency from manufacturers without signing non-disclosure
agreements, we can only estimate some of the values.
2.1.2 Flash Translation Layer
The flash translation layer (FTL) is the firmware on the SSD that addresses challenges
in flash memory such as the lack of in-place updates and block endurance. The FTL
allows flash memory to appear as a block device by providing a logical to physical address
mapping. This permits file systems to interact with an SSD transparently, similar to how a
file system would interact with a mechanical hard drive. The FTL also performs garbage
collection, reclaiming invalid pages in a victim block by copying the valid pages followed
by a block erasure. To increase block endurance, the FTL also performs wear-leveling.
11
Figure 2.2: Address mapping in page-level and block-level FTLs.
Address Mapping Algorithms generally fall into three categories: page-level, block-level
and hybrid-level [MFL14].
In page-level mapping, there is a one-to-one translation of a logical address to a phys-
ical page. This scheme efficiently utilizes blocks in flash, but it requires a large RAM to
be able to store the mapping table.
In block-level mapping, the logical address is split into a logical block number and a
logical page offset. The logical block number is translated into a physical block within
flash, and the physical page offset within the block is the same as the logical page offset.
This is shown in Figure 2.2(b). This scheme reduces the size of the mapping table, but it
can cause performance bottlenecks in write operations as a given logical page can only be
placed at a particular offset in a block.
In hybrid-level mapping, blocks in flash are separated into two types: log blocks and
data blocks. All incoming writes are written to the log blocks, which are later merged
into the data blocks and erased. The log blocks use page-level mapping, while the data
blocks use block-level mapping. Many different sector translation policies have been pro-
posed [KKN+02, Lee06, Lee08, Par08, Cho09]. However, hybrid-level mapping suffers
from extremely long delays in the worst-case recursive full merge of the log blocks into
the data blocks, so it is not suitable for real-time applications.
12
Figure 2.3: Page-level mapping and block reclamation. Every logical page number (LPN)
is mapped to a physical page number (PPN). When overwriting LPN=0, the updated data
x’ will be written to a different physical page since in-place updates cannot be performed;
the mapping is updated accordingly. When the SSD fills up, garbage collection is triggered
and valid pages in a victim block are copied to a free block and the victim block is erased.
Block Reclamation: When garbage collection is triggered, a block is selected to be
erased. All valid pages in that block are copied to a clean block, and the mappings are
updated. The need to reclaim space in NAND flash memory results in potentially unac-
ceptable worst-case performance for a real-time system. When free space becomes lim-
ited, garbage collection selects a block to reclaim. The valid pages in the selected block
are copied to another block, and the selected block is erased. In the worst case, only one
invalid page out of P pages in a block is reclaimed. Therefore, if a write request triggers
garbage collection, it could be blocked waiting for one block erasure as well as P − 1 read
and write operations to copy the valid pages.
Partial garbage collection is a technique used to reduce the latency of a page write. In
traditional garbage collection, if a read or write request arrives right after garbage collec-
tion is triggered, the request is blocked for the entire duration of the block reclamation. In
the example in Figure 2.4, an I/O will be blocked for four page copies and a block erasure.
Partial garbage collection splits the garbage collection into steps. As Figure 2.4 shows,
13
Figure 2.4: Partial garbage collection splits traditional garbage collection into steps.
the maximum time a read request will be blocked becomes the time it takes to erase a
block under partial garbage collection. Compared to conventional garbage collection, this
technique dramatically reduces the worst-case latency. The FTL also needs to be able to
guarantee that partial garbage collection is able to reclaim pages fast enough for incoming
write requests. For example, in Figure 2.4, the victim block contains four invalid pages,
meaning that four pages will be reclaimed when the victim block is erased. Assuming that
this is not background garbage collection and that there are no more free blocks left, there
14
can be at most four pages written in between the partial steps of the block reclamation in
Figure 2.4 is four pages. Zhang et al. [ZLW+15] show how to set the device utilization to
bound the number of pages written in between the partial steps of a block reclamation.
Wear-Leveling: Writing to flash pages and erasing blocks are stress events that have
detrimental impact on data retention and block endurance in flash memory. NAND flash
memory can contain single-level, multi-level, triple-level and quad-level cells. Single-level
cells (SLC) only store one bit at a time. SLC flash has the longest lifespan with 100,000
expected Program/Erase (P/E) cycles. It is the most reliable and the most expensive type
of NAND flash. Multi-level cell (MLC) flash stores two bits in each cell. MLC NAND is
less reliable and has a shorter lifespan compared to SLC, with expected P/E cycles around
10,000 per cell. Triple-level cell (TLC) flash and quad-level cell (QLC) flash store three
bits and four bits in each cell, respectively. Generally, performance and reliability degrade
as more bits are stored in a flash cell. In 2D NAND, blocks are two-dimensional arrays
of flash memory cells, and all the cells in the same row share a wordline. All the least
significant bits (LSB) on a wordline form an LSB page, and all the most significant bits
(MSB) on a wordline form an MSB page [CMHM13]. More recently, 3D NAND, also
known as V-NAND, has been developed to overcome 2D NAND’s capacity limitations.
In 3D NAND, flash cells are stacked vertically using multiple layers to achieve higher
density.
While flash endurance degrades with increasing P/E cycles, studies show that a block’s
level of wear is not accurately modeled by its P/E cycle alone. Mohan et al. [Moh10] show
that having a quiescent period of a few hours between successive P/E cycles can provide
two orders of magnitude improvement on flash endurance. Pletka et al. [PT16] show that
there is a huge variability of the maximum endurance between flash blocks. Some blocks
can sustain several times more P/E cycles than the others before reaching the same error
15
correction code limit, at which point the block cannot be recovered. Balancing P/E cycles
across blocks does not necessarily improve device endurance. Instead, Pletka et al. show
that the raw bit error rate of an MLC 2D NAND flash block can be accurately modeled
using a log-log model of the P/E cycle and that a TLC 3D NAND flash block can be
accurately modeled using a log-lin model of the P/E cycle [PKI+18]. Balancing the raw
bit error rate across blocks improves endurance, as it guarantees that the error correction
codes will be able to correct errors for a longer time.
Wear-leveling can be categorized as dynamic or static. Dynamic wear-leveling in-
volves only the migration of updated (or dynamic) data during garbage collection. For
example, during victim block selection, the age of the data is considered in policies such
as cost-age-time (CAT) [Chi99] and cost-age-time with age sort (CATA) [Han06]. Also,
during garbage collection, hot and cold data are written to different blocks to reduce future
garbage collection overheads. For example, the valid pages copied during garbage collec-
tion are often cold data since they have not been updated yet, so they should be written in
a different block from hot data [Wan12]. Static wear-leveling tries to even out the aging
of the blocks by identifying cold (or static) data. A block that is constantly storing hot
data will wear out its P/E cycles while a block storing cold data may never get reclaimed
and erased. Static wear-leveling is often more effective in prolonging the lifetime of flash
memory, but it may also lead to higher performance overheads and design complexity.
Over-provisioning: SSD over-provisioning is the inclusion of extra storage capacity. The
device utilization, λ, is the fraction of the logical address space over the physical address





Another frequently used metric for measuring the extra FTL capacity is defined as the
16








Over-provisioning provides a lower bound on the number of invalid pages that exist.
When garbage collection selects a block with the smallest number of valid pages to re-
claim, a maximum of ⌈λ·P ⌉ pages need to be copied, where P is the number of pages in a
block. This is because in the worst-case, the number of valid pages are spread out evenly
among all blocks.
Write Amplification: Write Amplification Factor (WAF) is the ratio of the number of ac-
tual physical write operations to the number of write operations issued by the host system.
Write amplification depends on over-provisioning and the workload pattern presented by
applications. If write amplification is reduced, the effective lifetime of an SSD increases,
and the number of valid page copies performed by garbage collection decreases, which
leads to better performance guarantees.
Space vs. Performance Trade-off: Traditional FTLs often have a tunable trade-off be-
tween storage and throughput utilization. When WAF increases, the number of valid
pages copied during garbage collection increases, resulting in decreasing throughput uti-
lization. The WAF value can be decreased by increasing the over-provisioning space,
resulting in higher throughput utilization but lower storage utilization. The RUM conjec-
ture [?] also shows that read times, update cost and storage overhead are inherent trade-offs
of every access method, and optimizing any two negatively impacts the third.
17
2.2 Drive-managed vs. Host-managed Designs
To support the block I/O interface, traditional drive-managed SSDs include an FTL, which
performs address mapping, garbage collection and wear-leveling [MFL14]. However, the
configuration through a block interface prevents optimizations from either the file system
or the FTL, causing redundant and inefficient storage management. This problem is exac-
erbated when using a traditional log-structured file system on top of a commercial SSD. In
a log-on-log configuration [YPG+14], where log-structured applications and file systems
are layered on top of log-structured flash devices, write workloads to the FTL can look
random due to unaligned segment sizes and uncoordinated multi-log garbage collection.
A segment in a log-structured file system is a chunk of data that is buffered and flushed
out to the disk once it is full. When the size of a file system segment is not aligned with
a flash block, a deleted segment results in partial invalidation of multiple flash blocks, in-
curring additional garbage collection overhead. While fragmentation can be mitigated to
some extent by matching segment sizes between upper and lower logs, it is not as straight-
forward when each log has multiple append streams. Furthermore, each log has its own
metadata which is invisible to the higher level logs. As a result, a segment contains meta-
data inter-mixed with data from upper logs. Thus, cleaning of segments at one log layer
does not preclude the need to clean the segments at another layer. Drive-managed SSDs
also suffer from shortcomings such as high tail-latency [HSKH+16] and unpredictable I/O
latency [APW+08, CLZ11b].
Figure 2.5 shows the FTL for a drive-managed design. The I/O interface does not pro-
vide the device with any knowledge of the lifetime of the data being stored. Without this
application-specific knowledge, drive-managed SSDs can only infer update frequencies
to try to store hot (recently updated) and cold data in different blocks. Storing hot and
cold data together increases garbage collection overhead. This is caused by writing hot
18
Figure 2.5: Drive-managed design. The
SSD includes a layer of indirection called
the Flash Translation Layer.
Figure 2.6: Host-managed design. The
mapping and garbage collection is man-
aged by a software FTL layer in the OS or
in the application layer.
data to a new block, invalidating old pages of the updated data in a victim block, copying
valid pages for cold data in the victim block to the new block, and then erasing the victim
block. The copying of the valid data would be eliminated if it were not located in the same
block as hot data. Prior work [KHMC14, LCG+15, YPCB17] shows that writing data with
similar hotness or lifetime to the same flash blocks results in improved performance.
To overcome the shortcomings of drive-managed SSDs, the Open-Channel SSD com-
munity has been pushing for host-managed devices [BBBD13]. For example, approaches
such as LightNVM [BGB17] and Zoned Namespaces [Bjo19, Wes19] expose the SSD in-
ternals to host-level software, which is able to control data placement and I/O scheduling.
Figure 2.6 shows a host-managed device, where the mapping and garbage collection is
handled by a software FTL in the operating system (OS) or in the application itself.
Currently, the Open-Channel SSD community is developing a standard as part of the
NVMe 2.0 specification to allow an SSD to expose a logical address namespace using
19
zones [Bjo19]. In this new interface, applications can intelligently decide where to place
the data in zones based on their knowledge of data hotness. Samsung shows that by passing
information from the application down to the drive to specify data hotness, they were able
to improve the update throughput by 56% [KHMC14]. The increase in throughput is
attributed to lower valid page copies and garbage collection overheads. However, with
host-managed devices wear-leveling [Moh10, PKI+18] is still implemented on the device,
where it is able to track bad blocks and perform error correction. This leads to movement
of data and erasure of blocks outside the control of the software FTL or the application
layer. In turn, this adds timing unpredictability to the performance of applications, which
is undesirable in a real-time system.
Rethinking the Block I/O Interface: Several works show the value of expanding the
narrow block interface [BBBD13, LSZ13, KHMC14, Wan14]. By providing extra infor-
mation to the SSD from the application level, we can improve performance and reduce
write amplification. The first storage interface proposed for applications to communicate
higher order I/O intentions to the SSD is the TRIM command, which allows the SSD to
handle garbage collection more efficiently and is shown to be critically important for sus-
tained write performance. Kang et al. [KHMC14] show through experimentation that by
using two bits to inform the SSD about the expected lifetime of data being written, write
amplification can be dramatically decreased. Data with different lifetime expectancies are
written to different streams. The multi-streamed SSD ensures that the data in a stream are
not only written together to a physically related NAND flash space (e.g., a NAND flash
block), but also separated from data in other streams. Experimentation with a Cassandra
NoSQL DB system show that the multi-streamed SSD dramatically decreases valid pages
copied during garbage collection and improves Cassandra’s cumulated latency distribution
when dividing data into three streams.
20
2.3 Performance Guarantees
In this section, we summarize designs that provide real-time performance guarantees and
FTL designs in the non-real-time community for improving performance in NAND flash
memory.
Real-Time Designs Chang et al. [CKL04] proposed a real-time garbage collection mecha-
nism (RTGC) with a real-time task model that schedules read and write periodic tasks with
a corresponding real-time garbage collection task. Although the real-time scheduler exists
on top of the FTL, a standard commercial SSD with a built-in FTL cannot be used because
RTGC requires support to access low-level flash operations such as page read, page write
and block erasure.
Guarantee Flash Translation Layer (GFTL) [CG08] first introduced partial garbage
collection with block-level mapping. Partial garbage collection reduces the latency ex-
perienced by traditional garbage collection by dividing the operation to reclaim one flash
block into multiple steps. Because block-level mapping incurs extra OOB operations to get
the real mapping information, GFTL is shown to have high worst-case latency [ZLW+15].
Real-time Flash Translation Layer (RFTL) [QWLS12] show that good performance
is possible using a “distributed” partial garbage collection. RFTL uses partial garbage
collection and assumes that the request arrival rate is bounded by the block erasure time. A
logical block is mapped to three physical blocks, and partial garbage collection is triggered
when the primary physical block is full. The two other physical blocks serve as a buffer
for write requests to the corresponding logical block during garbage collection and for
copying valid pages from the primary block to allow block erasure. In this way, garbage
collection is managed by each logical block in a distributed manner. Similar to GFTL,
RFTL also suffers from extra OOB operations [ZLW+15] as well as low space utilization
since each logical block is mapped to three physical blocks.
21
WAO-GC [ZLW+15], which stands for worst-case and average-case joint optimization
for garbage collection, builds upon the partial garbage collection technique. In addition
to providing ideal worst-case bounds for page read and write, WAO-GC is able to achieve
better average-case performance than GFTL and RFTL by using over-provisioning to de-
lay garbage collection. When a victim block is selected, the maximum number of valid
pages in the victim block is guaranteed by the over-provisioning value.
Non-Real-Time Designs Huang et al. [HC18] exploit the internal parallelism in SSDs by
reserving banks for servicing read requests, write requests and garbage collection in order
to guarantee stable read and write throughput. Partial garbage collection, where each step
is either a page copy or a block erasure, is used so each page read is potentially blocked
by a partial step.
Tiny-Tail Flash [YLH+17] partitions flash planes into two sets, one for garbage col-
lection and one for servicing read and write requests. The garbage collection planes rotate
periodically, and parity-based redundancy is used to rebuild reads. The configuration of
the partitioning differs in our design since we separate read and write requests onto differ-
ent flash dies. Whereas simulation results show that Tiny-Tail Flash reduces GC-blocked
I/Os to 0.003-0.7%, our real-time model guarantees that deadlines will not be missed. The
configuration of the partitioning differs in our design since we limit writes and erasures
to one channel. Simulation results show that Tiny-Tail Flash reduces GC-blocked I/Os to
0.003-0.7%. Our real-time model guarantees that deadlines will not be missed.
In addition to FTL designs, there exists many designs that treat the SSD as a black
box and use redundancy to guarantee performance by implementing RAID on multiple
SSDs. However, depending on the FTL design in the commercial SSDs, latency can
vary greatly when garbage collection is triggered. Purity [CDH+15] measures the la-
tency of each request and uses Reed-Solomon to reconstruct the requested data whenever
22
a request takes longer than the 95th percentile latency. Flash on Rails [SAW+14] and
Shin et al. [SKKY15] both partition read and write requests to different SSDs to provide
predictable read performance. Flash on Rails [SAW+14] uses two SSDs, one perform-
ing write requests and the other performing read requests, and periodically exchanges the
SSDs and synchronizes data. While this design provides high read throughput, it can suffer
from unpredictable write performance since it has no direct control of garbage collection
activities.
2.4 Reducing Write Amplification Factor
A large body of work focuses on reducing write amplification by storing together data
with similar update frequencies. Rosenblum and Ousterhout first point out that categoriz-
ing data as hot or cold reduces cleaning overhead in a log-structured file system [RO92].
This was followed by works that tuned LFS on flash systems [WZ94, Kaw95]. Other
research that detects data temperature can be found in flash wear-leveling works such as
Dynamic Age Clustering [Chi99] as well as data placement algorithms based on update
frequency [PD11, SA13]. However, drive-managed SSDs that provide the traditional block
I/O interface suffer from many shortcomings [APW+08, CLZ11b, HSKH+16], including
log-on-log issues [YPG+14]. Solutions to these problems fall into two main categories:
host-managed designs [KYM11, LSZ13, BBBD13, Bjo19] and some form of information
sharing between the host and the FTL [KHMC14, LCG+15, YPCB17]. The latter is the
approach we have taken with Telomere and Infinite Streams.
Multi-stream SSDs show how write amplification can be reduced by assigning data
with different update frequencies to different streams, which are stored at a different
physical location [KHMC14, LCG+15, YPCB17]. Multi-streamed SSDs [KHMC14] and
WARM [LCG+15] assign different stream IDs to different types of data (index files, log
23
files, sstables, etc.). However, files of the same type may contain data with different life-
time. AutoStream [YPCB17] automatically assigns stream IDs to data. Their experiments
show that WAF gets close to 1 when there is a large data lifetime difference with 4 streams
of data. However, with 16 streams of data with different lifetimes, AutoStream takes time
to differentiate blocks, and some requests are mixed into one stream, resulting in WAF
above 2. AutoStream does not disclose crucial details including the over-provisioning,
making it hard to compare against.
2.5 OpenSSD Cosmos Board
A commercial off-the-shelf SSD usually comes with a built-in FTL that performs address
translation, garbage collection and wear-leveling. These FTLs are usually designed for
high performance with long tail latency, and since they are built into the drive, their logic
cannot be easily modified. The OpenSSD is an open-source project designed for research-
ing SSD internals, including exploring new FTL designs. We use the OpenSSD Cosmos
platform [SJLK14], an FPGA board whose hardware and software designs are fully mod-
ifiable.
The Cosmos board includes the Zynq-7000 with dual ARM Cortex-A9 and NEON
DSP co-processor for each core. The internal structure of a Zynq-7000 SoC has two com-
ponents: the processing system (PS) and the programmable logic (PL). The PS component
includes the dual-core ARM processor, the memory interfaces and the I/O peripherals. The
PL component includes the FPGA fabric. The flash storage controller is synthesized in the
PL and the FTL firmware runs on the the ARM Cortex-A9.
The OpenSSD Cosmos board has two small outline dual in-line memory modules
(SO-DIMMs), each containing Micron Technology’s MLC NAND flash (part number
MT29F256G08CMCABH2). A block contains 256 pages, and a page is 8 KB. Each SO-
24
Figure 2.7: Cosmos OpenSSD board [SJLK14].
DIMM has a capacity of 128 GB with four channels and four ways per channel. The FTL
sends commands to way controllers directly, but it cannot access the channel controller in-
cluding the way arbiter, page buffer and the Bose-Chaudhuri-Hocquenghem (BCH) error
correction code engine. The way arbiter grants permission in a round-robin manner for
the way controllers to use the common flash bus or the internal data bus to access the page
buffer. The page buffer stores 2 KB of data, 60 bytes of error correction code parity and
90 reserved bytes. Since page size of the flash device is 8 KB, data transfer occurs four
times.
The FTL sends commands to the way controllers directly. To perform a page write,
when the way arbiter grants access to the internal data bus, the command is issued and data
is moved from DRAM to the page buffer. Data is then transferred to the error correction
code encoder which calculates the parity and transfers the data and parity to the page
25
Figure 2.8: OpenSSD Cosmos board setup.
buffer. Data is then transferred to the way controller and finally to the NAND flash when
the way arbiter grants access to the common flash bus. To perform a page read, data arrives
from the way controller and is transferred to the error correction code decoder. If there are
errors in the data, the error correction code decoder corrects the data and transfers it to the
page buffer. Data is then transferred to DRAM.
As shown in Figure 2.8, Xilinx on the development PC generates the FPGA bitstream,
which is used to configure the programmable logic side of the Zynq FPGA through the
JTAG digilent module. The Cosmos board is connected via an external PCIe cable to a
PC with an ASRock Z68 PRO3-M Motherboard and a 3.10 GHz Intel Core i3-2100 CPU
running the Quest real-time operating system [WLMD16].
Chapter 3
Partitioned Real-Time FTL
The Partitioned Real-Time FTL (PaRT-FTL) is designed for real-time systems with hard
deadlines associated with the storage and retrieval of persistent data. For example, an au-
tonomous vehicle management system might require the processing, storage and retrieval
of multiple data-intensive sensor streams, including video images and point clouds to ren-
der a 3D map of its surroundings as it performs simultaneous localization and mapping.
We envision scenarios for next-generation real-time applications where main memory has
insufficient capacity to store all the data needed for information processing, machine learn-
ing, path planning, decision making and other mission-critical tasks. In particular, having
low-latency access to stored data that can be processed and augmented with updated sensor
information is particularly relevant to our intended usage of PaRT-FTL.
The design of PaRT-FTL is motivated by our observations of the behavior of NAND
flash memory. To achieve predictable read performance for real-time workloads, read and
write requests are partitioned onto different flash chips, and parity pages are calculated
using XORs. Read requests of pages on flash chips that are servicing write requests are
rebuilt using the parity page. In this way, read requests are never blocked by write requests
or garbage collection. PaRT-FTL was designed with the following goals:
• an FTL design that takes advantage of internal parallelism in SSDs;
• a real-time task model for read and write requests on multiple flash chips;
27
• low-latency read requests that are not blocked by writes or garbage collection.
Flash Parallelism Observations: We measured the effects of way interleaving and chan-
nel striping for page-based reads and writes and block-based erasures, using the OpenSSD
Cosmos Board [SJLK14]. Recall that the way arbiter in the Cosmos Board grants access
to the shared buses in a round-robin manner. While our focus is on Micron Technol-
ogy’s MLC NAND flash, way interleaving and channel striping are common characteris-
tics found in other modern NAND flash technologies. Parallelism within a flash package
(i.e. die and plane) is not explored due to hardware limitations. Each Micron Technology
NAND flash package contains one flash die, and plane parallelism could not be exploited
due to limitations of the OpenSSD FPGA implementation.
For each flash operation (read, write and erasure), the latency of the operation is mea-
sured when performed four times on the same flash die, on different flash chips that exist
in the same channel (way interleaving), or on different flash chips that exist in different
channels (channel striping). Table 3.1 shows the results.
Min Max Avg Stddev
4 same die writes 0.844 9.70 5.18 2.50
4 way writes 0.826 2.90 1.87 0.645
4 channel writes 0.598 2.37 1.33 0.631
4 same die reads 0.856 1.44 1.04 0.040
4 way reads 0.833 1.25 1.23 0.081
4 channel reads 0.369 0.382 0.375 0.004
4 same die erasures 4.51 16.2 12.6 3.05
4 way erasures 2.58 4.06 3.84 0.111
4 channel erasures 2.77 4.06 3.84 0.114
Table 3.1: Latency in milliseconds for flash operations.
The maximum and average latency for page writes are reduced by both way inter-
leaving and channel striping. The slight slowdown in the way interleaving compared to
28
channel striping is most likely due to accessing the page buffer, which is shared among all
the flash chips in a channel.
For page read operations, while the maximum and average latency are reduced by
channel striping, no significant improvements are seen for way interleaving. Read opera-
tions do not show any performance benefits under way interleaving because the majority of
the time is spent accessing the page buffer, which cannot be performed in parallel during
way interleaving. On average, reads with way interleaving actually perform worse than
reads on the same flash die. We hypothesize that the extra time comes from performing
status checks on the different flash chips.
For block erasures, the maximum and average latency are reduced by both channel
striping and way interleaving.
In summary, we observe that read, write, and erase operations are parallelizable by
channel striping. However, read operations do not show any performance benefits under
way interleaving while write and erasure operations do.
FTL Data Layout: PaRT-FTL partitions the set of flash chips into write flash chips Fw
and read flash chips Fr. Table 4.1 contains symbol definitions. Fw is the set of flash
chips servicing only write requests and performing garbage collection, and Fr is the set
of flash chips servicing only read requests. These two sets are mutually exclusive. When
servicing a read request, if the physical page exists on a write flash chip, the page is rebuilt
by reading the associated encoding page and data pages in Fr. For example, in the SSD
layout depicted by Figure 3.1, 4 flash chips are servicing write requests, and 12 flash chips
are servicing read requests. Fw contains flash chips on Way 1 of each channel. Read
requests for pages in Ways 2 and 3 are handled normally while read requests for pages in
Way 1 will be rebuilt by reading and decoding the corresponding pages in Ways 2, 3 and
4. Data is encoded and decoded using XORs. In the example in Figure 3.1, each parity
29
Figure 3.1: Flash chip layout. In this example, there are 12 flash chips storing data, and
4 flash chips storing encoding pages. Flash chips being written to are on Way 1 of each
channel, while other flash chips are servicing read requests.
Figure 3.2: Flash chips rotation. The set of write flash chips rotates to a different way after
a block of data is written to each write flash chip.
page is the XOR of its corresponding data pages in the same channel.
After a block of data is written to each write chip, Fw rotates to a different set of flash
chips. For example, in Figure 3.2, Fw rotates to be the flash chips in Way 2, and flash chips
in Ways 1, 3 and 4 will be used to service read requests.
30
3.1 Real-Time Task Model
Let {τ1, τ2, ..., τn} be a set of n periodic tasks. Each task τi guarantees that an application
can perform ri page reads every T
r
i time units and wi page writes every T
w
i time units.
Note that τi, which has parameters [(ri, T
r
i ), (wi, T
w
i )], does not account for the CPU com-
putation time. These tasks exist on the FTL and utilize the NAND bus. A task is assumed
to be scheduled on the CPU in a way that is able to guarantee the above read and write
request rates on the SSD.
Symbol Definition
Fw Set of flash chips servicing write requests
Fr Set of flash chips servicing read requests
k Number of flash chips storing data
m Number of flash chips storing encoding info
P Number of flash pages in a flash block
τi A periodic task
ri Number of page read requests from τi
Cri Read capacity for τi
T ri Period for read requests from τi
wi Number of page write requests from τi
Cwi Write capacity for τi
Twi Period for write requests from τi
Cei Encoding capacity for τi
T ei Encoding period for τi
Cgi Garbage collection capacity for τi
T gi Garbage collection period for τi
tr Time to read a page on every flash chip in Fr
tw Time to write a page on every flash chip in Fw
te Time to erase a block on every flash chip in Fw
tec Time to encode a parity page
tdc Time to decode a page using parity
λ Ratio of logical to physical address space
α Lower bound of reclaimed pages in a block
Table 3.2: PaRT-FTL symbol definitions.
31
For a read request, the worst-case scenario is that the page exists on a flash chip being
written to and the page needs to be rebuilt. To rebuild the page, all the associated data and
encoding pages have to be read. Note that these pages exist on different flash dies, so they
can be read in parallel. For scheduling purposes, we set the read capacity Cri for task τi as
follows:
Cri = ri · (tr + t
d
c) (3.1)
where tr is the time it takes to read a flash page on every read flash chip, and t
d
c is the time
it takes to decode a page.
A write request is first written to a buffer and later written to flash chips in Fw. Since
there are no in-place updates in flash memory, a write request consisting of multiple pages
can be distributed to different flash chips. Page writes are parallelizable through both
channel striping and way interleaving, and they are thus parallelizable across every flash
chip in |F |. Since only |Fw| flash chips are servicing write requests, the write capacity C
w
i







where tw is the time it takes to write a flash page on every write flash chip. Note that tr
and tw depend on the configurations of Fr and Fw, which determine how much way and
channel parallelism exists.
Updating Encoding Pages: When a block of new data has been written to each of the data
flash chips, a block of encoding has to be written to each of the encoding flash chips. Let k
be the number of flash chips storing data, m be the number of flash chips storing encoding
information, and P be the number of pages in a block. The write granularity is (k/m)
32
pages so that each write operation to the SSD results in one page written to each way and
the corresponding parity page is updated. After (k·P ) pages of new data has been written,
(m·P ) encoding pages are updated. For a task τi, if (wi = k·P ), then (m·P ) encoding
pages are updated every write period Twi , so the period for the encoding task is the same
as the write period. If (wi 6=k·P ), then the period for the encoding task is the write period
multiplied by (k·P
wi
) to ensure that all the encoding pages can be written. For each task τi,
if (wi > 0), then an encoding task with the following encoding capacity C
e


















where tec is the time it takes to compute an encoding page.
Garbage Collection: When a flash chip in Fw runs out of free pages, garbage collection
is triggered on that chip. As a result, every task with (wi > 0) will have a corresponding
garbage collection task to ensure that enough free pages are reclaimed for task τi to write
wi pages every period T
w
i .
When garbage collection starts, a victim block is selected and valid pages are copied
from the victim block to a free block. Then, the victim block is erased and the num-
ber of invalid pages that were previously in the victim block are reclaimed. With over-
provisioning λ, the number of valid pages that need to be copied in all the data flash chips
can be upper bounded by k · ⌈λ·P ⌉. When k blocks of data become invalid, their cor-
responding encoding also becomes invalid. Therefore, m · ⌈λ·P ⌉ is the number of valid
encoding pages that need to be copied. Thus, to reclaim k ·α pages, where α = P −⌈λ·P ⌉
is the lower bound of reclaimed pages in a block, (k+m) · ⌈λ·P ⌉ pages need to be copied,
and (k +m) blocks erased. For a task τi, if wi = k · α, then garbage collection needs to
33
Figure 3.3: Corresponding GC
task period for a write task that
writes more than a block per
write period.
Figure 3.4: Corresponding GC
task period for a write task that
writes less than a block per
write period.
reclaim a block every write period Twi , so the period for the GC task is the same as the
write period. If the number of pages written wi is more than k · α pages, then the GC
task needs to guarantee that at least wi pages are reclaimed every write period T
w
i . Since
erasures happen at the block level and not at the page level, the GC task needs to upper
bound the number of pages reclaimed to a multiple of blocks. Thus, even if wi is just one
more than k · α pages, two blocks will need to be reclaimed every Twi . Similarly, if the
wi is only half of k · α pages, then the GC task only needs to reclaim a block every 2·T
w
i .
However, if more than half of k · α pages is requested, the GC task will need to guarantee
a block every Twi , and thus, the floor is used. For example, in Figure 3.3, assume that
the number of pages reclaimed when a block is erased equals eight. If wi = 14, then the
corresponding garbage collection task needs to reclaim two blocks in a write period. Thus,







. If wi is less than eight, as shown in an example in Figure 3.4, T
g
i is twice








If wi > 0, then a garbage collection task with the following capacity C
g
i and period T
g
i





























where te and tr are the latency for block erasure and page read, respectively, on every write
flash chip in Fw.
3.2 Admission Control
A schedulability test is invoked for each chip set to ensure that all the read and write
requests are schedulable. We use Earliest Deadline First (EDF) to schedule the tasks. The
equations are derived from Theorem 2 in Baker’s Stack Resource Policy work [Bak90]










where Bk denotes the execution time of the longest critical section of any job
1.
Since read requests are isolated from write requests and occur on separate flash chips,
a separate schedulability test is provided for read and write requests. For the read requests
that are serviced by flash chips in Fr, the longest non-preemptive period is the time it takes
to read a flash page. All flash operations, i.e. read, write and erasure, are non-preemptive,
but since read flash chips are not performing writes or erasures, the longest flash operation
is tr. The schedulability of the read requests with EDF is the following:
1In real-time systems, a periodic task releases jobs at regular intervals based on its period. Each job has











where min(T r) is the minimum period in all T ri .
For write requests that are serviced by flash chips in Fw, the largest non-preemptive
period is the longest flash operation that takes place on write flash chips, which is a block
erasure. A block reclamation, for example, can be preempted many times between read-
ing and writing valid pages. However, once a flash operation takes place, it cannot be
















) ≤ 1 (3.7)





At the interface level, the user specifies the number of read and write pages per read
and write period upon the open() syscall. If the admission control fails, the open() would
fail.
Bandwidth Calculation: We define the write bandwidth as the number of page writes
that can be performed in reclaimed blocks across all k data flash chips divided by the time
it takes to perform garbage collection and said page writes. After garbage collection is
initialized, the block with the largest number of invalid pages is selected as the victim
block to be reclaimed. Let Bv be the number of valid pages in the victim block and Br be
the number of invalid or reclaimed pages. To garbage collect the victim block, Bv pages
need to be copied to a free block, and the victim block is erased. At this point, Br pages are
reclaimed, and garbage collection starts again after Br page writes. The write bandwidth
36
is defined as follows:
k·Br
(|F |/|Fw|) · [Bv(tr + tw) + te + Br·tw]
(3.8)
where F is the set of all flash chips, and Fw is the set of write flash chips. The worst-
case theoretical bandwidth occurs under the worst case scenario for garbage collection.
Since the victim block chosen for garbage collection is the block with the most invalid
pages, the worst case is when the invalid pages are evenly spread out among all the blocks.
Thus, Bv = ⌈λ·P ⌉ and Br = α.
The maximum read bandwidth occurs when none of the pages need to be rebuilt. Thus,
the read bandwidth equals the number of page reads that can be performed in parallel, fr,








where tr + t
d
c is the time to read a page on every flash chip and the time to decode and
rebuild the page.
3.3 Evaluation
The experimental evaluation consists of two sections: 1) simulation-based schedulabil-
ity tests, and 2) experiments conducted using two different FTL implementations on the
Cosmos OpenSSD board. The simulations show that PaRT-FTL has a higher feasible uti-
37
lization, while the OpenSSD experiments show that PaRT-FTL has lower read and write
latency.
Implementation: Each read or write request is split up into flash page-size requests by the
device driver and then inserted into a request circular buffer. The FTL retrieves the requests
and orders them according to Earliest Deadline First and starts handling the request if the
flash die is not busy.
Write requests are buffered and admission control guarantees that the buffer will not
overflow. Write dies flush pages in the buffer to the SSD. If garbage collection is initial-
ized, one flash operation for the garbage collector will be done on that die, either reading
a valid page, writing a valid page, or erasing a block. If there is a read request for a page
on a write flash die, a rebuild operation will be initialized by adding the pages that need to
be read for the rebuild to the front of the queue. In the beginning of the request loop, page
rebuilds are checked. If all the pages for a rebuild are read, the requested page is decoded.
Over-provisioning: For PaRT-FTL, 25% of storage is reserved for parity checking, which
is typical for a RAID design. Of the 75% used for data, the ratio of logical to physical
address space is set to 64.3%. Thus, 48.2% of the SSD is used for data, with 26.8%
over-provisioning and 25% for parity information.
When comparing PaRT-FTL against other approaches such as RTGC and WAO-GC,
we use a matching data storage capacity. Thus, in RTGC and WAO-GC, 48.2% of the
SSD is used for data with 51.8% over-provisioning. To ensure that the over-provisioning
for WAO-GC is enough, the upper bound of λ is calculated. We define tr as the time it
takes to read a flash page assuming other flash dies on the same way are busy and not idle.
In our hardware, a page is 8 KB. The way arbiter in the Cosmos Board grants access in
a round-robin manner to the common flash bus to access the NAND flash or to use the
internal data bus to access the page buffer, which stores 2 KB of data. Since a flash page
38
is 8 KB, data transfer between the page buffer occurs four times for each flash page. This
means that if we are measuring the latency of a page read on a die, we have to assume
that if the other dies on the same way are also reading, tr = 1.23 ms on average for that
page read based on our observations in Table 3.1. Similarly, a page write takes 1.87 ms
and a block erasure takes 3.84 ms, on average. Let one partial garbage step consist of two
page copies and P = 256 pages per flash block, the upper bound on the ratio of logical to
physical address space for WAO-GC can be calculated using equation 3.13 as 66.4%.
3.3.1 Simulation Experiments
Random task sets were generated with varying total utilization using the UUnifast algo-
rithm [BB05]. 500 task sets were generated for each utilization value ranging from 0.05 to
0.95 with 0.05 increments. Each task set contains 10 tasks. Each task makes read requests
of one flash page, which equals 8 KB, and write requests of three flash pages. The periods









where Uwi and U
r
i are the write and read utilizations generated, respectively.
Each task set was tested to see if it was schedulable under PaRT-FTL, RTGC and
WAO-GC pagemap FTL. Figure 3.5 shows the simulation results for admission control
with PaRT-FTL, which is calculated with Equations 3.6 and 3.7, and the admission control
with RTGC [CKL04].
In WAO-GC, when a victim block is selected, over-provisioning guarantees that the
39
block has at most v valid pages. Let n be the number of partial garbage collection steps
needed to copy v pages and erase the victim block. Since a partial garbage collection step
is executed after a page write request, n is the number of new pages that need to be stored.
Therefore, the following constraint exists: n + v≤P . This constraint guarantees that one
free block can hold both v valid pages from the victim block and the n pages from page
write requests during the reclamation of the victim block. Let c be the number of page





+ 1. Given that v ≤ ⌈λ·P ⌉ due
to the over-provisioning, substituting n into the constraint gives the relationship between





The WAO-GC pagemap FTL has no admission control. The worst-case latency for a
page write occurs after a partial garbage collection step, which is bounded by te. Thus, the
worst-case latency for a page write is the following:
tw +max(te, c(tr + tw)) (3.14)
where c is the number of page copies that can be done in a partial step.
Each flash operation in WAO-GC could potentially be blocked by a partial garbage
collection step from a different task. In the worst case, a task will be blocked by all the
other tasks. In our experiments, we used a uniform probability to determine the interfer-
ence from other tasks. We also show a percentage of the likelihood that a write operation
would trigger a partial garbage collection step. When λ is set to the maximum value in
Equation 3.13, partial garbage collection will occur after every write operation. However,


























Figure 3.5: Admission Control Simulation.
We use a percentage to show the effects of higher over-provisioning (WAO-GC 75 and
WAO-GC 50). For example, in our implementation, λ = 48.2%, which is 72.5% of the
maximum value for WAO-GC, so schedulable tasksets would be close to WAO-GC 75.
We also varied the size of the read and write requests, shown in Figures 3.6 and 3.7.
We measured the weighted schedulability [BBA10], which is the sum of all the total uti-
lizations of task sets that were schedulable divided by the sum of all the total utilizations.
The weighted schedulability compresses a three-dimensional plot to two dimensions and
places higher value on task sets with higher utilization.
For various read request sizes, PaRT-FTL consistently shows higher schedulability
than RTGC and WAO-GC. When varying the write request sizes, PaRT-FTL has lower
schedulability than RTGC in general and lower schedulability than WAO-GC 50 when
wi > 10 as seen in Figure 3.7. This is due to PaRT-FTL’s lower write throughput as writes
are only occurring on 4 out of the 16 flash dies. Also, note that WAO-GC 50 has much
lower space utilization compared to our configuration for PaRT-FTL. The peaks at write
size equaling 18, 21 and 33 in Figure 3.7 are due to how the task sets are generated. For






















































Figure 3.7: Weighted Schedulability vs Write Request Size.
write period generated is twice as large as the write period generated when wi = 15, thus
increasing schedulability.
3.3.2 Hardware Experiments
We also implemented PaRT-FTL and WAO-GC pagemap FTL on the OpenSSD Cosmos
board [SJLK14]. The Cosmos board is connected via an external PCIe cable to a PC with
42
an ASRock Z68 PRO3-M Motherboard and a 3.10 GHz Intel Core i3-2100 CPU running
the Quest real-time operating system [WLMD16].
The experimental setup is as follows with both random accesses and writes: four tasks
make 12-page write requests every 60 msec, and four tasks make 3-page read requests
every 15 msec. Note that in the following experiment, a task either reads or writes. The
SSD is initially only written to so that the effects of garbage collection on write latency
can be observed.
Tasks Request Size Period
Write 4 12 pages 60 msec
Read 4 3 pages 15 msec
Table 3.3: Experimental setup.
3.3.2.1 PaRT-FTL
The experimental setup is as follows in Table 3.3, with both random accesses and writes.
The XOR implementation using NEON instructions takes 1.2 ms. We verified that the data
is correct with the encoding and decoding functions to rebuild read pages. The following
experiments, however, do not include the overhead to compute the XORs. We assume that
in a production-ready system, computing the XORs will be built into the hardware. While
we assume this cost to be negligible in our experiments, it is accounted for in our real-time
task model.
The maximum write throughput for PaRT-FTL is 7.3 MB/s (Equation 3.8) with k =
12, |Fw| = 4 and parameters in Table 3.4. The maximum read throughput is 76 MB/s
(Equation 3.9) with the worst-case read throughput equal to 6.3 MB/s (Equation 3.10) In
our experimental setup in Table 3.3, we have 4 tasks each writing 12 pages every 60 msec,
resulting in a write throughput of 6.25 MB/s. With 4 tasks, each reading 3 pages every 15
43
PaRT-FTL Fw PaRT-FTL Fr WAO-GC
|F | 16 16 16
fr - 12 16
tr 0.375 msec 1.23 msec 1.23 msec
tw 1.33 msec - 1.87 msec
te 3.84 msec - 3.84 msec














































Figure 3.9: PaRT-FTL read throughput.
msec, the read throughput is also 6.25 MB/s.
The write throughput for 4 tasks each making 12-page write requests is plotted in
Figure 3.8. The read throughput for 4 tasks each making 12-page write requests and 4
tasks each making 3-page read requests every 15 milliseconds is plotted in Figure 3.9.
Note that the first few points are below average because the tasks are initialized in the
middle of the time slice.
The response times of write and read requests are measured and plotted in Figure 3.10
and Figure 3.11, respectively. This is the time it took to complete a 12-page write request
or a 3-page read request. The device driver inserts single-page requests into the request
buffer. Latency is measured in the FTL as the time from when a page read or write is put




















Figure 3.10: PaRT-FTL write request latency for 4 tasks each making 12-page write re-



















Figure 3.11: PaRT-FTL read request latency for 4 tasks each making 3-page read requests
















































Figure 3.13: WAO-GC read throughput.
passes PaRT-FTL’s admission control, so no deadlines are missed.
3.3.2.2 WAO-GC Pagemap FTL
The same experimental setup shown in Table 3.3 is run with WAO-GC pagemap FTL. The
maximum write throughput for WAO-GC is 15.5 MB/s and read throughput is 13.4 MB/s
with parameters in Table 3.4.
As in PaRT-FTL, the response times of write and read requests are measured and plot-
ted in Figure 3.14 and Figure 3.15, respectively. As expected from the high read latency,
some read requests using WAO-GC miss deadlines, as shown in Figure 3.15 by the data
above the 15 millisecond horizontal line.
PaRT-FTL significantly reduces read and write latencies compared to previous real-
time FTL approaches that use partial garbage collection (Figures 3.16 and 3.17). This
is because read requests are never blocked by a busy flash die that is servicing a write
request and potentially performing garbage collection. As shown in Figure 3.17, the max-
imum write latency with PaRT-FTL is 20% of the maximum write latency with WAO-
GC pagemap FTL. The maximum read latency with PaRT-FTL is 35% of the maximum








































Figure 3.15: WAO-GC read request latency for 4 tasks each making 3-page read requests
every 15 msec. Note that not all requests have a response time within 15 msec, so some

















































Figure 3.17: Flash page latencies of PaRT-FTL and WAO-GC pagemap FTL.
worst-case latency of RTGC has been measured and compared in previous work [ZLW+15].
PaRT-FTL does sacrifice being able to write at a higher throughput since it partitions
flash dies into read and write dies. Whereas WAO-GC could write in parallel to all 16 flash























Figure 3.18: The effects of garbage collection on write bandwidth with WAO-GC.Garbage
collection is initiated after 7 seconds.
requests. WAO-GC guarantees that a request will not be blocked by more than a partial
garbage collection step given that the inter-arrival time of write requests does not exceed
the time it takes to do a partial garbage collection step. However, there is no admission
control, so the effects of garbage collection can be seen in Figure 3.18. The workload con-
sists of two streams, each sending requests of 24 pages as fast as possible. Initially, writes
occur at 27 MB/s. After 7 seconds, garbage collection starts and throughput fluctuates
down to 14 MB/s.
PaRT-FTL is suited to time-critical systems that require low latency guarantees, whereas
other approaches may be more suitable for tasks that need high throughput and can toler-
ate some missed deadlines. Our experiments are also limited by our hardware. Modern
SSDs have much higher throughput and more parallelism such as more channels, ways per
channel, and flash dies per flash chip, which would enable higher throughput.
PaRT-FTL is a partitioning flash translation layer that is motivated by the emerging
need for bounded and low latency access to solid state storage in time-critical systems. As
Figure 1.1 shows, PaRT-FTL optimizes for low latency but has lower device utilization and
higher GC overhead due to encoding data. Because write requests are partitioned to 25%
of the flash chips, write throughput is also lower compared to other methods. PaRT-FTL
49
is a design that exploits parallelism. However, when a logical page is read by rebuilding
with the encoding pages, multiple logical pages cannot be read in parallel, thus, it is not as
high as Telomere on the parallelism axis. We demonstrate the performance of PaRT-FTL
by comparison to previous work in real-time FTL design. Empirical results show that we
are able to significantly reduce read and write latency.
Chapter 4
Telomere
Modern SSDs achieve high data transfer rates due to their massive internal parallelism.
However, out-of-place updates for flash memory incur garbage collection costs when valid
data needs to be copied during space reclamation. The root cause of this extra cost is due to
the SSD not being able to accurately determine data lifetime and group together data that
expires before the space needs to be reclaimed. Real-time systems found in autonomous
vehicles, industrial control systems, and assembly-line robots store data from hundreds
of sensors and often have predictable data lifetimes. These systems require guaranteed
high storage bandwidth for read and write operations by mission-critical tasks. In this
work, we depart from the traditional block device interface to guarantee the high through-
put needed to process large volumes of data. Using data lifetime information from the
application layer, our proposed real-time design, called Telomere, is able to intelligently
lay out data in NAND flash memory and eliminate valid page copies during garbage col-
lection. Telomere’s real-time admission control is able to guarantee tasks their required
read and write operations within their periods and has a 30% higher throughput with 10%
over-provisioning compared to pre-existing techniques.
Garbage collection incurs substantial overhead in flash memory when data with differ-
ent expiration times are stored together in the same flash block. When a block needs to be
reclaimed, the valid pages in the block have to be copied to another block. This is why
51
Figure 4.1: In most flash designs, there is a tunable trade-off between storage and through-
put utilization. Telomere, however, is able to achieve both high throughput and high stor-
age utilization.
traditional FTLs often have a tunable trade-off between storage and throughput utilization,
as seen in Figure 4.1. Traditional flash designs copy valid pages during garbage collection
resulting in a WAF > 1, which directly affects throughput utilization. The higher the
WAF value, the lower the throughput utilization. The WAF value can be decreased by
increasing over-provisioning, resulting in lower storage utilization. If the SSD is able to
figure out how to store data such that all the pages in a block become invalid before it needs
to be reclaimed, garbage collection overhead would only be a block erasure. Data lifetime,
however, is application-specific. Drive-managed models supporting a block I/O interface
do not have the application-specific knowledge needed to make the most informed decision
for data layout. Instead, they must infer data hotness, which often cannot be accurately
predicted. On the other hand, host-managed models cannot provide a real-time solution
when wear-leveling is handled in the device. Our design bridges this gap where the appli-
cation provides knowledge of the data lifetime and the throughput requirements and the
drive guarantees predictable, high throughput and storage utilization.
52
Contributions. We present the design of a drive-managed SSD system, called Telomere,
which allows data lifetime information to be passed from the application to the device.
This is beneficial to real-time systems featuring numerous high bandwidth sensors (e.g.,
cameras) that must store, retrieve and process data according to throughput requirements,
and which must replace stale data once it has expired.
We assume a real-time model where each application defines a set of periodic tasks
with a pre-determined data lifetime per task. These parameters of each task are passed to
the operating system on the open() system call and admission control tests determine if the
task is accepted. If the system cannot guarantee the throughput and lifetime requirements
of a task, the open() syscall fails for that particular task. The contributions are as follows:
• a new interface that provides the drive with information on data lifetime;
• block allocation algorithms, Single Task Placement and Shared Task Placement, for
partitioning blocks among different tasks;
• storage admission control for each block allocation algorithm;
• throughput admission control to guarantee performance.
By properly collocating data, garbage collection overhead is minimized and thus,
Telomere is able to guarantee that tasks meet their deadlines and admits task sets with
high read and write throughput.
4.1 Design
When the FTL has information about data lifetime, it is able to efficiently store data such
that garbage collection overhead is only a block erasure. Our proposed design, called
53
Figure 4.2: A Telomere compliant application defines a set of periodic tasks with a pre-
determined data lifetime per task. These parameters are passed to the OS on the open()
syscall and admission control tests determine if the task is accepted. Then, Telomere
places data with similar expiration times together in the same block so that garbage col-
lection overhead is minimized.
Telomere, is a NAND flash storage system for sensor data in real-time systems. As de-
picted in Figure 4.2, Telomere stripes data across parallel units in the SSD and eliminates
write amplification by using a block expiration time 3. Data has a lifetime value and each
block is associated with a future timestamp at which point all the data inside that block
will become invalid. Telomere places data with similar expiration times together in the
same block so that garbage collection overhead is minimized. It eliminates the need to
copy valid pages to a new block in order to reclaim the invalid pages in that block. In-
stead, pages with similar expiration times and update frequencies are stored together and
the block is reclaimed when all the pages become invalid. Thus, garbage collection in
Telomere is simply a block erasure.
3The name Telomere is inspired by the biological structures that cap the ends of chromosomes which are
truncated during cell division. Over time, the telomere ends become shorter; when they get too short, the
cell can no longer divide and “expires” or dies.
54
4.2 Real-Time Task Model
On open(), the user specifies a periodic real-time task τi with a read period T
r
i , a write
period Twi , the maximum number of flash pages read ri and written wi during their respec-
tive periods, and a data lifetime lwi . The data lifetime is the number of write periods after
the data is written during which the data is valid. After lwi periods, the data expires and is
no longer accessible. A periodic task releases jobs at regular intervals based on its period.
Each job has a request time and a deadline. We assume the deadline equals the period.
Symbol Definition
λ device utilization: logical over physical address space
α over-provisioning: extra capacity over user capacity
τi A periodic task
T ri Read period for τi
ri Number of pages read in one read period
Twi Write period for τi
wi Number of pages written in one write period
lwi Number of write periods before the data expires
g Read and write page granularity
P Number of pages in a flash block
Spages Number of pages in the SSD
Sblocks Number of blocks in the SSD
Schips Number of flash chips in the SSD
tr flash page read latency
tw flash page write latency
te flash block erasure latency
Table 4.1: Telomere symbol definitions.
During the open() syscall, the admission control executes to ensure there is enough
resources for the new task τi to run. The admission control consists of two parts. First, we
present the storage admission control to guarantee the availability of free pages to write
the data from τi. Next, we provide the throughput admission control to guarantee the read
and write throughput requirement of τi.
55
4.3 Storage Admission Control
To guarantee that there are enough free pages available to write new data, we need to
bound the number of blocks used. To simplify the problem, we assume that the number of
free pages needed by a task is available at the beginning of each period. Also, to ensure
that the data can be read regardless of when the CPU task is scheduled, we add an extra
period to the expiration time. For a task τi with write period T
w
i , we can calculate the time
a write request starts s and the time those pages expire e as follows:
s = ⌊t/Twi ⌋ ·T
w
i (4.1)





We present two block allocation algorithms to bound the number of blocks needed for
a task set. The Single Task Placement (SiP) allocates different blocks to each task. The
Shared Task Placement (SharP) groups together tasks with similar data lifetime and shares
blocks among the group of tasks.
4.3.1 Single Task Placement (SiP)
SiP allocates independent blocks to store the incoming write jobs from each task. For
example, assume that each block contains 8 pages. Let task τ1 request 1 page write every
2 time units that expires after 3 periods, and task τ2 request 2 pages of data written every
3 time units that expire after 2 periods.
To calculate the number of blocks to allocate to a task, we need to compute how much
valid data the task stores and how many extra pages are stored before the block containing
the oldest data expires. A block expires when all of its pages become invalid.
For example, in Figure 4.3, the maximum number of valid pages stored by task τ1 at
56
Figure 4.3: Telomere Single Task Placement. A block only contains data written by one
task. τ1 writes data to blocks B1 and B2. τ2 writes data to blocks B3 and B4. No block
contains data from both tasks.
any point in time is 4 pages. Let wi be the number of pages written in one period, and l
w
i
be the number of periods before the data expires. K(τi) is the maximum number of valid
pages stored by τi defined as the following:
K(τi) = wi · (l
w
i + 1) (4.3)
A block is alive if it contains valid data. In order for the block storing the oldest data
to expire, τ1 will write at most another block of data. For example, in Figure 4.3, task τ1
first writes 4 valid pages, but in order for block B1 to expire, it will write at most another
block, or 8 pages, of data. At t = 22, all the data in B1 are expired, so B1 can be erased.
The maximum number of pages that will be written during the time of a block erasure
is (wi · ⌈te/T
w
i ⌉), where te is the time it takes to perform a block erasure. Let P be the
number of pages in a block. The total number of pages needed by a task τi under SiP is
therefore (K(τi)+P +wi · ⌈te/T
w
i ⌉). Since a block is allocated to a task and is not shared
among tasks, SiP takes the ceiling of the total number of pages over P . We assume in the
57
following equation that the number of pages written in one period is less than the size of
a flash block (wi < P ). When wi≥P , the pages in blocks that store data from a single
request will either be all valid or all invalid. This is the trivial case and those blocks are
simply added to the total number of blocks. H1(τi) is the number of blocks needed by task
τi with read and write granularity g = 1. It is defined as the following:
H1(τi) =
⌈














Equation 4.4 assumes a read and write request size of one page. Oftentimes, requests
of multiple pages are striped across flash chips in order to improve throughput. We extend
the calculation to a granularity size of g pages. Similar to our previous assumption when
g = 1, we assume in the following equation that the number of pages written in one period
is less than the granularity multiplied by the size of a flash block (wi < g·P ). Hg(τi) is the
number of blocks needed by task τi defined as the following:
Hg(τi) = g ·
(⌈








Without the above assumption on wi, K, K
′ and Hg are defined as follows:
K(τi) = [wi%(g·P )] · (l
w
i + 1) (4.6)





· (lwi + 1) (4.7)
Hg(τi) = g ·
(⌈









The total number of blocks needed for a task set, D, is the sum of the blocks needed
for each task. Let Sblocks be the total number of physical blocks. The storage admission





Hg(τi)≤Sblocks · λ (4.9)
The benefit of SiP is the simplicity in its implementation and, as we show in Sec-
tion 4.5, its storage cost is comparable to the more complicated SharP algorithm when the
size of the task set is small.
4.3.2 Shared Task Placement (SharP)
SharP uses the observation that since a block is erased when all the data pages expire, data
with similar start and expiration times should be stored together. For example, Figure 4.4
shows how it is more efficient to share blocks among the write requests of tasks τ1 and
τ2 compared to allocating independent blocks for each task. When a block stores write
requests from both tasks, only 3 blocks are needed compared to 4 blocks needed in SiP.
To determine the number of blocks to allocate to a set of tasks τi, ..., τj , SharP first
computes the percentage of a block that will be written to by each task. For example, in
Figure 4.4, task τ1 writes to 3-4 pages of a block and task τ2 will write to 4-5 pages of
a block. SharP determines how much of a block each task writes to based on the task’s
throughput requirement ( wi
Twi
) compared to the total write throughput requirement of the
set of tasks. Q(τi, {τh...τk}) is the fraction of write throughput requirement of τi over the










Figure 4.4: Telomere Shared Task Placement. A block contains data written by multiple
tasks. Blocks B1, B2 and B3 are shared among tasks τ1 and τ2.
Since each task can no longer write to all the pages in a block, the denominator in Equa-
tion 4.5 is multiplied by Q(τi, {τh...τk}). Again, we assume that wi < g·P . J(τj, {τi...τk})
is number of blocks needed by task τj while sharing blocks with other tasks in the set. It
is defined as the following:
J(τj, {τi...τk}) = g ·
(⌈








The total number of blocks needed by the set of tasks is the maximum J(τj, {τi...τk})
of all the tasks.
Hg({τi...τk}) = max∀τjJ(τj, {τi...τk})
4 (4.12)
Again, without the assumption on wi, the equations are as follows:








J(τj, {τi...τk}) = g ·
(⌈
K(τj) + wj · ⌈te/T
w
j ⌉





SharP partitions the tasks in a task set such that each partition will use separate blocks
to store data. First, the tasks are sorted by the amount of time the data is alive, (Twi ·
(lwi + 1)). The algorithm keeps track of a current partition. For each task, it determines
whether or not that task should be added to the current partition. The pseudocode is shown
in Algorithm 1.
4.4 Throughput Admission Control
In addition to the storage admission control, we also need to guarantee the read and write
throughput requirement of the tasks. Due to out-of-place updates in NAND flash, the read
and write throughput can fluctuate due to interfering garbage collection activities.
A read request with granularity g can be performed in parallel when g > 1. The read







where tr is the time it takes to read a flash page.
A write request is first written to a buffer and later written to flash. Since there are
no in-place updates in flash memory, a write request consisting of multiple pages can be
distributed to different flash chips. The write capacity Cwi for task τi is the following:
61
Algorithm 1 Telomere SharP partitions tasks such that each partition will use separate
blocks to store data.
1: procedure TELOMERESHARP(tasks)
2: Sort tasks by increasing (Tw · (lw + 1))
3: idx = 0
4: i = 0
5: tlen = tasks.length
6: while i < tlen− 1 do
7: t1 = tasks[i]
8: t2 = tasks[i+ 1]
9: opt1 = H(tasks[idx : i+ 1]) +H(t2)
10: opt2 = H(tasks[idx : i]) +H([t1, t2])
11: opt3 = H(tasks[idx : i]) +H(t1) +H(t2)
12: minBlks = min(opt1, opt2, opt3)
13: if opt1 == minBlks then
14: i = i+ 1
15: else
16: Add tasks[idx : i] to partitions
17: if opt2 == minBlks then
18: idx = i
19: i = i+ 1
20: else
21: Add [t1] to partitions
22: idx = i+ 1




27: opt1 = H(tasks[idx : tlen− 1])
28: opt2 = H(tasks[idx : tlen− 1]) +H(tasks[tlen− 1])
29: if opt1 < opt2 then
30: Add tasks[idx : tlen] to partitions
31: else
32: Add tasks[idx : tlen− 1] to partitions









where tw is the time it takes to write a flash page.
62



























where te is the latency for block erasure.
A schedulability test is invoked to ensure that all the read and write requests are schedu-
lable. We use Earliest Deadline First (EDF). The largest non-preemptive period is the
longest flash operation that takes place on write flash chips, which is a block erasure. The
feasibility of the real-time write requests can be verified by the following equation derived
from Theorem 2 in Baker’s Stack Resource Policy work [Bak90] which presents a suffi-

















) ≤ 1 (4.18)






The experimental evaluation consists of three sections: simulation-based schedulability
tests, event-driven simulation to measure garbage collection overheads, and hardware
experiments on the OpenSSD Cosmos board. The admission control simulation shows
that Telomere has a higher feasible throughput utilization with an additional storage cost.
The event-driven simulation presents Telomere’s low GC overhead. We also examine the
63
endurance gain experienced by flash blocks when running Telomere with various wear-
leveling techniques. Hardware experiments with the OpenSSD Cosmos board show that
Telomere is able to maintain high, predictable throughput.
4.5.1 Admission Control Simulations
SSD Parameters. We assume the following parameters for an SSD with 1 TB of storage.
There are 16 flash channels, each containing 4 flash chips. There are 4096 blocks per flash
chip, 128 pages in a block, and each page is 32 KB. The average page read time is 0.2
msec, the average page write time is 0.7 msec and the average block erasure time is 2.7
msec.
Data Generation Parameters. 50 random task sets were generated with two varying total
utilizations, U s for storage and U b for bandwidth, using the UUniFast algorithm [BB05]
with values ranging from 0.01 to 1.0 in 0.01 increments. Each task set contains 500
tasks. Each task τi has a storage utilization U
s













b). The read period T ri and write period T
w
i for each task
is calculated based on the number of pages read and written per period. Data is striped
across all flash chips, so the read and write granularity is 64. Let BRmax be the read band-


















The lifetime of the data depends on the storage utilization of the task and the number
of pages written by the task per period to guarantee that the SSD has the capacity to store
64







Telomere SiP and SharP. The storage cost of Telomere SiP and SharP depends on the
read and write granularity and the number of tasks in a task set. In Figures 4.5 and 4.6, we
show the maximum storage utilization out of a 1 TB drive with over-provisioning λ = 0.90
as we vary these parameters.
Figure 4.5 shows how the read and write granularity affects the storage cost and
throughput of Telomere SiP and SharP. When granularity increases, more pages are striped
across flash chips. This means that multiple pages can be read and written in parallel,
which leads to increased throughput. The striping of data also results in increased storage
cost for Telomere SiP and SharP. Figure 4.5 shows the maximum storage and throughput
utilization at which 100% of the task sets are schedulable. We do not plot the gradi-
ent from 100% to 0% schedulable because it is very small (within 1% to 2% storage or
throughput utilization). As granularity increases, both throughput utilization and storage
cost increases. The storage cost of Telomere SharP grows slower than Telomere SiP. Note
that in our simulation, there are 64 flash chips, so there will not be any throughput or
storage cost increase for granularity g > 64.
Figure 4.6 shows the storage cost of Telomere SiP and SharP as the number of tasks
in each task set varies. We plot the storage and throughput utilization at which 100% of
the task sets are schedulable for task sets with 100, 250, 500 and 1000 tasks. As expected,
the storage cost increases as the number of tasks increase. However, the storage cost of
Telomere SharP grows at a much slower rate than that of Telomere SiP. The number of
tasks in a task set significantly affects the storage cost of Telomere SiP since each task is
assigned independent blocks to store its data. The slight decrease in throughput utilization




























Figure 4.5: Storage and throughput utilization with different read and write granularity.
Lines indicate the storage and throughput utilization at which 100% of the task sets are
schedulable. As granularity increases, both throughput utilization and storage cost in-
































Figure 4.6: Storage and throughput utilization with varying number of tasks. Lines indi-
cate the storage and throughput utilization at which 100% of the task sets are schedulable.
The number of tasks in a task set significantly affects the storage cost of Telomere SiP
since each task is assigned independent blocks to store its data.
which results in a larger value for the first term ( te
min(T )
) in Equation 4.18.
66
Comparison to Previous Work. We compare Telomere to real-time method: Worst-
case and Average-case joint Optimization for Garbage Collection (WAO-GC) [ZLW+15].
We do not compare with Partitioned Real-Time FTL [MW18] since writes and garbage
collection are partitioned onto one fourth of the flash chips in order to provide low, pre-
dictable read latency. Thus, the write bandwidth is significantly lower than the methods
compared to. Due to the limited number of real-time FTLs to compare against, we also
compare Telomere to non-real-time work by Stoica et al. [SA13] on improving flash write
performance using update frequency. Stoica’s work partitions pages into sets with update
frequencies that decrease in powers of two. When a page becomes cold, it moves to a
set with a lower update frequency. Similarly, when it becomes hotter, it moves to a set
with a higher update frequency. Each set is stored as a log structure and this algorithm
is called MultiLog data placement. Stoica et al. also provide algorithms for estimating
page update frequency, including an Oracle algorithm that knows the exact page update
frequency, and thus, has the lowest garbage collection overhead. We compare Telomere
to MultiLog-Oracle. In addition, we apply bank reservation [HC18] to provide throughput
admission control based on the average observed WAF value of MultiLog-Oracle for two
different workloads. Stoica’s work is selected for comparison because the motivation is
similar to ours in that data with similar update frequencies should be placed together. Real-
time FTLs often sacrifice bandwidth for predictability. We show that Telomere is able to
achieve better throughput under certain over-provisioning even compared to non-real-time
methods.
Comparison to WAO-GC WAO-GC [ZLW+15] builds upon the partial garbage collection
technique. In addition to real-time bounds, WAO-GC shows that it is able to achieve better
average-case performance than Guarantee FTL [CG08] and Real-time FTL [QWLS12] by
using over-provisioning to delay garbage collection. WAO-GC derives a maximum λ value
67
using SSD parameters to guarantee that a page write will only be blocked by one partial
garbage collection step. With our SSD parameters, the upper bound of λ for WAO-GC is
0.74.
WAO-GC uses page-level mapping. Thus, the storage admission control needs can be
calculated as a function of its over-provisioning. Specifically, the number of pages needed






i + 1)] ≤ λ·Spages (4.22)
Figures 4.7, 4.8 and 4.9 show the maximum throughput and storage utilizations at
which 100% of the task sets are schedulable at different over-provisioning levels. When
over-provisioning α = 0.10, 10% of the physical space reserved for wear-leveling for
each method, which is why in Figure 4.7, task sets that need more than 10% of the storage
space are not schedulable. Specific wear-leveling techniques are out of the scope of this
work, but we reserve the same capacity of the SSD for wear-leveling purposes in order to
make a fair comparison of the different methods with flash endurance taken into account.
Note that because Telomere does not copy valid pages during GC, it needs less over-
provisioning than other methods that need to perform wear-leveling during GC to achieve
the same flash endurance. We can see that the storage cost of Telomere SiP is 19% and
Telomere SharP is 6%. Note that the storage utilization of WAO-GC cannot be higher 0.74
since WAO-GC requires a certain amount of over-provisioning in order to guarantee that a
page write will only trigger one partial garbage collection step. This is why at α = 0.10,
WAO-GC rejects more task sets than Telomere SharP even with Telomere’s extra cost.
The Telomere throughput admission control is calculated with Equation 4.18. WAO-
GC does not provide admission control for multiple flash chips. It only guarantees that a
write request will be blocked by no more than a partial garbage collection step. We use
68







·(tw +max(te, tr + tw)) (4.23)
Cgi = 0 (4.24)
The schedulability simulation shows that WAO-GC starts rejecting task sets at 38%
throughput utilization whereas Telomere SiP and SharP are 100% schedulable at 83%
throughput utilization.
Comparison to MultiLog-Oracle. The MultiLog-Oracle algorithm can be integrated into
a page-level mapping FTL, so the storage admission control can be calculated using Equa-
tion 4.22. For the throughput admission control, because MultiLog-Oracle is not real-time,
we will rely on the observed WAF value for two different workloads and use bank reser-
vation [HC18] to provide a throughput admission control on average. Banks are dynami-
cally partitioned to service read and write requests and perform garbage collection based
on the read and write throughput specified. Unlike Telomere’s throughput admission con-
trol, bank reservation [HC18] does not guarantee that each task will be able to perform its
specified number of page reads and writes within its period. It only guarantees that a total
read and write throughput from all the tasks in the task set can be sustained on average.
α Random Zipf 80/20
0.10 WAF = 5.00 WAF = 2.00
0.30 WAF = 2.20 WAF = 1.50
0.75 WAF = 1.25 WAF = 1.13

























Figure 4.7: Storage and throughput admission control with α = 0.10. Lines indicate the
maximum utilization at which 100% of the task sets are schedulable. MultiLog-Oracle
method is plotted in dotted lines because the bank reservation throughput admission con-

















































Figure 4.9: Storage and throughput admission control with α = 0.75.
When over-provisioning α = 0.10, MultiLog-Oracle has an average write amplifi-
cation factor experienced with a random workload at WAF = 5.0. Since our random
workload generated may be different from theirs, we also use their lower average write
amplification factor value for a skewed Zipf 80/20 distribution at WAF = 2.0. Table 4.2
summarizes the WAF for MultiLog-Oracle at different over-provisioning levels. Note
that these WAF values are the average estimated under empirical observation. They are
not the worst-case WAF values. Therefore, the MultiLog-Oracle method is plotted in
dotted lines.
Figures 4.8 and 4.9 show the storage/throughput trade-off for MultiLog-Oracle at dif-
ferent levels of over-provisioning. Lower storage utilization in MultiLog-Oracle leads to
higher throughput. The inverse relationship between throughput and storage utilization is
because to achieve higher throughput, the write amplification factor has to be lower, which
means more over-provisioning, which leads to lower storage utilization. Telomere SharP
does not suffer from a decrease in throughput when storage utilization increases.
71
When over-provisioning α = 0.75, Figure 4.9 shows that MultiLog-Oracle for a work-
load with a skewed Zipf 80/20 distribution has an observed WAF that admits a higher
throughput using bank reservation compared to Telomere. This is because bank reserva-
tion [HC18] only guarantees that a total read and write throughput from all the tasks in
the task set can be sustained on average whereas Telomere’s throughput admission control
guarantees that each task will be able to perform its specified number of page reads and
writes within its period.
MultiLog-Oracle, like traditional FTLs, has a tunable storage/throughput trade-off.
It can achieve either high storage utilization with low throughput utilization, as seen in
Figure 4.7, or high throughput utilization with low storage utilization, as seen in Figure 4.9.
Telomere, as seen in Figure 4.7, can achieve both high throughput and storage utilization.
4.5.2 Event-Driven Simulator
We also measured the garbage collection overhead of WAO-GC compared to Telomere.
Figure 4.10 and 4.11 show that both the number of blocks erased and the number of pages
copied during garbage collection increase as the variance on the periods and lifetimes
increases. Our event-driven simulator has the following flash parameters: 16 flash chips,
32 blocks per chip, 64 pages per block, read latency of 2 time units, write latency of 7
time units, block erasures taking 27 time units, and over-provisioning λ = 0.90. We
generate 100 task sets, each containing 10 tasks. The lifetime of each task is drawn from a
Gaussian distribution with a mean of 200. The standard deviation is calculated as (mean·v
100
),
where v is the relative standard deviation. Each task uses the same amount of storage s,
so the request size is ( s
lifetime
). The period is calculated from Equation 4.20, where the
throughput utilization of each task is drawn from a Gaussian distribution with a mean of
( 0.7
num tasks



























Figure 4.10: Graph shows the 25th, 50th, 75th percentiles, the minimum, maximum and
mean (dot) number of block erasures as the variance in lifetimes and periods increases. As
the variance in task periods increases, the number of page write requests in the simulation
time also varies, resulting in variance in the number of block erasures. Since WAO-GC



































Figure 4.11: Graph shows the 25th, 50th, 75th percentiles, the minimum, maximum and
mean (dot) number of GC page copies as the variance in lifetimes and periods increases.
Telomere does not need to copy valid pages during GC, while Pagemap (and WAO-GC)
copies increasingly more pages with task set lifetime variability.
73
In Figures 4.10 and 4.11, we plot the minimum, first quartile, median, mean, third
quartile and maximum number of block erasures and GC page copies for Telomere and
a page-level mapping FTL (Pagemap). Since WAO-GC does not have a data placement
policy, its garbage collection overhead is the same as Pagemap. WAO-GC has a lower
worst-case latency per page read or write compared to Pagemap due to partial garbage
collection, but its total garbage collection overhead is the same as Pagemap. Figure 4.10
shows that as the relative standard deviation for the task lifetimes and periods increases,
the number of block erasures dramatically increases for Pagemap. Figure 4.11 shows that
Telomere does not copy any pages when reclaiming a block. In Pagemap, as expected, the
number of pages copied during garbage collections grows as the relative standard deviation
for the task lifetimes and periods increases.
Telomere wear-leveling: We also measured the effectiveness of wear-leveling for Telom-
ere and Pagemap under different levels of over-provisioning using the Health Binning
wear-leveling technique [PT16]. Pletka et al. measured raw bit error rates (RBER) of the
worst page obtained from the blocks of a real consumer-level 16 nm MLC flash chip and
found that it is a function of the nominal endurance, defined as the program/erase cycle
(PEC) normalized to the manufacturer-specified block endurance [PKI+18]. The wear of
a 2D flash block of advanced age can be accurately modeled using the following log-log
model:
log10(Wb) = xb + yb·log10(E(b)) (4.25)
where E(b) denotes the PEC of block b normalized to the manufacturer-specified block
endurance, and xb and yb are parameters obtained from large-scale characterization and
are distinct for every block.
Pletka shared with us the xb and yb values of eight blocks that had been carefully se-
lected from a large number of characterized blocks to illustrate certain common properties.
74
xb yb
block 0 -2.5274 1.2942
block 1 -2.1112 2.3164
block 2 -2.7262 2.0544
block 3 -2.7096 1.2374
block 4 -2.1748 1.4439
block 5 -2.0623 2.0867
block 6 -2.3773 2.0620
block 7 -2.5166 1.7698
Table 4.3: Pletka’s block characterization.
They are shown in Table 4.3. As Table 4.3 shows, there is a huge variability of the maxi-
mum endurance between blocks. Some blocks can sustain several times more PECs than
the others before reaching the same ECC limit.
In the health binning wear-leveling technique, the health of a block is determined by
Equation 4.25. The hottest data is placed in the healthiest blocks, and cold data is placed
in less healthy blocks.
We run the event driven simulator for Telomere and Pagemap under 1) no wear-
leveling, where the free blocks queue is a first-in-first-out queue, and 2) RBER-balancing
wear-leveling, where the free blocks queue is sorted by the RBER of the blocks. Telomere
is also run with health binning. The Pagemap FTL is not run with health binning because
the WAO-GC pagemap FTL does not co-locate data with similar hotness. WAO-GC’s par-
tial garbage collection constraints restrict valid pages of a victim block to be written to the
same block as incoming write requests, regardless of data hotness. Thus, health binning
cannot be used since it only works with FTLs that collocate together data with similar up-
date frequency. For the simulation, we generate 20 task sets for each λ value from 0.70 to
0.95 in increments of 0.05. Each task set contains 10 tasks that are generated in the same






















Figure 4.12: Cumulative Distribution Function of the measured RBER at the end of a























Figure 4.13: Endurance gain of different methods over pagemap with no wear-leveling
(Pagemap noWL) at different amounts of over-provisioning.
76
until 2% of the blocks reach a wear value of Wb = 10
−2, which is the RBER threshold
for declaring a block dead [ZZS+13]. Pletka observed that once a few percentage of the
blocks have been retired as they reach the error correction capability of the ECC, write
amplification jumps abruptly, and the performance of the device drops suddenly [PT16].
Thus, we end our simulation when 2% of the total number of flash blocks get retired.
Figure 4.12 shows the Cumulative Distribution Function (CDF) of the measured RBER
at end of the simulation run for a task set with λ = 0.10 for Telomere and Pagemap with
different wear-leveling algorithms. We tested the health binning algorithm (HB), RBER-
balancing wear-leveling algorithm (RBER) and no wear-leveling with a First-In-First-Out
free blocks queue (noWL). In the ideal case, the blocks would all wear out at the same
time. In the graph, the CDF for the RBER would be 0% for RBER less than 10−2 and
then 100% at 10−2. Telomere HB and Telomere RBER are both more effective at wear-
leveling than Telomere noWL, Pagemap RBER and Pagemap noWL. The endurance gain
is measured as the area between the Pagemap noWL curve and the curves of the other
methods. Figure 4.13 plots the average endurance gain of 20 task sets for each λ value
from 0.70 to 0.95. At λ = 0.85, there is an inflection point for Telomere HB and Telomere
RBER. When the device utilization is less than 85%, the endurance gain is less sensitive to
over-provisioning changes. Figure 4.13 shows that the endurance gain for Telomere noWL
decreases with lower device utilization. This is because as device utilization decreases,
Pagemap noWL copies less pages during garbage collection, resulting in less writes from
garbage collection relocation. Telomere, on the other hand, does not copy pages during
block reclamation, so its garbage collection overhead is the same across different over-
provisioning values. Thus, the endurance gain for Telomere noWL decreases at lower
device utilization levels due to increased endurance from Pagemap noWL.
77
4.5.3 Hardware Experiments
We use the OpenSSD Cosmos board [SJLK14] to implement three FTLs: Telomere, WAO-
GC, and Pagemap. The Cosmos board is connected via an external PCIe cable to a PC with
an ASRock Z68 PRO3-M Motherboard and a 3.10 GHz Intel Core i3-2100 CPU running
the Quest real-time operating system [WLMD16].
The parameters for the tasks are in Table 4.4. We ran the experiment with 21 blocks
per flash chip. The limit on the number of blocks per flash chip is due to time constraints








τ0 16 20 16 80 250
τ1 8 20 16 40 33
τ2 16 20 16 40 330
τ3 0 0 16 40 990
τ4 0 0 11 40 1980
τ5 0 0 5 40 1980
Table 4.4: Task set running on the OpenSSD Cosmos board.
We test 4 methods: Telomere SiP, Telomere SharP, WAO-GC and Pagemap FTL. Fig-
ures 4.14 and 4.15 show the 4 methods, 3 of which run at different write throughputs
due to the rejected tasks. Telomere SharP and Pagemap start out with the highest write
throughput, running all 6 tasks at 1800 pages per second. Telomere SiP’s storage admis-
sion control rejects τ5, so it runs at a lower throughput of 1680 pages per second. WAO-GC
rejects τ3, τ4 and τ5 due to insufficient logical address space and write throughput, so it
runs at 1000 pages per second. Telomere SiP, SharP and WAO-GC are all able to maintain
the throughput accepted by their admission control when read requests arrive at around
t = 97 as shown in Figure 4.15. The non-real-time Pagemap FTL, however, is unable
to maintain the write throughput and read throughput as the number of pages written per




























Figure 4.14: Write throughput for task set in Table 4.4 for Telomere SiP and SharP, WAO-
























Figure 4.15: Read throughput for task set in Table 4.4 for Telomere SiP and SharP, WAO-
GC and Pagemap FTL.
79
The out-of-place property of flash memory requires garbage collection to be performed
when reclaiming a block. When blocks contain data with different lifetime, garbage col-
lection can incur long latency and cause throughput to drop and fluctuate. However, by
intelligently placing data such that all the pages in a block being reclaimed are invalid,
we can minimize the garbage collection overhead to simply a block erasure. We present a
new interface that provides the drive with information about the lifetime of the data. With
this additional information, Telomere is able to achieve high device utilization, through-
put and parallelism as well as low GC overhead, as shown in Figure 1.1. Our results
show that Telemere’s real-time admission control is able to guarantee tasks their required
read and write operations within their periods and has a 30% higher throughput with 10%
over-provisioning compared to pre-existing techniques.
Chapter 5
Infinite Streams
Enterprise storage applications requiring high performance and low latency are moving to
flash memory, driven by demand and lower storage prices [PwC15, Net16, Cou16, Bra18].
I/O performance is critically important to cloud service providers. For example, a 2017
Akamai study shows that every 100-millisecond delay in website load time can hurt con-
version rates by 7 percent [Aka17]. Due to their massive internal parallelism, NVMe SSDs
support high data transfer rates, making them a great solution for shared storage in cloud
environments. However, satisfying the read and write throughput requirements of each ap-
plication is challenging for SSDs because garbage collection globally affects the I/O per-
formance of all applications. Bank reservation [HC18] is a recently proposed technique to
distribute throughput among applications in order to satisfy the throughput requirements
of each application. However, we discuss a pathological case where bank reservation fails
to guarantee throughput and propose additional admission control equations so that the
read and write throughput requirements can be guaranteed in the pathological case.
Problem Statement: Sensor networks have been widely used and applied in medical, in-
dustrial, agricultural and environmental monitoring [Pan13]. Infinite Streams is designed
for applications that have a priori knowledge of data lifetime, such as applications that col-
lect sensor data, e.g. temperature, light, sound and pressure. Unlike previous multi-stream
approaches where the application appends an n-bit stream ID with a write request or the
81
FTL uses an n-bit heat tracker for each logical page number, Infinite Streams does not
place a limit on the maximum number of streams. Instead, the number of streams grows
with the number of tasks. Similar to Telomere, each task writes pages with a fixed life-
time, meaning the pages will become invalid after a pre-determined time period. However,
unlike Telomere, Infinite Streams allows each task to specify a percentage of data that can
be retained and exist beyond its lifetime.
5.1 Task Model
The task model for Infinite Streams extends from that of Telomere. Due to asymmetric
read and write performance in flash memory, a task specifies both a read and a write
throughput, a data lifetime during which the data can be accessed after it is written, and
a percentage of data that can be accessed beyond the lifetime value specified. Each task
has two pairs of submission-completion queues, one for read requests and one for write
requests. The FTL fetches requests from the submission queues such that the read and
write throughput requirements are satisfied and places them in dispatch pools. Read pages
are placed in separate dispatch pools based on which flash chips (or banks) contain the data
requested. Write requests are buffered and grouped together using Telomere’s Single Task
Placement algorithm. Read requests in the dispatch pools and write requests in the write
buffer are then dynamically assigned to idle banks using the bank reservation technique
outlined in the next section.
5.2 Bank Reservation
Bank reservation is a technique introduced by Huang and Chang [HC18] to provide sta-
ble read and write throughput. They observed that large performance variation is closely
82
related to the number of banks that are occupied performing garbage collection. Bank
reservation limits the number of banks that are performing read requests, write requests or
garbage collection at any given time. Banks are dynamically partitioned based on the read
and write throughput requirements of the applications. By placing a cap on the number
of banks performing garbage collection and interfering with I/O performance, the bank
reservation technique is able to provide high, stable read and write throughput.
However, we found a pathological case where bank reservation fails to guarantee the
read and write throughput requirements. Thus, we introduce extensions to the original
work after presenting a summary of the original bank reservation policy.
5.2.1 Bank Reservation Policy
We summarize the bank reservation technique [HC18] briefly for completeness. The bank
reservation policy determines how many banks can be servicing read, write and garbage
collection jobs1 at the same time in order to meet the read and write throughput require-
ments and perform enough garbage collection to provide free pages for future write re-
quests.
Let Br and Bw be the throughput needed to satisfy all read and write requests, respec-
tively. Let Fr, Fw, Fcp and Fe be the bandwidth for read, write, copy and erasure per bank.














where N is the number of banks.
Let Nr, Nw and Ngc be the bank reservations for read, write and garbage collection.
1In a real-time system, a periodic task releases jobs at regular intervals based on its period. Each job
has a request time and a deadline. In a traditional computer process model, a task can be represented as a
process that periodically executes a set of functionality.
83



















Ngc = N −Nr −Nw
(5.2)
5.2.2 Infinite Streams Bank Reservation
The original bank reservation technique will not always guarantee read and write through-
put. We present a pathological case in which the original bank reservation technique fails
to guarantee the read and write throughput requirements. Then, we present additional
admission control equations to avoid the pathological case in our extension to the bank
reservation technique.
5.2.2.1 Pathological Case
We demonstrate the pathological case with an example. Let Br = 600 MB/s and Bw = 350
MB/s. Assuming there are 16 banks, λ = 0.75, Fr = 100 MB/s, Fw = 41.4 MB/s and
Fe = 1.4 GB/s. Using the admission control in Equation 5.2, Nr = 6, Nw = 9 and
Ngc = 1.
Without specifying the read and write request granularity, it is impossible to guarantee
read throughput. For example, a read throughput of Br = 600 MB/s cannot be guaranteed
when all the read pages are physically stored on a single bank whose bandwidth is only





so that pages can be striped across multiple banks.





, bank reservation can still fail to guarantee write
throughput. Figure 5.1 shows a pathological example. Let g = 6. Assume that a re-
84
Figure 5.1: Pathological case in bank reservation where write throughput cannot be guar-
anteed. When write requests fill up the pages in banks 6 to 15, the fraction of pages in the
SSD filled with valid data is less than the over-provisioning λ = 0.75.
quest is written to banks 0 to 5 and that request is continuously read. Since Nr·Fr = Br,
the banks 0 to 5 will be constantly servicing read requests to be able to satisfy the read
throughput. Thus, write requests can only be performed on banks 6 to 15. When write
requests fill up the pages in banks 6 to 15, the fraction of pages in the SSD filled with
valid data is at most 10
16
= 0.625. Since λ = 0.75, garbage collection is not possible, and
subsequent write requests have to be written to free pages in banks 0 to 5, resulting in a
significant decrease in read and write throughput.
The pathological case above does not occur when the minimum number of banks





≤ (1− λ)·N (5.3)
85
Figure 5.2: By increasing the granularity g, Infinite Streams allows extra write throughput
to be distributed in setg. For example, when g = 7, only 6 banks are read from at any
moment in time, which means there is an idle bank that could be assigned a write job.
This constraint, however, limits the read throughput. We show that we can prevent the
pathological case from occurring without limiting the read throughput by increasing the
read and write granularity. Note that the pathological case does not occur when g = N
because read requests will be distributed among all the banks. Specifically, there exists a





and less than or equal to U in order to guarantee the read and
write throughput requirements. Recall that U is the minimum number of banks needed to
satisfy the read and write throughput requirements and garbage collection. By increasing
the granularity g, Infinite Streams allows extra write throughput to be distributed among
the g banks that are performing read requests. For example, in Figure 5.2, g = 7. Since
the read throughput requirement has not changed Br = 600MB/s, only 6 banks are read
from at any moment in time, which means there is an idle bank that could be assigned
a write job. In the next section, we discuss how to guarantee that enough pages can be
written to satisfy the over-provisioning.
86
5.2.2.2 Our Extension to Bank Reservation
Recall that the pathological case only occurs when the read requests contains pages that
exist on the same banks. Let’s denote this set of banks as setg. Since a request is written
across multiple banks, the granularity g of read and write requests is also the number of
banks in setg. Let seth contain the other N−g banks that are not in setg. Banks in seth do
not service any read requests in our pathological example. Let Tg and Th denote the time it
takes to write to all the pages in setg, given that reads are occurring, and seth, respectively.
Our pathological case occurs when seth is filled with valid pages and setg still contains
more free pages than the number of pages in the flash over-provisioned space. When this
occurs, there are no more pages to write to in seth, and the write throughput cannot be
satisfied by writing to setg because of the reads occurring in setg.
To find Tg and Th, first, we calculate the number of banks that could be servicing write
requests at any given moment in each set: ng for setg, and nh for seth. The number of
banks that could be servicing write requests is at most Bw
Fw
to satisfy the write throughput
requirement. Since banks in setg are servicing all the read requests with a read throughput
requirement of Br, it has at most (g −
Br
Fr
) idle banks for performing page writes. seth




















Using ng and nh, we calculate Tg and Th as follows. Let Bpages be the number of pages
in a bank.
Tg = (g·Bpages)/(ng·Fw) (5.6)
87
Th = ((N − g)·Bpages)/(nh·Fw) (5.7)
If Tg≤Th, this means that it takes less than to write to all the pages in setg than it
takes to write to all the pages in seth, and thus, g is sufficient as a granularity to prevent
the pathological case. If, however, Tg > Th, we need to calculate the number of pages
written in setg during Th time units and add those pages to the number of pages in seth.
If the result is greater than or equal to the total number of pages that are not used for
over-provisioning, g is sufficient to guarantee that the pathological case will not occur.





+ 1 and incrementally
test if the above condition can be satisfied at each value of g. Algorithm 2 presents the
pseudocode.
The job allocation for the banks is described as follows. When a bank is not busy, a
new job is allocated to the bank. If the number of banks performing read requests is less
than Nr, the read dispatch pool is checked for a read request on that bank. Otherwise, if
GC is triggered and the number of banks performing GC jobs is less than Ngc, the bank
performs a partial garbage collection step. Otherwise, if the number of banks performing
write jobs is less than Nw, a write job is allocated to that bank.
5.3 Evaluation
We use the event-driven simulator described in Section 4.5.2 to estimate average write
amplification factors (WAF ). We generate 10 task sets, each containing 100 tasks, for
each device utilization value and percentage of data retained beyond its lifetime. Device
utilization values range from 0.50 to 0.90 in increments of 0.10. The percentages of data
retained beyond its lifetime range from 0% to 90% in increments of 10%. Logical page
88
Algorithm 2 Determine the read and write page granularity to guarantee read and write
throughput given over-provisioning.



















6: while g≤U do

















9: Tg = (g·Bpages)/(ng·Fw)
10: Th = ((N − g)·Bpages)/(nh·Fw)
11: if Tg≤Th then
12: return g
13: else
14: W = ng·Fw·Th









numbers (LPNs) for each task are written sequentially. A random subset of LPNs are
retained beyond the lifetime specified by the task. This subset of LPNs will not be updated
and overwritten. When the percentage of data retained is met, there is a one percent chance
that an LPN being retained will be updated.
The WAF for Infinite Streams is compared with the WAF for the MultiLog data
placement algorithm that estimates the update frequency of a logical page number based
on the last two consecutive writes [SA13]. Pages are placed in bins with exponentially
decreasing update frequency ranges. For example, the first bin holds the hottest pages
with frequencies from f to f/2. The next set holds pages with frequencies from f/2 to
f/4. Figure 5.3 and Figure 5.4 contain the average WAF values for Infinite Streams and
MultiLog heat tracking, respectively. For both Infinite Streams and MultiLog, the WAF
increases as device utilization increases. Compared to MultiLog, Infinite Streams has
lower average WAF values across all varying parameters with smaller standard deviations
as shown in Tables 5.3 and 5.3.
We also measured the throughput utilization of Infinite Streams and MultiLog heat
tracking. The bank reservation technique with our additional admission control test is
used for both Infinite Streams and MultiLog heat tracking. We use the average WAF
values in Tables 5.3 and 5.3 in Equation 5.1. Each flash chip has a read bandwidth of 100
MB/s, a write bandwidth of 41.4 MB/s and a erasure bandwidth of 1.4 GB/s [HC18].
Figures 5.5 and 5.6 show the maximum throughput utilization at which 100% of the
task sets are schedulable. Infinite Streams is able to guarantee high throughput to task
sets across varying parameters compared to MultiLog heat tracking. At higher device
utilization and higher percentages of data retained, the throughput at which all the task sets
are schedulable decreases by up to 5%. For MultiLog heat tracking, on the other hand, the



































Figure 5.3: WAF for Infinite Streams over different over-provisioning levels and different


































Figure 5.4: WAF for MultiLog heat tracking over different over-provisioning levels and
different percentages of data retained beyond its lifetime.
91
λ % data retained
0 10 20 30 40 50 60 70 80 90
0.5 µ 1 1.0 1.0 1.0 1.0 1.0 1.0 1.01 1.01 1.01
σ 0 4e-5 2e-4 6e-4 4e-4 9e-4 1e-3 2e-3 3e-3 4e-3
0.6 µ 1 1.0 1.0 1.0 1.01 1.01 1.01 1.02 1.03 1.03
σ 0 9e-5 5e-4 1e-3 1e-3 2e-3 2e-3 2e-3 3e-3 5e-3
0.7 µ 1 1.0 1.0 1.01 1.01 1.02 1.03 1.04 1.05 1.05
σ 0 3e-4 9e-4 2e-3 2e-3 3e-3 3e-3 2e-3 4e-3 3e-3
0.8 µ 1 1.0 1.0 1.01 1.02 1.02 1.04 1.05 1.06 1.07
σ 0 5e-4 9e-4 3e-3 3e-3 3e-3 2e-3 4e-3 5e-3 2e-3
0.9 µ 1 1.0 1.01 1.01 1.02 1.03 1.04 1.06 1.07 1.09
σ 0 4e-4 9e-4 2e-3 3e-3 3e-3 4e-3 5e-3 6e-3 4e-3
Table 5.1: Average WAF and standard deviation for Infinite Streams over different over-
provisioning levels and different percentages of data retained beyond its lifetime.
λ % data retained
0 10 20 30 40 50 60 70 80 90
0.5 µ 1.06 1.07 1.09 1.09 1.09 1.08 1.07 1.06 1.06 1.07
σ 8e-3 7e-3 1e-2 1e-2 1e-2 1e-2 8e-3 1e-2 1e-2 1e-2
0.6 µ 1.07 1.07 1.1 1.13 1.15 1.14 1.11 1.09 1.07 1.09
σ 7e-3 8e-3 7e-3 7e-3 1e-2 1e-2 1e-2 1e-2 1e-2 2e-2
0.7 µ 1.07 1.08 1.11 1.17 1.2 1.19 1.2 1.16 1.11 1.1
σ 1e-2 1e-2 7e-3 1e-2 2e-2 1e-2 1e-2 3e-2 1e-2 1e-2
0.8 µ 1.1 1.1 1.13 1.21 1.25 1.28 1.33 1.33 1.2 1.16
σ 1e-2 1e-2 9e-3 1e-2 1e-2 2e-2 4e-2 3e-2 2e-2 2e-2
0.9 µ 1.22 1.25 1.3 1.37 1.38 1.44 1.53 1.6 1.48 1.39
σ 3e-2 3e-2 3e-2 2e-2 3e-2 3e-2 4e-2 2e-2 4e-2 4e-2
Table 5.2: Average WAF and standard deviation for MultiLog heat tracking over different

























Figure 5.5: Infinite Streams throughput with different percentages of data retained for
different amounts of over-provisioning in Infinite Streams. Lines indicate the maximum



























Figure 5.6: MultiLog heat tracking throughput with different percentages of data retained
for different amounts of over-provisioning. Lines indicate the maximum utilization at
which 100% of the task sets are schedulable.
93
higher device utilization levels. As the percentage of data retained increases, throughput
initially decreases due to increased garbage collection overhead. When the percentage of
data retained is high, throughput increases. This may be because only a small percentage
of logical page numbers are updated, resulting in a data set with a skewed distribution
where a large percentage of writes are going to a small percentage of the disk.
Infinite Streams is an extension of the Telomere task model. As Figure 1.2 shows, Infi-
nite Streams has slightly higher device utilization and throughput compared to the Multi-
Streams method. Unlike previous multi-stream approaches, we guarantee the read and
write throughput requirements specified by each task. By assigning streams automatically
based on the parameters of the tasks instead of grouping pages with exponentially decreas-




This thesis explores flash translation layer (FTL) designs that provide predictable perfor-
mance. We examined different design objectives such as low latency, high throughput,
high device utilization, low garbage collection overhead and high parallelism. Many de-
sign decisions often face trade-offs between these different objectives. For example, over-
provisioning is used to lower the write amplification factor, which then results in lowered
garbage collection overhead and increased throughput. However, over-provisioning in-
herently decreases device utilization. Low tail-latency is often achieved by redundancy
or partitioning designs with encoding data. However, these designs have lower device
utilization and often suffer from extra overhead for managing encoding data and lower
throughput. On the other end, designs that optimize throughput and parallelism often ex-
perience higher garbage collection overhead and higher tail-latency. Three FTL designs
were introduced in this thesis; one providing low tail-latency, real-time guarantees, one
providing real-time, high throughput guarantees, and the last one providing high through-
put with no worst-case guarantees.
In Partitioned Real-Time FTL (PaRT-FTL), we show that by partitioning reads onto
different flash chips from write requests and garbage collection, we can provide low la-
tency guarantees for time-critical systems. The trade-off in this design is the reduced
throughput due to the partitioning. In our configuration, write requests and garbage collec-
95
tion are limited to 25% of the flash chips, so write throughput is significantly decreased.
However, the redundancy and partitioning enable PaRT-FTL to decrease the maximum
write latency by 80% and the maximum read latency by 65% compared to previous real-
time techniques using partial garbage collection.
In Telomere, we tackle the root of the problem in flash memory that causes high
garbage collection overhead. By providing data lifetime information to the FTL, Telom-
ere is able to guarantee high parallelism and throughput to process large volumes of data.
We show that by intelligently placing data in flash memory, Telomere is able to reduce
garbage collection overhead to only a block erasure. By doing so, Telomere achieves
at least a 30% increase in throughput with 10% of over-provisioning compared to pre-
existing techniques. We also show that Telomere with a wear-leveling technique such as
RBER-balancing or health binning can significantly increase flash endurance.
We also extend the Telomere task model in our non-real-time design, Infinite Streams,
to allow data to be accessed beyond its specified lifetime. Unlike previous multi-stream
approaches, Infinite Streams does not place a limit on the maximum number of streams and
provides read and write throughput guarantees. We show the benefit of Infinite Streams
compared to grouping pages in bins with exponentially decreasing update frequency ranges.
At 10% device utilization, Infinite Streams is able to admit task sets with at least 10%
higher throughput compared to the MultiLog heat tracking approach.
6.1 Future Work & Emerging Research Directions
Future research directions include file system and database designs that incorporate streams,
and thus can take advantage of our task model in Telomere and Infinite Streams. For ex-
ample, application data can often be mapped to different streams with different lifetime.
Files in different levels of a log-structured merge tree, which is used in databases such as
96
Cassandra and RocksDB, could be assigned to separate streams. In file systems, metadata,
such as inode, logging and bitmap, are short-lived and can be written in separate streams
from user data. Future work could incorporate streams at the database or file system level
to alleviate some burden on the application developer. However, we envision that some
application-level customization will be required to specify the lifetime of the streams de-
pending on the application workload.
While flash translation layers have allowed flash memory to be integrated seamlessly
into existing infrastructures replacing mechanical hard disks, having this extra layer of
indirection introduces many issues such as long tail-latency and high garbage collection
overhead. The narrow block I/O interface between the file system and the FTL prevents
optimizations from either side. File semantics are hidden from the SSD, preventing intel-
ligent storage management, and flash memory properties such as garbage collection and
internal parallelism are hidden from the file system, preventing optimizations in the file
system. In Telomere and Infinite Streams, we pass data lifetime information from the ap-
plication to the SSD so that the SSD can intelligently separate hot and cold data to reduce
garbage collection overhead. Future work could look into incorporating data grouping se-
mantics in the FTL for storing objects. For example, pages in the same object are highly
likely to be modified at the same time, so storing them together in the same block would
reduce write amplification. To increase parallelism, objects that are likely to be accessed
together due to spatial or temporal locality should be stored on different parallel units,
such as flash chips. Moving the storage management from the file system layer to the FTL
allows optimizations that are otherwise not possible in traditional designs.
Another direction for future work is the one taken by the Open-Channel SSD com-
munity, where the internal properties of the SSD are revealed to the applications so that
flash memory management can be configured by each application based on its workload
97
and objective. Currently, there exists a myriad of FTL designs for different applications
with different design objectives. However, an FTL embedded in the SSD cannot be eas-
ily reconfigured for a different application. The advantage of an Open-Channel SSD is
that an application can fully optimize storage on flash memory. The disadvantage is the
burden it places on application developers to manage flash garbage collection and paral-
lelism. Zoned Namespaces, which is being added to the NVMe 2.0 specification, presents
a balance on the spectrum where a fully drive-managed FTL is on one end and a fully
software-managed design is on the other. Future research includes efficient ways to de-
velop algorithms using Zoned Namespaces SSDs as well as exploring new levels of ab-
straction and new interfaces that will support different applications and allow software to
optimize storage on flash memory.
Emerging non-volatile memory (NVM) technologies possess DRAM-like performance
with low latency, good endurance and byte-addressability. However, the price of NVMs,
such as Intel’s Optane memory, is an order of magnitude higher than NAND. For example,
as of August 2020, 960 GB of Intel Optane SSD 905P Series cost $1,297.00 compared
to 1 TB of Samsung 970 EVO SSD with V-NAND Technology, which cost $169.99. Fu-
ture research direction could explore hybrid storage system situated between NVMs and
SSDs. For example, in Telomere and Infinite Streams, streams with short lifetime could
be written to NVM to take advantage of NVM’s higher endurance. NVMs could also be
used as a cache, especially in Telomere which specifies task periods, to reduce the average
response time for I/O accesses. By migrating areas of concentrated I/Os to an NVM, new
hybrid designs could provide better real-time performance guarantees for future storage
solutions.
Bibliography




[Ang13] Amara D. Angelica. Google’s Self-driving Car Gathers Nearly 1
GB/sec. http://www.kurzweilai.net/googles-self-driving-car-gathers-nearly-
1-gbsec/, May 2013.
[APW+08] Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark
Manasse, and Rina Panigraphy. Design Tradeoffs for SSD Performance.
In USENIX Annual Technical Conference (ATC), 2008.
[Bak90] T P Baker. A Stack-Based Resource Allocation Policy for Realtime Pro-
cesses. In Proceedings of the Real-Time Systems Symposium (RTSS), 1990.
[BB05] E. Bini and G. C. Buttazzo. Measuring the Performance of Schedulability
Tests. Journal of Real-Time Systems, 30(1-2), 2005.
[BBA10] A. Bastoni, B. Brandenburg, and J. Anderson. Cache-related Preemption and
Migration Delays: Empirical Approximation and Impact on Schedulability.
In Proceedings of the Operating Systems Platforms for Embedded Real-Time
Applications (OSPERT), 2010.
[BBBD13] Matias Bjorling, Philippe Bonnet, Luc Bouganim, and Niv Dayan. The Nec-
essary Death of the Block Device Interface. In Biennial Converence on In-
novative Data Systems Research (CIDR), 2013.
[BGB17] Matias Bjorling, Javier Gonzalez, and Philippe Bonnet. LightNVM: The
Linux Open-Channel SSD Subsystem. In Proceedings of the USENIX Con-
ference on File and Storage Technologies (FAST), 2017.
[Bjo19] Matias Bjorling. From Open-Channel SSDs to Zoned Namespaces. In Linux
Storage and Filesystems Conference (Vault 19), 2019.




[CDH+15] John Colgrove, John D Davis, John Hayes, Ethan L Miller, Cary Sandvig,
Russell Sears, Ari Tamches, Neil Vachharajani, and Feng Wang. Purity:
Building Fast, Highly-available Enterprise Flash Storage from Commodity
Components. In Proceedings of the ACM SIGMOD International Confer-
ence on Management of Data (SIGMOD), 2015.
[CG08] Siddharth Choudhuri and Tony Givargis. Deterministic Service Guaran-
tees for NAND Flash using Partial Block Cleaning. In Proceedings of the
IEEE/ACM/IFIP International Conference on Hardware/Software Codesign
and System Synthesis, 2008.
[Chi99] Chiang, M.-L. and Chang, R.-C. Cleaning Policies in Mobile Computers
Using Flash Memory. Journal of System Software, 48(3):213–231, 1999.
[CHL16] Feng Chen, Binbing Hou, and Rubao Lee. Internal Parallelism of Flash
Memory-based Solid-state Drives. ACM Transactions on Storage, 12(3),
2016.
[Cho09] Cho, H. and Shin, D. and Eom, Y. I. KAST: K-associative Sector Translation
for NAND Flash Memory in Real-time Systems. In Proceedings of the Con-
ference on Design, Automation and Test in Europe (DATE), pages 507–512,
2009.
[CKL04] Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo. Real-time Garbage Collec-
tion for Flash-memory Storage Systems of Real-time Embedded Systems.
ACM Transactions on Embedded Computing Systems (TECS), 3(4):837–863,
2004.
[CLZ11a] F. Chen, T. Luo, and X. Zhang. CAFTL : A Content-Aware Flash Translation
Layer Enhancing the Lifespan of Flash Memory based Solid State Drives.
In Proceedings of the USENIX Conference on File and Storage Technologies
(FAST), 2011.
[CLZ11b] Feng Chen, Rubao Lee, and Xiaodong Zhang. Essential Roles of Exploit-
ing Internal Parallelism of Flash Memory based Solid State Drives in High-
Speed Data Processing. In IEEE International Symposium on High Perfor-
mance Computer Architecture, 2011.
[CMHM13] Yu Cai, Onur Mutlu, Erich F Haratsch, and Ken Mai. Program interference
in MLC NAND flash memory: Characterization, modeling, and mitigation.
In Proceedings of the IEEE International Conference on Computer Design
(ICCD), 2013.
100
[Cou16] Tom Coughlin. Flash Memory For Data Centers And Clouds.
https://www.forbes.com/sites/tomcoughlin/2016/04/27/flash-memory-
for-data-centers-and-clouds/, April 2016.
[Han06] Han, L. and Ryu, Y. and Yim, K. CATA: A Garbage Collection Scheme
for Flash Memory File Systems. Ubiquitous Intelligence and Computing,
4159:103–112, 2006.
[HC18] Sheng-Min Huang and Li-Pin Chang. Providing SLO Compliance on NVMe
SSDs Through Parallelism Reservation. ACM Transactions on Design Au-
tomation of Eletronic Systems, 23(3), 2018.
[HSKH+16] Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, An-
drew A. Chien, and Haryadi S. Gunawi. The Tail at Store: A Revelation from
Millions of Hours of Disk and SSD Deployments. In USENIX Conference
on File and Storage Technologies (FAST), 2016.
[Kaw95] A. Kawaguchi. A Flash-Memory Based File System. In USENIX, 1995.
[KHMC14] Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng, and Sangyeun Cho. The
Multi-Streamed Solid-State Drive. In USENIX Workshop on Hot Topics in
Storage and File Systems (HotStorage), 2014.
[KKN+02] Jesung Kim, Jong Min Kim, Sam H Noh, Sang Lyul Min, and Yookun Cho.
A Space-efficient Flash Translation Layer for CompactFlash Systems. IEEE
Transactions on Consumer Electronics, 48(2):366–375, 2002.
[KLN15] J. Kim, D. Lee, and S. H. Noh. Towards SLO Complying SSDs Through
OPS Isolation. In Proceedings of the USENIX Conference on File and Stor-
age Technologies (FAST), 2015.
[KOP+11] Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, and Sang-Won
Lee. Fast, Energy Efficient Scan Inside Flash Memory SSDs. In Proceedings
of the International Workshop on Accelerating Data Management Systems,
2011.
[KYM11] Y. Kang, J. Yang, and E. L. Miller. SCM: An Efficient Interface for Stor-
age Class Memories. In Proceedings of IEEE Symposium on Mass Storage
Systems and Technologies (MSST), 2011.
[LCG+15] Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu.
WARM: Improving NAND Flash Memory Lifetime with Write-hotness
Aware Retention Management. In Proceedings of the Mass Storage Systems
and Technology (MSST), 2015.
101
[Lee06] Lee, S.-W. and Choi, W.-K. and Park, D.-J. FAST: An Efficient Flash Trans-
lation Layer for Flash Memory. In Emerging Directions in Embedded and
Ubiquitous Computing, pages 879–887. Springer, 2006.
[Lee08] Lee, S. and Shin, D. and Kim, Y.-J. and Kim, J. LAST: Locality-aware
Sector Translation for NAND Flash Memory-based Storage Systems. ACM
SIGOPS Operating Systems Review, 42(6):36–42, 2008.
[LO18] LITE-ON. Best-in Class SSD Solutions with High-Reliability, Man-
ageability, and Durability. http://www.liteonssd.com/en/datasheet/J8/Lite-
On%20Automotive.pdf, 2018.
[LSZ13] Youyou Lu, J. Shu, and W. Zheng. Extending the Lifetime of Flash-based
Storage Through Reducing Write Amplification from File Systems. In
Proceedings of the USENIX Conference on File and Storage Technologies
(FAST), 2013.
[Maa20] Gilad David Maayan. The IoT Rundown For 2020: Stats, Risks, and Solu-
tions. https://securitytoday.com/Articles/2020/01/13/The-IoT-Rundown-for-
2020.aspx?Page=2, January 2020.
[MFL14] Dongzhe Ma, Jianhua Feng, and Guoliang Li. A Survey of Address Trans-
lation Technologies for Flash Memories. ACM Computing Surveys (CSUR),
46(3):36, 2014.
[Mic05] Micron. Micron Technical Report: Small-block vs. Large-block NAND
Flash Devices. Technical Report TN-29-07, 2005.
[Mic17] Micron. Micron Reveals Critical Technologies for Autonomous Vehicles.
https://investors.micron.com/news-releases/news-release-details/micron-
reveals-critical-technologies-autonomous-vehicles, 2017.
[Mic18] Micron. NAND Flash Die – 128Gb
Die: x8 300mm MLC MT29F128G08CBECB.
https://prod.micron.com/media/documents/products/data-sheet/nand-
flash/die/l95b die 128gb nand.pdf, February 2018.
[Moh10] Mohan, V. and Siddiqua, T. and Gurumurthi, S. and Stan, M. R. How I
Learned to Stop Worrying and Love Flash Endurance. In Proceedings of
the USENIX Conference on Hot Topics in Storage and File Systems (FAST),
2010.
[MW18] Katherine Missimer and Richard West. Optimizing Deterministic Garbage
Collection in NAND Flash Storage Systems. In Proceedings of the Real-
Time Systems Symposium (RTSS), 2018.
102
[Net16] NetworkComputing. The Future of Data Storage: Flash and Hy-
brid Cloud. https://www.networkcomputing.com/data-centers/future-data-
storage-flash-and-hybrid-cloud, 2016.
[Oh13] HakJune Oh. Single Controller 4/8TB SSD.
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/
2013/20130815 301A Oh.pdf, 2013. Flash Memory Summit Conference.
[Pan13] Pang, Changhyun and Lee, Chanseok Lee and Suh, Kahp-Yang. Recent Ad-
vances in Flexible Sensors for Wearable and Implantable Devices. Journal
of Applied Polymer Science, pages 1429–1441, 2013.
[Par08] Park, C. and Cheon, W. and Kang, J. and Roh, K. and Cho, W. and Kim, J.-S.
A reconfigurable FTL architecture for NAND flash-based applications. ACM
Transactions on Embedded Computing Systems (TECS), 7(4):38, 2008.
[PD11] D. Park and D. Du. Hot Data Identification for Flash-based Storage Systems
Using Multiple Bloom Filters. In Proceedings of Massive Storage Systems
and Technology (MSST), 2011.
[PKI+18] Roman Pletka, Ioannis Koltsidas, Nikolas Ioannou, Sasa Tomic, Niko-
laos Papandreou, Thomas Parnell, Haralampos Pozidis, Aaron Fry, and
Tim Fisher. Management of Next-Generation NAND Flash to Achieve
Enterprise-Level Endurance and Latency Targets. In Proceedings of ACM
Transactions on Storage (TOC), 2018.
[PT16] Roman A. Pletka and Sasa Tomic. Health-Binning: Maximizing the Perfor-
mance and the Endurance of Consumer-Level NAND Flash. In Proceedings
of the 9th ACM International on Systems and Storage Conference (SYSTOR),
2016.
[PwC15] PwC. The Internet of Things: The Next
Growth Engine for the Semiconductor Industry.
https://www.pwc.com/gx/en/technology/publications/assets/pwc-iot-
semicon-paper-may-2015.pdf, May 2015.
[QWLS12] Zhiwei Qin, Yi Wang, Duo Liu, and Zili Shao. Real-time Flash Translation
Layer for NAND Flash Memory Storage Systems. In Proceedings of the
IEEE Real-Time and Embedded Technology and Applications Symposium
(RTAS), 2012.
[RO92] M. Rosenblum and J. Ousterhout. The Design and Implementation of a Log-
Structured File System. In Proceedings of the ACM Transactions on Com-
puter Systems, 1992.
103
[Ruf19] Gustavo Henrique Ruffo. Tesla Cars Have A Memory Problem That May
Cost You A Lot To Repair. https://insideevs.com/news/376037/tesla-mcu-
emmc-memory-issue/, October 2019.
[SA13] Radu Stoica and Anastasia Ailamaki. Improving Flash Write Performance
by Using Update Frequency. In Proceedings of the International Conference
on Very Large Data Bases (VLDB), 2013.
[San18] Sandisk. iNAND Automotive Embedded Flash Drives.
https://www.sandisk.com/oem-design/automotive/inand, 2018.
[SAW+14] Dimitris Skourtis, Dimitris Achlioptas, Noah Watkins, Carlos Maltzahn, and
Scott Brandt. Flash on Rails: Consistent Flash Performance through Redun-
dancy. In Proceedings of the USENIX Annual Technical Conference (ATC),
2014.
[SJLK14] Yong Ho Song, Sanghyuk Jung, Sang-Won Lee, and Jin-Soo Kim.
Cosmos OpenSSD: A PCIe-based Open Source SSD Platform.
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/
2014/20140807 301B Song.pdf, 2014. Flash Memory Summit Conference.
[SKKY15] Woong Shin, Myeongcheol Kim, Kyudong Kim, and Heon Y. Yeom. Provid-
ing QoS through Host Controlled Flash SSD Garbage Collection and Multi-
ple SSDs. In Proceedings of the International Conference on Big Data and
Smart Computing (BIGCOMP), 2015.
[Wan12] Wang, C. and Wong, W.-F. Observational Wear-Leveling: An Efficient Al-
gorithm for Flash Memory Management. In Proceedings of the Design Au-
tomation Conference (DAC), 2012.
[Wan14] Wang, W. and Lu, Y. and Shu, J. p-OFTL: An Object-based Semantic-aware
Parallel Flash Translation Layer. In Proceedings fo the Conference on De-
sign, Automation and Test in Europe (DATE), 2014.
[Wes19] Western Digital. Zoned Storage. http://zonedstorage.io, 2019.
[WLMD16] Richard West, Ye Li, Eric Missimer, and Matthew Danish. A Virtualized
Separation Kernel for Mixed Criticality Systems. ACM Transactions on
Computer Systems, 34, 2016.
[WZ94] M. Wu and W. Zwaenepoel. eNVy: a Non-Volatile, Main Memory Stor-
age System. In Proceedings of the Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 1994.
104
[YLH+17] Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan
Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. Tiny-Tail Flash:
Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND
SSDs. In Proceedings of the USENIX Conference on File and Storage Tech-
nologies (FAST), 2017.
[YPCB17] Jingpei Yang, Rajinikanth Pandurangan, Changho Choi, and Vijay Balakr-
ishnan. Autostream: Automatic Stream Management for Multi-streamed
SSDs. In Proceedings of the ACM International Systems and Storage Con-
ference (SYSTOR), 2017.
[YPG+14] Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan
Sundararaman. Don’t stack your Log on my Log. In Workshop on Inter-
actions of NVM/Flash with Operating Systems and Workloads (INFLOW),
2014.
[ZLW+15] Qi Zhang, Xuandong Li, Linzhang Wang, Tian Zhang, Yi Wang, and Zili
Shao. Optimizing Deterministic Garbage Collection in NAND Flash Storage
Systems. In Proceedings of the Real-Time and Embedded Technology and
Applications Symposium (RTAS). IEEE, 2015.
[ZZS+13] K. Zhao, W. Zhao, H. Sun, T. Zhang, X. Zhang, and N. Zheng. LDPC-in-
SSD: Making Advanced Error Correction Codes Work Effectively in Solid
State Drives. In Proceedings of the USENIX Conference on File and Storage
Technologies (FAST), 2013.
Curriculum Vitae
