SSD의 긴 꼬리 지연시간 문제 완화를 위한 강화학습의 적용 by 강원경
 
 
저 시-비 리- 경 지 2.0 한민  
는 아래  조건  르는 경 에 한하여 게 
l  저 물  복제, 포, 전송, 전시, 공연  송할 수 습니다.  
다 과 같  조건  라야 합니다: 
l 하는,  저 물  나 포  경 ,  저 물에 적 된 허락조건
 명확하게 나타내어야 합니다.  
l 저 터  허가를 면 러한 조건들  적 되지 않습니다.  
저 에 른  리는  내 에 하여 향  지 않습니다. 




저 시. 하는 원저 를 시하여야 합니다. 
비 리. 하는  저 물  리 목적  할 수 없습니다. 
경 지. 하는  저 물  개 , 형 또는 가공할 수 없습니다. 
공학박사학위논문
Applying Reinforcement Learning to









Applying Reinforcement Learning to












위 원 장 이재욱 (인)
부위원장 유승주 (인)
위 원 김진수 (인)
위 원 신동군 (인)
위 원 윤성로 (인)

Abstract
Applying Reinforcement Learning to
Mitigate Long Tail Latency Problem in
SSD
Won Kyung Kang
Department of Computer Science and Engineering
The Graduate School
Seoul National University
NAND flash memory is widely used in a variety of systems, from real-
time embedded systems to high-performance enterprise server systems.
Flash memory has (1) erase-before-write (write-once) and (2) endurance
problems. To handle the erase-before-write feature, apply a flash-translation
layer (FTL). Currently, the page-level mapping method is mainly used to
reduce the latency increase caused by the write-once and block erase
characteristics of flash memory.
Garbage collection (GC) is one of the leading causes of long-tail la-
tency, which increases more than 100 times the average latency at 99th
percentile. Therefore, real-time systems or quality-critical systems can-
not satisfy given requirements such as QoS restrictions.
i
As flash memory capacity increases, GC latency also tends to in-
crease. This is because the block size (the number of pages included in
one block) of the flash memory increases as the capacity of the flash
memory increases. GC latency is determined by valid page copy and
block erase time. Therefore, as block size increases, GC latency also in-
creases.
Especially, the block size gets increased from 2D to 3D NAND flash
memory, e.g., 256 pages/block in 2D planner NAND flash memory and
768 pages/block in 3D NAND flash memory. Even in 3D NAND flash
memory, the block size is expected to continue to increase. Thus, the
long write latency problem incurred by GC can become more serious in
3D NAND flash memory-based storage.
In this dissertation, we propose three versions of the novel GC schedul-
ing method based on reinforcement learning. The purpose of this method
is to reduce the long tail latency caused by GC by utilizing the idle time
of the storage system. Also, we perform a quantitative analysis for the
RL-assisted GC solution.
RL-assisted GC scheduling technique was proposed which learns the
storage access behavior online and determines the number of GC op-
erations to exploit the idle time. We also presented aggressive meth-
ods,which helps in further reducing the long tail latency by aggressively
performing fine-grained GC operations.
We also proposed a technique that dynamically manages key states in
ii
RL-assisted GC to reduce the long-tail latency. This technique uses many
fine-grained pieces of information as state candidates and manages key
states that suitably represent the characteristics of the workload using a
relatively small amount of memory resource. Thus, the proposed method
can reduce the long-tail latency even further.
In addition, we presented a Q-value prediction network that predicts
the initial Q-value of a newly inserted state in the Q-table cache. The
integrated solution of the Q-table cache and Q-value prediction network
can exploit the short-term history of the system with a low-cost Q-table
cache. It is also equipped with a small network called Q-value prediction
network to make use of the long-term history and provide good Q-value
initialization for the Q-table cache. The experiments show that our pro-
posed method reduces by 25%-37% the long tail latency compared to the
state-of-the-art method.
Keywords: Solid state drive, long tail latency, garbage collection, rein-







List of Tables xii
List of Figures xiv
Chapter 1 Introduction 1
Chapter 2 Background 6
2.1 System Level Tail Latency . . . . . . . . . . . . . . . . 6
2.2 Solid State Drive . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Flash Storage Architecture and Garbage Collection 10
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . 13
Chapter 3 Related Work 17
Chapter 4 Small Q-table based Solution to Reduce Long
Tail Latency 23
4.1 Problem and Motivation . . . . . . . . . . . . . . . . . . 23
v
4.1.1 Long Tail Problem in Flash Storage Access Latency 23
4.1.2 Idle Time in Flash Storage . . . . . . . . . . . . 24
4.2 Design and Implementation . . . . . . . . . . . . . . . . 26
4.2.1 Solution Overview . . . . . . . . . . . . . . . . 26
4.2.2 RL-assisted Garbage Collection Scheduling . . . 27
4.2.3 Aggressive RL-assisted Garbage Collection Schedul-
ing . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Evaluation Setup . . . . . . . . . . . . . . . . . 35
4.3.2 Results and Discussion . . . . . . . . . . . . . . 39
Chapter 5 Q-table Cache to Exploit a Large Number of States
at Small Cost 52
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Design and Implementation . . . . . . . . . . . . . . . . 56
5.2.1 Solution Overview . . . . . . . . . . . . . . . . 56
5.2.2 Dynamic Key States Management . . . . . . . . 61
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Evaluation Setup . . . . . . . . . . . . . . . . . 67
5.3.2 Results and Discussion . . . . . . . . . . . . . . 67
Chapter 6 Combining Q-table cache and Neural Network
to Exploit both Long and Short-term History 73
6.1 Motivation and Problem . . . . . . . . . . . . . . . . . . 73
vi
6.1.1 More State Information can Further Reduce Long
Tail Latency . . . . . . . . . . . . . . . . . . . . 73
6.1.2 Locality Behavior of Workload . . . . . . . . . . 74
6.1.3 Zero Initialization Problem . . . . . . . . . . . . 75
6.2 Design and Implementation . . . . . . . . . . . . . . . . 77
6.2.1 Solution Overview . . . . . . . . . . . . . . . . 77
6.2.2 Q-table Cache for Action Selection . . . . . . . . 80
6.2.3 Q-value Prediction . . . . . . . . . . . . . . . . 83
6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.1 Evaluation Setup . . . . . . . . . . . . . . . . . 87
6.3.2 Storage-Intensive Workloads . . . . . . . . . . . 89
6.3.3 Latency Comparison: Overall . . . . . . . . . . . 92
6.3.4 Q-value Prediction Network Effects on Latency . 97
6.3.5 Q-table Cache Analysis . . . . . . . . . . . . . . 110
6.3.6 Immature State Analysis . . . . . . . . . . . . . 113
6.3.7 Miscellaneous Analysis . . . . . . . . . . . . . . 116
6.3.8 Multi Channel Analysis . . . . . . . . . . . . . . 121
Chapter 7 Conculsion and Future Work 138
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 138






Table 4.1 Workload characteristics. . . . . . . . . . . . . . 37
Table 4.2 NAND Flash memories. . . . . . . . . . . . . . . 37
Table 4.3 States. . . . . . . . . . . . . . . . . . . . . . . . 38
Table 4.4 Thresholds. . . . . . . . . . . . . . . . . . . . . 38
Table 4.5 Latency comparison on 3D 512Gb flash memory. 43
Table 4.6 Latency comparison on 3D 128Gb flash memory. 43
Table 4.7 Erase count comparison on 3D 512Gb flash mem-
ory. . . . . . . . . . . . . . . . . . . . . . . . . 47
Table 4.8 Erase count comparison on 3D 128Gb flash mem-
ory. . . . . . . . . . . . . . . . . . . . . . . . . 47
Table 4.9 Standard deviation of normalized latency on 3D
512Gb flash memory. . . . . . . . . . . . . . . . 50
Table 4.10 Standard deviation of normalized latency on 3D
128Gb flash memory. . . . . . . . . . . . . . . . 50
Table 4.11 Latency comparison of simple prediction method
on 3D 512Gb flash memory. . . . . . . . . . . . 51
ix
Table 4.12 Latency comparison of simple prediction method
on 3D 128Gb flash memory. . . . . . . . . . . . 51
Table 5.1 Top rank states and access counts in home2. . . . 55
Table 5.2 State information and # of bins. . . . . . . . . . . 65
Table 5.3 Latency comparison. . . . . . . . . . . . . . . . 70
Table 5.4 Latency comparison for various Q-table cache size
on 3D 512Gb flash memory. . . . . . . . . . . . 71
Table 5.5 Number of states visited for each workload on 3D
512Gb flash memory. . . . . . . . . . . . . . . . 71
Table 5.6 Hit rate of Q-table cache in each workload on 3D
512Gb flash memory. . . . . . . . . . . . . . . . 71
Table 5.7 Erase count for 3D 512Gb flash memory. . . . . . 72
Table 5.8 Erase count for 3D 128Gb flash memory. . . . . . 72
Table 6.1 Top rank states and access counts in home2. . . . 76
Table 6.2 Characteristics of flash memories for 3D 128Gb
[1] and 3D 512Gb [2]. . . . . . . . . . . . . . . 88
Table 6.3 Latency comparison. . . . . . . . . . . . . . . . 95
Table 6.4 Hit rate of Q-table cache only method. . . . . . . 96
Table 6.5 Pre-training comparison. . . . . . . . . . . . . . 102
Table 6.6 Q-value prediction error comparison. . . . . . . . 102
Table 6.7 Latency comparison of QP Net without Q-table
cache (normalized to the baseline). . . . . . . . . 103
x
Table 6.8 Latency comparison of actor-critic and A3C meth-
ods (normalized to the baseline). . . . . . . . . . 108
Table 6.9 Actor-critic network architecture. . . . . . . . . . 109
Table 6.10 Workload information (hit rate and state counts
are from integrated solution). . . . . . . . . . . . 109
Table 6.11 Latency comparison of large Q-table cache (nor-
malized to the baseline). . . . . . . . . . . . . . 112
Table 6.12 Memory cost. . . . . . . . . . . . . . . . . . . . 119
Table 6.13 Average latency comparison between negative and
zero reward for action 0. . . . . . . . . . . . . . 119
Table 6.14 Computation overhead [µs]. . . . . . . . . . . . 119
Table 6.15 Computation overhead of Q-table Cache [µs]. . . 120
Table 6.16 Erase count for 3D 512Gb flash memory. . . . . . 120
Table 6.17 Erase count for 3D 128Gb flash memory. . . . . . 120
Table 6.18 Latency comparison without block size reduction
for 3D 512Gb flash memory (4CH). . . . . . . . 122
Table 6.19 Latency comparison for 3D 512Gb flash memory
(4CH). . . . . . . . . . . . . . . . . . . . . . . . 124
Table 6.20 Latency comparison for 3D 128Gb flash memory
(4CH). . . . . . . . . . . . . . . . . . . . . . . . 125
Table 6.21 Latency comparison of applying suspension scheme
for 3D 512Gb flash memory (4CH). . . . . . . . 127
xi
Table 6.22 Latency comparison of applying suspension scheme
for 3D 128Gb flash memory (4CH). . . . . . . . 128
Table 6.23 Latency comparison of demand based Q-table cache
for 3D 512Gb flash memory (4CH). . . . . . . . 131
Table 6.24 Latency comparison of demand based Q-table cache
for 3D 128Gb flash memory (4CH). . . . . . . . 132
Table 6.25 Latency comparison of Q-table cache initialized
with average for 3D 512Gb flash memory (4CH). 135
Table 6.26 Latency comparison of Q-table cache initialized
with average for 3D 128Gb flash memory (4CH). 136
Table 6.27 Q-value prediction error comparison of Q-table
cache initialized with average (4CH). . . . . . . . 137
xii
List of Figures
Figure 2.1 SSD internal architecture. . . . . . . . . . . . . 12
Figure 2.2 Garbage collection. . . . . . . . . . . . . . . . 12
Figure 2.3 Environment-agent interaction. . . . . . . . . . 16
Figure 4.1 Long tail latency problem: home2. . . . . . . . 25
Figure 4.2 Inter-request interval distribution. . . . . . . . . 25
Figure 4.3 Reward function. . . . . . . . . . . . . . . . . . 32
Figure 4.4 Comparison of write long-tail latency (3D 512Gb
flash memory). . . . . . . . . . . . . . . . . . . 40
Figure 4.5 Distribution of write traffics. . . . . . . . . . . . 42
Figure 4.6 Comparison of number of free blocks. . . . . . 46
Figure 5.1 Latency variation according to the number of
states in home1. . . . . . . . . . . . . . . . . . 55
Figure 5.2 Environment and agent interaction. . . . . . . . 60
Figure 5.3 Q-table cache architecture. . . . . . . . . . . . 65
Figure 5.4 Number of states for each access frequency. . . 66
xiii
Figure 5.5 Number of states for each access frequency (af-
ter running the workload). . . . . . . . . . . . . 66
Figure 6.1 Q-table cache with QP Network. . . . . . . . . 79
Figure 6.2 Operation overview. . . . . . . . . . . . . . . . 79
Figure 6.3 Reward function. . . . . . . . . . . . . . . . . . 86
Figure 6.4 QP Net architecture. . . . . . . . . . . . . . . . 86
Figure 6.5 Latency comparison with NO GC case at 99.9999th
percentile on 3D 512Gb flash memory. . . . . . 91
Figure 6.6 Latency comparison with NO GC case at 99.9999th
percentile on 3D 128Gb flash memory. . . . . . 91
Figure 6.7 Sensitivity analysis: latency vs. QP Net size. . . 102
Figure 6.8 Latency comparison of QP Net without Q-table
cache for various network sizes at 99.9999th per-
centile (normalized to the baseline). . . . . . . . 102
Figure 6.9 Architecture of asynchronous advantage actor-
critic method. . . . . . . . . . . . . . . . . . . 109
Figure 6.10 Number of states in Q-table cache for each ac-
cess frequency in 3D 128Gb (after running the
workloads). . . . . . . . . . . . . . . . . . . . 119
Figure 6.11 Three types of demand based Q-table cache. . . 134




Flash memory storages are widely used in embedded systems, and con-
sumer and enterprise-server systems. Flash memory has two principal
issues: (1) erase-before-write (write-once) property, and (2) endurance
problem. To address the erase-before-write property, a flash translation
layer (FTL) is employed. Currently, a page-level mapping [3] is being
widely used to reduce the write latency induced by write-once and bulk-
erase properties of flash memory storages. In the page-level mapping,
when writing new data, FTL assigns a new free page, and subsequently,
writes data to the newly assigned free page. Thereafter, it updates the
address-mapping information between the logical and the physical ad-
dresses. If the free blocks are insufficient, they are obtained by reclaim-
ing the unused space in the used blocks. To do that, the valid pages of
the victim block are copied to a new block. The victim block is then
erased to obtain a free block. This procedure is called garbage collection
(GC). The GC induces a long-latency problem because the page-copy
and block-erase operations are time-consuming.
1
GC latency increases as the capacity of flash memory increases. It is
mainly due to the fact that the block size (number of pages per block)
increases as the capacity of flash memory increases. GC latency is deter-
mined by the time for valid page copy and block erase. Thus, as block
size gets increased, GC latency also increases. According to our analy-
sis, the block size has a strong impact on long tail latency. Especially, the
block size gets increased from 2D to 3D NAND flash memory, e.g., 256
pages/block in 2D planner NAND flash memory [4] and 768 pages/block
in 3D NAND flash memory [5]. Even in 3D NAND flash memory, the
block size is expected to continue to increase [6, 7]. Thus, the long write
latency problem incurred by GC can become more serious in 3D NAND
flash memory-based storage. Note that the long write latency due to GC
can increase not only write latency but also read latency since GC can
stall the service of subsequent read requests. A long tail is observed in
the distribution of the write latency because of the GC. For instance,
the latency at the 99th percentile can be 100× higher than the average
latency [8]. Such a long-tail latency causes a significant problem in real-
time embedded and enterprise-server systems which need to meet the
real-time and quality of service (QoS) requirements.
In this dissertation, we propose a reinforcement learning-assisted GC
technique to reduce the long tail latency. The proposed technique is a
new approach to exploit the idle time in the storage with reinforcement
learning. In addition, in order to reduce the memory cost of large Q-table
2
management while benefiting from a large number of states, we propose
a method called Q-table cache which aims to store recently visited states
among a large number of state candidates and further reduces long tail
latency by efficiently exploiting the large number of states. We also report
that the Q-table cache alone has a limitation in initializing new entries
to the cache structure. It is because an insertion of a new entry to the Q-
table cache requires an appropriate initialization of its Q-value. However,
a naive solution of zero initialization proves limited in fully exploiting
the potential of the Q-table cache. In order to resolve this problem, we
propose a novel notion of Q-value prediction and a neural network called
Q-value prediction network (QP Net) as a realization.
The contributions of this study are as follows:
• The proposed reinforcement learning-assisted solution helps de-
termine the number of GC operations to be executed to exploit the
varying idle time while avoiding the long-tail latency due to the
GC [9].
• We also present an optimization scheme that aggressively performs
fine-grained GC to prepare free blocks in advance, thereby reduc-
ing the blockage due to the GC, which significantly reduces the
long-tail latency [9].
• We propose a technique called Q-table cache that requires only a
small amount of memory space to represent the environment of the
3
RL model by managing key states dynamically [10].
• This study offers the QP Net for further reduction in long tail la-
tency by addressing the problem of zero Q-value initialization in
the original Q-table cache. The proposed QP Net learns the system
behavior during runtime thereby being able to provide good ini-
tial Q-values in case of inserting a new entry to the Q-table cache.
This improves Q-learning on the Q-table cache while finally con-
tributing to further reduction in long-tail latency. Thus, our inte-
grated solution of Q-table cache and QP Net aims at exploiting
both short-term (by Q-table cache) and long-term (by QP Net) his-
tory of system behavior [11].
• We perform in-depth analysis of storage workloads and identify
storage-intensive ones. We show that our new scheme offers signif-
icant reductions in long-tail latency for those workloads compared
with the state-of-the-art technique [11].
The rest of this dissertation is organized as follows: Chapter 2 de-
scribes the background of the long-tail latency, flash storage system and
RL. Chapter 3 reviews previous GC and RL techniques. Chapter 4 ex-
plains small Q-table based solution to reduce long-tail latency problems.
Chapter 5 presents Q-table cache based solution to exploit a large num-
ber of states at small cost. Chapter 6 proposes a neural network-based
solution combining Q-table cache to exploit both long and short-term
4




2.1 System Level Tail Latency
Modern applications require small and predictable response times. The
response behaviors are characterized by strict performance service level
objective (SLOs), e.g., 99.99% of all requests is to be answered within
300 ms. Instability of performance can cause a delay of several tens of
milliseconds. This can lead to SLO violations and reduced user experi-
ence, thereby having a negative impact on revenue. [12]. Tail latency is
expressed in terms of percentiles; for instance, long tail latency implies
latency as high as 99th percentile.
It is difficult to keep the tail latency distribution short from the per-
spective of a service provider. This is due to the increase in the size or
complexity of the interactive system and in total usage that subsequently
affects tail latency. In addition, workloads that temporarily incur high la-
tency often have a significant impact on the performance of the entire
system.
6
Changes in response time result in high tail latency. The main reasons
for the change in response time are as follows [8, 12, 13]:
• Shared resource: The server system shares key resources (such as
CPU core, cache, memory bandwidth, and network bandwidth)
with various applications. Even within the same application, de-
pending on the operation, shared resources are used by different
requests.
• Background daemons: Background daemons use only a small amount
of resources on average, but increase the response time by several
tens of milliseconds as tasks are allocated.
• Global resource sharing: Applications running on each machine
use competitive global resources, such as network switches or shared
storage.
• Power limits: Modern CPUs throttle the operating speed depending
on their temperature. Therefore, if the CPU is active for a long
time, it can affect the response time.
• Garbage collection: SSDs provide significantly fast random access,
but require periodic garbage collection operations. This operation
can increase read latency by more than 100 times.
• Energy management: Significant energy savings can be achieved
using power saving mode on many devices. However, additional
7
latency occurs when switching from inactive to active mode.
• Timeouts: Failure tolerance and retry are widely used techniques in
distributed systems. However, a single retry is sufficient to increase
latency for the current request.
• Overload: The user sends large or too many requests. These re-
quests can continue to use shared resources, such as CPU, memory,
and network, which can queue up other requests.
• Maintenance activities: Background activities such as data recon-
struction in a distributed file system, periodic log compaction, and
periodic garbage collection in garbage-collected languages are the
cause of periodic latency spikes.
The basic idea for reducing tail latency is hedging. Even when the
task is paralleled, the slowest instance can be seen when the request is
complete. For example, you can send more requests than you need and
get the fastest return. In general, smaller task partitions also help to re-
duce tail latency.
Another approach to reduce tail latency is addressing the problem
of head-of-line blocking. A few expensive requests increase the latency
of cheap requests occurring simultaneously. Therefore, partitioning ex-
pensive requests into uniformly smaller tasks can help to reduce latency
[8, 12, 13].
8
As mentioned earlier, there are several causes for tail latency. In this
dissertation, we focus on reducing the long tail latency caused by garbage
collection, which is an essential operation of SSDs.
9
2.2 Solid State Drive
2.2.1 Flash Storage Architecture and Garbage Col-
lection
Figure 2.1 shows the internal architecture of an SSD. The flash mem-
ory chip package stores data and the DRAM is used for mapping table
storage, data buffering and caching. The host interface (such as SATA
and NVMe) exchanges data with the host using a pre-determined proto-
col. The flash memory controller schedules the flash memory accesses
to maximize the parallelism of flash memory interface channels to boost
the performance. The main processor controls the address translation and
overall operation of the SSD.
Flash memory has the limitations of write-once (erase-before-write)
and block erase. A flash translation layer (FTL) is used to overcome
them, and various FTL algorithms have been studied. In recent years,
page-level mapping [3] has been widely adopted to reduce write latency
caused by the aforementioned flash memory characteristics, erase-before-
write.
When writing new data in page-level mapping, the FTL allocates a
new free page and then writes the data to the newly allocated page. The
mapping information is updated between the logical and physical ad-
dresses. If the free block is insufficient (typically, if the number of free
10
blocks is less than a certain threshold), the FTL can obtain a new block
through reclaiming, and this process is called garbage collection (GC).
Figure 2.2 exemplifies the GC operation. As shown in the figure, the
FTL reads the valid pages from the victim block and writes them to the
newly allocated free block. This process is called valid page copy. When
all the valid page copies from the victim block are completed, the victim
block is erased. Through this process, a new free block is obtained. The
page copies typically take much longer time than block erase. Thus, the
GC latency is proportional to the number of valid pages of the victim
block. In addition, the flash memory (to be exact, the target plane in flash
memory) is blocked during the page copy operations, which increases
the latency of subsequent requests thereby yielding the long tail latency
problem. Generally, to minimize the valid page copy overhead, the victim
block is selected as the block with the smallest number of valid pages.
Recently, 3D flash memory, e.g., V-NAND has gained popularity. The
3D flash memory has more pages in a block than the 2D flash memory,
which increases the latency of valid page copy and, finally, renders the











































Figure 2.2 Garbage collection.
12
2.3 Reinforcement Learning
Figure 2.3 shows the environment-agent interaction of the RL model.
The environment (storage system in this study) has states. The agent
(GC scheduler in this study) selects an action (number of pages to copy)
that maximizes all the future rewards expected from the current state of
the environment. The environment executes the action received from the
agent, passes the immediate reward (a function of latency) to the agent
as a result, and switches to the next state. The RL model operates by
repeating this process.
The basic components of the RL model are as follows:
State (S): The state can cover all the information of the environment
(SSD system in this study) which is useful for the purpose of RL agent,
i.e., reward maximization.
Action (A): The actions are the output of agent execution, which cor-
responds to the number of page copies to be performed in this study.
Reward (r): The reward is associated with the action. In this study,
the shorter latency, the larger reward is given.
Policy (π): The strategy of the agent for selecting an action. Actions
are selected to maximize the cumulative reward thereby minimizing long
tail latency in this study.
Q-learning [14] is used in this study as a policy learning method.
Q-learning updates the Q-value (all the future, i.e., cumulative reward)
13
using the value function of the state-action pair under the optimal policy,
called Bellman equation, as follows:
Q(s,a) = E{rt + γQ(st+1,at+1)|st = s,at = a} (2.1)
where st , at , and rt represent the state and action at time step t and reward
for at , respectively, and γ is the discount factor (set to 0.95 in our exper-
iments). Q(s,a) is the Q-value, i.e., cumulative reward expected when
action a is taken at time t and state s.
The policy is defined as follows:
π(s) = argmaxaQ(s,a) (2.2)
As shown in (2.2), the agent determines the action that maximizes
the Q-value in state s. Q-learning is an on-line method, therefore, the
policy is constantly modified, i.e., trained during runtime according to
the dynamic behavior of the system.
We apply time-difference (TD) learning as follows.
Q(s,a) = Q(s,a)+α{r+ γQ(s′,a′)−Q(s,a)} (2.3)
where α is the learning rate (set to 0.3 in our experiments), r is the reward
of the action a taken in state s, γ is the discount factor, and s′ and a′ are
the next state and action, respectively [14]. The goal of the equation is
to reduce the gap between the target Q-value, r+ γQ(s′,a′) and current
Q-value, Q(s,a).
14
In (2.3), the Q-value on the Q-table, i.e., Q(s′,a′) is re-used for a
quick calculation of the Q-value update, which is called bootstrapping,
wherein only the reward r is needed to update the Q-value. The equation
eventually updates the policy because the policy is determined by the
Q-value.
Q-learning uses a Q-table structure to store Q(s,a) values. In order to
build a Q-table, as examplified in Table 5.2 we need to bin the informa-
tion of continuous values, e.g.,inter-request interval, into multiple levels.
The size of the Q-table is the product of the number of state bins (in short,
the number of states) and that of actions. In order to apply Q-learning to







Action AtReward rt State St




Several studies have been conducted with the aim of reducing the latency
induced by GC. Zhang et al. proposed a real-time lazy GC mechanism
that adopts an on-demand page address mapping and a partial GC tech-
nique to improve the performance of the system [15]. In particular, par-
tial GC divides the GC into several operations to ensure the worst system
response time.
Chang et al. proposed a free-page replenishment mechanism wherein
the real-time tasks were prevented from being blocked due to insufficient
number of free pages. Assuming the write behavior of a realtime task is
known, the number of GC operations and the maximum quantum for GC
operation are determined to meet the real-time constraints [16].
Choudhuri et al. proposed GFTL, which helps perform partial GC to
ensure fixed upper bounds in the latency of storage access by eliminating
the source of non-determinism [17].
Qin et al. proposed a distributed partial GC policy in the RFTL, which
tries to hide the long-tail latency due to the GC. Periodically, the method
17
helps perform partial GC and exploit buffer blocks to store the write
data obtained during the GC operation, thereby reducing the GC-induced
blockage [18].
Shahidi and Kandemir proposed cache-assisted GC technique (CachedGC)
which resumes host requests immediately after moving valid page data
to the DRAM buffer of SSD controller. Writing back the valid pages of
DRAM buffer is performed when SSD has low utilization (i.e., idle time
of SSD) [19]. However, they still have the problem of accurately estimat-
ing the idle time. In addition, a limited DRAM buffer is shared between
the host write request and the GC. It means a reduction in the usable ca-
pacity of DRAM buffer for host write requests, which can affect write
performance.
In [20], Wei et al. propose a workload-adaptive flash translation layer
(WAFTL) with data partitions. It employs both page-level and block-
level mapping blocks as normal data blocks. According to the data pat-
tern, WAFTL selects the type of data block. The page-level mapping
block handles random data and conducts partial data updates. The block-
level mapping block stores sequential data. In particular, to reduce the
garbage collection overhead, they utilize offline garbage collection to
erase invalid blocks during idle time.
In [21], Yan et al. propose Tiny-Tail flash, which tries to eliminate tail
latency due to garbage collection. They employ the four key techniques,
plane-blocking GC, GC-tolerant reads, rotating GC and GC-tolerant flushes.
18
Plane-blocking GC reduces controller and channel blocking to plane block-
ing using a fine grained management scheme. GC-tolerant read prevents
IO blocking due to the plane undergoing GC using a technique called
RAIN, which exploits parity pages like redundant array of independent
disks (RAID). Rotating GC helps to reduce IO blocking using a policy
by which at most one plane in each plane group can run one GC at a time.
GC-tolerant flush facilitates a rapid write buffer using capacitor-backed
RAM.
A growing number of studies conducted with the aim of reducing
storage-level tail latency. Amvrosiadis et al. present Duet, a framework
that provides notifications about page-level events to maintenance tasks.
The application uses these events as hints to process cached data. The
tasks using Duet can finish maintenance work more efficiently due to
they request fewer I/O operations. These opportunistic maintenance tasks
require less I/O. Thus, tasks can complete faster. When tasks run concur-
rently, Duet helps to minimize affecting performance [22].
He et al. proposed Chopper, a tool to discover high-latency operation
within local file systems. They focused on block allocation which is a
critical contributor to uncommon behavior in recent systems. This could
reduce file system performance. Chopper utilizes sophisticated statistical
methodologies to discover the search space and diagnose intricate design
problems efficiently. They pinpoint and remove four layout issues in ext4.
Their improvements significantly reduce the problematic tasks causing
19
tail latencies [23].
Yang et al. proposed a split-level I/O scheduling framework that splits
I/O scheduling logic across handlers at three layers of the storage stacks
(block, system call, and page cache). Split schedulers can determine which
processes issued I/O and accurately estimate I/O costs. This method can
prevent file systems from striking orderings that are contrary to schedul-
ing goals [24].
Studies have been conducted on approaches that utilize the idle time
and workload prediction. Han et al. predicted the future workload and
controlled the number of victim blocks [25]. The victim blocks are se-
lected based on the age, utilization, and erase counts. The number of re-
claimed blocks is then determined by predicting the history of the request
count and rate.
Lin et al. predicted the future workload and obtained the number of
victim blocks based on the predicted workload, erase count, and invali-
dation period [26].
Reinforcement learning has been widely used in a broad range of
problems including robot control and resource allocation in data center.
In [27], Ipek et al. proposed a self-optimizing DRAM controller design
based on reinforcement learning. This memory controller sees the system
state and predicts the long-term performance impact of each action it
can perform. In this way, this controller learns to optimize its scheduling
policy to offer maximum performance.
20
Pritzel et al. proposed a technique called neural episodic control,
which can rapidly learn successful policies once they are underwent [28].
They employ a memory structure called differentiable neural dictionary
(DND) that integrates slow-changing keys to quickly updating values.
Thus, it utilizes context-based lookup on the keys. The DND is similar to
Q-table cache [10] in that DND stores key-value pair and Q-table cache
stores state-value pair. On the other hands, they manage memory in dif-
ferent ways. In the event of lookup, the DND gives a weighted sum of
multiple values from the memory, whereas Q-table cache reads a single
state-value pair. After the DND is read, a new key-value pair is saved
into the DND. With regard to Q-table cache, in case of hit, Q-table cache
updates the hit entry according to Q-learning and replacement policy. In
case of miss, Q-table cache inserts a new entry based on replacement
policy.
Mnih et al. proposed deep reinforcement learning (also called Deep
Q-Network, in short, DQN) where a neural network is trained to produce
Q-value output for the given input image [29]. Our proposed QP Net was
inspired by this approach. Our key difference is that, instead of predict-
ing Q-value with a single large network which is prohibitively expensive
in embedded systems like storage, our scheme exploits the locality of
storage behavior and, thus, consists of a small resource-efficient Q-table
cache for short-term behavior and a small Q-value prediction network
for long-term behavior. The DQN approach also employed an experi-
21
ence reply method to mitigate non-stationary distribution problem. The
experience replay requires a large replay memory which can incur a large
memory cost and may not be suitable in embedded systems with limited
memory resource.
Konda and Tsitsiklis proposed an actor-critic method [30]. The actor
network approximates the policy and outputs the probability of an action.
The critic network approximates the target value function, namely the
approximate optimal Q-value function. Our study was also motivated by
the actor-critic method. Our solution is more resource-efficient in that we
predict Q-value, i.e., the probability of action with a very small Q-table
cache instead of a large actor network. Our QP Net is also small since
it has only to predict the initial Q-value and a more accurate function of
Q-learning is further performed on the Q-table cache.
22
Chapter 4
Small Q-table based Solution to
Reduce Long Tail Latency
4.1 Problem and Motivation
4.1.1 Long Tail Problem in Flash Storage Access
Latency
Figure 4.1 shows the latency comparison for a storage trace called home2
(used in our experiments) between an ideal storage without a GC over-
head and a real one with page-level mapping. The figure shows that the
response time is short for the majority of the storage accesses. It is less
than 1 ms for approximately 85% of the accesses. However, the latency
difference between the median and the 99th percentile is a factor of 100.
As mentioned before, such a long-tail latency is a serious problem in real-
time and quality-critical systems. For instance, the server storage typi-
cally needs to provide a minimum 7.5 ms of write latency for 99.99%
23
of the storage accesses. Considering that the GC latency continues to
increase due to the increasing block size, it is important to reduce the
long-tail latency for such real-time and quality-critical systems.
4.1.2 Idle Time in Flash Storage
Figure 4.2 shows the distribution of the request interval time for 60K
requests the real-world workloads used in our experiments. The x-axis
represents the inter-request interval time, and the y-axis represents the
frequency of the request in each bin. As the figure shows, the storage sys-
tem has frequent and long idle periods. Such an idle time can be exploited
to perform GC operations. In idle time-aware GC methods [25, 26], it is
important to determine how many GC operations need to be performed
for a given idle time. The difficulty of this problem is that the length of
the current idle period is unknown. To address this problem, several tech-
niques exist [25, 26]. These techniques use fixed policies determined at
the design time. Thus, they are limited in adapting to the dynamically
changing storage access behavior because of the different program runs
or phases.
In this dissertation, we propose an RL-assisted adaptive GC method
[9], which learns the storage access behavior online and adjusts the GC
to it to reduce the long-tail latency.
24
Figure 4.1 Long tail latency problem: home2.
Figure 4.2 Inter-request interval distribution.
25
4.2 Design and Implementation
4.2.1 Solution Overview
We aim to reduce the long-tail latency by (1) hiding the GC latency by
exploiting the idle time, and (2) minimizing the GC-induced blocking.
In this section, we present an RL-assisted GC scheduler to hide the GC
latency (Section 4.2.2) and an aggressive fine-grained partial GC scheme
to reduce the blocking time (Section 4.2.3).
Our proposed RL-assisted GC scheduler is triggered in a lazy manner.
Thus, only when an access request arrives at the storage and the number
of free pages goes below a threshold [15], it is triggered. When triggered,
it chooses an action. Because our GC method is based on the partial GC,
the action is to perform a number of partial GC operations, e.g., five
page copies from a victim block to a free block. Thus, the GC scheduler
chooses an action, i.e., determines how many partial GC operations will
be performed after serving the current request. An erase operation is per-
formed when an action is chosen by the scheduler and a block is ready
to be erased. In such a case, instead of executing the action, the block is
erased.
After serving the request, the GC scheduler calculates the response
time. Because our goal is to reduce the long-tail latency, we need to re-
flect the response time in our reward. We explain the details of how the
26
reward is calculated using the response time in Section 4.2.2. Note that
the response time of the kth request gives the reward for the k-1th request.
Thus, in the aforementioned Q-learning (Equation (3)), we update the Q
value for the current state s and action a only after the next request is
served and the corresponding reward is calculated.
4.2.2 RL-assisted Garbage Collection Scheduling
States: In the reinforcement learning, the states need to represent the
history, which helps in maximizing the reward. We propose using the
following information as the states.
• Previous inter-request interval
• Current inter-request interval
• Previous action
The inter-request interval is an important information of history be-
cause it reflects the intensity (i.e., the idleness) of storage traffics. Thus,
if the interval is large, the RL-assisted GC scheduler tends to take a more
aggressive action, i.e., more number of partial GC operations. The previ-
ous action plays a role of a summary of both recent history and the deci-
sion of the GC scheduler. From the viewpoint of the agent, both the host
and the SSD subsystem constitute the environment. The inter-request in-
tervals represent the state of the host. Note that the previous action can
27
represent that of the SSD subsystem as well as that of the host. It is be-
cause the previous action does not only plays a role of a summary of both
recent history and the decision of GC scheduler, but also affects the state
of the SSD subsystem, i.e., being busy in page copy or idle. For instance,
if the previous action is to copy a large number of pages, then the current
state of SSD subsystem tends to be busy.
We divide each of the three components into multiple bins, 2 bins for
previous inter-request interval, 17 bins for current inter-request interval,
and 2 bins for previous action, which gives a total 68 (=2 × 17 × 2) states.
The details of binning are given in Section 4.3.1.
Reward: Regarding the reward, we need to assign a larger reward for
a smaller response time. We also need to penalize an action giving a long
response time. Figure 4.3 shows our reward function. The reward ranges
between 0.5 and 1. For instance, if the response time is large (larger than
the threshold t3), a negative reward is assigned to penalize the action.
The thresholds in the reward function in Figure 4.3 need to be adjusted
to the characteristics of the storage accesses. A fixed set of thresholds
will not cover diverse scenarios in the storage accesses. Thus, we set
the thresholds based on the characteristics of the storage accesses. In
particular, we set three thresholds, t1, t2, and t3 to the 70th, 90th, and
99th percentiles of the response time, respectively. Hence, even if the
storage-access behavior changes, the thresholds can be adjusted based
on the new distribution of the response time.
28
Exploitation and Exploration Balance: The exploration aims at fill-
ing in all the entries of the Q-table, and subsequently, improving them
toward the optimal policy. To do that, we employ the ε-greedy tech-
nique [14]. In the initial period of RL execution (the first 1000 GC oper-
ations in our experiments), we utilize a large ε value (80%) to perform
aggressive explorations. Then, we utilize a small ε value (1%) for a bal-
ance between exploitation and exploration during the rest of period.
GC Scheduling: Algorithm 1 shows the pseudo code of the proposed
RL-assisted GC scheduler. For each request to the storage, the GC sched-
uler compares the number of free blocks N f ree with threshold TGC (=10
blocks in our experiments). If TGC >= N f ree, we call function e greedy()
(line 2), which performs either exploration or exploitation based on the
probability of ε , i.e., a random action is selected at a probability of ε or
an action is selected using the policy at a probability of 1 - ε [14]. Note
that we do not trigger the GC scheduler in case of consecutive requests
wherein the inter-request interval is zero (line 3–5). After serving the re-
quest and obtaining the response time for the current request (line 6), we
perform the selected action, i.e., partial GC operation (line 7).We then
call the reward function with the response time of the current request
(line 8). Finally, we update the Q-table entry of the previous request (line
9). Note that, as mentioned previously, we update the entry of the Q-table
associated with the previous request.
Intensive Garbage Collection: The baseline method in Algorithm 1
29
is not free from a blocked situation wherein the flash storage is out of
the free block. To avoid such a situation, we employ an intensive garbage
collection (GC) method from LazyRTGC [15] and modify it for further
improvement. The objective of the intensive GC is to perform more (5
or 7 valid page copies in our experiments) partial GC operations than
that in the normal partial GC operations (typically, 1 or 2 page copies),
thus enabling faster reclamation of free blocks. The number (5 or 7) of
partial GC operations is determined by considering the number of pages
in a block and other parameters of the flash memory, e.g., erase time
[15]. In [15], the intensive GC is triggered when there is only one free
block left. Under the intensive GC, the action chosen by the RL policy is
ignored and a fixed number of partial GC operations is performed after
serving a write request. In [15], after the number of free blocks becomes
greater than one, the intensive GC is no longer applied. In our work, we
propose to utilize a larger threshold (termed the threshold of the stopping
intensive GC, TIGC) than the one required to stop applying the intensive
GC. We use a larger one (3), which is obtained via a sensitivity analysis
in our experiments.
30
Algorithm 1: RL-Assisted GC Scheduling
Input: request, statet−1(St−1), statet(St), actiont−1(At−1)
Output: actiont(At)
1 if TGC = N f ree then
2 At = e greedy(intervalt−1, intervalt , actiont−1)
3 if intervalt == 0 then
4 go to line 1
5 end
6 serve the request and obtain response time
7 run partial gc(At)
8 r = reward(response time)













Figure 4.3 Reward function.
32
4.2.3 Aggressive RL-assisted Garbage Collection
Scheduling
In this subsection, we propose two methods of aggressively triggering
the GC to further reduce the long-tail latency [9]. To reduce the long-
tail latency, it is effective to limit the maximum number of partial GC
operations per action. In our experiments,we found that when the number
of partial GC operations is limited to two, the best result is obtained.
Thus, when the policy chooses an action, and if the action has more than
two partial GC operations, we set the number of GC operations to two.
When applying this method, we need to consider the blocking situation
where the flash storage is out of the free block) because we limit the
maximum number of partial GC operations.
To avoid the blocking situation, we trigger the GC collection more ag-
gressively by introducing a new threshold for number of free blocks TAGC.
TAGC is set higher than TGC(10). We call this method early GC triggering
with the maximum limit of partial GC operation, in short, max-limited
early GC triggering. Note that, the maximum number of partial GC op-
erations is limited only when the number of free blocks N f ree is between
TAGC and TGC. When N f ree <= TGC, the maximum limit is not applied to
the action chosen by the RL-assisted GC scheduler.
The aggressive GC operation can increase the erase count. To avoid
this, we carefully select the victim blocks. When N f ree is within the two
33
thresholds TAGC and TGC, we select a victim block only when it has a
larger number of invalid pages than the threshold (60% of the block size
in our experiments).
In conventional GC methods, a write request triggers GC when the
number of free blocks is less than a certain threshold. In case of the read
request, the GC is not triggered to avoid the increase in the read latency.
We propose triggering a partial GC operation even for a read request
when the triggering condition is met. Note that the latency of the read
request does not increase because the GC operation is performed after
serving the read request. We call this method read-initiated GC trigger-
ing.
Note that, in our aggressive method, the RL-assisted GC scheduler
is triggered using the two methods: max-limited early GC triggering and
read-initiated GC triggering. Based on our experiments, they prove useful





We compare our proposed RL-assisted GC method [9] (baseline in Sec-
tion 4.2.2 and aggressive in Section 4.2.3) with a typical GC method
based on page-level mapping (page-level) [3] and LazyRTGC [15]. We
implemented our proposed methods, page-level and LazyRTGC on a
FlashSim simulator [31]. We use the metrics of long-tail latency at the
99th, 99.99th, and 99.9999th percentiles and erase count. We use eight
real-world workloads (six workloads from FIU [32] and two workloads
from Microsoft [32]) and a synthetic one (from filebench [33]) as listed
in Table 4.1. The goal of our work is to reduce long tail latency. In read-
intensive workloads, the problem of long tail latency is not severe since
GC is rarely invoked. Thus, we used write-intensive workloads in our
experiments.
We started simulations with empty contents in the flash-memory model
and measured the latency of all the requests for each workload.We use
two types of 3D flash-memory systems as listed in Table 4.2.
Table 4.3 shows the binning for the components of the state. The bin-
ning was obtained by a sensitivity analysis on binning choices by vary-
ing the numbers of bins, 1∼3 and 15∼20 for previous and current inter-
request intervals, and 1∼3 for previous actions, respectively, with an aim
35
to reduce the Q-table size, i.e., the number of states while improving the
long tail latency. Considering that the accesses to NAND flash memory
take 10∼1000μs, e.g., 49μs for read and 600μs for write [1], even though
the agent is triggered at every storage access, the runtime overhead of the
agent is negligibly small. It is because the agent accesses the Q-table (in
a small SRAM) at maximum twice and executes a few instructions on
the controller chip. Thus, the runtime of the agent is much smaller than
the read latency of NAND flash memory.
Table 4.4 summarizes the thresholds used in our method. We obtained
them by conducting a sensitivity analysis with all the storage traces. To
improve the generality of our proposed methods, in our future work, we
will investigate the feasibility of reducing the number of thresholds by
enhancing the RL model, e.g., by introducing the number of free blocks
into the states of the agent.
36
Table 4.1 Workload characteristics.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
threshold (termed the threshold of the stopping intensive GC, TIGC) than the one required to stop 
applying the intensive GC. We use a larger one (3), which is obtained via a sensitivity analysis in 
our experiments. 
5.3 Aggressive RL-assisted Garbage Collection Scheduling 
In this subsection, we propose two methods of aggressively triggering the GC to further reduce 
the long-tail latency. To reduce the long-tail latency, it is effective to limit the maximum number 
of partial GC operations per action. In our experiments, we found that when the number of partial 
GC operations is limited to two, the best result is obtained. Thus, when the policy chooses an 
action, and if the action has more than two partial GC operations, we set the number of GC 
operations to two. When applying this method, we need to consider the blocking situation where 
the flash storage is out of the free block) because we limit the maximum number of partial GC 
operations. To avoid the blocking situation, we trigger the GC collection more aggressively by 
introducing a new threshold for number of free blocks TAGC. TAGC is set higher than TGC (10). We 
call this method early GC triggering with the maximum limit of partial GC operation, in short, 
max-limited early GC triggering. Note that, the maximum number of partial GC operations is 
limited only when the number of free blocks Nfree is between TAGC and TGC. When Nfree <= TGC, 
the maximum limit is not applied to the action chosen by the RL-assisted GC scheduler. 
The aggressive GC operation can increase the erase count. To avoid this, we carefully select 
the victim blocks. When Nfree is within the two thresholds TAGC and TGC, we select a victim block  
 
Table 1. Workload characteristics 
 
 Write ratio 
Avg. interval  
[μs] 
Avg. request size 
[KB] 
home1 99% 85565 8.08 
home2 91% 320548 9.40 
home3 99% 1882329 8.26 
home4 94% 693651 7.56 
webmail 74% 303762 8.00 
webmail+online 78% 127184 8.00 
RBESQL 82% 11664 57.85 
MSNSFS 67% 739 21.67 
oltp 99% 84 4.46 
 
 
Table 2. NAND flash memory 
 
 3D 128 Gb [18] 3D 512 Gb [8] 
Page size 8KB 16KB 
Number of pages / block 384 768 
Number of blocks / plane 2731 2874 
Number of planes 2 2 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 
Data transfer rate 533 Mbps 1 Gbps 
 
Table 4.2 NAND Flash memories.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
threshold (termed the threshold of the stopping intensive GC, TIGC) than the one required to stop 
applying the intensive GC. We use a larger one (3), which is obtained via a sensitivity analysis in 
our experiments. 
5.3 Aggressive RL-assisted Garbage Collection Scheduling 
In this subsection, we propose two methods of aggressively triggering the GC to further reduce 
the long-tail latency. To reduce the long-tail latency, it is effective to limit t e maximum number 
of partial GC operations per action. In our experiments, we found that when the number of partial 
GC operations is limited to two, the best result is obtained. Thus, when the policy chooses an 
action, and if the actio  has m re than two partial GC operations, we set the umber of GC 
operations to two. When applying this meth d, we need to consider the blocking situation where 
the flash storage is out of the free block) because we limit the maximum number of partial GC 
operations. To avoid the blocking situation, we trigger the GC collection more aggressively by 
introducing a new threshold for number of free blocks TAGC. TAGC is set higher than TGC (10). We 
call this method early GC triggering with the maximum limit of partial GC operation, in short, 
max-limited early GC triggering. Note that, the maximum number of partial GC operations is 
limited only when the number of free blocks Nfree is between TAGC and TGC. When Nfree <= TGC, 
the maximum limit is not applied to the action chosen by the RL-assisted GC scheduler. 
The aggressive GC operation can increase the erase count. To avoid this, we carefully select 
the victim blocks. When Nfree is within the two thresholds TAGC and TGC, we select a victim block  
 
Table 1. Workload characteristics 
 
 Write ratio 
Avg. interval  
[μs] 
Avg. request size 
[KB] 
home1 99  85565 8.08 
home2 91  320548 9.40 
home3 99% 1882329 8.26 
home4 94% 693651 7.56 
webmail 74% 303762 8.00 
webmail+online 78% 127184 8.00 
RBESQL 82% 11664 57.85 
MSNSFS 67% 739 21.67 
oltp 99% 84 4.46 
 
 
Table 2. NAND flash memory 
 
 3D 128 Gb 3D 512 Gb 
Page size 8KB 16KB 
Number of pages / block 384 768 
Number of blocks / plane 2731 2874 
Number of planes 2 2 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 




XX:12  W. Kang et al. 
ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
only when it has a larger number of invalid pages than the threshold (60% of the block size in our 
experiments).  
In conventional GC methods, a write request triggers GC when the number of free blocks is 
less than a certain threshold. In case of the read request, the GC is not triggered to avoid the 
increase in the read latency. We propose triggering a partial GC operation even for a read request 
when the triggering condition is met. Note that the latency of the read request does not increase 
because the GC operation is performed after serving the read request. We call this method read-
initiated GC triggering.  
Note that, in our aggressive method, the RL-assisted GC scheduler is triggered using the two 
methods: max-limited early GC triggering and read-initiated GC triggering. Based on our 
experiments, they prove useful in obtaining free blocks during the idle time, thereby reducing the 
long-tail latency.  
 
6 EXPERIMENTS 
6.1 Experimental Setup 
We compare our proposed RL-assisted GC methods (baseline in Section 5.2 and aggressive in 
Section 5.3) with a typical GC method based on page-level mapping (page-level) [11] and 
LazyRTGC [1]. We implemented our proposed methods, page-level and LazyRTGC on a FlashSim 
simulator [3]. We use the mtrics of long-tail latency at the 99th, 99.99th, and 99.9999th percentiles 
and erase count. We use eight real-world workloads (six workloads from FIU [19] and two 
workloads from Microsoft [19]) and a synthetic one (from filebench [20]) as listed in Table 1. The 
goal of our work is to reduce long tail latency. In read-intensive workloads, the problem of long  
tail latency is not severe since GC is rarely invoked. Thus, we used write-intensive workloads in 
our experiments. 







< 100 < max action/2 < 100 
  < 500 
   
  > 100000 
 > max action/2  
> 100   
   
 > max action/2 > 100000 
 
 
Table 4. Threshold 
 
Threshold Value Remark 
TGC 10 Triggering GC 
TIGC 3 Stopping intensive GC 
TAGC 100 Triggering aggressive GC 
 
Tabl 4.4 Thresholds.
XX:12  W. Kang et al. 
ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
only when it has a larger number of invalid pages than the hr shold (60% of the block size in r 
experi ents).  
In conventional GC methods, a write request triggers GC when the number of free blocks is 
less than a certain threshold. In case of the read request, the GC is not triggered to avoid the 
increase in the read latency. We propose triggering a partial GC operation even for a read request 
when the triggering condition is met. Note that the latency of the read request does not increase 
because the GC operation is performed after serving the read request. We call this method read-
initiated GC triggering.  
Note that, in our aggressive method, the RL-assisted GC scheduler is triggered using the two 
methods: max-limited early GC triggering and read-initiated GC triggering. Based on our 
experiments, they prove useful in obtaining free blocks during the idle time, thereby reducing the 
long-tail latency.  
 
6 EXPERIMENTS 
6.1 Experimental Setup 
We compare our proposed RL-assisted GC methods (baseline in Section 5.2 and aggressive in 
Section 5.3) with a typical GC method based on page-level mapping (page-level) [11] and 
LazyRTGC [1]. We implemented our proposed methods, page-level and LazyRTGC on a FlashSim 
simulator [3]. We use the mtrics of long-tail latency at the 99th, 99.99th, and 99.9999th percentiles 
and erase count. We use eight real-world workloads (six workloads from FIU [19] and two 
workloads from Microsoft [19]) and a synthetic one (from filebench [20]) as listed in Table 1. The 
goal of our work is to reduce long tail latency. In read-intensive workloads, the problem of long  
tail latency is not severe since GC is rarely invoked. Thus, we used write-intensive workloads in 
our experiments. 







< 100 < max action/2 < 100 
  < 500 
   
  > 100000 
 > max action/2  
> 100   
   
 > max action/2 > 100000 
 
 
Table 4. Threshold 
 
Threshold Value Remark 
TGC 10 Tr ggering GC 
TIGC 3 Stopping intensive GC 
TAGC 100 Triggering aggressive GC 
 
38
4.3.2 Results and Discussion
Figure 4.4 compares the long-tail latency (in CDF) for writes. The figure
shows that our proposed methods exhibit better long-tail latency than that
using page-level and LazyRTGC. Page-level is not shown in the figure
due to too large latency since it does not adopt any optimization to reduce
long tail latency. LazyRTGC lies partial GC operations in a lazy manner
and shows better latency than page-level.
Latency: Table 4.5 compares the latency normalized to LazyRTGC
on a 3D 512Gb flash memory. Our baseline method (Base in the ta-
ble) gives better (smaller) average latency: 0.86× at 99.9999th, 0.94×
at 99.99th, and 0.92× at 99th percentile. The gain is a result of the rein-
forcement learning-assisted action selection. LazyRTGC utilizes a fixed
number of partial GC operations. In contrast, our proposed RL-assisted
method can adapt to the characteristics of storage behavior, thereby pro-
viding variable number of partial GC operations to better exploit the idle
time, which contributes to reducing the long-tail latency. Our aggres-
sive method (Aggr in the table) gives much smaller latency: 0.76× at
99.9999th, 0.71× at 99.99th, and 0.92× at 99th percentile. This proves
that the two aggressive solutions, max-limited early GC triggering and
read-initiated GC triggering, are effective in further reducing the long-
tail latency.
In particular, the aggressive method gives much better latency in the
39
XX:10  W. Kang et al. 
ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
GC Scheduling: Algorithm 1 shows the pseudo code of the proposed RL-assisted GC scheduler. 
For each request to the storage, the GC scheduler compares the number of free blocks Nfree with 
threshold TGC (=10 blocks in our experiments). If TGC > =Nfree, we call function e_greedy() (line 2), 
which performs either exploration or exploitation based on the probability of Ɛ, i.e., a random 
action is selected at a probability of Ɛ or an action is selected using the policy at a probability of 
1 - Ɛ [10]. Note that we do not trigger the GC scheduler in case of consecutive requests wherein 
the inter-request interval is zero (line 3-5). After serving the request and obtaining the response 
time for the current request (line 6), we perform the selected action, i.e., partial GC operation (line 
7). We then call the reward function with the response time of the current request (line 8). Finally, 
we update the q-table entry of the previous request (line 9). Note that, as mentioned previously, 
we update the entry of the q-table associated with the previous request. 
Intensive Garbage Collection: The baseline method in Algorithm 1 is not free from a blocked 
situation wherein the flash storage is out of the free block. To avoid such a situation, we employ 
an intensive garbage collection (GC) method from LazyRTGC [1] and modify it for further 
improvement. The objective of the intensive GC is to perform more (5 or 7 valid page copies in 
our experiments) partial GC operations than that in the normal partial GC operations (typically, 
1 or 2 page copies), thus enabling faster reclamation of free blocks. The number (5 or 7) of partial 
GC operations is determined by considering the number of pages in a block and other parameters 
of the flash memory, e.g., erase time [1]. 
In [1], the intensive GC is triggered when there is only one free block left. Under the intensive 
GC, the action chosen by the RL policy is ignored and a fixed number of partial GC operations is 
performed after serving a write request. In [1], after the number of free blocks becomes greater 
than one, the intensive GC is no longer applied. In our work, we propose to utilize a larger 
 
 
Fig. 7. Comparison of write long-tail latency (512Gb 3D NAND Flash memory). 
 
Figure 4.4 Comparison of write long-tail latency (3D 512Gb flash mem-
ry).
40
four workloads: home1, home2, webmail, and webmail+online. These
workloads have heavy overwrite traffics distributed across a wide range
of addresses. Figure 4.5 exemplifies the distribution of the write traffics
for home1 and home3. As the figure shows, in the case of home1, the
overwrites are much stronger than that in home3 (see y-axis). In addition,
such strong overwrites are more distributed across a wider address range
than that in home3.
Such a write behavior in home1 increases the ratio of invalid pages
across a large number of blocks, which makes the GC cheaper, i.e., a free
block can be obtained for fewer valid page copies. Thus, our aggressive
method is effective in home1. However, as shown in Figure 4.5(b), home3
has weaker overwrite behavior than home1, which makes it difficult for
the aggressive method to reclaim the free blocks using fine-grained par-
tial GC.
In Table 4.5, both the LazyRTGC and our methods give similar la-
tencies in home3 and oltp. In case of home3, the inter-request interval is
large as listed in Table 4.1. In such a case, the GC (and its optimization)
does not help in reducing the latency. On the other hand, oltp has very
short idle time, i.e., small inter-request interval as listed in Table 4.1.
Thus, there is little opportunity to improve the GC.
Table 4.6 compares the latencies in the case of a 3D 128Gb flash
memory. Compared to the results in Table 4.5, our proposed methods
give further reductions, e.g., 0.66× (in Table 4.6) vs 0.76× (Table 4.5),
41
XX:14  W. Kang et al. 
ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
and current inter-request intervals, and 1~3 for previous actions, respectively, with an aim to 
reduce the q-table size, i.e., the number of states while improving the long tail latency.    
Considering that the accesses to NAND flash memory take 10~1000μs, e.g., 49μs for read and 
600μs for write [18], even though the agent is triggered at every storage access, the runtime 
overhead of the agent is negligibly small. It is because the agent accesses the q-table (in a small 
SRAM) at maximum twice and executes a few instructions on the controller chip. Thus, the 
runtime of the agent is much smaller than the read latency of NAND flash memory.  
Table 4 summarizes the thresholds used in our method. We obtained them by conducting a 
sensitivity analysis with all the storage traces. To improve the generality of our proposed methods, 
 




Figure 4.5 Distribution of write traffics.
42
Table 4.5 Latency comparison on 3D 512Gb flash memory.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
We started simulations with empty contents in the flash-memory model and measured the 
latency of all the requests for each workload. We use two types of 3D flash-memory systems as 
listed in Table 2. 
Table 3 shows the binning for the components of the state. The binning was obtained by a 
sensitivity analysis on binning choices by varying the numbers of bins, 1~3 and 15~20 for previous 

































































 Page 465 239 N/A 962 514 697 9353 2246 33.1 1813 
Lazy 1.00 1.00 N/A 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.75 0.86 N/A 0.87 0.82 0.89 0.82 0.81 1.10 0.86 






 Page 769 292 127 1105 679 848 6396 3435 62.9 1823 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.65 0.90 1.00 0.88 0.66 0.69 0.86 0.84 1.00 0.94 





Page 6.67 3.95 1.00 5.93 6.06 5.79 5077 10.5 2.91 568 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 1.00 0.44 1.00 1.00 0.98 1.00 0.95 1.00 0.99 0.92 
Aggr 1.00 0.44 1.00 1.00 0.93 1.00 0.94 1.00 0.98 0.92 
 
 

































































 Page 127 99 N/A 190 121 137 1007 717 1677 509 
Lazy 1.00 1.00 N/A 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.82 0.99 N/A 0.88 0.66 0.81 0.77 0.69 1.28 0.86 






 Page 181 109 16.3 237 185 198 748 454 16.7 238 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.62 0.93 1.00 0.80 0.77 0.74 0.69 0.37 1.21 0.79 





Page 3.07 1.68 1.00 3.73 2.20 2.94 354 371 2.19 82.4 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.74 0.54 1.00 1.00 0.56 0.76 0.39 0.33 1.90 0.80 
Aggr 0.74 0.37 1.00 1.00 0.53 0.76 0.38 0.33 0.92 0.67 
 
Table 4.6 Latency comparison on 3D 128Gb flash memory.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
We started simulations with empty contents in the flash-memory model and measured the 
latency of all the requests for each workload. We use two types of 3D flash-memory systems as 
listed in Table 2. 
Table 3 shows the binning for the components of the state. The binning was obtained by a 
sensitivity analysis on binning choices by varying the numbers of bins, 1~3 and 15~20 for previous 

































































 Page 465 239 N/A 962 514 697 9353 2246 33.1 1813 
Lazy 1.00 1.00 N/A 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.75 0.86 N/A 0.87 0.82 0.89 0.82 0.81 1.10 0.86 






 Page 769 292 127 1105 679 848 6396 3435 62.9 1823 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.65 0.90 1.00 0.88 0.66 0.69 0.86 0.84 1.00 0.94 





Page 6.67 3.95 1.00 5.93 6.06 5.79 5077 10.5 2.91 568 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 1.00 0.44 1.00 1.00 0.98 1.00 0.95 1.00 0.99 0.92 
Aggr 1.00 0.44 1.00 1.00 0.93 1.00 0.94 1.00 0.98 0.92 
 
 

































































 Page 127 99 N/A 190 121 137 1007 717 1677 509 
Lazy 1.00 1.00 N/A 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.82 0.99 N/A 0.88 0.66 0.81 0.77 0.69 1.28 0.86 






 Page 181 109 16.3 237 185 198 748 454 16.7 238 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.62 0.93 1.00 0.80 0.77 0.74 0.69 0.37 1.21 0.79 





Page 3.07 1.68 1.00 3.73 2.20 2.94 354 371 2.19 82.4 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.74 0.54 1.00 1.00 0.56 0.76 0.39 0.33 1.90 0.80 
Aggr 0.74 0.37 1.00 1.00 0.53 0.76 0.38 0.33 0.92 0.67 
 
43
compared to the aggressive method at the 99.9999th percentile. This is
largely because of the low capacity of the 128Gb flash memory. The low
capacity triggers GC more frequently, which increases the overhead of
the GC in the conventional GC method (page-level). In Table 4.6, our
proposed methods are more effective than the LazyRTGC in reducing
the GC overhead in such a difficult condition.
Free block: Figure 4.6 shows the variation in the number of free
blocks over time in the workload home1 under LazyRTGC, and under our
baseline and aggressive methods. As shown in the figure, after an initial
period, LazyRTGC continues to retain 3 or 4 free blocks, which can lead
to frequent GC operations because the number of free blocks is less. Our
baseline method manages slightly more number (3–6) of free blocks. Our
aggressive method manages significantly more number of free blocks,
which helps in reducing the GC operations, thereby contributing to re-
ducing the long-tail latency. Note that, as mentioned in Section 4.2.3, our
aggressive method increases the number of free blocks only when there
are victim blocks having a large ratio of invalid pages. Thus, although the
aggressive method manages a significantly more number of free blocks
than LazyRTGC, it does not have a negative impact on the erase count,
as demonstrated later in this Section.
Erase Count: Tables 4.7 and 4.8 compare the erase counts (normal-
ized to LazyRTGC) on 512 Gb and 128Gb flash-memory systems, re-
spectively. From Tables 4.7 and 4.8, it is clear that our proposed aggres-
44
sive method and LazyRTGC give similar erase counts while the page-
level gives a higher erase count because of the block-level GC.
45




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
in our future work, we will investigate the feasibility of reducing the number of thresholds by 
enhancing the RL model, e.g., by introducing the number of free blocks into the states of the agent. 
6.2 Results and Discussion 
Fig. 7 compares the long-tail latency (in CDF) for writes. The figure shows that our proposed 
methods exhibit better long-tail latency than that using page-level and LazyRTGC. Page-level is 
not shown in the figure due to too large latency since it does not adopt any optimization to reduce 
long tail latency. LazyRTGC lies partial GC operations in a lazy manner and shows better latency 
than page-level.  
Latency: Table 5 compares the latency normalized to LazyRTGC on a 512 Gb 3D NAND flash 
memory. Our baseline method (Base in the table) gives better (smaller) average latency: 0.86x at 
 
 
Fig. 9. Comparison of number of free blocks. 
 
Figure 4.6 Comparison of number of free blocks.
46
Table 4.7 Erase count comparison on 3D 512Gb flash memory.
XX:16  W. Kang et al. 
ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
99.9999th, 0.94x at 99.99th, and 0.92x at 99th percentile. The gain is a result of the reinforcement 
learning-assisted action selection. LazyRTGC utilizes a fixed number of partial GC operations. In 
contrast, our proposed RL-assisted method can adapt to the characteristics of storage behavior, 
thereby providing variable number of partial GC operations to better exploit the idle time, which 
contributes to reducing the long-tail latency. Our aggressive method (Aggr in the table) gives 
much smaller latency: 0.76x at 99.9999th, 0.71x at 99.99th, and 0.92x at 99th percentile. This proves 
that the two aggressive solutions, max-limited early GC triggering and read-initiated GC 
triggering, are effective in further reducing the long-tail latency.  
In particular, the aggressive method gives much better latency in the four workloads: home1, 
home2, webmail, and webmail+online. These workloads have heavy overwrite traffics distributed 
across a wide range of addresses. Fig. 8 exemplifies the distribution of the write traffics for home1 
and home3. As the figure shows, in the case of home1, the overwrites are much stronger than that 
in home3 (see y axis). In addition, such strong overwrites are more distributed across a wider 
address range than that in home3.  
Such a write behavior in home1 increases the ratio of invalid pages across a large number of 
blocks, which makes the GC cheaper, i.e., a free block can be obtained for fewer valid page copies. 
Thus, our aggressive method is effective in home1. However, as shown in Fig. 8b, home3 has 
weaker overwrite behavior than home1, which makes it difficult for the aggressive method to 
reclaim the free blocks using fine-grained partial GC. 
In Table 5, both the LazyRTGC and our methods give similar latencies in home3 and oltp. In 
case of home3, the inter-request interval is large as listed in Table 1. In such a case, the GC (and 
its optimiza tion) does not help in reducing the latency. On the other hand, oltp has very short 
idle time, i.e., small inter-request interval as listed in Table 1.  


























































Page 1.92 1.33 1.50 1.83 1.59 1.69 10.9 2.07 1.67 2.72 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Aggr 0.91 1.00 1.02 1.01 1.02 1.03 1.03 0.95 1.01 1.00 
 
 


























































Page 1.26 1.16 1.26 1.56 1.19 1.41 0.56 0.63 1.81 1.20 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.92 0.97 1.00 1.01 0.98 1.01 0.93 0.97 3.15 1.21 
Aggr 0.93 1.00 1.06 1.12 1.01 1.1 0.94 0.98 1.14 1.03 
 
Table 4.8 Erase count comparison on 3D 128Gb flash memory.
XX:16  W. Kang et al. 
ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
99.9999th, 0.94x at 99.99th, and 0.92x at 99th percentile. The gain is a result of the reinforcement 
learning-assisted action selection. LazyRTGC utilizes a fixed number of partial GC operations. In 
contrast, our proposed RL-assisted method can adapt to the characteristics of storage behavior, 
thereby providing variable number of partial GC operations to better exploit the idle time, which 
contributes to reducing the long-tail latency. Our aggressive method (Aggr in the table) gives 
much smaller latency: 0.76x at 99.9999th, 0.71x at 99.99th, and 0.92x at 99th percentile. This proves 
that the two aggressive solutions, max-limited early GC triggering and read-initiated GC 
triggering, are effective in further reducing the long-tail latency.  
In particular, the aggressive method gives much better latency in the four workloads: home1, 
home2, webmail, and webmail+online. These workloads have heavy overwrite traffics distributed 
across a wide range of addresses. Fig. 8 exemplifies the distribution of the write traffics for home1 
and home3. As the figure shows, in the case of home1, the overwrites are much stronger than that 
in home3 (see y axis). In addition, such strong overwrites are more distributed across a wider 
address range than that in home3.  
Such a write behavior in home1 increases the ratio of invalid pages across a large number of 
blocks, which makes the GC cheaper, i.e., a free block can be obtained for fewer valid page copies. 
Thus, our aggressive method is effective in home1. However, as shown in Fig. 8b, home3 has 
weaker overwrite behavior than home1, which makes it difficult for the aggressive method to 
reclaim the free blocks using fine-grained partial GC. 
In Table 5, both the LazyRTGC and our methods give similar latencies in home3 and oltp. In 
case of home3, the inter-request interval is large as listed in Table 1. In such a case, the GC (and 
its optimiza tion) does not help in reducing the latency. On the other hand, oltp has very short 
idle time, i.e., small inter-request interval as listed in Table 1.  


























































Page 1.92 1.33 1.50 1.83 1.59 1.69 10.9 2.07 1.67 2.72 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Aggr 0.91 1.00 1.02 1.01 1.02 1.03 1.03 0.95 1.01 1.00 
 
 


























































Page 1.26 1.16 1.26 1.56 1.19 1.41 0.56 0.63 1.81 1.20 
Lazy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Base 0.92 0.97 1.00 1.01 0.98 1.01 0.93 0.97 3.15 1.21 
Aggr 0.93 1.00 1.06 1.12 1.01 1.1 0.94 0.98 1.14 1.03 
 
47
RL related Analysis: In order to evaluate the robustness of our method,
we measured the latency of ten executions of each trace. Tables 4.9 and
4.10 show that the results of proposed method are consistent having a
very small standard deviation of latency, 3.8% of the average normalized
latency.
We evaluated the utilization of Q-table entries for each workload. In
the analysis, we found that the average utilization is 79% and there is
possibility of further improvement by adjusting the Q-table size to each
workload, which is left for our future work.
Average application performance: It is important to evaluate the
impact of our proposed method on average application performance. Since
we did trace-based experiments, the average latency of request is consid-
ered to be correlated with average application performance. Our experi-
ments (the corresponding results of which are omitted due to page limit)
show that, the average latency of our proposed baseline and aggressive
methods is slightly better than that of the existing method, LazyRTGC.
Thus, we can state that our proposed methods improve the long tail la-
tency without degrading the average application performance.
Long trace experiment: We also evaluated our proposed method
with longer traces by stitching the original traces. Our experiments show
that, in the long trace cases, our proposed method outperforms the exist-
ing one, lazy as in the case of short ones.
Simple prediction method: Our problem of reducing long tail la-
48
tency could be addressed by existing, possibly simpler, alternatives such
as those based on time series prediction. In our experiments, we did a
quantitative comparison with GC methods based on two typical methods
of predicting the inter-request interval with moving average and exponen-
tial smoothing, respectively. Tables 4.11 and 4.12 show that our proposed
method constantly outperforms them. It is because our method manages
history and learns appropriate actions in a more fine-grained manner us-
ing the Q-table.
In summary, the experimental results show that the LazyRTGC does
not fully utilize the idle time available in the storage workload. In con-
trast, our baseline method can better exploit the idle time because of
the reinforcement learning-based GC. In addition, our aggressive method
helps in further reducing the long-tail latency by (1) preparing free blocks
with frequent small fine-grained partial GCs, which helps in reducing the
frequency of triggering the GC operations and stalling the subsequent re-
quests, and (2) hiding the GC operation by exploiting the idle time based
on the reinforcement learning. Consequently, as presented in Tables 4.5
and 4.6, our proposed aggressive method helps in reducing the long-tail
latency by 29–36% at the 99.99th percentile for the two flash-storage de-
vices.
49
Table 4.9 Standard deviation of normalized latency on 3D 512Gb flash
memory.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
Thus, there is little o pportunity to improve the GC. Table 6 compares the latencies in the case 
of a 128 Gb 3D NAND flash memory. Compared to the results in Table 5, our proposed methods 
give further reductions, e.g., 0.66x (in Table 6) vs 0.76x (Table 5), compared to the aggressive 
method at the 99.9999th percentile. This is largely because of the low capacity of the 128 Gb flash 
memory. The low capacity triggers GC more frequently, which increases the overhead of the GC 
in the conventional GC method (page-level). In Table 6, our proposed methods are more effective 
than the LazyRTGC in reducing the GC overhead in such a difficult condition.  
Free block: Fig. 9 shows the variation in the number of free blocks over time in the workload 
home1 under LazyRTGC, and under our baseline and aggressive methods. As shown in the figure, 
after an initial period, LazyRTGC continues to retain 3 or 4 free blocks, which can lead to frequent 
GC operations because the number of free blocks is less. Our baseline method manages slightly 
more number (3–6) of free blocks. Our aggressive method manages significantly more number of 
free blocks, which helps in reducing the GC operations, thereby contributing to reducing the long-
tail latency.  
Note that, as mentioned in Section 5.3, our aggressive method increases the number of free 
blocks only when there are victim blocks having a large ratio of invalid pages. Thus, although the 
 































































Base 0.055 0.051 0.000 0.031 0.030 0.043 0.038 0.042 0.055 0.038 
Aggr 0.026 0.049 0.000 0.027 0.002 0.002 0.026 0.029 0.008 0.019 
99.99th 
Base 0.017 0.072 0.000 0.006 0.015 0.022 0.003 0.002 0.003 0.016 
Aggr 0.014 0.052 0.000 0.000 0.000 0.000 0.003 0.002 0.005 0.008 
99th 
Base 0.000 0.000 0.000 0.000 0.004 0.000 0.000 0.000 0.003 0.001 
Aggr 0.005 0.000 0.000 0.000 0.004 0.000 0.000 0.000 0.001 0.001 
 
 































































Base 0.047 0.047 0.000 0.036 0.033 0.053 0.027 0.046 0.044 0.037 
Aggr 0.025 0.111 0.000 0.048 0.018 0.000 0.025 0.029 0.000 0.028 
99.99th 
Base 0.011 0.035 0.000 0.027 0.006 0.013 0.005 0.004 0.065 0.018 
Aggr 0.006 0.070 0.000 0.016 0.000 0.000 0.004 0.005 0.000 0.011 
99th 
Base 0.000 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.009 0.002 
Aggr 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.000 
 
Table 4.10 Standard deviation of normalized latency on 3D 128Gb flash
memory.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
Thus, there is little o pportunity to improve the GC. Table 6 compares the latencies in the case 
of a 128 Gb 3D NAND flash memory. Compared to the results in Table 5, our proposed methods 
give further reductions, e.g., 0.66x (in Table 6) vs 0.76x (Table 5), compared to the aggressive 
ethod at the 99.9999th percentile. This is largely because of the low capacity of the 128 Gb flash 
memory. The low capacity triggers GC more freque tly, which increases the v rhead of the GC 
in the conventional GC method (page-level). In Table 6, our proposed methods are more effective 
than the LazyRTGC in reduci g the GC overhead in such a difficult condition.  
Free block: Fig. 9 shows the variation in the number of free blocks over time in the workload 
home1 under LazyRTGC, and under our baseline and aggressive methods. As shown in the figure, 
after an initial period, LazyRTGC continues to retain 3 or 4 free blocks, which can lead to frequent 
GC operations because the number of free blocks is less. Our baseline method manages slightly 
more number (3–6) of free blocks. Our aggressive method manages significantly more number of 
free blocks, which helps in reducing the GC operations, thereby contributing to reducing the long-
tail latency.  
Note that, as mentioned in Section 5.3, our aggressive method increases the number of free 
blocks only when there are victim blocks having a large ratio of invalid pages. Thus, although the 
 































































Base 0.055 0.051 0.000 0.031 0.030 0.043 0.038 0.042 0.055 0.038 
Aggr 0.026 0.049 0.000 0.027 0.002 0.002 0.026 0.029 0.008 0.019 
99.99th 
Base 0.017 0.072 0.000 0.006 0.015 0.022 0.003 0.002 0.003 0.016 
Aggr 0.014 0.052 0.000 0.000 0.000 0.000 0.003 0.002 0.005 0.008 
99th 
Base 0.000 0.000 0.000 0.000 0.004 0.000 0.000 0.000 0.003 0.001 
Aggr 0.005 0.000 0.000 0.000 0.004 0.000 0.000 0.000 0.001 0.001 
 
 































































Base 0.047 0.047 0.000 0.036 0.033 0.053 0.027 0.046 0.044 0.037 
Aggr 0 025 0.111 0.000 0.048 0.018 0.000 0.025 0.029 0.000 0.028 
99.99th 
Base 0.011 0.035 0.000 0.027 0.006 0.013 0.005 0.004 0.065 0.018 
Aggr 0.006 0.070 0.000 0.016 0.000 0.000 0.004 0.005 0.000 0.011 
99th 
Base 0.000 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.009 0.002 
Aggr 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.000 
 
50
Table 4.11 Latency comparison of simple prediction method on 3D
512Gb flash memory.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
our proposed aggressive method and LazyRTGC give similar erase counts while the page-level 






































































Aggr 0.58 0.90 N/A 0.83 0.35 0.48 0.88 0.92 1.14 0.76 
Mov10 0.97 1.01 N/A 1.01 0.67 0.60 0.97 0.70 1.56 0.94 
Mov100 1.11 1.02 N/A 1.01 0.90 0.95 0.99 0.76 1.36 1.01 
Exp0.1 0.87 1.02 N/A 0.93 0.84 0.90 0.97 0.63 1.56 0.97 







Aggr 0.50 0.27 1.00 0.88 0.47 0.58 0.84 0.85 1.06 0.71 
Mov10 0.61 0.98 1.00 0.95 0.58 0.71 0.98 0.84 1.12 0.86 
Mov100 1.00 1.05 1.00 0.95 0.97 0.99 0.97 0.82 1.12 0.99 
Exp0.1 0.84 1.04 1.00 0.94 0.84 0.92 0.99 0.82 1.12 0.95 





Aggr 1.00 0.44 1.00 1.00 0.93 1.00 0.94 1.00 0.98 0.92 
Mov10 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.97 
Mov100 1.00 0.75 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.97 
Exp0.1 1.00 0.71 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.97 
Exp0.3 1.00 0.72 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.97 
 






































































Aggr 0.64 0.54 N/A 0.59 0.26 0.29 0.74 0.80 1.51 0.66 
Mov10 0.93 0.99 N/A 0.94 0.76 0.84 0.79 0.64 1.50 0.92 
Mov100 1.53 3.51 N/A 1.41 1.05 1.09 0.72 0.68 1.41 1.43 
Exp0.1 1.16 1.08 N/A 1.21 1.01 1.07 0.72 0.61 1.59 1.06 







Aggr 0.62 0.36 1.00 0.65 0.39 0.40 0.55 0.48 1.29 0.64 
Mov10 0.77 0.95 1.00 0.98 0.81 0.85 0.86 0.63 1.15 0.89 
Mov100 0.99 1.68 1.00 1.03 1.08 1.08 0.85 0.57 1.32 1.07 
Exp0.1 0.94 1.01 1.00 1.03 1.07 1.06 0.86 0.56 1.13 0.96 





Aggr 0.74 0.37 1.00 1.00 0.53 0.76 0.38 0.33 0.92 0.67 
Mov10 0.87 0.83 1.00 1.00 0.84 0.94 0.93 1.00 0.92 0.93 
Mov100 1.07 0.86 1.00 1.00 1.09 1.11 0.93 1.00 0.92 1.00 
Exp0.1 1.00 0.86 1.00 1.00 0.97 0.99 0.93 1.00 0.93 0.96 
Exp0.3 0.89 0.84 1.00 1.00 0.83 0.89 0.93 1.00 0.92 0.92 
 
Table 4.12 Latency comparison of simple prediction method on 3D
128Gb flash memory.




 ACM Transactions on Embedded Computing Systems, Vol. XX, No. XX, Article XX. Publication date: Month 2017. 
our proposed aggressive method and LazyRTGC give similar erase counts while the page-level 






































































Aggr 0.58 0.90 /  0.83 0.35 0.48 0.88 0.92 1.14 0.76 
Mov10 0.97 1.01 N/A 1.01 0.67 0.60 0.97 0.70 1.56 0.94 
Mov100 1.11 1.02 N/A 1.01 0.90 0.95 .99 0.76 1.3  1.01 
Exp0.1 0.87 1.02 N/A 0. 3 0.84 0.90 .97 0.63 1.56 0.97 







Aggr 0.50 0.27 1.00 0.88 0.47 0.58 0.84 0.85 1.06 0.71 
Mov10 0.61 0.98 1.00 0.95 0.58 0.71 0.98 0.84 1.12 0.86 
Mov100 1.00 1.05 1.00 0.95 0.97 0.99 0.97 0.82 1.12 0.99 
Exp0.1 0.84 1.04 1.00 0.94 0.84 0.92 0.99 0.82 1.12 0.95 





Aggr 1.00 0.44 1.00 1.00 0.93 1.00 0.94 1.00 0.98 0.92 
Mov10 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.97 
Mov100 1.00 0.75 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.97 
Exp0.1 1.00 0.71 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.97 
Exp0.3 1.00 0.72 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.97 
 






































































Aggr 0.64 0.54 /  0.59 0.26 0.29 0.74 0.80 1.51 0.66 
Mov10 0.93 0.99 N/A 0.94 0.76 0.84 0.79 0.64 1.50 0.92 
Mov100 1.53 3.51 N/A 1.41 1.05 1.09 0.72 0.68 1.41 1.43 
Exp0.1 1.16 1.08 N/A 1.21 1.01 1.07 0.72 0.61 1.59 1.06 







Aggr 0.62 0.36 1.00 0.65 0.39 0.40 0.55 0.48 1.29 0.64 
Mov10 0.77 0.95 1.00 0.98 0.81 0.85 0.86 0.63 1.15 0.89 
Mov100 0.99 1.68 1.00 1.03 1.08 1.08 0.85 0.57 1.32 1.07 
Exp0.1 0.94 1.01 1.00 1.03 1.07 1.06 0.86 0.56 1.13 0.96 





Aggr 0.74 0.37 1.00 1.00 0.53 0.76 0.38 0.33 0.92 0.67 
Mov10 0.87 0.83 1.00 1.00 0.84 0.94 0.93 1.00 0.92 0.93 
Mov100 1.07 0.86 1.00 1.00 1.09 1.11 0.93 1.00 0.92 1.00 
Exp0.1 1.00 0.86 1.00 1.00 0.97 0.99 0.93 1.00 0.93 0.96 




Q-table Cache to Exploit a Large
Number of States at Small Cost
5.1 Motivation
Techniques which apply RL have been studied in an effort to reduce long-
tail latency induced by GC [9]. In the RL model, multiple states are used
to represent the environment (e.g., the workload characteristics and SSD
internal information). Using appropriate states which represent the given
environments, which we call key states, is essential to obtain successful
RL assisted solutions.
Based on the work [9], we conducted experiments and found that
the long-tail latency varies according to the number of states. Figure 5.1
shows the 99.9999th percentile latencies when the number of states is in-
creased in home1 which is one of workloads we used in experiments (as
described in Section 4.2.3). The results show that increasing the num-
ber of states (by taking finer bins in this case) tends to decrease the la-
tency. On the other hand, as shown in the figure, increasing the number
52
of states does not always improve the performance consistently, as the
design space of most RL-assisted applications is not always monotone or
linear. This is closely related to the information used as states and how
they can be divided into multiple bins.
In addition, we investigated locality behavior of the RL solution. To
do this, we selected four periods in the middle of the workload, home2,
at equal periods (10,000 requests per period). We registered which states
were used and counted how many times they were accessed during the
period. Table 5.1 shows the top ten states sorted by access count and the
access counts at four periods in home2. The table shows that the states
used for each period are different. The access counts are also different
for each period. For example, the top two states account for the majority
of access counts in period 2. However, other states have relatively low
access counts, indicating that the characteristics corresponding to the two
top ranked states account for a large proportion in this period.
On the other hand, the access counts of the upper ranked states are not
high during period 4, indicating a greater variety of characteristics than in
the other periods. We continued the experiment with various workloads,
observing that different workloads have different states access patterns,
though this data is not shown here due to page limit.
Through the above two observations, we realized that using more
states is essential to ensure better performance and that creating a gener-
alized solution, which adapts to the dynamic behavior, is required.
53
In [9], RLGC uses Q-learning [14], which is a type of RL method. In
Q-learning, reward values (Q-value) for all stateaction pairs are stored in
a table called Q-table. That is, as the number of states increases, the size
of Q-table also increases. In the SSD firmware, there is a stringent con-
straint on the code size. Thus, the desired solution needs to be dynamic
to adapt to the changing behavior of SSD system, and allow for a large
















Figure 5.1 Latency variation according to the number of states in home1.





The rest of this pager is organized as follows. Section 2 
reviews previous GC-related techniques. Section 3 describes the 
motivation of our study. Section 4 presents the proposed method. 
Section 5 gives the experimental results and discussions. Section 
6 concludes this paper. 
2 RELATED WORK 
Various techniques have been studied to reduce the GC overhead. 
In [3], Wei et al. propose a workload-adaptive flash 
translation layer (WAFTL) with data partitions. It employs both 
page-level and block-level mapping blocks as normal data blocks. 
According to the data pattern, WAFTL selects the type of data 
block. The page-level mapping block handles random data and 
conducts partial data updates. The block-level mapping block 
stores sequential data. In particular, to reduce the garbage 
collection overhead, they utilize offline garbage collection to 
erase invalid blocks during idle time.  
In [4], Qin et al. propose a real-time flash translation layer 
which uses partial garbage collection to reduce the garbage 
collection overhead. They divide garbage collection into partial 
steps and then distributed write requests. This method reclaims 
invalid blocks while simultaneously handling write requests. 
In [5], Yan et al. propose Tiny-Tail flash, which tries to 
eliminate tail latency due to garbage collection. They employ the 
four key techniques, plane-blocking GC, GC-tolerant reads, 
rotating GC and GC-tolerant flushes. Plane-blocking GC reduces 
controller and channel blocking to plane blocking using a fine-
grained management scheme. GC-tolerant read prevents IO 
blocking due to the plane undergoing GC using a technique 
called RAIN, which exploits parity pages like redundant array of 
independent disks (RAID). Rotating GC helps to reduce IO 
blocking using a policy by which at most one plane in each plane 
group can run one GC at a time. GC-tolerant flush facilitates a 
rapid write buffer using capacitor-backed RAM. 
In [2], Kang et al. propose a reinforcement learning-assisted 
GC scheduling method to reduce GC-induced long-tail latency. 
They exploit a varying inter-request interval (idle time) to 
schedule GC based on a partial GC approach. A RL-assisted 
scheduler learns the behavior of the workloads and the SSD 
internal state, after which the scheduler determines the number 
of partial GC instances, i.e., # pages to copy to be performed. 
In [6], Zhang et al. proposed a real-time lazy garbage 
collection mechanism. They use a partial GC scheme which 
distributes the process to different time slots such that the 
worst-case GC time is guaranteed for real-time systems. They 
also mitigate the low-space-utilization problem which generally 
arises in previous partial GC methods based on real-time GC 
techniques.  
In [14], Pritzel et al. propose Neural Episodic Control which 
can quickly learn successful policies as soon as they are 
experienced. They use a memory module called differentiable 
neural dictionary (DND) that combines slow-changing keys to 
rapidly updating values and utilizes context-based lookup on the 
keys. The DND which stores key-value pair is similar to our Q-
table cache which stores state-value pair. They are different in 
managing the memory (the Q-table cache in our case). In case of 
lookup, the DND reads multiple values from the memory and 
gives, as the output, a weighted sum of those values while our Q-
table cache reads a single entry for a query. After the DND is 
queried, a new key-value pair is stored into a DND while our Q-
table cache, in case of hit, updates the hit entry based on Q-
learning and replacement policy and, in case of miss, inserts a 
new one.  
3 MOTIVATION 
Techniques which apply RL have been studied in an effort to 
reduce long-tail latency induced by GC [2]. In the RL model, 
multiple states are used to represent the environment (e.g., the 
workload characteristics and SSD internal information). Using 
appropriate states which represent the given environments, 
which we call key states, is essential to obtain successful RL-
assisted solutions. 
Based on the work [2], we conducted experiments and found 
that the long-tail latency varies according to the number of 
states. Figure 1 shows the 99.9999th percentile latencies when the 
number of states is increased in home1 which is one of 
workloads we used in experiments (as described in Section 5). 
The results show that increasing the number of states (by taking 
finer bins in this case) tends to decrease the latency. On the 
other hand, as shown in the figure, increasing the number of 
states does not always improve the performance consistently, as 
the design space of most RL-assisted applications is not always 
monotone or linear. This is closely related to the information 
used as states and how they can be divided into multiple bins.  
In addition, we investigated locality behavior of the RL 
solution. To do this, we selected four periods in the middle of the 
workload, home2, at equal periods (10,000 requests per period). 
We registered which states were used and counted how many 
times they were accessed during the period. Table 1 shows the 
top ten states sorted by access count and the access counts at 
four periods in home2. The table shows that the states used for 
















Figure 1: Latency variation according to the number of 
states in home1. 
Table 1: Top rank states and access counts in home2 
Period 1 Period 2 Period 3 Period 4 
State # Count State # Count State # Count State # Count 
199813153 3152 199803970 5779 424000545 1246 274455586 123 
349109313 963 349133890 2524 199822369 328 200025122 88 
274455585 853 424000545 132 423757951 321 199803969 88 
423760929 849 199804036 55 423831585 279 199914530 86 
199887871 627 423907361 49 274621473 181 199803938 85 
199969825 593 423969825 48 423757857 127 199803937 79 
199803937 543 199804063 35 199960609 122 274454562 77 
199886881 464 199804003 24 274454563 111 199969857 77 
274482209 323 199804899 22 199831585 98 274474049 73 
349189153 300 199803999 22 274455585 84 274537506 73 
 
55
5.2 Design and Implementation
5.2.1 Solution Overview
The purpose of this study is to reduce long-tail latency through the (1)
dynamic management of states appropriate for the characteristics of the
environment, including the properties of the workload and the SSD in-
ternal status, and (2) to reduce I/O blocking.
The proposed technique [10] is based on RLGC [9]. RLGC uses par-
tial GC which is used in LazyRTGC [15], as a GC method. To exploit the
inter-request interval (idle time), RLGC employs RL. Figure 5.2 shows
the agent and environment interaction in the RL solution. The agent (the
GC scheduler in our study) is a decision-maker that determines an action.
At time t, the agent receives state st and reward rt (response time, e.g.,
the write latency of the previous request in our study) from the environ-
ment (the storage system in our study). The agent selects action at (the
number of partial GC instances to be performed in our study) according
to the learned policy and sends it to the environment. The environment
takes an action from the agent and then passes the next state st+1 and
reward rt+1 to the agent. The agent learns and updates the policy using
the reward received from the environment.
We use Q-learning [14] as a policy learning method of RL. Q-learning
manages the value function (representing the expectation of cumulative
56
reward when taking the action at the state) for the state-action pair and
updates the value function using the state-action pair and reward. The
value function pertaining to the state-action pair is as follows,
Q(s,a) = E{rt |st = s,at = a} (5.1)
where s(st) and a(at) are the state and action at time t, respectively,
and rt is the reward at time t. Q-value Q(s,a) is the expectation of cu-
mulative reward when the environment takes action a at time t. After an
action is taken and the associated reward is available, the Q-value is up-
dated as a weighted sum of current Q-value and the most recent reward
for the state-action pair [9].
The policy to determine the action is as follows:
Π(s) = argmaxaQ(s,a) (5.2)
In equation 5.2, the policy determines an action which maximizes the
Q-value at state s. In terms of RL implementation, the important data
structure is the Q-table, which stores the Q-values. The number of Q-
table entries is states × actions. When adopting RL-assisted solution
in embedded systems, small Q-tables are required. The RL agent learns
behavior of the workload, and determines GC triggering time point and
how many partial GC instances to be executed as an action. RLGC [9]
uses a fixed set of states obtained from three types of predefined infor-
57
mation binned into 68 states at design time. Due to the fixed small set of
states, RLGC has significant limitations in achieving further reduction in
long tail latency, which requires a large number of states and an adapta-
tion to the dynamic behavior of the workload to be applied to storage as
explained in our observations.
The proposed method [10] uses a dynamic key states management
technique to overcome the limitations of RLGC. We use approximately
88 × 108 state candidates from binning and a combination of 17 pieces
of information. As a means of storing the Q-value for the state-action
pair, we use a small Q-table cache that uses an eviction policy in the
least recently used (LRU) manner instead of the Q-table of the previous
RLGC. We use a Q-table cache per action. Our method has also three
actions depending on the number (0, 1 and 2) of page copies in partial
GC, as in the aggressive (aggr) method of RLGC. Thus, three Q-table
caches are used for the three actions to be executed. As the storage system
works, the state is determined and the reward is calculated according to
the taken action. The state and corresponding Q-value pair are stored in
the Q-table cache. If the determined state is already in Q-table cache,
the Q-value is updated. On the other hand, if the determined state does
not exist in the Q-table cache, the Q-value is added as a new entry. At
this point, if there is no more free entry space in the Q-table cache, one
of the existing entries is evicted according to the LRU policy. Through
this process, the proposed RL-assisted GC scheduler further reduces the
58
long-tail latency while using a small Q-table cache, i.e., small memory







Action AtReward rt State St
Figure 5.2 Environment and agent interaction.
60
5.2.2 Dynamic Key States Management
As mentioned earlier, the proposed technique is based on the aggressive
method of RLGC [9]. Therefore, we focused on the difference to the
RLGC.
Candidate States: In order to present the characteristics of various
workloads, we need a considerable number of candidate states. Table
5.2 shows the information used in relation to the states and the number
of bins for each piece of information. The combination of 17 pieces of
information and their binned states yield 88 × 108 states. This helps
to reflect more fine-grained status changes as compared to the use of
only 68 states in the previous RLGC approach. Note that we selected the
information listed in the table through extensive analyses of candidate
information. However, more information can be used when applied to an
actual SSD product.
Storing the Q-value: As noted earlier, a very large amount of mem-
ory space is necessary when using a Q-table to store a large number of
Q-values for state-action pairs. For example, to use a Q-table for 88 ×
108 states listed in Table 5.2, we need a prohibitively large table of 98GB
(=88 × 108 (states) × 3 (actions) × 4B). It is impractical to apply this
into an embedded system such as SSD firmware.
In order to solve this problem, we use a Q-table cache that stores
only an active subset of the state and Q-value pairs. Figure 5.3 shows the
61
organization of the Q-table cache used in our experiments. In this study,
we use three Q-table caches each with 100 state-Q value pair entries.
The memory space for the three Q-table caches is only 2.34KB (= 100
(entries) × 2 (state, Q value) × 3 (actions) × 4B) taking a negligible
amount of memory space in a SSD. This Q-table cache stores a certain
number of state-Q value pairs. As mentioned before, if there is no free
space into which to add new entry, an existing entry is evicted according
to the LRU policy.
When receiving a new request, the GC scheduler (agent) runs as fol-
lows. First, each of the three Q-table caches is looked up with the current
state as the typical data cache in computer architecture is looked up with
an address. Then, if there is any match, then the action having the max-
imum Q value is selected and executed according to Equation 5.2. The
Q value is calculated using the reward, i.e., the latency of the previous
request. In case of miss, the state-Q value pair is inserted in the Q-table
cache for the action selected at time t−1. If the state-Q value pair already
exists in the Q-table cache for the action at time t−1, the Q value is up-
dated and marked as the most recently used entry. More details about
how an action is determined are given in the following subsection.
Eviction policy: The purpose of using the Q-table cache to store
the state-Q value pair is to store key states representing the environment
properly using a small amount of memory space. Meaningful states can
reflect recent behavior of the workload. This helps cope with workload
62
behavior changes quickly. It can also reflect frequently and repeatedly
occurring behaviors. To deal with these two factors, LRU and a least
frequently used (LFU) policy were considered. However, the LFU is im-
practical when attempting to manage the history of the frequency of ac-
cesses for all stored entries [34]. Thus, in this study, we use the LRU
policy only as an eviction policy. Although our solution uses only the
LRU policy, we gain a further reduction of the long-tail latency (as de-
scribed in Section 5.3.1).
Figure 5.4 shows a histogram of the number of states corresponding
to the state access frequency range in home1. The x-axis represents the
range of access count, and the y-axis the number of states. For an in-
tuitive understanding of this, the x-axis uses the three different interval
scales of 1 for 1 to 10, 10 for 11 to 100 and 1000 for 101 to 10100. Figure
5.4 shows that there is a clear distinction between states with very low
and high access frequencies. Among all states which occur in home1, the
number of states that occur only once account for a significant portion.
On the other hand, Figure 5.5 shows the state statistics obtained after run-
ning the workload, i.e., at the end of workload. This exemplifies that, at
an instant of workload run, the Q-table cache can have a high percentage
of frequently accessed states, i.e., a large number of meaningful states.
Note that because the process by which state-Q value pair is evicted and
newly added is repeated under the LRU eviction policy, some states with
an access frequency of 1 can exist.
63
Action determination: Our GC scheduler selects an action having
the maximum Q value. The difference relative to Q-learning using a com-
plete Q-table arises when the state-Q value pair used to determine an ac-
tion cannot be found in the Q-table cache. This occurs when the state-Q
value pair is not yet added to the Q-table cache or the state-Q value pair
was already evicted. If the GC scheduler cannot find the Q value for a
given state in all Q-table caches, it then selects action 0 (no GC). This is
a conservative approach to lower the possibility of increased latency due
to a lack of information.
64





corresponding Q-value pair are stored in the Q-table cache. If the 
determined state is already in Q-table cache, the Q-value is 
updated. On the other hand, if the determined state does not 
exist in the Q-table cache, the Q-value is added as a new entry. 
At this point, if there is no more free entry space in the Q-table 
cache, one of the existing entries is evicted according to the LRU 
policy. Through this process, the proposed RL-assisted GC 
scheduler further reduces the long-tail latency while using a 
small Q-table cache, i.e., small memory resource. In Section 4.2, 
we explain the dynamic key state management method in detail.  
4.2 Dynamic Key States Management 
As mentioned earlier, the proposed technique is based on the 
aggressive method of RLGC [2]. Therefore, we focused on the 
difference to the RLGC.  
Candidate States: In order to present the characteristics of 
various workloads, we need a considerable number of candidate 
states. Table 2 shows the information used in relation to the 
states and the number of bins for each piece of information. The 
combination of 17 pieces of information and their binned states 
yield 88 × 108 states. This helps to reflect more fine-grained 
status changes as compared to the use of only 68 states in the 
previous RLGC approach. Note that we selected the information 
listed in the table through extensive analyses of candidate 
information. However, more information can be used when 
applied to an actual SSD product. 
Storing the Q-value: As noted earlier, a very large amount of 
memory space is necessary when using a Q-table to store a large 
number of Q-values for state-action pairs. For example, to use a 
Q-table for 88 × 108 states listed in Table 2, we need a 
prohibitively large table of 98GB (=88 × 108 (states) × 3 (actions) 
× 4B). It is impractical to apply this into an embedded system 
such as SSD firmware.  
In order to solve this problem, we use a Q-table cache that 
stores only an active subset of the state and Q-value pairs. Figure 
3 shows the organization of the Q-table cache used in our 
experiments. In this study, we use three Q-table caches each 
with 100 state-Q value pair entries. The memory space for the 
Table 2: State information and # of bins 
 
Information used for state 
# of 
bins 
Current (t) inter-request interval 32 
Previous (t-1) inter-request interval 32 
Previous (t-1) action (# of performed partial gc) 3 
Previous (t-2) action (# of performed partial gc) 3 
Previous (t-3) action (# of performed partial gc) 3 
Previous (t-4) action (# of performed partial gc) 3 
Previous (t-5) action (# of performed partial gc) 3 
# of free blocks 12 
Previous (t-1) request size 5 
Previous (t-2) request size 5 
Previous (t-1) valid page copy (performed or not) 2 
Previous (t-2) valid page copy (performed or not) 2 
Previous (t-1) block erase (performed or not) 2 
Previous (t-2) block erase (performed or not) 2 
Current (t) requested operation 2 
Previous (t-1) requested operation 2 
Previous (t-2) requested operation 2 
 
Table 3: Flash memories 
 3D 128Gb [12] 3D 512Gb [11] 
Page size 8KB 16KB 
Number of pages / block 384 768 
Number of blocks / plane 2731 2874 
Number of planes 2 2 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 





































































































Figure 4: Number of states for each access frequency. 
Figure 5: Number of states for each access frequency 



















































































































We compared the proposed dynamic key states management technique
[10] with an aggressive method which shows the best performance in
RLGC [9]. We implemented the proposed method and RLGC [9] us-
ing the popular FlashSim simulator [31]. We use the 99th, 99.99th and
99.9999th percentiles of latencies as the metrics. Ten real-world work-
loads (home1, home2, home3, home4, webmail+online, webmail, MSNSFS,
RBESQL, TPCC, and TPCE) [32] and one synthetic workload [33] are
used in our evaluation. Two types of 3D flash memory are used. Table
4.2 shows the detailed parameters of the two types [1, 2]. Each Q-table
cache has 100 entries to store the state-Q value pairs. To generalize our
solution, we present the results when using the Q-table cache with 100
entries as a representative result. We also report on the effects of the Q-
table cache size.
5.3.2 Results and Discussion
Latency: Table 5.3 (a) shows the latency comparison with RLGC on
512Gb of 3D flash memory. The results are normalized to RLGC. The
proposed method has a good (low) average latency over the baseline (ag-
gressive method of RLGC), with 0.78× at the 99.9999th, 0.88× at the
67
99.99th, and 0.98× at the 99th percentiles. The baseline learns the policy
using a small number of states. On the other hand, our method utilizes a
much larger number of fine-grained states and manages key states among
them. As a result, it is possible to obtain better (lower) latency with infor-
mation useful for selecting actions and with fine-grained GC scheduling.
There was almost no improvement in the latency for the three work-
loads of home4, webmail+online and webmail. These results are already
close to the intrinsic program time of flash memory. Thus, there is no
additional room for further reduction. For home3 and oltp, the baseline
and proposed method give similar results. home3 has a long inter-request
time, and long-tail latency due to GC accordingly does not occur. For
oltp, it has a very short inter-request interval; thus, there is little opportu-
nity for optimization [9].
Table 5.3 (b) shows the results on 128Gb of 3D flash memory. It
shows better (lower) average latency, e.g., 25% reduction at the 99.9999th
percentile, than the results with 512Gb of 3D flash memory. It is because
the capacity of 128Gb is less than that of 512Gb, more frequent GC is
required. Therefore, there is more opportunity for further optimization.
Q-table cache entry size: Table 5.4 shows the latency variation for
four workloads when the number of Q-table cache entries changes. The
notation “100 of Ours 100” refers to the number of Q-table cache en-
tries. The results of home1 and home2 show the best (lowest) latency in
Ours 100. On the other hand, MSNSFS and RBESQL show the best (low-
68
est) latency in Ours 2000. As noted earlier, in order to generalize the
proposed method, we applied a common Q-table cache size which shows
the best average performance improvement for all workloads. Therefore,
we used a Q-table cache size of 100 entries.
Table 5.5 compares the total number of states. In the table, the values
of MSNSFS and RBESQL are greater by ten times than those of home1
and home2. This indicates that the workloads create a large number of
states because they undergo significant behavior changes. This explains
why the two workloads work better with large Q-table caches in Table
5.4.
Table 5.6 shows the hit rate of Q-table cache for each of the work-
loads with a Q-table cache size of 100. As discussed earlier, MSNSFS and
RBESQL have more significant behavior changes than the other work-
loads. Thus, the Q-table cache hit rates are lower relative to the other
workloads. These results also demonstrate the possibility of optimizing
the Q-table cache size, which is left as future work.
Erase count: Tables 5.7 and 5.8 compare the erase counts on 512Gb
and 128Gb of 3D flash memory. These results are normalized to the base-














-table caches is only 2.34K
B
 (= 100 (entries) × 2 (state, Q
-
value) × 3 (actions) × 4B




space in a SSD
. T
his Q
-table cache stores a certain num
ber of 
state-Q
 value pairs. A
s m














according to the LR
U
 policy.  
W
hen receiving a new
 request, the G
C
 scheduler (agent) runs 
as follow
s. First, each of the three Q
-table caches is looked up 
w
ith the current state as the typical data cache in com
puter 
architecture is looked up w
ith an address. T
hen, if there is any 
m




 value is selected 
and executed according to Equation (2). T
he Q
 value is calculated 
using the rew
ard, i.e., the latency of the previous request. In case 
of m
iss, the state-Q
 value pair is inserted in the Q
-table cache for 
the action selected at tim
e t-1. If the state-Q
 value pair already 
exists in the Q
-table cache for the action at tim
e t-1, the Q
 value 
is updated and m
arked as the m
ost recently used entry. M
ore 
details about how
 an action is determ
ined are given in the 
follow





he purpose of using the Q
-table cache to 
store the state-Q
 value pair is to store key states representing the 
environm






eaningful states can reflect recent behavior of the w
orkload. 
T
his helps cope w
ith w
orkload behavior changes quickly. It can 

















anage the history of the frequency of 
accesses for all stored entries [7]. T
hus, in this study, w
e use the 
LR
U
 policy only as an eviction policy. A
lthough our solution 
uses only the LR
U
 policy, w
e gain a further reduction of the 














corresponding to the state access frequency range in hom
e1. T
he 
x-axis represents the range of access count, and the y-axis the 
num
ber of states. For an intuitive understanding of this, the x-
axis uses the three different interval scales of 1 for 1 to 10, 10 for 
11 to 100 and 1000 for 101 to 10100. Figure 4 show
s that there is 
a clear distinction betw
een states w
ith very low
 and high access 
frequencies. A
m
ong all states w
hich occur in hom
e1, the num
ber 
of states that occur only once account for a significant portion.  
O
n the other hand, Figure 5 show
s the state statistics obtained 
after running the w




plifies that, at an instant of w
orkload run, the Q
-table cache 


















repeated under the LR
U
 eviction policy, som
e states w
ith an 















he difference relative to Q
-





pair used to determ





 value pair is not yet added 
to the Q
-table cache or the state-Q
 value pair w
as already evicted. 
If the G
C
 scheduler cannot find the Q
 value for a given state in 
all Q
-table caches, it then selects action 0 (no G
C
). T

































pared the proposed dynam




ith an aggressive m
ethod w
hich show
























e use the 99
th, 99.99
th and 99.9999





















[9] and one synthetic w




o types of 3D
 flash m
em




eters of the tw
o types [11, 12]. Each Q
-table cache 
has 100 entries to store the state-Q
 value pairs. T

























































































































































































































































































































































































































































































































































































































































































































































three Q-table caches is only 2.34KB (= 100 (entries) × 2 (state, Q-
value) × 3 (actions) × 4B) taking a negligible amount of memory 
space in a SSD. This Q-table cache stores a certain number of 
state-Q value pairs. As mentioned before, if there is no free space 
into which to add new entry, an existing entry is evicted 
according to the LRU policy.  
When receiving a new request, the GC scheduler (agent) runs 
as follows. First, each of the three Q-table caches is looked up 
with the current state as the typical data cache in computer 
architecture is looked up with an address. Then, if there is any 
match, then the action having the maximum Q value is selected 
and executed according to Equation (2). The Q value is calculated 
using the reward, i.e., the latency of the previous request. In case 
of miss, the state-Q value pair is inserted in the Q-table cache for 
the action selected at time t-1. If the state-Q value pair already 
exists in the Q-table cache for the action at time t-1, the Q value 
is updated and marked as the most recently used entry. More 
details about how an action is determined are given in the 
following subsection.  
Eviction policy: The purpose of using the Q-table cache to 
store the state-Q value pair is to store key states representing the 
environment properly using a small amount of memory space. 
Meaningful states can reflect recent behavior of the workload. 
This helps cope with workload behavior changes quickly. It can 
also reflect frequently and repeatedly occurring behaviors. To 
deal with these two factors, LRU and a least frequently used 
(LFU) policy were considered. However, the LFU is impractical 
when attempting to manage the history of the frequency of 
accesses for all stored entries [7]. Thus, in this study, we use the 
LRU policy only as an eviction policy. Although our solution 
uses only the LRU policy, we gain a further reduction of the 
long-tail latency (as described in Section 5).  
Figure 4 shows a histogram of the number of states 
corresponding to the state access frequency range in home1. The 
x-axis represents the range of access count, and the y-axis the 
number of states. For an intuitive understanding of this, the x-
axis uses the three different interval scales of 1 for 1 to 10, 10 for 
11 to 100 and 1000 for 101 to 10100. Figure 4 shows that there is 
a clear distinction between states with very low and high access 
frequencies. Among all states which occur in home1, the number 
of states that occur only once account for a significant portion.  
On the other hand, Figure 5 shows the state statistics obtained 
after running the workload, i.e., at the end of workload. This 
exemplifies that, at an instant of workload run, the Q-table cache 
can have a high percentage of frequently accessed states, i.e., a 
large number of meaningful states. Note that because the process 
by which state-Q value pair is evicted and newly added is 
repeated under the LRU eviction policy, some states with an 
access frequency of 1 can exist 
Action determination: Our GC scheduler selects an action 
having the maximum Q value. The difference relative to Q-
learning using a complete Q-table arises when the state-Q value 
pair used to determine an action cannot be found in the Q-table 
cache. This occurs when the state-Q value pair is not yet added 
to the Q-table cache or the state-Q value pair was already evicted. 
If the GC scheduler cannot find the Q value for a given state in 
all Q-table caches, it then selects action 0 (no GC). This is a 
conservative approach to lower the possibility of increased 
latency due to a lack of information. 
5 EXPERIMENTS 
5.1 Experimental Setup 
We compared the proposed dynamic key states management 
technique with an aggressive method which shows the best 
performance in RLGC [2]. We implemented the proposed 
method and RLGC [2] using the popular FlashSim simulator [8]. 
We use the 99th, 99.99th and 99.9999th percentiles of latencies as 
the metrics. Ten real-world workloads (home1, home2, home3, 
home4, webmail+online, webmail, MSNSFS, RBESQL, TPCC, TPCE) 
[9] and one synthetic workload [10] are used in our evaluation. 
Two types of 3D flash memory are used. Table 3 shows the 
detailed parameters of the two types [11, 12]. Each Q-table cache 
has 100 entries to store the state-Q value pairs. To generalize our 
Table 4: Latency comparison 













































































































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 0.32 0.23 1.00 0.99 1.00 0.99 1.03 0.95 1.04 0.20 0.82 0.78 0.36 0.34 1.00 0.71 0.98 0.98 1.01 0.97 0.92 0.12 0.82 0.75 
99.99th 
Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 0.82 0.61 1.00 1.00 0.99 0.98 1.02 1.04 1.04 0.18 0.99 0.88 0.53 0.63 1.00 0.89 1.00 1.00 0.97 1.03 0.88 0.19 0.98 0.83 
99th 
Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.00 1.00 1.00 1.00 1.00 1.00 1.03 1.00 0.99 0.77 1.00 0.98 0.98 1.00 1.00 1.00 1.01 1.00 1.00 1.00 0.96 0.66 1.00 0.96 
 
Table 5: Latency comparison for various Q-table 


































Base 1.00 1.00 1.00 1.00 
Ours_100 0.32 0.23 1.03 0.98 
Ours_200 0.32 0.25 1.05 0.93 
Ours_500 0.79 0.31 0.97 0.93 
Ours_1000 0.77 0.41 0.93 0.94 
Ours_2000 0.99 0.47 0.92 0.93 
Ours_3000 0.81 0.44 1.06 0.98 
 






solution, we present the results when using the Q-table cache 
with 100 entries as a representative result. We also report on the 
effects of the Q-table cache size. 
5.2 Results and Discussion 
Latency: Table 4 (a) shows the latency comparison with RLGC 
on 512Gb of 3D flash memory. The results are normalized to 
RLGC. The proposed method has a good (low) average latency 
over the baseline (aggressive method of RLGC), with 0.78x at the 
99.9999th, 0.88x at the 99.99th, and 0.98x at the 99th percentiles. 
The baselin  learns the policy using a small number of stat s. On 
the other hand, our method utilizes a much larger number of 
fine-grained states and manages key states among them. As a 
result, it is possible to obtain better (lower) latency with 
information useful for selecting actions and with fine-grained 
GC scheduling. 
There was almost no improvement in the latency for the three 
workloads of home4, webmail+online and webmail. These results 
are already close to the intrinsic program time of flash memory. 
Thus, there is no additional room for further reduction.  
For home3 and oltp, the baseline and proposed method give 
similar results. home3 has a long inter-request time, and long-tail 
latency due to GC accordingly does not occur. For oltp, it has a 
very short inter-request interval; thus, there is little opportunity 
for opt mization [2]. 
Table 4 (b) shows the results on 128Gb of 3D flash mem ry. It 
shows better (lower) average latency, e.g., 25% reduction at the 
99.9999th p rcentile, than the results with 512Gb of 3D flash 
memory. It is because the capacity of 128Gb is less than t at of 
512Gb, more frequent GC is required. Therefore, there is more 
opportunity for further optimization.  
Q-Tabl  cache entry size: T ble 5 shows the latency 
vari tion fo  four workloads when the number of Q-table cache 
entries chang . The notation “100 of Ours_100” r fers to the 
number of Q-table cache entries. 
The results of home1 and home2 show the best (lowest) latency 
in Ours_100. On the other hand, MSNSFS and RBESQL show the 
best (lowest) latency in Ours_2000. As noted earlier, in order to 
generalize the proposed method, we applied a common Q-table 
cache size which shows the best average performance 
improvement for all workloads. Therefore, we used a Q-table 
cache size of 100 entries.  
Table 6 compares the total number of states. In the table, the 
values of MSNSFS and RBESQL are greater by ten times than 
those of home1 and home2. This indicates that the workloads 
create a large number of states because they undergo significant 
behavior changes. This explains why the two workloads work 
better with large Q-table caches in Table 5.  
Table 7 shows the hit rate of Q-table cache for each of the 
workloads with a Q-table cache size of 100. As discussed earlier, 
MSNSFS and RBESQL have more significant behavior changes 
than the other workloads. Thus, the Q-table cache hit rates are 
lower relative to the other workloads. 
These results also demonstrate the possibility of optimizing 
the Q-table cache size, which is left as future work. 
Erase count: Tables 8 and 9 compare the erase counts on 
512Gb and 128Gb of 3D flash memory. These results are 
normalized to the baseline. In both cases, the erase counts are 
similar to the corresponding baselines. 
6 CONCLUSION 
In this paper, we propose a technique which dynamically 
manages key states in RL-assisted GC to reduce the long-tail 
latency. This technique uses many fine-grained pieces of 
information as state candidates and manages key states that 
suitably represent the characteristics of the workload using 
relatively small amount of memory resource. Thus, the proposed 
method can reduce the long-tail latency even further. In total, 
eleven workloads are evaluated on two type of flash memory 
storage. As a result, the long-tail latency are improved by 22-25% 
for heavy workloads.  
ACKNOWLEDGEMENTS 
This work was supported by National Research Foundation of 
Korea (NRF-2016M3A7B4909604). 
REFERENCES 
[1] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the 
ACM 56, 2013. 
Table 6: Number of states visited for each 


























# of states  37039 35428 444782 211395 
 
Table 7: Hit rate of Q-table cache in each workload 


























Hit rat  81.78% 75.94% 33.51% 45.82% 
 






















































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.13 1.01 1.00 1.00 1.08 1.05 1.03 0.99 1.00 0.99 1.00 1.03 
 






















































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.12 1.03 1.00 1.00 1.08 1.00 1.02 1.00 1.00 0.99 1.00 1.02 
 






solution, we present the results when using the Q-table cache 
with 100 entrie  as a representative result. We also report on the 
effects of the Q-table cache size. 
5.2 Results and Discussion 
Latency: Table 4 (a) shows the latency comparison with RLGC 
on 512Gb of 3D flash memory. The results are normalized to 
RLGC. The proposed method has a good (low) average latency 
over the baseline (aggressive method of RLGC), with 0.78x at the 
99.9999th, 0.88x at the 99.99th, and 0.98x at the 99th percentiles. 
The baseline learns the policy using a small number of states. On 
the other hand, our method utilizes a much larger number of 
fine-grained states and manages key states among them. As a 
result, it is possible to obtain better (lower) latency with 
information useful for selecting actions and with fine-grained 
GC scheduling. 
There was almost no improvement in the latency for the three 
workloads of home4, webmail+online and webmail. These results 
are already close to the intrinsic program time of flash memory. 
Thus, there is no additional room for further reduction.  
For hom 3 and oltp, the baseline and proposed method give 
similar results. home3 has  long inter-request time, and long-tail 
latency due to GC accordingly does n  occur. For oltp, it has a 
ver  short inter-request interval; thus, th re is littl  opportunity 
for optimization [2]. 
Table 4 (b) shows the results on 128Gb of 3D flash memory. It 
shows better (lower) average latency, e.g., 25% reduction at the 
99.9999th percentile, than the results with 512Gb of 3D flash 
memory. It is because the capacity of 128Gb is less than that of 
512Gb, more frequent GC is required. Therefore, there is more 
opportunity for further optimization.  
Q-Table cache entry size: Table 5 shows the latency 
variation for four workloads when the number of Q-table cache 
entries changes. The notation “100 of Ours_100” refers to the 
number of Q-table cache entries. 
The results of home1 and home2 show the best (lowest) latency 
in Ours_100. On the other hand, MSNSFS and RBESQL show the 
best (lowest) latency in Ours_2000. As noted earlier, in order to 
generalize the proposed method, we applied a common Q-table 
cache size which shows the best average performance 
improvement for all workloads. Therefore, we used a Q-table 
cache size of 100 entries.  
Table 6 compares the total number of states. In the table, the 
values of MSNSFS and RBESQL are greater by ten times than 
those of home1 and home2. This indicates that the workloads 
create a large number of states because they undergo significant 
behavior changes. This explains why the two workloads work 
better with large Q-table caches in Table 5.  
Table 7 shows the hit rate of Q-table cache for each of the 
workloads with a Q-table cache size of 100. As discussed earlier, 
MSNSFS and RBESQL have more significant behavior changes 
than the other workloads. Thus, the Q-table cache hit rates are 
lower relative to the other workloads. 
These results also demonstrate the possibility of optimizing 
the Q-table cache size, which is left as future work. 
Erase count: Tables 8 and 9 compare the erase counts on 
512Gb and 128Gb of 3D flash memory. These results are 
normalized to the baseline. In both cases, the erase counts are 
similar to the corresponding baselines. 
6 CONCLUSION 
In this paper, we propose a technique which dynamically 
manages key states in RL-assisted GC to reduce the long-tail 
latency. This technique uses many fine-grained pieces of 
information as state candidates and manages key states that 
suitably represent the characteristics of the workload using 
relatively small amount of memory resource. Thus, the proposed 
method can reduce the long-tail latency even further. In total, 
eleven workloads are evaluated on two type of flash memory 
storage. As a result, the long-tail latency are improved by 22-25% 
for heavy workloads.  
ACKNOWLEDGEMENTS 
This work was supported by National Research Foundation of 
Korea (NRF-2016M3A7B4909604). 
REFERENCES 
[1] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the 
ACM 56, 2013. 
Table 6: Number of states visited for each 


























# of states  37039 35428 444782 211395 
 
Table 7: Hit rate of Q-table cache in each workload 


























Hit rate 81.78% 75.94% 33.51% 45.82% 
 






















































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.13 1.01 1.00 1.00 1.08 1.05 1.03 0.99 1.00 0.99 1.00 1.03 
 






















































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.12 1.03 1.00 1.00 1.08 1.00 1.02 1.00 1.00 0.99 1.00 1.02 
 
71





solution, we present the results when using the Q-table cache 
with 100 entries as a representative result. We also report on the 
effects of the Q-table cache size. 
5.2 Results and Discussion 
Latency: Table 4 (a) shows the latency comparison with RLGC 
on 512Gb of 3D flash memory. The results are normalized to 
RLGC. The proposed method has a good (low) average latency 
over the baseline (aggressive method of RLGC), with 0.78x at the 
99.9999th, 0.88x at the 99.99th, and 0.98x at the 99th percentiles. 
The baseline learns the policy using a small number of states. On 
the other hand, our method utilizes a much larger number of 
fine-grained states and manages key states among them. As a 
result, it is possible to obtain better (lower) latency with 
information useful for selecting actions and with fine-grained 
GC scheduling. 
There was almost no improvement in the latency for the three 
workloads of home4, webmail+online and webmail. These results 
are already close to the intrinsic program time of flash memory. 
Thus, there is no additional room for further reduction.  
For home3 and oltp, the baseline and proposed method give 
similar results. home3 has a long inter-request time, and long-tail 
latency due to GC accordingly does not occur. For oltp, it has a 
very short inter-request interval; thus, there is little opportunity 
for optimization [2]. 
Table 4 (b) shows the results on 128Gb of 3D flash memory. It 
shows better (lower) average latency, e.g., 25% reduction at the 
99.9999th percentile, than the results with 512Gb of 3D flash 
memory. It is because the capacity of 128Gb is less than that of 
512Gb, more frequent GC is required. Therefore, there is more 
opportunity for further optimization.  
Q-Table cache entry size: Table 5 shows the latency 
variation for four workloads when the number of Q-table cache 
entries changes. The notation “100 of Ours_100” refers to the 
number of Q-table cache entries. 
The results of home1 and home2 show the best (lowest) latency 
in Ours_100. On the other hand, MSNSFS and RBESQL show the 
best (lowest) latency in Ours_2000. As noted earlier, in order to 
generalize the proposed method, we applied a common Q-table 
cache size which shows the best average performance 
improvement for all workloads. Therefore, we used a Q-table 
cache size of 100 entries.  
Table 6 compares the total number of states. In the table, the 
values of MSNSFS and RBESQL are greater by ten times than 
those of home1 and home2. This indicates that the workloads 
create a large number of states because they undergo significant 
behavior changes. This explains why the two workloads work 
better with large Q-table caches in Table 5.  
Table 7 shows the hit rate of Q-table cache for each of the 
workloads with a Q-table cache size of 100. As discussed earlier, 
MSNSFS and RBESQL have more significant behavior changes 
than the other workloads. Thus, the Q-table cache hit rates are 
lower relative to the other workloads. 
These results also demonstrate the possibility of optimizing 
the Q-table cache size, which is left as future work. 
Erase count: Tables 8 and 9 compare the erase counts on 
512Gb and 128Gb of 3D flash memory. These results are 
normalized to the baseline. In both cases, the erase counts are 
similar to the corresponding baselines. 
6 CONCLUSION 
In this paper, we propose a technique which dynamically 
manages key states in RL-assisted GC to reduce the long-tail 
latency. This technique uses many fine-grained pieces of 
information as state candidates and manages key states that 
suitably represent the characteristics of the workload using 
relatively small amount of memory resource. Thus, the proposed 
method can reduce the long-tail latency even further. In total, 
eleven workloads are evaluated on two type of flash memory 
storage. As a result, the long-tail latency are improved by 22-25% 
for heavy workloads.  
ACKNOWLEDGEMENTS 
This work was supported by National Research Foundation of 
Korea (NRF-2016M3A7B4909604). 
REFERENCES 
[1] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the 
Table 6: Number of states visited for each 


























# of states  37039 35428 444782 211395 
 
Table 7: Hit rate of Q-table cache in each workload 


























Hit rate 81.78% 75.94% 33.51% 45.82% 
 







































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.13 1.01 1.00 1.00 1.08 1.05 1.03 0.99 1.00 0.99 1.00 1.03 
 







































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.12 1.03 1.00 1.00 1.08 1.00 1.02 1.00 1.00 0.99 1.00 1.02 
 





solution, we present the results when using the Q-table cache 
with 100 entries as a representative result. We also report on the 
effects of the Q-table cache size. 
5.2 Results and Discussion 
Latency: Table 4 (a) shows the latency comparison with RLGC 
on 512Gb of 3D flash memory. The results are normalized to 
RLGC. The proposed method has a good (low) average latency 
over the baseline (aggressive method of RLGC), with 0.78x at the 
99.9999t , 0.88x at the 99.99th, and 0.98x at the 99th percentiles. 
The baseline learns the policy using a small number of states. On 
the other hand, our method utilizes a much larger number of 
fine-grained states and manages key states among them. As a 
result, it is possible to obtain better (lower) latency with 
information useful for selecting actions and with fine-grained 
GC scheduling. 
There was almost no improvement in the latency for the three 
workloads of home4, webmail+online and webmail. These results 
are already close to the intrinsic program time of flash memory. 
Thus, there is no additional room for further reduction.  
For home3 and oltp, the baseline and proposed method give 
similar results. home3 has a long inter-request time, and long-tail 
latency due to GC accordingly does not occur. For oltp, it has a 
very short inter-request interval; thus, there is little opportunity 
for optimization [2]. 
Table 4 (b) shows the results on 128Gb of 3D flash memory. It 
shows better (lower) average latency, e.g., 25% reduction at the 
99.9999th percentile, than the results with 512Gb of 3D flash 
memory. It is because the capacity of 128Gb is less than that of 
512Gb, more frequent GC is required. Therefore, there is more 
opportunity for further optimization.  
Q-Table cache entry size: Table 5 shows the latency 
variation for four workloads when the number of Q-table cache 
entries changes. The notation “100 of Ours_100” refers to the 
number of Q-table cache entries. 
The results of home1 and home2 show the best (lowest) latency 
in Ours_100. On the other hand, MSNSFS and RBESQL show the 
best (lowest) latency in Ours_2000. As noted earlier, in order to 
generalize the proposed method, we applied a common Q-table 
cache size which shows the best average performance 
improvement for all workloads. Therefore, we used a Q-table 
cache size of 100 entries.  
Table 6 compares the total number of states. In the table, the 
values of MSNSFS and RBESQL are greater by ten times than 
those of home1 and home2. This indicates that the workloads 
create a large number of states because they undergo significant 
behavior changes. This explains why the two workloads work 
better with large Q-table caches in Table 5.  
Table 7 shows the hit rate of Q-table cache for each of the 
workloads with a Q-table cache size of 100. As discussed earlier, 
MSNSFS and RBESQL have more significant behavior changes 
than the other workloads. Thus, the Q-table cache hit rates are 
lower relative to the other workloads. 
These results also demonstrate the possibility of optimizing 
the Q-table cache size, which is left as future work. 
Erase count: Tables 8 and 9 compare the erase counts on 
512Gb and 128Gb of 3D flash memory. These results are 
normalized to the baseline. In both cases, the erase counts are 
similar to the corresponding baselines. 
6 CONCLUSION 
In this paper, we propose a technique which dynamically 
manages key states in RL-assisted GC to reduce the long-tail 
latency. This technique uses many fine-grained pieces of 
information as state candidates and manages key states that 
suitably represent the characteristics of the workload using 
relatively small amount f memory resource. Thus, the pr posed 
method can reduce the long-tail latency even further. In total, 
eleven workloads are evaluated on two type of flash memory 
storage. As a result, the long-tail latency are improved by 22-25% 
for heavy workloads.  
ACKNOWLEDGEMENTS 
This work was supported by National Research Foundation of 
Korea (NRF-2016M3A7B4909604). 
REFERENCES 
[1] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the 
Table 6: Number of states visited for each 


























# of states  37039 35428 444782 211395 
 
Table 7: Hit rate of Q-table cache in each workload 


























Hit rate 81.78% 75.94% 33.51% 45.82% 
 







































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.13 1.01 1.00 1.00 1.08 1.05 1.03 0.99 1.00 0.99 1.00 1.03 
 







































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 




Combining Q-table cache and
Neural Network to Exploit both
Long and Short-term History
6.1 Motivation and Problem
6.1.1 More State Information can Further Reduce
Long Tail Latency
In the RL model [9], multiple states are used to represent the dynamic be-
havior of environment. Using the appropriate state to better represent the
environment is essential for a successful RL based solution. We call such
a state key state. Based on the work [9], we experimentally found that the
long tail latency varies with the number of states. Figure 5.1 shows the
latency of SSD at the 99.9999th percentile when we increase the number
of states in home1, one of the workloads we used in our experiments. As
a result, increasing the number of states by taking finer bins (to obtain
73
a Q-table) tends to decrease the latency. However, increasing the num-
ber of states does not always reduce latency consistently. This is because
the design space of most RL assisted applications is not always linear or
monotonic. This is closely related to what information is used as a state
and how to divide them into multiple bins to build the Q-table.
6.1.2 Locality Behavior of Workload
We analyzed the locality behavior of the RL based solution [9]. To do
this, we chose four equal length periods (10,000 requests per period)
in the middle of home2, one of the workloads we used in our experi-
ments. We recorded what states were used during this period and counted
the frequency of accessing each state. Table 6.1 shows the top twelve
states sorted by access count. The table shows that the statistics of access
counts is highly skewed while showing locality. For example, period 2
exhibits a strong locality behavior since top two states dominate access
counts while other states have relatively small access counts. The table
also shows that the states used in each period are different, which means
the state access behavior changes over time. Our idea of Q-table cache
was motivated by the observations that system behavior exhibits locality
which changes over time.
74
6.1.3 Zero Initialization Problem
Using more state information has the advantage of expressing the envi-
ronment in details, and can further reduce the long tail latency as shown
in Figure 5.1. The Q-table cache can use many state candidates in a small
memory space [10]. However, the evicted state from the Q-table cache
loses the previously learned Q-values. Thus, it relies on the zero initial-
ization of Q-values for newly inserted state-Q value pairs in the Q-table
cache. The zero initialization increases the learning time of newly in-
serted states and finally degrades the quality of action choice on such
states.
In order to mitigate the problem of zero initialization, the size of the
Q-table cache may be increased thereby reducing the loss of the learned
Q-values of the evicted states. However, the large Q-table cache can incur
prohibitively high memory cost. For instance, in order to keep all the
states (88 × 108 states) shown in Table 5.2, the Q-table size could reach
98GB, which is prohibitively expensive in embedded systems such as the
SSD (described in Section 6.2.2).
Especially, a large Q-table cache tends to contain a large number of
states that are not sufficiently learned, which we call immature states.
As will be explained later, such immature states can yield inappropriate
action choices, which prevents us from lowering latency.
75































































Period 1 Period 2 Period 3 Period 4 
State # Count State # Count State # Count State # Count 
199813153 3152 199803970 5779 424000545 1246 274455586 123 
349109313 963 349133890 2524 199822369 328 200025122 88 
274455585 853 424000545 132 423757951 321 199803969 88 
423760929 849 199804036 55 423831585 279 199914530 86 
199887871 627 423907361 49 274621473 181 199803938 85 
199969825 593 423969825 48 423757857 127 199803937 79 
199803937 543 199804063 35 199960609 122 274454562 77 
199886881 464 199804003 24 274454563 111 199969857 77 
274482209 323 199804899 22 199831585 98 274474049 73 
349189153 300 199803999 22 274455585 84 274537506 73 
274454561 76 274455523 19 274538529 77 423815202 71 
200025121 58 199804900 17 423760929 77 423942239 65 
 
76
6.2 Design and Implementation
6.2.1 Solution Overview
Figure 6.1 shows the overall architecture of the proposed solution which
integrates the Q-table cache 1 and a small neural network called Q-value
prediction network (QP Net) [11]. The Q-table cache learns the short-
term behavior to select actions on the given state. On the contrary, the
QP Net is trained to learn the long-term behavior of the system. Since
the QP Net has the global view of Q-function, it can provide good initial
Q-values in case of inserting new entries to the Q-table cache.
Such an integration of QP Net with the Q-table cache offers a low-
cost high-performance implementation of RL-based solution to the SSD.
As will be shown in the experiments, both Q-table cache and QP Net
incur very small cost while learning the Q-function of key states and the
global view of Q-function. Finally, the integrated solution offers further
reduction in long tail latency than the cases that a small Q-table is used
[9] and only the Q-table cache is utilized [10, 11].
The overall operation of our solution which shown in Figure 6.2 is
as follows. First, the SSD receives a request from the host and starts
serving it. While serving the request, e.g., writing data to the flash mem-
1In the original Q-table cache solution [10], three Q-table caches each with 100
state-Q value pair entries are used. Q-table cache for each action is operated indepen-
dently to prevent loss of learned information, e.g., Q-values. In this dissertation, we
propose Q-table cache where each entry consists of one state and three Q-values for
three actions.
77
ory, the RL agent applies (2.2), i.e., updates the Q-table cache (updating
the entry corresponding to the previous action or inserting a new state-
Q value pair) and QP Net while utilizing the information of reward and
new state. Then, the RL agent chooses an action based on the new state.
After finishing the service of the current request, the RL agent executes
the chosen action, e.g., 1-page copy. The above steps are repeated. Note
that our proposed GC solution is triggered only when the SSD receives
write requests.2
There are two aspects regarding idle time. First, our solution aims
at exploiting idle time by executing the action during idle time. This is
because the idle time can come after finishing the request. Second, how-
ever, our solution does not fully exploit idle time, especially, very long
idle time since the partial GC operation is triggered only when a request
arrives at the SSD. Exploiting long idle time will be a promising topic in
our future work.
In the following subsections, we describe how the components of
the RL agent, i.e., the Q-table cache and QP Net are constructed and
performed in detail.
2In the original Q-table cache solution [10], the partial GC, selected by the RL
agent, is performed at each SSD access (read and write). In this dissertation, we propose
applying the GC solution only after the service of write request. It is mainly because
our new integrated solution aims at hiding the latency overhead of QP Net execution
by the write latency. Our experiments will compare the two cases of GC execution: on
read/write request in [10] and only write request in the proposed integrated solution.
78
























Figure 6.2 Operation overview.
79
6.2.2 Q-table Cache for Action Selection
The Q-table cache [10,11] runs as a Q-table which determines actions on
the given state and updates its Q-values on the given reward. Compared
with RLGC [9], the Q-table cache tries to manage a much larger set of
states with a small cache structure. In this subsection, we describe how
the states are constructed, how the Q-values are accessed and updated in
the cache structure, and how the cache replacement is performed.
• States: In order to represent the characteristics of various work-
loads in detail, a considerable amount of state information is needed.
RLGC [9] uses a total of 68 states obtained by binning from three pieces
of information defined at design time. Such a small amount of state infor-
mation prevents RLGC from further reducing the long tail latency. Table
5.2 shows the information used for the states and the number of bins for
each piece of information used in our work. Such a large amount of state
information helps to reflect more detailed state changes. Note that we se-
lected the information shown in Table 5.2 through extensive experiments.
We think that, when our scheme is applied to actual SSD products, more
product-specific information can be added to the state information.
• Q-table Cache: As mentioned earlier, storing a large number of
state-Q value pairs in a Q-table requires a very large amount of memory.
For example, in order to store the 88 × 108 states mentioned in Table
5.2, the Q-table will require approximately 98GB (= 88 × 108 (states) ×
80
3 actions × 4B) memory space. It is impossible to apply such a solution
to an embedded system such as an SSD firmware.
In order to address this problem, we employ a Q-table cache that
stores only the active subset of states. Figure 6.1 shows the structure of Q-
table cache. In this study, as shown in the figure, we use the Q-table cache
which consists of one state and three Q-values for three actions. Note that
we utilize only three actions, 0, 1 or 2 page copy in the RL solution. We
selected three actions through sensitivity analysis. The memory capacity
of the Q-table cache is 1.56KB (= 100 (entries)× 4 (state and actions)×
4B). This is a negligible amount of memory cost in an SSD.
The RL agent operates on the Q-table cache as follows. First, upon
receiving the current state, the agent looks up the Q-table cache to see if
the current state is found in the three Q-table caches. To do this, we try
to find a match by comparing the current state and those in the Q-table
cache. If there is a match, then we select an action among 0/1/2-page
copy, based on the policy to be explained below.
For each action, its associated reward is obtained by applying a re-
ward function. Basically, we assign the larger reward when the smaller
latency is obtained as the result of action. Figure 6.3 shows the reward
function used in the proposed method.3 To be exact, we measure the re-
sponse time of the request immediately subsequent to the action. Then,
we obtain the reward of the action utilizing the reward function in the
3We obtained the reward function by extensive experiments.
81
figure. As shown in the figure, a large reward is assigned to a small
response time, which favors actions which incur small latency in sub-
sequent requests. Likewise, by assigning a negative reward for a long
response time, the actions incurring long latency are penalized. After ob-
taining the reward, the Q-value associated with the action is updated in
the Q-table cache using (2.3).
In case of miss in the Q-table cache, we need to perform cache re-
placement. If there is no free entry in the cache, we select a victim entry
under the LRU policy. Then, we insert into the Q-table cache a new entry
corresponding to the current state. We will explain how to initialize the
Q-values of the new entries in Section 6.2.3.
• Handling Negative Reward: Note that it is important to handle
negative reward in case of action 0. Action 0, which performs zero page
copy, does not run the GC. Even if action 0 is selected, the latency of
the subsequent request may be increased due to workload behavior (i.e.,
heavy write requests with short inter-request time), which may result in a
negative reward of action 0. In this case, the long latency is not the result
of the selected action (action 0). In such a case, the previous work [10]
assigned negative reward to the action, which will penalize the action.
In this work, as Figure 6.3 shows, we propose assigning zero reward,
instead of negative one, to the action, which avoids penalizing such an
action [11].
• Policy for Selecting an Action: In order to select an action, we
82
adopt the deterministic policy of (2.2) under ε-greedy method [14]. Ba-
sically, we select an action that maximizes the Q-value in (2.2).4 In ad-
dition to the deterministic policy, we also adopt ε-greedy method in or-
der to balance between exploitation and exploration. Thus, we select an
action in a random manner at the probability of ε while applying the
deterministic policy of (2.2) at the probability of 1-ε .
Note that, compared with the conventional policy, our Q-table cache
can have a unique situation when there is a miss in the Q-table cache. In
such a case, the agent selects 0-page copy as a default action. This is a
conservative approach to avoid the possibility of increased latency due to
the deficiency of information.
6.2.3 Q-value Prediction
We propose a neural network for Q-value prediction [11] in order to solve
the zero initialization problem of the original Q-table cache [10].
• Q-value Prediction Network Architecture: Figure 6.4 shows the
architecture of Q-value prediction network (QP Net) [11]. The QP Net
is a multi-layer perceptron (MLP). We chose the two-layer MLP as the
QP Net architecture based on the fact that a two-layer MLP can approxi-
mate any arbitrary nonlinear function [35–37]. As mentioned earlier, our
idea was motivated by the critic model in the actor-critic model. We also
4The selection of action having the maximum Q-value is deterministic compared
with the statistical policy where the action is selected in a probabilistic manner in pro-
portion to the associated Q-value.
83
considered that the critic model is typically designed with the MLP. In
order to determine the specific configuration of the two-layer MLP, we
performed sensitivity analysis where we varied the number of neurons
on each hidden layer, which will be given in our experiments (Section
6.3.4). The input layer receives 17 inputs, which are the normalized val-
ues of the 17 pieces of information used in the candidate states described
in Table 5.2. The output layer has three outputs, whose output values are
the predicted Q-values for actions 0, 1, and 2-page copy.
We train the QP Net whenever the Q-table cache is updated. It is
because the QP Net needs to learn the whole behavior of Q-function.
The QP Net is similar to the critic network of actor-critic model [30] in
that both are trained whenever a new reward is available. The QP Net
trained for the whole behavior of Q-function is advantageous especially
when unseen states occur and their initial Q-values need to be predicted
as well as when previously evicted states are re-inserted to the Q-table
cache. If the QP Net were trained only for the evicted entries of Q-table
cache, it could not make useful predictions on unseen states.
Note that the QP Net is trained with all the three Q-values of the
Q-table cache though only one of the three Q-values is updated. Such
a training enables the QP Net to learn the relative importance between
actions as well as state-dependent Q-values. When inserting a new entry
to the Q-table cache, we utilize all the three Q-values predicted by the
QP Net to initialize the three Q-values of the newly inserted state.
84
• Pre-training QP Net: We train the QP Net during runtime together
with the Q-table cache. We observed that a randomly-initialized QP Net
gives poor performance. It is mainly because the randomly-initialized QP
Net can give, as the prediction, large random Q-values, especially in the
beginning of the system run. Such large initial Q-values hurt Q-learning
by making the Q-table cache training difficult. In order to resolve this
problem, we experimented with pre-training the QP Net with random
and real traces and found pre-training with random traces is effective
because the QP Net pre-trained with random traces tends to give Q-value
predictions with much smaller variations than that pre-trained with real
traces. In our experiments, we pre-train the QP Net during design time
with 10,000 random requests. We also experimented pre-training with
real traces (as explained in Section 6.3.4). However, the pre-training with
















if action ≠ 0 (Proposed)
if action = 0 (Proposed)
if action = 0 (Original QTC)
if action ≠ 0 (Original QTC)
Figure 6.3 Reward function.












We compare the proposed method with RLGC (aggressive scheme) [9]
which shows the best performance (lowest latency) among the techniques
applying reinforcement learning to GC. We implemented the proposed
method and RLGC using a widely-used flash storage simulator, Flashsim
[31].
We used 16 Open Storage traces (home1*, home2*, home3*, web-
mail+online*, webmail*, MSNSFS*, RBESQL*, RA*, DADS*, DAP*,
DTR*, MSNSCFS*, homes*, online*, webresearch*, and webuser*) [38]
which are re-played SNIA block traces [32] with NVMe SSD array and
collected from storage nodes. We also used 13 real-world traces (home1,
home2, home3, home4, webmail+online, webmail, MSNSFS, RBESQL,
TPCC, TPCE, EXCH24, Financial1 and Financial2) [32] and a synthetic
one (oltp) [33]. Experiments were conducted on flash memory types of
3D 512Gb [2] and 3D 128Gb [1]. The parameters of the two flash mem-
ory types are shown in Table 6.2. The latency at the 99.9999th, 99.99th
and 99th percentile was used as the evaluation metric of the experiment.
The QP Net [11] was initialized as described in Section 6.2.3.
87



































Parameters 3D 128Gb 3D 512Gb 
Page size 8KB 16KB 
Number of pages / block 384 768 
Number of blocks / plane 2731 2874 
Number of planes 2 2 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 




# of neurons per a hidden layer 




35.15 105.55 215.15 368.30 
Converted for 
Cortex R5 




81.80 290.80 643.25 1119.50 
Converted for 
Cortex R5 



















































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
Ours 1.00 0.99 0.97 0.91 1.00 1.00 1.02 1.00 1.00 1.00 1.00 0.99 
             
 
 
Information used for state # of bins 
Current (t) inter-request interval 32 
Previous (t-1) inter-request interval 32 
Previous (t-1) request size 5 
Previous (t-2) request size 5 
Previous (t-1) action (# of performed partial gc) 3 
Previous (t-2) action (# of performed partial gc) 3 
Previous (t-3) action (# of performed partial gc) 3 
Previous (t-4) action (# of performed partial gc) 3 
Previous (t-5) action (# of performed partial gc) 3 
Previous (t-1) valid page copy (performed or not) 2 
Previous (t-2) valid page copy (performed or not) 2 
Previous (t-1) block erase (performed or not) 2 
Previous (t-2) block erase (performed or not) 2 
# of free blocks 12 
Current (t) requested operation 2 
Previous (t-1) requested operation 2 
Previous (t-2) requested operation 2 
 
 
Parameters 3D 128Gb 3D 512Gb 
Number of planes 2 2 
Number of blocks / plane 2731 2874 
Number of pages / block 384 768 
Page size 8KB 16KB 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 



















































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.03 1.02 1.00 0.99 1.00 0.98 1.00 1.00 1.00 1.00 1.00 




In the course of the experiment, it was observed that some workloads
were already sufficiently optimized by Q-table cache, and there was no
room to further reduce the long tail latency. Thus, we identified storage-
intensive workloads, where we evaluate the proposed method.5
To do this, we compared the Q-table cache method with the NO GC
case. The NO GC case assumes that flash memory is over-writable. This
means that GC does not occur. Thus, the minimum latency can be ob-
tained in this case [11].
Figures 6.5 and 6.6 compare Q-table cache method and NO GC case
in the 3D 512Gb and 3D 128Gb flash memory types, respectively. The
experimental results were normalized to NO GC cases and the results
were sorted in descending order. We assumed that storage-intensive work-
loads should exhibit sufficiently (10% in our experiments) longer latency
than the NO GC case. In the case of the 3D 512Gb flash memory, as
shown in Figure 6.5, the latency of Q-table cache method is similar to
that of the NO GC case (within 10%) in 17 workloads (home2, home1,
home4, webmail, webmail+online, TPCE, home3*, home3, home2*, DADS*,
DAP*, MSNSCFS*, homes*, online*, webresearch*, webuser* and Fi-
nancial2). For the 3D 128Gb flash memory, as shown in Figure 6.6, the
difference between Q-table cache method and NO GC cases is less than
5Identifying intensive workloads and evaluating on them is a popular approach, es-
pecially in memory sub-system architectures [39–42].
89
10% in 19 workloads (webmail*, webmail+online*, home4, home1*,
webmail+online, webmail, home1, TPCE, home3, home2*, home3*, DADS*,
DAP*, MSNSCFS*, homes*, online*, webresearch*, webuser* and Fi-
nancial2). Therefore, we performed the following experiments only for



































































































































































































Figure 6.5 Latency comparison with NO GC case at 99.9999th percentile

































































































































































































Figure 6.6 Latency comparison with NO GC case at 99.9999th percentile
on 3D 128Gb flash memory.
91
6.3.3 Latency Comparison: Overall
Table 6.3 shows a latency comparison with the RLGC (Base in the table)
[9] on the 3D 512Gb flash memory. The results are normalized to the
RLGC. Our proposed method has two approaches, Q-table cache only
(QTC in the table) [10], and the integrated method of Q-table cache and
QP Net (QPN and QTCW in the table)6 [11].
Both of our proposed methods (QTC and QPN in the table) show bet-
ter (lower) average latency than the baseline. QTC gives latency reduc-
tions by 0.89× at the 99.9999th percentile, 0.90× at 99.99th and 0.98×
at 99th. QPN offers further reductions by 0.75× at 99.9999th, 0.80× at
99.99th and 0.98× at 99th.
The baseline uses a small number of states to learn policy. However,
QTC exploits a much larger number of fine-grained states and maintains
key states among them. Thus, it offers smaller latency than the baseline.
Our integrated solution of Q-table cache and QP Net (QPN in the ta-
ble) gives further latency reductions by training QP Net during runtime to
provide better initialization of Q-table cache than the zero initialization
of the original Q-table cache, which finally contributes to better action
selection. Note that the latency improvement of QPN comes from bet-
ter Q-value initialization since both the QTC and QPN utilize the same
6As we mentioned earlier, we propose applying the GC solution only after the ser-
vice of write request to integrated solution. To verify the effect of not applying GC on
read request, we provided the results of Q-table cache only method at write request only
(QTCW in the table). The results are explained later in this subsection.
92
number of candidate states.
Table 6.3 also compares the latency on the 3D 128Gb flash mem-
ory. Our methods show better (lower) average latency than the baseline
by 0.78×(QTC)/0.63×(QPN) at the 99.9999th percentile, 0.85×(QTC)/
0.69×(QPN) at 99.99th, and 0.97×(QTC)/0.92×(QPN) at 99th. That is,
the QP Net gives additional 14-15% reductions to the original Q-table
cache [10] in the two types of flash memory.
The results of the 3D 128Gb flash memory show more latency re-
duction than those of the 3D 512Gb flash memory. It is because, the 3D
128Gb flash memory requires more frequent GCs because it has smaller
capacity than 3D 512Gb flash memory, which is also confirmed in Fig-
ures 6.5 and 6.6 where the 3D 128Gb flash memory gives larger latency
in the intensive workloads than in the 3D 512Gb flash memory.
In particular, in the 3D 128Gb flash memory, our integrated solu-
tion shows much larger latency reduction in the two workloads, RA*
and MSNSFS* than in other workloads. It is because they have lower
Q-table cache hit rates than others. As shown in Table 6.4, the hit rates of
RA* and MSNSFS* are 0.31 and 0.48, respectively. Therefore, in the case
of Q-table cache only (QTC), their Q-table caches suffer from frequent
evictions and insertions with zero Q-value, which amplifies the nega-
tive impact of zero-initialized Q-value on action selection. QP Net in our
integrated solution improves this situation with better initial Q-values
thereby enabling better action choices, which finally leads to lower la-
93
tency.
Table 6.3 shows QPN gives the largest latency reduction in TPCC. It
is because TPCC has a very large latency (above 90,000µs) as compared
to other workloads. Hence, there is a large potential of further optimiza-
tion as shown in Figures 6.5 and 6.6. Thus, our method could reduce the
normalized latency down to 0.02× in the 3D 128Gb flash memory. How-
ever, the latency reduction becomes less in the 3D 512Gb flash memory,
0.15×. This is because the latency of the 3D 128Gb flash memory is
around 3× longer than that of the 3D 512Gb flash memory, which shows
that the 3D 128Gb flash memory has more potential of further latency
reduction.
In addition, Table 6.3 compares the latency of QTC (Q-table cache
only method at each SSD access) and QTCW (Q-table cache only method
at write request) as well. As shown in the table, both QTC and QTCW
give similar average latency reductions which are 0.89×(QTC)/0.91×
(QTCW) at 99.9999th percentile, 0.90×(QTC)/0.91×(QTCW) at 99.99th
and 0.98× (QTC)/0.99×(QTCW) at 99th in the 3D 512Gb flash memory.
In case of the 3D 128Gb flash memory, QTC also gives similar average
latency to QTCW. It is because, both of QTC and QTCW method use
a large number of state candidates to represent the environment. Note
that our integrated solution (QPN) adopts QTCW to overlap the QP Net

































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.20 0.98 1.03 1.02 1.04 0.95 0.93 0.93 0.70 1.00 0.92 1.00 0.88 0.89 0.12 0.87 0.68 0.86 1.01 0.97 1.01 0.96 0.92 0.85 0.34 0.78 
QTCW 0.25 0.98 1.03 1.02 1.04 0.95 0.95 0.96 0.73 1.00 0.93 1.00 0.94 0.91 0.18 0.90 0.69 0.88 1.01 0.98 1.01 0.96 0.92 0.87 0.35 0.80 
QPN 0.15 0.82 0.93 0.87 1.00 0.88 0.81 0.61 0.47 0.78 0.86 0.78 0.75 0.75 0.02 0.80 0.31 0.79 0.62 0.84 0.88 0.65 0.88 0.80 0.33 0.63 
99. 
99th 
Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.18 1.01 1.02 1.01 1.04 1.04 0.93 0.87 0.99 1.00 0.85 0.93 0.79 0.90 0.19 0.94 0.96 0.89 0.98 1.03 0.97 0.97 0.88 0.87 0.63 0.85 
QTCW 0.20 1.01 1.03 1.01 1.04 1.05 0.93 0.90 1.00 1.00 0.85 0.94 0.86 0.91 0.23 0.95 0.97 0.89 0.98 1.03 0.98 0.97 0.89 0.87 0.63 0.85 
QPN 0.13 0.91 0.90 0.98 0.95 0.99 0.88 0.62 0.97 0.98 0.60 0.75 0.75 0.80 0.04 0.84 0.50 0.78 0.78 0.97 0.80 0.93 0.88 0.60 0.52 0.69 
99th 
Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.77 1.01 1.03 1.02 0.99 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 0.98 0.66 1.00 1.00 1.00 1.01 1.00 1.00 1.00 0.96 1.00 1.00 0.97 
QTCW 0.77 1.02 1.04 1.02 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.67 1.00 1.00 1.00 1.01 1.00 1.00 1.00 0.96 1.00 1.00 0.97 
QPN 0.77 0.95 1.01 1.02 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.66 0.99 1.00 1.00 0.73 1.00 0.85 0.99 0.95 1.00 1.00 0.92 
 
TABLE V 
HIT RATE OF Q-TABLE CACHE ONLY METHOD. 
 





































































Hit rate [%] 89 50 31 43 48 45 32 88 85 48 75 
 
TABLE VI 
LATENCY COMPARISON OF QP NET WITHOUT Q-TABLE CACHE (NORMALIZED TO THE BASELINE). 
 
 










































































































































































QPN 0.15 0.82 0.93 0.87 1.00 0.88 0.81 0.61 0.47 0.78 0.86 0.78 0.75 0.75 0.02 0.80 0.31 0.79 0.62 0.84 0.88 0.65 0.88 0.80 0.33 0.63 
NET 0.60 0.91 1.18 0.83 1.52 0.95 0.93 0.62 0.59 3.27 0.85 1.20 0.89 1.10 0.19 0.89 0.92 1.09 1.98 1.02 1.28 0.76 0.91 0.89 1.14 1.01 
 
 
No read GC 반영 
 
TABLE VIII 
LATENCY COMPARISON OF ACTOR-CRITIC METHOD (NORMALIZED TO BASELINE). 
 










































































































































































QPN 0.15 0.82 0.93 0.87 1.00 0.88 0.81 0.61 0.47 0.78 0.86 0.78 0.75 0.75 0.02 0.80 0.31 0.79 0.62 0.84 0.88 0.65 0.88 0.80 0.33 0.63 
AC 0.18 0.88 1.21 0.83 0.97 0.98 0.88 0.71 0.76 0.99 0.99 0.88 0.75 0.85 0.08 0.84 0.74 0.83 0.75 0.99 1.09 0.90 0.86 0.73 0.35 0.74 
 
 
No read GC 반영 
 
96
6.3.4 Q-value Prediction Network Effects on La-
tency
In this subsection, we report how the size of QP Net is determined. We
also evaluate alternative designs related with QP Net, i.e., QP Net only
solution and actor-critic method [11].
• QP Net size: The QP Net size needs to be minimized to reduce
the additional cost of runtime and memory due to QP Net execution.
We performed a sensitivity analysis to determine the appropriate QP Net
size by varying the number of neurons in the two hidden layers. Figure
6.7 shows the results (the average latency of the total workloads at the
99.9999th percentile) for different sizes of the QP Net in the 3D 512Gb
and 3D 128Gb flash memories. The 3D 512Gb flash memory has the best
(lowest) latency with 50 neurons per hidden layer, and the 3D 128Gb
flash memory has the best (lowest) latency with 100 neurons per hidden
layer.
The fact that the smaller capacity flash memory requires the larger
QP Net can be analyzed as follows. In the case of the 3D 128Gb flash
memory, because of the smaller capacity, more GCs are executed and
1.38×more states are created than in the 3D 512Gb flash memory. Thus,
in order to account for more states, a larger QP Net is needed.
Though we assume that the QP Net size is determined in design time
(50/100 neurons in each hidden layer for 512Gb/128Gb flash memories)
when the type of flash memory is determined, we do not exclude the pos-
97
sibility of dynamically adjusting QP Net size during runtime depending
on the available resource, e.g., storage capacity change due to aging, and
the application behavior, e.g., a large number of active states incurring
low Q-table cache hits. Investigating the possibility of further reduction
in latency in such cases will be future work.
• Pre-training effects: We compared two pre-training options: ran-
dom and realistic traces. After the pre-training, we used, as the initial
configuration, the pre-trained QP Net and zero-initialized QTC and eval-
uated our integrated solution with the realistic traces. In the case of ran-
dom traces, we used 10,000 randomly generated requests to train both Q-
table cache and QP Net. In the case of pre-training with realistic traces,
we utilized seven realistic traces (home2*, webmail+online*, home1*,
RBESQL*, RA*, MSNSFS*, and webmail*). We performed pre-training
and evaluation in 7-fold cross-validation (i.e., leave-one-out). For in-
stance, we trained both Q-table cache and QP Net with all traces except
for home2* and evaluated the Q-table cache and pre-trained QP Net with
home2*.
Table 6.5 compares the normalized latency of the two pre-training
options for the 99.9999th percentile. On average, the pre-training with
random traces gives 21% better (lower) latency than that with realistic
traces.
We analyze that the pre-training with random traces can give much
smaller Q-value prediction error than that with realistic traces as shown
98
in Table 6.6. The table compares the Q-value prediction error of QP Net.
We define the error as (Q(QTC)-Q(QPNet))/Q(QTC) where Q(QTC) and
Q(QPNet) represent the Q-value of Q-table cache and that of QP Net
for the same input state. We calculate the error with the Q-values after
each update in Q-table cache and QP Net. As the table shows, the pre-
training with random traces gives much smaller error than that with real-
istic traces. The large error of realistic traces could result from overfitting
to the realistic traces. In our future work, we will further investigate how
to exploit realistic traces as much as possible while benefiting from ran-
dom traces.
We also evaluated the case that QP Net weights are randomly ini-
tialized without pre-training (called No Pre-Training). As the tables
show, the case of no pre-training gives smaller latency and prediction
error than the pre-training with realistic traces while being inferior to the
proposed pre-training with random traces. This case also shows that the
pre-training with realistic traces may suffer from overfitting when com-
paring the prediction error between no pre-training and pre-training with
realistic traces.
• Latency of the QP Net without the Q-table cache: Our integrated
solution uses both the Q-table cache and the QP Net. However, we can
think of a solution that uses only the QP Net without the Q-table cache
because the QP Net can give Q-values on the given state. We imple-
mented a solution using only the QP Net without the Q-table cache. We
99
can consider the performance difference between our integrated solution
and this QP Net only one as the contribution of Q-table cache in our in-
tegrated solution. Note that the QP Net continues to be trained during
runtime in both solutions.
Table 6.7 compares the latency of the integrated solution (QPN) and
the QP Net only one (NET). The table shows that the integrated solution
(QPN) gives better (lower) average latency than the QP Net only one
(NET). To be specific, at 99.9999th percentile, the integrated solution
offers the average latency of 0.75× and 0.63× for the 3D 512Gb and 3D
128Gb flash memories, respectively. On the other hand, the QP Net only
one shows the latency of 1.10× and 1.01×, for the 3D 512Gb and 3D
128Gb, respectively.
The QP Net only solution is not equipped with the Q-table cache.
Thus, in order to achieve a similar performance, it might require a larger
network than the integrated solution. Thus, we conducted the experi-
ments by varying the network size in the QP Net only solution.
Figure 6.8 shows the average normalized latency of the QP Net only
solution for various QP Net sizes (number of neurons in each hidden
layer) at 99.9999th percentile. The results show that the integrated solu-
tion (QPN) gives better (lower) average latency than the QP Net only one
even under various QP Net sizes.
Figure 6.8 proves the benefit of our integrated solution where the Q-
table cache is specialized to exploit the short-term history while the QP
100
Net learns the long-term history in order to provide good initial Q-values
to the Q-table cache thereby enabling better Q-learning and, finally, lower
latency.
101


















# of neurons per a hidden layer
3D 512Gb 3D 128Gb


















# of neurons per a hidden layer
3D 512Gb 3D 128Gb
Normalized to RLGC
QTC+QPN
Figure 6.8 Latency comparison of QP Net without Q-table cache for var-
ious network sizes at 99.9999th percentile (normalized to the baseline).
Table 6.5 Pre-training comparison.
Reviewer #3 
1. Because QP-Net predicts Q-Value’s initial values, QP-Net’s accuracy influences that of Q-function. On 
the other hand, QP-Net is trained upon Q-table updates. Without pre-training QP-Net, how do Q-
function’s immature states affect the training of QP-Net? How tightly-coupled is the accuracy of the 
two models? Maybe I have missed something, thus I suggest further discussion on this issue to clarify 
any doubt a reader might raise. 
 
We appreciate the comments. We performed additional experiments without QP Net pre-training and 
added the followings to Section VI-D of the revised manuscript. 
 
We also evaluated the case that QP Net weights ar  andomly initialized without pre-training (called No 
Pre-Training). As the tables show, the case of no pre-training gives smaller latency and prediction error 
than the pre-training with realistic traces while being inferior to the proposed pre-training with random 
traces. This case also shows that the pre-training with realistic traces may suffer from overfitting when 


































































No Pre-Training 1.00 1.00 0.82 0.99 0.97 0.84 0.69 0.90 
Random 0.75 1.00 0.81 0.65 0.81 0.92 0.68 0.80 
Realistic 1.00 1.00 0.86 0.99 0.81 1.16 1.24 1.01 
 
TABLE XIV 

























































No Pre-Training 0.296 0.206 0.325 0.420 0.123 0.139 0.144 0.236 
Random 0.057 0.012 0.046 0.085 0.086 0.119 0.067 0.067 
Realistic 0.315 0.243 0.357 0.399 0.087 0.098 0.318 0.260 
 
Table 6.6 Q-value prediction error comparison.
Reviewer #3 
1. Because QP-Net predicts Q-Value’s initial values, QP-Net’s accuracy influences that of Q-function. On 
the other hand, QP-Net is trained upon Q-table updates. Without pre-training QP-Net, how do Q-
function’s immatur  states affect the training of QP-Net? How tightly-coupled is the accuracy f the 
two models? Maybe I have miss  something, thus I suggest further discussio  on this issue to clarify 
any d ubt a reader might raise. 
 
We appreciate the comments. We performed additional experiments without QP Net pre-training and 
added the followings to Section VI-D of the revised manuscript. 
 
We also evaluated the case that QP Net weights are randomly initialized without pre-training (called No 
Pre-Training). As the tables show, the case of no pre-training gives smaller latency and prediction error 
than the pre-training with realistic traces while being inferior to the proposed pre-training with random 
traces. This case also shows that the pre-training with realistic traces may suffer from overfitting when 


































































No Pre-Training 1.00 1.00 0.82 0.99 0.97 0.84 0.69 0.90 
Random 0.75 1.00 0.81 0.65 0.81 0.92 0.68 0.80 
Realistic 1.00 1.00 0.86 0.99 0.81 1.16 1.24 1.01 
 
TABLE XIV 

























































No Pre-Training 0.296 0.206 0.325 0.420 0.123 0.139 0.144 0.236 
Random 0.057 0.012 0.046 0.085 0.086 0.119 0.067 0.067 























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































• Latency of applying the actor-critic method: As mentioned ear-
lier, this study was also motivated by the actor-critic method [30]. The
actor-critic method approximates the probability of action and value func-
tions. We implemented a solution using the actor-critic method to com-
pare the performance with our integrated solution. Table 6.9 shows the
architecture of the actor-critic network we implemented.
Table 6.8 compares the latency of our integrated solution (QPN) and
the one using actor-critic method (AC) at 99.9999th percentile. The ta-
ble shows that our solution gives lower (better) average latency than the
actor-critic one for both the 3D 512Gb and the 3D 128Gb flash memo-
ries. To be specific, our solution gives the average latency of 0.75× and
0.63× for the 3D 512Gb and the 3D 128Gb flash memories, respectively,
while the actor-critic one offers the average latency of 0.85× and 0.74×
for two memory types.
We also implemented a solution using the asynchronous advantage
actor-critic method (A3C) to compare the performance. Fig. 6.9 shows a
high level architecture of A3C. As shown in the figure, total four agents
(one global agent and three local agents) are employed. A global agent
works as an real-SSD and local agents work as virtual-SSD to collect
experiences under the independent environment. Table 6.9 shows the ar-
chitecture of the actor-critic network we implemented for each agent.
Unlike the original A3C method, we use the online learning method to
apply A3C to real time SSDs. Note that this A3C method cannot apply
104
to real SSD, because real SSD can not virtualize their environment.
Table 6.8 compares the latency of our integrated solution (QPN)
and the one using A3C method (A3C) at 99.9999th percentile. The ta-
ble shows that our solution gives lower (better) average latency than the
A3C one for both the 3D 512Gb and the 3D 128Gb flash memories. To be
specific, our solution gives the average latency of 0.75× and 0.63× for
the 3D 512Gb and the 3D 128Gb flash memories, respectively, while the
A3C one offers the average latency of 0.87× and 0.76× for two mem-
ory types. The results of actor-critic method shows slightly better latency
reduction than A3C.
Actor-critic one has the advantage of using continuous values as states
instead of binned states. However, network learning is slow. In addition,
the predicted approximate values are directly used to determine the action
probability and the target value. Therefore, the effect of approximation
error can be large. In terms of implementation cost, as shown in Table
6.9, the actor critic method requires larger neural networks for actor and
critic networks than our QP Net.
On the other hand, our integrated solution learns the policy faster than
the actor-critic one because ours runs on a small Q-table cache. QP Net
learning can be slow like an actor-critic one. However, QP Net plays a
role of predicting the initial value of Q-table cache. Therefore, the effect
of the approximation error due to slow learning can be smaller than that
of the actor-critic one.
105
In cases of oltp for both flash memory types and RBESQL* for the
3D 128Gb flash memory, the actor-critic one shows slightly better per-
formance as shown in Table 6.8. We consider that it is due to the bene-
fit of using continuous values in the actor-critic method. As Table 6.10
shows, oltp has a high hit rate of the Q-table cache. In case of a high hit
rate, the influence of QP Net is small. In such a case, a better binning or
the usage of continuous values of the actor-critic method could lead to a
better solution.
In the case of RBESQL*, Table 6.10 shows the hit rate of the Q-
table cache is low. That is, the number of states used in RBESQL* is
very large and the states may vary frequently, which makes it difficult to
take full advantage of the Q-table cache [10]. The low hit rate of Q-table
cache can have an adverse impact on latency due to frequent evictions,
i.e., information loss, though the QP Net provides good initial Q-values.
Table 6.10 also shows the workload behavior of RBESQL* in terms of
average request size and standard deviation which are larger than other
workloads, confirming the characteristics of a large number of states.
In such a case, the actor-critic method, which utilizes continuous values
without suffering from information loss due to Q-table cache eviction,
could outperform the Q-table cache-based integrated solution.
Note that, by introducing the QP Net, the action choice and the state
behavior change thereby giving different hit rates in Tables 6.4 and 6.10.
Specifically, compared with the QTC only case, the QP Net can change
106
the selected action due to the better Q-learning. Different actions can
change the state behavior. For instance, if more partial GC operations
are taken in the QP Net case, then the number of free blocks will tend
to increase, which will frequently generate the states having more free




































































































































































































































































































































































































































































































































































































































 Actor Critic 
Number of inputs (states) 17 17 
Number of hidden layers 2 3 
Number of nodes / hidden layer 100 500 
Number of outputs 3 1 
 
 

































Hit rate [%] 80 50 55 93 97 
# of states  45559 337923 183545 6346 4293 
Avg. req. sectors 9.40 21.67 57.85 4.46 17.16 
SD of req. sectors 18.11 45.75 162.75 15.5 8.65 
 
Table 6.10 Workload information (hit rate and state counts are from inte-
grated solution).
4. What causes the difference of hit rate in Table IX (from QPN) and Table V (from QTC)? The reviewer 
doesn’t see how adding QP-Net changes Q-table hit rate. 
 
We appreciate the review comments. It helped give a better explanation of dynamics in the RL solution. 
In order to reflect the review comments, we added the followings. 
 
Note that, by introducing the QP Net, the action choice and the state behavior change thereby giving 
different hit rates in Tables V and XI. Specifically, compared with the QTC only case, the QP Net can 
change the selected action due to the better Q-learning. Different actions can change the state behavior. 
For instance, if more partial GC operations are taken in the QP Net case, then the number of free blocks 




5. Table IX doesn’t contain MSNSFS and RBESQL; thus the statement “As shown in Table IX, MSNSFS, 
MSNSFS*, RBESQL and RBESQL* show a relatively large number …” is not justified. In addition, in 
the above workloads, LQTC is worse than QTC only in 3D 128G, but not in 3D 512G. This requires 
more elaborate analysis. 
 





Another observation in some workloads (MSNSFS, RBESQL, MSNSFS* and RBESQL* in both 3D 
512Gb and 3D128Gb flash memory) is that as shown in Table XII, their latency on LQTC is worse than 
QTC only in 3D128Gb, but not in 3D512Gb.  
As shown in the original Q-table cache solution [5], the Q-table cache size that shows good performance 
depends on the characteristics of the workload. MSNSFS and RBESQL show the smallest latency in a 
Q-table cache with 2,000 entries, and the latency increases as the size of the Q-table cache gets larger. 
If the size of the Q-table cache is too small, there is a high probability of losing the learned information. 
On the other hand, if the size of the Q-table cache is too large, the influence of immature states can 
become more significant. 
TABLE IX 
WORKLOAD INFORMATION (HIT RATE AND STATE COUNTS ARE FROM 
INTEGRATED SOLUTION) 
 















































Hit rate [%] 80 50 55 93 97 37 46 
# of states 45559 337923 183545 6346 4293 435194 203081 
Avg. req. sectors 9.40 21.67 57.85 4.46 17.16 21.67 57.85 
SD of req. sectors 18.11 45.75 162.75 15.5 8.65 45.75 162.75 









• Not applicable to real-SSD
• Modified algorithm to apply an SSD
• 1 Global env, 3 Local nv
• Global env. works as a real-SSD
• Local env. works as a virtual-SSD to collect 
various experiences
• On-line learning 
• Network update every 20 requests
Figure 6.9 Architecture of asynchronous advantage actor-critic method.
109
6.3.5 Q-table Cache Analysis
• Replacement policy of Q-table cache: Our proposed solution em-
ployed an LRU policy as the Q-table cache replacement policy. In or-
der to evaluate the possibilities of different replacement policies, we
compared the performance of original Q-table cache method with LRU,
LIRS [43] and ARC [44] under the same constraint of memory usage,
i.e., 1,600 bytes for the memory cost of 100-entry QTC with LRU. Both
LIRS and ARC consider frequency as well as recency to compensate for
the disadvantages of LRU. Thus, both policies require storing meta data,
i.e., the state information of evicted entries, which gives fewer entries in
the QTC under the same memory cost. In our experiments, in order to
meet the memory cost of 1,600 bytes, we used 80-entry QTC for LIRS
and ARC, respectively.7 Our experiments (the corresponding results of
which are omitted due to page limit) show that, LRU outperforms both
of LIRS and ARC in the two flash memory types. It is mainly because,
under the same constraint of memory cost, LRU gives more QTC entries
than LIRS and ARC. As explained earlier, LIRS and ARC need addi-
tional memory area for storing meta data, which reduces the number of
QTC entries under the same memory constraint.
• Large Q-table cache: Our proposed integrated solution requires
7In our experiments, both LIRS and ARC keep the state information of 80 evicted
entries. Each entry of QTC has 16 bytes (=3 actions × 4 bytes/action + 4 bytes/state
information). The state information of evicted entry has 4 bytes. Thus, for each of LIRS
and ARC, the 80-entry QTC meets the memory cost of 1,600 bytes = 80 entries × (16
bytes/entries + 4 bytes/evicted entry).
110
additional memory cost of the QP Net in the original Q-table cache
method. For a fair evaluation, we need to compare latency under the
same memory cost when increasing the size of the original Q-table cache
to use the same amount of memory as the integrated solution. Table 6.12
shows the memory cost of the integrated solution (details of memory cost
are discussed later). As shown in Table 6.12, the integrated solution re-
quires 25.96KB and 88.68KB for 50 and 100 neurons per hidden layer,
respectively. One Q-table cache entry requires 16B (described in Sec-
tion 6.2.2). Thus, for iso-memory cost comparison, we utilize 1,107 and
3,783 Q-table cache entries for 3D 512Gb and 3D 128Gb flash memories,
respectively.
Table 6.11 compares the latency of our integrated solution (QPN) and
that of the large Q-table cache only solution (LQTC) at 99.9999th per-
centile. The table shows that our proposed solution offers lower (better)
average latency than the large Q-table cache for both the 3D 512Gb and
3D 128Gb flash memories. The large Q-table cache can manage more
state information than the original Q-table cache. However, the learned
information is still lost due to state eviction. Especially, a larger Q-table
cache can suffer from wrong action choices in immature states. On the
other hand, the proposed solution tries to avoid such immature states by









































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































6.3.6 Immature State Analysis
We observed that in some workloads (TPCC in 3D 512Gb flash memory
and TPCC, MSNSFS, MSNSFS*, RBESQL, and RBESQL* in 3D 128Gb
flash memory), the original small Q-table cache (QTC) outperforms the
large Q-table cache. As shown in Table 6.10, MSNSFS, MSNSFS*, RBESQL,
and RBESQL* show a relatively large number of states and low hit rates.
Due to the large number of states and the low locality of the states, the
large Q-table cache also suffers from low hit rates, e.g., 56% (MSNSFS),
64% (MSNSFS*), 62% (RBESQL) and 68% (RBESQL*), respectively.
As will be explained below, in such a case, the problem of immature
states is significant thereby degrading the quality of the large Q-table
cache. On the other hand, TPCC shows a high hit rate in the original
small Q-table cache. In such a case, increasing the size of Q-table cache
could increase the number of immature states worsening the quality of
Q-table cache.
In order to analyze the amount of immature states, as shown in Figure
6.10, we obtained the histogram of the access frequency of states which
remain on the large Q-table cache after workload runs.8 In the figure, the
x-axis represents the access frequency and the y-axis the number of states
corresponding to the access frequency range. As shown in the figure,
the workloads (TPCC, MSNSFS, MSNSFS*, RBESQL, and RBESQL*),
8The figure shows a snapshot of Q-table cache at the end of trace run. We use this
as a representative state of system run.
113
explained above, have a large number of states with small (i.e., less than
ten) access frequency. We consider such states as immature states9 which
degrade the quality of large Q-table cache.
The original small Q-table cache mitigates the negative impact of im-
mature states in two ways. First, the small Q-table cache tends to have a
smaller number of immature states. Second, in case of cache miss (which
often corresponds to the access to immature states in the large Q-table
cache), the small Q-table cache takes a conservative action choice, i.e.,
action 0 (No GC) which does not give an immediate latency increase
while the large Q-table cache can select an inappropriate action from an
immature state. Note also that the tail latency (i.e., 99.9999th percentile)
can often be affected by a single inadequate action selection due to its
infrequent occurrence.
On the contrary, RA* and Financial1 show better (low) latency in the
large Q-table cache than in the small cache. As Figure 6.10 shows, these
workloads have a relatively small number of immature states. Therefore,
the possibility of selecting inappropriate actions could be lower.
Another observation in some workloads (MSNSFS, RBESQL, MSNSFS*
and RBESQL* in both 3D 512Gb and 3D128Gb flash memory) is that as
shown in Table 6.11, their latency on LQTC is worse than QTC only
in 3D128Gb, but not in 3D512Gb. As shown in the original Q-table
9The definition of state maturity is difficult to obtain. Thus, our classification of
immature states could be subjective.
114
cache solution [10], the Q-table cache size that shows good performance
depends on the characteristics of the workload. MSNSFS and RBESQL
show the smallest latency in a Q-table cache with 2,000 entries, and the
latency increases as the size of the Q-table cache gets larger. If the size
of the Q-table cache is too small, there is a high probability of losing the
learned information. On the other hand, if the size of the Q-table cache is
too large, the influence of immature states can become more significant.
As mentioned earlier, we used Q-table cache with 1,107 and 3,783 en-
tries for 3D512Gb and 3D128Gb, respectively. In the case of 3D512Gb
(1,107 Q-table cache entries), latency could be reduced as the possibil-
ity of losing the learned information gets reduced. On the other hand, in
the case of 3D128Gb (3,783 Q-table cache entries), the latency tends to
increase due to the negative effect of a large number of immature states.
In order to resolve the problem of immature states, we also evaluated
a simple heuristic which applies thresholding based on he count of state
visits. If the count is smaller than a threshold, we select the conservative
choice, action 0 while continuing to train the Q-table and QP Net. Table
6.11 gives latency comparison. LQ20 and LQ40 represent the cases of
large Q-tables with the threshold of 20 and 40, respectively. As the table
shows, the heuristics gives similar results to that of the baseline large Q-
table (LQTC). We expected that the conservative choice of action may
be useful to avoid long latency due to wrong choices from immature
states. However, the heuristics turns out to suffer from immature states
115
because only the Q-values of action 0 are trained when the counts are
small thereby rendering the states of the other actions immature ones. We
expect there is a potential of further improvements by judiciously han-
dling immature states especially when a large QTC is available, which is
left for future work.
6.3.7 Miscellaneous Analysis
• Reward assignment options for action 0: As mentioned in Section
6.2.2, it is important to deal with negative reward in case of action 0. We
compared two reward assignment options: negative and zero reward for
action 0 on our integrated solution. Table 6.13 compares average latency
(normalized to the baseline) at 99.9999th percentile. As the table shows,
zero reward option gives by 11% and 12% lower average latency on the
3D 512Gb and 3D 128Gb flash memories, respectively. This shows the
effectiveness of zero reward assignment in the problem of latency in-
crease in action 0.
• Computation and memory overhead: We measured the compu-
tation overhead of the QP Net, i.e., the training and prediction time of
the QP Net (prediction and training) and that of the Q-table cache access
(search and insert time of the Q-table cache) on the ZED board [45] with
a Cortex A9 processor. The Cortex R5 processor [46] is mainly used for
SSD products [47,48]. Therefore, the results measured on the Cortex A9
are converted to those of Cortex R5 using the performance information
116
of DMIPS / MHz [46] provided by ARM. Note that, as we mentioned
earlier, our integrated solution does not trigger the partial GC in case of
read requests.
Table 6.14 shows the training and prediction runtime of the QP Net,
both of which are smaller than a typical program time (tPROG, ∼1ms)
in both cases of 50 and 100 neurons per hidden layer. Therefore, the
computation overhead of QP Net training/prediction can be considered
negligible since it can be hidden by the write latency to the flash memory.
In addition, the computation overhead can be further reduced in case
that the QP Net can run on a hardware accelerator equipped on the SSD
[49–51].
The Table 6.15 shows the search and insert time of the Q-table cache
with 100 entries. In case of insert, the runtime covers a search of 100 en-
tries and eviction/insertion. As both tables show, the computation over-
head of our integrated solution (Q-table cache and QP Net) can be con-
sidered negligible since it can be hidden by the write latency to the flash
memory.
Table 6.12 shows the memory overhead of our integrated solution.
As the table shows, QP Net includes 6,153 and 22,303 parameters (each
4 bytes) for 50 and 100 neurons per hidden layer, respectively. The Q-
table cache size with 100 entries for three actions is 1.56KB (described
in Section 6.2.2). Thus, the total memory overhead of both Q-table cache
and QP Net is 25.96KB/88.68KB for QP Net with 50/100 neurons per
117
hidden layer. The memory cost is negligible considering the fact that
the DRAM buffer size of SSD is much larger, typically 1GB for 1TB
capacity [52–54].
• Erase count: Tables 6.16 and 6.17 compare the erase count for the
3D 512Gb and 3D 128Gb flash memories. The results are normalized
to the baseline. In both cases, the erase count of the proposed integrated
solution is similar to the baseline.
• Block size effects: In order to evaluate the effects of block size, we
varied the block size (# pages/block) from 384 to 1,536 and the number
of blocks to keep the same capacity for 3D 512Gb flash memory (the
original block size of which is 768), and measured latency at 99.9999th
percentile on the baseline and proposed solution. Our experiments (the
corresponding results of which are omitted due to page limit) show that
both of baseline and QPN increase the latency as block size increases.
However, QPN is less correlated with that of block size than the base-
line. It is because QPN basically offers smaller latency than the baseline.
Thus, it will help to reduce latency increase with increasing block size.
118














































































Average latency at 99.9999th percentile 
3D 512Gb 3D 128Gb 
Negative 0.84 0.69 




 # of parameters 
 50 neurons 100 neurons 
Input layer 900 1800 
Hidden layer 2550 10100 
Hidden layer 2550 10100 
Output layer 153 303 
Total 6153 22303 
QP Net size [KB] 24.04 87.12 
Q-table cache size [KB] 1.56 1.56 





















MSNSFS MSNSFS* TPCC RBESQL
RBESQL* RA* Financial1
Figure 6.10 Number of states in Q-table cache for each access frequency
in 3D 128Gb (after running the workloads).
Table 6.13 Average latency comparison between negative and zero re-








































































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.00 1.03 1.02 1.00 0.99 0.99 1.00 0.98 1.00 1.00 1.00 1.00 1.00 
QPN 0.98 0.99 1.02 0.99 0.91 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 
 
TABLE XVII 














































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.00 0.99 1.00 1.00 1.00 1.02 1.00 1.00 1.00 1.03 1.00 
QPN 0.93 1.00 1.00 1.01 0.97 1.00 0.99 0.97 1.00 1.00 1.03 0.99 
 
TABLE X 
LATENCY COMPARISON OF LARGE Q-TABLE CACHE (NORMALIZED TO THE BASELINE). 
 










































































































































































QPN 0.15 0.82 0.93 0.87 1.0  0.88 0.81 0.61 0.47 0.78 0.86 0.78 0.75 0.75 0.02 0.80 0.31 0.79 0.62 0.84 0.88 0.65 0.88 0.80 0.33 0.63 
QTC 0.20 0.98 1.03 1.02 1.04 0.95 0.93 0.93 0.70 1.00 0.92 1.00 0.88 0.89 0.12 0.87 0.68 0.86 1.01 0.97 1.01 0.96 0.92 0.85 0.34 0.78 
LQTC 0.31 0.95 0.93 1.00 1.02 0.94 0.89 0.81 0.62 0.90 0.89 0.90 0.83 0.85 0.55 0.86 0.36 0.77 1.06 1.02 1.04 0.71 0.90 0.97 0.33 0.78 
LQ20 0.31 1.00 0.99 1.01 1. 4 0.98 0.91 0.93 0.90 1.25 1.13 0.82 0.82 0.93 0.46 0.88 0.48 0.97 1.00 0.98 1.14 0.71 0.90 1.02 0.53 0.82 
LQ40 0.31 1.00 1.08 1.01 1.04 0.98 0.92 0.93 0.90 1.29 1.17 0.82 0.99 0.96 0.46 0.88 0.56 1.06 1.01 1.05 1.18 0.71 0.90 1.02 0.63 0.86 
 
Table XIV 
AVERAGE LATENCY COMPARISON BETWEEN NEGATIVE AND 
ZERO REWARD FOR ACTION 0. 
 
Reward 
Average latency at 99.9999th percentile 
3D 512Gb 3D 128Gb 
Negative 0.86 0.75 
Zero 0.75 0.63 
































Parameters 3D 128Gb 3D 512Gb 
Page size 8KB 16KB 
Number of pages / block 384 768 
Number of blocks / plane 2731 2874 
Number of planes 2 2 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 




# of neurons per a hidden layer 




35.15 105.55 215.15 368.30 
Converted for 
Cortex R5 




81.80 290.80 643.25 1119.50 
Converted for 
Cortex R5 
122.70 436.20 964.88 1679.25 
 
 
Information used for state # of bins 
Current (t) inter-request interval 32 
Previous (t-1) inter-request interval 32 
Previous (t-1) request size 5 
Previous (t-2) request size 5 
Previous (t-1) action (# of performed partial gc) 3 
Previous (t-2) action (# of performed partial gc) 3 
Previous (t-3) action (# of performed partial gc) 3 
Previous (t-4) action (# of performed partial gc) 3 
Previous (t-5) action (# of performed partial gc) 3 
Previous (t-1) valid page copy (performed or not) 2 
Previous (t-2) valid page copy (performed or not) 2 
Previous (t-1) block erase (performed or not) 2 
Previous (t-2) block erase (performed or not) 2 
# of free blocks 12 
Current (t) requested operation 2 
Previous (t-1) requested operation 2 
Previous (t-2) requested operation 2 
 
 
Parameters 3D 128Gb 3D 512Gb 
Number of planes 2 2 
Number of blocks / plane 2731 2874 
Number of pages / block 384 768 
Page size 8KB 16KB 
Page read time 49 μs 60 μs 
Page program time 600 μs 700 μs 
Block erase time 4000 μs 3500 μs 
Data transfer rate 533Mbps 1Gbps 
 
119
Table 6.15 Computation overhead of Q-table Cache [µs].
 
5. The Q-table need to be updated for each operations, so does it make the system more busy? Also, the 
agent looks up the Q-table cache to match the current state, does it cause additional time cost? 
 
 
In Table XV of the original manuscript, we reported the computation overhead of QP Net evaluation 
and training as follows. 
 
The total computation overhead is dominated by QP Net, as the reviewer mentioned, we also think it 
would better to provide the computation overhead of QTC access. We added the followings to Section 
VI-G (Computation and memory overhead). 
 
We measured the computation overhead of the QP Net (evaluation and training) and that of the Q-table 
cache access (search and insert time of the Q-table cache) on the ZED board [21] with a Cortex A9 
processor.  
 
Table XVI shows the search and insert time of the Q-table cache with 100 entries. In case of insert, the 
runtime covers a search of 100 entries and eviction/insertion. As both tables show, the computation 
overhead of our integrated solution (Q-table cache and QP Net) can be considered negligible since it 





6. For a person who is unfamiliar of Machine Learning, the structure of the experiment is not very clearly. 
 
For a better readability, we grouped the experiments of similar topics. Specifically, we moved 
Subsection VI-F (Pre-training effect) in the original manuscript to Subsection VI-D Q-value prediction 
network effects on latency. In addition, we placed the experiments of Q-table cache analysis into a single 
subsection VI-E Q-table Cache Analysis. In order to clarify the issue of immature state, we also added 




COMPUTATION OVERHEAD OF Q-TABLE CACHE [μs] 
Search 
Measured on Cortex A9 2.90 
Converted for Cortex R5 4.34 
Insert 
(Search+Erase+Insert) 
Measured on Cortex A9 12.30 
Converted for Cortex R5 18.41 
 








































































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.00 1.03 1.02 1.00 0.99 0.99 1.00 0.98 1.00 1.00 1.00 1.00 1.00 
QPN 0.98 0.99 1.02 0.99 0.91 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 
 
TABLE XVII 














































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.00 0.99 1.00 1.00 1.00 1.02 1.00 1.00 1.00 1.03 1.00 
QPN 0.93 1.00 1.00 1.01 0.97 1.00 0.99 0.97 1.00 1.00 1.03 0.99 
 
TABLE X 
LATENCY COMPARISON OF LARGE Q-TABLE CACHE (NORMALIZED TO THE BASELINE). 
 










































































































































































QPN 0.15 0.82 0.93 0.87 1.00 0.88 0.81 0.61 0.47 0.78 0.86 0.78 0.75 0.75 0.02 0.80 0.31 0.79 0.62 0.84 0.88 0.65 0.88 0.80 0.33 0.63 
QTC 0.20 0.98 1.03 1.02 1.04 0.95 0.93 0.93 0.70 1.00 0.92 1.00 0.88 0.89 0.12 0.87 0.68 0.86 1.01 0.97 1.01 0.96 0.92 0.85 0.34 0.78 
LQTC 0.31 0.95 0.93 1.00 1.02 0.94 0.89 0.81 0.62 0.90 0.89 0.90 0.83 0.85 0.55 0.86 0.36 0.77 1.06 1.02 1.04 0.71 0.90 0.97 0.33 0.78 
LQ20 0.31 1.00 0.99 1.01 1.04 0.98 0.91 0.93 0.90 1.25 1.13 0.82 0.82 0.93 0.46 0.88 0.48 0.97 1.00 0.98 1.14 0.71 0.90 1.02 0.53 0.82 
LQ40 0.31 1.00 1.08 1.01 1.04 0.98 0.92 0.93 0.90 1.29 1.17 0.82 0.99 0.96 0.46 0.88 0.56 1.06 1.01 1.05 1.18 0.71 0.90 1.02 0.63 0.86 
 
Table XIV 
AVERAGE LATENCY COMPARISON BETWEEN NEGATIVE AND 
ZERO REWARD FOR ACTION 0. 
 
Reward 
Average latency at 99.9999th percentile 
3D 512Gb 3D 128Gb 
Negative 0.86 0.75 
Zero 0.75 0.63 
 








































































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.00 1.03 1.02 1.00 0.99 0.99 1.00 0.98 1.00 1.00 1.00 1.00 1.00 
QPN 0.98 0.99 1.02 0.99 0.91 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 
 
TABLE XVII 














































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.99 1.00 0.99 1.00 1.00 1.00 1.02 1.00 1.00 1.00 1.03 1.00 
QPN 0.93 1.00 1.00 1.01 0.97 1.00 0.99 0.97 1.00 1.00 1.03 0.99 
 
TABLE X 
LATENCY COMPARISON OF LARGE Q-TABLE CACHE (NORMALIZED TO THE BASELINE). 
 










































































































































































QPN 0.15 0.82 0.93 0.87 1.00 0.88 0.81 0.61 0.47 0.78 0.86 0.78 0.75 0.75 0.02 0.80 0.31 0.79 0.62 0.84 0.88 0.65 0.88 0.80 0.33 0.63 
QTC 0.20 0.98 1.03 1.02 1.04 0.95 0.93 0.93 0.70 1.00 0.92 1.00 0.88 0.89 0.12 0.87 0.68 0.86 1.01 0.97 1.01 0.96 0.92 0.85 0.34 0.78 
LQTC 0.31 0.95 0.93 1.00 1.02 0.94 0.89 0.81 0.62 0.90 0.89 0.90 0.83 0.85 0.55 0.86 0.36 0.77 1.06 1.02 1.04 0.71 0.90 0.97 0.33 0.78 
LQ20 0.31 1.00 0.99 1.01 1.04 0.98 0.91 0.93 0.90 1.25 1.13 0.82 0.82 0.93 0.46 0.88 0.48 0.97 1.00 0.98 1.14 0.71 0.90 1.02 0.53 0.82 
LQ40 0.31 1.00 1.08 1.01 1.04 0.98 0.92 0.93 0.90 1.29 1.17 0.82 0.99 0.96 0.46 0.88 0.56 . 6 1.01 1.05 1.18 0.71 0.90 1.02 0.63 0.86 
 
Table XIV 
AVERAGE LATENCY COMP RISON BETWEEN NEGATIVE AND 
ZERO REWARD FOR ACTION 0. 
 
Reward 
Average latency at 99.9999th percentile 
3D 512Gb 3D 128Gb 
Negative 0.86 0.75 
Zero 0.75 0.63 
 
120
6.3.8 Multi Channel Analysis
The evaluations discussed so far were conducted in a single channel SSD
environment. To increase the reality and robustness of the proposed solu-
tion, we performed additional experiments in a 4-channel SSD environ-
ment.
In addition to the workload used in the previous experiment, five ad-
ditional workloads were utilized (randomtransaction, readwhilewriting,
overwrite, filluniquerandom, and readrandomwriterandom). These were
extracted using RocksDB [55]. And three additional real world work-
loads (VDI0222LUN1, VDI0223LUN2, and VDI0224LUN3) [56] were
utilized.
Table 6.18 shows latency comparison of 3D 512 GB flash memory
in the 4-channel environment. Few workloads offer latency reduction for
both of our proposed solutions. This is a consequence of the capacity of
SSD having quadrupled with the usage of 4 channels. Therefore, the inci-
dence of GC decreases, and GC induced latency delay is not significant.
We reduced the number of blocks of flash memory by a factor of four
to evaluate the proposed solution in the 4-channel environment, while
maintaining the same SSD capacity as the previous experiment. All sub-
sequent experiments were performed using this configuration.
• Latency comparison at four channel configuration: Table 6.19

















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































memory. Both of our proposed methods (QTC and QPN in the table)
show better (lower) average latency than the baseline. QTC gives latency
reductions by 0.89× at 99.9999th percentile, 0.90× at 99.99th and 0.97×
at 99th. QPN offers further reductions by 0.83× at 99.9999th, 0.86× at
99.99th and 0.95× at 99th.
The baseline (Base in the table) uses a small number of states to learn
policy. However, QTC exploits a much larger number of fine-grained
states and maintains key states among them. Thus, it offers smaller la-
tency than the baseline.
Our integrated solution of Q-table cache and QP Net (QPN in the ta-
ble) gives further latency reductions by training QP Net during runtime to
provide better initialization of Q-table cache than the zero initialization
of the original Q-table cache, which finally contributes to better action
selection. Note that the latency improvement of QPN comes from bet-
ter Q-value initialization since both the QTC and QPN utilize the same
number of candidate states.
Table 6.20 also compares the latency on the 3D 128Gb flash mem-
ory. Our methods show better (lower) average latency than the baseline
by 0.86×(QTC)/0.80×(QPN) at the 99.9999th percentile, 0.89×(QTC)/
0.85×(QPN) at 99.99th, and 0.99×(QTC)/0.97×(QPN) at 99th. That is,
the QP Net gives additional 4-6% reductions to the original Q-table cache
[10] in the two types of flash memory.
















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































vides a suspension function to achieve fast read latency [57]. When a read
request is received, a program or an erase operation, which requires rel-
atively longer latency than a read operation, is stopped by this scheme.
The suspended program or erase operation is resumed after the comple-
tion of the operation (i.e., read operation) that requires fast latency.
Tables 6.21 and 6.22 compare the normalized latency of the original
our solution and the case of applying suspension scheme for 3D512G
and 3D128G flash memories. We implemented the suspension scheme to
both of our solutions (QTCSUS and QPNSUS in the table). The tables
show that the case of applying suspension scheme gives similar average
latency than our original solution for both flash memories. As shown in
the tables, both our original solution and the case of applying suspension
give similar average latency reductions which are 0.89×(QTC)/0.87×
(QTCSUS), 0.83×(QPN)/0.80×(QPNSUS) on the 3D512G flash mem-
ory. The results of the 3D 128G flash memory gives average latency re-
ductions which are 0.86×(QTC)/0.84×(QTCSUS), 0.80×(QPN)/0.79×
(QPNSUS).
•Demand-based Q-table cache: As mentioned earlier, our proposed
Q-table cache solution uses a small size of 100 entries. As Q-table cache
has the limitation of losing learned information, we can achieve better
performance if we increase the size of the Q-table cache significantly; in
this case, we use 500,000 entries. However, with substantial increase in




























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































random-access memory or dynamic random-access memory) of the SSD
controller. Therefore, a demand based Q-table cache is more suitable.
Fig. 6.11 shows three types of demand based Q-table cache that we
employed in the evaluation. The first type (DE100) uses a Q-table cache
with 100 entries and can store up to 500,000 entries in flash memory.
When a Q-table cache miss occurs, it evicts some entries from the Q-
table cache to flash memory and fetches the necessary information from
the latter. Eviction and fetching are performed in units of 20 entries.
The second type (DE1000) uses a Q-table cache with 1000 entries and
can store up to 500,000 entries in flash memory. Eviction and fetching
are performed in units of 100 entries. The last type (DEWO) uses a Q-
table cache with 500,000 entries. In this case, we assume that there is no
overhead due to eviction or fetching, i.e., the entire Q-table cache with
500,000 entries is placed in DRAM.
Tables 6.23 and 6.24 compare the normalized latency of the our Q-
table cache based solution (QTC) and demand based Q-table cache on 3D
512Gb and 3D 128Gb flash memories. In Table 6.23, QTC offers bet-
ter average latency reductions which are 0.89×(QTC)/1.57×(DE100)/
1.28×(DE1000)/0.92×(DEQO) on the 3D 512Gb flash memory. The re-
sults of the 3D 128Gb flash memory gives average latency reductions
which are 0.87×(QTC)/1.82×(DE100)/ 1.44×(DE1000)/0.89×(DEWO).
DE100 and DE1000 suffer from considerable eviction and fetching over-
head for both flash memory configurations. DE100 has more number of
129
eviction and fetching operations than DE1000 as the former uses smaller
Q-table cache. Therefore, Q-table cache miss occurs more frequently in
DE100.
As we mentioned earlier, DEWO has no eviction and fetching over-
head. Therefore, for comparing the latency of QTC and DEWO, the
observed latency change with the size of QTC can be used. In TPCC,
MSNSFS, RBESQL, DTR*, webmail*, and RA* workloads, QTC offers
better average latency than DEWO. These workloads suffer from the im-
mature state problem that we have already analyzed. In particular, TPCC
and webmail* have higher hit rate of 88% and 77% respectively. Increas-
ing the size of QTC could increase the number of immature states, and
subsequently worsen the quality of QTC.
• Q-table cache initialization with average of Q-values: As we
mentioned earlier, Q-table cache has a zero initialization problem. One
way to alleviate this is to initialize using the average of the Q-values of
the Q-table cache instead of the zero value. Fig. 6.12 shows a Q-table
cache initialized with average Q-value. As shown in the figure, Q-value
of newly inserted state is initialized with average of Q-values from Q-
table cache.
Tables 6.25 and 6.26 compare the normalized latency of the our Q-
table cache solution (QTC) and Q-table cache initialized with average
Q-value for 3D512G and 3D128G flash memories. QTC offers better






























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































3D512G flash memory. The results of the 3D 128G flash memory gives
average latency reductions which are 0.87×(QTC)/0.89×(AVE).
In particular, in the both flash memories, AVE shows a little latency
reduction in the three workloads (TPCC, rdwhilewr and overwrite) than
the other workloads. We analyze that these three workloads give much
larger initialization error for newly inserted entries.
Table 6.27 compares the Q-value prediction error of Q-table cache
initialized with zero and average Q-value. We define the error for initial-
ization with average Q-value as follows:
Eaverage = |Qevicted(s)−Qaverage(s)| (6.1)
where Qevicted(s) and Qaverage(s) represent Q-value of evicted state s
and Q-value of newly inserted state s, respectively.
The error for initialization with zero is defined as follows:
Ezero = |Qevicted(s)−0| (6.2)
As shown in the Table 6.27, AVE gives relatively larger prediction
error than QTC in these three workloads (TPCC, rdwhilewr, overwrite).
133
• Three types of demand based Q-table cache
82












Figure 6.11 Three types of demand based Q-table cache.





Q-table cache initialization 




Q-table cache initialization 
with zero (QTC)
0



































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Table 6.27 Q-value prediction error comparison of Q-table cache initial-



















































































































































































Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.82 1.00 1.00 1.00 1.00 1.00 0.71 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.78 0.79 0.59 0.94 
QPN 0.80 1.00 1.00 1.00 1.00 1.00 0.69 1.00 1.00 0.94 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.55 0.57 0.59 0.91 
99. 
99th 
Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.82 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.79 0.85 0.83 0.97 
QPN 0.77 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.78 0.81 0.83 0.96 
99th 
Base 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 
QTC 0.84 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 
QPN 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 
 
3D 512 4CH 
 
















































Ezero 0.47 2.44 1.62 1.90 0.22 2.61 0.95 1.00 




Conculsion and Future Work
7.1 Conclusion
In this dissertation, we addressed the problem of long-tail latency in flash
memory-based storage systems. To this end, a reinforcement learning
(RL)-assisted GC scheduling technique was proposed. The performance
of the three versions of the RL-assisted GC scheduling technique was
examined quantitatively.
RL-assisted GC scheduling technique was proposed which learns the
storage access behavior online and determines the number of GC oper-
ations to exploit the idle time. We also presented aggressive methods,
which helps in further reducing the long tail latency by aggressively per-
forming fine-grained GC operations. We evaluated our proposed methods
with eight real-world workloads on two types of flash memory storages.
We proposed a technique that dynamically manages key states in RL-
assisted GC to reduce the long-tail latency. This technique uses many
fine-grained pieces of information as state candidates and manages key
138
states that suitably represent the characteristics of the workload using a
relatively small amount of memory resource. Thus, the proposed method
can reduce the long-tail latency even further. In total, eleven workloads
are evaluated on two types of flash memory storage.
We also proposed a Q-value prediction network that predicts the ini-
tial Q-value of a new state in the Q-table cache. The integrated solu-
tion of the Q-table cache and Q-value prediction network can exploit
the short-term history of the system with a low-cost Q-table cache. It is
also equipped with a small network called Q-value prediction network to
make use of the long-term history and provide good Q-value initializa-
tion for the Q-table cache. We evaluated the proposed method with 24
workloads on two types of flash memory storage. We also compared our
proposed solution with alternative ones including the actor-critic method,
Q-value prediction network only and large Q-table cache only methods.
We conducted sensitivity analyses to investigate the contribution of each
component in our solution, suitable configurations such as the size of
the Q-value prediction network, as well as the effects of pre-training the
Q-value prediction network. The experiments show that our proposed




Recently, the use of SSD has become widespread, from personal com-
puters to server systems in data centers. The performance of computing
units such as CPUs and GPUs is increasing further. The size of data used
is also increasing, thereby requiring higher performance of the storage
system. In addition, due to the virtualization used in cloud systems, var-
ious workload characteristics are mixed and delivered to storage. There-
fore, the workload behavior and the requirements for SSDs differ. Based
on the study conducted for this dissertation, it is clear that a method that
can be easily applied to various SSD specs should be developed.
In this study, we use SSD internal information and workload informa-
tion as states. However, the RL-assisted GC scheduler can better under-
stand the system behavior using higher layer information such as host.
For example, if an SSD can detect changes in access patterns, running
programs, or virtual machine behavior, it will be able to learn quickly
using large epsilon and learning rates.
In another respect, if the characteristics of the workload are changed,
the information learned so far can be saved. If similar workload charac-
teristic occurs and is detected at a later point, the learning starts from the
previously stored information. This reduces the probability of selecting
an inappropriate action during the learning process and makes learning
faster.
140
Consideration of multi-channel is also an important feature for fu-
ture work. This study was mainly conducted in 1-channel configuration,
and some evaluation was performed in 4-channel configurations. Recent
SSDs were shown to actively utilize multi-channels to achieve high ca-
pacity and performance. Therefore, it is important to find solutions that
perform well in a multi-channel configuration. For example, information
related to multi-channels (such as channel blocking state and number of
queuing requests) can be added to the state. In addition, various tech-
niques for improving performance in flash memory (or FTL) have been
studied and applied to products. It is also important to take advantage of
these recent developments in technology.
In this study, we have proposed three solutions based on Q-table as
RL based Q-table is simple and can be applied using fewer computa-
tional resources. However, this tabular approach has some limitations.
For example, the size of the table increases proportionally to the size of
the state space; when the state space size grows beyond a certain level,
it becomes difficult to apply the approach efficiently. In addition, as bin-
ning is required to use a state with continuous values, the performance
and resource overhead of RL vary depending on the accuracy of bin-
ning. RL techniques using policy approximation have also been studied
to overcome the drawbacks of Q-table based methods. These methods
use neural networks and can directly use states with consecutive values. It
also takes longer to learn and requires more computational overhead than
141
tabular methods. In order to better understand the environment and de-
termine more appropriate actions, the use of techniques that incorporate
policy approximation is required. Due to the nature of the application of
SSD, it is difficult to implement policy approximation methods directly.
These techniques primarily approximate the policy using iterative learn-
ing over a large number of epochs in the same environment. However, in
the case of SSD, one of the principal differences is that it operates and
learns based on a request received from the host in real time. In this re-
gard, finding and improving RL algorithm that can be effectively applied
in SSD is also an important research direction.
142
Bibliography
[1] Samsung Electronics Co., Ltd., “Samsung v-nand technology,”
2014.
[2] C. Kim, J. Cho, W. Jeong, I. Park, H. Park, D. Kim, D. Kang, S. Lee,
J. Lee, W. Kim, J. Park, Y. Ahn, J. Lee, J. Lee, S. Kim, H. Yoon,
J. Yu, N. Choi, Y. Kwon, N. Kim, H. Jang, J. Park, S. Song, Y. Park,
J. Bang, S. Hong, B. Jeong, H. Kim, C. Lee, Y. Min, I. Lee, I. Kim,
S. Kim, D. Yoon, K. Kim, Y. Choi, M. Kim, H. Kim, P. Kwak,
J. Ihm, D. Byeon, J. Lee, K. Park, and K. Kyung, “11.4 a 512gb
3b/cell 64-stacked wl 3d v-nand flash memory,” in 2017 IEEE In-
ternational Solid-State Circuits Conference (ISSCC), pp. 202–203,
Feb 2017.
[3] A. Gupta, Y. Kim, and B. Urgaonkar, “Dftl: A flash translation layer
employing demand-based selective caching of page-level address
mappings,” in Proceedings of the 14th International Conference
on Architectural Support for Programming Languages and Oper-
ating Systems, ASPLOS XIV, (New York, NY, USA), pp. 229–240,
ACM, 2009.
143
[4] S. Choi, D. Kim, S. Choi, B. Kim, S. Jung, K. Chun, N. Kim,
W. Lee, T. Shin, H. Jin, H. Cho, S. Ahn, Y. Hong, I. Yang, B. Kim,
P. Yoo, Y. Jung, J. Lee, J. Shin, T. Kim, K. Park, and J. Kim, “19.2
a 93.4mm2 64gb mlc nand-flash memory with 16nm cmos technol-
ogy,” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pp. 328–329, Feb 2014.
[5] C. Kim, D. Kim, W. Jeong, H. Kim, I. H. Park, H. Park, J. Lee,
J. Park, Y. Ahn, J. Y. Lee, S. Kim, H. Yoon, J. D. Yu, N. Choi,
N. Kim, H. Jang, J. Park, S. Song, Y. Park, J. Bang, S. Hong,
Y. Choi, M. Kim, H. Kim, P. Kwak, J. Ihm, D. S. Byeon, J. Lee,
K. Park, and K. Kyung, “A 512-gb 3-b/cell 64-stacked wl 3-d-
nand flash memory,” IEEE Journal of Solid-State Circuits, vol. 53,
pp. 124–133, Jan 2018.
[6] R. Micheloni, S. Aritome, and L. Crippa, “Array architectures for
3-d nand flash memories,” Proceedings of the IEEE, vol. 105,
pp. 1634–1649, Sep. 2017.
[7] C. Monzio Compagnoni, A. Goda, A. S. Spinelli, P. Feeley, A. L.
Lacaita, and A. Visconti, “Reviewing the evolution of the nand flash
technology,” Proceedings of the IEEE, vol. 105, pp. 1609–1633,
Sep. 2017.
144
[8] J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM,
vol. 56, pp. 74–80, Feb. 2013.
[9] W. Kang, D. Shin, and S. Yoo, “Reinforcement learning-assisted
garbage collection to mitigate long-tail latency in ssd,” ACM Trans.
Embed. Comput. Syst., vol. 16, pp. 134:1–134:20, Sept. 2017.
[10] W. Kang and S. Yoo, “Dynamic management of key states for re-
inforcement learning-assisted garbage collection to reduce long tail
latency in ssd,” in Proceedings of the 55th Annual Design Automa-
tion Conference, DAC ’18, (New York, NY, USA), pp. 8:1–8:6,
ACM, 2018.
[11] W. Kang and S. Yoo, “Q-value prediction for reinforcement learn-
ing assisted garbage collection to reduce long tail latency in ssd,”
IEEE Transactions on Computer-Aided Design of Integrated Cir-
cuits and Systems, 2019. Early Access.
[12] M. Hao, G. Soundararajan, D. Kenchammana-Hosekote, A. A.
Chien, and H. S. Gunawi, “The tail at store: A revelation from
millions of hours of disk and {SSD} deployments,” in 14th
{USENIX} Conference on File and Storage Technologies ({FAST}
16), pp. 263–276, 2016.
[13] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability
Engineering: How Google Runs Production Systems. 2016.
145
[14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Intro-
duction. MIT Press, 1998.
[15] Q. Zhang, X. Li, L. Wang, T. Zhang, Y. Wang, and Z. Shao, “Lazy-
rtgc: A real-time lazy garbage collection mechanism with jointly
optimizing average and worst performance for nand flash memory
storage systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 20,
pp. 43:1–43:32, June 2015.
[16] L.-P. Chang, T.-W. Kuo, and S.-W. Lo, “Real-time garbage collec-
tion for flash-memory storage systems of real-time embedded sys-
tems,” ACM Trans. Embed. Comput. Syst., vol. 3, pp. 837–863, Nov.
2004.
[17] S. Choudhuri and T. Givargis, “Deterministic service guarantees for
nand flash using partial block cleaning,” in Proceedings of the 6th
IEEE/ACM/IFIP International Conference on Hardware/Software
Codesign and System Synthesis, CODES+ISSS ’08, (New York,
NY, USA), pp. 19–24, ACM, 2008.
[18] Z. Qin, Y. Wang, D. Liu, and Z. Shao, “Real-time flash translation
layer for nand flash memory storage systems,” in 2012 IEEE 18th
Real Time and Embedded Technology and Applications Symposium,
pp. 35–44, April 2012.
146
[19] N. Shahidi and M. T. Kandemir, “Cachedgc: Cache-assisted
garbage collection in modern solid state drives,” in 2018 IEEE 26th
International Symposium on Modeling, Analysis, and Simulation of
Computer and Telecommunication Systems (MASCOTS), pp. 79–
86, Sept 2018.
[20] Qingsong Wei, Bozhao Gong, S. Pathak, B. Veeravalli, LingFang
Zeng, and K. Okada, “Waftl: A workload adaptive flash translation
layer with data partition,” in 2011 IEEE 27th Symposium on Mass
Storage Systems and Technologies (MSST), pp. 1–12, May 2011.
[21] S. Yan, H. Li, M. Hao, M. H. Tong, S. Sundararaman, A. A. Chien,
and H. S. Gunawi, “Tiny-tail flash: Near-perfect elimination of
garbage collection tail latencies in nand ssds,” ACM Trans. Storage,
vol. 13, pp. 22:1–22:26, Oct. 2017.
[22] G. Amvrosiadis, A. D. Brown, and A. Goel, “Opportunistic storage
maintenance,” in Proceedings of the 25th Symposium on Operating
Systems Principles, SOSP ’15, (New York, NY, USA), pp. 457–473,
ACM, 2015.
[23] J. He, D. Nguyen, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, “Re-
ducing file system tail latencies with chopper,” in 13th USENIX
Conference on File and Storage Technologies (FAST 15), (Santa
Clara, CA), pp. 119–133, USENIX Association, Feb. 2015.
147
[24] S. Yang, T. Harter, N. Agrawal, S. S. Kowsalya, A. Krishnamurthy,
S. Al-Kiswany, R. T. Kaushik, A. C. Arpaci-Dusseau, and R. H.
Arpaci-Dusseau, “Split-level i/o scheduling,” in Proceedings of the
25th Symposium on Operating Systems Principles, SOSP ’15, (New
York, NY, USA), pp. 474–489, ACM, 2015.
[25] L. Han, Y. Ryu, and K. Yim, “Cata: A garbage collection scheme for
flash memory file systems,” in Ubiquitous Intelligence and Com-
puting (J. Ma, H. Jin, L. T. Yang, and J. J.-P. Tsai, eds.), (Berlin,
Heidelberg), pp. 103–112, Springer Berlin Heidelberg, 2006.
[26] M. Lin and S. Chen, “Efficient and intelligent garbage collection
policy for nand flash-based consumer electronics,” IEEE Transac-
tions on Consumer Electronics, vol. 59, pp. 538–543, August 2013.
[27] E. Ipek, O. Mutlu, J. F. Martı́nez, and R. Caruana, “Self-optimizing
memory controllers: A reinforcement learning approach,” in Pro-
ceedings of the 35th Annual International Symposium on Computer
Architecture, ISCA ’08, (Washington, DC, USA), pp. 39–50, IEEE
Computer Society, 2008.
[28] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Has-
sabis, D. Wierstra, and C. Blundell, “Neural episodic control,”
in Proceedings of the 34th International Conference on Machine
Learning (D. Precup and Y. W. Teh, eds.), vol. 70 of Proceedings
148
of Machine Learning Research, (International Convention Centre,
Sydney, Australia), pp. 2827–2836, PMLR, 06–11 Aug 2017.
[29] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforce-
ment learning,” 2013. cite arxiv:1312.5602Comment: NIPS Deep
Learning Workshop 2013.
[30] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Ad-
vances in Neural Information Processing Systems 12 (S. A. Solla,
T. K. Leen, and K. Müller, eds.), pp. 1008–1014, MIT Press, 2000.
[31] Y. Kim, B. Tauras, A. Gupta, and B. Urgaonkar, “Flashsim: A simu-
lator for nand flash-based solid-state drives,” in 2009 First Interna-
tional Conference on Advances in System Simulation, pp. 125–131,
Sept 2009.
[32] SNIA, “I/o trace data files,” 2008.
[33] Filebench, “filebench/filebench,” 2016.
[34] M. Bilal and S.-G. Kang, “A cache management scheme for ef-
ficient content eviction and replication in cache networks,” IEEE
Access, vol. 5, pp. 1692–1701, 2017.
[35] G. Bebis and M. Georgiopoulos, “Feed-forward neural networks,”
IEEE Potentials, vol. 13, pp. 27–31, Oct 1994.
149
[36] R. HECHT-NIELSEN, “Iii.3 - theory of the backpropagation neural
network,” in Neural Networks for Perception (H. Wechsler, ed.),
pp. 65 – 93, Academic Press, 1992.
[37] J. Heaton, Introduction to neural networks with Java. Heaton Re-
search, Inc., 2008.
[38] M. Kwon, J. Zhang, G. Park, W. Choi, D. Donofrio, J. Shalf,
M. Kandemir, and M. Jung, “Tracetracker: Hardware/software co-
evaluation for large-scale i/o workload reconstruction,” in 2017
IEEE International Symposium on Workload Characterization
(IISWC), pp. 87–96, Oct 2017.
[39] M. Hashemi, O. Mutlu, and Y. N. Patt, “Continuous runahead:
Transparent hardware acceleration for memory intensive work-
loads,” in The 49th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-49, (Piscataway, NJ, USA), pp. 61:1–
61:12, IEEE Press, 2016.
[40] M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “Ac-
celerating dependent cache misses with an enhanced memory con-
troller,” in 2016 ACM/IEEE 43rd Annual International Symposium
on Computer Architecture (ISCA), pp. 444–455, June 2016.
[41] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Mem-
ory bandwidth management for efficient performance isolation in
150
multi-core platforms,” IEEE Transactions on Computers, vol. 65,
pp. 562–576, Feb 2016.
[42] W. Wang, J. W. Davidson, and M. L. Soffa, “Predicting the memory
bandwidth and optimal core allocations for multi-threaded applica-
tions on large-scale numa machines,” in 2016 IEEE International
Symposium on High Performance Computer Architecture (HPCA),
pp. 419–431, March 2016.
[43] S. Jiang and X. Zhang, “Making lru friendly to weak locality work-
loads: a novel replacement algorithm to improve buffer cache per-
formance,” IEEE Transactions on Computers, vol. 54, pp. 939–952,
Aug 2005.
[44] N. Megiddo and D. S. Modha, “Arc: A self-tuning, low overhead
replacement cache.,” in FAST, vol. 3, pp. 115–130, 2003.
[45] AVNET, “Zedboard technical specifications,” 2017.
[46] ARM, “Cortex-r – arm developer,” 2016.
[47] Marvell, “Marvell 88nv11xx ssd controllers,” 2017.
[48] Marvell, “Marvell nvme pcie gen3x4 ssd controllers,” 2018.
[49] Y. Kang, Y. Kee, E. L. Miller, and C. Park, “Enabling cost-effective
data processing with smart ssd,” in 2013 IEEE 29th Symposium on
151
Mass Storage Systems and Technologies (MSST), pp. 1–12, May
2013.
[50] Samsung, “Smartssd - samsung @firstsamsung.”
[51] J. Do, Y.-S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt,
“Query processing on smart ssds: Opportunities and challenges,” in
Proceedings of the 2013 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’13, (New York, NY, USA),
pp. 1221–1230, ACM, 2013.
[52] B. Tallis, “The samsung 970 evo plus (250gb, 1tb) nvme ssd review:
96-layer 3d nand,” Jan 2019.
[53] B. Tallis, “The samsung 860 qvo (1tb, 4tb) ssd review: First con-
sumer sata qlc,” Nov 2018.
[54] B. Tallis, “The crucial p1 1tb ssd review: The other consumer qlc
ssd,” Nov 2018.
[55] “A persistent key-value store.”
[56] C. Lee, T. Kumano, T. Matsuki, H. Endo, N. Fukumoto, and
M. Sugawara, “Understanding storage traffic characteristics on en-
terprise virtual desktop infrastructure,” in Proceedings of the 10th
ACM International Systems and Storage Conference, SYSTOR ’17,
(New York, NY, USA), pp. 13:1–13:11, ACM, 2017.
152
[57] G. Wu and X. He, “Reducing ssd read latency via nand flash pro-
gram and erase suspension.,” in FAST, vol. 2, p. 3, 2012.
153
국문초록
낸드 플래시 메모리는 실시간 임베디드 시스템으로부터 고성능의 엔
터프라이즈서버시스템까지다양한시스템에서널리사용되고있다.
플래시메모리는 (1) erase-before-write (write once)와 (2) endurance문
제를갖고있다. Erase-before-write특성을다루기위해 flash-translation
layer (FTL)을적용한다.현재플래시메모리의write-once특성과 block
erase특성으로인한 latency증가를감소시키기위하여 page-level map-
ping방식이주로사용된다.
Garbage collection (GC)은 99th percentile에서평균지연시간의 100
배 이상 증가하는 long tail latency를 유발시키는 주요 원인 중 하나이
다.따라서실시간시스템이나 quality-critical system에서는 Quality of
Service (QoS)제한과같은주어진요구조건을만족시킬수없다.
플래시 메모리의 용량이 증가함에 따라 GC latency도 증가하는 경
향을보인다.이것은플래시메모리의용량이증가함에따라플래시메
모리의 블록 크기 (하나의 블록이 포함하고 있는 페이지의 수)가 증가
하기때문이다. GC latency는 valid page copy와 block erase시간에의해
결정된다.따라서,블록크기가증가하면, GC latency도증가한다.
특히, 최근 2D planner 플래시 메모리에서 3D vertical 플래시 메모
리구조로전환됨에따라블록크기는증가하였다.심지어 3D vertical
플래시 메모리에서도 블록 크기가 지속적으로 증가 하고 있다. 따라
154
서 3D vertical플래시메모리에서 long tail latency문제는더욱심각해
진다.
본 논문에서 우리는 강화학습(Reinforcement learning, RL)을 이용
한 세 가지 버전의 새로운 GC scheduling 기법을 제안하였다. 제안된
기술의목적은스토리지시스템의 idle시간을활용하여 GC에의해발
생된 long tail latency를 감소 시키는 것이다. 또한, 우리는 RL-assisted
GC솔루션을위한정량분석하였다.
우리는스토리지의 access behavior를온라인으로학습하고, idle시
간을 활용할 수 있는 GC operation의 수를 결정하는 RL-assisted GC
scheduling 기술을 제안 하였다. 추가적으로 우리는 공격적인 방법을
제시 하였다. 이 방법은 작은 단위의 GC operation들을 공격적으로 수
행함으로써, long tail latency를더욱감소시킬수있도록도움을준다.
또한우리는 long tail latency를더욱감소시키기위하여 RL-assisted





추가적으로, 우리는 Q-table cache에 새롭게 추가되는 state의 초기
값을예측하는 Q-value prediction network (QP Net)를제안하였다. Q-
table cache와 QP Net의 통합 솔루션은 저 비용의 Q-table cache를 이
용하여 단기간의 과거 정보를 활용 할 수 있다. 또한 이것은 QP Net
이라고 부르는 작은 신경망을 이용하여 학습한 장기간의 과거 정보를
155
사용하여 Q-table cache에 새롭게 삽입되는 state에 대해 좋은 Q-value
초기값을 제공한다. 실험결과는 제안한 방법이 state-of-the-art 방법에
비교하여 25%-37%의 long tail latency를감소시켰음을보여준다.
주요어:긴꼬리지연시간,강화학습,가비지컬렉션,솔리드스테이트
드라이브,낸드플래시메모리,스토리지
학번: 2016-30281
156
