83 research outputs found
Flash Memory Devices
Flash memory devices have represented a breakthrough in storage since their inception in the mid-1980s, and innovation is still ongoing. The peculiarity of such technology is an inherent flexibility in terms of performance and integration density according to the architecture devised for integration. The NOR Flash technology is still the workhorse of many code storage applications in the embedded world, ranging from microcontrollers for automotive environment to IoT smart devices. Their usage is also forecasted to be fundamental in emerging AI edge scenario. On the contrary, when massive data storage is required, NAND Flash memories are necessary to have in a system. You can find NAND Flash in USB sticks, cards, but most of all in Solid-State Drives (SSDs). Since SSDs are extremely demanding in terms of storage capacity, they fueled a new wave of innovation, namely the 3D architecture. Today โ3Dโ means that multiple layers of memory cells are manufactured within the same piece of silicon, easily reaching a terabit capacity. So far, Flash architectures have always been based on "floating gate," where the information is stored by injecting electrons in a piece of polysilicon surrounded by oxide. On the contrary, emerging concepts are based on "charge trap" cells. In summary, flash memory devices represent the largest landscape of storage devices, and we expect more advancements in the coming years. This will require a lot of innovation in process technology, materials, circuit design, flash management algorithms, Error Correction Code and, finally, system co-design for new applications such as AI and security enforcement
HMC-Based Accelerator Design For Compressed Deep Neural Networks
Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation.
In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller.
In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation.
In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller.
In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation
Digital Circuit Design Using Floating Gate Transistors
Floating gate (flash) transistors are used exclusively for memory applications today. These applications include SD cards of various form factors, USB flash drives and SSDs. In this thesis, we explore the use of flash transistors to implement digital logic circuits. Since the threshold voltage of flash transistors can be modified at a fine granularity during programming, several advantages are obtained by our flash-based digital circuit design approach. For one, speed binning at the factory can be controlled with precision. Secondly, an IC can be re-programmed in the field, to negate effects such as aging, which has been a significant problem in recent times, particularly for mission-critical applications. Thirdly, unlike a regular MOSFET, which has one threshold voltage level, a flash transistor can have multiple threshold voltage levels. The benefit of having multiple threshold voltage levels in a flash transistor is that it allows the ability to encode more symbols in each device, unlike a regular MOSFET. This allows us to implement multi-valued logic functions natively. In this thesis, we evaluate different flash-based digital circuit design approaches and compare their performance with a traditional CMOS standard cell-based design approach. We begin by evaluating our design approach at the cell level to optimize the designโs delay, power energy and physical area characteristics. The flash-based approach is demonstrated to be better than the CMOS standard cell approach, for these performance metrics. Afterwards, we present the performance of our design approach at the block level. We describe a synthesis flow to decompose a circuit block into a network of interconnected flash-based circuit cells. We also describe techniques to optimize the resulting network of flash-based circuit cells using donโt cares. Our optimization approach distinguishes itself from other optimization techniques that use donโt cares, since it a) targets a flash-based design flow, b) optimizes clusters of logic nodes at once instead of one node at a time, c) attempts to reduce the number of cubes instead of reducing the number of literals in each cube and d) performs optimization on the post-technology mapped netlist which results in a direct improvement in result quality, as compared to pre-technology mapping logic optimization that is typically done in the literature. The resulting network characteristics (delay, power, energy and physical area) are presented. These results are compared with a standard cell-based realization of the same block (obtained using commercial tools) and we demonstrate significant improvements in all the design metrics. We also study flash-based FPGA designs (both static and dynamic), and present the tradeoff of delay, power dissipation and energy consumption of the various designs. Our work differs from previously proposed flash-based FPGAs, since we embed the flash transistors (which store the configuration bits) directly within the logic and interconnect fabrics. We also present a detailed description of how the programming of the configuration bits is accomplished, for all the proposed designs
Digital Circuit Design Using Floating Gate Transistors
Floating gate (flash) transistors are used exclusively for memory applications today. These applications include SD cards of various form factors, USB flash drives and SSDs. In this thesis, we explore the use of flash transistors to implement digital logic circuits. Since the threshold voltage of flash transistors can be modified at a fine granularity during programming, several advantages are obtained by our flash-based digital circuit design approach. For one, speed binning at the factory can be controlled with precision. Secondly, an IC can be re-programmed in the field, to negate effects such as aging, which has been a significant problem in recent times, particularly for mission-critical applications. Thirdly, unlike a regular MOSFET, which has one threshold voltage level, a flash transistor can have multiple threshold voltage levels. The benefit of having multiple threshold voltage levels in a flash transistor is that it allows the ability to encode more symbols in each device, unlike a regular MOSFET. This allows us to implement multi-valued logic functions natively. In this thesis, we evaluate different flash-based digital circuit design approaches and compare their performance with a traditional CMOS standard cell-based design approach. We begin by evaluating our design approach at the cell level to optimize the designโs delay, power energy and physical area characteristics. The flash-based approach is demonstrated to be better than the CMOS standard cell approach, for these performance metrics. Afterwards, we present the performance of our design approach at the block level. We describe a synthesis flow to decompose a circuit block into a network of interconnected flash-based circuit cells. We also describe techniques to optimize the resulting network of flash-based circuit cells using donโt cares. Our optimization approach distinguishes itself from other optimization techniques that use donโt cares, since it a) targets a flash-based design flow, b) optimizes clusters of logic nodes at once instead of one node at a time, c) attempts to reduce the number of cubes instead of reducing the number of literals in each cube and d) performs optimization on the post-technology mapped netlist which results in a direct improvement in result quality, as compared to pre-technology mapping logic optimization that is typically done in the literature. The resulting network characteristics (delay, power, energy and physical area) are presented. These results are compared with a standard cell-based realization of the same block (obtained using commercial tools) and we demonstrate significant improvements in all the design metrics. We also study flash-based FPGA designs (both static and dynamic), and present the tradeoff of delay, power dissipation and energy consumption of the various designs. Our work differs from previously proposed flash-based FPGAs, since we embed the flash transistors (which store the configuration bits) directly within the logic and interconnect fabrics. We also present a detailed description of how the programming of the configuration bits is accomplished, for all the proposed designs
A Scalable Flash-Based Hardware Architecture for the Hierarchical Temporal Memory Spatial Pooler
Hierarchical temporal memory (HTM) is a biomimetic machine learning algorithm focused upon modeling the structural and algorithmic properties of the neocortex. It is comprised of two components, realizing pattern recognition of spatial and temporal data, respectively. HTM research has gained momentum in recent years, leading to both hardware and software exploration of its algorithmic formulation. Previous work on HTM has centered on addressing performance concerns; however, the memory-bound operation of HTM presents significant challenges to scalability.
In this work, a scalable flash-based storage processor unit, Flash-HTM (FHTM), is presented along with a detailed analysis of its potential scalability. FHTM leverages SSD flash technology to implement the HTM cortical learning algorithm spatial pooler. The ability for FHTM to scale with increasing model complexity is addressed with respect to design footprint, memory organization, and power efficiency. Additionally, a mathematical model of the hardware is evaluated against the MNIST dataset, yielding 91.98% classification accuracy. A fully custom layout is developed to validate the design in a TSMC 180nm process. The area and power footprints of the spatial pooler are 30.538mm2 and 5.171mW, respectively. Storage processor units have the potential to be viable platforms to support implementations of HTM at scale
๋ธ๋ ํ๋์ ์ ์ฅ์ฅ์น์ ์ฑ๋ฅ ๋ฐ ์๋ช ํฅ์์ ์ํ ํ๋ก๊ทธ๋จ ์ปจํ ์คํธ ๊ธฐ๋ฐ ์ต์ ํ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ๊น์งํ.์ปดํจํ
์์คํ
์ ์ฑ๋ฅ ํฅ์์ ์ํด, ๊ธฐ์กด์ ๋๋ฆฐ ํ๋๋์คํฌ(HDD)๋ฅผ ๋น ๋ฅธ ๋ธ๋
ํ๋์ ๋ฉ๋ชจ๋ฆฌ ๊ธฐ๋ฐ ์ ์ฅ์ฅ์น(SSD)๋ก ๋์ฒดํ๊ณ ์ ํ๋ ์ฐ๊ตฌ๊ฐ ์ต๊ทผ ํ๋ฐํ ์งํ
๋๊ณ ์๋ค. ๊ทธ๋ฌ๋ ์ง์์ ์ธ ๋ฐ๋์ฒด ๊ณต์ ์ค์ผ์ผ๋ง ๋ฐ ๋ฉํฐ ๋ ๋ฒจ๋ง ๊ธฐ์ ๋ก SSD
๊ฐ๊ฒฉ์ ๋๊ธ HDD ์์ค์ผ๋ก ๋ฎ์์ก์ง๋ง, ์ต๊ทผ์ ์ฒจ๋จ ๋๋ฐ์ด์ค ๊ธฐ์ ์ ๋ถ์์ฉ์ผ
๋ก NAND ํ๋์ ๋ฉ๋ชจ๋ฆฌ์ ์๋ช
์ด ์งง์์ง๋ ๊ฒ์ ๊ณ ์ฑ๋ฅ ์ปดํจํ
์์คํ
์์์
SSD์ ๊ด๋ฒ์ํ ์ฑํ์ ๋ง๋ ์ฃผ์ ์ฅ๋ฒฝ ์ค ํ๋์ด๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ต๊ทผ์ ๊ณ ๋ฐ๋ ๋ธ๋ ํ๋์ ๋ฉ๋ชจ๋ฆฌ์ ์๋ช
๋ฐ ์ฑ๋ฅ ๋ฌธ์ ๋ฅผ
ํด๊ฒฐํ๊ธฐ ์ํ ์์คํ
๋ ๋ฒจ์ ๊ฐ์ ๊ธฐ์ ์ ์ ์ํ๋ค. ์ ์ ๋ ๊ธฐ๋ฒ์ ์์ฉ ํ๋ก
๊ทธ๋จ์ ์ฐ๊ธฐ ๋ฌธ๋งฅ์ ํ์ฉํ์ฌ ๊ธฐ์กด์๋ ์ป์ ์ ์์๋ ๋ฐ์ดํฐ ์๋ช
ํจํด ๋ฐ ์ค๋ณต
๋ฐ์ดํฐ ํจํด์ ๋ถ์ํ์๋ค. ์ด์ ๊ธฐ๋ฐํ์ฌ, ๋จ์ผ ๊ณ์ธต์ ๋จ์ํ ์ ๋ณด๋ง์ ํ์ฉํ
๋ ๊ธฐ์กด ๊ธฐ๋ฒ์ ํ๊ณ๋ฅผ ๊ทน๋ณตํจ์ผ๋ก์จ ํจ๊ณผ์ ์ผ๋ก NAND ํ๋์ ๋ฉ๋ชจ๋ฆฌ์ ์ฑ๋ฅ
๋ฐ ์๋ช
์ ํฅ์์ํค๋ ์ต์ ํ ๋ฐฉ๋ฒ๋ก ์ ์ ์ํ๋ค.
๋จผ์ , ์์ฉ ํ๋ก๊ทธ๋จ์ I/O ์์
์๋ ๋ฌธ๋งฅ์ ๋ฐ๋ผ ๊ณ ์ ํ ๋ฐ์ดํฐ ์๋ช
๊ณผ ์ค
๋ณต ๋ฐ์ดํฐ์ ํจํด์ด ์กด์ฌํ๋ค๋ ์ ์ ๋ถ์์ ํตํด ํ์ธํ์๋ค. ๋ฌธ๋งฅ ์ ๋ณด๋ฅผ ํจ๊ณผ
์ ์ผ๋ก ํ์ฉํ๊ธฐ ์ํด ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ (์ฐ๊ธฐ ๋ฌธ๋งฅ) ์ถ์ถ ๋ฐฉ๋ฒ์ ๊ตฌํ ํ์๋ค.
ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ ์ ๋ณด๋ฅผ ํตํด ๊ฐ๋น์ง ์ปฌ๋ ์
๋ถํ์ ์ ํ๋ ์๋ช
์ NAND ํ
๋์ ๋ฉ๋ชจ๋ฆฌ ๊ฐ์ ์ ์ํ ๊ธฐ์กด ๊ธฐ์ ์ ํ๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ๊ทน๋ณตํ ์ ์๋ค.
๋์งธ, ๋ฉํฐ ์คํธ๋ฆผ SSD์์ WAF๋ฅผ ์ค์ด๊ธฐ ์ํด ๋ฐ์ดํฐ ์๋ช
์์ธก์ ์ ํ
์ฑ์ ๋์ด๋ ๊ธฐ๋ฒ์ ์ ์ํ์๋ค. ์ด๋ฅผ ์ํด ์ ํ๋ฆฌ์ผ์ด์
์ I/O ์ปจํ
์คํธ๋ฅผ ํ์ฉ
ํ๋ ์์คํ
์์ค์ ์ ๊ทผ ๋ฐฉ์์ ์ ์ํ์๋ค. ์ ์๋ ๊ธฐ๋ฒ์ ํต์ฌ ๋๊ธฐ๋ ๋ฐ์ดํฐ
์๋ช
์ด LBA๋ณด๋ค ๋์ ์ถ์ํ ์์ค์์ ํ๊ฐ ๋์ด์ผ ํ๋ค๋ ๊ฒ์ด๋ค. ๋ฐ๋ผ์ ํ
๋ก๊ทธ๋จ ์ปจํ
์คํธ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ฐ์ดํฐ์ ์๋ช
์ ๋ณด๋ค ์ ํํ ์์ธกํจ์ผ๋ก์จ, ๊ธฐ์กด
๊ธฐ๋ฒ์์ LBA๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ฐ์ดํฐ ์๋ช
์ ๊ด๋ฆฌํ๋ ํ๊ณ๋ฅผ ๊ทน๋ณตํ๋ค. ๊ฒฐ๋ก ์ ์ผ
๋ก ๋ฐ๋ผ์ ๊ฐ๋น์ง ์ปฌ๋ ์
์ ํจ์จ์ ๋์ด๊ธฐ ์ํด ์๋ช
์ด ์งง์ ๋ฐ์ดํฐ๋ฅผ ์๋ช
์ด ๊ธด
๋ฐ์ดํฐ์ ํจ๊ณผ์ ์ผ๋ก ๋ถ๋ฆฌ ํ ์ ์๋ค.
๋ง์ง๋ง์ผ๋ก, ์ฐ๊ธฐ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ์ค๋ณต ๋ฐ์ดํฐ ํจํด ๋ถ์์ ๊ธฐ๋ฐ์ผ๋ก
๋ถํ์ํ ์ค๋ณต ์ ๊ฑฐ ์์
์ ํผํ ์์๋ ์ ํ์ ์ค๋ณต ์ ๊ฑฐ๋ฅผ ์ ์ํ๋ค. ์ค๋ณต ๋ฐ
์ดํฐ๋ฅผ ์์ฑํ์ง ์๋ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ๊ฐ ์กด์ฌํจ์ ๋ถ์์ ์ผ๋ก ๋ณด์ด๊ณ ์ด๋ค์
์ ์ธํจ์ผ๋ก์จ, ์ค๋ณต์ ๊ฑฐ ๋์์ ํจ์จ์ฑ์ ๋์ผ ์ ์๋ค. ๋ํ ์ค๋ณต ๋ฐ์ดํฐ๊ฐ ๋ฐ์
ํ๋ ํจํด์ ๊ธฐ๋ฐํ์ฌ ๊ธฐ๋ก๋ ๋ฐ์ดํฐ๋ฅผ ๊ด๋ฆฌํ๋ ์๋ฃ๊ตฌ์กฐ ์ ์ง ์ ์ฑ
์ ์๋กญ๊ฒ
์ ์ํ์๋ค. ์ถ๊ฐ์ ์ผ๋ก, ์๋ธ ํ์ด์ง ์ฒญํฌ๋ฅผ ๋์
ํ์ฌ ์ค๋ณต ๋ฐ์ดํฐ๋ฅผ ์ ๊ฑฐ ํ
๊ฐ๋ฅ์ฑ์ ๋์ด๋ ์ธ๋ถํ ๋ ์ค๋ณต ์ ๊ฑฐ๋ฅผ ์ ์ํ๋ค.
์ ์ ๋ ๊ธฐ์ ์ ํจ๊ณผ๋ฅผ ํ๊ฐํ๊ธฐ ์ํด ๋ค์ํ ์ค์ ์์คํ
์์ ์์ง ๋ I/O
ํธ๋ ์ด์ค์ ๊ธฐ๋ฐํ ์๋ฎฌ๋ ์ด์
ํ๊ฐ ๋ฟ๋ง ์๋๋ผ ์๋ฎฌ๋ ์ดํฐ ๊ตฌํ์ ํตํด ์ค์
์์ฉ์ ๋์ํ๋ฉด์ ์ผ๋ จ์ ํ๊ฐ๋ฅผ ์ํํ๋ค. ๋ ๋์๊ฐ ๋ฉํฐ ์คํธ๋ฆผ ๋๋ฐ์ด์ค์
๋ด๋ถ ํ์จ์ด๋ฅผ ์์ ํ์ฌ ์ค์ ์ ๊ฐ์ฅ ๋น์ทํ๊ฒ ์ค์ ๋ ํ๊ฒฝ์์ ์คํ์ ์ํํ
์๋ค. ์คํ ๊ฒฐ๊ณผ๋ฅผ ํตํด ์ ์๋ ์์คํ
์์ค ์ต์ ํ ๊ธฐ๋ฒ์ด ์ฑ๋ฅ ๋ฐ ์๋ช
๊ฐ์
์ธก๋ฉด์์ ๊ธฐ์กด ์ต์ ํ ๊ธฐ๋ฒ๋ณด๋ค ๋ ํจ๊ณผ์ ์ด์์์ ํ์ธํ์๋ค. ํฅํ ์ ์๋ ๊ธฐ
๋ฒ๋ค์ด ๋ณด๋ค ๋ ๋ฐ์ ๋๋ค๋ฉด, ๋ธ๋ ํ๋์ ๋ฉ๋ชจ๋ฆฌ๊ฐ ์ด๊ณ ์ ์ปดํจํ
์์คํ
์ ์ฃผ
์ ์ฅ์ฅ์น๋ก ๋๋ฆฌ ์ฌ์ฉ๋๋ ๋ฐ์ ๊ธ์ ์ ์ธ ๊ธฐ์ฌ๋ฅผ ํ ์ ์์ ๊ฒ์ผ๋ก ๊ธฐ๋๋๋ค.Replacing HDDs with NAND flash-based storage devices (SSDs) has been
one of the major challenges in modern computing systems especially in regards to better performance and higher mobility. Although the continuous
semiconductor process scaling and multi-leveling techniques lower the price
of SSDs to the comparable level of HDDs, the decreasing lifetime of NAND
flash memory, as a side effect of recent advanced device technologies, is
emerging as one of the major barriers to the wide adoption of SSDs in highperformance computing systems.
In this dissertation, system-level lifetime improvement techniques for
recent high-density NAND flash memory are proposed. Unlike existing techniques, the proposed techniques resolve the problems of decreasing performance and lifetime of NAND flash memory by exploiting the I/O context
of an application to analyze data lifetime patterns or duplicate data contents
patterns.
We first present that I/O activities of an application have distinct data
lifetime and duplicate data patterns. In order to effectively utilize the context information, we implemented the program context extraction method.
With the program context, we can overcome the limitations of existing techniques for improving the garbage collection overhead and limited lifetime
of NAND flash memory.
Second, we propose a system-level approach to reduce WAF that exploits the I/O context of an application to increase the data lifetime prediction for the multi-streamed SSDs. The key motivation behind the proposed
technique was that data lifetimes should be estimated at a higher abstraction
level than LBAs, so we employ a write program context as a stream management unit. Thus, it can effectively separate data with short lifetimes from
data with long lifetimes to improve the efficiency of garbage collection.
Lastly, we propose a selective deduplication that can avoid unnecessary deduplication work based on the duplicate data pattern analysis of write
program context. With the help of selective deduplication, we also propose
fine-grained deduplication which improves the likelihood of eliminating redundant data by introducing sub-page chunk. It also resolves technical difficulties caused by its finer granularity, i.e., increased memory requirement
and read response time.
In order to evaluate the effectiveness of the proposed techniques, we
performed a series of evaluations using both a trace-driven simulator and
emulator with I/O traces which were collected from various real-world systems. To understand the feasibility of the proposed techniques, we also implemented them in Linux kernel on top of our in-house flash storage prototype and then evaluated their effects on the lifetime while running real-world
applications. Our experimental results show that system-level optimization
techniques are more effective over existing optimization techniques.I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Garbage Collection Problem . . . . . . . . . . . . . 2
1.1.2 Limited Endurance Problem . . . . . . . . . . . . . 4
1.2 Dissertation Goals . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . 7
II. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 NAND Flash Memory System Software . . . . . . . . . . . 9
2.2 NAND Flash-Based Storage Devices . . . . . . . . . . . . . 10
2.3 Multi-stream Interface . . . . . . . . . . . . . . . . . . . . 11
2.4 Inline Data Deduplication Technique . . . . . . . . . . . . . 12
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Data Separation Techniques for Multi-streamed SSDs 13
2.5.2 Write Traffic Reduction Techniques . . . . . . . . . 15
2.5.3 Program Context based Optimization Techniques for Operating Systems . . . . . . . . 18
III. Program Context-based Analysis . . . . . . . . . . . . . . . . 21
3.1 Definition and Extraction of Program Context . . . . . . . . 21
3.2 Data Lifetime Patterns of I/O Activities . . . . . . . . . . . 24
3.3 Duplicate Data Patterns of I/O Activities . . . . . . . . . . . 26
IV. Fully Automatic Stream Management For Multi-Streamed SSDs Using Program Contexts . . 29
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 No Automatic Stream Management for General I/O Workloads . . . . . . . . . 33
4.2.2 Limited Number of Supported Streams . . . . . . . 36
4.3 Automatic I/O Activity Management . . . . . . . . . . . . . 38
4.3.1 PC as a Unit of Lifetime Classification for General I/O Workloads . . . . . . . . . . . 39
4.4 Support for Large Number of Streams . . . . . . . . . . . . 41
4.4.1 PCs with Large Lifetime Variances . . . . . . . . . 42
4.4.2 Implementation of Internal Streams . . . . . . . . . 44
4.5 Design and Implementation of PCStream . . . . . . . . . . 46
4.5.1 PC Lifetime Management . . . . . . . . . . . . . . 46
4.5.2 Mapping PCs to SSD streams . . . . . . . . . . . . 49
4.5.3 Internal Stream Management . . . . . . . . . . . . . 50
4.5.4 PC Extraction for Indirect Writes . . . . . . . . . . 51
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . 53
4.6.1 Experimental Settings . . . . . . . . . . . . . . . . 53
4.6.2 Performance Evaluation . . . . . . . . . . . . . . . 55
4.6.3 WAF Comparison . . . . . . . . . . . . . . . . . . . 56
4.6.4 Per-stream Lifetime Distribution Analysis . . . . . . 57
4.6.5 Impact of Internal Streams . . . . . . . . . . . . . . 58
4.6.6 Impact of the PC Attribute Table . . . . . . . . . . . 60
V. Deduplication Technique using Program Contexts . . . . . . 62
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Selective Deduplication using Program Contexts . . . . . . . 63
5.2.1 PCDedup: Improving SSD Deduplication Efficiency using Selective Hash Cache Management . . . . . . 63
5.2.2 2-level LRU Eviction Policy . . . . . . . . . . . . . 68
5.3 Exploiting Small Chunk Size . . . . . . . . . . . . . . . . . 70
5.3.1 Fine-Grained Deduplication . . . . . . . . . . . . . 70
5.3.2 Read Overhead Management . . . . . . . . . . . . . 76
5.3.3 Memory Overhead Management . . . . . . . . . . . 80
5.3.4 Experimental Results . . . . . . . . . . . . . . . . . 82
VI. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . 88
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1 Supporting applications that have unusal program contexts . . . . . . . . . . . . . 89
6.2.2 Optimizing read request based on the I/O context . . 90
6.2.3 Exploiting context information to improve fingerprint lookups . . . . .. . . . . . 91
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Docto
- โฆ