1,598 research outputs found

    Towards Data Optimization in Storages and Networks

    Get PDF
    Title from PDF of title page, viewed on August 7, 2015Dissertation advisors: Sejun Song and Baek-Young ChoiVitaIncludes bibliographic references (pages 132-140)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015We are encountering an explosion of data volume, as a study estimates that data will amount to 40 zeta bytes by the end of 2020. This data explosion poses significant burden not only on data storage space but also access latency, manageability, and processing and network bandwidth. However, large portions of the huge data volume contain massive redundancies that are created by users, applications, systems, and communication models. Deduplication is a technique to reduce data volume by removing redundancies. Reliability will be even improved when data is replicated after deduplication. Many deduplication studies such as storage data deduplication and network redundancy elimination have been proposed to reduce storage consumption and network bandwidth consumption. However, existing solutions are not efficient enough to optimize data delivery path from clients to servers through network. Hence we propose a holistic deduplication framework to optimize data in their path. Our deduplication framework consists of three components including data sources or clients, networks, and servers. The client component removes local redundancies in clients, the network component removes redundant transfers coming from different clients, and the server component removes redundancies coming from different networks. We designed and developed components for the proposed deduplication framework. For the server component, we developed the Hybrid Email Deduplication System that achieves a trade-off of space savings and overhead for email systems. For the client component, we developed the Structure Aware File and Email Deduplication for Cloudbased Storage Systems that is very fast as well as having good space savings by using structure-based granularity. For the network component, we developed a system called Software-defined Deduplication as a Network and Storage service that is in-network deduplication, and that chains storage data deduplication and network redundancy elimination functions by using Software Defined Network to achieve both storage space and network bandwidth savings with low processing time and memory size. We also discuss mobile deduplication for image and video files in mobile devices. Through system implementations and experiments, we show that the proposed framework effectively and efficiently optimizes data volume in a holistic manner encompassing the entire data path of clients, networks and storage servers.Introduction -- Deduplication technology -- Existing deduplication approaches -- HEDS: Hybrid Email Deduplication System -- SAFE: Structure-aware File and Email Deduplication for cloud-based storage systems -- SoftDance: Software-defined Deduplication as a Network and Storage Service -- Moblie de-duplication -- Conclusion

    Name Disambiguation from link data in a collaboration graph using temporal and topological features

    Get PDF
    In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.Comment: The short version of this paper has been accepted to ASONAM 201

    Automatic Threshold Selections by exploration and exploitation of optimization algorithm in Record Deduplication

    Get PDF
    A deduplication process uses similarity function to identify the two entries are duplicate or not by setting the threshold.ร‚ย  This threshold setting is an important issue to achieve more accuracy and it relies more on human intervention. Swarm Intelligence algorithm such as PSO and ABC have been used for automatic detection of threshold to find the duplicate records. Though the algorithms performed well there is still an insufficiency regarding the solution search equation, which is used to generate new candidate solutions based on the information of previous solutions. ร‚ย The proposed work addressed two problems: first to find the optimal equation using Genetic Algorithm(GA) and next it adopts an modified ร‚ย Artificial Bee Colony (ABC) to get the optimal threshold to detect the duplicate records more accurately and also it reduces human intervention. CORA dataset is considered to analyze the proposed algorithm

    Re-evaluating Retrosynthesis Algorithms with Syntheseus

    Full text link
    The planning of how to synthesize molecules, also known as retrosynthesis, has been a growing focus of the machine learning and chemistry communities in recent years. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques. To remedy this, we present a benchmarking library called syntheseus which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step retrosynthesis algorithms. We use syntheseus to re-evaluate a number of previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes when evaluated carefully. We end with guidance for future works in this area

    Effects of Data Duplication in Pretraining

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค๋Œ€ํ•™์› ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šคํ•™๊ณผ, 2023. 2. ์ด์žฌ์ง„.This paper studies the effect of deduplication in training data on language models, such as BERT (the encoder-based model) and GPT-2 (the decoder-based model). Previous studies focus on memorizing duplicates in the training dataset whereas we perform several experiments with data deduplication. The pretraining data is first clustered by MinhashLSH, a stochastic method for finding near-duplicate documents in large corpus data, and then deduplicated by Jaccard similarity with various threshold values. Then, the models are finetuned with different downstream tasks. The experimental result indicates that GPT-2 works better with the deduplication, whereas BERT works differently depending on the tasks. It is due to the difference in self-supervised learning methods between BERT and GPT-2. The duplicated data may work on BERT as data augmentation through random masking in its data preprocessing stage. Data duplication may introduce biases and lead to overfitting, but the effect depends on the amount of duplicated data. To improve performance, data deduplication with proper granularity is essential in language model training.์ด ์—ฐ๊ตฌ๋Š” BERT(์ธ์ฝ”๋” ๊ธฐ๋ฐ˜ ๋ชจ๋ธ) ๋ฐ GPT-2(๋””์ฝ”๋” ๊ธฐ๋ฐ˜ ๋ชจ๋ธ)์™€ ๊ฐ™์€ ์–ธ์–ด ๋ชจ๋ธ์— ๋Œ€ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์ค‘๋ณต ์ œ๊ฑฐ ํšจ๊ณผ๋ฅผ ์ œ์‹œํ•˜๋Š” ๋ฐ ๋ชฉ์ ์ด ์žˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ์ƒ์„ฑ ๋ชจ๋ธ์— ํ•œํ•˜์—ฌ ์ค‘๋ณต ์ œ๊ฑฐ์˜ ์ด์ ์„ ๋ฐํ˜”์œผ๋ฉฐ, ๋ชจ๋ธ์ด ์•”๊ธฐ๋œ ํ…์ŠคํŠธ๋ฅผ ๋œ ์ƒ์„ฑํ•˜๊ณ  ๋ชจ๋ธ์˜ ํ›ˆ๋ จ ๋‹จ๊ณ„๊ฐ€ ๋” ์ ๊ฒŒ ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ์ด์— ๋ง๋ถ™์—ฌ ํ˜„ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ค‘๋ณต ์ œ๊ฑฐ์— ๋Œ€ํ•ด ๋ช‡ ๊ฐ€์ง€ ์ถ”๊ฐ€์ ์ธ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์šฐ์„  MinhashLSH(๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ์—์„œ ์œ ์‚ฌํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ํ™•๋ฅ ๋ก ์  ๋ฐฉ๋ฒ•)๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ•œ ๋‹ค์Œ, ๋‹ค์–‘ํ•œ ์ž„๊ณ„๊ฐ’์˜ Jaccard ์œ ์‚ฌ์„ฑ์œผ๋กœ ์ค‘๋ณต document๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค. ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ณ , ์ดํ›„ ๋‹ค์–‘ํ•œ downstream ์ž‘์—…์— finetuningํ•œ๋‹ค. GPT-2๋Š” ์ค‘๋ณต ์ œ๊ฑฐ๋œ ๋ชจ๋ธ์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋ฐ˜๋ฉด, BERT๋Š” downstream ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ์ด๋Š” BERT์™€ GPT-2์˜ self-supervised learning ๋ฐฉ์‹์˜ ์ฐจ์ด ๋•Œ๋ฌธ์ด๋‹ค. BERT์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„์—์„œ ๋žœ๋ค ๋งˆ์Šคํ‚น ๋ฐฉ์‹์„ ํ†ตํ•ด ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์˜คํžˆ๋ ค ๋ฐ์ดํ„ฐ augmentation์œผ๋กœ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ค‘๋ณต์€ ํŽธํ–ฅ์„ ๋„์ž…ํ•˜๊ณ  ๊ณผ์ ํ•ฉ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ทธ ํšจ๊ณผ๋Š” ์ค‘๋ณต ๋ฐ์ดํ„ฐ์˜ ์–‘์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„  ์–ธ์–ด ๋ชจ๋ธ ํ›ˆ๋ จ์—์„œ ์ ์ ˆํ•œ ์ž„๊ณ„๊ฐ’์˜ ๋ฐ์ดํ„ฐ ์ค‘๋ณต ์ œ๊ฑฐ๊ฐ€ ํ•„์ˆ˜์ ์ด๋‹ค.Chapter 1. Introduction ๏ผ‘ 1.1. Study Background ๏ผ‘ 1.2. Purpose of Research ๏ผ“ 1.3. Related Work ๏ผ” Chapter 2. Approach ๏ผ– 2.1. Pretraining Models ๏ผ– 2.2. Pretraining Dataset ๏ผ— 2.3. Near Deduplication ๏ผ— 2.4. Injection of Exact Document Duplication ๏ผ‘๏ผ Chapter 3. Experiments ๏ผ‘๏ผ’ 3.1. Near Deduplication Results ๏ผ‘๏ผ’ 3.2. Duplication Injection Results ๏ผ‘๏ผ” Chapter 4. Conclusion ๏ผ‘๏ผ– 4.1. Discussion and Future work ๏ผ‘๏ผ–์„

    ๋‚ธ๋“œ ํ”Œ๋ž˜์‹œ ์ €์žฅ์žฅ์น˜์˜ ์„ฑ๋Šฅ ๋ฐ ์ˆ˜๋ช… ํ–ฅ์ƒ์„ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ๊น€์ง€ํ™.์ปดํ“จํŒ… ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด, ๊ธฐ์กด์˜ ๋Š๋ฆฐ ํ•˜๋“œ๋””์Šคํฌ(HDD)๋ฅผ ๋น ๋ฅธ ๋‚ธ๋“œ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ฐ˜ ์ €์žฅ์žฅ์น˜(SSD)๋กœ ๋Œ€์ฒดํ•˜๊ณ ์ž ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ตœ๊ทผ ํ™œ๋ฐœํžˆ ์ง„ํ–‰ ๋˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ง€์†์ ์ธ ๋ฐ˜๋„์ฒด ๊ณต์ • ์Šค์ผ€์ผ๋ง ๋ฐ ๋ฉ€ํ‹ฐ ๋ ˆ๋ฒจ๋ง ๊ธฐ์ˆ ๋กœ SSD ๊ฐ€๊ฒฉ์„ ๋™๊ธ‰ HDD ์ˆ˜์ค€์œผ๋กœ ๋‚ฎ์•„์กŒ์ง€๋งŒ, ์ตœ๊ทผ์˜ ์ฒจ๋‹จ ๋””๋ฐ”์ด์Šค ๊ธฐ์ˆ ์˜ ๋ถ€์ž‘์šฉ์œผ ๋กœ NAND ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์ˆ˜๋ช…์ด ์งง์•„์ง€๋Š” ๊ฒƒ์€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์—์„œ์˜ SSD์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์ฑ„ํƒ์„ ๋ง‰๋Š” ์ฃผ์š” ์žฅ๋ฒฝ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ตœ๊ทผ์˜ ๊ณ ๋ฐ€๋„ ๋‚ธ๋“œ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์ˆ˜๋ช… ๋ฐ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์‹œ์Šคํ…œ ๋ ˆ๋ฒจ์˜ ๊ฐœ์„  ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ ๋œ ๊ธฐ๋ฒ•์€ ์‘์šฉ ํ”„๋กœ ๊ทธ๋žจ์˜ ์“ฐ๊ธฐ ๋ฌธ๋งฅ์„ ํ™œ์šฉํ•˜์—ฌ ๊ธฐ์กด์—๋Š” ์–ป์„ ์ˆ˜ ์—†์—ˆ๋˜ ๋ฐ์ดํ„ฐ ์ˆ˜๋ช… ํŒจํ„ด ๋ฐ ์ค‘๋ณต ๋ฐ์ดํ„ฐ ํŒจํ„ด์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, ๋‹จ์ผ ๊ณ„์ธต์˜ ๋‹จ์ˆœํ•œ ์ •๋ณด๋งŒ์„ ํ™œ์šฉํ–ˆ ๋˜ ๊ธฐ์กด ๊ธฐ๋ฒ•์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•จ์œผ๋กœ์จ ํšจ๊ณผ์ ์œผ๋กœ NAND ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ ๋ฐ ์ˆ˜๋ช…์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค. ๋จผ์ €, ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์˜ I/O ์ž‘์—…์—๋Š” ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๊ณ ์œ ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜๋ช…๊ณผ ์ค‘ ๋ณต ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์ด ์กด์žฌํ•œ๋‹ค๋Š” ์ ์„ ๋ถ„์„์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€๋‹ค. ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํšจ๊ณผ ์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ (์“ฐ๊ธฐ ๋ฌธ๋งฅ) ์ถ”์ถœ ๋ฐฉ๋ฒ•์„ ๊ตฌํ˜„ ํ•˜์˜€๋‹ค. ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ํ†ตํ•ด ๊ฐ€๋น„์ง€ ์ปฌ๋ ‰์…˜ ๋ถ€ํ•˜์™€ ์ œํ•œ๋œ ์ˆ˜๋ช…์˜ NAND ํ”Œ ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ์„ ์„ ์œ„ํ•œ ๊ธฐ์กด ๊ธฐ์ˆ ์˜ ํ•œ๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‘˜์งธ, ๋ฉ€ํ‹ฐ ์ŠคํŠธ๋ฆผ SSD์—์„œ WAF๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ˆ˜๋ช… ์˜ˆ์ธก์˜ ์ •ํ™• ์„ฑ์„ ๋†’์ด๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ I/O ์ปจํ…์ŠคํŠธ๋ฅผ ํ™œ์šฉ ํ•˜๋Š” ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์˜ ํ•ต์‹ฌ ๋™๊ธฐ๋Š” ๋ฐ์ดํ„ฐ ์ˆ˜๋ช…์ด LBA๋ณด๋‹ค ๋†’์€ ์ถ”์ƒํ™” ์ˆ˜์ค€์—์„œ ํ‰๊ฐ€ ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ํ”„ ๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๋ช…์„ ๋ณด๋‹ค ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•จ์œผ๋กœ์จ, ๊ธฐ์กด ๊ธฐ๋ฒ•์—์„œ LBA๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ ์ˆ˜๋ช…์„ ๊ด€๋ฆฌํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ๋‹ค. ๊ฒฐ๋ก ์ ์œผ ๋กœ ๋”ฐ๋ผ์„œ ๊ฐ€๋น„์ง€ ์ปฌ๋ ‰์…˜์˜ ํšจ์œจ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ˆ˜๋ช…์ด ์งง์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜๋ช…์ด ๊ธด ๋ฐ์ดํ„ฐ์™€ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„๋ฆฌ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์“ฐ๊ธฐ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์˜ ์ค‘๋ณต ๋ฐ์ดํ„ฐ ํŒจํ„ด ๋ถ„์„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ถˆํ•„์š”ํ•œ ์ค‘๋ณต ์ œ๊ฑฐ ์ž‘์—…์„ ํ”ผํ•  ์ˆ˜์žˆ๋Š” ์„ ํƒ์  ์ค‘๋ณต ์ œ๊ฑฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ค‘๋ณต ๋ฐ ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์ง€ ์•Š๋Š” ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ๊ฐ€ ์กด์žฌํ•จ์„ ๋ถ„์„์ ์œผ๋กœ ๋ณด์ด๊ณ  ์ด๋“ค์„ ์ œ์™ธํ•จ์œผ๋กœ์จ, ์ค‘๋ณต์ œ๊ฑฐ ๋™์ž‘์˜ ํšจ์œจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐœ์ƒ ํ•˜๋Š” ํŒจํ„ด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ธฐ๋ก๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์ž๋ฃŒ๊ตฌ์กฐ ์œ ์ง€ ์ •์ฑ…์„ ์ƒˆ๋กญ๊ฒŒ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ์„œ๋ธŒ ํŽ˜์ด์ง€ ์ฒญํฌ๋ฅผ ๋„์ž…ํ•˜์—ฌ ์ค‘๋ณต ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐ ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ด๋Š” ์„ธ๋ถ„ํ™” ๋œ ์ค‘๋ณต ์ œ๊ฑฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ ๋œ ๊ธฐ์ˆ ์˜ ํšจ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์‹ค์ œ ์‹œ์Šคํ…œ์—์„œ ์ˆ˜์ง‘ ๋œ I/O ํŠธ๋ ˆ์ด์Šค์— ๊ธฐ๋ฐ˜ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ‰๊ฐ€ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์—๋ฎฌ๋ ˆ์ดํ„ฐ ๊ตฌํ˜„์„ ํ†ตํ•ด ์‹ค์ œ ์‘์šฉ์„ ๋™์ž‘ํ•˜๋ฉด์„œ ์ผ๋ จ์˜ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€ ๋ฉ€ํ‹ฐ ์ŠคํŠธ๋ฆผ ๋””๋ฐ”์ด์Šค์˜ ๋‚ด๋ถ€ ํŽŒ์›จ์–ด๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์‹ค์ œ์™€ ๊ฐ€์žฅ ๋น„์Šทํ•˜๊ฒŒ ์„ค์ •๋œ ํ™˜๊ฒฝ์—์„œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•˜ ์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ ์ˆ˜์ค€ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ด ์„ฑ๋Šฅ ๋ฐ ์ˆ˜๋ช… ๊ฐœ์„  ์ธก๋ฉด์—์„œ ๊ธฐ์กด ์ตœ์ ํ™” ๊ธฐ๋ฒ•๋ณด๋‹ค ๋” ํšจ๊ณผ์ ์ด์—ˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ํ–ฅํ›„ ์ œ์•ˆ๋œ ๊ธฐ ๋ฒ•๋“ค์ด ๋ณด๋‹ค ๋” ๋ฐœ์ „๋œ๋‹ค๋ฉด, ๋‚ธ๋“œ ํ”Œ๋ž˜์‹œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ดˆ๊ณ ์† ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์˜ ์ฃผ ์ €์žฅ์žฅ์น˜๋กœ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์— ๊ธ์ •์ ์ธ ๊ธฐ์—ฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.Replacing HDDs with NAND flash-based storage devices (SSDs) has been one of the major challenges in modern computing systems especially in regards to better performance and higher mobility. Although the continuous semiconductor process scaling and multi-leveling techniques lower the price of SSDs to the comparable level of HDDs, the decreasing lifetime of NAND flash memory, as a side effect of recent advanced device technologies, is emerging as one of the major barriers to the wide adoption of SSDs in highperformance computing systems. In this dissertation, system-level lifetime improvement techniques for recent high-density NAND flash memory are proposed. Unlike existing techniques, the proposed techniques resolve the problems of decreasing performance and lifetime of NAND flash memory by exploiting the I/O context of an application to analyze data lifetime patterns or duplicate data contents patterns. We first present that I/O activities of an application have distinct data lifetime and duplicate data patterns. In order to effectively utilize the context information, we implemented the program context extraction method. With the program context, we can overcome the limitations of existing techniques for improving the garbage collection overhead and limited lifetime of NAND flash memory. Second, we propose a system-level approach to reduce WAF that exploits the I/O context of an application to increase the data lifetime prediction for the multi-streamed SSDs. The key motivation behind the proposed technique was that data lifetimes should be estimated at a higher abstraction level than LBAs, so we employ a write program context as a stream management unit. Thus, it can effectively separate data with short lifetimes from data with long lifetimes to improve the efficiency of garbage collection. Lastly, we propose a selective deduplication that can avoid unnecessary deduplication work based on the duplicate data pattern analysis of write program context. With the help of selective deduplication, we also propose fine-grained deduplication which improves the likelihood of eliminating redundant data by introducing sub-page chunk. It also resolves technical difficulties caused by its finer granularity, i.e., increased memory requirement and read response time. In order to evaluate the effectiveness of the proposed techniques, we performed a series of evaluations using both a trace-driven simulator and emulator with I/O traces which were collected from various real-world systems. To understand the feasibility of the proposed techniques, we also implemented them in Linux kernel on top of our in-house flash storage prototype and then evaluated their effects on the lifetime while running real-world applications. Our experimental results show that system-level optimization techniques are more effective over existing optimization techniques.I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Garbage Collection Problem . . . . . . . . . . . . . 2 1.1.2 Limited Endurance Problem . . . . . . . . . . . . . 4 1.2 Dissertation Goals . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . 7 II. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 NAND Flash Memory System Software . . . . . . . . . . . 9 2.2 NAND Flash-Based Storage Devices . . . . . . . . . . . . . 10 2.3 Multi-stream Interface . . . . . . . . . . . . . . . . . . . . 11 2.4 Inline Data Deduplication Technique . . . . . . . . . . . . . 12 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.1 Data Separation Techniques for Multi-streamed SSDs 13 2.5.2 Write Traffic Reduction Techniques . . . . . . . . . 15 2.5.3 Program Context based Optimization Techniques for Operating Systems . . . . . . . . 18 III. Program Context-based Analysis . . . . . . . . . . . . . . . . 21 3.1 Definition and Extraction of Program Context . . . . . . . . 21 3.2 Data Lifetime Patterns of I/O Activities . . . . . . . . . . . 24 3.3 Duplicate Data Patterns of I/O Activities . . . . . . . . . . . 26 IV. Fully Automatic Stream Management For Multi-Streamed SSDs Using Program Contexts . . 29 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.1 No Automatic Stream Management for General I/O Workloads . . . . . . . . . 33 4.2.2 Limited Number of Supported Streams . . . . . . . 36 4.3 Automatic I/O Activity Management . . . . . . . . . . . . . 38 4.3.1 PC as a Unit of Lifetime Classification for General I/O Workloads . . . . . . . . . . . 39 4.4 Support for Large Number of Streams . . . . . . . . . . . . 41 4.4.1 PCs with Large Lifetime Variances . . . . . . . . . 42 4.4.2 Implementation of Internal Streams . . . . . . . . . 44 4.5 Design and Implementation of PCStream . . . . . . . . . . 46 4.5.1 PC Lifetime Management . . . . . . . . . . . . . . 46 4.5.2 Mapping PCs to SSD streams . . . . . . . . . . . . 49 4.5.3 Internal Stream Management . . . . . . . . . . . . . 50 4.5.4 PC Extraction for Indirect Writes . . . . . . . . . . 51 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . 53 4.6.1 Experimental Settings . . . . . . . . . . . . . . . . 53 4.6.2 Performance Evaluation . . . . . . . . . . . . . . . 55 4.6.3 WAF Comparison . . . . . . . . . . . . . . . . . . . 56 4.6.4 Per-stream Lifetime Distribution Analysis . . . . . . 57 4.6.5 Impact of Internal Streams . . . . . . . . . . . . . . 58 4.6.6 Impact of the PC Attribute Table . . . . . . . . . . . 60 V. Deduplication Technique using Program Contexts . . . . . . 62 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Selective Deduplication using Program Contexts . . . . . . . 63 5.2.1 PCDedup: Improving SSD Deduplication Efficiency using Selective Hash Cache Management . . . . . . 63 5.2.2 2-level LRU Eviction Policy . . . . . . . . . . . . . 68 5.3 Exploiting Small Chunk Size . . . . . . . . . . . . . . . . . 70 5.3.1 Fine-Grained Deduplication . . . . . . . . . . . . . 70 5.3.2 Read Overhead Management . . . . . . . . . . . . . 76 5.3.3 Memory Overhead Management . . . . . . . . . . . 80 5.3.4 Experimental Results . . . . . . . . . . . . . . . . . 82 VI. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . 88 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2.1 Supporting applications that have unusal program contexts . . . . . . . . . . . . . 89 6.2.2 Optimizing read request based on the I/O context . . 90 6.2.3 Exploiting context information to improve fingerprint lookups . . . . .. . . . . . 91 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Docto
    • โ€ฆ
    corecore