10 research outputs found

    Malleable Coding with Fixed Reuse

    Full text link
    In cloud computing, storage area networks, remote backup storage, and similar settings, stored data is modified with updates from new versions. Representing information and modifying the representation are both expensive. Therefore it is desirable for the data to not only be compressed but to also be easily modified during updates. A malleable coding scheme considers both compression efficiency and ease of alteration, promoting codeword reuse. We examine the trade-off between compression efficiency and malleability cost-the difficulty of synchronizing compressed versions-measured as the length of a reused prefix portion. Through a coding theorem, the region of achievable rates and malleability is expressed as a single-letter optimization. Relationships to common information problems are also described

    A Running Time Improvement for Two Thresholds Two Divisors Algorithm

    Get PDF
    Chunking algorithms play an important role in data de-duplication systems. The Basic Sliding Window (BSW) algorithm is the first prototype of the content-based chunking algorithm which can handle most types of data. The Two Thresholds Two Divisors (TTTD) algorithm was proposed to improve the BSW algorithm in terms of controlling the variations of the chunk-size. In this project, we investigate and compare the BSW algorithm and TTTD algorithm from different factors by a series of systematic experiments. Up to now, no paper conducts these experimental evaluations for these two algorithms. This is the first value of this paper. According to our analyses and the results of experiments, we provide a running time improvement for the TTTD algorithm. Our new solution reduces about 7 % of the total running time and also reduces about 50 % of the large-sized chunks while comparing with the original TTTD algorithm and make average chunk-size closer to the expected chunk-size. These significant results are the second important value of this project

    Malleable coding for updatable cloud caching

    Full text link
    In software-as-a-service applications provisioned through cloud computing, locally cached data are often modified with updates from new versions. In some cases, with each edit, one may want to preserve both the original and new versions. In this paper, we focus on cases in which only the latest version must be preserved. Furthermore, it is desirable for the data to not only be compressed but to also be easily modified during updates, since representing information and modifying the representation both incur cost. We examine whether it is possible to have both compression efficiency and ease of alteration, in order to promote codeword reuse. In other words, we study the feasibility of a malleable and efficient coding scheme. The tradeoff between compression efficiency and malleability cost-the difficulty of synchronizing compressed versions-is measured as the length of a reused prefix portion. The region of achievable rates and malleability is found. Drawing from prior work on common information problems, we show that efficient data compression may not be the best engineering design principle when storing software-as-a-service data. In the general case, goals of efficiency and malleability are fundamentally in conflict.This work was supported in part by an NSF Graduate Research Fellowship (LRV), Grant CCR-0325774, and Grant CCF-0729069. This work was presented at the 2011 IEEE International Symposium on Information Theory [1] and the 2014 IEEE International Conference on Cloud Engineering [2]. The associate editor coordinating the review of this paper and approving it for publication was R. Thobaben. (CCR-0325774 - NSF Graduate Research Fellowship; CCF-0729069 - NSF Graduate Research Fellowship)Accepted manuscrip

    A Survey on Data Deduplication

    Get PDF
    Now-a-days, the demand of data storage capacity is increasing drastically. Due to more demands of storage, the computer society is attracting toward cloud storage. Security of data and cost factors are important challenges in cloud storage. A duplicate file not only waste the storage, it also increases the access time. So the detection and removal of duplicate data is an essential task. Data deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature. It is very tricky because neither duplicate files don?t have a common key nor they contain error. There are several approaches to identify and remove redundant data at file and chunk levels. In this paper, the background and key features of data deduplication is covered, then summarize and classify the data deduplication process according to the key workflow

    Memory Deduplication: An Effective Approach to Improve the Memory System

    Get PDF
    Programs now have more aggressive demands of memory to hold their data than before. This paper analyzes the characteristics of memory data by using seven real memory traces. It observes that there are a large volume of memory pages with identical contents contained in the traces. Furthermore, the unique memory content accessed are much less than the unique memory address accessed. This is incurred by the traditional address-based cache replacement algorithms that replace memory pages by checking the addresses rather than the contents of those pages, thus resulting in many identical memory contents with different addresses stored in the memory. For example, in the same file system, opening two identical files stored in different directories, or opening two similar files that share a certain amount of contents in the same directory, will result in identical data blocks stored in the cache due to the traditional address-based cache replacement algorithms. Based on the observations, this paper evaluates memory compression and memory deduplication. As expected, memory deduplication greatly outperforms memory compression. For example, the best deduplication ratio is 4.6 times higher than the best compression ratio. The deduplication time and restore time are 121 times and 427 times faster than the compression time and decompression time, respectively. The experimental results in this paper should be able to offer useful insights for designing systems that require abundant memory to improve the system performance

    Memory Deduplication: An Effective Approach to Improve the Memory System

    Get PDF
    Programs now have more aggressive demands of memory to hold their data than before. This paper analyzes the characteristics of memory data by using seven real memory traces. It observes that there are a large volume of memory pages with identical contents contained in the traces. Furthermore, the unique memory content accessed are much less than the unique memory address accessed. This is incurred by the traditional address-based cache replacement algorithms that replace memory pages by checking the addresses rather than the contents of those pages, thus resulting in many identical memory contents with different addresses stored in the memory. For example, in the same file system, opening two identical files stored in different directories, or opening two similar files that share a certain amount of contents in the same directory, will result in identical data blocks stored in the cache due to the traditional address-based cache replacement algorithms. Based on the observations, this paper evaluates memory compression and memory deduplication. As expected, memory deduplication greatly outperforms memory compression. For example, the best deduplication ratio is 4.6 times higher than the best compression ratio. The deduplication time and restore time are 121 times and 427 times faster than the compression time and decompression time, respectively. The experimental results in this paper should be able to offer useful insights for designing systems that require abundant memory to improve the system performance

    Malleable coding: compressed palimpsests

    Full text link
    A malleable coding scheme considers not only compression efficiency but also the ease of alteration, thus encouraging some form of recycling of an old compressed version in the formation of a new one. Malleability cost is the difficulty of synchronizing compressed versions, and malleable codes are of particular interest when representing information and modifying the representation are both expensive. We examine the trade-off between compression efficiency and malleability cost under a malleability metric defined with respect to a string edit distance. This problem introduces a metric topology to the compressed domain. We characterize the achievable rates and malleability as the solution of a subgraph isomorphism problem. This can be used to argue that allowing conditional entropy of the edited message given the original message to grow linearly with block length creates an exponential increase in code length.First author draf

    ๊ธฐ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•œ ๊ณ ์†๋„๋กœ ํ•ฉ๋ฅ˜๋ถ€ ์ฐจ๋กœ๋ณ€๊ฒฝ ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅํ‰๊ฐ€

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› ๊ณต๊ณผ๋Œ€ํ•™ ๊ฑด์„คํ™˜๊ฒฝ๊ณตํ•™๋ถ€, 2017. 8. ์ด์ฒญ์›.๊ณ ์†๋„๋กœ ํ•ฉ๋ฅ˜๋ถ€์—์„œ๋Š” ๋ณธ์„ ๋ถ€์ฐจ์„ ๊ณผ ํ•ฉ๋ฅ˜๋ถ€์ฐจ์„ ์˜ ๊ตํ†ต๋ฅ˜๊ฐ€ ํ˜ผ์ž…๋จ์— ๋”ฐ๋ฅธ ์ƒํ˜ธ์ž‘์šฉ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ๊ตํ†ต์˜ํ–ฅ์ธ์ž(์ฆ‰, ๊ตํ†ต๋Ÿ‰์˜ ์ƒํƒœ, ๋„๋กœ๊ธฐํ•˜๊ตฌ์กฐ, weaving, ์šด์ „์ž ํ–‰ํƒœ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ๊ฐœ์ธ์  ๋ฐ˜์‘ ๋“ฑ)๋กœ ์ธํ•˜์—ฌ ์ฐจ๋กœ๋ณ€๊ฒฝ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๋‹ค. ๋˜ํ•œ, NGSIM US-101 ๋ฐ์ดํ„ฐ๋Š” ๋™์ผํ•œ ๊ตํ†ต์ƒํ™ฉ์—์„œ ํ•œ ์šด์ „์ž๊ฐ€ ์—ฌ๋Ÿฌ ๋ฒˆ์— ๊ฑธ์ณ ๋น„ํ•ฉ๋ฅ˜ ๊ฒฐ๊ณผ(non-merge event)๋ฅผ ๋ฐœ์ƒ์‹œํ‚ค๋Š” ๋ฐ˜๋ฉด, 1๊ฐœ์˜ ํ•ฉ๋ฅ˜๊ฒฐ๊ณผ(merge event)๋งŒ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ์งˆ์ ์œผ๋กœ ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŒ๋“ค์–ด ์ง„๋‹ค. ์ฆ‰, ๋‹ค์ˆ˜์˜ rejected cases์— ๋น„ํ•ด ์†Œ์ˆ˜์˜ accepted cases๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ด์™€ ๊ฐ™์ด ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ณ  ์ฐจ๋กœ๋ณ€๊ฒฝ์—ฌ๋ถ€์˜ ํŒ์ •์€ ๋‹ค์ˆ˜์˜ ๊ฒฝ์šฐ์— ํŽธ์ค‘(biased)ํ•˜๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ข…์ข… accepted cases๋ฅผ rejected cases๋กœ ์˜ค๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ๋‹ค. ๊ฐ•์ œ์ฐจ๋กœ๋ณ€๊ฒฝ(MLC) ํ™˜๊ฒฝ ํ•˜์—์„œ ํ•ฉ๋ฅ˜์ฐจ๋Ÿ‰์˜ ์ฐจ๋กœ๋ณ€๊ฒฝ์˜ ํŒ์ •์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ถ„๋ฅ˜๊ธฐ๋“ค์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์™„ํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ๋ณธ ์—ฐ๊ตฌ์˜ ์ „๋žต์€ 3๊ฐ€์ง€๋กœ ์š”์•ฝ๋  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ์งธ๋Š”, ์ž๋ฃŒ ๋ถˆ๊ท ํ˜•๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์—ฌ ๋ถ„ํ• ํ‘œ(contingency matrix)์™€ ์ด๋ฅผ ํ™œ์šฉํ•œ ๋‹ค์–‘ํ•œ ํ‰๊ฐ€์ง€ํ‘œ(skill scores) ๋ฐ ROC/PR ๊ณก์„ ์„ ํ†ตํ•ด ๋ถ„๋ฅ˜์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ ์ž ํ•œ๋‹ค. ์ด ๋ชฉ์ ์„ ์œ„ํ•ด ๋จผ์ € MATLAB ํ”„๋กœ๊ทธ๋žจ์— ๋‚ด์žฅ๋˜์–ด ์žˆ๋Š” Hampel ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์ •์ƒ์  ์ด์ƒ์น˜์˜ ์ œ๊ฑฐ์™€ ์ธก์ •์˜ค์ฐจ๋ฅผ ์ €๊ฐ์‹œ์ผฐ๋‹ค. ๋˜ํ•œ rejected cases์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์ €์žฅ์„ ์œ„ํ•ด EXCEL Spread Sheet์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ํ‰๊ท ํ™”์— ์˜ํ•œ ๋ณต์ œ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ(duplicate elimination by averaging)์™€ ์ƒ˜ํ”Œ๋ง ์‹œ๊ฐ„๊ฐ„๊ฒฉ ์กฐ์ ˆ์— ์˜ํ•œ ๋ฐ์ดํ„ฐ ์ถ•์•ฝ(data reduction by sampling time interval) ๋ฐฉ๋ฒ•์ด ๋ชจ์ƒ‰๋˜์—ˆ๋‹ค. ๋‘˜์งธ๋Š”, ์ตœ๊ทผ ํ†ต๊ณ„ํ•™, ์˜ํ•™ ๋ฐ ์ „์‚ฐํ•™ ๋“ฑ์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๊ณ„ํ•™์Šต(machine algorithm)์— ๊ธฐ์ดˆ๋ฅผ ๋‘” ๋น„๋ชจ์ˆ˜๋ฒ• ํ˜•ํƒœ์˜ SVM(์„œํฌํŠธ๋ฒกํ„ฐ ๊ธฐ๊ณ„ํ•™์Šต)๊ณผ EBM(์•™์ƒ๋ธ” ๋ถ€์ŠคํŒ…๋ฒ•)์œผ๋กœ ๊ธฐ์กด์— ๋งŽ์ด ํ†ต์šฉ๋˜์–ด ์™”๋˜ ๋ชจ์ˆ˜๋ฒ•์ธ BLM(์ด๋ถ„ํ˜• ๋กœ์ง€์Šคํ‹ฑ ๋ชจ๋ธ)๊ณผ ๋น„๊ตํ•˜์—ฌ ํ•ฉ๋ฅ˜๋ถ€์—์„œ์˜ ์ฐจ๋กœ๋ณ€๊ฒฝ ์—ฌ๋ถ€์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ์ƒํ˜ธ๋น„๊ต ํ•˜์˜€๋‹ค. ์ฐธ๊ณ ๋กœ BLM์€ ์—ฌ๋Ÿฌ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์„ ํ˜•์กฐํ•ฉ์œผ๋กœ ์ •์˜๋˜๋Š” ํ™•๋ฅ ํ•จ์ˆ˜๋กœ ์ฐจ๋กœ๋ณ€๊ฒฝ ์—ฌ๋ถ€๋ฅผ ํŒ์ •ํ•œ๋‹ค. ์…‹์งธ๋Š”, MIT๊ณต๋Œ€์˜ Choudhury๊ฐ€ 2007๋…„์— ์ œ์•ˆํ•œ ์˜ˆ์ƒ๊ฐ„๊ฒฉ๋ชจ๋ธ(anticipated gap model)๋กœ ๊ธฐ์กด์— ์‚ฌ์šฉ๋˜์—ˆ๋˜ ํ•ฉ๋ฅ˜์ฐจ๋Ÿ‰๊ณผ ์ฃผ๋ณ€์ฐจ๋Ÿ‰์ด ๋“ฑ์†์šด๋™์„ ํ•œ๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ๊ณ„์‚ฐ๋˜๋Š” ์ธ์ ‘๊ฐ„๊ฒฉ(adjacent gap)์— ํ•ฉ๋ฅ˜์ฐจ๋Ÿ‰์„ ์ค‘์‹ฌ์œผ๋กœ ์„ ํ–‰ ๋ฐ ํ›„ํ–‰์ฐจ๋Ÿ‰์ด ์ฐจ๋กœ๋ณ€๊ฒฝ์„ ํ•˜๋Š” ๋™์•ˆ ๊ฐ€์†์šด๋™์„ ํ•œ๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ์ฃผ๋ณ€์ฐจ๋Ÿ‰์˜ ๋™์ ํšจ๊ณผ์— ์˜ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฐ„๊ฒฉ๋ณ€๋™์„ ๊ณ ๋ คํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ฐจ๋กœ๋ณ€๊ฒฝ์„ ๊ฒฐ์ •ํ•˜๋Š” ์ž„๊ณ„๊ฐ„๊ฒฉ(critical gap)์€ ์ฐจ๋กœ๋ณ€๊ฒฝ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์ƒˆ๋กญ๊ฒŒ ์ œ์•ˆ๋œ ๋ชจ๋ธ์„ ๋ฐ˜์˜ํ•˜์—ฌ ๋ถ„๋ฅ˜์„ฑ๋Šฅ ํšจ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ํ•œํŽธ ์ œ์•ˆํ•˜๋Š” ๊ธฐ๊ฒŒํ•™์Šต ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ๋“ค์˜ ํ™•์žฅ์„ฑ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด ๋ถ„ํ• ํ‘œ์—์„œ True-Positive(๋ถ„๋ฅ˜๊ธฐ์— ์˜ํ•ด์„œ๋„ ํ•ฉ๋ฅ˜ํŒ์ •, ์ธก์ •๊ฐ’๋„ ํ•ฉ๋ฅ˜ํŒ์ •์ธ ์ฐจ๋Ÿ‰)๋กœ ๋ถ„๋ฅ˜๋œ ์ฐจ๋Ÿ‰๋งŒ์„ ๊ฐ–๊ณ  ์‹ค์ œ ๋ฏธ์‹œ๊ตํ†ตํ•ด์„(microscopic traffic analysis)์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ฐจ๋Ÿ‰๊ถค์ ์˜ ๊ทธ๋ž˜ํ”„์ž‘์„ฑ์„ ํ†ตํ•ด ์ฐจ๋กœ๋ณ€๊ฒฝ ๊ฒฐ์ •(decision making process)์„ ๊ตฌ๋ถ„ํ•˜๊ณ  ํ•ฉ๋ฅ˜ํ–‰ํƒœ๋ฅผ ์ง์ ‘ํ•ฉ๋ฅ˜(direct merging), ์ถ”์ ํ•ฉ๋ฅ˜(chase merging) ๋ฐ ๊ธฐํƒ€ํ•ฉ๋ฅ˜๋กœ ๋ถ„๋ฅ˜ํ•˜์˜€๋‹ค. ์ด ๋ชฉ์ ์„ ์œ„ํ•ด ์ฐจ๋Ÿ‰๊ถค์ ์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด K-means ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ ์šฉ๋˜์—ˆ๋‹ค. ๊ฐ๊ฐ์˜ ๋ถ„๋ฅ˜๊ธฐ์—์„œ ์ƒ์„ฑ๋œ ํ•ฉ๋ฅ˜์ฐจ๋Ÿ‰์˜ ํšก๋ฐฉํ–ฅ ๋ณ€์œ„์™€ ์‹ค์ œ ์ธก์ •์น˜์™€์˜ ์˜ค์ฐจ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ํŠนํžˆ, ์ง์ ‘ํ•ฉ๋ฅ˜์˜ ๊ฒฝ์šฐ BLM์€ ๋งŽ์€ ์˜ค์ฐจ๋ฅผ SVM๊ณผ EBM์— ๋น„ํ•ด ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ, ์ƒ˜ํ”Œ๋ง ์‹œ๊ฐ„๊ฐ„๊ฒฉ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ถ•์•ฝ๊ธฐ๋ฒ•์— ์˜ํ•ด ๋ฐ์ดํ„ฐ์˜ ํšก๋ฐฉํ–ฅ ๋ณ€์œ„์™€ ์‹œ๊ฐ„์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ๋„์‹œํ•˜์—ฌ ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅํ‰๊ฐ€ ๋ฐ ์˜ค์ฐจ๋ถ„์„์ด ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค. ์ž์„ธํ•œ ์ฐจ๋Ÿ‰๊ถค์ ์˜ ๋ฐ์ดํ„ฐ๋Š” NGSIM(Next Generation Simulation) US-101 ๊ตฌ๊ฐ„๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ถ„์„๊ณผ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋„์ถœ๋˜์—ˆ๋‹ค. ์ฆ‰, ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋น„๋ชจ์ˆ˜ ๋ถ„๋ฅ˜๊ธฐ๋Š” NGSIM ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜• ์ •๋„์— ์ƒ๊ด€์—†์ด ๊ธฐ์กด์˜ ๋ชจ์ˆ˜๋ฒ• ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜๊ธฐ์— ๋น„ํ•ด ๊ฐœ์„ ๋œ ์˜ˆ์ธก์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง๊ธฐ๋ฒ•๊ณผ ์˜ˆ์ธก๊ฐ„๊ฒฉ๋ชจ๋ธ(anticipated gap model)์€ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•์„ ์™„ํ™”์‹œํ‚ค๊ณ  ๋ฐ์ดํ„ฐ์˜ ์งˆ์„ ๋†’์—ฌ ์ฃผ๋Š” ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋œ๋‹ค.Chapter 1. Introduction 1 1.1 Motivation 1 1.2 Data Collection 3 1.3 Objective 5 1.4 Research Outline 6 Chapter 2. Classifiers for Prediction Model 9 2.1 General 9 2.2 Binary Logit Model(BLM) 9 2.3 Support Vector Machine(SVM) 11 2.4 Ensemble Boosting Method(EBM) 13 Chapter 3. Data Resampling for Class Imbalance Problem 15 3.1 Genera 15 3.2 Data Processing by Hampel Filter 16 3.3 Data Under-sampling Technique 18 3.3.1 Data Reduction by Sampling Time Interval 18 3.3.2 Duplicate Elimination by Averaging 19 3.4 K-means Clustering 21 Chapter 4. Metrics for Classification Performance 26 4.1 Contingency Matrix 26 4.2 Skill Scores 27 4.3 ROC and PR Curves 29 Chapter 5. Lane-Change Characteristics 32 5.1 Anticipated Gap Model 32 5.2 Merging Pattern 34 5.3 Decision Making Process 36 Chapter 6. Numerical Results 38 6.1 Performance Evaluation of Classifiers by Duplicate Elimination 38 6.2 Performance Evaluation of Classifiers by Sampling Time Interval 43 6.3 Decision-Making Process by Vehicle Trajectory 45 6.4 Classification of Merging Patterns 58 Chapter 7. Conclusions 62 References 66Maste

    Improving Duplicate Elimination in Storage Systems

    No full text
    Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored, not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of todayโ€™s exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems. Intelligent object partitioning techniques identify data that are new when objects are updated, and transfer only those chunks to a storage server. In this paper, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40 % over comparable techniques
    corecore