Search CORE

10 research outputs found

Malleable Coding with Fixed Reuse

Author: Goyal Vivek K
Kusuma Julius
Varshney Lav R.
Publication venue
Publication date: 01/01/2011
Field of study

In cloud computing, storage area networks, remote backup storage, and similar settings, stored data is modified with updates from new versions. Representing information and modifying the representation are both expensive. Therefore it is desirable for the data to not only be compressed but to also be easily modified during updates. A malleable coding scheme considers both compression efficiency and ease of alteration, promoting codeword reuse. We examine the trade-off between compression efficiency and malleability cost-the difficulty of synchronizing compressed versions-measured as the length of a reused prefix portion. Through a coding theorem, the region of achievable rates and malleability is expressed as a single-letter optimization. Relationships to common information problems are also described

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

A Running Time Improvement for Two Thresholds Two Divisors Algorithm

Author: Chang BingChun
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2009
Field of study

Chunking algorithms play an important role in data de-duplication systems. The Basic Sliding Window (BSW) algorithm is the first prototype of the content-based chunking algorithm which can handle most types of data. The Two Thresholds Two Divisors (TTTD) algorithm was proposed to improve the BSW algorithm in terms of controlling the variations of the chunk-size. In this project, we investigate and compare the BSW algorithm and TTTD algorithm from different factors by a series of systematic experiments. Up to now, no paper conducts these experimental evaluations for these two algorithms. This is the first value of this paper. According to our analyses and the results of experiments, we provide a running time improvement for the TTTD algorithm. Our new solution reduces about 7 % of the total running time and also reduces about 50 % of the large-sized chunks while comparing with the original TTTD algorithm and make average chunk-size closer to the expected chunk-size. These significant results are the second important value of this project

SJSU ScholarWorks

Malleable coding for updatable cloud caching

Author: Goyal Vivek K.
Kusuma Julius
Varshney Lav R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2016
Field of study

In software-as-a-service applications provisioned through cloud computing, locally cached data are often modified with updates from new versions. In some cases, with each edit, one may want to preserve both the original and new versions. In this paper, we focus on cases in which only the latest version must be preserved. Furthermore, it is desirable for the data to not only be compressed but to also be easily modified during updates, since representing information and modifying the representation both incur cost. We examine whether it is possible to have both compression efficiency and ease of alteration, in order to promote codeword reuse. In other words, we study the feasibility of a malleable and efficient coding scheme. The tradeoff between compression efficiency and malleability cost-the difficulty of synchronizing compressed versions-is measured as the length of a reused prefix portion. The region of achievable rates and malleability is found. Drawing from prior work on common information problems, we show that efficient data compression may not be the best engineering design principle when storing software-as-a-service data. In the general case, goals of efficiency and malleability are fundamentally in conflict.This work was supported in part by an NSF Graduate Research Fellowship (LRV), Grant CCR-0325774, and Grant CCF-0729069. This work was presented at the 2011 IEEE International Symposium on Information Theory [1] and the 2014 IEEE International Conference on Cloud Engineering [2]. The associate editor coordinating the review of this paper and approving it for publication was R. Thobaben. (CCR-0325774 - NSF Graduate Research Fellowship; CCF-0729069 - NSF Graduate Research Fellowship)Accepted manuscrip

Crossref

Boston University Institutional Repository (OpenBU)

A Survey on Data Deduplication

Author: Shubhanshi Singhal, Naresh Kumar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2017
Field of study

Now-a-days, the demand of data storage capacity is increasing drastically. Due to more demands of storage, the computer society is attracting toward cloud storage. Security of data and cost factors are important challenges in cloud storage. A duplicate file not only waste the storage, it also increases the access time. So the detection and removal of duplicate data is an essential task. Data deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature. It is very tricky because neither duplicate files don?t have a common key nor they contain error. There are several approaches to identify and remove redundant data at file and chunk levels. In this paper, the background and key features of data deduplication is covered, then summarize and classify the data deduplication process according to the key workflow

International Journal on Recent and Innovation Trends in Computing and Communication

Memory Deduplication: An Effective Approach to Improve the Memory System

Author: Wang Frank Z.
Publication venue: 'Institute of Statistical Science'
Publication date: 01/01/2017
Field of study

Programs now have more aggressive demands of memory to hold their data than before. This paper analyzes the characteristics of memory data by using seven real memory traces. It observes that there are a large volume of memory pages with identical contents contained in the traces. Furthermore, the unique memory content accessed are much less than the unique memory address accessed. This is incurred by the traditional address-based cache replacement algorithms that replace memory pages by checking the addresses rather than the contents of those pages, thus resulting in many identical memory contents with different addresses stored in the memory. For example, in the same file system, opening two identical files stored in different directories, or opening two similar files that share a certain amount of contents in the same directory, will result in identical data blocks stored in the cache due to the traditional address-based cache replacement algorithms. Based on the observations, this paper evaluates memory compression and memory deduplication. As expected, memory deduplication greatly outperforms memory compression. For example, the best deduplication ratio is 4.6 times higher than the best compression ratio. The deduplication time and restore time are 121 times and 427 times faster than the compression time and decompression time, respectively. The experimental results in this paper should be able to offer useful insights for designing systems that require abundant memory to improve the system performance

Kent Academic Repository

Memory Deduplication: An Effective Approach to Improve the Memory System

Author: Deng Yuhui
Huang Xinyu
Song Liangshan
Wang Frank Z.
Zhou Yongtao
Publication venue: Academia Sinica
Publication date: 01/09/2017
Field of study

Kent Academic Repository

Malleable coding: compressed palimpsests

Author: Goyal Vivek K.
Kusuma Julius
Varshney Lav R.
Publication venue
Publication date: 01/01/2018
Field of study

A malleable coding scheme considers not only compression efficiency but also the ease of alteration, thus encouraging some form of recycling of an old compressed version in the formation of a new one. Malleability cost is the difficulty of synchronizing compressed versions, and malleable codes are of particular interest when representing information and modifying the representation are both expensive. We examine the trade-off between compression efficiency and malleability cost under a malleability metric defined with respect to a string edit distance. This problem introduces a metric topology to the compressed domain. We characterize the achievable rates and malleability as the solution of a subgraph isomorphism problem. This can be used to argue that allowing conditional entropy of the edited message given the original message to grow linearly with block length creates an exponential increase in code length.First author draf

Boston University Institutional Repository (OpenBU)

기계 알고리즘을 적용한 고속도로 합류부 차로변경 분류기의 성능평가

Author: 우동준
Publication venue: 서울대학교 대학원
Publication date: 01/08/2017
Field of study

학위논문 (석사)-- 서울대학교 대학원 공과대학 건설환경공학부, 2017. 8. 이청원.고속도로 합류부에서는 본선부차선과 합류부차선의 교통류가 혼입됨에 따른 상호작용뿐만 아니라 다양한 교통영향인자(즉, 교통량의 상태, 도로기하구조, weaving, 운전자 행태에 따라 달라지는 개인적 반응 등)로 인하여 차로변경여부를 예측하는 것은 매우 어렵다. 또한, NGSIM US-101 데이터는 동일한 교통상황에서 한 운전자가 여러 번에 걸쳐 비합류 결과(non-merge event)를 발생시키는 반면, 1개의 합류결과(merge event)만을 생성하는 데이터 구조를 갖기 때문에 본질적으로 불균형 데이터가 만들어 진다. 즉, 다수의 rejected cases에 비해 소수의 accepted cases로 구성된다. 이와 같이 불균형 데이터 구조를 갖고 차로변경여부의 판정은 다수의 경우에 편중(biased)하게 되기 때문에 종종 accepted cases를 rejected cases로 오분류될 수 있다. 강제차로변경(MLC) 환경 하에서 합류차량의 차로변경의 판정을 위해 제안된 분류기들의 성능을 향상시키고, 불균형 데이터를 완화시키기 위해서 본 연구의 전략은 3가지로 요약될 수 있다. 첫째는, 자료 불균형문제 해결을 위해 데이터 샘플링기법을 도입하여 분할표(contingency matrix)와 이를 활용한 다양한 평가지표(skill scores) 및 ROC/PR 곡선을 통해 분류성능을 보이고자 한다. 이 목적을 위해 먼저 MATLAB 프로그램에 내장되어 있는 Hampel 필터링 기법을 사용하여 비정상적 이상치의 제거와 측정오차를 저감시켰다. 또한 rejected cases의 개수를 줄이기 위해 데이터저장을 위해 EXCEL Spread Sheet에 많이 사용되는 평균화에 의한 복제데이터 제거(duplicate elimination by averaging)와 샘플링 시간간격 조절에 의한 데이터 축약(data reduction by sampling time interval) 방법이 모색되었다. 둘째는, 최근 통계학, 의학 및 전산학 등에 많이 사용되는 기계학습(machine algorithm)에 기초를 둔 비모수법 형태의 SVM(서포트벡터 기계학습)과 EBM(앙상블 부스팅법)으로 기존에 많이 통용되어 왔던 모수법인 BLM(이분형 로지스틱 모델)과 비교하여 합류부에서의 차로변경 여부에 대한 예측을 상호비교 하였다. 참고로 BLM은 여러 매개변수의 선형조합으로 정의되는 확률함수로 차로변경 여부를 판정한다. 셋째는, MIT공대의 Choudhury가 2007년에 제안한 예상간격모델(anticipated gap model)로 기존에 사용되었던 합류차량과 주변차량이 등속운동을 한다는 가정하에 계산되는 인접간격(adjacent gap)에 합류차량을 중심으로 선행 및 후행차량이 차로변경을 하는 동안 가속운동을 한다는 가정하에 주변차량의 동적효과에 의한 추가적인 간격변동을 고려하는 방식이다. 차로변경을 결정하는 임계간격(critical gap)은 차로변경에 큰 영향을 주기 때문에 새롭게 제안된 모델을 반영하여 분류성능 효과를 확인하고자 하였다. 한편 제안하는 기게학습 기반 분류기들의 확장성을 보이기 위해 분할표에서 True-Positive(분류기에 의해서도 합류판정, 측정값도 합류판정인 차량)로 분류된 차량만을 갖고 실제 미시교통해석(microscopic traffic analysis)을 수행하였다. 차량궤적의 그래프작성을 통해 차로변경 결정(decision making process)을 구분하고 합류행태를 직접합류(direct merging), 추적합류(chase merging) 및 기타합류로 분류하였다. 이 목적을 위해 차량궤적을 구분하기 위해 K-means 클러스터링 알고리즘이 적용되었다. 각각의 분류기에서 생성된 합류차량의 횡방향 변위와 실제 측정치와의 오차분석을 수행하였다. 특히, 직접합류의 경우 BLM은 많은 오차를 SVM과 EBM에 비해 보여주었다. 또한, 샘플링 시간간격을 통한 데이터 축약기법에 의해 데이터의 횡방향 변위와 시간에 대한 분포를 도시하여 분류기의 성능평가 및 오차분석이 수행되었다. 자세한 차량궤적의 데이터는 NGSIM(Next Generation Simulation) US-101 구간데이터가 사용되었다. 여러 분석과 평가를 통해 다음과 같은 결과가 도출되었다. 즉, 기계학습 기반의 비모수 분류기는 NGSIM 데이터의 불균형 정도에 상관없이 기존의 모수법 기반의 분류기에 비해 개선된 예측정확도를 보여준다. 그리고 데이터 샘플링기법과 예측간격모델(anticipated gap model)은 데이터 불균형을 완화시키고 데이터의 질을 높여 주는 것으로 판단된다.Chapter 1. Introduction 1 1.1 Motivation 1 1.2 Data Collection 3 1.3 Objective 5 1.4 Research Outline 6 Chapter 2. Classifiers for Prediction Model 9 2.1 General 9 2.2 Binary Logit Model(BLM) 9 2.3 Support Vector Machine(SVM) 11 2.4 Ensemble Boosting Method(EBM) 13 Chapter 3. Data Resampling for Class Imbalance Problem 15 3.1 Genera 15 3.2 Data Processing by Hampel Filter 16 3.3 Data Under-sampling Technique 18 3.3.1 Data Reduction by Sampling Time Interval 18 3.3.2 Duplicate Elimination by Averaging 19 3.4 K-means Clustering 21 Chapter 4. Metrics for Classification Performance 26 4.1 Contingency Matrix 26 4.2 Skill Scores 27 4.3 ROC and PR Curves 29 Chapter 5. Lane-Change Characteristics 32 5.1 Anticipated Gap Model 32 5.2 Merging Pattern 34 5.3 Decision Making Process 36 Chapter 6. Numerical Results 38 6.1 Performance Evaluation of Classifiers by Duplicate Elimination 38 6.2 Performance Evaluation of Classifiers by Sampling Time Interval 43 6.3 Decision-Making Process by Vehicle Trajectory 45 6.4 Classification of Merging Patterns 58 Chapter 7. Conclusions 62 References 66Maste

SNU Open Repository and Archive

Improving duplicate elimination in storage systems

Author: Bolosky S.
Broder A.
Broder A.
Cezary Dubnicki
Cox L.
Deepak R. Bobbarjung
Douglis F.
Douglis P. K. F.
Goldberg A. V.
Hong B.
Jain N.
Manber U.
Ouyang Z.
Policroniades C.
Quinlan S.
Shivakumar N.
Suresh Jagannathan
Weatherspoon H.
You L. L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Improving Duplicate Elimination in Storage Systems

Author: Cezary Dubnicki
Deepak R. Bobbarjung
Suresh Jagannathan
Publication venue
Publication date
Field of study

Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored, not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today’s exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems. Intelligent object partitioning techniques identify data that are new when objects are updated, and transfer only those chunks to a storage server. In this paper, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40 % over comparable techniques

CiteSeerX