171 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    Exploring heterogeneity of unreliable machines for p2p backup

    Full text link
    P2P architecture is a viable option for enterprise backup. In contrast to dedicated backup servers, nowadays a standard solution, making backups directly on organization's workstations should be cheaper (as existing hardware is used), more efficient (as there is no single bottleneck server) and more reliable (as the machines are geographically dispersed). We present the architecture of a p2p backup system that uses pairwise replication contracts between a data owner and a replicator. In contrast to standard p2p storage systems using directly a DHT, the contracts allow our system to optimize replicas' placement depending on a specific optimization strategy, and so to take advantage of the heterogeneity of the machines and the network. Such optimization is particularly appealing in the context of backup: replicas can be geographically dispersed, the load sent over the network can be minimized, or the optimization goal can be to minimize the backup/restore time. However, managing the contracts, keeping them consistent and adjusting them in response to dynamically changing environment is challenging. We built a scientific prototype and ran the experiments on 150 workstations in the university's computer laboratories and, separately, on 50 PlanetLab nodes. We found out that the main factor affecting the quality of the system is the availability of the machines. Yet, our main conclusion is that it is possible to build an efficient and reliable backup system on highly unreliable machines (our computers had just 13% average availability)

    ON OPTIMIZATIONS OF VIRTUAL MACHINE LIVE STORAGE MIGRATION FOR THE CLOUD

    Get PDF
    Virtual Machine (VM) live storage migration is widely performed in the data cen- ters of the Cloud, for the purposes of load balance, reliability, availability, hardware maintenance and system upgrade. It entails moving all the state information of the VM being migrated, including memory state, network state and storage state, from one physical server to another within the same data center or across different data centers. To minimize its performance impact, this migration process is required to be transparent to applications running within the migrating VM, meaning that ap- plications will keep running inside the VM as if there were no migration operations at all. In this dissertation, a thorough literature review is conducted to provide a big picture of the VM live storage migration process, its problems and existing solutions. After an in-depth examination, we observe that a severe IO interference between the VM IO threads and migration IO threads exists and causes both types of the IO threads to suffer from performance degradation. This interference stems from the fact that both types of IO threads share the same critical IO path by reading from and writing to the same shared storage system. Owing to IO resource contention and requests interference between the two different types of IO requests, not only will the IO request queue lengthens in the storage system, but the time-consuming disk seek operations will also become more frequent. Based on this fundamental observation, this dissertation research presents three related but orthogonal solutions that tackle the IO interference problem in order to improve the VM live storage migration performance. First, we introduce the Workload-Aware IO Outsourcing scheme, called WAIO, to improve the VM live storage migration efficiency. Second, we address this problem by proposing a novel scheme, called SnapMig, to improve the VM live storage migration efficiency and eliminate its performance impact on user applications at the source server by effectively leveraging the existing VM snapshots in the backup servers. Third, we propose the IOFollow scheme to improve both the VM performance and migration performance simultaneously. Finally, we outline the direction for the future research work. Advisor: Hong Jian

    Data De-Duplication in NoSQL Databases

    Get PDF
    With the popularity and expansion of Cloud Computing, NoSQL databases (DBs) are becoming the preferred choice of storing data in the Cloud. Because they are highly de-normalized, these DBs tend to store significant amounts of redundant data. Data de-duplication (DD) has an important role in reducing storage consumption to make it affordable to manage in today’s explosive data growth. Numerous DD methodologies like chunking and, delta encoding are available today to optimize the use of storage. These technologies approach DD at file and/or sub-file level but this approach has never been optimal for NoSQL DBs. This research proposes data De-Duplication in NoSQL Databases (DDNSDB) which makes use of a DD approach at a higher level of abstraction, namely at the DB level. It makes use of the structural information about the data (metadata) exploiting its granularity to identify and remove duplicates. The main goals of this research are: to maximally reduce the amount of duplicates in one type of NoSQL DBs, namely the key-value store, to maximally increase the process performance such that the backup window is marginally affected, and to design with horizontal scaling in mind such that it would run on a Cloud Platform competitively. Additionally, this research presents an analysis of the various types of NoSQL DBs (such as key-value, tabular/columnar, and document DBs) to understand their data model required for the design and implementation of DDNSDB. Primary experiments have demonstrated that DDNSDB can further reduce the NoSQL DB storage space compared with current archiving methods (from 17% to near 69% as more structural information is available). Also, by following an optimized adapted MapReduce architecture, DDNSDB proves to have competitive performance advantage in a horizontal scaling cloud environment compared with a vertical scaling environment (from 28.8 milliseconds to 34.9 milliseconds as the number of parallel Virtual Machines grows)

    Data security and reliability in cloud backup systems with deduplication.

    Get PDF
    雲存儲是一個新興的服務模式,讓個人和企業的數據備份外包予較低成本的遠程雲服務提供商。本論文提出的方法,以確保數據的安全性和雲備份系統的可靠性。在本論文的第一部分,我們提出 FadeVersion,安全的雲備份作為今天的雲存儲服務上的安全層服務的系統。 FadeVersion實現標準的版本控制備份設計,從而消除跨不同版本備份的冗餘數據存儲。此外,FadeVersion在此設計上加入了加密技術以保護備份。具體來說,它實現細粒度安全删除,那就是,雲客戶可以穩妥地在雲上删除特定的備份版本或文件,使有關文件永久無法被解讀,而其它共用被删除數據的備份版本或文件將不受影響。我們實現了試驗性原型的 FadeVersion並在亞馬遜S3之上進行實證評價。我們證明了,相對於不支援度安全删除技術傳統的雲備份服務 FadeVersion只增加小量額外開鎖。在本論文的第二部分,提出 CFTDedup一個分佈式代理系統,利用通過重複數據删除增加雲存儲的效率,而同時確保代理之間的崩潰容錯。代理之間會進行同步以保持重複數據删除元數據的一致性。另外,它也分批更新元數據減輕同步帶來的開銷。我們實現了初步的原型CFTDedup並通過試驗台試驗,以存儲虛擬機映像評估其重複數據删除的運行性能。我們還討論了幾個開放問題,例如如何提供可靠、高性能的重複數據删除的存儲。我們的CFTDedup原型提供了一個平台來探討這些問題。Cloud storage is an emerging service model that enables individuals and enterprises to outsource the storage of data backups to remote cloud providers at a low cost. This thesis presents methods to ensure the data security and reliability of cloud backup systems.In the first part of this thesis, we present FadeVersion, a secure cloud backup system that serves as a security layer on top of todays cloud storage services. FadeVersion follows the standard version-controlled backup design, which eliminates the storage of redundant data across different versions of backups. On top of this, FadeVersion applies cryptographic protection to data backups. Specifically, it enables ne-grained assured deletion, that is, cloud clients can assuredly delete particular backup versions or files on the cloud and make them permanently in accessible to anyone, while other versions that share the common data of the deleted versions or les will remain unaffected. We implement a proof-of-concept prototype of FadeVersion and conduct empirical evaluation atop Amazon S3. We show that FadeVersion only adds minimal performance overhead over a traditional cloud backup service that does not support assured deletion.In the second part of this thesis, we present CFTDedup, a distributed proxy system designed for providing storage efficiency via deduplication in cloud storage, while ensuring crash fault tolerance among proxies. It synchronizes deduplication metadata among proxies to provide strong consistency. It also batches metadata updates to mitigate synchronization overhead. We implement a preliminary prototype of CFTDedup and evaluate via test bed experiments its runtime performance in deduplication storage for virtual machine images. We also discuss several open issues on how to provide reliable, high-performance deduplication storage. Our CFTDedup prototype provides a platform to explore such issues.Detailed summary in vernacular field only.Detailed summary in vernacular field only.Detailed summary in vernacular field only.Rahumed, Arthur.Thesis (M.Phil.)--Chinese University of Hong Kong, 2012.Includes bibliographical references (leaves 47-51).Abstracts also in Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Cloud Based Backup and Assured Deletion --- p.1Chapter 1.2 --- Crash Fault Tolerance for Backup Systems with Deduplication --- p.4Chapter 1.3 --- Outline of Thesis --- p.6Chapter 2 --- Background and Related Work --- p.7Chapter 2.1 --- Deduplication --- p.7Chapter 2.2 --- Assured Deletion --- p.7Chapter 2.3 --- Policy Based Assured Deletion --- p.8Chapter 2.4 --- Convergent Encryption --- p.9Chapter 2.5 --- Cloud Based Backup Systems --- p.10Chapter 2.6 --- Fault Tolerant Deduplication Systems --- p.10Chapter 3 --- Design of FadeVersion --- p.12Chapter 3.1 --- Threat Model and Assumptions for Fade Version --- p.12Chapter 3.2 --- Motivation --- p.13Chapter 3.3 --- Main Idea --- p.14Chapter 3.4 --- Version Control --- p.14Chapter 3.5 --- Assured Deletion --- p.16Chapter 3.6 --- Assured Deletion for Multiple Policies --- p.18Chapter 3.7 --- Key Management --- p.19Chapter 4 --- Implementation of FadeVersion --- p.20Chapter 4.1 --- System Entities --- p.20Chapter 4.2 --- Metadata Format in FadeVersion --- p.22Chapter 5 --- Evaluation of FadeVersion --- p.24Chapter 5.1 --- Setup --- p.24Chapter 5.2 --- Backup/Restore Time --- p.26Chapter 5.3 --- Storage Space --- p.28Chapter 5.4 --- Monetary Cost --- p.29Chapter 5.5 --- Conclusions --- p.30Chapter 6 --- CFTDedup Design --- p.31Chapter 6.1 --- Failure Model --- p.31Chapter 6.2 --- System Overview --- p.32Chapter 6.3 --- Distributed Deduplication --- p.33Chapter 6.4 --- Crash Fault Tolerance --- p.35Chapter 6.5 --- Implementation --- p.36Chapter 7 --- Evaluation of CFTDedup --- p.37Chapter 7.1 --- Setup --- p.37Chapter 7.2 --- Experiment 1 (Archival) --- p.38Chapter 7.3 --- Experiment 2 (Restore) --- p.39Chapter 7.4 --- Experiment 3 (Recovery) --- p.40Chapter 7.5 --- Summary --- p.41Chapter 8 --- Future work and Conclusions of CFTDedup --- p.43Chapter 8.1 --- Future Work --- p.43Chapter 8.2 --- Conclusions --- p.44Chapter 9 --- Conclusion --- p.45Bibliography --- p.4
    corecore