298 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    Resumption of virtual machines after adaptive deduplication of virtual machine images in live migration

    Get PDF
    In cloud computing, load balancing, energy utilization are the critical problems solved by virtual machine (VM) migration. Live migration is the live movement of VMs from an overloaded/underloaded physical machine to a suitable one. During this process, transferring large disk image files take more time, hence more migration and down time. In the proposed adaptive deduplication, based on the image file size, the file undergoes both fixed, variable length deduplication processes. The significance of this paper is resumption of VMs with reunited deduplicated disk image files. The performance measured by calculating the percentage reduction of VM image size after deduplication, the time taken to migrate the deduplicated file and the time taken for each VM to resume after the migration. The results show that 83%, 89.76% reduction overall image size and migration time respectively. For a deduplication ratio of 92%, it takes an overall time of 3.52 minutes, 7% reduction in resumption time, compared with the time taken for the total QCOW2 files with original size. For VMDK files the resumption time reduced by a maximum 17% (7.63 mins) compared with that of for original files

    Towards Data Optimization in Storages and Networks

    Get PDF
    Title from PDF of title page, viewed on August 7, 2015Dissertation advisors: Sejun Song and Baek-Young ChoiVitaIncludes bibliographic references (pages 132-140)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015We are encountering an explosion of data volume, as a study estimates that data will amount to 40 zeta bytes by the end of 2020. This data explosion poses significant burden not only on data storage space but also access latency, manageability, and processing and network bandwidth. However, large portions of the huge data volume contain massive redundancies that are created by users, applications, systems, and communication models. Deduplication is a technique to reduce data volume by removing redundancies. Reliability will be even improved when data is replicated after deduplication. Many deduplication studies such as storage data deduplication and network redundancy elimination have been proposed to reduce storage consumption and network bandwidth consumption. However, existing solutions are not efficient enough to optimize data delivery path from clients to servers through network. Hence we propose a holistic deduplication framework to optimize data in their path. Our deduplication framework consists of three components including data sources or clients, networks, and servers. The client component removes local redundancies in clients, the network component removes redundant transfers coming from different clients, and the server component removes redundancies coming from different networks. We designed and developed components for the proposed deduplication framework. For the server component, we developed the Hybrid Email Deduplication System that achieves a trade-off of space savings and overhead for email systems. For the client component, we developed the Structure Aware File and Email Deduplication for Cloudbased Storage Systems that is very fast as well as having good space savings by using structure-based granularity. For the network component, we developed a system called Software-defined Deduplication as a Network and Storage service that is in-network deduplication, and that chains storage data deduplication and network redundancy elimination functions by using Software Defined Network to achieve both storage space and network bandwidth savings with low processing time and memory size. We also discuss mobile deduplication for image and video files in mobile devices. Through system implementations and experiments, we show that the proposed framework effectively and efficiently optimizes data volume in a holistic manner encompassing the entire data path of clients, networks and storage servers.Introduction -- Deduplication technology -- Existing deduplication approaches -- HEDS: Hybrid Email Deduplication System -- SAFE: Structure-aware File and Email Deduplication for cloud-based storage systems -- SoftDance: Software-defined Deduplication as a Network and Storage Service -- Moblie de-duplication -- Conclusion

    A Deterministic Eviction Model for Removing Redundancies in Video Corpus

    Get PDF
    The traditional storage approaches are being challenged by huge data volumes. In multimedia content, every file does not necessarily get tagged as an exact duplicate; rather they are prone to editing and resulting in similar copies of the same file. This paper proposes the similarity-based deduplication approach to evict similar duplicates from the archive storage, which compares the samples of binary hashes to identify the duplicates. This eviction is done by initially dividing the query video into dynamic key frames based on the video length. Binary hash codes of these frames are then compared with existing key frames to identify the differences. The similarity score is determined based on these differences, which decides the eradication strategy of duplicate copy. Duplicate elimination goes through two levels, namely removal of exact duplicates and similar duplicates. The proposed approach has shortened the comparison window by comparing only the candidate hash codes based on the dynamic key frames and aims the accurate lossless duplicate removals. The presented work is executed and tested on the produced synthetic video dataset. Results show the reduction in redundant data and increase in the storage space. Binary hashes and similarity scores contributed to achieving good deduplication ratio and overall performance
    • …
    corecore