35 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    XLH: more effective memory deduplication scanners through cross-layer hints

    Get PDF
    Limited main memory size is the primary bottleneck for consolidating virtual machines (VMs) on hosting servers. Memory deduplication scanners reduce the memory footprint of VMs by eliminating redundancy. Our approach extends main memory deduplication scanners through Cross Layer I/O-based Hints (XLH) to find and exploit sharing opportunities earlier without raising the deduplication overhead. Prior work on memory scanners has shown great opportunity for memory deduplication. In our analyses, we have confirmed these results; however, we have found memory scanners to work well only for deduplicating fairly static memory pages. Current scanners need a considerable amount of time to detect new sharing opportunities (e.g., 5 min) and therefore do not exploit the full sharing potential. XLH’s early detection of sharing opportunities saves more memory by deduplicating otherwise missed short-lived pages and by increasing the time long-lived duplicates remain shared. Compared to I/O-agnostic scanners such as KSM, our benchmarks show that XLH can merge equal pages that stem from the virtual disk image earlier by minutes and is capable of saving up to four times as much memory; e.g., XLH saves 290 MiB vs. 75 MiB of main memory for two VMs with 512 MiB assigned memory each

    KSM++: Using I/O-based hints to make memory-deduplication scanners more efficient

    Get PDF
    Memory scanning deduplication techniques, as implemented in Linux\u27 Kernel Samepage Merging (KSM), work very well for deduplicating fairly static, anonymous pages with equal content across different virtual machines. However, scanners need very aggressive scan rates when it comes to identifying sharing opportunities with a short life span of up to about 5 min. Otherwise, the scan process is not fast enough to catch those short-lived pages. Our approach generates I/O-based hints in the host to make the memory scanning process more efficient, thus enabling it to find and exploit short-lived sharing opportunities without raising the scan rate. Experiences with similar techniques for paravirtualized guests have shown that pages in a guest’s unified buffer cache are good sharing candidates. We already identify such pages in the host when carrying out I/O-operations on behalf of the guest. The target/source pages in the guest can safely be assumed to be part of the guest’s unified buffer cache. That way, we can determine good sharing hints for the memory scanner. A modification of the guest is not required. We have implemented our approach in Linux. By modifying the KSM scanning mechanism to process these hints preferentially, we move the associated sharing opportunities earlier into the merging stage. Thereby, we deduplicate more pages than the baseline system. In our evaluation, we identify sharing opportunities faster and with less overhead than the traditional linear scanning policy. KSM needs to follow about seven times as many pages as we do, to find a sharing opportunity

    ENTICE VM Image Analysis and Optimised Fragmentation

    Get PDF
    Virtual machine (VM) images (VMIs) often share common parts of significant size as they are stored individually. Using existing de-duplication techniques for such images are non-trivial, impose serious technical challenges, and requires direct access to clouds' proprietary image storages, which is not always feasible. We propose an alternative approach to split images into shared parts, called fragments, which are stored only once. Our solution requires a reasonably small set of base images available in the cloud, and additionally only the increments will be stored without the contents of base images, providing significant storage space savings. Composite images consisting of a base image and one or more fragments are assembled on-demand at VM deployment. Our technique can be used in conjunction with practically any popular cloud solution, and the storage of fragments is independent of the proprietary image storage of the cloud provider

    Survey on Deduplication Techniques in Flash-Based Storage

    Get PDF
    Data deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was influenced by appearance of Solid State Drive. This new type of disk has significant differences from random access memory and hard disk drives and is widely used now. In this paper we propose a novel taxonomy which reflects the main issues related to deduplication in Solid State Drive. We present a survey on deduplication techniques focusing on flash-based storage. We also describe several Open Source tools implementing data deduplication and briefly describe open research problems related to data deduplication in flash-based storage systems

    Gear: Enable Efficient Container Storage and Deployment with a New Image Format

    Get PDF
    International audienceContainers have been widely used in various cloud platforms as they enable agile and elastic application deployment through their process-based virtualization and layered image system. However, different layers of a container image may contain substantial duplicate and unnecessary data, which slows down its deployment due to long image downloading time and increased burden on the image registry. To accelerate the deployment and reduce the size of the registry, we propose a new image format, named Gear image, that consists of two parts: a Gear index describing the structure of the image’s file system and a set of files that are required when running an application. The Gear index is represented as a single-layer image compatible with the existing deployment framework. Containers can be launched by pulling a Gear index and on demand retrieving files pointed to by the index. Furthermore, the Gear image enables a file-level sharing mechanism, which helps remove duplicate data in the registry and avoid repeated downloading of identical files by a client. We implement a prototype of the container framework, named Gear, supporting the new image format. Evaluation shows that Gear saves 54% storage capacity in the registry, speeds up container startup by up to 5X, and reduces 84% bandwidth demands

    ENTICE VM image analysis and optimised fragmentation

    Get PDF
    Virtual machine (VM) images (VMIs) often share common parts of significant size as they are stored individually. Using existing de-duplication techniques for such images are non-trivial, impose serious technical challenges, and requires direct access to clouds' proprietary image storages, which is not always feasible. We propose an alternative approach to split images into shared parts, called fragments, which are stored only once. Our solution requires a reasonably small set of base images available in the cloud, and additionally only the increments will be stored without the contents of base images, providing significant storage space savings. Composite images consisting of a base image and one or more fragments are assembled on- demand at VM deployment. Our technique can be used in conjunction with practically any popular cloud solution, and the storage of fragments is independent of the proprietary image storage of the cloud provider
    corecore