83 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    Design patterns for tunable and efficient SSD-based indexes

    Full text link

    PEO-Store: Practical and Economical Oblivious Store with Peer-to-Peer Delegation

    Get PDF
    The growing popularity of cloud storage has brought attention to critical need for preventing information leakage from cloud access patterns. To this end, recent efforts have extended Oblivious RAM (ORAM) to the cloud environment in the form of Oblivious Store. However, its impracticality due to the use of probability encryption with fake accesses to obfuscate the access pattern, as well as the security requirements of conventional obliviousness designs, which hinder cloud interests in improving storage utilization by removing redundant data among cross-users, limit its effectiveness. Thus, we propose a practical Oblivious Store, PEO-Store, which integrates the obliviousness property into the cloud while removing redundancy without compromising security. Unlike conventional schemes, PEO-Store randomly selects a delegate for each client to communicate with the cloud, breaking the mapping link between a valid access pattern sequence and a specific client. Each client encrypts their data and shares it with selected delegates, who act as intermediaries with the cloud provider. This design leverages non-interactive zero-knowledge-based redundancy detection, discrete logarithm problem-based key sharing, and secure time-based delivery proof to protect access pattern privacy and accurately identify and remove redundancy in the cloud. The theoretical proof demonstrates that the probability of identifying the valid access pattern with a specific user is negligible in our design. Experimental results show that PEO-Store outperforms state-of-the-art methods, achieving an average throughput of up to 3 times faster and saving 74% of storage space

    TailoredRE: A Personalized Cloud-based Traffic Redundancy Elimination for Smartphones

    Get PDF
    The exceptional rise in usages of mobile devices such as smartphones and tablets has contributed to a massive increase in wireless network trac both Cellular (3G/4G/LTE) and WiFi. The unprecedented growth in wireless network trac not only strain the battery of the mobile devices but also bogs down the last-hop wireless access links. Interestingly, a signicant part of this data trac exhibits high level of redundancy in them due to repeated access of popular contents in the web. Hence, a good amount of research both in academia and in industries has studied, analyzed and designed diverse systems that attempt to eliminate redundancy in the network trac. Several of the existing Trac Redundancy Elimination (TRE) solutions either does not improve last-hop wireless access links or involves inecient use of compute resources from resource-constrained mobile devices. In this research, we propose TailoredRE, a personalized cloud-based trac redundancy elimination system. The main objective of TailoredRE is to tailor TRE mechanism such that TRE is performed against selected applications rather than application agnostically, thus improving eciency by avoiding caching of unnecessary data chunks. In our system, we leverage the rich resources of the cloud to conduct TRE by ooading most of the operational cost from the smartphones or mobile devices to its clones (proxies) available in the cloud. We cluster the multiple individual user clones in the cloud based on the factors of connectedness among users such as usage of similar applications, common interests in specic web contents etc., to improve the eciency of caching in the cloud. This thesis encompasses motivation, system design along with detailed analysis of the results obtained through simulation and real implementation of TailoredRE system

    An Analysis of Storage Virtualization

    Get PDF
    Investigating technologies and writing expansive documentation on their capabilities is like hitting a moving target. Technology is evolving, growing, and expanding what it can do each and every day. This makes it very difficult when trying to snap a line and investigate competing technologies. Storage virtualization is one of those moving targets. Large corporations develop software and hardware solutions that try to one up the competition by releasing firmware and patch updates to include their latest developments. Some of their latest innovations include differing RAID levels, virtualized storage, data compression, data deduplication, file deduplication, thin provisioning, new file system types, tiered storage, solid state disk, and software updates to coincide these technologies with their applicable hardware. Even data center environmental considerations like reusable energies, data center environmental characteristics, and geographic locations are being used by companies both small and large to reduce operating costs and limit environmental impacts. Companies are even moving to an entire cloud based setup to limit their environmental impact as it could be cost prohibited to maintain your own corporate infrastructure. The trifecta of integrating smart storage architectures to include storage virtualization technologies, reducing footprint to promote energy savings, and migrating to cloud based services will ensure a long-term sustainable storage subsystem

    DataComp: In search of the next generation of multimodal datasets

    Full text link
    Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai

    Improving Data Management and Data Movement Efficiency in Hybrid Storage Systems

    Get PDF
    University of Minnesota Ph.D. dissertation.July 2017. Major: Computer Science. Advisor: David Du. 1 computer file (PDF); ix, 116 pages.In the big data era, large volumes of data being continuously generated drive the emergence of high performance large capacity storage systems. To reduce the total cost of ownership, storage systems are built in a more composite way with many different types of emerging storage technologies/devices including Storage Class Memory (SCM), Solid State Drives (SSD), Shingle Magnetic Recording (SMR), Hard Disk Drives (HDD), and even across off-premise cloud storage. To make better utilization of each type of storage, industries have provided multi-tier storage through dynamically placing hot data in the faster tiers and cold data in the slower tiers. Data movement happens between devices on one single device and as well as between devices connected via various networks. Toward improving data management and data movement efficiency in such hybrid storage systems, this work makes the following contributions: To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information (I/O access hints) from upper layers to the block storage layer may greatly improve application performance or ease data management in heterogeneous storage systems. We present and develop a generic and flexible framework, called HintStor, to execute and evaluate various I/O access hints on heterogeneous storage systems with minor modifications to the kernel and applications. The design of HintStor contains a new application/user level interface, a file system plugin and a block storage data manager. With HintStor, storage systems composed of various storage devices can perform pre-devised data placement, space reallocation and data migration polices assisted by the added access hints. Each storage device/technology has its own unique price-performance tradeoffs and idiosyncrasies with respect to workload characteristics they prefer to support. To explore the internal access patterns and thus efficiently place data on storage systems with fully connected (i.e., data can move from one device to any other device instead of moving tier by tier) differential pools (each pool consists of storage devices of a particular type), we propose a chunk-level storage-aware workload analyzer framework, simplified as ChewAnalyzer. With ChewAnalzyer, the storage manager can adequately distribute and move the data chunks across different storage pools. To reduce the duplicate content transferred between local storage devices and devices in remote data centers, an inline Network Redundancy Elimination (NRE) process with Content-Defined Chunking (CDC) policy can obtain a higher Redundancy Elimination (RE) ratio but may suffer from a considerably higher computational requirement than fixed-size chunking. We build an inline NRE appliance which incorporates an improved FPGA based scheme to speed up CDC processing. To efficiently utilize the hardware resources, the whole NRE process is handled by a Virtualized NRE (VNRE) controller. The uniqueness of this VNRE that we developed lies in its ability to exploit the redundancy patterns of different TCP flows and customize the chunking process to achieve a higher RE ratio

    Availability and Preservation of Scholarly Digital Resources

    Get PDF
    The dynamic, decentralized world-wide-web has become an essential part of scientific research and communication, representing a relatively new medium for the conveyance of scientific thought and discovery. Researchers create thousands of web sites every year to share software, data and services. Unlike books and journals, however, the preservation systems are not yet mature. This carries implications that go to the core of science: the ability to examine another\u27s sources to understand and reproduce their work. These valuable resources have been documented as disappearing over time in several subject areas. This dissertation examines the problem by performing a crossdisciplinary investigation, testing the effectiveness of existing remedies and introducing new ones. As part of the investigation, 14,489 unique web pages found in the abstracts within Thomson Reuters’ Web of Science citation index were accessed. The median lifespan of these web pages was found to be 9.3 years with 62% of them being archived. Survival analysis and logistic regression identified significant predictors of URL lifespan and included the year a URL was published, the number of times it was cited, its depth as well as its domain. Statistical analysis revealed biases in current static web-page solutions
    • …
    corecore