27 research outputs found

    A survey and classification of storage deduplication systems

    Get PDF
    The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

    Fragmentation in storage systems with duplicate elimination

    Get PDF
    Deduplication inevitably results in data fragmentation, because logically continuous data is scattered across many disk locations. Even though this significantly increases restore time from backup, the problem is still not well examined. In this work I close this gap by designing algorithms that reduce negative impact of fragmentation on restore time for two major types of fragmentation: internal and inter-version.Internal stream fragmentation is caused by the blocks appearing many times within a single backup. Such phenomenon happens surprisingly often and can result in even three times lower restore bandwidth. With an algorithm utilizing available forward knowledge to enable efficient caching I managed to improve this result on average by 62%-88% with only about 5% extra memory used. Although these results are achieved with limited forward knowledge, they are very close to the ones measured with no such limitation.Inter-version fragmentation is caused by duplicates from previous backups of the same backup set. Since such duplicates are very common due to repeated full backups containing a lot of unchanged data, this type of fragmentation may double the restore time after even a few backups. The context-based rewriting algorithm minimizes this effect by selectively rewriting a small percentage of duplicates during backup, limiting the bandwidth drop from 21.3% to 2.48% on average with only small increase in writing time and temporary space overhead.The two algorithms combined end up in a very effective symbiosis resulting in an average 142% restore bandwidth increase with standard 256MB of per-stream cache memory. In many cases such setup achieves results close to the theoretical maximum achievable with unlimited cache size. Moreover, all the above experiments where performed assuming only one spindle, even though in majority of today’s systems many spindles are used. In a sample setup with ten spindles, the restore bandwidth results are on average 5 times higher than in standard LRU case.Fragmentacja jest nieuniknioną konsekwencją deduplikacji, ponieważ pojedynczy strumień danych rozrzucany jest pomiędzy wiele lokalizacji na dysku. Fakt ten powoduje znaczące wydłużenie czasu odzyskiwania danych z kopii zapasowych. Mimo to, problem wciąż nie jest dobrze zbadany. Niniejsza praca wypełnia tę lukę poprzez propozycje algorytmów, które redukują negatywny wpływ fragmentacji na czas odczytu dla dwóch najważniejszych jej rodzajów: wewnętrznej fragmentacji strumienia oraz fragmentacji pomiędzy różnymi wersjami danych.Wewnętrzna fragmentacja strumienia jest spowodowana blokami powtarzającymi się wielokrotnie w pojedynczym strumieniu danych. To zjawisko zdarza się zaskakująco często i powoduje nawet trzykrotnie niższą wydaj-ność odczytu. Proponowany w tej pracy algorytm efektywnego zarządzania pamięcią, wykorzystujący dostępną wiedzę o danych, jest w stanie podnieść wydajność odczytu o 62-88%, używając przy tym tylko 5% dodatkowej pamięci.Fragmentacja pomiędzy różnymi wersjami danych jest spowodowana duplikatami pochodzącymi z wcześniejszych zapisów tego samego zbioru danych. Ponieważ pełne kopie zapasowe tworzone są regularnie i zawierają duże ilości powtarzających się danych, takie duplikaty występują bardzo często. W przypadku późniejszego odczytu, ich obecność może powodować nawet podwojenie czasu potrzebnego na odzyskanie danych, po utworzeniu zaledwie kilku kopii zapasowych. Algorytm przepisywania kontekstowego minimalizuje ten efekt przez selektywne przepisywanie małej ilości duplikatów podczas zapisu. Takie postępowanie jest w stanie ograniczyć średni spadek wydajności odczytu z 21,3% do 2,48%, kosztem minimalnego zwiększenia czasu zapisudanych i wymagania niewielkiej przestrzeni dyskowej na pamięć tymczasową.Obydwa algorytmy użyte razem działają jeszcze wydajniej, poprawiając przepustowość odczytu przeciętnie o 142% przy standardowej ilości 256MB pamięci cache dla każdego strumienia. Dodatkowo, ponieważ powyższe wyniki zakładają odczyt z jednego dysku, przeprowadzone zostały testy symulujące korzystanie z przepustowości wielu dysków, gdyż takie konfiguracje są bardzo częste w dzisiejszych systemach. Dla przykładu, używając dziecięciu dysków i proponowanych algorytmów, można osiągnąć średnio pięciokrotnie wyższą wydajność niż w standardowym podejściu z algorytmem typu LRU

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    Survey on Deduplication Techniques in Flash-Based Storage

    Get PDF
    Data deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was influenced by appearance of Solid State Drive. This new type of disk has significant differences from random access memory and hard disk drives and is widely used now. In this paper we propose a novel taxonomy which reflects the main issues related to deduplication in Solid State Drive. We present a survey on deduplication techniques focusing on flash-based storage. We also describe several Open Source tools implementing data deduplication and briefly describe open research problems related to data deduplication in flash-based storage systems

    Off-line Deduplication Method for Solid-State Disk Based on Hot and Cold Data

    Get PDF
    Solid-state disk (SSD) deduplication refers to the identification and deletion of duplicate data stored in an SSD. The reliability of SSDs is improved by deduplication. At present, the common data deduplication of SSDs is based on online data deduplication with Field Programmable Gate Array (FPGA) acceleration. The disadvantage is that FPGA, which has a complex structure. An off-line deduplication method for the SSD based on hot and cold data was proposed in this study to simplify the structure of an SSD deduplication, reduce the cost, and improve the efficiency of deduplication and access performance of SSDs. First, the wear-leveling algorithm was employed in the SSD to divide the data into cold and hot. Then, the corresponding fingerprint was generated for the cold data. Second, the fingerprint was compared, and the cold data with the same fingerprint were deleted. Finally, the cold and hot data were exchanged after deduplication. Results demonstrate that the duplicate recognition rate of the proposed method is 5% - 38%, which is close to that of the online deduplication method. In terms of access performance, the performance of SSDs using the proposed method is improved by 20% compared with that of traditional SSDs and is near the access performance of SSDs using online deduplication. This study provides certain reference for improving the reliability of existing SSDs

    ON OPTIMIZATIONS OF VIRTUAL MACHINE LIVE STORAGE MIGRATION FOR THE CLOUD

    Get PDF
    Virtual Machine (VM) live storage migration is widely performed in the data cen- ters of the Cloud, for the purposes of load balance, reliability, availability, hardware maintenance and system upgrade. It entails moving all the state information of the VM being migrated, including memory state, network state and storage state, from one physical server to another within the same data center or across different data centers. To minimize its performance impact, this migration process is required to be transparent to applications running within the migrating VM, meaning that ap- plications will keep running inside the VM as if there were no migration operations at all. In this dissertation, a thorough literature review is conducted to provide a big picture of the VM live storage migration process, its problems and existing solutions. After an in-depth examination, we observe that a severe IO interference between the VM IO threads and migration IO threads exists and causes both types of the IO threads to suffer from performance degradation. This interference stems from the fact that both types of IO threads share the same critical IO path by reading from and writing to the same shared storage system. Owing to IO resource contention and requests interference between the two different types of IO requests, not only will the IO request queue lengthens in the storage system, but the time-consuming disk seek operations will also become more frequent. Based on this fundamental observation, this dissertation research presents three related but orthogonal solutions that tackle the IO interference problem in order to improve the VM live storage migration performance. First, we introduce the Workload-Aware IO Outsourcing scheme, called WAIO, to improve the VM live storage migration efficiency. Second, we address this problem by proposing a novel scheme, called SnapMig, to improve the VM live storage migration efficiency and eliminate its performance impact on user applications at the source server by effectively leveraging the existing VM snapshots in the backup servers. Third, we propose the IOFollow scheme to improve both the VM performance and migration performance simultaneously. Finally, we outline the direction for the future research work. Advisor: Hong Jian
    corecore