Search CORE

32 research outputs found

A survey and classification of storage deduplication systems

Author: Anand Ashok
Arcangeli Andrea
Berliner Brian
Bolosky William J.
Broder Andrei
Chen Feng
Chute Christopher
Clements Austin T.
Collberg Christian
Debnath Biplob
Dong Wei
Douglis Fred
Douglis Fred
Dubnicki Cezary
Dutch
El-Shimi Ahmed
Eshghi Kave
Guo Fanglu
Gupta Aayush
Hong Bo
José Pereira
João Paulo
Kruus Erik
Liguori Anthony
Lillibridge Mark
Lu Guanlin
Manber Udi
Milos Grzegorz
Nath Partho
Ng Chun-Ho
Quinlan Sean
Rhea Sean
Shilane Philip
Srinivasan Kiran
Suzaki Kuniyasu
Tarasov Vasily
Ungureanu Cristian
Wright Jeff
Xia Wen
You Lawrence
Zhu Benjamin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2014
Field of study

The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

Universidade do Minho: RepositoriUM

Crossref

Towards an accurate evaluation of deduplicated storage systems

Author: Paulo João
Pereira José
Reis Pedro
Sousa António Luís
Publication venue: CRL Publishing
Publication date: 01/11/2013
Field of study

Deduplication has proven to be a valuable technique for eliminating duplicate data in backup and archival systems and is now being applied to new storage environments with distinct requirements and performance trade-offs. Namely, deduplication system are now targeting large-scale cloud computing storage infrastructures holding unprecedented data volumes with a significant share of duplicate content. It is however hard to assess the usefulness of deduplication in particular settings and what techniques provide the best results. In fact, existing disk I/O benchmarks follow simplistic approaches for generating data content leading to unrealistic amounts of duplicates that do not evaluate deduplication systems accurately. Moreover, deduplication systems are now targeting heterogeneous storage environments, with specific duplication ratios, that benchmarks must also simulate. We address these issues with DEDISbench, a novel micro-benchmark for evaluating disk I/O performance of block based deduplication systems. As the main contribution, DEDISbench generates content by following realistic duplicate content distributions extracted from real datasets. Then, as a second contribution, we analyze and extract the duplicates found on three real storage systems, proving that DEDISbench can easily simulate several workloads. The usefulness of DEDISbench is shown by comparing it with Bonnie++ and IOzone open-source disk I/O micro-benchmarks on assessing two open-source deduplication systems, Opendedup and Lessfs, using Ext4 as a baseline. Our results lead to novel insight on the performance of these file systems.This work is funded by ERDF - European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT - Fundacao para a Ciencia e a Tecnologia (Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and FCT by Ph.D scholarship SFRH-BD-71372-2010

Universidade do Minho: RepositoriUM

Fragmentation in storage systems with duplicate elimination

Author: Kaczmarczyk Michał [APD]
Publication venue
Publication date
Field of study

Deduplication inevitably results in data fragmentation, because logically continuous data is scattered across many disk locations. Even though this significantly increases restore time from backup, the problem is still not well examined. In this work I close this gap by designing algorithms that reduce negative impact of fragmentation on restore time for two major types of fragmentation: internal and inter-version.Internal stream fragmentation is caused by the blocks appearing many times within a single backup. Such phenomenon happens surprisingly often and can result in even three times lower restore bandwidth. With an algorithm utilizing available forward knowledge to enable efficient caching I managed to improve this result on average by 62%-88% with only about 5% extra memory used. Although these results are achieved with limited forward knowledge, they are very close to the ones measured with no such limitation.Inter-version fragmentation is caused by duplicates from previous backups of the same backup set. Since such duplicates are very common due to repeated full backups containing a lot of unchanged data, this type of fragmentation may double the restore time after even a few backups. The context-based rewriting algorithm minimizes this effect by selectively rewriting a small percentage of duplicates during backup, limiting the bandwidth drop from 21.3% to 2.48% on average with only small increase in writing time and temporary space overhead.The two algorithms combined end up in a very effective symbiosis resulting in an average 142% restore bandwidth increase with standard 256MB of per-stream cache memory. In many cases such setup achieves results close to the theoretical maximum achievable with unlimited cache size. Moreover, all the above experiments where performed assuming only one spindle, even though in majority of today’s systems many spindles are used. In a sample setup with ten spindles, the restore bandwidth results are on average 5 times higher than in standard LRU case.Fragmentacja jest nieuniknioną konsekwencją deduplikacji, ponieważ pojedynczy strumień danych rozrzucany jest pomiędzy wiele lokalizacji na dysku. Fakt ten powoduje znaczące wydłużenie czasu odzyskiwania danych z kopii zapasowych. Mimo to, problem wciąż nie jest dobrze zbadany. Niniejsza praca wypełnia tę lukę poprzez propozycje algorytmów, które redukują negatywny wpływ fragmentacji na czas odczytu dla dwóch najważniejszych jej rodzajów: wewnętrznej fragmentacji strumienia oraz fragmentacji pomiędzy różnymi wersjami danych.Wewnętrzna fragmentacja strumienia jest spowodowana blokami powtarzającymi się wielokrotnie w pojedynczym strumieniu danych. To zjawisko zdarza się zaskakująco często i powoduje nawet trzykrotnie niższą wydaj-ność odczytu. Proponowany w tej pracy algorytm efektywnego zarządzania pamięcią, wykorzystujący dostępną wiedzę o danych, jest w stanie podnieść wydajność odczytu o 62-88%, używając przy tym tylko 5% dodatkowej pamięci.Fragmentacja pomiędzy różnymi wersjami danych jest spowodowana duplikatami pochodzącymi z wcześniejszych zapisów tego samego zbioru danych. Ponieważ pełne kopie zapasowe tworzone są regularnie i zawierają duże ilości powtarzających się danych, takie duplikaty występują bardzo często. W przypadku późniejszego odczytu, ich obecność może powodować nawet podwojenie czasu potrzebnego na odzyskanie danych, po utworzeniu zaledwie kilku kopii zapasowych. Algorytm przepisywania kontekstowego minimalizuje ten efekt przez selektywne przepisywanie małej ilości duplikatów podczas zapisu. Takie postępowanie jest w stanie ograniczyć średni spadek wydajności odczytu z 21,3% do 2,48%, kosztem minimalnego zwiększenia czasu zapisudanych i wymagania niewielkiej przestrzeni dyskowej na pamięć tymczasową.Obydwa algorytmy użyte razem działają jeszcze wydajniej, poprawiając przepustowość odczytu przeciętnie o 142% przy standardowej ilości 256MB pamięci cache dla każdego strumienia. Dodatkowo, ponieważ powyższe wyniki zakładają odczyt z jednego dysku, przeprowadzone zostały testy symulujące korzystanie z przepustowości wielu dysków, gdyż takie konfiguracje są bardzo częste w dzisiejszych systemach. Dla przykładu, używając dziecięciu dysków i proponowanych algorytmów, można osiągnąć średnio pięciokrotnie wyższą wydajność niż w standardowym podejściu z algorytmem typu LRU

Repozytorium UW

Data Reduction and Deep-Learning Based Recovery for Geospatial Visualization and Satellite Imagery

Author: Tasnim Jarin
Publication venue: 'University of Saskatchewan Library'
Publication date: 16/03/2021
Field of study

The storage, retrieval and distribution of data are some critical aspects of big data management. Data scientists and decision-makers often need to share large datasets and make decisions on archiving or deleting historical data to cope with resource constraints. As a consequence, there is an urgency of reducing the storage and transmission requirement. A potential approach to mitigate such problems is to reduce big datasets into smaller ones, which will not only lower storage requirements but also allow light load transfer over the network. The high dimensional data often exhibit high repetitiveness and paradigm across different dimensions. Carefully prepared data by removing redundancies, along with a machine learning model capable of reconstructing the whole dataset from its reduced version, can improve the storage scalability, data transfer, and speed up the overall data management pipeline. In this thesis, we explore some data reduction strategies for big datasets, while ensuring that the data can be transferred and used ubiquitously by all stakeholders, i.e., the entire dataset can be reconstructed with high quality whenever necessary. One of our data reduction strategies follows a straightforward uniform pattern, which guarantees a minimum of 75% data size reduction. We also propose a novel variance based reduction technique, which focuses on removing only redundant data and offers additional 1% to 2% deletion rate. We have adopted various traditional machine learning and deep learning approaches for high-quality reconstruction. We evaluated our pipelines with big geospatial data and satellite imageries. Among them, our deep learning approaches have performed very well both quantitatively and qualitatively with the capability of reconstructing high quality features. We also show how to leverage temporal data for better reconstruction. For uniform deletion, the reconstruction accuracy observed is as high as 98.75% on an average for spatial meteorological data (e.g., soil moisture and albedo), and 99.09% for satellite imagery. Pushing the deletion rate further by following variance based deletion method, the decrease in accuracy remains within 1% for spatial meteorological data and 7% for satellite imagery

University of Saskatchewan Research Archive

Recommended from our members

Making Data Storage Efficient in the Era of Cloud Computing

Author: Tang Yang
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

We enter the era of cloud computing in the last decade, as many paradigm shifts are happening on how people write and deploy applications. Despite the advancement of cloud computing, data storage abstractions have not evolved much, causing inefficiencies in performance, cost, and security. This dissertation proposes a novel approach to make data storage efficient in the era of cloud computing by building new storage abstractions and systems that bridge the gap between cloud computing and data storage and simplify development. We build four systems to address four data inefficiencies in cloud computing. The first system, Grandet, solves the data storage inefficiency caused by the paradigm shift from upfront provisioning to a variety of pay-as-you-go cloud services. Grandet is an extensible storage system that significantly reduces storage costs for web applications deployed in the cloud. Under the hood, it supports multiple heterogeneous stores and unifies them by placing each data object at the store deemed most economical. Our results show that Grandet reduces their costs by an average of 42.4%, and it is fast, scalable, and easy to use. The second system, Unic, solves the data inefficiency caused by the paradigm shift from single-tenancy to multi-tenancy. Unic securely deduplicates general computations. It exports a cache service that allows cloud applications running on behalf of mutually distrusting users to memoize and reuse computation results, thereby improving performance. Unic achieves both integrity and secrecy through a novel use of code attestation, and it provides a simple yet expressive API that enables applications to deduplicate their own rich computations. Our results show that Unic is easy to use, speeds up applications by an average of 7.58x, and with little storage overhead. The third system, Lambdata, solves the data inefficiency caused by the paradigm shift to serverless computing, where developers only write core business logic, and cloud service providers maintain all the infrastructure. Lambdata is a novel serverless computing system that enables developers to declare a cloud function's data intents, including both data read and data written. Once data intents are made explicit, Lambdata performs a variety of optimizations to improve speed, including caching data locally and scheduling functions based on code and data locality. Our results show that Lambdata achieves an average speedup of 1.51x on the turnaround time of practical workloads and reduces monetary cost by 16.5%. The fourth system, CleanOS, solves the data inefficiency caused by the paradigm shift from desktop computers to smartphones always connected to the cloud. CleanOS is a new Android-based operating system that manages sensitive data rigorously and maintains a clean environment at all times. It identifies and tracks sensitive data, encrypts it with a key, and evicts that key to the cloud when the data is not in active use on the device. Our results show that CleanOS limits sensitive-data exposure drastically while incurring acceptable overheads on mobile networks

Columbia University Academic Commons

A Survey on the Integration of NAND Flash Storage in the Design of File Systems and the Host Storage Software Stack

Author: Doekemeijer Krijn
Tehrany Nick
Trivedi Animesh
Publication venue
Publication date: 21/07/2023
Field of study

With the ever-increasing amount of data generate in the world, estimated to reach over 200 Zettabytes by 2025, pressure on efficient data storage systems is intensifying. The shift from HDD to flash-based SSD provides one of the most fundamental shifts in storage technology, increasing performance capabilities significantly. However, flash storage comes with different characteristics than prior HDD storage technology. Therefore, storage software was unsuitable for leveraging the capabilities of flash storage. As a result, a plethora of storage applications have been design to better integrate with flash storage and align with flash characteristics. In this literature study we evaluate the effect the introduction of flash storage has had on the design of file systems, which providing one of the most essential mechanisms for managing persistent storage. We analyze the mechanisms for effectively managing flash storage, managing overheads of introduced design requirements, and leverage the capabilities of flash storage. Numerous methods have been adopted in file systems, however prominently revolve around similar design decisions, adhering to the flash hardware constrains, and limiting software intervention. Future design of storage software remains prominent with the constant growth in flash-based storage devices and interfaces, providing an increasing possibility to enhance flash integration in the host storage software stack

arXiv.org e-Print Archive

A Survey on the Integration of NAND Flash Storage in the Design of File Systems and the Host Storage Software Stack

Author: Doekemeijer Krijn
Tehrany Nick
Trivedi Animesh
Publication venue
Publication date: 21/07/2023
Field of study

VU Research Portal