Search CORE

759 research outputs found

Data Deduplication Technology for Cloud Storage

Author: Bilin Shao
Genqing Bian*
Qinlu He
Weiqi Zhang
Publication venue: 'Mechanical Engineering Faculty in Slavonski Brod'
Publication date: 01/01/2020
Field of study

With the explosive growth of information data, the data storage system has stepped into the cloud storage era. Although the core of the cloud storage system is distributed file system in solving the problem of mass data storage, a large number of duplicate data exist in all storage system. File systems are designed to control how files are stored and retrieved. Fewer studies focus on the cloud file system deduplication technologies at the application level, especially for the Hadoop distributed file system. In this paper, we design a file deduplication framework on Hadoop distributed file system for cloud application developer. Proposed RFD-HDFS and FD-HDFS two data deduplication solutions process data deduplication online, which improves storage space utilisation and reduces the redundancy. In the end of the paper, we test the disk utilisation and the file upload performance on RFD-HDFS and FD-HDFS, and compare HDFS with the disk utilisation of two system frameworks. The results show that the two-system framework not only implements data deduplication function but also effectively reduces the disk utilisation of duplicate files. So, the proposed framework can indeed reduce the storage space by eliminating redundant HDFS file

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Doctor of Philosophy

Author: Lin Xing
Publication venue: University of Utah
Publication date: 01/12/2015
Field of study

dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

The University of Utah: J. Willard Marriott Digital Library

Towards Exascale Scientific Metadata Management

Author: Blanas Spyros
Byna Surendra
Publication venue
Publication date: 29/03/2015
Field of study

Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

arXiv.org e-Print Archive

eScholarship - University of California

HIODS: hybrid inline and offline deduplication system

Author: Pedrosa Carlos Pinto
Publication venue
Publication date: 22/02/2021
Field of study

Dissertação de mestrado integrado em Engenharia InformáticaDeduplication is a technique that allows finding and removing duplicate data at storage systems. With the current exponential growth of digital information, this mechanism is becoming more and more desirable for reducing the infrastructural costs of persisting such data. Therefore, deduplication is now being widely applied to several storage appliances serving applications with different requirements (e.g., archival, backup, primary storage). However, deduplication requires additional processing logic for each storage request in order to detect and eliminate duplicate content. Traditionally, this processing is done in the I/O critical path (inline), thus introducing a performance penalty on the throughput and latency of requests being served by the storage appliance. An alternative solution is to do this process as a background task, thus outside of the I/O critical path (offline), at the cost of requiring additional storage space as duplicate content is not found and eliminated immediately. However, the choice of what type of strategy to use is typically done manually and does not take into consideration changes in the applications' workloads. This dissertation proposes HIODS, a hybrid deduplication solution capable of automati cally changing between inline and offline deduplication according to the requirements (e.g., desired storage I/O throughput goal) of applications and their dynamic workloads. The goal is to choose the best strategy that fulfills the targeted I/O performance objectives while optimizing deduplication space savings. Finally, a prototype of HIODS is implemented and evaluated extensively with different storage workloads. Results show that HIODS is able to change its deduplication mode dy namically, according to the storage workload being served, while balancing I/O performance and space savings requirements efficiently.A deduplicação é uma técnica que permite encontrar e remover dados duplicados guardados nos sistemas de armazenamento. Com o crescimento exponencial da informação digital que vivemos atualmente, este mecanismo está a tornar-se cada vez mais popular para reduzir os custos das infraestruturas onde esses dados se encontram alojados. De facto, a deduplicação é, hoje em dia, usada numa grande variedade de serviços de armazenamento que servem diferentes aplicações com requisitos particulares (ex.: arquivo, backup, armazenamento primário). No entanto, a deduplicação adiciona uma camada de processamento extra a cada pedido de armazenamento, de modo a conseguir detetar e eliminar o conteúdo redundante. Tradicionalmente, este processo é realizado durante o caminho crítico do I/O (inline), causando perdas de desempenho e aumentos na latência dos pedidos processados. Uma alternativa é alterar o processamento para segundo plano, aliviando assim os custos no caminho crítico do I/O (offline). Esta solução requer espaço de armazenamento adicional, visto que os duplicados não são encontrados nem eliminados imediatamente. No entanto, a estratégia a seguir é escolhida de forma manual, não tendo em consideração qualquer possível mudança na carga de trabalho das aplicações. Esta dissertação propõe assim o HIODS, um sistema de deduplicação híbrido capaz de alterar entre o modo inline e offline de forma automática considerando os requisitos (ex.: débito do sistema de armazenamento desejado) das aplicações e das suas cargas de trabalho dinâmicas. Por fim, um protótipo do HIODS é implementado e avaliado exaustivamente. Os resultados mostram que o HIODS é capaz de alterar o modo de deduplicação de forma dinâmica e de acordo com a carga de trabalho, considerando os requisitos de desempenho e a eliminação eficiente dos dados duplicados

Universidade do Minho: RepositoriUM

Recommended from our members

HIV transmission networks among transgender women in Los Angeles County, CA, USA: a phylogenetic analysis of surveillance data.

Author: Hu Yunyin W
Morris Sheldon R
Poortinga Kathleen
Ragonnet-Cronin Manon
Sheng Zhijuan
Wertheim Joel O
Publication venue: eScholarship, University of California
Publication date: 01/03/2019
Field of study

BackgroundTransgender women are among the groups at highest risk for HIV infection, with a prevalence of 27·7% in the USA; and despite this known high risk, undiagnosed infection is common in this population. We set out to identify transgender women and their partners in a molecular transmission network to prioritise public health activities.MethodsSince 2006, HIV protease and reverse transcriptase gene (pol) sequences from drug resistance testing have been reported to the Los Angeles County Department of Public Health and linked to demographic data, gender, and HIV transmission risk factor data for each case in the enhanced HIV/AIDS Reporting System. We reconstructed a molecular transmission network by use of HIV-TRAnsmission Cluster Engine (with a pairwise genetic distance threshold of 0·015 substitutions per site) from the earliest pol sequences from 22 398 unique individuals, including 412 (2%) self-identified transgender women. We examined the possible predictors of clustering with multivariate logistic regression. We characterised the genetically linked partners of transgender women and calculated assortativity (the tendency for people to link to other people with the same attributes) for each transmission risk group.Findings8133 (36·3%) of 22 398 individuals clustered in the network across 1722 molecular transmission clusters. Transgender women who indicated a sexual risk factor clustered at the highest frequency in the network, with 147 (43%) of 345 being linked to at least one other person (adjusted odds ratio [aOR] 2·0, p=0·0002). Transgender women were assortative in the network (assortativity 0·06, p<0·001), indicating that they tended to link to other transgender women. Transgender women were more likely than expected to link to other transgender women (OR 4·65, p<0·001) and cisgender men who did not identify as men who have sex with men (MSM; OR 1·53, p<0·001). Transgender women were less likely than expected to link to MSM (OR 0·75, p<0·001), despite the high prevalence of HIV among MSM. Transgender women were distributed across 126 clusters, and cisgender individuals linked to one transgender woman were 9·2 times more likely to link to a second transgender woman than other individuals in the surveillance database. Reconstruction of the transmission network is limited by sample availability, but sequences were available for more than 40% of diagnoses.InterpretationClustering of transgender women and the observed tendency for linkage with cisgender men who did not identify as MSM, shows the potential to use molecular epidemiology both to identify clusters that are likely to include undiagnosed transgender women with HIV and to improve the targeting of public health prevention and treatment services to transgender women.FundingCalifornia HIV and AIDS Research Program and National Institutes of Health-National Institute of Allergy and Infectious Diseases

eScholarship - University of California