759 research outputs found
Data Deduplication Technology for Cloud Storage
With the explosive growth of information data, the data storage system has stepped into the cloud storage era. Although the core of the cloud storage system is distributed file system in solving the problem of mass data storage, a large number of duplicate data exist in all storage system. File systems are designed to control how files are stored and retrieved. Fewer studies focus on the cloud file system deduplication technologies at the application level, especially for the Hadoop distributed file system. In this paper, we design a file deduplication framework on Hadoop distributed file system for cloud application developer. Proposed RFD-HDFS and FD-HDFS two data deduplication solutions process data deduplication online, which improves storage space utilisation and reduces the redundancy. In the end of the paper, we test the disk utilisation and the file upload performance on RFD-HDFS and FD-HDFS, and compare HDFS with the disk utilisation of two system frameworks. The results show that the two-system framework not only implements data deduplication function but also effectively reduces the disk utilisation of duplicate files. So, the proposed framework can indeed reduce the storage space by eliminating redundant HDFS file
Doctor of Philosophy
dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
HIODS: hybrid inline and offline deduplication system
Dissertação de mestrado integrado em Engenharia InformáticaDeduplication is a technique that allows finding and removing duplicate data at storage
systems. With the current exponential growth of digital information, this mechanism is
becoming more and more desirable for reducing the infrastructural costs of persisting such
data. Therefore, deduplication is now being widely applied to several storage appliances
serving applications with different requirements (e.g., archival, backup, primary storage).
However, deduplication requires additional processing logic for each storage request in
order to detect and eliminate duplicate content. Traditionally, this processing is done in
the I/O critical path (inline), thus introducing a performance penalty on the throughput
and latency of requests being served by the storage appliance. An alternative solution is to
do this process as a background task, thus outside of the I/O critical path (offline), at the
cost of requiring additional storage space as duplicate content is not found and eliminated
immediately. However, the choice of what type of strategy to use is typically done manually
and does not take into consideration changes in the applications' workloads.
This dissertation proposes HIODS, a hybrid deduplication solution capable of automati cally changing between inline and offline deduplication according to the requirements (e.g.,
desired storage I/O throughput goal) of applications and their dynamic workloads. The
goal is to choose the best strategy that fulfills the targeted I/O performance objectives while
optimizing deduplication space savings.
Finally, a prototype of HIODS is implemented and evaluated extensively with different
storage workloads. Results show that HIODS is able to change its deduplication mode dy namically, according to the storage workload being served, while balancing I/O performance
and space savings requirements efficiently.A deduplicação é uma técnica que permite encontrar e remover dados duplicados guardados nos sistemas de armazenamento. Com o crescimento exponencial da informação digital que vivemos atualmente, este mecanismo está a tornar-se cada vez mais popular para reduzir os custos das infraestruturas onde esses dados se encontram alojados. De facto, a deduplicação é, hoje em dia, usada numa grande variedade de serviços de armazenamento que servem diferentes aplicações com requisitos particulares (ex.: arquivo, backup, armazenamento primário).
No entanto, a deduplicação adiciona uma camada de processamento extra a cada pedido de armazenamento, de modo a conseguir detetar e eliminar o conteúdo redundante. Tradicionalmente, este processo é realizado durante o caminho crítico do I/O (inline), causando
perdas de desempenho e aumentos na latência dos pedidos processados. Uma alternativa é alterar o processamento para segundo plano, aliviando assim os custos no caminho crítico do I/O (offline). Esta solução requer espaço de armazenamento adicional, visto que os duplicados não são encontrados nem eliminados imediatamente. No entanto, a estratégia a seguir é escolhida de forma manual, não tendo em consideração qualquer possível mudança na carga de trabalho das aplicações.
Esta dissertação propõe assim o HIODS, um sistema de deduplicação híbrido capaz de alterar entre o modo inline e offline de forma automática considerando os requisitos (ex.: débito do sistema de armazenamento desejado) das aplicações e das suas cargas de trabalho dinâmicas.
Por fim, um protótipo do HIODS é implementado e avaliado exaustivamente. Os resultados mostram que o HIODS é capaz de alterar o modo de deduplicação de forma dinâmica e de acordo com a carga de trabalho, considerando os requisitos de desempenho e a eliminação eficiente dos dados duplicados
Recommended from our members
HIV transmission networks among transgender women in Los Angeles County, CA, USA: a phylogenetic analysis of surveillance data.
BackgroundTransgender women are among the groups at highest risk for HIV infection, with a prevalence of 27·7% in the USA; and despite this known high risk, undiagnosed infection is common in this population. We set out to identify transgender women and their partners in a molecular transmission network to prioritise public health activities.MethodsSince 2006, HIV protease and reverse transcriptase gene (pol) sequences from drug resistance testing have been reported to the Los Angeles County Department of Public Health and linked to demographic data, gender, and HIV transmission risk factor data for each case in the enhanced HIV/AIDS Reporting System. We reconstructed a molecular transmission network by use of HIV-TRAnsmission Cluster Engine (with a pairwise genetic distance threshold of 0·015 substitutions per site) from the earliest pol sequences from 22 398 unique individuals, including 412 (2%) self-identified transgender women. We examined the possible predictors of clustering with multivariate logistic regression. We characterised the genetically linked partners of transgender women and calculated assortativity (the tendency for people to link to other people with the same attributes) for each transmission risk group.Findings8133 (36·3%) of 22 398 individuals clustered in the network across 1722 molecular transmission clusters. Transgender women who indicated a sexual risk factor clustered at the highest frequency in the network, with 147 (43%) of 345 being linked to at least one other person (adjusted odds ratio [aOR] 2·0, p=0·0002). Transgender women were assortative in the network (assortativity 0·06, p<0·001), indicating that they tended to link to other transgender women. Transgender women were more likely than expected to link to other transgender women (OR 4·65, p<0·001) and cisgender men who did not identify as men who have sex with men (MSM; OR 1·53, p<0·001). Transgender women were less likely than expected to link to MSM (OR 0·75, p<0·001), despite the high prevalence of HIV among MSM. Transgender women were distributed across 126 clusters, and cisgender individuals linked to one transgender woman were 9·2 times more likely to link to a second transgender woman than other individuals in the surveillance database. Reconstruction of the transmission network is limited by sample availability, but sequences were available for more than 40% of diagnoses.InterpretationClustering of transgender women and the observed tendency for linkage with cisgender men who did not identify as MSM, shows the potential to use molecular epidemiology both to identify clusters that are likely to include undiagnosed transgender women with HIV and to improve the targeting of public health prevention and treatment services to transgender women.FundingCalifornia HIV and AIDS Research Program and National Institutes of Health-National Institute of Allergy and Infectious Diseases
- …