158 research outputs found

    A Running Time Improvement for Two Thresholds Two Divisors Algorithm

    Get PDF
    Chunking algorithms play an important role in data de-duplication systems. The Basic Sliding Window (BSW) algorithm is the first prototype of the content-based chunking algorithm which can handle most types of data. The Two Thresholds Two Divisors (TTTD) algorithm was proposed to improve the BSW algorithm in terms of controlling the variations of the chunk-size. In this project, we investigate and compare the BSW algorithm and TTTD algorithm from different factors by a series of systematic experiments. Up to now, no paper conducts these experimental evaluations for these two algorithms. This is the first value of this paper. According to our analyses and the results of experiments, we provide a running time improvement for the TTTD algorithm. Our new solution reduces about 7 % of the total running time and also reduces about 50 % of the large-sized chunks while comparing with the original TTTD algorithm and make average chunk-size closer to the expected chunk-size. These significant results are the second important value of this project

    A survey and classification of storage deduplication systems

    Get PDF
    The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

    Efficient, Dependable Storage of Human Genome Sequencing Data

    Get PDF
    A compreensão do genoma humano impacta várias áreas da vida. Os dados oriundos do genoma humano são enormes pois existem milhões de amostras a espera de serem sequenciadas e cada genoma humano sequenciado pode ocupar centenas de gigabytes de espaço de armazenamento. Os genomas humanos são críticos porque são extremamente valiosos para a investigação e porque podem fornecer informações delicadas sobre o estado de saúde dos indivíduos, identificar os seus dadores ou até mesmo revelar informações sobre os parentes destes. O tamanho e a criticidade destes genomas, para além da quantidade de dados produzidos por instituições médicas e de ciências da vida, exigem que os sistemas informáticos sejam escaláveis, ao mesmo tempo que sejam seguros, confiáveis, auditáveis e com custos acessíveis. As infraestruturas de armazenamento existentes são tão caras que não nos permitem ignorar a eficiência de custos no armazenamento de genomas humanos, assim como em geral estas não possuem o conhecimento e os mecanismos adequados para proteger a privacidade dos dadores de amostras biológicas. Esta tese propõe um sistema de armazenamento de genomas humanos eficiente, seguro e auditável para instituições médicas e de ciências da vida. Ele aprimora os ecossistemas de armazenamento tradicionais com técnicas de privacidade, redução do tamanho dos dados e auditabilidade a fim de permitir o uso eficiente e confiável de infraestruturas públicas de computação em nuvem para armazenar genomas humanos. As contribuições desta tese incluem (1) um estudo sobre a sensibilidade à privacidade dos genomas humanos; (2) um método para detetar sistematicamente as porções dos genomas que são sensíveis à privacidade; (3) algoritmos de redução do tamanho de dados, especializados para dados de genomas sequenciados; (4) um esquema de auditoria independente para armazenamento disperso e seguro de dados; e (5) um fluxo de armazenamento completo que obtém garantias razoáveis de proteção, segurança e confiabilidade a custos modestos (por exemplo, menos de 1/Genoma/Ano),integrandoosmecanismospropostosaconfigurac\co~esdearmazenamentoapropriadasTheunderstandingofhumangenomeimpactsseveralareasofhumanlife.Datafromhumangenomesismassivebecausetherearemillionsofsamplestobesequenced,andeachsequencedhumangenomemaysizehundredsofgigabytes.Humangenomesarecriticalbecausetheyareextremelyvaluabletoresearchandmayprovidehintsonindividualshealthstatus,identifytheirdonors,orrevealinformationaboutdonorsrelatives.Theirsizeandcriticality,plustheamountofdatabeingproducedbymedicalandlifesciencesinstitutions,requiresystemstoscalewhilebeingsecure,dependable,auditable,andaffordable.Currentstorageinfrastructuresaretooexpensivetoignorecostefficiencyinstoringhumangenomes,andtheylacktheproperknowledgeandmechanismstoprotecttheprivacyofsampledonors.Thisthesisproposesanefficientstoragesystemforhumangenomesthatmedicalandlifesciencesinstitutionsmaytrustandafford.Itenhancestraditionalstorageecosystemswithprivacyaware,datareduction,andauditabilitytechniquestoenabletheefficient,dependableuseofmultitenantinfrastructurestostorehumangenomes.Contributionsfromthisthesisinclude(1)astudyontheprivacysensitivityofhumangenomes;(2)todetectgenomesprivacysensitiveportionssystematically;(3)specialiseddatareductionalgorithmsforsequencingdata;(4)anindependentauditabilityschemeforsecuredispersedstorage;and(5)acompletestoragepipelinethatobtainsreasonableprivacyprotection,security,anddependabilityguaranteesatmodestcosts(e.g.,lessthan1/Genoma/Ano), integrando os mecanismos propostos a configurações de armazenamento apropriadasThe understanding of human genome impacts several areas of human life. Data from human genomes is massive because there are millions of samples to be sequenced, and each sequenced human genome may size hundreds of gigabytes. Human genomes are critical because they are extremely valuable to research and may provide hints on individuals’ health status, identify their donors, or reveal information about donors’ relatives. Their size and criticality, plus the amount of data being produced by medical and life-sciences institutions, require systems to scale while being secure, dependable, auditable, and affordable. Current storage infrastructures are too expensive to ignore cost efficiency in storing human genomes, and they lack the proper knowledge and mechanisms to protect the privacy of sample donors. This thesis proposes an efficient storage system for human genomes that medical and lifesciences institutions may trust and afford. It enhances traditional storage ecosystems with privacy-aware, data-reduction, and auditability techniques to enable the efficient, dependable use of multi-tenant infrastructures to store human genomes. Contributions from this thesis include (1) a study on the privacy-sensitivity of human genomes; (2) to detect genomes’ privacy-sensitive portions systematically; (3) specialised data reduction algorithms for sequencing data; (4) an independent auditability scheme for secure dispersed storage; and (5) a complete storage pipeline that obtains reasonable privacy protection, security, and dependability guarantees at modest costs (e.g., less than 1/Genome/Year) by integrating the proposed mechanisms with appropriate storage configurations

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    THREE TEMPORAL PERSPECTIVES ON DECENTRALIZED LOCATION-AWARE COMPUTING: PAST, PRESENT, FUTURE

    Get PDF
    Durant les quatre dernières décennies, la miniaturisation a permis la diffusion à large échelle des ordinateurs, les rendant omniprésents. Aujourd’hui, le nombre d’objets connectés à Internet ne cesse de croitre et cette tendance n’a pas l’air de ralentir. Ces objets, qui peuvent être des téléphones mobiles, des véhicules ou des senseurs, génèrent de très grands volumes de données qui sont presque toujours associés à un contexte spatiotemporel. Le volume de ces données est souvent si grand que leur traitement requiert la création de système distribués qui impliquent la coopération de plusieurs ordinateurs. La capacité de traiter ces données revêt une importance sociétale. Par exemple: les données collectées lors de trajets en voiture permettent aujourd’hui d’éviter les em-bouteillages ou de partager son véhicule. Un autre exemple: dans un avenir proche, les données collectées à l’aide de gyroscopes capables de détecter les trous dans la chaussée permettront de mieux planifier les interventions de maintenance à effectuer sur le réseau routier. Les domaines d’applications sont par conséquent nombreux, de même que les problèmes qui y sont associés. Les articles qui composent cette thèse traitent de systèmes qui partagent deux caractéristiques clés: un contexte spatiotemporel et une architecture décentralisée. De plus, les systèmes décrits dans ces articles s’articulent autours de trois axes temporels: le présent, le passé, et le futur. Les systèmes axés sur le présent permettent à un très grand nombre d’objets connectés de communiquer en fonction d’un contexte spatial avec des temps de réponses proche du temps réel. Nos contributions dans ce domaine permettent à ce type de système décentralisé de s’adapter au volume de donnée à traiter en s’étendant sur du matériel bon marché. Les systèmes axés sur le passé ont pour but de faciliter l’accès a de très grands volumes données spatiotemporelles collectées par des objets connectés. En d’autres termes, il s’agit d’indexer des trajectoires et d’exploiter ces indexes. Nos contributions dans ce domaine permettent de traiter des jeux de trajectoires particulièrement denses, ce qui n’avait pas été fait auparavant. Enfin, les systèmes axés sur le futur utilisent les trajectoires passées pour prédire les trajectoires que des objets connectés suivront dans l’avenir. Nos contributions permettent de prédire les trajectoires suivies par des objets connectés avec une granularité jusque là inégalée. Bien qu’impliquant des domaines différents, ces contributions s’articulent autour de dénominateurs communs des systèmes sous-jacents, ouvrant la possibilité de pouvoir traiter ces problèmes avec plus de généricité dans un avenir proche. -- During the past four decades, due to miniaturization computing devices have become ubiquitous and pervasive. Today, the number of objects connected to the Internet is in- creasing at a rapid pace and this trend does not seem to be slowing down. These objects, which can be smartphones, vehicles, or any kind of sensors, generate large amounts of data that are almost always associated with a spatio-temporal context. The amount of this data is often so large that their processing requires the creation of a distributed system, which involves the cooperation of several computers. The ability to process these data is important for society. For example: the data collected during car journeys already makes it possible to avoid traffic jams or to know about the need to organize a carpool. Another example: in the near future, the maintenance interventions to be carried out on the road network will be planned with data collected using gyroscopes that detect potholes. The application domains are therefore numerous, as are the prob- lems associated with them. The articles that make up this thesis deal with systems that share two key characteristics: a spatio-temporal context and a decentralized architec- ture. In addition, the systems described in these articles revolve around three temporal perspectives: the present, the past, and the future. Systems associated with the present perspective enable a very large number of connected objects to communicate in near real-time, according to a spatial context. Our contributions in this area enable this type of decentralized system to be scaled-out on commodity hardware, i.e., to adapt as the volume of data that arrives in the system increases. Systems associated with the past perspective, often referred to as trajectory indexes, are intended for the access to the large volume of spatio-temporal data collected by connected objects. Our contributions in this area makes it possible to handle particularly dense trajectory datasets, a problem that has not been addressed previously. Finally, systems associated with the future per- spective rely on past trajectories to predict the trajectories that the connected objects will follow. Our contributions predict the trajectories followed by connected objects with a previously unmet granularity. Although involving different domains, these con- tributions are structured around the common denominators of the underlying systems, which opens the possibility of being able to deal with these problems more generically in the near future

    Hardware accelerated redundancy elimination in network system

    Get PDF
    With the tremendous growth in the amount of information stored on remote locations and cloud systems, many service providers are seeking ways to reduce the amount of redundant information sent across networks by using data de-duplication techniques. Data de-duplication can reduce network traffic without the loss of information, and consequently increase available network bandwidth by reducing redundant traffic. However, due to the heavy computation required for detecting and reducing redundant data transmission, de-duplication itself can become a bottleneck in high capacity links. We completed two parts of work in this research study, Hardware Accelerated Redundancy Elimination in Network Systems (HARENS) and Distributed Redundancy Elimination System Simulation (DRESS). HARENS can significantly improve the performance of redundancy elimination algorithm in a network system by leveraging General Purpose Graphic Processing Unit (GPGPU) techniques as well as other big data optimizations such as the use of a hierarchical multi-threaded pipeline, single machine Map-Reduce, and memory efficiency techniques. Our results indicate that throughput can be increased by a factor of 9 times compared to a naive implementation of the data de-duplication algorithm, providing a net transmission increase of up to 3.0 Gigabits per second (Gbps). DRESS provides further acceleration to the redundancy elimination in network system by deploying HARENS as the server\u27s side redundancy elimination module, and four cooperative distributed byte caches on the clients\u27 side. A client\u27s side distributed byte cache broadcast its cached chunks by sending hash values to other byte caches, so that they can keep a record of all the chunks in the cooperative distributed cache system. When duplications are detected, a client\u27s side byte cache can fetch a chunk directly from either its own cache or peer byte caches rather than server\u27s side redundancy elimination module. Our results indicate that bandwidth savings of the redundancy elimination system with cooperative distributed byte cache can be increased by 12% compared to the one without distributed byte cache, when transferring about 48 Gigabits of data

    Data De-Duplication in NoSQL Databases

    Get PDF
    With the popularity and expansion of Cloud Computing, NoSQL databases (DBs) are becoming the preferred choice of storing data in the Cloud. Because they are highly de-normalized, these DBs tend to store significant amounts of redundant data. Data de-duplication (DD) has an important role in reducing storage consumption to make it affordable to manage in today’s explosive data growth. Numerous DD methodologies like chunking and, delta encoding are available today to optimize the use of storage. These technologies approach DD at file and/or sub-file level but this approach has never been optimal for NoSQL DBs. This research proposes data De-Duplication in NoSQL Databases (DDNSDB) which makes use of a DD approach at a higher level of abstraction, namely at the DB level. It makes use of the structural information about the data (metadata) exploiting its granularity to identify and remove duplicates. The main goals of this research are: to maximally reduce the amount of duplicates in one type of NoSQL DBs, namely the key-value store, to maximally increase the process performance such that the backup window is marginally affected, and to design with horizontal scaling in mind such that it would run on a Cloud Platform competitively. Additionally, this research presents an analysis of the various types of NoSQL DBs (such as key-value, tabular/columnar, and document DBs) to understand their data model required for the design and implementation of DDNSDB. Primary experiments have demonstrated that DDNSDB can further reduce the NoSQL DB storage space compared with current archiving methods (from 17% to near 69% as more structural information is available). Also, by following an optimized adapted MapReduce architecture, DDNSDB proves to have competitive performance advantage in a horizontal scaling cloud environment compared with a vertical scaling environment (from 28.8 milliseconds to 34.9 milliseconds as the number of parallel Virtual Machines grows)

    Leveraging content properties to optimize distributed storage systems

    Get PDF
    Les fournisseurs de services de cloud computing, les réseaux sociaux et les entreprises de gestion des données ont assisté à une augmentation considérable du volume de données qu'ils reçoivent chaque jour. Toutes ces données créent des nouvelles opportunités pour étendre la connaissance humaine dans des domaines comme la santé, l'urbanisme et le comportement humain et permettent d'améliorer les services offerts comme la recherche, la recommandation, et bien d'autres. Ce n'est pas par accident que plusieurs universitaires mais aussi les médias publics se référent à notre époque comme l'époque Big Data . Mais ces énormes opportunités ne peuvent être exploitées que grâce à de meilleurs systèmes de gestion de données. D'une part, ces derniers doivent accueillir en toute sécurité ce volume énorme de données et, d'autre part, être capable de les restituer rapidement afin que les applications puissent bénéficier de leur traite- ment. Ce document se concentre sur ces deux défis relatifs aux Big Data . Dans notre étude, nous nous concentrons sur le stockage de sauvegarde (i) comme un moyen de protéger les données contre un certain nombre de facteurs qui peuvent les rendre indisponibles et (ii) sur le placement des données sur des systèmes de stockage répartis géographiquement, afin que les temps de latence perçue par l'utilisateur soient minimisés tout en utilisant les ressources de stockage et du réseau efficacement. Tout au long de notre étude, les données sont placées au centre de nos choix de conception dont nous essayons de tirer parti des propriétés de contenu à la fois pour le placement et le stockage efficace.Cloud service providers, social networks and data-management companies are witnessing a tremendous increase in the amount of data they receive every day. All this data creates new opportunities to expand human knowledge in fields like healthcare and human behavior and improve offered services like search, recommendation, and many others. It is not by accident that many academics but also public media refer to our era as the Big Data era. But these huge opportunities come with the requirement for better data management systems that, on one hand, can safely accommodate this huge and constantly increasing volume of data and, on the other, serve them in a timely and useful manner so that applications can benefit from processing them. This document focuses on the above two challenges that come with Big Data . In more detail, we study (i) backup storage systems as a means to safeguard data against a number of factors that may render them unavailable and (ii) data placement strategies on geographically distributed storage systems, with the goal to reduce the user perceived latencies and the network and storage resources are efficiently utilized. Throughout our study, data are placed in the centre of our design choices as we try to leverage content properties for both placement and efficient storage.RENNES1-Bibl. électronique (352382106) / SudocSudocFranceF
    corecore