10 research outputs found

    Design and Implementation of Storage System Using Byte-index Chunking Scheme

    Get PDF
    In this paper, we present an enhanced storage system that supports Byte-index chunking algorithm. The storage system aims to provide efficient data deduplication with high performance and to be performed in rapid time. We describe the overall procedure of Byteindex chunking based storage system including read/write procedure and how the system works. The key idea of Byte-index chunking is to adapt fixed-size block chunk scheme which are distributed to “Index-table ” by chunk’s both side boundary values. We have found that Byte-index chunking in storage system provides high performance compared with other chunking schemes. Experiments result shows that the storage system with Byte-index chunking compresses overall data with high deduplication capability and reduce the speed of file processing

    Faster Compression of Deterministic Finite Automata

    Full text link
    Deterministic finite automata (DFA) are a classic tool for high throughput matching of regular expressions, both in theory and practice. Due to their high space consumption, extensive research has been devoted to compressed representations of DFAs that still support efficient pattern matching queries. Kumar~et~al.~[SIGCOMM 2006] introduced the \emph{delayed deterministic finite automaton} (\ddfa{}) which exploits the large redundancy between inter-state transitions in the automaton. They showed it to obtain up to two orders of magnitude compression of real-world DFAs, and their work formed the basis of numerous subsequent results. Their algorithm, as well as later algorithms based on their idea, have an inherent quadratic-time bottleneck, as they consider every pair of states to compute the optimal compression. In this work we present a simple, general framework based on locality-sensitive hashing for speeding up these algorithms to achieve sub-quadratic construction times for \ddfa{}s. We apply the framework to speed up several algorithms to near-linear time, and experimentally evaluate their performance on real-world regular expression sets extracted from modern intrusion detection systems. We find an order of magnitude improvement in compression times, with either little or no loss of compression, or even significantly better compression in some cases

    Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

    Get PDF
    The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versionsversions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage−recreation  trade−offstorage-recreation\;trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DATAHUB system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios

    Data deduplication in a data store

    Get PDF
    In our thesis we presented the area of data deduplication and implemented an algorithm for object storage with support for elimination of duplicate chunks within those objects. In the first part we presented storage system as a tree-likestructure of directories and files. We described the features of storage system and simple ways of storing data on a medium. We examined in detail the properties of distributed storage system Ceph, it's components and operation. In the second part we presented deduplicatin as an important feature of modern storage systems. We surveyed deduplication techniques for centralized as well as distributed systems. In the last part we implemented an example of deduplication technique along with a simple object storage system. Using the described techniques we implemented detection of variable-length duplicated chunks within objects and added CLI tools for manipulating objects in the store

    A survey and classification of storage deduplication systems

    Get PDF
    The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

    Deployment Distribuito di codice e dati su Grid mediante Tecniche di Compressione e di Caching

    Get PDF
    Studio, progettazione e realizzazione di un sistema scalabile di deployment per Grid. Il prototipo realizza il multicast di grandi quantita' di dati tramite decomposizione a blocchi con fingerprinting e replicazione distribuita. Utilizza tecniche di compressione e caching per l’ottimizzazione della banda di rete, dei tempi di accesso ai dati e per riutilizzare i dati frutto di precedenti deployment. Il sistema e' ottimizzato per l’invio di insiemi di file a insiemi di nodi, tutti eventualmente disgiunti. La libreria progettata e realizzata e' in grado di mantenere pressoche' costante il tempo di deployment all’aumentare dei nodi destinatari e riesce a mantenere un’efficienza relativa che arriva fino al 100% all’aumentare della quantita' di dati da inviar

    Application-specific Delta-encoding via Resemblance Detection

    No full text
    Many objects, such as files, electronic messages, and web pages, contain overlapping content. Numerous past research projects have observed that one can compress one object relative to another one by computing the differences between the two, but these delta-encoding systems have almost invariably required knowledge of a specific relationship between them---most commonly, two versions using the same name at different points in time. We consider cases in which this relationship is determined dynamically, by efficiently determining when a sufficient resemblance exists between two objects in a relatively large collection. We look at specific examples of this technique, namely web pages, email, and files in a file system, and evaluate the potential data reduction and the factors that influence this reduction. We find that delta-encoding using this resemblance detection technique can improve on simple compression by up to a factor of two, depending on workload, and that a small fraction of objects can potentially account for a large portion of these savings