846 research outputs found

    File Updates Under Random/Arbitrary Insertions And Deletions

    Full text link
    A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201

    Compressing DNA sequence databases with coil

    Get PDF
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

    CloudTree: A Library to Extend Cloud Services for Trees

    Full text link
    In this work, we propose a library that enables on a cloud the creation and management of tree data structures from a cloud client. As a proof of concept, we implement a new cloud service CloudTree. With CloudTree, users are able to organize big data into tree data structures of their choice that are physically stored in a cloud. We use caching, prefetching, and aggregation techniques in the design and implementation of CloudTree to enhance performance. We have implemented the services of Binary Search Trees (BST) and Prefix Trees as current members in CloudTree and have benchmarked their performance using the Amazon Cloud. The idea and techniques in the design and implementation of a BST and prefix tree is generic and thus can also be used for other types of trees such as B-tree, and other link-based data structures such as linked lists and graphs. Preliminary experimental results show that CloudTree is useful and efficient for various big data applications

    Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

    Full text link
    This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time O(logn)O(\log n) per input symbol (as opposed to amortized O(logn)O(\log n) time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves O(logn)O(\log n) worst case time per input symbol. Searching for a pattern of length mm in the resulting suffix tree takes O(min(mlogΣ,m+logn)+tocc)O(\min(m\log |\Sigma|, m + \log n) + tocc) time, where tocctocc is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance

    B-tree indexes for high update rates

    Get PDF
    In some applications, data capture dominates query processing. For example, monitoring moving objects often requires more insertions and updates than queries. Data gathering using automated sensors often exhibits this imbalance. More generally, indexing streams apparently is considered an unsolved problem. For those applications, B-tree indexes are reasonable choices if some trade-off decisions are tilted towards optimization of updates rather than of queries. This paper surveys techniques that let B-trees sustain very high update rates, up to multiple orders of magnitude higher than tradi-tional B-trees, at the expense of query processing performance. Perhaps not surprisingly, some of these techniques are reminiscent of those employed during index creation, index rebuild, etc., while others are derived from other well known technologies such as differential files and log-structured file systems

    Propagation of updates to replicas using error-correcting codes

    Get PDF
    With the increase in percentage of replicas of data in the Internet, reducing the amount of bandwidth needed for propagation of updates across the replicas has become a major issue. Objective of our investigation is to design an update propagation mechanism focused on reducing the amount of bandwidth needed to propagate the change across multiple distinct versions of the replicas in a distributed system. We obtain the estimated amount of bytes changed from the user and generate parity information needed to correct these bytes using Error Correcting Codes. Transferring the parity information propagates the update. The updated data can be constructed using the parity information and the outdated data. Our investigation proved that the approach would be bandwidth efficient but computation intensive. We conclude our investigation with an update propagation mechanism that we believe would be less computationally intensive and also reduced bandwidth requirements

    Simple, compact and robust approximate string dictionary

    Full text link
    This paper is concerned with practical implementations of approximate string dictionaries that allow edit errors. In this problem, we have as input a dictionary DD of dd strings of total length nn over an alphabet of size σ\sigma. Given a bound kk and a pattern xx of length mm, a query has to return all the strings of the dictionary which are at edit distance at most kk from xx, where the edit distance between two strings xx and yy is defined as the minimum-cost sequence of edit operations that transform xx into yy. The cost of a sequence of operations is defined as the sum of the costs of the operations involved in the sequence. In this paper, we assume that each of these operations has unit cost and consider only three operations: deletion of one character, insertion of one character and substitution of a character by another. We present a practical implementation of the data structure we recently proposed and which works only for one error. We extend the scheme to 2k<m2\leq k<m. Our implementation has many desirable properties: it has a very fast and space-efficient building algorithm. The dictionary data structure is compact and has fast and robust query time. Finally our data structure is simple to implement as it only uses basic techniques from the literature, mainly hashing (linear probing and hash signatures) and succinct data structures (bitvectors supporting rank queries).Comment: Accepted to a journal (19 pages, 2 figures

    Communication Cost for Updating Linear Functions when Message Updates are Sparse: Connections to Maximally Recoverable Codes

    Full text link
    We consider a communication problem in which an update of the source message needs to be conveyed to one or more distant receivers that are interested in maintaining specific linear functions of the source message. The setting is one in which the updates are sparse in nature, and where neither the source nor the receiver(s) is aware of the exact {\em difference vector}, but only know the amount of sparsity that is present in the difference-vector. Under this setting, we are interested in devising linear encoding and decoding schemes that minimize the communication cost involved. We show that the optimal solution to this problem is closely related to the notion of maximally recoverable codes (MRCs), which were originally introduced in the context of coding for storage systems. In the context of storage, MRCs guarantee optimal erasure protection when the system is partially constrained to have local parity relations among the storage nodes. In our problem, we show that optimal solutions exist if and only if MRCs of certain kind (identified by the desired linear functions) exist. We consider point-to-point and broadcast versions of the problem, and identify connections to MRCs under both these settings. For the point-to-point setting, we show that our linear-encoder based achievable scheme is optimal even when non-linear encoding is permitted. The theory is illustrated in the context of updating erasure coded storage nodes. We present examples based on modern storage codes such as the minimum bandwidth regenerating codes.Comment: To Appear in IEEE Transactions on Information Theor
    corecore