149 research outputs found
String Reconstruction from Substring Compositions
Motivated by mass-spectrometry protein sequencing, we consider a
simply-stated problem of reconstructing a string from the multiset of its
substring compositions. We show that all strings of length 7, one less than a
prime, or one less than twice a prime, can be reconstructed uniquely up to
reversal. For all other lengths we show that reconstruction is not always
possible and provide sometimes-tight bounds on the largest number of strings
with given substring compositions. The lower bounds are derived by
combinatorial arguments and the upper bounds by algebraic considerations that
precisely characterize the set of strings with the same substring compositions
in terms of the factorization of bivariate polynomials. The problem can be
viewed as a combinatorial simplification of the turnpike problem, and its
solution may shed light on this long-standing problem as well. Using well known
results on transience of multi-dimensional random walks, we also provide a
reconstruction algorithm that reconstructs random strings over alphabets of
size in optimal near-quadratic time
Adaptive learning of compressible strings
Suppose an oracle knows a string S that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is s a substring of S?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle Sigma n/4 - O(n) queries in order to be able to reconstruct the hidden string, where Sigma is the size of the alphabet of S and n its length, and gave an algorithm that spends (Sigma - 1)n + O(Sigma root n) queries to reconstruct S. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to Tau bits, performs q = O(Tau) substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length n over an integer alphabet of size Sigma with rle runs can be reconstructed with q = O(rle(Sigma + log nrle)) substring queries in linear time and space. We then present an algorithm that spends q is an element of O (Sigma g log n) substring queries and runs in O (n(logn + log Sigma) + q) time using linear space, where g is the size of a smallest straight-line program generating the string. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Scalable string reconciliation by recursive content-dependent shingling
We consider the problem of reconciling similar strings in a distributed system. Specifically, we are interested in performing this reconciliation in an efficient manner, minimizing the communication cost. Our problem applies to several types of large-scale distributed networks, file synchronization utilities, and any system that manages the consistency of string encoded ordered data. We present the novel Recursive Content-Dependent Shingling (RCDS) protocol that can handle large strings and has the communication complexity that scales with the edit distance between the reconciling strings. Also, we provide analysis, experimental results, and comparisons to existing synchronization software such as the Rsync utility with an implementation of our protocol.2019-12-03T00:00:00
Low-Complexity Interactive Algorithms for Synchronization From Deletions, Insertions, and Substitutions
Consider two remote nodes having binary sequences and , respectively.
is an edited version of , where the editing involves random deletions,
insertions, and substitutions, possibly in bursts. The goal is for the node
with to reconstruct with minimal exchange of information over a
noiseless link. The communication is measured in terms of both the total number
of bits exchanged and the number of interactive rounds of communication.
This paper focuses on the setting where the number of edits is
, where is the length of . We first consider the
case where the edits are a mixture of insertions and deletions (indels), and
propose an interactive synchronization algorithm with near-optimal
communication rate and average computational complexity of arithmetic
operations. The algorithm uses interaction to efficiently split the source
sequence into substrings containing exactly one deletion or insertion. Each of
these substrings is then synchronized using an optimal one-way synchronization
code based on the single-deletion correcting channel codes of Varshamov and
Tenengolts (VT codes).
We then build on this synchronization algorithm in three different ways.
First, it is modified to work with a single round of interaction. The reduction
in the number of rounds comes at the expense of higher communication, which is
quantified. Next, we present an extension to the practically important case
where the insertions and deletions may occur in (potentially large) bursts.
Finally, we show how to synchronize the sources to within a target Hamming
distance. This feature can be used to differentiate between substitution and
indel edits. In addition to theoretical performance bounds, we provide several
validating simulation results for the proposed algorithms.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TIT.2015.246663
Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings
Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely.
Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed probabilistic remote algorithms based on both recursive comparison and characteristic polynomial interpolation of hashes computed from variable-length content-defined substrings. We described various heuristics for approximating optimal algorithms as arbitrary long strings and memory limitations force discarding information. Discussion also included compact delta encoding and in-place reconstruction. We presented results from empirical testing using discussed algorithms.
The conclusions were that multiple algorithms need to be integrated into a hybrid implementation, which heuristically chooses algorithms based on evaluation of the input data. Algorithms based on hashtable lookups are faster on average and require less memory, but algorithms based on suffix searching find least differences. Interpolating characteristic polynomials was found to be too slow for general use. With remote hash comparison, content-defined chunks and recursive comparison can reduce protocol overhead. A differential compressor should be merged with a state-of-art non-differential compressor to enable more compact delta encoding. Input should be processed multiple times to allow constant a space bound without significant reduction in compression efficiency. Compression efficiently of current popular synchronizers could be improved, as our empiral testing showed that a non-differential compressor produced smaller files without having access to one of the two strings
- …