18 research outputs found

    Subpolynomial trace reconstruction for random strings and arbitrary deletion probability

    Full text link
    The insertion-deletion channel takes as input a bit string x{0,1}n{\bf x}\in\{0,1\}^{n}, and outputs a string where bits have been deleted and inserted independently at random. The trace reconstruction problem is to recover x\bf x from many independent outputs (called "traces") of the insertion-deletion channel applied to x\bf x. We show that if x\bf x is chosen uniformly at random, then exp(O(log1/3n))\exp(O(\log^{1/3} n)) traces suffice to reconstruct x\bf x with high probability. For the deletion channel with deletion probability q<1/2q < 1/2 the earlier upper bound was exp(O(log1/2n))\exp(O(\log^{1/2} n)). The case of q1/2q\geq 1/2 or the case where insertions are allowed has not been previously analyzed, and therefore the earlier upper bound was as for worst-case strings, i.e., exp(O(n1/3))\exp(O( n^{1/3})). We also show that our reconstruction algorithm runs in n1+o(1)n^{1+o(1)} time. A key ingredient in our proof is a delicate two-step alignment procedure where we estimate the location in each trace corresponding to a given bit of x\bf x. The alignment is done by viewing the strings as random walks and comparing the increments in the walk associated with the input string and the trace, respectively.Comment: Analysis of running time added and proof simplified. Alex Zhai added as author. 37 pages, 7 figure

    New Lower Bounds for Trace Reconstruction

    Full text link
    We improve the lower bound on worst case trace reconstruction from Ω(n5/4logn)\Omega\left(\frac{n^{5/4}}{\sqrt{\log n}}\right) to Ω(n3/2log7n)\Omega\left(\frac{n^{3/2}}{\log^{7} n}\right). As a consequence, we improve the lower bound on average case trace reconstruction from Ω(log9/4nloglogn)\Omega\left(\frac{\log^{9/4}n}{\sqrt{\log\log n}}\right) to Ω(log5/2n(loglogn)7)\Omega\left(\frac{\log^{5/2}n}{(\log\log n)^{7}}\right).Comment: 20 page

    Polynomial-time trace reconstruction in the smoothed complexity model

    Full text link
    In the \emph{trace reconstruction problem}, an unknown source string x{0,1}nx \in \{0,1\}^n is sent through a probabilistic \emph{deletion channel} which independently deletes each bit with probability δ\delta and concatenates the surviving bits, yielding a \emph{trace} of xx. The problem is to reconstruct xx given independent traces. This problem has received much attention in recent years both in the worst-case setting where xx may be an arbitrary string in {0,1}n\{0,1\}^n \cite{DOS17,NazarovPeres17,HHP18,HL18,Chase19} and in the average-case setting where xx is drawn uniformly at random from {0,1}n\{0,1\}^n \cite{PeresZhai17,HPP18,HL18,Chase19}. This paper studies trace reconstruction in the \emph{smoothed analysis} setting, in which a ``worst-case'' string x^{\worst} is chosen arbitrarily from {0,1}n\{0,1\}^n, and then a perturbed version \bx of x^{\worst} is formed by independently replacing each coordinate by a uniform random bit with probability σ\sigma. The problem is to reconstruct \bx given independent traces from it. Our main result is an algorithm which, for any constant perturbation rate 0<σ<10<\sigma < 1 and any constant deletion rate 0<δ<10 < \delta < 1, uses \poly(n) running time and traces and succeeds with high probability in reconstructing the string \bx. This stands in contrast with the worst-case version of the problem, for which exp(O(n1/3))\text{exp}(O(n^{1/3})) is the best known time and sample complexity \cite{DOS17,NazarovPeres17}. Our approach is based on reconstructing \bx from the multiset of its short subwords and is quite different from previous algorithms for either the worst-case or average-case versions of the problem. The heart of our work is a new \poly(n)-time procedure for reconstructing the multiset of all O(logn)O(\log n)-length subwords of any source string x{0,1}nx\in \{0,1\}^n given access to traces of xx

    New Upper Bounds for Trace Reconstruction

    Full text link
    We improve the upper bound on trace reconstruction to exp(O~(n1/5))\exp(\widetilde{O}(n^{1/5})).Comment: 18 page

    Tree trace reconstruction using subtraces

    Full text link
    Tree trace reconstruction aims to learn the binary node labels of a tree, given independent samples of the tree passed through an appropriately defined deletion channel. In recent work, Davies, R\'acz, and Rashtchian used combinatorial methods to show that exp(O(klogkn))\exp(\mathcal{O}(k \log_{k} n)) samples suffice to reconstruct a complete kk-ary tree with nn nodes with high probability. We provide an alternative proof of this result, which allows us to generalize it to a broader class of tree topologies and deletion models. In our proofs, we introduce the notion of a subtrace, which enables us to connect with and generalize recent mean-based complex analytic algorithms for string trace reconstruction.Comment: 13 pages, 2 figure

    Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Distance

    Full text link
    Trace reconstruction considers the task of recovering an unknown string x{0,1}nx \in \{0,1\}^n given a number of independent "traces", i.e., subsequences of xx obtained by randomly and independently deleting every symbol of xx with some probability pp. The information-theoretic limit of the number of traces needed to recover a string of length nn are still unknown. This limit is essentially the same as the number of traces needed to determine, given strings xx and yy and traces of one of them, which string is the source. The most studied class of algorithms for the worst-case version of the problem are "mean-based" algorithms. These are a restricted class of distinguishers that only use the mean value of each coordinate on the given samples. In this work we study limitations of mean-based algorithms on strings at small Hamming or edit distance. We show on the one hand that distinguishing strings that are nearby in Hamming distance is "easy" for such distinguishers. On the other hand, we show that distinguishing strings that are nearby in edit distance is "hard" for mean-based algorithms. Along the way we also describe a connection to the famous Prouhet-Tarry-Escott (PTE) problem, which shows a barrier to finding explicit hard-to-distinguish strings: namely such strings would imply explicit short solutions to the PTE problem, a well-known difficult problem in number theory. Our techniques rely on complex analysis arguments that involve careful trigonometric estimates, and algebraic techniques that include applications of Descartes' rule of signs for polynomials over the reals

    Trace Reconstruction: Generalized and Parameterized

    Full text link
    In the beautifully simple-to-state problem of trace reconstruction, the goal is to reconstruct an unknown binary string xx given random "traces" of xx where each trace is generated by deleting each coordinate of xx independently with probability p<1p<1. The problem is well studied both when the unknown string is arbitrary and when it is chosen uniformly at random. For both settings, there is still an exponential gap between upper and lower sample complexity bounds and our understanding of the problem is still surprisingly limited. In this paper, we consider natural parameterizations and generalizations of this problem in an effort to attain a deeper and more comprehensive understanding. We prove that exp(O(n1/4logn))\exp(O(n^{1/4} \sqrt{\log n})) traces suffice for reconstructing arbitrary matrices. In the matrix version of the problem, each row and column of an unknown n×n\sqrt{n}\times \sqrt{n} matrix is deleted independently with probability pp. Our results contrasts with the best known results for sequence reconstruction where the best known upper bound is exp(O(n1/3))\exp(O(n^{1/3})). An optimal result for random matrix reconstruction: we show that Θ(logn)\Theta(\log n) traces are necessary and sufficient. This is in contrast to the problem for random sequences where there is a super-logarithmic lower bound and the best known upper bound is exp(O(log1/3n))\exp({O}(\log^{1/3} n)). We show that exp(O(k1/3log2/3n))\exp(O(k^{1/3}\log^{2/3} n)) traces suffice to reconstruct kk-sparse strings, providing an improvement over the best known sequence reconstruction results when k=o(n/log2n)k = o(n/\log^2 n). We show that poly(n)\textrm{poly}(n) traces suffice if xx is kk-sparse and we additionally have a "separation" promise, specifically that the indices of 1's in xx all differ by Ω(klogn)\Omega(k \log n)

    Statistical Windows in Testing for the Initial Distribution of a Reversible Markov Chain

    Full text link
    We study the problem of hypothesis testing between two discrete distributions, where we only have access to samples after the action of a known reversible Markov chain, playing the role of noise. We derive instance-dependent minimax rates for the sample complexity of this problem, and show how its dependence in time is related to the spectral properties of the Markov chain. We show that there exists a wide statistical window, in terms of sample complexity for hypothesis testing between different pairs of initial distributions. We illustrate these results in several concrete examples

    Lower bounds for trace reconstruction

    Full text link
    In the trace reconstruction problem, an unknown bit string x{0,1}n{\bf x}\in\{0,1 \}^n is sent through a deletion channel where each bit is deleted independently with some probability q(0,1)q\in(0,1), yielding a contracted string x~\widetilde{\bf x}. How many i.i.d.\ samples of x~\widetilde{\bf x} are needed to reconstruct x\bf x with high probability? We prove that there exist x,y{0,1}n{\bf x},{\bf y} \in\{0,1 \}^n such that at least cn5/4/lognc\, n^{5/4}/\sqrt{\log n} traces are required to distinguish between x{\bf x} and y{\bf y} for some absolute constant cc, improving the previous lower bound of cnc\,n. Furthermore, our result improves the previously known lower bound for reconstruction of random strings from clog2nc \log^2 n to clog9/4n/loglognc \log^{9/4}n/\sqrt{\log \log n} .Comment: Minor changes. 23 pages, 3 figure

    Algorithms for reconstruction over single and multiple deletion channels

    Full text link
    Recent advances in DNA sequencing technology and DNA storage systems have rekindled the interest in deletion channels. Multiple recent works have looked at variants of sequence reconstruction over a single and over multiple deletion channels, a notoriously difficult problem due to its highly combinatorial nature. Although works in theoretical computer science have provided algorithms which guarantee perfect reconstruction with multiple independent observations from the deletion channel, they are only applicable in the large blocklength regime and more restrictively, when the number of observations is also large. Indeed, with only a few observations, perfect reconstruction of the input sequence may not even be possible in most cases. In such situations, maximum likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion channels are natural questions that arise and these have remained open to the best of our knowledge. In this work, we take steps to answer the two aforementioned questions. Specifically: 1. We show that solving for the ML estimate over the single deletion channel (which can be cast as a discrete optimization problem) is equivalent to solving its relaxation, a continuous optimization problem; 2. We exactly compute the symbolwise posterior distributions (under some assumptions on the priors) for both the single as well as multiple deletion channels. As part of our contributions, we also introduce tools to visualize and analyze error events, which we believe could be useful in other related problems concerning deletion channels
    corecore