Search CORE

18 research outputs found

Subpolynomial trace reconstruction for random strings and arbitrary deletion probability

Author: Holden Nina
Pemantle Robin
Peres Yuval
Zhai Alex
Publication venue
Publication date: 26/04/2020
Field of study

The insertion-deletion channel takes as input a bit string

{\bf x}\in\{0,1\}^{n}

, and outputs a string where bits have been deleted and inserted independently at random. The trace reconstruction problem is to recover

\bf x

from many independent outputs (called "traces") of the insertion-deletion channel applied to

\bf x

. We show that if

\bf x

is chosen uniformly at random, then

\exp(O(\log^{1/3} n))

traces suffice to reconstruct

\bf x

with high probability. For the deletion channel with deletion probability

q < 1/2

the earlier upper bound was

\exp(O(\log^{1/2} n))

. The case of

q\geq 1/2

or the case where insertions are allowed has not been previously analyzed, and therefore the earlier upper bound was as for worst-case strings, i.e.,

\exp(O( n^{1/3}))

. We also show that our reconstruction algorithm runs in

n^{1+o(1)}

time. A key ingredient in our proof is a delicate two-step alignment procedure where we estimate the location in each trace corresponding to a given bit of

\bf x

. The alignment is done by viewing the strings as random walks and comparing the increments in the walk associated with the input string and the trace, respectively.Comment: Analysis of running time added and proof simplified. Alex Zhai added as author. 37 pages, 7 figure

arXiv.org e-Print Archive

New Lower Bounds for Trace Reconstruction

Author: Chase Zachary
Publication venue
Publication date: 23/07/2020
Field of study

We improve the lower bound on worst case trace reconstruction from

\Omega\left(\frac{n^{5/4}}{\sqrt{\log n}}\right)

\Omega\left(\frac{n^{3/2}}{\log^{7} n}\right)

. As a consequence, we improve the lower bound on average case trace reconstruction from

\Omega\left(\frac{\log^{9/4}n}{\sqrt{\log\log n}}\right)

\Omega\left(\frac{\log^{5/2}n}{(\log\log n)^{7}}\right)

.Comment: 20 page

arXiv.org e-Print Archive

Polynomial-time trace reconstruction in the smoothed complexity model

Author: Chen Xi
De Anindya
Lee Chin Ho
Servedio Rocco A.
Sinha Sandip
Publication venue
Publication date: 27/08/2020
Field of study

In the \emph{trace reconstruction problem}, an unknown source string

x \in \{0,1\}^n

is sent through a probabilistic \emph{deletion channel} which independently deletes each bit with probability

\delta

and concatenates the surviving bits, yielding a \emph{trace} of

x

. The problem is to reconstruct

x

given independent traces. This problem has received much attention in recent years both in the worst-case setting where

x

may be an arbitrary string in

\{0,1\}^n

\cite{DOS17,NazarovPeres17,HHP18,HL18,Chase19} and in the average-case setting where

x

is drawn uniformly at random from

\{0,1\}^n

\cite{PeresZhai17,HPP18,HL18,Chase19}. This paper studies trace reconstruction in the \emph{smoothed analysis} setting, in which a ``worst-case'' string x^{\worst} is chosen arbitrarily from

\{0,1\}^n

, and then a perturbed version \bx of x^{\worst} is formed by independently replacing each coordinate by a uniform random bit with probability

\sigma

. The problem is to reconstruct \bx given independent traces from it. Our main result is an algorithm which, for any constant perturbation rate

0<\sigma < 1

and any constant deletion rate

0 < \delta < 1

, uses \poly(n) running time and traces and succeeds with high probability in reconstructing the string \bx. This stands in contrast with the worst-case version of the problem, for which

\text{exp}(O(n^{1/3}))

is the best known time and sample complexity \cite{DOS17,NazarovPeres17}. Our approach is based on reconstructing \bx from the multiset of its short subwords and is quite different from previous algorithms for either the worst-case or average-case versions of the problem. The heart of our work is a new \poly(n)-time procedure for reconstructing the multiset of all

O(\log n)

-length subwords of any source string

x\in \{0,1\}^n

given access to traces of

x

arXiv.org e-Print Archive

New Upper Bounds for Trace Reconstruction

Author: Chase Zachary
Publication venue
Publication date: 24/09/2020
Field of study

We improve the upper bound on trace reconstruction to

\exp(\widetilde{O}(n^{1/5}))

.Comment: 18 page

arXiv.org e-Print Archive

Tree trace reconstruction using subtraces

Author: Brailovskaya Tatiana
Rácz Miklós Z.
Publication venue
Publication date: 02/02/2021
Field of study

Tree trace reconstruction aims to learn the binary node labels of a tree, given independent samples of the tree passed through an appropriately defined deletion channel. In recent work, Davies, R\'acz, and Rashtchian used combinatorial methods to show that

\exp(\mathcal{O}(k \log_{k} n))

samples suffice to reconstruct a complete

k

-ary tree with

n

nodes with high probability. We provide an alternative proof of this result, which allows us to generalize it to a broader class of tree topologies and deletion models. In our proofs, we introduce the notion of a subtrace, which enables us to connect with and generalize recent mean-based complex analytic algorithms for string trace reconstruction.Comment: 13 pages, 2 figure

arXiv.org e-Print Archive

Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Distance

Author: Grigorescu Elena
Sudan Madhu
Zhu Minshen
Publication venue
Publication date: 27/11/2020
Field of study

Trace reconstruction considers the task of recovering an unknown string

x \in \{0,1\}^n

given a number of independent "traces", i.e., subsequences of

x

obtained by randomly and independently deleting every symbol of

x

with some probability

p

. The information-theoretic limit of the number of traces needed to recover a string of length

n

are still unknown. This limit is essentially the same as the number of traces needed to determine, given strings

x

and

y

and traces of one of them, which string is the source. The most studied class of algorithms for the worst-case version of the problem are "mean-based" algorithms. These are a restricted class of distinguishers that only use the mean value of each coordinate on the given samples. In this work we study limitations of mean-based algorithms on strings at small Hamming or edit distance. We show on the one hand that distinguishing strings that are nearby in Hamming distance is "easy" for such distinguishers. On the other hand, we show that distinguishing strings that are nearby in edit distance is "hard" for mean-based algorithms. Along the way we also describe a connection to the famous Prouhet-Tarry-Escott (PTE) problem, which shows a barrier to finding explicit hard-to-distinguish strings: namely such strings would imply explicit short solutions to the PTE problem, a well-known difficult problem in number theory. Our techniques rely on complex analysis arguments that involve careful trigonometric estimates, and algebraic techniques that include applications of Descartes' rule of signs for polynomials over the reals

arXiv.org e-Print Archive

Trace Reconstruction: Generalized and Parameterized

Author: Krishnamurthy Akshay
Mazumdar Arya
McGregor Andrew
Pal Soumyabrata
Publication venue
Publication date: 13/03/2021
Field of study

In the beautifully simple-to-state problem of trace reconstruction, the goal is to reconstruct an unknown binary string

x

given random "traces" of

x

where each trace is generated by deleting each coordinate of

x

independently with probability

p<1

. The problem is well studied both when the unknown string is arbitrary and when it is chosen uniformly at random. For both settings, there is still an exponential gap between upper and lower sample complexity bounds and our understanding of the problem is still surprisingly limited. In this paper, we consider natural parameterizations and generalizations of this problem in an effort to attain a deeper and more comprehensive understanding. We prove that

\exp(O(n^{1/4} \sqrt{\log n}))

traces suffice for reconstructing arbitrary matrices. In the matrix version of the problem, each row and column of an unknown

\sqrt{n}\times \sqrt{n}

matrix is deleted independently with probability

p

. Our results contrasts with the best known results for sequence reconstruction where the best known upper bound is

\exp(O(n^{1/3}))

. An optimal result for random matrix reconstruction: we show that

\Theta(\log n)

traces are necessary and sufficient. This is in contrast to the problem for random sequences where there is a super-logarithmic lower bound and the best known upper bound is

\exp({O}(\log^{1/3} n))

. We show that

\exp(O(k^{1/3}\log^{2/3} n))

traces suffice to reconstruct

k

-sparse strings, providing an improvement over the best known sequence reconstruction results when

k = o(n/\log^2 n)

. We show that

\textrm{poly}(n)

traces suffice if

x

k

-sparse and we additionally have a "separation" promise, specifically that the indices of 1's in

x

all differ by

\Omega(k \log n)

arXiv.org e-Print Archive

Statistical Windows in Testing for the Initial Distribution of a Reversible Markov Chain

Author: Berthet Quentin
Kanade Varun
Publication venue
Publication date: 06/08/2018
Field of study

We study the problem of hypothesis testing between two discrete distributions, where we only have access to samples after the action of a known reversible Markov chain, playing the role of noise. We derive instance-dependent minimax rates for the sample complexity of this problem, and show how its dependence in time is related to the spectral properties of the Markov chain. We show that there exists a wide statistical window, in terms of sample complexity for hypothesis testing between different pairs of initial distributions. We illustrate these results in several concrete examples

arXiv.org e-Print Archive

Lower bounds for trace reconstruction

Author: Holden Nina
Lyons Russell
Publication venue
Publication date: 07/06/2019
Field of study

In the trace reconstruction problem, an unknown bit string

{\bf x}\in\{0,1 \}^n

is sent through a deletion channel where each bit is deleted independently with some probability

q\in(0,1)

, yielding a contracted string

\widetilde{\bf x}

. How many i.i.d.\ samples of

\widetilde{\bf x}

are needed to reconstruct

\bf x

with high probability? We prove that there exist

{\bf x},{\bf y} \in\{0,1 \}^n

such that at least

c\, n^{5/4}/\sqrt{\log n}

traces are required to distinguish between

{\bf x}

and

{\bf y}

for some absolute constant

c

, improving the previous lower bound of

c\,n

. Furthermore, our result improves the previously known lower bound for reconstruction of random strings from

c \log^2 n

c \log^{9/4}n/\sqrt{\log \log n}

.Comment: Minor changes. 23 pages, 3 figure

arXiv.org e-Print Archive

Algorithms for reconstruction over single and multiple deletion channels

Author: Diggavi Suhas
Du Michelle
Fragouli Christina
Srinivasavaradhan Sundara Rajan
Publication venue
Publication date: 29/05/2020
Field of study

Recent advances in DNA sequencing technology and DNA storage systems have rekindled the interest in deletion channels. Multiple recent works have looked at variants of sequence reconstruction over a single and over multiple deletion channels, a notoriously difficult problem due to its highly combinatorial nature. Although works in theoretical computer science have provided algorithms which guarantee perfect reconstruction with multiple independent observations from the deletion channel, they are only applicable in the large blocklength regime and more restrictively, when the number of observations is also large. Indeed, with only a few observations, perfect reconstruction of the input sequence may not even be possible in most cases. In such situations, maximum likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion channels are natural questions that arise and these have remained open to the best of our knowledge. In this work, we take steps to answer the two aforementioned questions. Specifically: 1. We show that solving for the ML estimate over the single deletion channel (which can be cast as a discrete optimization problem) is equivalent to solving its relaxation, a continuous optimization problem; 2. We exactly compute the symbolwise posterior distributions (under some assumptions on the priors) for both the single as well as multiple deletion channels. As part of our contributions, we also introduce tools to visualize and analyze error events, which we believe could be useful in other related problems concerning deletion channels

arXiv.org e-Print Archive