18 research outputs found
Subpolynomial trace reconstruction for random strings and arbitrary deletion probability
The insertion-deletion channel takes as input a bit string , and outputs a string where bits have been deleted and
inserted independently at random. The trace reconstruction problem is to
recover from many independent outputs (called "traces") of the
insertion-deletion channel applied to . We show that if is
chosen uniformly at random, then traces suffice to
reconstruct with high probability. For the deletion channel with
deletion probability the earlier upper bound was . The case of or the case where insertions are allowed has not
been previously analyzed, and therefore the earlier upper bound was as for
worst-case strings, i.e., . We also show that our
reconstruction algorithm runs in time.
A key ingredient in our proof is a delicate two-step alignment procedure
where we estimate the location in each trace corresponding to a given bit of
. The alignment is done by viewing the strings as random walks and
comparing the increments in the walk associated with the input string and the
trace, respectively.Comment: Analysis of running time added and proof simplified. Alex Zhai added
as author. 37 pages, 7 figure
New Lower Bounds for Trace Reconstruction
We improve the lower bound on worst case trace reconstruction from
to
. As a consequence, we improve
the lower bound on average case trace reconstruction from
to
.Comment: 20 page
Polynomial-time trace reconstruction in the smoothed complexity model
In the \emph{trace reconstruction problem}, an unknown source string is sent through a probabilistic \emph{deletion channel} which
independently deletes each bit with probability and concatenates the
surviving bits, yielding a \emph{trace} of . The problem is to reconstruct
given independent traces. This problem has received much attention in
recent years both in the worst-case setting where may be an arbitrary
string in \cite{DOS17,NazarovPeres17,HHP18,HL18,Chase19} and in the
average-case setting where is drawn uniformly at random from
\cite{PeresZhai17,HPP18,HL18,Chase19}.
This paper studies trace reconstruction in the \emph{smoothed analysis}
setting, in which a ``worst-case'' string x^{\worst} is chosen arbitrarily
from , and then a perturbed version \bx of x^{\worst} is formed
by independently replacing each coordinate by a uniform random bit with
probability . The problem is to reconstruct \bx given independent
traces from it.
Our main result is an algorithm which, for any constant perturbation rate
and any constant deletion rate , uses \poly(n)
running time and traces and succeeds with high probability in reconstructing
the string \bx. This stands in contrast with the worst-case version of the
problem, for which is the best known time and sample
complexity \cite{DOS17,NazarovPeres17}.
Our approach is based on reconstructing \bx from the multiset of its short
subwords and is quite different from previous algorithms for either the
worst-case or average-case versions of the problem. The heart of our work is a
new \poly(n)-time procedure for reconstructing the multiset of all -length subwords of any source string given access to
traces of
New Upper Bounds for Trace Reconstruction
We improve the upper bound on trace reconstruction to
.Comment: 18 page
Tree trace reconstruction using subtraces
Tree trace reconstruction aims to learn the binary node labels of a tree,
given independent samples of the tree passed through an appropriately defined
deletion channel. In recent work, Davies, R\'acz, and Rashtchian used
combinatorial methods to show that samples
suffice to reconstruct a complete -ary tree with nodes with high
probability. We provide an alternative proof of this result, which allows us to
generalize it to a broader class of tree topologies and deletion models. In our
proofs, we introduce the notion of a subtrace, which enables us to connect with
and generalize recent mean-based complex analytic algorithms for string trace
reconstruction.Comment: 13 pages, 2 figure
Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Distance
Trace reconstruction considers the task of recovering an unknown string given a number of independent "traces", i.e., subsequences of
obtained by randomly and independently deleting every symbol of with
some probability . The information-theoretic limit of the number of traces
needed to recover a string of length are still unknown. This limit is
essentially the same as the number of traces needed to determine, given strings
and and traces of one of them, which string is the source. The most
studied class of algorithms for the worst-case version of the problem are
"mean-based" algorithms. These are a restricted class of distinguishers that
only use the mean value of each coordinate on the given samples. In this work
we study limitations of mean-based algorithms on strings at small Hamming or
edit distance. We show on the one hand that distinguishing strings that are
nearby in Hamming distance is "easy" for such distinguishers. On the other
hand, we show that distinguishing strings that are nearby in edit distance is
"hard" for mean-based algorithms. Along the way we also describe a connection
to the famous Prouhet-Tarry-Escott (PTE) problem, which shows a barrier to
finding explicit hard-to-distinguish strings: namely such strings would imply
explicit short solutions to the PTE problem, a well-known difficult problem in
number theory. Our techniques rely on complex analysis arguments that involve
careful trigonometric estimates, and algebraic techniques that include
applications of Descartes' rule of signs for polynomials over the reals
Trace Reconstruction: Generalized and Parameterized
In the beautifully simple-to-state problem of trace reconstruction, the goal
is to reconstruct an unknown binary string given random "traces" of
where each trace is generated by deleting each coordinate of independently
with probability . The problem is well studied both when the unknown
string is arbitrary and when it is chosen uniformly at random. For both
settings, there is still an exponential gap between upper and lower sample
complexity bounds and our understanding of the problem is still surprisingly
limited. In this paper, we consider natural parameterizations and
generalizations of this problem in an effort to attain a deeper and more
comprehensive understanding.
We prove that traces suffice for
reconstructing arbitrary matrices. In the matrix version of the problem, each
row and column of an unknown matrix is deleted
independently with probability . Our results contrasts with the best known
results for sequence reconstruction where the best known upper bound is
. An optimal result for random matrix reconstruction: we show
that traces are necessary and sufficient. This is in contrast
to the problem for random sequences where there is a super-logarithmic lower
bound and the best known upper bound is . We show that
traces suffice to reconstruct -sparse
strings, providing an improvement over the best known sequence reconstruction
results when . We show that traces
suffice if is -sparse and we additionally have a "separation" promise,
specifically that the indices of 1's in all differ by
Statistical Windows in Testing for the Initial Distribution of a Reversible Markov Chain
We study the problem of hypothesis testing between two discrete
distributions, where we only have access to samples after the action of a known
reversible Markov chain, playing the role of noise. We derive
instance-dependent minimax rates for the sample complexity of this problem, and
show how its dependence in time is related to the spectral properties of the
Markov chain. We show that there exists a wide statistical window, in terms of
sample complexity for hypothesis testing between different pairs of initial
distributions. We illustrate these results in several concrete examples
Lower bounds for trace reconstruction
In the trace reconstruction problem, an unknown bit string is sent through a deletion channel where each bit is deleted
independently with some probability , yielding a contracted string
. How many i.i.d.\ samples of are needed
to reconstruct with high probability? We prove that there exist such that at least traces
are required to distinguish between and for some absolute
constant , improving the previous lower bound of . Furthermore, our
result improves the previously known lower bound for reconstruction of random
strings from to .Comment: Minor changes. 23 pages, 3 figure
Algorithms for reconstruction over single and multiple deletion channels
Recent advances in DNA sequencing technology and DNA storage systems have
rekindled the interest in deletion channels. Multiple recent works have looked
at variants of sequence reconstruction over a single and over multiple deletion
channels, a notoriously difficult problem due to its highly combinatorial
nature. Although works in theoretical computer science have provided algorithms
which guarantee perfect reconstruction with multiple independent observations
from the deletion channel, they are only applicable in the large blocklength
regime and more restrictively, when the number of observations is also large.
Indeed, with only a few observations, perfect reconstruction of the input
sequence may not even be possible in most cases. In such situations, maximum
likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion
channels are natural questions that arise and these have remained open to the
best of our knowledge. In this work, we take steps to answer the two
aforementioned questions. Specifically: 1. We show that solving for the ML
estimate over the single deletion channel (which can be cast as a discrete
optimization problem) is equivalent to solving its relaxation, a continuous
optimization problem; 2. We exactly compute the symbolwise posterior
distributions (under some assumptions on the priors) for both the single as
well as multiple deletion channels. As part of our contributions, we also
introduce tools to visualize and analyze error events, which we believe could
be useful in other related problems concerning deletion channels