458 research outputs found
String Reconstruction from Substring Compositions
Motivated by mass-spectrometry protein sequencing, we consider a
simply-stated problem of reconstructing a string from the multiset of its
substring compositions. We show that all strings of length 7, one less than a
prime, or one less than twice a prime, can be reconstructed uniquely up to
reversal. For all other lengths we show that reconstruction is not always
possible and provide sometimes-tight bounds on the largest number of strings
with given substring compositions. The lower bounds are derived by
combinatorial arguments and the upper bounds by algebraic considerations that
precisely characterize the set of strings with the same substring compositions
in terms of the factorization of bivariate polynomials. The problem can be
viewed as a combinatorial simplification of the turnpike problem, and its
solution may shed light on this long-standing problem as well. Using well known
results on transience of multi-dimensional random walks, we also provide a
reconstruction algorithm that reconstructs random strings over alphabets of
size in optimal near-quadratic time
Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Existing sequence alignment algorithms use heuristic scoring schemes which
cannot be used as objective distance metrics. Therefore one relies on measures
like the p- or log-det distances, or makes explicit, and often simplistic,
assumptions about sequence evolution. Information theory provides an
alternative, in the form of mutual information (MI) which is, in principle, an
objective and model independent similarity measure. MI can be estimated by
concatenating and zipping sequences, yielding thereby the "normalized
compression distance". So far this has produced promising results, but with
uncontrolled errors. We describe a simple approach to get robust estimates of
MI from global pairwise alignments. Using standard alignment algorithms, this
gives for animal mitochondrial DNA estimates that are strikingly close to
estimates obtained from the alignment free methods mentioned above. Our main
result uses algorithmic (Kolmogorov) information theory, but we show that
similar results can also be obtained from Shannon theory. Due to the fact that
it is not additive, normalized compression distance is not an optimal metric
for phylogenetics, but we propose a simple modification that overcomes the
issue of additivity. We test several versions of our MI based distance measures
on a large number of randomly chosen quartets and demonstrate that they all
perform better than traditional measures like the Kimura or log-det (resp.
paralinear) distances. Even a simplified version based on single letter Shannon
entropies, which can be easily incorporated in existing software packages, gave
superior results throughout the entire animal kingdom. But we see the main
virtue of our approach in a more general way. For example, it can also help to
judge the relative merits of different alignment algorithms, by estimating the
significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
From Hammersley's lines to Hammersley's trees
We construct a stationary random tree, embedded in the upper half plane, with
prescribed offspring distribution and whose vertices are the atoms of a unit
Poisson point process. This process which we call Hammersley's tree process
extends the usual Hammersley's line process. Just as Hammersley's process is
related to the problem of the longest increasing subsequence, this model also
has a combinatorial interpretation: it counts the number of heaps (i.e.
increasing trees) required to store a random permutation. This problem was
initially considered by Byers et. al (2011) and Istrate and Bonchis (2015) in
the case of regular trees. We show, in particular, that the number of heaps
grows logarithmically with the size of the permutation
Masur's criterion does not hold in Thurston metric
We construct a counterexample for Masur's criterion in the setting of
Teichm\"uller space with Thurston metric. For that, we find a minimal,
non-uniquely ergodic lamination on a seven-times punctured sphere
with the uniformly bounded annular projection coefficients. Then we show that
the geodesic in the corresponding Teichm\"uller space that converges to
, stays in the thick part for the whole time.Comment: 16 pages, 9 figures. comments are welcome
- …