Search CORE

458 research outputs found

String Reconstruction from Substring Compositions

Author: Acharya Jayadev
Das Hirakendu
Milenkovic Olgica
Orlitsky Alon
Pan Shengjun
Publication venue
Publication date: 10/03/2014
Field of study

Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size

\ge4

in optimal near-quadratic time

arXiv.org e-Print Archive

CiteSeerX

Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

Author: A Kraskov
A Milosavljević
G Navarro
J Felsenstein
J Lake
J Rissanen
J Rissanen
J Thompson
J Varre
Konrad Scheffler
L Allison
M Brudno
M Brudno
M Cao
M Li
M Li
M Mahoney
M Nei
M Steel
Maya Paczuski
N Bray
N Bray
N Saitou
Orion Penner
P Buneman
P Lockhart
P Viola
Peter Grassberger
R Cilibrasi
R Durbin
S Altschul
S Altschul
S McGinnis
S Vinga
T Cover
T Lassmann
W Press
X Chen
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 19/08/2010
Field of study

Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia

arXiv.org e-Print Archive

Directory of Open Access Journals

From Hammersley's lines to Hammersley's trees

Author: Basdevant Anne-Laure
Gerin Lucas
Gouere Jean-Baptiste
Singh Arvind
Publication venue
Publication date: 10/05/2016
Field of study

We construct a stationary random tree, embedded in the upper half plane, with prescribed offspring distribution and whose vertices are the atoms of a unit Poisson point process. This process which we call Hammersley's tree process extends the usual Hammersley's line process. Just as Hammersley's process is related to the problem of the longest increasing subsequence, this model also has a combinatorial interpretation: it counts the number of heaps (i.e. increasing trees) required to store a random permutation. This problem was initially considered by Byers et. al (2011) and Istrate and Bonchis (2015) in the case of regular trees. We show, in particular, that the number of heaps grows logarithmically with the size of the permutation

arXiv.org e-Print Archive

Masur's criterion does not hold in Thurston metric

Author: Telpukhovskiy Vanya
Publication venue
Publication date: 03/03/2019
Field of study

We construct a counterexample for Masur's criterion in the setting of Teichm\"uller space with Thurston metric. For that, we find a minimal, non-uniquely ergodic lamination

\lambda

on a seven-times punctured sphere with the uniformly bounded annular projection coefficients. Then we show that the geodesic in the corresponding Teichm\"uller space that converges to

\lambda

, stays in the thick part for the whole time.Comment: 16 pages, 9 figures. comments are welcome

arXiv.org e-Print Archive