14,785 research outputs found
Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics
Next Generation Sequencing (NGS) technologies generate large amounts of short
read data for many different organisms. The fact that NGS reads are generally
short makes it challenging to assemble the reads and reconstruct the original
genome sequence. For clustering genomes using such NGS data, word-count based
alignment-free sequence comparison is a promising approach, but for this
approach, the underlying expected word counts are essential.
A plausible model for this underlying distribution of word counts is given
through modelling the DNA sequence as a Markov chain (MC). For single long
sequences, efficient statistics are available to estimate the order of MCs and
the transition probability matrix for the sequences. As NGS data do not provide
a single long sequence, inference methods on Markovian properties of sequences
based on single long sequences cannot be directly used for NGS short read data.
Here we derive a normal approximation for such word counts. We also show that
the traditional Chi-square statistic has an approximate gamma distribution,
using the Lander-Waterman model for physical mapping. We propose several
methods to estimate the order of the MC based on NGS reads and evaluate them
using simulations. We illustrate the applications of our results by clustering
genomic sequences of several vertebrate and tree species based on NGS reads
using alignment-free sequence dissimilarity measures. We find that the
estimated order of the MC has a considerable effect on the clustering results,
and that the clustering results that use a MC of the estimated order give a
plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201
A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances
Spaced seeds have been recently shown to not only detect more alignments, but
also to give a more accurate measure of phylogenetic distances (Boden et al.,
2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower
misclassification rate when used with Support Vector Machines (SVMs) (On-odera
and Shibuya, 2013), We confirm by independent experiments these two results,
and propose in this article to use a coverage criterion (Benson and Mak, 2008,
Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both
cases in order to design better seed patterns. We show first how this coverage
criterion can be directly measured by a full automaton-based approach. We then
illustrate how this criterion performs when compared with two other criteria
frequently used, namely the single-hit and multiple-hit criteria, through
correlation coefficients with the correct classification/the true distance. At
the end, for alignment-free distances, we propose an extension by adopting the
coverage criterion, show how it performs, and indicate how it can be
efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017
A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances
Spaced seeds have been recently shown to not only detect more alignments, but
also to give a more accurate measure of phylogenetic distances (Boden et al.,
2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower
misclassification rate when used with Support Vector Machines (SVMs) (On-odera
and Shibuya, 2013), We confirm by independent experiments these two results,
and propose in this article to use a coverage criterion (Benson and Mak, 2008,
Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both
cases in order to design better seed patterns. We show first how this coverage
criterion can be directly measured by a full automaton-based approach. We then
illustrate how this criterion performs when compared with two other criteria
frequently used, namely the single-hit and multiple-hit criteria, through
correlation coefficients with the correct classification/the true distance. At
the end, for alignment-free distances, we propose an extension by adopting the
coverage criterion, show how it performs, and indicate how it can be
efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Existing sequence alignment algorithms use heuristic scoring schemes which
cannot be used as objective distance metrics. Therefore one relies on measures
like the p- or log-det distances, or makes explicit, and often simplistic,
assumptions about sequence evolution. Information theory provides an
alternative, in the form of mutual information (MI) which is, in principle, an
objective and model independent similarity measure. MI can be estimated by
concatenating and zipping sequences, yielding thereby the "normalized
compression distance". So far this has produced promising results, but with
uncontrolled errors. We describe a simple approach to get robust estimates of
MI from global pairwise alignments. Using standard alignment algorithms, this
gives for animal mitochondrial DNA estimates that are strikingly close to
estimates obtained from the alignment free methods mentioned above. Our main
result uses algorithmic (Kolmogorov) information theory, but we show that
similar results can also be obtained from Shannon theory. Due to the fact that
it is not additive, normalized compression distance is not an optimal metric
for phylogenetics, but we propose a simple modification that overcomes the
issue of additivity. We test several versions of our MI based distance measures
on a large number of randomly chosen quartets and demonstrate that they all
perform better than traditional measures like the Kimura or log-det (resp.
paralinear) distances. Even a simplified version based on single letter Shannon
entropies, which can be easily incorporated in existing software packages, gave
superior results throughout the entire animal kingdom. But we see the main
virtue of our approach in a more general way. For example, it can also help to
judge the relative merits of different alignment algorithms, by estimating the
significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans
We have used whole genome paired-end Illumina sequence data to identify
tandem duplications in 20 isofemale lines of D. yakuba, and 20 isofemale lines
of D. simulans and performed genome wide validation with PacBio long molecule
sequencing. We identify 1,415 tandem duplications that are segregating in D.
yakuba as well as 975 duplications in D. simulans, indicating greater variation
in D. yakuba. Additionally, we observe high rates of secondary deletions at
duplicated sites, with 8% of duplicated sites in D. simulans and 17% of sites
in D. yakuba modified with deletions. These secondary deletions are consistent
with the action of the large loop mismatch repair system acting to remove
polymorphic tandem duplication, resulting in rapid dynamics of gain and loss in
duplicated alleles and a richer substrate of genetic novelty than has been
previously reported. Most duplications are present in only single strains,
suggesting deleterious impacts are common. D. simulans shows larger numbers of
whole gene duplications in comparison to larger proportions of gene fragments
in D. yakuba. D. simulans displays an excess of high frequency variants on the
X chromosome, consistent with adaptive evolution through duplications on the D.
simulans X or demographic forces driving duplicates to high frequency. We
identify 78 chimeric genes in D. yakuba and 38 chimeric genes in D. simulans,
as well as 143 cases of recruited non-coding sequence in D. yakuba and 96 in D.
simulans, in agreement with rates of chimeric gene origination in D.
melanogaster. Together, these results suggest that tandem duplications often
result in complex variation beyond whole gene duplications that offers a rich
substrate of standing variation that is likely to contribute both to
detrimental phenotypes and disease, as well as to adaptive evolutionary change.Comment: Revised Version- Accepted at Molecular Biology and Evolutio
Non-alignment comparison of human and high primate genomes
Compositional spectra (CS) analysis based on k-mer scoring of DNA sequences
was employed in this study for dot-plot comparison of human and primate
genomes. The detection of extended conserved synteny regions was based on
continuous fuzzy similarity rather than on chains of discrete anchors (genes or
highly conserved noncoding elements). In addition to the high correspondence
found in the comparisons of whole-genome sequences, a good similarity was also
found after masking gene sequences, indicating that CS analysis manages to
reveal phylogenetic signal in the organization of noncoding part of the genome
sequences, including repetitive DNA and the genome "dark matter". Obviously,
the possibility to reveal parallel ordering depends on the signal of common
ancestor sequence organization varying locally along the corresponding segments
of the compared genomes. We explored two sources contributing to this signal:
sequence composition (GC content) and sequence organization (abundances of
k-mers in the usual A,T,G,C or purine-pyrimidine alphabets). Whole-genome
comparisons based on GC distribution along the analyzed sequences indeed gives
reasonable results, but combining it with k-mer abundances dramatically
improves the ordering quality, indicating that compositional and organizational
heterogeneity comprise complementary sources of information on evolutionary
conserved similarity of genome sequences
- …