7,246 research outputs found
Distinguishing regional from within-codon rate heterogeneity in DNA sequence alignments
We present an improved phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to (1) recombination and (2) rate heterogeneity. The focus of the present work is on improving the modelling of the latter aspect. Earlier papers have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. This approach fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. We propose an improved model that explicitly distinguishes between these two effects, and we assess its performance on a set of simulated DNA sequence alignments
Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories
Phylodynamics is an area of population genetics that uses genetic sequence
data to estimate past population dynamics. Modern state-of-the-art Bayesian
nonparametric methods for recovering population size trajectories of unknown
form use either change-point models or Gaussian process priors. Change-point
models suffer from computational issues when the number of change-points is
unknown and needs to be estimated. Gaussian process-based methods lack local
adaptivity and cannot accurately recover trajectories that exhibit features
such as abrupt changes in trend or varying levels of smoothness. We propose a
novel, locally-adaptive approach to Bayesian nonparametric phylodynamic
inference that has the flexibility to accommodate a large class of functional
behaviors. Local adaptivity results from modeling the log-transformed effective
population size a priori as a horseshoe Markov random field, a recently
proposed statistical model that blends together the best properties of the
change-point and Gaussian process modeling paradigms. We use simulated data to
assess model performance, and find that our proposed method results in reduced
bias and increased precision when compared to contemporary methods. We also use
our models to reconstruct past changes in genetic diversity of human hepatitis
C virus in Egypt and to estimate population size changes of ancient and modern
steppe bison. These analyses show that our new method captures features of the
population size trajectories that were missed by the state-of-the-art methods.Comment: 36 pages, including supplementary informatio
In search of lost introns
Many fundamental questions concerning the emergence and subsequent evolution
of eukaryotic exon-intron organization are still unsettled. Genome-scale
comparative studies, which can shed light on crucial aspects of eukaryotic
evolution, require adequate computational tools.
We describe novel computational methods for studying spliceosomal intron
evolution. Our goal is to give a reliable characterization of the dynamics of
intron evolution. Our algorithmic innovations address the identification of
orthologous introns, and the likelihood-based analysis of intron data. We
discuss a compression method for the evaluation of the likelihood function,
which is noteworthy for phylogenetic likelihood problems in general. We prove
that after preprocessing time, subsequent evaluations take time almost surely in the Yule-Harding random model of -taxon
phylogenies, where is the input sequence length.
We illustrate the practicality of our methods by compiling and analyzing a
data set involving 18 eukaryotes, more than in any other study to date. The
study yields the surprising result that ancestral eukaryotes were fairly
intron-rich. For example, the bilaterian ancestor is estimated to have had more
than 90% as many introns as vertebrates do now
Efficient FPT algorithms for (strict) compatibility of unrooted phylogenetic trees
In phylogenetics, a central problem is to infer the evolutionary
relationships between a set of species ; these relationships are often
depicted via a phylogenetic tree -- a tree having its leaves univocally labeled
by elements of and without degree-2 nodes -- called the "species tree". One
common approach for reconstructing a species tree consists in first
constructing several phylogenetic trees from primary data (e.g. DNA sequences
originating from some species in ), and then constructing a single
phylogenetic tree maximizing the "concordance" with the input trees. The
so-obtained tree is our estimation of the species tree and, when the input
trees are defined on overlapping -- but not identical -- sets of labels, is
called "supertree". In this paper, we focus on two problems that are central
when combining phylogenetic trees into a supertree: the compatibility and the
strict compatibility problems for unrooted phylogenetic trees. These problems
are strongly related, respectively, to the notions of "containing as a minor"
and "containing as a topological minor" in the graph community. Both problems
are known to be fixed-parameter tractable in the number of input trees , by
using their expressibility in Monadic Second Order Logic and a reduction to
graphs of bounded treewidth. Motivated by the fact that the dependency on
of these algorithms is prohibitively large, we give the first explicit dynamic
programming algorithms for solving these problems, both running in time
, where is the total size of the input.Comment: 18 pages, 1 figur
Nocardia kroppenstedtii sp. nov., a novel actinomycete isolated from a lung transplant patient with a pulmonary infection
An actinomycete, strain N1286T, isolated from a lung transplant patient with a pulmonary infection, was provisionally assigned to the genus Nocardia. The strain had chemotaxonomic and morphological properties typical of members of the genus Nocardia and formed a distinct phyletic line in the Nocardia 16S rRNA gene tree. It was most closely related to Nocardia farcinica DSM 43665T (99.8% gene similarity) but was distinguished from the latter by a low level of DNA:DNA relatedness. These strains were also distinguished by a broad range of phenotypic properties. On the basis of these data, it is proposed that isolate N1286T (=DSM 45810T = NCTC 13617T) should be classified as the type strain of a new Nocardia species for which the name Nocardia kroppenstedtii is proposed
Fast computation of distance estimators
BACKGROUND: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n(3)). Unfortunately, the fastest practical algorithms known for Computing the distance matrix, from n sequences of length l, takes time proportional to l·n(2). Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications. RESULTS: We give an advanced algorithm for Computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity Symbols. This new method is shown to be more accurate as well as faster than earlier methods. CONCLUSION: Our novel algorithm for Computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds
Frequentist and Bayesian measures of confidence via multiscale bootstrap for testing three regions
A new computation method of frequentist -values and Bayesian posterior
probabilities based on the bootstrap probability is discussed for the
multivariate normal model with unknown expectation parameter vector. The null
hypothesis is represented as an arbitrary-shaped region. We introduce new
parametric models for the scaling-law of bootstrap probability so that the
multiscale bootstrap method, which was designed for one-sided test, can also
computes confidence measures of two-sided test, extending applicability to a
wider class of hypotheses. Parameter estimation is improved by the two-step
multiscale bootstrap and also by including higher-order terms. Model selection
is important not only as a motivating application of our method, but also as an
essential ingredient in the method. A compromise between frequentist and
Bayesian is attempted by showing that the Bayesian posterior probability with
an noninformative prior is interpreted as a frequentist -value of
``zero-sided'' test
Inference of population splits and mixtures from genome-wide allele frequency data
Many aspects of the historical relationships between populations in a species
are reflected in genetic data. Inferring these relationships from genetic data,
however, remains a challenging task. In this paper, we present a statistical
model for inferring the patterns of population splits and mixtures in multiple
populations. In this model, the sampled populations in a species are related to
their common ancestor through a graph of ancestral populations. Using
genome-wide allele frequency data and a Gaussian approximation to genetic
drift, we infer the structure of this graph. We applied this method to a set of
55 human populations and a set of 82 dog breeds and wild canids. In both
species, we show that a simple bifurcating tree does not fully describe the
data; in contrast, we infer many migration events. While some of the migration
events that we find have been detected previously, many have not. For example,
in the human data we infer that Cambodians trace approximately 16% of their
ancestry to a population ancestral to other extant East Asian populations. In
the dog data, we infer that both the boxer and basenji trace a considerable
fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to
domestication, and that East Asian toy breeds (the Shih Tzu and the Pekingese)
result from admixture between modern toy breeds and "ancient" Asian breeds.
Software implementing the model described here, called TreeMix, is available at
http://treemix.googlecode.comComment: 28 pages, 6 figures in main text. Attached supplement is 22 pages, 15
figures. This is an updated version of the preprint available at
http://precedings.nature.com/documents/6956/version/
Recombination dramatically speeds up evolution of finite populations
We study the role of recombination, as practiced by genetically-competent
bacteria, in speeding up Darwinian evolution. This is done by adding a new
process to a previously-studied Markov model of evolution on a smooth fitness
landscape; this new process allows alleles to be exchanged with those in the
surrounding medium. Our results, both numerical and analytic, indicate that for
a wide range of intermediate population sizes, recombination dramatically
speeds up the evolutionary advance
- …