1,983 research outputs found
Evolutionary Inference via the Poisson Indel Process
We address the problem of the joint statistical inference of phylogenetic
trees and multiple sequence alignments from unaligned molecular sequences. This
problem is generally formulated in terms of string-valued evolutionary
processes along the branches of a phylogenetic tree. The classical evolutionary
process, the TKF91 model, is a continuous-time Markov chain model comprised of
insertion, deletion and substitution events. Unfortunately this model gives
rise to an intractable computational problem---the computation of the marginal
likelihood under the TKF91 model is exponential in the number of taxa. In this
work, we present a new stochastic process, the Poisson Indel Process (PIP), in
which the complexity of this computation is reduced to linear. The new model is
closely related to the TKF91 model, differing only in its treatment of
insertions, but the new model has a global characterization as a Poisson
process on the phylogeny. Standard results for Poisson processes allow key
computations to be decoupled, which yields the favorable computational profile
of inference under the PIP model. We present illustrative experiments in which
Bayesian inference under the PIP model is compared to separate inference of
phylogenies and alignments.Comment: 33 pages, 6 figure
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
Accurate reconstruction of insertion-deletion histories by statistical phylogenetics
The Multiple Sequence Alignment (MSA) is a computational abstraction that
represents a partial summary either of indel history, or of structural
similarity. Taking the former view (indel history), it is possible to use
formal automata theory to generalize the phylogenetic likelihood framework for
finite substitution models (Dayhoff's probability matrices and Felsenstein's
pruning algorithm) to arbitrary-length sequences. In this paper, we report
results of a simulation-based benchmark of several methods for reconstruction
of indel history. The methods tested include a relatively new algorithm for
statistical marginalization of MSAs that sums over a stochastically-sampled
ensemble of the most probable evolutionary histories. For mammalian
evolutionary parameters on several different trees, the single most likely
history sampled by our algorithm appears less biased than histories
reconstructed by other MSA methods. The algorithm can also be used for
alignment-free inference, where the MSA is explicitly summed out of the
analysis. As an illustration of our method, we discuss reconstruction of the
evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with
arXiv:1103.434
Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Molecular phylogenetic techniques do not generally account for such common
evolutionary events as site insertions and deletions (known as indels). Instead
tree building algorithms and ancestral state inference procedures typically
rely on substitution-only models of sequence evolution. In practice these
methods are extended beyond this simplified setting with the use of heuristics
that produce global alignments of the input sequences--an important problem
which has no rigorous model-based solution. In this paper we consider a new
version of the multiple sequence alignment in the context of stochastic indel
models. More precisely, we introduce the following {\em trace reconstruction
problem on a tree} (TRPT): a binary sequence is broadcast through a tree
channel where we allow substitutions, deletions, and insertions; we seek to
reconstruct the original sequence from the sequences received at the leaves of
the tree. We give a recursive procedure for this problem with strong
reconstruction guarantees at low mutation rates, providing also an alignment of
the sequences at the leaves of the tree. The TRPT problem without indels has
been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a
bootstrapping step towards obtaining optimal phylogenetic reconstruction
methods. The present work sets up a framework for extending these works to
evolutionary models with indels
Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis
We present an efficient phylogenetic reconstruction algorithm allowing
insertions and deletions which provably achieves a sequence-length requirement
(or sample complexity) growing polynomially in the number of taxa. Our
algorithm is distance-based, that is, it relies on pairwise sequence
comparisons. More importantly, our approach largely bypasses the difficult
problem of multiple sequence alignment.Comment: Published in at http://dx.doi.org/10.1214/12-AAP852 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study
Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined "true tree" using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons
Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement
Multiple genome alignment remains a challenging problem. Effects of
recombination including rearrangement, segmental duplication, gain, and loss
can create a mosaic pattern of homology even among closely related organisms.
We describe a method to align two or more genomes that have undergone
large-scale recombination, particularly genomes that have undergone substantial
amounts of gene gain and loss (gene flux). The method utilizes a novel
alignment objective score, referred to as a sum-of-pairs breakpoint score. We
also apply a probabilistic alignment filtering method to remove erroneous
alignments of unrelated sequences, which are commonly observed in other genome
alignment methods. We describe new metrics for quantifying genome alignment
accuracy which measure the quality of rearrangement breakpoint predictions and
indel predictions. The progressive genome alignment algorithm demonstrates
markedly improved accuracy over previous approaches in situations where genomes
have undergone realistic amounts of genome rearrangement, gene gain, loss, and
duplication. We apply the progressive genome alignment algorithm to a set of 23
completely sequenced genomes from the genera Escherichia, Shigella, and
Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content
conserved among all taxa and total unique content of 15.2Mbp. We document
substantial population-level variability among these organisms driven by
homologous recombination, gene gain, and gene loss. Free, open-source software
implementing the described genome alignment approach is available from
http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200
Using Genetic Algorithm to solve Median Problem and Phylogenetic Inference
Genome rearrangement analysis has attracted a lot of attentions in phylogenetic com- putation and comparative genomics. Solving the median problems based on various distance definitions has been a focus as it provides the building blocks for maximum parsimony analysis of phylogeny and ancestral genomes. The Median Problem (MP) has been proved to be NP-hard and although there are several exact or heuristic al- gorithms available, these methods all are difficulty to compute distant three genomes containing high evolution events. Such as current approaches, MGR[1] and GRAPPA [2], are restricted on small collections of genomes and low-resolution gene order data of a few hundred rearrangement events. In my work, we focus on heuristic algorithms which will combine genomic sorting algorithm with genetic algorithm (GA) to pro- duce new methods and directions for whole-genome median solver, ancestor inference and phylogeny reconstruction.
In equal median problem, we propose a DCJ sorting operation based genetic algorithms measurements, called GA-DCJ. Following classic genetic algorithm frame, we develop our algorithms for every procedure and substitute for each traditional genetic algorithm procedure. The final results of our GA-based algorithm are optimal median genome(s) and its median score. In limited time and space, especially in large scale and distant datasets, our algorithm get better results compared with GRAPPA and AsMedian.
Extending the ideas of equal genome median solver, we develop another genetic algorithm based solver, GaDCJ-Indel, which can solve unequal genomes median prob- lem (without duplication). In DCJ-Indel model, one of the key steps is still sorting operation[3]. The difference with equal genomes median is there are two sorting di- rections: minimal DCJ operation path or minimal indel operation path. Following different sorting path, in each step scenario, we can get various genome structures to fulfill our population pool. Besides that, we adopt adaptive surcharge-triangle inequality instead of classic triangle inequality in our fitness function in order to fit unequal genome restrictions and get more efficient results. Our experiments results show that GaDCJ-Indel method not only can converge to accurate median score, but also can infer ancestors that are very close to the true ancestors.
An important application of genome rearrangement analysis is to infer ancestral genomes, which is valuable for identifying patterns of evolution and for modeling the evolutionary processes. However, computing ancestral genomes is very difficult and we have to rely on heuristic methods that have various limitations. We propose a GA-Tree algorithm which adapts meta-population [4], co-evolution and repopulation pool methods In this paper, we describe and illuminate the first genetic algorithm for ancestor inference step by step, which uses fitness scores designed to consider co- evolution and uses sorting-based methods to initialize and evolve populations. Our extensive experiments show that compared with other existing tools, our method is accurate and can infer ancestors that are much closer to true ancestors
- …