2,178 research outputs found
Maximum likelihood estimates of pairwise rearrangement distances
Accurate estimation of evolutionary distances between taxa is important for
many phylogenetic reconstruction methods. In the case of bacteria, distances
can be estimated using a range of different evolutionary models, from single
nucleotide polymorphisms to large-scale genome rearrangements. In the case of
sequence evolution models (such as the Jukes-Cantor model and associated
metric) have been used to correct pairwise distances. Similar correction
methods for genome rearrangement processes are required to improve inference.
Current attempts at correction fall into 3 categories: Empirical computational
studies, Bayesian/MCMC approaches, and combinatorial approaches. Here we
introduce a maximum likelihood estimator for the inversion distance between a
pair of genomes, using the group-theoretic approach to modelling inversions
introduced recently. This MLE functions as a corrected distance: in particular,
we show that because of the way sequences of inversions interact with each
other, it is quite possible for minimal distance and MLE distance to
differently order the distances of two genomes from a third. This has obvious
implications for the use of minimal distance in phylogeny reconstruction. The
work also tackles the above problem allowing free rotation of the genome.
Generally a frame of reference is locked, and all computation made accordingly.
This work incorporates the action of the dihedral group so that distance
estimates are free from any a priori frame of reference.Comment: 21 pages, 7 figures. To appear in the Journal of Theoretical Biolog
Evolution signatures in genome network properties
Genomes maybe organized as networks where protein-protein association plays the role of network links. The resulting networks are far from being random and their topological properties are a consequence of the underlying mechanisms for genome evolution. Considering data on protein-protein association networks from STRING database, we present experimental evidence that degree distribution is not scale free, presenting an increased probability for high degree nodes. We also show that the degree distribution approaches a scale invariant state as the number of genes in the network increases, although real genomes still present finite size effects. Based on the experimental evidence unveiled by these data analyses, we propose a simulation model for genome evolution, where genes in a network are either acquired de novo using a preferential attachment rule, or duplicated, with a duplication probability that linearly grows with gene degree and decreases with its clustering coefficient. The results show that topological distributions are better described than in previous genome evolution models. This model correctly predicts that, in order to produce protein-protein association networks with number of links and number of nodes in the observed range, it is necessary 90% of gene duplication and 10% of de novo gene acquisition. If this scenario is true, it implies a universal mechanism for genome evolution
Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes
Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.
The similarity metric
A new class of distances appropriate for measuring similarity relations
between sequences, say one type of similarity per distance, is studied. We
propose a new ``normalized information distance'', based on the noncomputable
notion of Kolmogorov complexity, and show that it is in this class and it
minorizes every computable distance in the class (that is, it is universal in
that it discovers all computable similarities). We demonstrate that it is a
metric and call it the {\em similarity metric}. This theory forms the
foundation for a new practical tool. To evidence generality and robustness we
give two distinctive applications in widely divergent areas using standard
compression programs like gzip and GenCompress. First, we compare whole
mitochondrial genomes and infer their evolutionary history. This results in a
first completely automatic computed whole mitochondrial phylogeny tree.
Secondly, we fully automatically compute the language tree of 52 different
languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th
ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected,
version to appear in IEEE Trans Inform. T
Ancestral population genomics
The full genomes of several closely related species are now available, opening an emerging field of investigation borrowing both from population genetics and phylogenetics. Providing we can properly model sequence evolution within populations undergoing speciation events, this resource enables us to estimate key population genetics parameters, such as ancestral population sizes and split times. Furthermore, we can enhance our understanding of the recombination process and investigate various selective forces. We discuss the basic speciation models for closely related species, including the isolation and isolation-with-migration models. A major point in our discussion is that only a few complete genomes contain much information about the whole population. The reason being that recombination unlinks genomic regions, and therefore a few genomes contain many segments with distinct histories. The challenge of population genomics is to decode this mosaic of histories in order to infer scenarios of demography and selection. We survey different approaches for understanding ancestral species from analyses of genomic data from closely related species. In particular, we emphasize core assumptions and working hypothesis. Finally, we discuss computational and statistical challenges that arise in the analysis of population genomics data sets
A Quantitative Approach to Investigating the Hypothesis of Prokaryotic Intron Loss
Using a novel method, we show that ordered triplets of motifs usually associated with spliceosomal intron recognition are underrepresented in the protein coding sequence of complete Thermotogae, archaeal and bacterial genomes. The underrepresentation observed does not extend to the noncoding strand, suggesting that the cause of the asymmetry is related to mRNA rather than DNA. Our data do not suggest that the underrepresentation is due to gene transfer from eukaryotes. We speculate that one possible explanation for these observations is that the protein coding sequence of Thermotogae, Archaea and Bacteria was at some time in the past subjected to selection against certain motifs appearing in an order which might initiate splicing in environments harboring a functional spliceosome. This is consistent with, but certainly does not prove, a hypothetical scenario in which at least some prokaryote lineages once possessed a functional spliceosome. Thus, we present a new quantitative method, observations obtained using the method, and a speculative discussion of a possible explanation of the observations
Assessing molecular variability in cancer genomes
The dynamics of tumour evolution are not well understood. In this paper we
provide a statistical framework for evaluating the molecular variation observed
in different parts of a colorectal tumour. A multi-sample version of the Ewens
Sampling Formula forms the basis for our modelling of the data, and we provide
a simulation procedure for use in obtaining reference distributions for the
statistics of interest. We also describe the large-sample asymptotics of the
joint distributions of the variation observed in different parts of the tumour.
While actual data should be evaluated with reference to the simulation
procedure, the asymptotics serve to provide theoretical guidelines, for
instance with reference to the choice of possible statistics.Comment: 22 pages, 1 figure. Chapter 4 of "Probability and Mathematical
Genetics: Papers in Honour of Sir John Kingman" (Editors N.H. Bingham and
C.M. Goldie), Cambridge University Press, 201
Accurate Reconstruction of Molecular Phylogenies for Proteins Using Codon and Amino Acid Unified Sequence Alignments (CAUSA)
Based on molecular clock hypothesis, and neutral theory of molecular evolution, molecular phylogenies have been widely used for inferring evolutionary history of organisms and individual genes. Traditionally, alignments and phylogeny trees of proteins and their coding DNA sequences are constructed separately, thus often different conclusions were drawn. Here we present a new strategy for sequence alignment and phylogenetic tree reconstruction, codon and amino acid unified sequence alignment (CAUSA), which aligns DNA and protein sequences and draw phylogenetic trees in a unified manner. We demonstrated that CAUSA improves both the accuracy of multiple sequence alignments and phylogenetic trees by solving a variety of molecular evolutionary problems in virus, bacteria and mammals. Our results support the hypothesis that the molecular clock for proteins has two pointers existing separately in DNA and protein sequences. It is more accurate to read the molecular clock by combination (additive) of these two pointers, since the ticking rates of them are sometimes consistent, sometimes different. CAUSA software were released as Open Source under GNU/GPL license, and are downloadable free of charge from the website www.dnapluspro.com
- …