32 research outputs found

    The correlation space of Gaussian latent tree models and model selection without fitting

    Get PDF
    We provide a complete description of possible covariance matrices consistent with a Gaussian latent tree model for any tree. We then present techniques for utilising these constraints to assess whether observed data is compatible with that Gaussian latent tree model. Our method does not require us first to fit such a tree. We demonstrate the usefulness of the inverse-Wishart distribution for performing preliminary assessments of tree-compatibility using semialgebraic constraints. Using results from Drton et al. (2008) we then provide the appropriate moments required for test statistics for assessing adherence to these equality constraints. These are shown to be effective even for small sample sizes and can be easily adjusted to test either the entire model or only certain macrostructures hypothesized within the tree. We illustrate our exploratory tetrad analysis using a linguistic application and our confirmatory tetrad analysis using a biological application.Comment: 15 page

    Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics

    Get PDF
    BACKGROUND: A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/ouput methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications. RESULTS: We present Bio++, a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus. CONCLUSION: Implementation of methods aims at being both efficient and user-friendly. A special concern was given to the library design to enable easy extension and new methods development. We defined a general hierarchy of classes that allow the developer to implement its own algorithms while remaining compatible with the rest of the libraries. Bio++ source code is distributed free of charge under the CeCILL general public licence from its website

    In search of lost introns

    Full text link
    Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon-intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools. We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after O(nL)O(nL) preprocessing time, subsequent evaluations take O(nL/log⁥L)O(nL/\log L) time almost surely in the Yule-Harding random model of nn-taxon phylogenies, where LL is the input sequence length. We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now

    Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach

    Get PDF
    Biologically significant sites in a protein may be identified by contrasting the rates of synonymous (Ks) and non-synonymous (Ka) substitutions. This enables the inference of site-specific positive Darwinian selection and purifying selection. We present here Selecton version 2.2 (http://selecton.bioinfo.tau.ac.il), a web server which automatically calculates the ratio between Ka and Ks (ω) at each site of the protein. This ratio is graphically displayed on each site using a color-coding scheme, indicating either positive selection, purifying selection or lack of selection. Selecton implements an assembly of different evolutionary models, which allow for statistical testing of the hypothesis that a protein has undergone positive selection. Specifically, the recently developed mechanistic-empirical model is introduced, which takes into account the physicochemical properties of amino acids. Advanced options were introduced to allow maximal fine tuning of the server to the user's specific needs, including calculation of statistical support of the ω values, an advanced graphic display of the protein's 3-dimensional structure, use of different genetic codes and inputting of a pre-built phylogenetic tree. Selecton version 2.2 is an effective, user-friendly and freely available web server which implements up-to-date methods for computing site-specific selection forces, and the visualization of these forces on the protein's sequence and structure

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Latent tree models

    Full text link
    Latent tree models are graphical models defined on trees, in which only a subset of variables is observed. They were first discussed by Judea Pearl as tree-decomposable distributions to generalise star-decomposable distributions such as the latent class model. Latent tree models, or their submodels, are widely used in: phylogenetic analysis, network tomography, computer vision, causal modeling, and data clustering. They also contain other well-known classes of models like hidden Markov models, Brownian motion tree model, the Ising model on a tree, and many popular models used in phylogenetics. This article offers a concise introduction to the theory of latent tree models. We emphasise the role of tree metrics in the structural description of this model class, in designing learning algorithms, and in understanding fundamental limits of what and when can be learned

    Identification of the Otopetrin Domain, a conserved domain in vertebrate otopetrins and invertebrate otopetrin-like family members

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Otopetrin 1 (Otop1) </it>encodes a multi-transmembrane domain protein with no homology to known transporters, channels, exchangers, or receptors. Otop1 is necessary for the formation of otoconia and otoliths, calcium carbonate biominerals within the inner ear of mammals and teleost fish that are required for the detection of linear acceleration and gravity. Vertebrate <it>Otop1 </it>and its paralogues <it>Otop2 </it>and <it>Otop3 </it>define a new gene family with homology to the invertebrate Domain of Unknown Function 270 genes (<it>DUF270</it>; pfam03189).</p> <p>Results</p> <p>Multi-species comparison of the predicted primary sequences and predicted secondary structures of 62 vertebrate otopetrin, and arthropod and nematode DUF270 proteins, has established that the genes encoding these proteins constitute a single family that we renamed the Otopetrin Domain Protein (<it>ODP</it>) gene family. Signature features of ODP proteins are three "Otopetrin Domains" that are highly conserved between vertebrates, arthropods and nematodes, and a highly constrained predicted loop structure.</p> <p>Conclusion</p> <p>Our studies suggest a refined topologic model for ODP insertion into the lipid bilayer of 12 transmembrane domains, and highlight conserved amino-acid residues that will aid in the biochemical examination of ODP family function. The high degree of sequence and structural similarity of the ODP proteins may suggest a conserved role in the intracellular trafficking of calcium and the formation of biominerals.</p

    Probabilistic Graphical Model Representation in Phylogenetics

    Get PDF
    Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (1) reproducibility of an analysis, (2) model development and (3) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and non-specialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis-Hastings or Gibbs sampling of the posterior distribution
    corecore