32 research outputs found
The correlation space of Gaussian latent tree models and model selection without fitting
We provide a complete description of possible covariance matrices consistent
with a Gaussian latent tree model for any tree. We then present techniques for
utilising these constraints to assess whether observed data is compatible with
that Gaussian latent tree model. Our method does not require us first to fit
such a tree. We demonstrate the usefulness of the inverse-Wishart distribution
for performing preliminary assessments of tree-compatibility using
semialgebraic constraints. Using results from Drton et al. (2008) we then
provide the appropriate moments required for test statistics for assessing
adherence to these equality constraints. These are shown to be effective even
for small sample sizes and can be easily adjusted to test either the entire
model or only certain macrostructures hypothesized within the tree. We
illustrate our exploratory tetrad analysis using a linguistic application and
our confirmatory tetrad analysis using a biological application.Comment: 15 page
Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics
BACKGROUND: A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/ouput methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications. RESULTS: We present Bio++, a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus. CONCLUSION: Implementation of methods aims at being both efficient and user-friendly. A special concern was given to the library design to enable easy extension and new methods development. We defined a general hierarchy of classes that allow the developer to implement its own algorithms while remaining compatible with the rest of the libraries. Bio++ source code is distributed free of charge under the CeCILL general public licence from its website
In search of lost introns
Many fundamental questions concerning the emergence and subsequent evolution
of eukaryotic exon-intron organization are still unsettled. Genome-scale
comparative studies, which can shed light on crucial aspects of eukaryotic
evolution, require adequate computational tools.
We describe novel computational methods for studying spliceosomal intron
evolution. Our goal is to give a reliable characterization of the dynamics of
intron evolution. Our algorithmic innovations address the identification of
orthologous introns, and the likelihood-based analysis of intron data. We
discuss a compression method for the evaluation of the likelihood function,
which is noteworthy for phylogenetic likelihood problems in general. We prove
that after preprocessing time, subsequent evaluations take time almost surely in the Yule-Harding random model of -taxon
phylogenies, where is the input sequence length.
We illustrate the practicality of our methods by compiling and analyzing a
data set involving 18 eukaryotes, more than in any other study to date. The
study yields the surprising result that ancestral eukaryotes were fairly
intron-rich. For example, the bilaterian ancestor is estimated to have had more
than 90% as many introns as vertebrates do now
Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach
Biologically significant sites in a protein may be identified by contrasting the rates of synonymous (Ks) and non-synonymous (Ka) substitutions. This enables the inference of site-specific positive Darwinian selection and purifying selection. We present here Selecton version 2.2 (http://selecton.bioinfo.tau.ac.il), a web server which automatically calculates the ratio between Ka and Ks (Ï) at each site of the protein. This ratio is graphically displayed on each site using a color-coding scheme, indicating either positive selection, purifying selection or lack of selection. Selecton implements an assembly of different evolutionary models, which allow for statistical testing of the hypothesis that a protein has undergone positive selection. Specifically, the recently developed mechanistic-empirical model is introduced, which takes into account the physicochemical properties of amino acids. Advanced options were introduced to allow maximal fine tuning of the server to the user's specific needs, including calculation of statistical support of the Ï values, an advanced graphic display of the protein's 3-dimensional structure, use of different genetic codes and inputting of a pre-built phylogenetic tree. Selecton version 2.2 is an effective, user-friendly and freely available web server which implements up-to-date methods for computing site-specific selection forces, and the visualization of these forces on the protein's sequence and structure
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Latent tree models
Latent tree models are graphical models defined on trees, in which only a
subset of variables is observed. They were first discussed by Judea Pearl as
tree-decomposable distributions to generalise star-decomposable distributions
such as the latent class model. Latent tree models, or their submodels, are
widely used in: phylogenetic analysis, network tomography, computer vision,
causal modeling, and data clustering. They also contain other well-known
classes of models like hidden Markov models, Brownian motion tree model, the
Ising model on a tree, and many popular models used in phylogenetics. This
article offers a concise introduction to the theory of latent tree models. We
emphasise the role of tree metrics in the structural description of this model
class, in designing learning algorithms, and in understanding fundamental
limits of what and when can be learned
Identification of the Otopetrin Domain, a conserved domain in vertebrate otopetrins and invertebrate otopetrin-like family members
<p>Abstract</p> <p>Background</p> <p><it>Otopetrin 1 (Otop1) </it>encodes a multi-transmembrane domain protein with no homology to known transporters, channels, exchangers, or receptors. Otop1 is necessary for the formation of otoconia and otoliths, calcium carbonate biominerals within the inner ear of mammals and teleost fish that are required for the detection of linear acceleration and gravity. Vertebrate <it>Otop1 </it>and its paralogues <it>Otop2 </it>and <it>Otop3 </it>define a new gene family with homology to the invertebrate Domain of Unknown Function 270 genes (<it>DUF270</it>; pfam03189).</p> <p>Results</p> <p>Multi-species comparison of the predicted primary sequences and predicted secondary structures of 62 vertebrate otopetrin, and arthropod and nematode DUF270 proteins, has established that the genes encoding these proteins constitute a single family that we renamed the Otopetrin Domain Protein (<it>ODP</it>) gene family. Signature features of ODP proteins are three "Otopetrin Domains" that are highly conserved between vertebrates, arthropods and nematodes, and a highly constrained predicted loop structure.</p> <p>Conclusion</p> <p>Our studies suggest a refined topologic model for ODP insertion into the lipid bilayer of 12 transmembrane domains, and highlight conserved amino-acid residues that will aid in the biochemical examination of ODP family function. The high degree of sequence and structural similarity of the ODP proteins may suggest a conserved role in the intracellular trafficking of calcium and the formation of biominerals.</p
Probabilistic Graphical Model Representation in Phylogenetics
Recent years have seen a rapid expansion of the model space explored in
statistical phylogenetics, emphasizing the need for new approaches to
statistical model representation and software development. Clear communication
and representation of the chosen model is crucial for: (1) reproducibility of
an analysis, (2) model development and (3) software design. Moreover, a
unified, clear and understandable framework for model representation lowers the
barrier for beginners and non-specialists to grasp complex phylogenetic models,
including their assumptions and parameter/variable dependencies.
Graphical modeling is a unifying framework that has gained in popularity in
the statistical literature in recent years. The core idea is to break complex
models into conditionally independent distributions. The strength lies in the
comprehensibility, flexibility, and adaptability of this formalism, and the
large body of computational work based on it. Graphical models are well-suited
to teach statistical models, to facilitate communication among phylogeneticists
and in the development of generic software for simulation and statistical
inference.
Here, we provide an introduction to graphical models for phylogeneticists and
extend the standard graphical model representation to the realm of
phylogenetics. We introduce a new graphical model component, tree plates, to
capture the changing structure of the subgraph corresponding to a phylogenetic
tree. We describe a range of phylogenetic models using the graphical model
framework and introduce modules to simplify the representation of standard
components in large and complex models. Phylogenetic model graphs can be
readily used in simulation, maximum likelihood inference, and Bayesian
inference using, for example, Metropolis-Hastings or Gibbs sampling of the
posterior distribution