16 research outputs found
Getting started in probabilistic graphical models
Probabilistic graphical models (PGMs) have become a popular tool for
computational analysis of biological data in a variety of domains. But, what
exactly are they and how do they work? How can we use PGMs to discover patterns
that are biologically relevant? And to what extent can PGMs help us formulate
new hypotheses that are testable at the bench? This note sketches out some
answers and illustrates the main ideas behind the statistical approach to
biological pattern discovery.Comment: 12 pages, 1 figur
Subtree power analysis finds optimal species for comparative genomics
Sequence comparison across multiple organisms aids in the detection of
regions under selection. However, resource limitations require a prioritization
of genomes to be sequenced. This prioritization should be grounded in two
considerations: the lineal scope encompassing the biological phenomena of
interest, and the optimal species within that scope for detecting functional
elements. We introduce a statistical framework for optimal species subset
selection, based on maximizing power to detect conserved sites. In a study of
vertebrate species, we show that the optimal species subset is not in general
the most evolutionarily diverged subset. Our results suggest that marsupials
are prime sequencing candidates.Comment: 16 pages, 3 figures, 3 table
A Mutagenetic Tree Hidden Markov Model for Longitudinal Clonal HIV Sequence Data
RNA viruses provide prominent examples of measurably evolving populations. In
HIV infection, the development of drug resistance is of particular interest,
because precise predictions of the outcome of this evolutionary process are a
prerequisite for the rational design of antiretroviral treatment protocols. We
present a mutagenetic tree hidden Markov model for the analysis of longitudinal
clonal sequence data. Using HIV mutation data from clinical trials, we estimate
the order and rate of occurrence of seven amino acid changes that are
associated with resistance to the reverse transcriptase inhibitor efavirenz.Comment: 20 pages, 6 figure
Reference based annotation with GeneMapper
We introduce GeneMapper, a program for transferring annotations from a well annotated genome to other genomes. Drawing on high quality curated annotations, GeneMapper enables rapid and accurate annotation of newly sequenced genomes and is suitable for both finished and draft genomes. GeneMapper uses a profile based approach for mapping genes into multiple species, improving upon the standard pairwise approach. GeneMapper is freely available for academic use
Accelerating Science: A Computing Research Agenda
The emergence of "big data" offers unprecedented opportunities for not only
accelerating scientific advances but also enabling new modes of discovery.
Scientific progress in many disciplines is increasingly enabled by our ability
to examine natural phenomena through the computational lens, i.e., using
algorithmic or information processing abstractions of the underlying processes;
and our ability to acquire, share, integrate and analyze disparate types of
data. However, there is a huge gap between our ability to acquire, store, and
process data and our ability to make effective use of the data to advance
discovery. Despite successful automation of routine aspects of data management
and analytics, most elements of the scientific process currently require
considerable human expertise and effort. Accelerating science to keep pace with
the rate of data acquisition and data processing calls for the development of
algorithmic or information processing abstractions, coupled with formal methods
and tools for modeling and simulation of natural processes as well as major
innovations in cognitive tools for scientists, i.e., computational tools that
leverage and extend the reach of human intellect, and partner with humans on a
broad range of tasks in scientific discovery (e.g., identifying, prioritizing
formulating questions, designing, prioritizing and executing experiments
designed to answer a chosen question, drawing inferences and evaluating the
results, and formulating new questions, in a closed-loop fashion). This calls
for concerted research agenda aimed at: Development, analysis, integration,
sharing, and simulation of algorithmic or information processing abstractions
of natural processes, coupled with formal methods and tools for their analyses
and simulation; Innovations in cognitive tools that augment and extend human
intellect and partner with humans in all aspects of science.Comment: Computing Community Consortium (CCC) white paper, 17 page
A phylogenetic generalized hidden Markov model for predicting alternatively spliced exons
BACKGROUND: An important challenge in eukaryotic gene prediction is accurate identification of alternatively spliced exons. Functional transcripts can go undetected in gene expression studies when alternative splicing only occurs under specific biological conditions. Non-expression based computational methods support identification of rarely expressed transcripts. RESULTS: A non-expression based statistical method is presented to annotate alternatively spliced exons using a single genome sequence and evidence from cross-species sequence conservation. The computational method is implemented in the program ExAlt and an analysis of prediction accuracy is given for Drosophila melanogaster. CONCLUSION: ExAlt identifies the structure of most alternatively spliced exons in the test set and cross-species sequence conservation is shown to improve the precision of predictions. The software package is available to run on Drosophila genomes to search for new cases of alternative splicing
Evolutionary Sequence Modeling for Discovery of Peptide Hormones
There are currently a large number of “orphan” G-protein-coupled receptors (GPCRs) whose endogenous ligands (peptide hormones) are unknown. Identification of these peptide hormones is a difficult and important problem. We describe a computational framework that models spatial structure along the genomic sequence simultaneously with the temporal evolutionary path structure across species and show how such models can be used to discover new functional molecules, in particular peptide hormones, via cross-genomic sequence comparisons. The computational framework incorporates a priori high-level knowledge of structural and evolutionary constraints into a hierarchical grammar of evolutionary probabilistic models. This computational method was used for identifying novel prohormones and the processed peptide sites by producing sequence alignments across many species at the functional-element level. Experimental results with an initial implementation of the algorithm were used to identify potential prohormones by comparing the human and non-human proteins in the Swiss-Prot database of known annotated proteins. In this proof of concept, we identified 45 out of 54 prohormones with only 44 false positives. The comparison of known and hypothetical human and mouse proteins resulted in the identification of a novel putative prohormone with at least four potential neuropeptides. Finally, in order to validate the computational methodology, we present the basic molecular biological characterization of the novel putative peptide hormone, including its identification and regional localization in the brain. This species comparison, HMM-based computational approach succeeded in identifying a previously undiscovered neuropeptide from whole genome protein sequences. This novel putative peptide hormone is found in discreet brain regions as well as other organs. The success of this approach will have a great impact on our understanding of GPCRs and associated pathways and help to identify new targets for drug development
Vertebrate gene finding from multiple-species alignments using a two-level strategy
BACKGROUND: One way in which the accuracy of gene structure prediction in vertebrate DNA sequences can be improved is by analyzing alignments with multiple related species, since functional regions of genes tend to be more conserved. RESULTS: We describe DOGFISH, a vertebrate gene finder consisting of a cleanly separated site classifier and structure predictor. The classifier scores potential splice sites and other features, using sequence alignments between multiple vertebrate species, while the structure predictor hypothesizes coding transcripts by combining these scores using a simple model of gene structure. This also identifies and assigns confidence scores to possible additional exons. Performance is assessed on the ENCODE regions. We predict transcripts and exons across the whole human genome, and identify over 10,000 high confidence new coding exons not in the Ensembl gene set. CONCLUSION: We present a practical multiple species gene prediction method. Accuracy improves as additional species, up to at least eight, are introduced. The novel predictions of the whole-genome scan should support efficient experimental verification