Search CORE

131 research outputs found

Getting started in probabilistic graphical models

Author: Airoldi Edoardo M
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2007
Field of study

Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustrates the main ideas behind the statistical approach to biological pattern discovery.Comment: 12 pages, 1 figur

arXiv.org e-Print Archive

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

Compressão eficiente de sequências biológicas usando uma rede neuronal

Author: Silva Milton Duarte Teixeira da
Publication venue
Publication date: 08/01/2021
Field of study

Background: The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of biosequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for biosequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA and amino acids models. For this purpose, we created GeCo3 and AC2, two new biosequence compressors. Both use a neural network for mixing the opinions of multiple specific models. Findings: We benchmark GeCo3 as a reference-free DNA compressor in five datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, two compilations of archaeal and virus genomes, four whole genomes, and two collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively. As a reference-based DNA compressor, we benchmark GeCo3 in four datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in 12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of this compression improvement is some additional computational time (1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool scales efficiently, independently from the sequence size. Overall, these values outperform the state-of-the-art. For AC2 the improvements and costs over AC are similar, which allows the tool to also outperform the state-of-the-art. Conclusions: The GeCo3 and AC2 are biosequence compressors with a neural network mixing approach, that provides additional gains over top specific biocompressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 and AC2 are released under GPLv3 and are available for free download at https://github.com/cobilab/geco3 and https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma maior necessidade de modelos que possam lidar de forma eficiente com a compressão sem perdas de biosequências. Aplicações importantes incluem armazenamento de longo prazo e análise de dados baseada em compressão. Na literatura, apenas alguns artigos recentes propõem o uso de uma rede neuronal para compressão de biosequências. No entanto, os resultados ficam aquém quando comparados com ferramentas de compressão de ADN específicas, como o GeCo2. Essa limitação deve-se à ausência de modelos específicos para sequências de ADN. Neste trabalho, combinamos o poder de uma rede neuronal com modelos específicos de ADN e aminoácidos. Para isso, criámos o GeCo3 e o AC2, dois novos compressores de biosequências. Ambos usam uma rede neuronal para combinar as opiniões de vários modelos específicos. Resultados: Comparamos o GeCo3 como um compressor de ADN sem referência em cinco conjuntos de dados, incluindo um conjunto de dados balanceado de sequências de ADN, o cromossoma Y e o mitogenoma humano, duas compilações de genomas de arqueas e vírus, quatro genomas inteiros e duas coleções de dados FASTQ de um viroma humano e ADN antigo. O GeCo3 atinge uma melhoria sólida na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%, 6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN baseado em referência, comparamos o GeCo3 em quatro conjuntos de dados constituídos pela compressão aos pares dos cromossomas dos genomas de vários primatas. O GeCo3 melhora a compressão em 12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo desta melhoria de compressão é algum tempo computacional adicional (1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM é constante e a ferramenta escala de forma eficiente, independentemente do tamanho da sequência. De forma geral, os rácios de compressão superam o estado da arte. Para o AC2, as melhorias e custos em relação ao AC são semelhantes, o que permite que a ferramenta também supere o estado da arte. Conclusões: O GeCo3 e o AC2 são compressores de sequências biológicas com uma abordagem de mistura baseada numa rede neuronal, que fornece ganhos adicionais em relação aos biocompressores específicos de topo. O método de mistura proposto é portátil, exigindo apenas as probabilidades dos modelos como entradas, proporcionando uma fácil adaptação a outros compressores de dados ou ferramentas de análise baseadas em compressão. O GeCo3 e o AC2 são distribuídos sob GPLv3 e estão disponíveis para download gratuito em https://github.com/ cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e Telemátic

Repositório Institucional da Universidade de Aveiro

A Mutagenetic Tree Hidden Markov Model for Longitudinal Clonal HIV Sequence Data

Author: Bacheler
Bacheler
Beerenwinkel
Clavel
Crandall
Desper
Drummond
Gonzales
Heydebreck
M. Drton
Michalakis
Molla
N. Beerenwinkel
Seo
Siepel
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2006
Field of study

RNA viruses provide prominent examples of measurably evolving populations. In HIV infection, the development of drug resistance is of particular interest, because precise predictions of the outcome of this evolutionary process are a prerequisite for the rational design of antiretroviral treatment protocols. We present a mutagenetic tree hidden Markov model for the analysis of longitudinal clonal sequence data. Using HIV mutation data from clinical trials, we estimate the order and rate of occurrence of seven amino acid changes that are associated with resistance to the reverse transcriptase inhibitor efavirenz.Comment: 20 pages, 6 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

11th Intelligent Systems for Molecular Biology 2003 (ISMB 2003)

Author: Blanchette
Bono
Bono
Carninci
Castresana
Catherine A. Abbott
Eichler
Elkon
Kawai
Kent
Kent
Kent
Kent
Lander
Mattick
Mattick
Sankoff
Seipel
Sharan
Thomas
Waterston
Zhang
Zhang
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2003
Field of study

This report profiles the keynote talks given at ISMB03 in Brisbane, Australia by Ron Shamir, David Haussler, John Mattick, Yoshihide Hayashizaki, Sydney Brenner, the Overton Prize winner, Jim Kent, and the ISCB Senior Accomplishment Awardee, David Sankov

Crossref

Directory of Open Access Journals

PubMed Central

Flinders Academic Commons

The UCSC Archaeal Genome Browser

Author: Baertsch Robert
Lowe Todd M.
Pohl Andy
Pollard Katherine S.
Schneider Kevin L.
Publication venue: Oxford University Press
Publication date: 28/12/2005
Field of study

As more archaeal genomes are sequenced, effective research and analysis tools are needed to integrate the diverse information available for any given locus. The feature-rich UCSC Genome Browser, created originally to annotate the human genome, can be applied to any sequenced organism. We have created a UCSC Archaeal Genome Browser, available at , currently with 26 archaeal genomes. It displays G/C content, gene and operon annotation from multiple sources, sequence motifs (promoters and Shine-Dalgarno), microarray data, multi-genome alignments and protein conservation across phylogenetic and habitat categories. We encourage submission of new experimental and bioinformatic analysis from contributors. The purpose of this tool is to aid biological discovery and facilitate greater collaboration within the archaeal research community

CiteSeerX

Crossref

PubMed Central

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions

Author: Allen Jonathan E
Majoros William H
Pertea Mihaela
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures. RESULTS: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy. CONCLUSION: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research

Springer - Publisher Connector

PubMed Central

Digital Repository at the University of Maryland

Estimating empirical codon hidden Markov models

Author: De Maio Nicola
Holmes Ian
Kosiol Carolin
Schlötterer Christian
Publication venue: 'Oxford University Press (OUP)'
Publication date: 09/02/2017
Field of study

Empirical codon models (ECMs) estimated from a large number of globular protein families outperformed mechanistic codon models in their description of the general process of protein evolution. Among other factors, ECMs implicitly model the influence of amino acid properties and multiple nucleotide substitutions (MNS). However, the estimation of ECMs requires large quantities of data, and until recently, only few suitable data sets were available. Here, we take advantage of several new Drosophila species genomes to estimate codon models from genome-wide data. The availability of large numbers of genomes over varying phylogenetic depths in the Drosophila genus allows us to explore various divergence levels. In consequence, we can use these data to determine the appropriate level of divergence for the estimation of ECMs, avoiding overestimation of MNS rates caused by saturation. To account for variation in evolutionary rates along the genome, we develop new empirical codon hidden Markov models (ecHMMs). These models significantly outperform previous ones with respect to maximum likelihood values, suggesting that they provide a better fit to the evolutionary process. Using ECMs and ecHMMs derived from genome-wide data sets, we devise new likelihood ratio tests (LRTs) of positive selection. We found classical LRTs very sensitive to the presence of MNSs, showing high false-positive rates, especially with small phylogenies. The new LRTs are more conservative than the classical ones, having acceptable false-positive rates and reduced power.Publisher PDFPeer reviewe

St Andrews Research Repository