226 research outputs found

    Computational annotation of eukaryotic gene structures: algorithms development and software systems

    Get PDF
    An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.;The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution, for a variety of reasons. More recently, gene prediction methods leveraging information present in syntenic genomic sequences have become favorable, though these too, have limitations.;Identifying genes by inspection of genomic sequence alone thoroughly tests our theoretical understanding of the gene recognition process as it occurs in vivo, and where we encounter failure, excellent opportunities for meaningful research are revealed.;Therefore, the continued development of methods not reliant on homology information---the so-called ab initio gene prediction methods---should help to more rapidly achieve a comprehensive understanding of gene content in our model organisms, at least.;This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in gene prediction, with particular emphasis on ab initio approaches.;The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular

    Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency

    Get PDF
    BACKGROUND: Transcription regulatory regions in higher eukaryotes are often represented by cis-regulatory modules (CRM) and are responsible for the formation of specific spatial and temporal gene expression patterns. These extended, ~1 KB, regions are found far from coding sequences and cannot be extracted from genome on the basis of their relative position to the coding regions. RESULTS: To explore the feasibility of CRM extraction from a genome, we generated an original training set, containing annotated sequence data for most of the known developmental CRMs from Drosophila. Based on this set of experimental data, we developed a strategy for statistical extraction of cis-regulatory modules from the genome, using exhaustive analysis of local word frequency (LWF). To assess the performance of our analysis, we measured the correlation between predictions generated by the LWF algorithm and the distribution of conserved non-coding regions in a number of Drosophila developmental genes. CONCLUSIONS: In most of the cases tested, we observed high correlation (up to 0.6–0.8, measured on the entire gene locus) between the two independent techniques. We discuss computational strategies available for extraction of Drosophila CRMs and possible extensions of these methods

    Generalizations of Markov model to characterize biological sequences

    Get PDF
    BACKGROUND: The currently used k(th )order Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap = 0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. RESULT: We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences – CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and tri-nucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10–11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at . CONCLUSION: Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools

    Markov Chain-based Promoter Structure Modeling for Tissue-specific Expression Pattern Prediction

    Get PDF
    Transcriptional regulation is the first level of regulation of gene expression and is therefore a major topic in computational biology. Genes with similar expression patterns can be assumed to be co-regulated at the transcriptional level by promoter sequences with a similar structure. Current approaches for modeling shared regulatory features tend to focus mainly on clustering of cis-regulatory sites. Here we introduce a Markov chain-based promoter structure model that uses both shared motifs and shared features from an input set of promoter sequences to predict candidate genes with similar expression. The model uses positional preference, order, and orientation of motifs. The trained model is used to score a genomic set of promoter sequences: high-scoring promoters are assumed to have a structure similar to the input sequences and are thus expected to drive similar expression patterns. We applied our model on two datasets in Caenorhabditis elegans and in Ciona intestinalis. Both computational and experimental verifications indicate that this model is capable of predicting candidate promoters driving similar expression patterns as the input-regulatory sequences. This model can be useful for finding promising candidate genes for wet-lab experiments and for increasing our understanding of transcriptional regulation

    Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences.

    Get PDF
    Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P    =  1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs

    Hidden Markov Model with Binned Duration and Its Application

    Get PDF
    Hidden Markov models (HMM) have been widely used in various applications such as speech processing and bioinformatics. However, the standard hidden Markov model requires state occupancy durations to be geometrically distributed, which can be inappropriate in some real-world applications where the distributions on state intervals deviate signi cantly from the geometric distribution, such as multi-modal distributions and heavy-tailed distributions. The hidden Markov model with duration (HMMD) avoids this limitation by explicitly incor- porating the appropriate state duration distribution, at the price of signi cant computational expense. As a result, the applications of HMMD are still quited limited. In this work, we present a new algorithm - Hidden Markov Model with Binned Duration (HMMBD), whose result shows no loss of accuracy compared to the HMMD decoding performance and a com- putational expense that only diers from the much simpler and faster HMM decoding by a constant factor. More precisely, we further improve the computational complexity of HMMD from (TNN +TND) to (TNN +TND ), where TNN stands for the computational com- plexity of the HMM, D is the max duration value allowed and can be very large and D generally could be a small constant value

    Quantitative modeling and statistical analysis of protein-DNA binding sites

    Get PDF

    An empirical analysis of training protocols for probabilistic gene finders

    Get PDF
    BACKGROUND: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. RESULTS: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. CONCLUSIONS: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements
    corecore