1,179 research outputs found

    An overview of the role of context-sensitive HMMs in the prediction of ncRNA genes

    Get PDF
    Non-coding RNAs (ncRNA) are RNA molecules that function in the cells without being translated into proteins. In recent years, much evidence has been found that ncRNAs play a crucial role in various biological processes. As a result, there has been an increasing interest in the prediction of ncRNA genes. Due to the conserved secondary structure in ncRNAs, there exist pairwise dependencies between distant bases. These dependencies cannot be effectively modeled using traditional HMMs, and we need a more complex model such as the context-sensitive HMM (csHMM). In this paper, we overview the role of csHMMs in the RNA secondary structure analysis and the prediction of ncRNA genes. It is demonstrated that the context-sensitive HMMs can serve as an efficient framework for these purposes

    XRate: a fast prototyping, training and annotation tool for phylo-grammars

    Get PDF
    BACKGROUND: Recent years have seen the emergence of genome annotation methods based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists. RESULTS: We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic context-free grammars. xrate efficiently estimates maximum-likelihood parameters and phylogenetic trees using a novel "phylo-EM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures. CONCLUSION: Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools

    Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction

    Get PDF
    BACKGROUND: RNA secondary structure prediction methods based on probabilistic modeling can be developed using stochastic context-free grammars (SCFGs). Such methods can readily combine different sources of information that can be expressed probabilistically, such as an evolutionary model of comparative RNA sequence analysis and a biophysical model of structure plausibility. However, the number of free parameters in an integrated model for consensus RNA structure prediction can become untenable if the underlying SCFG design is too complex. Thus a key question is, what small, simple SCFG designs perform best for RNA secondary structure prediction? RESULTS: Nine different small SCFGs were implemented to explore the tradeoffs between model complexity and prediction accuracy. Each model was tested for single sequence structure prediction accuracy on a benchmark set of RNA secondary structures. CONCLUSIONS: Four SCFG designs had prediction accuracies near the performance of current energy minimization programs. One of these designs, introduced by Knudsen and Hein in their PFOLD algorithm, has only 21 free parameters and is significantly simpler than the others

    Ambivalent covariance models

    Get PDF

    Genomics and proteomics: a signal processor's tour

    Get PDF
    The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end

    Effective ambiguity checking in biosequence analysis

    Get PDF
    BACKGROUND: Ambiguity is a problem in biosequence analysis that arises in various analysis tasks solved via dynamic programming, and in particular, in the modeling of families of RNA secondary structures with stochastic context free grammars. Several types of analysis are invalidated by the presence of ambiguity. As this problem inherits undecidability (as we show here) from the namely problem for context free languages, there is no complete algorithmic solution to the problem of ambiguity checking. RESULTS: We explain frequently observed sources of ambiguity, and show how to avoid them. We suggest four testing procedures that may help to detect ambiguity when present, including a just-in-time test that permits to work safely with a potentially ambiguous grammar. We introduce, for the special case of stochastic context free grammars and RNA structure modeling, an automated partial procedure for proving non-ambiguity. It is used to demonstrate non-ambiguity for several relevant grammars. CONCLUSION: Our mechanical proof procedure and our testing methods provide a powerful arsenal of methods to ensure non-ambiguity

    Network Analysis with Stochastic Grammars

    Get PDF
    Digital forensics requires significant manual effort to identify items of evidentiary interest from the ever-increasing volume of data in modern computing systems. One of the tasks digital forensic examiners conduct is mentally extracting and constructing insights from unstructured sequences of events. This research assists examiners with the association and individualization analysis processes that make up this task with the development of a Stochastic Context -Free Grammars (SCFG) knowledge representation for digital forensics analysis of computer network traffic. SCFG is leveraged to provide context to the low-level data collected as evidence and to build behavior profiles. Upon discovering patterns, the analyst can begin the association or individualization process to answer criminal investigative questions. Three contributions resulted from this research. First , domain characteristics suitable for SCFG representation were identified and a step -by- step approach to adapt SCFG to novel domains was developed. Second, a novel iterative graph-based method of identifying similarities in context-free grammars was developed to compare behavior patterns represented as grammars. Finally, the SCFG capabilities were demonstrated in performing association and individualization in reducing the suspect pool and reducing the volume of evidence to examine in a computer network traffic analysis use case

    Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints

    Get PDF
    BACKGROUND: We are interested in the problem of predicting secondary structure for small sets of homologous RNAs, by incorporating limited comparative sequence information into an RNA folding model. The Sankoff algorithm for simultaneous RNA folding and alignment is a basis for approaches to this problem. There are two open problems in applying a Sankoff algorithm: development of a good unified scoring system for alignment and folding and development of practical heuristics for dealing with the computational complexity of the algorithm. RESULTS: We use probabilistic models (pair stochastic context-free grammars, pairSCFGs) as a unifying framework for scoring pairwise alignment and folding. A constrained version of the pairSCFG structural alignment algorithm was developed which assumes knowledge of a few confidently aligned positions (pins). These pins are selected based on the posterior probabilities of a probabilistic pairwise sequence alignment. CONCLUSION: Pairwise RNA structural alignment improves on structure prediction accuracy relative to single sequence folding. Constraining on alignment is a straightforward method of reducing the runtime and memory requirements of the algorithm. Five practical implementations of the pairwise Sankoff algorithm – this work (Consan), David Mathews' Dynalign, Ian Holmes' Stemloc, Ivo Hofacker's PMcomp, and Jan Gorodkin's FOLDALIGN – have comparable overall performance with different strengths and weaknesses
    corecore