68 research outputs found

    Genome wide search for pseudo knotted non-coding RNAs

    Get PDF
    Non-coding RNAs (ncRNAs) are the functional RNA molecules that are involved in many biological processes including gene regulation, chromosome replication and RNA modification. Searching genomes using computational methods has become an important asset for prediction and annotation of ncRNAs. To annotate an individual genome for a specific family of ncRNAs, a computational tool is interpreted to scan through the genome and align its sequence segments to some structure model for the ncRNA family. With the recent advances in detecting an ncRNA in the genome, heuristic techniques are designed to perform an accurate search and sequence-structure alignment. This study uses a novel approach for such genome wide search of ncRNAs using the RNATOPS and Infernal software tools, which incorporates heuristic dynamic programming algorithms to carry out the sequence analysis using the profiles of RNA consensus secondary structures. Genome wide search for ncRNAs from thirteen genomes is performed using RNATOPS and Infernal. The training set of ncRNA multiple sequence alignments is prepared from RFAM and homologous Genomes are retrieved from RNASTRAND database. Through the experiments, performance of each tool is analyzed and compared with respect to their ncRNA search accuracies. It is further interfered that Infernal, compared to RNATOPS, is more accurate in detecting an ncRNA in all the thirteen genomes tested

    Accurate classification of RNA structures using topological fingerprints

    Get PDF
    While RNAs are well known to possess complex structures, functionally similar RNAs often have little sequence similarity. While the exact size and spacing of base-paired regions vary, functionally similar RNAs have pronounced similarity in the arrangement, or topology, of base-paired stems. Furthermore, predicted RNA structures often lack pseudoknots (a crucial aspect of biological activity), and are only partially correct, or incomplete. A topological approach addresses all of these difficulties. In this work we describe each RNA structure as a graph that can be converted to a topological spectrum (RNA fingerprint). The set of subgraphs in an RNA structure, its RNA fingerprint, can be compared with the fingerprints of other RNA structures to identify and correctly classify functionally related RNAs. Topologically similar RNAs can be identified even when a large fraction, up to 30%, of the stems are omitted, indicating that highly accurate structures are not necessary. We investigate the performance of the RNA fingerprint approach on a set of eight highly curated RNA families, with diverse sizes and functions, containing pseudoknots, and with little sequence similarity–an especially difficult test set. In spite of the difficult test set, the RNA fingerprint approach is very successful (ROC AUC \u3e 0.95). Due to the inclusion of pseudoknots, the RNA fingerprint approach both covers a wider range of possible structures than methods based only on secondary structure, and its tolerance for incomplete structures suggests that it can be applied even to predicted structures. Source code is freely available at https://github.rcac.purdue.edu/mgribsko/XIOS_RNA_fingerprint

    From RNA folding to inverse folding: a computational study: Folding and design of RNA molecules

    Get PDF
    Since the discovery of the structure of DNA in the early 1953s and its double-chained complement of information hinting at its means of replication, biologists have recognized the strong connection between molecular structure and function. In the past two decades, there has been a surge of research on an ever-growing class of RNA molecules that are non-coding but whose various folded structures allow a diverse array of vital functions. From the well-known splicing and modification of ribosomal RNA, non-coding RNAs (ncRNAs) are now known to be intimately involved in possibly every stage of DNA translation and protein transcription, as well as RNA signalling and gene regulation processes. Despite the rapid development and declining cost of modern molecular methods, they typically can only describe ncRNA's structural conformations in vitro, which differ from their in vivo counterparts. Moreover, it is estimated that only a tiny fraction of known ncRNAs has been documented experimentally, often at a high cost. There is thus a growing realization that computational methods must play a central role in the analysis of ncRNAs. Not only do computational approaches hold the promise of rapidly characterizing many ncRNAs yet to be described, but there is also the hope that by understanding the rules that determine their structure, we will gain better insight into their function and design. Many studies revealed that the ncRNA functions are performed by high-level structures that often depend on their low-level structures, such as the secondary structure. This thesis studies the computational folding mechanism and inverse folding of ncRNAs at the secondary level. In this thesis, we describe the development of two bioinformatic tools that have the potential to improve our understanding of RNA secondary structure. These tools are as follows: (1) RAFFT for efficient prediction of pseudoknot-free RNA folding pathways using the fast Fourier transform (FFT)}; (2) aRNAque, an evolutionary algorithm inspired by Lévy flights for RNA inverse folding with or without pseudoknot (A secondary structure that often poses difficulties for bio-computational detection). The first tool, RAFFT, implements a novel heuristic to predict RNA secondary structure formation pathways that has two components: (i) a folding algorithm and (ii) a kinetic ansatz. When considering the best prediction in the ensemble of 50 secondary structures predicted by RAFFT, its performance matches the recent deep-learning-based structure prediction methods. RAFFT also acts as a folding kinetic ansatz, which we tested on two RNAs: the CFSE and a classic bi-stable sequence. In both test cases, fewer structures were required to reproduce the full kinetics, whereas known methods (such as Treekin) required a sample of 20,000 structures and more. The second tool, aRNAque, implements an evolutionary algorithm (EA) inspired by the Lévy flight, allowing both local global search and which supports pseudoknotted target structures. The number of point mutations at every step of aRNAque's EA is drawn from a Zipf distribution. Therefore, our proposed method increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. The overall performance showed improved empirical results compared to existing tools through intensive benchmarks on both pseudoknotted and pseudoknot-free datasets. In conclusion, we highlight some promising extensions of the versatile RAFFT method to RNA-RNA interaction studies. We also provide an outlook on both tools' implications in studying evolutionary dynamics

    Tree Diet: Reducing the Treewidth to Unlock FPT Algorithms in RNA Bioinformatics

    Get PDF
    Hard graph problems are ubiquitous in Bioinformatics, inspiring the design of specialized Fixed-Parameter Tractable algorithms, many of which rely on a combination of tree-decomposition and dynamic programming. The time/space complexities of such approaches hinge critically on low values for the treewidth tw of the input graph. In order to extend their scope of applicability, we introduce the Tree-Diet problem, i.e. the removal of a minimal set of edges such that a given tree-decomposition can be slimmed down to a prescribed treewidth tw\u27. Our rationale is that the time gained thanks to a smaller treewidth in a parameterized algorithm compensates the extra post-processing needed to take deleted edges into account. Our core result is an FPT dynamic programming algorithm for Tree-Diet, using 2^{O(tw)}n time and space. We complement this result with parameterized complexity lower-bounds for stronger variants (e.g., NP-hardness when tw\u27 or tw-tw\u27 is constant). We propose a prototype implementation for our approach which we apply on difficult instances of selected RNA-based problems: RNA design, sequence-structure alignment, and search of pseudoknotted RNAs in genomes, revealing very encouraging results. This work paves the way for a wider adoption of tree-decomposition-based algorithms in Bioinformatics

    Data mining in computational proteomics and genomics

    Get PDF
    This dissertation addresses data mining in bioinformatics by investigating two important problems, namely peak detection and structure matching. Peak detection is useful for biological pattern discovery while structure matching finds many applications in clustering and classification. The first part of this dissertation focuses on elastic peak detection in 2D liquid chromatographic mass spectrometry (LC-MS) data used in proteomics research. These data can be modeled as a time series, in which the X-axis represents time points and the Y-axis represents intensity values. A peak occurs in a set of 2D LC-MS data when the sum of the intensity values in a sliding time window exceeds a user-determined threshold. The elastic peak detection problem is to locate all peaks across multiple window sizes of interest in the dataset. A new method, called PeakID, is proposed in this dissertation, which solves the elastic peak detection problem in 2D LC-MS data without yielding any false negative. PeakID employs a novel data structure, called a Shifted Aggregation Tree or AggTree for short, to find the different peaks in the dataset. This method works by first constructing an AggTree in a bottom-up manner from the dataset, and then searching the AggTree for the peaks in a top-down manner. PeakID uses a state-space algorithm to find the topology and structure of an efficient AggTree. Experimental results demonstrate the superiority of the proposed method over other methods on both synthetic and real-world data. The second part of this dissertation focuses on RNA pseudoknot structure matching and alignment. RNA pseudoknot structures play important roles in many genomic processes. Previous methods for comparative pseudoknot analysis mainly focus on simultaneous folding and alignment of RNA sequences. Little work has been done to align two known RNA secondary structures with pseudoknots taking into account both sequence and structure information of the two RNAs. A new method, called RKalign, is proposed in this dissertation for aligning two known RNA secondary structures with pseudoknots. RKalign adopts the partition function methodology to calculate the posterior log-odds scores of the alignments between bases or base pairs of the two RNAs with a dynamic programming algorithm. The posterior log-odds scores are then used to calculate the expected accuracy of an alignment between the RNAs. The goal is to find an optimal alignment with the maximum expected accuracy. RKalign employs a greedy algorithm to achieve this goal. The performance of RKalign is investigated and compared with existing tools for RNA structure alignment. An extension of the proposed method to multiple alignment of pseudoknot structures is also discussed. RKalign is implemented in Java and freely accessible on the Internet. As more and more pseudoknots are revealed, collected and stored in public databases, it is anticipated that a tool like RKalign will play a significant role in data comparison, annotation, analysis, and retrieval in these databases

    Graphical methods in RNA structure matching

    Get PDF
    Eukaryotic genomes are pervasively transcribed; almost every base can be found in an RNA transcript. This is a surprising observation since most of the genome does not encode proteins. This RNA must serve an important regulatory function – important because producing non-coding RNA is an energy intensive process, and in the absence of strong selection one would expect it to disappear. RNA families with common functions have specifically conserved structural motifs, which are directly related to the functional roles of RNA in catalysis and regulation. Because the conserved structures depend on base-pairing, similar RNA structures may have little or no detectable sequence similarity, making the identification of conserved RNAs difficult. This is a particularly serious problem when studying regulatory structures in RNA. In many cases, such as that of cellular internal ribosome entry sites, although we can identify RNAs that have similar regulatory responses, it is difficult to tell whether the RNAs have common structural features using current methods. Available tools for identifying common structures based on RNA sequence suffer from one or more of the following problems: they do not consider pseudoknots, which are important in many catalytic and regulatory structures; they do not consider near minimum free energy structures, which is important as many RNAs exist as an ensemble of structures of nearly equal energy; they require many examples of known structures in order to train a computational model; they require impractical amounts of computational time, precluding their use on long sequences or genomic scale; or they use a similarity function that cannot identify RNAs as having similar structure, even when they are from one of the well characterized known classes. The approach presented here has the potential to address all of these issues, allowing novel RNA structures that are shared between RNAs with little or no sequence similarity to be discovered. This provides a powerful tool to investigate and explain the pervasive transcription observed in eukaryotic genomes

    Thermodynamics of RNA structures by Wang–Landau sampling

    Get PDF
    Motivation: Thermodynamics-based dynamic programming RNA secondary structure algorithms have been of immense importance in molecular biology, where applications range from the detection of novel selenoproteins using expressed sequence tag (EST) data, to the determination of microRNA genes and their targets. Dynamic programming algorithms have been developed to compute the minimum free energy secondary structure and partition function of a given RNA sequence, the minimum free-energy and partition function for the hybridization of two RNA molecules, etc. However, the applicability of dynamic programming methods depends on disallowing certain types of interactions (pseudoknots, zig-zags, etc.), as their inclusion renders structure prediction an nondeterministic polynomial time (NP)-complete problem. Nevertheless, such interactions have been observed in X-ray structures

    From Structure Prediction to Genomic Screens for Novel Non-Coding RNAs

    Get PDF
    Non-coding RNAs (ncRNAs) are receiving more and more attention not only as an abundant class of genes, but also as regulatory structural elements (some located in mRNAs). A key feature of RNA function is its structure. Computational methods were developed early for folding and prediction of RNA structure with the aim of assisting in functional analysis. With the discovery of more and more ncRNAs, it has become clear that a large fraction of these are highly structured. Interestingly, a large part of the structure is comprised of regular Watson-Crick and GU wobble base pairs. This and the increased amount of available genomes have made it possible to employ structure-based methods for genomic screens. The field has moved from folding prediction of single sequences to computational screens for ncRNAs in genomic sequence using the RNA structure as the main characteristic feature. Whereas early methods focused on energy-directed folding of single sequences, comparative analysis based on structure preserving changes of base pairs has been efficient in improving accuracy, and today this constitutes a key component in genomic screens. Here, we cover the basic principles of RNA folding and touch upon some of the concepts in current methods that have been applied in genomic screens for de novo RNA structures in searches for novel ncRNA genes and regulatory RNA structure on mRNAs. We discuss the strengths and weaknesses of the different strategies and how they can complement each other
    corecore