173 research outputs found

    Big data analytics in computational biology and bioinformatics

    Get PDF
    Big data analytics in computational biology and bioinformatics refers to an array of operations including biological pattern discovery, classification, prediction, inference, clustering as well as data mining in the cloud, among others. This dissertation addresses big data analytics by investigating two important operations, namely pattern discovery and network inference. The dissertation starts by focusing on biological pattern discovery at a genomic scale. Research reveals that the secondary structure in non-coding RNA (ncRNA) is more conserved during evolution than its primary nucleotide sequence. Using a covariance model approach, the stems and loops of an ncRNA secondary structure are represented as a statistical image against which an entire genome can be efficiently scanned for matching patterns. The covariance model approach is then further extended, in combination with a structural clustering algorithm and a random forests classifier, to perform genome-wide search for similarities in ncRNA tertiary structures. The dissertation then presents methods for gene network inference. Vast bodies of genomic data containing gene and protein expression patterns are now available for analysis. One challenge is to apply efficient methodologies to uncover more knowledge about the cellular functions. Very little is known concerning how genes regulate cellular activities. A gene regulatory network (GRN) can be represented by a directed graph in which each node is a gene and each edge or link is a regulatory effect that one gene has on another gene. By evaluating gene expression patterns, researchers perform in silico data analyses in systems biology, in particular GRN inference, where the “reverse engineering” is involved in predicting how a system works by looking at the system output alone. Many algorithmic and statistical approaches have been developed to computationally reverse engineer biological systems. However, there are no known bioin-formatics tools capable of performing perfect GRN inference. Here, extensive experiments are conducted to evaluate and compare recent bioinformatics tools for inferring GRNs from time-series gene expression data. Standard performance metrics for these tools based on both simulated and real data sets are generally low, suggesting that further efforts are needed to develop more reliable GRN inference tools. It is also observed that using multiple tools together can help identify true regulatory interactions between genes, a finding consistent with those reported in the literature. Finally, the dissertation discusses and presents a framework for parallelizing GRN inference methods using Apache Hadoop in a cloud environment

    A data science approach to pattern discovery in complex structures with applications in bioinformatics

    Get PDF
    Pattern discovery aims to find interesting, non-trivial, implicit, previously unknown and potentially useful patterns in data. This dissertation presents a data science approach for discovering patterns or motifs from complex structures, particularly complex RNA structures. RNA secondary and tertiary structure motifs are very important in biological molecules, which play multiple vital roles in cells. A lot of work has been done on RNA motif annotation. However, pattern discovery in RNA structure is less studied. In the first part of this dissertation, an ab initio algorithm, named DiscoverR, is introduced for pattern discovery in RNA secondary structures. This algorithm works by representing RNA secondary structures as ordered labeled trees and performs tree pattern discovery using a quadratic time dynamic programming algorithm. The algorithm is able to identify and extract the largest common substructures from two RNA molecules of different sizes, without prior knowledge of locations and topologies of these substructures. One application of DiscoverR is to locate the RNA structural elements in genomes. Experimental results show that this tool complements the currently used approaches for mining conserved structural RNAs in the human genome. DiscoverR can also be extended to find repeated regions in an RNA secondary structure. Specifically, this extended method is used to detect structural repeats in the 3\u27-untranslated region of a protein kinase gene

    Design and implementation of a cyberinfrastructure for RNA motif search, prediction and analysis

    Get PDF
    RNA secondary and tertiary structure motifs play important roles in cells. However, very few web servers are available for RNA motif search and prediction. In this dissertation, a cyberinfrastructure, named RNAcyber, capable of performing RNA motif search and prediction, is proposed, designed and implemented. The first component of RNAcyber is a web-based search engine, named RmotifDB. This web-based tool integrates an RNA secondary structure comparison algorithm with the secondary structure motifs stored in the Rfam database. With a user-friendly interface, RmotifDB provides the ability to search for ncRNA structure motifs in both structural and sequential ways. The second component of RNAcyber is an enhanced version of RmotifDB. This enhanced version combines data from multiple sources, incorporates a variety of well-established structure-based search methods, and is integrated with the Gene Ontology. To display RmotifDB’s search results, a software tool, called RSview, is developed. RSview is able to display the search results in a graphical manner. Finally, RNAcyber contains a web-based tool called Junction-Explorer, which employs a data mining method for predicting tertiary motifs in RNA junctions. Specifically, the tool is trained on solved RNA tertiary structures obtained from the Protein Data Bank, and is able to predict the configuration of coaxial helical stacks and families (topologies) in RNA junctions at the secondary structure level. Junction-Explorer employs several algorithms for motif prediction, including a random forest classification algorithm, a pseudoknot removal algorithm, and a feature ranking algorithm based on the gini impurity measure. A series of experiments including 10-fold cross- validation has been conducted to evaluate the performance of the Junction-Explorer tool. Experimental results demonstrate the effectiveness of the proposed algorithms and the superiority of the tool over existing methods. The RNAcyber infrastructure is fully operational, with all of its components accessible on the Internet

    Modulating RNA structure and catalysis: lessons from small cleaving ribozymes

    Get PDF
    RNA is a key molecule in life, and comprehending its structure/function relationships is a crucial step towards a more complete understanding of molecular biology. Even though most of the information required for their correct folding is contained in their primary sequences, we are as yet unable to accurately predict both the folding pathways and active tertiary structures of RNA species. Ribozymes are interesting molecules to study when addressing these questions because any modifications in their structures are often reflected in their catalytic properties. The recent progress in the study of the structures, the folding pathways and the modulation of the small ribozymes derived from natural, self-cleaving, RNA motifs have significantly contributed to today’s knowledge in the field

    A new paradigm for the folding of ribonucleic acids

    Get PDF
    De rĂ©centes dĂ©couvertes montrent le rĂŽle important que joue l’acide ribonuclĂ©ique (ARN) au sein des cellules, que ce soit le contrĂŽle de l’expression gĂ©nĂ©tique, la rĂ©gulation de plusieurs processus homĂ©ostasiques, en plus de la transcription et la traduction de l’acide dĂ©soxyribonuclĂ©ique (ADN) en protĂ©ine. Si l’on veut comprendre comment la cellule fonctionne, nous devons d’abords comprendre ses composantes et comment ils interagissent, et en particulier chez l’ARN. La fonction d’une molĂ©cule est tributaire de sa structure tridimensionnelle (3D). Or, dĂ©terminer expĂ©rimentalement la structure 3D d’un ARN s’avĂšre fort coĂ»teux. Les mĂ©thodes courantes de prĂ©diction par ordinateur de la structure d’un ARN ne tiennent compte que des appariements classiques ou canoniques, similaires Ă  ceux de la fameuse structure en double-hĂ©lice de l’ADN. Ici, nous avons amĂ©liorĂ© la prĂ©diction de structures d’ARN en tenant compte de tous les types possibles d’appariements, dont ceux dits non-canoniques. Cela est rendu possible dans le contexte d’un nouveau paradigme pour le repliement des ARN, basĂ© sur les motifs cycliques de nuclĂ©otides ; des blocs de bases pour la construction des ARN. De plus, nous avons dĂ©velopĂ©es de nouvelles mĂ©triques pour quantifier la prĂ©cision des mĂ©thodes de prĂ©diction des structures 3D des ARN, vue l’introduction rĂ©cente de plusieurs de ces mĂ©thodes. Enfin, nous avons Ă©valuĂ© le pouvoir prĂ©dictif des nouvelles techniques de sondage de basse rĂ©solution des structures d’ARN.Recent findings show the important role of ribonucleic acid (RNA) within the cell, be it the control of gene expression, the regulation of several homeostatic processes, in addition to the transcription and translation of deoxyribonucleic acid (DNA) into protein. If we wish to understand how the cell works, we first need to understand its components and how they interact, and in particular for RNA. The function of a molecule is tributary of its three-dimensional (3D) structure. However, experimental determination of RNA 3D structures imparts great costs. Current methods for RNA structure prediction by computers only take into account the classical or canonical base pairs, similar to those found in the well-celebrated DNA double helix. Here, we improved RNA structure prediction by taking into account all possible types of base pairs, even those said non-canonicals. This is made possible in the context of a new paradigm for the folding of RNA, based on nucleotide cyclic motifs (NCM): basic blocks for the construction of RNA. Furthermore, we have developed new metrics to quantify the precision of RNA 3D structure prediction methods, given the recent introduction of many of those methods. Finally, we have evaluated the predictive power of the latest low-resolution RNA structure probing techniques

    The Role of Topological Constraints in RNA Tertiary Folding and Dynamics.

    Full text link
    Functional RNA molecules must fold into highly complex three-dimensional (3D) structures and undergo precise structural dynamics in order to carry out their biological functions. However, the principles that govern RNA 3D folding and dynamics remain poorly understood. Recent studies have proposed that topological constraints arising from the basic connectivity and steric properties of RNA secondary structure strongly confine the 3D conformation of RNA junctions and thus may contribute to the specificity of RNA 3D folding and dynamics. Herein, this hypothesis is explored in quantitative detail using a combination of computational heuristic models and the specially developed coarse-grained molecular dynamics model TOPRNA. First, studies of two-way junctions provide new insight into the significance and mechanism of action of topological constraints. It is demonstrated that topological constraints explain the directionality and amplitude of bulge-induced bends, and that long-range tertiary interactions can modify topological constraints by disrupting non-canonical pairing in internal loops. Furthermore, topological constraints are shown to define free energy landscapes that coincide with the distribution of bulge conformations in structural databases and reproduce solution NMR measurements made on bulges. Next, TOPRNA is used to investigate the contributions of topological constraints to tRNA folding and dynamics. Topological constraints strongly constrain tRNA 3D conformation and notably discriminate against formation of non-native tertiary contacts, providing a sequence-independent source of folding specificity. Furthermore, topological constraints are observed to give rise to thermodynamic cooperativity between distinct tRNA tertiary interactions and encode functionally important 3D dynamics. Mutant tRNAs with unnatural secondary structures are shown to lack these favorable characteristics, suggesting that topological constraints underlie the evolutionary conservation of tRNA secondary structure. Additional studies of a non-canonical mitochondrial tRNA show that increased topological constraints can reduce the entropic cost of tertiary folding, and that disruptions of topological constraints explain the pathogenicity of a insertion mutation in this tRNA. UV melting experiments verify these findings. Finally, TOPRNA is used to study the topological constraints of the 197 nucleotide Azoarcus Group I ribozyme. It is shown that topological constraints strongly confine this RNA and provide a mechanism for encoding tertiary structure specificity and cooperative hierarchical folding behavior.PhDBiophysicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110505/1/amustoe_1.pd

    Modelling the folding pathway of DNA nanostructures

    Get PDF
    DNA origami is a robust technique for bottom-up nano-fabrication. It encodes a target shape into uniquely addressable interactions between a set of short 'staple' strands and a long 'scaffold' strand. The mechanisms of self-assembly, particularly regarding kinetics, need to be better understood. Origami design usually relies on optimising the thermodynamic stability of the target structure, and thermal annealing remains the most fool-proof assembly protocol. This work focuses on studying the folding pathway of three types of origami through simulations: a reconfigurable T-junction origami, several traditional origami, and origami with coated scaffolds. The T-junction origami is intended as an economically feasible method of changing the uniqueness of interactions. My contribution to this work is to characterise the basic structural motif through oxDNA, a nucleotide-resolution model of DNA. The thesis then focuses on extending a domain-level model of DNA origami to study several experimental origami designs. We reveal design-dependent free energy barriers using biased simulations and relate this to the observed hysteresis in experiments. We also highlight the role of specific design elements in determining the folding pathway. A novel method of lowering the temperature of error-free assembly using coated scaffolds is then presented, with simulations indicating the existence of an activation barrier. By exposing particular regions of the scaffold, we can lower assembly time and temperature

    Ab initio RNA folding

    Full text link
    RNA molecules are essential cellular machines performing a wide variety of functions for which a specific three-dimensional structure is required. Over the last several years, experimental determination of RNA structures through X-ray crystallography and NMR seems to have reached a plateau in the number of structures resolved each year, but as more and more RNA sequences are being discovered, need for structure prediction tools to complement experimental data is strong. Theoretical approaches to RNA folding have been developed since the late nineties when the first algorithms for secondary structure prediction appeared. Over the last 10 years a number of prediction methods for 3D structures have been developed, first based on bioinformatics and data-mining, and more recently based on a coarse-grained physical representation of the systems. In this review we are going to present the challenges of RNA structure prediction and the main ideas behind bioinformatic approaches and physics-based approaches. We will focus on the description of the more recent physics-based phenomenological models and on how they are built to include the specificity of the interactions of RNA bases, whose role is critical in folding. Through examples from different models, we will point out the strengths of physics-based approaches, which are able not only to predict equilibrium structures, but also to investigate dynamical and thermodynamical behavior, and the open challenges to include more key interactions ruling RNA folding.Comment: 28 pages, 18 figure

    Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach

    Get PDF
    Reeder J. Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach. Bielefeld (Germany): Bielefeld University; 2007.Our understanding of the role of RNA has undergone a major change in the last decade. Once believed to be only a mere carrier of information and structural component of the ribosomal machinery in the advent of the genomic age, it is now clear that RNAs play a much more active role. RNAs can act as regulators and can have catalytic activity - roles previously only attributed to proteins. There is still much speculation in the scientific community as to what extent RNAs are responsible for the complexity in higher organisms which can hardly be explained with only proteins as regulators. In order to investigate the roles of RNA, it is therefore necessary to search for new classes of RNA. For those and already known classes, analyses of their presence in different species of the tree of life will provide further insight about the evolution of biomolecules and especially RNAs. Since RNA function often follows its structure, the need for computer programs for RNA structure prediction is an immanent part of this procedure. The secondary structure of RNA - the level of base pairing - strongly determines the tertiary structure. As the latter is computationally intractable and experimentally expensive to obtain, secondary structure analysis has become an accepted substitute. In this thesis, I present two new algorithms (and a few variations thereof) for the prediction of RNA secondary structures. The first algorithm addresses the problem of predicting a secondary structure from a single sequence including RNA pseudoknots. Pseudoknots have been shown to be functionally relevant in many RNA mediated processes. However, pseudoknots are excluded from considerations by state-of-the-art RNA folding programs for reasons of computational complexity. While folding a sequence of length n into unknotted structures requires O(n^3) time and O(n^2) space, finding the best structure including arbitrary pseudoknots has been proven to be NP-complete. Nevertheless, I demonstrate in this work that certain types of pseudoknots can be included in the folding process with only a moderate increase of computational cost. In analogy to protein coding RNA, where a conserved encoded protein hints at a similar metabolic function, structural conservation in RNA may give clues to RNA function and to finding of RNA genes. However, structure conservation is more complex to deal with computationally than sequence conservation. The method considered to be at least conceptually the ideal approach in this situation is the Sankoff algorithm. It simultaneously aligns two sequences and predicts a common secondary structure. Unfortunately, it is computationally rather expensive - O(n^6) time and O(n^4) space for two sequences, and for more than two sequences it becomes exponential in the number of sequences! Therefore, several heuristic implementations emerged in the last decade trying to make the Sankoff approach practical by introducing pragmatic restrictions on the search space. In this thesis, I propose to redefine the consensus structure prediction problem in a way that does not imply a multiple sequence alignment step. For a family of RNA sequences, my method explicitly and independently enumerates the near-optimal abstract shape space and predicts an abstract shape as the consensus for all sequences. For each sequence, it delivers the thermodynamically best structure which has this shape. The technique of abstract shapes analysis is employed here for a synoptic view of the suboptimal folding space. As the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the number of sequences. Evaluations show that the new method compares favorably with available alternatives
    • 

    corecore