907 research outputs found

    Computational Methods For Comparative Non-coding Rna Analysis: From Structural Motif Identification To Genome-wide Functional Classification

    Get PDF
    Recent advances in biological research point out that many ribonucleic acids (RNAs) are transcribed from the genome to perform a variety of cellular functions, rather than merely acting as information carriers for protein synthesis. These RNAs are usually referred to as the non-coding RNAs (ncRNAs). The versatile regulation mechanisms and functionalities of the ncRNAs contribute to the amazing complexity of the biological system. The ncRNAs perform their biological functions by folding into specific structures. In this case, the comparative study of the ncRNA structures is key to the inference of their molecular and cellular functions. We are especially interested in two computational problems for the comparative analysis of ncRNA structures: the alignment of ncRNA structures and their classification. Specifically, we aim to develop algorithms to align and cluster RNA structural motifs (recurrent RNA 3D fragments), as well as RNA secondary structures. Thorough understanding of RNA structural motifs will help us to disassemble the huge RNA 3D structures into functional modules, which can significantly facilitate the analysis of the detailed molecular functions. On the other hand, efficient alignment and clustering of the RNA secondary structures will provide insights for the understanding of the ncRNA expression and interaction in a genomic scale. In this dissertation, we will present a suite of computational algorithms and software packages to solve the RNA structural motif alignment and clustering problem, as well as the RNA iii secondary structure alignment and clustering problem. The summary of the contributions of this dissertation is as follows. (1) We developed RNAMotifScan for comparing and searching RNA structural motifs. Recent studies have shown that RNA structural motifs play an essential role in RNA folding and interaction with other molecules. Computational identification and analysis of RNA structural motifs remain to be challenging tasks. Existing motif identification methods based on 3D structure may not properly compare motifs with high structural variations. We present a novel RNA structural alignment method for RNA structural motif identi- fication, RNAMotifScan, which takes into consideration the isosteric (both canonical and non-canonical) base-pairs and multi-pairings in RNA structural motifs. The utility and accuracy of RNAMotifScan are demonstrated by searching for Kink-turn, C-loop, Sarcin-ricin, Reverse Kink-turn and E-loop motifs against a 23s rRNA (PDBid: 1S72), which is well characterized for the occurrences of these motifs. (2) We improved upon RNAMotifScan by incorporating base-stacking information and devising a new branch-and-bound algorithm called RNAMotifScanX. Model-based search of RNA structural motif has been focused on finding instances with similar 3D geometry and base-pairing patterns. Although these methods have successfully identified many of the true motif instances, each of them has its own limitations and their accuracy and sensitivity can be further improved. We introduce a novel approach to model the RNA structural motifs, which incorporates both base-pairing and base-stacking information. We also develop a new algorithm to search for known motif instances with the consideration of both base-pairing and base-stacking information. Benchmarking of RNAMotifScanX on searching known RNA structural motifs including kink-turn, C-loop, sarcin-ricin, reverse kink-turn, and E-loop iv clearly show improved performances compared to its predecessor RNAMotifScan and other state-of-the-art RNA structural motif search tools. (3) We develop an RNA structural motif clustering and de novo identification pipeline called RNAMSC. RNA structural motifs are the building blocks of the complex RNA architecture. Identification of non-coding RNA structural motifs is a critical step towards understanding of their structures and functionalities. We present a clustering approach for de novo RNA structural motif identification. We applied our approach on a data set containing 5S, 16S and 23S rRNAs and rediscovered many known motifs including GNRA tetraloop, kink-turn, C-loop, sarcin-ricin, reverse kink-turn, hook-turn, E-loop and tandem-sheared motifs, with higher accuracy than the currently state-of-the-art clustering method. More importantly, several novel structural motif families have been revealed by our novel clustering analysis. (4) We propose an improved RNA structural clustering pipeline that takes into account the length-dependent distribution of the structural similarity measure. We also devise a more efficient and robust CLique finding CLustering algorithm (CLCL), to replace the traditional hierarchical clustering approach. Benchmark of the proposed pipeline on Rfam data clearly demonstrates over 10% performance gain, when compared to a traditional hierarchical clustering pipeline. We applied this new computational pipeline to cluster the posttranscriptional control elements in fly 3ā€™-UTR. The ncRNA elements in the 3ā€™ untranslated regions (3ā€™-UTRs) are known to participate in the genesā€™ post-transcriptional regulation, such as their stability, translation efficiency, and subcellular localization. Inferring co-expression patterns of the genes by clustering their 3ā€™-UTR ncRNA elements will provide invaluable knowledge for further studies of their functionalities and interactions under specific physiological processes. v (5) We develop an ultra-efficient RNA secondary structure alignment algorithm ERA by using a sparse dynamic programming technique. Current advances of the next-generation sequencing technology have revealed a large number of un-annotated RNA transcripts. Comparative study of the RNA structurome is an important approach to assess the biological functionalities of these RNA transcripts. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA structure-structure alignment algorithm is in urgent need to facilitate the comparative study. By using the sparse dynamic programming technique, we devised a new alignment algorithm that is as efficient as the tree-based alignment algorithms, and as accurate as the general edit-distance alignment algorithms. We implemented the new algorithm into a program called ERA (Efficient RNA Alignment). Benchmark results indicate that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA alignment tools, while maintaining high alignment accuracy. These novel algorithms have led to the discovery of many novel RNA structural motif instances, which have significantly deepened our understanding to the RNA molecular functions. The genome-wide clustering of ncRNA elements in fly 3ā€™-UTR has predicted a cluster of genes that are responsible for the spermatogenesis process. More importantly, these genes are very likely to be co-regulated by their common 3ā€™-UTR elements. We anticipate that these algorithms and the corresponding software tools will significantly promote the comparative ncRNA research in the futur

    Context based bioinformatics

    Get PDF
    The goal of bioinformatics is to develop innovative and practical methods and algorithms for bio- logical questions. In many cases, these questions are driven by new biotechnological techniques, especially by genome and cell wide high throughput experiment studies. In principle there are two approaches: 1. Reduction and abstraction of the question to a clearly deļ¬ned optimization problem, which can be solved with appropriate and efļ¬cient algorithms. 2. Development of context based methods, incorporating as much contextual knowledge as possible in the algorithms, and derivation of practical solutions for relevant biological ques- tions on the high-throughput data. These methods can be often supported by appropriate software tools and visualizations, allowing for interactive evaluation of the results by ex- perts. Context based methods are often much more complex and require more involved algorithmic techniques to get practical relevant and efļ¬cient solutions for real world problems, as in many cases already the simpliļ¬ed abstraction of problems result in NP-hard problem instances. In many cases, to solve these complex problems, one needs to employ efļ¬cient data structures and heuristic search methods to solve clearly deļ¬ned sub-problems using efļ¬cient (polynomial) op- timization (such as dynamic programming, greedy, path- or tree-algorithms). In this thesis, we present new methods and analyses addressing open questions of bioinformatics from different contexts by incorporating the corresponding contextual knowledge. The two main contexts in this thesis are the protein structure similarity context (Part I) and net- work based interpretation of high-throughput data (Part II). For the protein structure similarity context Part I we analyze the consistency of gold standard structure classiļ¬cation systems and derive a consistent benchmark set usable for different ap- plications. We introduce two methods (Vorolign, PPM) for the protein structure similarity recog- nition problem, based on different features of the structures. Derived from the idea and results of Vorolign, we introduce the concept of contact neighbor- hood potential, aiming to improve the results of protein fold recognition and threading. For the re-scoring problem of predicted structure models we introduce the method Vorescore, clearly improving the fold-recognition performance, and enabling the evaluation of the contact neighborhood potential for structure prediction methods in general. We introduce a contact consistent Vorolign variant ccVorolign further improving the structure based fold recognition performance, and enabling direct optimization of the neighborhood po- tential in the future. Due to the enforcement of contact-consistence, the ccVorolign method has much higher computational complexity than the polynomial Vorolign method - the cost of com- puting interpretable and consistent alignments. Finally, we introduce a novel structural alignment method (PPM) enabling the explicit modeling and handling of phenotypic plasticity in protein structures. We employ PPM for the analysis of effects of alternative splicing on protein structures. With the help of PPM we test the hypothesis, whether splice isoforms of the same protein can lead to protein structures with different folds (fold transitions). In Part II of the thesis we present methods generating and using context information for the interpretation of high-throughput experiments. For the generation of context information of molecular regulations we introduce novel textmin- ing approaches extracting relations automatically from scientiļ¬c publications. In addition to the fast NER (named entity recognition) method (syngrep) we also present a novel, fully ontology-based context-sensitive method (SynTree) allowing for the context-speciļ¬c dis- ambiguation of ambiguous synonyms and resulting in much better identiļ¬cation performance. This context information is important for the interpretation of high-throughput data, but often missing in current databases. Despite all improvements, the results of automated text-mining methods are error prone. The RelAnn application presented in this thesis helps to curate the automatically extracted regula- tions enabling manual and ontology based curation and annotation. For the usage of high-throughput data one needs additional methods for data processing, for example methods to map the hundreds of millions short DNA/RNA fragments (so called reads) on a reference genome or transcriptome. Such data (RNA-seq reads) are the output of next generation sequencing methods measured by sequencing machines, which are becoming more and more efļ¬cient and affordable. Other than current state-of-the-art methods, our novel read-mapping method ContextMap re- solves the occurring ambiguities at the ļ¬nal step of the mapping process, employing thereby the knowledge of the complete set of possible ambiguous mappings. This approach allows for higher precision, even if more nucleotide errors are tolerated in the read mappings in the ļ¬rst step. The consistence between context information of molecular regulations stored in databases and extracted from textmining against measured data can be used to identify and score consistent reg- ulations (GGEA). This method substantially extends the commonly used gene-set based methods such over-representation (ORA) and gene set enrichment analysis (GSEA). Finally we introduce the novel method RelExplain, which uses the extracted contextual knowl- edge and generates network-based and testable hypotheses for the interpretation of high-throughput data.Bioinformatik befasst sich mit der Entwicklung innovativer und praktisch einsetzbarer Verfahren und Algorithmen fĆ¼r biologische Fragestellungen. Oft ergeben sich diese Fragestellungen aus neuen Beobachtungs- und Messverfahren, insbesondere neuen Hochdurchsatzverfahren und genom- und zellweiten Studien. Im Prinzip gibt es zwei Vorgehensweisen: Reduktion und Abstraktion der Fragestellung auf ein klar definiertes Optimierungsproblem, das dann mit geeigneten mƶglichst effizienten Algorithmen gelƶst wird. Die Entwicklung von kontext-basierten Verfahren, die mƶglichst viel Kontextwissen und mƶglichst viele Randbedingungen in den Algorithmen nutzen, um praktisch relevante Lƶsungen fĆ¼r relvante biologische Fragestellungen und Hochdurchsatzdaten zu erhalten. Die Verfahren kƶnnen oft durch geeignete Softwaretools und Visualisierungen unterstĆ¼tzt werden, um eine interaktive Auswertung der Ergebnisse durch Fachwissenschaftler zu ermƶglichen. Kontext-basierte Verfahren sind oft wesentlich aufwƤndiger und erfordern involviertere algorithmische Techniken um fĆ¼r reale Probleme, deren simplifizierende Abstraktionen schon NP-hart sind, noch praktisch relevante und effiziente Lƶsungen zu ermƶglichen. Oft werden effiziente Datenstrukturen und heuristische Suchverfahren benƶtigt, die fĆ¼r klar umrissene Teilprobleme auf effiziente (polynomielle) Optimierungsverfahren (z.B. dynamische Programmierung, Greedy, Wege- und Baumverfahren) zurĆ¼ckgreifen und sie entsprechend fĆ¼r das Gesamtverfahren einsetzen. In dieser Arbeit werden eine Reihe von neuen Methoden und Analysen vorgestellt um offene Fragen der Bioinformatik aus verschiedenen Kontexten durch Verwendung von entsprechendem Kontext-Wissen zu adressieren. Die zwei Hauptkontexte in dieser Arbeit sind (Teil 1) die Ƅhnlichkeiten von 3D Protein Strukturen und (Teil 2) auf die netzwerkbasierte Interpretation von Hochdurchsatzdaten. Im Proteinstrukturkontext Teil 1 analysieren wir die Konsistenz der heute verfĆ¼gbaren Goldstandards fĆ¼r Proteinstruktur-Klassifikationen, und leiten ein vielseitig einsetzbares konsistentes Benchmark-Set ab. FĆ¼r eine genauere Bestimmung der Ƅhnlichkeit von Proteinstrukturen beschreiben wir zwei Methoden (Vorolign, PPM), die unterschiedliche Strukturmerkmale nutzen. Ausgehend von den fĆ¼r Vorolign erzielten Ergebnissen, fĆ¼hren wir Kontakt-Umgebungs-Potentiale mit dem Ziel ein, Fold-Erkennung (auf Basis der vorhandenen Strukturen) und Threading (zur Proteinstrukturvorhersage) zu verbessern. FĆ¼r das Problem des Re-scorings von vorhergesagten Strukturmodellen beschreiben wir das Vorescore Verfahren ein, mit dem die Fold-Erkennung deutlich verbessert, aber auch die Anwendbarkeit von Potentialen im Allgemeinen getested werden kann. Zur weiteren Verbesserung fĆ¼hren wir eine Kontakt-konsistente Vorolign Variante (ccVorolign) ein, die wegen der neuen Konsistenz-Randbedingung erheblich aufwƃĀ¤ndiger als das polynomielle Vorolignverfahren ist, aber eben auch interpretierbare konsistente Alignments liefert. Das neue Strukturalignment Verfahren (PPM) erlaubt es phƤnotypische PlastizitƤt, explizit zu modellieren und zu berĆ¼cksichtigen. PPM wird eingesetzt, um die Effekte von alternativem Splicing auf die Proteinstruktur zu untersuchen, insbesondere die Hypothese, ob Splice-Isoformen unterschiedliche Folds annehmen kƶnnen (Fold-Transitionen). Im zweiten Teil der Arbeit werden Verfahren zur Generierung von Kontextinformationen und zu ihrer Verwendung fĆ¼r die Interpretation von Hochdurchsatz-Daten vorgestellt. Neue Textmining Verfahren extrahieren aus wissenschaftlichen Publikationen automatisch molekulare regulatorische Beziehungen und entsprechende Kontextinformation. Neben schnellen NER (named entity recognition) Verfahren (wie syngrep) wird auch ein vollstƤndig Ontologie-basiertes kontext-sensitives Verfahren (SynTree) eingefĆ¼hrt, das es erlaubt, mehrdeutige Synonyme kontext-spezifisch und damit wesentlich genauer aufzulƶsen. Diese fĆ¼r die Interpretation von Hochdurchsatzdaten wichtige Kontextinformation fehlt hƤufig in heutigen Datenbanken. Automatische Verfahren produzieren aber trotz aller Verbesserungen noch viele Fehler. Mithilfe unserer Applikation RelAnn kƶnnen aus Texten extrahierte regulatorische Beziehungen ontologiebasiert manuell annotiert und kuriert werden. Die Verwendung aktueller Hochdurchsatzdaten benƶtigt zusƤtzliche AnsƤtze fĆ¼r die Datenprozessierung, zum Beispiel fĆ¼r das Mapping von hunderten von Millionen kurzer DNA/RNA Fragmente (sog. reads) auf Genom oder Transkriptom. Diese Daten (RNA-seq) ergeben sich durch next generation sequencing Methoden, die derzeit mit immer leistungsfƤhigeren GerƤten immer kostengĆ¼nstiger gemessen werden kƶnnen. In der ContextMap Methode werden im Gegensatz zu state-of-the-art Verfahren die auftretenden Mehrdeutigkeiten erst am Ende des Mappingprozesses aufgelƶst, wenn die Gesamtheit der Mappinginformationen zur VerfĆ¼gung steht. Dadurch kƶnenn mehr Fehler beim Mapping zugelassen und trotzdem hƶhere Genauigkeit erreicht werden. Die Konsistenz zwischen der Kontextinformation aus Textmining und Datenbanken sowie den gemessenen Daten kann dann fĆ¼r das Auffinden und Bewerten von konsistente Regulationen (GGEA) genutzt werden. Dieses Verfahren stellt eine wesentliche Erweiterung der hƤufig verwendeten Mengen-orientierten Verfahren wie overrepresentation (ORA) und gene set enrichment analysis (GSEA) dar. Zuletzt stellen wir die Methode RelExplain vor, die aus dem extrahierten Kontextwissen netzwerk-basierte, testbare Hypothesen fĆ¼r die ErklƤrung von Hochdurchsatzdaten generiert

    The identification and characterisation of the causative gene mutation for keratolytic winter erythema (KWE) in South African families

    Get PDF
    A thesis submitted to the Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, in fulfilment for the degree of Doctor of Philosophy Johannesburg, 2017Keratolytic winter erythema (KWE) is a rare autosomal dominant skin disorder characterized by recurrent episodes of palmoplantar erythema and epidermal peeling, and symptoms worsen in winter. KWE is relatively common in South African (SA) Afrikaners and was mapped to 8p23.1-p22 through a common haplotype in SA families. The aim of this study was to identify and characterize the causal mutation for KWE in SA families. Targeted resequencing of 8p23.1-22 was performed in three families and seven unrelated controls. Reads were aligned to the reference genome using BWA. GATK and Pindel were used to call small and large structural variants, respectively. A 7.67 kb tandem duplication was identified upstream of the CTSB gene and encompassing an enhancer element that is active in a keratinocytes (based on H3K27ac data). The tandem duplication segregated completely with the KWE. The tandem duplication overlaps with a 15.93 kb tandem duplication identified in two Norwegian families at a 2.62 kb region encompassing the active enhancer suggesting that the duplication of the enhancer leads to the KWE phenotype. Existing chromatin structure, CTCF binding and chromatin interaction data from several cell lines, including keratinocytes were analysed and three potential topological subdomains were identified, all containing the enhancer and CTSB, or CTSB and FDFT1 or both genes and NEIL2. Additionally, we showed that the enhancerā€™s activity correlated with CTSB expression, but not with FDFT1 and NEIL2 expression in differentiating keratinocytes and other cell lines. RNA polymerase II ChIA-PET interaction data in cancer cell lines showed that the enhancer interacts with CTSB but not FDFT1 or NEIL2. These data suggest that the enhancer normally regulates CTSB expression. Relative gene expression and immunohistochemistry from palmar biopsies from South African and Norwegian participants (7 Affected and 7 Controls) showed a significantly higher expression of CTSB, but not FDFT1 and NEIL2, in affected individuals compared to the controls and that CTSB was significantly more abundant in the granular layer of affected individuals compared to controls. We conclude that the enhancer duplication causes KWE by upregulating CTSB expression and causing an overabundance of CTSB in the granular layer of the epidermis.MT201

    Single-molecule experiments in biological physics: methods and applications

    Full text link
    I review single-molecule experiments (SME) in biological physics. Recent technological developments have provided the tools to design and build scientific instruments of high enough sensitivity and precision to manipulate and visualize individual molecules and measure microscopic forces. Using SME it is possible to: manipulate molecules one at a time and measure distributions describing molecular properties; characterize the kinetics of biomolecular reactions and; detect molecular intermediates. SME provide the additional information about thermodynamics and kinetics of biomolecular processes. This complements information obtained in traditional bulk assays. In SME it is also possible to measure small energies and detect large Brownian deviations in biomolecular reactions, thereby offering new methods and systems to scrutinize the basic foundations of statistical mechanics. This review is written at a very introductory level emphasizing the importance of SME to scientists interested in knowing the common playground of ideas and the interdisciplinary topics accessible by these techniques. The review discusses SME from an experimental perspective, first exposing the most common experimental methodologies and later presenting various molecular systems where such techniques have been applied. I briefly discuss experimental techniques such as atomic-force microscopy (AFM), laser optical tweezers (LOT), magnetic tweezers (MT), biomembrane force probe (BFP) and single-molecule fluorescence (SMF). I then present several applications of SME to the study of nucleic acids (DNA, RNA and DNA condensation), proteins (protein-protein interactions, protein folding and molecular motors). Finally, I discuss applications of SME to the study of the nonequilibrium thermodynamics of small systems and the experimental verification of fluctuation theorems. I conclude with a discussion of open questions and future perspectives.Comment: Latex, 60 pages, 12 figures, Topical Review for J. Phys. C (Cond. Matt

    Expression and structural studies of multidomain proteins and complexes

    Get PDF
    It is generally accepted that there is a level of organization in proteins that overlaps the classical definitions of tertiary and quaternary structure, i.e. sequentially consecutive residues in polypeptide chains fold into distinct compact regions called domains. Many multidomain proteins are flexible and are not amenable to X-ray crystallography or are too big for multi dimensional nuclear magnetic resonance techniques, while other proteins form oligomeric structures from subunits. It is possible using small-angle X-ray and neutron scattering, coupled with molecular modelling techniques, to locate the relative positions of these domains or subunits relative to each other within the full protein structure. This PhD thesis has looked at a variety of native and recombinant oligomeric proteins and domains and attempts have been made to produce low resolution structures of their oligomerisation or their multidomain structures. Expression systems used include a Pseudomonas aeruginosa over-expression system and the baculovirus expression system. One multidomain protein was studied, namely factor I of the complement system. Two forms of factor I were studied, a native form purified from human plasma, and a recombinant form produced in insect cells. Scattering modelling was used to elucidate a bilobal domain arrangement in factor I, in which the different types of carbohydrate present on the two different forms could be modelled. The quaternary structures of two complexes were determined, namely the homo- oligomeric complexes of the Ps. aeruginosa amidase regulatory protein, AmiC, and the Mycobacterium leprae Holliday junction protein, RuvA. It was determined that in solution AmiC exists as a monomer-trimer equilibrium, and that RuvA adopts an octameric structure, both when lice and when complexed with DNA, within which the Holliday junction is buried in the RuvA-DNA complex

    PROTEIN FUNCTION, DIVERISTY AND FUNCTIONAL INTERPLAY

    Get PDF
    Functional annotations of novel or unknown proteins is one of the central problems in post-genomics bioinformatics research. With the vast expansion of genomic and proteomic data and technologies over the last decade, development of automated function prediction (AFP) methods for large-scale identification of protein function has be-come imperative in many aspects. In this research, we address two important divergences from the ā€œone protein ā€“ one functionā€ concept on which all existing AFP methods are developed
    • ā€¦
    corecore