738 research outputs found

    Improving the Performance and Precision of Bioinformatics Algorithms

    Get PDF
    Recent advances in biotechnology have enabled scientists to generate and collect huge amounts of biological experimental data. Software tools for analyzing both genomic (DNA) and proteomic (protein) data with high speed and accuracy have thus become very important in modern biological research. This thesis presents several techniques for improving the performance and precision of bioinformatics algorithms used by biologists. Improvements in both the speed and cost of automated DNA sequencers have allowed scientists to sequence the DNA of an increasing number of organisms. One way biologists can take advantage of this genomic DNA data is to use it in conjunction with expressed sequence tag (EST) and cDNA sequences to find genes and their splice sites. This thesis describes ESTmapper, a tool designed to use an eager write-only top-down (WOTD) suffix tree to efficiently align DNA sequences against known genomes. Experimental results show that ESTmapper can be much faster than previous techniques for aligning and clustering DNA sequences, and produces alignments of comparable or better quality. Peptide identification by tandem mass spectrometry (MS/MS) is becoming the dominant high-throughput proteomics workflow for protein characterization in complex samples. Biologists currently rely on protein database search engines to identify peptides producing experimentally observed mass spectra. This thesis describes two approaches for improving peptide identification precision using statistical machine learning. HMMatch (HMM MS/MS Match) is a hidden Markov model approach to spectral matching, in which many examples of a peptide fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. Experimental results show that HMMatch can identify many peptides missed by traditional spectral matching and search engines. PepArML (Peptide Identification Arbiter by Machine Learning) is a machine learning based framework for improving the precision of peptide identification. It uses classification algorithms to effectively utilize spectra features and scores from multiple search engines in a single model-free framework that can be trained in an unsupervised manner. Experimental results show that PepArML can improve the sensitivity of peptide identification for several synthetic protein mixtures compared with individual search engines

    Bayesian methods for small molecule identification

    Get PDF
    Confident identification of small molecules remains a major challenge in untargeted metabolomics, natural product research and related fields. Liquid chromatography-tandem mass spectrometry is a predominant technique for the high-throughput analysis of small molecules and can detect thousands of different compounds in a biological sample. The automated interpretation of the resulting tandem mass spectra is highly non-trivial and many studies are limited to re-discovering known compounds by searching mass spectra in spectral reference libraries. But these libraries are vastly incomplete and a large portion of measured compounds remains unidentified. This constitutes a major bottleneck in the comprehensive, high-throughput analysis of metabolomics data. In this thesis, we present two computational methods that address different steps in the identification process of small molecules from tandem mass spectra. ZODIAC is a novel method for de novo that is, database-independent molecular formula annotation in complete datasets. It exploits similarities of compounds co-occurring in a sample to find the most likely molecular formula for each individual compound. ZODIAC improves on the currently best-performing method SIRIUS; on one dataset by 16.5 fold. We show that de novo molecular formula annotation is not just a theoretical advantage: We discover multiple novel molecular formulas absent from PubChem, one of the biggest structure databases. Furthermore, we introduce a novel scoring for CSI:FingerID, a state-of-the-art method for searching tandem mass spectra in a structure database. This scoring models dependencies between different molecular properties in a predicted molecular fingerprint via Bayesian networks. This problem has the unusual property, that the marginal probabilities differ for each predicted query fingerprint. Thus, we need to apply Bayesian networks in a novel, non-standard fashion. Modeling dependencies improves on the currently best scoring

    Proteome-Wide Prediction of Acetylation Substrates

    Get PDF
    Eukaryotic DNA is found packaged with proteins and RNA, which forms a substance called chromatin. This packaging is dynamic and regulates access to DNA for essential cellular processes such as transcription, replication, and repair. In recent years, studies have shown that regulated changes in the chemical and physical properties of chromatin often lead to dynamic changes in multiple cellular processes by affecting the accessibility of the DNA. These changes can be brought about in part through posttranslational modifications of histone proteins, which are involved in disrupting chromatin contacts or by recruiting effector proteins to chromatin. Acetylation is one of the well-studied post-translational modifications that has been associated with chromatin-associated processes, notably gene regulation. Many studies have contributed to our knowledge of the enzymology underlying acetylation, including efforts to understand the molecular mechanism of substrate recognition by several acetyltransferases, but traditional experiments to determine intrinsic features of substrate and site specificity have proven challenging. In my thesis work, I hypothesize that the primary amino acid sequence surrounding an acetylated lysine plays a critical role in acetylation site selection, and whether there are sequence preferences that enable a lysine acetyltransferase to recognize target lysines. A computational method was devised to examine this hypothesis, and an experimental approach was taken to test my computationally-derived predictions. In Chapter 2, I describe my basic computational methods, using a clustering analysis of protein sequences to predict lysine acetylation based on the sequence characteristics of acetylated lysines within histones. I define a local amino acid sequence composition that represents potential acetylation sites by implementing a clustering analysis of histone and nonhistone sequences. I demonstrate that this sequence composition has predictive power on two independent experimental datasets of acetylation marks. In Chapter 3, I describe the experimental validation approach used to detect acetylation in histone and nonhistone proteins using mass spectrometry. I also report several novel non-histone acetylated substrates in S. cerevisiae. My approach, combined with more traditional experimental methods, may be useful for identifying additional proteins in the acetylome. Finally, in Chapter 4, I describe two bioinformatics approaches; one to predict additional chromatin associated effector proteins, and another to further understand the evolutionary history and complexity of the Polycomb Group (PcG) proteins in multicellular organisms in order to infer gene expansion, co-evolution, and deletion events

    Alternative Splicing and Protein Structure Evolution

    Get PDF
    In den letzten Jahren gab es in verschiedensten Bereichen der Biologie einen dramatischen Anstieg verfügbarer, experimenteller Daten. Diese erlauben zum ersten Mal eine detailierte Analyse der Funktionsweisen von zellulären Komponenten wie Genen und Proteinen, die Analyse ihrer Verknüpfung in zellulären Netzwerken sowie der Geschichte ihrer Evolution. Insbesondere der Bioinformatik kommt hier eine wichtige Rolle in der Datenaufbereitung und ihrer biologischen Interpretation zu. In der vorliegenden Doktorarbeit werden zwei wichtige Bereiche der aktuellen bioinformatischen Forschung untersucht, nämlich die Analyse von Proteinstrukturevolution und Ähnlichkeiten zwischen Proteinstrukturen, sowie die Analyse von alternativem Splicing, einem integralen Prozess in eukaryotischen Zellen, der zur funktionellen Diversität beiträgt. Insbesondere führen wir mit dieser Arbeit die Idee einer kombinierten Analyse der beiden Mechanismen (Strukturevolution und Splicing) ein. Wir zeigen, dass sich durch eine kombinierte Betrachtung neue Einsichten gewinnen lassen, wie Strukturevolution und alternatives Splicing sowie eine Kopplung beider Mechanismen zu funktioneller und struktureller Komplexität in höheren Organismen beitragen. Die in der Arbeit vorgestellten Methoden, Hypothesen und Ergebnisse können dabei einen Beitrag zu unserem Verständnis der Funktionsweise von Strukturevolution und alternativem Splicing bei der Entstehung komplexer Organismen leisten wodurch beide, traditionell getrennte Bereiche der Bioinformatik in Zukunft voneinander profitieren können

    HVint: a strategy for identifying novel protein-protein interactions in Herpes Simplex Virus Type 1

    Get PDF
    Human herpesviruses are widespread human pathogens with a remarkable impact on worldwide public health. Despite intense decades of research, the molecular details in many aspects of their function remain to be fully characterized. To unravel the details of how these viruses operate, a thorough understanding of the relationships between the involved components is key. Here, we present HVint, a novel protein-protein intra-viral interaction resource for herpes simplex virus type 1 (HSV-1) integrating data from five external sources. To assess each interaction, we used a scoring scheme that takes into consideration aspects such as the type of detection method and the number of lines of evidence. The coverage of the initial interactome was further increased using evolutionary information, by importing interactions reported for other human herpesviruses. These latter interactions constitute, therefore, computational predictions for potential novel interactions in HSV-1. An independent experimental analysis was performed to confirm a subset of our predicted interactions. This subset covers proteins that contribute to nuclear egress and primary envelopment events, including VP26, pUL31, pUL40 and the recently characterized pUL32 and pUL21. Our findings support a coordinated crosstalk between VP26 and proteins such as pUL31, pUS9 and the CSVC complex, contributing to the development of a model describing the nuclear egress and primary envelopment pathways of newly synthesized HSV-1 capsids. The results are also consistent with recent findings on the involvement of pUL32 in capsid maturation and early tegumentation events. Further, they open the door to new hypotheses on virus-specific regulators of pUS9-dependent transport. To make this repository of interactions readily accessible for the scientific community, we also developed a user-friendly and interactive web interface. Our approach demonstrates the power of computational predictions to assist in the design of targeted experiments for the discovery of novel protein-protein interactions

    DEVELOPMENT AND APPLICATION OF MASS SPECTROMETRY-BASED PROTEOMICS TO GENERATE AND NAVIGATE THE PROTEOMES OF THE GENUS POPULUS

    Get PDF
    Historically, there has been tremendous synergy between biology and analytical technology, such that one drives the development of the other. Over the past two decades, their interrelatedness has catalyzed entirely new experimental approaches and unlocked new types of biological questions, as exemplified by the advancements of the field of mass spectrometry (MS)-based proteomics. MS-based proteomics, which provides a more complete measurement of all the proteins in a cell, has revolutionized a variety of scientific fields, ranging from characterizing proteins expressed by a microorganism to tracking cancer-related biomarkers. Though MS technology has advanced significantly, the analysis of complicated proteomes, such as plants or humans, remains challenging because of the incongruity between the complexity of the biological samples and the analytical techniques available. In this dissertation, analytical methods utilizing state-of-the-art MS instrumentation have been developed to address challenges associated with both qualitative and quantitative characterization of eukaryotic organisms. In particular, these efforts focus on characterizing Populus, a model organism and potential feedstock for bioenergy. The effectiveness of pre-existing MS techniques, initially developed to identify proteins reliably in microbial proteomes, were tested to define the boundaries and characterize the landscape of functional genome expression in Populus. Although these approaches were generally successful, achieving maximal proteome coverage was still limited by a number of factors, including genome complexity, the dynamic range of protein identification, and the abundance of protein variants. To overcome these challenges, improvements were needed in sample preparation, MS instrumentation, and bioinformatics. Optimization of experimental procedures and implementation of current state-of-the-art instrumentation afforded the most detailed look into the predicted proteome space of Populus, offering varying proteome perspectives: 1) network-wide, 2) pathway-specific, and 3) protein-level viewpoints. In addition, we implemented two bioinformatic approaches that were capable of decoding the plasticity of the Populus proteome, facilitating the identification of single amino acid polymorphisms and generating a more accurate profile of protein expression. Though the methods and results presented in this dissertation have direct implications in the study of bioenergy research, more broadly this dissertation focuses on developing techniques to contend with the notorious challenges associated with protein characterization in all eukaryotic organisms

    Computationally Comparing Biological Networks and Reconstructing Their Evolution

    Get PDF
    Biological networks, such as protein-protein interaction, regulatory, or metabolic networks, provide information about biological function, beyond what can be gleaned from sequence alone. Unfortunately, most computational problems associated with these networks are NP-hard. In this dissertation, we develop algorithms to tackle numerous fundamental problems in the study of biological networks. First, we present a system for classifying the binding affinity of peptides to a diverse array of immunoglobulin antibodies. Computational approaches to this problem are integral to virtual screening and modern drug discovery. Our system is based on an ensemble of support vector machines and exhibits state-of-the-art performance. It placed 1st in the 2010 DREAM5 competition. Second, we investigate the problem of biological network alignment. Aligning the biological networks of different species allows for the discovery of shared structures and conserved pathways. We introduce an original procedure for network alignment based on a novel topological node signature. The pairwise global alignments of biological networks produced by our procedure, when evaluated under multiple metrics, are both more accurate and more robust to noise than those of previous work. Next, we explore the problem of ancestral network reconstruction. Knowing the state of ancestral networks allows us to examine how biological pathways have evolved, and how pathways in extant species have diverged from that of their common ancestor. We describe a novel framework for representing the evolutionary histories of biological networks and present efficient algorithms for reconstructing either a single parsimonious evolutionary history, or an ensemble of near-optimal histories. Under multiple models of network evolution, our approaches are effective at inferring the ancestral network interactions. Additionally, the ensemble approach is robust to noisy input, and can be used to impute missing interactions in experimental data. Finally, we introduce a framework, GrowCode, for learning network growth models. While previous work focuses on developing growth models manually, or on procedures for learning parameters for existing models, GrowCode learns fundamentally new growth models that match target networks in a flexible and user-defined way. We show that models learned by GrowCode produce networks whose target properties match those of real-world networks more closely than existing models

    High throughput prediction of inter-protein coevolution

    Get PDF
    Inter-protein co-evolution analysis can reveal in/direct functional or physical protein interactions. Inter-protein co-evolutionary analysis compares the correlation of evolutionary changes between residues on aligned orthologous sequences. On the other hand, modern methods used in experimental cell biological research to screen for protein-protein interaction, often based on mass spectrometry, often lead to identification of large amount of possible interacting proteins. If automatized, inter-protein co-evolution analysis can serve as a valuable step in refining the results, typically containing hundreds of hits, for further experiments. Manual retrieval of tens of orthologous sequences, alignment and phylogenetic tree preparations of such amounts of data is insufficient. The aim of this thesis is to create an assembly of scripts that automatize high-throughput inter-protein co-evolution analysis. Scripts were written in Python language. Scripts are using API client interface to access online databases with sequences of input protein identifiers. Through matched identifiers, over 85 representative orthologous sequences from vertebrate species are retrieved from OrthoDB orthologues database. Scripts align these sequences with PRANK MSA algorithm and create corresponding phylogenetic tree. All protein pairs are structured for multicore computation with CAPS programme on CSC supercomputer. Multiple CAPS outputs are abstracted into comprehensive form for comparison of relative co-adaptive co-evolution between proposed protein pairs. In this work, I have developed automatization for a protein-interactome screen done by proximity labelling of B cell receptor and plasma membrane associated proteins under activating or non-activating conditions. Applying high-throughput co-evolutionary analysis to this data provides a completely new approach to identify new players in B cell activation, critical for autoimmunity, hypo-immunity or cancer. Results showed unsatisfying performance of CAPS, explanation and alternatives were given
    • …
    corecore