201 research outputs found

    Protein interface prediction using graph convolutional networks

    Get PDF
    2017 Fall.Includes bibliographical references.Proteins play a critical role in processes both within and between cells, through their interactions with each other and other molecules. Proteins interact via an interface forming a protein complex, which is difficult, expensive, and time consuming to determine experimentally, giving rise to computational approaches. These computational approaches utilize known electrochemical properties of protein amino acid residues in order to predict if they are a part of an interface or not. Prediction can occur in a partner independent fashion, where amino acid residues are considered independently of their neighbor, or in a partner specific fashion, where pairs of potentially interacting residues are considered together. Ultimately, prediction of protein interfaces can help illuminate cellular biology, improve our understanding of diseases, and aide pharmaceutical research. Interface prediction has historically been performed with a variety of methods, to include docking, template matching, and more recently, machine learning approaches. The field of machine learning has undergone a revolution of sorts with the emergence of convolutional neural networks as the leading method of choice for a wide swath of tasks. Enabled by large quantities of data and the increasing power and availability of computing resources, convolutional neural networks efficiently detect patterns in grid structured data and generate hierarchical representations that prove useful for many types of problems. This success has motivated the work presented in this thesis, which seeks to improve upon state of the art interface prediction methods by incorporating concepts from convolutional neural networks. Proteins are inherently irregular, so they don't easily conform to a grid structure, whereas a graph representation is much more natural. Various convolution operations have been proposed for graph data, each geared towards a particular application. We adapted these convolutions for use in interface prediction, and proposed two new variants. Neural networks were trained on the Docking Benchmark Dataset version 4.0 complexes and tested on the new complexes added in version 5.0. Results were compared against the state of the art method partner specific method, PAIRpred [1]. Results show that multiple variants of graph convolution outperform PAIRpred, with no method emerging as the clear winner. In the future, additional training data may be incorporated from other sources, unsupervised pretraining such as autoencoding may be employed, and a generalization of convolution to simplicial complexes may also be explored. In addition, the various graph convolution approaches may be applied to other applications with graph structured data, such as Quantitative Structure Activity Relationship (QSAR) learning, and knowledge base inference

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Transmembrane protein structure prediction using machine learning

    Get PDF
    This thesis describes the development and application of machine learning-based methods for the prediction of alpha-helical transmembrane protein structure from sequence alone. It is divided into six chapters. Chapter 1 provides an introduction to membrane structure and dynamics, membrane protein classes and families, and membrane protein structure prediction. Chapter 2 describes a topological study of the transmembrane protein CLN3 using a consensus of bioinformatic approaches constrained by experimental data. Mutations in CLN3 can cause juvenile neuronal ceroid lipofuscinosis, or Batten disease, an inherited neurodegenerative lysosomal storage disease affecting children, therefore such studies are important for directing further experimental work into this incurable illness. Chapter 3 explores the possibility of using biologically meaningful signatures described as regular expressions to influence the assignment of inside and outside loop locations during transmembrane topology prediction. Using this approach, it was possilbe to modify a recent topology prediction method leading to an improvement of 6% prediction accuracy using a standard data set. Chapter 4 describes the development of a novel support vector machine-based topology predictor that integrates both signal peptide and re-entrant helix prediction, benchmarked with full cross-validation on a novel data set of sequences with known crystal structures. The method achieves state-of-the-art performance in predicting topology and discriminating between globular and transmembrane proteins. We also present the results of applying these tools to a number of complete genomes. Chapter 5 describes a novel approach to predict lipid exposure, residue contacts, helix-helix interactions and finally the optimal helical packing arrangement of transmembrane proteins. It is based on two support vector machine classifiers that predict per residue lipid exposure and residue contacts, which are used to determine helix-helix interaction with up to 65% accuracy. The method is also able to discriminate native from decoy helical packing arrangements with up to 70% accuracy. Finally, a force-directed algorithm is employed to construct the optimal helical packing arrangement which demonstrates success for proteins containing up to 13 transmembrane helices. The final chapter summarises the major contributions of this thesis to biology, before future perspectives for TM protein structure prediction are discussed

    Quantitative modeling and statistical analysis of protein-DNA binding sites

    Get PDF

    Sequence based methods for the prediction and analysis of the structural topology of transmembrane beta barrel proteins

    Get PDF
    Transmembrane proteins play a major role in the normal functioning of the cell. Many transmembrane proteins act as a drug target and hence are of utmost importance to the pharmaceutical industry. In spite of the significance of transmembrane proteins, relatively few transmembrane 3D structures are available due to experimental bottlenecks. Due to this, it is imperative to develop novel computational methods to elucidate the structure and function of these proteins. The two major classes of transmembrane proteins are helical membrane proteins and transmembrane beta barrel proteins. Relatively more 3D structures of helical membrane proteins have been experimentally determined and in general, the majority of computational methods in the realm of transmembrane proteins deal with helical membrane proteins. However, in the recent years there has been an increased interest in the development of computational methods for the transmembrane beta barrel proteins. In this study, I focus on the transmembrane beta barrel proteins. More specifically, I present here computational methods for the prediction of the exposure status of the residues in the membrane spanning region of the transmembrane beta barrel proteins. To the best of our knowledge, the exposure status prediction is a novel problem in the realm of transmembrane beta barrel proteins. The knowledge about the exposure status of the membrane spanning residues is then used to analyse the structural properties of transmembrane beta strands. The exposure status information is also employed to identify relevant physico-chemical properties that are statistically significantly different in the transmembrane beta strands at the oligomeric interfaces and the rest of the protein surface. A method for the prediction of the beta strands in the membrane spanning regions of putative transmembrane beta barrel proteins from protein sequence has also been developed. The computational method for strand prediction is novel in the respect that it also gives the exposure status information of the residues predicted to be in the predicted transmembrane beta strands. The two computational methods developed in this study have been made available as web services. In the future, the information about the exposure status of the residues in the transmembrane beta strands can be used to identify putative transmembrane beta barrels from proteomic data. The exposure status prediction can also be extended to predict the pore region of transmembrane beta barrel proteins from sequence, which could in turn be used in the function prediction of putative transmembrane beta barrels.Die Klasse der Transmembranproteine ĂŒbernimmt eine Reihe wesentlicher Funktionen innerhalb der Zelle. Daher eignen sich viele dieser Proteine als Ziele fĂŒr medizinische Wirkstoffe und sind daher von außerordentlichem Interesse fĂŒr die Pharmaindustrie. Trotz ihrer Wichtigkeit wurden bislang nur wenige drei-dimensionale Strukturen von Membranproteinen erfasst, denn deren experimentelle Bestimmung hat sich als ausgesprochen schwierig herausgestellt. Aus diesem Grund erweist sich die Entwicklung von in silico Methoden zur de novo Vorhersage von Struktur und Funktion dieser Proteine von als notwendige Strategie. Die beiden wesentlichen Klassen von Transmembranproteinen unterteilt man, basierend auf ihren charakteristischen SekundĂ€rstrukturen, in alpha-helikale Proteine und beta-Barrels. Erstere machen den grĂ¶ĂŸeren Anteil an experimentell bestimmten Strukturen aus, und auch die meisten bislang vorgestellten in silico Methoden konzentrieren sich auf die Modellierung solch alpha-helikaler Strukturen. In den vergangenen Jahren stieg daher das Interesse an Methoden zur Modellierung von transmembranen beta-Barrels. Die vorliegende Disseration beschĂ€ftigt sich vorrangig mit dieser Klasse von Transmembranproteinen, insbesondere prĂ€sentieren wir ein Verfahren zur Vorhersage der Exposition ("Exposure\u27;) zur Lipidschicht einzelner Residuen innerhalb der Transmembranregion von beta-Barrels. Diese Vorhersage der Exposition stellt bislang ein neuartiges Problem im Feld der beta-Barrels dar. Die daraus gewonnenen Informationen wurden zur Analyse der strukturellen Eigenschaften von Transmembranketten verwendet. DarĂŒber hinaus können die Exposure-Daten zur Identifikation bedeutender physikochemischer Eigenschaften verwendet werden. Unsere Untersuchungen ergaben, dass zwischen transmembranen beta-strands an Oligomer-Interfaces und dem Rest der ProteinoberflĂ€che statistisch signifikante Unterschiede bezĂŒglich dieser Eigenschaften auftreten. DarĂŒber hinaus stellen wir ein Verfahren zur sequenzbasierten Vorhersage von Transmembran-Residuen mutmaßlicher beta-Barrels vor, welches in Kombination mit der Vorhersage des Exposure-Status in dieser Form neuartig ist. Die beiden in dieser Studie vorgestellten Methoden sind online als Webdienste verfĂŒgbar. Basierend auf den Exposure-Vorhersagen von beta-FaltblĂ€ttern ist es möglich, in kĂŒnftigen Studien mutmaßliche transmembrane beta-Barrels aus Proteomdatenzu identifizieren

    Specificity Determination by paralogous winged helix-turn-helix transcription factors

    Get PDF
    Transcription factors (TFs) localize to regulatory regions throughout the genome, where they exert physical or enzymatic control over the transcriptional machinery and regulate expression of target genes. Despite the substantial diversity of TFs found across all kingdoms of life, most belong to a relatively small number of structural families characterized by homologous DNA-binding domains (DBDs). In homologous DBDs, highly-conserved DNA-contacting residues define a characteristic ‘recognition potential’, or the limited sequence space containing high-affinity binding sites. Specificity-determining residues (SDRs) alter DNA binding preferences to further delineate this sequence space between homologous TFs, enabling functional divergence through the recognition of distinct genomic binding sites. This thesis explores the divergent DNA-binding preferences among dimeric, winged helix-turn-helix (wHTH) TFs belonging to the OmpR sub-family. As the terminal effectors of orthogonal two-component signaling pathways in Escherichia coli, OmpR paralogs bind distinct genomic sequences and regulate the expression of largely non-overlapping gene networks. Using high-throughput SELEX, I discover multiple sources of variation in DNA-binding, including the spacing and orientation of monomer sites as well as a novel binding ‘mode’ with unique half-site preferences (but retaining dimeric architecture). Surprisingly, given the diversity of residues observed occupying positions in contact with DNA, there are only minor quantitative differences in sequence-specificity between OmpR paralogs. Combining phylogenetic, structural, and biological information, I then define a comprehensive set of putative SDRs, which, although distributed broadly across the protein:DNA interface, preferentially localize to the major groove of the DNA helix. Direct specificity profiling of SDR variants reveals that individual SDRs impact local base preferences as well as global structural properties of the protein:DNA complex. This study demonstrates clearly that OmpR family TFs possess multiple ‘axes of divergence’, including base recognition, dimeric architecture, and structural attributes of the protein:DNA complex. It also provides evidence for a common structural ‘code’ for DNA-binding by OmpR homologues, and demonstrates that surprisingly modest residue changes can enable recognition of highly divergent sequence motifs. Importantly, well-characterized genomic binding sites for many of the TFs in this study diverge substantially from the presented de novo models, and it is unclear how mutations may affect binding in more complex environments. Further analysis using native sequences is required to build combined models of cis- and trans-evolution of two-component regulatory networks

    From gene to function: using new technologies for solving old problems.

    Get PDF
    Recent advances in DNA sequencing have changed the field of genomics as well as that of proteomics making it possible to generate gigabases of genome and transcriptome sequence data at substantially lower cost than it was possible just ten years ago. In recent years, many high-throughput technologies have been developed to interrogate various aspects of cellular processes, including sequence and structural variation and the transcriptome, epigenome, proteome and interactome. These Next Generation Sequencing (NGS) experimental technologies are more mature and accessible than the computational tools available for individual researchers to move, store, analyse and present data in a user-friendly and reproducible fashion. My research work is placed in this scenario and focuses on the analysis of data produced by NGS technologies as well as on the development of new tools aimed at solving the different problems that arise during NGS data analysis. In order to achieve this aim, my group and I have dealt with several open biomedical problems in collaboration with different research groups of the Sapienza University. Some of these experiments have already given interesting results but mostly have represented the occasion and starting point for the development of new tools able to improve some crucial steps of the analyses, solve problems derived by the system complexity and make the results easier to understand for the researchers. Some examples are IsomirT, a tool for the small RNA-Seq analysis and isomiR identification, Phagotto, a tool for analysing deep sequencing data derived from phage-displayed libraries and FIDEA, a web server for the functional interpretation of differential expression analysis. Recent reports have demonstrated that individual microRNAs can be heterogeneous in length and/or sequence producing multiple mature variants that have been dubbed isomiRs. IsomirT is a useful tool to improve and simplify the search for isomiRs starting directly from the results of a miRNA-sequencing experiment. By using it, we observed the behaviour of isomiRs in different cell types and in different biological replicates. Our results indicate that the distribution of the microRNA variants is similar among replicates and different among cells/tissues suggesting that the isomiRs have a functional role in the cell. The use of the NGS technologies for the analysis of antibody selected sequences both using phage display libraries and in vitro selection processes is becoming increasingly popular. By using these technologies, the experimental group headed by prof. Felici has introduced a new experimental pipeline, named PROFILER, aimed at significantly empowering the analysis of antigen-specific libraries. A key step to exploit this idea has been to develop a new tool, Phagotto, for processing and analysing the data derived by sequencing. PROFILER, in combination with Phagotto, seems ideally suited to streamline and guide rational antigen design, adjuvant selection, and quality control of newly produced vaccines. The publicly available web server FIDEA allows experimentalists to obtain a functional interpretation of the results derived from differential expression analysis and to test their hypothesis quickly and easily. The tool performs an enrichment analysis i.e. an analysis of specific properties that are distributed in a non random fashion in the up-regulated and down-regulated genes, taken both together and separately. It has been shown to be very useful and is being heavily used from scientists all over the world, more than 1500 requests for analysis have been submitted to the server in six months. Furthermore, during the course of the PhD I implemented pipelines for the speeding up and optimization of protocols for NGS data analysis and applied them to biomedical projects. Of course not all the proteins have a complete functional annotation and consequently the issue of predicting the function of proteins with a partial or no functional annotation arises. This can be done both by exploiting the 3D structure of the protein or by inferring the function directly from the sequence. A real challenge, however, is the assessment of the accuracy of existing methods. In this context the help that critical assessment experiments can give is essential. We have had the possibility to be involved, as assessors, in the world wide experiment CASP (Critical Assessment of protein Structure Prediction). In particular, we are involved in the assessment of the residue-residue contacts in which the participant groups provide a list of predicted contacts between residues that hopefully can be used as constraints to fold the protein. We proposed and implemented new methodologies to understand which method works better and where future efforts should be focused

    Deep Evolutionary Generative Molecular Modeling for RNA Aptamer Drug Design

    Get PDF
    Deep Aptamer Evolutionary Model (DAPTEV Model). Typical drug development processes are costly, time consuming and often manual with regard to research. Aptamers are short, single-stranded oligonucleotides (RNA/DNA) that bind to, and inhibit, target proteins and other types of molecules similar to antibodies. Compared with small-molecule drugs, these aptamers can bind to their targets with high affinity (binding strength) and specificity (designed to uniquely interact with the target only). The typical development process for aptamers utilizes a manual process known as Systematic Evolution of Ligands by Exponential Enrichment (SELEX), which is costly, slow, and often produces mild results. The focus of this research is to create a deep learning approach for the generating and evolving of aptamer sequences to support aptamer-based drug development. These sequences must be unique, contain at least some level of structural complexity, and have a high level of affinity and specificity for the intended target. Moreover, after training, the deep learning system, known as a Variational Autoencoder, must possess the ability to be queried for new sequences without the need for further training. Currently, this research is applied to the SARS-CoV-2 (Covid-19) spike protein’s receptor-binding domain (RBD). However, careful consideration has been placed in the intentional design of a general solution for future viral applications. Each individual run took five and a half days to complete. Over the course of two months, three runs were performed for three different models. After some sequence, score, and statistical comparisons, it was observed that the deep learning model was able to produce structurally complex aptamers with strong binding affinities and specificities to the target Covid-19 RBD. Furthermore, due to the nature of VAEs, this model is indeed able to be queried for new aptamers of similar quality based on previous training. Results suggest that VAE-based deep learning methods are capable of optimizing aptamer-target binding affinities and specificities (multi-objective learning), and are a strong tool to aid in aptamer-based drug development

    Protein function prediction by integrating sequence, structure and binding affinity information

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Proteins are nano-machines that work inside every living organism. Functional disruption of one or several proteins is the cause for many diseases. However, the functions for most proteins are yet to be annotated because inexpensive sequencing techniques dramatically speed up discovery of new protein sequences (265 million and counting) and experimental examinations of every protein in all its possible functional categories are simply impractical. Thus, it is necessary to develop computational function-prediction tools that complement and guide experimental studies. In this study, we developed a series of predictors for highly accurate prediction of proteins with DNA-binding, RNA-binding and carbohydrate-binding capability. These predictors are a template-based technique that combines sequence and structural information with predicted binding affinity. Both sequence and structure-based approaches were developed. Results indicate the importance of binding affinity prediction for improving sensitivity and precision of function prediction. Application of these methods to the human genome and structure genome targets demonstrated its usefulness in annotating proteins of unknown functions and discovering moon-lighting proteins with DNA,RNA, or carbohydrate binding function. In addition, we also investigated disruption of protein functions by naturally occurring genetic variations due to insertions and deletions (INDELS). We found that protein structures are the most critical features in recognising disease-causing non-frame shifting INDELs. The predictors for function predictions are available at http://sparks-lab.org/spot, and the predictor for classification of non-frame shifting INDELs is available at http://sparks-lab.org/ddig
    • 

    corecore