6,069 research outputs found

    A creature with a hundred waggly tails: intrinsically disordered proteins in the ribosome

    Get PDF
    This article is made available for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.Intrinsic disorder (i.e., lack of a unique 3-D structure) is a common phenomenon, and many biologically active proteins are disordered as a whole, or contain long disordered regions. These intrinsically disordered proteins/regions constitute a significant part of all proteomes, and their functional repertoire is complementary to functions of ordered proteins. In fact, intrinsic disorder represents an important driving force for many specific functions. An illustrative example of such disorder-centric functional class is RNA-binding proteins. In this study, we present the results of comprehensive bioinformatics analyses of the abundance and roles of intrinsic disorder in 3,411 ribosomal proteins from 32 species. We show that many ribosomal proteins are intrinsically disordered or hybrid proteins that contain ordered and disordered domains. Predicted globular domains of many ribosomal proteins contain noticeable regions of intrinsic disorder. We also show that disorder in ribosomal proteins has different characteristics compared to other proteins that interact with RNA and DNA including overall abundance, evolutionary conservation, and involvement in protein–protein interactions. Furthermore, intrinsic disorder is not only abundant in the ribosomal proteins, but we demonstrate that it is absolutely necessary for their various functions

    Computational Analysis and Prediction of Intrinsic Disorder and Intrinsic Disorder Functions in Proteins

    Get PDF
    COMPUTATIONAL ANALYSIS AND PREDICTION OF INTRINSIC DISORDER AND INTRINSIC DISORDER FUNCTIONS IN PROTEINS By Akila Imesha Katuwawala A dissertation submitted in partial fulfillment of the requirements for the degree of Engineering, Doctor of Philosophy with a concentration in Computer Science at Virginia Commonwealth University. Virginia Commonwealth University, 2021 Director: Lukasz Kurgan, Professor, Department of Computer Science Proteins, as a fundamental class of biomolecules, have been studied from various perspectives over the past two centuries. The traditional notion is that proteins require fixed and stable three-dimensional structures to carry out biological functions. However, there is mounting evidence regarding a “special” class of proteins, named intrinsically disordered proteins, which do not have fixed three-dimensional structures though they perform a number of important biological functions. Computational approaches have been a vital component to study these intrinsically disordered proteins over the past few decades. Prediction of the intrinsic disorder and functions of intrinsic disorder from protein sequences is one such important computational approach that has recently gained attention, particularly in the advent of the development of modern machine learning techniques. This dissertation runs along two basic themes, namely, prediction of the intrinsic disorder and prediction of the intrinsic disorder functions. The work related to the prediction of intrinsic disorder covers a novel approach to evaluate the predictive performance of the current computational disorder predictors. This approach evaluates the intrinsic disorder predictors at the individual protein level compared to the traditional studies that evaluate them over large protein datasets. We address several interesting aspects concerning the differences in the protein-level vs. dataset-level predictive quality, complementarity and predictive performance of the current predictors. Based on the findings from this assessment we have conceptualized, developed, tested and deployed an innovative platform called DISOselect that recommends the most suitable computational disorder predictors for a given protein, with an underlying goal to maximize the predictive performance. DISOselect provides advice on whether a given disorder predictor would provide an accurate prediction for a given protein of user’s interest, and recommends the most suitable disorder predictor together with an estimate of its expected predictive quality. The second theme, prediction of the intrinsic disorder functions, includes first-of-its-kind evaluation of the current computational disorder predictors on two functional sub-classes of the intrinsically disordered proteins. This study introduces several novel evaluation strategies to assess predictive performance of disorder prediction methods and focuses on the evaluation for disorder functions associated with interactions with partner molecules. Results of this analysis motivated us to conceptualize, design, test and deploy a new and accurate machine learning-based predictor of the disordered lipid-binding residues, DisoLipPred. We empirically show that the strong predictive performance of DisoLipPred stems from several innovative design features and that its predictions complements results produced by current disorder predictors, disorder function predictors and predictors of transmembrane regions. We deploy DisoLipPred as a convenient webserver and discuss its predictions on the yeast proteome

    The Echinococcus canadensis (G7) genome: A key knowledge of parasitic platyhelminth human diseases

    Get PDF
    Background: The parasite Echinococcus canadensis (G7) (phylum Platyhelminthes, class Cestoda) is one of the causative agents of echinococcosis. Echinococcosis is a worldwide chronic zoonosis affecting humans as well as domestic and wild mammals, which has been reported as a prioritized neglected disease by the World Health Organisation. No genomic data, comparative genomic analyses or efficient therapeutic and diagnostic tools are available for this severe disease. The information presented in this study will help to understand the peculiar biological characters and to design species-specific control tools. Results: We sequenced, assembled and annotated the 115-Mb genome of E. canadensis (G7). Comparative genomic analyses using whole genome data of three Echinococcus species not only confirmed the status of E. canadensis (G7) as a separate species but also demonstrated a high nucleotide sequences divergence in relation to E. granulosus (G1). The E. canadensis (G7) genome contains 11,449 genes with a core set of 881 orthologs shared among five cestode species. Comparative genomics revealed that there are more single nucleotide polymorphisms (SNPs) between E. canadensis (G7) and E. granulosus (G1) than between E. canadensis (G7) and E. multilocularis. This result was unexpected since E. canadensis (G7) and E. granulosus (G1) were considered to belong to the species complex E. granulosus sensu lato. We described SNPs in known drug targets and metabolism genes in the E. canadensis (G7) genome. Regarding gene regulation, we analysed three particular features: CpG island distribution along the three Echinococcus genomes, DNA methylation system and small RNA pathway. The results suggest the occurrence of yet unknown gene regulation mechanisms in Echinococcus. Conclusions: This is the first work that addresses Echinococcus comparative genomics. The resources presented here will promote the study of mechanisms of parasite development as well as new tools for drug discovery. The availability of a high-quality genome assembly is critical for fully exploring the biology of a pathogenic organism. The E. canadensis (G7) genome presented in this study provides a unique opportunity to address the genetic diversity among the genus Echinococcus and its particular developmental features. At present, there is no unequivocal taxonomic classification of Echinococcus species; however, the genome-wide SNPs analysis performed here revealed the phylogenetic distance among these three Echinococcus species. Additional cestode genomes need to be sequenced to be able to resolve their phylogeny.Fil: Maldonado, Lucas Luciano. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Assis, Juliana. Fundación Oswaldo Cruz; BrasilFil: Gomes Araújo, Flávio M.. Fundación Oswaldo Cruz; BrasilFil: Salim, Anna C. M.. Fundación Oswaldo Cruz; BrasilFil: Macchiaroli, Natalia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Cucher, Marcela Alejandra. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Camicia, Federico. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Fox, Adolfo. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Rosenzvit, Mara Cecilia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Oliveira, Guilherme. Instituto Tecnológico Vale; Brasil. Fundación Oswaldo Cruz; BrasilFil: Kamenetzky, Laura. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; Argentin

    Computational approaches to predict protein functional families and functional sites.

    Get PDF
    Understanding the mechanisms of protein function is indispensable for many biological applications, such as protein engineering and drug design. However, experimental annotations are sparse, and therefore, theoretical strategies are needed to fill the gap. Here, we present the latest developments in building functional subclassifications of protein superfamilies and using evolutionary conservation to detect functional determinants, for example, catalytic-, binding- and specificity-determining residues important for delineating the functional families. We also briefly review other features exploited for functional site detection and new machine learning strategies for combining multiple features

    DescribePROT: database of amino acid-level protein structure and function predictions

    Get PDF
    We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/

    FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues

    Get PDF
    A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.Edward Frymoyer Endowed Professorship in Information Sciences and Technology. The Center for Big Data Analytics and Discovery Informatics which is co-sponsored by the Institute for Cyberscience, the Huck Institutes of the Life Sciences, the Social Science Research Institute, and the College of Information Sciences and Technology at the Pennsylvania State University. NPRP grant No. 4-1454-1-233 from the Qatar National Research Fund (a member of Qatar Foundation)

    Identification of RNA Binding Proteins and RNA Binding Residues Using Effective Machine Learning Techniques

    Get PDF
    Identification and annotation of RNA Binding Proteins (RBPs) and RNA Binding residues from sequence information alone is one of the most challenging problems in computational biology. RBPs play crucial roles in several fundamental biological functions including transcriptional regulation of RNAs and RNA metabolism splicing. Existing experimental techniques are time-consuming and costly. Thus, efficient computational identification of RBPs directly from the sequence can be useful to annotate RBP and assist the experimental design. Here, we introduce AIRBP, a computational sequence-based method, which utilizes features extracted from evolutionary information, physiochemical properties, and disordered properties to train a machine learning method designed using stacking, an advanced machine learning technique, for effective prediction of RBPs. Furthermore, it makes use of efficient machine learning algorithms like Support Vector Machine, Logistic Regression, K-Nearest Neighbor and XGBoost (Extreme Gradient Boosting Algorithm). In this research work, we also propose another predictor for efficient annotation of RBP residues. This RBP residue predictor also uses stacking and evolutionary algorithms for efficient annotation of RBPs and RNA Binding residue. The RNA-binding residue predictor also utilizes various evolutionary, physicochemical and disordered properties to train a robust model. This thesis presents a possible solution to the RBP and RNA binding residue prediction problem through two independent predictors, both of which outperform existing state-of-the-art approaches

    Identification of RNA Binding Proteins and RNA Binding Residues Using Effective Machine Learning Techniques

    Get PDF
    Identification and annotation of RNA Binding Proteins (RBPs) and RNA Binding residues from sequence information alone is one of the most challenging problems in computational biology. RBPs play crucial roles in several fundamental biological functions including transcriptional regulation of RNAs and RNA metabolism splicing. Existing experimental techniques are time-consuming and costly. Thus, efficient computational identification of RBPs directly from the sequence can be useful to annotate RBP and assist the experimental design. Here, we introduce AIRBP, a computational sequence-based method, which utilizes features extracted from evolutionary information, physiochemical properties, and disordered properties to train a machine learning method designed using stacking, an advanced machine learning technique, for effective prediction of RBPs. Furthermore, it makes use of efficient machine learning algorithms like Support Vector Machine, Logistic Regression, K-Nearest Neighbor and XGBoost (Extreme Gradient Boosting Algorithm). In this research work, we also propose another predictor for efficient annotation of RBP residues. This RBP residue predictor also uses stacking and evolutionary algorithms for efficient annotation of RBPs and RNA Binding residue. The RNA-binding residue predictor also utilizes various evolutionary, physicochemical and disordered properties to train a robust model. This thesis presents a possible solution to the RBP and RNA binding residue prediction problem through two independent predictors, both of which outperform existing state-of-the-art approaches

    Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning

    Full text link
    Protein-DNA interaction is critical for life activities such as replication, transcription, and splicing. Identifying protein-DNA binding residues is essential for modeling their interaction and downstream studies. However, developing accurate and efficient computational methods for this task remains challenging. Improvements in this area have the potential to drive novel applications in biotechnology and drug design. In this study, we propose a novel approach called CLAPE, which combines a pre-trained protein language model and the contrastive learning method to predict DNA binding residues. We trained the CLAPE-DB model on the protein-DNA binding sites dataset and evaluated the model performance and generalization ability through various experiments. The results showed that the AUC values of the CLAPE-DB model on the two benchmark datasets reached 0.871 and 0.881, respectively, indicating superior performance compared to other existing models. CLAPE-DB showed better generalization ability and was specific to DNA-binding sites. In addition, we trained CLAPE on different protein-ligand binding sites datasets, demonstrating that CLAPE is a general framework for binding sites prediction. To facilitate the scientific community, the benchmark datasets and codes are freely available at https://github.com/YAndrewL/clape

    Computational and Experimental Approaches to Reveal the Effects of Single Nucleotide Polymorphisms with Respect to Disease Diagnostics

    Get PDF
    DNA mutations are the cause of many human diseases and they are the reason for natural differences among individuals by affecting the structure, function, interactions, and other properties of DNA and expressed proteins. The ability to predict whether a given mutation is disease-causing or harmless is of great importance for the early detection of patients with a high risk of developing a particular disease and would pave the way for personalized medicine and diagnostics. Here we review existing methods and techniques to study and predict the effects of DNA mutations from three different perspectives: in silico, in vitro and in vivo. It is emphasized that the problem is complicated and successful detection of a pathogenic mutation frequently requires a combination of several methods and a knowledge of the biological phenomena associated with the corresponding macromolecules
    corecore