33 research outputs found

    Predicting RNA-Protein Interactions Using Only Sequence Information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-protein interactions (RPIs) play important roles in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulation of gene expression to host defense against pathogens. High throughput experiments to identify RNA-protein interactions are beginning to provide valuable information about the complexity of RNA-protein interaction networks, but are expensive and time consuming. Hence, there is a need for reliable computational methods for predicting RNA-protein interactions.</p> <p>Results</p> <p>We propose <b><it>RPISeq</it></b>, a family of classifiers for predicting <b><it>R</it></b>NA-<b><it>p</it></b>rotein <b><it>i</it></b>nteractions using only <b><it>seq</it></b>uence information. Given the sequences of an RNA and a protein as input, <it>RPIseq </it>predicts whether or not the RNA-protein pair interact. The RNA sequence is encoded as a normalized vector of its ribonucleotide 4-mer composition, and the protein sequence is encoded as a normalized vector of its 3-mer composition, based on a 7-letter reduced alphabet representation. Two variants of <it>RPISeq </it>are presented: <it>RPISeq-SVM</it>, which uses a Support Vector Machine (SVM) classifier and <it>RPISeq-RF</it>, which uses a Random Forest classifier. On two non-redundant benchmark datasets extracted from the Protein-RNA Interface Database (PRIDB), <it>RPISeq </it>achieved an AUC (Area Under the Receiver Operating Characteristic (ROC) curve) of 0.96 and 0.92. On a third dataset containing only mRNA-protein interactions, the performance of <it>RPISeq </it>was competitive with that of a published method that requires information regarding many different features (e.g., mRNA half-life, GO annotations) of the putative RNA and protein partners. In addition, <it>RPISeq </it>classifiers trained using the PRIDB data correctly predicted the majority (57-99%) of non-coding RNA-protein interactions in NPInter-derived networks from <it>E. coli, S. cerevisiae, D. melanogaster, M. musculus</it>, and <it>H. sapiens</it>.</p> <p>Conclusions</p> <p>Our experiments with <it>RPISeq </it>demonstrate that RNA-protein interactions can be reliably predicted using only sequence-derived information. <it>RPISeq </it>offers an inexpensive method for computational construction of RNA-protein interaction networks, and should provide useful insights into the function of non-coding RNAs. <it>RPISeq </it>is freely available as a web-based server at <url>http://pridb.gdcb.iastate.edu/RPISeq/.</url></p

    Computational prediction of RNA-protein interaction partners and interfaces

    Get PDF
    RNA-protein interactions play important roles in fundamental cellular processes involved in human diseases, viral replication and defense against pathogens in plants, animals and microbes. However, the detailed recognition mechanisms underlying these interactions are poorly understood. To gain a better understanding of the molecular recognition code for RNA-protein interactions, this dissertation has three related goals: i) to develop methods for predicting RNA-protein interaction partners; ii) to develop an approach for predicting interfacial residues in both the RNA and protein components of RNA-protein complexes; and iii) to develop computational tools and resources for investigating RNA-protein interactions. First, we present machine learning classifiers for predicting RNA-protein interaction partners. The classifiers use the amino acid composition of proteins and the ribonucleotide composition of RNAs as input to predict whether a given RNA-protein pair interacts. We show that protein and RNA sequences alone (i.e., in the absence of any structural information) contain enough signal to allow reliable prediction of interaction partners. Second, we present RPISeq, a webserver that predicts the interaction probabilities of input RNA-protein pairs, using the above-mentioned machine learning classifiers. A comprehensive database of RNA-protein interactions, RPIntDB, is integrated with the webserver to allow users to search for homologous proteins and their known interacting RNA partners. Finally, we perform an analysis of contiguous interfacial amino acids and ribonucleotides in RNA-protein complexes for which structures are known. We generate a dataset of bipartite RNA-protein motifs that can be used to predict interfacial residues in both the RNA and protein sequences of a given RNA-protein pair simultaneously. We show that taking binding partner information into account leads to higher precision in the prediction of RNA-binding residues in proteins. Taken together, these studies have increased our understanding of how RNA and proteins interact

    Computational Tools for Investigating RNA-Protein Interaction Partners

    Get PDF
    RNA-protein interactions are important in a wide variety of cellular and developmental processes. Recently, high-throughput experiments have begun to provide valuable information about RNA partners and binding sites for many RNA-binding proteins (RBPs), but these experiments are expensive and time consuming. Thus, computational methods for predicting RNA-Protein interactions (RPIs) can be valuable tools for identifying potential interaction partners of a given protein or RNA, and for identifying likely interfacial residues in RNA-protein complexes. This review focuses on the “partner prediction” problem and summarizes available computational methods, web servers and databases that are devoted to it. New computational tools for addressing the related “interface prediction” problem are also discussed. Together, these computational methods for investigating RNA-protein interactions provide the basis for new strategies for integrating RNA-protein interactions into existing genetic and developmental regulatory networks, an important goal of future research

    Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions

    Get PDF
    Long non-coding RNAs (lncRNAs) constitute a large class of transcribed RNA molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins. They play an important role in regulating gene expression by interacting with the homologous RNA-binding proteins. Due to the laborious and time-consuming nature of wet experimental methods, more researchers should pay great attention to computational approaches for the prediction of lncRNA-protein interaction (LPI). An in-depth literature review in the state-of-the-art in silico investigations, leads to the conclusion that there is still room for improving the accuracy and velocity. This paper propose a novel method for identifying LPI by employing Kernel Ridge Regression, based on Fast Kernel Learning (LPI-FKLKRR). This approach, uses four distinct similarity measures for lncRNA and protein space, respectively. It is remarkable, that we extract Gene Ontology (GO) with proteins, in order to improve the quality of information in protein space. The process of heterogeneous kernels integration, applies Fast Kernel Learning (FastKL) to deal with weight optimization. The extrapolation model is obtained by gaining the ultimate prediction associations, after using Kernel Ridge Regression (KRR). Experimental outcomes show that the ability of modeling with LPI-FKLKRR has extraordinary performance compared with LPI prediction schemes. On benchmark dataset, it has been observed that the best Area Under Precision Recall Curve (AUPR) of 0.6950 is obtained by our proposed model LPI-FKLKRR, which outperforms the integrated LPLNP (AUPR: 0.4584), RWR (AUPR: 0.2827), CF (AUPR: 0.2357), LPIHN (AUPR: 0.2299), and LPBNI (AUPR: 0.3302). Also, combined with the experimental results of a case study on a novel dataset, it is anticipated that LPI-FKLKRR will be a useful tool for LPI prediction

    Aberrant KDM5B expression promotes aggressive breast cancer through MALAT1 overexpression and downregulation of hsa-miR-448

    Get PDF
    Relative expression of KDM5B, MALAT1, SNAIL, Vimentin and miR 448 normalized against GAPDH in MCF10A WT, MCF10A OE, MDA-MB-231 WT and MDA-MB-231 KD cells. Data are representative of 3 independent experiments and analyzed by student’s t-test. All data are shown as mean ± SEM. WT, wild type; OE, KDM5B overexpressed; KD, knockdown using shKDM5B clone II. (DOCX 519 kb

    Discovery and validation of clinically relevant long non-coding RNAs in colorectal cancer

    Get PDF
    Colorectal cancer (CRC) is the third most prevalent cancer worldwide, with nearly two million newly diagnosed cases each year. The survival of patients with CRC greatly depends on the cancer stage at the time of diagnosis, with worse prognosis for more advanced cases. Consequently, considerable effort has been directed towards improving population screening programs for early diagnosis and identifying prognostic markers that can better inform treatment strategies. In recent years, long non-coding RNAs (lncRNAs) have been recognized as promising molecules, with diagnostic and prognostic potential in many cancers, including CRC. Although large-scale genome and transcriptome sequencing surveys have identified many lncRNAs that are altered in CRC, most of their roles in disease onset and progression remain poorly understood. Here, we critically review the variety of detection methods and types of supporting evidence for the involvement of lncRNAs in CRC. In addition, we provide a reference catalog that features the most clinically relevant lncRNAs in CRC. These lncRNAs were selected based on recent studies sorted by stringent criteria for both supporting experimental evidence and reproducibility.This research was funded by the Spanish Ministry of Science and Innovation with grant PGC2018-099921-B-I00, cofounded by European Regional Development Fund (ERDF); by the Catalan Research Agency (AGAUR) SGR423; by the European Union’s Horizon 2020 research and innovation programme (Grant ERC-2016-724173); by TRANSCOLONCAN COST action network (CA17118); by the Gordon and Betty Moore Foundation (Grant GBMF9742); by the “La Caixa” foundation (Grant LCF/PR/HR21/00737), and by the Instituto de Salud Carlos III (IMPACT grant IMP/00019) and CIBERINFEC (grant CB21/13/00061-ISCIII-SGEFI/ERDF). This research was made possible by the Fulbright U.S. Student Grant Program, sponsored by the U.S. Department of State, Bureau of Education and Cultural Affairs.Peer ReviewedPostprint (published version

    Zooming in on protein–RNA interactions: a multilevel workflow to identify interaction partners

    Get PDF
    Interactions between proteins and RNA are at the base of numerous cellular regulatory and functional phenomena. The investigation of the biological relevance of non-coding RNAs has led to the identification of numerous novel RNA-binding proteins (RBPs). However, defining the RNA sequences and structures that are selectively recognised by an RBP remains challenging, since these interactions can be transient and highly dynamic, and may be mediated by unstructured regions in the protein, as in the case of many non-canonical RBPs. Numerous experimental and computational methodologies have been developed to predict, identify and verify the binding between a given RBP and potential RNA partners, but navigating across the vast ocean of data can be frustrating and misleading. In this mini-review, we propose a workflow for the identification of the RNA binding partners of putative, newly identified RBPs. The large pool of potential binders selected by in-cell experiments can be enriched by in silico tools such as catRAPID, which is able to predict the RNA sequences more likely to interact with specific RBP regions with high accuracy. The RNA candidates with the highest potential can then be analysed in vitro to determine the binding strength and to precisely identify the binding sites. The results thus obtained can furthermore validate the computational predictions, offering an all-round solution to the issue of finding the most likely RNA binding partners for a newly identified potential RBP

    Feature- Based and String-Based Models for Predicting RNA-Protein Interaction

    Get PDF
    In this work, we study two approaches for the problem of RNA-Protein Interaction (RPI). In the first approach, we use a feature-based technique by combining extracted features from both sequences and secondary structures. The feature-based approach enhanced the prediction accuracy as it included much more available information about the RNA-protein pairs. In the second approach, we apply search algorithms and data structures to extract effective string patterns for prediction of RPI, using both sequence information (protein and RNA sequences), and structure information (protein and RNA secondary structures). This led to different string-based models for predicting interacting RNA-protein pairs. We show results that demonstrate the effectiveness of the proposed approaches, including comparative results against leading state-of-the-art methods

    LncRNAs in CONDBITs perspectives, from genetics towards theranostics

    Get PDF
    LncRNAs (Long noncoding RNAs) are novel group of ncRNAs and has been discovered to be pervasively transcripted in the genome, characterized as endogenous cellular RNAs consist of more than 200 nucleotides. They are ordered in view of function, transcript length, relation with protein-coding genes and other functional DNA elements, and subcellular localization. Theranostics is a novel study in medicine that combines specific targeted biomolecules based upon molecular-based test. As novel finding in the field of molecular medicine, lncRNA is indispensable tools in theranostics based medicine that allows specific targeting of molecular pathway for diagnostics and therapeutics. LncRNAs may execute as signals, decoys, guides, and scaffolds in their natural capacities. LncRNA expression is controlled by transcriptional and epigenetic factors and processes. LncRNAs also relate detracting biological programs. Here we reviewed lncRNAs in disorders/diseasest horoughly based on CONDBITs perspectives, i.e.: cardiology, oncology, neurology and neuroscience, dermatology, the biology of molecular and bioinformatics, immunology, and technologies (related with “-omics”; transcriptomics and “nano”; nanotechnology). It was narrated the lncRNA biomarkers that abundant in cardiovascular, neurodegenerative, dermatology, and immunology perspective. However, as cancer is the most widely studied disease, more biomarkers are available for this particular case. There are abundant cancer-associated lncRNAs. The most frequent learned lncRNA molecules in cancer are HOTAIR, MALAT1, LincRNA-p21, H19, GAS5, ANRIL, MEG3, XIST, HULC. LncRNAs in cancer diagnosis and monitoring, e.g.: H19 and AA174084 (gastric), HULC (hepatocellular), PCA3 (prostate). Prognostic lncRNAs, e.g.: HOTAIR and NKILA (breast), MEG3 (meningioma), NBAT-1 (neuroblastoma), SCHLAP1 (prostate). LncRNAs predicting therapeutic responsiveness, e.g.: CCAT1 (colorectal), HOTAIR (ovarian). Thus, it is concluded that the CONDBIT perspective is useful to describe the encouraging outlook of this transcriptomics-based medicinal approach

    A Deep Learning Approach to LncRNA Subcellular Localization Using Inexact q-mer

    Get PDF
    Long non coding Ribonucleic Acids (lncRNAs) can be localized to different cellular components, such as the nucleus, exosome, cytoplasm, ribosome, etc. Their biological functions can be influenced by the region of the cell they are located. Many of these lncRNAs are associated with different challenging diseases. Thus, it is crucial to study their subcellular localization. However, compared to the vast number of lncRNAs, only relatively few have annotations in terms of their subcellular localization. Conventional computational methods use q-mer profiles from lncRNA sequences and then train machine learning models, such as support vector machines and logistic regression with the profiles. These methods focus on the exact q-mer. Given possible sequence mutations and other uncertainties in genomic sequences and their role in biological function, a consideration of these changes might improve our ability to model lncRNAs and their localization. I hypothesize that considering these changes may improve the ability to predict subcellular localization of lncRNAs. To test this hypothesis, I propose a deep learning model with inexact q-mers for the localization of lncRNAs in the cell. The proposed method can obtain a high overall accuracy of 94.7%, an average of 91.3% on a benchmark dataset, using the 8-mers with mismatches. In comparison, the exact 8-mer result was 89.8%. The proposed approach outperformed existing state-of-art lncRNA predictors on two different datasets. Therefore, the results support the hypothesis that deep learning models using inexact q-mers can improve the performance of computational lncRNA localization algorithms. The lengths of the lncRNAs vary from hundreds to thousands of nucleotides. In this work, I also check whether the length of lncRNA will impact the prediction accuracy. The results show that when the lncRNA sequence\u27s length is between 2000 and 3000 nucleotides, our model is more accurate
    corecore