51 research outputs found
Computational analysis and prediction of protein-RNA interactions
Protein-RNA interactions are essential for many important processes including all phases of protein production, regulation of gene expression, and replication and assembly of many viruses. This dissertation has two related goals: 1) predicting RNA-binding sites in proteins from protein sequence, structure, and conservation information, and 2) characterizing protein-RNA interactions.
We present several machine learning classifiers for predicting RNA-binding sites in proteins based on the protein sequence, protein structure, and conservation information. Our first classifier uses only amino acid sequence information as input and predicts RNA-binding sites with an area under the receiver operator characteristic curve (AUC) of 0.74. Using the neighboring amino acids in the protein structure improves prediction performance over using sequence alone. We show that using evolutionary information in the form of position specific scoring matrices provides a further significant improvement in predictions. Finally, we create an ensemble classifier that combines the predictions of the sequence, structure, and PSSM based classifiers and gives the best prediction performance, with an AUC of 0.81.
We construct the Protein-RNA Interaction Database, PRIDB, a comprehensive collection of all protein-RNA complexes in the PDB. PRIDB focuses on characterizing the molecular interaction at the protein-RNA interface in terms of van der Waals contacts, direct hydrogen bonds, and water-mediated hydrogen bonds. We perform an extensive analysis of the RNA-binding characteristics of a non-redundant dataset of 181 proteins to determine general characteristics of protein-RNA binding sites. We find that the overall interaction propensities for Watson-Crick paired nucleotides and non Watson-Crick paired nucleotides are very similar, with the propensities for amino acids binding to single stranded nucleotides showing more differences. We find that van der Waals contacts are more numerous than hydrogen bonds and amino acids interact with RNA through their side chain atoms more frequently than their main chain atoms. We also find that contacts to the RNA base are not as frequent as contacts to the RNA backbone.
Together, the prediction and characterization presented in this dissertation have increased our understanding of how proteins and RNA interact
Identifying Interaction Sites in "Recalcitrant" Proteins: Predicted Protein and Rna Binding Sites in Rev Proteins of Hiv-1 and Eiav Agree with Experimental Data
Protein-protein and protein nucleic acid interactions are vitally important
for a wide range of biological processes, including regulation of gene
expression, protein synthesis, and replication and assembly of many viruses. We
have developed machine learning approaches for predicting which amino acids of
a protein participate in its interactions with other proteins and/or nucleic
acids, using only the protein sequence as input. In this paper, we describe an
application of classifiers trained on datasets of well-characterized
protein-protein and protein-RNA complexes for which experimental structures are
available. We apply these classifiers to the problem of predicting protein and
RNA binding sites in the sequence of a clinically important protein for which
the structure is not known: the regulatory protein Rev, essential for the
replication of HIV-1 and other lentiviruses. We compare our predictions with
published biochemical, genetic and partial structural information for HIV-1 and
EIAV Rev and with our own published experimental mapping of RNA binding sites
in EIAV Rev. The predicted and experimentally determined binding sites are in
very good agreement. The ability to predict reliably the residues of a protein
that directly contribute to specific binding events - without the requirement
for structural information regarding either the protein or complexes in which
it participates - can potentially generate new disease intervention strategies.Comment: Pacific Symposium on Biocomputing, Hawaii, In press, Accepted, 200
Identifying Interaction Sites in Recalcitrant Proteins: Predicted Protein and RNA Binding Sites in Rev Proteins of HIV-1 and EIAV Agree with Experimental Data
Protein-protein and protein nucleic acid interactions are vitally important for a wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses. We have developed machine learning approaches for predicting which amino acids of a protein participate in its interactions with other proteins and/or nucleic acids, using only the protein sequence as input. In this paper, we describe an application of classifiers trained on datasets of well-characterized protein-protein and protein-RNA complexes for which experimental structures are available. We apply these classifiers to the problem of predicting protein and RNA binding sites in the sequence of a clinically important protein for which the structure is not known: the regulatory protein Rev, essential for the replication of HIV-1 and other lentiviruses. We compare our predictions with published biochemical, genetic and partial structural information for HIV-1 and EIAV Rev and with our own published experimental mapping of RNA binding sites in EIAV Rev. The predicted and experimentally determined binding sites are in very good agreement. The ability to predict reliably the residues of a protein that directly contribute to specific binding events - without the requirement for structural information regarding either the protein or complexes in which it participates - can potentially generate new disease intervention strategies
Predicting DNA-binding sites of proteins from amino acid sequence
BACKGROUND: Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions. RESULTS: We start with a NaĆÆve Bayes classifier trained to predict whether a given amino acid residue is a DNA-binding residue based on its identity and the identities of its sequence neighbors. The input to the classifier consists of the identities of the target residue and 4 sequence neighbors on each side of the target residue. The classifier is trained and evaluated (using leave-one-out cross-validation) on a non-redundant set of 171 proteins. Our results indicate the feasibility of identifying interface residues based on local sequence information. The classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. We show that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. The classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues. Examination of the predictions in the context of 3-dimensional structures of proteins demonstrates the effectiveness of this method in identifying DNA-binding sites from sequence information. In 33% (56 out of 171) of the proteins, the classifier identifies the interaction sites by correctly recognizing at least half of the interface residues. In 87% (149 out of 171) of the proteins, the classifier correctly identifies at least 20% of the interface residues. This suggests the possibility of using such classifiers to identify potential DNA-binding motifs and to gain potentially useful insights into sequence correlates of protein-DNA interactions. CONCLUSION: NaĆÆve Bayes classifiers trained to identify DNA-binding residues using sequence information offer a computationally efficient approach to identifying putative DNA-binding sites in DNA-binding proteins and recognizing potential DNA-binding motifs
Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
Background: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ācodeā that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction.
Results: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (NaĀØıve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequencebased classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues.
Conclusions: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons
RNABindR: a server for analyzing and predicting RNA-binding sites in proteins
Understanding interactions between proteins and RNA is key to deciphering the mechanisms of many important biological processes. Here we describe RNABindR, a web-based server that identifies and displays RNA-binding residues in known proteināRNA complexes and predicts RNA-binding residues in proteins of unknown structure. RNABindR uses a distance cutoff to identify which amino acids contact RNA in solved complex structures (from the Protein Data Bank) and provides a labeled amino acid sequence and a Jmol graphical viewer in which RNA-binding residues are displayed in the context of the three-dimensional structure. Alternatively, RNABindR can use a Naive Bayes classifier trained on a non-redundant set of proteināRNA complexes from the PDB to predict which amino acids in a protein sequence of unknown structure are most likely to bind RNA. RNABindR automatically displays āhigh specificityā and āhigh sensitivityā predictions of RNA-binding residues. RNABindR is freely available at http://bindr.gdcb.iastate.edu/RNABindR
Assessing dose rate distributions in VMAT plans
Dose rate is an essential factor in radiobiology. As modern radiotherapy delivery techniques such as volumetric modulated arc therapy (VMAT) introduce dynamic modulation of the dose rate, it is important to assess the changes in dose rate. Both the rate of monitor units per minute (MU rate) and collimation are varied over the course of a fraction, leading to different dose rates in every voxel of the calculation volume at any point in time during dose delivery. Given the radiotherapy plan and machine specific limitations, a VMAT treatment plan can be split into arc sectors between Digital Imaging and Communications in Medicine control points (CPs) of constant and known MU rate. By calculating dose distributions in each of these arc sectors independently and multiplying them with the MU rate, the dose rate in every single voxel at every time point during the fraction can be calculated. Independently calculated and then summed dose distributions per arc sector were compared to the whole arc dose calculation for validation. Dose measurements and video analysis were performed to validate the calculated datasets. A clinical head and neck, cranial and liver case were analyzed using the tool developed. Measurement validation of synthetic test cases showed linac agreement to precalculated arc sector times within Ā±0.4 s and doses Ā±0.1 MU (one standard deviation). Two methods for the visualization of dose rate datasets were developed: the first method plots a two-dimensional (2D) histogram of the number of voxels receiving a given dose rate over the course of the arc treatment delivery. In similarity to treatment planning system display of dose, the second method displays the dose rate as color wash on top of the corresponding computed tomography image, allowing the user to scroll through the variation over time. Examining clinical cases showed dose rates spread over a continuous spectrum, with mean dose rates hardly exceeding 100 cGy min(-1) for conventional fractionation. A tool to analyze dose rate distributions in VMAT plans with sub-second accuracy was successfully developed and validated. Dose rates encountered in clinical VMAT test cases show a continuous spectrum with a mean less than or near 100 cGy min(-1) for conventional fractionation
Recommended from our members
Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art
Article presenting a review, comparison, and critical assessment of published approaches for predicting RNA-binding residues in proteins using non-redundant databases
- ā¦