620 research outputs found

    An SVM-based system for predicting protein subnuclear localizations

    Get PDF
    BACKGROUND: The large gap between the number of protein sequences in databases and the number of functionally characterized proteins calls for the development of a fast computational tool for the prediction of subnuclear and subcellular localizations generally applicable to protein sequences. The information on localization may reveal the molecular function of novel proteins, in addition to providing insight on the biological pathways in which they function. The bulk of past work has been focused on protein subcellular localizations. Furthermore, no specific tool has been dedicated to prediction at the subnuclear level, despite its high importance. In order to design a suitable predictive system, the extraction of subtle sequence signals that can discriminate among proteins with different subnuclear localizations is the key. RESULTS: New kernel functions used in a support vector machine (SVM) learning model are introduced for the measurement of sequence similarity. The k-peptide vectors are first mapped by a matrix of high-scored pairs of k-peptides which are measured by BLOSUM62 scores. The kernels, measuring the similarity for sequences, are then defined on the mapped vectors. By combining these new encoding methods, a multi-class classification system for the prediction of protein subnuclear localizations is established for the first time. The performance of the system is evaluated with a set of proteins collected in the Nuclear Protein Database (NPD). The overall accuracy of prediction for 6 localizations is about 50% (vs. random prediction 16.7%) for single localization proteins in the leave-one-out cross-validation; and 65% for an independent set of multi-localization proteins. This integrated system can be accessed at . CONCLUSION: The integrated system benefits from the combination of predictions from several SVMs based on selected encoding methods. Finally, the predictive power of the system is expected to improve as more proteins with known subnuclear localizations become available

    The proteins of intra-nuclear bodies: a data-driven analysis of sequence, interaction and expression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cajal bodies, nucleoli, PML nuclear bodies, and nuclear speckles are morpohologically distinct intra-nuclear structures that dynamically respond to cellular cues. Such nuclear bodies are hypothesized to play important regulatory roles, e.g. by sequestering and releasing transcription factors in a timely manner. While the nucleolus and nuclear speckles have received more attention experimentally, the PML nuclear body and the Cajal body are still incompletely characterized in terms of their roles and protein complement.</p> <p>Results</p> <p>By collating recent experimentally verified data, we find that almost 1000 proteins in the mouse nuclear proteome are known to associate with one or more of the nuclear bodies. Their gene ontology terms highlight their regulatory roles: splicing is confirmed to be a core activity of speckles and PML nuclear bodies house a range of proteins involved in DNA repair. We train support-vector machines to show that nuclear proteins contain discriminative sequence features that can be used to identify their intra-nuclear body associations. Prediction accuracy is highest for nucleoli and nuclear speckles. The trained models are also used to estimate the full protein complement of each nuclear body. Protein interactions are found primarily to link proteins in the nuclear speckles with proteins from other compartments. Cell cycle expression data provide support for increased activity in nucleoli, nuclear speckles and PML nuclear bodies especially during S and G<sub>2 </sub>phases.</p> <p>Conclusions</p> <p>The large-scale analysis of the mouse nuclear proteome sheds light on the <it>functional </it>organization of <it>physically </it>embodied intra-nuclear compartments. We observe partial support for the hypothesis that the physical organization of the nucleus mirrors functional modularity. However, we are unable to unambiguously identify proteins' intra-nuclear destination, suggesting that critical drivers behind of intra-nuclear translocation are yet to be identified.</p

    Application of machine learning and deep learning for proteomics data analysis

    Get PDF

    Boosting accuracy of automated classification of fluorescence microscope images for location proteomics

    Get PDF
    BACKGROUND: Detailed knowledge of the subcellular location of each expressed protein is critical to a full understanding of its function. Fluorescence microscopy, in combination with methods for fluorescent tagging, is the most suitable current method for proteome-wide determination of subcellular location. Previous work has shown that neural network classifiers can distinguish all major protein subcellular location patterns in both 2D and 3D fluorescence microscope images. Building on these results, we evaluate here new classifiers and features to improve the recognition of protein subcellular location patterns in both 2D and 3D fluorescence microscope images. RESULTS: We report here a thorough comparison of the performance on this problem of eight different state-of-the-art classification methods, including neural networks, support vector machines with linear, polynomial, radial basis, and exponential radial basis kernel functions, and ensemble methods such as AdaBoost, Bagging, and Mixtures-of-Experts. Ten-fold cross validation was used to evaluate each classifier with various parameters on different Subcellular Location Feature sets representing both 2D and 3D fluorescence microscope images, including new feature sets incorporating features derived from Gabor and Daubechies wavelet transforms. After optimal parameters were chosen for each of the eight classifiers, optimal majority-voting ensemble classifiers were formed for each feature set. Comparison of results for each image for all eight classifiers permits estimation of the lower bound classification error rate for each subcellular pattern, which we interpret to reflect the fraction of cells whose patterns are distorted by mitosis, cell death or acquisition errors. Overall, we obtained statistically significant improvements in classification accuracy over the best previously published results, with the overall error rate being reduced by one-third to one-half and with the average accuracy for single 2D images being higher than 90% for the first time. In particular, the classification accuracy for the easily confused endomembrane compartments (endoplasmic reticulum, Golgi, endosomes, lysosomes) was improved by 5–15%. We achieved further improvements when classification was conducted on image sets rather than on individual cell images. CONCLUSIONS: The availability of accurate, fast, automated classification systems for protein location patterns in conjunction with high throughput fluorescence microscope imaging techniques enables a new subfield of proteomics, location proteomics. The accuracy and sensitivity of this approach represents an important alternative to low-resolution assignments by curation or sequence-based prediction

    Psiscan: a computational approach to identify H/ACA-like and AGA-like non-coding RNA in trypanosomatid genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Detection of non coding RNA (ncRNA) molecules is a major bioinformatics challenge. This challenge is particularly difficult when attempting to detect H/ACA molecules which are involved in converting uridine to pseudouridine on rRNA in trypanosomes, because these organisms have unique H/ACA molecules (termed H/ACA-like) that lack several of the features that characterize H/ACA molecules in most other organisms.</p> <p>Results</p> <p>We present here a computational tool called Psiscan, which was designed to detect H/ACA-like molecules in trypanosomes. We started by analyzing known H/ACA-like molecules and characterized their crucial elements both computationally and experimentally.</p> <p>Next, we set up constraints based on this analysis and additional phylogenic and functional data to rapidly scan three trypanosome genomes (<it>T. brucei</it>, <it>T. cruzi </it>and <it>L. major</it>) for sequences that observe these constraints and are conserved among the species. In the next step, we used minimal energy calculation to select the molecules that are predicted to fold into a lowest energy structure that is consistent with the constraints. In the final computational step, we used a Support Vector Machine that was trained on known H/ACA-like molecules as positive examples and on negative examples of molecules that were identified by the computational analyses but were shown experimentally not to be H/ACA-like molecules. The leading candidate molecules predicted by the SVM model were then subjected to experimental validation.</p> <p>Conclusion</p> <p>The experimental validation showed 11 molecules to be expressed (4 out of 25 in the intermediate stage and 7 out of 19 in the final validation after the machine learning stage). Five of these 11 molecules were further shown to be bona fide H/ACA-like molecules. As snoRNA in trypanosomes are organized in clusters, the new H/ACA-like molecules could be used as starting points to manually search for additional molecules in their neighbourhood. All together this study increased our repertoire by fourteen H/ACA-like and six C/D snoRNAs molecules from <it>T. brucei </it>and <it>L. Major</it>. In addition the experimental analysis revealed that six ncRNA molecules that are expressed are not downregulated in CBF5 silenced cells, suggesting that they have structural features of H/ACA-like molecules but do not have their standard function. We termed this novel class of molecules AGA-like, and we are exploring their function.</p> <p>This study demonstrates the power of tight collaboration between computational and experimental approaches in a combined effort to reveal the repertoire of ncRNA molecles.</p

    Achieving High Accuracy Prediction of Minimotifs

    Get PDF
    The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease

    SnoReport 2.0 : new features and a refined Support Vector Machine to improve snoRNA identification

    Get PDF
    Background: snoReport uses RNA secondary structure prediction combined with machine learning as the basis to identify the two main classes of small nucleolar RNAs, the box H/ACA snoRNAs and the box C/D snoRNAs. Here, we present snoReport 2.0, which substantially improves and extends in the original method by: extracting new features for both box C/D and H/ACA box snoRNAs; developing a more sophisticated technique in the SVM training phase with recent data from vertebrate organisms and a careful choice of the SVM parameters C and γ ; and using updated versions of tools and databases used for the construction of the original version of snoReport. To validate the new version and to demonstrate its improved performance, we tested snoReport 2.0 in different organisms. Results: Results of the training and test phases of boxes H/ACA and C/D snoRNAs, in both versions of snoReport, are discussed. Validation on real data was performed to evaluate the predictions of snoReport 2.0. Our program was applied to a set of previously annotated sequences, some of them experimentally confirmed, of humans, nematodes, drosophilids, platypus, chickens and leishmania. We significantly improved the predictions for vertebrates, since the training phase used information of these organisms, but H/ACA box snoRNAs identification was improved for the other ones. Conclusion: We presented snoReport 2.0, to predict H/ACA box and C/D box snoRNAs, an efficient method to find true positives and avoid false positives in vertebrate organisms. H/ACA box snoRNA classifier showed an F-score of 93 % (an improvement of 10 % regarding the previous version), while C/D box snoRNA classifier, an F-Score of 94 % (improvement of 14 %). Besides, both classifiers exhibited performance measures above 90 %. These results show that snoReport 2.0 avoid false positives and false negatives, allowing to predict snoRNAs with high quality. In the validation phase, snoReport 2.0 predicted 67.43 % of vertebrate organisms for both classes. For Nematodes and Drosophilids, 69 % and 76.67 %, for H/ACA box snoRNAs were predicted, respectively, showing that snoReport 2.0 is good to identify snoRNAs in vertebrates and also H/ACA box snoRNAs in invertebrates organisms