10 research outputs found

    Semi-supervised protein subcellular localization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational method. The location information can indicate key functionalities of proteins. Accurate predictions of subcellular localizations of protein can aid the prediction of protein function and genome annotation, as well as the identification of drug targets. Computational methods based on machine learning, such as support vector machine approaches, have already been widely used in the prediction of protein subcellular localization. However, a major drawback of these machine learning-based approaches is that a large amount of data should be labeled in order to let the prediction system learn a classifier of good generalization ability. However, in real world cases, it is laborious, expensive and time-consuming to experimentally determine the subcellular localization of a protein and prepare instances of labeled data.</p> <p>Results</p> <p>In this paper, we present an approach based on a new learning framework, semi-supervised learning, which can use much fewer labeled instances to construct a high quality prediction model. We construct an initial classifier using a small set of labeled examples first, and then use unlabeled instances to refine the classifier for future predictions.</p> <p>Conclusion</p> <p>Experimental results show that our methods can effectively reduce the workload for labeling data using the unlabeled data. Our method is shown to enhance the state-of-the-art prediction results of SVM classifiers by more than 10%.</p

    Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.</p> <p>Results</p> <p>In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data).</p> <p>Conclusions</p> <p>The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p

    Identifying factors controlling protein release from combinatorial biomaterial libraries via hybrid data mining methods

    Get PDF
    Polyanhydrides are a class of degradable biomaterials that have shown much promise for applications in drug and vaccine delivery. Their properties can be tailored for controlled drug release, drug/protein stability, and immune regulation (adjuvant effect). Identifying the relationship between the molecular structures of the polymers and the drug release kinetics profiles would help understand the release mechanism and aid in the accurate prediction of drug release and the rational design of polymer-based drug carrier systems. The molecular structure descriptors that had the most impact on the release kinetics were identified using a prediction/optimization data mining approach. Using this new approach for modeling nonlinear release kinetics behavior, we determined that the descriptors which had the greatest effect on the release kinetics were the number of backbone -COO- nonconjugated bonds, the number of aromatic rings, and the number of -CH 2- bonds

    Fast subcellular localization by cascaded fusion of signal-based and homology-based methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.</p> <p>Results</p> <p>This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).</p> <p>Conclusions</p> <p>Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.</p

    The effect of organelle discovery upon sub-cellular protein localisation.

    Get PDF
    Prediction of protein sub-cellular localisation by employing quantitative mass spectrometry experiments is an expanding field. Several methods have led to the assignment of proteins to specific subcellular localisations by partial separation of organelles across a fractionation scheme coupled with computational analysis. Methods developed to analyse organelle data have largely employed supervised machine learning algorithms to map unannotated abundance profiles to known protein–organelle associations. Such approaches are likely to make association errors if organelle-related groupings present in experimental output are not included in data used to create a protein–organelle classifier. Currently, there is no automated way to detect organelle-specific clusters within such datasets. In order to address the above issues we adapted a phenotype discovery algorithm, originally created to filter image-based output for RNAi screens, to identify putative subcellular groupings in organelle proteomics experiments. We were able to mine datasets to a deeper level and extract interesting phenotype clusters for more comprehensive evaluation in an unbiased fashion upon application of this approach. Organelle-related protein clusters were identified beyond those sufficiently annotated for use as training data. Furthermore, we propose avenues for the incorporation of observations made into general practice for the classification of protein–organelle membership from quantitative MS experiments. Biological significance Protein sub-cellular localisation plays an important role in molecular interactions, signalling and transport mechanisms. The prediction of protein localisation by quantitative mass-spectrometry (MS) proteomics is a growing field and an important endeavour in improving protein annotation. Several such approaches use gradient-based separation of cellular organelle content to measure relative protein abundance across distinct gradient fractions. The distribution profiles are commonly mapped in silico to known protein–organelle associations via supervised machine learning algorithms, to create classifiers that associate unannotated proteins to specific organelles. These strategies are prone to error, however, if organelle-related groupings present in experimental output are not represented, for example owing to the lack of existing annotation, when creating the protein–organelle mapping. Here, the application of a phenotype discovery approach to LOPIT gradient-based MS data identifies candidate organelle phenotypes for further evaluation in an unbiased fashion. Software implementation and usage guidelines are provided for application to wider protein–organelle association experiments. In the wider context, semi-supervised organelle discovery is discussed as a paradigm with which to generate new protein annotations from MS-based organelle proteomics experiments. This article is part of a Special Issue entitled: New Horizons and Applications for Proteomics [EuPA 2012]

    Prediction of eukaryotic protein subcellular multi- localisation with a combined KNN-SVM ensemble classifier

    Get PDF
    Proteins may exist in or shift among two or more different subcellular locations, and this phenomenon is closely related to biological function. It is challenging to deal with multiple locations during eukaryotic protein subcellular localisation prediction with routine methods; therefore, a reliable and automatic ensemble classifier for protein subcellular localisation is needed. We propose a new ensemble classifier combined with the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localisation of eukaryotic proteins from the GO (gene ontology) annotations. This method was developed by fusing basic individual classifiers through a voting system. The overall prediction accuracies thus obtained via the jackknife test and resubstitution test were 70.5 and 77.6% for eukaryotic proteins respectively, which are significantly higher than other methods presented in the previous studies and reveal that our strategy better predicts eukaryotic protein subcellular localisation

    An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

    Full text link

    Metabolic profiling on 2D NMR TOCSY spectra using machine learning

    Get PDF
    Due to the dynamicity of biological cells, the role of metabolic profiling in discovering biological fingerprints of diseases, and their evolution, as well as the cellular pathway of different biological or chemical stimuli is most significant. Two-dimensional nuclear magnetic resonance (2D NMR) is one of the fundamental and strong analytical instruments for metabolic profiling. Though, total correlation spectroscopy (2D NMR 1H -1H TOCSY) can be used to improve spectral overlap of 1D NMR, strong peak shift, signal overlap, spectral crowding and matrix effects in complex biological mixtures are extremely challenging in 2D NMR analysis. In this work, we introduce an automated metabolic deconvolution and assignment based on the deconvolution of 2D TOCSY of real breast cancer tissue, in addition to different differentiation pathways of adipose tissue-derived human Mesenchymal Stem cells. A major alternative to the common approaches in NMR based machine learning where images of the spectra are used as an input, our metabolic assignment is based only on the vertical and horizontal frequencies of metabolites in the 1H-1H TOCSY. One- and multi-class Kernel null foley–Sammon transform, support vector machines, polynomial classifier kernel density estimation, and support vector data description classifiers were tested in semi-supervised learning and novelty detection settings. The classifiers’ performance was evaluated by comparing the conventional human-based methodology and automatic assignments under different initial training sizes settings. The results of our novel metabolic profiling methods demonstrate its suitability, robustness, and speed in automated nontargeted NMR metabolic analysis

    Knowledge derivation and data mining strategies for probabilistic functional integrated networks

    Get PDF
    PhDOne of the fundamental goals of systems biology is the experimental verification of the interactome: the entire complement of molecular interactions occurring in the cell. Vast amounts of high-throughput data have been produced to aid this effort. However these data are incomplete and contain high levels of both false positives and false negatives. In order to combat these limitations in data quality, computational techniques have been developed to evaluate the datasets and integrate them in a systematic fashion using graph theory. The result is an integrated network which can be analysed using a variety of network analysis techniques to draw new inferences about biological questions and to guide laboratory experiments. Individual research groups are interested in specific biological problems and, consequently, network analyses are normally performed with regard to a specific question. However, the majority of existing data integration techniques are global and do not focus on specific areas of biology. Currently this issue is addressed by using known annotation data (such as that from the Gene Ontology) to produce process-specific subnetworks. However, this approach discards useful information and is of limited use in poorly annotated areas of the interactome. Therefore, there is a need for network integration techniques that produce process-specific networks without loss of data. The work described here addresses this requirement by extending one of the most powerful integration techniques, probabilistic functional integrated networks (PFINs), to incorporate a concept of biological relevance. Initially, the available functional data for the baker’s yeast Saccharomyces cerevisiae was evaluated to identify areas of bias and specificity which could be exploited during network integration. This information was used to develop an integration technique which emphasises interactions relevant to specific biological questions, using yeast ageing as an exemplar. The integration method improves performance during network-based protein functional prediction in relation to this process. Further, the process-relevant networks complement classical network integration techniques and significantly improve network analysis in a wide range of biological processes. The method developed has been used to produce novel predictions for 505 Gene Ontology biological processes. Of these predictions 41,610 are consistent with existing computational annotations, and 906 are consistent with known expert-curated annotations. The approach significantly reduces the hypothesis space for experimental validation of genes hypothesised to be involved in the oxidative stress response. Therefore, incorporation of biological relevance into network integration can significantly improve network analysis with regard to individual biological questions
    corecore