Search CORE

10 research outputs found

Semi-supervised protein subcellular localization

Author: A Blum
A Levin
A Pierleoni
A Reinhardt
A Sarkar
B JD
C Yu
C Zhang
C Zhang
CJL Chine-Sheng Yu
D Xie
Derek Hao Hu
ECY Su
G Zhou
G Zhou
G Zhou
H Nakashima
HB Shen
Hong Xue
I Bahar
J Gardy
J Wang
K Chou
K Chou
K Chou
K Chou
K Chou
K Chou
K Nakai
K Nigam
K Park
L Breiman
L Breiman
L Breiman
M Bhasin
M Claros
M Li
O Emanuelsson
P Horton
Qian Xu
Qiang Yang
R Luo
R Nair
R Nair
RPC Nair
S Hua
S Muskal
T Guo
T Joachims
T Joachims
TK Ho
W Liu
Weichuan Yu
X Zhu
Y Cai
Y Cai
Y Freund
Y Huang
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational method. The location information can indicate key functionalities of proteins. Accurate predictions of subcellular localizations of protein can aid the prediction of protein function and genome annotation, as well as the identification of drug targets. Computational methods based on machine learning, such as support vector machine approaches, have already been widely used in the prediction of protein subcellular localization. However, a major drawback of these machine learning-based approaches is that a large amount of data should be labeled in order to let the prediction system learn a classifier of good generalization ability. However, in real world cases, it is laborious, expensive and time-consuming to experimentally determine the subcellular localization of a protein and prepare instances of labeled data. Results In this paper, we present an approach based on a new learning framework, semi-supervised learning, which can use much fewer labeled instances to construct a high quality prediction model. We construct an initial classifier using a small set of labeled examples first, and then use unlabeled instances to refine the classifier for future predictions. Conclusion Experimental results show that our methods can effectively reduce the workload for labeling data using the unlabeled data. Our method is shown to enhance the state-of-the-art prediction results of SVM classifiers by more than 10%.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models

Author: A Blum
A Goldberg
A Höglund
Adrian Silvescu
AP Dempster
Cornelia Caragea
CS Ong
D Ron
Doina Caragea
G Camps-valls
G Casella
J Lafferty
J Lin
J Weston
J Zhang
JL Gardy
K Nigam
K Park
L Breiman
L Käll
M Belkin
M Li
M Szummer
MS Scott
ND Lawrence
O Emanuelsson
P Baldi
P Kuksa
Q Xu
T Jaakkola
T Jebara
T Joachims
TG Dietterich
Vasant Honavar
W Ansorge
X Zhu
Y Bengio
Y Grandvalet
Y Qi
Y Yuan
ZY Niu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing <it>semi-supervised methods</it> for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. Results In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting <it>unlabeled</it> data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). Conclusions The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UNT Digital Library

Identifying factors controlling protein release from combinatorial biomaterial libraries via hybrid data mining methods

Author: Broderick Scott
Li Xue
Narasimhan Balaji
Narasimhan Balaji
Petersen Latrisha
Rajan Krishna
Publication venue: Iowa State University Digital Repository
Publication date: 10/11/2010
Field of study

Polyanhydrides are a class of degradable biomaterials that have shown much promise for applications in drug and vaccine delivery. Their properties can be tailored for controlled drug release, drug/protein stability, and immune regulation (adjuvant effect). Identifying the relationship between the molecular structures of the polymers and the drug release kinetics profiles would help understand the release mechanism and aid in the accurate prediction of drug release and the rational design of polymer-based drug carrier systems. The molecular structure descriptors that had the most impact on the release kinetics were identified using a prediction/optimization data mining approach. Using this new approach for modeling nonlinear release kinetics behavior, we determined that the descriptors which had the greatest effect on the release kinetics were the number of backbone -COO- nonconjugated bonds, the number of aromatic rings, and the number of -CH 2- bonds

Digital Repository @ Iowa State University (ISU)

Crossref

Fast subcellular localization by cascaded fusion of signal-based and homology-based methods

Author: Kung Sun-Yuan
Mak Man-Wai
Wang Wei
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means. Results This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA). Conclusions Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.</p

The Hong Kong Polytechnic University Pao Yue-kong Library

Princeton University Open Access Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PolyU Institutional Repository

PubMed Central

The effect of organelle discovery upon sub-cellular protein localisation.

Author: Breckels L. M.
Christoforou A.
Gatto Laurent
Groen A. J.
Lilley K. S.
Trotter M. W. B.
Publication venue
Publication date: 01/08/2013
Field of study

Prediction of protein sub-cellular localisation by employing quantitative mass spectrometry experiments is an expanding field. Several methods have led to the assignment of proteins to specific subcellular localisations by partial separation of organelles across a fractionation scheme coupled with computational analysis. Methods developed to analyse organelle data have largely employed supervised machine learning algorithms to map unannotated abundance profiles to known protein–organelle associations. Such approaches are likely to make association errors if organelle-related groupings present in experimental output are not included in data used to create a protein–organelle classifier. Currently, there is no automated way to detect organelle-specific clusters within such datasets. In order to address the above issues we adapted a phenotype discovery algorithm, originally created to filter image-based output for RNAi screens, to identify putative subcellular groupings in organelle proteomics experiments. We were able to mine datasets to a deeper level and extract interesting phenotype clusters for more comprehensive evaluation in an unbiased fashion upon application of this approach. Organelle-related protein clusters were identified beyond those sufficiently annotated for use as training data. Furthermore, we propose avenues for the incorporation of observations made into general practice for the classification of protein–organelle membership from quantitative MS experiments. Biological significance Protein sub-cellular localisation plays an important role in molecular interactions, signalling and transport mechanisms. The prediction of protein localisation by quantitative mass-spectrometry (MS) proteomics is a growing field and an important endeavour in improving protein annotation. Several such approaches use gradient-based separation of cellular organelle content to measure relative protein abundance across distinct gradient fractions. The distribution profiles are commonly mapped in silico to known protein–organelle associations via supervised machine learning algorithms, to create classifiers that associate unannotated proteins to specific organelles. These strategies are prone to error, however, if organelle-related groupings present in experimental output are not represented, for example owing to the lack of existing annotation, when creating the protein–organelle mapping. Here, the application of a phenotype discovery approach to LOPIT gradient-based MS data identifies candidate organelle phenotypes for further evaluation in an unbiased fashion. Software implementation and usage guidelines are provided for application to wider protein–organelle association experiments. In the wider context, semi-supervised organelle discovery is discussed as a paradigm with which to generate new protein annotations from MS-based organelle proteomics experiments. This article is part of a Special Issue entitled: New Horizons and Applications for Proteomics [EuPA 2012]

ZENODO

Prediction of eukaryotic protein subcellular multi- localisation with a combined KNN-SVM ensemble classifier

Author: Hong Kuang
Kaifa Wang
Liqi Li
Ying Wan
Yuan Zhang
Yue Zhou
Publication venue
Publication date: 24/04/2020
Field of study

Proteins may exist in or shift among two or more different subcellular locations, and this phenomenon is closely related to biological function. It is challenging to deal with multiple locations during eukaryotic protein subcellular localisation prediction with routine methods; therefore, a reliable and automatic ensemble classifier for protein subcellular localisation is needed. We propose a new ensemble classifier combined with the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localisation of eukaryotic proteins from the GO (gene ontology) annotations. This method was developed by fusing basic individual classifiers through a voting system. The overall prediction accuracies thus obtained via the jackknife test and resubstitution test were 70.5 and 77.6% for eukaryotic proteins respectively, which are significantly higher than other methods presented in the previous studies and reveal that our strategy better predicts eukaryotic protein subcellular localisation

CiteSeerX

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

Author: Ana Stanescu
Doina Caragea
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

Metabolic profiling on 2D NMR TOCSY spectra using machine learning

Author: Migdadi Lubaba Yousef Hazza
Publication venue
Publication date: 01/01/2023
Field of study

Due to the dynamicity of biological cells, the role of metabolic profiling in discovering biological fingerprints of diseases, and their evolution, as well as the cellular pathway of different biological or chemical stimuli is most significant. Two-dimensional nuclear magnetic resonance (2D NMR) is one of the fundamental and strong analytical instruments for metabolic profiling. Though, total correlation spectroscopy (2D NMR 1H -1H TOCSY) can be used to improve spectral overlap of 1D NMR, strong peak shift, signal overlap, spectral crowding and matrix effects in complex biological mixtures are extremely challenging in 2D NMR analysis. In this work, we introduce an automated metabolic deconvolution and assignment based on the deconvolution of 2D TOCSY of real breast cancer tissue, in addition to different differentiation pathways of adipose tissue-derived human Mesenchymal Stem cells. A major alternative to the common approaches in NMR based machine learning where images of the spectra are used as an input, our metabolic assignment is based only on the vertical and horizontal frequencies of metabolites in the 1H-1H TOCSY. One- and multi-class Kernel null foley–Sammon transform, support vector machines, polynomial classifier kernel density estimation, and support vector data description classifiers were tested in semi-supervised learning and novelty detection settings. The classifiers’ performance was evaluated by comparing the conventional human-based methodology and automatic assignments under different initial training sizes settings. The results of our novel metabolic profiling methods demonstrate its suitability, robustness, and speed in automated nontargeted NMR metabolic analysis

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Knowledge derivation and data mining strategies for probabilistic functional integrated networks

Author: James Katherine
Publication venue: Newcastle University
Publication date: 01/01/2012
Field of study

PhDOne of the fundamental goals of systems biology is the experimental verification of the interactome: the entire complement of molecular interactions occurring in the cell. Vast amounts of high-throughput data have been produced to aid this effort. However these data are incomplete and contain high levels of both false positives and false negatives. In order to combat these limitations in data quality, computational techniques have been developed to evaluate the datasets and integrate them in a systematic fashion using graph theory. The result is an integrated network which can be analysed using a variety of network analysis techniques to draw new inferences about biological questions and to guide laboratory experiments. Individual research groups are interested in specific biological problems and, consequently, network analyses are normally performed with regard to a specific question. However, the majority of existing data integration techniques are global and do not focus on specific areas of biology. Currently this issue is addressed by using known annotation data (such as that from the Gene Ontology) to produce process-specific subnetworks. However, this approach discards useful information and is of limited use in poorly annotated areas of the interactome. Therefore, there is a need for network integration techniques that produce process-specific networks without loss of data. The work described here addresses this requirement by extending one of the most powerful integration techniques, probabilistic functional integrated networks (PFINs), to incorporate a concept of biological relevance. Initially, the available functional data for the baker’s yeast Saccharomyces cerevisiae was evaluated to identify areas of bias and specificity which could be exploited during network integration. This information was used to develop an integration technique which emphasises interactions relevant to specific biological questions, using yeast ageing as an exemplar. The integration method improves performance during network-based protein functional prediction in relation to this process. Further, the process-relevant networks complement classical network integration techniques and significantly improve network analysis in a wide range of biological processes. The method developed has been used to produce novel predictions for 505 Gene Ontology biological processes. Of these predictions 41,610 are consistent with existing computational annotations, and 906 are consistent with known expert-curated annotations. The approach significantly reduces the hypothesis space for experimental validation of genes hypothesised to be involved in the oxidative stress response. Therefore, incorporation of biological relevance into network integration can significantly improve network analysis with regard to individual biological questions

Newcastle University eTheses