Search CORE

3,883 research outputs found

NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features

Author: DJ Mangelsdorf
GP Zhou
GP Zhou
H Florence
H Mohabatkar
H Nakashima
JM Keller
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KK Kandaswamy
Kuo-Chen Chou
L Altucci
M Bhasin
M Masso
M Robinson-Rechavi
Niall James Haslam
PC Mahalanobis
Pu Wang
QB Gao
RR Joshi
SF Altschul
T Cover
T Liu
T Liu
T Wang
VD Gusev
W Li
W Liu
X Xiao
Xuan Xiao
Publication venue: Public Library of Science
Publication date
Field of study

Nuclear receptors (NRs) are one of the most abundant classes of transcriptional regulators in animals. They regulate diverse functions, such as homeostasis, reproduction, development and metabolism. Therefore, NRs are a very important target for drug development. Nuclear receptors form a superfamily of phylogenetically related proteins and have been subdivided into different subfamilies due to their domain diversity. In this study, a two-level predictor, called NR-2L, was developed that can be used to identify a query protein as a nuclear receptor or not based on its sequence information alone; if it is, the prediction will be automatically continued to further identify it among the following seven subfamilies: (1) thyroid hormone like (NR1), (2) HNF4-like (NR2), (3) estrogen like, (4) nerve growth factor IB-like (NR4), (5) fushi tarazu-F1 like (NR5), (6) germ cell nuclear factor like (NR6), and (7) knirps like (NR0). The identification was made by the Fuzzy K nearest neighbor (FK-NN) classifier based on the pseudo amino acid composition formed by incorporating various physicochemical and statistical features derived from the protein sequences, such as amino acid composition, dipeptide composition, complexity factor, and low-frequency Fourier spectrum components. As a demonstration, it was shown through some benchmark datasets derived from the NucleaRDB and UniProt with low redundancy that the overall success rates achieved by the jackknife test were about 93% and 89% in the first and second level, respectively. The high success rates indicate that the novel two-level predictor can be a useful vehicle for identifying NRs and their subfamilies. As a user-friendly web server, NR-2L is freely accessible at either http://icpr.jci.edu.cn/bioinfo/NR2L or http://www.jci-bioinfo.cn/NR2L. Each job submitted to NR-2L can contain up to 500 query protein sequences and be finished in less than 2 minutes. The less the number of query proteins is, the shorter the time will usually be. All the program codes for NR-2L are available for non-commercial purpose upon request

Crossref

Directory of Open Access Journals

PubMed Central

Predicting Anatomical Therapeutic Chemical (ATC) Classification of Drugs by Integrating Chemical-Chemical Interactions and Similarities

Author: DN Georgiou
GA Watson
GP Zhou
GP Zhou
GP Zhou
H Gurulingappa
H Mohabatkar
H Mohabatkar
IW Althaus
J Andraos
J Lin
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
Kuo-Chen Chou
L Hu
Lei Chen
M Dunkel
M Esmaeili
M Hattori
M Kanehisa
M Kanehisa
M Kuhn
Ozlem Keskin
P Jaccard
P Wang
Q Gu
R Sharan
T Huang
U Karaoz
Wei-Ming Zeng
WZ Lin
X Xiao
YD Cai
YD Cai
Yu-Dong Cai
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 13/04/2012
Field of study

The Anatomical Therapeutic Chemical (ATC) classification system, recommended by the World Health Organization, categories drugs into different classes according to their therapeutic and chemical characteristics. For a set of query compounds, how can we identify which ATC-class (or classes) they belong to? It is an important and challenging problem because the information thus obtained would be quite useful for drug development and utilization. By hybridizing the informations of chemical-chemical interactions and chemical-chemical similarities, a novel method was developed for such purpose. It was observed by the jackknife test on a benchmark dataset of 3,883 drug compounds that the overall success rate achieved by the prediction method was about 73% in identifying the drugs among the following 14 main ATC-classes: (1) alimentary tract and metabolism; (2) blood and blood forming organs; (3) cardiovascular system; (4) dermatologicals; (5) genitourinary system and sex hormones; (6) systemic hormonal preparations, excluding sex hormones and insulins; (7) anti-infectives for systemic use; (8) antineoplastic and immunomodulating agents; (9) musculoskeletal system; (10) nervous system; (11) antiparasitic products, insecticides and repellents; (12) respiratory system; (13) sensory organs; (14) various. Such a success rate is substantially higher than 7% by the random guess. It has not escaped our notice that the current method can be straightforwardly extended to identify the drugs for their 2nd-level, 3rd-level, 4th-level, and 5th-level ATC-classifications once the statistically significant benchmark data are available for these lower levels

Public Library of Science (PLOS)

Crossref

PubMed Central

FigShare

The relative roles of upper and lower tropospheric thermal contrasts and tropical influences in driving Asian summer monsoons

Author: Chou C.
Dai A.
Ho L.
Hong L.
Li H.
Sun Y.
Zhou T.
Publication venue: 'Wiley'
Publication date: 07/07/2013
Field of study

Summer thermal structure and winds over Asia show a larger land-ocean thermal gradient in the upper than in the lower troposphere, implying a bigger role of the upper troposphere in driving the Asian summer monsoon circulation. Using data from atmospheric re-analyses and model simulations, we show that the land-ocean thermal contrast in the mid-upper (200-500 hPa) troposphere (TCupper) contributes about three times as much as the thermal contrast in the mid-lower (500-850 hPa) troposphere (TClower) in determining both the strength and variations of Asian summer monsoon circulations. Tropical sea surface temperature anomalies associated with the annual cycle, El Niño-Southern Oscillation, decadal changes, and global warming all are accompanied with much larger variations and changes in TCupper than in TClower, partly due to enhanced latent heating aloft from convection. The variations and changes in TCupper and TClower are highly correlated with the strength of the South Asian Summer Monsoon (SASM) and the East Asian Summer Monsoon (EASM) in their respective sectors during the past 50-60 years. In particular, the weakening of the EASM since the 1950s is caused by the weakening mainly in TCupper and secondarily in TClower induced mainly by recent tropical surface warming, although spurious cooling over East Asia seen in reanalysis data may have enhanced this weakening. However, the strength of the SASM and EASM monsoons follows TCupper but decouples with TClower in the global warming case in the 21st century. The results suggest that the TCupper plays a dominant role and provides an efficient mechanism through which tropical oceans can influence extratropical monsoons. Key Points The land-sea T gradient is more important in the upper than in lower troposphere Tropical oceans influence extra-tropical monsoons through the upper troposphere Monsoon response to global warming depends on tropospheric warming patterns ©2013. American Geophysical Union. All Rights Reserved

MPG.PuRe

Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

Author: A Hoglund
B Liao
CE Rasmussen
DN Georgiou
FM Li
Franca Fraternali
G Tsoumakas
GP Zhou
H Mohabatkar
H Mohabatkar
H Nakashima
HB Shen
HB Shen
HB Shen
HB Shen
HN Lin
Hong Gu
J Ma
J Ma
J Tian
J Yin
Jianjun He
JY Shi
K Imai
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KY Lee
L Chen
L Chen
L Hu
LJ Foster
LL Hu
M Esmaeili
MS Scott
O Emanuelsson
P Wang
P Wang
RE Schapire
S Briesemeister
S Hua
S Mei
S Mei
S Zhang
T Huang
T Huang
T Huang
T Liu
Wenqi Liu
WZ Lin
X Jiang
X Xiao
X Xiao
X Xiao
YH Zeng
YL Chen
YL Chen
Z He
Z Lu
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 08/06/2012
Field of study

It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Classification and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Functional Property

Author: A Bairoch
A Barabasi
C Chen
C Chen
C Klukas
C Krieger
Cathal Seoighe
CF Gao
D Chakrabarti
D Frishman
DN Georgiou
E Camon
F Chiti
G Pollastri
GF Cooper
GP Zhou
GP Zhou
GY Zhang
H Ding
H Lin
H Mohabatkar
H Mohabatkar
H Ogata
H Peng
I Althaus
I Althaus
I Althaus
I Dubchak
I Dubchak
I Schomburg
I Schomburg
IH Witten
J Andraos
J Cheng
J Cheng
JD Qiu
JM Dale
K Chou
K Chou
K Chou
K Chou
K Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
Kuo-Chen Chou
L Chen
L Chen
L Chen
L Chen
L Chen
L Lu
L Lu
L Yu
Lei Chen
M Chang
M Esmaeili
M Kanehisa
M Kanehisa
M Kanehisa
M Kanehisa
N Chazal
N Friedman
P Carmona-Saez
P Pharkya
Q Gu
R Caspi
R Caspi
RR Bouckaert
S Salzberg
SS Keerthi
T Denoeux
T Huang
T Huang
T Huang
T Huang
T Huang
Tao Huang
U Stelzl
W Buntine
X Xiao
XB Zhou
Y Cai
Y Cai
Y Cai
Y Qi
YH Zeng
YS Lobanova
Yu-Dong Cai
Z He
ZC Wu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Given a regulatory pathway system consisting of a set of proteins, can we predict which pathway class it belongs to? Such a problem is closely related to the biological function of the pathway in cells and hence is quite fundamental and essential in systems biology and proteomics. This is also an extremely difficult and challenging problem due to its complexity. To address this problem, a novel approach was developed that can be used to predict query pathways among the following six functional categories: (i) “Metabolism”, (ii) “Genetic Information Processing”, (iii) “Environmental Information Processing”, (iv) “Cellular Processes”, (v) “Organismal Systems”, and (vi) “Human Diseases”. The prediction method was established trough the following procedures: (i) according to the general form of pseudo amino acid composition (PseAAC), each of the pathways concerned is formulated as a 5570-D (dimensional) vector; (ii) each of components in the 5570-D vector was derived by a series of feature extractions from the pathway system according to its graphic property, biochemical and physicochemical property, as well as functional property; (iii) the minimum redundancy maximum relevance (mRMR) method was adopted to operate the prediction. A cross-validation by the jackknife test on a benchmark dataset consisting of 146 regulatory pathways indicated that an overall success rate of 78.8% was achieved by our method in identifying query pathways among the above six classes, indicating the outcome is quite promising and encouraging. To the best of our knowledge, the current study represents the first effort in attempting to identity the type of a pathway system or its biological function. It is anticipated that our report may stimulate a series of follow-up investigations in this new and challenging area

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Gene ontology based transfer learning for protein subcellular localization

Author: A Bateman
A Dijk
A Hoglund
A Hoglund
A Pierleoni
C Chen
C Leslie
C Leslie
DH Haft
E Marcotte
EM Zdobnov
F Corpet
FM Li
G Lanckriet
G Schneider
H Ding
H Lin
H Lin
H Liu
H Rangwala
H Shen
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Cedano
J Schultz
J Shen
JD Qiu
JD Qiu
K Chou
K Chou
K Chou
K Hofmann
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Nanni
M Ashburner
M Esmaeili
M Mak
M Wang
Q Gu
Q Yang
R Apweiler
R Kuang
R Kuang
S Mei
S Pan
Shuigeng Zhou
Suyu Mei
T Blum
T Tung
TK Attwood
W Dai
W Dai
W Huang
W Huang
Wang Fei
X Jiang
X Xiao
XB Zhou
YH Zeng
YS Ding
YS Ding
Z Lei
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively. Conclusions Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Texture of Yukawa coupling matrices in general two-Higgs-doublet model

Author: Atwood D
Atwood D
Atwood D
Bowser-Chao D
Chang D
Cheng T P
Chiu S-H
Chou K C
Gill P S
Gill P S
Giudice G F
Glashow S L
Gupta S N
Hagiwara K (Particle Data Group)
Hall L J
Kusenko A
Wu Y L
Wu Y-L
Xiao Z-J
Yu-Feng Zhou
Zhou Y-F
Publication venue: 'IOP Publishing'
Publication date: 18/07/2003
Field of study

We discuss possible parallel textures of the Yukawa coupling matrices in the generaltwo-Higgs-doublet model (2HDM). In those textures the flavor changing neutral currentsare naturally suppressed. Motivated by a phenomenologically successful texturewith four texture zeros in the standard model, we propose a predictive ansatz for the Yukawa coupling matrices with the same texture in the general 2HDM. Compared with the six texture-zero based ansatz proposed by Cheng and Sher, it is in a better agreement with the data of quark mixings and CP violation. The four texture-zero based ansatz predicts a different hierarchy in the Yukawa coupling matrix elements. As a consequence, in the lepton sector, the related Yukawa couplings are less constrained by the experimental upper bound of

\mu\to e\gamma

, which allows significantly larger predictions for other processes. The contributions from neutral scalar interactions to the lepton number violation decay modes

\ell\to \ell_{1}\ell_{2}\ell_{3}

are calculated in both ansatz. It is shown that the predictions from the four texture-zero based ansatz could be two order of magnitude greater than that from the six texture-zero based one. The branching ratio of

\mu\to 3e

and

\tau\to 3\mu

can reach

7.5 \times 10^{-17}

and 1.3\tiems 10^{-10} respectively. The predicted ratio of

Br(\mu\to 3e)/Br(\tau\to 3e)

is also larger and almost parameter independent. Those differences make the two ansatz to be easily distinguished by the future experiments.Comment: 12 pages, 2 figure

arXiv.org e-Print Archive

Crossref

CERN Document Server

Identification of Colorectal Cancer Related Genes with mRMR and Shortest Path in Protein-Protein Interaction Network

Author: B Bakall
B Hoeft
BC Christensen
Bi-Qing Li
C Deves
C Hiranuma
CA Borgono
D Landi
D Liu
D Menendez
D Szklarczyk
DN Georgiou
DW Parsons
E Dijkstra
E Nabieva
EP Diamandis
EP Diamandis
G Lagger
G Thomas
GP Zhou
GP Zhou
GP Zhou
GR Howe
H Mohabatkar
H Mohabatkar
H Peng
H Stohr
H Tsukahara
HE MacLean
I Niittymaki
I Ohkubo
IJ Kim
IW Althaus
J Andraos
J Cui
J Li
J Sabates-Bellver
JH Friedman
JL Huret
JR Reeves
K Hibi
K Imai
K Yu
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KL Ng
Kuo-Chen Chou
L Castagnetta
L Chen
L Chen
L Hu
L Hu
LD Wood
Lei Liu
LL Hu
M Esmaeili
M Katoh
M Levesque
M Talieri
M Thangaraju
MG Catalano
ML Slattery
MS Kim
MW Medina
P Bogdanov
P Polakis
Paulo Lee Ho
Q Gu
Q Liu
R Sharan
RA Irizarry
S Jones
S Letovsky
SA Gayther
SA Johnson
SH Nagaraj
SM Lipkin
T Denoeux
T Hinoue
T Huang
T Huang
T Huang
T Huang
T Huang
T Huang
T Huang
T Huang
T Huang
T Huang
T Morikawa
Tao Huang
TS Keshava Prasad
U Karaoz
W Huang da
W van Criekinge
WL Allen
X Xiao
XY Yang
Y Benjamini
Y Cai
YA Kourmpetis
YD Cai
Yu-Dong Cai
ZC Wu
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

One of the most important and challenging problems in biomedicine and genomics is how to identify the disease genes. In this study, we developed a computational method to identify colorectal cancer-related genes based on (i) the gene expression profiles, and (ii) the shortest path analysis of functional protein association networks. The former has been used to select differentially expressed genes as disease genes for quite a long time, while the latter has been widely used to study the mechanism of diseases. With the existing protein-protein interaction data from STRING (Search Tool for the Retrieval of Interacting Genes), a weighted functional protein association network was constructed. By means of the mRMR (Maximum Relevance Minimum Redundancy) approach, six genes were identified that can distinguish the colorectal tumors and normal adjacent colonic tissues from their gene expression profiles. Meanwhile, according to the shortest path approach, we further found an additional 35 genes, of which some have been reported to be relevant to colorectal cancer and some are very likely to be relevant to it. Interestingly, the genes we identified from both the gene expression profiles and the functional protein association network have more cancer genes than the genes identified from the gene expression profiles alone. Besides, these genes also had greater functional similarity with the reported colorectal cancer genes than the genes identified from the gene expression profiles alone. All these indicate that our method as presented in this paper is quite promising. The method may become a useful tool, or at least plays a complementary role to the existing method, for identifying colorectal cancer genes. It has not escaped our notice that the method can be applied to identify the genes of other diseases as well

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Thermodynamics of Competitive Molecular Channel Transport: Application to Artificial Nuclear Pores

Author: A Zilman
A Zilman
A Zilman
AB Kolomeisky
AM Berezhkovskii
AM Berezhkovskii
AM Berezhkovskii
AM Berezhkovskii
CW Gardiner
D Branton
D Reguera
DG Levitt
H Zhou
Laurent Kreplak
N van Kampen
P Kohli
P Läuger
R Eisenberg
R Noble
SM Bezrukov
SM Iqbal
T Chou
T Chou
T Chou
T Jovanovic-Talisman
Walter Nadler
Wolfgang R. Bauer
WR Bauer
WR Bauer
WR Bauer
Y Astier
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

In an analytical model channel transport is analyzed as a function of key parameters, determining efficiency and selectivity of particle transport in a competitive molecular environment. These key parameters are the concentration of particles, solvent-channel exchange dynamics, as well as particle-in-channel- and interparticle interaction. These parameters are explicitly related to translocation dynamics and channel occupation probability. Slowing down the exchange dynamics at the channel ends, or elevating the particle concentration reduces the in-channel binding strength necessary to maintain maximum transport. Optimized in-channel interaction may even shift from binding to repulsion. A simple equation gives the interrelation of access dynamics and concentration at this transition point. The model is readily transferred to competitive transport of different species, each of them having their individual in-channel affinity. Combinations of channel affinities are determined which differentially favor selectivity of certain species on the cost of others. Selectivity for a species increases if its in-channel binding enhances the species' translocation probablity when compared to that of the other species. Selectivity increases particularly for a wide binding site, long channels, and fast access dynamics. Recent experiments on competitive transport of in-channel binding and inert molecules through artificial nuclear pores serve as a paradigm for our model. It explains qualitatively and quantitatively how binding molecules are favored for transport at the cost of the transport of inert molecules

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Juelich Shared Electronic Resources

Online-Publikations-Server der Universität Würzburg

Semi-supervised protein subcellular localization

Author: A Blum
A Levin
A Pierleoni
A Reinhardt
A Sarkar
B JD
C Yu
C Zhang
C Zhang
CJL Chine-Sheng Yu
D Xie
Derek Hao Hu
ECY Su
G Zhou
G Zhou
G Zhou
H Nakashima
HB Shen
Hong Xue
I Bahar
J Gardy
J Wang
K Chou
K Chou
K Chou
K Chou
K Chou
K Chou
K Nakai
K Nigam
K Park
L Breiman
L Breiman
L Breiman
M Bhasin
M Claros
M Li
O Emanuelsson
P Horton
Qian Xu
Qiang Yang
R Luo
R Nair
R Nair
RPC Nair
S Hua
S Muskal
T Guo
T Joachims
T Joachims
TK Ho
W Liu
Weichuan Yu
X Zhu
Y Cai
Y Cai
Y Freund
Y Huang
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational method. The location information can indicate key functionalities of proteins. Accurate predictions of subcellular localizations of protein can aid the prediction of protein function and genome annotation, as well as the identification of drug targets. Computational methods based on machine learning, such as support vector machine approaches, have already been widely used in the prediction of protein subcellular localization. However, a major drawback of these machine learning-based approaches is that a large amount of data should be labeled in order to let the prediction system learn a classifier of good generalization ability. However, in real world cases, it is laborious, expensive and time-consuming to experimentally determine the subcellular localization of a protein and prepare instances of labeled data. Results In this paper, we present an approach based on a new learning framework, semi-supervised learning, which can use much fewer labeled instances to construct a high quality prediction model. We construct an initial classifier using a small set of labeled examples first, and then use unlabeled instances to refine the classifier for future predictions. Conclusion Experimental results show that our methods can effectively reduce the workload for labeling data using the unlabeled data. Our method is shown to enhance the state-of-the-art prediction results of SVM classifiers by more than 10%.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central