Search CORE

159 research outputs found

Design of nearest neighbor classifiers: multi-objective approach

Author: Chen Hung-Ming
Chen Jian-Hung
Ho Shinn-Ying
Publication venue: Elsevier Inc.
Publication date: 31/07/2005
Field of study

AbstractThe goal of designing optimal nearest neighbor classifiers is to maximize classification accuracy while minimizing the sizes of both reference and feature sets. A usual way is to adaptively weight the three objectives as an objective function and then use a single-objective optimization method for achieving this goal. This paper proposes a multi-objective approach to cope with the weight tuning problem for practitioners. A novel intelligent multi-objective evolutionary algorithm IMOEA is utilized to simultaneously edit compact reference and feature sets for nearest neighbor classification. Three comparison studies are designed to evaluate performance of the proposed approach. It is shown empirically that the IMOEA-designed classifiers have high classification accuracy and small sizes of reference and feature sets. Moreover, IMOEA can provide a set of good solutions for practitioners to choose from in a single run. The simulation results indicate that the IMOEA-based approach is an expedient method to design nearest neighbor classifiers, compared with an existing single-objective approach

Elsevier - Publisher Connector

ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

Author: Ho Shih-Wen
Ho Shinn-Ying
Huang Wen-Lin
Hwang Shiow-Fen
Tung Chun-Wei
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Improving protein secondary structure prediction based on short subsequences with local structure similarity

Author: Ho Shinn-Ying
Hsu Wen-Lian
Lin Hsin-Nan
Sung Ting-Yi
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background When characterizing the structural topology of proteins, protein secondary structure (PSS) plays an important role in analyzing and modeling protein structures because it represents the local conformation of amino acids into regular structures. Although PSS prediction has been studied for decades, the prediction accuracy reaches a bottleneck at around 80%, and further improvement is very difficult. Results In this paper, we present an improved dictionary-based PSS prediction method called SymPred, and a meta-predictor called SymPsiPred. We adopt the concept behind natural language processing techniques and propose synonymous words to capture local sequence similarities in a group of similar proteins. A synonymous word is an <it>n-</it>gram pattern of amino acids that reflects the sequence variation in a protein’s evolution. We generate a protein-dependent synonymous dictionary from a set of protein sequences for PSS prediction. On a large non-redundant dataset of 8,297 protein chains (<it>DsspNr-25</it>), the average <it>Q</it>3 of SymPred and SymPsiPred are 81.0% and 83.9% respectively. On the two latest independent test sets (<it>EVA Set_1</it> and <it>EVA_Set2</it>), the average <it>Q</it>3 of SymPred is 78.8% and 79.2% respectively. SymPred outperforms other existing methods by 1.4% to 5.4%. We study two factors that may affect the performance of SymPred and find that it is very sensitive to the number of proteins of both known and unknown structures. This finding implies that SymPred and SymPsiPred have the potential to achieve higher accuracy as the number of protein sequences in the NCBInr and PDB databases increases. Conclusions Our experiment results show that local similarities in protein sequences typically exhibit conserved structures, which can be used to improve the accuracy of secondary structure prediction. For the application of synonymous words, we demonstrate an example of a sequence alignment which is generated by the distribution of shared synonymous words of a pair of protein sequences. We can align the two sequences nearly perfectly which are very dissimilar at the sequence level but very similar at the structural level. The SymPred and SymPsiPred prediction servers are available at <url>http://bio-cluster.iis.sinica.edu.tw/SymPred/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

POPISK: T-cell reactivity prediction using support vector machines and string kernels

Author: Ho Shinn-Ying
Kohlbacher Oliver
Kämper Andreas
Tung Chun-Wei
Ziehm Matthias
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: Accurate prediction of peptide immunogenicity and characterization of relation between peptide sequences and peptide immunogenicity will be greatly helpful for vaccine designs and understanding of the immune system. In contrast to the prediction of antigen processing and presentation pathway, the prediction of subsequent T-cell reactivity is a much harder topic. Previous studies of identifying T-cell receptor (TCR) recognition positions were based on small-scale analyses using only a few peptides and concluded different recognition positions such as positions 4, 6 and 8 of peptides with length 9. Large-scale analyses are necessary to better characterize the effect of peptide sequence variations on T-cell reactivity and design predictors of a peptide's T-cell reactivity (and thus immunogenicity). The identification and characterization of important positions influencing T-cell reactivity will provide insights into the underlying mechanism of immunogenicity. RESULTS: This work establishes a large dataset by collecting immunogenicity data from three major immunology databases. In order to consider the effect of MHC restriction, peptides are classified by their associated MHC alleles. Subsequently, a computational method (named POPISK) using support vector machine with a weighted degree string kernel is proposed to predict T-cell reactivity and identify important recognition positions. POPISK yields a mean 10-fold cross-validation accuracy of 68% in predicting T-cell reactivity of HLA-A2-binding peptides. POPISK is capable of predicting immunogenicity with scores that can also correctly predict the change in T-cell reactivity related to point mutations in epitopes reported in previous studies using crystal structures. Thorough analyses of the prediction results identify the important positions 4, 6, 8 and 9, and yield insights into the molecular basis for TCR recognition. Finally, we relate this finding to physicochemical properties and structural features of the MHC-peptide-TCR interaction. CONCLUSIONS: A computational method POPISK is proposed to predict immunogenicity with scores which are useful for predicting immunogenicity changes made by single-residue modifications. The web server of POPISK is freely available at http://iclab.life.nctu.edu.tw/POPISK

Springer - Publisher Connector

PubMed Central

UCL Discovery

Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties

Author: Ho Shinn-Jang
Ho Shinn-Ying
Hsu Kai-Ti
Huang Hui-Lin
Huang Wen-Lin
Lin I-Che
Liou Yi-Fan
Tsai Chia-Ta
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Protein subcellular localization prediction of eukaryotes using a knowledge-based approach

Author: Chen Ching-Tai
Ho Shinn-Ying
Hsu Wen-Lian
Lin Hsin-Nan
Sung Ting-Yi
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. Results In this study, we propose a knowledge based method, called KnowPredsite, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPredsite's performance. The experiment results show that KnowPredsite achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPredsite is 91.7%. For multi-localized proteins, the overall accuracy of KnowPredsite is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPredsite. Conclusion KnowPredsite demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPredsite is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPredsite is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPredsite prediction server is available at <url>http://bio-cluster.iis.sinica.edu.tw/kbloc/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

NeurphologyJ: An automatic neuronal morphology quantification method and its application in pharmacological discovery

Author: Chao Chih-Yuan
Charoenkwan Phasit
Chiu Tzai-Wen
Ho Shinn-Ying
Huang Hui-Ling
Hwang Eric
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer

Springer - Publisher Connector

PubMed Central

SCMPSP: Prediction and characterization of photosynthetic proteins based on a scoring card method

Author: Hong-An Chen
Hui-Ling Huang
Phasit Charoenkwan
Shinn-Ying Ho
Tamara Vasylenko
Yi-Fan Liou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

Springer - Publisher Connector

Computational identification of ubiquitylation sites from protein sequences

Author: A Dey
AL Chernorudskiy
AL Hitchcock
C Denison
CC Chang
Chun-Wei Tung
CW Tung
D Plewczynski
DS Kirkpatrick
DT Jones
E Tomlinson
GE Crooks
H Kaur
H Meirovitch
HB Jeon
IH Witten
J Cedano
J Herrmann
J Peng
JL Cornette
JR Quinlan
M Matsumoto
NJ Denis
Q Wu
RA George
RL Welchman
SF Altschul
Shinn-Ying Ho
SY Ho
SY Ho
W Li
WL Huang
Y Harpaz
Y Xue
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Ubiquitylation plays an important role in regulating protein functions. Recently, experimental methods were developed toward effective identification of ubiquitylation sites. To efficiently explore more undiscovered ubiquitylation sites, this study aims to develop an accurate sequence-based prediction method to identify promising ubiquitylation sites. Results We established an ubiquitylation dataset consisting of 157 ubiquitylation sites and 3676 putative non-ubiquitylation sites extracted from 105 proteins in the UbiProt database. This study first evaluates promising sequence-based features and classifiers for the prediction of ubiquitylation sites by assessing three kinds of features (amino acid identity, evolutionary information, and physicochemical property) and three classifiers (support vector machine, <it>k</it>-nearest neighbor, and NaïveBayes). Results show that the set of used 531 physicochemical properties and support vector machine (SVM) are the best kind of features and classifier respectively that their combination has a prediction accuracy of 72.19% using leave-one-out cross-validation. Consequently, an informative physicochemical property mining algorithm (IPMA) is proposed to select an informative subset of 531 physicochemical properties. A prediction system UbiPred was implemented by using an SVM with the feature set of 31 informative physicochemical properties selected by IPMA, which can improve the accuracy from 72.19% to 84.44%. To further analyze the informative physicochemical properties, a decision tree method C5.0 was used to acquire if-then rule-based knowledge of predicting ubiquitylation sites. UbiPred can screen promising ubiquitylation sites from putative non-ubiquitylation sites using prediction scores. By applying UbiPred, 23 promising ubiquitylation sites were identified from an independent dataset of 3424 putative non-ubiquitylation sites, which were also validated by using the obtained prediction rules. Conclusion We have proposed an algorithm IPMA for mining informative physicochemical properties from protein sequences to build an SVM-based prediction system UbiPred. UbiPred can predict ubiquitylation sites accompanied with a prediction score each to help biologists in identifying promising sites for experimental verification. UbiPred has been implemented as a web server and is available at <url>http://iclab.life.nctu.edu.tw/ubipred</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Protein-Protein Interaction Site Predictions with Three-Dimensional Probability Distributions of Interacting Atoms on Protein Surfaces

Author: A Koike
A Porollo
AA Bogan
An-Suei Yang
Attila Gursoy
BJ McConkey
BW Matthews
CC Chang
CD Manning
Ching-Tai Chen
CJC Burges
CM Yu
CT Chen
DE Rumelhart
DT Chang
ED Levy
Ei-Wen Yang
F Glaser
F Rodier
FB Sheinerman
G Moont
H Neuvirth
Hung-Pin Peng
HX Zhou
I Ezkurdia
I Kufareva
I Res
IS Moreira
J Janin
Jeng-Yih Chang
Jhih-Wei Jian
JM Elkins
Jun-Bo Chen
K Henrick
K Levenberg
Keng-Chang Tsai
L Breiman
L Jiang
L Lo Conte
M Reidmiller
M Riedmiller
M Sikic
MH Li
MN Wass
MN Wass
N Tuncbag
O Keskin
P Chakrabarti
PJ Kundrotas
QC Zhang
QC Zhang
RA Laskowski
S Engelen
S Jones
S Sacquin-Mora
Shinn-Ying Ho
SJ de Vries
SJ Hubbard
SS Negi
Wen-Lian Hsu
X Gallet
Y Murakami
Y Murakami
Y Ofran
Y Ofran
Y Ofran
Publication venue: Public Library of Science
Publication date: 06/06/2012
Field of study

Protein-protein interactions are key to many biological processes. Computational methodologies devised to predict protein-protein interaction (PPI) sites on protein surfaces are important tools in providing insights into the biological functions of proteins and in developing therapeutics targeting the protein-protein interaction sites. One of the general features of PPI sites is that the core regions from the two interacting protein surfaces are complementary to each other, similar to the interior of proteins in packing density and in the physicochemical nature of the amino acid composition. In this work, we simulated the physicochemical complementarities by constructing three-dimensional probability density maps of non-covalent interacting atoms on the protein surfaces. The interacting probabilities were derived from the interior of known structures. Machine learning algorithms were applied to learn the characteristic patterns of the probability density maps specific to the PPI sites. The trained predictors for PPI sites were cross-validated with the training cases (consisting of 432 proteins) and were tested on an independent dataset (consisting of 142 proteins). The residue-based Matthews correlation coefficient for the independent test set was 0.423; the accuracy, precision, sensitivity, specificity were 0.753, 0.519, 0.677, and 0.779 respectively. The benchmark results indicate that the optimized machine learning models are among the best predictors in identifying PPI sites on protein surfaces. In particular, the PPI site prediction accuracy increases with increasing size of the PPI site and with increasing hydrophobicity in amino acid composition of the PPI interface; the core interface regions are more likely to be recognized with high prediction confidence. The results indicate that the physicochemical complementarity patterns on protein surfaces are important determinants in PPIs, and a substantial portion of the PPI sites can be predicted correctly with the physicochemical complementarity features based on the non-covalent interaction data derived from protein interiors

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

The Francis Crick Institute