Search CORE

254 research outputs found

Machine learning methods for omics data integration

Author: Zhou Wengang
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2011
Field of study

High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis

Digital Repository @ Iowa State University (ISU)

ESLpred2: improved method for predicting subcellular localization of eukaryotic proteins

Author: A Garg
A Garg
A Hoglund
A Pierleoni
A Reinhardt
Aarti Garg
CS Yu
D Sarda
D Szafron
D Xie
DT Jones
E Tantoso
Gajendra PS Raghava
H Kaur
J Guo
JD Bendtsen
JL Gardy
K Nakai
K Nakai
K Park
KC Chou
KC Chou
KJ Park
M Bhasin
M Bhasin
O Emanuelsson
O Emanuelsson
Q Cui
R Nair
R Nair
S Hua
S Matsuda
SF Altschul
T Habib
WL Huang
Y Huang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The expansion of raw protein sequence databases in the post genomic era and availability of fresh annotated sequences for major localizations particularly motivated us to introduce a new improved version of our previously forged eukaryotic subcellular localizations prediction method namely "ESLpred". Since, subcellular localization of a protein offers essential clues about its functioning, hence, availability of localization predictor would definitely aid and expedite the protein deciphering studies. However, robustness of a predictor is highly dependent on the superiority of dataset and extracted protein attributes; hence, it becomes imperative to improve the performance of presently available method using latest dataset and crucial input features. Results Here, we describe augmentation in the prediction performance obtained for our most popular ESLpred method using new crucial features as an input to Support Vector Machine (SVM). In addition, recently available, highly non-redundant dataset encompassing three kingdoms specific protein sequence sets; 1198 fungi sequences, 2597 from animal and 491 plant sequences were also included in the present study. First, using the evolutionary information in the form of profile composition along with whole and N-terminal sequence composition as an input feature vector of 440 dimensions, overall accuracies of 72.7, 75.8 and 74.5% were achieved respectively after five-fold cross-validation. Further, enhancement in performance was observed when similarity search based results were coupled with whole and N-terminal sequence composition along with profile composition by yielding overall accuracies of 75.9, 80.8, 76.6% respectively; best accuracies reported till date on the same datasets. Conclusion These results provide confidence about the reliability and accurate prediction of SVM modules generated in the present study using sequence and profile compositions along with similarity search based results. The presently developed modules are implemented as web server "ESLpred2" available at <url>http://www.imtech.res.in/raghava/eslpred2/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Gene ontology based transfer learning for protein subcellular localization

Author: A Bateman
A Dijk
A Hoglund
A Hoglund
A Pierleoni
C Chen
C Leslie
C Leslie
DH Haft
E Marcotte
EM Zdobnov
F Corpet
FM Li
G Lanckriet
G Schneider
H Ding
H Lin
H Lin
H Liu
H Rangwala
H Shen
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Cedano
J Schultz
J Shen
JD Qiu
JD Qiu
K Chou
K Chou
K Chou
K Hofmann
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Nanni
M Ashburner
M Esmaeili
M Mak
M Wang
Q Gu
Q Yang
R Apweiler
R Kuang
R Kuang
S Mei
S Pan
Shuigeng Zhou
Suyu Mei
T Blum
T Tung
TK Attwood
W Dai
W Dai
W Huang
W Huang
Wang Fei
X Jiang
X Xiao
XB Zhou
YH Zeng
YS Ding
YS Ding
Z Lei
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively. Conclusions Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence

Author: A Kumar
A Reinhardt
B Matthews
C Guda
C Guda
CH Wu
G Dellaire
G-P Zhou
H Liu
H-B Shen
H-B Shen
HGE Sutherland
J Cedano
JL Heazlewood
K Nakai
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-C Chou
K-J Park
M Wang
M Wang
MA Andrade
MS Scott
O Emanuelsson
P Lio
Pufeng Du
Q-B Gao
RA Gottlieb
S Hua
S Kawashima
S-Q Wang
W Jassem
W Li
WA BickMore
Y Huang
Y-D Cai
Y-D Cai
Y-D Cai
Yanda Li
Z Lei
Z Yuan
Z-P Feng
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Knowing the submitochondria localization of a mitochondria protein is an important step to understand its function. We develop a method which is based on an extended version of pseudo-amino acid composition to predict the protein localization within mitochondria. This work goes one step further than predicting protein subcellular location. We also try to predict the membrane protein type for mitochondrial inner membrane proteins. RESULTS: By using leave-one-out cross validation, the prediction accuracy is 85.5% for inner membrane, 94.5% for matrix and 51.2% for outer membrane. The overall prediction accuracy for submitochondria location prediction is 85.2%. For proteins predicted to localize at inner membrane, the accuracy is 94.6% for membrane protein type prediction. CONCLUSION: Our method is an effective method for predicting protein submitochondria location. But even with our method or the methods at subcellular level, the prediction of protein submitochondria location is still a challenging problem. The online service SubMito is now available at

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Support vector machine (SVM) based multiclass prediction with basic statistical analysis of plasminogen activators

Author: Lefevre Christophe
Muthukrishnan Selvaraj
Puri Munish
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Plasminogen (Pg), the precursor of the proteolytic and fibrinolytic enzyme of blood, is converted to the active enzyme plasmin (Pm) by different plasminogen activators (tissue plasminogen activators and urokinase), including the bacterial activators streptokinase and staphylokinase, which activate Pg to Pm and thus are used clinically for thrombolysis. The identification of Pg-activators is therefore an important step in understanding their functional mechanism and derives new therapies

Deakin Research Online

Crossref

Springer - Publisher Connector

PubMed Central

University of Melbourne Institutional Repository

Fast subcellular localization by cascaded fusion of signal-based and homology-based methods

Author: Kung Sun-Yuan
Mak Man-Wai
Wang Wei
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means. Results This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA). Conclusions Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.</p

The Hong Kong Polytechnic University Pao Yue-kong Library

Princeton University Open Access Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PolyU Institutional Repository

PubMed Central

Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

Author: A Hoglund
B Liao
CE Rasmussen
DN Georgiou
FM Li
Franca Fraternali
G Tsoumakas
GP Zhou
H Mohabatkar
H Mohabatkar
H Nakashima
HB Shen
HB Shen
HB Shen
HB Shen
HN Lin
Hong Gu
J Ma
J Ma
J Tian
J Yin
Jianjun He
JY Shi
K Imai
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KY Lee
L Chen
L Chen
L Hu
LJ Foster
LL Hu
M Esmaeili
MS Scott
O Emanuelsson
P Wang
P Wang
RE Schapire
S Briesemeister
S Hua
S Mei
S Mei
S Zhang
T Huang
T Huang
T Huang
T Liu
Wenqi Liu
WZ Lin
X Jiang
X Xiao
X Xiao
X Xiao
YH Zeng
YL Chen
YL Chen
Z He
Z Lu
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 08/06/2012
Field of study

It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Protein Interactome and Its Application to Protein Function Prediction

Author: Hyun-Hwan Jeong
KiYoung Lee
Woojin Jung
Publication venue: 'IntechOpen'
Publication date: 30/03/2012
Field of study

IntechOpen

Ensemble deep learning: A review

Author: Ganaie M. A.
Hu Minghui
Malik A. K.
Suganthan P. N.
Tanveer M.
Publication venue
Publication date: 06/04/2021
Field of study

Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning models with multilayer processing architecture is showing better performance as compared to the shallow or traditional classification models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorised into ensemble models like bagging, boosting and stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised, semi-supervised, reinforcement learning and online/incremental, multilabel based deep ensemble models. Application of deep ensemble models in different domains is also briefly discussed. Finally, we conclude this paper with some future recommendations and research directions

arXiv.org e-Print Archive

Qatar University Institutional Repository