Search CORE

Aberystwyth Research Portal

Kent Academic Repository

One-Class Classification: Taxonomy of Study and Review of Techniques

Author: Khan Shehroz S.
Madden Michael G.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 29/11/2013
Field of study

One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

arXiv.org e-Print Archive

Access to Research at National University of Ireland, Galway

SiteSeek: Post-translational modification analysis using adaptive locality-effective kernel methods and new profiles

Author: A Radzicka
A Radzicka
A Zanzoni
AA Salamov
Albert Y Zomaya
B Amos
B Rost
B Zhang
BA Ballif
Bing Bing Zhou
C Li
CJC Burgess
D Frishman
DT Larose
F Diella
G Horváth
G Rose
GD Rose
H Hu
H Kim
HH Jang
HM Berman
J Liu
J Shawe-Taylor
JC Obenauer
JH Kim
K Koliba
K Lin
L Graves
L Johnson
LA Pinna
LM Iakoucheva
M Hjerrild
M Mann
M Scholz
MA Kramer
MB Yaffe
MJ Korenberg
MJ Zvelebil
N Blom
NL Daly
P Baldi
P Cohen
Paul D Yoo
R David
R Lohmann
RD King
RE Schapire
SA Beausoleil
SB Ficarro
T Hunter
T Hunter
TG Dietterich
W Hardle
Y Freund
Y Xue
Yung Shwen Ho
Z Songyang
ZR Yang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors. Results The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC). Conclusion Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.</p

Prediction of protein binding sites in protein structures using hidden Markov support vector machine

Author: A Henschel
A Koike
A Kouranov
A Porollo
A Rossi
AJ Bordner
B Wang
Bin Liu
Buzhou Tang
C Chothia
C Yan
C Yan
C-T Chen
C-W Cheng
H Chen
H Kim
H Neuvirth
H-X Zhou
HX Zhou
I Ezkurdia
I Res
I Tsochantaridis
I Tsochantaridis
J Lafferty
J Song
J Song
J-L Chung
JD Fischer
JL Chung
JR Bradford
JW Torrance
K Henrick
L Holm
L Lo Conte
L Wang
Lei Lin
LR Rabiner
M Gribskov
M Vincent
M Šikić
MH Li
N Li
NJ Burgoyne
P Fariselli
Q Dong
Qiwen Dong
S Ahmad
S Liang
S Qin
SF Altschul
SF Altschul
T Joachims
T Zhang
TH Dang
W Kabsch
WK Kim
X-w Chen
Xiaolong Wang
Xuan Wang
Y Altun
Y Liu
Y Ofran
Y Ofran
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Predicting the binding sites between two interacting proteins provides important clues to the function of a protein. Recent research on protein binding site prediction has been mainly based on widely known machine learning techniques, such as artificial neural networks, support vector machines, conditional random field, etc. However, the prediction performance is still too low to be used in practice. It is necessary to explore new algorithms, theories and features to further improve the performance. Results In this study, we introduce a novel machine learning model hidden Markov support vector machine for protein binding site prediction. The model treats the protein binding site prediction as a sequential labelling task based on the maximum margin criterion. Common features derived from protein sequences and structures, including protein sequence profile and residue accessible surface area, are used to train hidden Markov support vector machine. When tested on six data sets, the method based on hidden Markov support vector machine shows better performance than some state-of-the-art methods, including artificial neural networks, support vector machines and conditional random field. Furthermore, its running time is several orders of magnitude shorter than that of the compared methods. Conclusion The improved prediction performance and computational efficiency of the method based on hidden Markov support vector machine can be attributed to the following three factors. Firstly, the relation between labels of neighbouring residues is useful for protein binding site prediction. Secondly, the kernel trick is very advantageous to this field. Thirdly, the complexity of the training step for hidden Markov support vector machine is linear with the number of training samples by using the cutting-plane algorithm.</p

ScholarBank@NUS

Is EC class predictable from reaction mechanism?

Author: A Statnikov
AG McDonald
BV Dasarathy
BW Matthews
C Andreini
C Andreini
DARS Latino
DE Almonacid
DE Almonacid
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
GL Holliday
IUBMB
J Gorodkin
J Menke
John BO Mitchell
JW Torrance
K Astikainen
KM Borgwardt
L Breiman
L De Ferrari
LD Hughes
M Aizerman
M Kanehisa
M Leber
N Furnham
N Nagano
N Nagano
Neetika Nath
NM O'Boyle
O Sacher
PC Babbitt
PD Dobson
R Lowe
RD Uriarte
SA Rahman
SCH Pegg
SCH Pegg
T Bray
T Bylander
V Egelhofer
VN Vapnik
WS Noble
X Hu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

We thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support.Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction.Publisher PDFPeer reviewe

University of St. Andrews - Pure

St Andrews Research Repository

Predicting the network of substrate-enzyme-product triads by combining compound similarity and functional domain composition

Author: Cai Yu-Dong
Chen Lei
Chou Kuo-Chen
Feng Kai-Yan
Li Hai-Peng
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Metabolic pathway is a highly regulated network consisting of many metabolic reactions involving substrates, enzymes, and products, where substrates can be transformed into products with particular catalytic enzymes. Since experimental determination of the network of substrate-enzyme-product triad (whether the substrate can be transformed into the product with a given enzyme) is both time-consuming and expensive, it would be very useful to develop a computational approach for predicting the network of substrate-enzyme-product triads. Results A mathematical model for predicting the network of substrate-enzyme-product triads was developed. Meanwhile, a benchmark dataset was constructed that contains 744,192 substrate-enzyme-product triads, of which 14,592 are networking triads, and 729,600 are non-networking triads; i.e., the number of the negative triads was about 50 times the number of the positive triads. The molecular graph was introduced to calculate the similarity between the substrate compounds and between the product compounds, while the functional domain composition was introduced to calculate the similarity between enzyme molecules. The nearest neighbour algorithm was utilized as a prediction engine, in which a novel metric was introduced to measure the "nearness" between triads. To train and test the prediction engine, one tenth of the positive triads and one tenth of the negative triads were randomly picked from the benchmark dataset as the testing samples, while the remaining were used to train the prediction model. It was observed that the overall success rate in predicting the network for the testing samples was 98.71%, with 95.41% success rate for the 1,460 testing networking triads and 98.77% for the 72,960 testing non-networking triads. Conclusions It is quite promising and encouraged to use the molecular graph to calculate the similarity between compounds and use the functional domain composition to calculate the similarity between enzymes for studying the substrate-enzyme-product network system. The software is available upon request.</p

Protein model construction and evaluation.

Author: Klose D.P.
Publication venue: 'Queen Mary University of London'
Publication date: 01/01/2008
Field of study

The prediction of protein secondary and tertiary structure is becoming increasingly important as the number of sequences available to the biological community far exceeds the number of unique native structures. The following chapters describe the conception, construction, evaluation and application of a series of algorithms for the prediction and evaluation of two and three-dimensional protein structure. In chapter 1 a brief overview of protein structure and the resources required to predict protein features is given. Chapter 2 describes the investigation of sequence identity and alignments on the prediction of two-dimensional protein structure in the form of long and short range protein contacts a feature which is known to correlate with solvent accessibility. It also describes the identification of a feature which is referred to as the 'Empty Quarter' which forms the basis of an evaluation function described in Chapter 3 and developed in Chapter 4. Chapter 3 introduces the Dynamic Domain Threading method used during round six of the CASP exercise. Phobic, a protein evaluation function based on predicted solvent accessibility is described in Chapter 4. The de novo prediction of a/p proteins is described in Chapter 5, the method introduces a new approach to the old problem of combinatorial modelling and breaks the size limit previously imposed on de novo prediction. The final experimental chapter describes the prediction of solvent accessibility and secondary structure using a novel combination of the fuzzy k-nearest neighbour and support vector machine. Chapter 7 closes this piece of work with a review of the field and suggests potential improvements to the way work is conducted

UCL Discovery

OpenGrey Repository

Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques

Author: A Golbraikh
A Kamb
A Linusson
A Navia-Vázquez
B Ustün
CC Chang
D Aha
E Freyhult
G Cruciani
G Manning
G Scapin
H Daub
H Drucker
HM Berman
I Dubchak
IH Witten
J Trygg
Jarl ES Wikberg
JD Griffin
JE Wikberg
JE Wikberg
K Illergård
KC Chou
KC Chou
LH Alifrangis
M Bhasin
M Bhasin
M Bhasin
M Lapinsh
M Reczko
M Sandberg
M Van Heel
MA Fabian
MA Larkin
Maris Lapins
MS Cohen
MW Karaman
NP Shah
O Devos
P Bamborough
P Geladi
QB Gao
RJ Quinlan
S Hua
S Madhusudan
S Wold
S Wold
S Wold
SD Peterson
T Lundstedt
TA Carter
V Vapnik
ZR Li
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Protein kinases play crucial roles in cell growth, differentiation, and apoptosis. Abnormal function of protein kinases can lead to many serious diseases, such as cancer. Kinase inhibitors have potential for treatment of these diseases. However, current inhibitors interact with a broad variety of kinases and interfere with multiple vital cellular processes, which causes toxic effects. Bioinformatics approaches that can predict inhibitor-kinase interactions from the chemical properties of the inhibitors and the kinase macromolecules might aid in design of more selective therapeutic agents, that show better efficacy and lower toxicity. Results We applied proteochemometric modelling to correlate the properties of 317 wild-type and mutated kinases and 38 inhibitors (12,046 inhibitor-kinase combinations) to the respective combination's interaction dissociation constant (Kd). We compared six approaches for description of protein kinases and several linear and non-linear correlation methods. The best performing models encoded kinase sequences with amino acid physico-chemical z-scale descriptors and used support vector machines or partial least- squares projections to latent structures for the correlations. Modelling performance was estimated by double cross-validation. The best models showed high predictive ability; the squared correlation coefficient for new kinase-inhibitor pairs ranging P2 = 0.67-0.73; for new kinases it ranged P2kin = 0.65-0.70. Models could also separate interacting from non-interacting inhibitor-kinase pairs with high sensitivity and specificity; the areas under the ROC curves ranging AUC = 0.92-0.93. We also investigated the relationship between the number of protein kinases in the dataset and the modelling results. Using only 10% of all data still a valid model was obtained with P2 = 0.47, P2kin = 0.42 and AUC = 0.83. Conclusions Our results strongly support the applicability of proteochemometrics for kinome-wide interaction modelling. Proteochemometrics might be used to speed-up identification and optimization of protein kinase targeted and multi-targeted inhibitors.</p

arXiv.org e-Print Archive

A Survey on Metric Learning for Feature Vectors and Structured Data

Author: Bellet Aurélien
Habrard Amaury
Sebban Marc
Publication venue
Publication date: 01/01/2013
Field of study

The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. This survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. We pay particular attention to Mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. Recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. Finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new method

HAL-UJM

Applicability of semi-supervised learning assumptions for gene ontology terms prediction

Author: Castellanos Cesar German
Jaramillo-Garzón Jorge Alberto
Perera Lluna Alexandre
Publication venue
Publication date: 01/01/2016
Field of study

Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

ZENODO