Search CORE

UCL Discovery

Learning Kernel Perceptrons on Noisy Data and Random Projections

Author: A. Blum
B. Schölkopf
D. Fradkin
E. Cohen
F. Rosenblatt
G. Rätsch
H.D. Block
J. Shawe-Taylor
M. Kearns
M.F. Balcan
N. Cesa-Bianchi
N. Cristianini
V. Vapnik
V. Vapnik
Y. Freund
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

In this paper, we address the issue of learning nonlinearly separable concepts with a kernel classifier in the situation where the data at hand are altered by a uniform classification noise. Our proposed approach relies on the combination of the technique of random or deterministic projections with a classification noise tolerant perceptron learning algorithm that assumes distributions defined over finite-dimensional spaces. Provided a sufficient separation margin characterizes the problem, this strategy makes it possible to envision the learning from a noisy distribution in any separable Hilbert space, regardless of its dimension; learning with any appropriate Mercer kernel is therefore possible. We prove that the required sample complexity and running time of our algorithm is polynomial in the classical PAC learning parameters. Numerical simulations on toy datasets and on data from the UCI repository support the validity of our approach

CiteSeerX

HAL AMU

Feature Weighting Using Margin and Radius Based Error Bound Optimization in SVMs

Author: A. Kalousis
A. Rakotomamonjy
H. Zou
I. Guyon
J. Bonnans
J. Shawe-Taylor
J. Weston
K. Duan
N. Cristianini
N.M. Leo Liberti
O. Chapelle
R. Tibshirani
T. Hastie
V. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Margin and Radius Based Multiple Kernel Learning

Author: A. Kalousis
A. Kalousis
B. Schölkopf
C.S. Ong
F.R. Bach
G. Lanckriet
J. Bonnans
J. Shawe-Taylor
K. Crammer
N. Cristianini
O. Bousquet
O. Chapelle
Q. McNemar
S. Sonnenburg
V. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Long-term mortality prediction after operations for type A ascending aortic dissection

Author: A Menotti
A Sciangula
Alfonso Sciangula
C Gini
DG Altman
DM Shahian
F Macrina
Fausto Trigilia
Francesco Macrina
J Nilsson
J Shawe-Taylor
JA Hanley
JE Dayhoff
JL Homme
JV Tu
M Goto
M May
Marco Totaro
Mauro Cassese
Michele Toscano
MJ Pencina
N Cristianini
NR Cook
NR Cook
Paolo E Puddu
PE Puddu
PE Puddu
R Voss
RK Orr
RM Conroy
S Trimarchi
T Suzuki
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background There are few long-term mortality prediction studies after acute aortic dissection (AAD) Type A and none were performed using new models such as neural networks (NN) or support vector machines (SVM) which may show a higher discriminatory potency than standard multivariable models. Methods We used 32 risk factors identified by Literature search and previously assessed in short-term outcome investigations. Models were trained (50%) and validated (50%) on 2 random samples from a consecutive 235-patient cohort. NN were run only on patients with complete data for all included variables (N = 211); SVM on the overall group. Discrimination was assessed by receiver operating characteristic area under the curve (AUC) and Gini's coefficients along with classification performance. Results There were 84 deaths (36%) occurring at 564 ± 48 days (95%CI from 470 to 658 days). Patients with complete variables had a slightly lower death rate (60 of 211, 28%). NN classified 44 of 60 (73%) dead patients and 147 of 151 (97%) long-term survivors using 5 covariates: immediate post-operative chronic renal failure, circulatory arrest time, the type of surgery on ascending aorta plus hemi-arch, extracorporeal circulation time and the presence of Marfan habitus. Global accuracies of training and validation NN were excellent with AUC respectively 0.871 and 0.870 but classification errors were high among patients who died. Training SVM, using a larger number of covariates, showed no false negative or false positive cases among 118 randomly selected patients (error = 0%, AUC 1.0) whereas validation SVM, among 117 patients, provided 5 false negative and 11 false positive cases (error = 22%, AUC 0.821, p < 0.01 versus NN results). An html file was produced to adopt and manipulate the selected parameters for practical predictive purposes. Conclusions Both NN and SVM accurately selected a few operative and immediate post-operative factors and the Marfan habitus as long-term mortality predictors in AAD Type A. Although these factors were not new per se, their combination may be used in practice to index death risk post-operatively with good accuracy.</p

Springer - Publisher Connector

Enhanced protein fold recognition through a novel data integration approach

Author: A Andreeva
A Rakotomamonjy
AL Yuille
B Schölkopf
C Ding
CA Micchelli
CE Rasmussen
Colin Campbell
DT Jones
F Bach
F Bach
GRG Lanckriet
GRG Lanckriet
HB Shen
HW Mewes
I Dubchak
J Shawe-Taylor
J Ye
J Ye
JM Borwein
JV Davis
K Bleakley
K Chou
K Tsuda
Kaizhu Huang
L Liao
L Lo Conte
L Sun
L Vandenberghe
M Girolami
N Aronszajn
N Cristianini
ND Lawrence
PD Tao
R Hettich
RI Kondor
S Amari
S Ji
S Sonnenburg
T Damoulas
T Hastie
T Kato
Y Lin
Y Nesterov
Y Yamanishi
Y Ying
Yiming Ying
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources. Results In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. We propose a novel information-theoretic approach based on a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multi-class classification and multi-task learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KL-divergence objective, there are two formulations which we respectively refer to as <it>MKLdiv-dc </it>and <it>MKLdiv-conv</it>. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdiv-conv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem. Conclusion Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-of-the-art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem.</p

Springer - Publisher Connector

Explore Bristol Research

Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

Author: A Politi
B Alberts
C Leslie
CH Wu
CS Yu
E Friedberg
E Friedberg
E Sonoda
H Interthal
H Klein
H Shen
H Shen
HM Berman
I Miller
I Vergara
J Brown
J Cheng
J Demšar
J Shawe-Taylor
JB Brown
K Chou
K Chou
K Chou
K Fujishima
K Nitiss
K Takemoto
K-J Park
L Mariño-Ramírez
L Wen
M Bhasin
M Bhasin
M Kasahara
N Cristianini
N Dong
P Jowsey
R Wood
S Johnson
SF Altschul
T Dietterich
T Hubbard
T Jaakkola
T Joachims
Tatsuya Akutsu
The Gene Ontology Consortium
TK Hazra
TS Dexheimer
U Dery
W Ewens
W Li
Y El-Manzalawy
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM). Results We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes. Conclusion Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.</p

Springer - Publisher Connector

Public Library of Science (PLOS)

Estimation of Relevant Variables on High-Dimensional Biological Patterns Using Iterated Weighted Kernel Functions

Author: A Novikoff
C Cortes
C Nutt
D Agranoff
D Whitley
Dan Agranoff
Delmiro Fernandez-Reyes
Emily Hsieh
F Friedrichs
F Rosenblatt
Gustavo Stolovitzky
H Fröhlich
H Fröhlich
HJ Issaq
I Guyon
I Guyon
J Bedo
J Liu
J Mercer
J Shawe-Taylor
J Weston
L Davies
L Li
M Garey
M Herbster
M Minsky
M Wagner
M Wagner
MC Papadopoulos
ME Tipping
N Cristianini
O Chapelle
P Baldi
P Pelikan
S Davies
S Rojas Galeano
Sanjeev Krishna
Sergio Rojas-Galeano
T Briggs
T Conrads
T Joachims
T Van Gestel
U Alon
X Zhang
Y Ding
Y Freund
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

BACKGROUND The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder. METHODOLOGY/PRINCIPAL FINDINGS Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables. CONCLUSIONS Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation

St George's Online Research Archive

Sussex Research Online

Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research

Author: A Keselman
A Kolman
A Korhonen
Anna Korhonen
AR Feinstein
B Alex
C Boström
C Cortes
C Leslie
CC Chang
D Hattis
D McGregor
D Ó Séaghdha
Diarmuid Ó Séaghdha
DV Cicchetti
H Wang
H Wang
Ilona Silins
J Cohen
J Lin
J Shawe-Taylor
Johan Högberg
K Bouker
K Morgan
KB Cohen
L Hunter
Lin Sun
M Hein
M Jackson
N Cristianini
N Karamanis
Neil R. Smalheiser
P Zweigenbaum
Products EFSA Panel on Plant Protection
R Frijters
R Jelier
R Judson
RB Altman
S Ananiadou
S Cohen
Science US National Academy of
T Byrt
T Joachims
TC Rindesch
TG Dietterich
Ulla Stenius
Y Guo
YW Chen
Publication venue: Public Library of Science
Publication date: 12/04/2012
Field of study

Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB – a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future

Public Library of Science (PLOS)

FigShare

Maximizing upgrading and downgrading margins for ordinal regression

Author: A Shashua
AP Bradley
Belen Martin-Barragan
C Cortes
DJ Hand
E Bredensteiner
E Carrizosa
E Carrizosa
E Carrizosa
E Carrizosa
E Grigoroudis
EL Allwein
Emilio Carrizosa
F Plastria
G Ballarino
H Nakayama
J Mercer
J Shawe-Taylor
JC Platt
JP Pedroso
JS Cardoso
L Li
MA Kupinski
N Cristianini
NM Adams
OL Mangasarian
R Herbrich
R Lall
RM Everson
T Hastie
T Jiao
V Vapnik
V Vapnik
W Chu
W Waegeman
Y Guermeur
Y Jin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2011
Field of study

In ordinal regression, a score function and threshold values are sought to classify a set of objects into a set of ranked classes. Classifying an individual in a class with higher (respectively lower) rank than its actual rank is called an upgrading (respectively downgrading) error. Since upgrading and downgrading errors may not have the same importance, they should be considered as two different criteria to be taken into account when measuring the quality of a classifier. In Support Vector Machines, margin maximization is used as an effective and computationally tractable surrogate of the minimization of misclassification errors. As an extension, we consider in this paper the maximization of upgrading and downgrading margins as a surrogate of the minimization of upgrading and downgrading errors, and we address the biobjective problem of finding a classifier maximizing simultaneously the two margins. The whole set of Pareto-optimal solutions of such biobjective problem is described as translations of the optimal solutions of a scalar optimization problem. For the most popular case in which the Euclidean norm is considered, the scalar problem has a unique solution, yielding that all the Pareto-optimal solutions of the biobjective problem are translations of each other. Hence, the Pareto-optimal solutions can easily be provided to the analyst, who, after inspection of the misclassification errors caused, should choose in a later stage the most convenient classifier. The consequence of this analysis is that it provides a theoretical foundation for a popular strategy among practitioners, based on the so-called ROC curve, which is shown here to equal the set of Pareto-optimal solutions of maximizing simultaneously the downgrading and upgrading margins