Search CORE

325 research outputs found

A preliminary approach to the multilabel classification problem of Portuguese juridical documents

Author: A. McCallum
B. Schölkopf
C. Cortes
G. Salton
I. Witten
N. Cancedda
N. Cristianini
P. Quaresma
R. Quinlan
T. Joachims
V. Vapnik
V. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

Portuguese juridical documents from Supreme Courts and the Attorney General’s Office are manually classified by juridical experts into a set of classes belonging to a taxonomy of concepts. In this paper, a preliminary approach to develop techniques to automat- ically classify these juridical documents, is proposed. As basic strategy, the integration of natural language processing techniques with machine learning ones is used. Support Vector Machines (SVM) are used as learn- ing algorithm and the obtained results are presented and compared with other approaches, such as C4.5 and Naive Bayes

Crossref

Repositório Científico da Universidade de Évora

Generalization properties of finite size polynomial Support Vector Machines

Author: A. Buhot
C. Cortes
C. Marangi
H. Yoon
M. B. Gordon
M. Opper
M. Opper
M. Opper
Mirta B. Gordon
R. Dietrich
R. Monasson
S. Risau-Gusman
Sebastian Risau-Gusman
T. Cover
V. Vapnik
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2000
Field of study

The learning properties of finite size polynomial Support Vector Machines are analyzed in the case of realizable classification tasks. The normalization of the high order features acts as a squeezing factor, introducing a strong anisotropy in the patterns distribution in feature space. As a function of the training set size, the corresponding generalization error presents a crossover, more or less abrupt depending on the distribution's anisotropy and on the task to be learned, between a fast-decreasing and a slowly decreasing regime. This behaviour corresponds to the stepwise decrease found by Dietrich et al.[Phys. Rev. Lett. 82 (1999) 2975-2978] in the thermodynamic limit. The theoretical results are in excellent agreement with the numerical simulations.Comment: 12 pages, 7 figure

arXiv.org e-Print Archive

Crossref

HAL-CEA

Statistical Mechanics of Support Vector Networks

Author: C. Cortes
H. Seung
H. Yoon
Haim Sompolinsky
M. Opper
M. Opper
M. Opper
Manfred Opper
R. Kühn
R. Monasson
Rainer Dietrich
T. Cover
T. L. H. Watkin
V. N. Vapnik
Publication venue: 'American Physical Society (APS)'
Publication date: 25/02/1999
Field of study

Using methods of Statistical Physics, we investigate the generalization performance of support vector machines (SVMs), which have been recently introduced as a general alternative to neural networks. For nonlinear classification rules, the generalization error saturates on a plateau, when the number of examples is too small to properly estimate the coefficients of the nonlinear part. When trained on simple rules, we find that SVMs overfit only weakly. The performance of SVMs is strongly enhanced, when the distribution of the inputs has a gap in feature space.Comment: REVTeX, 4 pages, 2 figures, accepted by Phys. Rev. Lett (typos corrected

arXiv.org e-Print Archive

Crossref

Aston Publications Explorer

High-probability minimax probability machines

Author: AW Marshall
C Cortes
D Bertsimas
GRG Lanckriet
J Shawe-Taylor
John Shawe-Taylor
K Huang
M Marchand
N Alon
RA Fisher
Simon Cousins
V Vapnik
V Vapnik
Vladimir Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents

Author: A. Tikhonov
C. Cortes
D. Wilkins
E. Bick
E. Schweighofer
F. Borges
G. Salton
J. Cowie
J. Shawe-Taylor
J. Zeleznikow
N. Chomsky
P. Quaresma
P. Quaresma
S. Brüninghaus
S. Brüninghaus
T. Gonçalves
T. Gonçalves
T. Joachims
T. Joachims
T. Joachims
T. Joachims
V. Vapnik
V. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Information extraction from legal documents is an important and open problem. A mixed approach, using linguistic information and machine learning techniques, is described in this paper. In this approach, top-level legal concepts are identified and used for document classifica- tion using Support Vector Machines. Named entities, such as, locations, organizations, dates, and document references, are identified using se- mantic information from the output of a natural language parser. This information, legal concepts and named entities, may be used to popu- late a simple ontology, allowing the enrichment of documents and the creation of high-level legal information retrieval systems. The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site – and it was evaluated. The obtained results were quite good and indicate this may be a promising approach to the legal information extraction problem

Crossref

Repositório Científico da Universidade de Évora

The Huller: A Simple and Efficient Online SVM

Author: C. Cortes
C. Gentile
D.J. Crisp
E.G. Gilbert
F. Rosenblatt
K. Crammer
K. Crammer
K.P. Bennett
M.A. Aizerman
P. Haffner
T.T. Frieß
V. Vapnik
V.N. Vapnik
Y. Freund
Y. Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Crossref

Analysing Part-of-Speech for Portuguese Text Classification

Author: A. Moschitti
C. Cortes
E. Bick
G. Salton
I. Witten
J. Shawe-Taylor
T. Gonçalves
T. Joachims
T.M. Cover
V. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Crossref

Statistical Mechanics of Soft Margin Classifiers

Author: A. Buhot
A. Buhot
B. Martos
C. Cortes
C. J. C. Burges
E. Gardner
G. Györgyi
G. Györgyi
H. S. Seung
J. Hertz
J.-I. Inoue
M. B. Gordon
M. Opper
M. Opper
M. Seeger
Mirta B. Gordon
O. Kinouchi
P. Peretto
P. Reimann
P. Sollich
R. Dietrich
R. Meir
S. Risau-Gusman
S. Risau-Gusman
S.-I. Amari
Sebastian Risau-Gusman
T. Cover
T. L. H. Watkin
T. Uezu
V. Vapnik
W. Krauth
Publication venue: 'American Physical Society (APS)'
Publication date: 18/02/2001
Field of study

We study the typical learning properties of the recently introduced Soft Margin Classifiers (SMCs), learning realizable and unrealizable tasks, with the tools of Statistical Mechanics. We derive analytically the behaviour of the learning curves in the regime of very large training sets. We obtain exponential and power laws for the decay of the generalization error towards the asymptotic value, depending on the task and on general characteristics of the distribution of stabilities of the patterns to be learned. The optimal learning curves of the SMCs, which give the minimal generalization error, are obtained by tuning the coefficient controlling the trade-off between the error and the regularization terms in the cost function. If the task is realizable by the SMC, the optimal performance is better than that of a hard margin Support Vector Machine and is very close to that of a Bayesian classifier.Comment: 26 pages, 12 figures, submitted to Physical Review

arXiv.org e-Print Archive

Crossref

Robustness and Generalization

We derive generalization bounds for learning algorithms based on their robustness: the property that if a testing sample is "similar" to a training sample, then the testing error is close to the training error. This provides a novel approach, different from the complexity or stability arguments, to study generalization of learning algorithms. We further show that a weak notion of robustness is both sufficient and necessary for generalizability, which implies that robustness is a fundamental property for learning algorithms to work

arXiv.org e-Print Archive

CiteSeerX

Crossref

ScholarBank@NUS

Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

Author: A Arcuri
AL Rector
AM Wood
AS Glas
B Kulis
C Cortes
C Sammut
CC Diamond
CD Kidd
CR MacIntyre
DP Lewis
E Koumoundouros
E Rahm
EM Knorr
ES Fisher
GE Box
GM Weber
H Carter
H He
H Meyer
H Quan
HH Hoos
I Yoo
J Andreu-Perez
J Fan
J Zhao
JD Lafferty
JM Bland
JW Graham
K Lange
KP Murphy
LA King
LM Collins
M Azarm-Daigle
M Kantardzic
M Sokolova
MA Stoto
N Oreskes
PB Jensen
PK Lindenauer
PM Visscher
RJ Little
V López
V Sessions
VN Vapnik
W Raghupathi
Y Luo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/01/2018
Field of study

From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.Comment: Healthcare Informatics, Machine Learning, Knowledge Discovery: 20 Pages, 1 Figur

arXiv.org e-Print Archive

Crossref