325 research outputs found
A preliminary approach to the multilabel classification problem of Portuguese juridical documents
Portuguese juridical documents from Supreme Courts and the Attorney General’s Office are manually classified by juridical experts into a set of classes belonging to a taxonomy of concepts. In this paper, a preliminary approach to develop techniques to automat- ically classify these juridical documents, is proposed. As basic strategy, the integration of natural language processing techniques with machine learning ones is used. Support Vector Machines (SVM) are used as learn- ing algorithm and the obtained results are presented and compared with other approaches, such as C4.5 and Naive Bayes
Generalization properties of finite size polynomial Support Vector Machines
The learning properties of finite size polynomial Support Vector Machines are
analyzed in the case of realizable classification tasks. The normalization of
the high order features acts as a squeezing factor, introducing a strong
anisotropy in the patterns distribution in feature space. As a function of the
training set size, the corresponding generalization error presents a crossover,
more or less abrupt depending on the distribution's anisotropy and on the task
to be learned, between a fast-decreasing and a slowly decreasing regime. This
behaviour corresponds to the stepwise decrease found by Dietrich et al.[Phys.
Rev. Lett. 82 (1999) 2975-2978] in the thermodynamic limit. The theoretical
results are in excellent agreement with the numerical simulations.Comment: 12 pages, 7 figure
Statistical Mechanics of Support Vector Networks
Using methods of Statistical Physics, we investigate the generalization
performance of support vector machines (SVMs), which have been recently
introduced as a general alternative to neural networks. For nonlinear
classification rules, the generalization error saturates on a plateau, when the
number of examples is too small to properly estimate the coefficients of the
nonlinear part. When trained on simple rules, we find that SVMs overfit only
weakly. The performance of SVMs is strongly enhanced, when the distribution of
the inputs has a gap in feature space.Comment: REVTeX, 4 pages, 2 figures, accepted by Phys. Rev. Lett (typos
corrected
Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents
Information extraction from legal documents is an important and open problem. A mixed approach, using linguistic information and machine learning techniques, is described in this paper. In this approach, top-level legal concepts are identified and used for document classifica- tion using Support Vector Machines. Named entities, such as, locations, organizations, dates, and document references, are identified using se- mantic information from the output of a natural language parser. This information, legal concepts and named entities, may be used to popu- late a simple ontology, allowing the enrichment of documents and the creation of high-level legal information retrieval systems.
The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site – and it was evaluated. The obtained results were quite good and indicate this may be a promising approach to the legal information extraction problem
Statistical Mechanics of Soft Margin Classifiers
We study the typical learning properties of the recently introduced Soft
Margin Classifiers (SMCs), learning realizable and unrealizable tasks, with the
tools of Statistical Mechanics. We derive analytically the behaviour of the
learning curves in the regime of very large training sets. We obtain
exponential and power laws for the decay of the generalization error towards
the asymptotic value, depending on the task and on general characteristics of
the distribution of stabilities of the patterns to be learned. The optimal
learning curves of the SMCs, which give the minimal generalization error, are
obtained by tuning the coefficient controlling the trade-off between the error
and the regularization terms in the cost function. If the task is realizable by
the SMC, the optimal performance is better than that of a hard margin Support
Vector Machine and is very close to that of a Bayesian classifier.Comment: 26 pages, 12 figures, submitted to Physical Review
Robustness and Generalization
We derive generalization bounds for learning algorithms based on their
robustness: the property that if a testing sample is "similar" to a training
sample, then the testing error is close to the training error. This provides a
novel approach, different from the complexity or stability arguments, to study
generalization of learning algorithms. We further show that a weak notion of
robustness is both sufficient and necessary for generalizability, which implies
that robustness is a fundamental property for learning algorithms to work
Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline
From medical charts to national census, healthcare has traditionally operated
under a paper-based paradigm. However, the past decade has marked a long and
arduous transformation bringing healthcare into the digital age. Ranging from
electronic health records, to digitized imaging and laboratory reports, to
public health datasets, today, healthcare now generates an incredible amount of
digital information. Such a wealth of data presents an exciting opportunity for
integrated machine learning solutions to address problems across multiple
facets of healthcare practice and administration. Unfortunately, the ability to
derive accurate and informative insights requires more than the ability to
execute machine learning models. Rather, a deeper understanding of the data on
which the models are run is imperative for their success. While a significant
effort has been undertaken to develop models able to process the volume of data
obtained during the analysis of millions of digitalized patient records, it is
important to remember that volume represents only one aspect of the data. In
fact, drawing on data from an increasingly diverse set of sources, healthcare
data presents an incredibly complex set of attributes that must be accounted
for throughout the machine learning pipeline. This chapter focuses on
highlighting such challenges, and is broken down into three distinct
components, each representing a phase of the pipeline. We begin with attributes
of the data accounted for during preprocessing, then move to considerations
during model building, and end with challenges to the interpretation of model
output. For each component, we present a discussion around data as it relates
to the healthcare domain and offer insight into the challenges each may impose
on the efficiency of machine learning techniques.Comment: Healthcare Informatics, Machine Learning, Knowledge Discovery: 20
Pages, 1 Figur
- …