17 research outputs found
New models and methods for classification and feature selection. a mathematical optimization perspective
The objective of this PhD dissertation is the development of new models for Supervised
Classification and Benchmarking, making use of Mathematical Optimization and Statistical
tools. Particularly, we address the fusion of instruments from both disciplines,
with the aim of extracting knowledge from data. In such a way, we obtain innovative
methodologies that overcome to those existing ones, bridging theoretical Mathematics
with real-life problems.
The developed works along this thesis have focused on two fundamental methodologies
in Data Science: support vector machines (SVM) and Benchmarking. Regarding
the first one, the SVM classifier is based on the search for the separating hyperplane of
maximum margin and it is written as a quadratic convex problem. In the Benchmarking
context, the goal is to calculate the different efficiencies through a non-parametric
deterministic approach. In this thesis we will focus on Data Envelopment Analysis
(DEA), which consists on a Linear Programming formulation.
This dissertation is structured as follows. In Chapter 1 we briefly present the
different challenges this thesis faces on, as well as their state-of-the-art. In the same
vein, the different formulations used as base models are exposed, together with the
notation used along the chapters in this thesis.
In Chapter 2, we tackle the problem of the construction of a version of the SVM
that considers misclassification errors. To do this, we incorporate new performance
constraints in the SVM formulation, imposing upper bounds on the misclassification
errors. The resulting formulation is a quadratic convex problem with linear constraints.
Chapter 3 continues with the SVM as the basis, and sets out the problem of providing
not only a hard-labeling for each of the individuals belonging to the dataset, but a
class probability estimation. Furthermore, confidence intervals for both the score values
and the posterior class probabilities will be provided. In addition, as in the previous
chapter, we will carry the obtained results to the field in which misclassified errors are
considered. With such a purpose, we have to solve either a quadratic convex problem
or a quadratic convex problem with linear constraints and integer variables, and always
taking advantage of the parameter tuning of the SVM, that is usually wasted.
Based on the results in Chapter 2, in Chapter 4 we handle the problem of feature selection, taking again into account the misclassification errors. In order to build this
technique, the feature selection is embedded in the classifier model. Such a process is
divided in two different steps. In the first step, feature selection is performed while at
the same time data is separated via an hyperplane or linear classifier, considering the
performance constraints. In the second step, we build the maximum margin classifier
(SVM) using the selected features from the first step, and again taking into account
the same performance constraints.
In Chapter 5, we move to the problem of Benchmarking, where the practices of
different entities are compared through the products or services they provide. This is
done with the aim of make some changes or improvements in each of them. Concretely,
in this chapter we propose a Mixed Integer Linear Programming formulation based in
Data Envelopment Analysis (DEA), with the aim of perform feature selection, improving
the interpretability and comprehension of the obtained model and efficiencies.
Finally, in Chapter 6 we collect the conclusions of this thesis as well as future lines
of research
Statistical models in pharmacokinetics and pharmacodynamics
Universidad de Sevilla. Grado en Matemática
Constrained support vector machines theory and applications to health science
En los últimos años, la ciencia de los datos se ha convertido en una herramienta muy importante para tratar datos, así como para descubrir patrones y generar información útil en la toma de decisiones. Una de las tareas más importantes de la ciencia de los datos es la clasificación supervisada, la cual se ha aplicado de forma exitosa en muchas áreas, tales como la biología o la medicina. En este trabajo nos centramos en los Support Vector Machines, introducidos por Vapnik a principios de los 90 y que hoy en día son de los más usados en clasificación supervisada. En primer lugar, se hace un breve repaso de la teoría general acerca de los SVM, centrándonos en el caso binario, y dando un breve repaso al caso multiclase. Tras ello, presentamos una nueva formulación de los mismos, en las que se añaden nuevas restricciones para intentar asegurar un mínimo en los valores de ciertas medidas de rendimiento como las probabilidades de clasificación correcta. Además se realizan experimentos usando el software estadístico R, así como AMPL.Universidad de Sevilla. Máster Universitario en Matemática
Cost-sensitive probabilistic predictions for support vector machines
Support vector machines (SVMs) are widely used and constitute one of the best
examined and used machine learning models for two-class classification.
Classification in SVM is based on a score procedure, yielding a deterministic
classification rule, which can be transformed into a probabilistic rule (as
implemented in off-the-shelf SVM libraries), but is not probabilistic in
nature. On the other hand, the tuning of the regularization parameters in SVM
is known to imply a high computational effort and generates pieces of
information that are not fully exploited, not being used to build a
probabilistic classification rule. In this paper we propose a novel approach to
generate probabilistic outputs for the SVM. The new method has the following
three properties. First, it is designed to be cost-sensitive, and thus the
different importance of sensitivity (or true positive rate, TPR) and
specificity (true negative rate, TNR) is readily accommodated in the model. As
a result, the model can deal with imbalanced datasets which are common in
operational business problems as churn prediction or credit scoring. Second,
the SVM is embedded in an ensemble method to improve its performance, making
use of the valuable information generated in the parameters tuning process.
Finally, the probabilities estimation is done via bootstrap estimates, avoiding
the use of parametric models as competing approaches. Numerical tests on a wide
range of datasets show the advantages of our approach over benchmark
procedures.Comment: European Journal of Operational Research (2023
Cost-sensitive feature selection for support vector machines
Feature Selection (FS) is a crucial procedure in Data Science tasks such as
Classification, since it identifies the relevant variables, making thus the classification procedures more interpretable and more effective by reducing noise and data overfit. The relevance of features in a classification procedure is linked to the fact that misclassifications costs are frequently asymmetric, since false positive and false negative cases may have very different consequences. However, off-the-shelf FS procedures seldom take into account such cost-sensitivity of errors. In this paper we propose a mathematical-optimization-based FS procedure embedded in one of the most popular classification procedures, namely, Support Vector Machines (SVM), accommodating asymmetric misclassification costs. The key idea is to replace the traditional margin maximization by minimizing the number of features selected, but imposing upper bounds on the false positive and negative rates. The problem is written as an integer linear problem plus a quadratic convex problem for SVM with both linear and radial kernels. The reported numerical experience demonstrates the usefulness of the proposed FS procedure. Indeed, our results on benchmark data sets show that a substantial decrease of the number of features is obtained, whilst the desired trade-off between false positive and false negative rates is achieved
Diseño para el consumo cultural, la innovación y la inclusión social
Esta obra presenta diversos trabajos de investigación que tienen en común propuestas de diseño desde la cultura, la inclusión y la innovación social, desarrolladas por investigadores nacionales e internacionales adscritos a diversas universidades, así como a programas de posgrado
On sparse ensemble methods: an application to short-term predictions of the evolution of COVID-19
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components, but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse ensemble, which trades offthe accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising in the COVID-19 context
On support vector machines under a multiple-cost scenario
Support vector machine (SVM) is a powerful tool in binary classification, known to
attain excellent misclassification rates. On the other hand, many realworld classification
problems, such as those found in medical diagnosis, churn or fraud prediction,
involve misclassification costs which may be different in the different classes. However,
it may be hard for the user to provide precise values for such misclassification
costs, whereas it may be much easier to identify acceptable misclassification rates
values. In this paper we propose a novel SVM model in which misclassification costs
are considered by incorporating performance constraints in the problem formulation.
Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification
rates below given threshold values. Such maximal margin hyperplane
is obtained by solving a quadratic convex problem with linear constraints and integer
variables. The reported numerical experience shows that our model gives the user control
on the misclassification rates in one class (possibly at the expense of an increase
in misclassification rates for the other class) and is feasible in terms of running times
Population pharmacokinetics of colistin: implications for clinical use for Gram-negative pathogens
The objective of this study was to characterize the pharmacokinetics of colistin methanesulphonate (CMS) and colistin in critically ill patients
following the administration of a 4.5 MU CMS loading dose follow by 3MU CMS Q8. A population PK model and Monte Carlo simulation were used to calculate the probability of target attainment (PTA) against Acinetobacter baumannii and Pseudomonas aeruginosa by considering a range of MIC values seen in the clinic