238 research outputs found
Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones
La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sà mismos.
Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador.
En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados.
La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad.
Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de EconomÃa y Competitividad, proyecto TIN-2011-2404
Absent Data Generating Classifier for Imbalanced Class Sizes
We propose an algorithm for two-class classification problems when the training data are imbalanced. This means the number of training instances in one of the classes is so low that the conventional classification algorithms become ineffective in detecting the minority class. We present a modification of the kernel Fisher discriminant analysis such that the imbalanced nature of the problem is explicitly addressed in the new algorithm formulation. The new algorithm exploits the properties of the existing minority points to learn the effects of other minority data points, had they actually existed. The algorithm proceeds iteratively by employing the learned properties and conditional sampling in such a way that it generates sufficient artificial data points for the minority set, thus enhancing the detection probability of the minority class. Implementing the proposed method on a number c©2015 Arash Pourhabib, Bani K. Mallick and Yu Ding. Article preprint--accepted for publication, Feb 201
Application of advanced machine learning techniques to early network traffic classification
The fast-paced evolution of the Internet is drawing a complex context which
imposes demanding requirements to assure end-to-end Quality of Service. The
development of advanced intelligent approaches in networking is envisioning
features that include autonomous resource allocation, fast reaction against
unexpected network events and so on. Internet Network Traffic Classification
constitutes a crucial source of information for Network Management, being decisive
in assisting the emerging network control paradigms. Monitoring traffic flowing
through network devices support tasks such as: network orchestration, traffic
prioritization, network arbitration and cyberthreats detection, amongst others.
The traditional traffic classifiers became obsolete owing to the rapid Internet
evolution. Port-based classifiers suffer from significant accuracy losses due to port
masking, meanwhile Deep Packet Inspection approaches have severe user-privacy
limitations. The advent of Machine Learning has propelled the application of
advanced algorithms in diverse research areas, and some learning approaches have
proved as an interesting alternative to the classic traffic classification approaches.
Addressing Network Traffic Classification from a Machine Learning perspective
implies numerous challenges demanding research efforts to achieve feasible
classifiers. In this dissertation, we endeavor to formulate and solve important
research questions in Machine-Learning-based Network Traffic Classification. As a
result of numerous experiments, the knowledge provided in this research constitutes
an engaging case of study in which network traffic data from two different
environments are successfully collected, processed and modeled.
Firstly, we approached the Feature Extraction and Selection processes providing our
own contributions. A Feature Extractor was designed to create Machine-Learning
ready datasets from real traffic data, and a Feature Selection Filter based on fast
correlation is proposed and tested in several classification datasets. Then, the
original Network Traffic Classification datasets are reduced using our Selection
Filter to provide efficient classification models. Many classification models based on
CART Decision Trees were analyzed exhibiting excellent outcomes in identifying
various Internet applications. The experiments presented in this research comprise
a comparison amongst ensemble learning schemes, an exploratory study on Class
Imbalance and solutions; and an analysis of IP-header predictors for early traffic
classification. This thesis is presented in the form of compendium of JCR-indexed
scientific manuscripts and, furthermore, one conference paper is included.
In the present work we study a wide number of learning approaches employing the
most advance methodology in Machine Learning. As a result, we identify the
strengths and weaknesses of these algorithms, providing our own solutions to
overcome the observed limitations. Shortly, this thesis proves that Machine
Learning offers interesting advanced techniques that open prominent prospects in
Internet Network Traffic Classification.Departamento de TeorÃa de la Señal y Comunicaciones e IngenierÃa TelemáticaDoctorado en TecnologÃas de la Información y las Telecomunicacione
Semi-supervised learning and fairness-aware learning under class imbalance
With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms.
In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions:
• Tackling class-imbalance in semi-supervised learning – The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process.
• Mitigating unfairness in supervised models – Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes
The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification
A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for
various applications, promoting environmental sustainability and good resource management.
Although, their production continues to be a challenging task. There are various factors
that contribute towards the difficulty of generating accurate, timely updated LULC maps,
both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a
crucial step for any Machine Learning task, is particularly important in the remote sensing
domain due to the overwhelming amount of raw, unlabeled data continuously gathered
from multiple remote sensing missions. However a significant part of the state-of-the-art
focuses on scenarios with full access to labeled training data with relatively balanced class
distributions. This thesis focuses on the challenges found in automatic LULC classification
tasks, specifically in data preprocessing tasks. We focus on the development of novel
Active Learning (AL) and imbalanced learning techniques, to improve ML performance in
situations with limited training data and/or the existence of rare classes. We also show
that much of the contributions presented are not only successful in remote sensing problems,
but also in various other multidisciplinary classification problems. The work presented
in this thesis used open access datasets to test the contributions made in imbalanced
learning and AL. All the data pulling, preprocessing and experiments are made available at
https://github.com/joaopfonseca/publications. The algorithmic implementations are made
available in the Python package ml-research at https://github.com/joaopfonseca/ml-research
Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications
Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics
Mixed-Integer Projections for Automated Data Correction of EMRs Improve Predictions of Sepsis among Hospitalized Patients
Machine learning (ML) models are increasingly pivotal in automating clinical
decisions. Yet, a glaring oversight in prior research has been the lack of
proper processing of Electronic Medical Record (EMR) data in the clinical
context for errors and outliers. Addressing this oversight, we introduce an
innovative projections-based method that seamlessly integrates clinical
expertise as domain constraints, generating important meta-data that can be
used in ML workflows. In particular, by using high-dimensional mixed-integer
programs that capture physiological and biological constraints on patient
vitals and lab values, we can harness the power of mathematical "projections"
for the EMR data to correct patient data. Consequently, we measure the distance
of corrected data from the constraints defining a healthy range of patient
data, resulting in a unique predictive metric we term as "trust-scores". These
scores provide insight into the patient's health status and significantly boost
the performance of ML classifiers in real-life clinical settings. We validate
the impact of our framework in the context of early detection of sepsis using
ML. We show an AUROC of 0.865 and a precision of 0.922, that surpasses
conventional ML models without such projections
Deep learning for fast and robust medical image reconstruction and analysis
Medical imaging is an indispensable component of modern medical research as well as clinical practice. Nevertheless, imaging techniques such as magnetic resonance imaging (MRI) and computational tomography (CT) are costly and are less accessible to the majority of the world. To make medical devices more accessible, affordable and efficient, it is crucial to re-calibrate our current imaging paradigm for smarter imaging. In particular, as medical imaging techniques have highly structured forms in the way they acquire data, they provide us with an opportunity to optimise the imaging techniques holistically by leveraging data. The central theme of this thesis is to explore different opportunities where we can exploit data and deep learning to improve the way we extract information for better, faster and smarter imaging.
This thesis explores three distinct problems. The first problem is the time-consuming nature of dynamic MR data acquisition and reconstruction. We propose deep learning methods for accelerated dynamic MR image reconstruction, resulting in up to 10-fold reduction in imaging time. The second problem is the redundancy in our current imaging pipeline. Traditionally, imaging pipeline treated acquisition, reconstruction and analysis as separate steps. However, we argue that one can approach them holistically and optimise the entire pipeline jointly for a specific target goal. To this end, we propose deep learning approaches for obtaining high fidelity cardiac MR segmentation directly from significantly undersampled data, greatly exceeding the undersampling limit for image reconstruction. The final part of this thesis tackles the problem of interpretability of the deep learning algorithms. We propose attention-models that can implicitly focus on salient regions in an image to improve accuracy for ultrasound scan plane detection and CT segmentation. More crucially, these models can provide explainability, which is a crucial stepping stone for the harmonisation of smart imaging and current clinical practice.Open Acces
- …