225 research outputs found
Noisy multi-label semi-supervised dimensionality reduction
Noisy labeled data represent a rich source of information that often are
easily accessible and cheap to obtain, but label noise might also have many
negative consequences if not accounted for. How to fully utilize noisy labels
has been studied extensively within the framework of standard supervised
machine learning over a period of several decades. However, very little
research has been conducted on solving the challenge posed by noisy labels in
non-standard settings. This includes situations where only a fraction of the
samples are labeled (semi-supervised) and each high-dimensional sample is
associated with multiple labels. In this work, we present a novel
semi-supervised and multi-label dimensionality reduction method that
effectively utilizes information from both noisy multi-labels and unlabeled
data. With the proposed Noisy multi-label semi-supervised dimensionality
reduction (NMLSDR) method, the noisy multi-labels are denoised and unlabeled
data are labeled simultaneously via a specially designed label propagation
algorithm. NMLSDR then learns a projection matrix for reducing the
dimensionality by maximizing the dependence between the enlarged and denoised
multi-label space and the features in the projected space. Extensive
experiments on synthetic data, benchmark datasets, as well as a real-world case
study, demonstrate the effectiveness of the proposed algorithm and show that it
outperforms state-of-the-art multi-label feature extraction algorithms.Comment: 38 page
Supervised Feature Space Reduction for Multi-Label Nearest Neighbors
International audienceWith the ability to process many real-world problems, multi-label classification has received a large attention in recent years and the instance-based ML-kNN classifier is today considered as one of the most efficient. But it is sensitive to noisy and redundant features and its performances decrease with increasing data dimensionality. To overcome these problems, dimensionality reduction is an alternative but current methods optimize reduction objectives which ignore the impact on the ML-kNN classification. We here propose ML-ARP, a novel dimensionality reduction algorithm which, using a variable neighborhood search meta-heuristic, learns a linear projection of the feature space which specifically optimizes the ML-kNN classification loss. Numerical comparisons have confirmed that ML-ARP outperforms ML-kNN without data processing and four standard multi-label dimensionality reduction algorithms
API design for machine learning software: experiences from the scikit-learn project
Scikit-learn is an increasingly popular machine learning li- brary. Written
in Python, it is designed to be simple and efficient, accessible to
non-experts, and reusable in various contexts. In this paper, we present and
discuss our design choices for the application programming interface (API) of
the project. In particular, we describe the simple and elegant interface shared
by all learning and processing units in the library and then discuss its
advantages in terms of composition and reusability. The paper also comments on
implementation details specific to the Python ecosystem and analyzes obstacles
faced by users and developers of the library
DeepAMR for predicting co-occurrent resistance of Mycobacterium tuberculosis.
MOTIVATION: Resistance co-occurrence within first-line anti-tuberculosis (TB) drugs is a common phenomenon. Existing methods based on genetic data analysis of Mycobacterium tuberculosis (MTB) have been able to predict resistance of MTB to individual drugs, but have not considered the resistance co-occurrence and cannot capture latent structure of genomic data that corresponds to lineages. RESULTS: We used a large cohort of TB patients from 16 countries across six continents where whole-genome sequences for each isolate and associated phenotype to anti-TB drugs were obtained using drug susceptibility testing recommended by the World Health Organization. We then proposed an end-to-end multi-task model with deep denoising auto-encoder (DeepAMR) for multiple drug classification and developed DeepAMR_cluster, a clustering variant based on DeepAMR, for learning clusters in latent space of the data. The results showed that DeepAMR outperformed baseline model and four machine learning models with mean AUROC from 94.4% to 98.7% for predicting resistance to four first-line drugs [i.e. isoniazid (INH), ethambutol (EMB), rifampicin (RIF), pyrazinamide (PZA)], multi-drug resistant TB (MDR-TB) and pan-susceptible TB (PANS-TB: MTB that is susceptible to all four first-line anti-TB drugs). In the case of INH, EMB, PZA and MDR-TB, DeepAMR achieved its best mean sensitivity of 94.3%, 91.5%, 87.3% and 96.3%, respectively. While in the case of RIF and PANS-TB, it generated 94.2% and 92.2% sensitivity, which were lower than baseline model by 0.7% and 1.9%, respectively. t-SNE visualization shows that DeepAMR_cluster captures lineage-related clusters in the latent space. AVAILABILITY AND IMPLEMENTATION: The details of source code are provided at http://www.robots.ox.ac.uk/?davidc/code.php. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Modular Autoencoders for Ensemble Feature Extraction
We introduce the concept of a Modular Autoencoder (MAE), capable of learning
a set of diverse but complementary representations from unlabelled data, that
can later be used for supervised tasks. The learning of the representations is
controlled by a trade off parameter, and we show on six benchmark datasets the
optimum lies between two extremes: a set of smaller, independent autoencoders
each with low capacity, versus a single monolithic encoding, outperforming an
appropriate baseline. In the present paper we explore the special case of
linear MAE, and derive an SVD-based algorithm which converges several orders of
magnitude faster than gradient descent.Comment: 18 pages, 8 figures, to appear in a special issue of The Journal Of
Machine Learning Research (vol.44, Dec 2015
Latent representation for the characterisation of mental diseases
Mención Internacional en el tÃtulo de doctorMachine learning (ML) techniques are becoming crucial in the field of health and, in particular,
in the analysis of mental diseases. These are usually studied with neuroimaging, which is
characterised by a large number of input variables compared to the number of samples available.
The main objective of this PhD thesis is to propose different ML techniques to analyse mental
diseases from neuroimaging data including different extensions of these models in order to adapt
them to the neuroscience scenario. In particular, this thesis focuses on using brainimaging latent
representations, since they allow us to endow the problem with a reduced low dimensional
representation while obtaining a better insight on the internal relations between the disease and
the available data. This way, the main objective of this PhD thesis is to provide interpretable
results that are competent with the state-of-the-art in the analysis of mental diseases.
This thesis starts proposing a model based on classic latent representation formulations,
which relies on a bagging process to obtain the relevance of each brainimaging voxel, Regularised
Bagged Canonical Correlation Analysis (RB-CCA). The learnt relevance is combined with a
statistical test to obtain a selection of features. What’s more, the proposal obtains a class-wise
selection which, in turn, further improves the analysis of the effect of each brain area on the
stages of the mental disease. In addition, RB-CCA uses the relevance measure to guide the
feature extraction process by using it to penalise the least informative voxels for obtaining the
low-dimensional representation. Results obtained on two databases for the characterisation of
Alzheimer’s disease and Attention Deficit Hyperactivity Disorder show that the model is able to
perform as well as or better than the baselines while providing interpretable solutions.
Subsequently, this thesis continues with a second model that uses Bayesian approximations
to obtain a latent representation. Specifically, this model focuses on providing different functionalities
to build a common representation from different data sources and particularities. For
this purpose, the proposed generative model, Sparse Semi-supervised Heterogeneous Interbattery
Bayesian Factor Analysis (SSHIBA), can learn the feature relevance to perform feature selection,
as well as automatically select the number of latent factors. In addition, it can also model heterogeneous
data (real, multi-label and categorical), work with kernels and use a semi-supervised
formulation, which naturally imputes missing values by sampling from the learnt distributions.
Results using this model demonstrate the versatility of the formulation, which allows these extensions
to be combined interchangeably, expanding the scenarios in which the model can be
applied and improving the interpretability of the results.
Finally, this thesis includes a comparison of the proposed models on the Alzheimer’s disease
dataset, where both provide similar results in terms of performance; however, RB-CCA provides
a more robust analysis of mental diseases that is more easily interpretable. On the other hand,
while RB-CCA is more limited to specific scenarios, the SSHIBA formulation allows a wider
variety of data to be combined and is easily adapted to more complex real-life scenarios.Las técnicas de aprendizaje automático (ML) están siendo cruciales en el campo de la salud y,
en particular, en el análisis de las enfermedades mentales. Estas se estudian habitualmente con
neuroimagen, que se caracteriza por un gran número de variables de entrada en comparación
con el número de muestras disponibles. El objetivo principal de esta tesis doctoral es proponer
diferentes técnicas de ML para el análisis de enfermedades mentales a partir de datos de neuroimagen
incluyendo diferentes extensiones de estos modelos para adaptarlos al escenario de la
neurociencia. En particular, esta tesis se centra en el uso de representaciones latentes de imagen
cerebral, ya que permiten dotar al problema de una representación reducida de baja dimensión
a la vez que obtienen una mejor visión de las relaciones internas entre la enfermedad mental y
los datos disponibles. De este modo, el objetivo principal de esta tesis doctoral es proporcionar
resultados interpretables y competentes con el estado del arte en el análisis de las enfermedades
mentales.
Esta tesis comienza proponiendo un modelo basado en formulaciones clásicas de representación
latente, que se apoya en un proceso de bagging para obtener la relevancia de cada
voxel de imagen cerebral, el Análisis de Correlación Canónica Regularizada con Bagging (RBCCA).
La relevancia aprendida se combina con un test estadÃstico para obtener una selección de
caracterÃsticas. Además, la propuesta obtiene una selección por clases que, a su vez, mejora el
análisis del efecto de cada área cerebral en los estadios de la enfermedad mental. Por otro lado,
RB-CCA utiliza la medida de relevancia para guiar el proceso de extracción de caracterÃsticas,
utilizándola para penalizar los vóxeles menos relevantes para obtener la representación de baja
dimensión. Los resultados obtenidos en dos bases de datos para la caracterización de la enfermedad
de Alzheimer y el Trastorno por Déficit de Atención e Hiperactividad demuestran que el
modelo es capaz de rendir igual o mejor que los baselines a la vez que proporciona soluciones
interpretables.
Posteriormente, esta tesis continúa con un segundo modelo que utiliza aproximaciones Bayesianas
para obtener una representación latente. En concreto, este modelo se centra en proporcionar
diferentes funcionalidades para construir una representación común a partir de diferentes
fuentes de datos y particularidades. Para ello, el modelo generativo propuesto, Sparse Semisupervised
Heterogeneous Interbattery Bayesian Factor Analysis (SSHIBA), puede aprender la
relevancia de las caracterÃsticas para realizar la selección de las mismas, asà como seleccionar
automáticamente el número de factores latentes. Además, también puede modelar datos heterogéneos
(reales, multietiqueta y categóricos), trabajar con kernels y utilizar una formulación
semisupervisada, que imputa naturalmente los valores perdidos mediante el muestreo de las
distribuciones aprendidas. Los resultados obtenidos con este modelo demuestran la versatilidad
de la formulación, que permite combinar indistintamente estas extensiones, ampliando los escenarios
en los que se puede aplicar el modelo y mejorando la interpretabilidad de los resultados. Finalmente, esta tesis incluye una comparación de los modelos propuestos en el conjunto de
datos de la enfermedad de Alzheimer, donde ambos proporcionan resultados similares en términos
de rendimiento; sin embargo, RB-CCA proporciona un análisis más robusto de las enfermedades
mentales que es más fácilmente interpretable. Por otro lado, mientras que RB-CCA está más
limitado a escenarios especÃficos, la formulación SSHIBA permite combinar una mayor variedad
de datos y se adapta fácilmente a escenarios más complejos de la vida real.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Manuel MartÃnez Ramón.- Secretario: Emilio Parrado Hernández.- Vocal: Sancho Salcedo San
The Emerging Trends of Multi-Label Learning
Exabytes of data are generated daily by humans, leading to the growing need
for new efforts in dealing with the grand challenges for multi-label learning
brought by big data. For example, extreme multi-label classification is an
active and rapidly growing research area that deals with classification tasks
with an extremely large number of classes or labels; utilizing massive data
with limited supervision to build a multi-label classification model becomes
valuable for practical applications, etc. Besides these, there are tremendous
efforts on how to harvest the strong learning capability of deep learning to
better capture the label dependencies in multi-label learning, which is the key
for deep learning to address real-world classification tasks. However, it is
noted that there has been a lack of systemic studies that focus explicitly on
analyzing the emerging trends and new challenges of multi-label learning in the
era of big data. It is imperative to call for a comprehensive survey to fulfill
this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202
- …