10 research outputs found
mldr.resampling: Efficient Reference Implementations of Multilabel Resampling Algorithms
Resampling algorithms are a useful approach to deal with imbalanced learning
in multilabel scenarios. These methods have to deal with singularities in the
multilabel data, such as the occurrence of frequent and infrequent labels in
the same instance. Implementations of these methods are sometimes limited to
the pseudocode provided by their authors in a paper. This Original Software
Publication presents mldr.resampling, a software package that provides
reference implementations for eleven multilabel resampling methods, with an
emphasis on efficiency since these algorithms are usually time-consuming
Resampling Algorithms for Multi-Label Classification
Statistics and Actuarial Scienc
Analysis of Cellular and Subcellular Morphology using Machine Learning in Microscopy Images
Human cells undergo various morphological changes due to progression in the cell-cycle or environmental factors. Classification of these morphological states is vital for effective clinical decisions. Automated classification systems based on machine learning models are data-driven and efficient and help to avoid subjective outcomes. However, the efficacy of these models is highly dependent on the feature description along with the amount and nature of the training data.
This thesis presents three studies of automated image-based classification of cellular and subcellular morphologies. The first study presents 3D Sorted Random Projections (SRP) which includes the proposed approach to compute 3D plane information for texture description of 3D nuclear images. The proposed 3D SRP is used to classify nuclear morphology and measure changes in heterochromatin, which in turn helps to characterise cellular states. Classification performance evaluated on 3D images of the human fibroblast and prostate cancer cell lines shows that 3D SRP provides better classification than other feature descriptors.
The second study is on imbalanced multiclass and single-label classification of blood cell images. The scarcity of minority sam ples causes a drop in classification performance on minority classes. This study proposes oversampling of minority samples us ing data augmentation approaches, namely mixup, WGAN-div and novel nonlinear mixup, along with a minority class focussed sampling strategy. Classification performance evaluated using F1-score shows that the proposed deep learning framework out performs state-of-the art approaches on publicly available images of human T-lymphocyte cells and red blood cells.
The third study is on protein subcellular localisation, which is an imbalanced multiclass and multilabel classification problem. In order to handle data imbalance, this study proposes an oversampling method which includes synthetic images constructed using nonlinear mixup and geometric/colour transformations. The regularisation capability of nonlinear mixup is further improved for protein images. In addition, an imbalance aware sampling strategy is proposed to identify minority and medium classes in the dataset and include them during training. Classification performance evaluated on the Human Protein Atlas Kaggle challenge dataset using F1-score shows that the proposed deep learning framework achieves better predictions than existing methods
Latent representation for the characterisation of mental diseases
Mención Internacional en el tÃtulo de doctorMachine learning (ML) techniques are becoming crucial in the field of health and, in particular,
in the analysis of mental diseases. These are usually studied with neuroimaging, which is
characterised by a large number of input variables compared to the number of samples available.
The main objective of this PhD thesis is to propose different ML techniques to analyse mental
diseases from neuroimaging data including different extensions of these models in order to adapt
them to the neuroscience scenario. In particular, this thesis focuses on using brainimaging latent
representations, since they allow us to endow the problem with a reduced low dimensional
representation while obtaining a better insight on the internal relations between the disease and
the available data. This way, the main objective of this PhD thesis is to provide interpretable
results that are competent with the state-of-the-art in the analysis of mental diseases.
This thesis starts proposing a model based on classic latent representation formulations,
which relies on a bagging process to obtain the relevance of each brainimaging voxel, Regularised
Bagged Canonical Correlation Analysis (RB-CCA). The learnt relevance is combined with a
statistical test to obtain a selection of features. What’s more, the proposal obtains a class-wise
selection which, in turn, further improves the analysis of the effect of each brain area on the
stages of the mental disease. In addition, RB-CCA uses the relevance measure to guide the
feature extraction process by using it to penalise the least informative voxels for obtaining the
low-dimensional representation. Results obtained on two databases for the characterisation of
Alzheimer’s disease and Attention Deficit Hyperactivity Disorder show that the model is able to
perform as well as or better than the baselines while providing interpretable solutions.
Subsequently, this thesis continues with a second model that uses Bayesian approximations
to obtain a latent representation. Specifically, this model focuses on providing different functionalities
to build a common representation from different data sources and particularities. For
this purpose, the proposed generative model, Sparse Semi-supervised Heterogeneous Interbattery
Bayesian Factor Analysis (SSHIBA), can learn the feature relevance to perform feature selection,
as well as automatically select the number of latent factors. In addition, it can also model heterogeneous
data (real, multi-label and categorical), work with kernels and use a semi-supervised
formulation, which naturally imputes missing values by sampling from the learnt distributions.
Results using this model demonstrate the versatility of the formulation, which allows these extensions
to be combined interchangeably, expanding the scenarios in which the model can be
applied and improving the interpretability of the results.
Finally, this thesis includes a comparison of the proposed models on the Alzheimer’s disease
dataset, where both provide similar results in terms of performance; however, RB-CCA provides
a more robust analysis of mental diseases that is more easily interpretable. On the other hand,
while RB-CCA is more limited to specific scenarios, the SSHIBA formulation allows a wider
variety of data to be combined and is easily adapted to more complex real-life scenarios.Las técnicas de aprendizaje automático (ML) están siendo cruciales en el campo de la salud y,
en particular, en el análisis de las enfermedades mentales. Estas se estudian habitualmente con
neuroimagen, que se caracteriza por un gran número de variables de entrada en comparación
con el número de muestras disponibles. El objetivo principal de esta tesis doctoral es proponer
diferentes técnicas de ML para el análisis de enfermedades mentales a partir de datos de neuroimagen
incluyendo diferentes extensiones de estos modelos para adaptarlos al escenario de la
neurociencia. En particular, esta tesis se centra en el uso de representaciones latentes de imagen
cerebral, ya que permiten dotar al problema de una representación reducida de baja dimensión
a la vez que obtienen una mejor visión de las relaciones internas entre la enfermedad mental y
los datos disponibles. De este modo, el objetivo principal de esta tesis doctoral es proporcionar
resultados interpretables y competentes con el estado del arte en el análisis de las enfermedades
mentales.
Esta tesis comienza proponiendo un modelo basado en formulaciones clásicas de representación
latente, que se apoya en un proceso de bagging para obtener la relevancia de cada
voxel de imagen cerebral, el Análisis de Correlación Canónica Regularizada con Bagging (RBCCA).
La relevancia aprendida se combina con un test estadÃstico para obtener una selección de
caracterÃsticas. Además, la propuesta obtiene una selección por clases que, a su vez, mejora el
análisis del efecto de cada área cerebral en los estadios de la enfermedad mental. Por otro lado,
RB-CCA utiliza la medida de relevancia para guiar el proceso de extracción de caracterÃsticas,
utilizándola para penalizar los vóxeles menos relevantes para obtener la representación de baja
dimensión. Los resultados obtenidos en dos bases de datos para la caracterización de la enfermedad
de Alzheimer y el Trastorno por Déficit de Atención e Hiperactividad demuestran que el
modelo es capaz de rendir igual o mejor que los baselines a la vez que proporciona soluciones
interpretables.
Posteriormente, esta tesis continúa con un segundo modelo que utiliza aproximaciones Bayesianas
para obtener una representación latente. En concreto, este modelo se centra en proporcionar
diferentes funcionalidades para construir una representación común a partir de diferentes
fuentes de datos y particularidades. Para ello, el modelo generativo propuesto, Sparse Semisupervised
Heterogeneous Interbattery Bayesian Factor Analysis (SSHIBA), puede aprender la
relevancia de las caracterÃsticas para realizar la selección de las mismas, asà como seleccionar
automáticamente el número de factores latentes. Además, también puede modelar datos heterogéneos
(reales, multietiqueta y categóricos), trabajar con kernels y utilizar una formulación
semisupervisada, que imputa naturalmente los valores perdidos mediante el muestreo de las
distribuciones aprendidas. Los resultados obtenidos con este modelo demuestran la versatilidad
de la formulación, que permite combinar indistintamente estas extensiones, ampliando los escenarios
en los que se puede aplicar el modelo y mejorando la interpretabilidad de los resultados. Finalmente, esta tesis incluye una comparación de los modelos propuestos en el conjunto de
datos de la enfermedad de Alzheimer, donde ambos proporcionan resultados similares en términos
de rendimiento; sin embargo, RB-CCA proporciona un análisis más robusto de las enfermedades
mentales que es más fácilmente interpretable. Por otro lado, mientras que RB-CCA está más
limitado a escenarios especÃficos, la formulación SSHIBA permite combinar una mayor variedad
de datos y se adapta fácilmente a escenarios más complejos de la vida real.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Manuel MartÃnez Ramón.- Secretario: Emilio Parrado Hernández.- Vocal: Sancho Salcedo San