29 research outputs found
Data mining in soft computing framework: a survey
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included
Swarm Intelligence-Based Hybrid Models for Short-Term Power Load Prediction
Swarm intelligence (SI) is widely and successfully applied in the engineering field to solve practical optimization problems because various hybrid models, which are based on the SI algorithm and statistical models, are developed to further improve the predictive abilities. In this paper, hybrid intelligent forecasting models based on the cuckoo search (CS) as well as the singular spectrum analysis (SSA), time series, and machine learning methods are proposed to conduct short-term power load prediction. The forecasting performance of the proposed models is augmented by a rolling multistep strategy over the prediction horizon. The test results are representative of the out-performance of the SSA and CS in tuning the seasonal autoregressive integrated moving average (SARIMA) and support vector regression (SVR) in improving load forecasting, which indicates that both the SSA-based data denoising and SI-based intelligent optimization strategy can effectively improve the model’s predictive performance. Additionally, the proposed CS-SSA-SARIMA and CS-SSA-SVR models provide very impressive forecasting results, demonstrating their strong robustness and universal forecasting capacities in terms of short-term power load prediction 24 hours in advance
Medical data mining using Bayesian network and DNA sequence analysis.
Lee Kit Ying.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 115-117).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Project Background --- p.1Chapter 1.2 --- Problem Specifications --- p.3Chapter 1.3 --- Contributions --- p.5Chapter 1.4 --- Thesis Organization --- p.6Chapter 2 --- Background --- p.8Chapter 2.1 --- Medical Data Mining --- p.8Chapter 2.1.1 --- General Information --- p.9Chapter 2.1.2 --- Related Research --- p.10Chapter 2.1.3 --- Characteristics and Difficulties Encountered --- p.11Chapter 2.2 --- DNA Sequence Analysis --- p.13Chapter 2.3 --- Hepatitis B Virus --- p.14Chapter 2.3.1 --- Virus Characteristics --- p.15Chapter 2.3.2 --- Important Findings on the Virus --- p.17Chapter 2.4 --- Bayesian Network and its Classifiers --- p.17Chapter 2.4.1 --- Formal Definition --- p.18Chapter 2.4.2 --- Existing Learning Algorithms --- p.19Chapter 2.4.3 --- Evolutionary Algorithms and Hybrid EP (HEP) --- p.22Chapter 2.4.4 --- Bayesian Network Classifiers --- p.25Chapter 2.4.5 --- Learning Algorithms for BN Classifiers --- p.32Chapter 3 --- Bayesian Network Classifier for Clinical Data --- p.35Chapter 3.1 --- Related Work --- p.36Chapter 3.2 --- Proposed BN-augmented Naive Bayes Classifier (BAN) --- p.38Chapter 3.2.1 --- Definition --- p.38Chapter 3.2.2 --- Learning Algorithm with HEP --- p.39Chapter 3.2.3 --- Modifications on HEP --- p.39Chapter 3.3 --- Proposed General Bayesian Network with Markov Blan- ket (GBN) --- p.40Chapter 3.3.1 --- Definition --- p.41Chapter 3.3.2 --- Learning Algorithm with HEP --- p.41Chapter 3.4 --- Findings on Bayesian Network Parameters Calculation --- p.43Chapter 3.4.1 --- Situation and Errors --- p.43Chapter 3.4.2 --- Proposed Solution --- p.46Chapter 3.5 --- Performance Analysis on Proposed BN Classifier Learn- ing Algorithms --- p.47Chapter 3.5.1 --- Experimental Methodology --- p.47Chapter 3.5.2 --- Benchmark Data --- p.48Chapter 3.5.3 --- Clinical Data --- p.50Chapter 3.5.4 --- Discussion --- p.55Chapter 3.6 --- Summary --- p.56Chapter 4 --- Classification in DNA Analysis --- p.57Chapter 4.1 --- Related Work --- p.58Chapter 4.2 --- Problem Definition --- p.59Chapter 4.3 --- Proposed Methodology Architecture --- p.60Chapter 4.3.1 --- Overall Design --- p.60Chapter 4.3.2 --- Important Components --- p.62Chapter 4.4 --- Clustering --- p.63Chapter 4.5 --- Feature Selection Algorithms --- p.65Chapter 4.5.1 --- Information Gain --- p.66Chapter 4.5.2 --- Other Approaches --- p.67Chapter 4.6 --- Classification Algorithms --- p.67Chapter 4.6.1 --- Naive Bayes Classifier --- p.68Chapter 4.6.2 --- Decision Tree --- p.68Chapter 4.6.3 --- Neural Networks --- p.68Chapter 4.6.4 --- Other Approaches --- p.69Chapter 4.7 --- Important Points on Evaluation --- p.69Chapter 4.7.1 --- Errors --- p.70Chapter 4.7.2 --- Independent Test --- p.70Chapter 4.8 --- Performance Analysis on Classification of DNA Data --- p.71Chapter 4.8.1 --- Experimental Methodology --- p.71Chapter 4.8.2 --- Using Naive-Bayes Classifier --- p.73Chapter 4.8.3 --- Using Decision Tree --- p.73Chapter 4.8.4 --- Using Neural Network --- p.74Chapter 4.8.5 --- Discussion --- p.76Chapter 4.9 --- Summary --- p.77Chapter 5 --- Adaptive HEP for Learning Bayesian Network Struc- ture --- p.78Chapter 5.1 --- Background --- p.79Chapter 5.1.1 --- Objective --- p.79Chapter 5.1.2 --- Related Work - AEGA --- p.79Chapter 5.2 --- Feasibility Study --- p.80Chapter 5.3 --- Proposed A-HEP Algorithm --- p.82Chapter 5.3.1 --- Structural Dissimilarity Comparison --- p.82Chapter 5.3.2 --- Dynamic Population Size --- p.83Chapter 5.4 --- Evaluation on Proposed Algorithm --- p.88Chapter 5.4.1 --- Experimental Methodology --- p.89Chapter 5.4.2 --- Comparison on Running Time --- p.93Chapter 5.4.3 --- Comparison on Fitness of Final Network --- p.94Chapter 5.4.4 --- Comparison on Similarity to the Original Network --- p.95Chapter 5.4.5 --- Parameter Study --- p.96Chapter 5.5 --- Applications on Medical Domain --- p.100Chapter 5.5.1 --- Discussion --- p.100Chapter 5.5.2 --- An Example --- p.101Chapter 5.6 --- Summary --- p.105Chapter 6 --- Conclusion --- p.107Chapter 6.1 --- Summary --- p.107Chapter 6.2 --- Future Work --- p.109Bibliography --- p.11
A survey on multi-output regression
In recent years, a plethora of approaches have been proposed to deal
with the increasingly challenging task of multi-output regression. This paper
provides a survey on state-of-the-art multi-output regression methods,
that are categorized as problem transformation and algorithm adaptation
methods. In addition, we present the mostly used performance evaluation
measures, publicly available data sets for multi-output regression
real-world problems, as well as open-source software frameworks
Clinical microbiology with multi-view deep probabilistic models
Clinical microbiology is one of the critical topics of this century. Identification
and discrimination of microorganisms is considered a global public health
threat by the main international health organisations, such as World Health
Organisation (WHO) or the European Centre for Disease Prevention and Control
(ECDC). Rapid spread, high morbidity and mortality, as well as the economic
burden associated with their treatment and control are the main causes of their
impact. Discrimination of microorganisms is crucial for clinical applications, for
instance, Clostridium difficile (C. diff ) increases the mortality and morbidity of
healthcare-related infections. Furthermore, in the past two decades, other bacteria,
including Klebsiella pneumoniae (K. pneumonia), have demonstrated a significant
propensity to acquire antibiotic resistance mechanisms. Consequently, the use of
an ineffective antibiotic may result in mortality. Machine Learning (ML) has the
potential to be applied in the clinical microbiology field to automatise current
methodologies and provide more efficient guided personalised treatments.
However, microbiological data are challenging to exploit owing to the presence
of a heterogeneous mix of data types, such as real-valued high-dimensional data,
categorical indicators, multilabel epidemiological data, binary targets, or even
time-series data representations. This problem, which in the field of ML is known
as multi-view or multi-modal representation learning, has been studied in other
application fields such as mental health monitoring or haematology. Multi-view
learning combines different modalities or views representing the same data to extract
richer insights and improve understanding. Each modality or view corresponds
to a distinct encoding mechanism for the data, and this dissertation specifically
addresses the issue of heterogeneity across multiple views.
In the probabilistic ML field, the exploitation of multi-view learning is also
known as Bayesian Factor Analysis (FA). Current solutions face limitations when
handling high-dimensional data and non-linear associations. Recent research
proposes deep probabilistic methods to learn hierarchical representations of the data,
which can capture intricate non-linear relationships between features. However,
some Deep Learning (DL) techniques rely on complicated representations, which
can hinder the interpretation of the outcomes. In addition, some inference methods
used in DL approaches can be computationally burdensome, which can hinder their
practical application in real-world situations. Therefore, there is a demand for
more interpretable, explainable, and computationally efficient techniques for highdimensional
data. By combining multiple views representing the same information, such as genomic, proteomic, and epidemiologic data, multi-modal representation
learning could provide a better understanding of the microbial world. Hence,
in this dissertation, the development of two deep probabilistic models, that can
handle current limitations in state-of-the-art of clinical microbiology, are proposed.
Moreover, both models are also tested in two real scenarios regarding antibiotic
resistance prediction in K. pneumoniae and automatic ribotyping of C. diff in
collaboration with the Instituto de Investigación Sanitaria Gregorio Marañón
(IISGM) and the Instituto Ramón y Cajal de Investigación Sanitaria (IRyCIS).
The first presented algorithm is the Kernelised Sparse Semi-supervised Heterogeneous
Interbattery Bayesian Analysis (SSHIBA). This algorithm uses a kernelised
formulation to handle non-linear data relationships while providing compact representations
through the automatic selection of relevant vectors. Additionally, it
uses an Automatic Relevance Determination (ARD) over the kernel to determine
the input feature relevance functionality. Then, it is tailored and applied to the
microbiological laboratories of the IISGM and IRyCIS to predict antibiotic resistance
in K. pneumoniae. To do so, specific kernels that handle Matrix-Assisted
Laser Desorption Ionization (MALDI)-Time-Of-Flight (TOF) mass spectrometry
of bacteria are used. Moreover, by exploiting the multi-modal learning between
the spectra and epidemiological information, it outperforms other state-of-the-art
algorithms. Presented results demonstrate the importance of heterogeneous models
that can analyse epidemiological information and can automatically be adjusted for
different data distributions. The implementation of this method in microbiological
laboratories could significantly reduce the time required to obtain resistance results
in 24-72 hours and, moreover, improve patient outcomes.
The second algorithm is a hierarchical Variational AutoEncoder (VAE) for
heterogeneous data using an explainable FA latent space, called FA-VAE. The
FA-VAE model is built on the foundation of the successful KSSHIBA approach for
dealing with semi-supervised heterogeneous multi-view problems. This approach
further expands the range of data domains it can handle. With the ability to
work with a wide range of data types, including multilabel, continuous, binary,
categorical, and even image data, the FA-VAE model offers a versatile and powerful
solution for real-world data sets, depending on the VAE architecture. Additionally,
this model is adapted and used in the microbiological laboratory of IISGM, resulting
in an innovative technique for automatic ribotyping of C. diff, using MALDI-TOF
data. To the best of our knowledge, this is the first demonstration of using any
kind of ML for C. diff ribotyping. Experiments have been conducted on strains
of Hospital General Universitario Gregorio Marañón (HGUGM) to evaluate the
viability of the proposed approach. The results have demonstrated high accuracy
rates where KSSHIBA even achieved perfect accuracy in the first data collection.
These models have also been tested in a real-life outbreak scenario at the HGUGM,
where successful classification of all outbreak samples has been achieved by FAVAE. The presented results have not only shown high accuracy in predicting
each strain’s ribotype but also revealed an explainable latent space. Furthermore,
traditional ribotyping methods, which rely on PCR, required 7 days while FA-VAE
has predicted equal results on the same day. This improvement has significantly
reduced the time response by helping in the decision-making of isolating patients
with hyper-virulent ribotypes of C. diff on the same day of infection. The promising
results, obtained in a real outbreak, have provided a solid foundation for further
advancements in the field. This study has been a crucial stepping stone towards
realising the full potential of MALDI-TOF for bacterial ribotyping and advancing
our ability to tackle bacterial outbreaks.
In conclusion, this doctoral thesis has significantly contributed to the field of
Bayesian FA by addressing its drawbacks in handling various data types through
the creation of novel models, namely KSSHIBA and FA-VAE. Additionally, a
comprehensive analysis of the limitations of automating laboratory procedures in
the microbiology field has been carried out. The shown effectiveness of the newly
developed models has been demonstrated through their successful implementation in
critical problems, such as predicting antibiotic resistance and automating ribotyping.
As a result, KSSHIBA and FA-VAE, both in terms of their technical and practical
contributions, signify noteworthy progress both in the clinical and the Bayesian
statistics fields. This dissertation opens up possibilities for future advancements in
automating microbiological laboratories.La microbiologÃa clÃnica es uno de los temas crÃticos de este siglo. La identificación
y discriminación de microorganismos se considera una amenaza mundial
para la salud pública por parte de las principales organizaciones internacionales de
salud, como la Organización Mundial de la Salud (OMS) o el Centro Europeo para
la Prevención y Control de Enfermedades (ECDC). La rápida propagación, alta
morbilidad y mortalidad, asà como la carga económica asociada con su tratamiento
y control, son las principales causas de su impacto. La discriminación de microorganismos
es crucial para aplicaciones clÃnicas, como el caso de Clostridium difficile
(C. diff ), el cual aumenta la mortalidad y morbilidad de las infecciones relacionadas
con la atención médica. Además, en las últimas dos décadas, otros tipos de bacterias,
incluyendo Klebsiella pneumoniae (K. pneumonia), han demostrado una
propensión significativa a adquirir mecanismos de resistencia a los antibióticos. En
consecuencia, el uso de un antibiótico ineficaz puede resultar en un aumento de la
mortalidad. El aprendizaje automático (ML) tiene el potencial de ser aplicado en
el campo de la microbiologÃa clÃnica para automatizar las metodologÃas actuales y
proporcionar tratamientos personalizados más eficientes y guiados.
Sin embargo, los datos microbiológicos son difÃciles de explotar debido a la
presencia de una mezcla heterogénea de tipos de datos, tales como datos reales de
alta dimensionalidad, indicadores categóricos, datos epidemiológicos multietiqueta,
objetivos binarios o incluso series temporales. Este problema, conocido en el campo
del aprendizaje automático (ML) como aprendizaje multimodal o multivista, ha
sido estudiado en otras áreas de aplicación, como en el monitoreo de la salud mental
o la hematologÃa. El aprendizaje multivista combina diferentes modalidades o vistas
que representan los mismos datos para extraer conocimientos más ricos y mejorar la
comprensión. Cada vista corresponde a un mecanismo de codificación distinto para
los datos, y esta tesis aborda particularmente el problema de la heterogeneidad
multivista.
En el campo del aprendizaje automático probabilÃstico, la explotación del aprendizaje
multivista también se conoce como Análisis de Factores (FA) Bayesianos.
Las soluciones actuales enfrentan limitaciones al manejar datos de alta dimensionalidad
y correlaciones no lineales. Investigaciones recientes proponen métodos
probabilÃsticos profundos para aprender representaciones jerárquicas de los datos,
que pueden capturar relaciones no lineales intrincadas entre caracterÃsticas. Sin
embargo, algunas técnicas de aprendizaje profundo (DL) se basan en representaciones
complejas, dificultando asà la interpretación de los resultados. Además, algunos métodos de inferencia utilizados en DL pueden ser computacionalmente
costosos, obstaculizando su aplicación práctica. Por lo tanto, existe una demanda de
técnicas más interpretables, explicables y computacionalmente eficientes para datos
de alta dimensionalidad. Al combinar múltiples vistas que representan la misma
información, como datos genómicos, proteómicos y epidemiológicos, el aprendizaje
multimodal podrÃa proporcionar una mejor comprensión del mundo microbiano.
Dicho lo cual, en esta tesis se proponen el desarrollo de dos modelos probabilÃsticos
profundos que pueden manejar las limitaciones actuales en el estado del arte de la
microbiologÃa clÃnica. Además, ambos modelos también se someten a prueba en
dos escenarios reales relacionados con la predicción de resistencia a los antibióticos
en K. pneumoniae y el ribotipado automático de C. diff en colaboración con el
IISGM y el IRyCIS.
El primer algoritmo presentado es Kernelised Sparse Semi-supervised Heterogeneous
Interbattery Bayesian Analysis (SSHIBA). Este algoritmo utiliza una
formulación kernelizada para manejar correlaciones no lineales proporcionando representaciones
compactas a través de la selección automática de vectores relevantes.
Además, utiliza un Automatic Relevance Determination (ARD) sobre el kernel
para determinar la relevancia de las caracterÃsticas de entrada. Luego, se adapta
y aplica a los laboratorios microbiológicos del IISGM y IRyCIS para predecir la
resistencia a antibióticos en K. pneumoniae. Para ello, se utilizan kernels especÃficos
que manejan la espectrometrÃa de masas Matrix-Assisted Laser Desorption
Ionization (MALDI)-Time-Of-Flight (TOF) de bacterias. Además, al aprovechar el
aprendizaje multimodal entre los espectros y la información epidemiológica, supera
a otros algoritmos de última generación. Los resultados presentados demuestran la
importancia de los modelos heterogéneos ya que pueden analizar la información
epidemiológica y ajustarse automáticamente para diferentes distribuciones de datos.
La implementación de este método en laboratorios microbiológicos podrÃa reducir
significativamente el tiempo requerido para obtener resultados de resistencia en
24-72 horas y, además, mejorar los resultados para los pacientes.
El segundo algoritmo es un modelo jerárquico de Variational AutoEncoder
(VAE) para datos heterogéneos que utiliza un espacio latente con un FA explicativo,
llamado FA-VAE. El modelo FA-VAE se construye sobre la base del enfoque de
KSSHIBA para tratar problemas semi-supervisados multivista. Esta propuesta
amplÃa aún más el rango de dominios que puede manejar incluyendo multietiqueta,
continuos, binarios, categóricos e incluso imágenes. De esta forma, el modelo
FA-VAE ofrece una solución versátil y potente para conjuntos de datos realistas,
dependiendo de la arquitectura del VAE. Además, este modelo es adaptado y
utilizado en el laboratorio microbiológico del IISGM, lo que resulta en una técnica
innovadora para el ribotipado automático de C. diff utilizando datos MALDI-TOF.
Hasta donde sabemos, esta es la primera demostración del uso de cualquier tipo
de ML para el ribotipado de C. diff. Se han realizado experimentos en cepas del Hospital General Universitario Gregorio Marañón (HGUGM) para evaluar la
viabilidad de la técnica propuesta. Los resultados han demostrado altas tasas de
precisión donde KSSHIBA incluso logró una clasificación perfecta en la primera
colección de datos. Estos modelos también se han probado en un brote real
en el HGUGM, donde FA-VAE logró clasificar con éxito todas las muestras del
mismo. Los resultados presentados no solo han demostrado una alta precisión
en la predicción del ribotipo de cada cepa, sino que también han revelado un
espacio latente explicativo. Además, los métodos tradicionales de ribotipado, que
dependen de PCR, requieren 7 dÃas para obtener resultados mientras que FA-VAE
ha predicho resultados correctos el mismo dÃa del brote. Esta mejora ha reducido
significativamente el tiempo de respuesta ayudando asà en la toma de decisiones
para aislar a los pacientes con ribotipos hipervirulentos de C. diff el mismo dÃa
de la infección. Los resultados prometedores, obtenidos en un brote real, han
sentado las bases para nuevos avances en el campo. Este estudio ha sido un paso
crucial hacia el despliegue del pleno potencial de MALDI-TOF para el ribotipado
bacteriana avanzado asà nuestra capacidad para abordar brotes bacterianos.
En conclusión, esta tesis doctoral ha contribuido significativamente al campo
del FA Bayesiano al abordar sus limitaciones en el manejo de tipos de datos
heterogéneos a través de la creación de modelos noveles, concretamente, KSSHIBA
y FA-VAE. Además, se ha llevado a cabo un análisis exhaustivo de las limitaciones de
la automatización de procedimientos de laboratorio en el campo de la microbiologÃa.
La efectividad de los nuevos modelos, en este campo, se ha demostrado a través de su
implementación exitosa en problemas crÃticos, como la predicción de resistencia a los
antibióticos y la automatización del ribotipado. Como resultado, KSSHIBA y FAVAE,
tanto en términos de sus contribuciones técnicas como prácticas, representan
un progreso notable tanto en los campos clÃnicos como en la estadÃstica Bayesiana.
Esta disertación abre posibilidades para futuros avances en la automatización de
laboratorios microbiológicos.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Juan José Murillo Fuentes.- Secretario: Jerónimo Arenas GarcÃa.- Vocal: MarÃa de las Mercedes MarÃn Arriaz