388 research outputs found

    Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation

    Get PDF
    Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner

    A Literature Review of Fault Diagnosis Based on Ensemble Learning

    Get PDF
    The accuracy of fault diagnosis is an important indicator to ensure the reliability of key equipment systems. Ensemble learning integrates different weak learning methods to obtain stronger learning and has achieved remarkable results in the field of fault diagnosis. This paper reviews the recent research on ensemble learning from both technical and field application perspectives. The paper summarizes 87 journals in recent web of science and other academic resources, with a total of 209 papers. It summarizes 78 different ensemble learning based fault diagnosis methods, involving 18 public datasets and more than 20 different equipment systems. In detail, the paper summarizes the accuracy rates, fault classification types, fault datasets, used data signals, learners (traditional machine learning or deep learning-based learners), ensemble learning methods (bagging, boosting, stacking and other ensemble models) of these fault diagnosis models. The paper uses accuracy of fault diagnosis as the main evaluation metrics supplemented by generalization and imbalanced data processing ability to evaluate the performance of those ensemble learning methods. The discussion and evaluation of these methods lead to valuable research references in identifying and developing appropriate intelligent fault diagnosis models for various equipment. This paper also discusses and explores the technical challenges, lessons learned from the review and future development directions in the field of ensemble learning based fault diagnosis and intelligent maintenance

    Improving Engagement Assessment by Model Individualization and Deep Learning

    Get PDF
    This dissertation studies methods that improve engagement assessment for pilots. The major work addresses two challenging problems involved in the assessment: individual variation among pilots and the lack of labeled data for training assessment models. Task engagement is usually assessed by analyzing physiological measurements collected from subjects who are performing a task. However, physiological measurements such as Electroencephalography (EEG) vary from subject to subject. An assessment model trained for one subject may not be applicable to other subjects. We proposed a dynamic classifier selection algorithm for model individualization and compared it to other two methods: base line normalization and similarity-based model replacement. Experimental results showed that baseline normalization and dynamic classifier selection can significantly improve cross-subject engagement assessment. For complex tasks such as piloting an air plane, labeling engagement levels for pilots is challenging. Without enough labeled data, it is very difficult for traditional methods to train valid models for effective engagement assessment. This dissertation proposed to utilize deep learning models to address this challenge. Deep learning models are capable of learning valuable feature hierarchies by taking advantage of both labeled and unlabeled data. Our results showed that deep models are better tools for engagement assessment when label information is scarce. To further verify the power of deep learning techniques for scarce labeled data, we applied the deep learning algorithm to another small size data set, the ADNI data set. The ADNI data set is a public data set containing MRI and PET scans of Alzheimer\u27s Disease (AD) patients for AD diagnosis. We developed a robust deep learning system incorporating dropout and stability selection techniques to identify the different progression stages of AD patients. The experimental results showed that deep learning is very effective in AD diagnosis. In addition, we studied several imbalance learning techniques that are useful when data is highly unbalanced, i.e., when majority classes have many more training samples than minority classes. Conventional machine learning techniques usually tend to classify all data samples into majority classes and to perform poorly for minority classes. Unbalanced learning techniques can balance data sets before training and can improve learning performance

    Latent representation for the characterisation of mental diseases

    Get PDF
    Mención Internacional en el título de doctorMachine learning (ML) techniques are becoming crucial in the field of health and, in particular, in the analysis of mental diseases. These are usually studied with neuroimaging, which is characterised by a large number of input variables compared to the number of samples available. The main objective of this PhD thesis is to propose different ML techniques to analyse mental diseases from neuroimaging data including different extensions of these models in order to adapt them to the neuroscience scenario. In particular, this thesis focuses on using brainimaging latent representations, since they allow us to endow the problem with a reduced low dimensional representation while obtaining a better insight on the internal relations between the disease and the available data. This way, the main objective of this PhD thesis is to provide interpretable results that are competent with the state-of-the-art in the analysis of mental diseases. This thesis starts proposing a model based on classic latent representation formulations, which relies on a bagging process to obtain the relevance of each brainimaging voxel, Regularised Bagged Canonical Correlation Analysis (RB-CCA). The learnt relevance is combined with a statistical test to obtain a selection of features. What’s more, the proposal obtains a class-wise selection which, in turn, further improves the analysis of the effect of each brain area on the stages of the mental disease. In addition, RB-CCA uses the relevance measure to guide the feature extraction process by using it to penalise the least informative voxels for obtaining the low-dimensional representation. Results obtained on two databases for the characterisation of Alzheimer’s disease and Attention Deficit Hyperactivity Disorder show that the model is able to perform as well as or better than the baselines while providing interpretable solutions. Subsequently, this thesis continues with a second model that uses Bayesian approximations to obtain a latent representation. Specifically, this model focuses on providing different functionalities to build a common representation from different data sources and particularities. For this purpose, the proposed generative model, Sparse Semi-supervised Heterogeneous Interbattery Bayesian Factor Analysis (SSHIBA), can learn the feature relevance to perform feature selection, as well as automatically select the number of latent factors. In addition, it can also model heterogeneous data (real, multi-label and categorical), work with kernels and use a semi-supervised formulation, which naturally imputes missing values by sampling from the learnt distributions. Results using this model demonstrate the versatility of the formulation, which allows these extensions to be combined interchangeably, expanding the scenarios in which the model can be applied and improving the interpretability of the results. Finally, this thesis includes a comparison of the proposed models on the Alzheimer’s disease dataset, where both provide similar results in terms of performance; however, RB-CCA provides a more robust analysis of mental diseases that is more easily interpretable. On the other hand, while RB-CCA is more limited to specific scenarios, the SSHIBA formulation allows a wider variety of data to be combined and is easily adapted to more complex real-life scenarios.Las técnicas de aprendizaje automático (ML) están siendo cruciales en el campo de la salud y, en particular, en el análisis de las enfermedades mentales. Estas se estudian habitualmente con neuroimagen, que se caracteriza por un gran número de variables de entrada en comparación con el número de muestras disponibles. El objetivo principal de esta tesis doctoral es proponer diferentes técnicas de ML para el análisis de enfermedades mentales a partir de datos de neuroimagen incluyendo diferentes extensiones de estos modelos para adaptarlos al escenario de la neurociencia. En particular, esta tesis se centra en el uso de representaciones latentes de imagen cerebral, ya que permiten dotar al problema de una representación reducida de baja dimensión a la vez que obtienen una mejor visión de las relaciones internas entre la enfermedad mental y los datos disponibles. De este modo, el objetivo principal de esta tesis doctoral es proporcionar resultados interpretables y competentes con el estado del arte en el análisis de las enfermedades mentales. Esta tesis comienza proponiendo un modelo basado en formulaciones clásicas de representación latente, que se apoya en un proceso de bagging para obtener la relevancia de cada voxel de imagen cerebral, el Análisis de Correlación Canónica Regularizada con Bagging (RBCCA). La relevancia aprendida se combina con un test estadístico para obtener una selección de características. Además, la propuesta obtiene una selección por clases que, a su vez, mejora el análisis del efecto de cada área cerebral en los estadios de la enfermedad mental. Por otro lado, RB-CCA utiliza la medida de relevancia para guiar el proceso de extracción de características, utilizándola para penalizar los vóxeles menos relevantes para obtener la representación de baja dimensión. Los resultados obtenidos en dos bases de datos para la caracterización de la enfermedad de Alzheimer y el Trastorno por Déficit de Atención e Hiperactividad demuestran que el modelo es capaz de rendir igual o mejor que los baselines a la vez que proporciona soluciones interpretables. Posteriormente, esta tesis continúa con un segundo modelo que utiliza aproximaciones Bayesianas para obtener una representación latente. En concreto, este modelo se centra en proporcionar diferentes funcionalidades para construir una representación común a partir de diferentes fuentes de datos y particularidades. Para ello, el modelo generativo propuesto, Sparse Semisupervised Heterogeneous Interbattery Bayesian Factor Analysis (SSHIBA), puede aprender la relevancia de las características para realizar la selección de las mismas, así como seleccionar automáticamente el número de factores latentes. Además, también puede modelar datos heterogéneos (reales, multietiqueta y categóricos), trabajar con kernels y utilizar una formulación semisupervisada, que imputa naturalmente los valores perdidos mediante el muestreo de las distribuciones aprendidas. Los resultados obtenidos con este modelo demuestran la versatilidad de la formulación, que permite combinar indistintamente estas extensiones, ampliando los escenarios en los que se puede aplicar el modelo y mejorando la interpretabilidad de los resultados. Finalmente, esta tesis incluye una comparación de los modelos propuestos en el conjunto de datos de la enfermedad de Alzheimer, donde ambos proporcionan resultados similares en términos de rendimiento; sin embargo, RB-CCA proporciona un análisis más robusto de las enfermedades mentales que es más fácilmente interpretable. Por otro lado, mientras que RB-CCA está más limitado a escenarios específicos, la formulación SSHIBA permite combinar una mayor variedad de datos y se adapta fácilmente a escenarios más complejos de la vida real.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Manuel Martínez Ramón.- Secretario: Emilio Parrado Hernández.- Vocal: Sancho Salcedo San
    corecore