76 research outputs found
Recommended from our members
Building trajectories through clinical data to model disease progression
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Clinical trials are typically conducted over a population within a defined time period
in order to illuminate certain characteristics of a health issue or disease process. These cross-sectional studies provide a snapshot of these disease processes over a large number of people but do not allow us to model the temporal nature of disease, which is essential for modeling detailed prognostic predictions. Longitudinal studies, on the other hand, are used to explore how these processes develop over time in a number of people but can be expensive and time-consuming, and many studies only cover a relatively small window within the disease process. This thesis describes the application of intelligent data analysis techniques for extracting information from time series generated by different diseases. The aim of this thesis is to identify intermediate stages
in a disease process and sub-categories of the disease exhibiting subtly different symptoms. It explores the use of a bootstrap technique that fits trajectories through the data generating “pseudo time-series”. It addresses issues including: how clinical variables interact as a disease progresses along the trajectories in the data; and how to automatically identify different disease states along these trajectories, as well as the transitions between them. The thesis documents how reliable time-series models can be created from large amounts of historical cross-sectional data and a novel relabling/latent variable approach has enabled the exploration of the temporal nature of disease progression. The proposed algorithms are tested extensively on simulated data and on three real clinical datasets. Finally, a study is carried out to explore whether we can “calibrate” pseudo time-series models with real longitudinal data in order to improve them. Plausible directions for future research are discussed at the end of the thesis
Statistical Methods to Enhance Clinical Prediction with High-Dimensional Data and Ordinal Response
Der technologische Fortschritt ermöglicht es heute, die moleculare
Konfiguration einzelner Zellen oder ganzer Gewebeproben zu
untersuchen. Solche in großen Mengen produzierten
hochdimensionalen Omics-Daten aus der Molekularbiologie lassen sich
zu immer niedrigeren Kosten erzeugen und werden so immer
häufiger auch in klinischen Fragestellungen eingesetzt.
Personalisierte Diagnose oder auch die Vorhersage eines
Behandlungserfolges auf der Basis solcher Hochdurchsatzdaten stellen
eine moderne Anwendung von Techniken aus dem maschinellen Lernen dar.
In der Praxis werden klinische Parameter, wie etwa der
Gesundheitszustand oder die Nebenwirkungen einer Therapie, häufig auf
einer ordinalen Skala erhoben (beispielsweise gut, normal,
schlecht).
Es ist verbreitet, Klassifikationsproblme mit ordinal skaliertem
Endpunkt wie generelle Mehrklassenproblme zu behandeln und somit die
Information, die in der Ordnung zwischen den Klassen enthalten ist, zu
ignorieren. Allerdings kann das Vernachlässigen dieser Information zu
einer verminderten Klassifikationsgüte führen oder sogar eine
ungünstige ungeordnete Klassifikation erzeugen.
Klassische Ansätze, einen ordinal skalierten Endpunkt direkt zu
modellieren, wie beispielsweise mit einem kumulativen Linkmodell,
lassen sich typischerweise nicht auf hochdimensionale Daten anwenden.
Wir präsentieren in dieser Arbeit hierarchical twoing (hi2) als
einen Algorithmus für die Klassifikation hochdimensionler Daten in
ordinal Skalierte Kategorien. hi2 nutzt die Mächtigkeit der
sehr gut verstandenen binären Klassifikation, um auch in ordinale
Kategorien zu klassifizieren. Eine Opensource-Implementierung von
hi2 ist online verfügbar.
In einer Vergleichsstudie zur Klassifikation von echten wie von
simulierten Daten mit ordinalem Endpunkt produzieren etablierte
Methoden, die speziell für geordnete Kategorien entworfen wurden,
nicht generell bessere Ergebnisse als state-of-the-art
nicht-ordinale Klassifikatoren. Die Fähigkeit eines Algorithmus, mit
hochdimensionalen Daten umzugehen, dominiert die
Klassifikationsleisting. Wir zeigen, dass unser Algorithmus hi2
konsistent gute Ergebnisse erzielt und in vielen Fällen besser
abschneidet als die anderen Methoden
Machine Learning
Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience
Discovering Higher-order SNP Interactions in High-dimensional Genomic Data
In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise
Spatial and Content-based Audio Processing using Stochastic Optimization Methods
Stochastic optimization (SO) represents a category of numerical optimization approaches, in which the search for the optimal solution involves randomness in a constructive manner. As shown also in this thesis, the stochastic optimization techniques and models have become an important and notable paradigm in a wide range of application areas, including transportation models, financial instruments, and network design. Stochastic optimization is especially developed for solving the problems that are either too difficult or impossible to solve analytically by deterministic optimization approaches.
In this thesis, the focus is put on applying several stochastic optimization algorithms to two audio-specific application areas, namely sniper positioning and content-based audio classification and retrieval. In short, the first application belongs to an area of spatial audio, whereas the latter is a topic of machine learning and, more specifically, multimedia information retrieval. The SO algorithms considered in the thesis are particle filtering (PF), particle swarm optimization (PSO), and simulated annealing (SA), which are extended, combined and applied to the specified problems in a novel manner. Based on their iterative and evolving nature, especially the PSO algorithms are often included to the category of evolutionary algorithms.
Considering the sniper positioning application, in this thesis the PF and SA algorithms are employed to optimize the parameters of a mathematical shock wave model based on observed firing event wavefronts. Such an inverse problem is suitable for Bayesian approach, which is the main motivation for including the PF approach among the considered optimization methods. It is shown – also with SA – that by applying the stated shock wave model, the proposed stochastic parameter estimation approach provides statistically reliable and qualified results.
The content-based audio classification part of the thesis is based on a dedicated framework consisting of several individual binary classifiers. In this work, artificial neural networks (ANNs) are used within the framework, for which the parameters and network structures are optimized based the desired item outputs, i.e. the ground truth class labels. The optimization process is carried out using a multi-dimensional extension of the regular PSO algorithm (MD PSO). The audio retrieval experiments are performed in the context of feature generation (synthesis), which is an approach for generating new audio features/attributes based on some conventional features originally extracted from a particular audio database. Here the MD PSO algorithm is applied to optimize the parameters of the feature generation process, wherein the dimensionality of the generated feature vector is also optimized.
Both from practical perspective and the viewpoint of complexity theory, stochastic optimization techniques are often computationally demanding. Because of this, the practical implementations discussed in this thesis are designed as directly applicable to parallel computing. This is an important and topical issue considering the continuous increase of computing grids and cloud services. Indeed, many of the results achieved in this thesis are computed using a grid of several computers. Furthermore, since also personal computers and mobile handsets include an increasing number of processor cores, such parallel implementations are not limited to grid servers only
Reservoir Computing: computation with dynamical systems
In het onderzoeksgebied Machine Learning worden systemen onderzocht die kunnen leren op basis van voorbeelden. Binnen dit onderzoeksgebied zijn de recurrente neurale netwerken een belangrijke deelgroep. Deze netwerken zijn abstracte modellen van de werking van delen van de hersenen. Zij zijn in staat om zeer complexe temporele problemen op te lossen maar zijn over het algemeen zeer moeilijk om te trainen. Recentelijk zijn een aantal gelijkaardige methodes voorgesteld die dit trainingsprobleem elimineren. Deze methodes worden aangeduid met de naam Reservoir Computing. Reservoir Computing combineert de indrukwekkende rekenkracht van recurrente neurale netwerken met een eenvoudige trainingsmethode. Bovendien blijkt dat deze trainingsmethoden niet beperkt zijn tot neurale netwerken, maar kunnen toegepast worden op generieke dynamische systemen. Waarom deze systemen goed werken en welke eigenschappen bepalend zijn voor de prestatie is evenwel nog niet duidelijk.
Voor dit proefschrift is onderzoek gedaan naar de dynamische eigenschappen van generieke Reservoir Computing systemen. Zo is experimenteel aangetoond dat de idee van Reservoir Computing ook toepasbaar is op niet-neurale netwerken van dynamische knopen. Verder is een maat voorgesteld die gebruikt kan worden om het dynamisch regime van een reservoir te meten. Tenslotte is een adaptatieregel geïntroduceerd die voor een breed scala reservoirtypes de dynamica van het reservoir kan afregelen tot het gewenste dynamisch regime. De technieken beschreven in dit proefschrift zijn gedemonstreerd op verschillende academische en ingenieurstoepassingen
Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data
The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML
A new approach of top-down induction of decision trees for knowledge discovery
Top-down induction of decision trees is the most popular technique for classification in the field of data mining and knowledge discovery. Quinlan developed the basic induction algorithm of decision trees, ID3 (1984), and extended to C4.5 (1993). There is a lot of research work for dealing with a single attribute decision-making node (so-called the first-order decision) of decision trees. Murphy and Pazzani (1991) addressed about multiple-attribute conditions at decision-making nodes. They show that higher order decision-making generates smaller decision trees and better accuracy. However, there always exist NP-complete combinations of multiple-attribute decision-makings.;We develop a new algorithm of second-order decision-tree inductions (SODI) for nominal attributes. The induction rules of first-order decision trees are combined by \u27AND\u27 logic only, but those of SODI consist of \u27AND\u27, \u27OR\u27, and \u27OTHERWISE\u27 logics. It generates more accurate results and smaller decision trees than any first-order decision tree inductions.;Quinlan used information gains via VC-dimension (Vapnik-Chevonenkis; Vapnik, 1995) for clustering the experimental values for each numerical attribute. However, many researchers have discovered the weakness of the use of VC-dim analysis. Bennett (1997) sophistically applies support vector machines (SVM) to decision tree induction. We suggest a heuristic algorithm (SVMM; SVM for Multi-category) that combines a TDIDT scheme with SVM. In this thesis it will be also addressed how to solve multiclass classification problems.;Our final goal for this thesis is IDSS (Induction of Decision Trees using SODI and SVMM). We will address how to combine SODI and SVMM for the construction of top-down induction of decision trees in order to minimize the generalized penalty cost
Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones
La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sí mismos.
Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador.
En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados.
La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad.
Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de Economía y Competitividad, proyecto TIN-2011-2404
High Dimensional Analysis of Genetic Data for the Classification of Type 2 Diabetes Using Advanced Machine Learning Algorithms
The prevalence of type 2 diabetes (T2D) has increased steadily over the last thirty years and has now reached epidemic proportions. The secondary complications associated with T2D have significant health and economic impacts worldwide and it is now regarded as the seventh leading cause of mortality. Therefore, understanding the underlying causes of T2D is high on government agendas. The condition is a multifactorial disorder with a complex aetiology. This means that T2D emerges from the convergence between genetics, the environment and diet, and lifestyle choices. The genetic determinants remain largely elusive, with only a handful of identified candidate genes. Genome-wide association studies (GWAS) have enhanced our understanding of genetic-based determinants in common complex human diseases. To date, 120 single nucleotide polymorphisms (SNPs) for T2D have been identified using GWAS. Standard statistical tests for single and multi-locus analysis, such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. Logistic regression can capture linear interactions between SNPs and traits however it neglects the non-linear epistatic interactions that are often present within genetic data. Complex human diseases are caused by the contributions made by many interacting genetic variants. However, detecting epistatic interactions and understanding the underlying pathogenesis architecture of complex human disorders remains a significant challenge. This thesis presents a novel framework based on deep learning to reduce the high-dimensional space in GWAS and learn non-linear epistatic interactions in T2D genetic data for binary classification tasks. This framework includes traditional GWAS quality control, association analysis, deep learning stacked autoencoders, and a multilayer perceptron for classification. Quality control procedures are conducted to exclude genetic variants and individuals that do not meet a pre-specified criterion. Logistic association analysis under an additive genetic model adjusted for genomic control inflation factor is also conducted. SNPs generated with a p-value threshold of 10−2 are considered, resulting in 6609 SNPs (features), to remove statistically improbable SNPs and help minimise the computational requirements needed to process all SNPs. The 6609 SNPs are used for epistatic analysis through progressively smaller hidden layer units. Latent representations are extracted using stacked autoencoders to initialise a multilayer feedforward network for binary classification. The classifier is fine-tuned to discriminate between cases and controls using T2D genetic data. The performance of a deep learning stacked autoencoder model is evaluated and benchmarked against a multilayer perceptron and a random forest learning algorithm. The findings show that the best results were obtained using 2500 compressed hidden units (AUC=94.25%). However, the classification accuracy when using 300 compressed neurons remains reasonable with (AUC=80.78%). The results are promising. Using deep learning stacked autoencoders, it is possible to reduce high-dimensional features in T2D GWAS data and learn non-linear epistatic interactions between SNPs while enhancing overall model performance for binary classification purposes
- …