76 research outputs found

    Statistical Methods to Enhance Clinical Prediction with High-Dimensional Data and Ordinal Response

    Get PDF
    Der technologische Fortschritt ermöglicht es heute, die moleculare Konfiguration einzelner Zellen oder ganzer Gewebeproben zu untersuchen. Solche in großen Mengen produzierten hochdimensionalen Omics-Daten aus der Molekularbiologie lassen sich zu immer niedrigeren Kosten erzeugen und werden so immer häufiger auch in klinischen Fragestellungen eingesetzt. Personalisierte Diagnose oder auch die Vorhersage eines Behandlungserfolges auf der Basis solcher Hochdurchsatzdaten stellen eine moderne Anwendung von Techniken aus dem maschinellen Lernen dar. In der Praxis werden klinische Parameter, wie etwa der Gesundheitszustand oder die Nebenwirkungen einer Therapie, häufig auf einer ordinalen Skala erhoben (beispielsweise gut, normal, schlecht). Es ist verbreitet, Klassifikationsproblme mit ordinal skaliertem Endpunkt wie generelle Mehrklassenproblme zu behandeln und somit die Information, die in der Ordnung zwischen den Klassen enthalten ist, zu ignorieren. Allerdings kann das Vernachlässigen dieser Information zu einer verminderten Klassifikationsgüte führen oder sogar eine ungünstige ungeordnete Klassifikation erzeugen. Klassische Ansätze, einen ordinal skalierten Endpunkt direkt zu modellieren, wie beispielsweise mit einem kumulativen Linkmodell, lassen sich typischerweise nicht auf hochdimensionale Daten anwenden. Wir präsentieren in dieser Arbeit hierarchical twoing (hi2) als einen Algorithmus für die Klassifikation hochdimensionler Daten in ordinal Skalierte Kategorien. hi2 nutzt die Mächtigkeit der sehr gut verstandenen binären Klassifikation, um auch in ordinale Kategorien zu klassifizieren. Eine Opensource-Implementierung von hi2 ist online verfügbar. In einer Vergleichsstudie zur Klassifikation von echten wie von simulierten Daten mit ordinalem Endpunkt produzieren etablierte Methoden, die speziell für geordnete Kategorien entworfen wurden, nicht generell bessere Ergebnisse als state-of-the-art nicht-ordinale Klassifikatoren. Die Fähigkeit eines Algorithmus, mit hochdimensionalen Daten umzugehen, dominiert die Klassifikationsleisting. Wir zeigen, dass unser Algorithmus hi2 konsistent gute Ergebnisse erzielt und in vielen Fällen besser abschneidet als die anderen Methoden

    Machine Learning

    Get PDF
    Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    Spatial and Content-based Audio Processing using Stochastic Optimization Methods

    Get PDF
    Stochastic optimization (SO) represents a category of numerical optimization approaches, in which the search for the optimal solution involves randomness in a constructive manner. As shown also in this thesis, the stochastic optimization techniques and models have become an important and notable paradigm in a wide range of application areas, including transportation models, financial instruments, and network design. Stochastic optimization is especially developed for solving the problems that are either too difficult or impossible to solve analytically by deterministic optimization approaches. In this thesis, the focus is put on applying several stochastic optimization algorithms to two audio-specific application areas, namely sniper positioning and content-based audio classification and retrieval. In short, the first application belongs to an area of spatial audio, whereas the latter is a topic of machine learning and, more specifically, multimedia information retrieval. The SO algorithms considered in the thesis are particle filtering (PF), particle swarm optimization (PSO), and simulated annealing (SA), which are extended, combined and applied to the specified problems in a novel manner. Based on their iterative and evolving nature, especially the PSO algorithms are often included to the category of evolutionary algorithms. Considering the sniper positioning application, in this thesis the PF and SA algorithms are employed to optimize the parameters of a mathematical shock wave model based on observed firing event wavefronts. Such an inverse problem is suitable for Bayesian approach, which is the main motivation for including the PF approach among the considered optimization methods. It is shown – also with SA – that by applying the stated shock wave model, the proposed stochastic parameter estimation approach provides statistically reliable and qualified results. The content-based audio classification part of the thesis is based on a dedicated framework consisting of several individual binary classifiers. In this work, artificial neural networks (ANNs) are used within the framework, for which the parameters and network structures are optimized based the desired item outputs, i.e. the ground truth class labels. The optimization process is carried out using a multi-dimensional extension of the regular PSO algorithm (MD PSO). The audio retrieval experiments are performed in the context of feature generation (synthesis), which is an approach for generating new audio features/attributes based on some conventional features originally extracted from a particular audio database. Here the MD PSO algorithm is applied to optimize the parameters of the feature generation process, wherein the dimensionality of the generated feature vector is also optimized. Both from practical perspective and the viewpoint of complexity theory, stochastic optimization techniques are often computationally demanding. Because of this, the practical implementations discussed in this thesis are designed as directly applicable to parallel computing. This is an important and topical issue considering the continuous increase of computing grids and cloud services. Indeed, many of the results achieved in this thesis are computed using a grid of several computers. Furthermore, since also personal computers and mobile handsets include an increasing number of processor cores, such parallel implementations are not limited to grid servers only

    Reservoir Computing: computation with dynamical systems

    Get PDF
    In het onderzoeksgebied Machine Learning worden systemen onderzocht die kunnen leren op basis van voorbeelden. Binnen dit onderzoeksgebied zijn de recurrente neurale netwerken een belangrijke deelgroep. Deze netwerken zijn abstracte modellen van de werking van delen van de hersenen. Zij zijn in staat om zeer complexe temporele problemen op te lossen maar zijn over het algemeen zeer moeilijk om te trainen. Recentelijk zijn een aantal gelijkaardige methodes voorgesteld die dit trainingsprobleem elimineren. Deze methodes worden aangeduid met de naam Reservoir Computing. Reservoir Computing combineert de indrukwekkende rekenkracht van recurrente neurale netwerken met een eenvoudige trainingsmethode. Bovendien blijkt dat deze trainingsmethoden niet beperkt zijn tot neurale netwerken, maar kunnen toegepast worden op generieke dynamische systemen. Waarom deze systemen goed werken en welke eigenschappen bepalend zijn voor de prestatie is evenwel nog niet duidelijk. Voor dit proefschrift is onderzoek gedaan naar de dynamische eigenschappen van generieke Reservoir Computing systemen. Zo is experimenteel aangetoond dat de idee van Reservoir Computing ook toepasbaar is op niet-neurale netwerken van dynamische knopen. Verder is een maat voorgesteld die gebruikt kan worden om het dynamisch regime van een reservoir te meten. Tenslotte is een adaptatieregel geïntroduceerd die voor een breed scala reservoirtypes de dynamica van het reservoir kan afregelen tot het gewenste dynamisch regime. De technieken beschreven in dit proefschrift zijn gedemonstreerd op verschillende academische en ingenieurstoepassingen

    Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

    Get PDF
    The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

    A new approach of top-down induction of decision trees for knowledge discovery

    Get PDF
    Top-down induction of decision trees is the most popular technique for classification in the field of data mining and knowledge discovery. Quinlan developed the basic induction algorithm of decision trees, ID3 (1984), and extended to C4.5 (1993). There is a lot of research work for dealing with a single attribute decision-making node (so-called the first-order decision) of decision trees. Murphy and Pazzani (1991) addressed about multiple-attribute conditions at decision-making nodes. They show that higher order decision-making generates smaller decision trees and better accuracy. However, there always exist NP-complete combinations of multiple-attribute decision-makings.;We develop a new algorithm of second-order decision-tree inductions (SODI) for nominal attributes. The induction rules of first-order decision trees are combined by \u27AND\u27 logic only, but those of SODI consist of \u27AND\u27, \u27OR\u27, and \u27OTHERWISE\u27 logics. It generates more accurate results and smaller decision trees than any first-order decision tree inductions.;Quinlan used information gains via VC-dimension (Vapnik-Chevonenkis; Vapnik, 1995) for clustering the experimental values for each numerical attribute. However, many researchers have discovered the weakness of the use of VC-dim analysis. Bennett (1997) sophistically applies support vector machines (SVM) to decision tree induction. We suggest a heuristic algorithm (SVMM; SVM for Multi-category) that combines a TDIDT scheme with SVM. In this thesis it will be also addressed how to solve multiclass classification problems.;Our final goal for this thesis is IDSS (Induction of Decision Trees using SODI and SVMM). We will address how to combine SODI and SVMM for the construction of top-down induction of decision trees in order to minimize the generalized penalty cost

    Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones

    Get PDF
    La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sí mismos. Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador. En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados. La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad. Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de Economía y Competitividad, proyecto TIN-2011-2404

    High Dimensional Analysis of Genetic Data for the Classification of Type 2 Diabetes Using Advanced Machine Learning Algorithms

    Get PDF
    The prevalence of type 2 diabetes (T2D) has increased steadily over the last thirty years and has now reached epidemic proportions. The secondary complications associated with T2D have significant health and economic impacts worldwide and it is now regarded as the seventh leading cause of mortality. Therefore, understanding the underlying causes of T2D is high on government agendas. The condition is a multifactorial disorder with a complex aetiology. This means that T2D emerges from the convergence between genetics, the environment and diet, and lifestyle choices. The genetic determinants remain largely elusive, with only a handful of identified candidate genes. Genome-wide association studies (GWAS) have enhanced our understanding of genetic-based determinants in common complex human diseases. To date, 120 single nucleotide polymorphisms (SNPs) for T2D have been identified using GWAS. Standard statistical tests for single and multi-locus analysis, such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. Logistic regression can capture linear interactions between SNPs and traits however it neglects the non-linear epistatic interactions that are often present within genetic data. Complex human diseases are caused by the contributions made by many interacting genetic variants. However, detecting epistatic interactions and understanding the underlying pathogenesis architecture of complex human disorders remains a significant challenge. This thesis presents a novel framework based on deep learning to reduce the high-dimensional space in GWAS and learn non-linear epistatic interactions in T2D genetic data for binary classification tasks. This framework includes traditional GWAS quality control, association analysis, deep learning stacked autoencoders, and a multilayer perceptron for classification. Quality control procedures are conducted to exclude genetic variants and individuals that do not meet a pre-specified criterion. Logistic association analysis under an additive genetic model adjusted for genomic control inflation factor is also conducted. SNPs generated with a p-value threshold of 10−2 are considered, resulting in 6609 SNPs (features), to remove statistically improbable SNPs and help minimise the computational requirements needed to process all SNPs. The 6609 SNPs are used for epistatic analysis through progressively smaller hidden layer units. Latent representations are extracted using stacked autoencoders to initialise a multilayer feedforward network for binary classification. The classifier is fine-tuned to discriminate between cases and controls using T2D genetic data. The performance of a deep learning stacked autoencoder model is evaluated and benchmarked against a multilayer perceptron and a random forest learning algorithm. The findings show that the best results were obtained using 2500 compressed hidden units (AUC=94.25%). However, the classification accuracy when using 300 compressed neurons remains reasonable with (AUC=80.78%). The results are promising. Using deep learning stacked autoencoders, it is possible to reduce high-dimensional features in T2D GWAS data and learn non-linear epistatic interactions between SNPs while enhancing overall model performance for binary classification purposes
    corecore