5 research outputs found

    Using Markov Boundary Approach for Interpretable and Generalizable Feature Selection

    Full text link
    Predictive power and generalizability of models depend on the quality of features selected in the model. Machine learning (ML) models in banks consider a large number of features which are often correlated or dependent. Incorporation of these features may hinder model stability and prior feature screening can improve long term performance of the models. A Markov boundary (MB) of features is the minimum set of features that guarantee that other potential predictors do not affect the target given the boundary while ensuring maximal predictive accuracy. Identifying the Markov boundary is straightforward under assumptions of Gaussianity on the features and linear relationships between them. This paper outlines common problems associated with identifying the Markov boundary in structured data when relationships are non-linear, and predictors are of mixed data type. We have proposed a multi-group forward-backward selection strategy that not only handles the continuous features but addresses some of the issues with MB identification in a mixed data setup and demonstrated its capabilities on simulated and real datasets

    Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions

    Full text link
    Kernel mean embeddings have recently attracted the attention of the machine learning community. They map measures μ\mu from some set MM to functions in a reproducing kernel Hilbert space (RKHS) with kernel kk. The RKHS distance of two mapped measures is a semi-metric dkd_k over MM. We study three questions. (I) For a given kernel, what sets MM can be embedded? (II) When is the embedding injective over MM (in which case dkd_k is a metric)? (III) How does the dkd_k-induced topology compare to other topologies on MM? The existing machine learning literature has addressed these questions in cases where MM is (a subset of) the finite regular Borel measures. We unify, improve and generalise those results. Our approach naturally leads to continuous and possibly even injective embeddings of (Schwartz-) distributions, i.e., generalised measures, but the reader is free to focus on measures only. In particular, we systemise and extend various (partly known) equivalences between different notions of universal, characteristic and strictly positive definite kernels, and show that on an underlying locally compact Hausdorff space, dkd_k metrises the weak convergence of probability measures if and only if kk is continuous and characteristic.Comment: Old and longer version of the JMLR paper with same title (published 2018). Please start with the JMLR version. 55 pages (33 pages main text, 22 pages appendix), 2 tables, 1 figure (in appendix

    A Kernel Multiple Change-point Algorithm via Model Selection

    Get PDF
    International audienceWe tackle the change-point problem with data belonging to a general set. We build a penalty for choosing the number of change-points in the kernel-based method of Harchaoui and Cappé (2007). This penalty generalizes the one proposed by Lebarbier (2005) for one-dimensional signals. We prove a non-asymptotic oracle inequality for the proposed method, thanks to a new concentration result for some function of Hilbert-space valued random variables. Experiments on synthetic data illustrate the accuracy of our method, showing that it can detect changes in the whole distribution of data, even when the mean and variance are constant

    GRAPHICAL MODELS FOR HIGH DIMENSIONAL DATA WITH GENOMIC APPLICATIONS

    Get PDF
    Many previous studies have demonstrated that gene expression or other types of -omic features collected from patients can help disease diagnosis or treatment selection. For example, a few recent studies demonstrated that gene expression data collected from cancer cell lines are highly informative to predict cancer drug sensitivity (Garnett et al., 2012; Barretina et al., 2012; Chen et al., 2016b). This is partly because many cancer drugs are targeted drugs that perturb a particular mutated gene or protein, and thus having that mutation, or observing the consequence of such mutation in gene expression data, is highly informative for drug sensitivity prediction. Such systematic studies of drug sensitivities require giving different drugs in a series of doses to the same cell line, which is obviously not possible for the human studies. More sophisticated methods are needed to estimate potential effects of cancer drugs based on observational data. Since the effect of a targeted cancer drug can be considered as an intervention to the molecular system of cancer cells, a directed graphical model for gene-gene associations is a natural choice to model the molecular system and to study the consequence of such interventions. In this dissertation, we develop new statistical methods to estimate DAGs using high dimensional -omic data under two scenarios: i) with a model-free approach and ii) single cell RNA-seq data (scRNAseq). In the 1st chapter, we will give a brief introduction to graphical models, the various statistical characterizations of graphical models and the most current approaches to estimate graph structures. Then, we will review the scRNAseq data and current approaches to analyze scRNAseq data. Next, in Chapter 2, we propose a model-free method to estimate graphical models in two steps. The first step uses a model-free variable selection method based on the principles of sufficient dimension reduction. Then, the second step uses a non-parametric conditional independence testing method which utilizes embeddings of the conditional spaces into reproducing kernel Hilbert spaces. We will review some theoretical background in order to establish the asymptotic graphical model estimation consistency of this two-step approach. We examine its performance in simulations and TCGA breast cancer data, where we find significant improvements from current methods that require strong model assumptions. In Chapter 3, we propose a graphical model algorithm to analyze scRNAseq data. Similar to the previous algorithm, we create a two-step estimation method which utilizes a joint penalized zero-inflation model. We assess its performance and drawbacks in simulations. Then, we examined its utility when applied after clustering to a sample of 68k peripheral blood mononuclear cells with multiple subpopulations.Doctor of Philosoph

    Feature Extraction and Selection in Automatic Sleep Stage Classification

    Get PDF
    Sleep stage classification is vital for diagnosing many sleep related disorders and Polysomnography (PSG) is an important tool in this regard. The visual process of sleep stage classification is time consuming, subjective and costly. To improve the accuracy and efficiency of the sleep stage classification, researchers have been trying to develop automatic classification algorithms. The automatic sleep stage classification mainly consists of three steps: pre-processing, feature extraction and classification. In this research work, we focused on feature extraction and selection steps. The main goal of this thesis was identifying a robust and reliable feature set that can lead to efficient classification of sleep stages. For achieving this goal, three types of contributions were introduced in feature selection, feature extraction and feature vector quality enhancement. Several feature ranking and rank aggregation methods were evaluated and compared for finding the best feature set. Evaluation results indicated that the decision on the precise feature selection method depends on the system design requirements such as low computational complexity, high stability or high classification accuracy. In addition to conventional feature ranking methods, in this thesis, novel methods such as Stacked Sparse AutoEncoder (SSAE) was used for dimensionality reduction. In feature extration area, new and effective features such as distancebased features were utilized for the first time in sleep stage classification. The results showed that these features contribute positively to the classification performance. For signal quality enhancement, a loss-less EEG artefact removal algorithm was proposed. The proposed adaptive algorithm led to a significant enhancement in the overall classification accuracy
    corecore