5 research outputs found
Using Markov Boundary Approach for Interpretable and Generalizable Feature Selection
Predictive power and generalizability of models depend on the quality of
features selected in the model. Machine learning (ML) models in banks consider
a large number of features which are often correlated or dependent.
Incorporation of these features may hinder model stability and prior feature
screening can improve long term performance of the models. A Markov boundary
(MB) of features is the minimum set of features that guarantee that other
potential predictors do not affect the target given the boundary while ensuring
maximal predictive accuracy. Identifying the Markov boundary is straightforward
under assumptions of Gaussianity on the features and linear relationships
between them. This paper outlines common problems associated with identifying
the Markov boundary in structured data when relationships are non-linear, and
predictors are of mixed data type. We have proposed a multi-group
forward-backward selection strategy that not only handles the continuous
features but addresses some of the issues with MB identification in a mixed
data setup and demonstrated its capabilities on simulated and real datasets
Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions
Kernel mean embeddings have recently attracted the attention of the machine
learning community. They map measures from some set to functions in a
reproducing kernel Hilbert space (RKHS) with kernel . The RKHS distance of
two mapped measures is a semi-metric over . We study three questions.
(I) For a given kernel, what sets can be embedded? (II) When is the
embedding injective over (in which case is a metric)? (III) How does
the -induced topology compare to other topologies on ? The existing
machine learning literature has addressed these questions in cases where is
(a subset of) the finite regular Borel measures. We unify, improve and
generalise those results. Our approach naturally leads to continuous and
possibly even injective embeddings of (Schwartz-) distributions, i.e.,
generalised measures, but the reader is free to focus on measures only. In
particular, we systemise and extend various (partly known) equivalences between
different notions of universal, characteristic and strictly positive definite
kernels, and show that on an underlying locally compact Hausdorff space,
metrises the weak convergence of probability measures if and only if is
continuous and characteristic.Comment: Old and longer version of the JMLR paper with same title (published
2018). Please start with the JMLR version. 55 pages (33 pages main text, 22
pages appendix), 2 tables, 1 figure (in appendix
A Kernel Multiple Change-point Algorithm via Model Selection
International audienceWe tackle the change-point problem with data belonging to a general set. We build a penalty for choosing the number of change-points in the kernel-based method of Harchaoui and Cappé (2007). This penalty generalizes the one proposed by Lebarbier (2005) for one-dimensional signals. We prove a non-asymptotic oracle inequality for the proposed method, thanks to a new concentration result for some function of Hilbert-space valued random variables. Experiments on synthetic data illustrate the accuracy of our method, showing that it can detect changes in the whole distribution of data, even when the mean and variance are constant
GRAPHICAL MODELS FOR HIGH DIMENSIONAL DATA WITH GENOMIC APPLICATIONS
Many previous studies have demonstrated that gene expression or other types of -omic features collected from patients can help disease diagnosis or treatment selection. For example, a few recent studies demonstrated that gene expression data collected from
cancer cell lines are highly informative to predict cancer drug sensitivity (Garnett et al., 2012; Barretina et al., 2012; Chen et al., 2016b). This is partly because many cancer drugs are targeted drugs that perturb a particular mutated gene or protein, and thus having that mutation, or observing the consequence of such mutation in gene expression data, is highly informative for drug sensitivity prediction. Such systematic studies of drug sensitivities require giving different drugs in a series of doses to the same cell line, which is obviously not possible for the human studies. More sophisticated methods are needed to estimate potential effects of cancer drugs based on observational data. Since the effect of a targeted cancer drug can be considered as an intervention to the molecular system of cancer cells, a directed graphical model for gene-gene associations is a natural choice to model the molecular system and to study the consequence of such interventions.
In this dissertation, we develop new statistical methods to estimate DAGs using high dimensional -omic data under two scenarios: i) with a model-free approach and ii) single cell RNA-seq data (scRNAseq). In the 1st chapter, we will give a brief introduction to graphical models, the various statistical characterizations of graphical models and the most current approaches to estimate graph structures. Then, we will review the scRNAseq data and current approaches to analyze scRNAseq data. Next, in Chapter 2, we propose a model-free method to estimate graphical models in two steps. The first step uses a model-free variable selection method based on the principles of sufficient dimension reduction. Then, the second step uses a non-parametric conditional independence testing method which utilizes embeddings of the conditional spaces into reproducing kernel Hilbert spaces. We will review some theoretical background
in order to establish the asymptotic graphical model estimation consistency of this two-step approach. We examine its performance in simulations and TCGA breast cancer data, where we find significant improvements from current methods that require
strong model assumptions. In Chapter 3, we propose a graphical model algorithm to analyze scRNAseq data. Similar to the previous algorithm, we create a two-step estimation method which utilizes a joint penalized zero-inflation model. We assess its
performance and drawbacks in simulations. Then, we examined its utility when applied after clustering to a sample of 68k peripheral blood mononuclear cells with multiple subpopulations.Doctor of Philosoph
Feature Extraction and Selection in Automatic Sleep Stage Classification
Sleep stage classification is vital for diagnosing many sleep related
disorders and Polysomnography (PSG) is an important tool in this regard.
The visual process of sleep stage classification is time consuming, subjective
and costly. To improve the accuracy and efficiency of the sleep stage
classification, researchers have been trying to develop automatic
classification algorithms.
The automatic sleep stage classification mainly consists of three steps:
pre-processing, feature extraction and classification. In this research work,
we focused on feature extraction and selection steps. The main goal of this
thesis was identifying a robust and reliable feature set that can lead to
efficient classification of sleep stages. For achieving this goal, three types of
contributions were introduced in feature selection, feature extraction and
feature vector quality enhancement.
Several feature ranking and rank aggregation methods were evaluated and
compared for finding the best feature set. Evaluation results indicated that
the decision on the precise feature selection method depends on the system
design requirements such as low computational complexity, high stability
or high classification accuracy. In addition to conventional feature ranking
methods, in this thesis, novel methods such as Stacked Sparse AutoEncoder
(SSAE) was used for dimensionality reduction.
In feature extration area, new and effective features such as distancebased
features were utilized for the first time in sleep stage classification.
The results showed that these features contribute positively to the
classification performance. For signal quality enhancement, a loss-less EEG
artefact removal algorithm was proposed. The proposed adaptive algorithm
led to a significant enhancement in the overall classification accuracy