Search CORE

5 research outputs found

Using Markov Boundary Approach for Interpretable and Generalizable Feature Selection

Author: Bhattacharyya Anwesha
Nair Vijayan N.
Vaughan Joel
Wang Yaqun
Publication venue
Publication date: 26/07/2023
Field of study

Predictive power and generalizability of models depend on the quality of features selected in the model. Machine learning (ML) models in banks consider a large number of features which are often correlated or dependent. Incorporation of these features may hinder model stability and prior feature screening can improve long term performance of the models. A Markov boundary (MB) of features is the minimum set of features that guarantee that other potential predictors do not affect the target given the boundary while ensuring maximal predictive accuracy. Identifying the Markov boundary is straightforward under assumptions of Gaussianity on the features and linear relationships between them. This paper outlines common problems associated with identifying the Markov boundary in structured data when relationships are non-linear, and predictors are of mixed data type. We have proposed a multi-group forward-backward selection strategy that not only handles the continuous features but addresses some of the issues with MB identification in a mixed data setup and demonstrated its capabilities on simulated and real datasets

arXiv.org e-Print Archive

Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions

Author: Schölkopf Bernhard
Simon-Gabriel Carl-Johann
Publication venue
Publication date: 01/01/2018
Field of study

Kernel mean embeddings have recently attracted the attention of the machine learning community. They map measures

\mu

from some set

M

to functions in a reproducing kernel Hilbert space (RKHS) with kernel

k

. The RKHS distance of two mapped measures is a semi-metric

d_k

over

M

. We study three questions. (I) For a given kernel, what sets

M

can be embedded? (II) When is the embedding injective over

M

(in which case

d_k

is a metric)? (III) How does the

d_k

-induced topology compare to other topologies on

M

? The existing machine learning literature has addressed these questions in cases where

M

is (a subset of) the finite regular Borel measures. We unify, improve and generalise those results. Our approach naturally leads to continuous and possibly even injective embeddings of (Schwartz-) distributions, i.e., generalised measures, but the reader is free to focus on measures only. In particular, we systemise and extend various (partly known) equivalences between different notions of universal, characteristic and strictly positive definite kernels, and show that on an underlying locally compact Hausdorff space,

d_k

metrises the weak convergence of probability measures if and only if

k

is continuous and characteristic.Comment: Old and longer version of the JMLR paper with same title (published 2018). Please start with the JMLR version. 55 pages (33 pages main text, 22 pages appendix), 2 tables, 1 figure (in appendix

arXiv.org e-Print Archive

Publikationsserver der Universität Tübingen

MPG.PuRe

A Kernel Multiple Change-point Algorithm via Model Selection

Author: Arlot Sylvain
Celisse Alain
Harchaoui Zaid
Publication venue: Microtome Publishing
Publication date: 14/03/2019
Field of study

International audienceWe tackle the change-point problem with data belonging to a general set. We build a penalty for choosing the number of change-points in the kernel-based method of Harchaoui and Cappé (2007). This penalty generalizes the one proposed by Lebarbier (2005) for one-dimensional signals. We prove a non-asymptotic oracle inequality for the proposed method, thanks to a new concentration result for some function of Hilbert-space valued random variables. Experiments on synthetic data illustrate the accuracy of our method, showing that it can detect changes in the whole distribution of data, even when the mean and variance are constant

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

GRAPHICAL MODELS FOR HIGH DIMENSIONAL DATA WITH GENOMIC APPLICATIONS

Author: Yang Jenny
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2017
Field of study

Many previous studies have demonstrated that gene expression or other types of -omic features collected from patients can help disease diagnosis or treatment selection. For example, a few recent studies demonstrated that gene expression data collected from cancer cell lines are highly informative to predict cancer drug sensitivity (Garnett et al., 2012; Barretina et al., 2012; Chen et al., 2016b). This is partly because many cancer drugs are targeted drugs that perturb a particular mutated gene or protein, and thus having that mutation, or observing the consequence of such mutation in gene expression data, is highly informative for drug sensitivity prediction. Such systematic studies of drug sensitivities require giving different drugs in a series of doses to the same cell line, which is obviously not possible for the human studies. More sophisticated methods are needed to estimate potential effects of cancer drugs based on observational data. Since the effect of a targeted cancer drug can be considered as an intervention to the molecular system of cancer cells, a directed graphical model for gene-gene associations is a natural choice to model the molecular system and to study the consequence of such interventions. In this dissertation, we develop new statistical methods to estimate DAGs using high dimensional -omic data under two scenarios: i) with a model-free approach and ii) single cell RNA-seq data (scRNAseq). In the 1st chapter, we will give a brief introduction to graphical models, the various statistical characterizations of graphical models and the most current approaches to estimate graph structures. Then, we will review the scRNAseq data and current approaches to analyze scRNAseq data. Next, in Chapter 2, we propose a model-free method to estimate graphical models in two steps. The first step uses a model-free variable selection method based on the principles of sufficient dimension reduction. Then, the second step uses a non-parametric conditional independence testing method which utilizes embeddings of the conditional spaces into reproducing kernel Hilbert spaces. We will review some theoretical background in order to establish the asymptotic graphical model estimation consistency of this two-step approach. We examine its performance in simulations and TCGA breast cancer data, where we find significant improvements from current methods that require strong model assumptions. In Chapter 3, we propose a graphical model algorithm to analyze scRNAseq data. Similar to the previous algorithm, we create a two-step estimation method which utilizes a joint penalized zero-inflation model. We assess its performance and drawbacks in simulations. Then, we examined its utility when applied after clustering to a sample of 68k peripheral blood mononuclear cells with multiple subpopulations.Doctor of Philosoph

Carolina Digital Repository

Feature Extraction and Selection in Automatic Sleep Stage Classification

Author: Najdi Shirin
Publication venue
Publication date: 01/01/2018
Field of study

Sleep stage classification is vital for diagnosing many sleep related disorders and Polysomnography (PSG) is an important tool in this regard. The visual process of sleep stage classification is time consuming, subjective and costly. To improve the accuracy and efficiency of the sleep stage classification, researchers have been trying to develop automatic classification algorithms. The automatic sleep stage classification mainly consists of three steps: pre-processing, feature extraction and classification. In this research work, we focused on feature extraction and selection steps. The main goal of this thesis was identifying a robust and reliable feature set that can lead to efficient classification of sleep stages. For achieving this goal, three types of contributions were introduced in feature selection, feature extraction and feature vector quality enhancement. Several feature ranking and rank aggregation methods were evaluated and compared for finding the best feature set. Evaluation results indicated that the decision on the precise feature selection method depends on the system design requirements such as low computational complexity, high stability or high classification accuracy. In addition to conventional feature ranking methods, in this thesis, novel methods such as Stacked Sparse AutoEncoder (SSAE) was used for dimensionality reduction. In feature extration area, new and effective features such as distancebased features were utilized for the first time in sleep stage classification. The results showed that these features contribute positively to the classification performance. For signal quality enhancement, a loss-less EEG artefact removal algorithm was proposed. The proposed adaptive algorithm led to a significant enhancement in the overall classification accuracy

Repositório da Universidade Nova de Lisboa