Search CORE

5,861 research outputs found

The fused Kolmogorov filter: A nonparametric model-free screening method

Author: Mai Qing
Zou Hui
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 30/07/2015
Field of study

A new model-free screening method called the fused Kolmogorov filter is proposed for high-dimensional data analysis. This new method is fully nonparametric and can work with many types of covariates and response variables, including continuous, discrete and categorical variables. We apply the fused Kolmogorov filter to deal with variable screening problems emerging from a wide range of applications, such as multiclass classification, nonparametric regression and Poisson regression, among others. It is shown that the fused Kolmogorov filter enjoys the sure screening property under weak regularity conditions that are much milder than those required for many existing nonparametric screening methods. In particular, the fused Kolmogorov filter can still be powerful when covariates are strongly dependent on each other. We further demonstrate the superior performance of the fused Kolmogorov filter over existing screening methods by simulations and real data examples.Comment: Published at http://dx.doi.org/10.1214/14-AOS1303 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Robust distance correlation for variable screening

Author: Ke Hongjie
Ma Tianzhou
Ren Zhao
Publication venue
Publication date: 26/12/2022
Field of study

High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dimensional data due to low computational efficiency and unstable algorithm. Sure screening methods have become popular alternatives by first rapidly reducing the dimension using simple measures such as marginal correlation then applying any regularization methods. A number of screening methods for different models or problems have been developed, however, none of the methods have targeted at data with heavy tailedness, which is another important characteristics of modern big data. In this paper, we propose a robust distance correlation (``RDC'') based sure screening method to perform screening in ultrahigh-dimensional regression with heavy-tailed data. The proposed method shares the same good properties as the original model-free distance correlation based screening while has additional merit of robustly estimating the distance correlation when data is heavy-tailed and improves the model selection performance in screening. We conducted extensive simulations under different scenarios of heavy tailedness to demonstrate the advantage of our proposed procedure as compared to other existing model-based or model-free screening procedures with improved feature selection and prediction performance. We also applied the method to high-dimensional heavy-tailed RNA sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer cohort and RDC was shown to outperform the other methods in prioritizing the most essential and biologically meaningful genes

arXiv.org e-Print Archive

Are the dimensions of private information more multiple than expected? Information asymmetries in the market of supplementary private health insurance in England

Author: Karlsson M.
Klohn F.
Rickayzen B. D.
Publication venue: Faculty of Actuarial Science & Insurance, City University London
Publication date: 01/01/2012
Field of study

Our study reexamines standard econometric approaches for the detection of information asymmetries on insurance markets. We claim that evidence based on a standard framework with 2 equations, which uses potential sources of information asymmetries, should stress the importance of heterogeneity in the parameters. We argue that conclusions derived from this methodology can be misleading if the estimated coefficients in such an `unused characteristics' framework are driven by different parts of the population. We show formally that an individual's expected risk from the perspective of insurance, conditioned on certain characteristics (which are not used for calculating the risk premium), can equal the population's expectation in risk { although such characteristics are both related to risk and insurance probability, which is usually interpreted as an indicator of information asymmetries. We provide empirical evidence on the existence of information asymmetries in the market for supplementary private health insurance in the UK. Overall, we found evidence for advantageous selection into the private risk pool; ie people with lower health risk tend to insure more. The main drivers of this phenomenon seem to be characteristics such as income and wealth. Nevertheless, we also found parameter heterogeneity to be relevant, leading to possible misinterpretation if the standard `unused characteristics' approach is applied

High-dimensional Variable Screening via Conditional Martingale Difference Divergence

Author: Fang Lei
Ye Chenglong
Yin Xiangrong
Yuan Qingcong
Publication venue
Publication date: 27/10/2022
Field of study

Variable screening has been a useful research area that deals with ultrahigh-dimensional data. When there exist both marginally and jointly dependent predictors to the response, existing methods such as conditional screening or iterative screening often suffer from instability against the selection of the conditional set or the computational burden, respectively. In this article, we propose a new independence measure, named conditional martingale difference divergence (CMDH), that can be treated as either a conditional or a marginal independence measure. Under regularity conditions, we show that the sure screening property of CMDH holds for both marginally and jointly active variables. Based on this measure, we propose a kernel-based model-free variable screening method, which is efficient, flexible, and stable against high correlation among predictors and heterogeneity of the response. In addition, we provide a data-driven method to select the conditional set. In simulations and real data applications, we demonstrate the superior performance of the proposed method

arXiv.org e-Print Archive