483 research outputs found
Grouped feature screening for ultrahigh-dimensional classification via Gini distance correlation
Gini distance correlation (GDC) was recently proposed to measure the
dependence between a categorical variable, Y, and a numerical random vector, X.
It mutually characterizes independence between X and Y. In this article, we
utilize the GDC to establish a feature screening for ultrahigh-dimensional
discriminant analysis where the response variable is categorical. It can be
used for screening individual features as well as grouped features. The
proposed procedure possesses several appealing properties. It is model-free. No
model specification is needed. It holds the sure independence screening
property and the ranking consistency property. The proposed screening method
can also deal with the case that the response has divergent number of
categories. We conduct several Monte Carlo simulation studies to examine the
finite sample performance of the proposed screening procedure. Real data
analysis for two real life datasets are illustrated.Comment: 25 pages, 1 figur
Robust distance correlation for variable screening
High-dimensional data are commonly seen in modern statistical applications,
variable selection methods play indispensable roles in identifying the critical
features for scientific discoveries. Traditional best subset selection methods
are computationally intractable with a large number of features, while
regularization methods such as Lasso, SCAD and their variants perform poorly in
ultrahigh-dimensional data due to low computational efficiency and unstable
algorithm. Sure screening methods have become popular alternatives by first
rapidly reducing the dimension using simple measures such as marginal
correlation then applying any regularization methods. A number of screening
methods for different models or problems have been developed, however, none of
the methods have targeted at data with heavy tailedness, which is another
important characteristics of modern big data. In this paper, we propose a
robust distance correlation (``RDC'') based sure screening method to perform
screening in ultrahigh-dimensional regression with heavy-tailed data. The
proposed method shares the same good properties as the original model-free
distance correlation based screening while has additional merit of robustly
estimating the distance correlation when data is heavy-tailed and improves the
model selection performance in screening. We conducted extensive simulations
under different scenarios of heavy tailedness to demonstrate the advantage of
our proposed procedure as compared to other existing model-based or model-free
screening procedures with improved feature selection and prediction
performance. We also applied the method to high-dimensional heavy-tailed RNA
sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer
cohort and RDC was shown to outperform the other methods in prioritizing the
most essential and biologically meaningful genes
New developments of dimension reduction
Variable selection becomes more crucial than before, since high dimensional data are frequently seen in many research areas. Many model-based variable selection methods have been developed. However, the performance might be poor when the model is mis-specified. Sufficient dimension reduction (SDR, Li 1991; Cook 1998) provides a general framework for model-free variable selection methods.
In this thesis, we first propose a novel model-free variable selection method to deal with multi-population data by incorporating the grouping information. Theoretical properties of our proposed method are also presented. Simulation studies show that our new method significantly improves the selection performance compared with those ignoring the grouping information. In the second part of this dissertation, we apply partial SDR method to conduct conditional model-free variable (feature) screening for ultra-high dimensional data, when researchers have prior information regarding the importance of certain predictors based on experience or previous investigations. Comparing to the state of art conditional screening method, conditional sure independence screening (CSIS; Barut, Fan and Verhasselt, 2016), our method greatly outperforms CSIS for nonlinear models. The sure screening consistency property of our proposed method is also established --Abstract, page iv
Feature screening for ultrahigh-dimensional binary classification via linear projection
Linear discriminant analysis (LDA) is one of the most widely used methods in discriminant classification and pattern recognition. However, with the rapid development of information science and technology, the dimensionality of collected data is high or ultrahigh, which causes the failure of LDA. To address this issue, a feature screening procedure based on the Fisher's linear projection and the marginal score test is proposed to deal with the ultrahigh-dimensional binary classification problem. The sure screening property is established to ensure that the important features could be retained and the irrelevant predictors could be eliminated. The finite sample properties of the proposed procedure are assessed by Monte Carlo simulation studies and a real-life data example
Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification
We propose a high dimensional classification method that involves
nonparametric feature augmentation. Knowing that marginal density ratios are
the most powerful univariate classifiers, we use the ratio estimates to
transform the original feature measurements. Subsequently, penalized logistic
regression is invoked, taking as input the newly transformed or augmented
features. This procedure trains models equipped with local complexity and
global simplicity, thereby avoiding the curse of dimensionality while creating
a flexible nonlinear decision boundary. The resulting method is called Feature
Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by
generalizing the Naive Bayes model, writing the log ratio of joint densities as
a linear combination of those of marginal densities. It is related to
generalized additive models, but has better interpretability and computability.
Risk bounds are developed for FANS. In numerical analysis, FANS is compared
with competing methods, so as to provide a guideline on its best application
domain. Real data analysis demonstrates that FANS performs very competitively
on benchmark email spam and gene expression data sets. Moreover, FANS is
implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
Rank discriminants for predicting phenotypes from RNA expression
Statistical methods for analyzing large-scale biomolecular data are
commonplace in computational biology. A notable example is phenotype prediction
from gene expression data, for instance, detecting human cancers,
differentiating subtypes and predicting clinical outcomes. Still, clinical
applications remain scarce. One reason is that the complexity of the decision
rules that emerge from standard statistical learning impedes biological
understanding, in particular, any mechanistic interpretation. Here we explore
decision rules for binary classification utilizing only the ordering of
expression among several genes; the basic building blocks are then two-gene
expression comparisons. The simplest example, just one comparison, is the TSP
classifier, which has appeared in a variety of cancer-related discovery
studies. Decision rules based on multiple comparisons can better accommodate
class heterogeneity, and thereby increase accuracy, and might provide a link
with biological mechanism. We consider a general framework ("rank-in-context")
for designing discriminant functions, including a data-driven selection of the
number and identity of the genes in the support ("context"). We then specialize
to two examples: voting among several pairs and comparing the median expression
in two groups of genes. Comprehensive experiments assess accuracy relative to
other, more complex, methods, and reinforce earlier observations that simple
classifiers are competitive.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS738 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …