61 research outputs found
Robust causal structure learning with some hidden variables
We introduce a new method to estimate the Markov equivalence class of a
directed acyclic graph (DAG) in the presence of hidden variables, in settings
where the underlying DAG among the observed variables is sparse, and there are
a few hidden variables that have a direct effect on many of the observed ones.
Building on the so-called low rank plus sparse framework, we suggest a
two-stage approach which first removes the effect of the hidden variables, and
then estimates the Markov equivalence class of the underlying DAG under the
assumption that there are no remaining hidden variables. This approach is
consistent in certain high-dimensional regimes and performs favourably when
compared to the state of the art, both in terms of graphical structure recovery
and total causal effect estimation
Sure Screening for Transelliptical Graphical Models
We propose a sure screening approach for recovering the structure of a
transelliptical graphical model in the high dimensional setting. We estimate
the partial correlation graph by thresholding the elements of an estimator of
the sample correlation matrix obtained using Kendall's tau statistic. Under a
simple assumption on the relationship between the correlation and partial
correlation graphs, we show that with high probability, the estimated edge set
contains the true edge set, and the size of the estimated edge set is
controlled. We develop a threshold value that allows for control of the
expected false positive rate. In simulation and on an equities data set, we
show that transelliptical graphical sure screening performs quite competitively
with more computationally demanding techniques for graph estimation.Comment: The paper won the David Byar travel award in the Joint Statistical
Meetings (JSM) 201
ECA: High Dimensional Elliptical Component Analysis in non-Gaussian Distributions
We present a robust alternative to principal component analysis (PCA) ---
called elliptical component analysis (ECA) --- for analyzing high dimensional,
elliptically distributed data. ECA estimates the eigenspace of the covariance
matrix of the elliptical data. To cope with heavy-tailed elliptical
distributions, a multivariate rank statistic is exploited. At the model-level,
we consider two settings: either that the leading eigenvectors of the
covariance matrix are non-sparse or that they are sparse. Methodologically, we
propose ECA procedures for both non-sparse and sparse settings. Theoretically,
we provide both non-asymptotic and asymptotic analyses quantifying the
theoretical performances of ECA. In the non-sparse setting, we show that ECA's
performance is highly related to the effective rank of the covariance matrix.
In the sparse setting, the results are twofold: (i) We show that the sparse ECA
estimator based on a combinatoric program attains the optimal rate of
convergence; (ii) Based on some recent developments in estimating sparse
leading eigenvectors, we show that a computationally efficient sparse ECA
estimator attains the optimal rate of convergence under a suboptimal scaling.Comment: to appear in JASA (T&M
Optimal computational and statistical rates of convergence for sparse nonconvex learning problems
We provide theoretical analysis of the statistical and computational
properties of penalized -estimators that can be formulated as the solution
to a possibly nonconvex optimization problem. Many important estimators fall in
this category, including least squares regression with nonconvex
regularization, generalized linear models with nonconvex regularization and
sparse elliptical random design regression. For these problems, it is
intractable to calculate the global solution due to the nonconvex formulation.
In this paper, we propose an approximate regularization path-following method
for solving a variety of learning problems with nonconvex objective functions.
Under a unified analytic framework, we simultaneously provide explicit
statistical and computational rates of convergence for any local solution
attained by the algorithm. Computationally, our algorithm attains a global
geometric rate of convergence for calculating the full regularization path,
which is optimal among all first-order algorithms. Unlike most existing methods
that only attain geometric rates of convergence for one single regularization
parameter, our algorithm calculates the full regularization path with the same
iteration complexity. In particular, we provide a refined iteration complexity
bound to sharply characterize the performance of each stage along the
regularization path. Statistically, we provide sharp sample complexity analysis
for all the approximate local solutions along the regularization path. In
particular, our analysis improves upon existing results by providing a more
refined sample complexity bound as well as an exact support recovery result for
the final estimator. These results show that the final estimator attains an
oracle statistical property due to the usage of nonconvex penalty.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1238 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Large-Scale Nonparametric and Semiparametric Inference for Large, Complex, and Noisy Datasets
Massive Data bring new opportunities and challenges to data scientists and statisticians. On one hand, Massive Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the size and dimensionality of Massive Data introduce unique statistical challenges and consequences for model misspecification. Some important factors are as follows. Complexity: Since Massive Data are often aggregated from multiple sources, they often exhibit heavy-tailedness behavior with nontrivial tail dependence. Noise: Massive Data usually contain various types of measurement error, outliers, and missing values. Dependence: In many data types, such as financial time series, functional magnetic resonance image (fMRI), and time course microarray data, the samples are dependent with relatively weak signals. These challenges are difficult to address and require new computational and statistical tools. More specifically, to handle these challenges, it is necessary to develop statistical methods that are robust to data complexity, noise, and dependence. Our work aims to make headway in resolving these issues. Notably, we give a unified framework for analyzing high dimensional, complex, noisy datasets having temporal/spatial dependence. The proposed methods enjoy good theoretical properties. Their empirical usefulness is also verified in large-scale neuroimage and financial data analysis
- …