853 research outputs found
The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R
This paper describes an R package named flare, which implements a family of
new high dimensional regression methods (LAD Lasso, SQRT Lasso, Lasso,
and Dantzig selector) and their extensions to sparse precision matrix
estimation (TIGER and CLIME). These methods exploit different nonsmooth loss
functions to gain modeling flexibility, estimation robustness, and tuning
insensitiveness. The developed solver is based on the alternating direction
method of multipliers (ADMM). The package flare is coded in double precision C,
and called from R by a user-friendly interface. The memory usage is optimized
by using the sparse matrix output. The experiments show that flare is efficient
and can scale up to large problems
Homotopy Parametric Simplex Method for Sparse Learning
High dimensional sparse learning has imposed a great computational challenge
to large scale data analysis. In this paper, we are interested in a broad class
of sparse learning approaches formulated as linear programs parametrized by a
{\em regularization factor}, and solve them by the parametric simplex method
(PSM). Our parametric simplex method offers significant advantages over other
competing methods: (1) PSM naturally obtains the complete solution path for all
values of the regularization parameter; (2) PSM provides a high precision dual
certificate stopping criterion; (3) PSM yields sparse solutions through very
few iterations, and the solution sparsity significantly reduces the
computational cost per iteration. Particularly, we demonstrate the superiority
of PSM over various sparse learning approaches, including Dantzig selector for
sparse linear regression, LAD-Lasso for sparse robust linear regression, CLIME
for sparse precision matrix estimation, sparse differential network estimation,
and sparse Linear Programming Discriminant (LPD) analysis. We then provide
sufficient conditions under which PSM always outputs sparse solutions such that
its computational performance can be significantly boosted. Thorough numerical
experiments are provided to demonstrate the outstanding performance of the PSM
method.Comment: Accepted by NIPS 201
Sparse transition matrix estimation for high-dimensional and locally stationary vector autoregressive models
We consider the estimation of the transition matrix in the high-dimensional
time-varying vector autoregression (TV-VAR) models. Our model builds on a
general class of locally stationary VAR processes that evolve smoothly in time.
We propose a hybridized kernel smoothing and -regularized method to
directly estimate the sequence of time-varying transition matrices. Under the
sparsity assumption on the transition matrix, we establish the rate of
convergence of the proposed estimator and show that the convergence rate
depends on the smoothness of the locally stationary VAR processes only through
the smoothness of the transition matrix function. In addition, for our
estimator followed by thresholding, we prove that the false positive rate (type
I error) and false negative rate (type II error) in the pattern recovery can
asymptotically vanish in the presence of weak signals without assuming the
minimum nonzero signal strength condition. Favorable finite sample performances
over the -penalized least-squares estimator and the unstructured
maximum likelihood estimator are shown on simulated data. We also provide two
real examples on estimating the dependence structures on financial stock prices
and economic exchange rates datasets
Estimation of Large Covariance and Precision Matrices from Temporally Dependent Observations
We consider the estimation of large covariance and precision matrices from
high-dimensional sub-Gaussian or heavier-tailed observations with slowly
decaying temporal dependence. The temporal dependence is allowed to be
long-range so with longer memory than those considered in the current
literature. We show that several commonly used methods for independent
observations can be applied to the temporally dependent data. In particular,
the rates of convergence are obtained for the generalized thresholding
estimation of covariance and correlation matrices, and for the constrained
minimization and the penalized likelihood estimation of
precision matrix. Properties of sparsistency and sign-consistency are also
established. A gap-block cross-validation method is proposed for the tuning
parameter selection, which performs well in simulations. As a motivating
example, we study the brain functional connectivity using resting-state fMRI
time series data with long-range temporal dependence.Comment: The result for banding estimator of covariance matrix is given in the
version 2 of this article. See arXiv:1412.5059v
A study on tuning parameter selection for the high-dimensional lasso
High-dimensional predictive models, those with more measurements than
observations, require regularization to be well defined, perform well
empirically, and possess theoretical guarantees. The amount of regularization,
often determined by tuning parameters, is integral to achieving good
performance. One can choose the tuning parameter in a variety of ways, such as
through resampling methods or generalized information criteria. However, the
theory supporting many regularized procedures relies on an estimate for the
variance parameter, which is complicated in high dimensions. We develop a suite
of information criteria for choosing the tuning parameter in lasso regression
by leveraging the literature on high-dimensional variance estimation. We derive
intuition showing that existing information-theoretic approaches work poorly in
this setting. We compare our risk estimators to existing methods with an
extensive simulation and derive some theoretical justification. We find that
our new estimators perform well across a wide range of simulation conditions
and evaluation criteria.Comment: 64 pages, 11 figure
Confidence intervals for high-dimensional Cox models
The purpose of this paper is to construct confidence intervals for the
regression coefficients in high-dimensional Cox proportional hazards regression
models where the number of covariates may be larger than the sample size. Our
debiased estimator construction is similar to those in Zhang and Zhang (2014)
and van de Geer et al. (2014), but the time-dependent covariates and censored
risk sets introduce considerable additional challenges. Our theoretical
results, which provide conditions under which our confidence intervals are
asymptotically valid, are supported by extensive numerical experiments.Comment: 36 pages, 1 figur
Design of Experiments for Screening
The aim of this paper is to review methods of designing screening
experiments, ranging from designs originally developed for physical experiments
to those especially tailored to experiments on numerical models. The strengths
and weaknesses of the various designs for screening variables in numerical
models are discussed. First, classes of factorial designs for experiments to
estimate main effects and interactions through a linear statistical model are
described, specifically regular and nonregular fractional factorial designs,
supersaturated designs and systematic fractional replicate designs. Generic
issues of aliasing, bias and cancellation of factorial effects are discussed.
Second, group screening experiments are considered including factorial group
screening and sequential bifurcation. Third, random sampling plans are
discussed including Latin hypercube sampling and sampling plans to estimate
elementary effects. Fourth, a variety of modelling methods commonly employed
with screening designs are briefly described. Finally, a novel study
demonstrates six screening methods on two frequently-used exemplars, and their
performances are compared
An Imputation-Consistency Algorithm for High-Dimensional Missing Data Problems and Beyond
Missing data are frequently encountered in high-dimensional problems, but
they are usually difficult to deal with using standard algorithms, such as the
expectation-maximization (EM) algorithm and its variants. To tackle this
difficulty, some problem-specific algorithms have been developed in the
literature, but there still lacks a general algorithm. This work is to fill the
gap: we propose a general algorithm for high-dimensional missing data problems.
The proposed algorithm works by iterating between an imputation step and a
consistency step. At the imputation step, the missing data are imputed
conditional on the observed data and the current estimate of parameters; and at
the consistency step, a consistent estimate is found for the minimizer of a
Kullback-Leibler divergence defined on the pseudo-complete data. For high
dimensional problems, the consistent estimate can be found under sparsity
constraints. The consistency of the averaged estimate for the true parameter
can be established under quite general conditions. The proposed algorithm is
illustrated using high-dimensional Gaussian graphical models, high-dimensional
variable selection, and a random coefficient model.Comment: 30 pages, 1 figur
Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python
We describe a new library named picasso, which implements a unified framework
of pathwise coordinate optimization for a variety of sparse learning problems
(e.g., sparse linear regression, sparse logistic regression, sparse Poisson
regression and scaled sparse linear regression) combined with efficient active
set selection strategies. Besides, the library allows users to choose different
sparsity-inducing regularizers, including the convex , nonconvex MCP
and SCAD regularizers. The library is coded in C++ and has user-friendly R and
Python wrappers. Numerical experiments demonstrate that picasso can scale up to
large problems efficiently
Machine learning in solar physics
The application of machine learning in solar physics has the potential to
greatly enhance our understanding of the complex processes that take place in
the atmosphere of the Sun. By using techniques such as deep learning, we are
now in the position to analyze large amounts of data from solar observations
and identify patterns and trends that may not have been apparent using
traditional methods. This can help us improve our understanding of explosive
events like solar flares, which can have a strong effect on the Earth
environment. Predicting hazardous events on Earth becomes crucial for our
technological society. Machine learning can also improve our understanding of
the inner workings of the sun itself by allowing us to go deeper into the data
and to propose more complex models to explain them. Additionally, the use of
machine learning can help to automate the analysis of solar data, reducing the
need for manual labor and increasing the efficiency of research in this field.Comment: 100 pages, 13 figures, 286 references, accepted for publication as a
Living Review in Solar Physics (LRSP
- …