7,358 research outputs found
Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification
Neural networks are usually not the tool of choice for nonparametric
high-dimensional problems where the number of input features is much larger
than the number of observations. Though neural networks can approximate complex
multivariate functions, they generally require a large number of training
observations to obtain reasonable fits, unless one can learn the appropriate
network structure. In this manuscript, we show that neural networks can be
applied successfully to high-dimensional settings if the true function falls in
a low dimensional subspace, and proper regularization is used. We propose
fitting a neural network with a sparse group lasso penalty on the first-layer
input weights. This results in a neural net that only uses a small subset of
the original features. In addition, we characterize the statistical convergence
of the penalized empirical risk minimizer to the optimal neural network: we
show that the excess risk of this penalized estimator only grows with the
logarithm of the number of input features; and we show that the weights of
irrelevant features converge to zero. Via simulation studies and data analyses,
we show that these sparse-input neural networks outperform existing
nonparametric high-dimensional estimation methods when the data has complex
higher-order interactions
Feature Selection using Stochastic Gates
Feature selection problems have been extensively studied for linear
estimation, for instance, Lasso, but less emphasis has been placed on feature
selection for non-linear functions. In this study, we propose a method for
feature selection in high-dimensional non-linear function estimation problems.
The new procedure is based on minimizing the norm of the vector of
indicator variables that represent if a feature is selected or not. Our
approach relies on the continuous relaxation of Bernoulli distributions, which
allows our model to learn the parameters of the approximate Bernoulli
distributions via gradient descent. This general framework simultaneously
minimizes a loss function while selecting relevant features. Furthermore, we
provide an information-theoretic justification of incorporating Bernoulli
distribution into our approach and demonstrate the potential of the approach on
synthetic and real-life applications.Comment: Published in ICML 202
Harmless interpolation of noisy data in regression
A continuing mystery in understanding the empirical success of deep neural
networks is their ability to achieve zero training error and generalize well,
even when the training data is noisy and there are more parameters than data
points. We investigate this overparameterized regime in linear regression,
where all solutions that minimize training error interpolate the data,
including noise. We characterize the fundamental generalization (mean-squared)
error of any interpolating solution in the presence of noise, and show that
this error decays to zero with the number of features. Thus,
overparameterization can be explicitly beneficial in ensuring harmless
interpolation of noise. We discuss two root causes for poor generalization that
are complementary in nature -- signal "bleeding" into a large number of alias
features, and overfitting of noise by parsimonious feature selectors. For the
sparse linear model with noise, we provide a hybrid interpolating scheme that
mitigates both these issues and achieves order-optimal MSE over all possible
interpolating solutions.Comment: 52 pages, expanded version of the paper presented at ITA in San Diego
in Feb 2019, ISIT in Paris in July 2019, at Simons in July, and as a plenary
at ITW in Visby in August 201
Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects
A critical decision point when training predictors using multiple studies is
whether these studies should be combined or treated separately. We compare two
multi-study learning approaches in the presence of potential heterogeneity in
predictor-outcome relationships across datasets. We consider 1) merging all of
the datasets and training a single learner, and 2) multi-study ensembling,
which involves training a separate learner on each dataset and combining the
predictions resulting from each learner. In a linear regression setting, we
show analytically and confirm via simulation that merging yields lower
prediction error than ensembling when the predictor-outcome relationships are
relatively homogeneous across studies. However, as cross-study heterogeneity
increases, there exists a transition point beyond which ensembling outperforms
merging. We provide analytic expressions for the transition point in various
scenarios, study asymptotic properties, and illustrate how transition point
theory can be used for deciding when studies should be combined with an
application from metabolomics
Machine Learning Methods Economists Should Know About
We discuss the relevance of the recent Machine Learning (ML) literature for
economics and econometrics. First we discuss the differences in goals, methods
and settings between the ML literature and the traditional econometrics and
statistics literatures. Then we discuss some specific methods from the machine
learning literature that we view as important for empirical researchers in
economics. These include supervised learning methods for regression and
classification, unsupervised learning methods, as well as matrix completion
methods. Finally, we highlight newly developed methods at the intersection of
ML and econometrics, methods that typically perform better than either
off-the-shelf ML or more traditional econometric methods when applied to
particular classes of problems, problems that include causal inference for
average treatment effects, optimal policy estimation, and estimation of the
counterfactual effect of price changes in consumer choice models
Horseshoe Regularization for Machine Learning in Complex and Deep Models
Since the advent of the horseshoe priors for regularization, global-local
shrinkage methods have proved to be a fertile ground for the development of
Bayesian methodology in machine learning, specifically for high-dimensional
regression and classification problems. They have achieved remarkable success
in computation, and enjoy strong theoretical support. Most of the existing
literature has focused on the linear Gaussian case; see Bhadra et al. (2019b)
for a systematic survey. The purpose of the current article is to demonstrate
that the horseshoe regularization is useful far more broadly, by reviewing both
methodological and computational developments in complex models that are more
relevant to machine learning applications. Specifically, we focus on
methodological challenges in horseshoe regularization in nonlinear and
non-Gaussian models; multivariate models; and deep neural networks. We also
outline the recent computational developments in horseshoe shrinkage for
complex models along with a list of available software implementations that
allows one to venture out beyond the comfort zone of the canonical linear
regression problems
Bridgeout: stochastic bridge regularization for deep neural networks
A major challenge in training deep neural networks is overfitting, i.e.
inferior performance on unseen test examples compared to performance on
training examples. To reduce overfitting, stochastic regularization methods
have shown superior performance compared to deterministic weight penalties on a
number of image recognition tasks. Stochastic methods such as Dropout and
Shakeout, in expectation, are equivalent to imposing a ridge and elastic-net
penalty on the model parameters, respectively. However, the choice of the norm
of weight penalty is problem dependent and is not restricted to .
Therefore, in this paper we propose the Bridgeout stochastic regularization
technique and prove that it is equivalent to an penalty on the weights,
where the norm can be learned as a hyperparameter from data. Experimental
results show that Bridgeout results in sparse model weights, improved gradients
and superior classification performance compared to Dropout and Shakeout on
synthetic and real datasets
A weighted random survival forest
A weighted random survival forest is presented in the paper. It can be
regarded as a modification of the random forest improving its performance. The
main idea underlying the proposed model is to replace the standard procedure of
averaging used for estimation of the random survival forest hazard function by
weighted avaraging where the weights are assigned to every tree and can be
veiwed as training paremeters which are computed in an optimal way by solving a
standard quadratic optimization problem maximizing Harrell's C-index. Numerical
examples with real data illustrate the outperformance of the proposed model in
comparison with the original random survival forest
An Interpretable and Sparse Neural Network Model for Nonlinear Granger Causality Discovery
While most classical approaches to Granger causality detection repose upon
linear time series assumptions, many interactions in neuroscience and economics
applications are nonlinear. We develop an approach to nonlinear Granger
causality detection using multilayer perceptrons where the input to the network
is the past time lags of all series and the output is the future value of a
single series. A sufficient condition for Granger non-causality in this setting
is that all of the outgoing weights of the input data, the past lags of a
series, to the first hidden layer are zero. For estimation, we utilize a group
lasso penalty to shrink groups of input weights to zero. We also propose a
hierarchical penalty for simultaneous Granger causality and lag estimation. We
validate our approach on simulated data from both a sparse linear
autoregressive model and the sparse and nonlinear Lorenz-96 model.Comment: Accepted to the NIPS Time Series Workshop 201
Machine Learning for Survival Analysis: A Survey
Accurately predicting the time of occurrence of an event of interest is a
critical problem in longitudinal data analysis. One of the main challenges in
this context is the presence of instances whose event outcomes become
unobservable after a certain time point or when some instances do not
experience any event during the monitoring period. Such a phenomenon is called
censoring which can be effectively handled using survival analysis techniques.
Traditionally, statistical approaches have been widely developed in the
literature to overcome this censoring issue. In addition, many machine learning
algorithms are adapted to effectively handle survival data and tackle other
challenging problems that arise in real-world data. In this survey, we provide
a comprehensive and structured review of the representative statistical methods
along with the machine learning techniques used in survival analysis and
provide a detailed taxonomy of the existing methods. We also discuss several
topics that are closely related to survival analysis and illustrate several
successful applications in various real-world application domains. We hope that
this paper will provide a more thorough understanding of the recent advances in
survival analysis and offer some guidelines on applying these approaches to
solve new problems that arise in applications with censored data
- …