70 research outputs found
Sparse canonical correlation analysis from a predictive point of view
Canonical correlation analysis (CCA) describes the associations between two
sets of variables by maximizing the correlation between linear combinations of
the variables in each data set. However, in high-dimensional settings where the
number of variables exceeds the sample size or when the variables are highly
correlated, traditional CCA is no longer appropriate. This paper proposes a
method for sparse CCA. Sparse estimation produces linear combinations of only a
subset of variables from each data set, thereby increasing the interpretability
of the canonical variates. We consider the CCA problem from a predictive point
of view and recast it into a regression framework. By combining an alternating
regression approach together with a lasso penalty, we induce sparsity in the
canonical vectors. We compare the performance with other sparse CCA techniques
in different simulation settings and illustrate its usefulness on a genomic
data set
Robust Sparse Canonical Correlation Analysis
Canonical correlation analysis (CCA) is a multivariate statistical method
which describes the associations between two sets of variables. The objective
is to find linear combinations of the variables in each data set having maximal
correlation. This paper discusses a method for Robust Sparse CCA. Sparse
estimation produces canonical vectors with some of their elements estimated as
exactly zero. As such, their interpretability is improved. We also robustify
the method such that it can cope with outliers in the data. To estimate the
canonical vectors, we convert the CCA problem into an alternating regression
framework, and use the sparse Least Trimmed Squares estimator. We illustrate
the good performance of the Robust Sparse CCA method in several simulation
studies and two real data examples
Sparse cointegration
Cointegration analysis is used to estimate the long-run equilibrium relations
between several time series. The coefficients of these long-run equilibrium
relations are the cointegrating vectors. In this paper, we provide a sparse
estimator of the cointegrating vectors. The estimation technique is sparse in
the sense that some elements of the cointegrating vectors will be estimated as
zero. For this purpose, we combine a penalized estimation procedure for vector
autoregressive models with sparse reduced rank regression. The sparse
cointegration procedure achieves a higher estimation accuracy than the
traditional Johansen cointegration approach in settings where the true
cointegrating vectors have a sparse structure, and/or when the sample size is
low compared to the number of time series. We also discuss a criterion to
determine the cointegration rank and we illustrate its good performance in
several simulation settings. In a first empirical application we investigate
whether the expectations hypothesis of the term structure of interest rates,
implying sparse cointegrating vectors, holds in practice. In a second empirical
application we show that forecast performance in high-dimensional systems can
be improved by sparsely estimating the cointegration relations
Commodity Dynamics: A Sparse Multi-class Approach
The correct understanding of commodity price dynamics can bring relevant
improvements in terms of policy formulation both for developing and developed
countries. Agricultural, metal and energy commodity prices might depend on each
other: although we expect few important effects among the total number of
possible ones, some price effects among different commodities might still be
substantial. Moreover, the increasing integration of the world economy suggests
that these effects should be comparable for different markets. This paper
introduces a sparse estimator of the Multi-class Vector AutoRegressive model to
detect common price effects between a large number of commodities, for
different markets or investment portfolios. In a first application, we consider
agricultural, metal and energy commodities for three different markets. We show
a large prevalence of effects involving metal commodities in the Chinese and
Indian markets, and the existence of asymmetric price effects. In a second
application, we analyze commodity prices for five different investment
portfolios, and highlight the existence of important effects from energy to
agricultural commodities. The relevance of biofuels is hereby confirmed.
Overall, we find stronger similarities in commodity price effects among
portfolios than among markets
Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages
The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the
theory of multivariate time series; however, in practice, identifiability
issues have led many authors to abandon VARMA modeling in favor of the simpler
Vector AutoRegressive (VAR) model. Such a practice is unfortunate since even
very simple VARMA models can have quite complicated VAR representations. We
narrow this gap with a new optimization-based approach to VARMA identification
that is built upon the principle of parsimony. Among all equivalent
data-generating models, we seek the parameterization that is "simplest" in a
certain sense. A user-specified strongly convex penalty is used to measure
model simplicity, and that same penalty is then used to define an estimator
that can be efficiently computed. We show that our estimator converges to a
parsimonious element in the set of all equivalent data-generating models, in a
double asymptotic regime where the number of component time series is allowed
to grow with sample size. Further, we derive non-asymptotic upper bounds on the
estimation error of our method relative to our specially identified target.
Novel theoretical machinery includes non-asymptotic analysis of infinite-order
VAR, elastic net estimation under a singular covariance structure of
regressors, and new concentration inequalities for quadratic forms of random
variables from Gaussian time series. We illustrate the competitive performance
of our methods in simulation and several application domains, including
macro-economic forecasting, demand forecasting, and volatility forecasting
Monitoring Machine Learning Forecasts for Platform Data Streams
Data stream forecasts are essential inputs for decision making at digital
platforms. Machine learning algorithms are appealing candidates to produce such
forecasts. Yet, digital platforms require a large-scale forecast framework that
can flexibly respond to sudden performance drops. Re-training ML algorithms at
the same speed as new data batches enter is usually computationally too costly.
On the other hand, infrequent re-training requires specifying the re-training
frequency and typically comes with a severe cost of forecast deterioration. To
ensure accurate and stable forecasts, we propose a simple data-driven
monitoring procedure to answer the question when the ML algorithm should be
re-trained. Instead of investigating instability of the data streams, we test
if the incoming streaming forecast loss batch differs from a well-defined
reference batch. Using a novel dataset constituting 15-min frequency data
streams from an on-demand logistics platform operating in London, we apply the
monitoring procedure to popular ML algorithms including random forest, XGBoost
and lasso. We show that monitor-based re-training produces accurate forecasts
compared to viable benchmarks while preserving computational feasibility.
Moreover, the choice of monitoring procedure is more important than the choice
of ML algorithm, thereby permitting practitioners to combine the proposed
monitoring procedure with one's favorite forecasting algorithm
Interpretable Vector AutoRegressions with Exogenous Time Series
The Vector AutoRegressive (VAR) model is fundamental to the study of
multivariate time series. Although VAR models are intensively investigated by
many researchers, practitioners often show more interest in analyzing VARX
models that incorporate the impact of unmodeled exogenous variables (X) into
the VAR. However, since the parameter space grows quadratically with the number
of time series, estimation quickly becomes challenging. While several proposals
have been made to sparsely estimate large VAR models, the estimation of large
VARX models is under-explored. Moreover, typically these sparse proposals
involve a lasso-type penalty and do not incorporate lag selection into the
estimation procedure. As a consequence, the resulting models may be difficult
to interpret. In this paper, we propose a lag-based hierarchically sparse
estimator, called "HVARX", for large VARX models. We illustrate the usefulness
of HVARX on a cross-category management marketing application. Our results show
how it provides a highly interpretable model, and improves out-of-sample
forecast accuracy compared to a lasso-type approach.Comment: Presented at NIPS 2017 Symposium on Interpretable Machine Learnin
- …