70 research outputs found

    Sparse canonical correlation analysis from a predictive point of view

    Full text link
    Canonical correlation analysis (CCA) describes the associations between two sets of variables by maximizing the correlation between linear combinations of the variables in each data set. However, in high-dimensional settings where the number of variables exceeds the sample size or when the variables are highly correlated, traditional CCA is no longer appropriate. This paper proposes a method for sparse CCA. Sparse estimation produces linear combinations of only a subset of variables from each data set, thereby increasing the interpretability of the canonical variates. We consider the CCA problem from a predictive point of view and recast it into a regression framework. By combining an alternating regression approach together with a lasso penalty, we induce sparsity in the canonical vectors. We compare the performance with other sparse CCA techniques in different simulation settings and illustrate its usefulness on a genomic data set

    Robust Sparse Canonical Correlation Analysis

    Full text link
    Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. This paper discusses a method for Robust Sparse CCA. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. We also robustify the method such that it can cope with outliers in the data. To estimate the canonical vectors, we convert the CCA problem into an alternating regression framework, and use the sparse Least Trimmed Squares estimator. We illustrate the good performance of the Robust Sparse CCA method in several simulation studies and two real data examples

    Sparse cointegration

    Full text link
    Cointegration analysis is used to estimate the long-run equilibrium relations between several time series. The coefficients of these long-run equilibrium relations are the cointegrating vectors. In this paper, we provide a sparse estimator of the cointegrating vectors. The estimation technique is sparse in the sense that some elements of the cointegrating vectors will be estimated as zero. For this purpose, we combine a penalized estimation procedure for vector autoregressive models with sparse reduced rank regression. The sparse cointegration procedure achieves a higher estimation accuracy than the traditional Johansen cointegration approach in settings where the true cointegrating vectors have a sparse structure, and/or when the sample size is low compared to the number of time series. We also discuss a criterion to determine the cointegration rank and we illustrate its good performance in several simulation settings. In a first empirical application we investigate whether the expectations hypothesis of the term structure of interest rates, implying sparse cointegrating vectors, holds in practice. In a second empirical application we show that forecast performance in high-dimensional systems can be improved by sparsely estimating the cointegration relations

    Commodity Dynamics: A Sparse Multi-class Approach

    Full text link
    The correct understanding of commodity price dynamics can bring relevant improvements in terms of policy formulation both for developing and developed countries. Agricultural, metal and energy commodity prices might depend on each other: although we expect few important effects among the total number of possible ones, some price effects among different commodities might still be substantial. Moreover, the increasing integration of the world economy suggests that these effects should be comparable for different markets. This paper introduces a sparse estimator of the Multi-class Vector AutoRegressive model to detect common price effects between a large number of commodities, for different markets or investment portfolios. In a first application, we consider agricultural, metal and energy commodities for three different markets. We show a large prevalence of effects involving metal commodities in the Chinese and Indian markets, and the existence of asymmetric price effects. In a second application, we analyze commodity prices for five different investment portfolios, and highlight the existence of important effects from energy to agricultural commodities. The relevance of biofuels is hereby confirmed. Overall, we find stronger similarities in commodity price effects among portfolios than among markets

    Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages

    Full text link
    The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, in practice, identifiability issues have led many authors to abandon VARMA modeling in favor of the simpler Vector AutoRegressive (VAR) model. Such a practice is unfortunate since even very simple VARMA models can have quite complicated VAR representations. We narrow this gap with a new optimization-based approach to VARMA identification that is built upon the principle of parsimony. Among all equivalent data-generating models, we seek the parameterization that is "simplest" in a certain sense. A user-specified strongly convex penalty is used to measure model simplicity, and that same penalty is then used to define an estimator that can be efficiently computed. We show that our estimator converges to a parsimonious element in the set of all equivalent data-generating models, in a double asymptotic regime where the number of component time series is allowed to grow with sample size. Further, we derive non-asymptotic upper bounds on the estimation error of our method relative to our specially identified target. Novel theoretical machinery includes non-asymptotic analysis of infinite-order VAR, elastic net estimation under a singular covariance structure of regressors, and new concentration inequalities for quadratic forms of random variables from Gaussian time series. We illustrate the competitive performance of our methods in simulation and several application domains, including macro-economic forecasting, demand forecasting, and volatility forecasting

    Monitoring Machine Learning Forecasts for Platform Data Streams

    Full text link
    Data stream forecasts are essential inputs for decision making at digital platforms. Machine learning algorithms are appealing candidates to produce such forecasts. Yet, digital platforms require a large-scale forecast framework that can flexibly respond to sudden performance drops. Re-training ML algorithms at the same speed as new data batches enter is usually computationally too costly. On the other hand, infrequent re-training requires specifying the re-training frequency and typically comes with a severe cost of forecast deterioration. To ensure accurate and stable forecasts, we propose a simple data-driven monitoring procedure to answer the question when the ML algorithm should be re-trained. Instead of investigating instability of the data streams, we test if the incoming streaming forecast loss batch differs from a well-defined reference batch. Using a novel dataset constituting 15-min frequency data streams from an on-demand logistics platform operating in London, we apply the monitoring procedure to popular ML algorithms including random forest, XGBoost and lasso. We show that monitor-based re-training produces accurate forecasts compared to viable benchmarks while preserving computational feasibility. Moreover, the choice of monitoring procedure is more important than the choice of ML algorithm, thereby permitting practitioners to combine the proposed monitoring procedure with one's favorite forecasting algorithm

    Interpretable Vector AutoRegressions with Exogenous Time Series

    Full text link
    The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estimation quickly becomes challenging. While several proposals have been made to sparsely estimate large VAR models, the estimation of large VARX models is under-explored. Moreover, typically these sparse proposals involve a lasso-type penalty and do not incorporate lag selection into the estimation procedure. As a consequence, the resulting models may be difficult to interpret. In this paper, we propose a lag-based hierarchically sparse estimator, called "HVARX", for large VARX models. We illustrate the usefulness of HVARX on a cross-category management marketing application. Our results show how it provides a highly interpretable model, and improves out-of-sample forecast accuracy compared to a lasso-type approach.Comment: Presented at NIPS 2017 Symposium on Interpretable Machine Learnin
    • …
    corecore