20 research outputs found

    Predictive PAC Learning and Process Decompositions

    Full text link
    We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β\beta-mixing) processes, of independent probability-theoretic interest.Comment: 9 pages, accepted in NIPS 201

    Fast rates in learning with dependent observations

    Get PDF
    In this paper we tackle the problem of fast rates in time series forecasting from a statistical learning perspective. In a serie of papers (e.g. Meir 2000, Modha and Masry 1998, Alquier and Wintenberger 2012) it is shown that the main tools used in learning theory with iid observations can be extended to the prediction of time series. The main message of these papers is that, given a family of predictors, we are able to build a new predictor that predicts the series as well as the best predictor in the family, up to a remainder of order 1/n1/\sqrt{n}. It is known that this rate cannot be improved in general. In this paper, we show that in the particular case of the least square loss, and under a strong assumption on the time series (phi-mixing) the remainder is actually of order 1/n1/n. Thus, the optimal rate for iid variables, see e.g. Tsybakov 2003, and individual sequences, see \cite{lugosi} is, for the first time, achieved for uniformly mixing processes. We also show that our method is optimal for aggregating sparse linear combinations of predictors

    Limit theorems for a class of identically distributed random variables

    Full text link
    A new type of stochastic dependence for a sequence of random variables is introduced and studied. Precisely, (X_n)_{n\geq 1} is said to be conditionally identically distributed (c.i.d.), with respect to a filtration (G_n)_{n\geq 0}, if it is adapted to (G_n)_{n\geq 0} and, for each n\geq 0, (X_k)_{k>n} is identically distributed given the past G_n. In case G_0={\varnothing,\Omega} and G_n=\sigma(X_1,...,X_n), a result of Kallenberg implies that (X_n)_{n\geq 1} is exchangeable if and only if it is stationary and c.i.d. After giving some natural examples of nonexchangeable c.i.d. sequences, it is shown that (X_n)_{n\geq 1} is exchangeable if and only if (X_{\tau(n)})_{n\geq 1} is c.i.d. for any finite permutation \tau of {1,2,...}, and that the distribution of a c.i.d. sequence agrees with an exchangeable law on a certain sub-\sigma-field. Moreover, (1/n)\sum_{k=1}^nX_k converges a.s. and in L^1 whenever (X_n)_{n\geq 1} is (real-valued) c.i.d. and E[| X_1| ]<\infty. As to the CLT, three types of random centering are considered. One such centering, significant in Bayesian prediction and discrete time filtering, is E[X_{n+1}| G_n]. For each centering, convergence in distribution of the corresponding empirical process is analyzed under uniform distance.Comment: Published by the Institute of Mathematical Statistics (http://www.imstat.org) in the Annals of Probability (http://www.imstat.org/aop/) at http://dx.doi.org/10.1214/00911790400000067

    Model selection for weakly dependent time series forecasting

    Full text link
    Observing a stationary time series, we propose a two-step procedure for the prediction of the next value of the time series. The first step follows machine learning theory paradigm and consists in determining a set of possible predictors as randomized estimators in (possibly numerous) different predictive models. The second step follows the model selection paradigm and consists in choosing one predictor with good properties among all the predictors of the first steps. We study our procedure for two different types of bservations: causal Bernoulli shifts and bounded weakly dependent processes. In both cases, we give oracle inequalities: the risk of the chosen predictor is close to the best prediction risk in all predictive models that we consider. We apply our procedure for predictive models such as linear predictors, neural networks predictors and non-parametric autoregressive

    Prediction of time series by statistical learning: general losses and fast rates

    Full text link
    We establish rates of convergences in time series forecasting using the statistical learning approach based on oracle inequalities. A series of papers extends the oracle inequalities obtained for iid observations to time series under weak dependence conditions. Given a family of predictors and nn observations, oracle inequalities state that a predictor forecasts the series as well as the best predictor in the family up to a remainder term Δn\Delta_n. Using the PAC-Bayesian approach, we establish under weak dependence conditions oracle inequalities with optimal rates of convergence. We extend previous results for the absolute loss function to any Lipschitz loss function with rates Δn∼c(Θ)/n\Delta_n\sim\sqrt{c(\Theta)/ n} where c(Θ)c(\Theta) measures the complexity of the model. We apply the method for quantile loss functions to forecast the french GDP. Under additional conditions on the loss functions (satisfied by the quadratic loss function) and on the time series, we refine the rates of convergence to Δn∼c(Θ)/n\Delta_n \sim c(\Theta)/n. We achieve for the first time these fast rates for uniformly mixing processes. These rates are known to be optimal in the iid case and for individual sequences. In particular, we generalize the results of Dalalyan and Tsybakov on sparse regression estimation to the case of autoregression
    corecore