46 research outputs found

    Risk bounds of learning processes for Lévy processes

    Full text link
    Lévy processes refer to a class of stochastic processes, for example, Poisson processes and Brownian motions, and play an important role in stochastic processes and machine learning. Therefore, it is essential to study risk bounds of the learning process for time-dependent samples drawn from a Lévy process (or briefly called learning process for Lévy process). It is noteworthy that samples in this learning process are not independently and identically distributed (i.i.d.). Therefore, results in traditional statistical learning theory are not applicable (or at least cannot be applied directly), because they are obtained under the sample-i.i.d. assumption. In this paper, we study risk bounds of the learning process for time-dependent samples drawn from a Lévy process, and then analyze the asymptotical behavior of the learning process. In particular, we first develop the deviation inequalities and the symmetrization inequality for the learning process. By using the resultant inequalities, we then obtain the risk bounds based on the covering number. Finally, based on the resulting risk bounds, we study the asymptotic convergence and the rate of convergence of the learning process for Lévy process. Meanwhile, we also give a comparison to the related results under the sample-i.i.d. assumption. © 2013 Chao Zhang and Dacheng Tao

    Concentration for high-dimensional linear processes with dependent innovations

    Full text link
    We develop concentration inequalities for the ll_\infty norm of a vector linear processes on mixingale sequences with sub-Weibull tails. These inequalities make use of the Beveridge-Nelson decomposition, which reduces the problem to concentration for sup-norm of a vector-mixingale or its weighted sum. This inequality is used to obtain a concentration bound for the maximum entrywise norm of the lag-hh autocovariance matrices of linear processes. These results are useful for estimation bounds for high-dimensional vector-autoregressive processes estimated using l1l_1 regularisation, high-dimensional Gaussian bootstrap for time series, and long-run covariance matrix estimation

    Optimal Spatial Prediction Using Ensemble Machine Learning

    Get PDF
    Spatial prediction is an important problem in many scientific disciplines. Super Learner is an ensemble prediction approach related to stacked generalization that uses cross-validation to search for the optimal predictor amongst all convex combinations of a heterogeneous candidate set. It has been applied to non-spatial data, where theoretical results demonstrate it will perform asymptotically at least as well as the best candidate under consideration. We review these optimality properties and discuss the assumptions required in order for them to hold for spatial prediction problems. We present results of a simulation study confirming Super Learner works well in practice under a variety of sample sizes, sampling designs, and data-generating functions. We also apply Super Learner to a real world dataset

    Lasso Inference for High-Dimensional Time Series

    Full text link
    In this paper we develop valid inference for high-dimensional time series. We extend the desparsified lasso to a time series setting under Near-Epoch Dependence (NED) assumptions allowing for non-Gaussian, serially correlated and heteroskedastic processes, where the number of regressors can possibly grow faster than the time dimension. We first derive an oracle inequality for the (regular) lasso, relaxing the commonly made exact sparsity assumption to a weaker alternative, which permits many small but non-zero parameters. The weak sparsity coupled with the NED assumption means this inequality can also be applied to the (inherently misspecified) nodewise regressions performed in the desparsified lasso. This allows us to establish the uniform asymptotic normality of the desparsified lasso under general conditions. Additionally, we show consistency of a long-run variance estimator, thus providing a complete set of tools for performing inference in high-dimensional linear time series models. Finally, we perform a simulation exercise to demonstrate the small sample properties of the desparsified lasso in common time series settings

    Three Essays on Growth and Innovation of Digital Platforms

    Get PDF
    Digital platforms are complex digital technology arrangements that enable the interaction of otherwise unaffiliated organisations. This interaction often generates novel outputs and as a result digital platforms are seen as a powerful driver of digital innovation. Yet exactly how digital platforms generate innovations by facilitating interaction merits further investigation. This dissertation illustrates aspects of how platforms grow and innovate using the case of the open-geo data platform OpenStreetMap. The study draws from both quantitative as well as qualitative analysis techniques applied to highly detailed data capturing the use, design, and operation of the platform over more than ten years. A series of computationally-intensive, mixedmethods studies were conducted to utilise the full scale of available empirical material while maintaining contextual richness relevant to the case. Embedded in recent topics on digital platforms, three empirical studies are presented. Each study focuses on one aspect of growth and innovation on digital platforms. The studies specifically examine; (i) how platform operators can stimulate generativity, that is the generation of novel outputs without direct input by the operator, (ii), how the unique attributes of digital technologies enable the creation of complex ecosystems that allow for highpaced changes in a platform’s architecture even if that increases the structural complexity of a platform, and, (iii) how participants coordinate contributions to a platform’s operation when they cannot rely on stable interfaces. Collectively these studies contribute to the understanding of how platforms generate new digital innovations

    Finite-sample analysis of M-estimators using self-concordance

    Get PDF
    The classical asymptotic theory for parametric MM-estimators guarantees that, in the limit of infinite sample size, the excess risk has a chi-square type distribution, even in the misspecified case. We demonstrate how self-concordance of the loss allows to characterize the critical sample size sufficient to guarantee a chi-square type in-probability bound for the excess risk. Specifically, we consider two classes of losses: (i) self-concordant losses in the classical sense of Nesterov and Nemirovski, i.e., whose third derivative is uniformly bounded with the 3/23/2 power of the second derivative; (ii) pseudo self-concordant losses, for which the power is removed. These classes contain losses corresponding to several generalized linear models, including the logistic loss and pseudo-Huber losses. Our basic result under minimal assumptions bounds the critical sample size by O(ddeff),O(d \cdot d_{\text{eff}}), where dd the parameter dimension and deffd_{\text{eff}} the effective dimension that accounts for model misspecification. In contrast to the existing results, we only impose local assumptions that concern the population risk minimizer θ\theta_*. Namely, we assume that the calibrated design, i.e., design scaled by the square root of the second derivative of the loss, is subgaussian at θ\theta_*. Besides, for type-ii losses we require boundedness of a certain measure of curvature of the population risk at θ\theta_*.Our improved result bounds the critical sample size from above as O(max{deff,dlogd})O(\max\{d_{\text{eff}}, d \log d\}) under slightly stronger assumptions. Namely, the local assumptions must hold in the neighborhood of θ\theta_* given by the Dikin ellipsoid of the population risk. Interestingly, we find that, for logistic regression with Gaussian design, there is no actual restriction of conditions: the subgaussian parameter and curvature measure remain near-constant over the Dikin ellipsoid. Finally, we extend some of these results to 1\ell_1-penalized estimators in high dimensions

    A Copula-based approach to differential gene expression analysis

    Get PDF
    Thesis submitted in total fulfillment of the requirements for the degree of Doctor of Philosophy in Biostatistics at Strathmore UniversityMicroarray technology has revolutionized genomic studies by enabling the study of differential expression of thousands of genes simultaneously. The main objective in microarray experiments is to identify a panel of genes that are associated with a disease outcome or trait. In this thesis, we develop and evaluate a semi-parametric copula-based algorithm for gene selection that does not depend on the distributions of the covariates, except that their marginal distributions are continuous. A comparison of the developed method with the existing methods is done based on power to identify differentially expressed genes (DEGs) and control of Type I error rate via a simulation study. Simulations indicate that the copula-based model has a reasonable power in selecting differentially expressed gene and has a good control of Type I error rate. These results are validated in a publicly available melanoma dataset. The copula-based approach turns out to be useful in finding genes that are clinically important. Relaxing parametric assumptions on microarray data may yield procedures that have good power for differential gene expression analysis

    Spatial Dependence and Heterogeneity in Empirical Analyses of Regional Labour Market Dynamics

    Get PDF
    Are regions within a country really independent islands? Do economic relations and effects really have a homogenous, unique size across an entire country? These two assumptions are often imposed implicitly in empirical economic and social research. In his doctoral thesis, the author discusses how statistical methods can deviate from this unrealistic model structure through employing spatial patterns in both observable variables and presumed relations. Opportunities to improve our understanding of the economy as well as chances and perils in the application of such methods are demonstrated in a number of studies on aspects of regional labour market dynamics.Warum sollen Regionen innerhalb eines Landes unabhängige Inseln sein? Und warum sollen, über das gesamte Land hinweg, einheitlich starke ökonomische oder soziale Wirkungszusammenhänge bestehen? Diese zwei Annahmen werden in der angewandten empirischen Wirtschafts- und Sozialforschung üblicherweise implizit unterstellt. Wie in statistischen Verfahren von dieser unrealistischen Modellstruktur unter Ausnutzung der räumlichen Strukturen in beobachteten Variablen und unterstellten Zusammenhängen abgewichen werden kann, diskutiert der Autor im vorliegenden Band. Möglichkeiten, unser Verständnis der Ökonomie zu vertiefen, werden ebenso verdeutlicht, wie Chancen und Tücken beim Einsatz der Methoden in Studien zu verschiedenen Aspekten der Arbeitsmarktdynamik

    Essays in Statistics

    Get PDF
    This thesis is comprised of several contributions to the field of mathematical statistics, particularly with regards to computational issues of Bayesian statistics and functional data analysis. The first two chapters are concerned with computational Bayesian approaches that allow one to generate samples from an approximation to the posterior distribution in settings where the likelihood function of some statistical model of interest is unknown. This has led to a class of Approximate Bayesian Computation (ABC) methods whose performance depends on the ability to effectively summarize the information content of the data sample by a lower-dimensional vector of summary statistics. Ideally, these statistics are sufficient for the parameter of interest. However, it is difficult to establish sufficiency in a straightforward way if the likelihood of the model is unavailable. In Chapter 1 we propose an indirect approach to select sufficient summary statistics for ABC methods that borrows its intuition from the indirect estimation literature in econometrics. More precisely, we introduce an auxiliary statistical model that is large enough as to contain the structural model of interest. Summary statistics are then identified in this auxiliary model and mapped to the structural model of interest. We show sufficiency of these statistics for Indirect ABC methods based on parameter estimates (ABC-IP), likelihood functions (ABC-IL) and scores (ABC-IS) of the auxiliary model. A detailed simulation study investigates the performance of each proposal and compares it to a traditional, moment-based ABC approach. Particularly, the ABC-IL and ABC-IS algorithms are shown to perform better than both standard ABC and the ABC-IP methods. In Chapter 2 we extend the notion of Indirect ABC methods by proposing an efficient way of weighting the individual entries of the vector of summary statistics obtained from the score-based Indirect ABC approach (ABC-IS). In particular, the weighting matrix is given by the inverse of the asymptotic covariance matrix of the score vector of the auxiliary model and allows us to appropriately assess the distance between the true posterior distribution and the approximation based on the ABC-IS method. We illustrate the performance gain in a simulation study. An empirical application then implements the weighted ABC-IS method to the problem of estimating a continuous-time stochastic volatility model based on non-Gaussian Ornstein-Uhlenbeck processes. We show how a suitable auxiliary model can be constructed and confirm estimation results from concurring Bayesian estimation approaches suggested in the literature. In Chapter 3 we consider the problem of sampling from high-dimensional probability distributions that exhibit multiple, well-separated modes. Such distributions arise frequently, for instance, in the Bayesian estimation of macroeconomic DSGE models. Standard Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings algorithm, are prone to get trapped in local neighborhoods of the target distribution thus severely limiting the use of these methods in more complex models. We suggest the use of a Sequential Markov Chain Monte Carlo approach to overcome these difficulties and investigate its finite sample properties. The results show that Sequential MCMC methods clearly outperform standard MCMC approaches in a multimodal setting and can recover both the location as well as the mixture weights in a 12-dimensional mixture model. Moreover, we provide a detailed comparison of the effects different choices of tuning parameters have on the approximation to the true sampling distribution. These results can serve as valuable guidelines when applying this method to more complex economic models, such as the (Bayesian) estimation of Dynamic Stochastic General Equilibrium models. Chapters 4 and 5 study the statistical problem of prediction from a functional perspective. In many statistical applications, data is becoming available at ever increasing frequencies and it has thus become natural to think of discrete observations as realizations of a continuous function, say over the course of one day. However, as functions are generally speaking infinite-dimensional objects, the statistical analysis of such functional data is intrinsically different from standard multivariate techniques. In Chapter 4 we consider prediction in functional additive models of first-order autoregressive type for a time series of functional observations. This is a generalization of functional linear models that are commonly considered in the literature and has two advantages to be applied in a functional time series setting. First, it allows us to introduce a very general notion of time dependencies for functional data in this modeling framework. Particularly, it is rooted at the correlation structure of functional principal component scores and even allows for long memory behavior in the score series across the time dimension. Second, prediction in this modeling framework is straightforwardly implemented as it only concerns conditional means of scalar random variables and we suggest a k-nearest neighbors classification scheme. The theoretical contributions of this paper are twofold. In a first step, we verify the applicability of the functional principal components analysis under our notion of time dependence and obtain precise rates of convergence for the mean function and the covariance operator associated with the observed sample of functions. In a second step, we derive precise rates of convergence of the mean squared error for the proposed predictor, taking into account both the effect of truncating the infinite series expansion at some finite integer L as well as the effect of estimating the covariance operator and associated eigenelements based on a sample of N curves. In Chapter 5 we investigate the performance of functional models in a forecasting study of ground-level ozone-concentration surfaces over the geographical domain of Germany. Our perspective thus differs from the literature on spatially distributed functional processes (which are considered to be (univariate) functions of time that show spatial dependence) in that we consider smooth surfaces defined over some spatial domain that are sampled consecutively over time. In particular, we treat discrete observations that are sampled both over a spatial domain and over time as noisy realizations of some time series of smooth bivariate functions. In a first step we therefore discuss how smooth functions can be reconstructed from such noisy measurements through a finite element spline smoother that is defined over some triangulation of the spatial domain. In a second step we consider two forecasting approaches to functional time series. The first one is a functional linear model of first-order auto-regressive type, whereas the second considers the non-parametric extension to functional additive models discussed in Chapter 4. Both approaches are applied to predicting ground-level ozone concentration measured over the spatial domain of Germany and are shown to yield similar predictions
    corecore