11,992 research outputs found

    Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics

    Get PDF
    In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by UU-statistics of degree d1d\geq 1, i.e. functionals of the training data with low variance that take the form of averages over kk-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size nn, as it requires averaging O(nd)O(n^d) terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on O(n)O(n) terms only, usually referred to as incomplete UU-statistics, without damaging the OP(1/n)O_{\mathbb{P}}(1/\sqrt{n}) learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a UU-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.Comment: To appear in Journal of Machine Learning Research. 34 pages. v2: minor correction to Theorem 4 and its proof, added 1 reference. v3: typo corrected in Proposition 3. v4: improved presentation, added experiments on model selection for clustering, fixed minor typo

    Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning

    Get PDF
    The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points - such as metric learning, clustering or ranking do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartitioning steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets.Comment: 23 pages, 6 figures, ECML 201

    Density functionals, with an option-pricing application

    Get PDF
    We present a method of estimating density-related functionals, without prior knowledge of the density’s functional form. The approach revolves around the specification of an explicit formula for a new class of distributions that encompasses many of the known cases in statistics, including the normal, gamma, inverse gamma, and mixtures thereof. The functionals are based on a couple of hypergeometric functions. Their parameters can be estimated, and the estimates then reveal both the functional form of the density and the parameters that determine centering, scaling, etc. The function to be estimated always leads to a valid density, by design, namely, one that is nonnegative everywhere and integrates to 1. Unlike fully nonparametric methods, our approach can be applied to small datasets. To illustrate our methodology, we apply it to finding risk-neutral densities associated with different types of financial options. We show how our approach fits the data uniformly very well. We also find that our estimated densities’ functional forms vary over the dataset, so that existing parametric methods will not do uniformly well

    Integrating Multiple Commodities in a Model of Stochastic Price Dynamics

    Get PDF
    In this paper we develop a multi-factor model for the joint dynamics of related commodity spot prices in continuous time. We contribute to the existing literature by simultaneously considering various commodity markets in a single, consistent model. In an application we show the economic significance of our approach. We assume that the spot price processes can be characterized by the weighted sum of latent factors. Employing an essentially-affine model structure allows for rich dependencies among the latent factors and thus, the commodity prices. The co-integrated behavior between the different spot price dynamics is explicitly taken into account. Within this framework we derive closed-form solutions of futures prices. The Kalman Filter methodology is applied to estimate the model for crude oil, heating oil and gasoline futures contracts traded on the NYMEX. Empirically, we are able to identify a common non-stationary equilibrium factor driving the long-term price behavior and stationary factors affecting all three markets in a common way. Additionally, we identify factors which only impact subsets of the commodities considered. To demonstrate the economic consequences of our integrated approach, we evaluate the investment into a refinery from a financial management perspective and compare the results with an approach neglecting the co-movement of prices. This negligence leads to radical changes in the project's assessment.Commodities; Integrated Model; Crude Oil; Heating Oil; Gasoline; Futures; Kalman Filter

    Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability

    Full text link
    We study a statistical method to estimate the optimal value, and the optimality gap of a given solution for stochastic optimization as an assessment of the solution quality. Our approach is based on bootstrap aggregating, or bagging, resampled sample average approximation (SAA). We show how this approach leads to valid statistical confidence bounds for non-smooth optimization. We also demonstrate its statistical efficiency and stability that are especially desirable in limited-data situations, and compare these properties with some existing methods. We present our theory that views SAA as a kernel in an infinite-order symmetric statistic, which can be approximated via bagging. We substantiate our theoretical findings with numerical results

    Sensitivity of the Eisenberg-Noe clearing vector to individual interbank liabilities

    Get PDF
    We quantify the sensitivity of the Eisenberg-Noe clearing vector to estimation errors in the bilateral liabilities of a financial system in a stylized setting. The interbank liabilities matrix is a crucial input to the computation of the clearing vector. However, in practice central bankers and regulators must often estimate this matrix because complete information on bilateral liabilities is rarely available. As a result, the clearing vector may suffer from estimation errors in the liabilities matrix. We quantify the clearing vector's sensitivity to such estimation errors and show that its directional derivatives are, like the clearing vector itself, solutions of fixed point equations. We describe estimation errors utilizing a basis for the space of matrices representing permissible perturbations and derive analytical solutions to the maximal deviations of the Eisenberg-Noe clearing vector. This allows us to compute upper bounds for the worst case perturbations of the clearing vector in our simple setting. Moreover, we quantify the probability of observing clearing vector deviations of a certain magnitude, for uniformly or normally distributed errors in the relative liability matrix. Applying our methodology to a dataset of European banks, we find that perturbations to the relative liabilities can result in economically sizeable differences that could lead to an underestimation of the risk of contagion. Our results are a first step towards allowing regulators to quantify errors in their simulations.Comment: 37 page

    Sampling Plans for Control-Inspection Schemes Under Independent and Dependent Sampling Designs With Applications to Photovoltaics

    Full text link
    The evaluation of produced items at the time of delivery is, in practice, usually amended by at least one inspection at later time points. We extend the methodology of acceptance sampling for variables for arbitrary unknown distributions when additional sampling infor- mation is available to such settings. Based on appropriate approximations of the operating characteristic, we derive new acceptance sampling plans that control the overall operating characteristic. The results cover the case of independent sampling as well as the case of dependent sampling. In particular, we study a modified panel sampling design and the case of spatial batch sampling. The latter is advisable in photovoltaic field monitoring studies, since it allows to detect and analyze local clusters of degraded or damaged modules. Some finite sample properties are examined by a simulation study, focusing on the accuracy of estimation
    corecore