12,295 research outputs found
Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics
In a wide range of statistical learning problems such as ranking, clustering
or metric learning among others, the risk is accurately estimated by
-statistics of degree , i.e. functionals of the training data with
low variance that take the form of averages over -tuples. From a
computational perspective, the calculation of such statistics is highly
expensive even for a moderate sample size , as it requires averaging
terms. This makes learning procedures relying on the optimization of
such data functionals hardly feasible in practice. It is the major goal of this
paper to show that, strikingly, such empirical risks can be replaced by
drastically computationally simpler Monte-Carlo estimates based on terms
only, usually referred to as incomplete -statistics, without damaging the
learning rate of Empirical Risk Minimization (ERM)
procedures. For this purpose, we establish uniform deviation results describing
the error made when approximating a -process by its incomplete version under
appropriate complexity assumptions. Extensions to model selection, fast rate
situations and various sampling techniques are also considered, as well as an
application to stochastic gradient descent for ERM. Finally, numerical examples
are displayed in order to provide strong empirical evidence that the approach
we promote largely surpasses more naive subsampling techniques.Comment: To appear in Journal of Machine Learning Research. 34 pages. v2:
minor correction to Theorem 4 and its proof, added 1 reference. v3: typo
corrected in Proposition 3. v4: improved presentation, added experiments on
model selection for clustering, fixed minor typo
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning
The development of cluster computing frameworks has allowed practitioners to
scale out various statistical estimation and machine learning algorithms with
minimal programming effort. This is especially true for machine learning
problems whose objective function is nicely separable across individual data
points, such as classification and regression. In contrast, statistical
learning tasks involving pairs (or more generally tuples) of data points - such
as metric learning, clustering or ranking do not lend themselves as easily to
data-parallelism and in-memory computing. In this paper, we investigate how to
balance between statistical performance and computational efficiency in such
distributed tuplewise statistical problems. We first propose a simple strategy
based on occasionally repartitioning data across workers between parallel
computation stages, where the number of repartitioning steps rules the
trade-off between accuracy and runtime. We then present some theoretical
results highlighting the benefits brought by the proposed method in terms of
variance reduction, and extend our results to design distributed stochastic
gradient descent algorithms for tuplewise empirical risk minimization. Our
results are supported by numerical experiments in pairwise statistical
estimation and learning on synthetic and real-world datasets.Comment: 23 pages, 6 figures, ECML 201
Density functionals, with an option-pricing application
We present a method of estimating density-related functionals, without prior knowledge of the density’s functional form. The approach revolves around the specification of an explicit formula for a new class of distributions that encompasses many of the known cases in statistics, including the normal, gamma, inverse gamma, and mixtures thereof. The functionals are based on a couple of hypergeometric functions. Their parameters can be estimated, and the estimates then reveal both the functional form of the density and the parameters that determine centering, scaling, etc. The function to be estimated always leads to a valid density, by design, namely, one that is nonnegative everywhere and integrates to 1. Unlike fully nonparametric methods, our approach can be applied to small datasets. To illustrate our methodology, we apply it to finding risk-neutral densities associated with different types of financial options. We show how our approach fits the data uniformly very well. We also find that our estimated densities’ functional forms vary over the dataset, so that existing parametric methods will not do uniformly well
Bounding Optimality Gap in Stochastic Optimization via Bagging: Statistical Efficiency and Stability
We study a statistical method to estimate the optimal value, and the
optimality gap of a given solution for stochastic optimization as an assessment
of the solution quality. Our approach is based on bootstrap aggregating, or
bagging, resampled sample average approximation (SAA). We show how this
approach leads to valid statistical confidence bounds for non-smooth
optimization. We also demonstrate its statistical efficiency and stability that
are especially desirable in limited-data situations, and compare these
properties with some existing methods. We present our theory that views SAA as
a kernel in an infinite-order symmetric statistic, which can be approximated
via bagging. We substantiate our theoretical findings with numerical results
Integrating Multiple Commodities in a Model of Stochastic Price Dynamics
In this paper we develop a multi-factor model for the joint dynamics of related commodity spot prices in continuous time. We contribute to the existing literature by simultaneously considering various commodity markets in a single, consistent model. In an application we show the economic significance of our approach. We assume that the spot price processes can be characterized by the weighted sum of latent factors. Employing an essentially-affine model structure allows for rich dependencies among the latent factors and thus, the commodity prices. The co-integrated behavior between the different spot price dynamics is explicitly taken into account. Within this framework we derive closed-form solutions of futures prices. The Kalman Filter methodology is applied to estimate the model for crude oil, heating oil and gasoline futures contracts traded on the NYMEX. Empirically, we are able to identify a common non-stationary equilibrium factor driving the long-term price behavior and stationary factors affecting all three markets in a common way. Additionally, we identify factors which only impact subsets of the commodities considered. To demonstrate the economic consequences of our integrated approach, we evaluate the investment into a refinery from a financial management perspective and compare the results with an approach neglecting the co-movement of prices. This negligence leads to radical changes in the project's assessment.Commodities; Integrated Model; Crude Oil; Heating Oil; Gasoline; Futures; Kalman Filter
Sensitivity of the Eisenberg-Noe clearing vector to individual interbank liabilities
We quantify the sensitivity of the Eisenberg-Noe clearing vector to
estimation errors in the bilateral liabilities of a financial system in a
stylized setting. The interbank liabilities matrix is a crucial input to the
computation of the clearing vector. However, in practice central bankers and
regulators must often estimate this matrix because complete information on
bilateral liabilities is rarely available. As a result, the clearing vector may
suffer from estimation errors in the liabilities matrix. We quantify the
clearing vector's sensitivity to such estimation errors and show that its
directional derivatives are, like the clearing vector itself, solutions of
fixed point equations. We describe estimation errors utilizing a basis for the
space of matrices representing permissible perturbations and derive analytical
solutions to the maximal deviations of the Eisenberg-Noe clearing vector. This
allows us to compute upper bounds for the worst case perturbations of the
clearing vector in our simple setting. Moreover, we quantify the probability of
observing clearing vector deviations of a certain magnitude, for uniformly or
normally distributed errors in the relative liability matrix.
Applying our methodology to a dataset of European banks, we find that
perturbations to the relative liabilities can result in economically sizeable
differences that could lead to an underestimation of the risk of contagion. Our
results are a first step towards allowing regulators to quantify errors in
their simulations.Comment: 37 page
Sampling Plans for Control-Inspection Schemes Under Independent and Dependent Sampling Designs With Applications to Photovoltaics
The evaluation of produced items at the time of delivery is, in practice,
usually amended by at least one inspection at later time points. We extend the
methodology of acceptance sampling for variables for arbitrary unknown
distributions when additional sampling infor- mation is available to such
settings. Based on appropriate approximations of the operating characteristic,
we derive new acceptance sampling plans that control the overall operating
characteristic. The results cover the case of independent sampling as well as
the case of dependent sampling. In particular, we study a modified panel
sampling design and the case of spatial batch sampling. The latter is advisable
in photovoltaic field monitoring studies, since it allows to detect and analyze
local clusters of degraded or damaged modules. Some finite sample properties
are examined by a simulation study, focusing on the accuracy of estimation
- …