1,844 research outputs found
Random model trees: an effective and scalable regression method
We present and investigate ensembles of randomized model trees as a novel regression method. Such ensembles combine the scalability of tree-based methods with predictive performance rivaling the state of the art in numeric prediction. An extensive empirical investigation shows that Random Model Trees produce predictive performance which is competitive with state-of-the-art methods like Gaussian Processes Regression or Additive Groves of Regression Trees. The training
and optimization of Random Model Trees scales better than Gaussian Processes Regression to larger datasets, and enjoys a constant advantage over Additive Groves of the order of one to two orders of magnitude
Quasirandom Load Balancing
We propose a simple distributed algorithm for balancing indivisible tokens on
graphs. The algorithm is completely deterministic, though it tries to imitate
(and enhance) a random algorithm by keeping the accumulated rounding errors as
small as possible.
Our new algorithm surprisingly closely approximates the idealized process
(where the tokens are divisible) on important network topologies. On
d-dimensional torus graphs with n nodes it deviates from the idealized process
only by an additive constant. In contrast to that, the randomized rounding
approach of Friedrich and Sauerwald (2009) can deviate up to Omega(polylog(n))
and the deterministic algorithm of Rabani, Sinclair and Wanka (1998) has a
deviation of Omega(n^{1/d}). This makes our quasirandom algorithm the first
known algorithm for this setting which is optimal both in time and achieved
smoothness. We further show that also on the hypercube our algorithm has a
smaller deviation from the idealized process than the previous algorithms.Comment: 25 page
Fast learning rates in statistical inference through aggregation
We develop minimax optimal risk bounds for the general learning task
consisting in predicting as well as the best function in a reference set
up to the smallest possible additive term, called the convergence
rate. When the reference set is finite and when denotes the size of the
training data, we provide minimax convergence rates of the form
with tight evaluation of the positive
constant and with exact , the latter value depending on the
convexity of the loss function and on the level of noise in the output
distribution. The risk upper bounds are based on a sequential randomized
algorithm, which at each step concentrates on functions having both low risk
and low variance with respect to the previous step prediction function. Our
analysis puts forward the links between the probabilistic and worst-case
viewpoints, and allows to obtain risk bounds unachievable with the standard
statistical learning approach. One of the key ideas of this work is to use
probabilistic inequalities with respect to appropriate (Gibbs) distributions on
the prediction function space instead of using them with respect to the
distribution generating the data. The risk lower bounds are based on
refinements of the Assouad lemma taking particularly into account the
properties of the loss function. Our key example to illustrate the upper and
lower bounds is to consider the -regression setting for which an
exhaustive analysis of the convergence rates is given while ranges in
.Comment: Published in at http://dx.doi.org/10.1214/08-AOS623 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
An Emulator for the Lyman-alpha Forest
We present methods for interpolating between the 1-D flux power spectrum of
the Lyman- forest, as output by cosmological hydrodynamic simulations.
Interpolation is necessary for cosmological parameter estimation due to the
limited number of simulations possible. We construct an emulator for the
Lyman- forest flux power spectrum from small simulations using
Latin hypercube sampling and Gaussian process interpolation. We show that this
emulator has a typical accuracy of 1.5% and a worst-case accuracy of 4%, which
compares well to the current statistical error of 3 - 5% at from BOSS
DR9. We compare to the previous state of the art, quadratic polynomial
interpolation. The Latin hypercube samples the entire volume of parameter
space, while quadratic polynomial emulation samples only lower-dimensional
subspaces. The Gaussian process provides an estimate of the emulation error and
we show using test simulations that this estimate is reasonable. We construct a
likelihood function and use it to show that the posterior constraints generated
using the emulator are unbiased. We show that our Gaussian process emulator has
lower emulation error than quadratic polynomial interpolation and thus produces
tighter posterior confidence intervals, which will be essential for future
Lyman- surveys such as DESI.Comment: 28 pages, 10 figures, accepted to JCAP with minor change
Efficient Localization of Discontinuities in Complex Computational Simulations
Surrogate models for computational simulations are input-output
approximations that allow computationally intensive analyses, such as
uncertainty propagation and inference, to be performed efficiently. When a
simulation output does not depend smoothly on its inputs, the error and
convergence rate of many approximation methods deteriorate substantially. This
paper details a method for efficiently localizing discontinuities in the input
parameter domain, so that the model output can be approximated as a piecewise
smooth function. The approach comprises an initialization phase, which uses
polynomial annihilation to assign function values to different regions and thus
seed an automated labeling procedure, followed by a refinement phase that
adaptively updates a kernel support vector machine representation of the
separating surface via active learning. The overall approach avoids structured
grids and exploits any available simplicity in the geometry of the separating
surface, thus reducing the number of model evaluations required to localize the
discontinuity. The method is illustrated on examples of up to eleven
dimensions, including algebraic models and ODE/PDE systems, and demonstrates
improved scaling and efficiency over other discontinuity localization
approaches
- …