39,575 research outputs found
Global and Local Two-Sample Tests via Regression
Two-sample testing is a fundamental problem in statistics. Despite its long
history, there has been renewed interest in this problem with the advent of
high-dimensional and complex data. Specifically, in the machine learning
literature, there have been recent methodological developments such as
classification accuracy tests. The goal of this work is to present a regression
approach to comparing multivariate distributions of complex data. Depending on
the chosen regression model, our framework can efficiently handle different
types of variables and various structures in the data, with competitive power
under many practical scenarios. Whereas previous work has been largely limited
to global tests which conceal much of the local information, our approach
naturally leads to a local two-sample testing framework in which we identify
local differences between multivariate distributions with statistical
confidence. We demonstrate the efficacy of our approach both theoretically and
empirically, under some well-known parametric and nonparametric regression
methods. Our proposed methods are applied to simulated data as well as a
challenging astronomy data set to assess their practical usefulness
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
Bandwidth choice for nonparametric classification
It is shown that, for kernel-based classification with univariate
distributions and two populations, optimal bandwidth choice has a dichotomous
character. If the two densities cross at just one point, where their curvatures
have the same signs, then minimum Bayes risk is achieved using bandwidths which
are an order of magnitude larger than those which minimize pointwise estimation
error. On the other hand, if the curvature signs are different, or if there are
multiple crossing points, then bandwidths of conventional size are generally
appropriate. The range of different modes of behavior is narrower in
multivariate settings. There, the optimal size of bandwidth is generally the
same as that which is appropriate for pointwise density estimation. These
properties motivate empirical rules for bandwidth choice.Comment: Published at http://dx.doi.org/10.1214/009053604000000959 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
k-nearest neighbors prediction and classification for spatial data
We propose a nonparametric predictor and a supervised classification based on
the regression function estimate of a spatial real variable using k-nearest
neighbors method (k-NN). Under some assumptions, we establish almost complete
or sure convergence of the proposed estimates which incorporate a spatial
proximity between observations. Numerical results on simulated and real fish
data illustrate the behavior of the given predictor and classification method
- …