8,458 research outputs found
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
A One-Sample Test for Normality with Kernel Methods
We propose a new one-sample test for normality in a Reproducing Kernel
Hilbert Space (RKHS). Namely, we test the null-hypothesis of belonging to a
given family of Gaussian distributions. Hence our procedure may be applied
either to test data for normality or to test parameters (mean and covariance)
if data are assumed Gaussian. Our test is based on the same principle as the
MMD (Maximum Mean Discrepancy) which is usually used for two-sample tests such
as homogeneity or independence testing. Our method makes use of a special kind
of parametric bootstrap (typical of goodness-of-fit tests) which is
computationally more efficient than standard parametric bootstrap. Moreover, an
upper bound for the Type-II error highlights the dependence on influential
quantities. Experiments illustrate the practical improvement allowed by our
test in high-dimensional settings where common normality tests are known to
fail. We also consider an application to covariance rank selection through a
sequential procedure
In Search of Non-Gaussian Components of a High-Dimensional Distribution
Finding non-Gaussian components of high-dimensional data is an important preprocessing step for effcient information processing. This article proposes a new linear method to identify the ``non-Gaussian subspace´´ within a very general semi-parametric framework. Our proposed method, called NGCA (Non-Gaussian Component Analysis), is essentially based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector which belongs to the low dimensional non-Gaussian target subspace up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a final step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.non-Gaussian components, dimension reduction
An Infinitesimal Probabilistic Model for Principal Component Analysis of Manifold Valued Data
We provide a probabilistic and infinitesimal view of how the principal
component analysis procedure (PCA) can be generalized to analysis of nonlinear
manifold valued data. Starting with the probabilistic PCA interpretation of the
Euclidean PCA procedure, we show how PCA can be generalized to manifolds in an
intrinsic way that does not resort to linearization of the data space. The
underlying probability model is constructed by mapping a Euclidean stochastic
process to the manifold using stochastic development of Euclidean
semimartingales. The construction uses a connection and bundles of covariant
tensors to allow global transport of principal eigenvectors, and the model is
thereby an example of how principal fiber bundles can be used to handle the
lack of global coordinate system and orientations that characterizes manifold
valued statistics. We show how curvature implies non-integrability of the
equivalent of Euclidean principal subspaces, and how the stochastic flows
provide an alternative to explicit construction of such subspaces. We describe
estimation procedures for inference of parameters and prediction of principal
components, and we give examples of properties of the model on embedded
surfaces
Kernel methods in machine learning
We review machine learning methods employing positive definite kernels. These
methods formulate learning and estimation problems in a reproducing kernel
Hilbert space (RKHS) of functions defined on the data domain, expanded in terms
of a kernel. Working in linear spaces of function has the benefit of
facilitating the construction and analysis of learning algorithms while at the
same time allowing large classes of functions. The latter include nonlinear
functions as well as functions defined on nonvectorial data. We cover a wide
range of methods, ranging from binary classifiers to sophisticated methods for
estimation with structured data.Comment: Published in at http://dx.doi.org/10.1214/009053607000000677 the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning
- …