3,298 research outputs found
Representation Learning for Clustering: A Statistical Framework
We address the problem of communicating domain knowledge from a user to the
designer of a clustering algorithm. We propose a protocol in which the user
provides a clustering of a relatively small random sample of a data set. The
algorithm designer then uses that sample to come up with a data representation
under which -means clustering results in a clustering (of the full data set)
that is aligned with the user's clustering. We provide a formal statistical
model for analyzing the sample complexity of learning a clustering
representation with this paradigm. We then introduce a notion of capacity of a
class of possible representations, in the spirit of the VC-dimension, showing
that classes of representations that have finite such dimension can be
successfully learned with sample size error bounds, and end our discussion with
an analysis of that dimension for classes of representations induced by linear
embeddings.Comment: To be published in Proceedings of UAI 201
Supervised classification and mathematical optimization
Data Mining techniques often ask for the resolution of optimization problems. Supervised Classification, and, in particular, Support Vector Machines, can be seen as a paradigmatic instance. In this paper, some links between Mathematical Optimization methods and Supervised Classification are emphasized. It is shown that many different areas of Mathematical Optimization play a central role in off-the-shelf Supervised Classification methods. Moreover, Mathematical Optimization turns out to be extremely
useful to address important issues in Classification, such as identifying relevant variables, improving the interpretability of classifiers or dealing with vagueness/noise in the data.Ministerio de Ciencia e InnovaciónJunta de Andalucí
A Geometric Approach to Sound Source Localization from Time-Delay Estimates
This paper addresses the problem of sound-source localization from time-delay
estimates using arbitrarily-shaped non-coplanar microphone arrays. A novel
geometric formulation is proposed, together with a thorough algebraic analysis
and a global optimization solver. The proposed model is thoroughly described
and evaluated. The geometric analysis, stemming from the direct acoustic
propagation model, leads to necessary and sufficient conditions for a set of
time delays to correspond to a unique position in the source space. Such sets
of time delays are referred to as feasible sets. We formally prove that every
feasible set corresponds to exactly one position in the source space, whose
value can be recovered using a closed-form localization mapping. Therefore we
seek for the optimal feasible set of time delays given, as input, the received
microphone signals. This time delay estimation problem is naturally cast into a
programming task, constrained by the feasibility conditions derived from the
geometric analysis. A global branch-and-bound optimization technique is
proposed to solve the problem at hand, hence estimating the best set of
feasible time delays and, subsequently, localizing the sound source. Extensive
experiments with both simulated and real data are reported; we compare our
methodology to four state-of-the-art techniques. This comparison clearly shows
that the proposed method combined with the branch-and-bound algorithm
outperforms existing methods. These in-depth geometric understanding, practical
algorithms, and encouraging results, open several opportunities for future
work.Comment: 13 pages, 2 figures, 3 table, journa
Principal Component Analysis for Functional Data on Riemannian Manifolds and Spheres
Functional data analysis on nonlinear manifolds has drawn recent interest.
Sphere-valued functional data, which are encountered for example as movement
trajectories on the surface of the earth, are an important special case. We
consider an intrinsic principal component analysis for smooth Riemannian
manifold-valued functional data and study its asymptotic properties. Riemannian
functional principal component analysis (RFPCA) is carried out by first mapping
the manifold-valued data through Riemannian logarithm maps to tangent spaces
around the time-varying Fr\'echet mean function, and then performing a
classical multivariate functional principal component analysis on the linear
tangent spaces. Representations of the Riemannian manifold-valued functions and
the eigenfunctions on the original manifold are then obtained with exponential
maps. The tangent-space approximation through functional principal component
analysis is shown to be well-behaved in terms of controlling the residual
variation if the Riemannian manifold has nonnegative curvature. Specifically,
we derive a central limit theorem for the mean function, as well as root-
uniform convergence rates for other model components, including the covariance
function, eigenfunctions, and functional principal component scores. Our
applications include a novel framework for the analysis of longitudinal
compositional data, achieved by mapping longitudinal compositional data to
trajectories on the sphere, illustrated with longitudinal fruit fly behavior
patterns. RFPCA is shown to be superior in terms of trajectory recovery in
comparison to an unrestricted functional principal component analysis in
applications and simulations and is also found to produce principal component
scores that are better predictors for classification compared to traditional
functional functional principal component scores
- …