8,008 research outputs found
Robust Recovery of Subspace Structures by Low-Rank Representation
In this work we address the subspace recovery problem. Given a set of data
samples (vectors) approximately drawn from a union of multiple subspaces, our
goal is to segment the samples into their respective subspaces and correct the
possible errors as well. To this end, we propose a novel method termed Low-Rank
Representation (LRR), which seeks the lowest-rank representation among all the
candidates that can represent the data samples as linear combinations of the
bases in a given dictionary. It is shown that LRR well solves the subspace
recovery problem: when the data is clean, we prove that LRR exactly captures
the true subspace structures; for the data contaminated by outliers, we prove
that under certain conditions LRR can exactly recover the row space of the
original data and detect the outlier as well; for the data corrupted by
arbitrary errors, LRR can also approximately recover the row space with
theoretical guarantees. Since the subspace membership is provably determined by
the row space, these further imply that LRR can perform robust subspace
segmentation and error correction, in an efficient way.Comment: IEEE Trans. Pattern Analysis and Machine Intelligenc
Discriminative variable selection for clustering with the sparse Fisher-EM algorithm
The interest in variable selection for clustering has increased recently due
to the growing need in clustering high-dimensional data. Variable selection
allows in particular to ease both the clustering and the interpretation of the
results. Existing approaches have demonstrated the efficiency of variable
selection for clustering but turn out to be either very time consuming or not
sparse enough in high-dimensional spaces. This work proposes to perform a
selection of the discriminative variables by introducing sparsity in the
loading matrix of the Fisher-EM algorithm. This clustering method has been
recently proposed for the simultaneous visualization and clustering of
high-dimensional data. It is based on a latent mixture model which fits the
data into a low-dimensional discriminative subspace. Three different approaches
are proposed in this work to introduce sparsity in the orientation matrix of
the discriminative subspace through -type penalizations. Experimental
comparisons with existing approaches on simulated and real-world data sets
demonstrate the interest of the proposed methodology. An application to the
segmentation of hyperspectral images of the planet Mars is also presented
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
The discriminative functional mixture model for a comparative analysis of bike sharing systems
Bike sharing systems (BSSs) have become a means of sustainable intermodal
transport and are now proposed in many cities worldwide. Most BSSs also provide
open access to their data, particularly to real-time status reports on their
bike stations. The analysis of the mass of data generated by such systems is of
particular interest to BSS providers to update system structures and policies.
This work was motivated by interest in analyzing and comparing several European
BSSs to identify common operating patterns in BSSs and to propose practical
solutions to avoid potential issues. Our approach relies on the identification
of common patterns between and within systems. To this end, a model-based
clustering method, called FunFEM, for time series (or more generally functional
data) is developed. It is based on a functional mixture model that allows the
clustering of the data in a discriminative functional subspace. This model
presents the advantage in this context to be parsimonious and to allow the
visualization of the clustered systems. Numerical experiments confirm the good
behavior of FunFEM, particularly compared to state-of-the-art methods. The
application of FunFEM to BSS data from JCDecaux and the Transport for London
Initiative allows us to identify 10 general patterns, including pathological
ones, and to propose practical improvement strategies based on the system
comparison. The visualization of the clustered data within the discriminative
subspace turns out to be particularly informative regarding the system
efficiency. The proposed methodology is implemented in a package for the R
software, named funFEM, which is available on the CRAN. The package also
provides a subset of the data analyzed in this work.Comment: Published at http://dx.doi.org/10.1214/15-AOAS861 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …