12,216 research outputs found
Four lectures on probabilistic methods for data science
Methods of high-dimensional probability play a central role in applications
for statistics, signal processing theoretical computer science and related
fields. These lectures present a sample of particularly useful tools of
high-dimensional probability, focusing on the classical and matrix Bernstein's
inequality and the uniform matrix deviation inequality. We illustrate these
tools with applications for dimension reduction, network analysis, covariance
estimation, matrix completion and sparse signal recovery. The lectures are
geared towards beginning graduate students who have taken a rigorous course in
probability but may not have any experience in data science applications.Comment: Lectures given at 2016 PCMI Graduate Summer School in Mathematics of
Data. Some typos, inaccuracies fixe
Sparse Subspace Clustering: Algorithm, Theory, and Applications
In many real-world problems, we are dealing with collections of
high-dimensional data, such as images, videos, text and web documents, DNA
microarray data, and more. Often, high-dimensional data lie close to
low-dimensional structures corresponding to several classes or categories the
data belongs to. In this paper, we propose and study an algorithm, called
Sparse Subspace Clustering (SSC), to cluster data points that lie in a union of
low-dimensional subspaces. The key idea is that, among infinitely many possible
representations of a data point in terms of other points, a sparse
representation corresponds to selecting a few points from the same subspace.
This motivates solving a sparse optimization program whose solution is used in
a spectral clustering framework to infer the clustering of data into subspaces.
Since solving the sparse optimization program is in general NP-hard, we
consider a convex relaxation and show that, under appropriate conditions on the
arrangement of subspaces and the distribution of data, the proposed
minimization program succeeds in recovering the desired sparse representations.
The proposed algorithm can be solved efficiently and can handle data points
near the intersections of subspaces. Another key advantage of the proposed
algorithm with respect to the state of the art is that it can deal with data
nuisances, such as noise, sparse outlying entries, and missing entries,
directly by incorporating the model of the data into the sparse optimization
program. We demonstrate the effectiveness of the proposed algorithm through
experiments on synthetic data as well as the two real-world problems of motion
segmentation and face clustering
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means
We analyze a compression scheme for large data sets that randomly keeps a
small percentage of the components of each data sample. The benefit is that the
output is a sparse matrix and therefore subsequent processing, such as PCA or
K-means, is significantly faster, especially in a distributed-data setting.
Furthermore, the sampling is single-pass and applicable to streaming data. The
sampling mechanism is a variant of previous methods proposed in the literature
combined with a randomized preconditioning to smooth the data. We provide
guarantees for PCA in terms of the covariance matrix, and guarantees for
K-means in terms of the error in the center estimators at a given step. We
present numerical evidence to show both that our bounds are nearly tight and
that our algorithms provide a real benefit when applied to standard test data
sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure
Painting Analysis Using Wavelets and Probabilistic Topic Models
In this paper, computer-based techniques for stylistic analysis of paintings
are applied to the five panels of the 14th century Peruzzi Altarpiece by Giotto
di Bondone. Features are extracted by combining a dual-tree complex wavelet
transform with a hidden Markov tree (HMT) model. Hierarchical clustering is
used to identify stylistic keywords in image patches, and keyword frequencies
are calculated for sub-images that each contains many patches. A generative
hierarchical Bayesian model learns stylistic patterns of keywords; these
patterns are then used to characterize the styles of the sub-images; this in
turn, permits to discriminate between paintings. Results suggest that such
unsupervised probabilistic topic models can be useful to distill characteristic
elements of style.Comment: 5 pages, 4 figures, ICIP 201
- …