114 research outputs found
Spectal Harmonics: Bridging Spectral Embedding and Matrix Completion in Self-Supervised Learning
Self-supervised methods received tremendous attention thanks to their
seemingly heuristic approach to learning representations that respect the
semantics of the data without any apparent supervision in the form of labels. A
growing body of literature is already being published in an attempt to build a
coherent and theoretically grounded understanding of the workings of a zoo of
losses used in modern self-supervised representation learning methods. In this
paper, we attempt to provide an understanding from the perspective of a Laplace
operator and connect the inductive bias stemming from the augmentation process
to a low-rank matrix completion problem. To this end, we leverage the results
from low-rank matrix completion to provide theoretical analysis on the
convergence of modern SSL methods and a key property that affects their
downstream performance.Comment: 12 pages, 3 figure
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
Data Representation for Learning and Information Fusion in Bioinformatics
This thesis deals with the rigorous application of nonlinear dimension reduction and data organization techniques to biomedical data analysis. The Laplacian Eigenmaps algorithm is representative of these methods and has been widely applied in manifold learning and related areas. While their asymptotic manifold recovery behavior has been well-characterized, the clustering properties of Laplacian embeddings with finite data are largely motivated by heuristic arguments. We develop a precise bound, characterizing cluster structure preservation under Laplacian embeddings. From this foundation, we introduce flexible and mathematically well-founded approaches for information fusion and feature representation. These methods are applied to three substantial case studies in bioinformatics, illustrating their capacity to extract scientifically valuable information from complex data
Robust Principal Component Analysis?
This paper is about a curious phenomenon. Suppose we have a data matrix,
which is the superposition of a low-rank component and a sparse component. Can
we recover each component individually? We prove that under some suitable
assumptions, it is possible to recover both the low-rank and the sparse
components exactly by solving a very convenient convex program called Principal
Component Pursuit; among all feasible decompositions, simply minimize a
weighted combination of the nuclear norm and of the L1 norm. This suggests the
possibility of a principled approach to robust principal component analysis
since our methodology and results assert that one can recover the principal
components of a data matrix even though a positive fraction of its entries are
arbitrarily corrupted. This extends to the situation where a fraction of the
entries are missing as well. We discuss an algorithm for solving this
optimization problem, and present applications in the area of video
surveillance, where our methodology allows for the detection of objects in a
cluttered background, and in the area of face recognition, where it offers a
principled way of removing shadows and specularities in images of faces
Sparse Modeling for Image and Vision Processing
In recent years, a large amount of multi-disciplinary research has been
conducted on sparse models and their applications. In statistics and machine
learning, the sparsity principle is used to perform model selection---that is,
automatically selecting a simple model among a large collection of them. In
signal processing, sparse coding consists of representing data with linear
combinations of a few dictionary elements. Subsequently, the corresponding
tools have been widely adopted by several scientific communities such as
neuroscience, bioinformatics, or computer vision. The goal of this monograph is
to offer a self-contained view of sparse modeling for visual recognition and
image processing. More specifically, we focus on applications where the
dictionary is learned and adapted to data, yielding a compact representation
that has been successful in various contexts.Comment: 205 pages, to appear in Foundations and Trends in Computer Graphics
and Visio
Nonlinear Dimensionality Reduction Methods in Climate Data Analysis
Linear dimensionality reduction techniques, notably principal component
analysis, are widely used in climate data analysis as a means to aid in the
interpretation of datasets of high dimensionality. These linear methods may not
be appropriate for the analysis of data arising from nonlinear processes
occurring in the climate system. Numerous techniques for nonlinear
dimensionality reduction have been developed recently that may provide a
potentially useful tool for the identification of low-dimensional manifolds in
climate data sets arising from nonlinear dynamics. In this thesis I apply three
such techniques to the study of El Nino/Southern Oscillation variability in
tropical Pacific sea surface temperatures and thermocline depth, comparing
observational data with simulations from coupled atmosphere-ocean general
circulation models from the CMIP3 multi-model ensemble.
The three methods used here are a nonlinear principal component analysis
(NLPCA) approach based on neural networks, the Isomap isometric mapping
algorithm, and Hessian locally linear embedding. I use these three methods to
examine El Nino variability in the different data sets and assess the
suitability of these nonlinear dimensionality reduction approaches for climate
data analysis.
I conclude that although, for the application presented here, analysis using
NLPCA, Isomap and Hessian locally linear embedding does not provide additional
information beyond that already provided by principal component analysis, these
methods are effective tools for exploratory data analysis.Comment: 273 pages, 76 figures; University of Bristol Ph.D. thesis; version
with high-resolution figures available from
http://www.skybluetrades.net/thesis/ian-ross-thesis.pdf (52Mb download
- …