1,581 research outputs found
On Low Distortion Embeddings of Statistical Distance Measures into Low Dimensional Spaces
Statistical distance measures have found wide applicability in information
retrieval tasks that typically involve high dimensional datasets. In order to
reduce the storage space and ensure efficient performance of queries,
dimensionality reduction while preserving the inter-point similarity is highly
desirable. In this paper, we investigate various statistical distance measures
from the point of view of discovering low distortion embeddings into
low-dimensional spaces. More specifically, we consider the Mahalanobis distance
measure, the Bhattacharyya class of divergences and the Kullback-Leibler
divergence. We present a dimensionality reduction method based on the
Johnson-Lindenstrauss Lemma for the Mahalanobis measure that achieves
arbitrarily low distortion. By using the Johnson-Lindenstrauss Lemma again, we
further demonstrate that the Bhattacharyya distance admits dimensionality
reduction with arbitrarily low additive error. We also examine the question of
embeddability into metric spaces for these distance measures due to the
availability of efficient indexing schemes on metric spaces. We provide
explicit constructions of point sets under the Bhattacharyya and the
Kullback-Leibler divergences whose embeddings into any metric space incur
arbitrarily large distortions. We show that the lower bound presented for
Bhattacharyya distance is nearly tight by providing an embedding that
approaches the lower bound for relatively small dimensional datasets.Comment: 18 pages, The short version of this paper was accepted for
presentation at the 20th International Conference on Database and Expert
Systems Applications, DEXA 200
Learning Embeddings into Entropic Wasserstein Spaces
Euclidean embeddings of data are fundamentally limited in their ability to
capture latent semantic structures, which need not conform to Euclidean spatial
assumptions. Here we consider an alternative, which embeds data as discrete
probability distributions in a Wasserstein space, endowed with an optimal
transport metric. Wasserstein spaces are much larger and more flexible than
Euclidean spaces, in that they can successfully embed a wider variety of metric
structures. We exploit this flexibility by learning an embedding that captures
semantic information in the Wasserstein distance between embedded
distributions. We examine empirically the representational capacity of our
learned Wasserstein embeddings, showing that they can embed a wide variety of
metric structures with smaller distortion than an equivalent Euclidean
embedding. We also investigate an application to word embedding, demonstrating
a unique advantage of Wasserstein embeddings: We can visualize the
high-dimensional embedding directly, since it is a probability distribution on
a low-dimensional space. This obviates the need for dimensionality reduction
techniques like t-SNE for visualization.Comment: ICLR 201
Planar Earthmover is not in
We show that any embedding of the transportation cost (a.k.a.
Earthmover) metric on probability measures supported on the grid
incurs distortion . We
also use Fourier analytic techniques to construct a simple embedding of
this space which has distortion
Metric dimension reduction: A snapshot of the Ribe program
The purpose of this article is to survey some of the context, achievements,
challenges and mysteries of the field of metric dimension reduction, including
new perspectives on major older results as well as recent advances.Comment: proceedings of ICM 201
Mining Mass Spectra: Metric Embeddings and Fast Near Neighbor Search
Mining large-scale high-throughput tandem mass spectrometry data sets is a
very important problem in mass spectrometry based protein identification. One
of the fundamental problems in large scale mining of spectra is to design
appropriate metrics and algorithms to avoid all-pair-wise comparisons of
spectra. In this paper, we present a general framework based on vector spaces
to avoid pair-wise comparisons. We first robustly embed spectra in a high
dimensional space in a novel fashion and then apply fast approximate near
neighbor algorithms for tasks such as constructing filters for database search,
indexing and similarity searching. We formally prove that our embedding has low
distortion compared to the cosine similarity, and, along with locality
sensitive hashing (LSH), we design filters for database search that can filter
out more than 989% of peptides (118 times less) while missing at most 0.29% of
the correct sequences. We then show how our framework can be used in similarity
searching, which can then be used to detect tight clusters or replicates. On an
average, for a cluster size of 16 spectra, LSH only misses 1 spectrum and
admits only 1 false spectrum. In addition, our framework in conjunction with
dimension reduction techniques allow us to visualize large datasets in 2D
space. Our framework also has the potential to embed and compare datasets with
post translation modifications (PTM).Comment: Computational Proteomics, Mass Spectrometr
Kernel Cuts: MRF meets Kernel & Spectral Clustering
We propose a new segmentation model combining common regularization energies,
e.g. Markov Random Field (MRF) potentials, and standard pairwise clustering
criteria like Normalized Cut (NC), average association (AA), etc. These
clustering and regularization models are widely used in machine learning and
computer vision, but they were not combined before due to significant
differences in the corresponding optimization, e.g. spectral relaxation and
combinatorial max-flow techniques. On the one hand, we show that many common
applications using MRF segmentation energies can benefit from a high-order NC
term, e.g. enforcing balanced clustering of arbitrary high-dimensional image
features combining color, texture, location, depth, motion, etc. On the other
hand, standard clustering applications can benefit from an inclusion of common
pairwise or higher-order MRF constraints, e.g. edge alignment, bin-consistency,
label cost, etc. To address joint energies like NC+MRF, we propose efficient
Kernel Cut algorithms based on bound optimization. While focusing on graph cut
and move-making techniques, our new unary (linear) kernel and spectral bound
formulations for common pairwise clustering criteria allow to integrate them
with any regularization functionals with existing discrete or continuous
solvers.Comment: The main ideas of this work are published in our conference papers:
"Normalized cut meets MRF" [70] (ECCV 2016) and "Secrets of Grabcut and
kernel K-means" [41] (ICCV 2015
Stochastic approximation of lamplighter metrics
We observe that embeddings into random metrics can be fruitfully used to
study the -embeddability of lamplighter graphs or groups, and more
generally lamplighter metric spaces. Once this connection has been established,
several new upper bound estimates on the -distortion of lamplighter
metrics follow from known related estimates about stochastic embeddings into
dominating tree-metrics. For instance, every lamplighter metric on a -point
metric space embeds bi-Lipschitzly into with distortion . In
particular, for every finite group the lamplighter group bi-Lipschitzly embeds into with distortion
. In the case where the ground space in the lamplighter
construction is a graph with some topological restrictions, better distortion
estimates can be achieved. Finally, we discuss how a coarse embedding into
of the lamplighter group over the -dimensional infinite lattice
can be constructed from bi-Lipschitz embeddings of the
lamplighter graphs over finite -dimensional grids, and we include a remark
on Lipschitz free spaces over finite metric spaces.Comment: The paper has been completely rewritten (now 14 pages). It contains
more results and better quantitative estimates. The title has been changed to
reflect the different, more general, and more efficient approach take
Exploring High-Dimensional Structure via Axis-Aligned Decomposition of Linear Projections
Two-dimensional embeddings remain the dominant approach to visualize high
dimensional data. The choice of embeddings ranges from highly non-linear ones,
which can capture complex relationships but are difficult to interpret
quantitatively, to axis-aligned projections, which are easy to interpret but
are limited to bivariate relationships. Linear project can be considered as a
compromise between complexity and interpretability, as they allow explicit axes
labels, yet provide significantly more degrees of freedom compared to
axis-aligned projections. Nevertheless, interpreting the axes directions, which
are linear combinations often with many non-trivial components, remains
difficult. To address this problem we introduce a structure aware decomposition
of (multiple) linear projections into sparse sets of axis aligned projections,
which jointly capture all information of the original linear ones. In
particular, we use tools from Dempster-Shafer theory to formally define how
relevant a given axis aligned project is to explain the neighborhood relations
displayed in some linear projection. Furthermore, we introduce a new approach
to discover a diverse set of high quality linear projections and show that in
practice the information of linear projections is often jointly encoded in
axis aligned plots. We have integrated these ideas into an interactive
visualization system that allows users to jointly browse both linear
projections and their axis aligned representatives. Using a number of case
studies we show how the resulting plots lead to more intuitive visualizations
and new insight
Near-Isometric Binary Hashing for Large-scale Datasets
We develop a scalable algorithm to learn binary hash codes for indexing
large-scale datasets. Near-isometric binary hashing (NIBH) is a data-dependent
hashing scheme that quantizes the output of a learned low-dimensional embedding
to obtain a binary hash code. In contrast to conventional hashing schemes,
which typically rely on an -norm (i.e., average distortion)
minimization, NIBH is based on a -norm (i.e., worst-case
distortion) minimization that provides several benefits, including superior
distance, ranking, and near-neighbor preservation performance. We develop a
practical and efficient algorithm for NIBH based on column generation that
scales well to large datasets. A range of experimental evaluations demonstrate
the superiority of NIBH over ten state-of-the-art binary hashing schemes
Social Connection Induces Cultural Contraction: Evidence from Hyperbolic Embeddings of Social and Semantic Networks
Research has repeatedly demonstrated the influence of social connection and
communication on convergence in cultural tastes, opinions and ideas. Here we
review recent studies and consider the implications of social connection on
cultural, epistemological and ideological contraction, then formalize these
intuitions within the language of information theory. To systematically examine
connectivity and cultural diversity, we introduce new methods of manifold
learning to map both social networks and topic combinations into comparable,
two-dimensional hyperbolic spaces or Poincar\'e disks, which represent both
hierarchy and diversity within a system. On a Poincar\'e disk, radius from
center traces the position of an actor in a social hierarchy or an idea in a
cultural hierarchy. The angle of the disk required to inscribe connected actors
or ideas captures their diversity. Using this method in the epistemic culture
of 21st Century physics, we empirically demonstrate that denser university
collaborations systematically contract the space of topics discussed and ideas
investigated more than shared topics drive collaboration, despite the extreme
commitments academic physicists make to research programs over the course of
their careers. Dense connections unleash flows of communication that contract
otherwise fragmented semantic spaces into convergent hubs or polarized
clusters. We theorize the dynamic interplay between structural expansion and
cultural contraction and explore how this introduces an essential tension
between the enjoyment and protection of difference
- …