1,581 research outputs found

    On Low Distortion Embeddings of Statistical Distance Measures into Low Dimensional Spaces

    Full text link
    Statistical distance measures have found wide applicability in information retrieval tasks that typically involve high dimensional datasets. In order to reduce the storage space and ensure efficient performance of queries, dimensionality reduction while preserving the inter-point similarity is highly desirable. In this paper, we investigate various statistical distance measures from the point of view of discovering low distortion embeddings into low-dimensional spaces. More specifically, we consider the Mahalanobis distance measure, the Bhattacharyya class of divergences and the Kullback-Leibler divergence. We present a dimensionality reduction method based on the Johnson-Lindenstrauss Lemma for the Mahalanobis measure that achieves arbitrarily low distortion. By using the Johnson-Lindenstrauss Lemma again, we further demonstrate that the Bhattacharyya distance admits dimensionality reduction with arbitrarily low additive error. We also examine the question of embeddability into metric spaces for these distance measures due to the availability of efficient indexing schemes on metric spaces. We provide explicit constructions of point sets under the Bhattacharyya and the Kullback-Leibler divergences whose embeddings into any metric space incur arbitrarily large distortions. We show that the lower bound presented for Bhattacharyya distance is nearly tight by providing an embedding that approaches the lower bound for relatively small dimensional datasets.Comment: 18 pages, The short version of this paper was accepted for presentation at the 20th International Conference on Database and Expert Systems Applications, DEXA 200

    Learning Embeddings into Entropic Wasserstein Spaces

    Full text link
    Euclidean embeddings of data are fundamentally limited in their ability to capture latent semantic structures, which need not conform to Euclidean spatial assumptions. Here we consider an alternative, which embeds data as discrete probability distributions in a Wasserstein space, endowed with an optimal transport metric. Wasserstein spaces are much larger and more flexible than Euclidean spaces, in that they can successfully embed a wider variety of metric structures. We exploit this flexibility by learning an embedding that captures semantic information in the Wasserstein distance between embedded distributions. We examine empirically the representational capacity of our learned Wasserstein embeddings, showing that they can embed a wide variety of metric structures with smaller distortion than an equivalent Euclidean embedding. We also investigate an application to word embedding, demonstrating a unique advantage of Wasserstein embeddings: We can visualize the high-dimensional embedding directly, since it is a probability distribution on a low-dimensional space. This obviates the need for dimensionality reduction techniques like t-SNE for visualization.Comment: ICLR 201

    Planar Earthmover is not in L1L_1

    Full text link
    We show that any L1L_1 embedding of the transportation cost (a.k.a. Earthmover) metric on probability measures supported on the grid {0,1,...,n}2R2\{0,1,...,n\}^2\subseteq \R^2 incurs distortion Ω(logn)\Omega(\sqrt{\log n}). We also use Fourier analytic techniques to construct a simple L1L_1 embedding of this space which has distortion O(logn)O(\log n)

    Metric dimension reduction: A snapshot of the Ribe program

    Full text link
    The purpose of this article is to survey some of the context, achievements, challenges and mysteries of the field of metric dimension reduction, including new perspectives on major older results as well as recent advances.Comment: proceedings of ICM 201

    Mining Mass Spectra: Metric Embeddings and Fast Near Neighbor Search

    Full text link
    Mining large-scale high-throughput tandem mass spectrometry data sets is a very important problem in mass spectrometry based protein identification. One of the fundamental problems in large scale mining of spectra is to design appropriate metrics and algorithms to avoid all-pair-wise comparisons of spectra. In this paper, we present a general framework based on vector spaces to avoid pair-wise comparisons. We first robustly embed spectra in a high dimensional space in a novel fashion and then apply fast approximate near neighbor algorithms for tasks such as constructing filters for database search, indexing and similarity searching. We formally prove that our embedding has low distortion compared to the cosine similarity, and, along with locality sensitive hashing (LSH), we design filters for database search that can filter out more than 989% of peptides (118 times less) while missing at most 0.29% of the correct sequences. We then show how our framework can be used in similarity searching, which can then be used to detect tight clusters or replicates. On an average, for a cluster size of 16 spectra, LSH only misses 1 spectrum and admits only 1 false spectrum. In addition, our framework in conjunction with dimension reduction techniques allow us to visualize large datasets in 2D space. Our framework also has the potential to embed and compare datasets with post translation modifications (PTM).Comment: Computational Proteomics, Mass Spectrometr

    Kernel Cuts: MRF meets Kernel & Spectral Clustering

    Full text link
    We propose a new segmentation model combining common regularization energies, e.g. Markov Random Field (MRF) potentials, and standard pairwise clustering criteria like Normalized Cut (NC), average association (AA), etc. These clustering and regularization models are widely used in machine learning and computer vision, but they were not combined before due to significant differences in the corresponding optimization, e.g. spectral relaxation and combinatorial max-flow techniques. On the one hand, we show that many common applications using MRF segmentation energies can benefit from a high-order NC term, e.g. enforcing balanced clustering of arbitrary high-dimensional image features combining color, texture, location, depth, motion, etc. On the other hand, standard clustering applications can benefit from an inclusion of common pairwise or higher-order MRF constraints, e.g. edge alignment, bin-consistency, label cost, etc. To address joint energies like NC+MRF, we propose efficient Kernel Cut algorithms based on bound optimization. While focusing on graph cut and move-making techniques, our new unary (linear) kernel and spectral bound formulations for common pairwise clustering criteria allow to integrate them with any regularization functionals with existing discrete or continuous solvers.Comment: The main ideas of this work are published in our conference papers: "Normalized cut meets MRF" [70] (ECCV 2016) and "Secrets of Grabcut and kernel K-means" [41] (ICCV 2015

    Stochastic approximation of lamplighter metrics

    Full text link
    We observe that embeddings into random metrics can be fruitfully used to study the L1L_1-embeddability of lamplighter graphs or groups, and more generally lamplighter metric spaces. Once this connection has been established, several new upper bound estimates on the L1L_1-distortion of lamplighter metrics follow from known related estimates about stochastic embeddings into dominating tree-metrics. For instance, every lamplighter metric on a nn-point metric space embeds bi-Lipschitzly into L1L_1 with distortion O(logn)O(\log n). In particular, for every finite group GG the lamplighter group H=Z2GH = \mathbb{Z}_2\wr G bi-Lipschitzly embeds into L1L_1 with distortion O(loglogH)O(\log\log|H|). In the case where the ground space in the lamplighter construction is a graph with some topological restrictions, better distortion estimates can be achieved. Finally, we discuss how a coarse embedding into L1L_1 of the lamplighter group over the dd-dimensional infinite lattice Zd\mathbb{Z}^d can be constructed from bi-Lipschitz embeddings of the lamplighter graphs over finite dd-dimensional grids, and we include a remark on Lipschitz free spaces over finite metric spaces.Comment: The paper has been completely rewritten (now 14 pages). It contains more results and better quantitative estimates. The title has been changed to reflect the different, more general, and more efficient approach take

    Exploring High-Dimensional Structure via Axis-Aligned Decomposition of Linear Projections

    Full text link
    Two-dimensional embeddings remain the dominant approach to visualize high dimensional data. The choice of embeddings ranges from highly non-linear ones, which can capture complex relationships but are difficult to interpret quantitatively, to axis-aligned projections, which are easy to interpret but are limited to bivariate relationships. Linear project can be considered as a compromise between complexity and interpretability, as they allow explicit axes labels, yet provide significantly more degrees of freedom compared to axis-aligned projections. Nevertheless, interpreting the axes directions, which are linear combinations often with many non-trivial components, remains difficult. To address this problem we introduce a structure aware decomposition of (multiple) linear projections into sparse sets of axis aligned projections, which jointly capture all information of the original linear ones. In particular, we use tools from Dempster-Shafer theory to formally define how relevant a given axis aligned project is to explain the neighborhood relations displayed in some linear projection. Furthermore, we introduce a new approach to discover a diverse set of high quality linear projections and show that in practice the information of kk linear projections is often jointly encoded in k\sim k axis aligned plots. We have integrated these ideas into an interactive visualization system that allows users to jointly browse both linear projections and their axis aligned representatives. Using a number of case studies we show how the resulting plots lead to more intuitive visualizations and new insight

    Near-Isometric Binary Hashing for Large-scale Datasets

    Full text link
    We develop a scalable algorithm to learn binary hash codes for indexing large-scale datasets. Near-isometric binary hashing (NIBH) is a data-dependent hashing scheme that quantizes the output of a learned low-dimensional embedding to obtain a binary hash code. In contrast to conventional hashing schemes, which typically rely on an 2\ell_2-norm (i.e., average distortion) minimization, NIBH is based on a \ell_{\infty}-norm (i.e., worst-case distortion) minimization that provides several benefits, including superior distance, ranking, and near-neighbor preservation performance. We develop a practical and efficient algorithm for NIBH based on column generation that scales well to large datasets. A range of experimental evaluations demonstrate the superiority of NIBH over ten state-of-the-art binary hashing schemes

    Social Connection Induces Cultural Contraction: Evidence from Hyperbolic Embeddings of Social and Semantic Networks

    Full text link
    Research has repeatedly demonstrated the influence of social connection and communication on convergence in cultural tastes, opinions and ideas. Here we review recent studies and consider the implications of social connection on cultural, epistemological and ideological contraction, then formalize these intuitions within the language of information theory. To systematically examine connectivity and cultural diversity, we introduce new methods of manifold learning to map both social networks and topic combinations into comparable, two-dimensional hyperbolic spaces or Poincar\'e disks, which represent both hierarchy and diversity within a system. On a Poincar\'e disk, radius from center traces the position of an actor in a social hierarchy or an idea in a cultural hierarchy. The angle of the disk required to inscribe connected actors or ideas captures their diversity. Using this method in the epistemic culture of 21st Century physics, we empirically demonstrate that denser university collaborations systematically contract the space of topics discussed and ideas investigated more than shared topics drive collaboration, despite the extreme commitments academic physicists make to research programs over the course of their careers. Dense connections unleash flows of communication that contract otherwise fragmented semantic spaces into convergent hubs or polarized clusters. We theorize the dynamic interplay between structural expansion and cultural contraction and explore how this introduces an essential tension between the enjoyment and protection of difference
    corecore