568 research outputs found

    Compressive Embedding and Visualization using Graphs

    Get PDF
    Visualizing high-dimensional data has been a focus in data analysis communities for decades, which has led to the design of many algorithms, some of which are now considered references (such as t-SNE for example). In our era of overwhelming data volumes, the scalability of such methods have become more and more important. In this work, we present a method which allows to apply any visualization or embedding algorithm on very large datasets by considering only a fraction of the data as input and then extending the information to all data points using a graph encoding its global similarity. We show that in most cases, using only O(log⁥(N))\mathcal{O}(\log(N)) samples is sufficient to diffuse the information to all NN data points. In addition, we propose quantitative methods to measure the quality of embeddings and demonstrate the validity of our technique on both synthetic and real-world datasets

    Sketching for Large-Scale Learning of Mixture Models

    Get PDF
    Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing

    Optimal approximate matrix product in terms of stable rank

    Get PDF
    We prove, using the subspace embedding guarantee in a black box way, that one can achieve the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having m=O(r~/Δ2)m = O(\tilde{r}/\varepsilon^2) rows. Here r~\tilde{r} is the maximum stable rank, i.e. squared ratio of Frobenius and operator norms, of the two matrices being multiplied. This is a quantitative improvement over previous work of [MZ11, KVZ14], and is also optimal for any oblivious dimensionality-reducing map. Furthermore, due to the black box reliance on the subspace embedding property in our proofs, our theorem can be applied to a much more general class of sketching matrices than what was known before, in addition to achieving better bounds. For example, one can apply our theorem to efficient subspace embeddings such as the Subsampled Randomized Hadamard Transform or sparse subspace embeddings, or even with subspace embedding constructions that may be developed in the future. Our main theorem, via connections with spectral error matrix multiplication shown in prior work, implies quantitative improvements for approximate least squares regression and low rank approximation. Our main result has also already been applied to improve dimensionality reduction guarantees for kk-means clustering [CEMMP14], and implies new results for nonparametric regression [YPW15]. We also separately point out that the proof of the "BSS" deterministic row-sampling result of [BSS12] can be modified to show that for any matrices A,BA, B of stable rank at most r~\tilde{r}, one can achieve the spectral norm guarantee for approximate matrix multiplication of ATBA^T B by deterministically sampling O(r~/Δ2)O(\tilde{r}/\varepsilon^2) rows that can be found in polynomial time. The original result of [BSS12] was for rank instead of stable rank. Our observation leads to a stronger version of a main theorem of [KMST10].Comment: v3: minor edits; v2: fixed one step in proof of Theorem 9 which was wrong by a constant factor (see the new Lemma 5 and its use; final theorem unaffected

    An Approximate Shapley-Folkman Theorem

    Full text link
    The Shapley-Folkman theorem shows that Minkowski averages of uniformly bounded sets tend to be convex when the number of terms in the sum becomes much larger than the ambient dimension. In optimization, Aubin and Ekeland [1976] show that this produces an a priori bound on the duality gap of separable nonconvex optimization problems involving finite sums. This bound is highly conservative and depends on unstable quantities, and we relax it in several directions to show that non convexity can have a much milder impact on finite sum minimization problems such as empirical risk minimization and multi-task classification. As a byproduct, we show a new version of Maurey's classical approximate Carath\'eodory lemma where we sample a significant fraction of the coefficients, without replacement, as well as a result on sampling constraints using an approximate Helly theorem, both of independent interest.Comment: Added constraint sampling result, simplified sampling results, reformat, et
