8,746 research outputs found

    When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

    Full text link
    Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold Ď„\tau. In contrast to previous work where Ď„\tau is assumed to be quite close to 1, we focus on recommendation applications where Ď„\tau is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small Ď„\tau. To the best of our knowledge, there is no practical solution for computing all user pairs with, say Ď„=0.2\tau = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

    Sample Complexity Analysis for Learning Overcomplete Latent Variable Models through Tensor Methods

    Full text link
    We provide guarantees for learning latent variable models emphasizing on the overcomplete regime, where the dimensionality of the latent space can exceed the observed dimensionality. In particular, we consider multiview mixtures, spherical Gaussian mixtures, ICA, and sparse coding models. We provide tight concentration bounds for empirical moments through novel covering arguments. We analyze parameter recovery through a simple tensor power update algorithm. In the semi-supervised setting, we exploit the label or prior information to get a rough estimate of the model parameters, and then refine it using the tensor method on unlabeled samples. We establish that learning is possible when the number of components scales as k=o(dp/2)k=o(d^{p/2}), where dd is the observed dimension, and pp is the order of the observed moment employed in the tensor method. Our concentration bound analysis also leads to minimax sample complexity for semi-supervised learning of spherical Gaussian mixtures. In the unsupervised setting, we use a simple initialization algorithm based on SVD of the tensor slices, and provide guarantees under the stricter condition that k≤βdk\le \beta d (where constant β\beta can be larger than 11), where the tensor method recovers the components under a polynomial running time (and exponential in β\beta). Our analysis establishes that a wide range of overcomplete latent variable models can be learned efficiently with low computational and sample complexity through tensor decomposition methods.Comment: Title change

    A differential method for bounding the ground state energy

    Get PDF
    For a wide class of Hamiltonians, a novel method to obtain lower and upper bounds for the lowest energy is presented. Unlike perturbative or variational techniques, this method does not involve the computation of any integral (a normalisation factor or a matrix element). It just requires the determination of the absolute minimum and maximum in the whole configuration space of the local energy associated with a normalisable trial function (the calculation of the norm is not needed). After a general introduction, the method is applied to three non-integrable systems: the asymmetric annular billiard, the many-body spinless Coulombian problem, the hydrogen atom in a constant and uniform magnetic field. Being more sensitive than the variational methods to any local perturbation of the trial function, this method can used to systematically improve the energy bounds with a local skilled analysis; an algorithm relying on this method can therefore be constructed and an explicit example for a one-dimensional problem is given.Comment: Accepted for publication in Journal of Physics

    Parameter Selection and Uncertainty Measurement for Variable Precision Probabilistic Rough Set

    Get PDF
    In this paper, we consider the problem of parameter selection and uncertainty measurement for a variable precision probabilistic rough set. Firstly, within the framework of the variable precision probabilistic rough set model, the relative discernibility of a variable precision rough set in probabilistic approximation space is discussed, and the conditions that make precision parameters α discernible in a variable precision probabilistic rough set are put forward. Concurrently, we consider the lack of predictability of precision parameters in a variable precision probabilistic rough set, and we propose a systematic threshold selection method based on relative discernibility of sets, using the concept of relative discernibility in probabilistic approximation space. Furthermore, a numerical example is applied to test the validity of the proposed method in this paper. Secondly, we discuss the problem of uncertainty measurement for the variable precision probabilistic rough set. The concept of classical fuzzy entropy is introduced into probabilistic approximation space, and the uncertain information that comes from approximation space and the approximated objects is fully considered. Then, an axiomatic approach is established for uncertainty measurement in a variable precision probabilistic rough set, and several related interesting properties are also discussed. Thirdly, we study the attribute reduction for the variable precision probabilistic rough set. The definition of reduction and its characteristic theorems are given for the variable precision probabilistic rough set. The main contribution of this paper is twofold. One is to propose a method of parameter selection for a variable precision probabilistic rough set. Another is to present a new approach to measurement uncertainty and the method of attribute reduction for a variable precision probabilistic rough set

    The investigation of the Bayesian rough set model

    Get PDF
    AbstractThe original Rough Set model is concerned primarily with algebraic properties of approximately defined sets. The Variable Precision Rough Set (VPRS) model extends the basic rough set theory to incorporate probabilistic information. The article presents a non-parametric modification of the VPRS model called the Bayesian Rough Set (BRS) model, where the set approximations are defined by using the prior probability as a reference. Mathematical properties of BRS are investigated. It is shown that the quality of BRS models can be evaluated using probabilistic gain function, which is suitable for identification and elimination of redundant attributes

    The investigation of the Bayesian rough set model

    Get PDF
    AbstractThe original Rough Set model is concerned primarily with algebraic properties of approximately defined sets. The Variable Precision Rough Set (VPRS) model extends the basic rough set theory to incorporate probabilistic information. The article presents a non-parametric modification of the VPRS model called the Bayesian Rough Set (BRS) model, where the set approximations are defined by using the prior probability as a reference. Mathematical properties of BRS are investigated. It is shown that the quality of BRS models can be evaluated using probabilistic gain function, which is suitable for identification and elimination of redundant attributes
    • …
    corecore