8,746 research outputs found
When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
Finding similar user pairs is a fundamental task in social networks, with
numerous applications in ranking and personalization tasks such as link
prediction and tie strength detection. A common manifestation of user
similarity is based upon network structure: each user is represented by a
vector that represents the user's network connections, where pairwise cosine
similarity among these vectors defines user similarity. The predominant task
for user similarity applications is to discover all similar pairs that have a
pairwise cosine similarity value larger than a given threshold . In
contrast to previous work where is assumed to be quite close to 1, we
focus on recommendation applications where is small, but still
meaningful. The all pairs cosine similarity problem is computationally
challenging on networks with billions of edges, and especially so for settings
with small . To the best of our knowledge, there is no practical solution
for computing all user pairs with, say on large social networks,
even using the power of distributed algorithms.
Our work directly addresses this challenge by introducing a new algorithm ---
WHIMP --- that solves this problem efficiently in the MapReduce model. The key
insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for
approximate matrix multiplication with the SimHash random projection techniques
of Charikar. We provide a theoretical analysis of WHIMP, proving that it has
near optimal communication costs while maintaining computation cost comparable
with the state of the art. We also empirically demonstrate WHIMP's scalability
by computing all highly similar pairs on four massive data sets, and show that
it accurately finds high similarity pairs. In particular, we note that WHIMP
successfully processes the entire Twitter network, which has tens of billions
of edges
Sample Complexity Analysis for Learning Overcomplete Latent Variable Models through Tensor Methods
We provide guarantees for learning latent variable models emphasizing on the
overcomplete regime, where the dimensionality of the latent space can exceed
the observed dimensionality. In particular, we consider multiview mixtures,
spherical Gaussian mixtures, ICA, and sparse coding models. We provide tight
concentration bounds for empirical moments through novel covering arguments. We
analyze parameter recovery through a simple tensor power update algorithm. In
the semi-supervised setting, we exploit the label or prior information to get a
rough estimate of the model parameters, and then refine it using the tensor
method on unlabeled samples. We establish that learning is possible when the
number of components scales as , where is the observed
dimension, and is the order of the observed moment employed in the tensor
method. Our concentration bound analysis also leads to minimax sample
complexity for semi-supervised learning of spherical Gaussian mixtures. In the
unsupervised setting, we use a simple initialization algorithm based on SVD of
the tensor slices, and provide guarantees under the stricter condition that
(where constant can be larger than ), where the
tensor method recovers the components under a polynomial running time (and
exponential in ). Our analysis establishes that a wide range of
overcomplete latent variable models can be learned efficiently with low
computational and sample complexity through tensor decomposition methods.Comment: Title change
A differential method for bounding the ground state energy
For a wide class of Hamiltonians, a novel method to obtain lower and upper
bounds for the lowest energy is presented. Unlike perturbative or variational
techniques, this method does not involve the computation of any integral (a
normalisation factor or a matrix element). It just requires the determination
of the absolute minimum and maximum in the whole configuration space of the
local energy associated with a normalisable trial function (the calculation of
the norm is not needed). After a general introduction, the method is applied to
three non-integrable systems: the asymmetric annular billiard, the many-body
spinless Coulombian problem, the hydrogen atom in a constant and uniform
magnetic field. Being more sensitive than the variational methods to any local
perturbation of the trial function, this method can used to systematically
improve the energy bounds with a local skilled analysis; an algorithm relying
on this method can therefore be constructed and an explicit example for a
one-dimensional problem is given.Comment: Accepted for publication in Journal of Physics
Parameter Selection and Uncertainty Measurement for Variable Precision Probabilistic Rough Set
In this paper, we consider the problem of parameter selection and uncertainty measurement for a variable precision probabilistic rough set. Firstly, within the framework of the variable precision probabilistic rough set model, the relative discernibility of a variable precision rough set in probabilistic approximation space is discussed, and the conditions that make precision parameters α discernible in a variable precision probabilistic rough set are put forward. Concurrently, we consider the lack of predictability of precision parameters in a variable precision probabilistic rough set, and we propose a systematic threshold selection method based on relative discernibility of sets, using the concept of relative discernibility in probabilistic approximation space. Furthermore, a numerical example is applied to test the validity of the proposed method in this paper. Secondly, we discuss the problem of uncertainty measurement for the variable precision probabilistic rough set. The concept of classical fuzzy entropy is introduced into probabilistic approximation space, and the uncertain information that comes from approximation space and the approximated objects is fully considered. Then, an axiomatic approach is established for uncertainty measurement in a variable precision probabilistic rough set, and several related interesting properties are also discussed. Thirdly, we study the attribute reduction for the variable precision probabilistic rough set. The definition of reduction and its characteristic theorems are given for the variable precision probabilistic rough set. The main contribution of this paper is twofold. One is to propose a method of parameter selection for a variable precision probabilistic rough set. Another is to present a new approach to measurement uncertainty and the method of attribute reduction for a variable precision probabilistic rough set
The investigation of the Bayesian rough set model
AbstractThe original Rough Set model is concerned primarily with algebraic properties of approximately defined sets. The Variable Precision Rough Set (VPRS) model extends the basic rough set theory to incorporate probabilistic information. The article presents a non-parametric modification of the VPRS model called the Bayesian Rough Set (BRS) model, where the set approximations are defined by using the prior probability as a reference. Mathematical properties of BRS are investigated. It is shown that the quality of BRS models can be evaluated using probabilistic gain function, which is suitable for identification and elimination of redundant attributes
The investigation of the Bayesian rough set model
AbstractThe original Rough Set model is concerned primarily with algebraic properties of approximately defined sets. The Variable Precision Rough Set (VPRS) model extends the basic rough set theory to incorporate probabilistic information. The article presents a non-parametric modification of the VPRS model called the Bayesian Rough Set (BRS) model, where the set approximations are defined by using the prior probability as a reference. Mathematical properties of BRS are investigated. It is shown that the quality of BRS models can be evaluated using probabilistic gain function, which is suitable for identification and elimination of redundant attributes
- …