80,492 research outputs found

    Probabilistic Neighborhood Selection in Collaborative Filtering Systems

    Get PDF
    This paper presents a novel probabilistic method for recommending items in the neighborhood-based collaborative filtering framework. For the probabilistic neighborhood selection phase, we use an efficient method for weighted sampling of k neighbors without replacement that also takes into consideration the similarity levels between the target user and the candidate neighbors. We conduct an empirical study showing that the proposed method alleviates the over-specialization and concentration biases in common recommender systems by generating recommendation lists that are very different from the classical collaborative filtering approach and also increasing the aggregate diversity and mobility of recommendations. We also demonstrate that the proposed method outperforms both the previously proposed user based k-nearest neighbors and k-furthest neighbors collaborative filtering approaches in terms of item prediction accuracy and utility based ranking measures across various experimental settings. This accuracy performance improvement is in accordance with ensemble learning theory.NYU Stern School of Busines

    Parallel Weighted Random Sampling

    Get PDF
    Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps both for shared-memory and distributed-memory machines. We give efficient, fast, and practicable algorithms for sampling single items, k items with/without replacement, permutations, subsets, and reservoirs. We also give improved sequential algorithms for alias table construction and for sampling with replacement. Experiments on shared-memory parallel machines with up to 158 threads show near linear speedups both for construction and queries

    Adaptive Threshold Sampling and Estimation

    Full text link
    Sampling is a fundamental problem in both computer science and statistics. A number of issues arise when designing a method based on sampling. These include statistical considerations such as constructing a good sampling design and ensuring there are good, tractable estimators for the quantities of interest as well as computational considerations such as designing fast algorithms for streaming data and ensuring the sample fits within memory constraints. Unfortunately, existing sampling methods are only able to address all of these issues in limited scenarios. We develop a framework that can be used to address these issues in a broad range of scenarios. In particular, it addresses the problem of drawing and using samples under some memory budget constraint. This problem can be challenging since the memory budget forces samples to be drawn non-independently and consequently, makes computation of resulting estimators difficult. At the core of the framework is the notion of a data adaptive thresholding scheme where the threshold effectively allows one to treat the non-independent sample as if it were drawn independently. We provide sufficient conditions for a thresholding scheme to allow this and provide ways to build and compose such schemes. Furthermore, we provide fast algorithms to efficiently sample under these thresholding schemes

    Probabilistic Neighborhood Selection in Collaborative Filtering Systems

    Get PDF
    This paper presents a novel probabilistic method for recommending items in the neighborhood-based collaborative filtering framework. For the probabilistic neighborhood selection phase, we use an efficient method for weighted sampling of k neighbors without replacement that also takes into consideration the similarity levels between the target user and the candidate neighbors. We conduct an empirical study showing that the proposed method alleviates the over-specialization and concentration biases in common recommender systems by generating recommendation lists that are very different from the classical collaborative filtering approach and also increasing the aggregate diversity and mobility of recommendations. We also demonstrate that the proposed method outperforms both the previously proposed user based k-nearest neighbors and k-furthest neighbors collaborative filtering approaches in terms of item prediction accuracy and utility based ranking measures across various experimental settings. This accuracy performance improvement is in accordance with ensemble learning theory.NYU Stern School of Busines

    Random sampling of bandlimited signals on graphs

    Get PDF
    We study the problem of sampling k-bandlimited signals on graphs. We propose two sampling strategies that consist in selecting a small subset of nodes at random. The first strategy is non-adaptive, i.e., independent of the graph structure, and its performance depends on a parameter called the graph coherence. On the contrary, the second strategy is adaptive but yields optimal results. Indeed, no more than O(k log(k)) measurements are sufficient to ensure an accurate and stable recovery of all k-bandlimited signals. This second strategy is based on a careful choice of the sampling distribution, which can be estimated quickly. Then, we propose a computationally efficient decoder to reconstruct k-bandlimited signals from their samples. We prove that it yields accurate reconstructions and that it is also stable to noise. Finally, we conduct several experiments to test these techniques

    Weighted Reservoir Sampling from Distributed Streams

    Get PDF
    We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling \textit{without replacement} (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking \textit{heavy hitters with residual error}. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of 1\ell_1 heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a log(1/ϵ)\log(1/\epsilon) factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed L1L_1 tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.Comment: To appear in PODS 201
    corecore