10,740 research outputs found

    Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)

    Full text link
    Recently it was shown that the problem of Maximum Inner Product Search (MIPS) is efficient and it admits provably sub-linear hashing algorithms. Asymmetric transformations before hashing were the key in solving MIPS which was otherwise hard. In the prior work, the authors use asymmetric transformations which convert the problem of approximate MIPS into the problem of approximate near neighbor search which can be efficiently solved using hashing. In this work, we provide a different transformation which converts the problem of approximate MIPS into the problem of approximate cosine similarity search which can be efficiently solved using signed random projections. Theoretical analysis show that the new scheme is significantly better than the original scheme for MIPS. Experimental evaluations strongly support the theoretical findings.Comment: arXiv admin note: text overlap with arXiv:1405.586

    In Defense of MinHash Over SimHash

    Full text link
    MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R\mathcal{R}), while the collision probability of SimHash is a function of cosine similarity (S\mathcal{S}). To provide a common basis for comparison, we evaluate retrieval results in terms of S\mathcal{S} for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S\mathcal{S}, by using a general inequality S2≤R≤S2−S\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R≥Sz−S\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}} holds where zz is only slightly larger than 2 (e.g., z≤2.1z\leq 2.1). Our restricted worst case analysis by assuming Sz−S≤R≤S2−S\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}} shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse

    Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization

    Full text link
    Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data

    Improved Densification of One Permutation Hashing

    Full text link
    The existing work on densification of one permutation hashing reduces the query processing cost of the (K,L)(K,L)-parameterized Locality Sensitive Hashing (LSH) algorithm with minwise hashing, from O(dKL)O(dKL) to merely O(d+KL)O(d + KL), where dd is the number of nonzeros of the data vector, KK is the number of hashes in each hash table, and LL is the number of hash tables. While that is a substantial improvement, our analysis reveals that the existing densification scheme is sub-optimal. In particular, there is no enough randomness in that procedure, which affects its accuracy on very sparse datasets. In this paper, we provide a new densification procedure which is provably better than the existing scheme. This improvement is more significant for very sparse datasets which are common over the web. The improved technique has the same cost of O(d+KL)O(d + KL) for query processing, thereby making it strictly preferable over the existing procedure. Experimental evaluations on public datasets, in the task of hashing based near neighbor search, support our theoretical findings

    Revisiting Shared Data Protection Against Key Exposure

    Full text link
    This paper puts a new light on secure data storage inside distributed systems. Specifically, it revisits computational secret sharing in a situation where the encryption key is exposed to an attacker. It comes with several contributions: First, it defines a security model for encryption schemes, where we ask for additional resilience against exposure of the encryption key. Precisely we ask for (1) indistinguishability of plaintexts under full ciphertext knowledge, (2) indistinguishability for an adversary who learns: the encryption key, plus all but one share of the ciphertext. (2) relaxes the "all-or-nothing" property to a more realistic setting, where the ciphertext is transformed into a number of shares, such that the adversary can't access one of them. (1) asks that, unless the user's key is disclosed, noone else than the user can retrieve information about the plaintext. Second, it introduces a new computationally secure encryption-then-sharing scheme, that protects the data in the previously defined attacker model. It consists in data encryption followed by a linear transformation of the ciphertext, then its fragmentation into shares, along with secret sharing of the randomness used for encryption. The computational overhead in addition to data encryption is reduced by half with respect to state of the art. Third, it provides for the first time cryptographic proofs in this context of key exposure. It emphasizes that the security of our scheme relies only on a simple cryptanalysis resilience assumption for blockciphers in public key mode: indistinguishability from random, of the sequence of diferentials of a random value. Fourth, it provides an alternative scheme relying on the more theoretical random permutation model. It consists in encrypting with sponge functions in duplex mode then, as before, secret-sharing the randomness

    Editorial: Sub/Verse

    Get PDF
    • …
    corecore