10,740 research outputs found
Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)
Recently it was shown that the problem of Maximum Inner Product Search (MIPS)
is efficient and it admits provably sub-linear hashing algorithms. Asymmetric
transformations before hashing were the key in solving MIPS which was otherwise
hard. In the prior work, the authors use asymmetric transformations which
convert the problem of approximate MIPS into the problem of approximate near
neighbor search which can be efficiently solved using hashing. In this work, we
provide a different transformation which converts the problem of approximate
MIPS into the problem of approximate cosine similarity search which can be
efficiently solved using signed random projections. Theoretical analysis show
that the new scheme is significantly better than the original scheme for MIPS.
Experimental evaluations strongly support the theoretical findings.Comment: arXiv admin note: text overlap with arXiv:1405.586
In Defense of MinHash Over SimHash
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing
(LSH) algorithms for large-scale data processing applications. Deciding which
LSH to use for a particular problem at hand is an important question, which has
no clear answer in the existing literature. In this study, we provide a
theoretical answer (validated by experiments) that MinHash virtually always
outperforms SimHash when the data are binary, as common in practice such as
search.
The collision probability of MinHash is a function of resemblance similarity
(), while the collision probability of SimHash is a function of
cosine similarity (). To provide a common basis for comparison, we
evaluate retrieval results in terms of for both MinHash and
SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH
with respect to , by using a general inequality . Our worst case analysis can
show that MinHash significantly outperforms SimHash in high similarity region.
Interestingly, our intensive experiments reveal that MinHash is also
substantially better than SimHash even in datasets where most of the data
points are not too similar to each other. This is partly because, in practical
data, often holds where
is only slightly larger than 2 (e.g., ). Our restricted worst case
analysis by assuming shows that MinHash indeed significantly
outperforms SimHash even in low similarity region.
We believe the results in this paper will provide valuable guidelines for
search in practice, especially when the data are sparse
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Improved Densification of One Permutation Hashing
The existing work on densification of one permutation hashing reduces the
query processing cost of the -parameterized Locality Sensitive Hashing
(LSH) algorithm with minwise hashing, from to merely ,
where is the number of nonzeros of the data vector, is the number of
hashes in each hash table, and is the number of hash tables. While that is
a substantial improvement, our analysis reveals that the existing densification
scheme is sub-optimal. In particular, there is no enough randomness in that
procedure, which affects its accuracy on very sparse datasets.
In this paper, we provide a new densification procedure which is provably
better than the existing scheme. This improvement is more significant for very
sparse datasets which are common over the web. The improved technique has the
same cost of for query processing, thereby making it strictly
preferable over the existing procedure. Experimental evaluations on public
datasets, in the task of hashing based near neighbor search, support our
theoretical findings
Revisiting Shared Data Protection Against Key Exposure
This paper puts a new light on secure data storage inside distributed
systems. Specifically, it revisits computational secret sharing in a situation
where the encryption key is exposed to an attacker. It comes with several
contributions: First, it defines a security model for encryption schemes, where
we ask for additional resilience against exposure of the encryption key.
Precisely we ask for (1) indistinguishability of plaintexts under full
ciphertext knowledge, (2) indistinguishability for an adversary who learns: the
encryption key, plus all but one share of the ciphertext. (2) relaxes the
"all-or-nothing" property to a more realistic setting, where the ciphertext is
transformed into a number of shares, such that the adversary can't access one
of them. (1) asks that, unless the user's key is disclosed, noone else than the
user can retrieve information about the plaintext. Second, it introduces a new
computationally secure encryption-then-sharing scheme, that protects the data
in the previously defined attacker model. It consists in data encryption
followed by a linear transformation of the ciphertext, then its fragmentation
into shares, along with secret sharing of the randomness used for encryption.
The computational overhead in addition to data encryption is reduced by half
with respect to state of the art. Third, it provides for the first time
cryptographic proofs in this context of key exposure. It emphasizes that the
security of our scheme relies only on a simple cryptanalysis resilience
assumption for blockciphers in public key mode: indistinguishability from
random, of the sequence of diferentials of a random value. Fourth, it provides
an alternative scheme relying on the more theoretical random permutation model.
It consists in encrypting with sponge functions in duplex mode then, as before,
secret-sharing the randomness
- …