13 research outputs found
b-Bit Minwise Hashing
This paper establishes the theoretical framework of b-bit minwise hashing.
The original minwise hashing method has become a standard technique for
estimating set similarity (e.g., resemblance) with applications in information
retrieval, data management, social networks and computational advertising.
By only storing the lowest bits of each (minwise) hashed value (e.g., b=1
or 2), one can gain substantial advantages in terms of computational efficiency
and storage space. We prove the basic theoretical results and provide an
unbiased estimator of the resemblance for any b. We demonstrate that, even in
the least favorable scenario, using b=1 may reduce the storage space at least
by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is
interested in resemblance > 0.5
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Top-K Queries Over Digital Traces
Recent advances in social and mobile technology have enabled an abundance of digital traces (in the form of mobile check-ins, WiFi hotspots handshaking, etc.) revealing the physical presence history of diverse sets of entities. One challenging, yet important, task is to identify k entities that are most closely associated with a given query entity based on their digital traces. We propose a suite of hierarchical indexing techniques and algorithms to enable fast query processing for this problem at scale. We theoretically analyze the pruning effectiveness of the proposed methods based on a human mobility model which we propose and validate in real life situations. Finally, we conduct extensive experiments on both synthetic and real datasets at scale, evaluating the performance of our techniques, confirming the effectiveness and superiority of our approach over other applicable approaches across a variety of parameter settings and datasets
Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates
Many datasets such as market basket data, text or hypertext documents, and
sensor observations recorded in different locations or time periods, are
modeled as a collection of sets over a ground set of keys. We are interested in
basic aggregates such as the weight or selectivity of keys that satisfy some
selection predicate defined over keys' attributes and membership in particular
sets. This general formulation includes basic aggregates such as the Jaccard
coefficient, Hamming distance, and association rules.
On massive data sets, exact computation can be inefficient or infeasible.
Sketches based on coordinated random samples are classic summaries that support
approximate query processing.
Queries are resolved by generating a sketch (sample) of the union of sets
used in the predicate from the sketches these sets and then applying an
estimator to this union-sketch.
We derive novel tighter (unbiased) estimators that leverage sampled keys that
are present in the union of applicable sketches but excluded from the union
sketch. We establish analytically that our estimators dominate estimators
applied to the union-sketch for {\em all queries and data sets}. Empirical
evaluation on synthetic and real data reveals that on typical applications we
can expect a 25%-4 fold reduction in estimation error.Comment: 16 page