3,346 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
THRIVE: Threshold Homomorphic encryption based secure and privacy preserving bIometric VErification system
In this paper, we propose a new biometric verification and template
protection system which we call the THRIVE system. The system includes novel
enrollment and authentication protocols based on threshold homomorphic
cryptosystem where the private key is shared between a user and the verifier.
In the THRIVE system, only encrypted binary biometric templates are stored in
the database and verification is performed via homomorphically randomized
templates, thus, original templates are never revealed during the
authentication stage. The THRIVE system is designed for the malicious model
where the cheating party may arbitrarily deviate from the protocol
specification. Since threshold homomorphic encryption scheme is used, a
malicious database owner cannot perform decryption on encrypted templates of
the users in the database. Therefore, security of the THRIVE system is enhanced
using a two-factor authentication scheme involving the user's private key and
the biometric data. We prove security and privacy preservation capability of
the proposed system in the simulation-based model with no assumption. The
proposed system is suitable for applications where the user does not want to
reveal her biometrics to the verifier in plain form but she needs to proof her
physical presence by using biometrics. The system can be used with any
biometric modality and biometric feature extraction scheme whose output
templates can be binarized. The overall connection time for the proposed THRIVE
system is estimated to be 336 ms on average for 256-bit biohash vectors on a
desktop PC running with quad-core 3.2 GHz CPUs at 10 Mbit/s up/down link
connection speed. Consequently, the proposed system can be efficiently used in
real life applications
Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline
Recommender systems constitute the core engine of most social network
platforms nowadays, aiming to maximize user satisfaction along with other key
business objectives. Twitter is no exception. Despite the fact that Twitter
data has been extensively used to understand socioeconomic and political
phenomena and user behaviour, the implicit feedback provided by users on Tweets
through their engagements on the Home Timeline has only been explored to a
limited extent. At the same time, there is a lack of large-scale public social
network datasets that would enable the scientific community to both benchmark
and build more powerful and comprehensive models that tailor content to user
interests. By releasing an original dataset of 160 million Tweets along with
engagement information, Twitter aims to address exactly that. During this
release, special attention is drawn on maintaining compliance with existing
privacy laws. Apart from user privacy, this paper touches on the key challenges
faced by researchers and professionals striving to predict user engagements. It
further describes the key aspects of the RecSys 2020 Challenge that was
organized by ACM RecSys in partnership with Twitter using this dataset.Comment: 16 pages, 2 table
- …