Search CORE

3,346 research outputs found

Hashing for Similarity Search: A Survey

Author: Ji Jianqiu
Shen Heng Tao
Song Jingkuan
Wang Jingdong
Publication venue
Publication date: 13/08/2014
Field of study

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

arXiv.org e-Print Archive

CiteSeerX

Fast and Powerful Hashing using Tabulation

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2016
Field of study

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of

c

characters and we have precomputed character tables

h_1,...,h_c

mapping characters to random hash values. A key

x=(x_1,...,x_c)

is hashed to

h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]

. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

THRIVE: Threshold Homomorphic encryption based secure and privacy preserving bIometric VErification system

Author: Erdogan Hakan
Karabat Cagatay
Kiraz Mehmet Sabir
Savas Erkay
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/09/2014
Field of study

In this paper, we propose a new biometric verification and template protection system which we call the THRIVE system. The system includes novel enrollment and authentication protocols based on threshold homomorphic cryptosystem where the private key is shared between a user and the verifier. In the THRIVE system, only encrypted binary biometric templates are stored in the database and verification is performed via homomorphically randomized templates, thus, original templates are never revealed during the authentication stage. The THRIVE system is designed for the malicious model where the cheating party may arbitrarily deviate from the protocol specification. Since threshold homomorphic encryption scheme is used, a malicious database owner cannot perform decryption on encrypted templates of the users in the database. Therefore, security of the THRIVE system is enhanced using a two-factor authentication scheme involving the user's private key and the biometric data. We prove security and privacy preservation capability of the proposed system in the simulation-based model with no assumption. The proposed system is suitable for applications where the user does not want to reveal her biometrics to the verifier in plain form but she needs to proof her physical presence by using biometrics. The system can be used with any biometric modality and biometric feature extraction scheme whose output templates can be binarized. The overall connection time for the proposed THRIVE system is estimated to be 336 ms on average for 256-bit biohash vectors on a desktop PC running with quad-core 3.2 GHz CPUs at 10 Mbit/s up/down link connection speed. Consequently, the proposed system can be efficiently used in real life applications

arXiv.org e-Print Archive

Springer - Publisher Connector

De Montfort University Open Research Archive

Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline

Author: Andrade Nazareno
Anelli Walter
Belli Luca
Bronstein Michael
Delić Amra
Gupta Akshay
Ktena Sofia Ira
Lung-Yut-Fon Alexandre
Portman Frank
Shi Wenzhe
Smith Jessie
Sottocornola Gabriele
Tejani Alykhan
Xie Yuanpu
Zhu Xiao
Publication venue
Publication date: 07/10/2020
Field of study

Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagements on the Home Timeline has only been explored to a limited extent. At the same time, there is a lack of large-scale public social network datasets that would enable the scientific community to both benchmark and build more powerful and comprehensive models that tailor content to user interests. By releasing an original dataset of 160 million Tweets along with engagement information, Twitter aims to address exactly that. During this release, special attention is drawn on maintaining compliance with existing privacy laws. Apart from user privacy, this paper touches on the key challenges faced by researchers and professionals striving to predict user engagements. It further describes the key aspects of the RecSys 2020 Challenge that was organized by ACM RecSys in partnership with Twitter using this dataset.Comment: 16 pages, 2 table

arXiv.org e-Print Archive