19,717 research outputs found
Distance Sensitive Bloom Filters Without False Negatives
A Bloom filter is a widely used data-structure for representing a set and
answering queries of the form "Is in ?". By allowing some false positive
answers (saying "yes" when the answer is in fact `no') Bloom filters use space
significantly below what is required for storing . In the distance sensitive
setting we work with a set of (Hamming) vectors and seek a data structure
that offers a similar trade-off, but answers queries of the form "Is close
to an element of ?" (in Hamming distance). Previous work on distance
sensitive Bloom filters have accepted false positive and false negative
answers. Absence of false negatives is of critical importance in many
applications of Bloom filters, so it is natural to ask if this can be also
achieved in the distance sensitive setting. Our main contributions are upper
and lower bounds (that are tight in several cases) for space usage in the
distance sensitive setting where false negatives are not allowed.Comment: Published in SODA 201
Scalable Techniques for Similarity Search
Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages
Recommended from our members
Usable Secure Private Search
Real-world applications commonly require untrusting parties to share sensitive information securely. This article describes a secure anonymous database search (SADS) system that provides exact keyword match capability. Using a new reroutable encryption and the ideas of Bloom filters and deterministic encryption, SADS lets multiple parties efficiently execute exact-match queries over distributed encrypted databases in a controlled manner. This article further considers a more general search setting allowing similarity searches, going beyond existing work that considers similarity in terms of error tolerance and Hamming distance. This article presents a general framework, built on the cryptographic and privacy-preserving guarantees of the SADS primitive, for engineering usable private secure search systems
A neural data structure for novelty detection
Novelty detection is a fundamental biological problem that organisms must solve to determine whether a given stimulus departs from those previously experienced. In computer science, this problem is solved efficiently using a data structure called a Bloom filter. We found that the fruit fly olfactory circuit evolved a variant of a Bloom filter to assess the novelty of odors. Compared with a traditional Bloom filter, the fly adjusts novelty responses based on two additional features: the similarity of an odor to previously experienced odors and the time elapsed since the odor was last experienced. We elaborate and validate a framework to predict novelty responses of fruit flies to given pairs of odors. We also translate insights from the fly circuit to develop a class of distance- and time-sensitive Bloom filters that outperform prior filters when evaluated on several biological and computational datasets. Overall, our work illuminates the algorithmic basis of an important neurobiological problem and offers strategies for novelty detection in computational systems
Data Leak Detection As a Service: Challenges and Solutions
We describe a network-based data-leak detection (DLD)
technique, the main feature of which is that the detection
does not require the data owner to reveal the content of the
sensitive data. Instead, only a small amount of specialized
digests are needed. Our technique – referred to as the fuzzy
fingerprint – can be used to detect accidental data leaks due
to human errors or application flaws. The privacy-preserving
feature of our algorithms minimizes the exposure of sensitive
data and enables the data owner to safely delegate the
detection to others.We describe how cloud providers can offer
their customers data-leak detection as an add-on service
with strong privacy guarantees.
We perform extensive experimental evaluation on the privacy,
efficiency, accuracy and noise tolerance of our techniques.
Our evaluation results under various data-leak scenarios
and setups show that our method can support accurate
detection with very small number of false alarms, even
when the presentation of the data has been transformed. It
also indicates that the detection accuracy does not degrade
when partial digests are used. We further provide a quantifiable
method to measure the privacy guarantee offered by our
fuzzy fingerprint framework
- …