815 research outputs found
Scalable Techniques for Similarity Search
Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages
Bloom Filters in Adversarial Environments
Many efficient data structures use randomness, allowing them to improve upon
deterministic ones. Usually, their efficiency and correctness are analyzed
using probabilistic tools under the assumption that the inputs and queries are
independent of the internal randomness of the data structure. In this work, we
consider data structures in a more robust model, which we call the adversarial
model. Roughly speaking, this model allows an adversary to choose inputs and
queries adaptively according to previous responses. Specifically, we consider a
data structure known as "Bloom filter" and prove a tight connection between
Bloom filters in this model and cryptography.
A Bloom filter represents a set of elements approximately, by using fewer
bits than a precise representation. The price for succinctness is allowing some
errors: for any it should always answer `Yes', and for any it should answer `Yes' only with small probability.
In the adversarial model, we consider both efficient adversaries (that run in
polynomial time) and computationally unbounded adversaries that are only
bounded in the number of queries they can make. For computationally bounded
adversaries, we show that non-trivial (memory-wise) Bloom filters exist if and
only if one-way functions exist. For unbounded adversaries we show that there
exists a Bloom filter for sets of size and error , that is
secure against queries and uses only
bits of memory. In comparison, is the best
possible under a non-adaptive adversary
khmer: Working with Big Data in Bioinformatics
We introduce design and optimization considerations for the 'khmer' package.Comment: Invited chapter for forthcoming book on Performance of Open Source
Application
A secure searcher for end-to-end encrypted email communication
Email has become a common mode of communication for confidential personal as well as business needs. There are different approaches to authenticate the sender of an email message at the receiverâs client and ensure that the message can be read only by the intended recipient. A typical approach is to use an email encryption standard to encrypt the message on the senderâs client and decrypt it on the receiverâs client for secure communication. A major drawback of this approach is that only the encrypted email messages are stored in the mail servers and the default search does not work on encrypted data. This project details an approach that could be adopted for securely searching email messages protected using end-to-end encrypted email communication. This project proposes an overall design for securely searching encrypted email messages and provides an implementation in Java based on a cryptographically secure Bloom filter technique to create a secure index. The implemented library is then integrated with an open source email client to depict its usability in a live environment. The technique and the implemented library are further evaluated for security and scalability while allowing remote storage of the created secure index. The research in this project would enhance email clients that support encrypted email transfer with a full secure search functionality
Towards Application of Cuckoo Filters in Network Security Monitoring
In this paper, we study the feasibility of applying the recently proposed cuckoo filters to improve space efficiency for set membership testing in Network Security Monitoring, focusing on the example of Threat Intelligence matching. We present conceptual insights for the practical application of cuckoo filters and provide a cuckoo filter implementation that allows runtime configuration. To evaluate the practical applicability of cuckoo filters, we integrate our implementation into the Bro Network Security Monitor, compare it to traditional data structures and conduct a brief operational evaluation. We find that cuckoo filters allow remarkable memory savings, while potential performance trade-offs, caused by introducing false positives, have to be carefully evaluated on a case-by-case basis
Scalable Hash Tables
The term scalability with regards to this dissertation has two meanings: It means
taking the best possible advantage of the provided resources (both computational
and memory resources) and it also means scaling data structures in the literal sense,
i.e., growing the capacity, by ârescalingâ the table.
Scaling well to computational resources implies constructing the fastest best per-
forming algorithms and data structures. On todayâs many-core machines the best
performance is immediately associated with parallelism. Since CPU frequencies
have stopped growing about 10-15 years ago, parallelism is the only way to take ad-
vantage of growing computational resources. But for data structures in general and
hash tables in particular performance is not only linked to faster computations. The
most execution time is actually spent waiting for memory. Thus optimizing data
structures to reduce the amount of memory accesses or to take better advantage of
the memory hierarchy especially through predictable access patterns and prefetch-
ing is just as important.
In terms of scaling the size of hash tables we have identified three domains where
scaling hash-based data structures have been lacking previously, i.e., space effi-
cient growing, concurrent hash tables, and Approximate Membership Query data
structures (AMQ-filter). Throughout this dissertation, we describe the problems
in these areas and develop efficient solutions. We highlight three different libraries
that we have developed over the course of this dissertation, each containing mul-
tiple implementations that have shown throughout our testing to be among the
best implementations in their respective domains. In this composition they offer
a comprehensive toolbox that can be used to solve many kinds of hashing related
problems or to develop individual solutions for further ones.
DySECT is a library for space efficient hash tables specifically growing space effi-
cient hash tables that scale with their input size. It contains the namesake DySECT
data structure in addition to a number of different probing and cuckoo based im-
plementations. Growt is a library for highly efficient concurrent hash tables. It
contains a very fast base table and a number of extensions to adapt this table to
match any purpose. All extension can be combined to create a variety of different
interfaces. In our extensive experimental evaluation, each adaptation has shown
to be among the best hash tables for their specific purpose. Lpqfilter is a library
for concurrent approximate membership query (AMQ) data structures. It contains
some original data structures, like the linear probing quotient filter, as well as some
novel approaches to dynamically sized quotient filters
- âŠ