815 research outputs found

    Scalable Techniques for Similarity Search

    Get PDF
    Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages

    Bloom Filters in Adversarial Environments

    Get PDF
    Many efficient data structures use randomness, allowing them to improve upon deterministic ones. Usually, their efficiency and correctness are analyzed using probabilistic tools under the assumption that the inputs and queries are independent of the internal randomness of the data structure. In this work, we consider data structures in a more robust model, which we call the adversarial model. Roughly speaking, this model allows an adversary to choose inputs and queries adaptively according to previous responses. Specifically, we consider a data structure known as "Bloom filter" and prove a tight connection between Bloom filters in this model and cryptography. A Bloom filter represents a set SS of elements approximately, by using fewer bits than a precise representation. The price for succinctness is allowing some errors: for any x∈Sx \in S it should always answer `Yes', and for any x∉Sx \notin S it should answer `Yes' only with small probability. In the adversarial model, we consider both efficient adversaries (that run in polynomial time) and computationally unbounded adversaries that are only bounded in the number of queries they can make. For computationally bounded adversaries, we show that non-trivial (memory-wise) Bloom filters exist if and only if one-way functions exist. For unbounded adversaries we show that there exists a Bloom filter for sets of size nn and error Δ\varepsilon, that is secure against tt queries and uses only O(nlog⁥1Δ+t)O(n \log{\frac{1}{\varepsilon}}+t) bits of memory. In comparison, nlog⁥1Δn\log{\frac{1}{\varepsilon}} is the best possible under a non-adaptive adversary

    khmer: Working with Big Data in Bioinformatics

    Full text link
    We introduce design and optimization considerations for the 'khmer' package.Comment: Invited chapter for forthcoming book on Performance of Open Source Application

    A secure searcher for end-to-end encrypted email communication

    Get PDF
    Email has become a common mode of communication for confidential personal as well as business needs. There are different approaches to authenticate the sender of an email message at the receiver‟s client and ensure that the message can be read only by the intended recipient. A typical approach is to use an email encryption standard to encrypt the message on the sender‟s client and decrypt it on the receiver‟s client for secure communication. A major drawback of this approach is that only the encrypted email messages are stored in the mail servers and the default search does not work on encrypted data. This project details an approach that could be adopted for securely searching email messages protected using end-to-end encrypted email communication. This project proposes an overall design for securely searching encrypted email messages and provides an implementation in Java based on a cryptographically secure Bloom filter technique to create a secure index. The implemented library is then integrated with an open source email client to depict its usability in a live environment. The technique and the implemented library are further evaluated for security and scalability while allowing remote storage of the created secure index. The research in this project would enhance email clients that support encrypted email transfer with a full secure search functionality

    Towards Application of Cuckoo Filters in Network Security Monitoring

    Get PDF
    In this paper, we study the feasibility of applying the recently proposed cuckoo filters to improve space efficiency for set membership testing in Network Security Monitoring, focusing on the example of Threat Intelligence matching. We present conceptual insights for the practical application of cuckoo filters and provide a cuckoo filter implementation that allows runtime configuration. To evaluate the practical applicability of cuckoo filters, we integrate our implementation into the Bro Network Security Monitor, compare it to traditional data structures and conduct a brief operational evaluation. We find that cuckoo filters allow remarkable memory savings, while potential performance trade-offs, caused by introducing false positives, have to be carefully evaluated on a case-by-case basis

    Scalable Hash Tables

    Get PDF
    The term scalability with regards to this dissertation has two meanings: It means taking the best possible advantage of the provided resources (both computational and memory resources) and it also means scaling data structures in the literal sense, i.e., growing the capacity, by “rescaling” the table. Scaling well to computational resources implies constructing the fastest best per- forming algorithms and data structures. On today’s many-core machines the best performance is immediately associated with parallelism. Since CPU frequencies have stopped growing about 10-15 years ago, parallelism is the only way to take ad- vantage of growing computational resources. But for data structures in general and hash tables in particular performance is not only linked to faster computations. The most execution time is actually spent waiting for memory. Thus optimizing data structures to reduce the amount of memory accesses or to take better advantage of the memory hierarchy especially through predictable access patterns and prefetch- ing is just as important. In terms of scaling the size of hash tables we have identified three domains where scaling hash-based data structures have been lacking previously, i.e., space effi- cient growing, concurrent hash tables, and Approximate Membership Query data structures (AMQ-filter). Throughout this dissertation, we describe the problems in these areas and develop efficient solutions. We highlight three different libraries that we have developed over the course of this dissertation, each containing mul- tiple implementations that have shown throughout our testing to be among the best implementations in their respective domains. In this composition they offer a comprehensive toolbox that can be used to solve many kinds of hashing related problems or to develop individual solutions for further ones. DySECT is a library for space efficient hash tables specifically growing space effi- cient hash tables that scale with their input size. It contains the namesake DySECT data structure in addition to a number of different probing and cuckoo based im- plementations. Growt is a library for highly efficient concurrent hash tables. It contains a very fast base table and a number of extensions to adapt this table to match any purpose. All extension can be combined to create a variety of different interfaces. In our extensive experimental evaluation, each adaptation has shown to be among the best hash tables for their specific purpose. Lpqfilter is a library for concurrent approximate membership query (AMQ) data structures. It contains some original data structures, like the linear probing quotient filter, as well as some novel approaches to dynamically sized quotient filters
    • 

    corecore