2,787 research outputs found

    Order Preserving Minimal Perfect Hash Functions and Information Retrieval

    Get PDF
    Rapid access to information is essential for a wide variety of retrieval systems and applications. Hashing has long been used when the fastest possible direct search is desired, but is generally not appropriate when sequential or range searches are also required. This paper describes a hashing method, developed for collections that are relatively static, that supports both direct and sequential access. The algorithms described give hash functions that are optimal in terms of time and hash table space utilization, and that preserve any a priori ordering desired. Furthermore, the resulting order preserving minimal perfect hash functions (OPMPHFs) can be found using time and space that are linear in the number of keys involved and so are close to optimal

    Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

    Full text link
    The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution H\mathcal{H} over locality-sensitive hash functions that partition space. For a collection of nn points, after preprocessing, the query time is dominated by O(nρlogn)O(n^{\rho} \log n) evaluations of hash functions from H\mathcal{H} and O(nρ)O(n^{\rho}) hash table lookups and distance computations where ρ(0,1)\rho \in (0,1) is determined by the locality-sensitivity properties of H\mathcal{H}. It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to O(log2n)O(\log^2 n), leaving the query time to be dominated by O(nρ)O(n^{\rho}) distance computations and O(nρlogn)O(n^{\rho} \log n) additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to O(nρ)O(n^\rho).Comment: 15 pages, 3 figure

    Finding evidence of wordlists being deployed against SSH Honeypots - implications and impacts

    Get PDF
    This paper is an investigation focusing on activities detected by three SSH honeypots that utilise Kippo honeypot software. The honeypots were located on the same /24 IPv4 network and configured as identically as possible. The honeypots used the same base software and hardware configurations. The data from the honeypots were collected during the period 17th July 2012 and 26th November 2013, a total of 497 active day periods. The analysis in this paper focuses on the techniques used to attempt to gain access to these systems by attacking entities. Although all three honeypots are have the same configuration settings and are located on the same IPv4 /24 subnet work space, there is a variation between the numbers of activities recorded on each honeypots. Automated password guessing using wordlists is one technique employed by cyber criminals in attempts to gain access to devices on the Internet. The research suggests there is wide use of automated password tools and wordlists in attempts to gain access to the SSH honeypots, there are also a wide range of account types being probed

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    Mercury BLAST dictionaries: analysis and performance measurement

    Get PDF
    This report describes a hashing scheme for a dictionary of short bit strings. The scheme, which we call near-perfect hashing, was designed as part of the construction of Mercury BLAST, an FPGA-based accelerator for the BLAST family of biosequence comparison algorithms. Near-perfect hashing is a heuristic variant of the well-known displacement hashing approach to building perfect hash functions. It uses a family of hash functions composed from linear transformations on bit vectors and lookups in small precomputed tables, both of which are especially appropriate for implementation in ardware logic. We show empirically that for inputs derived from genomic DNA sequences, our scheme obtains a good tradeoff between the size of the hash table and the time required to ompute it from a set of input strings, while generating few or no collisions between keys in the table. One of the building blocks of our scheme is the H_3 family of hash functions, which are linear transformations on bit vectors. We show that the uniformity of hashing performed with randomly chosen linear transformations depends critically on their rank, and that randomly chosen transformations have a high probability of having the maximum possible uniformity. A simple test is sufficient to ensure that a randomly chosen H3 hash function will not cause an unexpectedly large number of collisions. Moreover, if two such functions are chosen independently at random, the second function is unlikely to hash together two keys that were hashed together by the first. Hashing schemes based on H3 hash functions therefore tend to distribute their inputs more uniformly than would be expected under a simple uniform hashing model, and schemes using pairs of these functions are more uniform than would be assumed for a pair of independent hash functions
    corecore